I spent way too much time trying out different approaches, so may as well post the results here. I tested with
infile, a list of ten million pseudo-random numbers from 0 to 32767. Each command separates this into two output files - one with all five-digit numbers (about 70%), and one with all the others. Here are the results I got in terms of wall-clock times, with the fastest shown first and defined as 1x. Of course you'll get different results with your own input file, regex, and version of the various tools.
Code:
1.00x
grep '[0-9][0-9][0-9][0-9][0-9]' infile > outfile1 ; grep -v '[0-9][0-9][0-9][0-9][0-9]' infile > outfile2
The naive
grep approach turned out to be the fastest, but it reads the input file twice. If the file is larger than memory, that will likely slow things considerably.
Code:
1.03x
tee >( grep '[0-9][0-9][0-9][0-9][0-9]' > outfile1 ) < infile | grep -v '[0-9][0-9][0-9][0-9][0-9]' > outfile2
1.03x
tee >( grep '[0-9][0-9][0-9][0-9][0-9]' > outfile1 ) >( grep -v '[0-9][0-9][0-9][0-9][0-9]' > outfile2 ) > /dev/null < infile
These use the
tee trick so the input file only needs to be read once. The second form is easier to understand but may take a hair longer since it throws away output to /dev/null. Note that the >() syntax may not be available in shells besides
bash.
Code:
1.62x
awk '{ if (/[0-9][0-9][0-9][0-9][0-9]/) print > "outfile1" ; else print > "outfile2" }' infile
Hey, don't forget the old-school text processing languages.
Code:
3.44x
perl -ne 'BEGIN { open(ONE, ">outfile1"); open(TWO, ">outfile2") } if (/[0-9][0-9][0-9][0-9][0-9]/) { print ONE } else { print TWO }' infile
Perl can do more than
awk, but in this case takes about twice the time.
Code:
14.2x
sed -n -e 's/[0-9][0-9][0-9][0-9][0-9]/&/w outfile1
t
w outfile2' infile
15.1x
sed -n -e '/[0-9][0-9][0-9][0-9][0-9]/ w outfile1' -e '/[0-9][0-9][0-9][0-9][0-9]/! w outfile2' infile
I was a bit surprised to see
sed come out so poorly. Not really sure why this is. As expected, testing against the regex once (the first form) is faster than doing it twice (the second), though the syntax is more difficult to follow.