Linode Forum
Linode Community Forums
 FAQFAQ    SearchSearch    MembersMembers      Register Register 
 LoginLogin [ Anonymous ] 
Post new topic  Reply to topic
Author Message
 Post subject: looking for a grep trick
PostPosted: Thu Jan 14, 2010 12:32 pm 
Offline
Senior Newbie

Joined: Wed Sep 28, 2005 8:50 pm
Posts: 19
Can anyone think of a way to split a file by using to grep to match lines and pipe to a file, then somehow tee non-matches to another file.

I know I can use the -v option but my regular expression is going to be a very large OR statement against multi-gb files. I'd like to avoid scanning the file twice (once with -v and once without)

Any thoughts or tips would be greatly appreciated!


Top
   
 Post subject:
PostPosted: Thu Jan 14, 2010 12:50 pm 
Offline
Senior Newbie

Joined: Thu Oct 09, 2008 10:21 am
Posts: 8
grep is designed to show matches, OR non-matches. It still only gives one output stream, whether that be stdout, piped to a file, whatever.

The only way to bend it that way would be to modify grep from source, letting it create 2 output streams, one for match, one for non-match.


Top
   
 Post subject:
PostPosted: Thu Jan 14, 2010 1:20 pm 
Offline
Senior Member

Joined: Sat May 03, 2008 4:01 pm
Posts: 569
Website: http://www.mattnordhoff.com/
There might be an easier way, but if you don't find it, it wouldn't be hard to write a little Perl or Python script for this...


Top
   
 Post subject: found it
PostPosted: Thu Jan 14, 2010 4:23 pm 
Offline
Senior Newbie

Joined: Wed Sep 28, 2005 8:50 pm
Posts: 19
I was thinking there had to be a way with tee, just couldn't nail it down until I found this example:

ls -A | tee >(grep ^[.] > hidden-files) >(grep -v ^[.] > normal-files) | less

Wasn't familiar with command substitution but I'm loving it.

from:

http://linux.byexamples.com/archives/14 ... processes/

:D


Top
   
 Post subject:
PostPosted: Fri Jan 15, 2010 10:07 am 
Offline
Senior Member
User avatar

Joined: Sun Jan 18, 2009 2:41 pm
Posts: 830
I spent way too much time trying out different approaches, so may as well post the results here. I tested with infile, a list of ten million pseudo-random numbers from 0 to 32767. Each command separates this into two output files - one with all five-digit numbers (about 70%), and one with all the others. Here are the results I got in terms of wall-clock times, with the fastest shown first and defined as 1x. Of course you'll get different results with your own input file, regex, and version of the various tools.

Code:
1.00x
grep '[0-9][0-9][0-9][0-9][0-9]' infile > outfile1 ; grep -v '[0-9][0-9][0-9][0-9][0-9]' infile > outfile2


The naive grep approach turned out to be the fastest, but it reads the input file twice. If the file is larger than memory, that will likely slow things considerably.

Code:
1.03x
tee >( grep '[0-9][0-9][0-9][0-9][0-9]' > outfile1 ) < infile | grep -v '[0-9][0-9][0-9][0-9][0-9]' > outfile2

1.03x
tee >( grep '[0-9][0-9][0-9][0-9][0-9]' > outfile1 ) >( grep -v '[0-9][0-9][0-9][0-9][0-9]' > outfile2 ) > /dev/null < infile


These use the tee trick so the input file only needs to be read once. The second form is easier to understand but may take a hair longer since it throws away output to /dev/null. Note that the >() syntax may not be available in shells besides bash.

Code:
1.62x
awk '{ if (/[0-9][0-9][0-9][0-9][0-9]/) print > "outfile1" ; else print > "outfile2" }' infile


Hey, don't forget the old-school text processing languages.

Code:
3.44x
perl -ne 'BEGIN { open(ONE, ">outfile1"); open(TWO, ">outfile2") } if (/[0-9][0-9][0-9][0-9][0-9]/) { print ONE } else { print TWO }' infile


Perl can do more than awk, but in this case takes about twice the time.

Code:
14.2x
sed -n -e 's/[0-9][0-9][0-9][0-9][0-9]/&/w outfile1
t
w outfile2' infile

15.1x
sed -n -e '/[0-9][0-9][0-9][0-9][0-9]/ w outfile1' -e '/[0-9][0-9][0-9][0-9][0-9]/! w outfile2' infile


I was a bit surprised to see sed come out so poorly. Not really sure why this is. As expected, testing against the regex once (the first form) is faster than doing it twice (the second), though the syntax is more difficult to follow.


Top
   
 Post subject:
PostPosted: Fri Jan 15, 2010 11:16 am 
Offline
Senior Newbie

Joined: Wed Sep 28, 2005 8:50 pm
Posts: 19
Wow - thanks Vance, this is great stuff. I had thought about AWK and SED but didn't get around to testing. Glad to see that you confirmed grep is the fastest option.


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
RSS

Powered by phpBB® Forum Software © phpBB Group