Linode Forum
Linode Community Forums
 FAQFAQ    SearchSearch    MembersMembers      Register Register 
 LoginLogin [ Anonymous ] 
Post new topic  Reply to topic
Author Message
PostPosted: Fri Jan 15, 2010 11:33 am 
Offline
Senior Newbie

Joined: Wed Sep 28, 2005 8:50 pm
Posts: 19
So this is an extension of the question I asked in this thread:

http://www.linode.com/forums/viewtopic.php?t=5054

The use case here is consuming the Twitter streaming API. You open an HTTP connection and it sends chunked responses indefinitely. Based on the keywords being tracked this can get very noisy and produce a lot of data.

What I'm doing is piping the stream through my grep "tree" using tee. This allows me to run regex on the incoming stream to filter out noisy results. Because this could generate 300-400MB daily of unused data I want to pipe it to gzip. (don't want to discard anything due to false negatives)

The stream to the gzip process may be terminated at any point on either end of the connection. I'm worried that this puts me at risk for a corrupt gzip file.

I've been able to simulate a corrupt file by splitting the gz forcefully. I can send the files to a windows box unzip and open them in plain text, although with an error.

I've been unable to get gunzip to process the file and spit out the raw ascii.

Questions... is there a way to open/recover the contents of an ASCII gzip that terminates transfer early? Are there any options within gzip or another zip utility that have "transactions" so to speak. Meaning the zip utility will only write in full chunks and "rollback" if receiving a SIGTERM in the middle of writing a line of text.

Thanks in advance for any tips.


Top
   
 Post subject:
PostPosted: Fri Jan 15, 2010 1:38 pm 
Offline
Senior Member
User avatar

Joined: Sun Jan 18, 2009 2:41 pm
Posts: 830
Sounds like you could pipe your output to split which would break it into files of a specified size. Then you could set up a cron job to periodically compress the resulting files.


Top
   
 Post subject:
PostPosted: Fri Jan 15, 2010 3:45 pm 
Offline
Senior Newbie

Joined: Wed Sep 28, 2005 8:50 pm
Posts: 19
Hmm, that's a good idea. So if I pipe std out into split at say 500MB, it will just buffer the output until it reaches 500MB and then spit it out to disk... then I just compress those files?

If so it seems like a great way to keep things in ASCII until I'm out of zone for potential failure. Going to give it a shot. Thanks!


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
RSS

Powered by phpBB® Forum Software © phpBB Group