Linode Forum
Linode Community Forums
 FAQFAQ    SearchSearch    MembersMembers      Register Register 
 LoginLogin [ Anonymous ] 
Post new topic  Reply to topic
Author Message
PostPosted: Sun Apr 14, 2013 3:00 pm 
Offline
Senior Member

Joined: Mon Jan 02, 2012 12:45 pm
Posts: 365
I realize asking these types of questions can result in some unpleasantness, so please be civil towards each other (and me).

I have a project coming up that will require me to batch process a large amount of data once a week (on the weekend). Here's the pertinent info:
- while processing the files I will have exclusive access to the VPS & DB.
- processing 20 files (containing csv or xml data) with a total of 300k to 350k records.
- adding ~450k records to the db: 10 fields (4 indexes) each record ~96 bytes each (incl indexes).

Once the new data has been added I'll need to updated related records:
- updating ~100k records: 20 fields (4 indexes), each record ~160 bytes each (incl indexes).
- each update will require up to 30 simple math operations (add, sub, mult or div using two arguments).

The db is going to be kind of big:
- 15 to 18 months of history (65 - 78 weeks worth of records).
- 30M - 35M total records in DB.

Once each week's data is processed:
- 25 csv files will be produced (up to 100k records each).
- during the week up to 20 concurrent users could be accessing the system performing simply queries via a web app.

I'm thinking of setting up three Linodes for this customer's services:
- an appropriately sized VPS to handle the heavy lifting and for user queries.
- a smaller VPS (separate data center) to function as a backup web server (will have copy of latest DB).
- a smaller VPS to function as dev/test machine as well as to store most recent DB backups.

Performance while processing the batch files and updating the DB is important, but so is being a good 'VPS neighbor'.

Though I'm familiar with CentOS & Ubuntu, MySQL & PHP when dealing with web based projects, I don't want to limit myself to these if there are better solutions for the batch processing.

I'm not sure if any particular Linux distro is better suited to this type of batch processing. I'm guessing that VPS resources & tuning would have more of an impact than the distro, but I'm sure someone has some insight into this.

Is MySQL the best fit for this type of project? Does it scale well to this large of a DB? Would PostgreSQL be a better fit?

I'm aware that Python and Perl are faster than PHP at many things, but would either one of them be a better fit for this type of batch processing? Is there another language that would be a better fit? I've programmed in may different languages over the years, so I'm certainly open to suggestions.

Any insight would be greatly appreciated.


Thanks,
James


Top
   
PostPosted: Sun Apr 14, 2013 5:09 pm 
Offline
Senior Member
User avatar

Joined: Wed Mar 17, 2004 4:11 pm
Posts: 554
Website: http://www.unixtastic.com
Location: Europe
The choice of which distro to use is religious not practical. Use whatever one you are most familiar with, it doesn't make much difference as long as it's fairly recent and still gets the security updates.

Both MySQL and Postgres will work fine. I'd use Postgres personally but again you are better off using the one you know best. Whatever you do you will need to index the right things and tune well.

PHP makes it easy to write code but it's a terrible language. It's good for user interface stuff though. The speed difference between PHP, Python, Perl, C, or whatever else doesn't matter as much as writing good code and using good algorithms. I'd write the user interface web stuff in PHP and the heavy lifting in Python.


Top
   
PostPosted: Mon Apr 15, 2013 6:31 am 
Offline
Senior Member

Joined: Mon Jan 02, 2012 12:45 pm
Posts: 365
Thanks for your input, sednet. The logic for the batch processing is straight forward and should be relatively easy to optimize, though tightening it up will be important when executing it over 300k times in a row.

I didn't think PHP was the way to go for the heavy lifting, but I do intend to use it for the UI since the web app should have less than 10k page views per month.


Top
   
PostPosted: Mon Apr 15, 2013 6:35 am 
Offline
Senior Member

Joined: Mon Dec 07, 2009 6:46 am
Posts: 331
Just be careful if you use Python, esp. with SQLAlchemy, to work with data in chunks, as Python* is notorious for not releasing the memory back to the OS. And if you can parallelize your work, use the 'multiprocessing' lib, not threading, because of the GIL. Then again if your threads are mostly IO, the GIL might not matter.

Python also has Celery which is nice if you can parallelize across many nodes.

Otherwise, I'll second what sednet said. PHP for webapp, Python for heavy lifting, PostgreSQL over MySQL.

I have an app that does something similar, in a similar amount of data, except it's XML, not CSV. Because of the memory overhead associated with various lists and dictionaries, and in conjunction with said Python's problem with releasing the memory to the OS, I had to rewrite for "manual" creation of output files on disk (actually using tempfile.SpooledTemporaryFile that automatically spills to disk if it grows over XX MB) instead of using xml.etree library -- which I use only to construct simple branches, not whole XML tree.

------------
*) At least 2.6 on CentOS. I hear 3.x and 2.7 have patches that make it less of a problem.


Top
   
PostPosted: Tue Apr 16, 2013 6:34 am 
Offline
Senior Member

Joined: Mon Jan 02, 2012 12:45 pm
Posts: 365
Thanks Azathoth. I'll make sure I'm running 2.7

I hadn't thought of processing the batches across multiple nodes. I'll look into Celery.


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
RSS

Powered by phpBB® Forum Software © phpBB Group