mst wrote:
Don't do it in PHP. It has horrible memory management and plenty bugs that leak memory. If you want to execute long running processes (like crawlers), PHP is your enemy.
While this is true, I think the biggest failing here is to assume that a crawler needs to be a "long running process." I would suggest making a scheduler script, using a database like MySQL to build a queue, and using many short-lived php tasks. The result? Better resource usage and the crawler need only handle 1 small task at a time. The aggregate data can be stored in another database. There's very little reason that a crawler would need cross-crawl data, and just about any reason I can think of can be resolved using a queuing database. A very simple structure would be queueUid(primary key) | queuePriority(allows for more complex queuing) | queueData(serialized array; can include instructions, cookie data, referrer, or anything else you would want to pass on)
This, imo, is far better than just running a single long-term process in python. For one, multi-threading is far easier, and you can very easily control the number of threads. This method is far easier to multi-thread than a single long-term process multi-threaded in any language.