Re: [Boston.pm] Passing large complex data structures between process

2013-04-04 Thread Morse, Richard E.MGH
On Apr 3, 2013, at 10:34 AM, David Larochelle da...@larochelle.name wrote: Currently, the driver process periodically queries a database to get a list of URLs to crawler. It then stores these url's to be downloaded in a complex in memory and pipes them to separate processes that do the actual

Re: [Boston.pm] Passing large complex data structures between process

2013-04-04 Thread David Larochelle
Thanks for all the feedback. I left out a lot of details about the system because I didn't want to complicate things. The purpose of the system is comprehensively study online media. We need the system to run 24 hours a day to download news articles in media sources such as the New York Times. We

Re: [Boston.pm] Passing large complex data structures between process

2013-04-04 Thread Anthony Caravello
This sounds like a perfect fit for a queuing service like RabbitMQ. Logstash uses Redis lists for this as it's simple to setup and pretty reliable, but there are many such applications available. The queue's would allow multiple backend processes to check for and take items as they became

Re: [Boston.pm] Passing large complex data structures between process

2013-04-04 Thread Gyepi SAM
On Thu, Apr 04, 2013 at 04:21:54PM -0400, David Larochelle wrote: My hope is to split the engine process into two pieces that ran in parallel: one to query the database and another to send downloads to fetchers. This way it won't matter how long the db query takes as long as we can get URLs

Re: [Boston.pm] Passing large complex data structures between process

2013-04-04 Thread John Redford
David Larochelle wrote: [...] We're using Postgresql 8.4 and running on Ubuntu. Almost all data is stored in the database. The system contains a list of media sources with associated RSS feeds. We have a downloads table that has all of the URLs that we want to download or have downloaded in