I'm trying to optimize a database driven web crawler and I was wondering if
anyone could offer any recommendations for interprocess communications.
Currently, the driver process periodically queries a database to get a
list of URLs to crawler. It then stores these url's to be downloaded in a
This would be a great topic for a meeting, either a report later on what you
found and how you used it, or as a workshop to evaluate your options.
[FYI, Schedule for next Tuesday is Federico will update us on his embedded Perl
hardware hacking project.]
bill@$dayjob
#include
I've generally had luck with threads in Perl 5.10 and beyond, though if
you're sharing variables containing large amounts of data they can be
inefficient. Storing your data in Redis is often handy and fast, and
Redis::List works nicely as a queuing system, even if you use threads.
Hope that
Thanks,
I think that Redis would work but the system currently uses Postgresql, I'm
looking for something simpler than having to maintain another service.
(We're actively looking at NoSQL solutions but that would be a huge
rearchitecting of the system.)
Is there a clean way to do this with pipes
To keep it simple, try threads and Thread::Queue module. That will keep it
all 'in-house'.
On Wed, Apr 3, 2013 at 11:28 AM, David Larochelle da...@larochelle.namewrote:
Thanks,
I think that Redis would work but the system currently uses
Postgresql, I'm looking for something simpler than
On Wed, Apr 03, 2013 at 10:34:17AM -0400, David Larochelle wrote:
I'm trying to optimize a database driven web crawler and I was wondering if
anyone could offer any recommendations for interprocess communications.
Currently, the driver process periodically queries a database to get a
list
Another option that I've used in similar situations:
1. have a process hit the database and generate a storable of the data
2. have multiple crawlers execute and unfreeze the storable into memory
3. do what you need to do with the data, pushing back to the database when
necessary.
Instead of