Generally best practices for crawlers is that no process runs more than an
hour or five. All crawler processes update
a central state store with their progress, but they exit when they reach a
time limit knowing that somebody else will
take up the work where they leave off. This avoids a multitude of ills.
On Tue, Sep 21, 2010 at 11:53 AM, Tim Robertson
> > On the topic of your application, why you are using processes instead of
> > threads? With threads, you can get your memory overhead down to 10's of
> > kilobytes as opposed to 10's of megabytes.
> I am just prototyping scaling out many processes and potentially
> across multiple machines. Our live crawler runs in a single JVM, but
> some of these crawlers take 4-6 weeks, so long running processes block
> others, so I was looking at alternatives - our live crawler also uses
> DOM based XML parsing so hitting memory limits - SAX would address
> this. Also we want to be able to deploy patches to the crawlers
> without interrupting those long running jobs if possible.