I really need overlapping crawl cycles to take advantage of a non-standard 
hardware platform I am required to use.
I will rephrase my question: If I set generate.update.crawldb=true, will I be 
able call the generator more than once without explicitly calling  the 
crawldb-update in between those calls? I already know I can generate more than 
one segment in each call to the generator. I want rolling waves of multisegment 
operations. It seems like it should work if generate.update.crawldb means what 
I think it means.

----
Hi,

That pseudo is hard to read/understand, but anyway, don't use 
generate.update.crawldb=true
unless you have overlapping crawl cycles, which is a bad idea in Nutch world 
anyway.

What do you need, reduce overall latency of what? If you need a short crawl 
cycle but have
a hard time with generate/update due to size of CrawlDB and fetch speed, then 
increase your
hardware, Nutch cannot do low latency crawls of large amounts of URL's in a 
short time span.

If you just need to crawl more URL's in an hour, a few or a day, then just 
generate a lot
of segments in one go, fetch them all, update them all in one updatedb. This is 
a better fit
for Nutch with regards to limited hardware.

Regards,
Markus

prevSegs = generate( segsPerWave ) 

in a "background" process (on other machines):
    fetch and parse prevSegs
repeat indefinitely
    curSegs = generate( segsPerWave ) 

    in a "background" process (on other machines):
        fetch and parse curSegs
    wait for prevSegs to be fetched and parsed
    update, linkdb, and merge prevSegs
    prevSegs = curSegs

Reply via email to