In search of more effective parallelism, I have been experimenting with 
different schemes for organizing the nutch jobs. I would like to know if the 
Generator can work in a way that supports what I'm trying to do.
Here is a pseudocode description of one approach. I use variables named curSegs 
and prevSegs to refer to lists of segments. SegsPerWave is typically 4 or more.

prevSegs = generate( segsPerWave ) 
in a "background" process (on other machines):    fetch and parse prevSegs
repeat indefinitely    curSegs = generate( segsPerWave ) 
    in a "background" process (on other machines):        fetch and parse 
curSegs    wait for prevSegs to be fetched and parsed
    update, linkdb, and merge prevSegs    prevSegs = curSegs
As I understand it, this will not work right if I do not set 
generate.update.crawldb = true. In my subsequent calls to generate, it would 
generate duplicated (or partially duplicated) segments.
If I do set generate.update.crawldb = true, should it work right?  What, 
exactly, does generate.update.crawldb = true do? I assume it changes something 
in the crawldb, but I don't know what.

Reply via email to