Hi,

That pseudo is hard to read/understand, but anyway, don't use 
generate.update.crawldb=true unless you have overlapping crawl cycles, which is 
a bad idea in Nutch world anyway.

What do you need, reduce overall latency of what? If you need a short crawl 
cycle but have a hard time with generate/update due to size of CrawlDB and 
fetch speed, then increase your hardware, Nutch cannot do low latency crawls of 
large amounts of URL's in a short time span.

If you just need to crawl more URL's in an hour, a few or a day, then just 
generate a lot of segments in one go, fetch them all, update them all in one 
updatedb. This is a better fit for Nutch with regards to limited hardware.

Regards,
Markus
 
-----Original message-----
> From:Michael Coffey <[email protected]>
> Sent: Tuesday 23rd May 2017 2:08
> To: User <[email protected]>
> Subject: generating and updating segments
> 
> In search of more effective parallelism, I have been experimenting with 
> different schemes for organizing the nutch jobs. I would like to know if the 
> Generator can work in a way that supports what I'm trying to do.
> Here is a pseudocode description of one approach. I use variables named 
> curSegs and prevSegs to refer to lists of segments. SegsPerWave is typically 
> 4 or more.
> 
> prevSegs = generate( segsPerWave ) 
> in a "background" process (on other machines):    fetch and parse prevSegs
> repeat indefinitely    curSegs = generate( segsPerWave ) 
>     in a "background" process (on other machines):        fetch and parse 
> curSegs    wait for prevSegs to be fetched and parsed
>     update, linkdb, and merge prevSegs    prevSegs = curSegs
> As I understand it, this will not work right if I do not set 
> generate.update.crawldb = true. In my subsequent calls to generate, it would 
> generate duplicated (or partially duplicated) segments.
> If I do set generate.update.crawldb = true, should it work right?  What, 
> exactly, does generate.update.crawldb = true do? I assume it changes 
> something in the crawldb, but I don't know what.
> 
> 

Reply via email to