Hi, That pseudo is hard to read/understand, but anyway, don't use generate.update.crawldb=true unless you have overlapping crawl cycles, which is a bad idea in Nutch world anyway.
What do you need, reduce overall latency of what? If you need a short crawl cycle but have a hard time with generate/update due to size of CrawlDB and fetch speed, then increase your hardware, Nutch cannot do low latency crawls of large amounts of URL's in a short time span. If you just need to crawl more URL's in an hour, a few or a day, then just generate a lot of segments in one go, fetch them all, update them all in one updatedb. This is a better fit for Nutch with regards to limited hardware. Regards, Markus -----Original message----- > From:Michael Coffey <[email protected]> > Sent: Tuesday 23rd May 2017 2:08 > To: User <[email protected]> > Subject: generating and updating segments > > In search of more effective parallelism, I have been experimenting with > different schemes for organizing the nutch jobs. I would like to know if the > Generator can work in a way that supports what I'm trying to do. > Here is a pseudocode description of one approach. I use variables named > curSegs and prevSegs to refer to lists of segments. SegsPerWave is typically > 4 or more. > > prevSegs = generate( segsPerWave ) > in a "background" process (on other machines): fetch and parse prevSegs > repeat indefinitely curSegs = generate( segsPerWave ) > in a "background" process (on other machines): fetch and parse > curSegs wait for prevSegs to be fetched and parsed > update, linkdb, and merge prevSegs prevSegs = curSegs > As I understand it, this will not work right if I do not set > generate.update.crawldb = true. In my subsequent calls to generate, it would > generate duplicated (or partially duplicated) segments. > If I do set generate.update.crawldb = true, should it work right? What, > exactly, does generate.update.crawldb = true do? I assume it changes > something in the crawldb, but I don't know what. > >

