I really need overlapping crawl cycles to take advantage of a non-standard hardware platform I am required to use. I will rephrase my question: If I set generate.update.crawldb=true, will I be able call the generator more than once without explicitly calling the crawldb-update in between those calls? I already know I can generate more than one segment in each call to the generator. I want rolling waves of multisegment operations. It seems like it should work if generate.update.crawldb means what I think it means.
---- Hi, That pseudo is hard to read/understand, but anyway, don't use generate.update.crawldb=true unless you have overlapping crawl cycles, which is a bad idea in Nutch world anyway. What do you need, reduce overall latency of what? If you need a short crawl cycle but have a hard time with generate/update due to size of CrawlDB and fetch speed, then increase your hardware, Nutch cannot do low latency crawls of large amounts of URL's in a short time span. If you just need to crawl more URL's in an hour, a few or a day, then just generate a lot of segments in one go, fetch them all, update them all in one updatedb. This is a better fit for Nutch with regards to limited hardware. Regards, Markus prevSegs = generate( segsPerWave ) in a "background" process (on other machines): fetch and parse prevSegs repeat indefinitely curSegs = generate( segsPerWave ) in a "background" process (on other machines): fetch and parse curSegs wait for prevSegs to be fetched and parsed update, linkdb, and merge prevSegs prevSegs = curSegs

