I'm using this to generate a segment: bin/nutch generate -D mapred.child.java.opts=-Xmx6000m -D mapred.map.tasks.speculative.execution=false -D mapreduce.map.speculative=false -D mapred.reduce.tasks.speculative.execution=false -D mapreduce.reduce.speculative=false -D mapred.map.output.compress=true -Dgenerate.max.count=20000 -D mapred.reduce.tasks=100 crawldb segments -noFilter -noNorm -numFetchers 19
I'm seeing that the change in fetched urls after updatedb runs is much smaller than the number of successfully fetched documents for the segment. I'm wondering if some of the urls that were downloaded at the beginning of life of the crawldb are being downloaded again hence the delta being lower. I'm going to try to debug but just thought I'd ask a few questions first: * what's the easiest way to verify that the urls in the segment are urls that have never been fetched? * if that's not the case, does someone know what would be the appropriate command to use to only fetch unfetched urls? * I'm using generate.max.count in the hope that it will give the best through put for each of our crawl cycles, i.e. maximising out thread usage, does that sound sensible? Cheers Harry

