I'm using this to generate a segment:

bin/nutch generate -D mapred.child.java.opts=-Xmx6000m -D
mapred.map.tasks.speculative.execution=false -D
mapreduce.map.speculative=false -D
mapred.reduce.tasks.speculative.execution=false -D
mapreduce.reduce.speculative=false -D mapred.map.output.compress=true
-Dgenerate.max.count=20000 -D mapred.reduce.tasks=100 crawldb segments
-noFilter -noNorm -numFetchers 19


I'm seeing that the change in fetched urls after updatedb runs is much
smaller than the number of successfully fetched documents for the segment.
I'm wondering if some of the urls that were downloaded at the beginning of
life of the crawldb are being downloaded again hence the delta being lower.

I'm going to try to debug but just thought I'd ask a few questions first:

 * what's the easiest way to verify that the urls in the segment are urls
that have never been fetched?
 * if that's not the case, does someone know what would be the appropriate
command to use to only fetch unfetched urls?
 * I'm using generate.max.count in the hope that it will give the best
through put for each of our crawl cycles, i.e. maximising out thread usage,
does that sound sensible?

Cheers
Harry

Reply via email to