Fantastic, thanks Markus On Wed, Jul 20, 2016 at 5:30 PM Markus Jelsma <[email protected]> wrote:
> Hi Harry, > > The generator has Jexl support, check [1] for fields. Metadata is as-is. > > It's very simple: > # bin/nutch generate -expr "status == db_unfetched" > > Cheers > > [1] > https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java#L524 > > > > -----Original message----- > > From:Harry Waye <[email protected]> > > Sent: Wednesday 20th July 2016 15:40 > > To: [email protected] > > Subject: Generate segment of only unfetched urls > > > > I'm using this to generate a segment: > > > > bin/nutch generate -D mapred.child.java.opts=-Xmx6000m -D > > mapred.map.tasks.speculative.execution=false -D > > mapreduce.map.speculative=false -D > > mapred.reduce.tasks.speculative.execution=false -D > > mapreduce.reduce.speculative=false -D mapred.map.output.compress=true > > -Dgenerate.max.count=20000 -D mapred.reduce.tasks=100 crawldb segments > > -noFilter -noNorm -numFetchers 19 > > > > I'm seeing that the change in fetched urls after updatedb runs is much > > smaller than the number of successfully fetched documents for the > segment. > > I'm wondering if some of the urls that were downloaded at the beginning > of > > life of the crawldb are being downloaded again hence the delta being > lower. > > > > I'm going to try to debug but just thought I'd ask a few questions first: > > > > * what's the easiest way to verify that the urls in the segment are urls > > that have never been fetched? > > * if that's not the case, does someone know what would be the > appropriate > > command to use to only fetch unfetched urls? > > * I'm using generate.max.count in the hope that it will give the best > > through put for each of our crawl cycles, i.e. maximising out thread > usage, > > does that sound sensible? > > > > Cheers > > Harry >

