I may have missed a common curtesy in providing the nutch version. I'm using 1.11. It looks like generate doesn't support Jexl in this version. I'm going to have a look to see if it's easily back-portable or if a later 1.xx has support.
Cheers On Thu, Jul 21, 2016 at 11:19 AM Harry Waye <[email protected]> wrote: > Fantastic, thanks Markus > > On Wed, Jul 20, 2016 at 5:30 PM Markus Jelsma <[email protected]> > wrote: > >> Hi Harry, >> >> The generator has Jexl support, check [1] for fields. Metadata is as-is. >> >> It's very simple: >> # bin/nutch generate -expr "status == db_unfetched" >> >> Cheers >> >> [1] >> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java#L524 >> >> >> >> -----Original message----- >> > From:Harry Waye <[email protected]> >> > Sent: Wednesday 20th July 2016 15:40 >> > To: [email protected] >> > Subject: Generate segment of only unfetched urls >> > >> > I'm using this to generate a segment: >> > >> > bin/nutch generate -D mapred.child.java.opts=-Xmx6000m -D >> > mapred.map.tasks.speculative.execution=false -D >> > mapreduce.map.speculative=false -D >> > mapred.reduce.tasks.speculative.execution=false -D >> > mapreduce.reduce.speculative=false -D mapred.map.output.compress=true >> > -Dgenerate.max.count=20000 -D mapred.reduce.tasks=100 crawldb segments >> > -noFilter -noNorm -numFetchers 19 >> > >> > I'm seeing that the change in fetched urls after updatedb runs is much >> > smaller than the number of successfully fetched documents for the >> segment. >> > I'm wondering if some of the urls that were downloaded at the beginning >> of >> > life of the crawldb are being downloaded again hence the delta being >> lower. >> > >> > I'm going to try to debug but just thought I'd ask a few questions >> first: >> > >> > * what's the easiest way to verify that the urls in the segment are >> urls >> > that have never been fetched? >> > * if that's not the case, does someone know what would be the >> appropriate >> > command to use to only fetch unfetched urls? >> > * I'm using generate.max.count in the hope that it will give the best >> > through put for each of our crawl cycles, i.e. maximising out thread >> usage, >> > does that sound sensible? >> > >> > Cheers >> > Harry >> >

