Fantastic, thanks Markus

On Wed, Jul 20, 2016 at 5:30 PM Markus Jelsma <[email protected]>
wrote:

> Hi Harry,
>
> The generator has Jexl support, check [1] for fields. Metadata is as-is.
>
> It's very simple:
> # bin/nutch generate -expr "status == db_unfetched"
>
> Cheers
>
> [1]
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java#L524
>
>
>
> -----Original message-----
> > From:Harry Waye <[email protected]>
> > Sent: Wednesday 20th July 2016 15:40
> > To: [email protected]
> > Subject: Generate segment of only unfetched urls
> >
> > I'm using this to generate a segment:
> >
> > bin/nutch generate -D mapred.child.java.opts=-Xmx6000m -D
> > mapred.map.tasks.speculative.execution=false -D
> > mapreduce.map.speculative=false -D
> > mapred.reduce.tasks.speculative.execution=false -D
> > mapreduce.reduce.speculative=false -D mapred.map.output.compress=true
> > -Dgenerate.max.count=20000 -D mapred.reduce.tasks=100 crawldb segments
> > -noFilter -noNorm -numFetchers 19
> >
> > I'm seeing that the change in fetched urls after updatedb runs is much
> > smaller than the number of successfully fetched documents for the
> segment.
> > I'm wondering if some of the urls that were downloaded at the beginning
> of
> > life of the crawldb are being downloaded again hence the delta being
> lower.
> >
> > I'm going to try to debug but just thought I'd ask a few questions first:
> >
> >  * what's the easiest way to verify that the urls in the segment are urls
> > that have never been fetched?
> >  * if that's not the case, does someone know what would be the
> appropriate
> > command to use to only fetch unfetched urls?
> >  * I'm using generate.max.count in the hope that it will give the best
> > through put for each of our crawl cycles, i.e. maximising out thread
> usage,
> > does that sound sensible?
> >
> > Cheers
> > Harry
>

Reply via email to