Re: Generate segment of only unfetched urls

Harry Waye Thu, 21 Jul 2016 03:26:51 -0700

I may have missed a common curtesy in providing the nutch version. I'm
using 1.11. It looks like generate doesn't support Jexl in this version.
I'm going to have a look to see if it's easily back-portable or if a later
1.xx has support.


Cheers

On Thu, Jul 21, 2016 at 11:19 AM Harry Waye <[email protected]> wrote:

> Fantastic, thanks Markus
>
> On Wed, Jul 20, 2016 at 5:30 PM Markus Jelsma <[email protected]>
> wrote:
>
>> Hi Harry,
>>
>> The generator has Jexl support, check [1] for fields. Metadata is as-is.
>>
>> It's very simple:
>> # bin/nutch generate -expr "status == db_unfetched"
>>
>> Cheers
>>
>> [1]
>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java#L524
>>
>>
>>
>> -----Original message-----
>> > From:Harry Waye <[email protected]>
>> > Sent: Wednesday 20th July 2016 15:40
>> > To: [email protected]
>> > Subject: Generate segment of only unfetched urls
>> >
>> > I'm using this to generate a segment:
>> >
>> > bin/nutch generate -D mapred.child.java.opts=-Xmx6000m -D
>> > mapred.map.tasks.speculative.execution=false -D
>> > mapreduce.map.speculative=false -D
>> > mapred.reduce.tasks.speculative.execution=false -D
>> > mapreduce.reduce.speculative=false -D mapred.map.output.compress=true
>> > -Dgenerate.max.count=20000 -D mapred.reduce.tasks=100 crawldb segments
>> > -noFilter -noNorm -numFetchers 19
>> >
>> > I'm seeing that the change in fetched urls after updatedb runs is much
>> > smaller than the number of successfully fetched documents for the
>> segment.
>> > I'm wondering if some of the urls that were downloaded at the beginning
>> of
>> > life of the crawldb are being downloaded again hence the delta being
>> lower.
>> >
>> > I'm going to try to debug but just thought I'd ask a few questions
>> first:
>> >
>> >  * what's the easiest way to verify that the urls in the segment are
>> urls
>> > that have never been fetched?
>> >  * if that's not the case, does someone know what would be the
>> appropriate
>> > command to use to only fetch unfetched urls?
>> >  * I'm using generate.max.count in the hope that it will give the best
>> > through put for each of our crawl cycles, i.e. maximising out thread
>> usage,
>> > does that sound sensible?
>> >
>> > Cheers
>> > Harry
>>
>

Re: Generate segment of only unfetched urls

Reply via email to