Hi Joseph,

On Tue, May 3, 2016 at 7:53 AM, <[email protected]> wrote:

>
> From: Joseph Obernberger <[email protected]>
> To: [email protected]
> Cc:
> Date: Tue, 3 May 2016 09:04:09 -0400
> Subject: Nutch 2.3.1 - Fetch Phase - Only 2 Reducers
> Hello - I'm working with nutch 2.3.1 with HBase for the webpage table.  I
> have all the phases (inject, generate, fetch, parse, and updatedb) working
> fine.  Nutch is a crawling beast!
>

Glad to hear.


>
> On our cluster, the generate phase uses around 60 mappers and 128 reducers,
> but the fetch phase always uses just 2 reducers.  In a recent test, the
> fetch phase used 60 mappers and 2 reducers.
>

In Nutch 2.X you will have noticed that the actual 'Fetching' is executed
within the FetcherReducer [0]. More specifically, it is achieved within the
FetcherReducer.FetcherThread [1] which picks items from FetchItemQueues and
fetches the pages.
The crux of this issue here is a politeness issue. It has to do with the
URL Partitioning scheme [2] you use which partitions urls by host, domain
name or IP depending on the value of the parameter 'partition.url.mode'
which can be 'byHost', 'byDomain' or 'byIP'.
The issue was described a few weeks ago by Karanjeet and Sebastian
http://www.mail-archive.com/user%40nutch.apache.org/msg14496.html


[0]
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java
[1]
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java#L430
[2]
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/URLPartitioner.java


Please note that you have quite significant differences between the
following



>                 Map input records=22514605
>                 Map output records=21459377
>
>

Above Generator Map-phase delta of 1,055,228, and


>                 Reduce input records=21459377
>                 Reduce output records=7506045
>
>

Above Fetch Map-phase delta of 13,953,332


>                 Reduce input records=7503906
>                 Reduce output records=609920
>
>

Above Fetch Reducer-phase delta of 6,893,986


>         FetcherStatus
>                 ACCESS_DENIED=131
>                 EXCEPTION=36676
>                 GONE=295
>                 HitByTimeLimit-QueueFeeder=6883654
>                 HitByTimeLimit-Queues=10291
>                 MOVED=37141
>                 NOTFOUND=10490
>                 NOTMODIFIED=732
>                 SUCCESS=485083
>                 TEMP_MOVED=14589
>
>

Very interesting FetcherStatus stats. HitByTimeLimit-QueueFeeder=6883654 is
of particular interest.
If I were you I would create many more, smaller batches of URLs to fetch as
opposs to these large batches which are simply... not being fetched. You
only fetched around 485K URLs going by the above stats.


>
>
> Any idea on what I need to adjust to use more nodes for the reduce phase?


Hopefully the above has given you a decent amount to consider. Please let
us knwo if you have some more questions.
Thanks
Lewis

Reply via email to