Hi Joseph, On Tue, May 3, 2016 at 7:53 AM, <[email protected]> wrote:
> > From: Joseph Obernberger <[email protected]> > To: [email protected] > Cc: > Date: Tue, 3 May 2016 09:04:09 -0400 > Subject: Nutch 2.3.1 - Fetch Phase - Only 2 Reducers > Hello - I'm working with nutch 2.3.1 with HBase for the webpage table. I > have all the phases (inject, generate, fetch, parse, and updatedb) working > fine. Nutch is a crawling beast! > Glad to hear. > > On our cluster, the generate phase uses around 60 mappers and 128 reducers, > but the fetch phase always uses just 2 reducers. In a recent test, the > fetch phase used 60 mappers and 2 reducers. > In Nutch 2.X you will have noticed that the actual 'Fetching' is executed within the FetcherReducer [0]. More specifically, it is achieved within the FetcherReducer.FetcherThread [1] which picks items from FetchItemQueues and fetches the pages. The crux of this issue here is a politeness issue. It has to do with the URL Partitioning scheme [2] you use which partitions urls by host, domain name or IP depending on the value of the parameter 'partition.url.mode' which can be 'byHost', 'byDomain' or 'byIP'. The issue was described a few weeks ago by Karanjeet and Sebastian http://www.mail-archive.com/user%40nutch.apache.org/msg14496.html [0] https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java [1] https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java#L430 [2] https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/URLPartitioner.java Please note that you have quite significant differences between the following > Map input records=22514605 > Map output records=21459377 > > Above Generator Map-phase delta of 1,055,228, and > Reduce input records=21459377 > Reduce output records=7506045 > > Above Fetch Map-phase delta of 13,953,332 > Reduce input records=7503906 > Reduce output records=609920 > > Above Fetch Reducer-phase delta of 6,893,986 > FetcherStatus > ACCESS_DENIED=131 > EXCEPTION=36676 > GONE=295 > HitByTimeLimit-QueueFeeder=6883654 > HitByTimeLimit-Queues=10291 > MOVED=37141 > NOTFOUND=10490 > NOTMODIFIED=732 > SUCCESS=485083 > TEMP_MOVED=14589 > > Very interesting FetcherStatus stats. HitByTimeLimit-QueueFeeder=6883654 is of particular interest. If I were you I would create many more, smaller batches of URLs to fetch as opposs to these large batches which are simply... not being fetched. You only fetched around 485K URLs going by the above stats. > > > Any idea on what I need to adjust to use more nodes for the reduce phase? Hopefully the above has given you a decent amount to consider. Please let us knwo if you have some more questions. Thanks Lewis

