Hi Julien/Lewis (and All), Alright, I found out how to achieve an even distribution of URLs for fetch across the nodes of the cluster. It does not have to do with the generate.max.count setting suggested by Julien (though that is important too). The key is to set an appropriate value for the '*numLists*' parameter to the generate() method. This ensures that the fetch-list is broken into many files (blocks) at the end of the generate, which would yield the same number of map tasks for those URLs during fetch, and those get evently distributed across the cluster.
So, with a 'numLists' setting of 16 during generate(), and a hadoop cluster of about 5 nodes, I was able to achieve a fetch() throughput of about 400K URLs in 2 hours for the map phase. This was because the fetch was spread across the 5 nodes, using 16 mappers total - and 2 map slots per node. * * *However*, the reduce phase got stuck on *one* node and sat there for more than 1 hour! I feel this might be because there is only one reducer aggregating the output from 16 map tasks. But the code doesn't specify a reducer class for this job. So: *a)* does that mean that the IdentityReducer is being used? The reducer seems to be CPU-bound, since the CPU has been at 100% for most of the time on the single node that's performing the reduce. What is it doing? The FetchOutputFormat class, to which data is finally output just funnels each key+value into an appropriate folder (parse_text, parse_data, content), so should not by itself account for the 100% utilization. Is it the shuffle+sort on 1.4 million map output records causing this? Or something else? *b)* Is there any setting whereby I can increase the number of reducers used during the fetch() job? Thanks in advance! Cheers, Safdar On Tue, Jun 12, 2012 at 11:57 PM, Ali Safdar Kureishy < [email protected]> wrote: > Thanks Lewis and Julien, for your inputs. > > I will look into this a bit further and reply with some numbers, as seen > with a fetchlist of 50K urls. It is late night here in my timezone, so will > look at this first thing in the morning. > > Thanks, > Safdar > > > > On Tue, Jun 12, 2012 at 4:56 PM, Julien Nioche < > [email protected]> wrote: > >> Guys, >> >> This has to do with the way URLs are grouped for politeness and not so >> much >> with the number of blocks in the input. Limiting the URLs by # host >> names, >> domains or IP is a way of ensuring an even distribution across the >> cluster. >> See nutch-default.xml for details >> >> J. >> >> >> On 12 June 2012 13:06, Lewis John Mcgibbney <[email protected] >> >wrote: >> >> > Hi Ali, >> >> >> > Please check out this post [0] I found. I need to agree with the >> > response in the thread ans state that I don't know how Hadoop ensures >> > even distribution of workload but we can assume that by explicitly >> > specifying the mapper and reducers we can ensure that all 'will' be >> > used across your cluster. >> > >> > hth >> > >> > [0] http://stackoverflow.com/questions/5748585/hadoop-workload >> > >> > On Tue, Jun 12, 2012 at 10:15 AM, Ali Safdar Kureishy >> > <[email protected]> wrote: >> > > Hi, >> > > >> > > I have a hadoop cluster of 5 nodes. I want to ensure that the fetch >> phase >> > > is distributed evenly across all the nodes (to maximize bandwidth >> etc). >> > > However, if I generate a fetchlist of size 1000 urls, does this get >> > > distributed equally across the nodes? Doesn't the fact that the size >> of >> > the >> > > fetchlist is < 64MB (block size) result in it being fetched from a >> single >> > > node? If not, how is this distributed across the mappers evenly? Is >> > there a >> > > rough formulate I can use, to determine how many URLs I should fetch >> for >> > an >> > > equal distribution across my nodes, for a given block size setting? >> > > >> > > Thanks, >> > > Safdar >> > >> > >> > >> > -- >> > Lewis >> > >> >> >> >> -- >> * >> *Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> http://twitter.com/digitalpebble >> > >

