Hi, You should use set general number of reducers property in your cluster. Whenever a job (such a the Fetcher) does not explicitely set the number to a custom amount (using job.setNumReduceTasks(...)) it will just use the cluster-wide configured amount. Set mapred.reduce.tasks to the preferred amount.
Indeed the IndentityReducer is used for the fetcher. Now that I think about it, the Fetcher could be redesigned to exclude the reduce phase and directly output to the fs. (Not applicable to Nutchgora). This would mean the output is not sorted though, and I'm not sure what the consequence of that would be. Correct me if I'm wrong. Ferdy. On Wed, Jun 13, 2012 at 2:52 PM, Ali Safdar Kureishy < [email protected]> wrote: > Hi Julien/Lewis (and All), > > Alright, I found out how to achieve an even distribution of URLs for fetch > across the nodes of the cluster. It does not have to do with the > generate.max.count setting suggested by Julien (though that is important > too). The key is to set an appropriate value for the '*numLists*' parameter > to the generate() method. This ensures that the fetch-list is broken into > many files (blocks) at the end of the generate, which would yield the same > number of map tasks for those URLs during fetch, and those get evently > distributed across the cluster. > > So, with a 'numLists' setting of 16 during generate(), and a hadoop cluster > of about 5 nodes, I was able to achieve a fetch() throughput of about 400K > URLs in 2 hours for the map phase. This was because the fetch was spread > across the 5 nodes, using 16 mappers total - and 2 map slots per node. > * > * > *However*, the reduce phase got stuck on *one* node and sat there for more > than 1 hour! I feel this might be because there is only one reducer > aggregating the output from 16 map tasks. But the code doesn't specify a > reducer class for this job. So: > *a)* does that mean that the IdentityReducer is being used? The reducer > seems to be CPU-bound, since the CPU has been at 100% for most of the time > on the single node that's performing the reduce. What is it doing? The > FetchOutputFormat class, to which data is finally output just funnels each > key+value into an appropriate folder (parse_text, parse_data, content), so > should not by itself account for the 100% utilization. Is it the > shuffle+sort on 1.4 million map output records causing this? Or something > else? > *b)* Is there any setting whereby I can increase the number of reducers > used during the fetch() job? > > Thanks in advance! > > Cheers, > Safdar > > > On Tue, Jun 12, 2012 at 11:57 PM, Ali Safdar Kureishy < > [email protected]> wrote: > > > Thanks Lewis and Julien, for your inputs. > > > > I will look into this a bit further and reply with some numbers, as seen > > with a fetchlist of 50K urls. It is late night here in my timezone, so > will > > look at this first thing in the morning. > > > > Thanks, > > Safdar > > > > > > > > On Tue, Jun 12, 2012 at 4:56 PM, Julien Nioche < > > [email protected]> wrote: > > > >> Guys, > >> > >> This has to do with the way URLs are grouped for politeness and not so > >> much > >> with the number of blocks in the input. Limiting the URLs by # host > >> names, > >> domains or IP is a way of ensuring an even distribution across the > >> cluster. > >> See nutch-default.xml for details > >> > >> J. > >> > >> > >> On 12 June 2012 13:06, Lewis John Mcgibbney <[email protected] > >> >wrote: > >> > >> > Hi Ali, > >> > >> > >> > Please check out this post [0] I found. I need to agree with the > >> > response in the thread ans state that I don't know how Hadoop ensures > >> > even distribution of workload but we can assume that by explicitly > >> > specifying the mapper and reducers we can ensure that all 'will' be > >> > used across your cluster. > >> > > >> > hth > >> > > >> > [0] http://stackoverflow.com/questions/5748585/hadoop-workload > >> > > >> > On Tue, Jun 12, 2012 at 10:15 AM, Ali Safdar Kureishy > >> > <[email protected]> wrote: > >> > > Hi, > >> > > > >> > > I have a hadoop cluster of 5 nodes. I want to ensure that the fetch > >> phase > >> > > is distributed evenly across all the nodes (to maximize bandwidth > >> etc). > >> > > However, if I generate a fetchlist of size 1000 urls, does this get > >> > > distributed equally across the nodes? Doesn't the fact that the size > >> of > >> > the > >> > > fetchlist is < 64MB (block size) result in it being fetched from a > >> single > >> > > node? If not, how is this distributed across the mappers evenly? Is > >> > there a > >> > > rough formulate I can use, to determine how many URLs I should fetch > >> for > >> > an > >> > > equal distribution across my nodes, for a given block size setting? > >> > > > >> > > Thanks, > >> > > Safdar > >> > > >> > > >> > > >> > -- > >> > Lewis > >> > > >> > >> > >> > >> -- > >> * > >> *Open Source Solutions for Text Engineering > >> > >> http://digitalpebble.blogspot.com/ > >> http://www.digitalpebble.com > >> http://twitter.com/digitalpebble > >> > > > > >

