Guys, This has to do with the way URLs are grouped for politeness and not so much with the number of blocks in the input. Limiting the URLs by # host names, domains or IP is a way of ensuring an even distribution across the cluster. See nutch-default.xml for details
J. On 12 June 2012 13:06, Lewis John Mcgibbney <[email protected]>wrote: > Hi Ali, > Please check out this post [0] I found. I need to agree with the > response in the thread ans state that I don't know how Hadoop ensures > even distribution of workload but we can assume that by explicitly > specifying the mapper and reducers we can ensure that all 'will' be > used across your cluster. > > hth > > [0] http://stackoverflow.com/questions/5748585/hadoop-workload > > On Tue, Jun 12, 2012 at 10:15 AM, Ali Safdar Kureishy > <[email protected]> wrote: > > Hi, > > > > I have a hadoop cluster of 5 nodes. I want to ensure that the fetch phase > > is distributed evenly across all the nodes (to maximize bandwidth etc). > > However, if I generate a fetchlist of size 1000 urls, does this get > > distributed equally across the nodes? Doesn't the fact that the size of > the > > fetchlist is < 64MB (block size) result in it being fetched from a single > > node? If not, how is this distributed across the mappers evenly? Is > there a > > rough formulate I can use, to determine how many URLs I should fetch for > an > > equal distribution across my nodes, for a given block size setting? > > > > Thanks, > > Safdar > > > > -- > Lewis > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

