Guys,

This has to do with the way URLs are grouped for politeness and not so much
with the number of blocks in the input. Limiting the URLs by #  host names,
domains or IP is a way of ensuring an even distribution across the cluster.
See nutch-default.xml for details

J.


On 12 June 2012 13:06, Lewis John Mcgibbney <[email protected]>wrote:

> Hi Ali,


> Please check out this post [0] I found. I need to agree with the
> response in the thread ans state that I don't know how Hadoop ensures
> even distribution of workload but we can assume that by explicitly
> specifying the mapper and reducers we can ensure that all 'will' be
> used across your cluster.
>
> hth
>
> [0] http://stackoverflow.com/questions/5748585/hadoop-workload
>
> On Tue, Jun 12, 2012 at 10:15 AM, Ali Safdar Kureishy
> <[email protected]> wrote:
> > Hi,
> >
> > I have a hadoop cluster of 5 nodes. I want to ensure that the fetch phase
> > is distributed evenly across all the nodes (to maximize bandwidth etc).
> > However, if I generate a fetchlist of size 1000 urls, does this get
> > distributed equally across the nodes? Doesn't the fact that the size of
> the
> > fetchlist is < 64MB (block size) result in it being fetched from a single
> > node? If not, how is this distributed across the mappers evenly? Is
> there a
> > rough formulate I can use, to determine how many URLs I should fetch for
> an
> > equal distribution across my nodes, for a given block size setting?
> >
> > Thanks,
> > Safdar
>
>
>
> --
> Lewis
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to