Thanks Lewis and Julien, for your inputs.

I will look into this a bit further and reply with some numbers, as seen
with a fetchlist of 50K urls. It is late night here in my timezone, so will
look at this first thing in the morning.

Thanks,
Safdar



On Tue, Jun 12, 2012 at 4:56 PM, Julien Nioche <
[email protected]> wrote:

> Guys,
>
> This has to do with the way URLs are grouped for politeness and not so much
> with the number of blocks in the input. Limiting the URLs by #  host names,
> domains or IP is a way of ensuring an even distribution across the cluster.
> See nutch-default.xml for details
>
> J.
>
>
> On 12 June 2012 13:06, Lewis John Mcgibbney <[email protected]
> >wrote:
>
> > Hi Ali,
>
>
> > Please check out this post [0] I found. I need to agree with the
> > response in the thread ans state that I don't know how Hadoop ensures
> > even distribution of workload but we can assume that by explicitly
> > specifying the mapper and reducers we can ensure that all 'will' be
> > used across your cluster.
> >
> > hth
> >
> > [0] http://stackoverflow.com/questions/5748585/hadoop-workload
> >
> > On Tue, Jun 12, 2012 at 10:15 AM, Ali Safdar Kureishy
> > <[email protected]> wrote:
> > > Hi,
> > >
> > > I have a hadoop cluster of 5 nodes. I want to ensure that the fetch
> phase
> > > is distributed evenly across all the nodes (to maximize bandwidth etc).
> > > However, if I generate a fetchlist of size 1000 urls, does this get
> > > distributed equally across the nodes? Doesn't the fact that the size of
> > the
> > > fetchlist is < 64MB (block size) result in it being fetched from a
> single
> > > node? If not, how is this distributed across the mappers evenly? Is
> > there a
> > > rough formulate I can use, to determine how many URLs I should fetch
> for
> > an
> > > equal distribution across my nodes, for a given block size setting?
> > >
> > > Thanks,
> > > Safdar
> >
> >
> >
> > --
> > Lewis
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Reply via email to