Hi Julien/Lewis (and All),

Alright, I found out how to achieve an even distribution of URLs for fetch
across the nodes of the cluster. It does not have to do with the
generate.max.count setting suggested by Julien (though that is important
too). The key is to set an appropriate value for the '*numLists*' parameter
to the generate() method. This ensures that the fetch-list is broken into
many files (blocks) at the end of the generate, which would yield the same
number of map tasks for those URLs during fetch, and those get evently
distributed across the cluster.

So, with a 'numLists' setting of 16 during generate(), and a hadoop cluster
of about 5 nodes, I was able to achieve a fetch() throughput of about 400K
URLs in 2 hours for the map phase. This was because the fetch was spread
across the 5 nodes, using 16 mappers total - and 2 map slots per node.
*
*
*However*, the reduce phase got stuck on *one* node and sat there for more
than 1 hour! I feel this might be because there is only one reducer
aggregating the output from 16 map tasks. But the code doesn't specify a
reducer class for this job. So:
*a)* does that mean that the IdentityReducer is being used? The reducer
seems to be CPU-bound, since the CPU has been at 100% for most of the time
on the single node that's performing the reduce. What is it doing? The
FetchOutputFormat class, to which data is finally output just funnels each
key+value into an appropriate folder (parse_text, parse_data, content), so
should not by itself account for the 100% utilization. Is it the
shuffle+sort on 1.4 million map output records causing this? Or something
else?
*b)* Is there any setting whereby I can increase the number of reducers
used during the fetch() job?

Thanks in advance!

Cheers,
Safdar


On Tue, Jun 12, 2012 at 11:57 PM, Ali Safdar Kureishy <
[email protected]> wrote:

> Thanks Lewis and Julien, for your inputs.
>
> I will look into this a bit further and reply with some numbers, as seen
> with a fetchlist of 50K urls. It is late night here in my timezone, so will
> look at this first thing in the morning.
>
> Thanks,
> Safdar
>
>
>
> On Tue, Jun 12, 2012 at 4:56 PM, Julien Nioche <
> [email protected]> wrote:
>
>> Guys,
>>
>> This has to do with the way URLs are grouped for politeness and not so
>> much
>> with the number of blocks in the input. Limiting the URLs by #  host
>> names,
>> domains or IP is a way of ensuring an even distribution across the
>> cluster.
>> See nutch-default.xml for details
>>
>> J.
>>
>>
>> On 12 June 2012 13:06, Lewis John Mcgibbney <[email protected]
>> >wrote:
>>
>> > Hi Ali,
>>
>>
>> > Please check out this post [0] I found. I need to agree with the
>> > response in the thread ans state that I don't know how Hadoop ensures
>> > even distribution of workload but we can assume that by explicitly
>> > specifying the mapper and reducers we can ensure that all 'will' be
>> > used across your cluster.
>> >
>> > hth
>> >
>> > [0] http://stackoverflow.com/questions/5748585/hadoop-workload
>> >
>> > On Tue, Jun 12, 2012 at 10:15 AM, Ali Safdar Kureishy
>> > <[email protected]> wrote:
>> > > Hi,
>> > >
>> > > I have a hadoop cluster of 5 nodes. I want to ensure that the fetch
>> phase
>> > > is distributed evenly across all the nodes (to maximize bandwidth
>> etc).
>> > > However, if I generate a fetchlist of size 1000 urls, does this get
>> > > distributed equally across the nodes? Doesn't the fact that the size
>> of
>> > the
>> > > fetchlist is < 64MB (block size) result in it being fetched from a
>> single
>> > > node? If not, how is this distributed across the mappers evenly? Is
>> > there a
>> > > rough formulate I can use, to determine how many URLs I should fetch
>> for
>> > an
>> > > equal distribution across my nodes, for a given block size setting?
>> > >
>> > > Thanks,
>> > > Safdar
>> >
>> >
>> >
>> > --
>> > Lewis
>> >
>>
>>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>

Reply via email to