Thanks Ferdy.

With regards to excluding the reduce phase and your concern about the
sorted-ness of fetch output, if there were more than 1 reducer, there
wouldn't be a total ordering on the fetcher output anyway ... so perhaps
that renders the concern about the sort order moot.

Also, please correct me if I am wrong, but it seems that the jobs that
consume the fetcher output (updatecrawldb, invertlinks, solrindex etc) do
not require the input data to be in sorted order, so a map-only fetch()
might be fine.

Thanks,
Safdar


On Wed, Jun 13, 2012 at 5:35 PM, Ferdy Galema <[email protected]>wrote:

> Hi,
>
> You should use set general number of reducers property in your cluster.
> Whenever a job (such a the Fetcher) does not explicitely set the number to
> a custom amount (using job.setNumReduceTasks(...)) it will just use the
> cluster-wide configured amount. Set mapred.reduce.tasks to the preferred
> amount.
>
> Indeed the IndentityReducer is used for the fetcher. Now that I think
> about it, the Fetcher could be redesigned to exclude the reduce phase and
> directly output to the fs. (Not applicable to Nutchgora). This would mean
> the output is not sorted though, and I'm not sure what the consequence of
> that would be. Correct me if I'm wrong.
>
> Ferdy.
>
> On Wed, Jun 13, 2012 at 2:52 PM, Ali Safdar Kureishy <
> [email protected]> wrote:
>
>> Hi Julien/Lewis (and All),
>>
>> Alright, I found out how to achieve an even distribution of URLs for fetch
>> across the nodes of the cluster. It does not have to do with the
>> generate.max.count setting suggested by Julien (though that is important
>> too). The key is to set an appropriate value for the '*numLists*'
>> parameter
>>
>> to the generate() method. This ensures that the fetch-list is broken into
>> many files (blocks) at the end of the generate, which would yield the same
>> number of map tasks for those URLs during fetch, and those get evently
>> distributed across the cluster.
>>
>> So, with a 'numLists' setting of 16 during generate(), and a hadoop
>> cluster
>> of about 5 nodes, I was able to achieve a fetch() throughput of about 400K
>> URLs in 2 hours for the map phase. This was because the fetch was spread
>> across the 5 nodes, using 16 mappers total - and 2 map slots per node.
>> *
>> *
>> *However*, the reduce phase got stuck on *one* node and sat there for more
>>
>> than 1 hour! I feel this might be because there is only one reducer
>> aggregating the output from 16 map tasks. But the code doesn't specify a
>> reducer class for this job. So:
>> *a)* does that mean that the IdentityReducer is being used? The reducer
>>
>> seems to be CPU-bound, since the CPU has been at 100% for most of the time
>> on the single node that's performing the reduce. What is it doing? The
>> FetchOutputFormat class, to which data is finally output just funnels each
>> key+value into an appropriate folder (parse_text, parse_data, content), so
>> should not by itself account for the 100% utilization. Is it the
>> shuffle+sort on 1.4 million map output records causing this? Or something
>> else?
>> *b)* Is there any setting whereby I can increase the number of reducers
>>
>> used during the fetch() job?
>>
>> Thanks in advance!
>>
>> Cheers,
>> Safdar
>>
>>
>> On Tue, Jun 12, 2012 at 11:57 PM, Ali Safdar Kureishy <
>> [email protected]> wrote:
>>
>> > Thanks Lewis and Julien, for your inputs.
>> >
>> > I will look into this a bit further and reply with some numbers, as seen
>> > with a fetchlist of 50K urls. It is late night here in my timezone, so
>> will
>> > look at this first thing in the morning.
>> >
>> > Thanks,
>> > Safdar
>> >
>> >
>> >
>> > On Tue, Jun 12, 2012 at 4:56 PM, Julien Nioche <
>> > [email protected]> wrote:
>> >
>> >> Guys,
>> >>
>> >> This has to do with the way URLs are grouped for politeness and not so
>> >> much
>> >> with the number of blocks in the input. Limiting the URLs by #  host
>> >> names,
>> >> domains or IP is a way of ensuring an even distribution across the
>> >> cluster.
>> >> See nutch-default.xml for details
>> >>
>> >> J.
>> >>
>> >>
>> >> On 12 June 2012 13:06, Lewis John Mcgibbney <[email protected]
>> >> >wrote:
>> >>
>> >> > Hi Ali,
>> >>
>> >>
>> >> > Please check out this post [0] I found. I need to agree with the
>> >> > response in the thread ans state that I don't know how Hadoop ensures
>> >> > even distribution of workload but we can assume that by explicitly
>> >> > specifying the mapper and reducers we can ensure that all 'will' be
>> >> > used across your cluster.
>> >> >
>> >> > hth
>> >> >
>> >> > [0] http://stackoverflow.com/questions/5748585/hadoop-workload
>> >> >
>> >> > On Tue, Jun 12, 2012 at 10:15 AM, Ali Safdar Kureishy
>> >> > <[email protected]> wrote:
>> >> > > Hi,
>> >> > >
>> >> > > I have a hadoop cluster of 5 nodes. I want to ensure that the fetch
>> >> phase
>> >> > > is distributed evenly across all the nodes (to maximize bandwidth
>> >> etc).
>> >> > > However, if I generate a fetchlist of size 1000 urls, does this get
>> >> > > distributed equally across the nodes? Doesn't the fact that the
>> size
>> >> of
>> >> > the
>> >> > > fetchlist is < 64MB (block size) result in it being fetched from a
>> >> single
>> >> > > node? If not, how is this distributed across the mappers evenly? Is
>> >> > there a
>> >> > > rough formulate I can use, to determine how many URLs I should
>> fetch
>> >> for
>> >> > an
>> >> > > equal distribution across my nodes, for a given block size setting?
>> >> > >
>> >> > > Thanks,
>> >> > > Safdar
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Lewis
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> *
>> >> *Open Source Solutions for Text Engineering
>> >>
>> >> http://digitalpebble.blogspot.com/
>> >> http://www.digitalpebble.com
>> >> http://twitter.com/digitalpebble
>> >>
>> >
>> >
>>
>
>

Reply via email to