Thanks Ferdy. With regards to excluding the reduce phase and your concern about the sorted-ness of fetch output, if there were more than 1 reducer, there wouldn't be a total ordering on the fetcher output anyway ... so perhaps that renders the concern about the sort order moot.
Also, please correct me if I am wrong, but it seems that the jobs that consume the fetcher output (updatecrawldb, invertlinks, solrindex etc) do not require the input data to be in sorted order, so a map-only fetch() might be fine. Thanks, Safdar On Wed, Jun 13, 2012 at 5:35 PM, Ferdy Galema <[email protected]>wrote: > Hi, > > You should use set general number of reducers property in your cluster. > Whenever a job (such a the Fetcher) does not explicitely set the number to > a custom amount (using job.setNumReduceTasks(...)) it will just use the > cluster-wide configured amount. Set mapred.reduce.tasks to the preferred > amount. > > Indeed the IndentityReducer is used for the fetcher. Now that I think > about it, the Fetcher could be redesigned to exclude the reduce phase and > directly output to the fs. (Not applicable to Nutchgora). This would mean > the output is not sorted though, and I'm not sure what the consequence of > that would be. Correct me if I'm wrong. > > Ferdy. > > On Wed, Jun 13, 2012 at 2:52 PM, Ali Safdar Kureishy < > [email protected]> wrote: > >> Hi Julien/Lewis (and All), >> >> Alright, I found out how to achieve an even distribution of URLs for fetch >> across the nodes of the cluster. It does not have to do with the >> generate.max.count setting suggested by Julien (though that is important >> too). The key is to set an appropriate value for the '*numLists*' >> parameter >> >> to the generate() method. This ensures that the fetch-list is broken into >> many files (blocks) at the end of the generate, which would yield the same >> number of map tasks for those URLs during fetch, and those get evently >> distributed across the cluster. >> >> So, with a 'numLists' setting of 16 during generate(), and a hadoop >> cluster >> of about 5 nodes, I was able to achieve a fetch() throughput of about 400K >> URLs in 2 hours for the map phase. This was because the fetch was spread >> across the 5 nodes, using 16 mappers total - and 2 map slots per node. >> * >> * >> *However*, the reduce phase got stuck on *one* node and sat there for more >> >> than 1 hour! I feel this might be because there is only one reducer >> aggregating the output from 16 map tasks. But the code doesn't specify a >> reducer class for this job. So: >> *a)* does that mean that the IdentityReducer is being used? The reducer >> >> seems to be CPU-bound, since the CPU has been at 100% for most of the time >> on the single node that's performing the reduce. What is it doing? The >> FetchOutputFormat class, to which data is finally output just funnels each >> key+value into an appropriate folder (parse_text, parse_data, content), so >> should not by itself account for the 100% utilization. Is it the >> shuffle+sort on 1.4 million map output records causing this? Or something >> else? >> *b)* Is there any setting whereby I can increase the number of reducers >> >> used during the fetch() job? >> >> Thanks in advance! >> >> Cheers, >> Safdar >> >> >> On Tue, Jun 12, 2012 at 11:57 PM, Ali Safdar Kureishy < >> [email protected]> wrote: >> >> > Thanks Lewis and Julien, for your inputs. >> > >> > I will look into this a bit further and reply with some numbers, as seen >> > with a fetchlist of 50K urls. It is late night here in my timezone, so >> will >> > look at this first thing in the morning. >> > >> > Thanks, >> > Safdar >> > >> > >> > >> > On Tue, Jun 12, 2012 at 4:56 PM, Julien Nioche < >> > [email protected]> wrote: >> > >> >> Guys, >> >> >> >> This has to do with the way URLs are grouped for politeness and not so >> >> much >> >> with the number of blocks in the input. Limiting the URLs by # host >> >> names, >> >> domains or IP is a way of ensuring an even distribution across the >> >> cluster. >> >> See nutch-default.xml for details >> >> >> >> J. >> >> >> >> >> >> On 12 June 2012 13:06, Lewis John Mcgibbney <[email protected] >> >> >wrote: >> >> >> >> > Hi Ali, >> >> >> >> >> >> > Please check out this post [0] I found. I need to agree with the >> >> > response in the thread ans state that I don't know how Hadoop ensures >> >> > even distribution of workload but we can assume that by explicitly >> >> > specifying the mapper and reducers we can ensure that all 'will' be >> >> > used across your cluster. >> >> > >> >> > hth >> >> > >> >> > [0] http://stackoverflow.com/questions/5748585/hadoop-workload >> >> > >> >> > On Tue, Jun 12, 2012 at 10:15 AM, Ali Safdar Kureishy >> >> > <[email protected]> wrote: >> >> > > Hi, >> >> > > >> >> > > I have a hadoop cluster of 5 nodes. I want to ensure that the fetch >> >> phase >> >> > > is distributed evenly across all the nodes (to maximize bandwidth >> >> etc). >> >> > > However, if I generate a fetchlist of size 1000 urls, does this get >> >> > > distributed equally across the nodes? Doesn't the fact that the >> size >> >> of >> >> > the >> >> > > fetchlist is < 64MB (block size) result in it being fetched from a >> >> single >> >> > > node? If not, how is this distributed across the mappers evenly? Is >> >> > there a >> >> > > rough formulate I can use, to determine how many URLs I should >> fetch >> >> for >> >> > an >> >> > > equal distribution across my nodes, for a given block size setting? >> >> > > >> >> > > Thanks, >> >> > > Safdar >> >> > >> >> > >> >> > >> >> > -- >> >> > Lewis >> >> > >> >> >> >> >> >> >> >> -- >> >> * >> >> *Open Source Solutions for Text Engineering >> >> >> >> http://digitalpebble.blogspot.com/ >> >> http://www.digitalpebble.com >> >> http://twitter.com/digitalpebble >> >> >> > >> > >> > >

