> Alright, I found out how to achieve an even distribution of URLs for fetch > across the nodes of the cluster. It does not have to do with the > generate.max.count setting suggested by Julien (though that is important > too). >
> The key is to set an appropriate value for the '*numLists*' parameter > to the generate() method. This ensures that the fetch-list is broken into > many files (blocks) at the end of the generate, which would yield the same > number of map tasks for those URLs during fetch, and those get evently > distributed across the cluster. > Ok, let's get a few things right. What you are referring to is called [-numFetchers] when using the command nutch generate. It splits the output into separate files which are then used as input by the Fetcher - it's not about blocks as such. Indeed this means that this will allow the slaves on the clusters to fetch from them - I forgot to mention it. However this does not guarantee an even distribution of the URLs in the fetchlists, hence my comment about generate.count.mode + generate.max.count > > So, with a 'numLists' setting of 16 during generate(), and a hadoop cluster > of about 5 nodes, I was able to achieve a fetch() throughput of about 400K > URLs in 2 hours for the map phase. This was because the fetch was spread > across the 5 nodes, using 16 mappers total - and 2 map slots per node. > no. this depends mostly on the distribution of URLS per host/domain/IP in your fetchlist. If you have mostly URLs from a single hostname they will be dealt with by a single mapper and most other inputs will be empty regardless of the number of machines in your cluster. This is how we enforce politeness in Nutch. The results you are getting indicate that there probably is good distribution of URLs in your fetch list + increading the number of threads in the fetcher would have probably led to a comparable result I'm making this point as this is an important one and I want to make sure that anyone reading this thread later does not get confused by your comments > * > * > *However*, the reduce phase got stuck on *one* node and sat there for more > than 1 hour! I feel this might be because there is only one reducer > aggregating the output from 16 map tasks. But the code doesn't specify a > reducer class for this job. So: > just set the number of reducers that you want to use with the standard Hadoop parameter "mapred.reduce.tasks" > *a)* does that mean that the IdentityReducer is being used? The reducer > seems to be CPU-bound, since the CPU has been at 100% for most of the time > on the single node that's performing the reduce. What is it doing? The > FetchOutputFormat class, to which data is finally output just funnels each > key+value into an appropriate folder (parse_text, parse_data, content), so > should not by itself account for the 100% utilization. Is it the > shuffle+sort on 1.4 million map output records causing this? Or something > else? > use jstack and / or the hadoop web app to monitor your crawl and see where time is spent > *b)* Is there any setting whereby I can increase the number of reducers > used during the fetch() job? > see above Julien > > Thanks in advance! > > Cheers, > Safdar > > > On Tue, Jun 12, 2012 at 11:57 PM, Ali Safdar Kureishy < > [email protected]> wrote: > > > Thanks Lewis and Julien, for your inputs. > > > > I will look into this a bit further and reply with some numbers, as seen > > with a fetchlist of 50K urls. It is late night here in my timezone, so > will > > look at this first thing in the morning. > > > > Thanks, > > Safdar > > > > > > > > On Tue, Jun 12, 2012 at 4:56 PM, Julien Nioche < > > [email protected]> wrote: > > > >> Guys, > >> > >> This has to do with the way URLs are grouped for politeness and not so > >> much > >> with the number of blocks in the input. Limiting the URLs by # host > >> names, > >> domains or IP is a way of ensuring an even distribution across the > >> cluster. > >> See nutch-default.xml for details > >> > >> J. > >> > >> > >> On 12 June 2012 13:06, Lewis John Mcgibbney <[email protected] > >> >wrote: > >> > >> > Hi Ali, > >> > >> > >> > Please check out this post [0] I found. I need to agree with the > >> > response in the thread ans state that I don't know how Hadoop ensures > >> > even distribution of workload but we can assume that by explicitly > >> > specifying the mapper and reducers we can ensure that all 'will' be > >> > used across your cluster. > >> > > >> > hth > >> > > >> > [0] http://stackoverflow.com/questions/5748585/hadoop-workload > >> > > >> > On Tue, Jun 12, 2012 at 10:15 AM, Ali Safdar Kureishy > >> > <[email protected]> wrote: > >> > > Hi, > >> > > > >> > > I have a hadoop cluster of 5 nodes. I want to ensure that the fetch > >> phase > >> > > is distributed evenly across all the nodes (to maximize bandwidth > >> etc). > >> > > However, if I generate a fetchlist of size 1000 urls, does this get > >> > > distributed equally across the nodes? Doesn't the fact that the size > >> of > >> > the > >> > > fetchlist is < 64MB (block size) result in it being fetched from a > >> single > >> > > node? If not, how is this distributed across the mappers evenly? Is > >> > there a > >> > > rough formulate I can use, to determine how many URLs I should fetch > >> for > >> > an > >> > > equal distribution across my nodes, for a given block size setting? > >> > > > >> > > Thanks, > >> > > Safdar > >> > > >> > > >> > > >> > -- > >> > Lewis > >> > > >> > >> > >> > >> -- > >> * > >> *Open Source Solutions for Text Engineering > >> > >> http://digitalpebble.blogspot.com/ > >> http://www.digitalpebble.com > >> http://twitter.com/digitalpebble > >> > > > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

