Hey Sourajit, I had seen such thing when running crawls over hadoop cluster. After some experiments, I came to following conclusion: The number of mappers spawned is governed by the no of part files created by the generator (and not the #nodes in the cluster). And this is nothing but the reducers for the last job in the generate phase. There is a param passed to generate named numFetchers to control its #reducers.
Thanks, Tejas Patil On Mon, Jan 28, 2013 at 2:49 AM, Sourajit Basak <[email protected]>wrote: > A higher number of per host threads, etc might not be useful if the > bandwidth doesn't scale out. I have a different observation though. > > We run nutch on a hadoop cluster. Even as we added new machines to the > cluster, the fetch phase only creates two tasks. (the original number of > nodes when we started) Why is it so ? I have checked that the tasks do get > spawned in the newly added nodes. > We have this setting in hadoop mapred-site.xml > <property> > <name>mapred.tasktracker.map.tasks.maximum</name> > <value>20</value> > </property> > > We have planned to double the number of websites and see if it still > doesn't spawn tasks on each node. I will keep this forum updated with out > results. In the meantime, can anyone point out if we have missed any > particular configuration ? > > Thanks, > Sourajit > > > > On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil <[email protected] > >wrote: > > > Hey Peter, > > > > I am guessing that you have just increased the global thread count. Have > > you even increased "fetcher.threads.per.host" ? This will improve the > crawl > > rate as multiple threads can attack the same site. Dont make it too high > or > > else the system will get overloaded. The nutch wiki has an article [0] > > about the potential reasons for slow crawls and some good suggestions. > > > > [0] : https://wiki.apache.org/nutch/OptimizingCrawls > > > > Thanks, > > Tejas Patil > > > > > > On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto < > [email protected] > > >wrote: > > > > > I tried increasing the numbers of threads to 50 but the speed is not > > > affected > > > > > > > > > I tried changing the partition.url.mode value to byDomain and > > > fetcher.queue.mode to byDomain but still it does not help the speed. > > > It seems to get urls from 2 domains now and the other domains are not > > > getting crawled. Is this due to the url score? if so how do i crawl > urls > > > from all the domains? > > > > > > > > > lewis john mcgibbney wrote > > > > Increase number of threads when fetching > > > > Also please see nutch-deault.xml for paritioning of urls, if you know > > > your > > > > target domains you may wish to adapt the policy. > > > > Lewis > > > > > > > > On Sunday, January 27, 2013, peterbarretto < > > > > > > > peterbarretto08@ > > > > > > > > > > > > wrote: > > > >> I want to increase the number of urls fetched at a time in nutch. I > > have > > > >> around 10 websites to crawl. so how can i crawl all the sites at a > > time > > > ? > > > >> right now i am fetching 1 site with a fetch delay of 2 second but it > > is > > > > too > > > >> slow. How to concurrently fetch from different domain? > > > >> > > > >> > > > >> > > > >> -- > > > >> View this message in context: > > > > > > > > > > http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html > > > >> Sent from the Nutch - User mailing list archive at Nabble.com. > > > >> > > > > > > > > -- > > > > *Lewis* > > > > > > > > > > > > > > > > > > -- > > > View this message in context: > > > > > > http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html > > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > >

