A higher number of per host threads, etc might not be useful if the bandwidth doesn't scale out. I have a different observation though.
We run nutch on a hadoop cluster. Even as we added new machines to the cluster, the fetch phase only creates two tasks. (the original number of nodes when we started) Why is it so ? I have checked that the tasks do get spawned in the newly added nodes. We have this setting in hadoop mapred-site.xml <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>20</value> </property> We have planned to double the number of websites and see if it still doesn't spawn tasks on each node. I will keep this forum updated with out results. In the meantime, can anyone point out if we have missed any particular configuration ? Thanks, Sourajit On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil <[email protected]>wrote: > Hey Peter, > > I am guessing that you have just increased the global thread count. Have > you even increased "fetcher.threads.per.host" ? This will improve the crawl > rate as multiple threads can attack the same site. Dont make it too high or > else the system will get overloaded. The nutch wiki has an article [0] > about the potential reasons for slow crawls and some good suggestions. > > [0] : https://wiki.apache.org/nutch/OptimizingCrawls > > Thanks, > Tejas Patil > > > On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto <[email protected] > >wrote: > > > I tried increasing the numbers of threads to 50 but the speed is not > > affected > > > > > > I tried changing the partition.url.mode value to byDomain and > > fetcher.queue.mode to byDomain but still it does not help the speed. > > It seems to get urls from 2 domains now and the other domains are not > > getting crawled. Is this due to the url score? if so how do i crawl urls > > from all the domains? > > > > > > lewis john mcgibbney wrote > > > Increase number of threads when fetching > > > Also please see nutch-deault.xml for paritioning of urls, if you know > > your > > > target domains you may wish to adapt the policy. > > > Lewis > > > > > > On Sunday, January 27, 2013, peterbarretto < > > > > > peterbarretto08@ > > > > > > > > > wrote: > > >> I want to increase the number of urls fetched at a time in nutch. I > have > > >> around 10 websites to crawl. so how can i crawl all the sites at a > time > > ? > > >> right now i am fetching 1 site with a fetch delay of 2 second but it > is > > > too > > >> slow. How to concurrently fetch from different domain? > > >> > > >> > > >> > > >> -- > > >> View this message in context: > > > > > > http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html > > >> Sent from the Nutch - User mailing list archive at Nabble.com. > > >> > > > > > > -- > > > *Lewis* > > > > > > > > > > > > -- > > View this message in context: > > > http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > >

