I will try this out. How do I pass this parameter if we are doing a one step crawl ?
On Mon, Jan 28, 2013 at 4:28 PM, Tejas Patil <[email protected]>wrote: > Hey Sourajit, > > I had seen such thing when running crawls over hadoop cluster. After some > experiments, I came to following conclusion: > The number of mappers spawned is governed by the no of part files created > by the generator (and not the #nodes in the cluster). And this is nothing > but the reducers for the last job in the generate phase. There is a param > passed to generate named numFetchers to control its #reducers. > > Thanks, > Tejas Patil > > > On Mon, Jan 28, 2013 at 2:49 AM, Sourajit Basak <[email protected] > >wrote: > > > A higher number of per host threads, etc might not be useful if the > > bandwidth doesn't scale out. I have a different observation though. > > > > We run nutch on a hadoop cluster. Even as we added new machines to the > > cluster, the fetch phase only creates two tasks. (the original number of > > nodes when we started) Why is it so ? I have checked that the tasks do > get > > spawned in the newly added nodes. > > We have this setting in hadoop mapred-site.xml > > <property> > > <name>mapred.tasktracker.map.tasks.maximum</name> > > <value>20</value> > > </property> > > > > We have planned to double the number of websites and see if it still > > doesn't spawn tasks on each node. I will keep this forum updated with out > > results. In the meantime, can anyone point out if we have missed any > > particular configuration ? > > > > Thanks, > > Sourajit > > > > > > > > On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil <[email protected] > > >wrote: > > > > > Hey Peter, > > > > > > I am guessing that you have just increased the global thread count. > Have > > > you even increased "fetcher.threads.per.host" ? This will improve the > > crawl > > > rate as multiple threads can attack the same site. Dont make it too > high > > or > > > else the system will get overloaded. The nutch wiki has an article [0] > > > about the potential reasons for slow crawls and some good suggestions. > > > > > > [0] : https://wiki.apache.org/nutch/OptimizingCrawls > > > > > > Thanks, > > > Tejas Patil > > > > > > > > > On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto < > > [email protected] > > > >wrote: > > > > > > > I tried increasing the numbers of threads to 50 but the speed is not > > > > affected > > > > > > > > > > > > I tried changing the partition.url.mode value to byDomain and > > > > fetcher.queue.mode to byDomain but still it does not help the speed. > > > > It seems to get urls from 2 domains now and the other domains are not > > > > getting crawled. Is this due to the url score? if so how do i crawl > > urls > > > > from all the domains? > > > > > > > > > > > > lewis john mcgibbney wrote > > > > > Increase number of threads when fetching > > > > > Also please see nutch-deault.xml for paritioning of urls, if you > know > > > > your > > > > > target domains you may wish to adapt the policy. > > > > > Lewis > > > > > > > > > > On Sunday, January 27, 2013, peterbarretto < > > > > > > > > > peterbarretto08@ > > > > > > > > > > > > > > > wrote: > > > > >> I want to increase the number of urls fetched at a time in nutch. > I > > > have > > > > >> around 10 websites to crawl. so how can i crawl all the sites at a > > > time > > > > ? > > > > >> right now i am fetching 1 site with a fetch delay of 2 second but > it > > > is > > > > > too > > > > >> slow. How to concurrently fetch from different domain? > > > > >> > > > > >> > > > > >> > > > > >> -- > > > > >> View this message in context: > > > > > > > > > > > > > > > http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html > > > > >> Sent from the Nutch - User mailing list archive at Nabble.com. > > > > >> > > > > > > > > > > -- > > > > > *Lewis* > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > View this message in context: > > > > > > > > > > http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html > > > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > > > > >

