I anticipated. If the #of sites crawled increases, have you seen if nutch generates more part files than the number of nodes ? Maybe we will wait till we see the results from doubling sites before forcing a non-default behavior.
Thanks, Sourajit On Mon, Jan 28, 2013 at 5:47 PM, Tejas Patil <[email protected]>wrote: > Hey Sourajit, > > I don't think that it can be passed with the crawl command. You will have > to use individual commands for that. > I personally felt messy while running a full crawl with all these bunch of > commands, so I had created a script to automate things. > > Thanks, > Tejas Patil > > > On Mon, Jan 28, 2013 at 3:46 AM, Sourajit Basak <[email protected] > >wrote: > > > I will try this out. > > How do I pass this parameter if we are doing a one step crawl ? > > > > On Mon, Jan 28, 2013 at 4:28 PM, Tejas Patil <[email protected] > > >wrote: > > > > > Hey Sourajit, > > > > > > I had seen such thing when running crawls over hadoop cluster. After > some > > > experiments, I came to following conclusion: > > > The number of mappers spawned is governed by the no of part files > created > > > by the generator (and not the #nodes in the cluster). And this is > nothing > > > but the reducers for the last job in the generate phase. There is a > param > > > passed to generate named numFetchers to control its #reducers. > > > > > > Thanks, > > > Tejas Patil > > > > > > > > > On Mon, Jan 28, 2013 at 2:49 AM, Sourajit Basak < > > [email protected] > > > >wrote: > > > > > > > A higher number of per host threads, etc might not be useful if the > > > > bandwidth doesn't scale out. I have a different observation though. > > > > > > > > We run nutch on a hadoop cluster. Even as we added new machines to > the > > > > cluster, the fetch phase only creates two tasks. (the original number > > of > > > > nodes when we started) Why is it so ? I have checked that the tasks > do > > > get > > > > spawned in the newly added nodes. > > > > We have this setting in hadoop mapred-site.xml > > > > <property> > > > > <name>mapred.tasktracker.map.tasks.maximum</name> > > > > <value>20</value> > > > > </property> > > > > > > > > We have planned to double the number of websites and see if it still > > > > doesn't spawn tasks on each node. I will keep this forum updated with > > out > > > > results. In the meantime, can anyone point out if we have missed any > > > > particular configuration ? > > > > > > > > Thanks, > > > > Sourajit > > > > > > > > > > > > > > > > On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil < > > [email protected] > > > > >wrote: > > > > > > > > > Hey Peter, > > > > > > > > > > I am guessing that you have just increased the global thread count. > > > Have > > > > > you even increased "fetcher.threads.per.host" ? This will improve > the > > > > crawl > > > > > rate as multiple threads can attack the same site. Dont make it too > > > high > > > > or > > > > > else the system will get overloaded. The nutch wiki has an article > > [0] > > > > > about the potential reasons for slow crawls and some good > > suggestions. > > > > > > > > > > [0] : https://wiki.apache.org/nutch/OptimizingCrawls > > > > > > > > > > Thanks, > > > > > Tejas Patil > > > > > > > > > > > > > > > On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto < > > > > [email protected] > > > > > >wrote: > > > > > > > > > > > I tried increasing the numbers of threads to 50 but the speed is > > not > > > > > > affected > > > > > > > > > > > > > > > > > > I tried changing the partition.url.mode value to byDomain and > > > > > > fetcher.queue.mode to byDomain but still it does not help the > > speed. > > > > > > It seems to get urls from 2 domains now and the other domains are > > not > > > > > > getting crawled. Is this due to the url score? if so how do i > crawl > > > > urls > > > > > > from all the domains? > > > > > > > > > > > > > > > > > > lewis john mcgibbney wrote > > > > > > > Increase number of threads when fetching > > > > > > > Also please see nutch-deault.xml for paritioning of urls, if > you > > > know > > > > > > your > > > > > > > target domains you may wish to adapt the policy. > > > > > > > Lewis > > > > > > > > > > > > > > On Sunday, January 27, 2013, peterbarretto < > > > > > > > > > > > > > peterbarretto08@ > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > >> I want to increase the number of urls fetched at a time in > > nutch. > > > I > > > > > have > > > > > > >> around 10 websites to crawl. so how can i crawl all the sites > > at a > > > > > time > > > > > > ? > > > > > > >> right now i am fetching 1 site with a fetch delay of 2 second > > but > > > it > > > > > is > > > > > > > too > > > > > > >> slow. How to concurrently fetch from different domain? > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> -- > > > > > > >> View this message in context: > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html > > > > > > >> Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > >> > > > > > > > > > > > > > > -- > > > > > > > *Lewis* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > View this message in context: > > > > > > > > > > > > > > > > > > > > > http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html > > > > > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > > > > > > > > > > > > > > > >

