A higher number of per host threads, etc might not be useful if the
bandwidth doesn't scale out. I have a different observation though.

We run nutch on a hadoop cluster. Even as we added new machines to the
cluster, the fetch phase only creates two tasks. (the original number of
nodes when we started) Why is it so ? I have checked that the tasks do get
spawned in the newly added nodes.
We have this setting in hadoop mapred-site.xml
 <property>
   <name>mapred.tasktracker.map.tasks.maximum</name>
   <value>20</value>
 </property>

We have planned to double the number of websites and see if it still
doesn't spawn tasks on each node. I will keep this forum updated with out
results. In the meantime, can anyone point out if we have missed any
particular configuration ?

Thanks,
Sourajit



On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil <[email protected]>wrote:

> Hey Peter,
>
> I am guessing that you have just increased the global thread count. Have
> you even increased "fetcher.threads.per.host" ? This will improve the crawl
> rate as multiple threads can attack the same site. Dont make it too high or
> else the system will get overloaded. The nutch wiki has an article [0]
> about the potential reasons for slow crawls and some good suggestions.
>
> [0] : https://wiki.apache.org/nutch/OptimizingCrawls
>
> Thanks,
> Tejas Patil
>
>
> On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto <[email protected]
> >wrote:
>
> > I tried increasing the numbers of threads to 50 but the speed is not
> > affected
> >
> >
> > I tried changing the partition.url.mode value to byDomain and
> > fetcher.queue.mode to byDomain but still it does not help the speed.
> > It seems to get urls from 2 domains now and the other domains are not
> > getting crawled. Is this due to the url score? if so how do i crawl urls
> > from all the domains?
> >
> >
> > lewis john mcgibbney wrote
> > > Increase number of threads when fetching
> > > Also please see nutch-deault.xml for paritioning of urls, if you know
> > your
> > > target domains you may wish to adapt the policy.
> > > Lewis
> > >
> > > On Sunday, January 27, 2013, peterbarretto &lt;
> >
> > > peterbarretto08@
> >
> > > &gt;
> > > wrote:
> > >> I want to increase the number of urls fetched at a time in nutch. I
> have
> > >> around 10 websites to crawl. so how can i crawl all the sites at a
> time
> > ?
> > >> right now i am fetching 1 site with a fetch delay of 2 second but it
> is
> > > too
> > >> slow. How to concurrently fetch from different domain?
> > >>
> > >>
> > >>
> > >> --
> > >> View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
> > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > >>
> > >
> > > --
> > > *Lewis*
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>

Reply via email to