Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Tejas Patil Mon, 28 Jan 2013 02:59:01 -0800

Hey Sourajit,

I had seen such thing when running crawls over hadoop cluster. After some
experiments, I came to following conclusion:
The number of mappers spawned is governed by the no of part files created
by the generator (and not the #nodes in the cluster). And this is nothing
but the reducers for the last job in the generate phase. There is a param
passed to generate named numFetchers to control its #reducers.


Thanks,
Tejas Patil


On Mon, Jan 28, 2013 at 2:49 AM, Sourajit Basak <[email protected]>wrote:

> A higher number of per host threads, etc might not be useful if the
> bandwidth doesn't scale out. I have a different observation though.
>
> We run nutch on a hadoop cluster. Even as we added new machines to the
> cluster, the fetch phase only creates two tasks. (the original number of
> nodes when we started) Why is it so ? I have checked that the tasks do get
> spawned in the newly added nodes.
> We have this setting in hadoop mapred-site.xml
>  <property>
>    <name>mapred.tasktracker.map.tasks.maximum</name>
>    <value>20</value>
>  </property>
>
> We have planned to double the number of websites and see if it still
> doesn't spawn tasks on each node. I will keep this forum updated with out
> results. In the meantime, can anyone point out if we have missed any
> particular configuration ?
>
> Thanks,
> Sourajit
>
>
>
> On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil <[email protected]
> >wrote:
>
> > Hey Peter,
> >
> > I am guessing that you have just increased the global thread count. Have
> > you even increased "fetcher.threads.per.host" ? This will improve the
> crawl
> > rate as multiple threads can attack the same site. Dont make it too high
> or
> > else the system will get overloaded. The nutch wiki has an article [0]
> > about the potential reasons for slow crawls and some good suggestions.
> >
> > [0] : https://wiki.apache.org/nutch/OptimizingCrawls
> >
> > Thanks,
> > Tejas Patil
> >
> >
> > On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto <
> [email protected]
> > >wrote:
> >
> > > I tried increasing the numbers of threads to 50 but the speed is not
> > > affected
> > >
> > >
> > > I tried changing the partition.url.mode value to byDomain and
> > > fetcher.queue.mode to byDomain but still it does not help the speed.
> > > It seems to get urls from 2 domains now and the other domains are not
> > > getting crawled. Is this due to the url score? if so how do i crawl
> urls
> > > from all the domains?
> > >
> > >
> > > lewis john mcgibbney wrote
> > > > Increase number of threads when fetching
> > > > Also please see nutch-deault.xml for paritioning of urls, if you know
> > > your
> > > > target domains you may wish to adapt the policy.
> > > > Lewis
> > > >
> > > > On Sunday, January 27, 2013, peterbarretto &lt;
> > >
> > > > peterbarretto08@
> > >
> > > > &gt;
> > > > wrote:
> > > >> I want to increase the number of urls fetched at a time in nutch. I
> > have
> > > >> around 10 websites to crawl. so how can i crawl all the sites at a
> > time
> > > ?
> > > >> right now i am fetching 1 site with a fetch delay of 2 second but it
> > is
> > > > too
> > > >> slow. How to concurrently fetch from different domain?
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> View this message in context:
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
> > > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > > >>
> > > >
> > > > --
> > > > *Lewis*
> > >
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
> > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > >
> >
>

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Reply via email to