I will try this out.
How do I pass this parameter if we are doing a one step crawl ?

On Mon, Jan 28, 2013 at 4:28 PM, Tejas Patil <[email protected]>wrote:

> Hey Sourajit,
>
> I had seen such thing when running crawls over hadoop cluster. After some
> experiments, I came to following conclusion:
> The number of mappers spawned is governed by the no of part files created
> by the generator (and not the #nodes in the cluster). And this is nothing
> but the reducers for the last job in the generate phase. There is a param
> passed to generate named numFetchers to control its #reducers.
>
> Thanks,
> Tejas Patil
>
>
> On Mon, Jan 28, 2013 at 2:49 AM, Sourajit Basak <[email protected]
> >wrote:
>
> > A higher number of per host threads, etc might not be useful if the
> > bandwidth doesn't scale out. I have a different observation though.
> >
> > We run nutch on a hadoop cluster. Even as we added new machines to the
> > cluster, the fetch phase only creates two tasks. (the original number of
> > nodes when we started) Why is it so ? I have checked that the tasks do
> get
> > spawned in the newly added nodes.
> > We have this setting in hadoop mapred-site.xml
> >  <property>
> >    <name>mapred.tasktracker.map.tasks.maximum</name>
> >    <value>20</value>
> >  </property>
> >
> > We have planned to double the number of websites and see if it still
> > doesn't spawn tasks on each node. I will keep this forum updated with out
> > results. In the meantime, can anyone point out if we have missed any
> > particular configuration ?
> >
> > Thanks,
> > Sourajit
> >
> >
> >
> > On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil <[email protected]
> > >wrote:
> >
> > > Hey Peter,
> > >
> > > I am guessing that you have just increased the global thread count.
> Have
> > > you even increased "fetcher.threads.per.host" ? This will improve the
> > crawl
> > > rate as multiple threads can attack the same site. Dont make it too
> high
> > or
> > > else the system will get overloaded. The nutch wiki has an article [0]
> > > about the potential reasons for slow crawls and some good suggestions.
> > >
> > > [0] : https://wiki.apache.org/nutch/OptimizingCrawls
> > >
> > > Thanks,
> > > Tejas Patil
> > >
> > >
> > > On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto <
> > [email protected]
> > > >wrote:
> > >
> > > > I tried increasing the numbers of threads to 50 but the speed is not
> > > > affected
> > > >
> > > >
> > > > I tried changing the partition.url.mode value to byDomain and
> > > > fetcher.queue.mode to byDomain but still it does not help the speed.
> > > > It seems to get urls from 2 domains now and the other domains are not
> > > > getting crawled. Is this due to the url score? if so how do i crawl
> > urls
> > > > from all the domains?
> > > >
> > > >
> > > > lewis john mcgibbney wrote
> > > > > Increase number of threads when fetching
> > > > > Also please see nutch-deault.xml for paritioning of urls, if you
> know
> > > > your
> > > > > target domains you may wish to adapt the policy.
> > > > > Lewis
> > > > >
> > > > > On Sunday, January 27, 2013, peterbarretto &lt;
> > > >
> > > > > peterbarretto08@
> > > >
> > > > > &gt;
> > > > > wrote:
> > > > >> I want to increase the number of urls fetched at a time in nutch.
> I
> > > have
> > > > >> around 10 websites to crawl. so how can i crawl all the sites at a
> > > time
> > > > ?
> > > > >> right now i am fetching 1 site with a fetch delay of 2 second but
> it
> > > is
> > > > > too
> > > > >> slow. How to concurrently fetch from different domain?
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> View this message in context:
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
> > > > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > > > >>
> > > > >
> > > > > --
> > > > > *Lewis*
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > View this message in context:
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
> > > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > > >
> > >
> >
>

Reply via email to