Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Sourajit Basak Mon, 28 Jan 2013 04:24:46 -0800

I anticipated.

If the #of sites crawled increases, have you seen if nutch generates more
part files than the number of nodes ? Maybe we will wait till we see the
results from doubling sites before forcing a non-default behavior.


Thanks,
Sourajit

On Mon, Jan 28, 2013 at 5:47 PM, Tejas Patil <[email protected]>wrote:

> Hey Sourajit,
>
> I don't think that it can be passed with the crawl command. You will have
> to use individual commands for that.
> I personally felt messy while running a full crawl with all these bunch of
> commands, so I had created a script to automate things.
>
> Thanks,
> Tejas Patil
>
>
> On Mon, Jan 28, 2013 at 3:46 AM, Sourajit Basak <[email protected]
> >wrote:
>
> > I will try this out.
> > How do I pass this parameter if we are doing a one step crawl ?
> >
> > On Mon, Jan 28, 2013 at 4:28 PM, Tejas Patil <[email protected]
> > >wrote:
> >
> > > Hey Sourajit,
> > >
> > > I had seen such thing when running crawls over hadoop cluster. After
> some
> > > experiments, I came to following conclusion:
> > > The number of mappers spawned is governed by the no of part files
> created
> > > by the generator (and not the #nodes in the cluster). And this is
> nothing
> > > but the reducers for the last job in the generate phase. There is a
> param
> > > passed to generate named numFetchers to control its #reducers.
> > >
> > > Thanks,
> > > Tejas Patil
> > >
> > >
> > > On Mon, Jan 28, 2013 at 2:49 AM, Sourajit Basak <
> > [email protected]
> > > >wrote:
> > >
> > > > A higher number of per host threads, etc might not be useful if the
> > > > bandwidth doesn't scale out. I have a different observation though.
> > > >
> > > > We run nutch on a hadoop cluster. Even as we added new machines to
> the
> > > > cluster, the fetch phase only creates two tasks. (the original number
> > of
> > > > nodes when we started) Why is it so ? I have checked that the tasks
> do
> > > get
> > > > spawned in the newly added nodes.
> > > > We have this setting in hadoop mapred-site.xml
> > > >  <property>
> > > >    <name>mapred.tasktracker.map.tasks.maximum</name>
> > > >    <value>20</value>
> > > >  </property>
> > > >
> > > > We have planned to double the number of websites and see if it still
> > > > doesn't spawn tasks on each node. I will keep this forum updated with
> > out
> > > > results. In the meantime, can anyone point out if we have missed any
> > > > particular configuration ?
> > > >
> > > > Thanks,
> > > > Sourajit
> > > >
> > > >
> > > >
> > > > On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil <
> > [email protected]
> > > > >wrote:
> > > >
> > > > > Hey Peter,
> > > > >
> > > > > I am guessing that you have just increased the global thread count.
> > > Have
> > > > > you even increased "fetcher.threads.per.host" ? This will improve
> the
> > > > crawl
> > > > > rate as multiple threads can attack the same site. Dont make it too
> > > high
> > > > or
> > > > > else the system will get overloaded. The nutch wiki has an article
> > [0]
> > > > > about the potential reasons for slow crawls and some good
> > suggestions.
> > > > >
> > > > > [0] : https://wiki.apache.org/nutch/OptimizingCrawls
> > > > >
> > > > > Thanks,
> > > > > Tejas Patil
> > > > >
> > > > >
> > > > > On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto <
> > > > [email protected]
> > > > > >wrote:
> > > > >
> > > > > > I tried increasing the numbers of threads to 50 but the speed is
> > not
> > > > > > affected
> > > > > >
> > > > > >
> > > > > > I tried changing the partition.url.mode value to byDomain and
> > > > > > fetcher.queue.mode to byDomain but still it does not help the
> > speed.
> > > > > > It seems to get urls from 2 domains now and the other domains are
> > not
> > > > > > getting crawled. Is this due to the url score? if so how do i
> crawl
> > > > urls
> > > > > > from all the domains?
> > > > > >
> > > > > >
> > > > > > lewis john mcgibbney wrote
> > > > > > > Increase number of threads when fetching
> > > > > > > Also please see nutch-deault.xml for paritioning of urls, if
> you
> > > know
> > > > > > your
> > > > > > > target domains you may wish to adapt the policy.
> > > > > > > Lewis
> > > > > > >
> > > > > > > On Sunday, January 27, 2013, peterbarretto &lt;
> > > > > >
> > > > > > > peterbarretto08@
> > > > > >
> > > > > > > &gt;
> > > > > > > wrote:
> > > > > > >> I want to increase the number of urls fetched at a time in
> > nutch.
> > > I
> > > > > have
> > > > > > >> around 10 websites to crawl. so how can i crawl all the sites
> > at a
> > > > > time
> > > > > > ?
> > > > > > >> right now i am fetching 1 site with a fetch delay of 2 second
> > but
> > > it
> > > > > is
> > > > > > > too
> > > > > > >> slow. How to concurrently fetch from different domain?
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> View this message in context:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
> > > > > > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > > > > > >>
> > > > > > >
> > > > > > > --
> > > > > > > *Lewis*
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > View this message in context:
> > > > > >
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
> > > > > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Reply via email to