Hey Sourajit,

I don't think that it can be passed with the crawl command. You will have
to use individual commands for that.
I personally felt messy while running a full crawl with all these bunch of
commands, so I had created a script to automate things.

Thanks,
Tejas Patil


On Mon, Jan 28, 2013 at 3:46 AM, Sourajit Basak <[email protected]>wrote:

> I will try this out.
> How do I pass this parameter if we are doing a one step crawl ?
>
> On Mon, Jan 28, 2013 at 4:28 PM, Tejas Patil <[email protected]
> >wrote:
>
> > Hey Sourajit,
> >
> > I had seen such thing when running crawls over hadoop cluster. After some
> > experiments, I came to following conclusion:
> > The number of mappers spawned is governed by the no of part files created
> > by the generator (and not the #nodes in the cluster). And this is nothing
> > but the reducers for the last job in the generate phase. There is a param
> > passed to generate named numFetchers to control its #reducers.
> >
> > Thanks,
> > Tejas Patil
> >
> >
> > On Mon, Jan 28, 2013 at 2:49 AM, Sourajit Basak <
> [email protected]
> > >wrote:
> >
> > > A higher number of per host threads, etc might not be useful if the
> > > bandwidth doesn't scale out. I have a different observation though.
> > >
> > > We run nutch on a hadoop cluster. Even as we added new machines to the
> > > cluster, the fetch phase only creates two tasks. (the original number
> of
> > > nodes when we started) Why is it so ? I have checked that the tasks do
> > get
> > > spawned in the newly added nodes.
> > > We have this setting in hadoop mapred-site.xml
> > >  <property>
> > >    <name>mapred.tasktracker.map.tasks.maximum</name>
> > >    <value>20</value>
> > >  </property>
> > >
> > > We have planned to double the number of websites and see if it still
> > > doesn't spawn tasks on each node. I will keep this forum updated with
> out
> > > results. In the meantime, can anyone point out if we have missed any
> > > particular configuration ?
> > >
> > > Thanks,
> > > Sourajit
> > >
> > >
> > >
> > > On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil <
> [email protected]
> > > >wrote:
> > >
> > > > Hey Peter,
> > > >
> > > > I am guessing that you have just increased the global thread count.
> > Have
> > > > you even increased "fetcher.threads.per.host" ? This will improve the
> > > crawl
> > > > rate as multiple threads can attack the same site. Dont make it too
> > high
> > > or
> > > > else the system will get overloaded. The nutch wiki has an article
> [0]
> > > > about the potential reasons for slow crawls and some good
> suggestions.
> > > >
> > > > [0] : https://wiki.apache.org/nutch/OptimizingCrawls
> > > >
> > > > Thanks,
> > > > Tejas Patil
> > > >
> > > >
> > > > On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto <
> > > [email protected]
> > > > >wrote:
> > > >
> > > > > I tried increasing the numbers of threads to 50 but the speed is
> not
> > > > > affected
> > > > >
> > > > >
> > > > > I tried changing the partition.url.mode value to byDomain and
> > > > > fetcher.queue.mode to byDomain but still it does not help the
> speed.
> > > > > It seems to get urls from 2 domains now and the other domains are
> not
> > > > > getting crawled. Is this due to the url score? if so how do i crawl
> > > urls
> > > > > from all the domains?
> > > > >
> > > > >
> > > > > lewis john mcgibbney wrote
> > > > > > Increase number of threads when fetching
> > > > > > Also please see nutch-deault.xml for paritioning of urls, if you
> > know
> > > > > your
> > > > > > target domains you may wish to adapt the policy.
> > > > > > Lewis
> > > > > >
> > > > > > On Sunday, January 27, 2013, peterbarretto &lt;
> > > > >
> > > > > > peterbarretto08@
> > > > >
> > > > > > &gt;
> > > > > > wrote:
> > > > > >> I want to increase the number of urls fetched at a time in
> nutch.
> > I
> > > > have
> > > > > >> around 10 websites to crawl. so how can i crawl all the sites
> at a
> > > > time
> > > > > ?
> > > > > >> right now i am fetching 1 site with a fetch delay of 2 second
> but
> > it
> > > > is
> > > > > > too
> > > > > >> slow. How to concurrently fetch from different domain?
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> View this message in context:
> > > > > >
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
> > > > > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > > > > >>
> > > > > >
> > > > > > --
> > > > > > *Lewis*
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > View this message in context:
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
> > > > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > > > >
> > > >
> > >
> >
>

Reply via email to