Hi Sourajit,

I strongly feel that having more hosts / sites will not affect the no of
part files formed. The no of part files created will be bounded by (no of
nodes) * (max no of reducers per node). This being the max, the actual
value is always less than it as the cluster won't allocate all the reducer
slots on all the nodes for one particular job.

Thanks,
Tejas Patil


On Mon, Jan 28, 2013 at 4:24 AM, Sourajit Basak <[email protected]>wrote:

> I anticipated.
>
> If the #of sites crawled increases, have you seen if nutch generates more
> part files than the number of nodes ? Maybe we will wait till we see the
> results from doubling sites before forcing a non-default behavior.
>
> Thanks,
> Sourajit
>
> On Mon, Jan 28, 2013 at 5:47 PM, Tejas Patil <[email protected]
> >wrote:
>
> > Hey Sourajit,
> >
> > I don't think that it can be passed with the crawl command. You will have
> > to use individual commands for that.
> > I personally felt messy while running a full crawl with all these bunch
> of
> > commands, so I had created a script to automate things.
> >
> > Thanks,
> > Tejas Patil
> >
> >
> > On Mon, Jan 28, 2013 at 3:46 AM, Sourajit Basak <
> [email protected]
> > >wrote:
> >
> > > I will try this out.
> > > How do I pass this parameter if we are doing a one step crawl ?
> > >
> > > On Mon, Jan 28, 2013 at 4:28 PM, Tejas Patil <[email protected]
> > > >wrote:
> > >
> > > > Hey Sourajit,
> > > >
> > > > I had seen such thing when running crawls over hadoop cluster. After
> > some
> > > > experiments, I came to following conclusion:
> > > > The number of mappers spawned is governed by the no of part files
> > created
> > > > by the generator (and not the #nodes in the cluster). And this is
> > nothing
> > > > but the reducers for the last job in the generate phase. There is a
> > param
> > > > passed to generate named numFetchers to control its #reducers.
> > > >
> > > > Thanks,
> > > > Tejas Patil
> > > >
> > > >
> > > > On Mon, Jan 28, 2013 at 2:49 AM, Sourajit Basak <
> > > [email protected]
> > > > >wrote:
> > > >
> > > > > A higher number of per host threads, etc might not be useful if the
> > > > > bandwidth doesn't scale out. I have a different observation though.
> > > > >
> > > > > We run nutch on a hadoop cluster. Even as we added new machines to
> > the
> > > > > cluster, the fetch phase only creates two tasks. (the original
> number
> > > of
> > > > > nodes when we started) Why is it so ? I have checked that the tasks
> > do
> > > > get
> > > > > spawned in the newly added nodes.
> > > > > We have this setting in hadoop mapred-site.xml
> > > > >  <property>
> > > > >    <name>mapred.tasktracker.map.tasks.maximum</name>
> > > > >    <value>20</value>
> > > > >  </property>
> > > > >
> > > > > We have planned to double the number of websites and see if it
> still
> > > > > doesn't spawn tasks on each node. I will keep this forum updated
> with
> > > out
> > > > > results. In the meantime, can anyone point out if we have missed
> any
> > > > > particular configuration ?
> > > > >
> > > > > Thanks,
> > > > > Sourajit
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil <
> > > [email protected]
> > > > > >wrote:
> > > > >
> > > > > > Hey Peter,
> > > > > >
> > > > > > I am guessing that you have just increased the global thread
> count.
> > > > Have
> > > > > > you even increased "fetcher.threads.per.host" ? This will improve
> > the
> > > > > crawl
> > > > > > rate as multiple threads can attack the same site. Dont make it
> too
> > > > high
> > > > > or
> > > > > > else the system will get overloaded. The nutch wiki has an
> article
> > > [0]
> > > > > > about the potential reasons for slow crawls and some good
> > > suggestions.
> > > > > >
> > > > > > [0] : https://wiki.apache.org/nutch/OptimizingCrawls
> > > > > >
> > > > > > Thanks,
> > > > > > Tejas Patil
> > > > > >
> > > > > >
> > > > > > On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto <
> > > > > [email protected]
> > > > > > >wrote:
> > > > > >
> > > > > > > I tried increasing the numbers of threads to 50 but the speed
> is
> > > not
> > > > > > > affected
> > > > > > >
> > > > > > >
> > > > > > > I tried changing the partition.url.mode value to byDomain and
> > > > > > > fetcher.queue.mode to byDomain but still it does not help the
> > > speed.
> > > > > > > It seems to get urls from 2 domains now and the other domains
> are
> > > not
> > > > > > > getting crawled. Is this due to the url score? if so how do i
> > crawl
> > > > > urls
> > > > > > > from all the domains?
> > > > > > >
> > > > > > >
> > > > > > > lewis john mcgibbney wrote
> > > > > > > > Increase number of threads when fetching
> > > > > > > > Also please see nutch-deault.xml for paritioning of urls, if
> > you
> > > > know
> > > > > > > your
> > > > > > > > target domains you may wish to adapt the policy.
> > > > > > > > Lewis
> > > > > > > >
> > > > > > > > On Sunday, January 27, 2013, peterbarretto &lt;
> > > > > > >
> > > > > > > > peterbarretto08@
> > > > > > >
> > > > > > > > &gt;
> > > > > > > > wrote:
> > > > > > > >> I want to increase the number of urls fetched at a time in
> > > nutch.
> > > > I
> > > > > > have
> > > > > > > >> around 10 websites to crawl. so how can i crawl all the
> sites
> > > at a
> > > > > > time
> > > > > > > ?
> > > > > > > >> right now i am fetching 1 site with a fetch delay of 2
> second
> > > but
> > > > it
> > > > > > is
> > > > > > > > too
> > > > > > > >> slow. How to concurrently fetch from different domain?
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> --
> > > > > > > >> View this message in context:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
> > > > > > > >> Sent from the Nutch - User mailing list archive at
> Nabble.com.
> > > > > > > >>
> > > > > > > >
> > > > > > > > --
> > > > > > > > *Lewis*
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > View this message in context:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
> > > > > > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to