Eeh, this reply was meant for the "Please share your experience of using Nutch
in production" topic.
Markus
-----Original message-----
> From:Markus Jelsma <[email protected]>
> Sent: Sunday 22nd June 2014 22:54
> To: [email protected]
> Subject: RE: Relationship between fetcher.threads.fetch and
> fetcher.threads.per.host
>
> Hi Meraj,
>
> If you see things from another perspective, you may not even need a (very)
> small crawl delay. Even using a high delay, say 10 seconds, you can still
> recrawl websites up to 200.000 records large every month, and still quickly
> discover and index newly found content. If the sites you target are small,
> then there isn't really a problem, except that it takes a bit longer for the
> first iteration to finish.
>
> Markus
>
>
>
> -----Original message-----
> > From:Meraj A. Khan <[email protected]>
> > Sent: Sunday 22nd June 2014 22:33
> > To: [email protected]
> > Subject: Re: Relationship between fetcher.threads.fetch and
> > fetcher.threads.per.host
> >
> > Sebastian,
> >
> > Thanks for the clear explanation , I have a similar questions .
> >
> >
> > 1. If I set the fetcher.threads.per.host or the renamed
> > fetcher.threads.per.queue property to more than the edefault 1 , would my
> > cralwer still be with in the crawl-delay limits for each host as
> > specified
> > in its robots.txt ?
> > 2. Looks like the max value we set in fetcher.threads.per.host value
> > only comes into play when the total number of threads for the map task
> > are
> > less than the value we specify in the fetcher.threads.fetch property ?
> >
> > Thanks.
> >
> >
> > On Sun, Jun 22, 2014 at 2:13 PM, Sebastian Nagel <[email protected]
> > > wrote:
> >
> > > Hi,
> > >
> > > > 1. fetcher.threads.per.host: 10*3 = 30
> > > Correct. But if there are 1000 hosts you hardly
> > > would set it to 3000, see question 2.
> > >
> > > Keep in mind, that the property has been renamed into
> > > fetcher.threads.per.queue with Nutch 1.4!
> > > A queue can be defined by host or ip, see fetcher.queue.mode.
> > >
> > > > 2. fetcher.threads.fetch
> > > If there are many hosts you would set fetcher.threads.per.host
> > > to 1 (the default), and use fetcher.threads.fetch to limit the
> > > load on your system (esp. to limit the network load).
> > >
> > > > 3. in distributed mode
> > > All URLs from the same host are placed in the same partition.
> > > This ensures that host-level blocking can be done in one single
> > > JVM.
> > >
> > > Sebastian
> > >
> > >
> > > On 06/22/2014 05:51 PM, S.L wrote:
> > > > Hi All,
> > > >
> > > > I would like to know the relationship between the two config properties
> > > > *fetcher.threads.fetch* and *fetcher.threads.per.host*.
> > > >
> > > >
> > > > 1. If lets say I am crawling 10 hosts in my seed file and set the
> > > > fetcher.threads.per.host property to 3 , should I set the
> > > > fetcher.threads.fetch property to 10*3 i.e >=30 ?
> > > > 2. I can understand the *fetcher.threads.per.host *property as it is
> > > > self explanatory , which means number to concurrent connections to a
> > > > particular host , however , I am not able to clearly follow what
> > > > *fetcher.threads.fetch
> > > > *does.
> > > > 3. Also I would like to know how the *fetcher.threads.per.host*
> > > property
> > > > comes into play in a distributed mode ?
> > > >
> > > >
> > > >
> > > > Thanks in advance.
> > > >
> > >
> > >
> >
>