RE: Relationship between fetcher.threads.fetch and fetcher.threads.per.host

Markus Jelsma Sun, 22 Jun 2014 13:54:24 -0700

Hi Meraj,

If you see things from another perspective, you may not even need a (very) 
small crawl delay. Even using a high delay, say 10 seconds, you can still 
recrawl websites up to 200.000 records large every month, and still quickly 
discover and index newly found content. If the sites you target are small, then 
there isn't really a problem, except that it takes a bit longer for the first 
iteration to finish.


Markus

 
 
-----Original message-----
> From:Meraj A. Khan <[email protected]>
> Sent: Sunday 22nd June 2014 22:33
> To: [email protected]
> Subject: Re: Relationship between fetcher.threads.fetch and 
> fetcher.threads.per.host
> 
> Sebastian,
> 
> Thanks for the clear explanation , I have a similar questions .
> 
> 
>    1. If I set the fetcher.threads.per.host or the renamed
>    fetcher.threads.per.queue property to more than the edefault 1 , would my
>    cralwer still be with in the crawl-delay limits for each host as specified
>    in its robots.txt ?
>    2. Looks like the max value we set in fetcher.threads.per.host value
>    only comes into play when the total number of threads for the map task are
>    less than the value we specify in the fetcher.threads.fetch property ?
> 
> Thanks.
> 
> 
> On Sun, Jun 22, 2014 at 2:13 PM, Sebastian Nagel <[email protected]
> > wrote:
> 
> > Hi,
> >
> > > 1. fetcher.threads.per.host: 10*3 = 30
> > Correct. But if there are 1000 hosts you hardly
> > would set it to 3000, see question 2.
> >
> > Keep in mind, that the property has been renamed into
> > fetcher.threads.per.queue with Nutch 1.4!
> > A queue can be defined by host or ip, see fetcher.queue.mode.
> >
> > > 2. fetcher.threads.fetch
> > If there are many hosts you would set fetcher.threads.per.host
> > to 1 (the default), and use fetcher.threads.fetch to limit the
> > load on your system (esp. to limit the network load).
> >
> > > 3. in distributed mode
> > All URLs from the same host are placed in the same partition.
> > This ensures that host-level blocking can be done in one single
> > JVM.
> >
> > Sebastian
> >
> >
> > On 06/22/2014 05:51 PM, S.L wrote:
> > > Hi All,
> > >
> > > I would like to know the relationship between the two config properties
> > > *fetcher.threads.fetch* and *fetcher.threads.per.host*.
> > >
> > >
> > >    1. If lets say I am crawling 10 hosts in my seed file and set the
> > >    fetcher.threads.per.host property to 3 , should I set the
> > >    fetcher.threads.fetch property to 10*3 i.e >=30 ?
> > >    2. I can understand the *fetcher.threads.per.host *property as it is
> > >    self explanatory , which means number to concurrent connections to a
> > >    particular host , however , I am not able to clearly follow what
> > > *fetcher.threads.fetch
> > >    *does.
> > >    3. Also I would like to know how the *fetcher.threads.per.host*
> > property
> > >    comes into play in a distributed mode  ?
> > >
> > >
> > >
> > > Thanks in advance.
> > >
> >
> >
>

RE: Relationship between fetcher.threads.fetch and fetcher.threads.per.host

Reply via email to