RE: Relationship between fetcher.threads.fetch and fetcher.threads.per.host

Markus Jelsma Sun, 22 Jun 2014 13:57:25 -0700
Eeh, this reply was meant for the "Please share your experience of using Nutch 
in production" topic. 
Markus 
 
-----Original message-----
> From:Markus Jelsma <[email protected]>
> Sent: Sunday 22nd June 2014 22:54
> To: [email protected]
> Subject: RE: Relationship between fetcher.threads.fetch and 
> fetcher.threads.per.host
> 
> Hi Meraj,
> 
> If you see things from another perspective, you may not even need a (very) 
> small crawl delay. Even using a high delay, say 10 seconds, you can still 
> recrawl websites up to 200.000 records large every month, and still quickly 
> discover and index newly found content. If the sites you target are small, 
> then there isn't really a problem, except that it takes a bit longer for the 
> first iteration to finish.
> 
> Markus
> 
>  
>  
> -----Original message-----
> > From:Meraj A. Khan <[email protected]>
> > Sent: Sunday 22nd June 2014 22:33
> > To: [email protected]
> > Subject: Re: Relationship between fetcher.threads.fetch and 
> > fetcher.threads.per.host
> > 
> > Sebastian,
> > 
> > Thanks for the clear explanation , I have a similar questions .
> > 
> > 
> >    1. If I set the fetcher.threads.per.host or the renamed
> >    fetcher.threads.per.queue property to more than the edefault 1 , would my
> >    cralwer still be with in the crawl-delay limits for each host as 
> > specified
> >    in its robots.txt ?
> >    2. Looks like the max value we set in fetcher.threads.per.host value
> >    only comes into play when the total number of threads for the map task 
> > are
> >    less than the value we specify in the fetcher.threads.fetch property ?
> > 
> > Thanks.
> > 
> > 
> > On Sun, Jun 22, 2014 at 2:13 PM, Sebastian Nagel <[email protected]
> > > wrote:
> > 
> > > Hi,
> > >
> > > > 1. fetcher.threads.per.host: 10*3 = 30
> > > Correct. But if there are 1000 hosts you hardly
> > > would set it to 3000, see question 2.
> > >
> > > Keep in mind, that the property has been renamed into
> > > fetcher.threads.per.queue with Nutch 1.4!
> > > A queue can be defined by host or ip, see fetcher.queue.mode.
> > >
> > > > 2. fetcher.threads.fetch
> > > If there are many hosts you would set fetcher.threads.per.host
> > > to 1 (the default), and use fetcher.threads.fetch to limit the
> > > load on your system (esp. to limit the network load).
> > >
> > > > 3. in distributed mode
> > > All URLs from the same host are placed in the same partition.
> > > This ensures that host-level blocking can be done in one single
> > > JVM.
> > >
> > > Sebastian
> > >
> > >
> > > On 06/22/2014 05:51 PM, S.L wrote:
> > > > Hi All,
> > > >
> > > > I would like to know the relationship between the two config properties
> > > > *fetcher.threads.fetch* and *fetcher.threads.per.host*.
> > > >
> > > >
> > > >    1. If lets say I am crawling 10 hosts in my seed file and set the
> > > >    fetcher.threads.per.host property to 3 , should I set the
> > > >    fetcher.threads.fetch property to 10*3 i.e >=30 ?
> > > >    2. I can understand the *fetcher.threads.per.host *property as it is
> > > >    self explanatory , which means number to concurrent connections to a
> > > >    particular host , however , I am not able to clearly follow what
> > > > *fetcher.threads.fetch
> > > >    *does.
> > > >    3. Also I would like to know how the *fetcher.threads.per.host*
> > > property
> > > >    comes into play in a distributed mode  ?
> > > >
> > > >
> > > >
> > > > Thanks in advance.
> > > >
> > >
> > >
> > 
>
RE: Relationship between fetcher.threads.fetch and fetcher.threads.per.host

Reply via email to