Re: Concurrency And Crawl Delay ?

Manish Verma Wed, 06 Jan 2016 12:53:14 -0800

Thanks for replying Sebastian,

Just wanted to be clear here , I have multiple urls to crawl but number of 
hosts is one, I hope you mean even in this case only one thread will be working.
If this is the case then what is significance of  property 
fetcher.threads.per.queue


In my case there would be only one queue as all URLs reside on same host then 
whats the use of fetcher.threads.per.queue ?

Thanks Manish



> On Jan 6, 2016, at 12:40 PM, Sebastian Nagel <[email protected]> 
> wrote:
> 
> Hi,
> 
> all requests to the same host are processed in the same
> fetch queue which also takes care that the configured
> delay (or that specified in robots.txt) is observed.
> With 10 threads and only one host to be crawled
> 9 of the threads are just doing nothing. Things are
> different if there are multiple hosts to crawl (>=10).
> 
> Cheers,
> Sebastian
> 
> On 01/06/2016 08:51 PM, Manish Verma wrote:
>> Hi,
>> I am using Nutch 1.10 and have some confusion over concurrency over crawl 
>> deal.
>> 
>> For e.g 
>> 
>> fetcher.server.min.delay = .300
>> fetcher.threads.per.queue = 10
>> fetcher.queue.mode = byHost (for simplicity lets assume there is only one 
>> host)
>> 
>> 
>> 
>> Now we have defined 10 threads, how this will behave, 10 request will be 
>> sent to host same time or first thread will hit and then after 300 ms second 
>> thread will hit.
>> If thread can not hit at same time then whats the use of having multiple 
>> threads as each thread has to wait 300 ms.
>> 
>> 
>> Thanks MV
>> 
>

Re: Concurrency And Crawl Delay ?

Reply via email to