Re: Concurrency And Crawl Delay ?

Sebastian Nagel Wed, 06 Jan 2016 13:33:07 -0800

The property fetcher.threads.per.queue allows
to let multiple threads fetch content from the same
host in parallel.


Note that with fetcher.threads.per.queue > 1
the delay is configured by fetcher.server.min.delay
Of course, due to the concurrent nature
that there could be still open connections
and fetches in progress during the "delay".

On 01/06/2016 09:54 PM, Manish Verma wrote:
> Thanks for replying Sebastian,
> 
> Just wanted to be clear here , I have multiple urls to crawl but number of 
> hosts is one, I hope you mean even in this case only one thread will be 
> working.
> If this is the case then what is significance of  property 
> fetcher.threads.per.queue
> 
> In my case there would be only one queue as all URLs reside on same host then 
> whats the use of fetcher.threads.per.queue ?
> 
> Thanks Manish
> 
> 
> 
>> On Jan 6, 2016, at 12:40 PM, Sebastian Nagel <[email protected]> 
>> wrote:
>>
>> Hi,
>>
>> all requests to the same host are processed in the same
>> fetch queue which also takes care that the configured
>> delay (or that specified in robots.txt) is observed.
>> With 10 threads and only one host to be crawled
>> 9 of the threads are just doing nothing. Things are
>> different if there are multiple hosts to crawl (>=10).
>>
>> Cheers,
>> Sebastian
>>
>> On 01/06/2016 08:51 PM, Manish Verma wrote:
>>> Hi,
>>> I am using Nutch 1.10 and have some confusion over concurrency over crawl 
>>> deal.
>>>
>>> For e.g 
>>>
>>> fetcher.server.min.delay = .300
>>> fetcher.threads.per.queue = 10
>>> fetcher.queue.mode = byHost (for simplicity lets assume there is only one 
>>> host)
>>>
>>>
>>>
>>> Now we have defined 10 threads, how this will behave, 10 request will be 
>>> sent to host same time or first thread will hit and then after 300 ms 
>>> second thread will hit.
>>> If thread can not hit at same time then whats the use of having multiple 
>>> threads as each thread has to wait 300 ms.
>>>
>>>
>>> Thanks MV
>>>
>>
> 
>

Re: Concurrency And Crawl Delay ?

Reply via email to