The property fetcher.threads.per.queue allows to let multiple threads fetch content from the same host in parallel.
Note that with fetcher.threads.per.queue > 1 the delay is configured by fetcher.server.min.delay Of course, due to the concurrent nature that there could be still open connections and fetches in progress during the "delay". On 01/06/2016 09:54 PM, Manish Verma wrote: > Thanks for replying Sebastian, > > Just wanted to be clear here , I have multiple urls to crawl but number of > hosts is one, I hope you mean even in this case only one thread will be > working. > If this is the case then what is significance of property > fetcher.threads.per.queue > > In my case there would be only one queue as all URLs reside on same host then > whats the use of fetcher.threads.per.queue ? > > Thanks Manish > > > >> On Jan 6, 2016, at 12:40 PM, Sebastian Nagel <[email protected]> >> wrote: >> >> Hi, >> >> all requests to the same host are processed in the same >> fetch queue which also takes care that the configured >> delay (or that specified in robots.txt) is observed. >> With 10 threads and only one host to be crawled >> 9 of the threads are just doing nothing. Things are >> different if there are multiple hosts to crawl (>=10). >> >> Cheers, >> Sebastian >> >> On 01/06/2016 08:51 PM, Manish Verma wrote: >>> Hi, >>> I am using Nutch 1.10 and have some confusion over concurrency over crawl >>> deal. >>> >>> For e.g >>> >>> fetcher.server.min.delay = .300 >>> fetcher.threads.per.queue = 10 >>> fetcher.queue.mode = byHost (for simplicity lets assume there is only one >>> host) >>> >>> >>> >>> Now we have defined 10 threads, how this will behave, 10 request will be >>> sent to host same time or first thread will hit and then after 300 ms >>> second thread will hit. >>> If thread can not hit at same time then whats the use of having multiple >>> threads as each thread has to wait 300 ms. >>> >>> >>> Thanks MV >>> >> > >

