Server delay feature conflicts with maxThreadsPerHost
-----------------------------------------------------

                 Key: NUTCH-385
                 URL: http://issues.apache.org/jira/browse/NUTCH-385
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
            Reporter: Chris Schneider


For some time I've been puzzled by the interaction between two paramters that 
control how often the fetcher can access a particular host:

1) The server delay, which comes back from the remote server during our 
processing of the robots.txt file, and which can be limited by 
fetcher.max.crawl.delay.

2) The fetcher.threads.per.host value, particularly when this is greater than 
the default of 1.

According to my (limited) understanding of the code in HttpBase.java:

Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
continuously. In other words, it never tries to point 3 at the host, and it 
always points a second thread at the host before the first thread finishes 
accessing it. Since HttpBase.unblockAddr never gets called with 
(((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. 
Thus, the server delay will never be used at all. The fetcher will be 
continuously retrieving pages from the host, often with 2 fetchers accessing 
the host simultaneously.

Suppose instead that the fetcher finally does allow the last thread to complete 
before it gets around to pointing another thread at the target host. When the 
last fetcher thread calls HttpBase.unblockAddr, it will now put 
System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. 
This, in turn, will prevent any threads from accessing this host until the 
delay is complete, even though zero threads are currently accessing the host.

I see this behavior as inconsistent. More importantly, the current 
implementation certainly doesn't seem to answer my original question about 
appropriate definitions for what appear to be conflicting parameters. 

In a nutshell, how could we possibly honor the server delay if we allow more 
than one fetcher thread to simultaneously access the host?

It would be one thing if whenever (fetcher.threads.per.host > 1), this trumped 
the server delay, causing the latter to be ignored completely. That is 
certainly not the case in the current implementation, as it will wait for 
server delay whenever the number of threads accessing a given host drops to 
zero.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to