[
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794757#action_12794757
]
Mike Baranczak commented on NUTCH-385:
--
This is something that recently came up on a project that I'm working on (we're
using 1.0). I'd actually be OK with leaving the functionality as it is - as
long as it was explained properly in the config file. That is, make it clear
that fetcher.server.delay is applied to each fetcher thread individually.
Server delay feature conflicts with maxThreadsPerHost
-
Key: NUTCH-385
URL: https://issues.apache.org/jira/browse/NUTCH-385
Project: Nutch
Issue Type: Bug
Components: fetcher
Reporter: Chris Schneider
For some time I've been puzzled by the interaction between two paramters that
control how often the fetcher can access a particular host:
1) The server delay, which comes back from the remote server during our
processing of the robots.txt file, and which can be limited by
fetcher.max.crawl.delay.
2) The fetcher.threads.per.host value, particularly when this is greater than
the default of 1.
According to my (limited) understanding of the code in HttpBase.java:
Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher
ends up keeping either 1 or 2 fetcher threads pointing at a particular host
continuously. In other words, it never tries to point 3 at the host, and it
always points a second thread at the host before the first thread finishes
accessing it. Since HttpBase.unblockAddr never gets called with
(((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts
System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the
host. Thus, the server delay will never be used at all. The fetcher will be
continuously retrieving pages from the host, often with 2 fetchers accessing
the host simultaneously.
Suppose instead that the fetcher finally does allow the last thread to
complete before it gets around to pointing another thread at the target host.
When the last fetcher thread calls HttpBase.unblockAddr, it will now put
System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the
host. This, in turn, will prevent any threads from accessing this host until
the delay is complete, even though zero threads are currently accessing the
host.
I see this behavior as inconsistent. More importantly, the current
implementation certainly doesn't seem to answer my original question about
appropriate definitions for what appear to be conflicting parameters.
In a nutshell, how could we possibly honor the server delay if we allow more
than one fetcher thread to simultaneously access the host?
It would be one thing if whenever (fetcher.threads.per.host 1), this
trumped the server delay, causing the latter to be ignored completely. That
is certainly not the case in the current implementation, as it will wait for
server delay whenever the number of threads accessing a given host drops to
zero.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.