Using Nutch 1.6, I am having a problem with the processing of
fetcher.max.crawl.delay.

 

The description for this property states that "If the Crawl-Delay in
robots.txt is set to greater than this value (in seconds) then the fetcher
will skip this page, generating an error report. If set to -1 the fetcher
will never skip such pages and will wait the amount of time retrieved from
robots.txt Crawl-Delay, however long that might be."

 

I have found that the processing is not as stated when the value is set to
-1.  If I set the value of  fetcher.max.crawl.delay to -1, any URL on a site
that has Crawl-Delay specified in the applicable section of robots.text is
rejected with a robots_denied(18).

 

I am not a Java developer and I am completely new to using Nutch, but this
looks like it may be either a documentation error for the property or a
problem with the logic in Fetcher.java  at Line 682.

 

I can work around this by setting the property to some high value, but
perhaps this is a problem that someone would like to look at.

 

Happy to post in Jira if someone can confirm my assessment or if this is the
right way to get this investigated.

 

Thanks

 

 

Reply via email to