Thanks Iain for raising this. I will look into it. Can you kindly share urls for which you see this behavior ? I can run a crawl with those and try at my end.
On Sat, Apr 27, 2013 at 1:13 PM, Iain Lopata <[email protected]> wrote: > Using Nutch 1.6, I am having a problem with the processing of > fetcher.max.crawl.delay. > > > > The description for this property states that "If the Crawl-Delay in > robots.txt is set to greater than this value (in seconds) then the fetcher > will skip this page, generating an error report. If set to -1 the fetcher > will never skip such pages and will wait the amount of time retrieved from > robots.txt Crawl-Delay, however long that might be." > > > > I have found that the processing is not as stated when the value is set to > -1. If I set the value of fetcher.max.crawl.delay to -1, any URL on a > site > that has Crawl-Delay specified in the applicable section of robots.text is > rejected with a robots_denied(18). > > > > I am not a Java developer and I am completely new to using Nutch, but this > looks like it may be either a documentation error for the property or a > problem with the logic in Fetcher.java at Line 682. > > > > I can work around this by setting the property to some high value, but > perhaps this is a problem that someone would like to look at. > > > > Happy to post in Jira if someone can confirm my assessment or if this is > the > right way to get this investigated. > > > > Thanks > > > > > >

