I have verified the problem by stripping it down to its most basic form using one of my sites. You can recreate this behavior on any site simply by having a robots.txt file with Crawl-Delay specified (value does not matter) and fetcher.max.crawl.delay set to -1. The Crawl will not retrieve the page. Then comment out the Crawl-Delay in robots.txt and try again. It will succeed.
I could point you to any number of sites with Crawl-Delay specified, but you really need to test this against a site where you have control of the robots.txt file to verify the behavior. -----Original Message----- From: Tejas Patil [mailto:[email protected]] Sent: Saturday, April 27, 2013 3:17 PM To: [email protected] Subject: Re: Nutch 1.6 Processing of fetcher.max.crawl.delay Thanks Iain for raising this. I will look into it. Can you kindly share urls for which you see this behavior ? I can run a crawl with those and try at my end. On Sat, Apr 27, 2013 at 1:13 PM, Iain Lopata <[email protected]> wrote: > Using Nutch 1.6, I am having a problem with the processing of > fetcher.max.crawl.delay. > > > > The description for this property states that "If the Crawl-Delay in > robots.txt is set to greater than this value (in seconds) then the > fetcher will skip this page, generating an error report. If set to -1 > the fetcher will never skip such pages and will wait the amount of > time retrieved from robots.txt Crawl-Delay, however long that might be." > > > > I have found that the processing is not as stated when the value is > set to -1. If I set the value of fetcher.max.crawl.delay to -1, any > URL on a site that has Crawl-Delay specified in the applicable section > of robots.text is rejected with a robots_denied(18). > > > > I am not a Java developer and I am completely new to using Nutch, but > this looks like it may be either a documentation error for the > property or a problem with the logic in Fetcher.java at Line 682. > > > > I can work around this by setting the property to some high value, but > perhaps this is a problem that someone would like to look at. > > > > Happy to post in Jira if someone can confirm my assessment or if this > is the right way to get this investigated. > > > > Thanks > > > > > >

