I have verified the problem by stripping it down to its most basic form
using one of my sites.  You can recreate this behavior on any site simply by
having a robots.txt file with Crawl-Delay specified (value does not matter)
and fetcher.max.crawl.delay set to -1. The Crawl will not retrieve the page.
Then comment out the Crawl-Delay in robots.txt and try again.  It will
succeed.

I could point you to any number of sites with Crawl-Delay specified, but you
really need to test this against a site where you have control of the
robots.txt file to verify the behavior.

-----Original Message-----
From: Tejas Patil [mailto:[email protected]] 
Sent: Saturday, April 27, 2013 3:17 PM
To: [email protected]
Subject: Re: Nutch 1.6 Processing of fetcher.max.crawl.delay

Thanks Iain for raising this. I will look into it. Can you kindly share urls
for which you see this behavior ? I can run a crawl with those and try at my
end.


On Sat, Apr 27, 2013 at 1:13 PM, Iain Lopata <[email protected]> wrote:

> Using Nutch 1.6, I am having a problem with the processing of 
> fetcher.max.crawl.delay.
>
>
>
> The description for this property states that "If the Crawl-Delay in 
> robots.txt is set to greater than this value (in seconds) then the 
> fetcher will skip this page, generating an error report. If set to -1 
> the fetcher will never skip such pages and will wait the amount of 
> time retrieved from robots.txt Crawl-Delay, however long that might be."
>
>
>
> I have found that the processing is not as stated when the value is 
> set to -1.  If I set the value of  fetcher.max.crawl.delay to -1, any 
> URL on a site that has Crawl-Delay specified in the applicable section 
> of robots.text is rejected with a robots_denied(18).
>
>
>
> I am not a Java developer and I am completely new to using Nutch, but 
> this looks like it may be either a documentation error for the 
> property or a problem with the logic in Fetcher.java  at Line 682.
>
>
>
> I can work around this by setting the property to some high value, but 
> perhaps this is a problem that someone would like to look at.
>
>
>
> Happy to post in Jira if someone can confirm my assessment or if this 
> is the right way to get this investigated.
>
>
>
> Thanks
>
>
>
>
>
>

Reply via email to