Re: Nutch 1.6 Processing of fetcher.max.crawl.delay

Tejas Patil Sat, 27 Apr 2013 13:17:03 -0700

Thanks Iain for raising this. I will look into it. Can you kindly share
urls for which you see this behavior ? I can run a crawl with those and try
at my end.



On Sat, Apr 27, 2013 at 1:13 PM, Iain Lopata <[email protected]> wrote:

> Using Nutch 1.6, I am having a problem with the processing of
> fetcher.max.crawl.delay.
>
>
>
> The description for this property states that "If the Crawl-Delay in
> robots.txt is set to greater than this value (in seconds) then the fetcher
> will skip this page, generating an error report. If set to -1 the fetcher
> will never skip such pages and will wait the amount of time retrieved from
> robots.txt Crawl-Delay, however long that might be."
>
>
>
> I have found that the processing is not as stated when the value is set to
> -1.  If I set the value of  fetcher.max.crawl.delay to -1, any URL on a
> site
> that has Crawl-Delay specified in the applicable section of robots.text is
> rejected with a robots_denied(18).
>
>
>
> I am not a Java developer and I am completely new to using Nutch, but this
> looks like it may be either a documentation error for the property or a
> problem with the logic in Fetcher.java  at Line 682.
>
>
>
> I can work around this by setting the property to some high value, but
> perhaps this is a problem that someone would like to look at.
>
>
>
> Happy to post in Jira if someone can confirm my assessment or if this is
> the
> right way to get this investigated.
>
>
>
> Thanks
>
>
>
>
>
>

Re: Nutch 1.6 Processing of fetcher.max.crawl.delay

Reply via email to