Hi Iain, no bother at all. This was not the easiest one to find within the Jira. @Tejas, good job. Have a great weekend Lewis
On Sat, Apr 27, 2013 at 1:35 PM, Iain Lopata <ilopa...@hotmail.com> wrote: > Lewis -- Looks like a duplicate of NUTCH-1284. Sorry for not catching that > before posting. > > -----Original Message----- > From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com] > Sent: Saturday, April 27, 2013 3:30 PM > To: user@nutch.apache.org > Subject: Re: Nutch 1.6 Processing of fetcher.max.crawl.delay > > Hi, > @Tejas, you will remember the work undertaken on NUTCH-1284 (the patch for > which you submitted included the fix for NUTCH-1042) relates to this. > I am not sure if the situations are identical, but they are closely linked > by the looks of it. > @ianin, can you look at the commentary and provide your input? Thank you so > much. > Also, one should note that this fix is not released yet, it is in trunk and > 2.x branches which we will release hopefully soon. > Thanks > Lewis > > > On Sat, Apr 27, 2013 at 1:16 PM, Tejas Patil > <tejas.patil...@gmail.com>wrote: > > > Thanks Iain for raising this. I will look into it. Can you kindly > > share urls for which you see this behavior ? I can run a crawl with > > those and try at my end. > > > > > > On Sat, Apr 27, 2013 at 1:13 PM, Iain Lopata <ilopa...@hotmail.com> > wrote: > > > > > Using Nutch 1.6, I am having a problem with the processing of > > > fetcher.max.crawl.delay. > > > > > > > > > > > > The description for this property states that "If the Crawl-Delay in > > > robots.txt is set to greater than this value (in seconds) then the > > fetcher > > > will skip this page, generating an error report. If set to -1 the > > > fetcher will never skip such pages and will wait the amount of time > > > retrieved > > from > > > robots.txt Crawl-Delay, however long that might be." > > > > > > > > > > > > I have found that the processing is not as stated when the value is > > > set > > to > > > -1. If I set the value of fetcher.max.crawl.delay to -1, any URL > > > on a site that has Crawl-Delay specified in the applicable section > > > of robots.text > > is > > > rejected with a robots_denied(18). > > > > > > > > > > > > I am not a Java developer and I am completely new to using Nutch, > > > but > > this > > > looks like it may be either a documentation error for the property > > > or a problem with the logic in Fetcher.java at Line 682. > > > > > > > > > > > > I can work around this by setting the property to some high value, > > > but perhaps this is a problem that someone would like to look at. > > > > > > > > > > > > Happy to post in Jira if someone can confirm my assessment or if > > > this is the right way to get this investigated. > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > -- > *Lewis* > > -- *Lewis*