Re: Nutch 1.6 Processing of fetcher.max.crawl.delay

Lewis John Mcgibbney Sat, 27 Apr 2013 14:03:41 -0700

Hi Iain,
no bother at all. This was not the easiest one to find within the Jira.
@Tejas, good job.
Have a great weekend
Lewis



On Sat, Apr 27, 2013 at 1:35 PM, Iain Lopata <ilopa...@hotmail.com> wrote:

> Lewis -- Looks like a duplicate of NUTCH-1284.  Sorry for not catching that
> before posting.
>
> -----Original Message-----
> From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com]
> Sent: Saturday, April 27, 2013 3:30 PM
> To: user@nutch.apache.org
> Subject: Re: Nutch 1.6 Processing of fetcher.max.crawl.delay
>
> Hi,
> @Tejas, you will remember the work undertaken on NUTCH-1284 (the patch for
> which you submitted included the fix for NUTCH-1042) relates to this.
> I am not sure if the situations are identical, but they are closely linked
> by the looks of it.
> @ianin, can you look at the commentary and provide your input? Thank you so
> much.
> Also, one should note that this fix is not released yet, it is in trunk and
> 2.x branches which we will release hopefully soon.
> Thanks
> Lewis
>
>
> On Sat, Apr 27, 2013 at 1:16 PM, Tejas Patil
> <tejas.patil...@gmail.com>wrote:
>
> > Thanks Iain for raising this. I will look into it. Can you kindly
> > share urls for which you see this behavior ? I can run a crawl with
> > those and try at my end.
> >
> >
> > On Sat, Apr 27, 2013 at 1:13 PM, Iain Lopata <ilopa...@hotmail.com>
> wrote:
> >
> > > Using Nutch 1.6, I am having a problem with the processing of
> > > fetcher.max.crawl.delay.
> > >
> > >
> > >
> > > The description for this property states that "If the Crawl-Delay in
> > > robots.txt is set to greater than this value (in seconds) then the
> > fetcher
> > > will skip this page, generating an error report. If set to -1 the
> > > fetcher will never skip such pages and will wait the amount of time
> > > retrieved
> > from
> > > robots.txt Crawl-Delay, however long that might be."
> > >
> > >
> > >
> > > I have found that the processing is not as stated when the value is
> > > set
> > to
> > > -1.  If I set the value of  fetcher.max.crawl.delay to -1, any URL
> > > on a site that has Crawl-Delay specified in the applicable section
> > > of robots.text
> > is
> > > rejected with a robots_denied(18).
> > >
> > >
> > >
> > > I am not a Java developer and I am completely new to using Nutch,
> > > but
> > this
> > > looks like it may be either a documentation error for the property
> > > or a problem with the logic in Fetcher.java  at Line 682.
> > >
> > >
> > >
> > > I can work around this by setting the property to some high value,
> > > but perhaps this is a problem that someone would like to look at.
> > >
> > >
> > >
> > > Happy to post in Jira if someone can confirm my assessment or if
> > > this is the right way to get this investigated.
> > >
> > >
> > >
> > > Thanks
> > >
> > >
> > >
> > >
> > >
> > >
> >
>
>
>
> --
> *Lewis*
>
>


-- 
*Lewis*

Re: Nutch 1.6 Processing of fetcher.max.crawl.delay

Reply via email to