As Jorge said it could be parametrized in seed file:
<URL>\tnutch.fetchInterval=86400
It is quite important that if we use AdaptiveFetchSchedule interval will be
overriden. In nutch 1.6 it could be bypassed using
nutch.fetchInterval.fixed (Issue NUTCH-1388) but it wasn't yet ported to
nutch 2.1 (Issue NUTCH-1682)



2014-02-18 9:53 GMT+01:00 Markus Jelsma <[email protected]>:

> Hi
>
> We do something similar using a parse filter plugin and a custom
> scheduler. The parse filter plugin contains a SVM classifier that gives a
> high score to hub pages, or pages we consider not important, no content,
> overviews, lists etc. This score is passed back to the CrawlDatum and used
> in the scheduler to adjust fetch time partially based on the hub score.
>
> Markus
>
> -----Original message-----
> > From:Jorge Luis Betancourt González <[email protected]>
> > Sent: Tuesday 18th February 2014 0:48
> > To: [email protected]
> > Subject: Re: Setting different fetch interval for some pages
> >
> > If I'm don't remember wrong in the list there was a patch to accomplish
> this, specifying the fetch interval in the seed file. Also this could work
> as a base to implement a custom plugin to accomplish your specific use case.
> >
> > ----- Original Message -----
> > From: "Mateusz Zakarczemny" <[email protected]>
> > To: [email protected]
> > Sent: Monday, February 17, 2014 10:14:14 AM
> > Subject: Setting different fetch interval for some pages
> >
> > Hi,
> >
> > I'm going to crawl some set of news sites. Pages on those sites could be
> > divided into two types: category page and article page. I would like to
> > fetch categories pages more frequently than article pages. List of
> > categories is rather fixed so I could mark them manually.
> >
> > I know I could reach similar behaviour using AdaptiveFetchSchedule but it
> > require some time to adjust fetch time. This doesn't satisfy me because
> > before the fetch I already know how often pages should be re crawled.
> >
> > I wonder if it is possible in nutch to set different fetch intervals for
> > sites. I know that I could extend AbstractFetchSchedule and implement
> this
> > behaviour manually. This would require adding some extra field to WebPage
> > object which indicate what type of page we are dealing with. It is
> possible
> > to add such field to WebPage object? Maybe there is another approach?
> >
> > Regards,
> > Mateusz
> >
> ________________________________________________________________________________________________
> > III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero
> del 2014. Ver www.uci.cu
> >
>

Reply via email to