As Jorge said it could be parametrized in seed file: <URL>\tnutch.fetchInterval=86400 It is quite important that if we use AdaptiveFetchSchedule interval will be overriden. In nutch 1.6 it could be bypassed using nutch.fetchInterval.fixed (Issue NUTCH-1388) but it wasn't yet ported to nutch 2.1 (Issue NUTCH-1682)
2014-02-18 9:53 GMT+01:00 Markus Jelsma <[email protected]>: > Hi > > We do something similar using a parse filter plugin and a custom > scheduler. The parse filter plugin contains a SVM classifier that gives a > high score to hub pages, or pages we consider not important, no content, > overviews, lists etc. This score is passed back to the CrawlDatum and used > in the scheduler to adjust fetch time partially based on the hub score. > > Markus > > -----Original message----- > > From:Jorge Luis Betancourt González <[email protected]> > > Sent: Tuesday 18th February 2014 0:48 > > To: [email protected] > > Subject: Re: Setting different fetch interval for some pages > > > > If I'm don't remember wrong in the list there was a patch to accomplish > this, specifying the fetch interval in the seed file. Also this could work > as a base to implement a custom plugin to accomplish your specific use case. > > > > ----- Original Message ----- > > From: "Mateusz Zakarczemny" <[email protected]> > > To: [email protected] > > Sent: Monday, February 17, 2014 10:14:14 AM > > Subject: Setting different fetch interval for some pages > > > > Hi, > > > > I'm going to crawl some set of news sites. Pages on those sites could be > > divided into two types: category page and article page. I would like to > > fetch categories pages more frequently than article pages. List of > > categories is rather fixed so I could mark them manually. > > > > I know I could reach similar behaviour using AdaptiveFetchSchedule but it > > require some time to adjust fetch time. This doesn't satisfy me because > > before the fetch I already know how often pages should be re crawled. > > > > I wonder if it is possible in nutch to set different fetch intervals for > > sites. I know that I could extend AbstractFetchSchedule and implement > this > > behaviour manually. This would require adding some extra field to WebPage > > object which indicate what type of page we are dealing with. It is > possible > > to add such field to WebPage object? Maybe there is another approach? > > > > Regards, > > Mateusz > > > ________________________________________________________________________________________________ > > III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero > del 2014. Ver www.uci.cu > > >

