Not experienced but this may help a bit... The fetchTime field is used by Mapper to decide if it is time to fetch this url. For a well written overview see this link http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
Also see the Nutch API documentation for AbstractFetchSchedule athttp://nutch.apache.org/apidocs-2.0/org/apache/nutch/crawl/AbstractFet chSchedule.html#setFetchSchedule%28java.lang.String,%20org.apache.nutch. storage.WebPage,%20long,%20long,%20long,%20long,%20int%29 The default re-fetch schedule is somewhat simplistic. No matter if the page was changed or not, the fetchInterval remains unchanged, and the updated page fetchTime will always be set to fetchTime + fetchInterval * 1000 (a month with Nutch 2.0). See http://nutch.apache.org/apidocs-2.0/org/apache/nutch/crawl/DefaultFetchS chedule.html A better implementation for most cases is the AdaptiveFetchSchedule AdaptiveFetchSchedule. The FetchSchedule implementation can be changed by copying the db.fetch.schedule.class property from conf/nutch-default.xml to conf/nutch-site.xml and changing the value. http://nutch.apache.org/apidocs-2.0/org/apache/nutch/crawl/AdaptiveFetch Schedule.html -----Original Message----- From: Sourajit Basak [mailto:[email protected]] Sent: Tuesday, August 14, 2012 5:21 PM To: [email protected] Subject: Re: adaptive fetches On a second thought, it doesn't seem that the 'generate' phase checks for the modified timestamp of every page. It seems to be pre-calculated by a previous generate-fetch-update cycle. Experienced guys can comment on how a next fetch time is calculated. >From the crawldb output, it seems to have added a month to the last fetch time, though I only checked my target site's home pages. On Tue, Aug 14, 2012 at 1:26 PM, Sourajit Basak <[email protected]>wrote: > What is "adaptive fetch schedule" as dictated by the property * > db.fetch.schedule.adaptive.sync_delta* ? If this is set to true how > does property *db.fetch.interval.default* come to effect ? > > I guess the 'generate' phase checks for the modified timestamp of > every page in the crawldb. If a page does change, Nutch decides > whether to re-fetch based on the property - "* > db.fetch.schedule.adaptive.sync_delta_rate*". Is this assumption correct ? > > If yes, what does the default fetch interval mean in this context. The > re-fetch seems to be affected for such cases by how often I run "generate". >

