Hi Joe, > Ideally, it should take higher priority than the default interval. This is > particularly important for sites such as cnn.com, whether the leaf page > doesn't really change, but the portal page is updated all the time.
AdaptiveFetchSchedule does exactly this: if a page is found modified when it is re-fetched, the fetch interval is decreased, if it's not modified it's increased. You can enable it by: <property> <name>db.fetch.schedule.class</name> <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value> </property> There are a couple of further properties to fine-tune AdaptiveFetchSchedule, mainly db.fetch.schedule.adaptive.min_interval db.fetch.schedule.adaptive.max_interval Sebastian On 06/22/2013 05:06 AM, Joe Zhang wrote: > Thanks, guys. So, just to confirm, lastModifed is not use in the fetching > logic at all. > > Ideally, it should take higher priority than the default interval. This is > particularly important for sites such as cnn.com, whether the leaf page > doesn't really change, but the portal page is updated all the time. > > On Fri, Jun 21, 2013 at 7:40 PM, Tejas Patil <tejas.patil...@gmail.com>wrote: > >> On Fri, Jun 21, 2013 at 7:07 PM, Joe Zhang <smartag...@gmail.com> wrote: >> >>> Sorry, Nutch is certainly aware of page modification, and it does capture >>> lastModified. >> >> Nutch does captures the "last modified" field but I am not sure if its >> value is used ahead. I remember that it was not being used for any logic in >> older versions but need to confirm if the code is modified to take that >> into account. >> >> The real question is, can nutch get lastModified of a page >>> before fetching, and use it to make fetching decisions (e.g,, whether or >>> not to override the default interval)? >>> >> >> No. Nutch won't lookup for the lastModified of a page before fetching its >> content. >> >>> >>> >>> On Fri, Jun 21, 2013 at 6:27 PM, Joe Zhang <smartag...@gmail.com> wrote: >>> >>>> If I don't change the default value of db.fetch.interval.default, which >>> is >>>> 30 days, does it mean that the URL in the db won't be refetched before >>> the >>>> due time even if it has been modified? In other words, is Nutch aware >> of >>>> page modification? >>>> >>> >> >