Yes, i've done that. Thanks.
On Sun, Jun 23, 2013 at 9:53 AM, Sebastian Nagel <[email protected] > wrote: > Hi Joe, > > > Ideally, it should take higher priority than the default interval. This > is > > particularly important for sites such as cnn.com, whether the leaf page > > doesn't really change, but the portal page is updated all the time. > > AdaptiveFetchSchedule does exactly this: if a page is found modified when > it is re-fetched, the fetch interval is decreased, if it's not modified > it's increased. > > You can enable it by: > <property> > <name>db.fetch.schedule.class</name> > <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value> > </property> > > There are a couple of further properties to fine-tune > AdaptiveFetchSchedule, > mainly > db.fetch.schedule.adaptive.min_interval > db.fetch.schedule.adaptive.max_interval > > Sebastian > > On 06/22/2013 05:06 AM, Joe Zhang wrote: > > Thanks, guys. So, just to confirm, lastModifed is not use in the fetching > > logic at all. > > > > Ideally, it should take higher priority than the default interval. This > is > > particularly important for sites such as cnn.com, whether the leaf page > > doesn't really change, but the portal page is updated all the time. > > > > On Fri, Jun 21, 2013 at 7:40 PM, Tejas Patil <[email protected] > >wrote: > > > >> On Fri, Jun 21, 2013 at 7:07 PM, Joe Zhang <[email protected]> > wrote: > >> > >>> Sorry, Nutch is certainly aware of page modification, and it does > capture > >>> lastModified. > >> > >> Nutch does captures the "last modified" field but I am not sure if its > >> value is used ahead. I remember that it was not being used for any > logic in > >> older versions but need to confirm if the code is modified to take that > >> into account. > >> > >> The real question is, can nutch get lastModified of a page > >>> before fetching, and use it to make fetching decisions (e.g,, whether > or > >>> not to override the default interval)? > >>> > >> > >> No. Nutch won't lookup for the lastModified of a page before fetching > its > >> content. > >> > >>> > >>> > >>> On Fri, Jun 21, 2013 at 6:27 PM, Joe Zhang <[email protected]> > wrote: > >>> > >>>> If I don't change the default value of db.fetch.interval.default, > which > >>> is > >>>> 30 days, does it mean that the URL in the db won't be refetched before > >>> the > >>>> due time even if it has been modified? In other words, is Nutch aware > >> of > >>>> page modification? > >>>> > >>> > >> > > > >

