Thanks, guys. So, just to confirm, lastModifed is not use in the fetching logic at all.
Ideally, it should take higher priority than the default interval. This is particularly important for sites such as cnn.com, whether the leaf page doesn't really change, but the portal page is updated all the time. On Fri, Jun 21, 2013 at 7:40 PM, Tejas Patil <[email protected]>wrote: > On Fri, Jun 21, 2013 at 7:07 PM, Joe Zhang <[email protected]> wrote: > > > Sorry, Nutch is certainly aware of page modification, and it does capture > > lastModified. > > Nutch does captures the "last modified" field but I am not sure if its > value is used ahead. I remember that it was not being used for any logic in > older versions but need to confirm if the code is modified to take that > into account. > > The real question is, can nutch get lastModified of a page > > before fetching, and use it to make fetching decisions (e.g,, whether or > > not to override the default interval)? > > > > No. Nutch won't lookup for the lastModified of a page before fetching its > content. > > > > > > > On Fri, Jun 21, 2013 at 6:27 PM, Joe Zhang <[email protected]> wrote: > > > > > If I don't change the default value of db.fetch.interval.default, which > > is > > > 30 days, does it mean that the URL in the db won't be refetched before > > the > > > due time even if it has been modified? In other words, is Nutch aware > of > > > page modification? > > > > > >

