Thanks.
On Fri, Jun 21, 2013 at 8:52 PM, Tejas Patil <[email protected]>wrote: > I just checked the current code and it seems to me that lastModifed > (aka "Modified > time" in CrawlDatum class) is not used for any further logic. If you want > to customize the fetch interval for a subset of pages, do as Lewis > suggested. i.e. specify a customized fetch interval for the main pages in > the inject command [0]. > > [0] : http://wiki.apache.org/nutch/bin/nutch_inject > > > On Fri, Jun 21, 2013 at 8:06 PM, Joe Zhang <[email protected]> wrote: > > > Thanks, guys. So, just to confirm, lastModifed is not use in the fetching > > logic at all. > > > > Ideally, it should take higher priority than the default interval. This > is > > particularly important for sites such as cnn.com, whether the leaf page > > doesn't really change, but the portal page is updated all the time. > > > > On Fri, Jun 21, 2013 at 7:40 PM, Tejas Patil <[email protected] > > >wrote: > > > > > On Fri, Jun 21, 2013 at 7:07 PM, Joe Zhang <[email protected]> > wrote: > > > > > > > Sorry, Nutch is certainly aware of page modification, and it does > > capture > > > > lastModified. > > > > > > Nutch does captures the "last modified" field but I am not sure if its > > > value is used ahead. I remember that it was not being used for any > logic > > in > > > older versions but need to confirm if the code is modified to take that > > > into account. > > > > > > The real question is, can nutch get lastModified of a page > > > > before fetching, and use it to make fetching decisions (e.g,, whether > > or > > > > not to override the default interval)? > > > > > > > > > > No. Nutch won't lookup for the lastModified of a page before fetching > its > > > content. > > > > > > > > > > > > > > > On Fri, Jun 21, 2013 at 6:27 PM, Joe Zhang <[email protected]> > > wrote: > > > > > > > > > If I don't change the default value of db.fetch.interval.default, > > which > > > > is > > > > > 30 days, does it mean that the URL in the db won't be refetched > > before > > > > the > > > > > due time even if it has been modified? In other words, is Nutch > aware > > > of > > > > > page modification? > > > > > > > > > > > > > > >

