I just checked the current code and it seems to me that lastModifed
(aka "Modified
time" in CrawlDatum class) is not used for any further logic. If  you want
to customize the fetch interval for a subset of pages, do as Lewis
suggested. i.e. specify a customized fetch interval for the main pages in
the inject command [0].

[0] : http://wiki.apache.org/nutch/bin/nutch_inject


On Fri, Jun 21, 2013 at 8:06 PM, Joe Zhang <[email protected]> wrote:

> Thanks, guys. So, just to confirm, lastModifed is not use in the fetching
> logic at all.
>
> Ideally, it should take higher priority than the default interval. This is
> particularly important for sites such as cnn.com, whether the leaf page
> doesn't really change, but the portal page is updated all the time.
>
> On Fri, Jun 21, 2013 at 7:40 PM, Tejas Patil <[email protected]
> >wrote:
>
> > On Fri, Jun 21, 2013 at 7:07 PM, Joe Zhang <[email protected]> wrote:
> >
> > > Sorry, Nutch is certainly aware of page modification, and it does
> capture
> > > lastModified.
> >
> > Nutch does captures the "last modified" field but I am not sure if its
> > value is used ahead. I remember that it was not being used for any logic
> in
> > older versions but need to confirm if the code is modified to take that
> > into account.
> >
> > The real question is, can nutch get lastModified of a page
> > > before fetching, and use it to make fetching decisions (e.g,, whether
> or
> > > not to override the default interval)?
> > >
> >
> > No. Nutch won't lookup for the lastModified of a page before fetching its
> > content.
> >
> > >
> > >
> > > On Fri, Jun 21, 2013 at 6:27 PM, Joe Zhang <[email protected]>
> wrote:
> > >
> > > > If I don't change the default value of db.fetch.interval.default,
> which
> > > is
> > > > 30 days, does it mean that the URL in the db won't be refetched
> before
> > > the
> > > > due time even if it has been modified? In other words, is Nutch aware
> > of
> > > > page modification?
> > > >
> > >
> >
>

Reply via email to