Thanks.

On Fri, Jun 21, 2013 at 8:52 PM, Tejas Patil <[email protected]>wrote:

> I just checked the current code and it seems to me that lastModifed
> (aka "Modified
> time" in CrawlDatum class) is not used for any further logic. If  you want
> to customize the fetch interval for a subset of pages, do as Lewis
> suggested. i.e. specify a customized fetch interval for the main pages in
> the inject command [0].
>
> [0] : http://wiki.apache.org/nutch/bin/nutch_inject
>
>
> On Fri, Jun 21, 2013 at 8:06 PM, Joe Zhang <[email protected]> wrote:
>
> > Thanks, guys. So, just to confirm, lastModifed is not use in the fetching
> > logic at all.
> >
> > Ideally, it should take higher priority than the default interval. This
> is
> > particularly important for sites such as cnn.com, whether the leaf page
> > doesn't really change, but the portal page is updated all the time.
> >
> > On Fri, Jun 21, 2013 at 7:40 PM, Tejas Patil <[email protected]
> > >wrote:
> >
> > > On Fri, Jun 21, 2013 at 7:07 PM, Joe Zhang <[email protected]>
> wrote:
> > >
> > > > Sorry, Nutch is certainly aware of page modification, and it does
> > capture
> > > > lastModified.
> > >
> > > Nutch does captures the "last modified" field but I am not sure if its
> > > value is used ahead. I remember that it was not being used for any
> logic
> > in
> > > older versions but need to confirm if the code is modified to take that
> > > into account.
> > >
> > > The real question is, can nutch get lastModified of a page
> > > > before fetching, and use it to make fetching decisions (e.g,, whether
> > or
> > > > not to override the default interval)?
> > > >
> > >
> > > No. Nutch won't lookup for the lastModified of a page before fetching
> its
> > > content.
> > >
> > > >
> > > >
> > > > On Fri, Jun 21, 2013 at 6:27 PM, Joe Zhang <[email protected]>
> > wrote:
> > > >
> > > > > If I don't change the default value of db.fetch.interval.default,
> > which
> > > > is
> > > > > 30 days, does it mean that the URL in the db won't be refetched
> > before
> > > > the
> > > > > due time even if it has been modified? In other words, is Nutch
> aware
> > > of
> > > > > page modification?
> > > > >
> > > >
> > >
> >
>

Reply via email to