Yes, i've done that. Thanks.

On Sun, Jun 23, 2013 at 9:53 AM, Sebastian Nagel <[email protected]
> wrote:

> Hi Joe,
>
> > Ideally, it should take higher priority than the default interval. This
> is
> > particularly important for sites such as cnn.com, whether the leaf page
> > doesn't really change, but the portal page is updated all the time.
>
> AdaptiveFetchSchedule does exactly this: if a page is found modified when
> it is re-fetched, the fetch interval is decreased, if it's not modified
> it's increased.
>
> You can enable it by:
>  <property>
>   <name>db.fetch.schedule.class</name>
>   <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
>  </property>
>
> There are a couple of further properties to fine-tune
> AdaptiveFetchSchedule,
> mainly
>  db.fetch.schedule.adaptive.min_interval
>  db.fetch.schedule.adaptive.max_interval
>
> Sebastian
>
> On 06/22/2013 05:06 AM, Joe Zhang wrote:
> > Thanks, guys. So, just to confirm, lastModifed is not use in the fetching
> > logic at all.
> >
> > Ideally, it should take higher priority than the default interval. This
> is
> > particularly important for sites such as cnn.com, whether the leaf page
> > doesn't really change, but the portal page is updated all the time.
> >
> > On Fri, Jun 21, 2013 at 7:40 PM, Tejas Patil <[email protected]
> >wrote:
> >
> >> On Fri, Jun 21, 2013 at 7:07 PM, Joe Zhang <[email protected]>
> wrote:
> >>
> >>> Sorry, Nutch is certainly aware of page modification, and it does
> capture
> >>> lastModified.
> >>
> >> Nutch does captures the "last modified" field but I am not sure if its
> >> value is used ahead. I remember that it was not being used for any
> logic in
> >> older versions but need to confirm if the code is modified to take that
> >> into account.
> >>
> >> The real question is, can nutch get lastModified of a page
> >>> before fetching, and use it to make fetching decisions (e.g,, whether
> or
> >>> not to override the default interval)?
> >>>
> >>
> >> No. Nutch won't lookup for the lastModified of a page before fetching
> its
> >> content.
> >>
> >>>
> >>>
> >>> On Fri, Jun 21, 2013 at 6:27 PM, Joe Zhang <[email protected]>
> wrote:
> >>>
> >>>> If I don't change the default value of db.fetch.interval.default,
> which
> >>> is
> >>>> 30 days, does it mean that the URL in the db won't be refetched before
> >>> the
> >>>> due time even if it has been modified? In other words, is Nutch aware
> >> of
> >>>> page modification?
> >>>>
> >>>
> >>
> >
>
>

Reply via email to