Hi Joe,

> Ideally, it should take higher priority than the default interval. This is
> particularly important for sites such as cnn.com, whether the leaf page
> doesn't really change, but the portal page is updated all the time.

AdaptiveFetchSchedule does exactly this: if a page is found modified when
it is re-fetched, the fetch interval is decreased, if it's not modified
it's increased.

You can enable it by:
 <property>
  <name>db.fetch.schedule.class</name>
  <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
 </property>

There are a couple of further properties to fine-tune AdaptiveFetchSchedule,
mainly
 db.fetch.schedule.adaptive.min_interval
 db.fetch.schedule.adaptive.max_interval

Sebastian

On 06/22/2013 05:06 AM, Joe Zhang wrote:
> Thanks, guys. So, just to confirm, lastModifed is not use in the fetching
> logic at all.
> 
> Ideally, it should take higher priority than the default interval. This is
> particularly important for sites such as cnn.com, whether the leaf page
> doesn't really change, but the portal page is updated all the time.
> 
> On Fri, Jun 21, 2013 at 7:40 PM, Tejas Patil <tejas.patil...@gmail.com>wrote:
> 
>> On Fri, Jun 21, 2013 at 7:07 PM, Joe Zhang <smartag...@gmail.com> wrote:
>>
>>> Sorry, Nutch is certainly aware of page modification, and it does capture
>>> lastModified.
>>
>> Nutch does captures the "last modified" field but I am not sure if its
>> value is used ahead. I remember that it was not being used for any logic in
>> older versions but need to confirm if the code is modified to take that
>> into account.
>>
>> The real question is, can nutch get lastModified of a page
>>> before fetching, and use it to make fetching decisions (e.g,, whether or
>>> not to override the default interval)?
>>>
>>
>> No. Nutch won't lookup for the lastModified of a page before fetching its
>> content.
>>
>>>
>>>
>>> On Fri, Jun 21, 2013 at 6:27 PM, Joe Zhang <smartag...@gmail.com> wrote:
>>>
>>>> If I don't change the default value of db.fetch.interval.default, which
>>> is
>>>> 30 days, does it mean that the URL in the db won't be refetched before
>>> the
>>>> due time even if it has been modified? In other words, is Nutch aware
>> of
>>>> page modification?
>>>>
>>>
>>
> 

Reply via email to