Re: revisit time as a function of content type

Christopher Laux Wed, 06 Oct 2010 02:17:10 -0700

Thanks for that hint which answers my original question. For even
better performance, I would prefer set CrawlDatum's fetchinterval
depending on the parsed contents of say an XML feed file: if the last
entries are temporally close together I want a shorter fetchinterval
than if they lie apart. Where would the right place be to set that?


Cheers,
Chris


On Wed, Oct 6, 2010 at 10:14 AM, reinhard schwab <[email protected]> wrote:
> implement your own schedule class and set the property in the nutch-site.xml
> in nutch-default.xml you have
>
> <property>
>  <name>db.fetch.schedule.class</name>
>  <value>org.apache.nutch.crawl.DefaultFetchSchedule</value>
>  <description>The implementation of fetch schedule.
> DefaultFetchSchedule simply
>  adds the original fetchInterval to the last fetch time, regardless of
>  page changes.</description>
> </property>
>
> you can see in this class how to implement your own schedule class.
>
> Christopher Laux schrieb:
>> Hi all,
>>
>> thanks for the last answer. I have a more advanced question, if you don't 
>> mind:
>>
>> What is the easiest way to make revisit times depend on the http/html
>> content-type, e.g. I want to revisit "application/rss+xml" pages every
>> 12 hours but "text/html" etc. can remain at 30 days?
>>
>> Do I have to modify the generate and update functions or could plugins
>> handle this?
>>
>> Thanks,
>> Chris
>>
>>
>
>

Re: revisit time as a function of content type

Reply via email to