Re: Re-crawling strategy

feng lu Fri, 08 Feb 2013 18:57:37 -0800

hi
you can define the list page fetch interval time in your seed list text
like this
http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t
userType=open_source




On Friday, February 8, 2013, 高睿 wrote:

>  Hi,
>
> I'm using nutch 2.1 for dedicated crawling for several blogs.
> In my urls folder, there are several blog article list page in seed.txt.
> The blogs are not updated very frequently. I don't want to re-crawl the
> article content page once it is already crawled, but I want the article
> list to be crawled every time so that new article page could be found.
>
> The parameter 'db.fetch.interval.default' is for this purpose, but I guess
> it will impact all urls including the article list page.
> So, is there any way to specify the re-crawling strategy based on url?
> Thanks.
>
> Regards,
> Rui
>


-- 
Don't Grow Old, Grow Up... :-)

Re: Re-crawling strategy

Reply via email to