Re: How to prevent re-crawling?

feng lu Sun, 10 Mar 2013 07:36:56 -0700

Hi

Maybe you can add article urls that are already crawled to a seed file.
Next set db.injector.update to true and set metadata nutch.fetchInterval of
each url to a long time. Finally use bin/nutch inject command to update the
fetchInterval time of each urls (article urls).

Or can extends AbstractFetchSchedule and override the setFetchSchedule
method, use a urlfilter to filter the article url and set a long
fetchInterval to it.

On Sun, Mar 10, 2013 at 9:29 PM, 高睿 <[email protected]> wrote:

>  Hi,
>
> Background: I have several article list urls in seed.txt. Currently, the
> nutch crawl command crawls both the list urls and the article urls every
> time.
> I want to prevent re-crawling for the urls (article urls) which are
> already crawled. But I want to crawl the urls in the seed.txt (article list
> urls).
> Do you have idea about this?
>
> Regards,
> Rui
>

-- 
Don't Grow Old, Grow Up... :-)

Re: How to prevent re-crawling?

Reply via email to