Hi Maybe you can add article urls that are already crawled to a seed file. Next set db.injector.update to true and set metadata nutch.fetchInterval of each url to a long time. Finally use bin/nutch inject command to update the fetchInterval time of each urls (article urls).
Or can extends AbstractFetchSchedule and override the setFetchSchedule method, use a urlfilter to filter the article url and set a long fetchInterval to it. On Sun, Mar 10, 2013 at 9:29 PM, 高睿 <[email protected]> wrote: > Hi, > > Background: I have several article list urls in seed.txt. Currently, the > nutch crawl command crawls both the list urls and the article urls every > time. > I want to prevent re-crawling for the urls (article urls) which are > already crawled. But I want to crawl the urls in the seed.txt (article list > urls). > Do you have idea about this? > > Regards, > Rui > -- Don't Grow Old, Grow Up... :-)

