OK, thanks. I'll try the 2nd approach. I'm using the 'nutch crawl' command, it seems 'fetchinterval' doesn't really work. Maybe I should build my own script based on the basic command.
At 2013-03-10 22:36:03,"feng lu" <[email protected]> wrote: >Hi > >Maybe you can add article urls that are already crawled to a seed file. >Next set db.injector.update to true and set metadata nutch.fetchInterval of >each url to a long time. Finally use bin/nutch inject command to update the >fetchInterval time of each urls (article urls). > >Or can extends AbstractFetchSchedule and override the setFetchSchedule >method, use a urlfilter to filter the article url and set a long >fetchInterval to it. > > >On Sun, Mar 10, 2013 at 9:29 PM, 高睿 <[email protected]> wrote: > >> Hi, >> >> Background: I have several article list urls in seed.txt. Currently, the >> nutch crawl command crawls both the list urls and the article urls every >> time. >> I want to prevent re-crawling for the urls (article urls) which are >> already crawled. But I want to crawl the urls in the seed.txt (article list >> urls). >> Do you have idea about this? >> >> Regards, >> Rui >> > > > >-- >Don't Grow Old, Grow Up... :-)

