How to prevent re-crawling?

高睿 Sun, 10 Mar 2013 06:30:05 -0700

 Hi,

Background: I have several article list urls in seed.txt. Currently, the nutch 
crawl command crawls both the list urls and the article urls every time.
I want to prevent re-crawling for the urls (article urls) which are already 
crawled. But I want to crawl the urls in the seed.txt (article list urls).
Do you have idea about this?


Regards,
Rui

How to prevent re-crawling?

Reply via email to