Re:Re: How to prevent re-crawling?

高睿 Sun, 10 Mar 2013 09:26:49 -0700

OK, thanks.
I'll try the 2nd approach.
I'm using the 'nutch crawl' command, it seems 'fetchinterval' doesn't really 
work. Maybe I should build my own script based on the basic command.








At 2013-03-10 22:36:03,"feng lu" <[email protected]> wrote:
>Hi
>
>Maybe you can add article urls that are already crawled to a seed file.
>Next set db.injector.update to true and set metadata nutch.fetchInterval of
>each url to a long time. Finally use bin/nutch inject command to update the
>fetchInterval time of each urls (article urls).
>
>Or can extends AbstractFetchSchedule and override the setFetchSchedule
>method, use a urlfilter to filter the article url and set a long
>fetchInterval to it.
>
>
>On Sun, Mar 10, 2013 at 9:29 PM, 高睿 <[email protected]> wrote:
>
>>  Hi,
>>
>> Background: I have several article list urls in seed.txt. Currently, the
>> nutch crawl command crawls both the list urls and the article urls every
>> time.
>> I want to prevent re-crawling for the urls (article urls) which are
>> already crawled. But I want to crawl the urls in the seed.txt (article list
>> urls).
>> Do you have idea about this?
>>
>> Regards,
>> Rui
>>
>
>
>
>-- 
>Don't Grow Old, Grow Up... :-)

Re:Re: How to prevent re-crawling?

Reply via email to