Re: Re: How to prevent re-crawling?

feng lu Sun, 10 Mar 2013 19:14:20 -0700

yes, using "nutch crawl" command can not affect the 'fetchInterval',
Currently it will be affected by these factors.

1. db.fetch.interval.default property in nutch-site.xml - The default
number of seconds between re-fetches of a page
2. nutch.fetchInterval metadata in nutch inject process - allows to set a
custom fetch interval for a specific URL
3. if you use adaptive fetch schedule class, it can continuously monitor a
site and crawl updates [0].

Maybe another method can implement your requirement is add all article list
urls to a seed list, set a customed fetchinterval time to them. and set
db.fetch.interval.default to a long time. then inject the article list urls
to crawldb. This method should know all the list page in advance. otherwise
the fetchInterval of new discovered article list url will be set to
db.fetch.interval.default.

[0] http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/

On Mon, Mar 11, 2013 at 12:26 AM, 高睿 <[email protected]> wrote:

> OK, thanks.
> I'll try the 2nd approach.
> I'm using the 'nutch crawl' command, it seems 'fetchinterval' doesn't
> really work. Maybe I should build my own script based on the basic command.
>
>
>
>
>
>
>
> At 2013-03-10 22:36:03,"feng lu" <[email protected]> wrote:
> >Hi
> >
> >Maybe you can add article urls that are already crawled to a seed file.
> >Next set db.injector.update to true and set metadata nutch.fetchInterval
> of
> >each url to a long time. Finally use bin/nutch inject command to update
> the
> >fetchInterval time of each urls (article urls).
> >
> >Or can extends AbstractFetchSchedule and override the setFetchSchedule
> >method, use a urlfilter to filter the article url and set a long
> >fetchInterval to it.
> >
> >
> >On Sun, Mar 10, 2013 at 9:29 PM, 高睿 <[email protected]> wrote:
> >
> >>  Hi,
> >>
> >> Background: I have several article list urls in seed.txt. Currently, the
> >> nutch crawl command crawls both the list urls and the article urls every
> >> time.
> >> I want to prevent re-crawling for the urls (article urls) which are
> >> already crawled. But I want to crawl the urls in the seed.txt (article
> list
> >> urls).
> >> Do you have idea about this?
> >>
> >> Regards,
> >> Rui
> >>
> >
> >
> >
> >--
> >Don't Grow Old, Grow Up... :-)
>

-- 
Don't Grow Old, Grow Up... :-)

Re: Re: How to prevent re-crawling?

Reply via email to