If you want records to be fetched at a fixed interval its easier to inject them 
with a fixed fetch interval.

nutch.fixedFetchInterval=86400

 
 
-----Original message-----
> From:kemical <[email protected]>
> Sent: Thu 14-Feb-2013 10:15
> To: [email protected]
> Subject: Re: Nutch Incremental Crawl
> 
> Hi David,
> 
> You can also consider setting shorter fetch interval time with nutch inject.
> This way you'll set higher score (so the url is always taken in priority
> when you generate a segment) and a fetch.interval of 1 day.
> 
> If you have a case similar to me, you'll often want some homepage fetch each
> day but not their inlinks. What you can do is inject all your seed urls
> again (assuming those url are only homepages).
> 
> #change nutch option so existing urls can be injected again in
> conf/nutch-default.xml or conf/nutch-site.xml 
> db.injector.update=true 
> 
> #Add metadata to update score/fetch interval
> #the following line will concat to each line of your seed urls files with
> the new score / new interval
> perl -pi -e 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=80000'
> [your_seed_url_dir]/*
> 
> #run command 
> bin/nutch inject crawl/crawldb [your_seed_url_dir]
> 
> Now, the following crawl will take your urls in top priority and crawl them
> once a day. I've used my situation to illustrate the concept but i guess you
> can tweek params to fit your needs. 
> 
> This way is useful when you want a regular fetch on some urls, if it's
> occured rarely i guess freegen is the right choice.
> 
> Best,
> Mike
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

Reply via email to