If you want records to be fetched at a fixed interval its easier to inject them with a fixed fetch interval.
nutch.fixedFetchInterval=86400 -----Original message----- > From:kemical <[email protected]> > Sent: Thu 14-Feb-2013 10:15 > To: [email protected] > Subject: Re: Nutch Incremental Crawl > > Hi David, > > You can also consider setting shorter fetch interval time with nutch inject. > This way you'll set higher score (so the url is always taken in priority > when you generate a segment) and a fetch.interval of 1 day. > > If you have a case similar to me, you'll often want some homepage fetch each > day but not their inlinks. What you can do is inject all your seed urls > again (assuming those url are only homepages). > > #change nutch option so existing urls can be injected again in > conf/nutch-default.xml or conf/nutch-site.xml > db.injector.update=true > > #Add metadata to update score/fetch interval > #the following line will concat to each line of your seed urls files with > the new score / new interval > perl -pi -e 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=80000' > [your_seed_url_dir]/* > > #run command > bin/nutch inject crawl/crawldb [your_seed_url_dir] > > Now, the following crawl will take your urls in top priority and crawl them > once a day. I've used my situation to illustrate the concept but i guess you > can tweek params to fit your needs. > > This way is useful when you want a regular fetch on some urls, if it's > occured rarely i guess freegen is the right choice. > > Best, > Mike > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

