In order to achieve the most timely crawling of news sites, I want to be able 
to manipulate the refetch intervals and scores in the crawl db. I thought I 
could accomplish that by re-injecting the urls that should be re-fetched most 
often. According to the documentation, it seems I should be able to do that 
using the db.injector.overwrite property. However, it does not actually work 
for me.


Here is the injection command I use:
$NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=10 -D 
db.injector.overwrite=true -D db.fetch.interval.default=1800 
/crawls/news0/data/crawldb /crawls/news0/seeds/reuters.txt

After re-injecting, I inspect the crawldb dump and see that the intervals and 
scores have not been overwritten. I have also tried db.injector.overwrite=true, 
with similar results.

I suspect that my db.fetch.interval.default does not affect existing urls. Is 
there any way to change the refetch intervals of existing urls?



For a test case, one could inject a few of the following urls, crawl several 
iterations, and then inject all of them. The result should be that all of them 
have the 1800 interval.

http://mobile.reuters.com/
http://mobile.reuters.com/business
http://mobile.reuters.com/finance
http://mobile.reuters.com/news/entertainment
http://mobile.reuters.com/news/entertainment/arts
http://mobile.reuters.com/news/environment
http://mobile.reuters.com/news/health
http://mobile.reuters.com/news/lifestyle
http://mobile.reuters.com/news/oddlyEnough
http://mobile.reuters.com/news/science
http://mobile.reuters.com/news/sports
http://mobile.reuters.com/news/technology
http://mobile.reuters.com/news/us
http://mobile.reuters.com/news/world
http://mobile.reuters.com/politics
http://www.reuters.com/subjects/healthcare
https://www.reuters.com/
https://www.reuters.com/energy-environment
https://www.reuters.com/finance
https://www.reuters.com/money
https://www.reuters.com/news/entertainment
https://www.reuters.com/news/health
https://www.reuters.com/news/technology
https://www.reuters.com/news/world
https://www.reuters.com/politics

Reply via email to