I also tried including metadata in the seeds file (TAB-delimited) as follows.


http://mobile.reuters.com/      nutch.score=100 nutch.fetchInterval=1800
http://mobile.reuters.com/business      nutch.score=100 nutch.fetchInterval=1800


So, I am still looking for a way to manipulate the refetch intervals and scores 
in the crawl db.


________________________________
From: Michael Coffey <[email protected]>
To: User <[email protected]> 
Sent: Friday, November 24, 2017 3:13 PM
Subject: need to override refetch intervals



In order to achieve the most timely crawling of news sites, I want to be able 
to manipulate the refetch intervals and scores in the crawl db. I thought I 
could accomplish that by re-injecting the urls that should be re-fetched most 
often. According to the documentation, it seems I should be able to do that 
using the db.injector.overwrite property. However, it does not actually work 
for me.



Here is the injection command I use:

$NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=10 -D 
db.injector.overwrite=true -D db.fetch.interval.default=1800 
/crawls/news0/data/crawldb /crawls/news0/seeds/reuters.txt


After re-injecting, I inspect the crawldb dump and see that the intervals and 
scores have not been overwritten. I have also tried db.injector.overwrite=true, 
with similar results.


I suspect that my db.fetch.interval.default does not affect existing urls. Is 
there any way to change the refetch intervals of existing urls?




For a test case, one could inject a few of the following urls, crawl several 
iterations, and then inject all of them. The result should be that all of them 
have the 1800 interval.


http://mobile.reuters.com/

http://mobile.reuters.com/business

http://mobile.reuters.com/finance

http://mobile.reuters.com/news/entertainment

http://mobile.reuters.com/news/entertainment/arts

http://mobile.reuters.com/news/environment

http://mobile.reuters.com/news/health

http://mobile.reuters.com/news/lifestyle

http://mobile.reuters.com/news/oddlyEnough

http://mobile.reuters.com/news/science

http://mobile.reuters.com/news/sports

http://mobile.reuters.com/news/technology

http://mobile.reuters.com/news/us

http://mobile.reuters.com/news/world

http://mobile.reuters.com/politics

http://www.reuters.com/subjects/healthcare

https://www.reuters.com/

https://www.reuters.com/energy-environment

https://www.reuters.com/finance

https://www.reuters.com/money

https://www.reuters.com/news/entertainment

https://www.reuters.com/news/health

https://www.reuters.com/news/technology

https://www.reuters.com/news/world

https://www.reuters.com/politics

Reply via email to