Hi Michael,

> http://mobile.reuters.com/    nutch.score=100 nutch.fetchInterval=1800

works (make sure you have tabs as separators).

Of course, if the URLs are already in CrawlDb you need to "overwrite" them.

   nutch inject  ...   -overwrite
      -D db.injector.overwrite=true does not work because it's overwritten by
      -overwrite or is set to false if -overwrite is absent ;(

or "update"

   nutch inject  ...   -update
     (-update will only overwrite the fetch interval if it's not the default,
      otherwise it preserves the fetch interval which might have been changed 
adaptively)

Best,
Sebastian

On 11/27/2017 09:23 PM, Michael Coffey wrote:
> I also tried including metadata in the seeds file (TAB-delimited) as follows.
> 
> 
> http://mobile.reuters.com/      nutch.score=100 nutch.fetchInterval=1800
> http://mobile.reuters.com/business      nutch.score=100 
> nutch.fetchInterval=1800
> 
> 
> So, I am still looking for a way to manipulate the refetch intervals and 
> scores in the crawl db.
> 
> 
> ________________________________
> From: Michael Coffey <[email protected]>
> To: User <[email protected]> 
> Sent: Friday, November 24, 2017 3:13 PM
> Subject: need to override refetch intervals
> 
> 
> 
> In order to achieve the most timely crawling of news sites, I want to be able 
> to manipulate the refetch intervals and scores in the crawl db. I thought I 
> could accomplish that by re-injecting the urls that should be re-fetched most 
> often. According to the documentation, it seems I should be able to do that 
> using the db.injector.overwrite property. However, it does not actually work 
> for me.
> 
> 
> 
> Here is the injection command I use:
> 
> $NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=10 -D 
> db.injector.overwrite=true -D db.fetch.interval.default=1800 
> /crawls/news0/data/crawldb /crawls/news0/seeds/reuters.txt
> 
> 
> After re-injecting, I inspect the crawldb dump and see that the intervals and 
> scores have not been overwritten. I have also tried 
> db.injector.overwrite=true, with similar results.
> 
> 
> I suspect that my db.fetch.interval.default does not affect existing urls. Is 
> there any way to change the refetch intervals of existing urls?
> 
> 
> 
> 
> For a test case, one could inject a few of the following urls, crawl several 
> iterations, and then inject all of them. The result should be that all of 
> them have the 1800 interval.
> 
> 
> http://mobile.reuters.com/
> 
> http://mobile.reuters.com/business
> 
> http://mobile.reuters.com/finance
> 
> http://mobile.reuters.com/news/entertainment
> 
> http://mobile.reuters.com/news/entertainment/arts
> 
> http://mobile.reuters.com/news/environment
> 
> http://mobile.reuters.com/news/health
> 
> http://mobile.reuters.com/news/lifestyle
> 
> http://mobile.reuters.com/news/oddlyEnough
> 
> http://mobile.reuters.com/news/science
> 
> http://mobile.reuters.com/news/sports
> 
> http://mobile.reuters.com/news/technology
> 
> http://mobile.reuters.com/news/us
> 
> http://mobile.reuters.com/news/world
> 
> http://mobile.reuters.com/politics
> 
> http://www.reuters.com/subjects/healthcare
> 
> https://www.reuters.com/
> 
> https://www.reuters.com/energy-environment
> 
> https://www.reuters.com/finance
> 
> https://www.reuters.com/money
> 
> https://www.reuters.com/news/entertainment
> 
> https://www.reuters.com/news/health
> 
> https://www.reuters.com/news/technology
> 
> https://www.reuters.com/news/world
> 
> https://www.reuters.com/politics
> 

Reply via email to