Hi Michael,
> http://mobile.reuters.com/ nutch.score=100 nutch.fetchInterval=1800
works (make sure you have tabs as separators).
Of course, if the URLs are already in CrawlDb you need to "overwrite" them.
nutch inject ... -overwrite
-D db.injector.overwrite=true does not work because it's overwritten by
-overwrite or is set to false if -overwrite is absent ;(
or "update"
nutch inject ... -update
(-update will only overwrite the fetch interval if it's not the default,
otherwise it preserves the fetch interval which might have been changed
adaptively)
Best,
Sebastian
On 11/27/2017 09:23 PM, Michael Coffey wrote:
> I also tried including metadata in the seeds file (TAB-delimited) as follows.
>
>
> http://mobile.reuters.com/ nutch.score=100 nutch.fetchInterval=1800
> http://mobile.reuters.com/business nutch.score=100
> nutch.fetchInterval=1800
>
>
> So, I am still looking for a way to manipulate the refetch intervals and
> scores in the crawl db.
>
>
> ________________________________
> From: Michael Coffey <[email protected]>
> To: User <[email protected]>
> Sent: Friday, November 24, 2017 3:13 PM
> Subject: need to override refetch intervals
>
>
>
> In order to achieve the most timely crawling of news sites, I want to be able
> to manipulate the refetch intervals and scores in the crawl db. I thought I
> could accomplish that by re-injecting the urls that should be re-fetched most
> often. According to the documentation, it seems I should be able to do that
> using the db.injector.overwrite property. However, it does not actually work
> for me.
>
>
>
> Here is the injection command I use:
>
> $NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=10 -D
> db.injector.overwrite=true -D db.fetch.interval.default=1800
> /crawls/news0/data/crawldb /crawls/news0/seeds/reuters.txt
>
>
> After re-injecting, I inspect the crawldb dump and see that the intervals and
> scores have not been overwritten. I have also tried
> db.injector.overwrite=true, with similar results.
>
>
> I suspect that my db.fetch.interval.default does not affect existing urls. Is
> there any way to change the refetch intervals of existing urls?
>
>
>
>
> For a test case, one could inject a few of the following urls, crawl several
> iterations, and then inject all of them. The result should be that all of
> them have the 1800 interval.
>
>
> http://mobile.reuters.com/
>
> http://mobile.reuters.com/business
>
> http://mobile.reuters.com/finance
>
> http://mobile.reuters.com/news/entertainment
>
> http://mobile.reuters.com/news/entertainment/arts
>
> http://mobile.reuters.com/news/environment
>
> http://mobile.reuters.com/news/health
>
> http://mobile.reuters.com/news/lifestyle
>
> http://mobile.reuters.com/news/oddlyEnough
>
> http://mobile.reuters.com/news/science
>
> http://mobile.reuters.com/news/sports
>
> http://mobile.reuters.com/news/technology
>
> http://mobile.reuters.com/news/us
>
> http://mobile.reuters.com/news/world
>
> http://mobile.reuters.com/politics
>
> http://www.reuters.com/subjects/healthcare
>
> https://www.reuters.com/
>
> https://www.reuters.com/energy-environment
>
> https://www.reuters.com/finance
>
> https://www.reuters.com/money
>
> https://www.reuters.com/news/entertainment
>
> https://www.reuters.com/news/health
>
> https://www.reuters.com/news/technology
>
> https://www.reuters.com/news/world
>
> https://www.reuters.com/politics
>