UpdateDbJob increases fetchtime of unfetched pages

Günter Ladwig Wed, 20 Nov 2013 03:08:42 -0800

Hi all,

I’m currently using Nutch 2.2.1 and noticed what seems to a be a bug in the 
update step. Everytime I run a crawl (using a modified bin/crawl script), the 
fetchtime is updated even for pages that were not fetched during the current 
crawl.


I found the related bug report NUTCH-1457 [1] through a previous post on this 
list [2].

For me this means that Nutch 2.2.1 is unusable. I want to run continuous crawls 
in order to keep a Solr index of a website up-to-date. This bug basically 
ensures that most pages will never be fetched again as their fetchtime is 
increased on each updatedb.

Is there a workaround? Does this problem appear in Nutch 1.7?

Cheers,
Günter

[1] https://issues.apache.org/jira/browse/NUTCH-1457
[2] 
http://lucene.472066.n3.nabble.com/updatedb-in-nutch-2-0-increases-fetch-time-of-all-pages-td4008429.html

UpdateDbJob increases fetchtime of unfetched pages

Reply via email to