Re: UpdateDbJob increases fetchtime of unfetched pages

Julien Nioche Wed, 20 Nov 2013 03:36:52 -0800

Hi Gunter

Nutch 1.x is a lot more stable than 2.x which is very much work in
progress. This particular issue will hopefully be fixed in 2.x soon but you
won't have it in 1.x for sure.


Julien


On 20 November 2013 11:07, Günter Ladwig <[email protected]> wrote:

> Hi all,
>
> I’m currently using Nutch 2.2.1 and noticed what seems to a be a bug in
> the update step. Everytime I run a crawl (using a modified bin/crawl
> script), the fetchtime is updated even for pages that were not fetched
> during the current crawl.
>
> I found the related bug report NUTCH-1457 [1] through a previous post on
> this list [2].
>
> For me this means that Nutch 2.2.1 is unusable. I want to run continuous
> crawls in order to keep a Solr index of a website up-to-date. This bug
> basically ensures that most pages will never be fetched again as their
> fetchtime is increased on each updatedb.
>
> Is there a workaround? Does this problem appear in Nutch 1.7?
>
> Cheers,
> Günter
>
> [1] https://issues.apache.org/jira/browse/NUTCH-1457
> [2]
> http://lucene.472066.n3.nabble.com/updatedb-in-nutch-2-0-increases-fetch-time-of-all-pages-td4008429.html




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: UpdateDbJob increases fetchtime of unfetched pages

Reply via email to