Re: UpdateDbJob increases fetchtime of unfetched pages

Talat UYARER Thu, 21 Nov 2013 01:00:04 -0800


In addition my mail If you use -all parameter, You may have same problem.


Talat

21-11-2013 10:30 tarihinde, Talat UYARER yazdı:

Hi Günter,

UpdatedbJob of Nutch 2.2.1 doesn't accept batchId. You are right. But
this problem fix with NUTCH-1556 issue [1]. If you apply this patch You
will never have same problem.

We use 2.x in production. We don't find big issue. I think it is safe to
use.

Thanks
Talat

[1] https://issues.apache.org/jira/browse/NUTCH-1556

20-11-2013 13:35 tarihinde, Julien Nioche yazdı:

Hi Gunter

Nutch 1.x is a lot more stable than 2.x which is very much work in
progress. This particular issue will hopefully be fixed in 2.x soon
but you
won't have it in 1.x for sure.

Julien


On 20 November 2013 11:07, Günter Ladwig <[email protected]> wrote:

Hi all,

I’m currently using Nutch 2.2.1 and noticed what seems to a be a bug in
the update step. Everytime I run a crawl (using a modified bin/crawl
script), the fetchtime is updated even for pages that were not fetched
during the current crawl.

I found the related bug report NUTCH-1457 [1] through a previous post on
this list [2].

For me this means that Nutch 2.2.1 is unusable. I want to run continuous
crawls in order to keep a Solr index of a website up-to-date. This bug
basically ensures that most pages will never be fetched again as their
fetchtime is increased on each updatedb.

Is there a workaround? Does this problem appear in Nutch 1.7?

Cheers,
Günter

[1] https://issues.apache.org/jira/browse/NUTCH-1457
[2]
http://lucene.472066.n3.nabble.com/updatedb-in-nutch-2-0-increases-fetch-time-of-all-pages-td4008429.html

Re: UpdateDbJob increases fetchtime of unfetched pages

Reply via email to