Hi, I stopped using de-duplication in Nutch 0.9-1.2 versions because too many URLs were being removed for no apparent reason. I did not report the problem to the list though. I am working with version 1.4 now, tried de-duplication again, and the problem appears to be still there. There are significant numbers of URLs being removed when de-duplication is applied. I could blame it on duplicated content, but it is hard to believe that so much is duplicated. One small site is represented by 1639 URLs in the index, and this number goes down to 1068 after de-duplication is done. OK, theoretically, this can happen, but, here is another example. Another site has just one (root) page in the index. This entry gets removed by de-duplication. How can this happen? There can be a collision in digests, but this is hard to believe, especially given other suspicious phenomena.
I am not going to use de-duplication anyway, because duplicated entries may exist in Arch index for a valid reason (e.g. different owners). However, it seems that I have a good case that could help to pinpoint the problem, if it indeed exists. If anyone would want to do it, I am happy to help. Regards, Arkadi

