An alternative to Nutch deduplication (which I've found to fail with multiple document sources that don't provide a 'digest' field in the SOLR index) is to use SOLR to detect duplicates on update - I've done this with a SOLR plugin.
On Wed, Nov 2, 2011 at 5:38 PM, <[email protected]> wrote: > Hi, > > I stopped using de-duplication in Nutch 0.9-1.2 versions because too many > URLs were being removed for no apparent reason. I did not report the > problem to the list though. I am working with version 1.4 now, tried > de-duplication again, and the problem appears to be still there. There are > significant numbers of URLs being removed when de-duplication is applied. I > could blame it on duplicated content, but it is hard to believe that so > much is duplicated. One small site is represented by 1639 URLs in the > index, and this number goes down to 1068 after de-duplication is done. OK, > theoretically, this can happen, but, here is another example. Another site > has just one (root) page in the index. This entry gets removed by > de-duplication. How can this happen? There can be a collision in digests, > but this is hard to believe, especially given other suspicious phenomena. > > I am not going to use de-duplication anyway, because duplicated entries > may exist in Arch index for a valid reason (e.g. different owners). > However, it seems that I have a good case that could help to pinpoint the > problem, if it indeed exists. If anyone would want to do it, I am happy to > help. > > Regards, > > Arkadi > >

