Dear list, i have a problem with removing duplicates from my nutch index. If i understood it right, then the dedup option should do the work for me, i.e. remove entries with the same URL or same content (MD5 hash). But unfortunately it doesn't.
The strange thing is, that if i check the index with luke, the pages in doubt do have in fact different hash sums and different URLs. This of course explains why the dedup option "fails". But if i take two of these URLs, which lead obviously to the same content, store the pages with all their content locally and calculate the hash with md5sum, the result is that they have the same hash value and are binary identical. Do you have any hints why these pages are indexed with different hash values? What point am i missing here? Example URLs: 1) http://www.bbr.bund.de/cln_015/nn_343756/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true 2) http://www.bbr.bund.de/cln_015/nn_21196/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true 3) http://www.bbr.bund.de/cln_015/nn_21210/sid_A75D796CCCFFEBE7CDDD46DC26BEC98E/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true What i've read so far about the TextProfileSignature class is that it would not help me that much, since many of the pages i am trying to index are not that text heavy. Since the indexing took quite some time and the amount of duplicates is large i would be thankful for any idea on how to remove these duplicates. Thanks for any suggestions, André -- GMX DSL SOMMER-SPECIAL: Surf & Phone Flat 16.000 für nur 19,99 ¿/mtl.!* http://portal.gmx.net/de/go/dsl

