Dear list,

i have a problem with removing duplicates from my nutch index. If i understood 
it right, then the dedup option should do the work for me, i.e. remove entries 
with the same URL or same content (MD5 hash). But unfortunately it doesn't.

The strange thing is, that if i check the index with luke, the pages in doubt 
do have in fact different hash sums and different URLs. This of course explains 
why the dedup option "fails". But if i take two of these URLs, which lead 
obviously to the same content, store the pages with all their content locally 
and calculate the hash with md5sum, the result is that they have the same hash 
value and are binary identical.

Do you have any hints why these pages are indexed with different hash values? 
What point am i missing here?

Example URLs:
1) 
http://www.bbr.bund.de/cln_015/nn_343756/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
2) 
http://www.bbr.bund.de/cln_015/nn_21196/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
3) 
http://www.bbr.bund.de/cln_015/nn_21210/sid_A75D796CCCFFEBE7CDDD46DC26BEC98E/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true

What i've read so far about the TextProfileSignature class is that it would not 
help me that much, since many of the pages i am trying to index are not that 
text heavy. Since the indexing took quite some time and the amount of 
duplicates is large i would be thankful for any idea on how to remove these 
duplicates.

Thanks for any suggestions,
André
-- 
GMX DSL SOMMER-SPECIAL: Surf & Phone Flat 16.000 für nur 19,99 ¿/mtl.!*
http://portal.gmx.net/de/go/dsl

Reply via email to