On 2010-08-23 18:11, Andre Pautz wrote:
Dear list,

i have a problem with removing duplicates from my nutch index. If i understood 
it right, then the dedup option should do the work for me, i.e. remove entries 
with the same URL or same content (MD5 hash). But unfortunately it doesn't.

The strange thing is, that if i check the index with luke, the pages in doubt do have in 
fact different hash sums and different URLs. This of course explains why the dedup option 
"fails". But if i take two of these URLs, which lead obviously to the same 
content, store the pages with all their content locally and calculate the hash with 
md5sum, the result is that they have the same hash value and are binary identical.

Do you have any hints why these pages are indexed with different hash values? 
What point am i missing here?

Example URLs:
1) 
http://www.bbr.bund.de/cln_015/nn_343756/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
2) 
http://www.bbr.bund.de/cln_015/nn_21196/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
3) 
http://www.bbr.bund.de/cln_015/nn_21210/sid_A75D796CCCFFEBE7CDDD46DC26BEC98E/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true

What i've read so far about the TextProfileSignature class is that it would not 
help me that much, since many of the pages i am trying to index are not that 
text heavy. Since the indexing took quite some time and the amount of 
duplicates is large i would be thankful for any idea on how to remove these 
duplicates.

You didn't say what version of Nutch you are using, but take a look at this issue:

https://issues.apache.org/jira/browse/NUTCH-835

This has been fixed in 1.2 and 2.0.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to