Hello, i was running the indexing with nutch 1.0 - so most probably the mentioned bug NUTCH-835 is the problem. I will give version 1.2 a try. If that doesn't help I will try the TextProfilSignature as Reinhard Schwab suggested.
Again, thanks everyone for their help. Regards, André -----Ursprüngliche Nachricht----- Von: Andrzej Bialecki [mailto:[email protected]] Gesendet: Montag, 23. August 2010 22:38 An: [email protected] Betreff: Re: obvious duplicates with different hash-values On 2010-08-23 18:11, Andre Pautz wrote: > Dear list, > > i have a problem with removing duplicates from my nutch index. If i > understood it right, then the dedup option should do the work for me, i.e. > remove entries with the same URL or same content (MD5 hash). But > unfortunately it doesn't. > > The strange thing is, that if i check the index with luke, the pages in doubt > do have in fact different hash sums and different URLs. This of course > explains why the dedup option "fails". But if i take two of these URLs, which > lead obviously to the same content, store the pages with all their content > locally and calculate the hash with md5sum, the result is that they have the > same hash value and are binary identical. > > Do you have any hints why these pages are indexed with different hash values? > What point am i missing here? > > Example URLs: > 1) > http://www.bbr.bund.de/cln_015/nn_343756/DE/BaufachlicherService/baufa > chlicherService__node.html?__nnn=true > 2) > http://www.bbr.bund.de/cln_015/nn_21196/DE/BaufachlicherService/baufac > hlicherService__node.html?__nnn=true > 3) > http://www.bbr.bund.de/cln_015/nn_21210/sid_A75D796CCCFFEBE7CDDD46DC26 > BEC98E/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=t > rue > > What i've read so far about the TextProfileSignature class is that it would > not help me that much, since many of the pages i am trying to index are not > that text heavy. Since the indexing took quite some time and the amount of > duplicates is large i would be thankful for any idea on how to remove these > duplicates. You didn't say what version of Nutch you are using, but take a look at this issue: https://issues.apache.org/jira/browse/NUTCH-835 This has been fixed in 1.2 and 2.0. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

