Hello,

i was running the indexing with nutch 1.0 - so most probably the mentioned bug  
NUTCH-835 is the problem. I will give version 1.2 a try.
If that doesn't help I will try the TextProfilSignature as Reinhard Schwab 
suggested.

Again, thanks everyone for their help.

Regards, André




-----Ursprüngliche Nachricht-----
Von: Andrzej Bialecki [mailto:[email protected]] 
Gesendet: Montag, 23. August 2010 22:38
An: [email protected]
Betreff: Re: obvious duplicates with different hash-values

On 2010-08-23 18:11, Andre Pautz wrote:
> Dear list,
>
> i have a problem with removing duplicates from my nutch index. If i 
> understood it right, then the dedup option should do the work for me, i.e. 
> remove entries with the same URL or same content (MD5 hash). But 
> unfortunately it doesn't.
>
> The strange thing is, that if i check the index with luke, the pages in doubt 
> do have in fact different hash sums and different URLs. This of course 
> explains why the dedup option "fails". But if i take two of these URLs, which 
> lead obviously to the same content, store the pages with all their content 
> locally and calculate the hash with md5sum, the result is that they have the 
> same hash value and are binary identical.
>
> Do you have any hints why these pages are indexed with different hash values? 
> What point am i missing here?
>
> Example URLs:
> 1) 
> http://www.bbr.bund.de/cln_015/nn_343756/DE/BaufachlicherService/baufa
> chlicherService__node.html?__nnn=true
> 2) 
> http://www.bbr.bund.de/cln_015/nn_21196/DE/BaufachlicherService/baufac
> hlicherService__node.html?__nnn=true
> 3) 
> http://www.bbr.bund.de/cln_015/nn_21210/sid_A75D796CCCFFEBE7CDDD46DC26
> BEC98E/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=t
> rue
>
> What i've read so far about the TextProfileSignature class is that it would 
> not help me that much, since many of the pages i am trying to index are not 
> that text heavy. Since the indexing took quite some time and the amount of 
> duplicates is large i would be thankful for any idea on how to remove these 
> duplicates.

You didn't say what version of Nutch you are using, but take a look at this 
issue:

https://issues.apache.org/jira/browse/NUTCH-835

This has been fixed in 1.2 and 2.0.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||  \|  ||  
|  Embedded Unix, System Integration http://www.sigram.com  Contact: info at 
sigram dot com


Reply via email to