On 2010-08-23 18:11, Andre Pautz wrote:
Dear list,
i have a problem with removing duplicates from my nutch index. If i understood
it right, then the dedup option should do the work for me, i.e. remove entries
with the same URL or same content (MD5 hash). But unfortunately it doesn't.
The strange thing is, that if i check the index with luke, the pages in doubt do have in
fact different hash sums and different URLs. This of course explains why the dedup option
"fails". But if i take two of these URLs, which lead obviously to the same
content, store the pages with all their content locally and calculate the hash with
md5sum, the result is that they have the same hash value and are binary identical.
Do you have any hints why these pages are indexed with different hash values?
What point am i missing here?
Example URLs:
1)
http://www.bbr.bund.de/cln_015/nn_343756/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
2)
http://www.bbr.bund.de/cln_015/nn_21196/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
3)
http://www.bbr.bund.de/cln_015/nn_21210/sid_A75D796CCCFFEBE7CDDD46DC26BEC98E/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
What i've read so far about the TextProfileSignature class is that it would not
help me that much, since many of the pages i am trying to index are not that
text heavy. Since the indexing took quite some time and the amount of
duplicates is large i would be thankful for any idea on how to remove these
duplicates.
You didn't say what version of Nutch you are using, but take a look at
this issue:
https://issues.apache.org/jira/browse/NUTCH-835
This has been fixed in 1.2 and 2.0.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com