Were I to guess, the md5 hash isn't a hash of the content but, rather, of
the CrawlDatum object that Nutch stores.

Scott

On Mon, Aug 23, 2010 at 9:11 AM, Andre Pautz <[email protected]> wrote:

> Dear list,
>
> i have a problem with removing duplicates from my nutch index. If i
> understood it right, then the dedup option should do the work for me, i.e.
> remove entries with the same URL or same content (MD5 hash). But
> unfortunately it doesn't.
>
> The strange thing is, that if i check the index with luke, the pages in
> doubt do have in fact different hash sums and different URLs. This of course
> explains why the dedup option "fails". But if i take two of these URLs,
> which lead obviously to the same content, store the pages with all their
> content locally and calculate the hash with md5sum, the result is that they
> have the same hash value and are binary identical.
>
> Do you have any hints why these pages are indexed with different hash
> values? What point am i missing here?
>
> Example URLs:
> 1)
> http://www.bbr.bund.de/cln_015/nn_343756/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
> 2)
> http://www.bbr.bund.de/cln_015/nn_21196/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
> 3)
> http://www.bbr.bund.de/cln_015/nn_21210/sid_A75D796CCCFFEBE7CDDD46DC26BEC98E/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
>
> What i've read so far about the TextProfileSignature class is that it would
> not help me that much, since many of the pages i am trying to index are not
> that text heavy. Since the indexing took quite some time and the amount of
> duplicates is large i would be thankful for any idea on how to remove these
> duplicates.
>
> Thanks for any suggestions,
> André
> --
> GMX DSL SOMMER-SPECIAL: Surf & Phone Flat 16.000 für nur 19,99 ¿/mtl.!*
> http://portal.gmx.net/de/go/dsl
>

Reply via email to