Were I to guess, the md5 hash isn't a hash of the content but, rather, of the CrawlDatum object that Nutch stores.
Scott On Mon, Aug 23, 2010 at 9:11 AM, Andre Pautz <[email protected]> wrote: > Dear list, > > i have a problem with removing duplicates from my nutch index. If i > understood it right, then the dedup option should do the work for me, i.e. > remove entries with the same URL or same content (MD5 hash). But > unfortunately it doesn't. > > The strange thing is, that if i check the index with luke, the pages in > doubt do have in fact different hash sums and different URLs. This of course > explains why the dedup option "fails". But if i take two of these URLs, > which lead obviously to the same content, store the pages with all their > content locally and calculate the hash with md5sum, the result is that they > have the same hash value and are binary identical. > > Do you have any hints why these pages are indexed with different hash > values? What point am i missing here? > > Example URLs: > 1) > http://www.bbr.bund.de/cln_015/nn_343756/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true > 2) > http://www.bbr.bund.de/cln_015/nn_21196/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true > 3) > http://www.bbr.bund.de/cln_015/nn_21210/sid_A75D796CCCFFEBE7CDDD46DC26BEC98E/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true > > What i've read so far about the TextProfileSignature class is that it would > not help me that much, since many of the pages i am trying to index are not > that text heavy. Since the indexing took quite some time and the amount of > duplicates is large i would be thankful for any idea on how to remove these > duplicates. > > Thanks for any suggestions, > André > -- > GMX DSL SOMMER-SPECIAL: Surf & Phone Flat 16.000 für nur 19,99 ¿/mtl.!* > http://portal.gmx.net/de/go/dsl >

