use another signature.
it is tolerant against small changes.

<property>
  <name>db.signature.class</name>
  <value>org.apache.nutch.crawl.TextProfileSignature</value>
  <description>The default implementation of a page signature. Signatures
  created with this implementation will be used for duplicate detection
  and removal.</description>
</property>

Scott Gonyea schrieb:
> Were I to guess, the md5 hash isn't a hash of the content but, rather, of
> the CrawlDatum object that Nutch stores.
>
> Scott
>
> On Mon, Aug 23, 2010 at 9:11 AM, Andre Pautz <[email protected]> wrote:
>
>   
>> Dear list,
>>
>> i have a problem with removing duplicates from my nutch index. If i
>> understood it right, then the dedup option should do the work for me, i.e.
>> remove entries with the same URL or same content (MD5 hash). But
>> unfortunately it doesn't.
>>
>> The strange thing is, that if i check the index with luke, the pages in
>> doubt do have in fact different hash sums and different URLs. This of course
>> explains why the dedup option "fails". But if i take two of these URLs,
>> which lead obviously to the same content, store the pages with all their
>> content locally and calculate the hash with md5sum, the result is that they
>> have the same hash value and are binary identical.
>>
>> Do you have any hints why these pages are indexed with different hash
>> values? What point am i missing here?
>>
>> Example URLs:
>> 1)
>> http://www.bbr.bund.de/cln_015/nn_343756/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
>> 2)
>> http://www.bbr.bund.de/cln_015/nn_21196/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
>> 3)
>> http://www.bbr.bund.de/cln_015/nn_21210/sid_A75D796CCCFFEBE7CDDD46DC26BEC98E/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
>>
>> What i've read so far about the TextProfileSignature class is that it would
>> not help me that much, since many of the pages i am trying to index are not
>> that text heavy. Since the indexing took quite some time and the amount of
>> duplicates is large i would be thankful for any idea on how to remove these
>> duplicates.
>>
>> Thanks for any suggestions,
>> André
>> --
>> GMX DSL SOMMER-SPECIAL: Surf & Phone Flat 16.000 für nur 19,99 ¿/mtl.!*
>> http://portal.gmx.net/de/go/dsl
>>
>>     
>
>   

Reply via email to