use another signature. it is tolerant against small changes. <property> <name>db.signature.class</name> <value>org.apache.nutch.crawl.TextProfileSignature</value> <description>The default implementation of a page signature. Signatures created with this implementation will be used for duplicate detection and removal.</description> </property>
Scott Gonyea schrieb: > Were I to guess, the md5 hash isn't a hash of the content but, rather, of > the CrawlDatum object that Nutch stores. > > Scott > > On Mon, Aug 23, 2010 at 9:11 AM, Andre Pautz <[email protected]> wrote: > > >> Dear list, >> >> i have a problem with removing duplicates from my nutch index. If i >> understood it right, then the dedup option should do the work for me, i.e. >> remove entries with the same URL or same content (MD5 hash). But >> unfortunately it doesn't. >> >> The strange thing is, that if i check the index with luke, the pages in >> doubt do have in fact different hash sums and different URLs. This of course >> explains why the dedup option "fails". But if i take two of these URLs, >> which lead obviously to the same content, store the pages with all their >> content locally and calculate the hash with md5sum, the result is that they >> have the same hash value and are binary identical. >> >> Do you have any hints why these pages are indexed with different hash >> values? What point am i missing here? >> >> Example URLs: >> 1) >> http://www.bbr.bund.de/cln_015/nn_343756/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true >> 2) >> http://www.bbr.bund.de/cln_015/nn_21196/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true >> 3) >> http://www.bbr.bund.de/cln_015/nn_21210/sid_A75D796CCCFFEBE7CDDD46DC26BEC98E/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true >> >> What i've read so far about the TextProfileSignature class is that it would >> not help me that much, since many of the pages i am trying to index are not >> that text heavy. Since the indexing took quite some time and the amount of >> duplicates is large i would be thankful for any idea on how to remove these >> duplicates. >> >> Thanks for any suggestions, >> André >> -- >> GMX DSL SOMMER-SPECIAL: Surf & Phone Flat 16.000 für nur 19,99 ¿/mtl.!* >> http://portal.gmx.net/de/go/dsl >> >> > >

