Hi David, see
<property> <name>db.signature.class</name> <value>org.apache.nutch.crawl.MD5Signature</value> <description>The default implementation of a page signature. Signatures created with this implementation will be used for duplicate detection and removal.</description> </property> See the extending classes MD5Signature, TextMD5Signature, TextProfileSignature of http://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/crawl/Signature.html Best, Sebastian On 06/09/2017 04:32 PM, David Parker wrote: > Hello, > > I am running Nutch 1.13 and was wondering if the digest field in the crawl > results can be configured. Ideally, I would like the digest to be a hash > of the page content only. A bit of Googling landed me at > https://wiki.apache.org/nutch/IndexStructure which describes the digest > field as follows: > > "Adds a *message digest* field to a document. Can be MD5 over content and > headers or more sophisticated text profile of the content." > > This makes it sound like the contents of the digest can be configured, but > I can't seem to figure out how. Any help is greatly appreciated. Thanks! >

