Hi David,

see

<property>
  <name>db.signature.class</name>
  <value>org.apache.nutch.crawl.MD5Signature</value>
  <description>The default implementation of a page signature. Signatures
  created with this implementation will be used for duplicate detection
  and removal.</description>
</property>

See the extending classes MD5Signature, TextMD5Signature, TextProfileSignature 
of
  
http://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/crawl/Signature.html

Best,
Sebastian

On 06/09/2017 04:32 PM, David Parker wrote:
> Hello,
> 
> I am running Nutch 1.13 and was wondering if the digest field in the crawl
> results can be configured.  Ideally, I would like the digest to be a hash
> of the page content only.  A bit of Googling landed me at
> https://wiki.apache.org/nutch/IndexStructure which describes the digest
> field as follows:
> 
> "Adds a *message digest* field to a document. Can be MD5 over content and
> headers or more sophisticated text profile of the content."
> 
> This makes it sound like the contents of the digest can be configured, but
> I can't seem to figure out how.  Any help is greatly appreciated.  Thanks!
> 

Reply via email to