Thanks, Sebastian. Changing db.signature.class to TextProfileSignature seems to have done exactly what I needed. Much appreciated!
On Sun, Jun 11, 2017 at 5:45 AM, Sebastian Nagel <[email protected] > wrote: > Hi David, > > see > > <property> > <name>db.signature.class</name> > <value>org.apache.nutch.crawl.MD5Signature</value> > <description>The default implementation of a page signature. Signatures > created with this implementation will be used for duplicate detection > and removal.</description> > </property> > > See the extending classes MD5Signature, TextMD5Signature, > TextProfileSignature of > http://nutch.apache.org/apidocs/apidocs-1.13/org/ > apache/nutch/crawl/Signature.html > > Best, > Sebastian > > On 06/09/2017 04:32 PM, David Parker wrote: > > Hello, > > > > I am running Nutch 1.13 and was wondering if the digest field in the > crawl > > results can be configured. Ideally, I would like the digest to be a hash > > of the page content only. A bit of Googling landed me at > > https://wiki.apache.org/nutch/IndexStructure which describes the digest > > field as follows: > > > > "Adds a *message digest* field to a document. Can be MD5 over content and > > headers or more sophisticated text profile of the content." > > > > This makes it sound like the contents of the digest can be configured, > but > > I can't seem to figure out how. Any help is greatly appreciated. > Thanks! > > > > -- Dave Parker Database & Systems Administrator Utica College Integrated Information Technology Services (315) 792-3229 Registered Linux User #408177

