Thanks, Sebastian.  Changing db.signature.class to TextProfileSignature
seems to have done exactly what I needed.  Much appreciated!

On Sun, Jun 11, 2017 at 5:45 AM, Sebastian Nagel <[email protected]
> wrote:

> Hi David,
>
> see
>
> <property>
>   <name>db.signature.class</name>
>   <value>org.apache.nutch.crawl.MD5Signature</value>
>   <description>The default implementation of a page signature. Signatures
>   created with this implementation will be used for duplicate detection
>   and removal.</description>
> </property>
>
> See the extending classes MD5Signature, TextMD5Signature,
> TextProfileSignature of
>   http://nutch.apache.org/apidocs/apidocs-1.13/org/
> apache/nutch/crawl/Signature.html
>
> Best,
> Sebastian
>
> On 06/09/2017 04:32 PM, David Parker wrote:
> > Hello,
> >
> > I am running Nutch 1.13 and was wondering if the digest field in the
> crawl
> > results can be configured.  Ideally, I would like the digest to be a hash
> > of the page content only.  A bit of Googling landed me at
> > https://wiki.apache.org/nutch/IndexStructure which describes the digest
> > field as follows:
> >
> > "Adds a *message digest* field to a document. Can be MD5 over content and
> > headers or more sophisticated text profile of the content."
> >
> > This makes it sound like the contents of the digest can be configured,
> but
> > I can't seem to figure out how.  Any help is greatly appreciated.
> Thanks!
> >
>
>


-- 
Dave Parker
Database & Systems Administrator
Utica College
Integrated Information Technology Services
(315) 792-3229
Registered Linux User #408177

Reply via email to