This was reported in NUTCH-1559 and planned to be solved for Nutch 1.11, can 
you provide a patch?. The root of the problem appears to be using the plugin 
tika-parser in combination with parse-metatags. 

Regards,

----- Original Message -----
From: "Jeff Cocking" <[email protected]>
To: "Nutch User MailList" <[email protected]>
Sent: Thursday, April 30, 2015 1:05:33 PM
Subject: [MASSMAIL]Re: Duplicate Metatag.Description Values

Ok, after further investigations, I believe I have found the culprit. It
appears we may have conflicting activities occurring:

I was able to remove the error by only running the parse-html plugin.  When
the tika plugin is activated the duplicate value occurs.  The Tika plugin
copies all the Tika metadata into the nutch metadata. (see code below). The
MetaTagsParser is setup to parse both Tika metadata and Nutch metadata.
This is the reason for the duplicate values.

I do not know how these values are used else where within the system.  It
would appear we could remove the MetaTagsParser execution of the Tika
metadata.  Thoughts?

Source Code:
TikaParser.java (around line 184):
        // populate Nutch metadata with Tika metadata
        String[] TikaMDNames = tikamd.names();
        for (String tikaMDName : TikaMDNames) {
            if (tikaMDName.equalsIgnoreCase(Metadata.TITLE))
                continue;
            // TODO what if multivalued?
            nutchMetadata.add(tikaMDName, tikamd.get(tikaMDName));
        }

MetaTagsParser.java (around line 104)
    // check in the metadata first : the tika-parser
    // might have stored the values there already
    for (String mdName : metadata.names()) {
      addIndexedMetatags(metadata, mdName, metadata.getValues(mdName));
    }

    Metadata generalMetaTags = metaTags.getGeneralTags();
    for (String tagName : generalMetaTags.names()) {
      addIndexedMetatags(metadata, tagName,
generalMetaTags.getValues(tagName));
    }

I hope this makes sense....

jeff



On Thu, Apr 30, 2015 at 11:22 AM, Jeff Cocking <[email protected]>
wrote:

> I am getting duplicate metatag.description values in my indexed results.
> When running a parse checker, I am picking up meta name=description and the
> meta property=og:description values.
>
> Has anyone else ran into this issue?  If so, how have you fixed it?
>
> If not, any clues on how to resolve.
>
> Thank you in advance,
> jeff
>
>
> configuration: Nutch 1.9
> nutch-site.xml(partial):
> <!-- Plugin Control statement -->
> <property>
>   <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-(prefix|suffix|regex)|feed|headings|parse-(tika|html|metatags)|urlmeta|index-(basic|anchor|metadata|img)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>   <description></description>
> </property>
>
> <!-- Parse Meta Tag parameters -->
> <property>
>   <name>metatags.names</name>
>   <value>description</value>
> </property>
>
> <!-- Parse - Tika Controls -->
> <property>
>   <name>tika.boilerpipe</name>
>   <value>true</value>
> </property>
>
> <property>
>   <name>tika.boilerpipe.extractor</name>
>   <value>JeffExtractor</value>
> </property>
>
> <!-- Index-Metadata Plugin -->
> <property>
>   <name>index.parse.md</name>
>   <value>metatag.description</value>
> </property>
> <property>
>   <name>index.content.md</name>
>   <value>description</value>
> </property>
>

Reply via email to