Ok, after further investigations, I believe I have found the culprit. It
appears we may have conflicting activities occurring:
I was able to remove the error by only running the parse-html plugin. When
the tika plugin is activated the duplicate value occurs. The Tika plugin
copies all the Tika metadata into the nutch metadata. (see code below). The
MetaTagsParser is setup to parse both Tika metadata and Nutch metadata.
This is the reason for the duplicate values.
I do not know how these values are used else where within the system. It
would appear we could remove the MetaTagsParser execution of the Tika
metadata. Thoughts?
Source Code:
TikaParser.java (around line 184):
// populate Nutch metadata with Tika metadata
String[] TikaMDNames = tikamd.names();
for (String tikaMDName : TikaMDNames) {
if (tikaMDName.equalsIgnoreCase(Metadata.TITLE))
continue;
// TODO what if multivalued?
nutchMetadata.add(tikaMDName, tikamd.get(tikaMDName));
}
MetaTagsParser.java (around line 104)
// check in the metadata first : the tika-parser
// might have stored the values there already
for (String mdName : metadata.names()) {
addIndexedMetatags(metadata, mdName, metadata.getValues(mdName));
}
Metadata generalMetaTags = metaTags.getGeneralTags();
for (String tagName : generalMetaTags.names()) {
addIndexedMetatags(metadata, tagName,
generalMetaTags.getValues(tagName));
}
I hope this makes sense....
jeff
On Thu, Apr 30, 2015 at 11:22 AM, Jeff Cocking <[email protected]>
wrote:
> I am getting duplicate metatag.description values in my indexed results.
> When running a parse checker, I am picking up meta name=description and the
> meta property=og:description values.
>
> Has anyone else ran into this issue? If so, how have you fixed it?
>
> If not, any clues on how to resolve.
>
> Thank you in advance,
> jeff
>
>
> configuration: Nutch 1.9
> nutch-site.xml(partial):
> <!-- Plugin Control statement -->
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-(prefix|suffix|regex)|feed|headings|parse-(tika|html|metatags)|urlmeta|index-(basic|anchor|metadata|img)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> <description></description>
> </property>
>
> <!-- Parse Meta Tag parameters -->
> <property>
> <name>metatags.names</name>
> <value>description</value>
> </property>
>
> <!-- Parse - Tika Controls -->
> <property>
> <name>tika.boilerpipe</name>
> <value>true</value>
> </property>
>
> <property>
> <name>tika.boilerpipe.extractor</name>
> <value>JeffExtractor</value>
> </property>
>
> <!-- Index-Metadata Plugin -->
> <property>
> <name>index.parse.md</name>
> <value>metatag.description</value>
> </property>
> <property>
> <name>index.content.md</name>
> <value>description</value>
> </property>
>