This was reported in NUTCH-1559 and planned to be solved for Nutch 1.11, can you provide a patch?. The root of the problem appears to be using the plugin tika-parser in combination with parse-metatags.
Regards, ----- Original Message ----- From: "Jeff Cocking" <[email protected]> To: "Nutch User MailList" <[email protected]> Sent: Thursday, April 30, 2015 1:05:33 PM Subject: [MASSMAIL]Re: Duplicate Metatag.Description Values Ok, after further investigations, I believe I have found the culprit. It appears we may have conflicting activities occurring: I was able to remove the error by only running the parse-html plugin. When the tika plugin is activated the duplicate value occurs. The Tika plugin copies all the Tika metadata into the nutch metadata. (see code below). The MetaTagsParser is setup to parse both Tika metadata and Nutch metadata. This is the reason for the duplicate values. I do not know how these values are used else where within the system. It would appear we could remove the MetaTagsParser execution of the Tika metadata. Thoughts? Source Code: TikaParser.java (around line 184): // populate Nutch metadata with Tika metadata String[] TikaMDNames = tikamd.names(); for (String tikaMDName : TikaMDNames) { if (tikaMDName.equalsIgnoreCase(Metadata.TITLE)) continue; // TODO what if multivalued? nutchMetadata.add(tikaMDName, tikamd.get(tikaMDName)); } MetaTagsParser.java (around line 104) // check in the metadata first : the tika-parser // might have stored the values there already for (String mdName : metadata.names()) { addIndexedMetatags(metadata, mdName, metadata.getValues(mdName)); } Metadata generalMetaTags = metaTags.getGeneralTags(); for (String tagName : generalMetaTags.names()) { addIndexedMetatags(metadata, tagName, generalMetaTags.getValues(tagName)); } I hope this makes sense.... jeff On Thu, Apr 30, 2015 at 11:22 AM, Jeff Cocking <[email protected]> wrote: > I am getting duplicate metatag.description values in my indexed results. > When running a parse checker, I am picking up meta name=description and the > meta property=og:description values. > > Has anyone else ran into this issue? If so, how have you fixed it? > > If not, any clues on how to resolve. > > Thank you in advance, > jeff > > > configuration: Nutch 1.9 > nutch-site.xml(partial): > <!-- Plugin Control statement --> > <property> > <name>plugin.includes</name> > > <value>protocol-httpclient|urlfilter-(prefix|suffix|regex)|feed|headings|parse-(tika|html|metatags)|urlmeta|index-(basic|anchor|metadata|img)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > <description></description> > </property> > > <!-- Parse Meta Tag parameters --> > <property> > <name>metatags.names</name> > <value>description</value> > </property> > > <!-- Parse - Tika Controls --> > <property> > <name>tika.boilerpipe</name> > <value>true</value> > </property> > > <property> > <name>tika.boilerpipe.extractor</name> > <value>JeffExtractor</value> > </property> > > <!-- Index-Metadata Plugin --> > <property> > <name>index.parse.md</name> > <value>metatag.description</value> > </property> > <property> > <name>index.content.md</name> > <value>description</value> > </property> >

