I have updated NUTCH-1559 with the info I found and a patch. I would like someone to review the approach from an architectural perspective. I am very unfamiliar with the underpinnings of Nutch to know if the proposed solution will not have any downstream issues. I have been testing this on my test system, but my testing is only within the plugins/scope of my project. (i.e. lacks regression testing.)
jeff On Thu, Apr 30, 2015 at 12:35 PM, Jorge Luis Betancourt González < [email protected]> wrote: > This was reported in NUTCH-1559 and planned to be solved for Nutch 1.11, > can you provide a patch?. The root of the problem appears to be using the > plugin tika-parser in combination with parse-metatags. > > Regards, > > ----- Original Message ----- > From: "Jeff Cocking" <[email protected]> > To: "Nutch User MailList" <[email protected]> > Sent: Thursday, April 30, 2015 1:05:33 PM > Subject: [MASSMAIL]Re: Duplicate Metatag.Description Values > > Ok, after further investigations, I believe I have found the culprit. It > appears we may have conflicting activities occurring: > > I was able to remove the error by only running the parse-html plugin. When > the tika plugin is activated the duplicate value occurs. The Tika plugin > copies all the Tika metadata into the nutch metadata. (see code below). The > MetaTagsParser is setup to parse both Tika metadata and Nutch metadata. > This is the reason for the duplicate values. > > I do not know how these values are used else where within the system. It > would appear we could remove the MetaTagsParser execution of the Tika > metadata. Thoughts? > > Source Code: > TikaParser.java (around line 184): > // populate Nutch metadata with Tika metadata > String[] TikaMDNames = tikamd.names(); > for (String tikaMDName : TikaMDNames) { > if (tikaMDName.equalsIgnoreCase(Metadata.TITLE)) > continue; > // TODO what if multivalued? > nutchMetadata.add(tikaMDName, tikamd.get(tikaMDName)); > } > > MetaTagsParser.java (around line 104) > // check in the metadata first : the tika-parser > // might have stored the values there already > for (String mdName : metadata.names()) { > addIndexedMetatags(metadata, mdName, metadata.getValues(mdName)); > } > > Metadata generalMetaTags = metaTags.getGeneralTags(); > for (String tagName : generalMetaTags.names()) { > addIndexedMetatags(metadata, tagName, > generalMetaTags.getValues(tagName)); > } > > I hope this makes sense.... > > jeff > > > > On Thu, Apr 30, 2015 at 11:22 AM, Jeff Cocking <[email protected]> > wrote: > > > I am getting duplicate metatag.description values in my indexed results. > > When running a parse checker, I am picking up meta name=description and > the > > meta property=og:description values. > > > > Has anyone else ran into this issue? If so, how have you fixed it? > > > > If not, any clues on how to resolve. > > > > Thank you in advance, > > jeff > > > > > > configuration: Nutch 1.9 > > nutch-site.xml(partial): > > <!-- Plugin Control statement --> > > <property> > > <name>plugin.includes</name> > > > > > <value>protocol-httpclient|urlfilter-(prefix|suffix|regex)|feed|headings|parse-(tika|html|metatags)|urlmeta|index-(basic|anchor|metadata|img)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > > <description></description> > > </property> > > > > <!-- Parse Meta Tag parameters --> > > <property> > > <name>metatags.names</name> > > <value>description</value> > > </property> > > > > <!-- Parse - Tika Controls --> > > <property> > > <name>tika.boilerpipe</name> > > <value>true</value> > > </property> > > > > <property> > > <name>tika.boilerpipe.extractor</name> > > <value>JeffExtractor</value> > > </property> > > > > <!-- Index-Metadata Plugin --> > > <property> > > <name>index.parse.md</name> > > <value>metatag.description</value> > > </property> > > <property> > > <name>index.content.md</name> > > <value>description</value> > > </property> > > >

