Hello Julien, Thanks for the reply. Unfortunately, undoing the changes I made to parse-plugins.xml and only removing parse-html from plugin.includes does not fix the double indexing issue. It also might be worth mentioning that this is also happens on a fresh version of Nutch 1.8, without using Boilerpipe. Seems like as long as I'm parsing with Tika I get back duplicate metadata. Do you have any other thoughts?
Best, Jonathan On Wed, Jul 9, 2014 at 4:11 AM, Julien Nioche <[email protected] > wrote: > Hi Jonathan > > You shouldn't need to modify parse-plugins.xml to parse HTML docs with > Tika : just remove parse-html from plugin.includes from nutch-site.xml. > Could you please try that instead and see if that fixes your problem? > > Thanks > > Julien > > > On 8 July 2014 19:41, Jonathan Cooper-Ellis <[email protected]> wrote: > > > Hello, > > > > I'm using Nutch 1.8 and trying to index HTML metadata in Solr. I followed > > the steps for parsing metatags and had no issues while using parse-html > for > > parsing HTML. The problem arises when I modify parse-plugins.xml to parse > > HTML docs with Tika. When Tika parses the doc and plugin.includes has > > parse-metatags and index-metadata listed, the specified metadata fields > > show up twice. So, running indexchecker will list metatag.description > > twice, with identical content. > > > > eg. > > > > *metatag.description : CONCORD, N.H. -- September's primary for the > > Republican nomination for governor pits Walt Havenstein* > > > > *metatag.description : CONCORD, N.H. -- September's primary for the > > Republican nomination for governor pits Walt Havenstein* > > > > Likewise, actually trying to index with Solr will cause Solr to complain > > that the field must allow multiple values, and setting multiValued="true" > > will cause two identical values to be indexed for the field. > > > > I need to parse HTML pages with Tika because I'm using Boilerpipe, so I > > can't just use parse-html, and I can't figure out why this issue is > showing > > up with Tika. Any ideas? > > > > Best, > > Jonathan > > > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >

