Hi Jonathan You shouldn't need to modify parse-plugins.xml to parse HTML docs with Tika : just remove parse-html from plugin.includes from nutch-site.xml. Could you please try that instead and see if that fixes your problem?
Thanks Julien On 8 July 2014 19:41, Jonathan Cooper-Ellis <[email protected]> wrote: > Hello, > > I'm using Nutch 1.8 and trying to index HTML metadata in Solr. I followed > the steps for parsing metatags and had no issues while using parse-html for > parsing HTML. The problem arises when I modify parse-plugins.xml to parse > HTML docs with Tika. When Tika parses the doc and plugin.includes has > parse-metatags and index-metadata listed, the specified metadata fields > show up twice. So, running indexchecker will list metatag.description > twice, with identical content. > > eg. > > *metatag.description : CONCORD, N.H. -- September's primary for the > Republican nomination for governor pits Walt Havenstein* > > *metatag.description : CONCORD, N.H. -- September's primary for the > Republican nomination for governor pits Walt Havenstein* > > Likewise, actually trying to index with Solr will cause Solr to complain > that the field must allow multiple values, and setting multiValued="true" > will cause two identical values to be indexed for the field. > > I need to parse HTML pages with Tika because I'm using Boilerpipe, so I > can't just use parse-html, and I can't figure out why this issue is showing > up with Tika. Any ideas? > > Best, > Jonathan > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

