Hi Jonathan

You shouldn't need to modify  parse-plugins.xml to parse HTML docs with
Tika : just remove parse-html from plugin.includes from nutch-site.xml.
Could you please try that instead and see if that fixes your problem?

Thanks

Julien


On 8 July 2014 19:41, Jonathan Cooper-Ellis <[email protected]> wrote:

> Hello,
>
> I'm using Nutch 1.8 and trying to index HTML metadata in Solr. I followed
> the steps for parsing metatags and had no issues while using parse-html for
> parsing HTML. The problem arises when I modify parse-plugins.xml to parse
> HTML docs with Tika. When Tika parses the doc and plugin.includes has
> parse-metatags and index-metadata listed, the specified metadata fields
> show up twice. So, running indexchecker will list metatag.description
> twice, with identical content.
>
> eg.
>
> *metatag.description : CONCORD, N.H. -- September's primary for the
> Republican nomination for governor pits Walt Havenstein*
>
> *metatag.description : CONCORD, N.H. -- September's primary for the
> Republican nomination for governor pits Walt Havenstein*
>
> Likewise, actually trying to index with Solr will cause Solr to complain
> that the field must allow multiple values, and setting multiValued="true"
> will cause two identical values to be indexed for the field.
>
> I need to parse HTML pages with Tika because I'm using Boilerpipe, so I
> can't just use parse-html, and I can't figure out why this issue is showing
> up with Tika. Any ideas?
>
> Best,
> Jonathan
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to