Hello,

I'm using Nutch 1.8 and trying to index HTML metadata in Solr. I followed
the steps for parsing metatags and had no issues while using parse-html for
parsing HTML. The problem arises when I modify parse-plugins.xml to parse
HTML docs with Tika. When Tika parses the doc and plugin.includes has
parse-metatags and index-metadata listed, the specified metadata fields
show up twice. So, running indexchecker will list metatag.description
twice, with identical content.

eg.

*metatag.description : CONCORD, N.H. -- September's primary for the
Republican nomination for governor pits Walt Havenstein*

*metatag.description : CONCORD, N.H. -- September's primary for the
Republican nomination for governor pits Walt Havenstein*

Likewise, actually trying to index with Solr will cause Solr to complain
that the field must allow multiple values, and setting multiValued="true"
will cause two identical values to be indexed for the field.

I need to parse HTML pages with Tika because I'm using Boilerpipe, so I
can't just use parse-html, and I can't figure out why this issue is showing
up with Tika. Any ideas?

Best,
Jonathan

Reply via email to