Hello Julien,

Thanks for the reply. Unfortunately, undoing the changes I made to
parse-plugins.xml and only removing parse-html from plugin.includes does
not fix the double indexing issue. It also might be worth mentioning that
this is also happens on a fresh version of Nutch 1.8, without using
Boilerpipe. Seems like as long as I'm parsing with Tika I get back
duplicate metadata. Do you have any other thoughts?

Best,
Jonathan


On Wed, Jul 9, 2014 at 4:11 AM, Julien Nioche <[email protected]
> wrote:

> Hi Jonathan
>
> You shouldn't need to modify  parse-plugins.xml to parse HTML docs with
> Tika : just remove parse-html from plugin.includes from nutch-site.xml.
> Could you please try that instead and see if that fixes your problem?
>
> Thanks
>
> Julien
>
>
> On 8 July 2014 19:41, Jonathan Cooper-Ellis <[email protected]> wrote:
>
> > Hello,
> >
> > I'm using Nutch 1.8 and trying to index HTML metadata in Solr. I followed
> > the steps for parsing metatags and had no issues while using parse-html
> for
> > parsing HTML. The problem arises when I modify parse-plugins.xml to parse
> > HTML docs with Tika. When Tika parses the doc and plugin.includes has
> > parse-metatags and index-metadata listed, the specified metadata fields
> > show up twice. So, running indexchecker will list metatag.description
> > twice, with identical content.
> >
> > eg.
> >
> > *metatag.description : CONCORD, N.H. -- September's primary for the
> > Republican nomination for governor pits Walt Havenstein*
> >
> > *metatag.description : CONCORD, N.H. -- September's primary for the
> > Republican nomination for governor pits Walt Havenstein*
> >
> > Likewise, actually trying to index with Solr will cause Solr to complain
> > that the field must allow multiple values, and setting multiValued="true"
> > will cause two identical values to be indexed for the field.
> >
> > I need to parse HTML pages with Tika because I'm using Boilerpipe, so I
> > can't just use parse-html, and I can't figure out why this issue is
> showing
> > up with Tika. Any ideas?
> >
> > Best,
> > Jonathan
> >
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Reply via email to