Hi,

Can you please open a JIRA issue on
https://issues.apache.org/jira/browse/NUTCH and include a URL which can be
used to reproduce the problem?

Thanks

Julien


On 9 July 2014 14:37, Jonathan Cooper-Ellis <[email protected]> wrote:

> Hello Julien,
>
> Thanks for the reply. Unfortunately, undoing the changes I made to
> parse-plugins.xml and only removing parse-html from plugin.includes does
> not fix the double indexing issue. It also might be worth mentioning that
> this is also happens on a fresh version of Nutch 1.8, without using
> Boilerpipe. Seems like as long as I'm parsing with Tika I get back
> duplicate metadata. Do you have any other thoughts?
>
> Best,
> Jonathan
>
>
> On Wed, Jul 9, 2014 at 4:11 AM, Julien Nioche <
> [email protected]
> > wrote:
>
> > Hi Jonathan
> >
> > You shouldn't need to modify  parse-plugins.xml to parse HTML docs with
> > Tika : just remove parse-html from plugin.includes from nutch-site.xml.
> > Could you please try that instead and see if that fixes your problem?
> >
> > Thanks
> >
> > Julien
> >
> >
> > On 8 July 2014 19:41, Jonathan Cooper-Ellis <[email protected]> wrote:
> >
> > > Hello,
> > >
> > > I'm using Nutch 1.8 and trying to index HTML metadata in Solr. I
> followed
> > > the steps for parsing metatags and had no issues while using parse-html
> > for
> > > parsing HTML. The problem arises when I modify parse-plugins.xml to
> parse
> > > HTML docs with Tika. When Tika parses the doc and plugin.includes has
> > > parse-metatags and index-metadata listed, the specified metadata fields
> > > show up twice. So, running indexchecker will list metatag.description
> > > twice, with identical content.
> > >
> > > eg.
> > >
> > > *metatag.description : CONCORD, N.H. -- September's primary for the
> > > Republican nomination for governor pits Walt Havenstein*
> > >
> > > *metatag.description : CONCORD, N.H. -- September's primary for the
> > > Republican nomination for governor pits Walt Havenstein*
> > >
> > > Likewise, actually trying to index with Solr will cause Solr to
> complain
> > > that the field must allow multiple values, and setting
> multiValued="true"
> > > will cause two identical values to be indexed for the field.
> > >
> > > I need to parse HTML pages with Tika because I'm using Boilerpipe, so I
> > > can't just use parse-html, and I can't figure out why this issue is
> > showing
> > > up with Tika. Any ideas?
> > >
> > > Best,
> > > Jonathan
> > >
> >
> >
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to