Hi, Can you please open a JIRA issue on https://issues.apache.org/jira/browse/NUTCH and include a URL which can be used to reproduce the problem?
Thanks Julien On 9 July 2014 14:37, Jonathan Cooper-Ellis <[email protected]> wrote: > Hello Julien, > > Thanks for the reply. Unfortunately, undoing the changes I made to > parse-plugins.xml and only removing parse-html from plugin.includes does > not fix the double indexing issue. It also might be worth mentioning that > this is also happens on a fresh version of Nutch 1.8, without using > Boilerpipe. Seems like as long as I'm parsing with Tika I get back > duplicate metadata. Do you have any other thoughts? > > Best, > Jonathan > > > On Wed, Jul 9, 2014 at 4:11 AM, Julien Nioche < > [email protected] > > wrote: > > > Hi Jonathan > > > > You shouldn't need to modify parse-plugins.xml to parse HTML docs with > > Tika : just remove parse-html from plugin.includes from nutch-site.xml. > > Could you please try that instead and see if that fixes your problem? > > > > Thanks > > > > Julien > > > > > > On 8 July 2014 19:41, Jonathan Cooper-Ellis <[email protected]> wrote: > > > > > Hello, > > > > > > I'm using Nutch 1.8 and trying to index HTML metadata in Solr. I > followed > > > the steps for parsing metatags and had no issues while using parse-html > > for > > > parsing HTML. The problem arises when I modify parse-plugins.xml to > parse > > > HTML docs with Tika. When Tika parses the doc and plugin.includes has > > > parse-metatags and index-metadata listed, the specified metadata fields > > > show up twice. So, running indexchecker will list metatag.description > > > twice, with identical content. > > > > > > eg. > > > > > > *metatag.description : CONCORD, N.H. -- September's primary for the > > > Republican nomination for governor pits Walt Havenstein* > > > > > > *metatag.description : CONCORD, N.H. -- September's primary for the > > > Republican nomination for governor pits Walt Havenstein* > > > > > > Likewise, actually trying to index with Solr will cause Solr to > complain > > > that the field must allow multiple values, and setting > multiValued="true" > > > will cause two identical values to be indexed for the field. > > > > > > I need to parse HTML pages with Tika because I'm using Boilerpipe, so I > > > can't just use parse-html, and I can't figure out why this issue is > > showing > > > up with Tika. Any ideas? > > > > > > Best, > > > Jonathan > > > > > > > > > > > -- > > > > Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

