RE: parsing issue - content and title fields combined

Markus Jelsma Tue, 04 Oct 2016 11:29:56 -0700
That doesn't mean a thing. If you need it, patch the sources and compile it 
yourself.
Markuss 
 
-----Original message-----
> From:KRIS MUSSHORN <[email protected]>
> Sent: Tuesday 4th October 2016 18:51
> To: [email protected]
> Subject: Re: parsing issue - content and title fields combined
> 
> this is slated for fix in v1.13. 
> Great. 
> K 
> 
> ----- Original Message -----
> 
> From: "Markus Jelsma" <[email protected]> 
> To: [email protected] 
> Sent: Tuesday, October 4, 2016 12:34:33 PM 
> Subject: RE: parsing issue - content and title fields combined 
> 
> Hi - this is a known and open issue, but it has a patch: 
> https://issues.apache.org/jira/browse/NUTCH-1749 
> 
> 
> 
> -----Original message----- 
> > From:KRIS MUSSHORN <[email protected]> 
> > Sent: Tuesday 4th October 2016 16:53 
> > To: [email protected] 
> > Subject: parsing issue - content and title fields combined 
> > 
> > Nutch 1.12 
> > Solr 5.4.1 
> > 
> > I have a simple webpage that I am crawling with Nutch (attached). 
> > 
> > Nutch picks it up as application/xhtml according to the doc type 
> > definition. 
> > 
> > In parse-plugins I am specifically telling nutch to use parse-html. 
> > 
> > <mimeType name="application/xhtml+xml"> 
> > <plugin id="parse-html" /> 
> > <!-- <plugin id="parse-tika" /> --> 
> > </mimeType> 
> > I am using parse-(html|tika|metatags) to extract the description, keywords, 
> > and date into solr. 
> > 
> > this all works fine except.... 
> > 
> > the content field in solr shows the title and the body text. 
> > 
> > I want just the body text in the contents field. 
> > 
> > Solr schema.xml does NOT perform any kind of copy into contents. 
> > 
> > Solr schema.xml defines content as: 
> > 
> > <field name="content" type="text" indexed="true" stored="true" 
> > termVectors="true"/> 
> > I have attached the nutch dump and the parseText:: shows title and body. 
> > 
> > How do I get the result i need? 
> > 
> > I have tried using parse-tika, with boilerpipe-default/article/canola, 
> > instead of parse-html and parsing with Tika does not produce the desired 
> > result. 
> > 
> > In fact parsing with Tika produces duplicate entries in the metadata 
> > fields. 
> > 
> > TIA for assistance? 
> > 
> > 
> > 
> 
>
RE: parsing issue - content and title fields combined

Reply via email to