I was not complaining Sent from my iPhone
> On Oct 4, 2016, at 2:29 PM, Markus Jelsma <[email protected]> wrote: > > That doesn't mean a thing. If you need it, patch the sources and compile it > yourself. > Markuss > > -----Original message----- >> From:KRIS MUSSHORN <[email protected]> >> Sent: Tuesday 4th October 2016 18:51 >> To: [email protected] >> Subject: Re: parsing issue - content and title fields combined >> >> this is slated for fix in v1.13. >> Great. >> K >> >> ----- Original Message ----- >> >> From: "Markus Jelsma" <[email protected]> >> To: [email protected] >> Sent: Tuesday, October 4, 2016 12:34:33 PM >> Subject: RE: parsing issue - content and title fields combined >> >> Hi - this is a known and open issue, but it has a patch: >> https://issues.apache.org/jira/browse/NUTCH-1749 >> >> >> >> -----Original message----- >>> From:KRIS MUSSHORN <[email protected]> >>> Sent: Tuesday 4th October 2016 16:53 >>> To: [email protected] >>> Subject: parsing issue - content and title fields combined >>> >>> Nutch 1.12 >>> Solr 5.4.1 >>> >>> I have a simple webpage that I am crawling with Nutch (attached). >>> >>> Nutch picks it up as application/xhtml according to the doc type >>> definition. >>> >>> In parse-plugins I am specifically telling nutch to use parse-html. >>> >>> <mimeType name="application/xhtml+xml"> >>> <plugin id="parse-html" /> >>> <!-- <plugin id="parse-tika" /> --> >>> </mimeType> >>> I am using parse-(html|tika|metatags) to extract the description, keywords, >>> and date into solr. >>> >>> this all works fine except.... >>> >>> the content field in solr shows the title and the body text. >>> >>> I want just the body text in the contents field. >>> >>> Solr schema.xml does NOT perform any kind of copy into contents. >>> >>> Solr schema.xml defines content as: >>> >>> <field name="content" type="text" indexed="true" stored="true" >>> termVectors="true"/> >>> I have attached the nutch dump and the parseText:: shows title and body. >>> >>> How do I get the result i need? >>> >>> I have tried using parse-tika, with boilerpipe-default/article/canola, >>> instead of parse-html and parsing with Tika does not produce the desired >>> result. >>> >>> In fact parsing with Tika produces duplicate entries in the metadata >>> fields. >>> >>> TIA for assistance? >>> >>> >>> >> >>

