----- Original Message -----
From: "KRIS MUSSHORN" <[email protected]> To: [email protected] Sent: Tuesday, October 4, 2016 10:52:43 AM Subject: parsing issue - content and title fields combined Nutch 1.12 Solr 5.4.1 I have a simple webpage that I am crawling with Nutch (attached). Nutch picks it up as application/xhtml according to the doc type definition. In parse-plugins I am specifically telling nutch to use parse-html. <mimeType name="application/xhtml+xml"> <plugin id="parse-html" /> <!-- <plugin id="parse-tika" /> --> </mimeType> I am using parse-(html|tika|metatags) to extract the description, keywords, and date into solr. this all works fine except.... the content field in solr shows the title and the body text. I want just the body text in the contents field. Solr schema.xml does NOT perform any kind of copy into contents. Solr schema.xml defines content as: <field name="content" type="text" indexed="true" stored="true" termVectors="true"/> I have attached the nutch dump and the parseText:: shows title and body. How do I get the result i need? I have tried using parse-tika, with boilerpipe-default/article/canola, instead of parse-html and parsing with Tika does not produce the desired result. In fact parsing with Tika produces duplicate entries in the metadata fields. TIA for assistance?

