Hi - this is a known and open issue, but it has a patch: https://issues.apache.org/jira/browse/NUTCH-1749
-----Original message----- > From:KRIS MUSSHORN <[email protected]> > Sent: Tuesday 4th October 2016 16:53 > To: [email protected] > Subject: parsing issue - content and title fields combined > > Nutch 1.12 > Solr 5.4.1 > > I have a simple webpage that I am crawling with Nutch (attached). > > Nutch picks it up as application/xhtml according to the doc type definition. > > In parse-plugins I am specifically telling nutch to use parse-html. > > <mimeType name="application/xhtml+xml"> > <plugin id="parse-html" /> > <!-- <plugin id="parse-tika" /> --> > </mimeType> > I am using parse-(html|tika|metatags) to extract the description, keywords, > and date into solr. > > this all works fine except.... > > the content field in solr shows the title and the body text. > > I want just the body text in the contents field. > > Solr schema.xml does NOT perform any kind of copy into contents. > > Solr schema.xml defines content as: > > <field name="content" type="text" indexed="true" stored="true" > termVectors="true"/> > I have attached the nutch dump and the parseText:: shows title and body. > > How do I get the result i need? > > I have tried using parse-tika, with boilerpipe-default/article/canola, > instead of parse-html and parsing with Tika does not produce the desired > result. > > In fact parsing with Tika produces duplicate entries in the metadata fields. > > TIA for assistance? > > >

