RE: parsing issue - content and title fields combined

Markus Jelsma Tue, 04 Oct 2016 09:35:28 -0700

Hi - this is a known and open issue, but it has a patch:
https://issues.apache.org/jira/browse/NUTCH-1749


 
 
-----Original message-----
> From:KRIS MUSSHORN <[email protected]>
> Sent: Tuesday 4th October 2016 16:53
> To: [email protected]
> Subject: parsing issue - content and title fields combined
> 
> Nutch 1.12 
> Solr 5.4.1
> 
> I have a simple webpage that I am crawling with Nutch (attached).
> 
> Nutch picks it up as application/xhtml according to the doc type definition.
> 
> In parse-plugins I am specifically telling nutch to use parse-html.
> 
> <mimeType name="application/xhtml+xml">
>         <plugin id="parse-html" />
>         <!-- <plugin id="parse-tika" /> -->
> </mimeType>
> I am using parse-(html|tika|metatags) to extract the description, keywords, 
> and date into solr.
> 
> this all works fine except....
> 
> the content field in solr shows the title and the body text.
> 
> I want just the body text in the contents field.
> 
> Solr schema.xml does NOT perform any kind of copy into contents.
> 
> Solr schema.xml defines content as: 
> 
> <field name="content" type="text" indexed="true" stored="true" 
> termVectors="true"/>
> I have attached the nutch dump and the parseText:: shows title and body.
> 
> How do I get the result i need?
> 
> I have tried using parse-tika, with boilerpipe-default/article/canola, 
> instead of parse-html and parsing with Tika does not produce the desired 
> result.
> 
> In fact parsing with Tika produces duplicate entries in the metadata fields.
> 
> TIA for assistance?
> 
> 
>

RE: parsing issue - content and title fields combined

Reply via email to