Re: parsing issue - content and title fields combined

KRIS MUSSHORN Tue, 04 Oct 2016 07:54:55 -0700


----- Original Message -----

From: "KRIS MUSSHORN" <[email protected]> 
To: [email protected] 
Sent: Tuesday, October 4, 2016 10:52:43 AM 
Subject: parsing issue - content and title fields combined 

Nutch 1.12 
Solr 5.4.1 

I have a simple webpage that I am crawling with Nutch (attached). 

Nutch picks it up as application/xhtml according to the doc type definition. 

In parse-plugins I am specifically telling nutch to use parse-html. 

<mimeType name="application/xhtml+xml"> 
<plugin id="parse-html" /> 
<!-- <plugin id="parse-tika" /> --> 
</mimeType> 

I am using parse-(html|tika|metatags) to extract the description, keywords, and 
date into solr. 

this all works fine except.... 

the content field in solr shows the title and the body text. 

I want just the body text in the contents field. 

Solr schema.xml does NOT perform any kind of copy into contents. 

Solr schema.xml defines content as: 

<field name="content" type="text" indexed="true" stored="true" 
termVectors="true"/> 

I have attached the nutch dump and the parseText:: shows title and body. 

How do I get the result i need? 

I have tried using parse-tika, with boilerpipe-default/article/canola, instead 
of parse-html and parsing with Tika does not produce the desired result. 

In fact parsing with Tika produces duplicate entries in the metadata fields. 

TIA for assistance?

Re: parsing issue - content and title fields combined

Reply via email to