Re: parsing issue - content and title fields combined

Comcast Tue, 04 Oct 2016 11:32:51 -0700

I was not complaining

Sent from my iPhone


> On Oct 4, 2016, at 2:29 PM, Markus Jelsma <[email protected]> wrote:
> 
> That doesn't mean a thing. If you need it, patch the sources and compile it 
> yourself.
> Markuss 
> 
> -----Original message-----
>> From:KRIS MUSSHORN <[email protected]>
>> Sent: Tuesday 4th October 2016 18:51
>> To: [email protected]
>> Subject: Re: parsing issue - content and title fields combined
>> 
>> this is slated for fix in v1.13. 
>> Great. 
>> K 
>> 
>> ----- Original Message -----
>> 
>> From: "Markus Jelsma" <[email protected]> 
>> To: [email protected] 
>> Sent: Tuesday, October 4, 2016 12:34:33 PM 
>> Subject: RE: parsing issue - content and title fields combined 
>> 
>> Hi - this is a known and open issue, but it has a patch: 
>> https://issues.apache.org/jira/browse/NUTCH-1749 
>> 
>> 
>> 
>> -----Original message----- 
>>> From:KRIS MUSSHORN <[email protected]> 
>>> Sent: Tuesday 4th October 2016 16:53 
>>> To: [email protected] 
>>> Subject: parsing issue - content and title fields combined 
>>> 
>>> Nutch 1.12 
>>> Solr 5.4.1 
>>> 
>>> I have a simple webpage that I am crawling with Nutch (attached). 
>>> 
>>> Nutch picks it up as application/xhtml according to the doc type 
>>> definition. 
>>> 
>>> In parse-plugins I am specifically telling nutch to use parse-html. 
>>> 
>>> <mimeType name="application/xhtml+xml"> 
>>> <plugin id="parse-html" /> 
>>> <!-- <plugin id="parse-tika" /> --> 
>>> </mimeType> 
>>> I am using parse-(html|tika|metatags) to extract the description, keywords, 
>>> and date into solr. 
>>> 
>>> this all works fine except.... 
>>> 
>>> the content field in solr shows the title and the body text. 
>>> 
>>> I want just the body text in the contents field. 
>>> 
>>> Solr schema.xml does NOT perform any kind of copy into contents. 
>>> 
>>> Solr schema.xml defines content as: 
>>> 
>>> <field name="content" type="text" indexed="true" stored="true" 
>>> termVectors="true"/> 
>>> I have attached the nutch dump and the parseText:: shows title and body. 
>>> 
>>> How do I get the result i need? 
>>> 
>>> I have tried using parse-tika, with boilerpipe-default/article/canola, 
>>> instead of parse-html and parsing with Tika does not produce the desired 
>>> result. 
>>> 
>>> In fact parsing with Tika produces duplicate entries in the metadata 
>>> fields. 
>>> 
>>> TIA for assistance? 
>>> 
>>> 
>>> 
>> 
>>

Re: parsing issue - content and title fields combined

Reply via email to