RE: parsing issue - content and title fields combined

Markus Jelsma Tue, 04 Oct 2016 11:38:48 -0700

No, i don't mean to suggest that at all. All i meant was that the fix/version 
never guarantees an issue actually will be fixed for that version. In some 
extreme cases, it can take a long time.


This is not a difficult issue, so it might be fixed for 1.13. But commiters can 
also forget issues.

Cheers
Markus

 
 
-----Original message-----
> From:Comcast <[email protected]>
> Sent: Tuesday 4th October 2016 20:32
> To: [email protected]
> Subject: Re: parsing issue - content and title fields combined
> 
> I was not complaining
> 
> Sent from my iPhone
> 
> > On Oct 4, 2016, at 2:29 PM, Markus Jelsma <[email protected]> 
> > wrote:
> > 
> > That doesn't mean a thing. If you need it, patch the sources and compile it 
> > yourself.
> > Markuss 
> > 
> > -----Original message-----
> >> From:KRIS MUSSHORN <[email protected]>
> >> Sent: Tuesday 4th October 2016 18:51
> >> To: [email protected]
> >> Subject: Re: parsing issue - content and title fields combined
> >> 
> >> this is slated for fix in v1.13. 
> >> Great. 
> >> K 
> >> 
> >> ----- Original Message -----
> >> 
> >> From: "Markus Jelsma" <[email protected]> 
> >> To: [email protected] 
> >> Sent: Tuesday, October 4, 2016 12:34:33 PM 
> >> Subject: RE: parsing issue - content and title fields combined 
> >> 
> >> Hi - this is a known and open issue, but it has a patch: 
> >> https://issues.apache.org/jira/browse/NUTCH-1749 
> >> 
> >> 
> >> 
> >> -----Original message----- 
> >>> From:KRIS MUSSHORN <[email protected]> 
> >>> Sent: Tuesday 4th October 2016 16:53 
> >>> To: [email protected] 
> >>> Subject: parsing issue - content and title fields combined 
> >>> 
> >>> Nutch 1.12 
> >>> Solr 5.4.1 
> >>> 
> >>> I have a simple webpage that I am crawling with Nutch (attached). 
> >>> 
> >>> Nutch picks it up as application/xhtml according to the doc type 
> >>> definition. 
> >>> 
> >>> In parse-plugins I am specifically telling nutch to use parse-html. 
> >>> 
> >>> <mimeType name="application/xhtml+xml"> 
> >>> <plugin id="parse-html" /> 
> >>> <!-- <plugin id="parse-tika" /> --> 
> >>> </mimeType> 
> >>> I am using parse-(html|tika|metatags) to extract the description, 
> >>> keywords, and date into solr. 
> >>> 
> >>> this all works fine except.... 
> >>> 
> >>> the content field in solr shows the title and the body text. 
> >>> 
> >>> I want just the body text in the contents field. 
> >>> 
> >>> Solr schema.xml does NOT perform any kind of copy into contents. 
> >>> 
> >>> Solr schema.xml defines content as: 
> >>> 
> >>> <field name="content" type="text" indexed="true" stored="true" 
> >>> termVectors="true"/> 
> >>> I have attached the nutch dump and the parseText:: shows title and body. 
> >>> 
> >>> How do I get the result i need? 
> >>> 
> >>> I have tried using parse-tika, with boilerpipe-default/article/canola, 
> >>> instead of parse-html and parsing with Tika does not produce the desired 
> >>> result. 
> >>> 
> >>> In fact parsing with Tika produces duplicate entries in the metadata 
> >>> fields. 
> >>> 
> >>> TIA for assistance? 
> >>> 
> >>> 
> >>> 
> >> 
> >> 
> 
>

RE: parsing issue - content and title fields combined

Reply via email to