No, i don't mean to suggest that at all. All i meant was that the fix/version never guarantees an issue actually will be fixed for that version. In some extreme cases, it can take a long time.
This is not a difficult issue, so it might be fixed for 1.13. But commiters can also forget issues. Cheers Markus -----Original message----- > From:Comcast <[email protected]> > Sent: Tuesday 4th October 2016 20:32 > To: [email protected] > Subject: Re: parsing issue - content and title fields combined > > I was not complaining > > Sent from my iPhone > > > On Oct 4, 2016, at 2:29 PM, Markus Jelsma <[email protected]> > > wrote: > > > > That doesn't mean a thing. If you need it, patch the sources and compile it > > yourself. > > Markuss > > > > -----Original message----- > >> From:KRIS MUSSHORN <[email protected]> > >> Sent: Tuesday 4th October 2016 18:51 > >> To: [email protected] > >> Subject: Re: parsing issue - content and title fields combined > >> > >> this is slated for fix in v1.13. > >> Great. > >> K > >> > >> ----- Original Message ----- > >> > >> From: "Markus Jelsma" <[email protected]> > >> To: [email protected] > >> Sent: Tuesday, October 4, 2016 12:34:33 PM > >> Subject: RE: parsing issue - content and title fields combined > >> > >> Hi - this is a known and open issue, but it has a patch: > >> https://issues.apache.org/jira/browse/NUTCH-1749 > >> > >> > >> > >> -----Original message----- > >>> From:KRIS MUSSHORN <[email protected]> > >>> Sent: Tuesday 4th October 2016 16:53 > >>> To: [email protected] > >>> Subject: parsing issue - content and title fields combined > >>> > >>> Nutch 1.12 > >>> Solr 5.4.1 > >>> > >>> I have a simple webpage that I am crawling with Nutch (attached). > >>> > >>> Nutch picks it up as application/xhtml according to the doc type > >>> definition. > >>> > >>> In parse-plugins I am specifically telling nutch to use parse-html. > >>> > >>> <mimeType name="application/xhtml+xml"> > >>> <plugin id="parse-html" /> > >>> <!-- <plugin id="parse-tika" /> --> > >>> </mimeType> > >>> I am using parse-(html|tika|metatags) to extract the description, > >>> keywords, and date into solr. > >>> > >>> this all works fine except.... > >>> > >>> the content field in solr shows the title and the body text. > >>> > >>> I want just the body text in the contents field. > >>> > >>> Solr schema.xml does NOT perform any kind of copy into contents. > >>> > >>> Solr schema.xml defines content as: > >>> > >>> <field name="content" type="text" indexed="true" stored="true" > >>> termVectors="true"/> > >>> I have attached the nutch dump and the parseText:: shows title and body. > >>> > >>> How do I get the result i need? > >>> > >>> I have tried using parse-tika, with boilerpipe-default/article/canola, > >>> instead of parse-html and parsing with Tika does not produce the desired > >>> result. > >>> > >>> In fact parsing with Tika produces duplicate entries in the metadata > >>> fields. > >>> > >>> TIA for assistance? > >>> > >>> > >>> > >> > >> > >

