RE: How to extract only body

Markus Jelsma Wed, 17 Feb 2016 11:34:59 -0800

Hello -  by default, both parse-html and parse-tika emit only the title tag and 
the entire text content within the body. No  HTML metadata is emitted as text. 
Don't you mean you want the body text except menu, header/footer and other 
clutter/boilerplate text?


Markus
 
 
-----Original message-----
> From:Zara Parst <[email protected]>
> Sent: Wednesday 17th February 2016 19:34
> To: [email protected]; [email protected]
> Subject: How to extract only body
> 
> Hi everybody,
> 
> I am trying to make search for my own website.  For that I am using nutch
>  and solr.
> 
> Problem with nutch is htmparser seems to me as a flat parser which
> concatenate everything
> 
> title , Metatag , body into one single field content.  Which is not my
> desired search result.  Is it possible to separate somehow body part out of
> content or is it possible create body field that will have only
> <body></body> content of html page.
> 
> thanks
>

RE: How to extract only body

Reply via email to