Hello -  by default, both parse-html and parse-tika emit only the title tag and 
the entire text content within the body. No  HTML metadata is emitted as text. 
Don't you mean you want the body text except menu, header/footer and other 
clutter/boilerplate text?

Markus
 
 
-----Original message-----
> From:Zara Parst <edotserv...@gmail.com>
> Sent: Wednesday 17th February 2016 19:34
> To: user@nutch.apache.org; d...@nutch.apache.org
> Subject: How to extract only body
> 
> Hi everybody,
> 
> I am trying to make search for my own website.  For that I am using nutch
>  and solr.
> 
> Problem with nutch is htmparser seems to me as a flat parser which
> concatenate everything
> 
> title , Metatag , body into one single field content.  Which is not my
> desired search result.  Is it possible to separate somehow body part out of
> content or is it possible create body field that will have only
> <body></body> content of html page.
> 
> thanks
> 

Reply via email to