Thanks really much;
I've tried to change the toHTMLHandler with ToTextHandel, but the result is
the same(almost).
I want to retrieve only text useful, meaning that i don't want these item:
BBC navigation
Home
Home
News
News
Sport
Weather
only the text inside the body, in other words i want to eliminate every
text related to navigation bar.
thanks really much
2017-09-13 21:32 GMT+02:00 Allison, Timothy B. <[email protected]>:
> Hmmm…
>
>
>
> What are you expecting? What version of Tika are you using?
>
>
>
> With master, the parse works as expected. I get the links from the
> LinkHandler and the full content from the ToHTMLContentHandler
>
>
>
>
>
> It is odd to escape the html, though; what is your goal?
>
>
>
>
>
> If you’re trying to get just the text out, use the ToTextContentHandler
> instead of the ToHTMLContentHandler ?
>
>
>
>
>
> *From:* Francesco Viscomi [mailto:[email protected]]
> *Sent:* Wednesday, September 13, 2017 12:39 PM
> *To:* [email protected]
> *Subject:* Fwd: possible a bug?
>
>
>
>
>
>
> ---------- Forwarded message ----------
> From: *Francesco Viscomi* <[email protected]>
> Date: 2017-09-13 18:37 GMT+02:00
> Subject: possible a bug?
> To: [email protected]
>
> Hi all,
>
> I'm trying to extract a content from a web page, and i find the following
> example on internet :
>
> =======START CODE======
>
> String url = "http://www.bbc.com/news/uk-england-41255962";
>
>
>
>
>
> URL _url = new URL(url);
> InputStream input = _url.openStream();
>
>
>
> LinkContentHandler linkHandler = new
> LinkContentHandler();
>
> ContentHandler textHandler = new
> BodyContentHandler();
>
> ToHTMLContentHandler toHTMLHandler = new
> ToHTMLContentHandler();
>
>
>
> TeeContentHandler teeHandler = new
> TeeContentHandler(linkHandler, textHandler, toHTMLHandler);
>
>
>
> Metadata metadata = new Metadata();
>
> ParseContext parseContext = new ParseContext();
>
> HtmlParser parser = new HtmlParser();
>
>
>
> parser.parse(input, teeHandler, metadata,
> parseContext);
>
> content = (StringEscapeUtils.escapeHtml(
> textHandler.toString()));
>
> System.out.println("il contenuto "+content);
>
> =======END CODE========
>
> But the output is useless, as i
>
>
>
>
>
> ===============START OUTPUT==================
>
> Accessibility links
>
> Skip to content
>
> Accessibility Help
>
>
>
>
>
> BBC iD
>
>
>
>
>
>
>
> Notifications
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> BBC navigation
>
> Home
>
> Home
>
> News
>
> News
>
> Sport
>
> Weather
>
> Shop
>
> ==============END PART OF OUTPUT=============
>
>
>
>
>
>
> How i can understand why this happen, and also how can solve it (for some
> other web page, for example http://www.vogella.com/tutorials/
> AndroidTestingEspresso/article.html) it work right good;
>
>
>
>
>
> can please help me???
>
> thanks really much
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
> Ing. Viscomi Francesco
>
>
>
>
> --
>
> Ing. Viscomi Francesco
>
--
Ing. Viscomi Francesco