Ah, thank you, have you tried the BoilerPipeContentHandler? It isn’t perfect, but it tries to strip out boilerplate/navigation stuff.
From: Francesco Viscomi [mailto:[email protected]] Sent: Thursday, September 14, 2017 4:49 AM To: [email protected] Subject: Re: possible a bug? Thanks really much; I've tried to change the toHTMLHandler with ToTextHandel, but the result is the same(almost). I want to retrieve only text useful, meaning that i don't want these item: BBC navigation Home Home News News Sport Weather only the text inside the body, in other words i want to eliminate every text related to navigation bar. thanks really much 2017-09-13 21:32 GMT+02:00 Allison, Timothy B. <[email protected]<mailto:[email protected]>>: Hmmm… What are you expecting? What version of Tika are you using? With master, the parse works as expected. I get the links from the LinkHandler and the full content from the ToHTMLContentHandler It is odd to escape the html, though; what is your goal? If you’re trying to get just the text out, use the ToTextContentHandler instead of the ToHTMLContentHandler ? From: Francesco Viscomi [mailto:[email protected]<mailto:[email protected]>] Sent: Wednesday, September 13, 2017 12:39 PM To: [email protected]<mailto:[email protected]> Subject: Fwd: possible a bug? ---------- Forwarded message ---------- From: Francesco Viscomi <[email protected]<mailto:[email protected]>> Date: 2017-09-13 18:37 GMT+02:00 Subject: possible a bug? To: [email protected]<mailto:[email protected]> Hi all, I'm trying to extract a content from a web page, and i find the following example on internet : =======START CODE====== String url = "http://www.bbc.com/news/uk-england-41255962"; URL _url = new URL(url); InputStream input = _url.openStream(); LinkContentHandler linkHandler = new LinkContentHandler(); ContentHandler textHandler = new BodyContentHandler(); ToHTMLContentHandler toHTMLHandler = new ToHTMLContentHandler(); TeeContentHandler teeHandler = new TeeContentHandler(linkHandler, textHandler, toHTMLHandler); Metadata metadata = new Metadata(); ParseContext parseContext = new ParseContext(); HtmlParser parser = new HtmlParser(); parser.parse(input, teeHandler, metadata, parseContext); content = (StringEscapeUtils.escapeHtml(textHandler.toString())); System.out.println("il contenuto "+content); =======END CODE======== But the output is useless, as i ===============START OUTPUT================== Accessibility links Skip to content Accessibility Help BBC iD Notifications BBC navigation Home Home News News Sport Weather Shop ==============END PART OF OUTPUT============= How i can understand why this happen, and also how can solve it (for some other web page, for example http://www.vogella.com/tutorials/AndroidTestingEspresso/article.html) it work right good; can please help me??? thanks really much -- Ing. Viscomi Francesco -- Ing. Viscomi Francesco -- Ing. Viscomi Francesco
