Thanks Sir Allison Timothy, now is accettable; 2017-09-14 12:56 GMT+02:00 Allison, Timothy B. <[email protected]>:
> Ah, thank you, have you tried the BoilerPipeContentHandler? > > > > It isn’t perfect, but it tries to strip out boilerplate/navigation stuff. > > > > *From:* Francesco Viscomi [mailto:[email protected]] > *Sent:* Thursday, September 14, 2017 4:49 AM > *To:* [email protected] > *Subject:* Re: possible a bug? > > > > Thanks really much; > > I've tried to change the toHTMLHandler with ToTextHandel, but the result > is the same(almost). > > > I want to retrieve only text useful, meaning that i don't want these item: > BBC navigation > > Home > > Home > > News > > News > > Sport > > Weather > > > > only the text inside the body, in other words i want to eliminate every > text related to navigation bar. > > > > thanks really much > > > > > > 2017-09-13 21:32 GMT+02:00 Allison, Timothy B. <[email protected]>: > > Hmmm… > > > > What are you expecting? What version of Tika are you using? > > > > With master, the parse works as expected. I get the links from the > LinkHandler and the full content from the ToHTMLContentHandler > > > > > > It is odd to escape the html, though; what is your goal? > > > > > > If you’re trying to get just the text out, use the ToTextContentHandler > instead of the ToHTMLContentHandler ? > > > > > > *From:* Francesco Viscomi [mailto:[email protected]] > *Sent:* Wednesday, September 13, 2017 12:39 PM > *To:* [email protected] > *Subject:* Fwd: possible a bug? > > > > > > > ---------- Forwarded message ---------- > From: *Francesco Viscomi* <[email protected]> > Date: 2017-09-13 18:37 GMT+02:00 > Subject: possible a bug? > To: [email protected] > > Hi all, > > I'm trying to extract a content from a web page, and i find the following > example on internet : > > =======START CODE====== > > String url = "http://www.bbc.com/news/uk-england-41255962"; > > > > > > URL _url = new URL(url); > InputStream input = _url.openStream(); > > > > LinkContentHandler linkHandler = new > LinkContentHandler(); > > ContentHandler textHandler = new > BodyContentHandler(); > > ToHTMLContentHandler toHTMLHandler = new > ToHTMLContentHandler(); > > > > TeeContentHandler teeHandler = new > TeeContentHandler(linkHandler, textHandler, toHTMLHandler); > > > > Metadata metadata = new Metadata(); > > ParseContext parseContext = new ParseContext(); > > HtmlParser parser = new HtmlParser(); > > > > parser.parse(input, teeHandler, metadata, > parseContext); > > content = (StringEscapeUtils.escapeHtml( > textHandler.toString())); > > System.out.println("il contenuto "+content); > > =======END CODE======== > > But the output is useless, as i > > > > > > ===============START OUTPUT================== > > Accessibility links > > Skip to content > > Accessibility Help > > > > > > BBC iD > > > > > > > > Notifications > > > > > > > > > > > > > > > > BBC navigation > > Home > > Home > > News > > News > > Sport > > Weather > > Shop > > ==============END PART OF OUTPUT============= > > > > > > > How i can understand why this happen, and also how can solve it (for some > other web page, for example http://www.vogella.com/tutorials/ > AndroidTestingEspresso/article.html) it work right good; > > > > > > can please help me??? > > thanks really much > > > > > > > > > > > > > > > -- > > Ing. Viscomi Francesco > > > > > -- > > Ing. Viscomi Francesco > > > > > -- > > Ing. Viscomi Francesco > -- Ing. Viscomi Francesco
