Re: possible a bug?

Francesco Viscomi Tue, 19 Sep 2017 05:27:49 -0700

Thanks Sir Allison Timothy,
now is accettable;

2017-09-14 12:56 GMT+02:00 Allison, Timothy B. <[email protected]>:


> Ah, thank you, have you tried the BoilerPipeContentHandler?
>
>
>
> It isn’t perfect, but it tries to strip out boilerplate/navigation stuff.
>
>
>
> *From:* Francesco Viscomi [mailto:[email protected]]
> *Sent:* Thursday, September 14, 2017 4:49 AM
> *To:* [email protected]
> *Subject:* Re: possible a bug?
>
>
>
> Thanks really much;
>
> I've tried to change the toHTMLHandler with ToTextHandel, but the result
> is the same(almost).
>
>
> I want to retrieve only text useful, meaning that i don't want these item:
> BBC navigation
>
>           Home
>
>         Home
>
>         News
>
>         News
>
>         Sport
>
>         Weather
>
>
>
> only the text inside the body, in other words i want to eliminate every
> text related to navigation bar.
>
>
>
> thanks really much
>
>
>
>
>
> 2017-09-13 21:32 GMT+02:00 Allison, Timothy B. <[email protected]>:
>
> Hmmm…
>
>
>
> What are you expecting?  What version of Tika are you using?
>
>
>
> With master, the parse works as expected.  I get the links from the
> LinkHandler and the full content from the ToHTMLContentHandler
>
>
>
>
>
> It is odd to escape the html, though; what is your goal?
>
>
>
>
>
> If you’re trying to get just the text out, use the ToTextContentHandler
> instead of the ToHTMLContentHandler ?
>
>
>
>
>
> *From:* Francesco Viscomi [mailto:[email protected]]
> *Sent:* Wednesday, September 13, 2017 12:39 PM
> *To:* [email protected]
> *Subject:* Fwd: possible a bug?
>
>
>
>
>
>
> ---------- Forwarded message ----------
> From: *Francesco Viscomi* <[email protected]>
> Date: 2017-09-13 18:37 GMT+02:00
> Subject: possible a bug?
> To: [email protected]
>
> Hi all,
>
> I'm trying to extract a content from a web page, and i find the following
> example on internet :
>
> =======START CODE======
>
> String url = "http://www.bbc.com/news/uk-england-41255962";;
>
>
>
>
>
>         URL _url = new URL(url);
> InputStream input = _url.openStream();
>
>
>
>                         LinkContentHandler linkHandler = new
> LinkContentHandler();
>
>                         ContentHandler textHandler = new
> BodyContentHandler();
>
>                         ToHTMLContentHandler toHTMLHandler = new
> ToHTMLContentHandler();
>
>
>
>                         TeeContentHandler teeHandler = new
> TeeContentHandler(linkHandler, textHandler, toHTMLHandler);
>
>
>
>                         Metadata metadata = new Metadata();
>
>                         ParseContext parseContext = new ParseContext();
>
>                         HtmlParser parser = new HtmlParser();
>
>
>
>                         parser.parse(input, teeHandler, metadata,
> parseContext);
>
>                         content = (StringEscapeUtils.escapeHtml(
> textHandler.toString()));
>
>                         System.out.println("il contenuto   "+content);
>
> =======END CODE========
>
> But the output is useless, as i
>
>
>
>
>
> ===============START OUTPUT==================
>
>  Accessibility links
>
>          Skip to content
>
>         Accessibility Help
>
>
>
>
>
>       BBC iD
>
>
>
>
>
>
>
>         Notifications
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>     BBC navigation
>
>           Home
>
>         Home
>
>         News
>
>         News
>
>         Sport
>
>         Weather
>
>         Shop
>
> ==============END PART OF OUTPUT=============
>
>
>
>
>
>
> How i can understand why this happen, and also how can solve it (for some
> other web page, for example http://www.vogella.com/tutorials/
> AndroidTestingEspresso/article.html) it work right good;
>
>
>
>
>
> can please help me???
>
> thanks really much
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
> Ing. Viscomi Francesco
>
>
>
>
> --
>
> Ing. Viscomi Francesco
>
>
>
>
> --
>
> Ing. Viscomi Francesco
>



-- 
Ing. Viscomi Francesco

Re: possible a bug?

Reply via email to