Hi all,

I'm trying to extract a content from a web page, and i find the following
example on internet :

=======START CODE======

String url = "http://www.bbc.com/news/uk-england-41255962";;





        URL _url = new URL(url);
InputStream input = _url.openStream();



                        LinkContentHandler linkHandler = new
LinkContentHandler();

                        ContentHandler textHandler = new
BodyContentHandler();

                        ToHTMLContentHandler toHTMLHandler = new
ToHTMLContentHandler();



                        TeeContentHandler teeHandler = new
TeeContentHandler(linkHandler, textHandler, toHTMLHandler);



                        Metadata metadata = new Metadata();

                        ParseContext parseContext = new ParseContext();

                        HtmlParser parser = new HtmlParser();



                        parser.parse(input, teeHandler, metadata,
parseContext);

                        content =
(StringEscapeUtils.escapeHtml(textHandler.toString()));

                        System.out.println("il contenuto   "+content);
=======END CODE========


But the output is useless, as i


===============START OUTPUT==================

 Accessibility links

         Skip to content

        Accessibility Help





      BBC iD







        Notifications















    BBC navigation

          Home

        Home

        News

        News

        Sport

        Weather

        Shop
==============END PART OF OUTPUT=============



How i can understand why this happen, and also how can solve it (for some
other web page, for example
http://www.vogella.com/tutorials/AndroidTestingEspresso/article.html)







-- 
Ing. Viscomi Francesco

Reply via email to