Ah, thank you, have you tried the BoilerPipeContentHandler?

It isn’t perfect, but it tries to strip out boilerplate/navigation stuff.

From: Francesco Viscomi [mailto:[email protected]]
Sent: Thursday, September 14, 2017 4:49 AM
To: [email protected]
Subject: Re: possible a bug?

Thanks really much;
I've tried to change the toHTMLHandler with ToTextHandel, but the result is the 
same(almost).

I want to retrieve only text useful, meaning that i don't want these item:
BBC navigation
          Home
        Home
        News
        News
        Sport
        Weather

only the text inside the body, in other words i want to eliminate every text 
related to navigation bar.

thanks really much


2017-09-13 21:32 GMT+02:00 Allison, Timothy B. 
<[email protected]<mailto:[email protected]>>:
Hmmm…

What are you expecting?  What version of Tika are you using?

With master, the parse works as expected.  I get the links from the LinkHandler 
and the full content from the ToHTMLContentHandler


It is odd to escape the html, though; what is your goal?


If you’re trying to get just the text out, use the ToTextContentHandler instead 
of the ToHTMLContentHandler ?


From: Francesco Viscomi [mailto:[email protected]<mailto:[email protected]>]
Sent: Wednesday, September 13, 2017 12:39 PM
To: [email protected]<mailto:[email protected]>
Subject: Fwd: possible a bug?



---------- Forwarded message ----------
From: Francesco Viscomi <[email protected]<mailto:[email protected]>>
Date: 2017-09-13 18:37 GMT+02:00
Subject: possible a bug?
To: [email protected]<mailto:[email protected]>
Hi all,
I'm trying to extract a content from a web page, and i find the following 
example on internet :
=======START CODE======
String url = "http://www.bbc.com/news/uk-england-41255962";;


        URL _url = new URL(url);
InputStream input = _url.openStream();

                        LinkContentHandler linkHandler = new 
LinkContentHandler();
                        ContentHandler textHandler = new BodyContentHandler();
                        ToHTMLContentHandler toHTMLHandler = new 
ToHTMLContentHandler();

                        TeeContentHandler teeHandler = new 
TeeContentHandler(linkHandler, textHandler, toHTMLHandler);

                        Metadata metadata = new Metadata();
                        ParseContext parseContext = new ParseContext();
                        HtmlParser parser = new HtmlParser();

                        parser.parse(input, teeHandler, metadata, parseContext);
                        content = 
(StringEscapeUtils.escapeHtml(textHandler.toString()));
                        System.out.println("il contenuto   "+content);
=======END CODE========
But the output is useless, as i


===============START OUTPUT==================
 Accessibility links
         Skip to content
        Accessibility Help


      BBC iD



        Notifications







    BBC navigation
          Home
        Home
        News
        News
        Sport
        Weather
        Shop
==============END PART OF OUTPUT=============



How i can understand why this happen, and also how can solve it (for some other 
web page, for example 
http://www.vogella.com/tutorials/AndroidTestingEspresso/article.html) it work 
right good;


can please help me???
thanks really much







--
Ing. Viscomi Francesco



--
Ing. Viscomi Francesco



--
Ing. Viscomi Francesco

Reply via email to