Please, please let us know of any other problems you find! And, thank
you, again.

On Tue, Jan 2, 2024 at 8:58 AM Tim Allison <[email protected]> wrote:
>
> yakivy opened a PR for exactly this: https://github.com/apache/tika/pull/1522
>
> Once the checks pass, I'll merge that, and we should be good to go.
> Thank you so much for letting us know of this bug.
>
> On Mon, Dec 18, 2023 at 7:57 AM Andreas Hubold
> <[email protected]> wrote:
> >
> > Hi,
> >
> > I'm currently testing the upgrade to 3.0.0-BETA and one of our test
> > cases failed because of changed behavior when extracting text from HTML.
> > That's probably related to the switch from TagSoup to JSoup (TIKA-1599).
> >
> > The test uses really old but real-world HTML, which contains a script
> > tag at the start of the body. With 3.0.0-BETA, the body text below the
> > script tag is not returned anymore.
> > I don't think that's a blocker for us, but I just wanted to tell you. I
> > also don't know how common such HTML actually is.
> >
> > A reduced example file:
> >
> > <html>
> > <body>
> > <script type="text/javascript">alert("A");</script>
> > Hello World
> > </body>
> > </html>
> >
> > If I pass that to the Tika app, I get the text "Hello World" back with
> > 2.9.1, but not with 3.0.0-BETA:
> >
> >   $ java -jar tika-app-3.0.0-BETA.jar test.html
> > <?xml version="1.0" encoding="UTF-8"?><html
> > xmlns="http://www.w3.org/1999/xhtml";>
> > <head>
> > <meta name="X-TIKA:Parsed-By"
> > content="org.apache.tika.parser.DefaultParser"/>
> > <meta name="X-TIKA:Parsed-By"
> > content="org.apache.tika.parser.html.JSoupParser"/>
> > <meta name="Content-Encoding" content="ISO-8859-1"/>
> > <meta name="resourceName" content="test.html"/>
> > <meta name="Content-Length" content="94"/>
> > <meta name="Content-Type" content="text/html; charset=ISO-8859-1"/>
> > <title/>
> > </head>
> > <body>
> >
> > [~/dev/tika] $ java -jar tika-app-2.9.1.jar test.html
> > <?xml version="1.0" encoding="UTF-8"?><html
> > xmlns="http://www.w3.org/1999/xhtml";>
> > <head>
> > <meta name="X-TIKA:Parsed-By"
> > content="org.apache.tika.parser.DefaultParser"/>
> > <meta name="X-TIKA:Parsed-By"
> > content="org.apache.tika.parser.html.HtmlParser"/>
> > <meta name="Content-Encoding" content="ISO-8859-1"/>
> > <meta name="resourceName" content="test.html"/>
> > <meta name="Content-Length" content="94"/>
> > <meta name="Content-Type" content="text/html; charset=ISO-8859-1"/>
> > <title/>
> > </head>
> > <body>
> >
> > Hello World
> > </body></html>
> >
> >
> > Cheers
> > Andreas
> >
> >

Reply via email to