Please, please let us know of any other problems you find! And, thank you, again.
On Tue, Jan 2, 2024 at 8:58 AM Tim Allison <[email protected]> wrote: > > yakivy opened a PR for exactly this: https://github.com/apache/tika/pull/1522 > > Once the checks pass, I'll merge that, and we should be good to go. > Thank you so much for letting us know of this bug. > > On Mon, Dec 18, 2023 at 7:57 AM Andreas Hubold > <[email protected]> wrote: > > > > Hi, > > > > I'm currently testing the upgrade to 3.0.0-BETA and one of our test > > cases failed because of changed behavior when extracting text from HTML. > > That's probably related to the switch from TagSoup to JSoup (TIKA-1599). > > > > The test uses really old but real-world HTML, which contains a script > > tag at the start of the body. With 3.0.0-BETA, the body text below the > > script tag is not returned anymore. > > I don't think that's a blocker for us, but I just wanted to tell you. I > > also don't know how common such HTML actually is. > > > > A reduced example file: > > > > <html> > > <body> > > <script type="text/javascript">alert("A");</script> > > Hello World > > </body> > > </html> > > > > If I pass that to the Tika app, I get the text "Hello World" back with > > 2.9.1, but not with 3.0.0-BETA: > > > > $ java -jar tika-app-3.0.0-BETA.jar test.html > > <?xml version="1.0" encoding="UTF-8"?><html > > xmlns="http://www.w3.org/1999/xhtml"> > > <head> > > <meta name="X-TIKA:Parsed-By" > > content="org.apache.tika.parser.DefaultParser"/> > > <meta name="X-TIKA:Parsed-By" > > content="org.apache.tika.parser.html.JSoupParser"/> > > <meta name="Content-Encoding" content="ISO-8859-1"/> > > <meta name="resourceName" content="test.html"/> > > <meta name="Content-Length" content="94"/> > > <meta name="Content-Type" content="text/html; charset=ISO-8859-1"/> > > <title/> > > </head> > > <body> > > > > [~/dev/tika] $ java -jar tika-app-2.9.1.jar test.html > > <?xml version="1.0" encoding="UTF-8"?><html > > xmlns="http://www.w3.org/1999/xhtml"> > > <head> > > <meta name="X-TIKA:Parsed-By" > > content="org.apache.tika.parser.DefaultParser"/> > > <meta name="X-TIKA:Parsed-By" > > content="org.apache.tika.parser.html.HtmlParser"/> > > <meta name="Content-Encoding" content="ISO-8859-1"/> > > <meta name="resourceName" content="test.html"/> > > <meta name="Content-Length" content="94"/> > > <meta name="Content-Type" content="text/html; charset=ISO-8859-1"/> > > <title/> > > </head> > > <body> > > > > Hello World > > </body></html> > > > > > > Cheers > > Andreas > > > >
