yakivy opened a PR for exactly this: https://github.com/apache/tika/pull/1522
Once the checks pass, I'll merge that, and we should be good to go. Thank you so much for letting us know of this bug. On Mon, Dec 18, 2023 at 7:57 AM Andreas Hubold <[email protected]> wrote: > > Hi, > > I'm currently testing the upgrade to 3.0.0-BETA and one of our test > cases failed because of changed behavior when extracting text from HTML. > That's probably related to the switch from TagSoup to JSoup (TIKA-1599). > > The test uses really old but real-world HTML, which contains a script > tag at the start of the body. With 3.0.0-BETA, the body text below the > script tag is not returned anymore. > I don't think that's a blocker for us, but I just wanted to tell you. I > also don't know how common such HTML actually is. > > A reduced example file: > > <html> > <body> > <script type="text/javascript">alert("A");</script> > Hello World > </body> > </html> > > If I pass that to the Tika app, I get the text "Hello World" back with > 2.9.1, but not with 3.0.0-BETA: > > $ java -jar tika-app-3.0.0-BETA.jar test.html > <?xml version="1.0" encoding="UTF-8"?><html > xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.DefaultParser"/> > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.html.JSoupParser"/> > <meta name="Content-Encoding" content="ISO-8859-1"/> > <meta name="resourceName" content="test.html"/> > <meta name="Content-Length" content="94"/> > <meta name="Content-Type" content="text/html; charset=ISO-8859-1"/> > <title/> > </head> > <body> > > [~/dev/tika] $ java -jar tika-app-2.9.1.jar test.html > <?xml version="1.0" encoding="UTF-8"?><html > xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.DefaultParser"/> > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.html.HtmlParser"/> > <meta name="Content-Encoding" content="ISO-8859-1"/> > <meta name="resourceName" content="test.html"/> > <meta name="Content-Length" content="94"/> > <meta name="Content-Type" content="text/html; charset=ISO-8859-1"/> > <title/> > </head> > <body> > > Hello World > </body></html> > > > Cheers > Andreas > >
