HTML Parsing Changes in 3.0.0-BETA

Andreas Hubold Mon, 18 Dec 2023 04:57:51 -0800

Hi,

I'm currently testing the upgrade to 3.0.0-BETA and one of our testcases failed because of changed behavior when extracting text from HTML.That's probably related to the switch from TagSoup to JSoup (TIKA-1599).

The test uses really old but real-world HTML, which contains a scripttag at the start of the body. With 3.0.0-BETA, the body text below thescript tag is not returned anymore.I don't think that's a blocker for us, but I just wanted to tell you. Ialso don't know how common such HTML actually is.


A reduced example file:

<html>
<body>
<script type="text/javascript">alert("A");</script>
Hello World
</body>
</html>

If I pass that to the Tika app, I get the text "Hello World" back with2.9.1, but not with 3.0.0-BETA:


 $ java -jar tika-app-3.0.0-BETA.jar test.html

<?xml version="1.0" encoding="UTF-8"?><htmlxmlns="http://www.w3.org/1999/xhtml";>

<head>

<meta name="X-TIKA:Parsed-By"content="org.apache.tika.parser.DefaultParser"/><meta name="X-TIKA:Parsed-By"content="org.apache.tika.parser.html.JSoupParser"/>

<meta name="Content-Encoding" content="ISO-8859-1"/>
<meta name="resourceName" content="test.html"/>
<meta name="Content-Length" content="94"/>
<meta name="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title/>
</head>
<body>

[~/dev/tika] $ java -jar tika-app-2.9.1.jar test.html

<?xml version="1.0" encoding="UTF-8"?><htmlxmlns="http://www.w3.org/1999/xhtml";>

<head>

<meta name="X-TIKA:Parsed-By"content="org.apache.tika.parser.DefaultParser"/><meta name="X-TIKA:Parsed-By"content="org.apache.tika.parser.html.HtmlParser"/>

<meta name="Content-Encoding" content="ISO-8859-1"/>
<meta name="resourceName" content="test.html"/>
<meta name="Content-Length" content="94"/>
<meta name="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title/>
</head>
<body>

Hello World
</body></html>


Cheers
Andreas

HTML Parsing Changes in 3.0.0-BETA

Reply via email to