Hi,

I'm currently testing the upgrade to 3.0.0-BETA and one of our test cases failed because of changed behavior when extracting text from HTML. That's probably related to the switch from TagSoup to JSoup (TIKA-1599).

The test uses really old but real-world HTML, which contains a script tag at the start of the body. With 3.0.0-BETA, the body text below the script tag is not returned anymore. I don't think that's a blocker for us, but I just wanted to tell you. I also don't know how common such HTML actually is.

A reduced example file:

<html>
<body>
<script type="text/javascript">alert("A");</script>
Hello World
</body>
</html>

If I pass that to the Tika app, I get the text "Hello World" back with 2.9.1, but not with 3.0.0-BETA:

 $ java -jar tika-app-3.0.0-BETA.jar test.html
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.html.JSoupParser"/>
<meta name="Content-Encoding" content="ISO-8859-1"/>
<meta name="resourceName" content="test.html"/>
<meta name="Content-Length" content="94"/>
<meta name="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title/>
</head>
<body>

[~/dev/tika] $ java -jar tika-app-2.9.1.jar test.html
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.html.HtmlParser"/>
<meta name="Content-Encoding" content="ISO-8859-1"/>
<meta name="resourceName" content="test.html"/>
<meta name="Content-Length" content="94"/>
<meta name="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title/>
</head>
<body>

Hello World
</body></html>


Cheers
Andreas


Reply via email to