Hi,
I'm currently testing the upgrade to 3.0.0-BETA and one of our test
cases failed because of changed behavior when extracting text from HTML.
That's probably related to the switch from TagSoup to JSoup (TIKA-1599).
The test uses really old but real-world HTML, which contains a script
tag at the start of the body. With 3.0.0-BETA, the body text below the
script tag is not returned anymore.
I don't think that's a blocker for us, but I just wanted to tell you. I
also don't know how common such HTML actually is.
A reduced example file:
<html>
<body>
<script type="text/javascript">alert("A");</script>
Hello World
</body>
</html>
If I pass that to the Tika app, I get the text "Hello World" back with
2.9.1, but not with 3.0.0-BETA:
$ java -jar tika-app-3.0.0-BETA.jar test.html
<?xml version="1.0" encoding="UTF-8"?><html
xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-TIKA:Parsed-By"
content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-TIKA:Parsed-By"
content="org.apache.tika.parser.html.JSoupParser"/>
<meta name="Content-Encoding" content="ISO-8859-1"/>
<meta name="resourceName" content="test.html"/>
<meta name="Content-Length" content="94"/>
<meta name="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title/>
</head>
<body>
[~/dev/tika] $ java -jar tika-app-2.9.1.jar test.html
<?xml version="1.0" encoding="UTF-8"?><html
xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-TIKA:Parsed-By"
content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-TIKA:Parsed-By"
content="org.apache.tika.parser.html.HtmlParser"/>
<meta name="Content-Encoding" content="ISO-8859-1"/>
<meta name="resourceName" content="test.html"/>
<meta name="Content-Length" content="94"/>
<meta name="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title/>
</head>
<body>
Hello World
</body></html>
Cheers
Andreas