Hello,

I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran all 
995 unit tests and observed three failures, two encoding issues and one other 
weird thing. The tests use real HTML.

Where we previously extracted text  such as 'Spokane, Wash. [— The solar' we 
now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could take 
["weeks, or' but we not get 'could take [“weeks, or' extracted. Our tests 
pass with 1.17 but fail with 1.18 and 1.19.1. 

The other test fails because we suddenly extracted a bunch of Javascript as 
text content while instead it is actually a script tag with base64 inline. This 
inline code is decoded and reported in the characters() method of our custom 
ContentHandler, and ends up as text being extracted, but it seems the 
Javascript start tag itself is never reported to startElement(). The Javascript 
is reported to characters() after we left the head and entered the body.

Any idea on how to fix this encoding issue and the weird inline base64 
Javascript? Are there any Tika options that i am unaware of? Are these bugs? 

Of course, i can share the HTML files if needed.

Many thanks,
Markus

Reply via email to