Hello Tim, Opened two issues to track the problems: https://issues.apache.org/jira/browse/TIKA-2758 https://issues.apache.org/jira/browse/TIKA-2759
Many thanks, Markus -----Original message----- > From:Tim Allison <[email protected]> > Sent: Wednesday 17th October 2018 16:53 > To: [email protected] > Subject: Re: Encoding issues when upgrading Tika 1.17 to 1.19.1 > > Hi Markus, > > On the scripts...we added an "extractScripts" option, but the > default is false, and the idea is that the scripts should be extracted > as embedded documents, which with xhtml, would be inlined. But, with > the default as false, you shouldn't be seeing anything from scripts. > > On charset detection, that was likely caused by our "upgrade" to a > more recent copy of icu4j's charset detector. > > Thank you for letting us know about these. Please do open issues > and share files. > > Cheers, > > Tim > On Wed, Oct 17, 2018 at 10:24 AM Markus Jelsma > <[email protected]> wrote: > > > > Hello, > > > > I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran > > all 995 unit tests and observed three failures, two encoding issues and one > > other weird thing. The tests use real HTML. > > > > Where we previously extracted text such as 'Spokane, Wash. [— The solar' > > we now got 'Spokane, Wash. [â€" The solar' in one test. The other had > > 'could take ["weeks, or' but we not get 'could take [“weeks, or' > > extracted. Our tests pass with 1.17 but fail with 1.18 and 1.19.1. > > > > The other test fails because we suddenly extracted a bunch of Javascript as > > text content while instead it is actually a script tag with base64 inline. > > This inline code is decoded and reported in the characters() method of our > > custom ContentHandler, and ends up as text being extracted, but it seems > > the Javascript start tag itself is never reported to startElement(). The > > Javascript is reported to characters() after we left the head and entered > > the body. > > > > Any idea on how to fix this encoding issue and the weird inline base64 > > Javascript? Are there any Tika options that i am unaware of? Are these bugs? > > > > Of course, i can share the HTML files if needed. > > > > Many thanks, > > Markus >
