Re: Encoding issues when upgrading Tika 1.17 to 1.19.1

Tim Allison Wed, 17 Oct 2018 07:53:50 -0700

Hi Markus,

  On the scripts...we added an "extractScripts" option, but the
default is false, and the idea is that the scripts should be extracted
as embedded documents, which with xhtml, would be inlined.  But, with
the default as false, you shouldn't be seeing anything from scripts.


  On charset detection, that was likely caused by our "upgrade" to a
more recent copy of icu4j's charset detector.

  Thank you for letting us know about these.  Please do open issues
and share files.

   Cheers,

              Tim
On Wed, Oct 17, 2018 at 10:24 AM Markus Jelsma
<[email protected]> wrote:
>
> Hello,
>
> I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran 
> all 995 unit tests and observed three failures, two encoding issues and one 
> other weird thing. The tests use real HTML.
>
> Where we previously extracted text  such as 'Spokane, Wash. [— The solar' we 
> now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could 
> take ["weeks, or' but we not get 'could take [â€œweeks, or' extracted. Our 
> tests pass with 1.17 but fail with 1.18 and 1.19.1.
>
> The other test fails because we suddenly extracted a bunch of Javascript as 
> text content while instead it is actually a script tag with base64 inline. 
> This inline code is decoded and reported in the characters() method of our 
> custom ContentHandler, and ends up as text being extracted, but it seems the 
> Javascript start tag itself is never reported to startElement(). The 
> Javascript is reported to characters() after we left the head and entered the 
> body.
>
> Any idea on how to fix this encoding issue and the weird inline base64 
> Javascript? Are there any Tika options that i am unaware of? Are these bugs?
>
> Of course, i can share the HTML files if needed.
>
> Many thanks,
> Markus

Re: Encoding issues when upgrading Tika 1.17 to 1.19.1

Reply via email to