Hello Tim,

Opened two issues to track the problems:
https://issues.apache.org/jira/browse/TIKA-2758
https://issues.apache.org/jira/browse/TIKA-2759

Many thanks,
Markus
 
-----Original message-----
> From:Tim Allison <[email protected]>
> Sent: Wednesday 17th October 2018 16:53
> To: [email protected]
> Subject: Re: Encoding issues when upgrading Tika 1.17 to 1.19.1
> 
> Hi Markus,
> 
>   On the scripts...we added an "extractScripts" option, but the
> default is false, and the idea is that the scripts should be extracted
> as embedded documents, which with xhtml, would be inlined.  But, with
> the default as false, you shouldn't be seeing anything from scripts.
> 
>   On charset detection, that was likely caused by our "upgrade" to a
> more recent copy of icu4j's charset detector.
> 
>   Thank you for letting us know about these.  Please do open issues
> and share files.
> 
>    Cheers,
> 
>               Tim
> On Wed, Oct 17, 2018 at 10:24 AM Markus Jelsma
> <[email protected]> wrote:
> >
> > Hello,
> >
> > I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran 
> > all 995 unit tests and observed three failures, two encoding issues and one 
> > other weird thing. The tests use real HTML.
> >
> > Where we previously extracted text  such as 'Spokane, Wash. [— The solar' 
> > we now got 'Spokane, Wash. [â€" The solar' in one test. The other had 
> > 'could take ["weeks, or' but we not get 'could take [“weeks, or' 
> > extracted. Our tests pass with 1.17 but fail with 1.18 and 1.19.1.
> >
> > The other test fails because we suddenly extracted a bunch of Javascript as 
> > text content while instead it is actually a script tag with base64 inline. 
> > This inline code is decoded and reported in the characters() method of our 
> > custom ContentHandler, and ends up as text being extracted, but it seems 
> > the Javascript start tag itself is never reported to startElement(). The 
> > Javascript is reported to characters() after we left the head and entered 
> > the body.
> >
> > Any idea on how to fix this encoding issue and the weird inline base64 
> > Javascript? Are there any Tika options that i am unaware of? Are these bugs?
> >
> > Of course, i can share the HTML files if needed.
> >
> > Many thanks,
> > Markus
> 

Reply via email to