Re: Encoding issues when upgrading Tika 1.17 to 1.19.1

2018-10-17 Thread Tim Allison
Hi Markus, On the scripts...we added an "extractScripts" option, but the default is false, and the idea is that the scripts should be extracted as embedded documents, which with xhtml, would be inlined. But, with the default as false, you shouldn't be seeing anything from scripts. On

Encoding issues when upgrading Tika 1.17 to 1.19.1

2018-10-17 Thread Markus Jelsma
Hello, I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran all 995 unit tests and observed three failures, two encoding issues and one other weird thing. The tests use real HTML. Where we previously extracted text such as 'Spokane, Wash. [— The solar' we now got

Re: Sample Rate / Audio Sample Rate not included in XML output

2018-10-17 Thread Tim Allison
>IIRC some of the metadata is only known once all parsing is finished, eg the audio duration, which may be why it's currently done as it is Y, I completely agree, but I don't think anything is written during the parsing. I _think_ all info is stored in memory during the parse, and then we write

Re: Sample Rate / Audio Sample Rate not included in XML output

2018-10-17 Thread Nick Burch
On Wed, 17 Oct 2018, Tim Allison wrote: This is one of the limitations of a streaming write. As I look at the code of the MP3Parser, I _think_ it would be trivial to write the metadata before writing any content, and it wouldn't get in the way of a streaming parse because the parser reads the

Re: Sample Rate / Audio Sample Rate not included in XML output

2018-10-17 Thread Tim Allison
Nick, I'm sorry for my delay. The XHTMLContentHandler writes everything that is in the Metadata object when the parser writes the first "content" element, and in the MP3Parser, this is the element, which is written before the sample rate is added to the Metadata object. Any metadata that is