Can you create a JIRA & provide a sample of the file? Does the file has any embeddings, like, Excel, PPT, ...? Or text inserted as text box?
Steven White at "Sat, 14 Sep 2019 17:29:19 -0400" wrote: SW> Hi everyone, SW> I'm using Tika <> to extract raw text from an a Microsoft Word 9.0 file. Tika is giving me back 1/3 of the data. If I save the SW> file as DOCX using MS Word 2017, I still see the problem. However, if I save the file as PDF using MS Word 2017, the PDF file SW> gets processed just fine (I get all the raw text). SW> How can I debug this to find out what's the issue? SW> Thanks SW> Steven -- With best wishes, Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
