If Tesseract is installed and was triggered: There is an ‘.equ’ language file on their GitHub page which should be installed with the other language files. It does what its name suggests: detect equation symbols. The ‘.equ’ and ‘.osd’ languages are universal for Tesseract versions. I’m looking at FreeBSD’s port which includes them by default, but not sure about Linux distros. It seems that Debian breaks them into individual languages so you could potentially not have those unless they’re in the base package.
> On Aug 6, 2018, at 12:27 PM, Morkus <[email protected]> wrote: > > Hello all, > > For the first time ever, a PDF I tried to extract with Tika, failed. > > A scientific article with lots of symbols and such, by these authors: > > Beyond the Words: Predicting User Personality from > Heterogeneous Information > Honghao Weiy;, Fuzheng Zhangy, Nicholas Jing Yuanz, > Chuan Caoz, Hao Fuz, Xing Xiey, Yong Ruiy, Wei-Ying May > yMicrosoft ResearchzMicrosoft > Department of Computer Science and Technology, Tsinghua University > [email protected] <mailto:[email protected]>, > {fuzzhang, nicholas.yuan, chcao, fuha, xingx, yongrui, wyma}@microsoft.com > > ------------ > > I have tika-core 1.18 and tika-parsers 1.18. > > Is it unusual to have a failed PDF translation? > > Suggestions? > > I can include the PDF in an email, but wanted to ask first. > > Thanks! > > > Sent from ProtonMail <https://protonmail.com/>, Swiss-based encrypted email. > > Sent from ProtonMail <https://protonmail.com/>, Swiss-based encrypted email. > >
