Re: PDF Extraction Failed for scientific document

Robert Neal Clayton Mon, 06 Aug 2018 10:43:21 -0700

If Tesseract is installed and was triggered: 

There is an ‘.equ’ language file on their GitHub page which should be installed 
with the other language files. It does what its name suggests: detect equation 
symbols. The ‘.equ’ and ‘.osd’ languages are universal for Tesseract versions.  
I’m looking at FreeBSD’s port which includes them by default, but not sure 
about Linux distros.  It seems that Debian breaks them into individual 
languages so you could potentially not have those unless they’re in the base 
package.


> On Aug 6, 2018, at 12:27 PM, Morkus <[email protected]> wrote:
> 
> Hello all,
> 
> For the first time ever, a PDF I tried to extract with Tika, failed.
> 
> A scientific article with lots of symbols and such, by these authors:
> 
> Beyond the Words: Predicting User Personality from
> Heterogeneous Information
> Honghao Weiy;, Fuzheng Zhangy, Nicholas Jing Yuanz,
> Chuan Caoz, Hao Fuz, Xing Xiey, Yong Ruiy, Wei-Ying May
> yMicrosoft ResearchzMicrosoft
> Department of Computer Science and Technology, Tsinghua University
> [email protected] <mailto:[email protected]>,
> {fuzzhang, nicholas.yuan, chcao, fuha, xingx, yongrui, wyma}@microsoft.com
> 
> ------------
> 
> I have tika-core 1.18 and tika-parsers 1.18.
> 
> Is it unusual to have a failed PDF translation?
> 
> Suggestions?
> 
> I can include the PDF in an email, but wanted to ask first.
> 
> Thanks!
> 
> 
> Sent from ProtonMail <https://protonmail.com/>, Swiss-based encrypted email.
> 
> Sent from ProtonMail <https://protonmail.com/>, Swiss-based encrypted email.
> 
>

Re: PDF Extraction Failed for scientific document

Reply via email to