Try this as well: http://wiki.apache.org/tika/GrobidJournalParser 

 

 

 

From: Tim Allison <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Monday, August 6, 2018 at 10:52 AM
To: "[email protected]" <[email protected]>, "[email protected]" 
<[email protected]>
Subject: Re: PDF Extraction Failed for scientific document

 

Well...um...it isn't common, but it does happen, and PDFs are

notoriously bad transport containers for text.

 

Some things are fixable, and some things aren't.

 

I downloaded this pdf:

https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/WSDM_personality.pdf

I opened it in AdobeDC and "saved as text".  There are some definite,

um, areas for improvement.

 

Typically, if Adobe didn't do a good job, then we can assume that

there are some underlying, er, features that we can't expect Tika or

PDFBox to fix.  Adobe has problems with spacing: "isa

psychologicallexicon,hasbeenusedtoevaluate user personality".  This

does happen with PDFs because sometimes spaces aren't stored, but

rather are calculated based on font widths etc.

 

When I compared the output with Tika, it looks like we (and PDFBox!)

are actually doing better in this case and several others.

Tika-eval reports 7215 tokens extracted by Tika and 6473 tokens

extracted by Adobe, with a drop of 374 "common tokens" in Adobe.  In

short, our extract has more common words in it than Adobe does.

 

And "where Ti;m;Ei;n;Ai;o;Si;p

representsan instanceofatweet"  suggests that there are no Unicode

equivalents stored in the PDF for some fonts.

 

PDFBox notes: "WARN  No Unicode mapping for summationdisplay (88) in

font RBRLOC+CMEX9"

 

 

On Mon, Aug 6, 2018 at 1:27 PM Morkus <[email protected]> wrote:

 

Hello all,

 

For the first time ever, a PDF I tried to extract with Tika, failed.

 

A scientific article with lots of symbols and such, by these authors:

 

Beyond the Words: Predicting User Personality from

Heterogeneous Information

Honghao Weiy;, Fuzheng Zhangy, Nicholas Jing Yuanz,

Chuan Caoz, Hao Fuz, Xing Xiey, Yong Ruiy, Wei-Ying May

yMicrosoft ResearchzMicrosoft

Department of Computer Science and Technology, Tsinghua University

[email protected],

{fuzzhang, nicholas.yuan, chcao, fuha, xingx, yongrui, wyma}@microsoft.com

 

------------

 

I have tika-core 1.18 and tika-parsers 1.18.

 

Is it unusual to have a failed PDF translation?

 

Suggestions?

 

I can include the PDF in an email, but wanted to ask first.

 

Thanks!

 

 

Sent from ProtonMail, Swiss-based encrypted email.

 

Sent from ProtonMail, Swiss-based encrypted email.

 

 

 

Reply via email to