Well...um...it isn't common, but it does happen, and PDFs are
notoriously bad transport containers for text.

Some things are fixable, and some things aren't.

I downloaded this pdf:
https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/WSDM_personality.pdf
I opened it in AdobeDC and "saved as text".  There are some definite,
um, areas for improvement.

Typically, if Adobe didn't do a good job, then we can assume that
there are some underlying, er, features that we can't expect Tika or
PDFBox to fix.  Adobe has problems with spacing: "isa
psychologicallexicon,hasbeenusedtoevaluate user personality".  This
does happen with PDFs because sometimes spaces aren't stored, but
rather are calculated based on font widths etc.

When I compared the output with Tika, it looks like we (and PDFBox!)
are actually doing better in this case and several others.
Tika-eval reports 7215 tokens extracted by Tika and 6473 tokens
extracted by Adobe, with a drop of 374 "common tokens" in Adobe.  In
short, our extract has more common words in it than Adobe does.

And "where Ti;m;Ei;n;Ai;o;Si;p
representsan instanceofatweet"  suggests that there are no Unicode
equivalents stored in the PDF for some fonts.

PDFBox notes: "WARN  No Unicode mapping for summationdisplay (88) in
font RBRLOC+CMEX9"


On Mon, Aug 6, 2018 at 1:27 PM Morkus <[email protected]> wrote:
>
> Hello all,
>
> For the first time ever, a PDF I tried to extract with Tika, failed.
>
> A scientific article with lots of symbols and such, by these authors:
>
> Beyond the Words: Predicting User Personality from
> Heterogeneous Information
> Honghao Weiy;, Fuzheng Zhangy, Nicholas Jing Yuanz,
> Chuan Caoz, Hao Fuz, Xing Xiey, Yong Ruiy, Wei-Ying May
> yMicrosoft ResearchzMicrosoft
> Department of Computer Science and Technology, Tsinghua University
> [email protected],
> {fuzzhang, nicholas.yuan, chcao, fuha, xingx, yongrui, wyma}@microsoft.com
>
> ------------
>
> I have tika-core 1.18 and tika-parsers 1.18.
>
> Is it unusual to have a failed PDF translation?
>
> Suggestions?
>
> I can include the PDF in an email, but wanted to ask first.
>
> Thanks!
>
>
> Sent from ProtonMail, Swiss-based encrypted email.
>
> Sent from ProtonMail, Swiss-based encrypted email.
>
>

Reply via email to