Well...um...it isn't common, but it does happen, and PDFs are notoriously bad transport containers for text.
Some things are fixable, and some things aren't. I downloaded this pdf: https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/WSDM_personality.pdf I opened it in AdobeDC and "saved as text". There are some definite, um, areas for improvement. Typically, if Adobe didn't do a good job, then we can assume that there are some underlying, er, features that we can't expect Tika or PDFBox to fix. Adobe has problems with spacing: "isa psychologicallexicon,hasbeenusedtoevaluate user personality". This does happen with PDFs because sometimes spaces aren't stored, but rather are calculated based on font widths etc. When I compared the output with Tika, it looks like we (and PDFBox!) are actually doing better in this case and several others. Tika-eval reports 7215 tokens extracted by Tika and 6473 tokens extracted by Adobe, with a drop of 374 "common tokens" in Adobe. In short, our extract has more common words in it than Adobe does. And "where Ti;m;Ei;n;Ai;o;Si;p representsan instanceofatweet" suggests that there are no Unicode equivalents stored in the PDF for some fonts. PDFBox notes: "WARN No Unicode mapping for summationdisplay (88) in font RBRLOC+CMEX9" On Mon, Aug 6, 2018 at 1:27 PM Morkus <[email protected]> wrote: > > Hello all, > > For the first time ever, a PDF I tried to extract with Tika, failed. > > A scientific article with lots of symbols and such, by these authors: > > Beyond the Words: Predicting User Personality from > Heterogeneous Information > Honghao Weiy;, Fuzheng Zhangy, Nicholas Jing Yuanz, > Chuan Caoz, Hao Fuz, Xing Xiey, Yong Ruiy, Wei-Ying May > yMicrosoft ResearchzMicrosoft > Department of Computer Science and Technology, Tsinghua University > [email protected], > {fuzzhang, nicholas.yuan, chcao, fuha, xingx, yongrui, wyma}@microsoft.com > > ------------ > > I have tika-core 1.18 and tika-parsers 1.18. > > Is it unusual to have a failed PDF translation? > > Suggestions? > > I can include the PDF in an email, but wanted to ask first. > > Thanks! > > > Sent from ProtonMail, Swiss-based encrypted email. > > Sent from ProtonMail, Swiss-based encrypted email. > >
