Speaking of scientific papers, you’re right, you are doing a better job than most. A couple of German comp-sci professors have done a study and published it ;)
http://ad-publications.informatik.uni-freiburg.de/benchmark.pdf <http://ad-publications.informatik.uni-freiburg.de/benchmark.pdf> > On Aug 6, 2018, at 12:51 PM, Tim Allison <[email protected]> wrote: > > Well...um...it isn't common, but it does happen, and PDFs are > notoriously bad transport containers for text. > > Some things are fixable, and some things aren't. > > I downloaded this pdf: > https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/WSDM_personality.pdf > I opened it in AdobeDC and "saved as text". There are some definite, > um, areas for improvement. > > Typically, if Adobe didn't do a good job, then we can assume that > there are some underlying, er, features that we can't expect Tika or > PDFBox to fix. Adobe has problems with spacing: "isa > psychologicallexicon,hasbeenusedtoevaluate user personality". This > does happen with PDFs because sometimes spaces aren't stored, but > rather are calculated based on font widths etc. > > When I compared the output with Tika, it looks like we (and PDFBox!) > are actually doing better in this case and several others. > Tika-eval reports 7215 tokens extracted by Tika and 6473 tokens > extracted by Adobe, with a drop of 374 "common tokens" in Adobe. In > short, our extract has more common words in it than Adobe does. > > And "where Ti;m;Ei;n;Ai;o;Si;p > representsan instanceofatweet" suggests that there are no Unicode > equivalents stored in the PDF for some fonts. > > PDFBox notes: "WARN No Unicode mapping for summationdisplay (88) in > font RBRLOC+CMEX9" > > > On Mon, Aug 6, 2018 at 1:27 PM Morkus <[email protected]> wrote: >> >> Hello all, >> >> For the first time ever, a PDF I tried to extract with Tika, failed. >> >> A scientific article with lots of symbols and such, by these authors: >> >> Beyond the Words: Predicting User Personality from >> Heterogeneous Information >> Honghao Weiy;, Fuzheng Zhangy, Nicholas Jing Yuanz, >> Chuan Caoz, Hao Fuz, Xing Xiey, Yong Ruiy, Wei-Ying May >> yMicrosoft ResearchzMicrosoft >> Department of Computer Science and Technology, Tsinghua University >> [email protected], >> {fuzzhang, nicholas.yuan, chcao, fuha, xingx, yongrui, wyma}@microsoft.com >> >> ------------ >> >> I have tika-core 1.18 and tika-parsers 1.18. >> >> Is it unusual to have a failed PDF translation? >> >> Suggestions? >> >> I can include the PDF in an email, but wanted to ask first. >> >> Thanks! >> >> >> Sent from ProtonMail, Swiss-based encrypted email. >> >> Sent from ProtonMail, Swiss-based encrypted email. >> >>
