Re: PDF Extraction Failed for scientific document

Robert Neal Clayton Mon, 06 Aug 2018 11:12:02 -0700

Speaking of scientific papers, you’re right, you are doing a better job than 
most. A couple of German comp-sci professors have done a study and published it 
;)


http://ad-publications.informatik.uni-freiburg.de/benchmark.pdf 
<http://ad-publications.informatik.uni-freiburg.de/benchmark.pdf>


> On Aug 6, 2018, at 12:51 PM, Tim Allison <[email protected]> wrote:
> 
> Well...um...it isn't common, but it does happen, and PDFs are
> notoriously bad transport containers for text.
> 
> Some things are fixable, and some things aren't.
> 
> I downloaded this pdf:
> https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/WSDM_personality.pdf
> I opened it in AdobeDC and "saved as text".  There are some definite,
> um, areas for improvement.
> 
> Typically, if Adobe didn't do a good job, then we can assume that
> there are some underlying, er, features that we can't expect Tika or
> PDFBox to fix.  Adobe has problems with spacing: "isa
> psychologicallexicon,hasbeenusedtoevaluate user personality".  This
> does happen with PDFs because sometimes spaces aren't stored, but
> rather are calculated based on font widths etc.
> 
> When I compared the output with Tika, it looks like we (and PDFBox!)
> are actually doing better in this case and several others.
> Tika-eval reports 7215 tokens extracted by Tika and 6473 tokens
> extracted by Adobe, with a drop of 374 "common tokens" in Adobe.  In
> short, our extract has more common words in it than Adobe does.
> 
> And "where Ti;m;Ei;n;Ai;o;Si;p
> representsan instanceofatweet"  suggests that there are no Unicode
> equivalents stored in the PDF for some fonts.
> 
> PDFBox notes: "WARN  No Unicode mapping for summationdisplay (88) in
> font RBRLOC+CMEX9"
> 
> 
> On Mon, Aug 6, 2018 at 1:27 PM Morkus <[email protected]> wrote:
>> 
>> Hello all,
>> 
>> For the first time ever, a PDF I tried to extract with Tika, failed.
>> 
>> A scientific article with lots of symbols and such, by these authors:
>> 
>> Beyond the Words: Predicting User Personality from
>> Heterogeneous Information
>> Honghao Weiy;, Fuzheng Zhangy, Nicholas Jing Yuanz,
>> Chuan Caoz, Hao Fuz, Xing Xiey, Yong Ruiy, Wei-Ying May
>> yMicrosoft ResearchzMicrosoft
>> Department of Computer Science and Technology, Tsinghua University
>> [email protected],
>> {fuzzhang, nicholas.yuan, chcao, fuha, xingx, yongrui, wyma}@microsoft.com
>> 
>> ------------
>> 
>> I have tika-core 1.18 and tika-parsers 1.18.
>> 
>> Is it unusual to have a failed PDF translation?
>> 
>> Suggestions?
>> 
>> I can include the PDF in an email, but wanted to ask first.
>> 
>> Thanks!
>> 
>> 
>> Sent from ProtonMail, Swiss-based encrypted email.
>> 
>> Sent from ProtonMail, Swiss-based encrypted email.
>> 
>>

Re: PDF Extraction Failed for scientific document

Reply via email to