Hi Christopher,

Am 21.04.2010 um 02:13 schrieb Christopher Mason:

> 
> I'm investigating libraries for rendering and extracting text from PDF.  
> Across the half dozen I've looked at, both commercial and open source, I 
> think pdfbox is the cleanest.

I agree, I've just extracted text from around 40.000 mathematical PDF files, 
and my experience is that pdfbox is the best tool.
There are also a few exceptions…

> However, I've run across a number of pdfs that pdfbox does not render 
> properly.  One I'm particularly concerned about is:
> 
> http://www.cmason.com/tmp/Sowa.pdf

I just try to extract text and am not concerned with rendering.
As far as I see, the text I get is as good as it can be, thus I don't think 
that there should be problems with font and/or glyphs, see the excerpt below.
But I have to agree that org.apache.pdfbox.PDFToImage doesn't give me anything 
useful either (actually one very long image consisting of all the pages of the 
document, with errors like the image mentioned.

Cheers
Thomas


> 
> It looks to have encoding or char -> glyph issues in pdfbox, but look okay in 
> every other reader/library I've tried.  I've tried with both pdfbox-1.1.0 and 
> with the trunk.  Here's how it looks in pdfbox trunk versus Preview:
> 
> http://www.cmason.com/tmp/Sowa.png
> 
> Any help or suggestions would be most appreciated.
> 
> -c
> 
> 
> 
> java -cp 
> ~/.m2/repository/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar:pdfbox-1.1.0.jar:fontbox-1.1.0.jar
>  org.apache.pdfbox.PDFToImage -color rgba -startPage 1 -endPage 1 -resolution 
> 100 -imageType png -outputPrefix Sowa ~/Sites/docs/Sowa.pdf
> 
> 

Beginning of text:

The Challenge 
Of Knowledge Soup
John F. Sowa 
26 August 2004 
PerMIS 2004 Workshop at NIST 
Gaithersburg, Maryland 
Outline of This Talk
1. Thesis: 
Support interoperability among heterogeneous systems 
by defining all concepts precisely and unambiguously. 
2. Antithesis: 
"There are more things in heaven and earth, Horatio, 
Than are dreamt of in your philosophy." 
William Shakespeare 
3. Synthesis: 
Develop more flexible methods of knowledge acquisition 
by simulating the human cognitive cycle. 
Aristotle's Syllogisms
System of logic based on four sentence patterns: 
1. Universal affirmative.  Every employee is human. 
2. Particular affirmative.  Some employees are customers. 
3. Universal negative.  No employee is a competitor. 
4. Particular negative.  Some customers are not employees. 
Affirmative patterns for stating inheritance. 
Negative patterns for stating constraints. 
Description logics are based on Aristotle's syllogisms. 
Tree of Porphyry

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to