Hi Christopher, Am 21.04.2010 um 02:13 schrieb Christopher Mason:
> > I'm investigating libraries for rendering and extracting text from PDF. > Across the half dozen I've looked at, both commercial and open source, I > think pdfbox is the cleanest. I agree, I've just extracted text from around 40.000 mathematical PDF files, and my experience is that pdfbox is the best tool. There are also a few exceptions… > However, I've run across a number of pdfs that pdfbox does not render > properly. One I'm particularly concerned about is: > > http://www.cmason.com/tmp/Sowa.pdf I just try to extract text and am not concerned with rendering. As far as I see, the text I get is as good as it can be, thus I don't think that there should be problems with font and/or glyphs, see the excerpt below. But I have to agree that org.apache.pdfbox.PDFToImage doesn't give me anything useful either (actually one very long image consisting of all the pages of the document, with errors like the image mentioned. Cheers Thomas > > It looks to have encoding or char -> glyph issues in pdfbox, but look okay in > every other reader/library I've tried. I've tried with both pdfbox-1.1.0 and > with the trunk. Here's how it looks in pdfbox trunk versus Preview: > > http://www.cmason.com/tmp/Sowa.png > > Any help or suggestions would be most appreciated. > > -c > > > > java -cp > ~/.m2/repository/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar:pdfbox-1.1.0.jar:fontbox-1.1.0.jar > org.apache.pdfbox.PDFToImage -color rgba -startPage 1 -endPage 1 -resolution > 100 -imageType png -outputPrefix Sowa ~/Sites/docs/Sowa.pdf > > Beginning of text: The Challenge Of Knowledge Soup John F. Sowa 26 August 2004 PerMIS 2004 Workshop at NIST Gaithersburg, Maryland Outline of This Talk 1. Thesis: Support interoperability among heterogeneous systems by defining all concepts precisely and unambiguously. 2. Antithesis: "There are more things in heaven and earth, Horatio, Than are dreamt of in your philosophy." William Shakespeare 3. Synthesis: Develop more flexible methods of knowledge acquisition by simulating the human cognitive cycle. Aristotle's Syllogisms System of logic based on four sentence patterns: 1. Universal affirmative. Every employee is human. 2. Particular affirmative. Some employees are customers. 3. Universal negative. No employee is a competitor. 4. Particular negative. Some customers are not employees. Affirmative patterns for stating inheritance. Negative patterns for stating constraints. Description logics are based on Aristotle's syllogisms. Tree of Porphyry
smime.p7s
Description: S/MIME cryptographic signature

