I'm investigating libraries for rendering and extracting text from PDF. Across the half dozen I've looked at, both commercial and open source, I think pdfbox is the cleanest.

However, I've run across a number of pdfs that pdfbox does not render properly. One I'm particularly concerned about is:

http://www.cmason.com/tmp/Sowa.pdf

It looks to have encoding or char -> glyph issues in pdfbox, but look okay in every other reader/library I've tried. I've tried with both pdfbox-1.1.0 and with the trunk. Here's how it looks in pdfbox trunk versus Preview:

http://www.cmason.com/tmp/Sowa.png

Any help or suggestions would be most appreciated.

-c



java -cp ~/.m2/repository/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar:pdfbox-1.1.0.jar:fontbox-1.1.0.jar org.apache.pdfbox.PDFToImage -color rgba -startPage 1 -endPage 1 -resolution 100 -imageType png -outputPrefix Sowa ~/Sites/docs/Sowa.pdf


Reply via email to