I'm investigating libraries for rendering and extracting text from PDF.
Across the half dozen I've looked at, both commercial and open source,
I think pdfbox is the cleanest.
However, I've run across a number of pdfs that pdfbox does not render
properly. One I'm particularly concerned about is:
http://www.cmason.com/tmp/Sowa.pdf
It looks to have encoding or char -> glyph issues in pdfbox, but look
okay in every other reader/library I've tried. I've tried with both
pdfbox-1.1.0 and with the trunk. Here's how it looks in pdfbox trunk
versus Preview:
http://www.cmason.com/tmp/Sowa.png
Any help or suggestions would be most appreciated.
-c
java -cp
~/.m2/repository/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar:pdfbox-1.1.0.jar:fontbox-1.1.0.jar
org.apache.pdfbox.PDFToImage -color rgba -startPage 1 -endPage 1
-resolution 100 -imageType png -outputPrefix Sowa ~/Sites/docs/Sowa.pdf