Hi Zachary, Am 10.03.2011 um 23:58 schrieb Zachary Mitchell:
> Does this mean it is impossible to use pdfBox to read a pdf file? > > 1. You get text like "G38G43G36G51G5" instead of what you expect when you are > extracting text. This is because the characters are a meaningless internal > encoding that point to glyphs that are embedded in the PDF document. The > only way to access the text is to use OCR. This may be a future > enhancement." no, but it means that there are PDF files that cannot be read using pdfBox (and actually most likely by no other PDF reader, i.e. extract text e.g. by using a copy command). The information in the PDF files refers to some embedded glyphs (images of characters), which when rendered appear as characters, but cannot without OCR be recognised as characters because there is no reference to any standard character set. Using thousands of PDF files I came across this phenomenon, though rather rarely, and more often with older PDF files than with newer ones. Cheers Thomas

