Hi Zachary,

Am 10.03.2011 um 23:58 schrieb Zachary Mitchell:

> Does this mean it is impossible to use pdfBox to read a pdf file?
> 
> 1. You get text like "G38G43G36G51G5" instead of what you expect when you are
> extracting text. This is because the characters are a meaningless internal
> encoding that point to glyphs that are embedded in the PDF document. The
> only way to access the text is to use OCR. This may be a future
> enhancement."


no, but it means that there are PDF files that cannot be read using pdfBox (and 
actually most likely by no other PDF reader, i.e. extract text e.g. by using a 
copy command). The information in the PDF files refers to some embedded glyphs 
(images of characters), which when rendered appear as characters, but cannot 
without OCR be recognised as characters because there is no reference to any 
standard character set.
Using thousands of PDF files I came across this phenomenon, though rather 
rarely, and more often with older PDF files than with newer ones.

Cheers
Thomas

Reply via email to