Hi,
I'm sorry for the late answer ...
Am 13.07.2011 18:37, schrieb Michael Jeier:
Hi,
I looked at the fonts in Adobe Reader:
IDRGagrotesc
Type: Type 1
Encoding: Ansi
Actual Font: Adobe Sans MM
Actual Font Type: Type 1
IDRGagrotesc
Type: Type 1
Encoding: Roman
Actual Font: Adobe Sans MM
Actual Font Type: Type 1
TimesAcapitals (Embedded Subset)
Type: Type 1
Encoding: Custom
TimesAcursivNormal (Embedded Subset)
Type: Type 1
Encoding: Custom
TimesAfoneticaNormal (Embedded Subset)
Type: Type 1
Encoding: Custom
TimesAgrass (Embedded Subset)
Type: Type 1
Encoding: Custom
TimesAngrec (Embedded Subset)
Type: Type 1
Encoding: Custom
TimesAstabil (Embedded Subset)
Type: Type 1
Encoding: Custom
So, I guess, custom encoding means I am screwed? :(
I'm sorry but yes.
But how can the Adobe Reader display the characters correctly? Shouldn't
that be reflected somehow in the PDFBox API??
The characters are stored as glyphs (small pieces of graphics). In many cases
readable mappings are used to adress those glyphs so that the character code
can be used to extract the text. But in some cases pdf uses a custom mapping
which isn't readable.
Where in the code is the encoding handled? If someone could point me in that
direction I can maybe just add a workaround
there. Feeling a bit lost here... :/
I guess there is no workaround. Just do the ultimate test. Open the pdf in
question using the acrobat reader. Select the text, copy and paste it to an
editor. If the text is readable, PDFBox should be able to extract it too.
But if it is unreadable, you won't find any way to extract the text directly.
Thanks for helping!
Regards, Robin
SNIP
BR
Andreas Lehmkühler