Hi,

I'm sorry for the late answer ...

Am 13.07.2011 18:37, schrieb Michael Jeier:
Hi,

I looked at the fonts in Adobe Reader:

IDRGagrotesc
     Type: Type 1
     Encoding: Ansi
     Actual Font: Adobe Sans MM
     Actual Font Type: Type 1

IDRGagrotesc
     Type: Type 1
     Encoding: Roman
     Actual Font: Adobe Sans MM
     Actual Font Type: Type 1

TimesAcapitals (Embedded Subset)
     Type: Type 1
     Encoding: Custom

TimesAcursivNormal (Embedded Subset)
     Type: Type 1
     Encoding: Custom

TimesAfoneticaNormal (Embedded Subset)
     Type: Type 1
     Encoding: Custom

TimesAgrass (Embedded Subset)
     Type: Type 1
     Encoding: Custom

TimesAngrec (Embedded Subset)
     Type: Type 1
     Encoding: Custom

TimesAstabil (Embedded Subset)
     Type: Type 1
     Encoding: Custom

So, I guess, custom encoding means I am screwed? :(
I'm sorry but yes.

But how can the Adobe Reader display the characters correctly? Shouldn't
that be reflected somehow in the PDFBox API??
The characters are stored as glyphs (small pieces of graphics). In many cases
readable mappings are used to adress those glyphs so that the character code
can be used to extract the text. But in some cases pdf uses a custom mapping
which isn't readable.

Where in the code is the encoding handled? If someone could point me in that
direction I can maybe just add a workaround
there. Feeling a bit lost here... :/
I guess there is no workaround. Just do the ultimate test. Open the pdf in
question using the acrobat reader. Select the text, copy and paste it to an
editor. If the text is readable, PDFBox should be able to extract it too.
But if it is unreadable, you won't find any way to extract the text directly.

Thanks for helping!

Regards, Robin
SNIP

BR
Andreas Lehmkühler

Reply via email to