Hi,
Am 23.07.2011 19:36, schrieb Franklin Antony:
Hi Andreas,
Isnt there even any type of hack that can be done to get this working?
If I knew such a hack I would have already share it with the project.
BR
Andreas Lehmkühler
Regards,
Franklin
On Sat, Jul 23, 2011 at 7:48 PM, Andreas Lehmkuehler<[email protected]>wrote:
Hi,
I'm sorry for the late answer ...
Am 13.07.2011 18:37, schrieb Michael Jeier:
Hi,
I looked at the fonts in Adobe Reader:
IDRGagrotesc
Type: Type 1
Encoding: Ansi
Actual Font: Adobe Sans MM
Actual Font Type: Type 1
IDRGagrotesc
Type: Type 1
Encoding: Roman
Actual Font: Adobe Sans MM
Actual Font Type: Type 1
TimesAcapitals (Embedded Subset)
Type: Type 1
Encoding: Custom
TimesAcursivNormal (Embedded Subset)
Type: Type 1
Encoding: Custom
TimesAfoneticaNormal (Embedded Subset)
Type: Type 1
Encoding: Custom
TimesAgrass (Embedded Subset)
Type: Type 1
Encoding: Custom
TimesAngrec (Embedded Subset)
Type: Type 1
Encoding: Custom
TimesAstabil (Embedded Subset)
Type: Type 1
Encoding: Custom
So, I guess, custom encoding means I am screwed? :(
I'm sorry but yes.
But how can the Adobe Reader display the characters correctly? Shouldn't
that be reflected somehow in the PDFBox API??
The characters are stored as glyphs (small pieces of graphics). In many
cases
readable mappings are used to adress those glyphs so that the character
code
can be used to extract the text. But in some cases pdf uses a custom
mapping
which isn't readable.
Where in the code is the encoding handled? If someone could point me in
that
direction I can maybe just add a workaround
there. Feeling a bit lost here... :/
I guess there is no workaround. Just do the ultimate test. Open the pdf in
question using the acrobat reader. Select the text, copy and paste it to an
editor. If the text is readable, PDFBox should be able to extract it too.
But if it is unreadable, you won't find any way to extract the text
directly.
Thanks for helping!
Regards, Robin
SNIP
BR
Andreas Lehmkühler