Re: How do I get the encoding of a PDF File and set the encoding of the input stream??

Andreas Lehmkuehler Sun, 24 Jul 2011 05:53:23 -0700

Hi,

Am 23.07.2011 19:36, schrieb Franklin Antony:

Hi Andreas,
   Isnt there even any type of hack that can be done to get this working?

If I knew such a hack I would have already share it with the project.


BR
Andreas Lehmkühler

Regards,
Franklin

On Sat, Jul 23, 2011 at 7:48 PM, Andreas Lehmkuehler<[email protected]>wrote:

Hi,

I'm sorry for the late answer ...

Am 13.07.2011 18:37, schrieb Michael Jeier:

Hi,

I looked at the fonts in Adobe Reader:

IDRGagrotesc
     Type: Type 1
     Encoding: Ansi
     Actual Font: Adobe Sans MM
     Actual Font Type: Type 1

IDRGagrotesc
     Type: Type 1
     Encoding: Roman
     Actual Font: Adobe Sans MM
     Actual Font Type: Type 1

TimesAcapitals (Embedded Subset)
     Type: Type 1
     Encoding: Custom

TimesAcursivNormal (Embedded Subset)
     Type: Type 1
     Encoding: Custom

TimesAfoneticaNormal (Embedded Subset)
     Type: Type 1
     Encoding: Custom

TimesAgrass (Embedded Subset)
     Type: Type 1
     Encoding: Custom

TimesAngrec (Embedded Subset)
     Type: Type 1
     Encoding: Custom

TimesAstabil (Embedded Subset)
     Type: Type 1
     Encoding: Custom

So, I guess, custom encoding means I am screwed? :(

I'm sorry but yes.

  But how can the Adobe Reader display the characters correctly? Shouldn't

that be reflected somehow in the PDFBox API??

The characters are stored as glyphs (small pieces of graphics). In many
cases
readable mappings are used to adress those glyphs so that the character
code
can be used to extract the text. But in some cases pdf uses a custom
mapping
which isn't readable.

  Where in the code is the encoding handled? If someone could point me in

that
direction I can maybe just add a workaround
there. Feeling a bit lost here... :/

I guess there is no workaround. Just do the ultimate test. Open the pdf in
question using the acrobat reader. Select the text, copy and paste it to an
editor. If the text is readable, PDFBox should be able to extract it too.
But if it is unreadable, you won't find any way to extract the text
directly.

  Thanks for helping!


Regards, Robin
SNIP


BR
Andreas Lehmkühler

Re: How do I get the encoding of a PDF File and set the encoding of the input stream??

Reply via email to