Re: How do I get the encoding of a PDF File and set the encoding of the input stream??

Franklin Antony Sun, 24 Jul 2011 05:55:04 -0700

Ok no worries.

Thanks,
Franklin


On Sun, Jul 24, 2011 at 4:52 PM, Andreas Lehmkuehler <[email protected]>wrote:

> Hi,
>
> Am 23.07.2011 19:36, schrieb Franklin Antony:
>
>  Hi Andreas,
>>   Isnt there even any type of hack that can be done to get this working?
>>
> If I knew such a hack I would have already share it with the project.
>
> BR
> Andreas Lehmkühler
>
>
>
>  Regards,
>> Franklin
>>
>> On Sat, Jul 23, 2011 at 7:48 PM, Andreas Lehmkuehler<[email protected]>**
>> wrote:
>>
>>  Hi,
>>>
>>> I'm sorry for the late answer ...
>>>
>>> Am 13.07.2011 18:37, schrieb Michael Jeier:
>>>
>>>  Hi,
>>>>
>>>> I looked at the fonts in Adobe Reader:
>>>>
>>>> IDRGagrotesc
>>>>     Type: Type 1
>>>>     Encoding: Ansi
>>>>     Actual Font: Adobe Sans MM
>>>>     Actual Font Type: Type 1
>>>>
>>>> IDRGagrotesc
>>>>     Type: Type 1
>>>>     Encoding: Roman
>>>>     Actual Font: Adobe Sans MM
>>>>     Actual Font Type: Type 1
>>>>
>>>> TimesAcapitals (Embedded Subset)
>>>>     Type: Type 1
>>>>     Encoding: Custom
>>>>
>>>> TimesAcursivNormal (Embedded Subset)
>>>>     Type: Type 1
>>>>     Encoding: Custom
>>>>
>>>> TimesAfoneticaNormal (Embedded Subset)
>>>>     Type: Type 1
>>>>     Encoding: Custom
>>>>
>>>> TimesAgrass (Embedded Subset)
>>>>     Type: Type 1
>>>>     Encoding: Custom
>>>>
>>>> TimesAngrec (Embedded Subset)
>>>>     Type: Type 1
>>>>     Encoding: Custom
>>>>
>>>> TimesAstabil (Embedded Subset)
>>>>     Type: Type 1
>>>>     Encoding: Custom
>>>>
>>>> So, I guess, custom encoding means I am screwed? :(
>>>>
>>>>  I'm sorry but yes.
>>>
>>>  But how can the Adobe Reader display the characters correctly? Shouldn't
>>>
>>>> that be reflected somehow in the PDFBox API??
>>>>
>>>>  The characters are stored as glyphs (small pieces of graphics). In many
>>> cases
>>> readable mappings are used to adress those glyphs so that the character
>>> code
>>> can be used to extract the text. But in some cases pdf uses a custom
>>> mapping
>>> which isn't readable.
>>>
>>>  Where in the code is the encoding handled? If someone could point me in
>>>
>>>> that
>>>> direction I can maybe just add a workaround
>>>> there. Feeling a bit lost here... :/
>>>>
>>>>  I guess there is no workaround. Just do the ultimate test. Open the pdf
>>> in
>>> question using the acrobat reader. Select the text, copy and paste it to
>>> an
>>> editor. If the text is readable, PDFBox should be able to extract it too.
>>> But if it is unreadable, you won't find any way to extract the text
>>> directly.
>>>
>>>  Thanks for helping!
>>>
>>>>
>>>> Regards, Robin
>>>> SNIP
>>>>
>>>>
>>> BR
>>> Andreas Lehmkühler
>>>
>>>
>>
>

Re: How do I get the encoding of a PDF File and set the encoding of the input stream??

Reply via email to