Re: How do I get the encoding of a PDF File and set the encoding of the input stream??

Michael Jeier Wed, 13 Jul 2011 09:37:51 -0700

Hi,

I looked at the fonts in Adobe Reader:


IDRGagrotesc
    Type: Type 1
    Encoding: Ansi
    Actual Font: Adobe Sans MM
    Actual Font Type: Type 1

IDRGagrotesc
    Type: Type 1
    Encoding: Roman
    Actual Font: Adobe Sans MM
    Actual Font Type: Type 1

TimesAcapitals (Embedded Subset)
    Type: Type 1
    Encoding: Custom

TimesAcursivNormal (Embedded Subset)
    Type: Type 1
    Encoding: Custom

TimesAfoneticaNormal (Embedded Subset)
    Type: Type 1
    Encoding: Custom

TimesAgrass (Embedded Subset)
    Type: Type 1
    Encoding: Custom

TimesAngrec (Embedded Subset)
    Type: Type 1
    Encoding: Custom

TimesAstabil (Embedded Subset)
    Type: Type 1
    Encoding: Custom

So, I guess, custom encoding means I am screwed? :(
But how can the Adobe Reader display the characters correctly? Shouldn't
that be reflected somehow in the PDFBox API??
Where in the code is the encoding handled? If someone could point me in that
direction I can maybe just add a workaround
there. Feeling a bit lost here... :/

Thanks for helping!

Regards, Robin

On Sun, Jul 10, 2011 at 4:42 PM, Andreas Lehmkuehler <[email protected]>wrote:

> Hi,
>
> Am 10.07.2011 16:08, schrieb Michael Jeier:
>
>  Hi,
>>
>> well, though it's a research project, I am not allowed to publish the
>> original PDF file. :(
>> Maybe the outcome will be published though. But this is beyond my control.
>>
> Check the embedded fonts using a pdf reader, e.g. acrobat reader. Load the
> pdf and go to the document properties. Most of the readers provide a list of
> the used fonts including their encoding.
>
>
>  There must be a way to set the encoding of the input stream and to find
>> out
>> the
>> encoding of the PDF!! Somebody please push me in the right direction.
>>
> There must be a misunderstanding. A pdf doesn't have an encoding like a
> simple text file it is a binary file. The font encodings are used to convert
> the text into readable text. Sometimes pdfs don't provide the needed
> information to do that conversion.
>
>
>  Regards, Michael
>>
>> On Sat, Jul 9, 2011 at 7:57 PM, Andreas Lehmkuehler<[email protected]>**
>> wrote:
>>
>>  Hi,
>>>
>>> Am 09.07.2011 15:38, schrieb Michael Jeier:
>>>
>>>  Hi,
>>>
>>>>
>>>> I am trying to export a PDF with the help of the PDFTextStripper class.
>>>> Unfortunately, I have some character encoding issues.
>>>> 1.) How do I get the encoding information of the PDF file? Or is that
>>>> information chained to the single fonts??
>>>>
>>>>  Each font has its own enconding and the fonts of a pdf don't have to
>>> share
>>> the same one.
>>>
>>>
>>>  2.) How do I set the encoding for the input stream of PDFTextStripper is
>>>
>>>> using?
>>>>
>>>>  Nowhere, it doens't make sense to change that.
>>>
>>>
>>>  Here's my code snippet:
>>>
>>>>
>>>> Writer output = null;
>>>>         PDDocument document = null;
>>>>         URL url = new URL( "file:\\mypdf.pdf" );
>>>>         document = PDDocument.load(url, true);
>>>>
>>>>         output = new OutputStreamWriter(
>>>>                 new FileOutputStream( "C:\\out2.txt" ), "ISO-8859-1" );
>>>> //
>>>> set encoding
>>>>
>>>>         PDFTextStripper stripper = null;
>>>>         stripper = new PDFTextStripper("ISO-8859-1"); // set encoding
>>>>         stripper.setStartPage( 1 );
>>>>         stripper.setEndPage( 2 );
>>>>         stripper.writeText( document, output );
>>>>
>>>>         output.close();
>>>>         document.close();
>>>>
>>>> I tried all kind of encodings (UTF-8 and all possible ISO-8859-X) at
>>>> "set
>>>> encoding" comments in the code,
>>>> but nothing works. It seems that you can only configure the encoding of
>>>> the
>>>> output.
>>>>
>>>>  That's exactly the expected behaviour.
>>>
>>>
>>>  Maybe someone can
>>>
>>>> point me to the class and line in PDFBox which contains the actual
>>>> opening
>>>> of the input stream.
>>>>
>>>> Any help would be highly appreciated! THANKS!
>>>>
>>>>  I guess, the problem is the pdf itself. It probably uses some
>>> unsupported
>>> font encodings or the text can't be extracted because of the encoding.
>>>
>>> But without the pdf in question everything is just a guess.
>>>
>>>  Regards, Michael
>>>
>>>>
>>>>
>>>
>>> BR
>>> Andreas Lehmkühler
>>>
>>>
>>
> BR
> Andreas Lehmkühler
>

Re: How do I get the encoding of a PDF File and set the encoding of the input stream??

Reply via email to