Re: How do I get the encoding of a PDF File and set the encoding of the input stream??

Michael Jeier Sun, 10 Jul 2011 07:09:29 -0700

Hi,

well, though it's a research project, I am not allowed to publish the
original PDF file. :(
Maybe the outcome will be published though. But this is beyond my control.


There must be a way to set the encoding of the input stream and to find out
the
encoding of the PDF!! Somebody please push me in the right direction.

Regards, Michael

On Sat, Jul 9, 2011 at 7:57 PM, Andreas Lehmkuehler <[email protected]>wrote:

> Hi,
>
> Am 09.07.2011 15:38, schrieb Michael Jeier:
>
>  Hi,
>>
>> I am trying to export a PDF with the help of the PDFTextStripper class.
>> Unfortunately, I have some character encoding issues.
>> 1.) How do I get the encoding information of the PDF file? Or is that
>> information chained to the single fonts??
>>
> Each font has its own enconding and the fonts of a pdf don't have to share
> the same one.
>
>
>  2.) How do I set the encoding for the input stream of PDFTextStripper is
>> using?
>>
> Nowhere, it doens't make sense to change that.
>
>
>  Here's my code snippet:
>>
>> Writer output = null;
>>         PDDocument document = null;
>>         URL url = new URL( "file:\\mypdf.pdf" );
>>         document = PDDocument.load(url, true);
>>
>>         output = new OutputStreamWriter(
>>                 new FileOutputStream( "C:\\out2.txt" ), "ISO-8859-1" ); //
>> set encoding
>>
>>         PDFTextStripper stripper = null;
>>         stripper = new PDFTextStripper("ISO-8859-1"); // set encoding
>>         stripper.setStartPage( 1 );
>>         stripper.setEndPage( 2 );
>>         stripper.writeText( document, output );
>>
>>         output.close();
>>         document.close();
>>
>> I tried all kind of encodings (UTF-8 and all possible ISO-8859-X) at "set
>> encoding" comments in the code,
>> but nothing works. It seems that you can only configure the encoding of
>> the
>> output.
>>
> That's exactly the expected behaviour.
>
>
>  Maybe someone can
>> point me to the class and line in PDFBox which contains the actual opening
>> of the input stream.
>>
>> Any help would be highly appreciated! THANKS!
>>
> I guess, the problem is the pdf itself. It probably uses some unsupported
> font encodings or the text can't be extracted because of the encoding.
>
> But without the pdf in question everything is just a guess.
>
>  Regards, Michael
>>
>
>
> BR
> Andreas Lehmkühler
>

Re: How do I get the encoding of a PDF File and set the encoding of the input stream??

Reply via email to