How do I get the encoding of a PDF File and set the encoding of the input stream??

Michael Jeier Sat, 09 Jul 2011 06:38:48 -0700

Hi,

I am trying to export a PDF with the help of the PDFTextStripper class.
Unfortunately, I have some character encoding issues.
1.) How do I get the encoding information of the PDF file? Or is that
information chained to the single fonts??
2.) How do I set the encoding for the input stream of PDFTextStripper is
using?


Here's my code snippet:

Writer output = null;
        PDDocument document = null;
        URL url = new URL( "file:\\mypdf.pdf" );
        document = PDDocument.load(url, true);

        output = new OutputStreamWriter(
                new FileOutputStream( "C:\\out2.txt" ), "ISO-8859-1" ); //
set encoding

        PDFTextStripper stripper = null;
        stripper = new PDFTextStripper("ISO-8859-1"); // set encoding
        stripper.setStartPage( 1 );
        stripper.setEndPage( 2 );
        stripper.writeText( document, output );

        output.close();
        document.close();

I tried all kind of encodings (UTF-8 and all possible ISO-8859-X) at "set
encoding" comments in the code,
but nothing works. It seems that you can only configure the encoding of the
output. Maybe someone can
point me to the class and line in PDFBox which contains the actual opening
of the input stream.

Any help would be highly appreciated! THANKS!

Regards, Michael

How do I get the encoding of a PDF File and set the encoding of the input stream??

Reply via email to