Hi,

I am trying to export a PDF with the help of the PDFTextStripper class.
Unfortunately, I have some character encoding issues.
1.) How do I get the encoding information of the PDF file? Or is that
information chained to the single fonts??
2.) How do I set the encoding for the input stream of PDFTextStripper is
using?

Here's my code snippet:

Writer output = null;
        PDDocument document = null;
        URL url = new URL( "file:\\mypdf.pdf" );
        document = PDDocument.load(url, true);

        output = new OutputStreamWriter(
                new FileOutputStream( "C:\\out2.txt" ), "ISO-8859-1" ); //
set encoding

        PDFTextStripper stripper = null;
        stripper = new PDFTextStripper("ISO-8859-1"); // set encoding
        stripper.setStartPage( 1 );
        stripper.setEndPage( 2 );
        stripper.writeText( document, output );

        output.close();
        document.close();

I tried all kind of encodings (UTF-8 and all possible ISO-8859-X) at "set
encoding" comments in the code,
but nothing works. It seems that you can only configure the encoding of the
output. Maybe someone can
point me to the class and line in PDFBox which contains the actual opening
of the input stream.

Any help would be highly appreciated! THANKS!

Regards, Michael

Reply via email to