Hi,

Am 10.07.2011 16:08, schrieb Michael Jeier:
Hi,

well, though it's a research project, I am not allowed to publish the
original PDF file. :(
Maybe the outcome will be published though. But this is beyond my control.
Check the embedded fonts using a pdf reader, e.g. acrobat reader. Load the pdf and go to the document properties. Most of the readers provide a list of the used fonts including their encoding.

There must be a way to set the encoding of the input stream and to find out
the
encoding of the PDF!! Somebody please push me in the right direction.
There must be a misunderstanding. A pdf doesn't have an encoding like a simple text file it is a binary file. The font encodings are used to convert the text into readable text. Sometimes pdfs don't provide the needed information to do that conversion.

Regards, Michael

On Sat, Jul 9, 2011 at 7:57 PM, Andreas Lehmkuehler<[email protected]>wrote:

Hi,

Am 09.07.2011 15:38, schrieb Michael Jeier:

  Hi,

I am trying to export a PDF with the help of the PDFTextStripper class.
Unfortunately, I have some character encoding issues.
1.) How do I get the encoding information of the PDF file? Or is that
information chained to the single fonts??

Each font has its own enconding and the fonts of a pdf don't have to share
the same one.


  2.) How do I set the encoding for the input stream of PDFTextStripper is
using?

Nowhere, it doens't make sense to change that.


  Here's my code snippet:

Writer output = null;
         PDDocument document = null;
         URL url = new URL( "file:\\mypdf.pdf" );
         document = PDDocument.load(url, true);

         output = new OutputStreamWriter(
                 new FileOutputStream( "C:\\out2.txt" ), "ISO-8859-1" ); //
set encoding

         PDFTextStripper stripper = null;
         stripper = new PDFTextStripper("ISO-8859-1"); // set encoding
         stripper.setStartPage( 1 );
         stripper.setEndPage( 2 );
         stripper.writeText( document, output );

         output.close();
         document.close();

I tried all kind of encodings (UTF-8 and all possible ISO-8859-X) at "set
encoding" comments in the code,
but nothing works. It seems that you can only configure the encoding of
the
output.

That's exactly the expected behaviour.


  Maybe someone can
point me to the class and line in PDFBox which contains the actual opening
of the input stream.

Any help would be highly appreciated! THANKS!

I guess, the problem is the pdf itself. It probably uses some unsupported
font encodings or the text can't be extracted because of the encoding.

But without the pdf in question everything is just a guess.

  Regards, Michael



BR
Andreas Lehmkühler



BR
Andreas Lehmkühler

Reply via email to