Extracting plain text from PDF

Piotr Rychlik Tue, 06 Apr 2010 04:50:27 -0700

Hi,

I have a problem with extracting plain text from PDF documents that contain 
polish characters.
I am using the following approach to extract text:
 ......
   File f = new File(fileName);


 PDFParser parser = new PDFParser(new FileInputStream(f));
 parser.parse();

 COSDocument cosDoc = parser.getDocument();
 PDFTextStripper pdfStripper = new PDFTextStripper();
 PDDocument pdDoc = new PDDocument(cosDoc);
 String parsedText = pdfStripper.getText(pdDoc);
 ......

parsedText is then written to a file using UTF8 encoding.

The above code works fine in most cases. Text containing polish characters is 
extracted correctly.
However, I managed to find a strange .pdf file for witch the above method does 
not work. Polish characters are replaced. E.g. polish crossed l (ł) is replaced 
by %. Is there any way to fix this problem?

Regards,
Piotr Rychlik

Extracting plain text from PDF

Reply via email to