PDFBox extracts fi char wrong

Hesham G. Thu, 26 May 2011 09:41:59 -0700

Hello ,

I am using PDFBox version 1.4 to extract text from a PDF, but all the words 
having "fi" inside them are extracted wrong. You can test the following 1 page 
PDF sample: http://www.4shared.com/document/GAMnpE9A/the_fi_char.html


I am aware of the post: https://issues.apache.org/jira/browse/PDFBOX-860 which 
mentions that this is now fixed if I use the ICU4J jar, which already exists 
inside the PDFBox 1.4 jar as I can see, but still such words are parsed wrong. 
Am I missing something here ?


Best regards ,
Hesham

PDFBox extracts fi char wrong

Reply via email to