Re: PDFBox extracts fi char wrong

Hesham G. Fri, 27 May 2011 06:06:08 -0700

Thomas ,

Thank you a lot for the information.
I have just tested it now and it extracts those characters fine.


I will upgrade to this version now.


Best regards ,
Hesham

---------------------------------------------
Included message :

Hi Hesham,
I am using PDFBox version 1.4 to extract text from a PDF, but all thewords having "fi" inside them are extracted wrong. You can test thefollowing 1 page PDF sample:http://www.4shared.com/document/GAMnpE9A/the_fi_char.html
I am aware of the post: https://issues.apache.org/jira/browse/PDFBOX-860which mentions that this is now fixed if I use the ICU4J jar, whichalready exists inside the PDFBox 1.4 jar as I can see, but still suchwords are parsed wrong. Am I missing something here ?
I checked your file and get the same results with PDFBox 1.4. But fromversion 1.5 on everything seems to be OK (apart from my own ligatureproblems…).So I suggest to update your PDFBox version. Note that version 1.6definitely needs the additional ICU jar (seehttps://issues.apache.org/jira/browse/PDFBOX-970).
All the best
Thomas

Re: PDFBox extracts fi char wrong

Reply via email to