Re: PDFBox extracts fi char wrong

Thomas Fischer Fri, 27 May 2011 02:05:12 -0700

Hi Hesham,

> I am using PDFBox version 1.4 to extract text from a PDF, but all the words 
> having "fi" inside them are extracted wrong. You can test the following 1 
> page PDF sample: http://www.4shared.com/document/GAMnpE9A/the_fi_char.html
> 
> I am aware of the post: https://issues.apache.org/jira/browse/PDFBOX-860 
> which mentions that this is now fixed if I use the ICU4J jar, which already 
> exists inside the PDFBox 1.4 jar as I can see, but still such words are 
> parsed wrong. Am I missing something here ?


I checked your file and get the same results with PDFBox 1.4. But from version 
1.5 on everything seems to be OK (apart from my own ligature problems…). 
So I suggest to update your PDFBox version. Note that version 1.6 definitely 
needs the additional ICU jar (see 
https://issues.apache.org/jira/browse/PDFBOX-970).

All the best
Thomas

Re: PDFBox extracts fi char wrong

Reply via email to