Andreas ,

Thanks for the explanation.


Best regards ,
Hesham

---------------------------------------------
Included message :


Hi,

Am 01.09.2011 05:50, schrieb Hesham G.:
Mirko ,

Thanks a lot for your reply.
Shouldn't PDFBox handle those ligatures automatically, as stated in the previous
PDFBox versions ?
Yes, but only if these could be recognized as ligatures. There is one font in your pdf using a custom encoding and I guess it doesn't provide a mapping for readable characters. Even the acrobat reader can't extract those ligatures.
IMHO it's impossible to extract those kind of text without using some
pdf2image/ocr-stuff which was already discussed theorectically on this list.

Best regards ,
Hesham


---------------------------------------------
Included message :


These are most likely ligatures in the original PDF. Ligatures for fi, fl,
ffl, and ft are pretty common, and some word processing programs
automatically replace the original character sequences by their
corresponding ligatures. I haven't really seen a Th ligature before, but it makes sense because the vertical bar of the T and the vertical bar of the h
typically appear visually too far apart without custom kerning.

HTH,

Mirko


On Wed, Aug 31, 2011 at 12:59 PM, Hesham G. <[email protected]> wrote:

Hello ,

I have a PDF that I extract its text using PDFBox. The PDF is read fine
using Mac's Preview, but in PDFBox some words are read in a strange way.
Examples:
crucifixion => cruci<xion
They => +ey
after => a>er

You can check a 1 page PDF sample here :
http://www.4shared.com/document/F5DG_rHu/pdf_with_strange_text.html

Is this something with the PDF or it concerns PDFBox ?


Best regards ,
Hesham


BR
Andreas Lehmkühler

Reply via email to