Re: PDFbox text extraction: l for i

Villu Ruusmann Mon, 15 Feb 2010 15:35:13 -0800

Hello there,

>
> I know about ligatures, and normally PDFBox handles them well, e.g. ﬀ ﬃ ﬁ ﬂ 
> are quite common in TeX-produced PDF documents.
> But why should PDFBox reproduce a fi (FI) ligature as fl (FL)?
>


When does this problem occur? Are you receiving "fl" instead of "fi"
when performing text extraction (eg. PDFTextStripper utility) or are
you seeing it when performing PDF rendering (eg. the PageDrawer
utility)?

Debugging could be more or less rewarding depending on what tools you
are using and how familiar you are with font encodings and charsets.
The basic idea would be to find out the value of the "problematic"
byte in the PDF text object, and then to look up its character name.

If you could share the PDF document I might take a look at it
sometimes. Could be another Type1C font issue where I am to blame.


VR

Re: PDFbox text extraction: l for i

Reply via email to