Hi,

I'm trying to resolve PDFBOX-1216 that I reported a while ago by
debugging the PDFBox source code, and I need some advice on what to
do. In brief, the issue is that PDFBox doesn't use presentation forms
when creating PDF images for Arabic / Persian text in PDF, thus the
characters are shown disconnected. I'm not sure yet, but I guess this
is called "ligature"?

Anyway, here's what I concluded so far, and if anyone could guide me,
I may be able to fix this and provide a patch.

* In PDF file, different codes are used for different presentation
forms of a single unicode character (under Content stream of PDF file,
under "TJ" command which is "show text, allowing individual glyph
positioning")

* In the "ToUnicode" table of PDF file (which is read into the "cmap"
variable of PDFont class), all the presentation forms are mapped to
the same unicode character (which is not in the presentation range)

* When PDFBox is drawing text on graphics canvas, it uses the unicode
value in a string and calls "PDSimpleFont.drawStirng" method.

* Since the single character is isolated, it is either not found in
the Font, or the isolated form (if present) is rendered.

Example:

You can check characters in the following address:
http://en.wikipedia.org/wiki/Arabic_characters_in_Unicode

When there is a U+0647 character in the file ( ه ), and should be
connected to the character before it, it should appear as U+FEEA ( ﻪ
).
In the attached PDF file, this character appears in two different
fonts. Internal PDF code for the this character in the fonts are
"00C4" and "03EA".

When I set a breakpoint in "PDSimpleFont.drawStirng" method, and
manually replace the string content with the appropriate presentation
form (like "\ufeea" for the above character) everything else works
fine and the output image is correct (it is found in the Font, where
the original character, "\u0647", is not embedded in the font).

PDF viewers have some way of figuring out the presentation forms,
because the PDF is displayed correctly in all viewers.

But I could not find out how can I determine which character code
should be mapped to which presentation form. I'm not very familiar
with the internals of PDF file, if any of the developers can guide me
on where to look next, I'd hopefully be able to figure out a way to
fix this.

Thanks in advance
Hamed

Reply via email to