Ligature Substitutions, glyph reverse split reorder and gsub in PDFbox

Jeyan Mon, 28 Feb 2022 10:53:01 -0800

Hi Team,

PDF viewers are not rendering all of the tamil letters as expected in the
PDF generated using PDFbox. It seems I have to do the required
substitutions while generating the PDF to get it rendered as expected.

Attempting the substitutions, any help would be appreciated.

Ligature Substitutions - Tamil Use Cases

Below are the 5 possible cases for a base character to join with vowels.
There are 18 base characters, however the cases will be the same for the
remaining seventeen.

Case 1 - vowel follows the base character - No change required. PDF
viewers render as expected.

க + ா = கா

Case 2 - Vowel on top of base character - No change required. PDF
viewers render as expected.

க + ி = கி

க + ீ = கீ

க + ் = க்

Case 3 - base character follows the vowel - Need to reverse the
glyphes

க + ெ = கெ -> ெ + க = கெ

க + ே = கே -> ே + க = கே

க + ை = கை -> ை + க = கை

Case 4 - base character follows the composite vowel - Need to split and
reorder the glyphs

க + ொ = கொ -> க + ெ + ா -> ெ + க + ா = கொ

க + ோ = கோ -> க + ே + ா -> ே + க + ா = கோ

க + ௌ = கௌ -> க + ெ + ள -> ெ + க + ள = கௌ

Case 5 - Base character and vowel needs to point new glypse id - New
resultant glyphe without unicode character - Substitute new glyphe for a
series of glyphes

க + ு = கு -> கு

க + ூ = கூ - > கூ

Below in table representation,

Input text

JDK

TTF

PDFbox generate PDF

Input text

Char Sequence

Code points

gid

Actual*

Expected

க்

க + ்

2965 3021

Character : க

Codepoint : 2965

unicode : ub95

Character : ்

Codepoint : 3021

unicode : ubcd

1828

1862

க்

All good

கா

க + ா

2965 3006

Character : க

Codepoint : 2965

unicode : ub95

Character : ா

Codepoint : 3006

unicode : ubbe

1828

1851

கா

All good

கி

க + ி

2965 3007

Character : க

Codepoint : 2965

unicode : ub95

Character : ி

Codepoint : 3007

unicode : ubbf

1828

1852

கி

All good

கீ

க + ீ

2965 3008

Character : க

Codepoint : 2965

unicode : ub95

Character : ீ

Codepoint : 3008

unicode : ubc0

1828

1853

கீ

All good

கு

க + ு

2965 3009

Character : க

Codepoint : 2965

unicode : ub95

Character : ு

Codepoint : 3009

unicode : ubc1

1828

1854

கு

கு (gid = 6698)

New glyphe expected.

கூ

க + ூ

2965 3010

Character : க

Codepoint : 2965

unicode : ub95

Character : ூ

Codepoint : 3010

unicode : ubc2

1828

1855

கூ

கூ ( gid = 6716)

New glyphe expected.

கெ

க + ெ

2965 3014

Character : க

Codepoint : 2965

unicode : ub95

Character : ெ

Codepoint : 3014

unicode : ubc6

1828

1856

கெ

ெ + க = கெ

Reversing the glyphes expected.

கே

க + ே

2965 3015

Character : க

Codepoint : 2965

unicode : ub95

Character : ே

Codepoint : 3015

unicode : ubc7

1828

1857

கே

ே + க = கே

Reversing the glyphes expected.

கை

க + ை

2965 3016

Character : க

Codepoint : 2965

unicode : ub95

Character : ை

Codepoint : 3016

unicode : ubc8

1828

1858

கை

ை + க = கை

Reversing the glyphes expected.

கொ

க + ொ

2965 3018

Character : க

Codepoint : 2965

unicode : ub95

Character : ொ

Codepoint : 3018

unicode : ubca

1828

1859

கொ

க + ெ + ா

ெ + க + ா = கொ

Split and reorder expected.

கோ

க + ோ

2965 3019

Character : க

Codepoint : 2965

unicode : ub95

Character : ோ

Codepoint : 3019

unicode : ubcb

1828

1860

கோ

க + ே + ா

ே + க + ா = கோ

Split and reorder expected.

கௌ

க + ௌ

2965 3020

Character : க

Codepoint : 2965

unicode : ub95

Character : ௌ

Codepoint : 3020

unicode : ubcc

1828

1861

கௌ

க + ெ + ள

ெ + க + ள = கௌ

Split and reorder expected.

* Actual - the dotted circle will be invisible.

Attached the actual output and expected output. Did a hard coded
substitution(For the glyphe id without having unicode, hardcoded at
PDCIDFontType2#public byte[] encode(int unicode). Reverse, split and
reorder input text charsequence before calling the showtext. Also added the
glyphe id that does not have a unicode at TrueTypeEmbedder Subsetter for
embedding the glyphe into the generated pdf.) just to obtain the expected
output.

How to handle these substitutions in an efficient way? Looking at the
GlyphSubstitutionTable, fontbox.cmap.Identity-H,
fontbox.unicode.Scripts.txt. Couldn’t get it so far. Any help would be
appreciated.

thank you,

Jeyan

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Ligature Substitutions, glyph reverse split reorder and gsub in PDFbox

Reply via email to