Re: Ligature Substitutions, glyph reverse split reorder and gsub in PDFbox

Tilman Hausherr Mon, 28 Feb 2022 11:00:05 -0800

That is an unfinished thing, sadly.

https://issues.apache.org/jira/browse/PDFBOX-4189

It works for Bengali, but only visually, the text extraction doesn'twork. I assume it might be possible to add Tamil, but the textextraction problem would still be there.


Tilman

Am 28.02.2022 um 19:51 schrieb Jeyan:


Hi Team,

PDF viewers are not rendering all of the tamil letters as expected inthe PDF generated using PDFbox. It seems I have to do the requiredsubstitutions while generating the PDF to get it rendered as expected.


Attempting the substitutions, any help would be appreciated.**


Ligature Substitutions - Tamil Use Cases

Below are the 5 possible cases for a base character to join withvowels. There are 18 base characters, however the cases will be thesame for the remaining seventeen.

Case 1 - vowel follows the base character - No change required.PDF viewers render as expected.



            க + ா  =  கா

Case 2 - Vowel on top of base character - No change required. PDFviewers render as expected.


            க + ி  = கி

            க + ீ  = கீ

            க + ்  =  க்

Case 3 - base character follows the vowel - Need to reverse theglyphes


            க + ெ=  கெ  -> ெ+ க = கெ

            க + ே=  கே  -> ே+ க = கே

            க + ை =  கை-> ை+ க = கை

Case 4 - base character follows the composite vowel - Need to splitand reorder the glyphs


            க + ொ  = கொ  ->க + ெ+ ா-> ெ+ க + ா= கொ

            க + ோ  = கோ  -> க + ே+ ா     -> ே+ க + ா= கோ

            க + ௌ= கௌ-> க + ெ+ ள     -> ெ+ க + ள = கௌ

Case 5 - Base character and vowel needs to point new glypse id -New resultant glyphe without unicode character - Substitute new glyphefor a series of glyphes


            க + ு= கு    -> கு

            க + ூ= கூ  - > கூ




Below in table representation,



Input text

        

JDK

        

TTF

        

PDFbox generate PDF

        

Input text

        

Char Sequence

        

Code points

        

gid

        

Actual*

        

Expected

        

க்

        

க + ்

        

2965 3021


Character : க

Codepoint : 2965

unicode : ub95


Character : ்

Codepoint : 3021

unicode : ubcd

        

1828

1862

        

க்

        

க்

        

All good

கா

        

க + ா

        

2965 3006


Character : க

Codepoint : 2965

unicode : ub95


Character : ா

Codepoint : 3006

unicode : ubbe

        

1828

1851

        

கா

        

கா

        

All good

கி

        

க + ி

        

2965 3007


Character : க

Codepoint : 2965

unicode : ub95


Character : ி

Codepoint : 3007

unicode : ubbf

        

1828

1852

        

கி

        

கி

        

All good

கீ

        

க + ீ

        

2965 3008


Character : க

Codepoint : 2965

unicode : ub95


Character : ீ

Codepoint : 3008

unicode : ubc0


        

1828

1853

        

கீ

        

கீ

        

All good

கு

        

க + ு

        

2965 3009


Character : க

Codepoint : 2965

unicode : ub95


Character : ு

Codepoint : 3009

unicode : ubc1

        

1828

1854

        

கு

        

கு (gid = 6698)

        

New glyphe expected.

கூ

        

க + ூ

        

2965 3010


Character : க

Codepoint : 2965

unicode : ub95


Character : ூ

Codepoint : 3010

unicode : ubc2

        

1828

1855

        

கூ

        

கூ ( gid = 6716)

        

New glyphe expected.

கெ

        

க + ெ

        

2965 3014


Character : க

Codepoint : 2965

unicode : ub95


Character : ெ

Codepoint : 3014

unicode : ubc6

        

1828

1856

        

கெ

        

ெ+ க = கெ

        

Reversing the glyphes expected.

கே

        

க + ே

        

2965 3015


Character : க

Codepoint : 2965

unicode : ub95


Character : ே

Codepoint : 3015

unicode : ubc7

        

1828

1857

        

கே

        

ே+ க = கே

        

Reversing the glyphes expected.

கை

        

க + ை

        

2965 3016


Character : க

Codepoint : 2965

unicode : ub95


Character : ை

Codepoint : 3016

unicode : ubc8

        

1828

1858

        

கை

        

ை+ க = கை

        

Reversing the glyphes expected.

கொ

        

க + ொ

        

2965 3018


Character : க

Codepoint : 2965

unicode : ub95


Character : ொ

Codepoint : 3018

unicode : ubca

        

1828

1859

        

கொ

        

க + ெ+ ா


ெ+ க + ா= கொ

        

Split and reorder expected.

கோ

        

க + ோ

        

2965 3019


Character : க

Codepoint : 2965

unicode : ub95


Character : ோ

Codepoint : 3019

unicode : ubcb

        

1828

1860

        

கோ

        

க + ே+ ா


ே+ க + ா= கோ

        

Split and reorder expected.

கௌ

        

க + ௌ

        

2965 3020


Character : க

Codepoint : 2965

unicode : ub95


Character : ௌ

Codepoint : 3020

unicode : ubcc

        

1828

1861

        

கௌ

        

க + ெ+ ள


ெ+ க + ள = கௌ

        

Split and reorder expected.


* Actual - the dotted circle will be invisible.

Attached the actual output and expected output. Did a hard codedsubstitution(For the glyphe id without having unicode, hardcoded atPDCIDFontType2#public byte[] encode(int unicode). Reverse, split andreorder input text charsequence before calling the showtext. Alsoadded the glyphe id that does not have a unicode at TrueTypeEmbedderSubsetter for embedding the glyphe into the generated pdf.) just toobtain the expected output.

How to handle these substitutions in an efficient way? Looking at theGlyphSubstitutionTable, fontbox.cmap.Identity-H,fontbox.unicode.Scripts.txt. Couldn’t get it so far. Any help would beappreciated.


thank you,

Jeyan



---------------------------------------------------------------------
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org

Re: Ligature Substitutions, glyph reverse split reorder and gsub in PDFbox

Reply via email to