That is an unfinished thing, sadly.

https://issues.apache.org/jira/browse/PDFBOX-4189

It works for Bengali, but only visually, the text extraction doesn't work. I assume it might be possible to add Tamil, but the text extraction problem would still be there.

Tilman

Am 28.02.2022 um 19:51 schrieb Jeyan:

Hi Team,

PDF viewers are not rendering all of the tamil letters as expected in the PDF generated using PDFbox. It seems I have to do the required substitutions while generating the PDF to get it rendered as expected.

Attempting the substitutions, any help would be appreciated.**


Ligature Substitutions - Tamil Use Cases


Below are the 5 possible cases for a base character to join with vowels. There are 18 base characters, however the cases will be the same for the remaining seventeen.


Case 1  -   vowel follows the base character -  No change required. PDF viewers render as expected.


            க + ா  =  கா


Case 2  -   Vowel on top of base character - No change required. PDF viewers render as expected.

            க + ி  = கி

            க + ீ  = கீ

            க + ்  =  க்


Case 3  -   base character follows the vowel - Need to reverse the glyphes

            க + ெ=  கெ  -> ெ+ க = கெ

            க + ே=  கே  -> ே+ க = கே

            க + ை =  கை-> ை+ க = கை


Case 4  -   base character follows the composite vowel - Need to split and reorder the glyphs

            க + ொ  = கொ  ->க + ெ+ ா-> ெ+ க + ா= கொ

            க + ோ  = கோ  -> க + ே+ ா     -> ே+ க + ா= கோ

            க + ௌ= கௌ-> க + ெ+ ள     -> ெ+ க + ள = கௌ


Case 5  -   Base character and vowel needs to point new glypse id  - New resultant glyphe without unicode character - Substitute new glyphe for a series of glyphes

            க + ு= கு    -> கு

            க + ூ= கூ  - > கூ




Below in table representation,



Input text

        

JDK

        

TTF

        

PDFbox generate PDF

        

Input text

        

Char Sequence

        

Code points

        

gid

        

Actual*

        

Expected

        

க்

        

க + ்

        

2965 3021


Character : க

Codepoint : 2965

unicode : ub95


Character : ்

Codepoint : 3021

unicode : ubcd

        

1828

1862

        

க்

        

க்

        

All good

கா

        

க + ா

        

2965 3006


Character : க

Codepoint : 2965

unicode : ub95


Character : ா

Codepoint : 3006

unicode : ubbe

        

1828

1851

        

கா

        

கா

        

All good

கி

        

க + ி

        

2965 3007


Character : க

Codepoint : 2965

unicode : ub95


Character : ி

Codepoint : 3007

unicode : ubbf

        

1828

1852

        

கி

        

கி

        

All good

கீ

        

க + ீ

        

2965 3008


Character : க

Codepoint : 2965

unicode : ub95


Character : ீ

Codepoint : 3008

unicode : ubc0


        

1828

1853

        

கீ

        

கீ

        

All good

கு

        

க + ு

        

2965 3009


Character : க

Codepoint : 2965

unicode : ub95


Character : ு

Codepoint : 3009

unicode : ubc1

        

1828

1854

        

கு

        

கு (gid = 6698)

        

New glyphe expected.

கூ

        

க + ூ

        

2965 3010


Character : க

Codepoint : 2965

unicode : ub95


Character : ூ

Codepoint : 3010

unicode : ubc2

        

1828

1855

        

கூ

        

கூ ( gid = 6716)

        

New glyphe expected.

கெ

        

க + ெ

        

2965 3014


Character : க

Codepoint : 2965

unicode : ub95


Character : ெ

Codepoint : 3014

unicode : ubc6

        

1828

1856

        

கெ

        

ெ+ க = கெ

        

Reversing the glyphes expected.

கே

        

க + ே

        

2965 3015


Character : க

Codepoint : 2965

unicode : ub95


Character : ே

Codepoint : 3015

unicode : ubc7

        

1828

1857

        

கே

        

ே+ க = கே

        

Reversing the glyphes expected.

கை

        

க + ை

        

2965 3016


Character : க

Codepoint : 2965

unicode : ub95


Character : ை

Codepoint : 3016

unicode : ubc8

        

1828

1858

        

கை

        

ை+ க = கை

        

Reversing the glyphes expected.

கொ

        

க + ொ

        

2965 3018


Character : க

Codepoint : 2965

unicode : ub95


Character : ொ

Codepoint : 3018

unicode : ubca

        

1828

1859

        

கொ

        

க + ெ+ ா


ெ+ க + ா= கொ

        

Split and reorder expected.

கோ

        

க + ோ

        

2965 3019


Character : க

Codepoint : 2965

unicode : ub95


Character : ோ

Codepoint : 3019

unicode : ubcb

        

1828

1860

        

கோ

        

க + ே+ ா


ே+ க + ா= கோ

        

Split and reorder expected.

கௌ

        

க + ௌ

        

2965 3020


Character : க

Codepoint : 2965

unicode : ub95


Character : ௌ

Codepoint : 3020

unicode : ubcc

        

1828

1861

        

கௌ

        

க + ெ+ ள


ெ+ க + ள = கௌ

        

Split and reorder expected.


* Actual - the dotted circle will be invisible.

Attached the actual output and  expected output. Did a hard coded substitution(For the glyphe id without having unicode, hardcoded at PDCIDFontType2#public byte[] encode(int unicode). Reverse, split and reorder input text charsequence before calling the showtext. Also added the glyphe id that does not have a unicode at TrueTypeEmbedder Subsetter for embedding the glyphe into the generated pdf.) just to obtain the expected output.


How to handle these substitutions in an efficient way? Looking at the GlyphSubstitutionTable, fontbox.cmap.Identity-H, fontbox.unicode.Scripts.txt. Couldn’t get it so far. Any help would be appreciated.

thank you,

Jeyan



---------------------------------------------------------------------
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org

Reply via email to