That is an unfinished thing, sadly.
https://issues.apache.org/jira/browse/PDFBOX-4189
It works for Bengali, but only visually, the text extraction doesn't
work. I assume it might be possible to add Tamil, but the text
extraction problem would still be there.
Tilman
Am 28.02.2022 um 19:51 schrieb Jeyan:
Hi Team,
PDF viewers are not rendering all of the tamil letters as expected in
the PDF generated using PDFbox. It seems I have to do the required
substitutions while generating the PDF to get it rendered as expected.
Attempting the substitutions, any help would be appreciated.**
Ligature Substitutions - Tamil Use Cases
Below are the 5 possible cases for a base character to join with
vowels. There are 18 base characters, however the cases will be the
same for the remaining seventeen.
Case 1 - vowel follows the base character - No change required.
PDF viewers render as expected.
க + ா = கா
Case 2 - Vowel on top of base character - No change required. PDF
viewers render as expected.
க + ி = கி
க + ீ = கீ
க + ் = க்
Case 3 - base character follows the vowel - Need to reverse the
glyphes
க + ெ= கெ -> ெ+ க = கெ
க + ே= கே -> ே+ க = கே
க + ை = கை-> ை+ க = கை
Case 4 - base character follows the composite vowel - Need to split
and reorder the glyphs
க + ொ = கொ ->க + ெ+ ா-> ெ+ க + ா= கொ
க + ோ = கோ -> க + ே+ ா -> ே+ க + ா= கோ
க + ௌ= கௌ-> க + ெ+ ள -> ெ+ க + ள = கௌ
Case 5 - Base character and vowel needs to point new glypse id -
New resultant glyphe without unicode character - Substitute new glyphe
for a series of glyphes
க + ு= கு -> கு
க + ூ= கூ - > கூ
Below in table representation,
Input text
JDK
TTF
PDFbox generate PDF
Input text
Char Sequence
Code points
gid
Actual*
Expected
க்
க + ்
2965 3021
Character : க
Codepoint : 2965
unicode : ub95
Character : ்
Codepoint : 3021
unicode : ubcd
1828
1862
க்
க்
All good
கா
க + ா
2965 3006
Character : க
Codepoint : 2965
unicode : ub95
Character : ா
Codepoint : 3006
unicode : ubbe
1828
1851
கா
கா
All good
கி
க + ி
2965 3007
Character : க
Codepoint : 2965
unicode : ub95
Character : ி
Codepoint : 3007
unicode : ubbf
1828
1852
கி
கி
All good
கீ
க + ீ
2965 3008
Character : க
Codepoint : 2965
unicode : ub95
Character : ீ
Codepoint : 3008
unicode : ubc0
1828
1853
கீ
கீ
All good
கு
க + ு
2965 3009
Character : க
Codepoint : 2965
unicode : ub95
Character : ு
Codepoint : 3009
unicode : ubc1
1828
1854
கு
கு (gid = 6698)
New glyphe expected.
கூ
க + ூ
2965 3010
Character : க
Codepoint : 2965
unicode : ub95
Character : ூ
Codepoint : 3010
unicode : ubc2
1828
1855
கூ
கூ ( gid = 6716)
New glyphe expected.
கெ
க + ெ
2965 3014
Character : க
Codepoint : 2965
unicode : ub95
Character : ெ
Codepoint : 3014
unicode : ubc6
1828
1856
கெ
ெ+ க = கெ
Reversing the glyphes expected.
கே
க + ே
2965 3015
Character : க
Codepoint : 2965
unicode : ub95
Character : ே
Codepoint : 3015
unicode : ubc7
1828
1857
கே
ே+ க = கே
Reversing the glyphes expected.
கை
க + ை
2965 3016
Character : க
Codepoint : 2965
unicode : ub95
Character : ை
Codepoint : 3016
unicode : ubc8
1828
1858
கை
ை+ க = கை
Reversing the glyphes expected.
கொ
க + ொ
2965 3018
Character : க
Codepoint : 2965
unicode : ub95
Character : ொ
Codepoint : 3018
unicode : ubca
1828
1859
கொ
க + ெ+ ா
ெ+ க + ா= கொ
Split and reorder expected.
கோ
க + ோ
2965 3019
Character : க
Codepoint : 2965
unicode : ub95
Character : ோ
Codepoint : 3019
unicode : ubcb
1828
1860
கோ
க + ே+ ா
ே+ க + ா= கோ
Split and reorder expected.
கௌ
க + ௌ
2965 3020
Character : க
Codepoint : 2965
unicode : ub95
Character : ௌ
Codepoint : 3020
unicode : ubcc
1828
1861
கௌ
க + ெ+ ள
ெ+ க + ள = கௌ
Split and reorder expected.
* Actual - the dotted circle will be invisible.
Attached the actual output and expected output. Did a hard coded
substitution(For the glyphe id without having unicode, hardcoded at
PDCIDFontType2#public byte[] encode(int unicode). Reverse, split and
reorder input text charsequence before calling the showtext. Also
added the glyphe id that does not have a unicode at TrueTypeEmbedder
Subsetter for embedding the glyphe into the generated pdf.) just to
obtain the expected output.
How to handle these substitutions in an efficient way? Looking at the
GlyphSubstitutionTable, fontbox.cmap.Identity-H,
fontbox.unicode.Scripts.txt. Couldn’t get it so far. Any help would be
appreciated.
thank you,
Jeyan
---------------------------------------------------------------------
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org