Hi,
This is a known weakness. There is an implementation for Bengali, but
not for other Indian languages
https://issues.apache.org/jira/browse/PDFBOX-4189
this is only for 3.0, and doesn't do text extraction properly.
That code might be expanded for Telugu if it uses the same concepts from
the GSUB table.
If you, or anyone, is interested in this:
- get the source code
- look at GsubWorkerForBengali
- look at
https://learn.microsoft.com/en-us/typography/script-development/bengali
and compare with
https://learn.microsoft.com/en-us/typography/script-development/telugu
to see what might have to be done for a new GsubWorkerForTelugu
- possibly (not sure if needed) implement the TODOs in
GlyphSubstitutionTable and in GlyphSubstitutionDataExtractor
- possible (not sure if needed) implement GPOS handling
Tilman
On 21.06.2023 03:26, Ravi, Swetha wrote:
Hi Apache Pdfbox team,
I am woking with Mediaconvert team in AWS elemental. We use ttt:ttpe
tool for rendering captions in ttml onto the video file. We found
issues when rendering the few words in Telugu language using pdfbox
tool. For example, the word వాక్యూమ్, which is in Telugu language, is
not rendered properly. I have attached the rendering as pdf file and
the input as image file with this email. To be specific the word is
`vacuum` and rendering of half y sound in the language is missing in
the image. So I suspect half consonant rendering is an issue. I tried
using the latest version of pdfbox to create a pdf for this text
(output is attached).
Could you please take a look at this issue and let me know if we have
any workaround, or if we can have a fix for this issue in the near future?
Thank you,
Swetha Ravi
Software Development Engineer
AWS Elemental Mediaconvert
---------------------------------------------------------------------
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org