Hi,

This is a known weakness. There is an implementation for Bengali, but not for other Indian languages
https://issues.apache.org/jira/browse/PDFBOX-4189
this is only for 3.0, and doesn't do text extraction properly.
That code might be expanded for Telugu if it uses the same concepts from the GSUB table.

If you, or anyone, is interested in this:
- get the source code
- look at GsubWorkerForBengali
- look at https://learn.microsoft.com/en-us/typography/script-development/bengali
and compare with
https://learn.microsoft.com/en-us/typography/script-development/telugu
to see what might have to be done for a new GsubWorkerForTelugu
- possibly (not sure if needed) implement the TODOs in GlyphSubstitutionTable and in GlyphSubstitutionDataExtractor
- possible (not sure if needed) implement GPOS handling

Tilman

On 21.06.2023 03:26, Ravi, Swetha wrote:

Hi Apache Pdfbox team,

I am woking with Mediaconvert team in AWS elemental. We use ttt:ttpe tool for rendering captions in ttml onto the video file. We found issues when rendering the few words in Telugu language using pdfbox tool. For example, the word వాక్యూమ్, which is in Telugu language, is not rendered properly. I have attached the rendering as pdf file and the input as image file with this email. To be specific the word is `vacuum` and rendering of half y sound in the language is missing in the image. So I suspect half consonant rendering is an issue. I tried using the latest version of pdfbox to create a pdf for this text (output is attached).

Could you please take a look at this issue and let me know if we have any workaround, or if we can have a fix for this issue in the near future?

Thank you,

Swetha Ravi

Software Development Engineer

AWS Elemental Mediaconvert


---------------------------------------------------------------------
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org

Reply via email to