[jira] [Created] (PDFBOX-5126) Complex Unicode glyphs (surrogate pairs, combining diacritics, zero-width join, etc.) in a RTL context get reversed incorrectly on text extraction

Jira Mon, 08 Mar 2021 14:58:07 -0800

Gábor Stefanik created PDFBOX-5126:
--------------------------------------


             Summary: Complex Unicode glyphs (surrogate pairs, combining 
diacritics, zero-width join, etc.) in a RTL context get reversed incorrectly on 
text extraction
                 Key: PDFBOX-5126
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5126
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.22
            Reporter: Gábor Stefanik
         Attachments: rovasvegyes.pdf

The attached PDF contains old Hungarian runic script, which is both 
right-to-left and outside Unicode's Basic Multilingual Plane (and thus encoded 
as surrogate pairs in Java's internal UTF-16-like representation). When this 
text is extracted, the surrogate pairs are reversed due to an overly naive use 
of "char"-level reversal, leading to malformed Unicode output.

Likewise, when combining diacritics/modifiers occur in a right-to-left context, 
their position relative to the "parent" character is reversed, and so they 
appear on the wrong glyph, as demonstrated by the Hebrew sample in the same 
PDF. I imagine the same thing would also happen to emoji using the "zero-width 
joiner" in an RTL context.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Created] (PDFBOX-5126) Complex Unicode glyphs (surrogate pairs, combining diacritics, zero-width join, etc.) in a RTL context get reversed incorrectly on text extraction

Reply via email to