Gábor Stefanik created PDFBOX-5126: --------------------------------------
Summary: Complex Unicode glyphs (surrogate pairs, combining diacritics, zero-width join, etc.) in a RTL context get reversed incorrectly on text extraction Key: PDFBOX-5126 URL: https://issues.apache.org/jira/browse/PDFBOX-5126 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 2.0.22 Reporter: Gábor Stefanik Attachments: rovasvegyes.pdf The attached PDF contains old Hungarian runic script, which is both right-to-left and outside Unicode's Basic Multilingual Plane (and thus encoded as surrogate pairs in Java's internal UTF-16-like representation). When this text is extracted, the surrogate pairs are reversed due to an overly naive use of "char"-level reversal, leading to malformed Unicode output. Likewise, when combining diacritics/modifiers occur in a right-to-left context, their position relative to the "parent" character is reversed, and so they appear on the wrong glyph, as demonstrated by the Hebrew sample in the same PDF. I imagine the same thing would also happen to emoji using the "zero-width joiner" in an RTL context. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org