On 10/21/19 4:06 PM, Jonathan Lennox wrote:
The right concept here is probably "grapheme clusters", as defined in
Unicode Standard Annex 29.  ICU has support for this.

We should refrain from using things like grapheme clusters in wire formats, as those are subject to changes in upcoming Unicode versions and thus the wire format would be understood differently depending on the Unicode version implemented by the client.

Technically we could also agree on using a certain Unicode version now and for all eternity, but this sounds like a stupid concept and will cause people to use ICU or similar which will break eventually as the standard changes.

We should strive for the maximum compatibility. This gives us basically two options: bytes and codepoints. As our encoding is fixed to UTF-8 per RFC6120, both would be equally understandable by clients. However there are two good reasons against bytes: 1) At some point we might want to allow the usage of UTF-16 or any other encoding. Byte counts would have to be translated when re-encoding which a server is probably unable to do generically. 2) There is no useful meaning of starting a link or bold inside a codepoint. Depending on the tech stack used, it might cause developers to unintentionally allow the generation of invalidly encoded strings, causing all kind of issues (including potential security impact)

Thus, I would vote for using codepoints. This would of course open the questions what happens if multiple codepoints result in a single grapheme and anything points inside the grapheme. The rule should just be that clients should not do that on outgoing data. If a clients receives input pointing inside a grapheme, it's implementation-defined if the grapheme is included, excluded or split. In practice this shouldn't happen so I doubt it is really worth it to define ruling in the respective XEP, but this would also be an option.

By the way, the often mentioned flag example is not consistent across browsers either, try https://larma.de/splitflag.html with various browsers and browser versions. (Bonus Task: Build a browser detector based on flag rendering)

Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org

Reply via email to