On Thu, Oct 24, 2019, at 18:32, Marvin W wrote: > 1) At some point we might want to allow the usage of UTF-16 or any > other encoding. Byte counts would have to be translated when re- > encoding which a server is probably unable to do generically.
XMPP uses UTF-8, and there's almost no reason to use anything but UTF-8. On the public network, I think it's safe to operate under the assumption that this will never change. If it ever does, we'll have lots of work and bad assumptions to modify anyways, so one more won't hurt. Assuming UTF-8 drastically simplifies a lot, so it doesn't seem worth changing that assumption for a hypothetical. > 2) There is no useful meaning of starting a link or bold inside a > codepoint. Depending on the tech stack used, it might cause > developers to unintentionally allow the generation of invalidly > encoded strings, causing all kind of issues (including potential > security impact) This problem exists with codepoints too, though to a lesser extent and it may be less clear how it should be handled in all cases. For example, in the middle of a multi-codepoint emoji or country flag. By contrast, if the start or end string exists between bytes in a UTF-8 encoding of a single codepoint, it is easier to detect, and is clearly an error. There's also the minor problem of having to decode all the bytes up to the start position at the application layer if we have to count codepoints. With bytes you only have two checks: is the start and the end marker on a byte boundary? If so the string in the middle can be assumed to be valid. —Sam -- Sam Whited _______________________________________________ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org _______________________________________________