On Thu, Oct 24, 2019, at 18:32, Marvin W wrote:
> 1) At some point we might want to allow the usage of UTF-16 or any
>    other encoding. Byte counts would have to be translated when re-
>    encoding which a server is probably unable to do generically.

XMPP uses UTF-8, and there's almost no reason to use anything but UTF-8.
On the public network, I think it's safe to operate under the assumption
that this will never change. If it ever does, we'll have lots of work
and bad assumptions to modify anyways, so one more won't hurt. Assuming
UTF-8 drastically simplifies a lot, so it doesn't seem worth changing
that assumption for a hypothetical.

> 2) There is no useful meaning of starting a link or bold inside a
>    codepoint. Depending on the tech stack used, it might cause
>    developers to unintentionally allow the generation of invalidly
>    encoded strings, causing all kind of issues (including potential
>    security impact)

This problem exists with codepoints too, though to a lesser extent and
it may be less clear how it should be handled in all cases. For example,
in the middle of a multi-codepoint emoji or country flag. By contrast,
if the start or end string exists between bytes in a UTF-8 encoding of a
single codepoint, it is easier to detect, and is clearly an error.

There's also the minor problem of having to decode all the bytes up to
the start position at the application layer if we have to count
codepoints. With bytes you only have two checks: is the start and the
end marker on a byte boundary? If so the string in the middle can be
assumed to be valid.


—Sam

-- 
Sam Whited
_______________________________________________
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
_______________________________________________

Reply via email to