On 10/25/19 3:15 PM, Sam Whited wrote:
On Thu, Oct 24, 2019, at 18:32, Marvin W wrote:
XMPP uses UTF-8, and there's almost no reason to use anything but UTF-8.
I do agree that this is true inside XMPP, but the data being transported
inside XMPP might be transcoded to non-xmpp transport (examples: bridges
to other networks, clients that don't do XMPP on c2s connections) and
for those use-cases different encodings might occur. We shouldn't focus
on non-UTF-8 encodings, but considering it also doesn't hurt.
This problem exists with codepoints too, though to a lesser extent and
it may be less clear how it should be handled in all cases. For example,
in the middle of a multi-codepoint emoji or country flag.
Yes and no. multi-codepoint emojis are still valid characters when
split, whereas multi-byte codepoints cannot be split. There is nothing
wrong with displaying the flag 🇪🇺 as 🇪🇺 *, so your implementation
is always capable in strictly following any markup being done on a
codepoint basis, even if the markup border is inside a multi-codepoint
There's also the minor problem of having to decode all the bytes up to
the start position at the application layer if we have to count
Some programming languages handle strings in unicode codepoints instead
of bytes. I agree that this would be an issue for non messaging content
(i.e. large files) but I don't think we are talking about. For messaging
content, it's no issue that the client has two decode all the bytes - it
will be required to do so anyway for displaying.
With bytes you only have two checks: is the start and the
end marker on a byte boundary? If so the string in the middle can be
assumed to be valid.
Assuming you meant codepoint boundary instead of byte boundary, I agree
that this would also be an option, as long as we make sure people
actually do these checks. I personally prefer codepoints, but both are
valid and sane options - as long as we don't go with grapheme cluster or
any like this, we are fine IMO.
* I put a zero-width space in there to ensure your mail client is not
going to merge the two characters.
Standards mailing list