Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

Tedd Sterr Wed, 09 Dec 2020 11:10:46 -0800

>> The decoding _should_ be done upfront - that's how you get a valid XML 
>> document.


> I don't think this is true. XML is defined as UTF-8 (in this case),
> which is a collection of bytes. They don't have to be separated out and
> transformed into some higher representation of code points. Just because
> Python et al. convert things into UTF-32 strings first doesn't mean
> everything has to.
>
> Regardless of what language you're using it's trivial to deal with this
> as a UTF-8 byte stream, it is not always trivial to handle this as a UTF-
> 32 integer stream as the example shows.

XML is defined as a sequence of characters, it doesn't specify how those 
character must be encoded (though it does require support for both UTF-8 and 
UTF-16.) UTF-7/8/16/32 are encoding schemes, not character representations - 
people do make the mistake of conflating the two things, but that doesn't mean 
they are the same.

Unicode doesn't specify the size of characters - they don't have a specific 
bit-width, they are as large as required; the encoding scheme is then a method 
to transform characters into a sequence of bytes. It shouldn't matter what 
encoding scheme is used - UTF-8, UTF-16, ISO-8859-9, ISO-2022-JP, Shift_JIS, 
EBCDIC, are all possibilities - because you're supposed to decode the data into 
characters before doing anything it.

The fact that you're able to take advantage of the foreknowledge of your data 
being encoded using UTF-8 is purely because XMPP happens to define it that way, 
not because XML is defined using any specific encoding scheme. Basing your 
entire implementation around the expectation of UTF-8 allows you to take some 
convenient short-cuts, but much of that only works because XML markup uses 
ASCII-compatible characters, which conveniently have an equivalent single-byte 
representation when encoded as UTF-8; if it were almost any other encoding then 
it simply wouldn't work without some form of decoding first. If you insist on 
not decoding and then run into difficulties with handling characters because 
you're purposely avoiding handling characters while simultaneously using XML 
which is defined as a sequence of characters, an appropriate response is "what 
did you expect?"

It's not trivial to handle everything as UTF-8 in implementations where the 
application receives already decoded strings (a sequence of characters, not 
bytes) from the XML parser. The most likely approach to dealing with that will 
be to re-encode the already decoded data back into UTF-8 just to deal with the 
offsets, which is precisely the kind of inefficient processing you're 
suggesting should be avoided. And considering the whole purpose of references 
is for marking sequences of characters, those characters are going to be 
decoded at some point; you're trying to avoid decoding early, while still 
validating offsets, so that the decoding can be done later anyway.

Regardless, your argument is still "bytes is more convenient for me, so 
everyone else should do what's best for me." I don't think that's a good 
argument.

_______________________________________________
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: [email protected]
_______________________________________________

Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

Reply via email to