Mark said in the UTF-8 / UTF-16 discussion:
However, I am thinking of following Simon's excellent suggestion.

What do you think of his suggestion of using "code point" counting for length and position attributes? That'd pretty much essentially turn XMPP RTT equivalently into a standard for editing an array of 32-bit integers instead (allow use of native UCS4 string functions in programming languages that stores strings in UCS4 format). It makes my 16-bit programming slightly more complicated, but much easier than counting in UTF8. It might be a better long term goal.

Opinion?

Yes, counting in code points is the right decision. You do not need to comment what that means for the programmer. Some may want to work in native UTF-8. Then a Unicode codepoint is well defined as a 1-4 bytes long UTF-8 transform, easily isolated.

Some may want to work in UTF-16. They then need to watch out for 16-bit values in the range U+D800 to U+DFFF and count pairs of such codes as 1 codepoint while all other 16-bit codes are 1 codepoint.

And some may want to work in the 32 bit expanded Unicode.

Just specify that in the protocol, p and n are counted in Unicode code points.

/Gunnar

Reply via email to