Mark said in the UTF-8 / UTF-16 discussion:
However, I am thinking of following Simon's excellent suggestion.
What do you think of his suggestion of using "code point" counting for
length and position attributes?
That'd pretty much essentially turn XMPP RTT equivalently into a
standard for editing an array of 32-bit integers instead (allow use of
native UCS4 string functions in programming languages that stores
strings in UCS4 format). It makes my 16-bit programming slightly more
complicated, but much easier than counting in UTF8. It might be a
better long term goal.
Opinion?
Yes, counting in code points is the right decision. You do not need to
comment what that means for the programmer.
Some may want to work in native UTF-8. Then a Unicode codepoint is well
defined as a 1-4 bytes long UTF-8 transform, easily isolated.
Some may want to work in UTF-16. They then need to watch out for 16-bit
values in the range U+D800 to U+DFFF and count pairs of such codes as 1
codepoint while all other 16-bit codes are 1 codepoint.
And some may want to work in the 32 bit expanded Unicode.
Just specify that in the protocol, p and n are counted in Unicode code
points.
/Gunnar