On Fri Jun 24 02:54:12 2011, Peter Saint-Andre wrote:
On 6/23/11 12:41 AM, Mark Rejhon wrote:
> Opinion?
On the wire is no such thing as a code point, there are only code
points
that are encoded using an encoding form like UTF-8 or UTF-16. For
details, see:
http://tools.ietf.org/html/draft-ietf-appsawg-rfc3536bis-02
Given that XMPP is pure UTF-8, I don't see a compelling reason to
count
UTF-16-encoded code points or UTF-32-encoded code points.
I think UTF-16 and UTF-32 encodings would both be a bad idea; XMPP is
purely UTF-8 as you say.
However, I don't think that we should refer to UTF-8 octets either,
here, for a number of reasons:
1) Processing software may have decoded the UTF-8 into "something",
making it awkward to manage.
2) Referring to UTF-8 octets means we have silly states where we
could edit inside characters. It's even possible this may be used
intentionally, in some languages.
So I'd say that we should refer to characters in a string, and deal
with Unicode code-points in the abstract. I'd expect that
implementations would convert this internally into whatever made
sense for them.
Dave.
--
Dave Cridland - mailto:[email protected] - xmpp:[email protected]
- acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
- http://dave.cridland.net/
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade