On Fri Jun 24 02:54:12 2011, Peter Saint-Andre wrote:
On 6/23/11 12:41 AM, Mark Rejhon wrote:
> Opinion?

On the wire is no such thing as a code point, there are only code points
that are encoded using an encoding form like UTF-8 or UTF-16. For
details, see:

http://tools.ietf.org/html/draft-ietf-appsawg-rfc3536bis-02

Given that XMPP is pure UTF-8, I don't see a compelling reason to count
UTF-16-encoded code points or UTF-32-encoded code points.


I think UTF-16 and UTF-32 encodings would both be a bad idea; XMPP is purely UTF-8 as you say.

However, I don't think that we should refer to UTF-8 octets either, here, for a number of reasons:

1) Processing software may have decoded the UTF-8 into "something", making it awkward to manage.

2) Referring to UTF-8 octets means we have silly states where we could edit inside characters. It's even possible this may be used intentionally, in some languages.

So I'd say that we should refer to characters in a string, and deal with Unicode code-points in the abstract. I'd expect that implementations would convert this internally into whatever made sense for them.

Dave.
--
Dave Cridland - mailto:[email protected] - xmpp:[email protected]
 - acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
 - http://dave.cridland.net/
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade

Reply via email to