"Kent Johnson" <ken...@tds.net> wrote in message
news:1c2a2c590903310357m682e16acr9d94b12b60993...@mail.gmail.com...
On Tue, Mar 31, 2009 at 1:52 AM, Mark Tolonen <metolone+gm...@gmail.com>
wrote:
Unicode is simply code points. How the code points are represented
internally is another matter. The below code is from a 16-bit Unicode
build
of Python but should look exactly the same on a 32-bit Unicode build;
however, the internal representation is different.
Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)]
on win32
Type "help", "copyright", "credits" or "license" for more information.
x=u'\U00012345'
x.encode('utf8')
'\xf0\x92\x8d\x85'
However, I wonder if this should be considered a bug. I would think the
length of a Unicode string should be the number of code points in the
string, which for my string above should be 1. Anyone have a 32-bit
Unicode
build of Python handy? This exposes the implementation as UTF-16.
len(x)
2
x[0]
u'\ud808'
x[1]
u'\udf45'
In standard Python the representation of unicode is 16 bits, without
correct handling of surrogate pairs (which is what your string
contains). I think this is called UCS-2, not UTF-16.
There is a a compile switch to enable 32-bit representation of
unicode. See PEP 261 and the "Internal Representation" section of the
second link below for more details.
http://www.python.org/dev/peps/pep-0261/
http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python
Kent
My string above is UTF-16 because it *does* handle surrogate pairs. See
http://en.wikipedia.org/wiki/UTF-16. "UCS-2 (2-byte Universal Character
Set) is an obsolete character encoding which is a predecessor to UTF-16. The
UCS-2 encoding form is identical to that of UTF-16, except that it *does
not* support surrogate pairs...". The single character \U00012345 was
stored by Python as the surrogate pair \ud808\udf45 and was correctly
encoded as the 4-byte UTF-8 '\xf0\x92\x8d\x85' in my example. Also,
"Because of the technical similarities and upwards compatibility from UCS-2
to UTF-16, the two encodings are often erroneously conflated and used as if
interchangeable, so that strings encoded in UTF-16 are sometimes
misidentified as being encoded in UCS-2." Python isn't strictly UCS-2
anymore, but it doesn't completely implement UTF-16 either, since string
functions return incorrect results for characters outside the BMP.
-Mark
_______________________________________________
Tutor maillist - Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor