Re: [Tutor] how are unicode chars represented?

Mark Tolonen Tue, 31 Mar 2009 18:09:36 -0700

"Kent Johnson" <[email protected]> wrote in messagenews:[email protected]...On Tue, Mar 31, 2009 at 1:52 AM, Mark Tolonen <[email protected]>wrote:

Unicode is simply code points. How the code points are represented

internally is another matter. The below code is from a 16-bit Unicodebuild

of Python but should look exactly the same on a 32-bit Unicode build;
however, the internal representation is different.

Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit(Intel)]

on win32
Type "help", "copyright", "credits" or "license" for more information.


x=u'\U00012345'
x.encode('utf8')


'\xf0\x92\x8d\x85'

However, I wonder if this should be considered a bug. I would think the
length of a Unicode string should be the number of code points in the

string, which for my string above should be 1. Anyone have a 32-bitUnicode

build of Python handy? This exposes the implementation as UTF-16.


len(x)


x[0]


u'\ud808'


x[1]


u'\udf45'


In standard Python the representation of unicode is 16 bits, without
correct handling of surrogate pairs (which is what your string
contains). I think this is called UCS-2, not UTF-16.

There is a a compile switch to enable 32-bit representation of
unicode. See PEP 261 and the "Internal Representation" section of the
second link below for more details.
http://www.python.org/dev/peps/pep-0261/
http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python

Kent

My string above is UTF-16 because it *does* handle surrogate pairs. Seehttp://en.wikipedia.org/wiki/UTF-16. "UCS-2 (2-byte Universal CharacterSet) is an obsolete character encoding which is a predecessor to UTF-16. TheUCS-2 encoding form is identical to that of UTF-16, except that it *doesnot* support surrogate pairs...". The single character \U00012345 wasstored by Python as the surrogate pair \ud808\udf45 and was correctlyencoded as the 4-byte UTF-8 '\xf0\x92\x8d\x85' in my example. Also,"Because of the technical similarities and upwards compatibility from UCS-2to UTF-16, the two encodings are often erroneously conflated and used as ifinterchangeable, so that strings encoded in UTF-16 are sometimesmisidentified as being encoded in UCS-2." Python isn't strictly UCS-2anymore, but it doesn't completely implement UTF-16 either, since stringfunctions return incorrect results for characters outside the BMP.


-Mark


_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] how are unicode chars represented?

Reply via email to