On 07Aug2017 21:44, boB Stepp <robertvst...@gmail.com> wrote:
py3: s = 'Hello!'
py3: len(s.encode("UTF-8"))
6
py3: len(s.encode("UTF-16"))
14
py3: len(s.encode("UTF-32"))
28
How is len() getting these values? And I am sure it will turn out not
to be a coincidence that 2 * (6 + 1) = 14 and 4 * (6 + 1) = 28. Hmm
The result of str.encode is a bytes object with the specified encoding of the
original text.
Your sample string contains only ASCII characters, which encode 1-to-1 in
UTF-8. So as you might expect, your UTF-8 encoding is 6 bytes. The others have
a slight twist. Let's see:
Python 3.6.1 (default, Apr 24 2017, 06:17:09)
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> s = 'Hello!'
>>> s.encode()
b'Hello!'
>>> s.encode('utf-16')
b'\xff\xfeH\x00e\x00l\x00l\x00o\x00!\x00'
>>> s.encode('utf-32')
b'\xff\xfe\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00!\x00\x00\x00'
The utf-8 encoding (the default) is as you would expect.
The UTF-16 and UTF-32 encodings encode code points into 2 and 4 byte sequences
as you might expect. Unlike the UTF-8 encoding, however, it is necessary to to
know whether these byte sequences are big-endian (most significant byte first)
or little-endian (least significant byte first).
The machine I'm on here is writing big endian UTF-16 and UTF-32.
As you note, the 16 and 32 forms are (6 + 1) times 2 or 4 respectively. This is
because each encoding has a leading byte order marker to indicate the big
endianness or little endianness. For big endian data that is \xff\xfe; for
little endian data it would be \xfe\xff.
Cheers,
Cameron Simpson <c...@cskk.id.au> (formerly c...@zip.com.au)
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor