On 07Aug2017 21:44, boB Stepp <robertvst...@gmail.com> wrote:
py3: s = 'Hello!'
py3: len(s.encode("UTF-8"))
6
py3: len(s.encode("UTF-16"))
14
py3: len(s.encode("UTF-32"))
28

How is len() getting these values?  And I am sure it will turn out not
to be a coincidence that 2 * (6 + 1) = 14 and 4 * (6 + 1) = 28.  Hmm

The result of str.encode is a bytes object with the specified encoding of the original text.

Your sample string contains only ASCII characters, which encode 1-to-1 in UTF-8. So as you might expect, your UTF-8 encoding is 6 bytes. The others have a slight twist. Let's see:

   Python 3.6.1 (default, Apr 24 2017, 06:17:09)
   [GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
   Type "help", "copyright", "credits" or "license" for more information.
   >>> s = 'Hello!'
   >>> s.encode()
   b'Hello!'
   >>> s.encode('utf-16')
   b'\xff\xfeH\x00e\x00l\x00l\x00o\x00!\x00'
   >>> s.encode('utf-32')
   
b'\xff\xfe\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00!\x00\x00\x00'

The utf-8 encoding (the default) is as you would expect.

The UTF-16 and UTF-32 encodings encode code points into 2 and 4 byte sequences as you might expect. Unlike the UTF-8 encoding, however, it is necessary to to know whether these byte sequences are big-endian (most significant byte first) or little-endian (least significant byte first).

The machine I'm on here is writing big endian UTF-16 and UTF-32.

As you note, the 16 and 32 forms are (6 + 1) times 2 or 4 respectively. This is because each encoding has a leading byte order marker to indicate the big endianness or little endianness. For big endian data that is \xff\xfe; for little endian data it would be \xfe\xff.

Cheers,
Cameron Simpson <c...@cskk.id.au> (formerly c...@zip.com.au)
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to