Re: [Tutor] How does len() compute length of a string in UTF-8, 16, and 32?

Cameron Simpson Tue, 08 Aug 2017 21:32:38 -0700

On 08Aug2017 22:30, boB Stepp <robertvst...@gmail.com> wrote:

On Mon, Aug 7, 2017 at 10:20 PM, Cameron Simpson <c...@cskk.id.au> wrote:

On 07Aug2017 21:44, boB Stepp <robertvst...@gmail.com> wrote:

py3: s = 'Hello!'
py3: len(s.encode("UTF-8"))
6
py3: len(s.encode("UTF-16"))
14
py3: len(s.encode("UTF-32"))
28


How is len() getting these values?  And I am sure it will turn out not
to be a coincidence that 2 * (6 + 1) = 14 and 4 * (6 + 1) = 28.  Hmm


The result of str.encode is a bytes object with the specified encoding of
the original text.

Your sample string contains only ASCII characters, which encode 1-to-1 in
UTF-8. So as you might expect, your UTF-8 encoding is 6 bytes. The others
have a slight twist. Let's see:

   Python 3.6.1 (default, Apr 24 2017, 06:17:09)
   [GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
   Type "help", "copyright", "credits" or "license" for more information.
   >>> s = 'Hello!'
   >>> s.encode()
   b'Hello!'
   >>> s.encode('utf-16')
   b'\xff\xfeH\x00e\x00l\x00l\x00o\x00!\x00'
   >>> s.encode('utf-32')
   
b'\xff\xfe\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00!\x00\x00\x00'

[...]

As I just posted in my response to Ben, I am missing something
probably quite basic in translating the bytes representation above
into "bytes".

The machine I'm on here is writing big endian UTF-16 and UTF-32.

As you note, the 16 and 32 forms are (6 + 1) times 2 or 4 respectively. This
is because each encoding has a leading byte order marker to indicate the big
endianness or little endianness. For big endian data that is \xff\xfe; for
little endian data it would be \xfe\xff.


The arithmetic as I mentioned in my original post is what I am
expecting in "bytes", but my current thinking is that if I have for
the BOM you point out "\xff\xfe", I translate that as 4 hex digits,
each having 16 bits, for a total of 64 bits or 8 bytes.  What am I
misunderstanding here?

A hex digit expresses 4 bits, not 16. "Hex"/"hexadecimal" is base 16, but thatis 2^4, so just four bits per hex digit. So the BOM is 2 bytes long in UTF-16and 4 bytes long (\xff\xfe\x00\x00) in UTF-32.

Is a definition of "byte" meaning something
other than 8 bits here?  I vaguely recall reading somewhere that
"byte" can mean different numbers of bits in different contexts.

There used to be machines with different "word" or "memory cell" sizes, size as6 or 9 bits etc, and these were still referred to as bytes. They're pretty muchdefunct, and these days the word "byte" always means 8 bits unless someone goesout of their way to say otherwise.

You'll find all the RFCs talk about "octets" for this very reason: a valueconsisting of 8 bits ("oct" meaning 8).

And is len() actually counting "bytes" or something else for these encodings?


Just bytes, exactly as you expect.

Cheers,
Cameron Simpson <c...@cskk.id.au> (formerly c...@zip.com.au)
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] How does len() compute length of a string in UTF-8, 16, and 32?

Reply via email to