On Mon, Aug 7, 2017 at 10:20 PM, Cameron Simpson <c...@cskk.id.au> wrote:
> On 07Aug2017 21:44, boB Stepp <robertvst...@gmail.com> wrote:
>> py3: s = 'Hello!'
>> py3: len(s.encode("UTF-8"))
>> py3: len(s.encode("UTF-16"))
>> py3: len(s.encode("UTF-32"))
>> How is len() getting these values? And I am sure it will turn out not
>> to be a coincidence that 2 * (6 + 1) = 14 and 4 * (6 + 1) = 28. Hmm
> The result of str.encode is a bytes object with the specified encoding of
> the original text.
> Your sample string contains only ASCII characters, which encode 1-to-1 in
> UTF-8. So as you might expect, your UTF-8 encoding is 6 bytes. The others
> have a slight twist. Let's see:
> Python 3.6.1 (default, Apr 24 2017, 06:17:09)
> [GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> s = 'Hello!'
> >>> s.encode()
> >>> s.encode('utf-16')
> >>> s.encode('utf-32')
> The utf-8 encoding (the default) is as you would expect.
> The UTF-16 and UTF-32 encodings encode code points into 2 and 4 byte
> sequences as you might expect. Unlike the UTF-8 encoding, however, it is
> necessary to to know whether these byte sequences are big-endian (most
> significant byte first) or little-endian (least significant byte first).
As I just posted in my response to Ben, I am missing something
probably quite basic in translating the bytes representation above
> The machine I'm on here is writing big endian UTF-16 and UTF-32.
> As you note, the 16 and 32 forms are (6 + 1) times 2 or 4 respectively. This
> is because each encoding has a leading byte order marker to indicate the big
> endianness or little endianness. For big endian data that is \xff\xfe; for
> little endian data it would be \xfe\xff.
The arithmetic as I mentioned in my original post is what I am
expecting in "bytes", but my current thinking is that if I have for
the BOM you point out "\xff\xfe", I translate that as 4 hex digits,
each having 16 bits, for a total of 64 bits or 8 bytes. What am I
misunderstanding here? Is a definition of "byte" meaning something
other than 8 bits here? I vaguely recall reading somewhere that
"byte" can mean different numbers of bits in different contexts.
And is len() actually counting "bytes" or something else for these encodings?
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options: