On 08Aug2017 22:30, boB Stepp <robertvst...@gmail.com> wrote:
On Mon, Aug 7, 2017 at 10:20 PM, Cameron Simpson <c...@cskk.id.au> wrote:
On 07Aug2017 21:44, boB Stepp <robertvst...@gmail.com> wrote:
py3: s = 'Hello!'
How is len() getting these values? And I am sure it will turn out not
to be a coincidence that 2 * (6 + 1) = 14 and 4 * (6 + 1) = 28. Hmm
The result of str.encode is a bytes object with the specified encoding of
the original text.
Your sample string contains only ASCII characters, which encode 1-to-1 in
UTF-8. So as you might expect, your UTF-8 encoding is 6 bytes. The others
have a slight twist. Let's see:
Python 3.6.1 (default, Apr 24 2017, 06:17:09)
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> s = 'Hello!'
As I just posted in my response to Ben, I am missing something
probably quite basic in translating the bytes representation above
The machine I'm on here is writing big endian UTF-16 and UTF-32.
As you note, the 16 and 32 forms are (6 + 1) times 2 or 4 respectively. This
is because each encoding has a leading byte order marker to indicate the big
endianness or little endianness. For big endian data that is \xff\xfe; for
little endian data it would be \xfe\xff.
The arithmetic as I mentioned in my original post is what I am
expecting in "bytes", but my current thinking is that if I have for
the BOM you point out "\xff\xfe", I translate that as 4 hex digits,
each having 16 bits, for a total of 64 bits or 8 bytes. What am I
A hex digit expresses 4 bits, not 16. "Hex"/"hexadecimal" is base 16, but that
is 2^4, so just four bits per hex digit. So the BOM is 2 bytes long in UTF-16
and 4 bytes long (\xff\xfe\x00\x00) in UTF-32.
Is a definition of "byte" meaning something
other than 8 bits here? I vaguely recall reading somewhere that
"byte" can mean different numbers of bits in different contexts.
There used to be machines with different "word" or "memory cell" sizes, size as
6 or 9 bits etc, and these were still referred to as bytes. They're pretty much
defunct, and these days the word "byte" always means 8 bits unless someone goes
out of their way to say otherwise.
You'll find all the RFCs talk about "octets" for this very reason: a value
consisting of 8 bits ("oct" meaning 8).
And is len() actually counting "bytes" or something else for these encodings?
Just bytes, exactly as you expect.
Cameron Simpson <c...@cskk.id.au> (formerly c...@zip.com.au)
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options: