On Mon, Aug 7, 2017 at 10:01 PM, Ben Finney <ben+pyt...@benfinney.id.au> wrote:
> boB Stepp <robertvst...@gmail.com> writes:
>> How is len() getting these values?
> By asking the objects themselves to report their length. You are
> creating different objects with different content::
> >>> s = 'Hello!'
> >>> s_utf8 = s.encode("UTF-8")
> >>> s == s_utf8
> >>> s_utf16 = s.encode("UTF-16")
> >>> s == s_utf16
> >>> s_utf32 = s.encode("UTF-32")
> >>> s == s_utf32
> So it shouldn't be surprising that, with different content, they will
> have different length::
> >>> type(s), len(s)
> (<class 'str'>, 6)
> >>> type(s_utf8), len(s_utf8)
> (<class 'bytes'>, 6)
> >>> type(s_utf16), len(s_utf16)
> (<class 'bytes'>, 14)
> >>> type(s_utf32), len(s_utf32)
> (<class 'bytes'>, 28)
> What is it you think ‘str.encode’ does?
It is translating the Unicode code points into bits patterned by the
encoding specified. I know this. I was reading some examples from a
book and it was demonstrating the different lengths resulting from
encoding into UTF-8, 16 and 32. I was mildly surprised that len()
even worked on these encoding results. But for the life of me I can't
figure out for UTF-16 and 18 how these lengths are determined. For
instance just looking at a single character:
py3: h = 'h'
py3: h16 = h.encode("UTF-16")
From Cameron's response, I know that \xff\xfe is a Big-Endian BOM.
But in my mind 0xff takes up 4 bytes as each hex digit occupies
16-bits of space. Likewise 0x00 looks to be 4 bytes -- Is this
representing EOL? So far I have 8 bytes for the BOM and 4 bytes for
what I am guessing is the end-of-the-line for a byte length of 12 and
I haven't even gotten to the "h" yet! So my question is actually as
stated: For these encoded bytes, how are these lengths calculated?
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options: