Re: [Tutor] How does len() compute length of a string in UTF-8, 16, and 32?

Mats Wichmann Tue, 08 Aug 2017 20:33:07 -0700

eh? the bytes are  ff fe h 0 
0xff is not literally four bytes, its the hex repr of an 8bit quantity with all 
bits on


On August 8, 2017 9:17:49 PM MDT, boB Stepp <[email protected]> wrote:
>On Mon, Aug 7, 2017 at 10:01 PM, Ben Finney
><[email protected]> wrote:
>> boB Stepp <[email protected]> writes:
>>
>>> How is len() getting these values?
>>
>> By asking the objects themselves to report their length. You are
>> creating different objects with different content::
>>
>>     >>> s = 'Hello!'
>>     >>> s_utf8 = s.encode("UTF-8")
>>     >>> s == s_utf8
>>     False
>>     >>> s_utf16 = s.encode("UTF-16")
>>     >>> s == s_utf16
>>     False
>>     >>> s_utf32 = s.encode("UTF-32")
>>     >>> s == s_utf32
>>     False
>>
>> So it shouldn't be surprising that, with different content, they will
>> have different length::
>>
>>     >>> type(s), len(s)
>>     (<class 'str'>, 6)
>>     >>> type(s_utf8), len(s_utf8)
>>     (<class 'bytes'>, 6)
>>     >>> type(s_utf16), len(s_utf16)
>>     (<class 'bytes'>, 14)
>>     >>> type(s_utf32), len(s_utf32)
>>     (<class 'bytes'>, 28)
>>
>> What is it you think ‘str.encode’ does?
>
>It is translating the Unicode code points into bits patterned by the
>encoding specified.  I know this.  I was reading some examples from a
>book and it was demonstrating the different lengths resulting from
>encoding into UTF-8, 16 and 32.  I was mildly surprised that len()
>even worked on these encoding results.  But for the life of me I can't
>figure out for UTF-16 and 18 how these lengths are determined.  For
>instance just looking at a single character:
>
>py3: h = 'h'
>py3: h16 = h.encode("UTF-16")
>py3: h16
>b'\xff\xfeh\x00'
>py3: len(h16)
>4
>
>From Cameron's response, I know that \xff\xfe is a Big-Endian BOM.
>But in my mind 0xff takes up 4 bytes as each hex digit occupies
>16-bits of space.  Likewise 0x00 looks to be 4 bytes -- Is this
>representing EOL?  So far I have 8 bytes for the BOM and 4 bytes for
>what I am guessing is the end-of-the-line for a byte length of 12 and
>I haven't even gotten to the "h" yet!  So my question is actually as
>stated:  For these encoded bytes, how are these lengths calculated?
>
>
>
>-- 
>boB
>_______________________________________________
>Tutor maillist  -  [email protected]
>To unsubscribe or change subscription options:
>https://mail.python.org/mailman/listinfo/tutor

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] How does len() compute length of a string in UTF-8, 16, and 32?

Reply via email to