eh? the bytes are ff fe h 0 0xff is not literally four bytes, its the hex repr of an 8bit quantity with all bits on
On August 8, 2017 9:17:49 PM MDT, boB Stepp <robertvst...@gmail.com> wrote: >On Mon, Aug 7, 2017 at 10:01 PM, Ben Finney ><ben+pyt...@benfinney.id.au> wrote: >> boB Stepp <robertvst...@gmail.com> writes: >> >>> How is len() getting these values? >> >> By asking the objects themselves to report their length. You are >> creating different objects with different content:: >> >> >>> s = 'Hello!' >> >>> s_utf8 = s.encode("UTF-8") >> >>> s == s_utf8 >> False >> >>> s_utf16 = s.encode("UTF-16") >> >>> s == s_utf16 >> False >> >>> s_utf32 = s.encode("UTF-32") >> >>> s == s_utf32 >> False >> >> So it shouldn't be surprising that, with different content, they will >> have different length:: >> >> >>> type(s), len(s) >> (<class 'str'>, 6) >> >>> type(s_utf8), len(s_utf8) >> (<class 'bytes'>, 6) >> >>> type(s_utf16), len(s_utf16) >> (<class 'bytes'>, 14) >> >>> type(s_utf32), len(s_utf32) >> (<class 'bytes'>, 28) >> >> What is it you think ‘str.encode’ does? > >It is translating the Unicode code points into bits patterned by the >encoding specified. I know this. I was reading some examples from a >book and it was demonstrating the different lengths resulting from >encoding into UTF-8, 16 and 32. I was mildly surprised that len() >even worked on these encoding results. But for the life of me I can't >figure out for UTF-16 and 18 how these lengths are determined. For >instance just looking at a single character: > >py3: h = 'h' >py3: h16 = h.encode("UTF-16") >py3: h16 >b'\xff\xfeh\x00' >py3: len(h16) >4 > >From Cameron's response, I know that \xff\xfe is a Big-Endian BOM. >But in my mind 0xff takes up 4 bytes as each hex digit occupies >16-bits of space. Likewise 0x00 looks to be 4 bytes -- Is this >representing EOL? So far I have 8 bytes for the BOM and 4 bytes for >what I am guessing is the end-of-the-line for a byte length of 12 and >I haven't even gotten to the "h" yet! So my question is actually as >stated: For these encoded bytes, how are these lengths calculated? > > > >-- >boB >_______________________________________________ >Tutor maillist - Tutor@python.org >To unsubscribe or change subscription options: >https://mail.python.org/mailman/listinfo/tutor -- Sent from my Android device with K-9 Mail. Please excuse my brevity. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor