On Mon, Aug 7, 2017 at 11:30 PM, eryk sun <eryk...@gmail.com> wrote: > On Tue, Aug 8, 2017 at 3:20 AM, Cameron Simpson <c...@cskk.id.au> wrote: >> >> As you note, the 16 and 32 forms are (6 + 1) times 2 or 4 respectively. This >> is because each encoding has a leading byte order marker to indicate the big >> endianness or little endianness. For big endian data that is \xff\xfe; for >> little endian data it would be \xfe\xff. > > To avoid encoding a byte order mark (BOM), use an "le" or "be" suffix, e.g. > > >>> 'Hello!'.encode('utf-16le') > b'H\x00e\x00l\x00l\x00o\x00!\x00'
If I do this, then I guess it becomes my responsibility to use the correct "le" or "be" suffix when I later decode these bytes back into Unicode code points. > Sometimes a data format includes the byte order, which makes using a > BOM redundant. For example, strings in the Windows registry use > UTF-16LE, without a BOM. Are there Windows bobby-traps that I need to watch out for because of this? I already know that the code pages that cmd.exe uses have caused me some grief in displaying (or not displaying!) code points which I have wanted to use. -- boB _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor