On Thu, Apr 23, 2015 at 10:08:05PM +0100, Mark Lawrence wrote: > Slight aside, why a BOM, all I ever think of is Inspector Clouseau? :)
:-) I'm not sure if you mean that as an serious question or not. BOM stands for Byte Order Mark, and it if needed for UTF-16 and UTF-32 encodings because there is some ambiguity due to hardware differences. UTF-16 uses two byte "code units", that is, each character is represented by two bytes (or four, but we can ignore that). Similarly, UTF-32 uses four byte code units. The problem is, different platforms link multiple bytes in opposite order: "big-endian" and "little-endian". (In the Bad Old Days, there were "middle-endian" platforms too, and you really don't want to know about them.) For example, the two-byte quantity O1FE (in hex) might have: (1) the 01 byte at address 100 and the FE byte at address 101 (big end first) (2) the 01 byte at address 101 and the FE byte at address 100 (little end first) We always write the two bytes as 01FE in hexadecimal notation, with the "little end" (units) on the right, just as we do with decimals: 01FE = E units + F sixteens + 1 sixteen-squares + 0 sixteen-cubes is the same as 510 in decimal: 510 = 0 units + 1 ten + 5 ten-squares but in computer memory those two bytes could be stored either big-end first: 01 has the lower address FE has the higher address or little-end first: FE has the lower address 01 has the higher address So when you read a text file, and you see two bytes 01FE, unless you know whether that came from a big-endian system or a little-endian system, you don't know how to interpret it: - If both the sender and the receiver are big-endian, or little-endian, then two bytes 01 followed by FE represent 01FE or 510 in decimal, which is LATIN CAPITAL LETTER O WITH STROKE AND ACUTE in UTF-16. - If the sender and the receiver are opposite endians (one big-, the other little-, it doesn't matter which) then the two bytes 01 followed by FE will be seen as FE01 (65025 in decimal) which is VARIATION SELECTOR-2 in UTF-16. I've deliberately used the terms "sender" and "receiver", because this problem is fundamental to networking too. Whenever you transfer data from a Motorola based platform to an Intel based platform, the byte order has to be swapped like this. If it's not swapped, the data looks weird. (TIFF files also have to take endian issues into account.) UTF-16 and UTF-32 solve this problem by putting a Byte Order Mark at the start of the file. The BOM is two bytes in the case of UTF-16 (a single code-unit): py> ''.encode('utf-16') b'\xff\xfe' So when you read a UTF-16 file back, if it starts with two bytes FFFE (in hex), you can continue. If it starts with FEFF, then the bytes have been reversed. If you treat the file as Latin-1 (one of the many versions of "extended ASCII"), then you will see the BOM as either: ÿþ # byte order matches þÿ # byte order doesn't match Normally you don't need to care about any of this! The encoding handles all the gory details. You just pick the right encoding: - UTF-16 if the file will have a BOM; - UTF-16BE if it is big-endian, and doesn't have a BOM; - UTF-16LE if it is little-endian, and doesn't have a BOM; and the encoding deals with the BOM. -- Steve _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor