On 03/05/10 23:45, Lech Lorens wrote:
[...]
I might be totally wrong basing my understanding of BOM and character
sets mainly on Wikipedia, but I thought that setting 'bomb' for utf-8
encoded files (which does not pose a risk of misinterpreting the
contents due to endianness difference) didn't make much sense. For
utf-16 that would be another thing.

http://en.wikipedia.org/wiki/Byte-order_mark


Notwithstanding its name, the BOM provides more than just endianness detection. Actually, it is an "encoding signal" which allows detecting all five of the following encodings, assuming a UTF-16le file won't start with a NULL:

utf-16be    FE FF
utf-16le    FF FE
utf-8       EF BB BF
utf-32be    00 00 FE FF
utf-32le    FF FE 00 00

For instance, when I was still on XP, I noticed that WordPad could read UTF-8 files but only if they started with a BOM. When writing what it called "Unicode", what it produced was UTF-16le with BOM.

Any file starting 0xEF 0xBB 0xBF can be assumed to be in UTF-8. Distinguishing UTF-8 from Latin1 or Windows-1252 would otherwise require scanning the whole file, checking for invalid UTF-8 byte sequences.


Best regards,
Tony.
--
Life is a gift, living is an art.               (Bram Moolenaar)

--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Raspunde prin e-mail lui