On 03/05/10 23:45, Lech Lorens wrote:
[...]
I might be totally wrong basing my understanding of BOM and character
sets mainly on Wikipedia, but I thought that setting 'bomb' for utf-8
encoded files (which does not pose a risk of misinterpreting the
contents due to endianness difference) didn't make much sense. For
utf-16 that would be another thing.
http://en.wikipedia.org/wiki/Byte-order_mark
Notwithstanding its name, the BOM provides more than just endianness
detection. Actually, it is an "encoding signal" which allows detecting
all five of the following encodings, assuming a UTF-16le file won't
start with a NULL:
utf-16be FE FF
utf-16le FF FE
utf-8 EF BB BF
utf-32be 00 00 FE FF
utf-32le FF FE 00 00
For instance, when I was still on XP, I noticed that WordPad could read
UTF-8 files but only if they started with a BOM. When writing what it
called "Unicode", what it produced was UTF-16le with BOM.
Any file starting 0xEF 0xBB 0xBF can be assumed to be in UTF-8.
Distinguishing UTF-8 from Latin1 or Windows-1252 would otherwise require
scanning the whole file, checking for invalid UTF-8 byte sequences.
Best regards,
Tony.
--
Life is a gift, living is an art. (Bram Moolenaar)
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php