On 09/11/12 23:39, Jay Heyl wrote:
I have some files that came from an outside organization containing Byte
Order Marks. Looking at these files with a hex editor I can see the BOM
is that for a UTF-8 file. I don't think I configured the 'fileencodings'
for Vim, but checking the variables it is using
fileencodings=ucs-bom,utf-8,default,latin1. With this, Vim fails to read
these files properly. I've seen oddly varying behavior as I try
different things, but it usually changes the BOM to indicate UTF-16
(big-endian). This results in improper display of many characters.

If I change my configuration so 'encoding' is utf-8, then the file is
displayed correctly, though the BOM sometimes shows up as UTF-16 in hex
(<FE FF>) and other times as UTF-8 as normal, though funny looking,
characters.

Since I don't need to send these files back out anywhere and the BOM is
just unnecessary junk to me, I've used the hex editor to get rid of them
and Vim behaves like normal. But I'm still curious what is going on with
Vim and the BOMs. Can anyone explain why Vim is apparently thinking
these files are or should be UTF-16 when the BOM clearly indicates
they're UTF-8? Or perhaps just suggest some better settings so Vim will
behave in a logical manner in regards to file encoding?

In order to handle correctly Unicode files, Vim needs 'encoding' set to a Unicode value such as UTF-8 (if set to UTF-16 or UCS-4, of any endianness, Vim will handle it as UTF-8 internally because null bytes terminate C strings) or to GB18030 (which is not recommended except maybe for CJK).

With 'fileencodings' [plural] set to "ucs-bom,utf-8,default,latin1", any file starting with the hex bytes EF BB BF will get 'fileencoding' [singular] set to "utf-8" and 'bomb' set to TRUE unless it is not a valid UTF-8 file (see below). The BOM will not be visible while you edit but it will be written back as you save the file.

See http://vim.wikia.com/wiki/Working_with_Unicode (and the Vim helptags listed there) for more details.

If Vim displays <feff> it doesn't necessarily mean it thinks the file is in UTF-16, it means that, at that point in the file, there is the codepoint U+FEFF ZERO WIDTH NO-BREAK SPACE which is deprecated (in favour of U+2060 WORD JOINER) except when used as a byte order mark (or, more precisely, an encoding mark, since UTF-8 is invariant for byte order) placed at the very start of the file. That codepoint takes up three bytes (hex EF BB BF) in UTF-8, two (little-endian FF FE or big-endian FE FF) in UTF-16, four in UTF-32 (little-endian FF FE 00 00, big-endian 00 00 FE FF, or FE FF 00 00 or 00 00 FF FE in the rarer 3412 and 2143 orderings respectively), but its Unicode scalar value is always 0xFEFF.

If Vim displays the BOM as three funny-looking characters it means it has *not* recognized the file as UTF-8: for instance in Latin1 you would see (i-diaresis, closing French quote, and Spanish reversed question mark). In order for the file to be recognized as UTF-8 it must not contain any byte sequence which would be illegal for UTF-8. This means in particular that (in hex):
- bytes FE and FF are forbidden
- any byte in the range C0 to FD is the "leading byte" (the first byte) of a multi-byte sequence whose total length in bytes is exactly equal to the number of consecutive high-order one bits in the leading byte, and whose other ("trailing") bytes are in the range 80 to BF
- trailing bytes may not appear elsewhere

In UTF-8, the BOM can be useful, harmful or indifferent depending on where it is used:

- in a file beginning with #! it is harmful because it hides the magic shebang and the name of the program which should handle the script - similarly, it is harmful in any file to be used as input by a program which doesn't know about BOMs - it is useful in a file to be used as input by a program which knows about it but would use a different setting if it weren't there: for instance on Windows, if you want to use WordPad to edit a UTF-8 file (as opposed to UTF-16le or Windows-1252) it had better start with a UTF-8 BOM - in some filetypes it serves as a confirmation that the file is in UTF-8. For instance HTML documents must have an encoding declaration, as one or more of a BOM, an HTML Content-Type header, and a charset-declaring <meta> element. For other filetypes, the BOM may be indifferent if the file would be correctly interpreted by any program using it even if it didn't have a BOM.


Best regards,
Tony.
--
[Nuclear war] ... may not be desirable.
                -- Edwin Meese III

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Reply via email to