Re: Vim BOMing out

Tony Mechelynck Fri, 09 Nov 2012 23:51:05 -0800

On 09/11/12 23:39, Jay Heyl wrote:

I have some files that came from an outside organization containing Byte
Order Marks. Looking at these files with a hex editor I can see the BOM
is that for a UTF-8 file. I don't think I configured the 'fileencodings'
for Vim, but checking the variables it is using
fileencodings=ucs-bom,utf-8,default,latin1. With this, Vim fails to read
these files properly. I've seen oddly varying behavior as I try
different things, but it usually changes the BOM to indicate UTF-16
(big-endian). This results in improper display of many characters.


If I change my configuration so 'encoding' is utf-8, then the file is
displayed correctly, though the BOM sometimes shows up as UTF-16 in hex
(<FE FF>) and other times as UTF-8 as normal, though funny looking,
characters.

Since I don't need to send these files back out anywhere and the BOM is
just unnecessary junk to me, I've used the hex editor to get rid of them
and Vim behaves like normal. But I'm still curious what is going on with
Vim and the BOMs. Can anyone explain why Vim is apparently thinking
these files are or should be UTF-16 when the BOM clearly indicates
they're UTF-8? Or perhaps just suggest some better settings so Vim will
behave in a logical manner in regards to file encoding?

In order to handle correctly Unicode files, Vim needs 'encoding' set toa Unicode value such as UTF-8 (if set to UTF-16 or UCS-4, of anyendianness, Vim will handle it as UTF-8 internally because null bytesterminate C strings) or to GB18030 (which is not recommended exceptmaybe for CJK).

With 'fileencodings' [plural] set to "ucs-bom,utf-8,default,latin1", anyfile starting with the hex bytes EF BB BF will get 'fileencoding'[singular] set to "utf-8" and 'bomb' set to TRUE unless it is not avalid UTF-8 file (see below). The BOM will not be visible while you editbut it will be written back as you save the file.

See http://vim.wikia.com/wiki/Working_with_Unicode (and the Vim helptagslisted there) for more details.

If Vim displays <feff> it doesn't necessarily mean it thinks the file isin UTF-16, it means that, at that point in the file, there is thecodepoint U+FEFF ZERO WIDTH NO-BREAK SPACE which is deprecated (infavour of U+2060 WORD JOINER) except when used as a byte order mark (or,more precisely, an encoding mark, since UTF-8 is invariant for byteorder) placed at the very start of the file. That codepoint takes upthree bytes (hex EF BB BF) in UTF-8, two (little-endian FF FE orbig-endian FE FF) in UTF-16, four in UTF-32 (little-endian FF FE 00 00,big-endian 00 00 FE FF, or FE FF 00 00 or 00 00 FF FE in the rarer 3412and 2143 orderings respectively), but its Unicode scalar value is always0xFEFF.

If Vim displays the BOM as three funny-looking characters it means ithas *not* recognized the file as UTF-8: for instance in Latin1 you wouldsee (i-diaresis, closing French quote, and Spanish reversed questionmark). In order for the file to be recognized as UTF-8 it must notcontain any byte sequence which would be illegal for UTF-8. This meansin particular that (in hex):

- bytes FE and FF are forbidden

- any byte in the range C0 to FD is the "leading byte" (the first byte)of a multi-byte sequence whose total length in bytes is exactly equal tothe number of consecutive high-order one bits in the leading byte, andwhose other ("trailing") bytes are in the range 80 to BF

- trailing bytes may not appear elsewhere

In UTF-8, the BOM can be useful, harmful or indifferent depending onwhere it is used:

- in a file beginning with #! it is harmful because it hides the magicshebang and the name of the program which should handle the script- similarly, it is harmful in any file to be used as input by a programwhich doesn't know about BOMs- it is useful in a file to be used as input by a program which knowsabout it but would use a different setting if it weren't there: forinstance on Windows, if you want to use WordPad to edit a UTF-8 file (asopposed to UTF-16le or Windows-1252) it had better start with a UTF-8 BOM- in some filetypes it serves as a confirmation that the file is inUTF-8. For instance HTML documents must have an encoding declaration, asone or more of a BOM, an HTML Content-Type header, and acharset-declaring <meta> element. For other filetypes, the BOM may beindifferent if the file would be correctly interpreted by any programusing it even if it didn't have a BOM.



Best regards,
Tony.
--
[Nuclear war] ... may not be desirable.
                -- Edwin Meese III

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Re: Vim BOMing out

Reply via email to