Re: How does vim decide the encoding of an existing file?

Tony Mechelynck Fri, 14 Nov 2008 06:41:44 -0800

On 13/11/08 15:10, James Kanze wrote:
> How does vim decide what encoding(s) to use when it opens an
> existing file?
>
> I ask this because in the past, with text files, it seems to
> have "just worked", and with C++ files and shell scripts, it
> never mattered, since they only contained ASCII.  However, I've
> now got some C++ files which have French (with accents) in their
> comments.  The standard header that we use (copyright, etc.) is
> in English, as is all of the program text itself, which means
> that there is a large block of pure ASCII at the start.  I'm
> gradually converting everything from Latin 1 to UTF-8, however;
> I use vim for the conversion (read, change fileencoding,
> rewrite), which works fine, but the next time I read the file,
> vim still treats it as if it were Latin 1 unless I manually
> change encoding (and fileencoding?)
>
> As far as I can tell, I've nothing in any of my configuration
> files which specify an encoding.


There are a number of settings related with encodings in [g]vim.

* 'encoding' is global, it governs how the data is represented in Vim's 
internal memory. As already said, you should set it one in your vimrc, 
or not at all.

* 'termencoding' tells Vim how the keyboard encodes data, and also, in 
Console Vim but not in gvim, how the display understands text sent to 
it. Its default is empty, which means "use 'encoding'"; however, if your 
vimrc changes 'encoding', you should first save here the "old" 
'encoding' value as set from your OS's locale, in order to avoid 
"misunderstandings" between Vim, its keyboard, and in Console mode also 
its display.

* 'fileencoding' (singular) is buffer-local, it tells Vim which encoding 
is used on disk for the file in question. If empty there is no 
translation (i.e., 'encoding' is used); if nonempty, you should make 
sure that all characters actually used in the file can be represented in 
memory (which is always the case if 'encoding' is UTF-8).

* The ++enc argument (see ":help ++opt") to several reading or writing 
commands (such as ":e[dit]", ":r[ead]", ":w[rite]", ":sav[eas]", etc.) 
tells Vim which encoding to use on disk for that particular command. If 
you use it, it overrides 'fileencodings' (see below). In the case of 
commands which read a whole disk file into a new buffer, or (like 
":saveas") change the filename for the current buffer, it also sets 
'fileencoding' (see above).

* 'fileencodings' (plural) is global; it defines the heuristics used by 
Vim to set 'fileencoding' (singular) for an existing file. Its 
comma-separated values are used from left to right; the following can be 
used:
   - ucs-bom (which should be first) means that if a Unicode BOM is 
found at the start of a file, the corresponding Unicode encoding (as 
well as the local boolean 'bomb' option) will be set, as follows:
     o EF BB BF        UTF-8
     o 00 00 FE FF     UTF-32ge
     o FF FE 00 00     UTF-32le
     o FE FF           UTF-16ge
     o FF FE           UTF-16le
     o For proper recognition of UTF-16le (which can represent 
codepoints above U+FFFF) in preference to UCS-2le (which cannot, but 
uses the same representation as UTF-16le for valid codepoints below 
U+10000), Vim version 7.2.033 or later is required.
     o For proper recognition of UTF-16ge in preference to UCS-2ge (same 
remark), 7.1.261 or later is required.
     o As can be seen above, the use of a BOM assumes that no UTF-16le 
file will start with a null codepoint. I believe that this is a 
reasonable assumption.
   - a multibyte encoding name: when coming to that element of the 
heuristic, Vim will test the file for that encoding, accept it if no 
invalid code is found, and proceed to the next element of the heuristic 
otherwise.
   - an 8-bit encoding name (which should be last): since 8-bit 
encodings can never give a "fail" signal, this _tells_ Vim which 
encoding to use if all previous elements (if any) are found wanting.
   - if no 8-bit encoding is included, and all included elements give 
"fail" results, Vim will use Latin1 as a fallback, unless 
'fileencodings' is totally empty, in which case no test will be done and 
the setting of 'fileencoding' (singular) will not be changed.

The following is what I use near the top of my vimrc to set those 
settings. I'm adding comments to make it more self-explanatory.

if has("multi_byte")
        " (optional) remember OS locale
        let g:locale_encoding = &encoding
        " if already Unicode, no need to change it
        if &encoding !~? '^u'
                " avoid clobbering the keyboard's encoding
                " (and the display's in Console mode)
                if &termencoding == ""
                        let &termencoding = &encoding
                endif
                " now we can change the setting for Vim memory
                set encoding=utf-8
        endif
        " define default heuristics for existing files
        " (can be overridden by ++enc on a file-by-file basis)
        set fileencodings=ucs-bom,utf-8,latin1
        " Finally, let's set defaults for new files
        " -- The following line is optional.
        " If setting 'fileencoding' to some non-Unicode value,
        " it is still possible to set 'bomb' on to mean that
        " new Unicode files should have a BOM by default.
        " 'bomb' has no effect on non-Unicode files.
        setglobal bomb fileencoding=utf-8
endif



Best regards,
Tony.
-- 
ARTHUR: Right! Knights! Forward!
    ARTHUR leads a charge toward the castle.  Various shots of them 
battling on,
    despite being hit by a variety of farm animals.
                  "Monty Python and the Holy Grail" PYTHON (MONTY) 
PICTURES LTD

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Re: How does vim decide the encoding of an existing file?

Reply via email to