Re: How does vim decide the encoding of an existing file?

Charles Campbell Fri, 14 Nov 2008 09:14:21 -0800

Tony Mechelynck wrote:
> On 13/11/08 15:10, James Kanze wrote:
>   
>> How does vim decide what encoding(s) to use when it opens an
>> existing file?
>>
>> I ask this because in the past, with text files, it seems to
>> have "just worked", and with C++ files and shell scripts, it
>> never mattered, since they only contained ASCII.  However, I've
>> now got some C++ files which have French (with accents) in their
>> comments.  The standard header that we use (copyright, etc.) is
>> in English, as is all of the program text itself, which means
>> that there is a large block of pure ASCII at the start.  I'm
>> gradually converting everything from Latin 1 to UTF-8, however;
>> I use vim for the conversion (read, change fileencoding,
>> rewrite), which works fine, but the next time I read the file,
>> vim still treats it as if it were Latin 1 unless I manually
>> change encoding (and fileencoding?)
>>
>> As far as I can tell, I've nothing in any of my configuration
>> files which specify an encoding.
>>     
>
> There are a number of settings related with encodings in [g]vim.
>
> * 'encoding' is global, it governs how the data is represented in Vim's 
> internal memory. As already said, you should set it one in your vimrc, 
> or not at all.
>
> * 'termencoding' tells Vim how the keyboard encodes data, and also, in 
> Console Vim but not in gvim, how the display understands text sent to 
> it. Its default is empty, which means "use 'encoding'"; however, if your 
> vimrc changes 'encoding', you should first save here the "old" 
> 'encoding' value as set from your OS's locale, in order to avoid 
> "misunderstandings" between Vim, its keyboard, and in Console mode also 
> its display.
>
> * 'fileencoding' (singular) is buffer-local, it tells Vim which encoding 
> is used on disk for the file in question. If empty there is no 
> translation (i.e., 'encoding' is used); if nonempty, you should make 
> sure that all characters actually used in the file can be represented in 
> memory (which is always the case if 'encoding' is UTF-8).
>
> * The ++enc argument (see ":help ++opt") to several reading or writing 
> commands (such as ":e[dit]", ":r[ead]", ":w[rite]", ":sav[eas]", etc.) 
> tells Vim which encoding to use on disk for that particular command. If 
> you use it, it overrides 'fileencodings' (see below). In the case of 
> commands which read a whole disk file into a new buffer, or (like 
> ":saveas") change the filename for the current buffer, it also sets 
> 'fileencoding' (see above).
>
> * 'fileencodings' (plural) is global; it defines the heuristics used by 
> Vim to set 'fileencoding' (singular) for an existing file. Its 
> comma-separated values are used from left to right; the following can be 
> used:
>    - ucs-bom (which should be first) means that if a Unicode BOM is 
> found at the start of a file, the corresponding Unicode encoding (as 
> well as the local boolean 'bomb' option) will be set, as follows:
>      o EF BB BF        UTF-8
>      o 00 00 FE FF     UTF-32ge
>      o FF FE 00 00     UTF-32le
>      o FE FF           UTF-16ge
>      o FF FE           UTF-16le
>      o For proper recognition of UTF-16le (which can represent 
> codepoints above U+FFFF) in preference to UCS-2le (which cannot, but 
> uses the same representation as UTF-16le for valid codepoints below 
> U+10000), Vim version 7.2.033 or later is required.
>      o For proper recognition of UTF-16ge in preference to UCS-2ge (same 
> remark), 7.1.261 or later is required.
>      o As can be seen above, the use of a BOM assumes that no UTF-16le 
> file will start with a null codepoint. I believe that this is a 
> reasonable assumption.
>    - a multibyte encoding name: when coming to that element of the 
> heuristic, Vim will test the file for that encoding, accept it if no 
> invalid code is found, and proceed to the next element of the heuristic 
> otherwise.
>    - an 8-bit encoding name (which should be last): since 8-bit 
> encodings can never give a "fail" signal, this _tells_ Vim which 
> encoding to use if all previous elements (if any) are found wanting.
>    - if no 8-bit encoding is included, and all included elements give 
> "fail" results, Vim will use Latin1 as a fallback, unless 
> 'fileencodings' is totally empty, in which case no test will be done and 
> the setting of 'fileencoding' (singular) will not be changed.
>
> The following is what I use near the top of my vimrc to set those 
> settings. I'm adding comments to make it more self-explanatory.
>
> if has("multi_byte")
>       " (optional) remember OS locale
>       let g:locale_encoding = &encoding
>       " if already Unicode, no need to change it
>       if &encoding !~? '^u'
>               " avoid clobbering the keyboard's encoding
>               " (and the display's in Console mode)
>               if &termencoding == ""
>                       let &termencoding = &encoding
>               endif
>               " now we can change the setting for Vim memory
>               set encoding=utf-8
>       endif
>       " define default heuristics for existing files
>       " (can be overridden by ++enc on a file-by-file basis)
>       set fileencodings=ucs-bom,utf-8,latin1
>       " Finally, let's set defaults for new files
>       " -- The following line is optional.
>       " If setting 'fileencoding' to some non-Unicode value,
>       " it is still possible to set 'bomb' on to mean that
>       " new Unicode files should have a BOM by default.
>       " 'bomb' has no effect on non-Unicode files.
>       setglobal bomb fileencoding=utf-8
> endif
>   
Tony -- perhaps you should consider making an addition to usr_45.txt 
with the above and submitting it to Bram...  Good explanation!


Regards,
Chip Campbell


--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Re: How does vim decide the encoding of an existing file?

Reply via email to