On 14/11/08 18:14, Charles Campbell wrote:
> Tony Mechelynck wrote:
>> On 13/11/08 15:10, James Kanze wrote:
>>
>>> How does vim decide what encoding(s) to use when it opens an
>>> existing file?
>>>
>>> I ask this because in the past, with text files, it seems to
>>> have "just worked", and with C++ files and shell scripts, it
>>> never mattered, since they only contained ASCII. However, I've
>>> now got some C++ files which have French (with accents) in their
>>> comments. The standard header that we use (copyright, etc.) is
>>> in English, as is all of the program text itself, which means
>>> that there is a large block of pure ASCII at the start. I'm
>>> gradually converting everything from Latin 1 to UTF-8, however;
>>> I use vim for the conversion (read, change fileencoding,
>>> rewrite), which works fine, but the next time I read the file,
>>> vim still treats it as if it were Latin 1 unless I manually
>>> change encoding (and fileencoding?)
>>>
>>> As far as I can tell, I've nothing in any of my configuration
>>> files which specify an encoding.
>>>
>> There are a number of settings related with encodings in [g]vim.
>>
>> * 'encoding' is global, it governs how the data is represented in Vim's
>> internal memory. As already said, you should set it one in your vimrc,
>> or not at all.
>>
>> * 'termencoding' tells Vim how the keyboard encodes data, and also, in
>> Console Vim but not in gvim, how the display understands text sent to
>> it. Its default is empty, which means "use 'encoding'"; however, if your
>> vimrc changes 'encoding', you should first save here the "old"
>> 'encoding' value as set from your OS's locale, in order to avoid
>> "misunderstandings" between Vim, its keyboard, and in Console mode also
>> its display.
>>
>> * 'fileencoding' (singular) is buffer-local, it tells Vim which encoding
>> is used on disk for the file in question. If empty there is no
>> translation (i.e., 'encoding' is used); if nonempty, you should make
>> sure that all characters actually used in the file can be represented in
>> memory (which is always the case if 'encoding' is UTF-8).
>>
>> * The ++enc argument (see ":help ++opt") to several reading or writing
>> commands (such as ":e[dit]", ":r[ead]", ":w[rite]", ":sav[eas]", etc.)
>> tells Vim which encoding to use on disk for that particular command. If
>> you use it, it overrides 'fileencodings' (see below). In the case of
>> commands which read a whole disk file into a new buffer, or (like
>> ":saveas") change the filename for the current buffer, it also sets
>> 'fileencoding' (see above).
>>
>> * 'fileencodings' (plural) is global; it defines the heuristics used by
>> Vim to set 'fileencoding' (singular) for an existing file. Its
>> comma-separated values are used from left to right; the following can be
>> used:
>> - ucs-bom (which should be first) means that if a Unicode BOM is
>> found at the start of a file, the corresponding Unicode encoding (as
>> well as the local boolean 'bomb' option) will be set, as follows:
>> o EF BB BF UTF-8
>> o 00 00 FE FF UTF-32ge
>> o FF FE 00 00 UTF-32le
>> o FE FF UTF-16ge
>> o FF FE UTF-16le
>> o For proper recognition of UTF-16le (which can represent
>> codepoints above U+FFFF) in preference to UCS-2le (which cannot, but
>> uses the same representation as UTF-16le for valid codepoints below
>> U+10000), Vim version 7.2.033 or later is required.
>> o For proper recognition of UTF-16ge in preference to UCS-2ge (same
>> remark), 7.1.261 or later is required.
>> o As can be seen above, the use of a BOM assumes that no UTF-16le
>> file will start with a null codepoint. I believe that this is a
>> reasonable assumption.
>> - a multibyte encoding name: when coming to that element of the
>> heuristic, Vim will test the file for that encoding, accept it if no
>> invalid code is found, and proceed to the next element of the heuristic
>> otherwise.
>> - an 8-bit encoding name (which should be last): since 8-bit
>> encodings can never give a "fail" signal, this _tells_ Vim which
>> encoding to use if all previous elements (if any) are found wanting.
>> - if no 8-bit encoding is included, and all included elements give
>> "fail" results, Vim will use Latin1 as a fallback, unless
>> 'fileencodings' is totally empty, in which case no test will be done and
>> the setting of 'fileencoding' (singular) will not be changed.
>>
>> The following is what I use near the top of my vimrc to set those
>> settings. I'm adding comments to make it more self-explanatory.
>>
>> if has("multi_byte")
>> " (optional) remember OS locale
>> let g:locale_encoding =&encoding
>> " if already Unicode, no need to change it
>> if&encoding !~? '^u'
>> " avoid clobbering the keyboard's encoding
>> " (and the display's in Console mode)
>> if&termencoding == ""
>> let&termencoding =&encoding
>> endif
>> " now we can change the setting for Vim memory
>> set encoding=utf-8
>> endif
>> " define default heuristics for existing files
>> " (can be overridden by ++enc on a file-by-file basis)
>> set fileencodings=ucs-bom,utf-8,latin1
>> " Finally, let's set defaults for new files
>> " -- The following line is optional.
>> " If setting 'fileencoding' to some non-Unicode value,
>> " it is still possible to set 'bomb' on to mean that
>> " new Unicode files should have a BOM by default.
>> " 'bomb' has no effect on non-Unicode files.
>> setglobal bomb fileencoding=utf-8
>> endif
>>
> Tony -- perhaps you should consider making an addition to usr_45.txt
> with the above and submitting it to Bram... Good explanation!
>
> Regards,
> Chip Campbell
The snippet (or maybe a slightly older version of it), and a substantial
part of the explanation, are already at
http://vim.wikia.org/Working_with_Unicode . This said, if Bram wants to
follow up on your suggestion and use all or part of the above in the
help, he's welcome. Even if some of the things I write here are
original, or at least presented in an original way, I don't claim
intellectual property rights on any of it. ;-)
And -- oops -- wherever I used "ge" it should be "be" for "big endian".
Or nothing because big-endian is the Vim default, even on little-endian
machines.
Best regards,
Tony.
--
One man's theology is another man's belly laugh.
--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---