Re: enc,fenc (again!?)

Tony Mechelynck Thu, 04 Nov 2010 22:41:37 -0700

On 05/11/10 03:59, Alessandro Antonello wrote:

Since latin1 is an 8-bit encoding, it cannot give a "fail" signal:
fencs=ucs-bom,latin1,default,utf-8 means the same as fencs=ucs-bom,latin1
i.e. whenever there is no BOM, the file will be detected as Latin1 because
none of the 256 possible byte values, in any sequence, is invalid for Latin1
-- and if it is actually UTF-8 without BOM, anything above U+007F wil appear
as two or more characters of gibberish.


In the 'fileencodings' option, "ucs-bom", if present, should be first, and
an 8-bit encoding, if present, should be last (which means that at most one
8-bit encoding should be used), because anything that comes after the first
8-bit encoding will never be used.

Setting fencs=ucs-bom,utf-8,latin1 means the following:

1) Is there a BOM at the very start of the file? Then setlocal bomb, "eat"
the BOM, and setlocal the corresponding Unicode 'fileencoding', otherwise
setlocal nobomb and:

2) Are the full contents of the file valid for UTF-8? (and note: 7-bit ASCII
is valid for both UTF-8 and Latin1 and is displayed the same in both) -- if
yes, setlocal fenc=utf-8; otherwise

3) Unconditionally setlocal fenc=latin1


Hi!

I see what you mean now.

1) Yes, I have a BOM in the start of every utf-8 file. You are saying that I
should set 'fencs=ucs-bom,utf-8,latin1'. What would happen if I open an utf-8
file from the command line using just 'gvim filename.ext'? Assuming that the
file has a BOM. Vim would recognize the BOM and set 'fenc=utf-8'? What if I
use 'gvim filename.ext' for a file in latin1 encoding and no BOM? Vim would
recognize that it has no BOM and set 'fenc=latin1'? Assuming that I have
'enc=latin1' defined.

If a file has a BOM, it is recognised by the first heuristic (ucs-bom),and nothing else comes into play.

A Latin1 file never has a BOM, so the "ucs-bom" heuristic will fail andthe next heuristics will be tried in turn. If that file containscharacters above 0x7F, it will be found "invalid for UTF-8" by thesecond (utf-8) heuristic, which will also fail. The latin1 heuristiccannot fail and the file will get the equivalent of ":setlocal nobombfenc=latin1". The 'bomb' option is immaterial when fenc=latin1, but itis set to false by the failing ucs-bom heuristic, which cannot knowwhether or not a Unicode encoding will later be detected for this file.

A file entirely in 7-bit ASCII is valid for both UTF-8 and Latin1, sowhichever of these is tried first will give a "success" signal, and thatencoding will be used as the file's 'fileencoding'. As long as the filecontains only 7-bit ASCII data, it makes no difference whether it isread and written as UTF-8 (without BOM) or Latin1, since 7-bit ASCII isrepresented identically in both.

If 'encoding' is set to latin1, however, you have a bigger problem: inthat case most Unicode codepoints (in fact, any codepoint above U+00FF)cannot be represented in Vim's _internal memory_, and such"unrepresentable" codepoints will be garbled, probably replaced byinverted question marks or something like that. See the code snippet Iwrote in some earlier post in this thread, or the pagehttp://vim.wikia.com/Working_with_Unicode , about how to make sure atstartup (in your vimrc) that Vim can edit anything, including ifnecessary a page like my own homepage,http://users.skynet.be/antoine.mechelynck/index.htm , which contains notonly text in several "Western" languages, but also in Esperanto (whichis Latin script but not Latin1-compatible; however the "incompatible"accented letters don't appear in that page) and in Russian, Arabic,Chinese and Japanese — or likehttp://users.skynet.be/antoine.mechelynck/other/imbecile.htm , whichcontains a single sentence in many languages including Portuguese (Idon't know if from Portugal, Angola, Mozambique, Macau, Brazil, or whichcombination of them); and how to do it cleanly (because if you change'encoding' after Vim has started, once some editfile(s) has been loadedin memory, there's a good chance the data in memory for such file(s)will get hopelessly corrupt).


2) No, not all files are valid for both UTF-8 and Latin1. Some files are in
Portuguese (Brazilian) with accents, cedillas, etc.

Latin1 files with accents, cedillas, etc. will not be accepted by Vim asUTF-8 files, see my second paragraph from top, above, and the discussionabout the difference between 7-bit ASCII (which is valid in both) andLatin1 with "higher-ASCII" characters in the hex range 80-FF (which isn't).


Right now I don't thrust in the Vim's automatic behavior. Almost all source
files that I have are in latin1 encoding. This is why I put 'set enc=latin1'
in my *.vimrc*, even in the Mac that is UTF-8 by default. Just a few XML files
are in 'utf-8'. For these I always use '++enc=utf-8' when open/create them.

That won't work. ++enc=utf-8 changes 'fileencoding', not 'encoding' (andrightly so, because of the risk of corrupting other already loadedfiles' data by changing 'encoding'); if 'encoding' is at Latin1 Vimwon't be able to represent in memory any codepoint above U+00FF. Forinstance you won't be able to load correctly into Vim either of the twowebpages I mentioned above if 'encoding' is set to Latin1. Not only thenon-Latin1 letters won't be visible, but they will be garbled in Vim memory.


Don't get me wrong. I love Vim. I don't thrust its behavior because I had
problems in the past with utf-16le encoded files. Maybe I just don't get the
right configuration at that time. Since then I use the configuration I show.
But I am open for your advices, and I'll try the way you said.

Thanks again.
Alessandro Antonello


If 'encoding' is set to UTF-8, Vim will correctly load and edit:

- any UTF-16 (be or le) file with BOM, if "ucs-bom" comes first in'fileencodings';- any UTF-16 (be or le) file without BOM, if read with the proper ++enc=modifier. If the file is read as Latin1 by the "automatic" process,about every second character will be shown as ^@ (a null); then rereadit with e.g. ":e ++enc=utf-16le" and all will be OK.


Best regards,
Tony.
--
Acquaintance, n.:
        A person whom we know well enough to borrow from, but not well
enough to lend to.
                -- Ambrose Bierce, "The Devil's Dictionary"

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Re: enc,fenc (again!?)

Reply via email to