On 05/11/10 03:59, Alessandro Antonello wrote:
Since latin1 is an 8-bit encoding, it cannot give a "fail" signal:
fencs=ucs-bom,latin1,default,utf-8 means the same as fencs=ucs-bom,latin1
i.e. whenever there is no BOM, the file will be detected as Latin1 because
none of the 256 possible byte values, in any sequence, is invalid for Latin1
-- and if it is actually UTF-8 without BOM, anything above U+007F wil appear
as two or more characters of gibberish.

In the 'fileencodings' option, "ucs-bom", if present, should be first, and
an 8-bit encoding, if present, should be last (which means that at most one
8-bit encoding should be used), because anything that comes after the first
8-bit encoding will never be used.

Setting fencs=ucs-bom,utf-8,latin1 means the following:

1) Is there a BOM at the very start of the file? Then setlocal bomb, "eat"
the BOM, and setlocal the corresponding Unicode 'fileencoding', otherwise
setlocal nobomb and:

2) Are the full contents of the file valid for UTF-8? (and note: 7-bit ASCII
is valid for both UTF-8 and Latin1 and is displayed the same in both) -- if
yes, setlocal fenc=utf-8; otherwise

3) Unconditionally setlocal fenc=latin1

Hi!

I see what you mean now.

1) Yes, I have a BOM in the start of every utf-8 file. You are saying that I
should set 'fencs=ucs-bom,utf-8,latin1'. What would happen if I open an utf-8
file from the command line using just 'gvim filename.ext'? Assuming that the
file has a BOM. Vim would recognize the BOM and set 'fenc=utf-8'? What if I
use 'gvim filename.ext' for a file in latin1 encoding and no BOM? Vim would
recognize that it has no BOM and set 'fenc=latin1'? Assuming that I have
'enc=latin1' defined.

If a file has a BOM, it is recognised by the first heuristic (ucs-bom), and nothing else comes into play.

A Latin1 file never has a BOM, so the "ucs-bom" heuristic will fail and the next heuristics will be tried in turn. If that file contains characters above 0x7F, it will be found "invalid for UTF-8" by the second (utf-8) heuristic, which will also fail. The latin1 heuristic cannot fail and the file will get the equivalent of ":setlocal nobomb fenc=latin1". The 'bomb' option is immaterial when fenc=latin1, but it is set to false by the failing ucs-bom heuristic, which cannot know whether or not a Unicode encoding will later be detected for this file.

A file entirely in 7-bit ASCII is valid for both UTF-8 and Latin1, so whichever of these is tried first will give a "success" signal, and that encoding will be used as the file's 'fileencoding'. As long as the file contains only 7-bit ASCII data, it makes no difference whether it is read and written as UTF-8 (without BOM) or Latin1, since 7-bit ASCII is represented identically in both.

If 'encoding' is set to latin1, however, you have a bigger problem: in that case most Unicode codepoints (in fact, any codepoint above U+00FF) cannot be represented in Vim's _internal memory_, and such "unrepresentable" codepoints will be garbled, probably replaced by inverted question marks or something like that. See the code snippet I wrote in some earlier post in this thread, or the page http://vim.wikia.com/Working_with_Unicode , about how to make sure at startup (in your vimrc) that Vim can edit anything, including if necessary a page like my own homepage, http://users.skynet.be/antoine.mechelynck/index.htm , which contains not only text in several "Western" languages, but also in Esperanto (which is Latin script but not Latin1-compatible; however the "incompatible" accented letters don't appear in that page) and in Russian, Arabic, Chinese and Japanese — or like http://users.skynet.be/antoine.mechelynck/other/imbecile.htm , which contains a single sentence in many languages including Portuguese (I don't know if from Portugal, Angola, Mozambique, Macau, Brazil, or which combination of them); and how to do it cleanly (because if you change 'encoding' after Vim has started, once some editfile(s) has been loaded in memory, there's a good chance the data in memory for such file(s) will get hopelessly corrupt).


2) No, not all files are valid for both UTF-8 and Latin1. Some files are in
Portuguese (Brazilian) with accents, cedillas, etc.

Latin1 files with accents, cedillas, etc. will not be accepted by Vim as UTF-8 files, see my second paragraph from top, above, and the discussion about the difference between 7-bit ASCII (which is valid in both) and Latin1 with "higher-ASCII" characters in the hex range 80-FF (which isn't).


Right now I don't thrust in the Vim's automatic behavior. Almost all source
files that I have are in latin1 encoding. This is why I put 'set enc=latin1'
in my *.vimrc*, even in the Mac that is UTF-8 by default. Just a few XML files
are in 'utf-8'. For these I always use '++enc=utf-8' when open/create them.

That won't work. ++enc=utf-8 changes 'fileencoding', not 'encoding' (and rightly so, because of the risk of corrupting other already loaded files' data by changing 'encoding'); if 'encoding' is at Latin1 Vim won't be able to represent in memory any codepoint above U+00FF. For instance you won't be able to load correctly into Vim either of the two webpages I mentioned above if 'encoding' is set to Latin1. Not only the non-Latin1 letters won't be visible, but they will be garbled in Vim memory.


Don't get me wrong. I love Vim. I don't thrust its behavior because I had
problems in the past with utf-16le encoded files. Maybe I just don't get the
right configuration at that time. Since then I use the configuration I show.
But I am open for your advices, and I'll try the way you said.

Thanks again.
Alessandro Antonello


If 'encoding' is set to UTF-8, Vim will correctly load and edit:
- any UTF-16 (be or le) file with BOM, if "ucs-bom" comes first in 'fileencodings'; - any UTF-16 (be or le) file without BOM, if read with the proper ++enc= modifier. If the file is read as Latin1 by the "automatic" process, about every second character will be shown as ^@ (a null); then reread it with e.g. ":e ++enc=utf-16le" and all will be OK.

Best regards,
Tony.
--
Acquaintance, n.:
        A person whom we know well enough to borrow from, but not well
enough to lend to.
                -- Ambrose Bierce, "The Devil's Dictionary"

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Reply via email to