On Mon, 4 Oct 2010, esquifit wrote:

On 4 Oct, 15:42, Ben Fritz wrote:

You can also set fileencoding manually after a file read, so that you can convert it to a different encoding when writing the file. You will probably want this new encoding in your fileencodings option so it can be detected,

If I set fileencoding manually, I see no changes on the screen. What does this option exactly controls?

It controls the encoding used when writing the file.


According to the help: "Sets the character encoding for the file of this buffer." But honestly, I don't get it. What does the statement means? I have a number of fundamental questions about the subject of (vim and) encoding:

1) As far as I know, there is no information stored with a text file about in what encoding the series of bytes makes sense as a text. An editor makes a guess on trying to open and display the file based on fist N bytes, on certain patterns, etc, but in the end is it always a guess, and sometimes the editor get it wrong. Is this right?

If you tell the editor to only ever consider certain encodings, you can improve its "guess". Also, various Unicode formats support a Byte-Order Mark (BOM). This is common with UTF-16, and discouraged with UTF-8[1]. The BOM prevents the need for guessing, but so does explicitly specifying what character sets you want to use.

[1] http://www.unicode.org/faq/utf_bom.html#bom4


2) When a file is loaded from disk into vim, what does exactly happen with the bytes? Is there any option in vim that influences this process? My guess is that the editor interprets the original sequence of bytes (as on disk) according to the rules of some character encoding; for vim, this would be the value of the 'encoding' option. Is this correct?

Vim uses the 'fencodings' option to choose (unless explicitly given a ++enc= argument). See :help 'fencodings' for the sequence of what Vim tries. See :help ++opt for how to use ++enc=


3) Based on these rules, the editor knows when to take one or two or more bytes to build a single *character*, and if more that one, in which order. From that, the editor has decided which *characters* (not bytes) the text contains. So for example, the sequence 1A 2B F3 E5 66 could be interpreted as
(1A 2B) (F3) (E5 66) according to encoding 1
(2B 1A) (E5 F3 66) according to encoding 2
where each () group represents a 'character' in the respective encoding. Thus, according to encoding 1 one would have for example: "small a", "capital z" and "digit 8", whereas according to encoding 2 one would have "question mark" and "small u umlaut". Is this description correct?

Yes, that's roughly it.


4) What decides how the bytes are displayed in the screen? My understanding is that the font comes now into play; to each *character*, a glyph is provided by the font, and this is what is displayed on the screen. Is this description correct?

Oversimplified, but yes. In some encodings (e.g. Unicode), there are also "combining characters"[2]. Languages that are written right-to-left need to be laid out. For scripts that have letters whose shapes differ depending on their context, there is also "shaping" (e.g. Arabic[3], or Urdu[4]). Other characters might have different glyphs depending on the locale (e.g. Simplified or Traditional Chinese characters[5]).

Depending on whether you're using Vim or Gvim, this might be handled by Gvim or the underlying terminal (in Vim). Most of these things aren't well-supported by Vim, particularly in Vim proper, as Vim frequently relies on the assumption that characters can be arranged in a grid (many discussions on this list if interested).

[2] http://en.wikipedia.org/wiki/Combining_character
[3] Arabic: 
http://www.w3.org/International/tests/tests-html-css/tests-webfonts/generate?test=5
[4] Urdu: 
http://www.w3.org/International/tests/tests-html-css/tests-webfonts/generate?test=6
[5] http://en.wikipedia.org/wiki/Help:Multilingual_support_(East_Asian)

[*] (generally interesting tests) 
http://www.w3.org/International/tests/tests-html-css/list-fonts


If yes, how can I in vim change the way vim interprets the sequence of bytes according to a different encoding? Is it necessary to reload the file?

Yes, it's necessary to reload. Once fully loaded in a buffer, the characters are characters (not bytes).


If I use 'set fileeconding=blah', no change is visible, whereas if when I use ':e ++enc=blah', the displayed glyphs do change. This is probably due to the fact that ':e ++enc' effectively reloads the sequence from disk (or rereads the original sequence of bytes from memory), and in doing so it resolves the bytes into characters according to the newly specified character encoding. On the other hand, 'set fileencoding=blah' does not seems to reload/reread anything. What is the effect of this option?

(as above: the encoding that will be written to disk)


I have a couple of ideas, but I first like to know the answer to the following question.

5) What happens when I type something on the keyboard? This is a similar situation a reading from the disk; in the end, it about a sequence of bytes being inserted at some place in the file; there is also the need to interpret them as characters and look for glyphs on some font to represent them (in case the file is being displayed or printed). Also in this case I would expect some option in vim to control how the bytes sent by my keyboard are to be interpreted. Which are these options?

If in Gvim, it uses the underlying library functionality. If in a terminal emulator, see:
:help mbyte-terminal

Is it the current value of 'encoding'? Or of 'fileencoding'? or of 'termencoding'? And when? Only on terminals, or also in GUI? and does makes a difference whether I am on Win32 or on *nix? Or if I use GTK or not? or if I use Cygwin or not?

To oversimplify: Basically the only option that is significantly different between OS'es (Win32 vs. *nix vs. Cygwin) is 'ff'/'ffs' (the end-of-line format), which isn't really even discussed in your email.

The defaults for the rest ('tenc','fenc','fencs','enc') generally depend on the locale (external to Vim -- can affect 'enc', and by association 'fencs') or being in Gvim vs. Vim (Gvim defaults 'tenc' to utf-8).

Each option's help text describes its defaults.


6) What happens when the file is written to disk (:w)? My guess is: after reading the bytes, resolved then into characters and having found a glyph for each character and displayed on the screen, the editor works exclusively 'on characters, not on bytes'. According to this, when writing back to disk, the editor would then reverse-engineer the characters into bytes according to the rules of some encoding option. What would be this option, 'encoding', 'fileencoding', something derived from 'fileencodings', what?

'fileencoding' if set, otherwise 'encoding'.


As you see, too many basic question that cannot be answered with 'fileencoding: Sets the character encoding for the file of this buffer'.

But that's just the summary: If it's set, fileencoding does exactly that: it sets the character encoding for the file (on disk) of this buffer.

The text after that (in :help 'fileencoding') explains what happens if you don't choose something explicitly. Vim tries to pick a reasonable default. If you use 'encoding=utf-8', that default is usually what you want. If you don't, Vim has to fall back on more heuristic approaches (it "guesses").

--
Best,
Ben

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Reply via email to