Re: How to set utf-8 locally (for a buffer) on loading the file

Benjamin R. Haskell Mon, 04 Oct 2010 15:08:22 -0700

On Mon, 4 Oct 2010, esquifit wrote:

On 4 Oct, 15:42, Ben Fritz wrote:
You can also set fileencoding manually after a file read, so that youcan convert it to a different encoding when writing the file. Youwill probably want this new encoding in your fileencodings option soit can be detected,
If I set fileencoding manually, I see no changes on the screen. Whatdoes this option exactly controls?


It controls the encoding used when writing the file.

According to the help: "Sets the character encoding for the file ofthis buffer." But honestly, I don't get it. What does the statementmeans? I have a number of fundamental questions about the subject of(vim and) encoding:
1) As far as I know, there is no information stored with a text fileabout in what encoding the series of bytes makes sense as a text. Aneditor makes a guess on trying to open and display the file based onfist N bytes, on certain patterns, etc, but in the end is it always aguess, and sometimes the editor get it wrong. Is this right?

If you tell the editor to only ever consider certain encodings, you canimprove its "guess". Also, various Unicode formats support a Byte-OrderMark (BOM). This is common with UTF-16, and discouraged with UTF-8[1].The BOM prevents the need for guessing, but so does explicitlyspecifying what character sets you want to use.


[1] http://www.unicode.org/faq/utf_bom.html#bom4

2) When a file is loaded from disk into vim, what does exactly happenwith the bytes? Is there any option in vim that influences thisprocess? My guess is that the editor interprets the original sequenceof bytes (as on disk) according to the rules of some characterencoding; for vim, this would be the value of the 'encoding' option.Is this correct?

Vim uses the 'fencodings' option to choose (unless explicitly given a++enc= argument). See :help 'fencodings' for the sequence of what Vimtries. See :help ++opt for how to use ++enc=

3) Based on these rules, the editor knows when to take one or two ormore bytes to build a single *character*, and if more that one, inwhich order. From that, the editor has decided which *characters*(not bytes) the text contains. So for example, the sequence 1A 2B F3E5 66 could be interpreted as
(1A 2B) (F3) (E5 66) according to encoding 1
(2B 1A) (E5 F3 66) according to encoding 2
where each () group represents a 'character' in the respectiveencoding. Thus, according to encoding 1 one would have for example:"small a", "capital z" and "digit 8", whereas according to encoding 2one would have "question mark" and "small u umlaut". Is thisdescription correct?


Yes, that's roughly it.

4) What decides how the bytes are displayed in the screen? Myunderstanding is that the font comes now into play; to each*character*, a glyph is provided by the font, and this is what isdisplayed on the screen. Is this description correct?

Oversimplified, but yes. In some encodings (e.g. Unicode), there arealso "combining characters"[2]. Languages that are writtenright-to-left need to be laid out. For scripts that have letters whoseshapes differ depending on their context, there is also "shaping" (e.g.Arabic[3], or Urdu[4]). Other characters might have different glyphsdepending on the locale (e.g. Simplified or Traditional Chinesecharacters[5]).

Depending on whether you're using Vim or Gvim, this might be handled byGvim or the underlying terminal (in Vim). Most of these things aren'twell-supported by Vim, particularly in Vim proper, as Vim frequentlyrelies on the assumption that characters can be arranged in a grid (manydiscussions on this list if interested).


[2] http://en.wikipedia.org/wiki/Combining_character
[3] Arabic: 
http://www.w3.org/International/tests/tests-html-css/tests-webfonts/generate?test=5
[4] Urdu: 
http://www.w3.org/International/tests/tests-html-css/tests-webfonts/generate?test=6
[5] http://en.wikipedia.org/wiki/Help:Multilingual_support_(East_Asian)

[*] (generally interesting tests) 
http://www.w3.org/International/tests/tests-html-css/list-fonts

If yes, how can I in vim change the way vim interprets the sequence ofbytes according to a different encoding? Is it necessary to reload thefile?

Yes, it's necessary to reload. Once fully loaded in a buffer, thecharacters are characters (not bytes).

If I use 'set fileeconding=blah', no change is visible, whereas ifwhen I use ':e ++enc=blah', the displayed glyphs do change. This isprobably due to the fact that ':e ++enc' effectively reloads thesequence from disk (or rereads the original sequence of bytes frommemory), and in doing so it resolves the bytes into charactersaccording to the newly specified character encoding. On the otherhand, 'set fileencoding=blah' does not seems to reload/rereadanything. What is the effect of this option?


(as above: the encoding that will be written to disk)

I have a couple of ideas, but I first like to know the answer to thefollowing question.
5) What happens when I type something on the keyboard? This is asimilar situation a reading from the disk; in the end, it about asequence of bytes being inserted at some place in the file; there isalso the need to interpret them as characters and look for glyphs onsome font to represent them (in case the file is being displayed orprinted). Also in this case I would expect some option in vim tocontrol how the bytes sent by my keyboard are to be interpreted. Whichare these options?

If in Gvim, it uses the underlying library functionality. If in aterminal emulator, see:

:help mbyte-terminal

Is it the current value of 'encoding'? Or of 'fileencoding'? or of'termencoding'? And when? Only on terminals, or also in GUI? and doesmakes a difference whether I am on Win32 or on *nix? Or if I use GTKor not? or if I use Cygwin or not?

To oversimplify: Basically the only option that is significantlydifferent between OS'es (Win32 vs. *nix vs. Cygwin) is 'ff'/'ffs' (theend-of-line format), which isn't really even discussed in your email.

The defaults for the rest ('tenc','fenc','fencs','enc') generally dependon the locale (external to Vim -- can affect 'enc', and by association'fencs') or being in Gvim vs. Vim (Gvim defaults 'tenc' to utf-8).


Each option's help text describes its defaults.

6) What happens when the file is written to disk (:w)? My guess is:after reading the bytes, resolved then into characters and havingfound a glyph for each character and displayed on the screen, theeditor works exclusively 'on characters, not on bytes'. According tothis, when writing back to disk, the editor would thenreverse-engineer the characters into bytes according to the rules ofsome encoding option. What would be this option, 'encoding','fileencoding', something derived from 'fileencodings', what?


'fileencoding' if set, otherwise 'encoding'.

As you see, too many basic question that cannot be answered with'fileencoding: Sets the character encoding for the file of thisbuffer'.

But that's just the summary: If it's set, fileencoding does exactlythat: it sets the character encoding for the file (on disk) of thisbuffer.

The text after that (in :help 'fileencoding') explains what happens ifyou don't choose something explicitly. Vim tries to pick a reasonabledefault. If you use 'encoding=utf-8', that default is usually what youwant. If you don't, Vim has to fall back on more heuristic approaches(it "guesses").


--
Best,
Ben

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Re: How to set utf-8 locally (for a buffer) on loading the file

Reply via email to