On 03/11/13 14:56, John Joche wrote:
OK. Thank you for your help...

I can put the command

|set guifont=Lucida_Console:h12:cDEFAULT
|

inside /C:\Users\JSonderson_gvimrc/ and this font family and font size
and character set is loaded each time I start gvim.

------------------------------------------------------------------------

However a question still remains, that is, how come UTF-8 is not on the
list of character sets?

tl;dr: see last paragraph above your next question

UTF-8 is one of the ways to represent Unicode in memory. Unicode is the Universal character set, a superset of all possible character sets known to computer software.

The following encodings can represent all Unicode codepoints ("characters"):
- UTF-8, with between 1 and 4 bytes per character (originally up to 6 bytes had been foreseen, but then it was decided that codepoints above U+10FFFF would never be attributed). UTF-8 has the property that the 128 US-ASCII characters are represented in UTF-8 by one byte in exactly the same way as in US-ASCII, Latin1, and most other ASCII-derived encodings. (EBCDIC is of course a world apart).
- UTF-16, with one or two 2-byte words per character;
- UTF-32 (aka UCS-4), with one 4-byte doubleword per character;
- GB18030, with 1, 2 or 4 bytes per character but biased in favour of Chinese (this is the current official standard encoding of the PRC). Conversion between GB18030 and the other ones is possible but not trivial, and requires bulky tables. The iconv utility can usually do it, and so can Vim if built with +iconv, or with +iconv/dyn and it can find the iconv or libiconv library.

UTF-16 and UTF-32 can be big-endian (default) or little-endian (e.g. UTF-16le). UTF-32 even supports the rarely used 3412 and 2143 byte orderings but I'm not sure Vim knows about it.

Vim represents internally UTF-16 and UTF-32 as UTF-8 in memory, because a NUL codepoint is a null word in UTF-16, a null doubleword in UTF-32, and the many other null bytes in the files would play havoc with Vim's use of null-terminated C strings. OTOH, in UTF-8 nothing other than the NUL codepoint U+0000 may validly include a null byte in its representation.

With some filetypes, it is possible to tell user applications which Unicode encoding and endianness to use by adding the codepoint U+FEFF at the very start of the file. That codepoint is usually called the BOM (byte-order mark) but it can even identify UTF-8 which has no endianness variants. It is supported for at least HTML and CSS; it is not recognized (and should not be present) in executable scripts in UTF-8, especially those where the first line starts with #! — I've been caught by that in the past, and now I know better.

Note that when Windows people say "Unicode" they usually mean UTF-16le. That's e.g. how one must decode the sentence "The file is not in UTF-8, it's in Unicode" (which, taken literally, is nonsense) in the mouth of a Microsoft engineer.

You set the 'encoding' option, preferably near the top of your vimrc, to tell Vim how characters are to be represented in memory. The advantage of using ":set enc=utf8" is that it allows Vim to represent in memory any character of any charset known to computer people. OTOH, e.g. using Latin1 as your 'encoding' value only allows to represent the 256 characters which are part of the Latin1 charset; those are also the first 256 codepoints (U+0000 to U+00FF) of Unicode.

See also http://vim.wikia.com/wiki/Working_with_Unicode

All of the above is independent of the 'guifont' setting. Why is there nothing relating to Unicode in the :cXX parameter of Windows 'guifont' settings? I'm not sure. Either :cDEFAULT means Unicode, or else it's a Windows mystery.


Isn't the character set something separate from the font anyways?

Yes, it is; but each font file has glyphs for a certain set of languages. Usually not for all Unicode codepoints which are defined, there are an enormous lot of them.


What's the difference between character set and character encoding?

Not much. In most situations they can be used as synonyms. When not synonymous, the character set is the array of characters, and the character encoding is the exact manner those characters are represented (by how many bytes, and which ones) in memory, on disk, on tape, etc. Sometimes both words are used one for the other: e.g. in HTTP or mail headers, the Content-Type line uses "charset=" to tell the receiving application which encoding is used in the document.

Unicode can be regarded as one abstract character set with room for more than a million characters (originally two thousand million, but then the number was reduced), which ATM can be represented in at least 8 different encodings if all byte-ordering variants are considered. Not all the Unicode "slots" have already received an assignment; some are reserved "for private use" and others have been blocked as "noncharacters". For details, see http://www.unicode.org/ and in particular http://www.unicode.org/charts/


How can I display the actual character set which is being used when I
use the DEFAULT setting?

You don't. The font either has a glyph for the character you're trying to display (and you should see that glyph), or it doesn't (and you should see some placeholder glyph instead, e.g. an empty frame or a reverse-video question mark).


Thanks.



Best regards,
Tony.
--
Love in your heart wasn't put there to stay.
Love isn't love 'til you give it away.
                -- Oscar Hammerstein II

--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- You received this message because you are subscribed to the Google Groups "vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to