On 17/04/11 16:19, Benjamin R. Haskell wrote:
[...]
Err, if you're using the 7-bit Control Sequence Introducer (\e[ = <Esc>
+ <[> = \033 \133), then CSI sequences are virtually always valid UTF-8.
The 8-bit single-character variant does avoid UTF-8, though. In a
properly formed stream of bytes, 0x9b is never the first character of a
UTF-8 sequence, since it has the appearance of a continuation byte.
[...]

U+009B is an unprintable codepoint in Unicode, set apart to mean <CSI>. Its UTF-8 representation is 0xC2 0x9B. Couldn't that be used in UTF-8? It is valid UTF-8, but a *control* code, not a printable one, and it still means <CSI>.

See http://www.unicode.org/charts/PDF/U0080.pdf which says:

009B <control>
     = CONTROL SEQUENCE INTRODUCER


Yes, it might confuse some Windows users whose OS misguidedly pretends that its Windows-1252 is ISO-8859-1; but cp1252 and Latin1 are *not* the same, whatever Bill Gates may decree. (OTOH, it is intentionally that the 256 first codepoints of Unicode are the same as in Latin1, and that the first half of those even have the same disk representation in UTF-8 as in Latin1 and US-ASCII.)


Best regards,
Tony.
--
Ye gods! Give me strength to suffer what cannot be changed, courage to change
what must be changed, and wisdom to tell the two apart.
                -- Marcus Aurelius

--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Raspunde prin e-mail lui