On 17/04/11 16:19, Benjamin R. Haskell wrote:
[...]
Err, if you're using the 7-bit Control Sequence Introducer (\e[ = <Esc>
+ <[> = \033 \133), then CSI sequences are virtually always valid UTF-8.
The 8-bit single-character variant does avoid UTF-8, though. In a
properly formed stream of bytes, 0x9b is never the first character of a
UTF-8 sequence, since it has the appearance of a continuation byte.
[...]
U+009B is an unprintable codepoint in Unicode, set apart to mean <CSI>.
Its UTF-8 representation is 0xC2 0x9B. Couldn't that be used in UTF-8?
It is valid UTF-8, but a *control* code, not a printable one, and it
still means <CSI>.
See http://www.unicode.org/charts/PDF/U0080.pdf which says:
009B <control>
= CONTROL SEQUENCE INTRODUCER
Yes, it might confuse some Windows users whose OS misguidedly pretends
that its Windows-1252 is ISO-8859-1; but cp1252 and Latin1 are *not* the
same, whatever Bill Gates may decree. (OTOH, it is intentionally that
the 256 first codepoints of Unicode are the same as in Latin1, and that
the first half of those even have the same disk representation in UTF-8
as in Latin1 and US-ASCII.)
Best regards,
Tony.
--
Ye gods! Give me strength to suffer what cannot be changed, courage to
change
what must be changed, and wisdom to tell the two apart.
-- Marcus Aurelius
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php