mbbill wrote:
Hello A.J.Mechelynck,
Thursday, November 30, 2006, 1:15:14 PM, you wrote:
?A.J.Mechelynck wrote:
?mbbill wrote:
?I met a very strange problem recently, that is
?when I set the following options:
?set encoding=utf-8
?set ignorecase
?then the expression: if "\xe4"=="\xe4" fails.
?I test it using:
?if "\xe4"=="\xe4"
? echo "test"
?endif
?but I got nothing output, why ?
?
?I confirm this:
? :echo ("\xe4" == "\xe4")
?outputs 0
?I guess the strings, or at least one of them, are not evaluated as "the
?U+00E4 codepoint, i.e., 0xC3 0xA4" but as "the one-byte string 0xE4,
?which is not a valid Unicode codepoint when followed by a null". The
?latter would be NaS (Not a String) in evaluations, and give the same
?kind of strange results as NaN (Not a Number) in floating-point
?comparisons.
?This conjecture seems to be confirmed by
? :echo ("\xe4")
?which outputs <e4> in blue, not ä (a-umlaut) in black, which is output by
? :echo "ä"
?and by
? :echo ("\<Char-0xe4>")
?Bug or feature?
?Best regards,
?Tony.
?P.S.
? :echo ("ä" == "\xc3\xa4")
?outputs 1 (one, i.e., TRUE). I think this proves my conjecture above.
Yes, I agree with your opinion.
When I test it somewhere else, I can not let the "bug" come again sometimes,
may be some other options can affect the result of the expression.
In all 8-bit encodings, "\xe4" is (IIUC) whatever is represented in that
encoding by the byte 0xe4, which is usually a valid character. In Unicode
(always internally UTF-8 in Vim) 0xE4 is not a valid character, unless it is
followed by exactly two bytes (no more, no less) in the range 0x80-0xBF,
because UTF-8 codepoints are represented by one to six bytes each, and these
bytes are as follows:
0x00-0x7F: standalone byte
0x80-0xBF: trailing byte (any byte but the first, in a multibyte sequence)
0xCO-0xDF: leading byte of a two-byte sequence
0xE0-0xEF: leading byte of a three-byte sequence
0xF0-0xF7: leading byte of a four-byte sequence
0xF8-0xFB: leading byte of a five-byte sequence
0xFC-0xFD: leading byte of a six-byte sequence
0xFE-0xFF: invalid
I don't know how "\xe4" tests in non-Unicode multibyte encodings such as those
used for Chinese, Japanese, Korean, etc.
Best regards,
Tony.