mbbill wrote:
Hello A.J.Mechelynck,

Thursday, November 30, 2006, 1:15:14 PM, you wrote:

?A.J.Mechelynck wrote:
?mbbill wrote:
?I met a very strange problem recently, that is
?when I set the following options:
?set encoding=utf-8
?set ignorecase
?then the expression: if "\xe4"=="\xe4" fails.
?I test it using:
?if "\xe4"=="\xe4"
?   echo "test"
?endif
?but I got nothing output, why ?

?

?I confirm this:

?    :echo ("\xe4" == "\xe4")

?outputs 0

?I guess the strings, or at least one of them, are not evaluated as "the ?U+00E4 codepoint, i.e., 0xC3 0xA4" but as "the one-byte string 0xE4, ?which is not a valid Unicode codepoint when followed by a null". The ?latter would be NaS (Not a String) in evaluations, and give the same ?kind of strange results as NaN (Not a Number) in floating-point ?comparisons.

?This conjecture seems to be confirmed by

?    :echo ("\xe4")

?which outputs <e4> in blue, not ä (a-umlaut) in black, which is output by

?    :echo "ä"

?and by

?    :echo ("\<Char-0xe4>")


?Bug or feature?


?Best regards,
?Tony.


?P.S.

?        :echo ("ä" == "\xc3\xa4")

?outputs 1 (one, i.e., TRUE). I think this proves my conjecture above.

Yes, I agree with your opinion.
When I test it somewhere else, I can not let the "bug" come again sometimes, 
may be some other options can affect the result of the expression.




In all 8-bit encodings, "\xe4" is (IIUC) whatever is represented in that encoding by the byte 0xe4, which is usually a valid character. In Unicode (always internally UTF-8 in Vim) 0xE4 is not a valid character, unless it is followed by exactly two bytes (no more, no less) in the range 0x80-0xBF, because UTF-8 codepoints are represented by one to six bytes each, and these bytes are as follows:
0x00-0x7F: standalone byte
0x80-0xBF: trailing byte (any byte but the first, in a multibyte sequence)
0xCO-0xDF: leading byte of a two-byte sequence
0xE0-0xEF: leading byte of a three-byte sequence
0xF0-0xF7: leading byte of a four-byte sequence
0xF8-0xFB: leading byte of a five-byte sequence
0xFC-0xFD: leading byte of a six-byte sequence
0xFE-0xFF: invalid

I don't know how "\xe4" tests in non-Unicode multibyte encodings such as those used for Chinese, Japanese, Korean, etc.


Best regards,
Tony.

Reply via email to