Re: if "\xe4"=="\xe4" failes,why?

A.J.Mechelynck Wed, 29 Nov 2006 23:13:51 -0800

mbbill wrote:

Hello A.J.Mechelynck,


Thursday, November 30, 2006, 1:15:14 PM, you wrote:

?A.J.Mechelynck wrote:

?mbbill wrote:

?I met a very strange problem recently, that is
?when I set the following options:
?set encoding=utf-8
?set ignorecase
?then the expression: if "\xe4"=="\xe4" fails.
?I test it using:
?if "\xe4"=="\xe4"
?   echo "test"
?endif
?but I got nothing output, why ?

?

?I confirm this:

?    :echo ("\xe4" == "\xe4")

?outputs 0

?I guess the strings, or at least one of them, are not evaluated as "the?U+00E4 codepoint, i.e., 0xC3 0xA4" but as "the one-byte string 0xE4,?which is not a valid Unicode codepoint when followed by a null". The?latter would be NaS (Not a String) in evaluations, and give the same?kind of strange results as NaN (Not a Number) in floating-point?comparisons.

?This conjecture seems to be confirmed by

?    :echo ("\xe4")

?which outputs <e4> in blue, not ä (a-umlaut) in black, which is output by

?    :echo "ä"

?and by

?    :echo ("\<Char-0xe4>")

?Bug or feature?

?Best regards,
?Tony.

?P.S.

?        :echo ("ä" == "\xc3\xa4")

?outputs 1 (one, i.e., TRUE). I think this proves my conjecture above.


Yes, I agree with your opinion.
When I test it somewhere else, I can not let the "bug" come again sometimes, 
may be some other options can affect the result of the expression.

In all 8-bit encodings, "\xe4" is (IIUC) whatever is represented in thatencoding by the byte 0xe4, which is usually a valid character. In Unicode(always internally UTF-8 in Vim) 0xE4 is not a valid character, unless it isfollowed by exactly two bytes (no more, no less) in the range 0x80-0xBF,because UTF-8 codepoints are represented by one to six bytes each, and thesebytes are as follows:

0x00-0x7F: standalone byte
0x80-0xBF: trailing byte (any byte but the first, in a multibyte sequence)
0xCO-0xDF: leading byte of a two-byte sequence
0xE0-0xEF: leading byte of a three-byte sequence
0xF0-0xF7: leading byte of a four-byte sequence
0xF8-0xFB: leading byte of a five-byte sequence
0xFC-0xFD: leading byte of a six-byte sequence
0xFE-0xFF: invalid

I don't know how "\xe4" tests in non-Unicode multibyte encodings such as thoseused for Chinese, Japanese, Korean, etc.



Best regards,
Tony.

Re: if "\xe4"=="\xe4" failes,why?

Reply via email to