Kenneth Whistler wrote:

If you read through those definitions from Unicode 4.0 carefully,
you will see that UTF-8 representing a noncharacter is perfectly
valid, but UTF-8 representing an unpaired surrogate code point
is ill-formed (and therefore disallowed).



I see a hole here. How about UTF-8 representing a paired of surrogate code point with two 3 octets sequence instead of an one octets UTF-8 sequence? It should be ill-formed since it is non-shortest form also, right? But we really need to watch out the language used there so we won't create new problem. I DO NOT want people think one 3 otects of UTF-8 surrogate low or high is ill-formed but one 3 octets of UTF-8 surrogate high followed by a one 3 octets of UTF-8 surrogate low is legal.




Reply via email to