Lars asked: > BTW, what are the properties of U+FFFD? In English please, do not point me > to the standard.
?! It has the general category of "Symbol Other" [gc=So]. > Like, can it be a part of an identifier, It does not have the ID_Start or the ID_Continue property, which you could determine for yourself by referring to the standard: http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt That doesn't prevent a formal syntax definition for a language from including it within the BNF for defining and identifier, but in general, no, it would not appear in identifiers, just as most other symbols would not. > is it an 'alphanumeric'? No. > Let me speculate. It should be a letter No. > (it probably more > often originally was than wasn't). You are referring here to speculation regarding what uninterpretable sequence in some other character encoding was *converted* to U+FFFD on conversion to Unicode. But that is irrelevant to the properties of U+FFFD itself. That is tantamount, for example, to claiming that the C0 control code 0x1A SUBSTITUTE should be defined as a "letter", simply because it is often used in signalling a conversion substitution in 8-bit tables. > I would accept it for identifiers (variables, filenames). If you are defining your own language, that would be your prerogative, of course. But if you are using standard languages like C, C++, Java, C#, SQL, etc., it is unlikely that you would be correct in that approach. > It has no case properties. And it is obviously not a > space. True. There is much, much more to know about Unicode character properties than just what can be inferred from an attempt to apply the POSIX model to UTF-8. A good place to start would be Unicode Technical Report #23, The Unicode Charater Property Model: http://www.unicode.org/reports/tr23/ And after that, yes, I would point you to the standard. --Ken

