"Marcin 'Qrczak' Kowalczyk" writes: > "D. Starner" writes: > > > This implies that every programmer needs an indepth knowledge of > > Unicode to handle simple strings. > > There is no way to avoid that.
Then there's no way that we're ever going to get reliable Unicode support. > If the runtime automatically performed NFC on input, then a part of a > program which is supposed to pass a string unmodified would sometimes > modify it. Similarly with NFD. No. By the same logic you used above, I can expect the programmer to understand their tools, and if they need to pass strings unmodified, they shouldn't load them using methods that normalize the string. > You can't expect each and every program which compares strings to > perform normalization (e.g. Linux kernel with filenames). As has been pointed out here, Posix filenames are not character strings; they are byte strings. They quite likely aren't even valid UTF-8 strings. > > So S should _sometimes_ match an accented S? Again, I feel extended misery > > of explaining to people why things aren't working right coming on. > > Well, otherwise things get ambiguous, similarly to these XML issues. Sometimes things get ambiguous if one day ŝ is matched by s and one day ŝ isn't? That's absolutely wrong behavior; the program must serve the user, not the programmer. 's' cannot, should, must not match 'ŝ'; and if it must, then it absolutely always must match 'ŝ' and someway to make a regex that matches s but not ŝ must be designed. It doesn't matter what problems exist in the world of programming; that is the entirely reasonable expectation of the end user. > Does "\n" followed by a combining code point start a new line? The Standard says no, that's a defective combining sequence. > Does > a double quote followed by a combining code point start a string > literal? That would depend on your language. I'd prefer no, but it's obvious many have made other choices. > Does a slash followed by a combining code point separate > subdirectory names? In Unix, yes; that's because filenames in Unix are byte streams with the byte 0x2F acting as a path seperator. > It's hard enough to convince them that a > character is not the same as a byte. That contradicts you above statement, that every programmer needs an indepth knowledge of Unicode. > In case I want to circumvent security or deliberately cause a piece of > software to misbehave. Robustness require unambiguous and simple rules. The rules you are offering are only simple and unambiguous to the programmer; they appear completely random to the end user. To have ≮ sometimes start a tag means that a user can't look at the XML and tell whether something opens a tag or is just text. You might be able to expect all programmers, but you can't expect all end users to. -- ___________________________________________________________ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm

