On 06/03/2002 05:56:38 PM Kenneth Whistler wrote:
>Peter, >The problem, of course, is that not all eventualities could be >foreseen at the time the decisions had to be made -- when normalization >and Unicode 3.0 were looming... >So hindsight is 20/20. But at the time, the editors and participants >in the UTC couldn't get experts to pay enough attention to the >potential implications for Thai and other Southeast Asian scripts, >so now we are stuck with a few anomalies that people will just have >to program around, I am afraid. I understand. I'm not arguing at this point that the combining classes should be changed (though I would were it a possibility) -- if you look at my earlier post in this thread, you'll see that I explained to Khun Samphan that this is not a possibility. At this point, I'm merely explaining *why* the combining classes as they stand present issues for implementers. >> The result is that string comparisons that rely on normalisation into >> any one of the existing Unicode normalisation forms (NFD, NFC, NFKD, NFKC) >> will fail to consider these as equal. > >I think you are missing a point here. It is true that if you just >take the two strings, normalize them, and then compare binary, they >will compare unequal. But for most user's expectations of equivalent >string comparisons, simply comparing binary for normalized strings >is insufficient, anyway. There may be embedded (invisible) format >control characters (ZWJ and its ilk) which should be ignored on >comparison -- but a simple binary compare won't do that. True, but I think there's a categorical difference between the need to remove ZWJ and its ilk and the other kinds of issues you raise on the one hand, and on the other, the issues I've raise in relation to combining classes for SE Asian scripts and Hebrew: the former are things that implementers have been aware of for a while, but the latter is something they are likely not aware of, and is exactly the kind of thing people would have expected normalisation to have dealt with and so are not likely to notice. Implementers need to have the issues pointed out to them, which is exactly my intent -- for at least one potential implementer -- with the comments I have made in this thread. >The ordinary way to deal with this is to enhance the comparisons, >often in language-specific ways, to match user expectations of what >should and should not compare equal under various circumstances. Is that true everywhere? What about systems for file naming, security, domain naming, etc. for which language-specific processing is rarely if ever done? Even in word processors, I doubt that language-tailored collation-based comparisons are used. But clearly if the combining classes can't be changed, then some or all of these will have to start dealing at least with the issues that these combinng class values raise. At least, given all the hoopla in recent months about spoofing and security, I'd think people with concerns in this area would want to deal with the issues presented by these combining class values. And if my memory is serving me in relation to Hebrew, we're also going to have to look at that again and figure out a way to encode needed distinctions that the fixed position classes cause to be neutralised in normalisation. - Peter --------------------------------------------------------------------------- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[EMAIL PROTECTED]>

