Peter, > On 06/02/2002 05:40:05 AM Samphan Raruenrom wrote: > > >> My opinion is that they should have been simplified, but that setting > the > >> bulk of them to 0 was a mistake and creates some significant problems > >> (which go a step beyond the questions you raise here). > > > >Can you elaborate on this? > > Given the characters > > : 0E35;THAI CHARACTER SARA II;Mn;0 > : 0E39;THAI CHARACTER SARA UU;Mn;103 > > consider the sequences > > < 0e35, 0e39 > vs. < 0e39, 0e35 > > > I'm guessing your first reaction will be to say that these cannot co-occur. > That is true for the Thai language, but may not be true for other languages > written with Thai script.
The problem, of course, is that not all eventualities could be foreseen at the time the decisions had to be made -- when normalization and Unicode 3.0 were looming. It might have been possible to marginally improve on the assignments that eventually were made -- but both the original assignment to fixed position classes, and the later simplification of the fixed position classes, had to be made *prior* to the accumulation of experience based on normalization being locked down in the standard. So hindsight is 20/20. But at the time, the editors and participants in the UTC couldn't get experts to pay enough attention to the potential implications for Thai and other Southeast Asian scripts, so now we are stuck with a few anomalies that people will just have to program around, I am afraid. > > Now, the problem with the sequences above is that they are visually > indistinct, meaning that they could not possibly be used by users for a > semantically-relevant distinction. From the user's perspective, they are > identical. Moreover, it would fit a user's expectations to have string > comparisons to equate them (e.g. a search for < 0e35, 0e39 > should find a > match if the data contains < 0e39, 0e35 >). They are both > canonically-ordered sequences, however, since U+0E35 has a combining class > of 0. The result is that string comparisons that rely on normalisation into > any one of the existing Unicode normalisation forms (NFD, NFC, NFKD, NFKC) > will fail to consider these as equal. I think you are missing a point here. It is true that if you just take the two strings, normalize them, and then compare binary, they will compare unequal. But for most user's expectations of equivalent string comparisons, simply comparing binary for normalized strings is insufficient, anyway. There may be embedded (invisible) format control characters (ZWJ and its ilk) which should be ignored on comparison -- but a simple binary compare won't do that. The presence of a ZWSP might or might not be considered as indicative of a string difference by a user, but would definitively cause the strings to compare unequal without a corresponding visual difference. On the other hand, the presence of some types of visual punctuation might be considered insignificant by a user, and to be ignored, even though causing a visual difference. The ordinary way to deal with this is to enhance the comparisons, often in language-specific ways, to match user expectations of what should and should not compare equal under various circumstances. And a commonly used technology for that is one form or another of collation tailoring for culturally expecting string comparison. If such technology is being used to provide better results, there is no particular reason why the language-specific tailorings for it cannot also take into account the few anomalous cases resulting from canonical ordering of dependent vowels in Brahmi-derived scripts in Southeast Asia, so that, under those circumstances, < 0e35, 0e39 > vs. < 0e39, 0e35 > *would* compare equal. > > > >IMO, it'll be the best if we could change that. But apart from that, it > >still be useful to note what is right or wrong than not to say about it. > >After all, this happends to other (Indic) scripts too, right? > > There are some similar problems in at least Lao, Khmer and Myanmar. I don't > recall for certain, but there may also be similar problems in Hebrew. And each of the cases are fairly limited and amenable to the same kinds of solutions, script by script, and language by language. In any case, I think one is going to have to have some rather specific string comparison extensions to get Khmer and Myanmar string orderings and matchings to behave appropriately. And people who need to make those extensions aren't going to be particularly misled by the few anomalous instances of above or below vowel signs having zero combining classes, which make it technically possible to have non-canonically equivalent spellings of visually similar combinations. --Ken

