From: "Peter Kirk" <[EMAIL PROTECTED]> > I can see that there might be some problems in the changeover phase. But > these are basically the same problems as are present anyway, and at > least putting them into a changeover phase means that they go away > gradually instead of being standardised for ever, or however long > Unicode is planned to survive for.
I had already thought about it. But this may cause more troubles in the future for handling languages (like modern Hebrew) in which those combining classes are not a problem, and where the ordering of combining characters is a real bonus that would be lost if combining classes are merged, notably for full text searches where the number of order combinations to search could explode, as the effective order in occurences could become unpredictable for searches. Of course, if the combining class values were really bogous, a much simpler way would be to deprecate some existing characters, allowing new applications to use the new replacement characters, and slowly adapt the existing documents with the replacement characters whose combining classes would be more language-friendly. This last solution may seem better but only in the case where a unique combining class can be assigned to these characters. As one said previously in this list, there are languages in which such axiome will cause problems, meaning that, with the current model, those problematic combining characters would have to be encoded with a null combining class, and linked to the previous combining sequence using either a character property (for its combining behavior in grapheme clusters and for rendering) or a specific joiner control (ZWJ ?) if this property is not universal for the character. > It isn't a problem for XML etc as in such cases normalisation is > recommended but not required, thankfully. In practive, "recommanded" will mean that many processes will perform this normalization, as part of their internal job, so it would cause interoperability problems if the result of this normalization is further retreived by the unaware client that submitted the data to that service which is supposed to keep the normalization identity of the string. Also I have doubts about the validity of this change face to the stability pact signed between Unicode and the W3C for XML. > As for requirements that lists > are normalised and sorted, I would consider that a process that makes > assumptions, without checking, about data received from another process > under separate control is a process badly implemented and asking for > trouble. Here the problem is that we will not always have to manage the case of separate processes, but also the case of utility libraries: if this library is upgraded separately, the application using it may start experimenting problems. e.g. I am thinking about the implied sort order in SQL databases for table indices: what would happen if the SQL server is stopped just the time to upgrade a standard library implementing the normalization among many other services, because another security bug such as a buffer overrun is solved in another API? When restarting the SQL server with the new library implementing the new normalization, nothing would happen, apparently, but the sort order would no more be guaranteed, and stored sorted indices would start being "corrupted", in a way that would invalidate binary searches (meaning that some unique keys could become duplicated, or not found, producing unpredictable results, critical if they are assumed for, say, user authentication, or file existence). Of course such upgrade should be documented, but as this would occur in very intimate levels of a utility library incidentally used by the server. Will all administrators and programmers be able to find and know all the intimate details of this change, when Unicode has stated to them that normalized forms should never change? Will it be possible to scan and rebuild the corrupted data with a check&repair tool, if the programmers of this system assumed that the Unicode statement was definitive and allowed performing such assumptions to build optimized systems? When I read the stability pact, I can conclude from it that any text valid and normalized in one version of Unicode will remain normalized in any version of Unicode (including previous ones) provided that the normalized strings contain characters that were all defined in the previous version. This means that there's a upward _and_ backward compatibility of encoded strings and normalizations on their common defined subset (excluding only characters that have been defined in later versions but were not assigned in previous versions). The only thing that is allowed to change is the absolute value of non-zero combining classes (but in a limited way, as for now they are limited to a 8-bit value range also specified in the current stability pact with the XML working group), but not their relative order: merging neighbouring classes will change their relative order, and has the effect of removing requirements on the sort order, and thus modifying the result of the normalization algorithm applied to the same initial source strings.

