2013/2/5 Richard Wordingham <[email protected]>: > Philippe Verdy <[email protected]> wrote: >> But if the W3C needs to update >> something, it's to say that ALL forms that are canonically equivalent >> should be treated equally. This means that it is to the recipient of >> encoded documents to perform their own normalization. > > The problem comes with applications that ignore canonical > normalisation. The stability of Unicode normalisation is guaranteed so > that application can ignore the normalisation process!
I've not spoken here about ay recommanded normalization, but ONLY about canonical equivalence which should be preserved by conforming processes. A process can be FULLY conforming by preserving the canonical equivalence and treating ALL strings that are canonically equivalent, without having to normalize them in any recommanded form, or performing any reordering in its backing store, or it can choose to normalize to any other form that is convenient for that process (so it could be NFC or NFD, or something else) For example, typically when a web browser has a plain-text search tool to look for some text present in the displayed page, it just needs to perform collation with a level 1 strengh. Collation at level 1 does not require ANY normalization, and can be performed by a simple 1-to-1 mapping, where canonically decomposable characters are mapped to a single simpler form and a simple 1-to-1 case folding, and where all combinjing diacritics are then filtered out as ignorable (if this is the rule for level 1 collation in the searched language). Such process will be FULLY conformant and will not depend on ANY normalization form being used on the input web page. In addition it is easy to implement, and very efficient (the simple 1-to-1 mapping table needed for such strengh 1 collation is very small, you could do that with an binary search or with a small hash table with very few collisions, or even no collisions at all dependng on the hash function used).

