On Tue, 5 Feb 2013 12:16:47 +0100 Philippe Verdy <[email protected]> wrote:
> A process can be FULLY conforming by preserving the canonical > equivalence and treating ALL strings that are canonically equivalent, > without having to normalize them in any recommanded form,... Try doing UCA collation with <U+0302 COMBINING CIRCUMFLEX ACCENT, U+0067 LATIN SMALL LETTER G> being a collation element (with arbitrary collation elements) without doing normalisation. Consider how you would handle <U+011D LATIN SMALL LETTER G WITH CIRCUMFLEX, U+011D, U+011D>! > For example, typically when a web browser has a plain-text search tool > to look for some text present in the displayed page, it just needs to > perform collation with a level 1 strengh. Perhaps, but I note that Firefox does at least level 2 matching for Thai, and therefore will be vulnerable to vowels below following tone marks, which are equivalent to vowels below preceding tone marks. The former may be regarded as invalid by processes that are not Unicode compliant (or are not processing Unicode text). > Collation at level 1 does > not require ANY normalization, and can be performed by a simple 1-to-1 > mapping, where canonically decomposable characters are mapped to a > single simpler form and a simple 1-to-1 case folding, and where all > combinjing diacritics are then filtered out as ignorable (if this is > the rule for level 1 collation in the searched language). Under the UCA defaults, Tibetan script vowels need some form of normalisation for level 1 collation - length and quality indications can be interchanged while preserving canonical equivalence, and both contribute level 1 differences. These differences should be real for Sanskrit. Richard.

