2013/8/6 Richard Wordingham <[email protected]> > On Tue, 6 Aug 2013 19:27:56 +0200 > Philippe Verdy <[email protected]> wrote: > > > But there's an admitted exception : sorting with UCA may change the > > relative order between the source strings, simply because sort > > stability is not always wanted (it has a cost), and binary sorting > > the results using the code point values as an additional collation > > level is not always wanted, and normalization remains optional in > > UCA. > > No
Why no? You don't give any real objection here, because you modify the rules. LEt's suppose that you use a stable sort algorithm and that the compare function does not normalize the input, but still performs UCA comparison up to some strength level (e.g. only at primary level), and then just performs a simple binary sort, the list will not be be necessarily listing items in the same ouput order : if you convert this soeted list as a single plain-text with one lie per item (assuming that items do not contain any newline) and separate these items by newlines. And then you normalize this plain text file, then the resulting sorted then normalized list shoudl compare exactly equal to the sorted list generated from indivudually normalized input data. i.e. we should assert: implode_list(sort(items[], UCA_compare_function)) == implode_list(sort(normalize(items[]), UCA_compare_function) But the sort() function here is not required to replace the input by its normalized version so that only normalized strigns will be seen by the compare_function (an UCA compare function does not modify its input, it will normalize it *internally* before replying that item1 is lower than, equal to, or higher than item2). The sort function just swaps the items without modifying them, and outputs them exactly in the same form that was present in the input. So the first expression (before the equals sign) will not be necessarily canonically equivalent to the second expression, even if the two inputs: items[] and normalize(items[]) are canonically equivalent (item per item in each input array of texts, both having the same total number of items). The canonical equivalent od the two ouputs require that inputs are all normalized FIRST before sorting. But UCA sorting does not have to perform this and generally it won't do that, even if it performs an internal normalization, unless this internal normalization replaces the data of the input before sorting it; but then it won't be able to output the original items : let's suppose that the input contained ONLY unique elements without duplicates and that they were unique identifiers, the replacement of the input list with normalized identifiers will violate the uniqueness constraint. If you count the number of unique elements in the output it may be *lower* than the number of unique element in the non-normalized input source. In summary, UCA-sorting results in canonically equivalent outputs (same order) *only* if the input is warrantied to be already normalized (using the same normalization form, which could be NF(K)C, or NF(K)D, or any other non-standard normalization that generates exactly the same binary ouput from all possible inputs that are canonically equivalent), or if there's no pair of items that are considered "equal" by the UCA_compare_function). Even when using an UCA_compare_function with its maximum strength there are still lots of binary distinct items which are compared equal at all strengths by UCA collations, even when the collation is tailored using UCA rulesets. And this exactly concerns ALL pairs of items that are canonically equvalent but are using different normalizations forms (their difference is only binary at the codepoint level). Note however that the UCA_compare_function is conforming : in this case. it will return "equal", but it has nothing to do with the sort function which actually do not change items, but swaps them conditionally according to the result (lower, equal, higher) of the compare function. And what we call "UCA sorting" is a standard binary sort using an conforming UCA compare function. UCA sorting by itself, with just this definition is then **not necessarily** a conforming process, even if the UCA compare function used is conforming...

