"Kenneth Whistler" <k...@sybase.com> wrote: > Huh? That is just preprocessing to delete portions of strings > before calculating keys. If you want to do so, be my guest, > but building in arbitrary rules of content suppression into > the UCA algorithm itself is a non-starter.
I have definitely not asked for adding such preprocessing. My examples just gave practical examples where the lowest significant parts of the strings are alsmost always at end of the text, in practical cases. And there's no reason that a backwards processing will consider them more important than the beginning of text. I have sufficiently though about it when 2-3 years ago I wanted to fix the definitely broken sorts in French Wikitionary categories (notably the categories containing full lists of words per language, where items were apparently missing just because they were sorted many pages away). I also added that *storing* additional collation strings introduces a significant cost. This cost cannot be ignored, so in practice only the leading part of the sort keys/ collation strings will be effectively stored. Wikipedia for example truncates ALL its sort keys used in categories to a limit not exceeding about 64 bytes, even if page names can be up to 255 UTF-8 bytes, so that, even NOT ALL the primary differences will be stored (it replaces the truncated part by an arbitrary unique numeric id which is used only to ensure sort stability and avoid bugs when navigating in heavily populated categories, even if this means that small groups of entries will not have the optimal order). Note that the linguistic sort will effectively be partial, but this is a practical case that does occur when sorting long lists. Then, during effective sort (using QuickSort or similar, you'll end up comparing many pairs of strings multiple times in case of equality, without depending on collation strings, and YES of course, you'll process the string pairs level by level, and the case where you'll find long strings sorting as equal at the primary level is not exceptional, so this means that you'll have to reparse a second time both strings from their beginning for the next level, even if you could have detected very early that there was a secondary difference on the first word, without even fully processing the first level on the two long strings). I've made experimentations, and performance counting (counting the number of pairs compared, and the number of times they are need to be reread from the backing store, and cumulating their relative distance) clearly shows that comparing word per word (all levels for each word performed at once before going to the next words) gives a significant and measurable advantage due to data locality, but also because you'll effectively scan shorter sequences to detect a secondary difference in the first words (such as accents) or tertiary differences (such as lettercase). And it gives much more consistant results once you are using truncated collation strings (and only those truncated strings, without addition compares of pairs in additional sort passes). The cases where strings will compare equal on the first level is in fact EXTREMELY frequent in SQL requests performing master-detail joins, ordered by a master key then by detail keys, or when computing aggregates (such as SELECT SUM()... GROUP BY master key, or using a clause like: HAVING column1=aggregate(column2). This occurs in almost all statistical/financial reports. And given that you work for Sybase, you should already know that sorting/grouping is the most costly operation performed by the SQL engine, where all optimisations are expected: multifields collations/sorts is also a very basic operation found in nearly all requests (including inserts or unsorted/ungrouped selects using unique indice or primary/foreign key indice, because indice always have to maintain their collation in an accurate but fast way).