RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

Philippe Verdy Mon, 12 Jul 2010 15:44:28 -0700

"Kenneth Whistler" <k...@sybase.com> wrote:
> Huh? That is just preprocessing to delete portions of strings
> before calculating keys. If you want to do so, be my guest,
> but building in arbitrary rules of content suppression into
> the UCA algorithm itself is a non-starter.


I have definitely not asked for adding such preprocessing. My examples
just gave practical examples where the lowest significant parts of the
strings are alsmost always at end of the text, in practical cases. And
there's no reason that a backwards processing will consider them more
important than the beginning of text.

I have sufficiently though about it when 2-3 years ago I wanted to fix
the definitely broken sorts in French Wikitionary categories (notably
the categories containing full lists of words per language, where
items were apparently missing just because they were sorted many pages
away).

I also added that *storing* additional collation strings introduces a
significant cost. This cost cannot be ignored, so in practice only the
leading part of the sort keys/ collation strings will be effectively
stored.

Wikipedia for example truncates ALL its sort keys used in categories
to a limit not exceeding about 64 bytes, even if page names can be up
to 255 UTF-8 bytes, so that, even NOT ALL the primary differences will
be stored (it replaces the truncated part by an arbitrary unique
numeric id which is used only to ensure sort stability and avoid bugs
when navigating in heavily populated categories, even if this means
that small groups of entries will not have the optimal order). Note
that the linguistic sort will effectively be partial, but this is a
practical case that does occur when sorting long lists.

Then, during effective sort (using QuickSort or similar, you'll end up
comparing many pairs of strings multiple times in case of equality,
without depending on collation strings, and YES of course, you'll
process the string pairs level by level, and the case where you'll
find long strings sorting as equal at the primary level is not
exceptional, so this means that you'll have to reparse a second time
both strings from their beginning for the next level, even if you
could have detected very early that there was a secondary difference
on the first word, without even fully processing the first level on
the two long strings).

I've made experimentations, and performance counting (counting the
number of pairs compared, and the number of times they are need to be
reread from the backing store, and cumulating their relative distance)
clearly shows that comparing word per word (all levels for each word
performed at once before going to the next words) gives a significant
and measurable advantage due to data locality, but also because you'll
effectively scan shorter sequences to detect a secondary difference in
the first words (such as accents) or tertiary differences (such as
lettercase). And it gives much more consistant results once you are
using truncated collation strings (and only those truncated strings,
without addition compares of pairs in additional sort passes).

The cases where strings will compare equal on the first level is in
fact EXTREMELY frequent in SQL requests performing master-detail
joins, ordered by a master key then by detail keys, or when computing
aggregates (such as SELECT SUM()... GROUP BY master key, or using a
clause like: HAVING column1=aggregate(column2). This occurs in almost
all statistical/financial reports. And given that you work for Sybase,
you should already know that sorting/grouping is the most costly
operation performed by the SQL engine, where all optimisations are
expected: multifields collations/sorts is also a very basic operation
found in nearly all requests (including inserts or unsorted/ungrouped
selects using unique indice or primary/foreign key indice, because
indice always have to maintain their collation in an accurate but fast
way).

RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

Reply via email to