[Bug 164] Support collation by a certain locale (sorting order of characters)

bugzilla-daemon Fri, 23 Jul 2010 22:54:14 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=164


--- Comment #186 from Philippe Verdy <[email protected]> 2010-07-24 05:53:30 
UTC ---
(In reply to comment #183)
> Upgrading the collation can be done in-place.  The worst case is that
> categories sort weirdly for a few hours.  Also, we would only realistically
> have to change the collation often on smaller wikis, since the largest wikis
> should have high-quality collations that cover almost all their pages to begin
> with.  I don't think we need to adjust the schema to prepare for this.

That'a a bad assumption : even the highest quality collations will need to be
updated from time to time:
- Unicode evolves and new characters get encoded (new versions are published
about every 1-2 years after synchronization and final balloting at both ISO WG2
and the UTC.
- The content of the Unicode DUCET is NOT stable: characters are inserted in
the sequence so that the full list of collation weights needs to be offseted
where the new characters get inserted.
- Collations for languages get corrected. We should be able to upgrade these
rules when the CLDR project produces new tailorings (CLDR updates are published
separately, about every 1-2 years.)

These corrections may be rare (every few months), but when they will occur, any
upgrade could take many hours that could horce the site to go offline when
recomputing sortkeys, or NO correction will be possible. Upgrading "in place"
is effectively what I proposed, but how will you track which pages need to to
reindexed?

A collation ID in the stored index can really help determine which collation
rule was used to generate the stored sortkey; In addition it will allow to
support multiple collations. This is the mean by which the "in place"
recomputing can be safely be done.

Note: truncating the sortkeys will ALWAYS be needed, just because the database
column will still have a length limit. Truncating is not so bad anyway,
because:
- the compact binary sequence of primary collation weights, that starts the
sort key will be at the begining of the sort key. Further length is used to
store the compacted sequence of secundary collation weights, then the sequence
of ternary collation weights.
- if truncation occurs, the effect will be that only the smallest differences
will not be represented.

But if you accept to store only non-truncated sort keys, you'll still have the
case where some pages will have some long name, plus the case where someone
will have indicated for that page a very long {{DEFAULTSORT:sortkey}} or very
long text in the second parameter of [[category:...|sortkey]]. To avoid this:
- page names already have a length limit. This also limits the length of sort
keys computed from only them
- we should already truncate the string given in {{DEFAULTSORT:sortkey}} or
{{category:..|sortkey]] so that the concatenation of this string and of the
page name can be used to compute the binary sortkey.

If you can accept arbitrary lengths, so go with it, but it will be unsafe and
your schema will not be able to put that in a sortable column (you'll be only
able to put it in a referenced BLOB, just like the the text of articles, and
databases can never sort external BLOB's)

Anyway you did not reply to the idea of first developin the parser functions
and test them. Developping the SQL schema extension should not be attempted
before at least the first function {{SORTKEY:text|locale|level}} has been fully
developed and tested on specific pages (it can be tested easily in tables).

And with just this function, it should be possible on specific wikis to use it
immediately to sort specific categories (for example by using templates using
that function).

The second function {{COLLATIONMAP:text|locale|level|clusters}} is not needed
immediately to develop the schema, but will be useful to restore the
functionality of headings. Headings don't need to be stored as they can be
computed on the fly, directly by reading sequentially the sorted result set
from the SQL query:

You can compute headings from the returned page names, or from the existing
stored "cl_sortkey" which should be used now ONLY to store the plain-text
specified in articles with {{DEFAULTSORT:sortkey}} and
[[category:...|sortkey]].
The existing cl_sortkey is just a forced "hint", it does not make the sort
order unique. Otherwise it should remain completely empty with the new schema.
It will always be locale neutral and will take precedence on the page name : to
sort the pages effectively, the content of the cl_sortkey content and the
pagename should be always concatenated inernally to compute the binary sortkey
for various  locales.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 164] Support collation by a certain locale (sorting order of characters)

Reply via email to