[Bug 164] Support collation by a certain locale (sorting order of characters)

bugzilla-daemon Thu, 22 Jul 2010 03:49:13 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=164


--- Comment #178 from Philippe Verdy <[email protected]> 2010-07-22 10:48:39 
UTC ---
Because his specification is really incomplete, and he said that Bug#164 was
useless (despite of the fact that I had described my solution extensively in
Bug#164long before Ariyeh started working on it.

And yes, before ever attempting to change the schema, I support the prior
developement and extensive testing of builtin parser functions supported by PHP
functions which will be shared later to support the updated SQL schema.

Only this developmeent alone will have significant difficulties:
* notably integrating ICU in a PHP module installation, or
* rewriting the collation algorithms entirely with PHP;
* having to support the DUCET updates caused by new Unicode versions or
corrections;
* having to support multiple collation orders by per-locale tailorizations
(coming from CLDR or from other sources).

The need to support upgraded collation orders is also an important decision
factor for the schema, if sortkeys are stored in a SQL backend, that's why I
speak about it very early:
* collations supported by SQL backends have very strong limitations, or any
upgrade would require shutting down the servers for hours or days to perform
the upgrade of collated indexes.
* in their missing full ISO 10646 "level 3 implementation" for the support of
supplementary planes.

All this is something that can be avoided completely by using ICU and not
depending on SQL backends for their support of many more collation locales that
we need in our international projects:

* the schema just needs to be able to store multiple sortkeys, so that newer
sortkeys (computed with the new rules) can be progressively computed in the
background by a bot or server script or some upgrades occuring on the fly when
processing articles.
* older sortkeys that were using a older generation rule can be deleted in a
simple DELETE operation after the new collation rule for a corrected locale has
been made the default one, or can be deleted one by one each time a new
generation sortkey is recomputed and has been inserted (there's not even the
need to perform the two sucessive operations in a transaction if the first
INSERT withe the new rule has been sucessful).

Because we have now multiple sortkeys per indexed page in a category, we can
conveniently support multiple sortkeys for different locales and offer a good
experience for users that will want alternate sort orders (notably Chinese
users that will want presentations in radical/stroke order, or in pinyin
order).

----

Another note about how to serialize the opaque sortkeys:
the builtin function {{SORTKEY:text|locale|level}} described above will not
limit the length of the generated binary sortkey, however it should serilize it
in a valid Unicode text that can be used in tables.

A convenient serialization of bytes to characters that will sort correctly is
Base-36 using the alphabet [0-9A-Z] (no padding necessary) or Base-32 (it
avoids modular arithmetics but will serialize into longer strings)

If sortkeys are about to be stored, retrieved in the SQL schema, and sorted by
the SQL clause "ORDER BY...sortkey...", then:

- either the SQL backend allows storing and sorting binary sequences of bytes
as VARBINARY(N) : then no extra serialization is needed, store directly that
opaque sort key, after truncation to the max length value (N) indicated in the
SQL type of the "sortkey" table column.

- or the SQL backend does not support sortable binary sequences of arbitrary
bytes, but can only sort VARCHAR(N), then use a similar Base-32 or Base-36
conversion to create compatible sortkeys, and then store the converted string
after truncating to the max length value (N) indicated in the SQL type of the
'sortkey" table column.

- in both cases, the stored sortkeys will NEVER be exposed to users, its sole
purpose is to make the SQL "ORDER BY" clause work properly.

To start listing a category from a given artbitrary Unicode text, use the
"start=" HTTP query parameter and compute internally the sortkey associated
with it to generate the value used in SQL clause "WHERE sortkey >= 'value'".

- Section headings in categories will never need to be stored, they are
generated on the fly by reading the page names retrieved in the SQL result set
using the {{COLLATIONMAP:}} function, with the specified locale in the
"uselang=" HTTP query parameters, and the specified (or default) "clusters="
parameter (whose default will be 1 or 0 as indicated above). They will be
diretly readable by users and do not require decoding anything from the stored
sortkey.

- the readable collation mappings and the opaque sortkeys should be coherent in
the same locale, but they are clearly different: pagenames that are
collation-mapped should sort in the same natural order as the section headings
generated from them, so it's absolutely not needed to generate sort keys from
collation-ampped headings computed in the fly.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 164] Support collation by a certain locale (sorting order of characters)

Reply via email to