[Bug 164] Support collation by a certain locale (sorting order of characters)

bugzilla-daemon Thu, 19 Nov 2009 07:51:24 -0800

https://bugzilla.wikimedia.org/show_bug.cgi?id=164






--- Comment #139 from Philippe Verdy <[email protected]>  2009-11-19 15:51:05 
UTC ---
#88,#100,#101 "Does UCA explicitly specify the binary representation of the
generated
key?"

yes and no:
- Yes there's an interface to retrieve the computed sort key.
- But no: it is not stable across versions of Unicode (which can add characters
in the DUCET), of ICU (that can change the representation), and of CLDR (that
can modify also the locale-dependant tailorings).

In other words : precomputed (stored) collations will need to be versioned too,
any upgrade will require recomputing the collation keys (but a real problem on
large wikis, if this is implemented in the SQL native engine (because the
database index will have to be FULLY rescanned and updated, which may take a
considerable down time on large wikis).

This will be much less problematic if the colaltion keys are not part of the
SQL engine itself, but stored in a separate table that can store multiple keys
for distinct collations, because it can also be used for storing successive
versions. In that case, recomputing the collation keys can be deferred, and
once a category is finished, the collation version can be updated in the
category, and then the collation keys for the previous version can be deleted,
and then the next category can be handled. This will imply NO downtime on the
server, as new collations can be added on the fly (and separately for each
category).

Consequence: add to the MediaWiki data model a table of supported
locales-collation (those that can be specified in the site-wide default) with
their supported version. Attach a unique id for the versioned collation, and
create a separate category collation table containing the collation keys.

Conceptually (some details omitted) :

CREATE TABLE collations(
  coll_id INTEGER NOT NULL,              -- primary key for this table
  coll_name VARCHAR(32) NOT NULL,        -- e.g. "root", "fr", "en-US",
"en-GB", "zh-Latn", "zh-Bopo", "zh-Hans.radical-strokes"
  coll_isprefered SMALLINT NOT NULL,     -- identifier of the version
  coll_version VARCHAR(32) NOT NULL,     -- unique description of the version
  PRIMARY KEY (coll_id),
  UNIQUE KEY(coll_name, coll_version),   -- strong constraint
  INDEX ON (coll_name, coll_ispreferred) -- optional, for fast retrieval of the
prefered version of a given collation
);

CREATE TABLE categorysort (
  cl_id INTEGER NOT NULL,   -- in fact, the primary key of the categorylinks
table
  coll_id INTEGER NOT NULL, -- primary key of the collations table
  cs_key VARCHAR(255),      -- in fact a binary value, computed from ICU's
Collator:getKey(UString).
  PRIMARY KEY(cl_id, coll_id)
);

Then no need to change the categorylinks table, which will continue to store
the full pagename, and the custom sortkey.

----

To support the firstChar() conceptual API, each entry in the collations table
above would also need another table containing the possible lowest strings (in
collation order) that use the same weight value at the primary level:

CREATE TABLE collations_headings(
  coll_id INTEGER NOT NULL, -- primary key of the collations table
  ch_weight INTEGER NOT NULL, -- primary collation weight value
  ch_cluster VARCHAR(32) NOT NULL -- single default grapheme cluster from the
first string starting with this primary weight. 
  PRIMARY KEY (coll_id, cf_cluster)
)

One problem is that ICU does not currently contains such a list of default
grapheme clusters suitable for all primary weight values in each collation. Is
there a way to generate it anyway? Note that some primary weights can sometimes
only exist with multiple characters

E.g. "ch" is a default grapheme cluster in the Breton collation, LC_COLLATE=br.
It makes no sense in Breton to mix "c" and "ch" together under the same
heading, given that they sort separately. The same thing occurs in many
languages (e.g. Spanish, Swedish...).

However it probably does not occur (?) in the DUCET (I did not verify this
assertion), so may be we can just avoid storing all possible Unicode characters
to convert them to their default primary heading, and instead we just have to
store the graphemes specified in the examplar set for the language (or other
graphemes that are explicitly present in the locale specific tailoring rules.

The other graphemes usable as first char can then be taken automatically
generated from the first Unicode character present in the Mediawiki custom
cl_sortkey (whose default is the full pagename, including the namespace name
localized to the default locale of the wiki), according to the DUCET which is
shared as the base of all locales.


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 164] Support collation by a certain locale (sorting order of characters)

Reply via email to