https://bugzilla.wikimedia.org/show_bug.cgi?id=164
--- Comment #139 from Philippe Verdy <verd...@wanadoo.fr> 2009-11-19 15:51:05 UTC --- #88,#100,#101 "Does UCA explicitly specify the binary representation of the generated key?" yes and no: - Yes there's an interface to retrieve the computed sort key. - But no: it is not stable across versions of Unicode (which can add characters in the DUCET), of ICU (that can change the representation), and of CLDR (that can modify also the locale-dependant tailorings). In other words : precomputed (stored) collations will need to be versioned too, any upgrade will require recomputing the collation keys (but a real problem on large wikis, if this is implemented in the SQL native engine (because the database index will have to be FULLY rescanned and updated, which may take a considerable down time on large wikis). This will be much less problematic if the colaltion keys are not part of the SQL engine itself, but stored in a separate table that can store multiple keys for distinct collations, because it can also be used for storing successive versions. In that case, recomputing the collation keys can be deferred, and once a category is finished, the collation version can be updated in the category, and then the collation keys for the previous version can be deleted, and then the next category can be handled. This will imply NO downtime on the server, as new collations can be added on the fly (and separately for each category). Consequence: add to the MediaWiki data model a table of supported locales-collation (those that can be specified in the site-wide default) with their supported version. Attach a unique id for the versioned collation, and create a separate category collation table containing the collation keys. Conceptually (some details omitted) : CREATE TABLE collations( coll_id INTEGER NOT NULL, -- primary key for this table coll_name VARCHAR(32) NOT NULL, -- e.g. "root", "fr", "en-US", "en-GB", "zh-Latn", "zh-Bopo", "zh-Hans.radical-strokes" coll_isprefered SMALLINT NOT NULL, -- identifier of the version coll_version VARCHAR(32) NOT NULL, -- unique description of the version PRIMARY KEY (coll_id), UNIQUE KEY(coll_name, coll_version), -- strong constraint INDEX ON (coll_name, coll_ispreferred) -- optional, for fast retrieval of the prefered version of a given collation ); CREATE TABLE categorysort ( cl_id INTEGER NOT NULL, -- in fact, the primary key of the categorylinks table coll_id INTEGER NOT NULL, -- primary key of the collations table cs_key VARCHAR(255), -- in fact a binary value, computed from ICU's Collator:getKey(UString). PRIMARY KEY(cl_id, coll_id) ); Then no need to change the categorylinks table, which will continue to store the full pagename, and the custom sortkey. ---- To support the firstChar() conceptual API, each entry in the collations table above would also need another table containing the possible lowest strings (in collation order) that use the same weight value at the primary level: CREATE TABLE collations_headings( coll_id INTEGER NOT NULL, -- primary key of the collations table ch_weight INTEGER NOT NULL, -- primary collation weight value ch_cluster VARCHAR(32) NOT NULL -- single default grapheme cluster from the first string starting with this primary weight. PRIMARY KEY (coll_id, cf_cluster) ) One problem is that ICU does not currently contains such a list of default grapheme clusters suitable for all primary weight values in each collation. Is there a way to generate it anyway? Note that some primary weights can sometimes only exist with multiple characters E.g. "ch" is a default grapheme cluster in the Breton collation, LC_COLLATE=br. It makes no sense in Breton to mix "c" and "ch" together under the same heading, given that they sort separately. The same thing occurs in many languages (e.g. Spanish, Swedish...). However it probably does not occur (?) in the DUCET (I did not verify this assertion), so may be we can just avoid storing all possible Unicode characters to convert them to their default primary heading, and instead we just have to store the graphemes specified in the examplar set for the language (or other graphemes that are explicitly present in the locale specific tailoring rules. The other graphemes usable as first char can then be taken automatically generated from the first Unicode character present in the Mediawiki custom cl_sortkey (whose default is the full pagename, including the namespace name localized to the default locale of the wiki), according to the DUCET which is shared as the base of all locales. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l