https://bugzilla.wikimedia.org/show_bug.cgi?id=164
--- Comment #180 from Philippe Verdy <[email protected]> 2010-07-22 11:42:55 UTC --- Your issue ***IS*** addressed in my proposal: *Both* {{COLLATIONMAP:Áa}} and {{COLLATIONMAP:Aa}} will be unambiguously "aa" in the primary collation level for that locale. They only differ at the secondary collation level (with accents). You did not understand that a collation-mapping is DIFFERENT from an opaque sort key, even if both are built using the same collation rules for the same locale. The case of "Albert Einstein" sorting as "Einstein, Albert" will pass through the standard generation of the sortkey from the string "Einstein, Albert" explicitly given in MediaWiki source code as the parameter of the {{DEFAULTSORT:Einstein, Albert}} special function or as the second parameter of the [[category:...|Einstein, Albert]] link. ---- So here is a development and deployment map: 1) Develop to PHP functions that will compute: function sortkey($text, $locale, $level=1) - it will return an opaque array of byte values - $locale may be given a default value from the project's content language, but this is not specified here but part of its integration in MediaWiki - $level may take the default value of 1. - the algorithm involves parsing steps to transform the $text parameter into normalized form, then parse it by collation element, and then locating the collation element in the tailored collation table, which is indexed by collation element value and returns an array of collation weights, one for each level. - it packs all the collation weights into the returned opaque array of byte values, by appaending all non-zero collation weights for each collation element at the same collation level before appending the collation weights for higher successive levels. function collationmap($text, $locale, $level=1, $clusters) - it will return a remapped text using the same $locale and $level parameters - it will use the same DUCET table and the same per-locale tailorings - the same parsing steps are performed - but instead of packing the collation weigths, it scans the collation table in the reverse order, by locating the first collation element (a small Unicode string, often limited to a single character) that has the same collation weights up to the specified level. When this smallest collation element is found, append this to the result string. function base36($bytes) - it packs the opaque binary array of bytes into plain ASCII that has safe binary order and can be safely be stored in a VARCHAR(N) table field, or that can be returned in a MediaWiki function. This module should use ICU and implement the locale tailorings, and should be able to support a full DUCET table,and allow lookups from a collation element to an array of collation weights, or the reverse (and ordered) lookup from a collation weight to a collation element for the function collationmap()) 2) Integrate these functions in a Media Wiki extension for builtin parser functions. {{SORTKEY:text|locale|level}} - this will return base36(sortkey($text, $locale, $level)) - by default take $level=1 - by default take $locale={{CONTENTLANGUAGE}} (or {{int:lang}} ?) - it can be used as a much better implementation of Wikipedia templates like [[Template:Sort]] {{COLLATIONMAP:text|locale|level|clusters}} - this will return collationmap($text, $locale, $level, $clusters) - it can be used to simulate the generation of headings in categories, but as well within mediawiki tables - by default take $clusters=null (no limitation of length) - by default take $level=1 - by default take $locale={{CONTENTLANGUAGE}} (or {{int:lang}} ?) 3) build a function for mapping category sortkeys into stored sort keys, this will depend on the SQL backend capabilities and on the schema constraint length for the sortkey data columns: function sqlsortkey($text, $locale, $level) - it will return either : substring(sortkey($text, $locale, $level), 0, $maxlength) - or : substring(base36(sortkey($text, $locale, $level)), 0, $maxlength) - the choice will depend on the support of VARBINARY(N) and its sortability in the SQL engine, or of only VARCHAR(N) - the sortkey will not have to be UTF-8, and will not need any support of the same locales for collation tailoring in the SQL backend. 4) update the schema to support the description of supported internal collation ids. - add a mapping function from the human readable $locale parameter to the collation id associated to the curent version of the collation rule currently applicable to a locale. - support this mapping with a simple "collations" relational table 5) build a new category index based on: (categoryid, collationid, sortkey, pageid) - categoryid and pageid are standard MediaWiki page ids (in any modification version). - collationid will come from the previous mapping (using the locale identifier string, and where the locale will be determined by HTTP query parameters like "uselang=", i.e. {{int:lang}}, or from the project's default {{CONTENTLANGUAGE}}). - the sortkey column will be computed using PHP's: sqlsortkey($text, $locale, $level) described above. .... -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
