https://bugzilla.wikimedia.org/show_bug.cgi?id=164
--- Comment #119 from Philippe Verdy <[email protected]> 2009-09-29 06:31:02 UTC --- I think that you should first implement it in two steps: == STEP ONE == First create a function that computes a collation key from a given string within a given locale, and provide it as a string function, something like {{#collate:locale|string}} * e.g. "{{#collate:fr|Clé}}" could return something like "cle !élc !Clé", using the " !" substring as a separator between collation elements in each level : note that here the secondary key is inverted, as specified in French. * The exact collation keys should be made readable by taking, at each collation level, the first character that is part of the same collation group (at a given level). * If you can find a more compact way to represent collation keys, you could as well accept the TAB character (instead of space+first non-space character) as the collation level separator, if the database will accept to store it and display it, however it will be difficult to use such key within prefix searches and in links, even if they are {{urlencode:d}}. * This will allow Wiktionnaries to compute collation keys more reliably, but will still also allow using those strings for populating differrently the sort key for specific categories. The locale parameter can just be the language identifier: at least those languages already supported in WikiMedia projects, for which we can at least bind them to a default collation order appropriate for their scripts. * The collation key computed should only use 3 levels (the fourth one is the string itself in its binary Unicode form, and is implicitly handled, it does not need to be specified or stored). ** Most of the time, the primary level will be readable as if it was using a very simplified script, with ignorable and diacritics characters dropped as well as apostrophes, other separators chanded to a single space, and all characters in lowercase. ** The most complex substring will be that for the secondary level : this is the one that cannot be computed easily today, it should not preserve the case differences and ignorable characters, the differences of accents should be in it (and the one that the French Wiktionnary requests as a parameter for its template "[[Modèle:clé de tri]], but in fact it requests it with its original case, and then generates the secondary key by converting it to lowercase, but uses it directly as the third-level key). ** Most of the time the third level will be very similar to the original string (with its significant case), with just the ignorable characters removed. * Some considerations should be done for languages with complex scripts: the primary key should at least be able to extract a meaningful first character usable when rendering categories : an initial Hangul syllable can be decomposed to its initial jamo, Chinese ideographs should be mapped to a radical/strokes mapping, so that the radical can be used as the first "character". The Unicode's Unihan database can be helpful. * For locales that use contractions, the "first" character in a collation group may be a digram or trigram : displayin the content of a category sorted this way should be able to use that digram/trigram as the title, instead of just the first physical Unicode character. We could imagine the a string function would return this "first letter" even if it is a digram/trigram, with something like {{#collatefirst:locale|string}}. For example {{#collatefirst|br|Christian}} would return "ch" because it is a single letter in Breton, and it is the first in the primary collation group that contains also contains "cH", "Ch" and "CH" in Breton, sorted between "c" and the trigram "c'h", the later is also distinct and comes before "d"). Such cases (named contractions in UCA) are frequent and needed in Spanish and Nordic languages. == STEP TWO == * Then, if articles are categorized without any DEFAULTSORT: key and without sort key parameter, use the same function to automatically use this function to generate their collation key but ONLY when displayin the category, using the project's default locale, or with the locale specified within that target category (but don't store the collation key with the article, unless you are ready to have a server update task that will be able to recompute the collation keys of pages and subcategories that have been categorized in it, because the locale of a category could change over time, or could be set much later) : * Category pages would contain their own option to specify their prefered collation, some thing like {{DEFAULTCOLLATE:languagecode}} in the text of that category page, to change it from the default's project locale/language (this additional and magic keyword would only be meaningful for category pages, and distinct from the string function above). * But this would not prohibit pages to set their own sort key if they wish so when they categorize themselves in such category. I am convinced that there's no need to use collation locales according to user's own preferences, the locales should be a property of both the project site and of the target category (which should be specific to a given reference language, so that it can effectively specify itself what is its prefered locale. For this reason, the collation key can be stored as it is today, when assigning different sortkeys for the same page but within distinct categories. Having the string function {{#collate:language|string}} would still allow to sort some pages in specific groups (like "*" today in many wikis, for special elements that require higher priority in categories with large enough populations). -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
