[Bug 164] Support collation by a certain locale (sorting order of characters)

bugzilla-daemon Mon, 28 Sep 2009 23:31:21 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=164






--- Comment #119 from Philippe Verdy <[email protected]>  2009-09-29 06:31:02 
UTC ---
I think that you should first implement it in two steps:

== STEP ONE ==

First create a function that computes a collation key from a given string
within a given locale, and provide it as a string function, something like
{{#collate:locale|string}}

* e.g. "{{#collate:fr|Clé}}" could return something like "cle !élc !Clé",
using the " !" substring as a separator between collation elements in each
level : note that here the secondary key is inverted, as specified in French.

* The exact collation keys should be made readable by taking, at each collation
level, the first character that is part of the same collation group (at a given
level).

* If you can find a more compact way to represent collation keys, you could as
well accept the TAB character (instead of space+first non-space character) as
the collation level separator, if the database will accept to store it and
display it, however it will be difficult to use such key within prefix searches
and in links, even if they are {{urlencode:d}}.

* This will allow Wiktionnaries to compute collation keys more reliably, but
will still also allow using those strings for populating differrently the sort
key for specific categories. The locale parameter can just be the language
identifier: at least those languages already supported in WikiMedia projects,
for which we can at least bind them to a default collation order appropriate
for their scripts.

* The collation key computed should only use 3 levels (the fourth one is the
string itself in its binary Unicode form, and is implicitly handled, it does
not need to be specified or stored).
** Most of the time, the primary level will be readable as if it was using a
very simplified script, with ignorable and diacritics characters dropped as
well as apostrophes, other separators chanded to a single space, and all
characters in lowercase.
** The most complex substring will be that for the secondary level : this is
the one that cannot be computed easily today, it should not preserve the case
differences and ignorable characters, the differences of accents should be in
it (and the one that the French Wiktionnary requests as a parameter for its
template "[[Modèle:clé de tri]], but in fact it requests it with its original
case, and then generates the secondary key by converting it to lowercase, but
uses it directly as the third-level key).
** Most of the time the third level will be very similar to the original string
(with its significant case), with just the ignorable characters removed.

* Some considerations should be done for languages with complex scripts: the
primary key should at least be able to extract a meaningful first character
usable when rendering categories : an initial Hangul syllable can be decomposed
to its initial jamo, Chinese ideographs should be mapped to a radical/strokes
mapping, so that the radical can be used as the first "character". The
Unicode's Unihan database can be helpful.

* For locales that use contractions, the "first" character in a collation group
may be a digram or trigram : displayin the content of a category sorted this
way should be able to use that digram/trigram as the title, instead of just the
first physical Unicode character. We could imagine the a string function would
return this "first letter" even if it is a digram/trigram, with something like
{{#collatefirst:locale|string}}. For example {{#collatefirst|br|Christian}}
would return "ch" because it is a single letter in Breton, and it is the first
in the primary collation group that contains also contains "cH", "Ch" and "CH"
in Breton, sorted between "c" and the trigram "c'h", the later is also distinct
and comes before "d"). Such cases (named contractions in UCA) are frequent and
needed in Spanish and Nordic languages.

== STEP TWO ==

* Then, if articles are categorized without any DEFAULTSORT: key and without
sort key parameter, use the same function to automatically use this function to
generate their collation key but ONLY when displayin the category, using the
project's default locale, or with the locale specified within that target
category (but don't store the collation key with the article, unless you are
ready to have a server update task that will be able to recompute the collation
keys of pages and subcategories that have been categorized in it, because the
locale of a category could change over time, or could be set much later) :

* Category pages would contain their own option to specify their prefered
collation, some thing like {{DEFAULTCOLLATE:languagecode}} in the text of that
category page, to change it from the default's project locale/language (this
additional and magic keyword would only be meaningful for category pages, and
distinct from the string function above).

* But this would not prohibit pages to set their own sort key if they wish so
when they categorize themselves in such category.


I am convinced that there's no need to use collation locales according to
user's own preferences, the locales should be a property of both the project
site and of the target category (which should be specific to a given reference
language, so that it can effectively specify itself what is its prefered
locale. For this reason, the collation key can be stored as it is today, when
assigning different sortkeys for the same page but within distinct categories.

Having the string function {{#collate:language|string}} would still allow to
sort some pages in specific groups (like "*" today in many wikis, for special
elements that require higher priority in categories with large enough
populations).


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 164] Support collation by a certain locale (sorting order of characters)

Reply via email to