[Bug 164] Support collation by a certain locale (sorting order of characters)

bugzilla-daemon Mon, 26 Jul 2010 12:49:58 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=164


--- Comment #194 from Aryeh Gregor <[email protected]> 2010-07-26 
19:49:15 UTC ---
(In reply to comment #190)
> Yes Language::firstLetterforList(s) maps more or less to COLLATIONMAP, but
> COLLATIONMAP is a more generic concept which reflects what is defined in
> Unicode standard annexes, which speaks about various mappings (including 
> collan
> mapppings, but also case mappings)

Which particular Unicode standards are relevant here?  It would probably be a
good idea for me to take a look at them.

> For some categories, it should be convenient also to be able to use longer
> substrings, containing more than one grapheme cluster (in Wiktionnary for 
> lists
> of terms belonging to a language, or in lists of people names, a category may
> need to be indexed and anchored with section headings containing the first 2 
> or
> 3 grapheme clusters, because the first grapheme is not discriminant enough and
> will remain identical an all columns of the disaplyed list on one page, and
> even sometimes on several or many successive pages : the first letter heading
> does not help, and is just an unneeded visual pollution)
> 
> For other categories that have very few contents, too many headings are
> frequently added that also appear as pollution. Being able to suppress all of
> them, by specifying 0 graphemeclusters in that category will better help
> readers locate the wanted item.

This is all beyond the scope of what I'm currently doing, but it should be
possible to add on without too much trouble as a separate project.

> Why do I think that exposing the functions as parser functions will be useful 
> ?
> that's because it allows the implementation to be tested extensively on lots 
> of
> cases, but only within a limited set of pages, long before the schema is
> developed, finalized and finally deployed.

This presupposes that the sortkey generation algorithm is settled upon before
the database changes are.  In fact, it's exactly the opposite: I've been
contracted to do the backend changes, and other people will figure out how
exactly to generate the sortkeys later.  Really, we could do the changes in
either order.

> Both functions will be deployable rapidly, even on wikis that won't want to
> apply the schema change (so they will continue to use a single collation order
> for ALL their categories, and will anyway be able to sort specific categories
> using another supplied locale matching the category name).
> 
> If you think about it, changing the SQL schema may be rejected at end by lots
> of people.

The schema change will be part of the core software and will not be an optional
update.  Anyone who doesn't want to accept it will likely have to stick with
1.16 forever, because we're not going to support the old schema.

> Exposing the parser functions will provide a convenient alternative
> that can be deployed much more simply, and with MUCH LESS risks, using the
> existing facilities offered by [[category:...|sortkey]] and
> {{DEFAULTSORT:sortkey}}, except that their parameter will be computed using 
> the
> exposed {{SORTKEY:}} function:
> 
>   {{DEFAULTSORT:{{SORTKEY:text|locale|level}}}}
> 
> or:
> 
>   [[category:...|{{SORTKEY:text|locale|level}}]]
> 
> both being generalizable through helper templates.

It's not particularly less risky.  It does encourage each wiki to hack around
the problem locally instead of pushing for a proper fix in the software.  You
shouldn't have to do DEFAULTSORT on any pages where the right sorting is
human-detectable -- it should only have to be for things like "Abraham Lincoln"
being sorted as "Lincoln, Abraham", which the software can't do automatically.

> (Note also that section headings ("first letter") will have to be "translated"
> to correctly report the "first letter" of the Pinyin romanization, because the
> page names listed will continue to display their distinctive ideographs ! The
> only way to do that is to use the collation mapping exposed by
> {{COLLATIONMAP:}})

Surely the most sensible idea is just to disable the section headings
altogether for CJK?

> My opinion is that the same category should be sortable using different
> locales, and that's why they should support multiple sortkeys par indexed 
> page,
> one for each specified locale. Some wikis will only sort on the
> {{CONTENTLANGUAGE}} by default, but the Chinese Wiktionnary will benefit of
> sorting automatically all categories using at least the default "zh" locale
> which is an alias for "zh-hans", plus the "zh-hant" locale for traditional
> radical/stroke order, "zh-latn"  for the Pinyin order, and "zh-bpmf" for the
> Bopomofo order.
> 
> The exact locale to which "zh" corresponds will be a user preference, but one
> will be able to navigate by clicking the automatically generated links that
> will allow them to specify the other collation orders supported specifically 
> by
> the category or by default throughout the wiki project.

This is doable if it's desired, as long as the number of locales is very
limited (like four, not a hundred).  However, it will not be part of my initial
implementation unless Tim asks me to do it.

> In the English Wiktionary or on Commons, that will only use the "en" default
> collation order (identical to {{CONTENTLANGUAGE}}), it will be possible to
> specify, for specific categories, an additional sort order when the category 
> is
> directly related to a specific language.

This is certainly a useful feature for some wikis (especially Wiktionaries),
and it could be added fairly easily.  It might make it into my initial
implementation.

(In reply to comment #191)
> In all this discussion it appears that the development can be made in two
> separate projects developped independantly.
> 
> You can continue to develop the SQL schema extension, provided that:
> 
> - you call a PHP function/method that will be developped separately in a
> "Collator" class, to compute the collation sortkeys for each locale and
> specified collation level.
> 
> - you call a PHP function/method that will be developped separately in a
> "Collator" class, to compute the collation mappings for each locale and
> specified collation level and maximum grapheme clusters, in order to generate
> the headings on the fly in the category list

This is what I'm doing.

> - you think about a HTTP query parameter that will allow to change the locale
> (this parameter exists, it's "uselang", but another one will be needed
> eventually to specify sort options like "-x-lc" or "-x-uc" (the "-x-" 
> separator
> will be used impliclty. So the HTTP query may contain: &uselang=fr&opt=lc).
> These parameters will be used later when you'll be able to store multiple
> sortkeys.

These don't need to be developed until that feature is actually implemented,
which will probably not be in my initial implementation, unless Tim Starling
asks me to.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 164] Support collation by a certain locale (sorting order of characters)

Reply via email to