[Bug 164] Support collation by a certain locale (sorting order of characters)

bugzilla-daemon Mon, 26 Jul 2010 15:24:27 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=164


--- Comment #199 from Philippe Verdy <[email protected]> 2010-07-26 22:23:47 
UTC ---
>> (Note also that section headings ("first letter") will have to be 
>> "translated"
>> to correctly report the "first letter" of the Pinyin romanization, because 
>> the
>> page names listed will continue to display their distinctive ideographs ! The
>> only way to do that is to use the collation mapping exposed by
>> {{COLLATIONMAP:}})
>
>Surely the most sensible idea is just to disable the section headings
>altogether for CJK?

Don't need to do that.

The Collator instance returned by the factory for $locale="zh-latn" (which
sorts using Pinyin) just has to return an empty string for its map() method, as
long as this is a stub which can't safely map ideographs to Latin initial
consonnants of the Pinyin syllable.

Note that the syntax of the Latin Pinyin ampping should be quite similar to the
syntax used in Minnan as the Minnan syllables have more or less the same
structure as the Pinyin syllables. for ideographs that are not supported, in
the Pinyin romanization, there will be no substitution so they will sort at end
and will not start by a Latin letter.

Here also, it is possible to infer a common heading for all of them, such as
their radical component, or just the first ideographic radical encoded in the
DUCET with a non zero primary weight, or even the first stroke.

The first CJK stroke is:
31C0  ; [*10E0.0020.0002.31C0] # CJK STROKE T

But I should look at the exact sequence in the "zh" tailoring of the DUCET, in
the CLDR database.

There's a demo page for "zh-hans" collation here (in the ICU project which is
used by the Unicode's CLDR project as a reference implementation):

http://demo.icu-project.org/icu-bin/locexp?d_=fr&_=zh_Hans

The interesting part is the data for "Rules" which orders the examplar
sinograms, where the data for "Set" just show them in the numeric codepoint
order or as ranges of character with ascending numeric code point values.
But on both cases they just concentrate on the most common basic subset used in
GB2312, this is not the complete set defined in GB18030 and Unicode.

For actual transforms from Sinograms to Latin (Pinyin) there's this demo page:

http://demo.icu-project.org/icu-bin/translit

To see how the DUCET orders ideographs, look at:

http://unicode.org/charts/collation/

The first sinogram (non-stroke) defined with a non-zero primary weight in the
DUCET sequence is U+4E00 (一) at it seems that it provides a very convenient
geading for every sinogram that we can't sort or convert to Pinyin.

Note that in all cases all the sinograms are at end of the DUCET, just before
the unsupported/reserved/unassaigned characters.

The "zh-hans" is just moving all the other letters starting in the "Variable"
subset upward (in fact it just moves upwards the letters starting at Latin), to
fit the sinograms before them.

The "zh-Hant" collation is like "zh-Hans" but swaps some positions, according
to their expected radical, but also because they differ in thei stroke count.

Only about 31000 sinograms have known transiptions to Pinyin (this number is
progressing), all the other will then appear under the remaining heading group
starting by U+4E00 (一), except those in the "CJK-Extension blocks" that should
be listed under the heading U+3400 (㐀).

Most non-extension CJK have now a pinyin transcription, most CJK extensions
don't (but they are also the rarest character used, so there should not be a
lot of pages indexed there)...

>> For example in people's names whose page name is "FirstNames LastName" but 
>> that
>> we want to sort as if they were "LastName, FirstNames" by indicating only
>> {{DEFAULTSORT:LastName !}} (it should not needed to include the FirstNames in
>> the wiki text, as this sort hint will not be unique and the group of pages
>> using the same hint will still need to sort within this group using their
>> natural order). Even in that case, there's no need to bogously tack the 
>> cl_from
>> field in the stored field.
>
> How do you propose this be implemented?  We would need some character that
> sorts before all characters that can legitimately occur in a sort key.

This is exactly what UCA defines as a "tailoring". We have such a character
available that we don't use in pagenames and in sortkey prefixes: Control
characters.

Note that control characters are present in the DUCET, and they are NOT ALL
ignorable. The first of them is TAB (U+0009) and it is the first character that
we DON'T USE, and that has a primary weight in the DUCET and that is not
ignorable or null.

All the characters have null weights are listed wihin the NULL group, which is
immediately followed by ignorable characters.

TAB has a primary weight of 0201 in the DUCET (it comes far later, after all
the ignorables)

The first ignorable character has a primary weight 0021, so the primary weight
0001 is free to serve as the separator.

All we have to do is then to tailor the primary weight of TAB to assign it the
weight 0001 instead of 0000. In that case, the souce text to give to
Collator:sortKey() just has to use the TAB character between the two fields.

If there's no sortkey prefix, don't generate the TAB, use directly the
pagename:
<? php

require('.../CollatorFactory.php'); // choose the script for the implementation
global $wgCollatorFactor = CollatorFactory();

...
$collator = $wgCollatorFactory->get($locale, $level);

...
if ($sortkeyprefix != '')
  $text = $sortkeyprefix + '\t' + $pagename;
else
  $text = $pagename;
$sortkey = $collator->sortkey($text);

//optional to convert VARBINARY(N) to VARCHAR(N) (depends on SQL backend)
$sortkey = varbinaryToVarchar($sortkey); // e.g. Base-32

// you may also need to truncate to N characters,
// to fit the SQL sortable field maximum length constraint
// (according to the schema for this SQL backend)...

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 164] Support collation by a certain locale (sorting order of characters)

Reply via email to