[Bug 30675] Use allkeys_CLDR.txt - the CLDR tailored DUCET instead of allkeys.txt

bugzilla-daemon Wed, 31 Aug 2011 22:55:13 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=30675


Philippe Verdy <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[email protected]

--- Comment #2 from Philippe Verdy <[email protected]> 2011-09-01 05:55:04 UTC 
---
I also agree, simply because the most recent definition of CLDR no longer uses
the DUCET in its root locale, but defines now an alternate collation (initially
derived from the DUCET, but with many arrangements, notably in variable
collation elements that are now subgrouped more logically while preserving
their relative order within each subgroup, compared to the relative order they
had in the DUCET).

The CLDR still allows using the default DUCET, but the standard DUCET is now
considered a tailoring of the new CLDR root version.

The differences basically concerns non-letters, but there are a few other
arrangements (notably within letter-like symbols, currency symbols, and with
some format controls), that also facilitate the definition of language-specific
tailoring, including definitions to facilitate the relative reordering of
distinct scripts within a language that is written using multiple scripts.

The CLDR version of the DUCET is then much better, as it requires much less
maintenance work for each language-specific tailoring.

To make it work, the CLDR version of the DUCET used in the root locale, adds
pseudo collation elements, that are not defined based on standard characters,
but only as markers separating subgroups of collation elements, and for which
it also defines specific values for primary collation weights.

The CLDR version also defines new pseudo collation elements usable as
separators for sorting rows of data structured in separate fields, so that all
fields will first sort in parallel at primary level, before comparing all
fields to the next level (that's something you can't do simply by using a
stable sort starting by fields of lower importance up to the field with first
importance).

For MediaWiki itself, there's nothing to change if it uses ICU, on the server
side, except just upgrading it.

But if MediaWiki uses its own code, it may not be able to process the
pseudo-collation elements defined as markers between subgroups of collation
elements (notably between whitespaces, symbols, punctuations, in the variable
elements, and then starting the group of numbers, then the group of letters
split now by script with their own marker. As these markers are only needed to
define tailorings, as long as this specific code will not be able to
instantiate thes language-specific tailorings, these pseudo-markers may be
simply skipped (ignored). You can easily detect them because they are defined
using a specific syntax between [square brackets with a marker type followed by
a value], such as "[script Arab]", or remapped using code points mapped to
non-characters (so they are NOT encoded with Unicode, but displayed using an
escape syntax such as \uFFFF, in the parsable text formats used by the CLDR
data (this syntax is not visible in the new binary format now documented in the
UCA specification and more precisely in LDML specifications used by the CLDR)

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 30675] Use allkeys_CLDR.txt - the CLDR tailored DUCET instead of allkeys.txt

Reply via email to