https://bugzilla.wikimedia.org/show_bug.cgi?id=30675
Philippe Verdy <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |[email protected] --- Comment #2 from Philippe Verdy <[email protected]> 2011-09-01 05:55:04 UTC --- I also agree, simply because the most recent definition of CLDR no longer uses the DUCET in its root locale, but defines now an alternate collation (initially derived from the DUCET, but with many arrangements, notably in variable collation elements that are now subgrouped more logically while preserving their relative order within each subgroup, compared to the relative order they had in the DUCET). The CLDR still allows using the default DUCET, but the standard DUCET is now considered a tailoring of the new CLDR root version. The differences basically concerns non-letters, but there are a few other arrangements (notably within letter-like symbols, currency symbols, and with some format controls), that also facilitate the definition of language-specific tailoring, including definitions to facilitate the relative reordering of distinct scripts within a language that is written using multiple scripts. The CLDR version of the DUCET is then much better, as it requires much less maintenance work for each language-specific tailoring. To make it work, the CLDR version of the DUCET used in the root locale, adds pseudo collation elements, that are not defined based on standard characters, but only as markers separating subgroups of collation elements, and for which it also defines specific values for primary collation weights. The CLDR version also defines new pseudo collation elements usable as separators for sorting rows of data structured in separate fields, so that all fields will first sort in parallel at primary level, before comparing all fields to the next level (that's something you can't do simply by using a stable sort starting by fields of lower importance up to the field with first importance). For MediaWiki itself, there's nothing to change if it uses ICU, on the server side, except just upgrading it. But if MediaWiki uses its own code, it may not be able to process the pseudo-collation elements defined as markers between subgroups of collation elements (notably between whitespaces, symbols, punctuations, in the variable elements, and then starting the group of numbers, then the group of letters split now by script with their own marker. As these markers are only needed to define tailorings, as long as this specific code will not be able to instantiate thes language-specific tailorings, these pseudo-markers may be simply skipped (ignored). You can easily detect them because they are defined using a specific syntax between [square brackets with a marker type followed by a value], such as "[script Arab]", or remapped using code points mapped to non-characters (so they are NOT encoded with Unicode, but displayed using an escape syntax such as \uFFFF, in the parsable text formats used by the CLDR data (this syntax is not visible in the new binary format now documented in the UCA specification and more precisely in LDML specifications used by the CLDR) -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
