https://bugzilla.wikimedia.org/show_bug.cgi?id=30287
Philippe Verdy <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |[email protected] --- Comment #12 from Philippe Verdy <[email protected]> 2011-09-18 21:04:12 UTC --- Collation table for Persian (a.k.a "Farsi") is documented by the MimerSQL documentation for developers on http://developer.mimer.com/charts/persian.htm which documents it with this rule: CREATE COLLATION persian FROM eor USING '[Arabic]' '@E#<<#0650#<<#064F#<<#064B#<<#064D#<<#064C#' 'ɭ#<#0622#' 'ɳ#<<#0671#<#0621#<<#0623#<<#0672#<<#0625#' ' <<#0673#<<#0624#<<#06CC##0654#<<<#0649##0654#<<<#0626#' 'A9#<<#06AA#<<#06AB#<<#0643#<<#06AC#<<#06AD#<<#06AE#' 'CF#<#0647#<<#06D5#<<#06C1#<<#0629#<<#06C3#<<#06C0#<<#06BE#' 'CC#<<#0649#<<#06D2#<<#064A#<<#06D0#<<#06D1#<<#06CD#<<#06CE#' where "eor" is the base collation used for the standard "European Ordering Rules" (defined as both an ISO standard and a CEN standard), from which most other collation orders are based, with very small tailorings. It has a few other settings that requires specific adjustments indicated by the "[Arabic]" tailoring attribute, which has the effect of reordering all Arabic blocks before all letters of other scripts (but still after the ignorables, whitespaces, variables, common length marks, common currency symbols, and common digits). The rule above adds specific reordering of a few other letters (look at the collation chart). Yes, this is different from the standard collation for the Arabic language, which is a bit simpler (and only adjusts secondary differences): CREATE COLLATION arabic FROM eor USING '[Arabic]' 'ɳ#<<#0622#<<#0627#<<#0621#<<#0623#<<#0625#<<#0624#<<#0626#' '@A#<<#0649#' and it is also different from the Urdu collation which is a bit more complex: CREATE COLLATION urdu FROM eor USING '[Arabic]' '@B#<<#0652#<<#064E#<<#0650#<<#064F#<<#0670#<<#0656#<<#0657#' ' <<#064B#<<#064D#<<#064C#<<#0654#<<#0651#<<#0658#<<#0653#' 'ɳ#<<#0623#<#0622#' 'ʈ#<<#0624#' 'CF#<#06C1#<<#0647#<#06BE#<#06C3#<<#0629#<#0621#' 'CC#<<#0649#<<#064A#<<#0626#' 'ɴ#<#0628##06BE#' 'CE#<#067E##06BE#' '>A#<#062A##06BE#' 'ʧ#<#0679##06BE#' '>C#<#062C##06BE#' 'ʮ#<#0686##06BE#' '>F#<#062F##06BE#' 'ʰ#<#0688##06BE#' 'ɷ#<#0631##06BE#' 'ʳ#<#0691##06BE#' 'A9#<#06A9##06BE#' 'AF#<#06AF##06BE#' 'ʄ#<#0644##06BE#' 'ʅ#<#0645##06BE#' 'ʆ#<#0646##06BE#' 'BA#<#06BA##06BE#' 'ʈ#<#0648##06BE#' 'CC#<#06CC##06BE#'; MimerSQL has defined these rules using EOR as the base collation; the CLDR project was initially based on the DUCET collation, but is now using a different base collation (a modified DUCET), which is nearer from the standard EOR (but still different). Note that MimerSQL, just like also MySQL, the default Java runtime library,the .Net CLR library still does not support the newer syntax for contextual rules, and for reordering script blocks, which is only supported for now by the most recent version of ICU; it also lacks the support of newer attributes. The DUCET will soon be changed to become nearer from the CLDR version made for ICU, but the modified DUCET in the CLDR also does not use any contextual rules (for compatibility with lots of other implementations of the UCA). For this reason, some scripts will still not sort as expected using only the CLDR rules, without using the extended syntax (for example with the Devanagari script, see the final vowelless consonnant clusters at end of syllables. This is even more critical for Lao, which requires a very complex syllabification, that cannot be represented by a collation table, but only as a specific [Lao] attribute triggering its specific syllabification by code and sometimes dictionary lookups; the case also occurs with the collations for Thai and Khmer languages, but in less critical way). So don't assume that any unique DUCET (or modified DUCET from CLDR, or even the EOR collation table) will make things correct for all languages. We still need tailorings on top of any base collation, for almost all languages in all scripts ! -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
