[Bug 30287] bug on sorting Persian characters

bugzilla-daemon Sun, 18 Sep 2011 14:04:19 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=30287


Philippe Verdy <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[email protected]

--- Comment #12 from Philippe Verdy <[email protected]> 2011-09-18 21:04:12 
UTC ---
Collation table for Persian (a.k.a "Farsi") is documented by the MimerSQL
documentation for developers on
http://developer.mimer.com/charts/persian.htm
which documents it with this rule:

CREATE COLLATION persian FROM eor USING
'[Arabic]'
'&#064E#<<#0650#<<#064F#<<#064B#<<#064D#<<#064C#'
'&#0621#<#0622#'
'&#0627#<<#0671#<#0621#<<#0623#<<#0672#<<#0625#'
'       <<#0673#<<#0624#<<#06CC##0654#<<<#0649##0654#<<<#0626#'
'&#06A9#<<#06AA#<<#06AB#<<#0643#<<#06AC#<<#06AD#<<#06AE#'
'&#06CF#<#0647#<<#06D5#<<#06C1#<<#0629#<<#06C3#<<#06C0#<<#06BE#'
'&#06CC#<<#0649#<<#06D2#<<#064A#<<#06D0#<<#06D1#<<#06CD#<<#06CE#'

where "eor" is the base collation used for the standard "European Ordering
Rules" (defined as both an ISO standard and a CEN standard), from which most
other collation orders are based, with very small tailorings. It has a few
other settings that requires specific adjustments indicated by the "[Arabic]"
tailoring attribute, which has the effect of reordering all Arabic blocks
before all letters of other scripts (but still after the ignorables,
whitespaces, variables, common length marks, common currency symbols, and
common digits). The rule above adds specific reordering of a few other letters
(look at the collation chart).

Yes, this is different from the standard collation for the Arabic language,
which is a bit simpler (and only adjusts secondary differences):

CREATE COLLATION arabic FROM eor USING
'[Arabic]'
'&#0627#<<#0622#<<#0627#<<#0621#<<#0623#<<#0625#<<#0624#<<#0626#'
'&#064A#<<#0649#'

and it is also different from the Urdu collation which is a bit more complex:

CREATE COLLATION urdu FROM eor USING
'[Arabic]'
'&#064B#<<#0652#<<#064E#<<#0650#<<#064F#<<#0670#<<#0656#<<#0657#'
'       <<#064B#<<#064D#<<#064C#<<#0654#<<#0651#<<#0658#<<#0653#'
'&#0627#<<#0623#<#0622#'
'&#0648#<<#0624#'
'&#06CF#<#06C1#<<#0647#<#06BE#<#06C3#<<#0629#<#0621#'
'&#06CC#<<#0649#<<#064A#<<#0626#'
'&#0628#<#0628##06BE#'
'&#067E#<#067E##06BE#'
'&#062A#<#062A##06BE#'
'&#0679#<#0679##06BE#'
'&#062C#<#062C##06BE#'
'&#0686#<#0686##06BE#'
'&#062F#<#062F##06BE#'
'&#0688#<#0688##06BE#'
'&#0631#<#0631##06BE#'
'&#0691#<#0691##06BE#'
'&#06A9#<#06A9##06BE#'
'&#06AF#<#06AF##06BE#'
'&#0644#<#0644##06BE#'
'&#0645#<#0645##06BE#'
'&#0646#<#0646##06BE#'
'&#06BA#<#06BA##06BE#'
'&#0648#<#0648##06BE#'
'&#06CC#<#06CC##06BE#';

MimerSQL has defined these rules using EOR as the base collation; the CLDR
project was initially based on the DUCET collation, but is now using a
different base collation (a modified DUCET), which is nearer from the standard
EOR (but still different).

Note that MimerSQL, just like also MySQL, the default Java runtime library,the
.Net CLR library still does not support the newer syntax for contextual rules,
and for reordering script blocks, which is only supported for now by the most
recent version of ICU; it also lacks the support of newer attributes.

The DUCET will soon be changed to become nearer from the CLDR version made for
ICU, but the modified DUCET in the CLDR also does not use any contextual rules
(for compatibility with lots of other implementations of the UCA). For this
reason, some scripts will still not sort as expected using only the CLDR rules,
without using the extended syntax (for example with the Devanagari script, see
the final vowelless consonnant clusters at end of syllables.

This is even more critical for Lao, which requires a very complex
syllabification, that cannot be represented by a collation table, but only as a
specific [Lao] attribute triggering its specific syllabification by code and
sometimes dictionary lookups; the case also occurs with the collations for Thai
and Khmer languages, but in less critical way).

So don't assume that any unique DUCET (or modified DUCET from CLDR, or even the
EOR collation table) will make things correct for all languages. We still need
tailorings on top of any base collation, for almost all languages in all
scripts !

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 30287] bug on sorting Persian characters

Reply via email to