On Thu, May 14, 2009 at 7:38 AM, Domas Mituzas <[email protected]> wrote:
> [13:36:06]      GerardM-        so how do we currently deal with the languages
> from India where the order of Unicode is almost certainly to be wrong
> [13:36:17]      domas   well, currently we're using byte order
> [13:36:24]      domas   it is not any kind of unicode order
> [13:36:35]      GerardM-        so there is no proper sorting
> [13:36:36]      domas   as utf8 is variable length, offsets of character
> starts are different

Well, a binary sort of UTF-8 is code point-order.  One-byte characters
start with 0, two-byte characters start with 110, three-byte
characters start with 1110, four-byte characters start with 11110, so
they'll always sort as 1-byte < 2-byte < 3-byte < 4-byte, and the
variable length makes no difference.  But code point order isn't very
good: even in English, z < A, let alone languages with diacritics or
whatnot.

An interesting discussion, anyway.

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to