On Thu, May 14, 2009 at 7:38 AM, Domas Mituzas <[email protected]> wrote: > [13:36:06] GerardM- so how do we currently deal with the languages > from India where the order of Unicode is almost certainly to be wrong > [13:36:17] domas well, currently we're using byte order > [13:36:24] domas it is not any kind of unicode order > [13:36:35] GerardM- so there is no proper sorting > [13:36:36] domas as utf8 is variable length, offsets of character > starts are different
Well, a binary sort of UTF-8 is code point-order. One-byte characters start with 0, two-byte characters start with 110, three-byte characters start with 1110, four-byte characters start with 11110, so they'll always sort as 1-byte < 2-byte < 3-byte < 4-byte, and the variable length makes no difference. But code point order isn't very good: even in English, z < A, let alone languages with diacritics or whatnot. An interesting discussion, anyway. _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
