Okay, Dar, I tried your idea. It works like a dream, at least for my problem (Cyrillic range unicode). I didn't even have to convert upper to lower case! In fact, I'm not even sure exactly why this works.

On Jun 1, 2006, at 5:21 PM, Dar Scott wrote:

Wow!  Great news for sorting Unicode!

On May 30, 2006, at 5:08 PM, Devin Asay wrote:

I got your code to work by making some simple changes in the sortCodeFromRussian function:

Deven, I've been processing some bits of UTF-8, and something dawned on me that is probably known by the Unicode experts.

**** A lexical byte sort of well-formed UTF-8 will result in a Unicode code point sort! *****

Here's what I did:

The call:

set the unicodeText of fld 1 to uniencode(sortRussian(the unicodeText of fld 1),"utf8")

The function:

function sortRuss utf16RussList
  put uniDecode(utf16RussList, "UTF8") into utf8RussList
  sort lines of utf8RussList text
  return utf8RussianList
end sortRuss


That avoids the NUL problem in sort. That means that russianLex() can return the UTF-8 of the string with your character conversions.

I think the replace command will work with UTF-8, so you can even avoid a character loop. All you need is 34 replaces and then a return. OK, that might actually be slower than a character loop.

FWIW, at first I did do a UC > LC conversion. The replaces were very fast. Less than 1 second on a list of > 1350 unicode lines. Just a list of 35 replaces.

function russToLC lList
  replace "А" with "а" in lList
  replace "Б" with "б" in lList
  replace "В" with "в" in lList
  replace "Г" with "г" in lList
  replace "Д" with "д" in lList
  replace "Е" with "е" in lList
  replace "Ё" with "е" in lList -- convert "yo" to "ye"
  replace "ё" with "е" in lList
  replace "Ж" with "ж" in lList
  replace "З" with "з" in lList
  replace "И" with "и" in lList
  replace "Й" with "й" in lList
  replace "К" with "к" in lList
  replace "Л" with "л" in lList
  replace "М" with "м" in lList
  replace "Н" with "н" in lList
  replace "О" with "о" in lList
  replace "П" with "п" in lList
  replace "Р" with "р" in lList
  replace "С" with "с" in lList
replace ""&quote with "т" in lList --U.C. Russ T has #0022 as byte 2 (= ascii quote char)
  replace "У" with "у" in lList
  replace "Ф" with "ф" in lList
  replace "Х" with "х" in lList
  replace "Ц" with "ц" in lList
  replace "Ч" with "в" in lList
  replace "Ш" with "в" in lList
  replace "Щ" with "в" in lList
  replace "Ъ" with "в" in lList
  replace "Ы" with "в" in lList
  replace "Ь" with "в" in lList
  replace "Э" with "в" in lList
  replace "Ю" with "в" in lList
  replace "Я" with "я" in lList
  return lList
end russToLC




Dar
Unicode Sophomore

Devin
Still in Unicode Prep School

Devin Asay
Humanities Technology and Research Support Center
Brigham Young University

_______________________________________________
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to