Re: Unicode sorting

Dar Scott Sat, 27 May 2006 13:05:09 -0700


On May 27, 2006, at 9:12 AM, Devin Asay wrote:

For the Russian (don't know if this will come thru in your emailreader):
  Я вижу вас.
The unicode is (omitting the "U+" convention):
  042F 0020 0432 0438 0436 0443 0020 0432 0430 0441 002E
But what rev is seeing during sort is a series of single bytechars, with leading null bytes of basic latin range chars ignored:
  04 2F 20 04 32 04 38 04 36 04 43 20 04 32 04 30 04 41 2E
Since all of the first bytes of the Cyrillic range are 04 (< 20),they are always sorted *before* virtually everything in lower asciirange.


The letters came through.

But the NUL characters are not dropped unless you are doing somethingto drop them. But if they are dropped, then that will happen.

I see you have a period in your characters. Perhaps you have othercharacters outside Russian Cyrillic.

I forgot about the line ends in sort. Those can come up in themiddle of Unicode characters in general.

I wonder if a sort of the utf8 of the Cyrillic to-lower would beclose. The idea below is probably better in general.


Try something roughly like this (not tested; typed in raw):

function sortRussian utf16RussianList
   -- use utf8 to get rid of NULs and extra line ends
   put uniDecode(utf16RussianList, "UTF8") into utf8RussianList
   sort lines of utf8RussianList text by russianLex(each)
   return utf8RussianList
end sortRussian

-- returns string suitable for lexical comparison (Rev sort text)

-- of a utf8 string made up of Russian subset of Cyrillic plus someASCII

function russianLex utf8RussianLine
   -- Add adjustments for special words here
   put uniEncode(utf8RussianLine, "UTF8") into utf16RussianLine
   put empty into lex

repeat with i = 1 to length(utf16RussianLine)-1 step 2 -- uniCodechar loop

      put char i to i+1 of utf16RussianLine into utf16RussianChar
      -- Add char dropping tests here
      put sortCodeFromRussianChar( utf16RussianChar) into sortNumber

put numTochar( sortNumber ) after lex -- use 1-byte chars forsorting

  end repeat
  return lex
end russianLex

-- returns number in range 1 to 255 indicating sort position of
-- allowed characters
function sortCodeFromRussianChar utf16Char
   set the useUnicode to true
   put charToNum(utf16Char) into unicodePoint
   switch unicodePoint
   case 0x0020 -- space
     get 1
     break
   ...
   default
     get 255
   end switch
   return it
end sortCodeFromRussianChar

This will take some debugging.

In this approach above, one-byte "chars" are used for sorting. Analternative is to use two ASCII chars, space-char, for the ASCIIsubset and two letters that sort right for the Cyrillic. That wouldmake testing easier for russianLex().

I remember from yesterday that yo or ye or something (two dots overe) was not in the basic Russian group, so you will need to handle itseparate from the basic Russian range.

BTW, for those not familiar with using customFun(each) in sort,customFun() seems to be called only once for each line; it is notcalled twice for each comparison.

I am not in favor of a Unicode sort option. I'll elaborate later. Ihave a couple goals to meet by tonight.


Dar




_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: Unicode sorting

Reply via email to