On May 27, 2006, at 9:12 AM, Devin Asay wrote:


For the Russian (don't know if this will come thru in your email reader):
  Я вижу вас.
The unicode is (omitting the "U+" convention):
  042F 0020 0432 0438 0436 0443 0020 0432 0430 0441 002E

But what rev is seeing during sort is a series of single byte chars, with leading null bytes of basic latin range chars ignored:
  04 2F 20 04 32 04 38 04 36 04 43 20 04 32 04 30 04 41 2E

Since all of the first bytes of the Cyrillic range are 04 (< 20), they are always sorted *before* virtually everything in lower ascii range.

The letters came through.

But the NUL characters are not dropped unless you are doing something to drop them. But if they are dropped, then that will happen.

I see you have a period in your characters. Perhaps you have other characters outside Russian Cyrillic.

I forgot about the line ends in sort. Those can come up in the middle of Unicode characters in general.

I wonder if a sort of the utf8 of the Cyrillic to-lower would be close. The idea below is probably better in general.

Try something roughly like this (not tested; typed in raw):

function sortRussian utf16RussianList
   -- use utf8 to get rid of NULs and extra line ends
   put uniDecode(utf16RussianList, "UTF8") into utf8RussianList
   sort lines of utf8RussianList text by russianLex(each)
   return utf8RussianList
end sortRussian

-- returns string suitable for lexical comparison (Rev sort text)
-- of a utf8 string made up of Russian subset of Cyrillic plus some ASCII
function russianLex utf8RussianLine
   -- Add adjustments for special words here
   put uniEncode(utf8RussianLine, "UTF8") into utf16RussianLine
   put empty into lex
repeat with i = 1 to length(utf16RussianLine)-1 step 2 -- uniCode char loop
      put char i to i+1 of utf16RussianLine into utf16RussianChar
      -- Add char dropping tests here
      put sortCodeFromRussianChar( utf16RussianChar) into sortNumber
put numTochar( sortNumber ) after lex -- use 1-byte chars for sorting
  end repeat
  return lex
end russianLex

-- returns number in range 1 to 255 indicating sort position of
-- allowed characters
function sortCodeFromRussianChar utf16Char
   set the useUnicode to true
   put charToNum(utf16Char) into unicodePoint
   switch unicodePoint
   case 0x0020 -- space
     get 1
     break
   ...
   default
     get 255
   end switch
   return it
end sortCodeFromRussianChar

This will take some debugging.

In this approach above, one-byte "chars" are used for sorting. An alternative is to use two ASCII chars, space-char, for the ASCII subset and two letters that sort right for the Cyrillic. That would make testing easier for russianLex().

I remember from yesterday that yo or ye or something (two dots over e) was not in the basic Russian group, so you will need to handle it separate from the basic Russian range.

BTW, for those not familiar with using customFun(each) in sort, customFun() seems to be called only once for each line; it is not called twice for each comparison.


I am not in favor of a Unicode sort option. I'll elaborate later. I have a couple goals to meet by tonight.

Dar




_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to