On May 26, 2006, at 5:26 PM, Dar Scott wrote:


On May 26, 2006, at 3:57 PM, Devin Asay wrote:

A 'sort lines' command, after converting upper case to lower, works fairly well, except that, curiously, a space sorts *after* all cyrillic chars.


I think I figured out what it is. 'sort' seems to see NUL as the end of the string and U+0020 has virtually a NUL in it. Try this test:

on mouseUp
  put "a" & NULL & "z" & lf & "a" & NULL & "b" into d
  sort d
  replace NULL with "x" in d
  put d
end mouseUp
==>
axz
axb

We have been bitten by C again.

Hmmm. Maybe this is it. But I thought the reason was different: We know (I think) that when we enter unicode text in Rev that lower ascii range characters are still encoded as lower ascii. So space and cr are still ascii 32 (#20) and ascii 10 (#10). (I think RunRev do this so that chunk expressions like word and line still work for unicode text.) To see what I mean type some unicode-range text into a field then look at the htmlText.

The sort command is an ascii sort by default, so when rev sorts a field it sees all the characters, ascii or unicode, as ascii bytes, sort of like what you get if you have a field with unicode text in it and do:

  put fld "myUnicodeText" into fld "anyOldField"

When your unicode is in a lower unicode range, like my Cyrillic example, all of the first bytes are #00, so rev is seeing something like:

For the Russian (don't know if this will come thru in your email reader):
  Я вижу вас.
The unicode is (omitting the "U+" convention):
  042F 0020 0432 0438 0436 0443 0020 0432 0430 0441 002E

But what rev is seeing during sort is a series of single byte chars, with leading null bytes of basic latin range chars ignored:
  04 2F 20 04 32 04 38 04 36 04 43 20 04 32 04 30 04 41 2E

Since all of the first bytes of the Cyrillic range are 04 (< 20), they are always sorted *before* virtually everything in lower ascii range.

At least I think this is what's going on, and if so explains a lot of the weirdness that happens when working with unicode in Rev.

Anyway, I think the moral of the story is that we need beefed up unicode support in Rev, including a true sort unicode option. I've submitted an enhancement request for this, BZ 3646. Please consider voting for it if you have some votes to spare.

Devin

Devin Asay
Humanities Technology and Research Support Center
Brigham Young University

_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to