Re: Jumping cursors

Richmond Mathewson Thu, 05 Jan 2017 03:26:00 -0800

Thank you: 373 wonky results!

Well, to be honest, I'm not going to wait for you and yours to sort thatout; I shall use the list to help

me avoid wonky Unicode addresses.


On 1/5/17 1:07 pm, Mark Waddingham wrote:

On 2017-01-05 11:01, Richmond Mathewson wrote:
Ha, Ha, Ha: possibly the first time ever that it hasn't been thelatter :)
http://quality.livecode.com/show_bug.cgi?id=19045
By 'stupid engine' do you mean the LiveCode engine, something else, or
code that has been co-opted
from elsewhere and folded into the LC engine?
Specifically the internal routine which fetches the Unicode'properties' for a run of characters is currently computing asurrogate pair's codepoint incorrectly - in this case U+0FF001 isbeing treated as U+07BC - which is an undefined codepoint and as suchthe property info being fetched (in this case, BiDi class) is undefined.
I, like a fool, had assumed that post LiveCode 7.0 the engine was,
somehow, avoiding surrogate pairs
altogether, rather than fudging around so things were *very pleasant
indeed* for people like me when
leveraging glyphs occupying Unicode areas above the first plain.

Obviously things were slightly too good to be true.
The engine does 'automatically' deal with surrogate pairs in UTF-16.Indeed, the fact that they exist at all in the engine's internalrepresentation is generally not something the developer has to worryabout (modulo bugs, like the one above).
You can use the codeunit chunk to access a string's individual UTF-16components, codepoint chunk to access a string as a sequence of actualcodepoints, and char to access a string as a sequence of graphemes(approximation to what most people call 'letters' or 'characters').
Do you have any idea which other surrogate pairs it might be gettingwrong?
Until (if ?) things get sorted out that would be a useful reference
list so as to know which Unicode slots
to avoid.
This should list all the codepoints in the SPUA-A which will causedirectionality problems (due to incorrect property lookup):
   local tList
   repeat with tCodepoint = 0xF0000 to 0xFFFFD
      get numToCodepoint(tCodepoint)

      local tLeading, tTrailing
      put codepointToNum(codeunit 1 of it) into tLeading
      put codepointToNum(codeunit 2 of it) into tTrailing

      local tWrongCodepoint
put (tLeading - 0xD800) + ((tTrailing - 0xDC00) * 2^10) intotWrongCodepoint
get codepointProperty(numToCodepoint(tWrongCodepoint), "BidiClass")
      if it contains "Right_To_Left" or it contains "Arabic" then
put format("U+0x%6x has wrong bidi class - %s\n", tCodepoint,it) after tList
      end if
   end repeat
   put tList

Anyone who wants to mess around with this (I am on a Macintosh at themoment) on Windows or Linux

can download this:

https://www.dropbox.com/s/i8ba0viztujs0dq/bad%20Unicode.livecode.zip?dl=0

Writing as a lazy slob I feel no screaming urge to go back and recode
all those (0x4FFF6), (0x3EEDA)
hex codes as surrogate pairs . . .
Doing so wouldn't do you any good anyway. The bug lies in theprocessing of the string *after* it has been constructed - whether itis constructed directly from codepoints, or codeunits wouldn't make adifference.
I've submitted a PR for a fix to the problem against the 8.1 branch here:

   https://github.com/livecode/livecode/pull/5020


Presumably that also holds forth for the LiveCode 9 series.


Warmest Regards,

Mark.


Best, Richmond.
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Jumping cursors

Reply via email to