Unicode mysteries

Neville Smythe via use-livecode Wed, 25 Mar 2020 23:56:12 -0700

I am trying to understand the mysteries of unicode encodings; the following may 
(or may not) be useful (or confusing) to others.


 The docs say the full chunk expression for a unicode character is
      byte i of codeunit j of codepoint k of character c  of str
(with the warning that this is 'not of general utility’ … indeed!)

Taking a look at the Emoji ‘flag of Scotland’ character 🏴󠁧󠁢󠁳󠁣󠁴󠁿 which won’t 
display here but exists in the Apple Color Emoji font and in corresponding 
fonts for other platforms, I get

put 🏴󠁧󠁢󠁳󠁣󠁴󠁿 into str
number of chars of str: 1

char 1 of str : 🏴󠁧󠁢󠁳󠁣󠁴󠁿 number of codepoints of char 1 of str:  7
     codepoint:1        1F3F4   with 2 codeunits (D83C DFF4)
     codepoint:2        0               with 0 codeunits - seems to be a 
placeholder rather an actual codepoint
     codepoint:3        E0067   (DB40 DC67)
     codepoint:4        0       
     codepoint:5        E0062   (DB40 DC62)
     codepoint:6        0       
     codepoint:7        E0073   (DB40 DC73)

number of codepoints of str: 7
number of codeunits of str: 14
number of codeunits of char 1 of str: 14

So there are 6 codeunits which are not in any codepoints (or at least not as 
reported by LC). They can be enumerated by looping over “codeunit j of str” 
rather than 'codeunit j of codepoint k of ..' Or by textEnccode(str,”UTF-16”) 
and then by enumerating the bytes of the binary encoded str.

Bytes in binary encoding = all the codeunits (encoding is actually in 
littleendian byte order, but given here in bigendian order, which is the order 
reported by enumerating the codeunits)
       D83C DFF4 DB40 DC67 DB40 DC62 DB40 DC73 DB40 DC63 DB40 DC74 DB40 DC7F

Which should correspond to codepoints
       1F3F4 E0067 E0062 E0073 E0063 E0074 E007F
And indeed if I manually build a UTF-16 string with these code points it does 
display as the flag of Scotland. So the lesson is that the reported chunks are 
not to be naively trusted  --- tho not exactly a bug given the documentation 
warning.

1F3F4 by the way is a black flag; the remaining codepoints are in the 
Variations unicode block. Amusingly the Rainbow flag emoji  is made up of 3 
characters, char 1 is a white flag, char 2 is an invisible vertical join 
instruction, char 3 is a rainbow. BTW, backspacing over the displayed Rainbow 
flag actually has to be done in three steps to remove the displayed glyph, 
which I think is not correct behaviour for an editor since it appears to the 
user as one unicode character. Apple's TextEdit for example deletes the Rainbow 
flag with a single backspace. There are nasties lurking here for text 
manipulation LC code. Perhaps there should be a new string element 
‘unicodeChar’? BTW I have nothing but huge admiration for the LC unicode 
implementation team, it is a subject of extreme complexity.

Another question (which I think has been raised before but I don’t think there 
was an answer?). When a character (codepoint) in a string is displayed, if the 
requested font does not have that codepoint the OS substitutes a glyph from 
another font (or the missing character glyph if no font supports the 
codepoint). So for example if you change the font of the above flag of Scotland 
to Arial, it still displays as the flag of Scotland, even though this glyph is 
not in Arial. LC will still report that the font of this character is Arial: 
from what I can gather this is not the fault of LC, the OS is doing this 
substitution behind its back (TextEdit does the same). But is there any way to 
find out (programatically) the actual font being used? 
   
   


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Unicode mysteries

Reply via email to