Re: [sqlite] Things you shouldn't assume when you store names

Igor Tandetnik Mon, 11 Nov 2019 12:08:37 -0800

On 11/11/2019 2:56 PM, Igor Tandetnik wrote:

On 11/11/2019 12:30 PM, Jose Isaias Cabrera wrote:


Igor Tandetnik, on Monday, November 11, 2019 11:02 AM, wrote...

Most people have to figure out what Unicode they are using, count the bytes, 
divide
by... and on, and on.  Not me, I just take that UTF8, or UTF16 string, convert 
it to
UTF32, and do a count.


And then what do you do with that count? What do you use it for?


Say that I am writing a report and I only want to print the first 20 characters 
of a string

A sequence of Unicode codepoints U+006F U+0302 U+0301 should be rendered as a single grapheme ( ố 
 ) - what a human would think of as a "character". This is an actual character in 
Vietnamese. Now, if you have several such triplets in a row in your string, and you chop it at 20 
codepoints, you'll only print 7 graphemes / "characters". Moreover, you'll end up 
dropping the last combining accent, producing a different grapheme (ô) and potentially altering 
the meaning of the text. (Don't know how much of a danger this is in Vietnamese, but I know that 
combining viramas https://www.compart.com/en/unicode/combining/9 are vital to Indic languages, and 
dropping one will in fact often produce a valid but different word).


A more colorful example: Emoji characters are composed of a long sequence of 
Unicode codepoints: 👨‍👩‍👶‍👶  would be U+1F468 U+200D U+1F469 U+200D U+1F476 
U+200D U+1F476 ( https://emojipedia.org/family-man-woman-baby-baby/ ) . 
Truncating such a sequence at an arbitrary point is likely to produce a valid 
emoji with a very different meaning.

Igor Tandetnik

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Things you shouldn't assume when you store names

Reply via email to