On 11/11/2019 2:56 PM, Igor Tandetnik wrote:
On 11/11/2019 12:30 PM, Jose Isaias Cabrera wrote:
Igor Tandetnik, on Monday, November 11, 2019 11:02 AM, wrote...
Most people have to figure out what Unicode they are using, count the bytes,
divide
by... and on, and on. Not me, I just take that UTF8, or UTF16 string, convert
it to
UTF32, and do a count.
And then what do you do with that count? What do you use it for?
Say that I am writing a report and I only want to print the first 20 characters
of a string
A sequence of Unicode codepoints U+006F U+0302 U+0301 should be rendered as a single grapheme ( ố
) - what a human would think of as a "character". This is an actual character in
Vietnamese. Now, if you have several such triplets in a row in your string, and you chop it at 20
codepoints, you'll only print 7 graphemes / "characters". Moreover, you'll end up
dropping the last combining accent, producing a different grapheme (ô) and potentially altering
the meaning of the text. (Don't know how much of a danger this is in Vietnamese, but I know that
combining viramas https://www.compart.com/en/unicode/combining/9 are vital to Indic languages, and
dropping one will in fact often produce a valid but different word).
A more colorful example: Emoji characters are composed of a long sequence of
Unicode codepoints: 👨👩👶👶 would be U+1F468 U+200D U+1F469 U+200D U+1F476
U+200D U+1F476 ( https://emojipedia.org/family-man-woman-baby-baby/ ) .
Truncating such a sequence at an arbitrary point is likely to produce a valid
emoji with a very different meaning.
Igor Tandetnik
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users