Re: [sqlite] Things you shouldn't assume when you store names

Richard Damon Mon, 11 Nov 2019 09:51:13 -0800

On 11/11/19 12:30 PM, Jose Isaias Cabrera wrote:
> Igor Tandetnik, on Monday, November 11, 2019 11:02 AM, wrote...
>> On 11/11/2019 10:49 AM, Jose Isaias Cabrera wrote:
>>> So, yes, it's bulky, but, if you want to count characters in languages such 
>>> as
>>> Arabic, Hebrew, Chinese, Japanese, etc., the easiest way is to convert that 
>>> string
>>> to UTF32, and do a string count of that UTF32 variable.
>> Between ligatures and combining diacritics, the number of Unicode codepoints 
>> in a
>> string has little practical meaning. E.g. it is not necessarily correlated 
>> with the
>> width of the string as displayed on the screen or on paper; or with the 
>> number of
>> graphemes a human would say the string contains, if asked.
> That could be true, but have you tried to just display an specific number of 
> characters from an UTF8 string having Hebrew, Arabic, Chinese, Japanese (see 
> below).
>
>>> Most people have to figure out what Unicode they are using, count the 
>>> bytes, divide
>>> by... and on, and on.  Not me, I just take that UTF8, or UTF16 string, 
>>> convert it to
>>> UTF32, and do a count.
>> And then what do you do with that count? What do you use it for?
> Say that I am writing a report and I only want to print the first 20 
> characters of a string, that would be something like,
> if (var.length> 20)
> {
>   writefln(var[0 .. 20]);
> }
> else
> {
>   writefln(var ~ "                   "[0 .. 20]);
> }
> if var is declared UTF8, and there is a Chinese string or some multi-byte 
> language in that string, this will never print 20 Chinese characters. It will 
> print less.  If, I convert that UTF8 string to UTF32, then each multi-byte 
> character fits in one UTF32 character.  So,
>
> dchar[] var32 = std.utf.toUTF32(var);
> if (var32.length> 20)
> {
>   writefln(var32[0 .. 20]);
> }
> else
> {
>   writefln(var32 ~ cast(dchar[])"                   "[0 .. 20]);
> }
>
> This will always print 20 characters, whether these are ASCII or multi-byte 
> language characters.  Thanks.
>
> josé
> _______________________________________________


Writing 20 UTF-32 characters may ALSO print less than 20 glyphs to the
screen.

One quick way to see this is that there is a need for NFD and NFC
representations, because some characters can be decomposed from a
combined character into a base character + a combining character, so a
string in NFD form may naturally 'compress' itself when being printed.
Then you need to remember that the one of reasons for providing the
combining characters was that it was decided that there would not be
created code points for all the possible composed characters, that many
would be expressed only as decomposed form of a base character +
combining character(s). Thus code-point count is not the same a output
glyph count. In fact, Unicode works hard at avoiding the term
'character' as it isn't well defined, often being thought as a glyph,
but also sometimes as a code-point.

-- 
Richard Damon

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Things you shouldn't assume when you store names

Reply via email to