On 11/11/19 12:30 PM, Jose Isaias Cabrera wrote: > Igor Tandetnik, on Monday, November 11, 2019 11:02 AM, wrote... >> On 11/11/2019 10:49 AM, Jose Isaias Cabrera wrote: >>> So, yes, it's bulky, but, if you want to count characters in languages such >>> as >>> Arabic, Hebrew, Chinese, Japanese, etc., the easiest way is to convert that >>> string >>> to UTF32, and do a string count of that UTF32 variable. >> Between ligatures and combining diacritics, the number of Unicode codepoints >> in a >> string has little practical meaning. E.g. it is not necessarily correlated >> with the >> width of the string as displayed on the screen or on paper; or with the >> number of >> graphemes a human would say the string contains, if asked. > That could be true, but have you tried to just display an specific number of > characters from an UTF8 string having Hebrew, Arabic, Chinese, Japanese (see > below). > >>> Most people have to figure out what Unicode they are using, count the >>> bytes, divide >>> by... and on, and on. Not me, I just take that UTF8, or UTF16 string, >>> convert it to >>> UTF32, and do a count. >> And then what do you do with that count? What do you use it for? > Say that I am writing a report and I only want to print the first 20 > characters of a string, that would be something like, > if (var.length> 20) > { > writefln(var[0 .. 20]); > } > else > { > writefln(var ~ " "[0 .. 20]); > } > if var is declared UTF8, and there is a Chinese string or some multi-byte > language in that string, this will never print 20 Chinese characters. It will > print less. If, I convert that UTF8 string to UTF32, then each multi-byte > character fits in one UTF32 character. So, > > dchar[] var32 = std.utf.toUTF32(var); > if (var32.length> 20) > { > writefln(var32[0 .. 20]); > } > else > { > writefln(var32 ~ cast(dchar[])" "[0 .. 20]); > } > > This will always print 20 characters, whether these are ASCII or multi-byte > language characters. Thanks. > > josé > _______________________________________________
Writing 20 UTF-32 characters may ALSO print less than 20 glyphs to the screen. One quick way to see this is that there is a need for NFD and NFC representations, because some characters can be decomposed from a combined character into a base character + a combining character, so a string in NFD form may naturally 'compress' itself when being printed. Then you need to remember that the one of reasons for providing the combining characters was that it was decided that there would not be created code points for all the possible composed characters, that many would be expressed only as decomposed form of a base character + combining character(s). Thus code-point count is not the same a output glyph count. In fact, Unicode works hard at avoiding the term 'character' as it isn't well defined, often being thought as a glyph, but also sometimes as a code-point. -- Richard Damon _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users