> On Jun 10, 2017, at 11:24 AM, Richard Damon <rich...@damon-family.org> wrote:
> 
> If the field was declared as char[38], and UNICODE collation, many systems 
> will allocate 4 bytes per character to allow for any possible character

Maybe, but it’s a bad design! It’s a huge amount of bloat for most text that 
would be stored in a database (even most Asian characters will fit in 16 bits.) 
And worse, it doesn’t actually make working with text easier, because you can 
_not_ treat every 32-bit code point as a character. There are arcane Unicode 
rules for combining code points — an accented letter may be represented as the 
base letter followed by an accent mark, and some ideographs (including many 
emoji!) are composed of multiple combined ideographs. So even something simple 
like “how many characters are in this string” requires scanning the string, not 
just a simple array lookup. 

In the end, UTF-8 almost always becomes the best encoding, since everything has 
to be treated as variable-width anyway.

—Jens
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to