> On Aug 7, 2017, at 8:29 AM, x <tam118...@hotmail.com> wrote:
> 
> I thought I had learned enough about this string lunacy to get by but finding 
> out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 has tipped 
> me over the edge. I assumed they both used the same codes but UTF16 allowed 
> some characters UTF8 didn’t have.

UTF-8 is backwards-compatible with ASCII. All 7-bit bytes (00-7f) represent the 
same characters as their ASCII equivalents. Beyond that, UTF-8 uses a sequence 
of two to five bytes in the range 80-ff to encode a single Unicode 
character/code-point. (You can sort of think of this as every byte holding 7 
bits of the actual character number, with its MSB set to 1. It’s not exactly 
like that, but close.)

IMHO UTF-8 is the best general purpose text encoding. Code that works with 
ASCII (real 7-bit ASCII, not the nonstandard “extended” stuff) will generally 
work with UTF-8; the main thing to watch out for tends to be breaking or 
trimming strings, because you don’t want to cut part of a multibyte sequence. 
UTF-8 is also quite compact for Roman languages (although not non-Roman ones.)

16-bit encodings used to seem like a good idea back when Unicode has fewer than 
65,536 characters, so you could assume that one unichar = one character. Those 
days are long gone. Now dealing with UTF-16 has all the same problems of 
dealing with UTF-8 (i.e. multi-word sequences) without the benefits of 
compactness or ASCII compatibility.

32-bit encodings are just silly, unless for some reason you really really have 
to optimize for speed over size (and even then the added size may well blow out 
your CPU caches and negate the speed boost.)

—Jens

PS: Apparently C++11 allows Unicode string literals by putting a letter U in 
front of the initial quote. The result will be a string of wchar_t.
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to