> On Aug 7, 2017, at 8:29 AM, x <tam118...@hotmail.com> wrote: > > I thought I had learned enough about this string lunacy to get by but finding > out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 has tipped > me over the edge. I assumed they both used the same codes but UTF16 allowed > some characters UTF8 didn’t have.
UTF-8 is backwards-compatible with ASCII. All 7-bit bytes (00-7f) represent the same characters as their ASCII equivalents. Beyond that, UTF-8 uses a sequence of two to five bytes in the range 80-ff to encode a single Unicode character/code-point. (You can sort of think of this as every byte holding 7 bits of the actual character number, with its MSB set to 1. It’s not exactly like that, but close.) IMHO UTF-8 is the best general purpose text encoding. Code that works with ASCII (real 7-bit ASCII, not the nonstandard “extended” stuff) will generally work with UTF-8; the main thing to watch out for tends to be breaking or trimming strings, because you don’t want to cut part of a multibyte sequence. UTF-8 is also quite compact for Roman languages (although not non-Roman ones.) 16-bit encodings used to seem like a good idea back when Unicode has fewer than 65,536 characters, so you could assume that one unichar = one character. Those days are long gone. Now dealing with UTF-16 has all the same problems of dealing with UTF-8 (i.e. multi-word sequences) without the benefits of compactness or ASCII compatibility. 32-bit encodings are just silly, unless for some reason you really really have to optimize for speed over size (and even then the added size may well blow out your CPU caches and negate the speed boost.) —Jens PS: Apparently C++11 allows Unicode string literals by putting a letter U in front of the initial quote. The result will be a string of wchar_t. _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users