Actually a maximum of 4 bytes are required to encode a single valid code-point 
in UTF-8.


> On Aug 8, 2017, at 2:44 AM, Jens Alfke <j...@mooseyard.com> wrote:
> 
> 
>> On Aug 7, 2017, at 8:29 AM, x <tam118...@hotmail.com> wrote:
>> 
>> I thought I had learned enough about this string lunacy to get by but 
>> finding out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 
>> has tipped me over the edge. I assumed they both used the same codes but 
>> UTF16 allowed some characters UTF8 didn’t have.
> 
> UTF-8 is backwards-compatible with ASCII. All 7-bit bytes (00-7f) represent 
> the same characters as their ASCII equivalents. Beyond that, UTF-8 uses a 
> sequence of two to five bytes in the range 80-ff to encode a single Unicode 
> character/code-point. (You can sort of think of this as every byte holding 7 
> bits of the actual character number, with its MSB set to 1. It’s not exactly 
> like that, but close.)
> 
> IMHO UTF-8 is the best general purpose text encoding. Code that works with 
> ASCII (real 7-bit ASCII, not the nonstandard “extended” stuff) will generally 
> work with UTF-8; the main thing to watch out for tends to be breaking or 
> trimming strings, because you don’t want to cut part of a multibyte sequence. 
> UTF-8 is also quite compact for Roman languages (although not non-Roman ones.)
> 
> 16-bit encodings used to seem like a good idea back when Unicode has fewer 
> than 65,536 characters, so you could assume that one unichar = one character. 
> Those days are long gone. Now dealing with UTF-16 has all the same problems 
> of dealing with UTF-8 (i.e. multi-word sequences) without the benefits of 
> compactness or ASCII compatibility.
> 
> 32-bit encodings are just silly, unless for some reason you really really 
> have to optimize for speed over size (and even then the added size may well 
> blow out your CPU caches and negate the speed boost.)
> 
> —Jens
> 
> PS: Apparently C++11 allows Unicode string literals by putting a letter U in 
> front of the initial quote. The result will be a string of wchar_t.
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to