Re: Fwd: Re: Unicode, SMS and year 2012

Doug Ewell Sat, 28 Apr 2012 12:21:54 -0700

<anbu at peoplestring dot com> wrote:

This clearly shows that my design yields number of values more than
double that of UTF8

I didn't know we were competing against UTF-8 on efficiency. That'seasy. UTF-8 is not at all guaranteed to be the most efficient encodingpossible, or even reasonably possible. It was originally scoped to be"not extravagant" in terms of space, while providing other designfeatures like byte boundaries, full ASCII transparency, easy detection,and prefixes that quickly indicate the length of the sequence.

It's easy to beat the efficiency of UTF-8 in a byte-based encoding, ifmany of its other design features are ignored:


0xxxxxxx - encodes U+0000 through U+007F
1xxxxxxx 0xxxxxxx - encodes U+0080 through U+3FFF
1xxxxxxx 1xxxxxxx - encodes U+4000 through U+10FFFF
(and onward to 0x1FFFFF)

This is a well-known and freely available technique, sometimes called"self-delimiting numeric values" (RFC 6256) and sometimes by othernames.

There are many reasons why a new encoding that is merely more efficientthan UTF-8, especially one that sacrifices byte-based processing orother design features, will face a severe uphill battle in trying todisplace UTF-8.


--
Doug Ewell | Thornton, Colorado, USA

http://www.ewellic.org | @DougEwell

Re: Fwd: Re: Unicode, SMS and year 2012

Reply via email to