<anbu at peoplestring dot com> wrote:
This clearly shows that my design yields number of values more than
double that of UTF8
I didn't know we were competing against UTF-8 on efficiency. That's
easy. UTF-8 is not at all guaranteed to be the most efficient encoding
possible, or even reasonably possible. It was originally scoped to be
"not extravagant" in terms of space, while providing other design
features like byte boundaries, full ASCII transparency, easy detection,
and prefixes that quickly indicate the length of the sequence.
It's easy to beat the efficiency of UTF-8 in a byte-based encoding, if
many of its other design features are ignored:
0xxxxxxx - encodes U+0000 through U+007F
1xxxxxxx 0xxxxxxx - encodes U+0080 through U+3FFF
1xxxxxxx 1xxxxxxx - encodes U+4000 through U+10FFFF
(and onward to 0x1FFFFF)
This is a well-known and freely available technique, sometimes called
"self-delimiting numeric values" (RFC 6256) and sometimes by other
names.
There are many reasons why a new encoding that is merely more efficient
than UTF-8, especially one that sacrifices byte-based processing or
other design features, will face a severe uphill battle in trying to
displace UTF-8.
--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell