RE: Least used parts of BMP.

Doug Ewell Fri, 04 Jun 2010 09:08:28 -0700

Mark Davis ☕ <mark at macchiato dot com> replied to Otto Stolz <Otto
dot Stolz at uni dash konstanz dot de>:


>> The problem with this encoding is that the trailing bytes
>> are not clearly marked: they may start with any of
>> '0', '10', or '110'; only '111' would mark a byte
>> unambiguously as a trailing one.
>>
>> In contrast, in UTF-8 every single byte carries a marker
>> that unambiguously marks it as either a single ASCII byte,
>> a starting, or a continuation byte; hence you have not to
>> go back to the beginning of the whole data stream to recognize,
>> and decode, a group of bytes.
>
> In a compression format, that doesn't matter; you can't expect random
> access, nor many of the other features of UTF-8.

That said, if Kannan were to go with the alternative format suggested on
this list:

0xxxxxxx
1xxxxxxx 0yyyyyyy
1xxxxxxx 1yyyyyyy 0zzzzzzz

then he would at least have this one feature of UTF-8, at no additional
cost in bits compared to the format he is using today.

Of course, he will not have other UTF-8-like features, such as avoidance
of ASCII values in the final trail byte, and "fast forward parsing" by
looking at the first byte.  He may not care.  One thing I've noted about
descriptions of UTF-8, in the context of alternative formats for private
protocols, is that they always assume these features are important to
everyone, when they may not be.

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org 
RFC 5645, 4645, UTN #14 | ietf-languages: is dot gd slash 2kf0s

RE: Least used parts of BMP.

Reply via email to