Mark Davis ☕ <mark at macchiato dot com> replied to Otto Stolz <Otto dot Stolz at uni dash konstanz dot de>:
>> The problem with this encoding is that the trailing bytes >> are not clearly marked: they may start with any of >> '0', '10', or '110'; only '111' would mark a byte >> unambiguously as a trailing one. >> >> In contrast, in UTF-8 every single byte carries a marker >> that unambiguously marks it as either a single ASCII byte, >> a starting, or a continuation byte; hence you have not to >> go back to the beginning of the whole data stream to recognize, >> and decode, a group of bytes. > > In a compression format, that doesn't matter; you can't expect random > access, nor many of the other features of UTF-8. That said, if Kannan were to go with the alternative format suggested on this list: 0xxxxxxx 1xxxxxxx 0yyyyyyy 1xxxxxxx 1yyyyyyy 0zzzzzzz then he would at least have this one feature of UTF-8, at no additional cost in bits compared to the format he is using today. Of course, he will not have other UTF-8-like features, such as avoidance of ASCII values in the final trail byte, and "fast forward parsing" by looking at the first byte. He may not care. One thing I've noted about descriptions of UTF-8, in the context of alternative formats for private protocols, is that they always assume these features are important to everyone, when they may not be. -- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages: is dot gd slash 2kf0s

