Hello,

Am 2010-06-03 07:07, schrieb Kannan Goundan:
This is currently what I do (I was referring to this as the "compact
UTF-8-like encoding").  The one difference is that I put all the
marker bits in the first byte (instead of in the high bit of every
byte):
   0xxxxxxx
   10xxxxxx xyyyyyyy
   110xxxxx xxyyyyyy yzzzzzzz

The problem with this encoding is that the trailing bytes
are not clearly marked: they may start with any of
'0', '10', or '110'; only '111' would mark a byte
unambiguously as a trailing one.

In contrast, in UTF-8 every single byte carries a marker
that unambiguously marks it as either a single ASCII byte,
a starting, or a continuation byte; hence you have not to
go back to the beginning of the whole data stream to recognize,
and decode, a group of bytes.

Best wishes,
  Otto Stolz



Reply via email to