Re: Least used parts of BMP.

Otto Stolz Fri, 04 Jun 2010 06:45:16 -0700

Hello,

Am 2010-06-03 07:07, schrieb Kannan Goundan:

This is currently what I do (I was referring to this as the "compact
UTF-8-like encoding").  The one difference is that I put all the
marker bits in the first byte (instead of in the high bit of every
byte):
   0xxxxxxx
   10xxxxxx xyyyyyyy
   110xxxxx xxyyyyyy yzzzzzzz


The problem with this encoding is that the trailing bytes
are not clearly marked: they may start with any of
'0', '10', or '110'; only '111' would mark a byte
unambiguously as a trailing one.

In contrast, in UTF-8 every single byte carries a marker
that unambiguously marks it as either a single ASCII byte,
a starting, or a continuation byte; hence you have not to
go back to the beginning of the whole data stream to recognize,
and decode, a group of bytes.

Best wishes,
  Otto Stolz

Re: Least used parts of BMP.

Reply via email to