In a compression format, that doesn't matter; you can't expect random
access, nor many of the other features of UTF-8.

The minimal expectation for these kinds of simple compression is that when
you write a string with a particular *write* method, and then read it back
with the corresponding *read* method, you get exactly the original string
contents back, and you consume exactly as many bytes as you had written.
There are really no other guarantees.

Mark

— Il meglio è l’inimico del bene —


On Fri, Jun 4, 2010 at 06:35, Otto Stolz <[email protected]> wrote:

> Hello,
>
> Am 2010-06-03 07:07, schrieb Kannan Goundan:
>
>  This is currently what I do (I was referring to this as the "compact
>> UTF-8-like encoding").  The one difference is that I put all the
>> marker bits in the first byte (instead of in the high bit of every
>> byte):
>>   0xxxxxxx
>>   10xxxxxx xyyyyyyy
>>   110xxxxx xxyyyyyy yzzzzzzz
>>
>
> The problem with this encoding is that the trailing bytes
> are not clearly marked: they may start with any of
> '0', '10', or '110'; only '111' would mark a byte
> unambiguously as a trailing one.
>
> In contrast, in UTF-8 every single byte carries a marker
> that unambiguously marks it as either a single ASCII byte,
> a starting, or a continuation byte; hence you have not to
> go back to the beginning of the whole data stream to recognize,
> and decode, a group of bytes.
>
> Best wishes,
>  Otto Stolz
>
>
>
>

Reply via email to