In a compression format, that doesn't matter; you can't expect random access, nor many of the other features of UTF-8.
The minimal expectation for these kinds of simple compression is that when you write a string with a particular *write* method, and then read it back with the corresponding *read* method, you get exactly the original string contents back, and you consume exactly as many bytes as you had written. There are really no other guarantees. Mark — Il meglio è l’inimico del bene — On Fri, Jun 4, 2010 at 06:35, Otto Stolz <[email protected]> wrote: > Hello, > > Am 2010-06-03 07:07, schrieb Kannan Goundan: > > This is currently what I do (I was referring to this as the "compact >> UTF-8-like encoding"). The one difference is that I put all the >> marker bits in the first byte (instead of in the high bit of every >> byte): >> 0xxxxxxx >> 10xxxxxx xyyyyyyy >> 110xxxxx xxyyyyyy yzzzzzzz >> > > The problem with this encoding is that the trailing bytes > are not clearly marked: they may start with any of > '0', '10', or '110'; only '111' would mark a byte > unambiguously as a trailing one. > > In contrast, in UTF-8 every single byte carries a marker > that unambiguously marks it as either a single ASCII byte, > a starting, or a continuation byte; hence you have not to > go back to the beginning of the whole data stream to recognize, > and decode, a group of bytes. > > Best wishes, > Otto Stolz > > > >

