On 6/4/2010 8:34 AM, Mark Davis ☕ wrote:
In a compression format, that doesn't matter; you can't expect random
access, nor many of the other features of UTF-8.
The minimal expectation for these kinds of simple compression is that
when you write a string with a particular /write/ method, and then
read it back with the corresponding /read/ method, you get exactly the
original string contents back, and you consume exactly as many bytes
as you had written. There are really no other guarantees.
Actually, SCSU makes an additional guarantee, which is that you can edit
the compressed string. In other words, you can insert a substring such
that the new string remains a valid compressed string and the parts
preceding and following the insertion, when read, match the
corresponding portion of the original after decoding. I remember that
this was an important design criterion for the precursor RCSU. Their
implementation required the ability to deliver a "patch" to a compressed
string, something that isn't possible with many other compression formats.
So there is a sliding scale in features, each compression method being
designed to address the specific requirements of given application.
A./
Mark
— Il meglio è l’inimico del bene —
On Fri, Jun 4, 2010 at 06:35, Otto Stolz <[email protected]
<mailto:[email protected]>> wrote:
Hello,
Am 2010-06-03 07:07, schrieb Kannan Goundan:
This is currently what I do (I was referring to this as the
"compact
UTF-8-like encoding"). The one difference is that I put all the
marker bits in the first byte (instead of in the high bit of every
byte):
0xxxxxxx
10xxxxxx xyyyyyyy
110xxxxx xxyyyyyy yzzzzzzz
The problem with this encoding is that the trailing bytes
are not clearly marked: they may start with any of
'0', '10', or '110'; only '111' would mark a byte
unambiguously as a trailing one.
In contrast, in UTF-8 every single byte carries a marker
that unambiguously marks it as either a single ASCII byte,
a starting, or a continuation byte; hence you have not to
go back to the beginning of the whole data stream to recognize,
and decode, a group of bytes.
Best wishes,
Otto Stolz