On 6/4/2010 8:34 AM, Mark Davis ☕ wrote:
In a compression format, that doesn't matter; you can't expect random access, nor many of the other features of UTF-8.

The minimal expectation for these kinds of simple compression is that when you write a string with a particular /write/ method, and then read it back with the corresponding /read/ method, you get exactly the original string contents back, and you consume exactly as many bytes as you had written. There are really no other guarantees.
Actually, SCSU makes an additional guarantee, which is that you can edit the compressed string. In other words, you can insert a substring such that the new string remains a valid compressed string and the parts preceding and following the insertion, when read, match the corresponding portion of the original after decoding. I remember that this was an important design criterion for the precursor RCSU. Their implementation required the ability to deliver a "patch" to a compressed string, something that isn't possible with many other compression formats.

So there is a sliding scale in features, each compression method being designed to address the specific requirements of given application.

A./

Mark

— Il meglio è l’inimico del bene —


On Fri, Jun 4, 2010 at 06:35, Otto Stolz <[email protected] <mailto:[email protected]>> wrote:

    Hello,

    Am 2010-06-03 07:07, schrieb Kannan Goundan:

        This is currently what I do (I was referring to this as the
        "compact
        UTF-8-like encoding").  The one difference is that I put all the
        marker bits in the first byte (instead of in the high bit of every
        byte):
          0xxxxxxx
          10xxxxxx xyyyyyyy
          110xxxxx xxyyyyyy yzzzzzzz


    The problem with this encoding is that the trailing bytes
    are not clearly marked: they may start with any of
    '0', '10', or '110'; only '111' would mark a byte
    unambiguously as a trailing one.

    In contrast, in UTF-8 every single byte carries a marker
    that unambiguously marks it as either a single ASCII byte,
    a starting, or a continuation byte; hence you have not to
    go back to the beginning of the whole data stream to recognize,
    and decode, a group of bytes.

    Best wishes,
     Otto Stolz






Reply via email to