Re: Least used parts of BMP.

Asmus Freytag Fri, 04 Jun 2010 10:31:01 -0700

On 6/4/2010 8:34 AM, Mark Davis ☕ wrote:

In a compression format, that doesn't matter; you can't expect randomaccess, nor many of the other features of UTF-8.
The minimal expectation for these kinds of simple compression is thatwhen you write a string with a particular /write/ method, and thenread it back with the corresponding /read/ method, you get exactly theoriginal string contents back, and you consume exactly as many bytesas you had written. There are really no other guarantees.

Actually, SCSU makes an additional guarantee, which is that you can editthe compressed string. In other words, you can insert a substring suchthat the new string remains a valid compressed string and the partspreceding and following the insertion, when read, match thecorresponding portion of the original after decoding. I remember thatthis was an important design criterion for the precursor RCSU. Theirimplementation required the ability to deliver a "patch" to a compressedstring, something that isn't possible with many other compression formats.

So there is a sliding scale in features, each compression method beingdesigned to address the specific requirements of given application.

A./


Mark

— Il meglio è l’inimico del bene —

On Fri, Jun 4, 2010 at 06:35, Otto Stolz <[email protected]<mailto:[email protected]>> wrote:


    Hello,

    Am 2010-06-03 07:07, schrieb Kannan Goundan:

        This is currently what I do (I was referring to this as the
        "compact
        UTF-8-like encoding").  The one difference is that I put all the
        marker bits in the first byte (instead of in the high bit of every
        byte):
          0xxxxxxx
          10xxxxxx xyyyyyyy
          110xxxxx xxyyyyyy yzzzzzzz


    The problem with this encoding is that the trailing bytes
    are not clearly marked: they may start with any of
    '0', '10', or '110'; only '111' would mark a byte
    unambiguously as a trailing one.

    In contrast, in UTF-8 every single byte carries a marker
    that unambiguously marks it as either a single ASCII byte,
    a starting, or a continuation byte; hence you have not to
    go back to the beginning of the whole data stream to recognize,
    and decode, a group of bytes.

    Best wishes,
     Otto Stolz

Re: Least used parts of BMP.

Reply via email to