When we were doing BOCU, I thought of a way to allow semi-random access.
This also applies to DUDE-8.

Reserve a particular byte value (or sequence if you have to) as a
synchronization token. Every 33 (or so) characters, emit one of those
sequences. You can then access into a buffer, and backup at most 33 x 3
bytes to find a synchronization point. The overhead is only about 33% (if
you can use a single byte, as in BOCU).

A stronger notion of random access allows you to not only get to character
boundaries, but also to determine which character (from 0 .. n) you are on
in the buffer. That can also be done, with a little bit more work. Left to
the reader.

Mark

----- Original Message -----
From: "John Cowan" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, June 22, 2001 18:23
Subject: DUDE-8, a compression proposal


> It's DUDE-8!  It's quick!  It's easy!  It does it all!
>
> The DUDE-8 algorithm compresses 21-bit Unicode Scalar Values into bytes.
> It is based on the DUDE(-6) algorithm currently being proposed for
> i18n of DNS names.
>
> 1.  Let prev = 0.
> 2.  For each Unicode Scalar value usv, let prev = prev xor USV.
> 3.  Let the value of prev be expressed in 21 bits as
xxxxxxxyyyyyyyzzzzzzz.
> 4.  Encode prev in 3 bytes as 0xxxxxxx 0yyyyyyy 1zzzzzzz.
> 5.  Emit all non-zero bytes.
> 6.  Repeat until done.
>
> The optional signature of DUDE-8 is 03 FD FF, which is how U+FEFF appears
at the
> beginning of a DUDE-8 compression stream.
>
> DUDE-8 is simpler and simpler than SCSU, but doesn't allow recovery from
> garbles or even partial random or backwards access.
>
> --
> John Cowan                                   [EMAIL PROTECTED]
> One art/there is/no less/no more/All things/to do/with sparks/galore
> --Douglas Hofstadter
>


Reply via email to