Re: Compression through normalization

Mark Davis Fri, 05 Dec 2003 12:10:11 -0800

Think you are missing a negative, see below.

Mark
__________________________________
http://www.macchiato.com
â ààààààààààààààààààààà â


----- Original Message ----- 
From: "Doug Ewell" <[EMAIL PROTECTED]>
To: "Unicode Mailing List" <[EMAIL PROTECTED]>
Cc: "Kenneth Whistler" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Fri, 2003 Dec 05 08:43
Subject: Re: Compression through normalization


> Kenneth Whistler <kenw at sybase dot com> wrote:
>
> > Canonical equivalence is about not modifying the interpretation of the
> > text. That is different from considerations about not changing the
> > text, period.
> >
> > If some process using text is sensitive to *any* change in the text
> > whatsover (CRC-checking or any form of digital signaturing, memory
> > allocation), then, of course, *any* change to the text, including any
> > normalization, will make a difference.
> >
> > If some process using text is sensitive to the *interpretation* of the
> > text, i.e. it is concerned about the content and meaning of the
> > letters involved, then normalization, to forms NFC or NFD, which only
> > involve canonical equivalences, will *not* make a difference.
>
> All right.  I think that is the missing piece I needed.
>
> How's this:
>
> Compression techniques may optionally replace certain sequences with
> canonically equivalent sequences to improve efficiency, but *only* if
> the output of the decompressed text is expected to be
is not required to be
> codepoint-for-codepoint equivalent to the original.  Whether this is
> true or not depends on the user and the intended use of the text.
>
> Text compression techniques are generally assumed to be "lossless,"
> meaning that no information -- including meta-information -- is altered
> by compressing and decompressing the text.  However, this is not always
> the case for other types of data.  In particular, video and audio
> formats often incorporate some form of "lossy" compression where the
> benefit of reduced size outweighs the potential degradation of the
> original image or sample.
>
> Because Unicode incorporates the notion of canonical equivalence, the
> line between "lossless" and "lossy" is not as clear as with other
> character encoding standards.  Conformance clause C10 says (roughly)
> that a process may choose any canonical-equivalent sequence for a run of
> text without altering the interpretation of the text.  Compression of
> Unicode text may be assumed either to (a) retain only the
> interpretation, in which case this is acceptable, or (b) retain the
> exact code points, in which case it is not.
>
> Mark indicated that a compression-decompression cycle should not only
> stick to canonical-equivalent sequences, which is what C10 requires, but
> should convert text only to NFC (if at all).  Ken mentioned
> normalization "to forms NFC or NFD," but I'm not sure this was in the
> same context.  (Can we find a consensus on this?)
>
> No substitution of compatibility equivalents or other privately defined
> equivalents is acceptable.  A compressor can obviously convert its input
> to whatever representation it likes, but it must be able to recover the
> original input exactly, or "equivalently" as described above.
>
> > Or to be more subtle about it, it might make a difference, but it is
> > nonconformant to claim that a process which claims it does not make a
> > difference is nonconformant.
> >
> > If you can parse that last sentence, then you are well on the way to
> > understanding the Tao of Unicode.
>
> I had to read it a few times, but such things are necessary along the
> Path of Enlightenment.
>
> -Doug Ewell
>  Fullerton, California
>  http://users.adelphia.net/~dewell/
>
>
>

Re: Compression through normalization

Reply via email to