Think you are missing a negative, see below. Mark __________________________________ http://www.macchiato.com â ààààààààààààààààààààà â
----- Original Message ----- From: "Doug Ewell" <[EMAIL PROTECTED]> To: "Unicode Mailing List" <[EMAIL PROTECTED]> Cc: "Kenneth Whistler" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Fri, 2003 Dec 05 08:43 Subject: Re: Compression through normalization > Kenneth Whistler <kenw at sybase dot com> wrote: > > > Canonical equivalence is about not modifying the interpretation of the > > text. That is different from considerations about not changing the > > text, period. > > > > If some process using text is sensitive to *any* change in the text > > whatsover (CRC-checking or any form of digital signaturing, memory > > allocation), then, of course, *any* change to the text, including any > > normalization, will make a difference. > > > > If some process using text is sensitive to the *interpretation* of the > > text, i.e. it is concerned about the content and meaning of the > > letters involved, then normalization, to forms NFC or NFD, which only > > involve canonical equivalences, will *not* make a difference. > > All right. I think that is the missing piece I needed. > > How's this: > > Compression techniques may optionally replace certain sequences with > canonically equivalent sequences to improve efficiency, but *only* if > the output of the decompressed text is expected to be is not required to be > codepoint-for-codepoint equivalent to the original. Whether this is > true or not depends on the user and the intended use of the text. > > Text compression techniques are generally assumed to be "lossless," > meaning that no information -- including meta-information -- is altered > by compressing and decompressing the text. However, this is not always > the case for other types of data. In particular, video and audio > formats often incorporate some form of "lossy" compression where the > benefit of reduced size outweighs the potential degradation of the > original image or sample. > > Because Unicode incorporates the notion of canonical equivalence, the > line between "lossless" and "lossy" is not as clear as with other > character encoding standards. Conformance clause C10 says (roughly) > that a process may choose any canonical-equivalent sequence for a run of > text without altering the interpretation of the text. Compression of > Unicode text may be assumed either to (a) retain only the > interpretation, in which case this is acceptable, or (b) retain the > exact code points, in which case it is not. > > Mark indicated that a compression-decompression cycle should not only > stick to canonical-equivalent sequences, which is what C10 requires, but > should convert text only to NFC (if at all). Ken mentioned > normalization "to forms NFC or NFD," but I'm not sure this was in the > same context. (Can we find a consensus on this?) > > No substitution of compatibility equivalents or other privately defined > equivalents is acceptable. A compressor can obviously convert its input > to whatever representation it likes, but it must be able to recover the > original input exactly, or "equivalently" as described above. > > > Or to be more subtle about it, it might make a difference, but it is > > nonconformant to claim that a process which claims it does not make a > > difference is nonconformant. > > > > If you can parse that last sentence, then you are well on the way to > > understanding the Tao of Unicode. > > I had to read it a few times, but such things are necessary along the > Path of Enlightenment. > > -Doug Ewell > Fullerton, California > http://users.adelphia.net/~dewell/ > > >

