RE: Compression through normalization

Philippe Verdy Wed, 26 Nov 2003 07:41:47 -0800

Peter Kirk [peterkirk at qaya dot org] writes:

> On 25/11/2003 16:38, Doug Ewell wrote:
> 
> >Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
> >
> >>So SCSU and BOCU-* formats are NOT general purpose compressors. As
> >>they are defined only in terms of stream of Unicode code points, they
> >>are assumed to follow the conformance clauses of Unicode. As they
> >>recognize their input as Unicode text, they can recognize canonical
> >>equivalence, and thus this creates an opportunity for them to consider
> >>if a (de)normalization or de/re-composition would result in higher
> >>compression (interestingly, the composition exclusion could be
> >>reconsidered in the case of BOCU-1 and SCSU compressed streams,
> >>provided that the decompression to code points will redecompose the
> >>excluded compositions).
> >
> >I have to say, if there's a flaw in Philippe's logic here, I don't see
> >it.  Anyone?
>
> Yes, the compressor can make any canonically equivalent change, not just 
> composing composition exclusions but reordering combining marks in 
> different classes. The only flaw I see is that the compressor does not 
> have to undo these changes on decompression; at least no other process 
> is allowed to rely on it having done so.


Being able to undo these changes when decompressing is needed if one wants
to be able to restore a canonically equivalent text that preserves all its
initial semantics.

I don't say that decompressors do need to undo all these changes to be
lossless, as long as the result of the decompressions is canonically
equivalent to the original: so the decompressor may keep sequences that were
composed despite they were normally excluded from recomposition (this
restriction only applies to encoded streams that claim being in NFC or NFKC
form when parsed as streams of code points, and in practice, in applications
that handle code points as binary code units, this is extended to streams of
_code units_, not to streams of _bytes_ of an UTF encoding _scheme_)

I see good reasons why a fully Unicode-Compliant application, process or
system can be built that handle Unicode text symbolically rather than with
code units. For example a Unicode text can for example be fully handled (and
transformed with Unicode algorithms) just as a linked list of items, where
items are symbolic abstract characters, or complete objects with their own
interface to access its properties, transformation methods and associations,
or as enumerated XML elements with distinct names. For these applications,
the normalization form makes sense if it is the internal representation, and
it has nothing to do with the glyph representation. There may even exist a
object interface to these objects for interchange which does not use or
transmit any code unit or even a binary bytes representation.

In that case, the most important thing is not the code unit or not even the
code point itself, but the supported enumerated objects, i.e. assigned
abstract characters that are part of the Unicode CCS (coded character set).
For me code points are more symbolic than what they look in appearance, and
they are not numeric values. If this was the case, we wouldn't need the
concept of code points, and we could just use the code units of the UTF32
encoding.

What I mean here is that the numeric code assigned in GB18030 to abstract
characters is as valid as UTF32 code units, but they both represent the same
abstract character, so UTF-32BE and GB18030 (for example) encode the same
set of abstract characters (ISO/IEC 10646 would say they share the same
subset, but distinct numeric code positions, so they are two distinct coded
character sets a.k.a. CCS).

As long as ISO/IEC 10646 and Unicode had not formally merged their character
set and normative references so that they fully interoperate, it was
impossible to think about normalizing Unicode texts within compressors. But
now that there's a normative stability policy for canonically equivalent
strings, it's clear that even ISO/IEC 10646 is more than just a coded
character set: it includes the definition of canonically equivalent strings
bound very tightly with the code points assigned in the CCS.

Ensuring compliance with the canonically equivalent strings then requires
indicating which character subset is supported, i.e. the version of the
Unicode standard, or of the ISI/IEC 10646 standard (which is augmented with
new assignments more often than in Unicode, until the new repertoires are
merged by formal agreements between both parties). Interoperability is
guaranteed only if the character sets used in documents are strictly bound
to the code points assigned in both published and versioned standards, but
when this is done, you immediately can assume the rules for canonical
equivalences of strings using these new characters.

That's why I do think that both standards (Unicode and ISO/IEC 10646) MUST
clearly and formally specify to which versions they correspond regarding
their common CCS. I note that this was not the case before Unicode 4.0, but
this is now formally indicated since the official publication of Unicode
4.0, and I hope that this normative reference will be kept in the future.


__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com

<<attachment: winmail.dat>>

RE: Compression through normalization

Reply via email to