Peter Kirk [peterkirk at qaya dot org] writes: > On 25/11/2003 16:38, Doug Ewell wrote: > > >Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote: > > > >>So SCSU and BOCU-* formats are NOT general purpose compressors. As > >>they are defined only in terms of stream of Unicode code points, they > >>are assumed to follow the conformance clauses of Unicode. As they > >>recognize their input as Unicode text, they can recognize canonical > >>equivalence, and thus this creates an opportunity for them to consider > >>if a (de)normalization or de/re-composition would result in higher > >>compression (interestingly, the composition exclusion could be > >>reconsidered in the case of BOCU-1 and SCSU compressed streams, > >>provided that the decompression to code points will redecompose the > >>excluded compositions). > > > >I have to say, if there's a flaw in Philippe's logic here, I don't see > >it. Anyone? > > Yes, the compressor can make any canonically equivalent change, not just > composing composition exclusions but reordering combining marks in > different classes. The only flaw I see is that the compressor does not > have to undo these changes on decompression; at least no other process > is allowed to rely on it having done so.
Being able to undo these changes when decompressing is needed if one wants to be able to restore a canonically equivalent text that preserves all its initial semantics. I don't say that decompressors do need to undo all these changes to be lossless, as long as the result of the decompressions is canonically equivalent to the original: so the decompressor may keep sequences that were composed despite they were normally excluded from recomposition (this restriction only applies to encoded streams that claim being in NFC or NFKC form when parsed as streams of code points, and in practice, in applications that handle code points as binary code units, this is extended to streams of _code units_, not to streams of _bytes_ of an UTF encoding _scheme_) I see good reasons why a fully Unicode-Compliant application, process or system can be built that handle Unicode text symbolically rather than with code units. For example a Unicode text can for example be fully handled (and transformed with Unicode algorithms) just as a linked list of items, where items are symbolic abstract characters, or complete objects with their own interface to access its properties, transformation methods and associations, or as enumerated XML elements with distinct names. For these applications, the normalization form makes sense if it is the internal representation, and it has nothing to do with the glyph representation. There may even exist a object interface to these objects for interchange which does not use or transmit any code unit or even a binary bytes representation. In that case, the most important thing is not the code unit or not even the code point itself, but the supported enumerated objects, i.e. assigned abstract characters that are part of the Unicode CCS (coded character set). For me code points are more symbolic than what they look in appearance, and they are not numeric values. If this was the case, we wouldn't need the concept of code points, and we could just use the code units of the UTF32 encoding. What I mean here is that the numeric code assigned in GB18030 to abstract characters is as valid as UTF32 code units, but they both represent the same abstract character, so UTF-32BE and GB18030 (for example) encode the same set of abstract characters (ISO/IEC 10646 would say they share the same subset, but distinct numeric code positions, so they are two distinct coded character sets a.k.a. CCS). As long as ISO/IEC 10646 and Unicode had not formally merged their character set and normative references so that they fully interoperate, it was impossible to think about normalizing Unicode texts within compressors. But now that there's a normative stability policy for canonically equivalent strings, it's clear that even ISO/IEC 10646 is more than just a coded character set: it includes the definition of canonically equivalent strings bound very tightly with the code points assigned in the CCS. Ensuring compliance with the canonically equivalent strings then requires indicating which character subset is supported, i.e. the version of the Unicode standard, or of the ISI/IEC 10646 standard (which is augmented with new assignments more often than in Unicode, until the new repertoires are merged by formal agreements between both parties). Interoperability is guaranteed only if the character sets used in documents are strictly bound to the code points assigned in both published and versioned standards, but when this is done, you immediately can assume the rules for canonical equivalences of strings using these new characters. That's why I do think that both standards (Unicode and ISO/IEC 10646) MUST clearly and formally specify to which versions they correspond regarding their common CCS. I note that this was not the case before Unicode 4.0, but this is now formally indicated since the official publication of Unicode 4.0, and I hope that this normative reference will be kept in the future. __________________________________________________________________ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com
<<attachment: winmail.dat>>

