RE: Compression through normalization

Philippe Verdy Fri, 05 Dec 2003 15:43:13 -0800

Mark Davis writes:
> Doug Ewell writes:
> > OK. So it's Mark, not me, who is unilaterally extending C10.
> 
> Where on earth do you get that? I did say that, in practice, NFC should be
> produced, but that is simply a practical guideline, independent of C10.


I also think that the NFC form is not required for the result of the
decompression to respect clause C10. So if your intent is to create a
compressor/decompressor that respects canonical equivalence, NFC is not
required.

Of course clause C10 cannot be fully respected for charset mappings;
non-Unicode Korean charsets is one example where canonical equivalence
cannot be guaranteed, and where in fact the Unicode codanonical equivalence
is a pollution: mappings to/from non-Unicode charsets do not need to respect
canonical equivalence, when this non-Unicode charset has its own canonical
equivalence rules.

It's just a shame that what was considered as equivalent in the Korean
standards is considered as canonically distinct (and even compatibility
dictinct) in Unicode. This means that the same exact abstract Korean text
can have two distinct representation in Unicode and there's no way to match
these Unicode representations together. And also that whan mapping Korean
charsets to Unicode, care must be done, before making the mapping, that all
compound jamaos will be used each time it is possible.

If now the text is stored and handled entirely in Unicode without returning
to the KSC standard, you won't have any other tool than just UCA to collate
strings (but collation does not produces strings, just collation weights,
and there's currently no tool to reverse a list of weights back to an
Unicode string...

... unless the table of UCA collation weights is built as if it was a
bidirectional mapping to a legacy charset, which would then become
reversible and usable to perform various Unicode algorithms including case
folding, or many other similar foldings defined in UTR...

If someone adventures himself to define such collation charset and maps it
to Unicode, then he will effectively create as many charset as collation
orders tailored for a particuler language.


__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com

<<attachment: winmail.dat>>

RE: Compression through normalization

Reply via email to