Mark Davis writes: > I would say that a compressor can normalize, if (a) when decompressing it > produces NFC, and (b) it advertises that it normalizes.
Why condition (a) ? NFD could be used as well, and even another normalization where combining characters are sorted differently, or partly recomposed, or even recomposed by ignoring the composition exclusions, as long as the result is a canonical equivalent. Whatever the compressor produces, there's no way to specify the normalization form in the result: there's no standard to indicate it in the output stream. The relevant standard is using a MIME or IANA charset, which just specifies a pair consisting in a CCS (coded character set, i.e. for us the Unicode/ISO/IEC 10646 assigned codepoints) and a CES (for us it it the encoding scheme). The normalization form has no standard convention to advertize it. This imples that any transport protocol cannot assume any normalization form of Unicode, even if it's specified with UTF-*, UCS*, BOCU*, SCSU. Normalization becomes a normal step in all interchanges, including for compression purpose. Unicode already says that all noramlization forms are canonically equivalent and must be treated equally. I see no justification of accepting some VALID Unicode text and rejected some other VALID text, when both texts are canonically equivalent. The interaction of C9 and C10 implies that any process that claims respecting the canonical equivalence must perform the normalization of its input, or be SURE that the input is already normalized the same way as expected. There's no other way to be SURE of that, if both processes are not part of the same local system and they don't share the same normalization library for their implementation at ANY time. If there's a delay between those two processes and the system is upgraded, you'll experiment problems, unless the intermediate results from the first process is renormalized with the newer implementation before attempting any use of the second process. If the intermediate result is for example a RDBMS database, the database needs to be checked and cleaned up with the new normalization to allow correct access to tables through binary sorted indices with the upgraded RDBMS engine. In practive, this means rebuilding the indices, unless the database also stores somewhere which normalization form is used in its indices, and the engine performs the necessary normalization on the fly to match storage requirements... For me a process that accepts some text but not some other canonical one is NOT conforming to the claim that it respects canonical equivalence, and so it is only a partial implementation of Unicode. __________________________________________________________________ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com
<<attachment: winmail.dat>>

