Re: Is it roundtripping or transfer-encoding

Philippe Verdy Tue, 21 Dec 2004 10:30:56 -0800

RE: Is it roundtripping or transfer-encodingFrom: Lars Kristan

OK, so it introduces a multiple representation of the same codepoints. Every escaping technique does that. And it is not a problem. All you need to do is define the normalization procedure. And use it where it applies. In many cases its use is not even necessary. Specifically, a Unicode system does not need to (and should not) normalize the escape codepoints. The need for normalization only needs to be determined for an application that uses the TES itself, and applies only in few cases.

Please don't use the term "normalize" in this context. Normalization in Unicode involves transformation of the stream of *code points*, but is independant of their encoding form or encoding scheme. Normalization is exposed in terms of combining sequences and mostly the "combining class" property of characters and the character composition mapping property (plus some values of the "general category" property, to take control characters into account when delimiting combining sequences).

Unicode defines only 4 *standard* normalization forms (NFC, NFD, NFKC, NFKD), but other *non-standard* normalization forms are possible:

Normalization involves transformation of strings of abstract characters that should be considered "equivalent" for text processing (notably for input text, but normalization may apply optionally and less importantly for output text of these processes).

Unicode defines two sets of equivalence classes for encoded texts: "canonical" equivalence (NFC or NFD, or the non-standard special decomposition form used on MacOS for HFS+ volumes), important for some other important standards depending on Unicode, and "compatibility" equivalence (NFKC, NFKD), each equivalence type defined with "composed" and "decomposed" forms, important only for fallback mechanisms (but compatibility mappings can involve loss of some information in the source text).

Re: Is it roundtripping or transfer-encoding

Reply via email to