Asmus replied: > On 11/15/2010 2:24 PM, Kenneth Whistler wrote: > >> FA47 is a "compatibility character", and would have a > >> compatibility mapping. > > Faulty syllogism. > > Formally correct answer but only because of something of a design flaw > in Unicode. When the type of mapping was decided on, people didn't fully > expect that NFC might become widely used/enforced, making these > distinctions appear wherever text is normalized in a distributed > architecture.
O.k., I'm gonna have to intervene again. *hehe* Yes, there is a design flaw here, but Asmus' explanation is also somewhat faulty, because it flattens out the history in a way that is liable to be misunderstood. There is a *reason* why "when the type of mapping was decided on" that "people didn't fully expect that NFC might become widely used/enforced" -- but it wasn't that they were goofing up in understanding the implications of normalization. Rather, at that point in Unicode history NFC didn't *exist* yet, nor had the normalization algorithm been designed. Here, for the benefit of the standards geeks out there, are the relevant higlights of the historical timeline involved. June, 1992. The canonical mappings for the CJK Compatibility characters were *printed* (with off-by-one errors for some of them!) in Unicode 1.0, volume 2 (= Unicode 1.0.1). Actually, at the time, we didn't know they were "canonical" mappings, because that concept hadn't formally been invented yet, but the intention was clear. They were the mappings from the "CJK compatibility ideographs" to the "real" unified Han ideographs in the standard. The CJK compatibility characters were all considered to be duplicates in the source standards that didn't follow the unification rules. July, 1996. The formal definitions of "canonical decomposition" and "compatibility decomposition" were first published in Unicode 2.0. There wasn't a data file for the CJK Compatibility Ideographs block, but the canonical mappings were *printed* (correctly, this time) on pp. 7-470 to 7-472 of the standard. August 4, 1998. The first published version of UnicodeData.txt that contained the canonical mappings for the CJK Compatibility Ideographs was UnicodeData-2.1.5.txt for Unicode 2.1.5. (Actually, they got into UnicodeData-2.1.4.txt on July 9, 1998, but that wasn't a published version of the data file.) July 23, 1999. This was the publication data of the first approved version of UAX #15 (Revision 15), and so is the first published definition of NFC. (Of course UAX #15 had been in draft for some time earlier than that, so the term "NFC" can be tracked back in the drafts to mid-1998.) September, 1999. Release of Unicode 3.0 -- the first release of Unicode formally tied to the Unicode Normalization Algorithm. (The revision of UAX #15 for the release was actually Revision 18, dated November 11, 1999.) March 23, 2001. UAX #15, Version 3.1.0. This was the version of the Unicode Normalization Algorithm that specified the composition version to be Version 3.1.0 and locked down normalization forever more. So essentially, there was a 9 year period between when the first mappings were defined for the CJK Compatibility Ideographs and the date beyond which it became impossible to reinterpret or change a canonical mapping because of the lockdown of normalization. The problems resulting from the normalization for CJK Compatibility Ideographs only started to become visible to people *after* the lockdown, and when Unicode normalization started to become a regular feature of actual processing. And it wasn't because "people didn't fully expect that NFC might become widely used/enforced" -- or at least not the people in the UTC. The UAX #15 text published with Unicode 3.0 already stated: "The W3C Character Model for the World Wide Web requires the use of Normalization Form C for XML and related standards..." And it wasn't because of some oversight about the canonical mappings involving the CJK Compatibility Ideographs per se. That same UAX #15 for Unicode 3.0 also stated: "With *all* normalization forms singleton characters (those with singleton canonical mappings) are replaced." So the ground facts for the FA10 --> (NFC/NFD/NFKC/NFKD) 585C normalization pattern were well-established and explicitly stated in 1999. > > FA47 is a CJK Compatibility character, which means it was encoded > > for compatibility purposes -- in this case to cover the round-trip > > mapping needed for JIS X 0213. > > > > However, it has a *canonical* decomposition mapping to U+6F22. > > And that, of course, destroys the desired "round-trip" behavior if it is > inadvertently applied while the data are encoded in Unicode. Hence the > need to recreate a solution to the issue of variant forms with a > different mechanism, the ideographic variation sequence (and > corresponding database). Yes, that is basically correct. But, this architectural "design flaw" actually results from two additional requirements that accrued to the Unicode Standard well after its initial design: 1. The requirement to be able to carry "round-trip" behavior through distributed environments. In the original design, the notion of how one would deal with legacy data was conceived of primarily as a controlled and contained conversion issue. An application/system would convert legacy data to Unicode, and if it needed to convert back, it could use compatibility characters for round-trip conversion. The system would know how and when it could normalize, because it controlled the data and the conversion. 2. The requirement to be able to maintain CJK variant glyph distinctions in plain text data. Again, that was not at all a part of the original Unicode Standard design. So the essential nature of the problem is that these new requirements have mostly accrued to Unicode implementations *after* 2001, more or less at the point when the lockdown of Unicode normalization made it impossible for normalization to be adjusted in any way to account for them. Hence the need to construct an *alternative* approach involving variation selectors, which would be robust and invariant under normalization transformations. > > The behavior in BabelPad is correct: U+6F22 is the NFC form of U+FA47. > > > > Easily verified, for example, by checking the FA47 entry in > > NormalizationTest.txt in the UCD. > > While correct, it's something that remains a bit of a gotcha. Yeah, well, the basic gotcha is that no matter how many times I say it or what the Unicode Standard says, people will continue to just assume "compatibility character" implies "compatibility decomposition". For everybody on the list, I recommend frequent re-reading of Section 2.3, Compatibility Characters, of the standard: http://www.unicode.org/versions/Unicode5.2.0/ch02.pdf whenever somebody mentions "compatibility" in discussion of Unicode. Yes, I suspect that people will find their heads hurting -- but this subject *is* complex, and generalizations that people make about "compatibility characters" are often wrong when they don't pay attention to the details. > Especially > now that Unicode has charts that go to great length showing the > different glyphs for these characters, Well, even there the issue is complicated, because there are CJK Compatibility Ideographs, and then there are CJK Compatibility Ideographs. They fall into at least 3 important classes: 1. Ones which really are *unified* ideographs, despite their names. 2. Ones which are *pronunciation* variants from KS X 1001, and which are *not* intended to show different glyphs. 3. Ones which are *graphical* variants from other legacy standards, and which *are* intended to show different glyphs. And even class 3 has subtypes, because some show variants that are distinguished only in one legacy standard, whereas some are themselves cross-mapped between more than one legacy standard -- putatively because each legacy standard shows the same variant glyph. It is class 3 that may be adversely affected *visually* by the application of normalization in a distributed environment. > I would suggest adding a note to > the charts that make clear that these distinctions are *removed* anytime > the text is normalized, which, in a distributed architecture may happen > anytime. The CJK Compatibility Ideographs already have warnings attached to them in the standard. They are repeatedly documented as "only for round-trip compatibility with XYZ" and "They should not be used for any other purpose." However, I think your point is a valid one. Now that the clear answer for maintaining legacy CJK glyph variant distinctions in a distributed environment is via ideographic variation sequences as registered in the IVD, it would make sense to beef up the CJK Compatibility Ideograph documentation with better pointers (and with accompanying rationale text) to UTS #37 and the IVD, and to post stronger warning labels in the code charts for CJK Compatibility Ideographs. Perhaps someone would like to make a detailed proposal to the UTC for how to fix the text and charts? ;-) --Ken

