CJK Compatibility Gotchas (was: Re: Application that displays CJK text in Normalization Form D

Kenneth Whistler Mon, 15 Nov 2010 17:57:06 -0800

Asmus replied:

> On 11/15/2010 2:24 PM, Kenneth Whistler wrote:
> >> FA47 is a "compatibility character", and would have a 
> >> compatibility mapping.
> > Faulty syllogism.
> 
> Formally correct answer but only because of something of a design flaw 
> in Unicode. When the type of mapping was decided on, people didn't fully 
> expect that NFC might become widely used/enforced, making these 
> distinctions appear wherever text is normalized in a distributed 
> architecture.


O.k., I'm gonna have to intervene again. *hehe* Yes, there is
a design flaw here, but Asmus' explanation is also somewhat
faulty, because it flattens out the history in a way that is
liable to be misunderstood.

There is a *reason* why "when the type of mapping was decided on"
that "people didn't fully expect that NFC might become
widely used/enforced" -- but it wasn't that they were goofing
up in understanding the implications of normalization. Rather,
at that point in Unicode history NFC didn't *exist* yet, nor
had the normalization algorithm been designed.

Here, for the benefit of the standards geeks out there, are the
relevant higlights of the historical timeline involved.

June, 1992.

  The canonical mappings for the CJK Compatibility characters
  were *printed* (with off-by-one errors for some of them!) in
  Unicode 1.0, volume 2 (= Unicode 1.0.1).
  
  Actually, at the time, we didn't know they were "canonical"
  mappings, because that concept hadn't formally been invented
  yet, but the intention was clear. They were the mappings
  from the "CJK compatibility ideographs" to the "real" unified
  Han ideographs in the standard. The CJK compatibility characters
  were all considered to be duplicates in the source standards
  that didn't follow the unification rules.
  
July, 1996.

  The formal definitions of "canonical decomposition" and
  "compatibility decomposition" were first published in
  Unicode 2.0. There wasn't a data file for the CJK Compatibility
  Ideographs block, but the canonical mappings were *printed*
  (correctly, this time) on pp. 7-470 to 7-472 of the standard.
  
August 4, 1998.

  The first published version of UnicodeData.txt that contained
  the canonical mappings for the CJK Compatibility Ideographs
  was UnicodeData-2.1.5.txt for Unicode 2.1.5. (Actually,
  they got into UnicodeData-2.1.4.txt on July 9, 1998, but that
  wasn't a published version of the data file.)
  
July 23, 1999.

  This was the publication data of the first approved version
  of UAX #15 (Revision 15), and so is the first published definition
  of NFC. (Of course UAX #15 had been in draft for some time earlier
  than that, so the term "NFC" can be tracked back in the drafts
  to mid-1998.)
  
September, 1999.

  Release of Unicode 3.0 -- the first release of Unicode formally
  tied to the Unicode Normalization Algorithm. (The revision
  of UAX #15 for the release was actually Revision 18, dated
  November 11, 1999.)
  
March 23, 2001.

  UAX #15, Version 3.1.0. This was the version of the Unicode
  Normalization Algorithm that specified the composition version
  to be Version 3.1.0 and locked down normalization
  forever more.
  
So essentially, there was a 9 year period between when the
first mappings were defined for the CJK Compatibility Ideographs
and the date beyond which it became impossible to reinterpret
or change a canonical mapping because of the lockdown of
normalization.

The problems resulting from the normalization for CJK Compatibility
Ideographs only started to become visible to people *after*
the lockdown, and when Unicode normalization started to become
a regular feature of actual processing.

And it wasn't because "people didn't fully expect that NFC might 
become widely used/enforced" -- or at least not the people in
the UTC. The UAX #15 text published with Unicode 3.0 already
stated: "The W3C Character Model for the World Wide Web requires
the use of Normalization Form C for XML and related standards..."

And it wasn't because of some oversight about the canonical
mappings involving the CJK Compatibility Ideographs per se.
That same UAX #15 for Unicode 3.0 also stated: "With *all*
normalization forms singleton characters (those with singleton
canonical mappings) are replaced." So the ground facts for
the FA10 --> (NFC/NFD/NFKC/NFKD) 585C normalization pattern
were well-established and explicitly stated in 1999.

> > FA47 is a CJK Compatibility character, which means it was encoded
> > for compatibility purposes -- in this case to cover the round-trip
> > mapping needed for JIS X 0213.
> >
> > However, it has a *canonical* decomposition mapping to U+6F22.
> 
> And that, of course, destroys the desired "round-trip" behavior if it is 
> inadvertently applied while the data are encoded in Unicode. Hence the 
> need to recreate a solution to the issue of variant forms with a 
> different mechanism, the ideographic variation sequence (and 
> corresponding database).

Yes, that is basically correct. But, this architectural "design flaw"
actually results from two additional requirements that accrued
to the Unicode Standard well after its initial design:

1. The requirement to be able to carry "round-trip" behavior
   through distributed environments.
   
In the original design, the notion of how one would deal with
legacy data was conceived of primarily as a controlled and
contained conversion issue. An application/system would convert
legacy data to Unicode, and if it needed to convert back, it
could use compatibility characters for round-trip conversion.
The system would know how and when it could normalize, because
it controlled the data and the conversion.

2. The requirement to be able to maintain CJK variant glyph
   distinctions in plain text data.
   
Again, that was not at all a part of the original Unicode
Standard design.

So the essential nature of the problem is that these new requirements
have mostly accrued to Unicode implementations *after* 2001,
more or less at the point when the lockdown of Unicode normalization
made it impossible for normalization to be adjusted in any way
to account for them.

Hence the need to construct an *alternative* approach involving
variation selectors, which would be robust and invariant under
normalization transformations.
 
> > The behavior in BabelPad is correct: U+6F22 is the NFC form of U+FA47.
> >
> > Easily verified, for example, by checking the FA47 entry in
> > NormalizationTest.txt in the UCD.
> 
> While correct, it's something that remains a bit of a gotcha.

Yeah, well, the basic gotcha is that no matter how many times
I say it or what the Unicode Standard says, people will continue
to just assume "compatibility character" implies "compatibility
decomposition". For everybody on the list, I recommend
frequent re-reading of Section 2.3, Compatibility Characters,
of the standard:

http://www.unicode.org/versions/Unicode5.2.0/ch02.pdf

whenever somebody mentions "compatibility" in discussion
of Unicode. Yes, I suspect that people will find their
heads hurting -- but this subject *is* complex, and generalizations
that people make about "compatibility characters" are often
wrong when they don't pay attention to the details.


> Especially 
> now that Unicode has charts that go to great length showing the 
> different glyphs for these characters,

Well, even there the issue is complicated, because there
are CJK Compatibility Ideographs, and then there are
CJK Compatibility Ideographs. They fall into at least
3 important classes:

1. Ones which really are *unified* ideographs, despite their
   names.
   
2. Ones which are *pronunciation* variants from KS X 1001,
   and which are *not* intended to show different glyphs.
   
3. Ones which are *graphical* variants from other legacy
   standards, and which *are* intended to show different
   glyphs.
   
And even class 3 has subtypes, because some show variants
that are distinguished only in one legacy standard, whereas
some are themselves cross-mapped between more than one
legacy standard -- putatively because each legacy standard
shows the same variant glyph.

It is class 3 that may be adversely affected *visually* by the
application of normalization in a distributed environment.

> I would suggest adding a note to 
> the charts that make clear that these distinctions are *removed* anytime 
> the text is normalized, which, in a distributed architecture may happen 
> anytime.

The CJK Compatibility Ideographs already have warnings attached
to them in the standard. They are repeatedly documented as "only
for round-trip compatibility with XYZ" and "They should not
be used for any other purpose."

However, I think your point is a valid one. Now that the
clear answer for maintaining legacy CJK glyph variant
distinctions in a distributed environment is via ideographic
variation sequences as registered in the IVD, it would make
sense to beef up the CJK Compatibility Ideograph documentation
with better pointers (and with accompanying rationale text)
to UTS #37 and the IVD, and to post stronger warning labels
in the code charts for CJK Compatibility Ideographs.

Perhaps someone would like to make a detailed proposal to
the UTC for how to fix the text and charts? ;-)

--Ken

CJK Compatibility Gotchas (was: Re: Application that displays CJK text in Normalization Form D

Reply via email to