On Thursday, January 24, 2002, at 12:29 PM, John Cowan wrote:
> John H. Jenkins wrote: > > {TC1, SC1, SC2, TC2, TC3, SC3} constitute a "Han simplification > class" (HSC), and are all the same when appearing in IDNs. > > Correct? > Oui. > >> The caveat is that this must be understood to be a first-order, >> computer-appropriate equivalence and is not in any way to be held to be >> a generalized solution to the lexically appropriate conversion between >> SC and TC. > > > Is there any danger that these classes will turn out to be a > "small world", in the sense that we wind up with a few huge classes > which include almost all the characters? > Nope. >> (Maybe we should refer to *zhengguihua* instead of "Han normalization"…) > > > Can you explain the joke? > It's just to make Ken happy. He doesn't like me talking about "Han normalization," since "normalization" is Unicodespeak for something else. "Zhengguihua" is Mandarin for "normalization." >> It will also mean that we will no longer be able to accept both the TC >> and SC form for a character as a candidate for separate encoding in the >> future, > > > I don't understand this part. Since this is neither compatibility nor > canonical equivalence, it will not effect any of the known normalization > forms. Nor are we defining a new normalization form here, since in > HSCs like the above there is no particular reason to pick any of the > six characters as *the* normalized form, although by convention we can > pick one -- say, the one with the smallest Unicode scalar > value, or the one which appears in the largest number of legacy > sets -- to aid in description and implementation. > > It's just another of those sets of equivalence classes provided for > special purposes, like the Arabic/Syriac shaping classes or the > canonical combining classes. > Well, first of all, the UTC is already on record as refusing to encode new SC separately. Secondly, we would break IDN equivalence. If we add a new SC which is equivalent to two TC, then suddenly domains which could be distinguished on the basis of the old TC pair can't any more. > Or are you saying that this new information should be represented > as a Unicode compatibility equivalence? If so, that would > wreak havoc with existing NCF and NKCF code. > No, >> (Actually, you could save yourself some grief right off by excluding Han >> radicals and all compatibility ideographs.) > > This would be a Bad Thing in Korean, though, because the whole point > of Korean compatibility ideographs is to preserve differences in > reading. Or are ideographs not used in (modern) Korean names? > These compatibility ideographs are *not* to provide phonetic-specific distinctions between various Korean hanja. They're for compatibility with an older standard only, which did make that distinction. IMHO it would be more confusing to Chinese, Japanese, *and* Korean readers to have some domain names distinguished when the the only thing different about them is the Korean pronunciation of the hanja used to write them. ========== John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jenkins/