On Jan 21, 2004, at 6:36 AM, Andrew C. West wrote:
If a simplified form of a given CJK ideograph is used, then it deserves encoding
properly. There are newly-coined simplified forms in CJK-B and CJK-C, so why not
add newly used simplified forms to CJK-C or whereever if they are really needed
? To borrow Michael's term, this use of variation selectors is simply
pseudo-coding.
Well, first of all, there were a *lot* of mistakes made in Extension B. And Extension C isn't encoded yet. The UTC intends to lobby WG2 to do the encoding of such forms via variation selectors.
The whole point of using variation selectors is that the line between character and glyph can sometimes be a fuzzy one, and Han is probably the worst case. In the case of TC and SC, it's just as easy (in many cases, where there's a one-one, algorithmic relationship) to see the two forms as glyphic avatars of a single, Platonic character. Such a representation, via variation selectors, aids a number of processes, such as fuzzy searching, text-to-speech, and so on, because you don't require new tables to do a match.
Indeed, right now I have to periodically run checks on the Unihan database to make sure that TC/SC pairs have the same readings. It's a pain.
From an end-user perspective, there is *NO DIFFERENCE* between representing these characters using variation selectors and direct encoding. They can show up in input methods and fonts just the same.
1. Unicode Design Principle 3 : "The Unicode Standard encodes characters, not
glyphs."
This is simple glyph variant. I insist on writing the "A" in my name with two
cross-bars. Will the UTC kindly accommodate me by providing an appropriate
standardised variant for U+0041 ? (In fact, come to think of it I have
idiosyncratic ways of writing all of the letters in my name ...)
Well, a personal name ideograph is perhaps not the best example, since the size of the "personal name" problem is unknown. IIRC nobody's won Rick's contest yet. The goal was to come up with an instance where some people make a distinction and others don't. In any event, the example is not entirely tongue-in-cheek. First of all, all three of my Cantonese-English dictionaries contain a variant turtle ideograph which isn't encoded yet. (I haven't looked in Extension C, BTW.) Secondly, the original Korean proposal for Extension C contained literally dozens of variant turtle ideographs.
The difficulty here -- and this leads into the third example -- the Koreans derived their characters from a soft copy of the Korean tripitaka. Now, I would assert that these variant turtles are probably just variant turtles, chosen idiosyncratically by the scribe for whatever reason. (Rather the way that 16th and 17th century English books have fairly random and inconsistent spelling.) If it is absolutely necessary to embody this variation, it would be better to use rich text. Unfortunately, it's impossible to know for certain whether this is the case or not, and so variation selectors are available to make a distinction possible in plain text for those who care about it.
Granted, epigraphy is tough on plain text. As Unicode starts to deal with dead scripts, we have to deal with the issues it raises. Variation selectors are one way of doing it.
The plain fact of the matter is that the *character* turtle is already encoded,
and if someone wants to use a different glyph form for this character then he or
she should design their own font with the appropriate glyph mapped to U+9F9C or
U+9F9F.
Or any of the other turtles we already have.
======== John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage..mac.com/jhjenkins/

