On Tuesday, November 12, 2002, at 09:03 AM, Andrew C. West wrote:
BTW, what is "CJK Unified Ideographs Extension C" intended to include ? SurelyNope. We're still doing modern stuff.
not any more ordinary Han ideographs - with over 70,000 ideographs already
encoded, there can't be so many genuine ideographs that still need encoding as
to warrant a whole new plane. However there is a real need to encode oracle bone
characters and other ancient epigraphic forms of Han ideographs. Is this
(hopefully) what Extension C is intended for ?
it is unlikely in the extreme that we'll actuall *need* a whole plane for new ideographs. Extension C is currently big enough, however, that if we were to accommodate it via separate encoding of everything we'd use up the rest of Plane 2. And there's still no end in sight.
To some extent, we're having to deal with massive turtle--er, fecal matter being dumped uncritically into the bin consisting largely of things which are obviously variants of existing characters. This we will deal with to an extent by using variation selectors. (Many of Unicode's proposed additions are unofficial simplifications which will also be handled via variation selectors.)
Beyond that, it is incredible just how many obscure characters there are once you start looking for them. The PRC's submission includes large numbers of place names, for example, and I dread to think how many more of *those* there may be. HKSAR has come up with more Cantonese- or Hong Kong-specific characters. The only non-Mandarin dialect to receive *any* attention at all is Cantonese, and despite the efforts of the HKSAR that's been rather unsystematic. Unicode's proposed characters include a few Cantonese-specific ones that we were able to dig up without much effort.
And all this leaves out stuff like cute names for Hong Kong race horses, frogs-in-wells, and things like that.
All in all, I wouldn't be surprised if there were as many as ten thousand or so genuinely distinct characters in modern use which have yet to be encoded. And there are a number of border line cases from pre-modern texts where it looks like it's probably a variant but it may not be. (Of course, I also estimated the total number of genuine Han ideographs to be under eighty thousand, which just goes to show how much *I* know.)
Oracle bone forms and other older versions of the Han ideographs are something we haven't even got a good model for how to handle yet.
==========
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://www.tejat.net/

