More rambling about Han

Joel Rees Tue, 20 Feb 2001 23:09:36 -0800

Hi again. The reason I fudge to March/April is that it doesn't do us any good if we can't see what's in the Han section of extension B right now. (I just checked this morning and the charts for the Han do not yet seem to be available on the site.) > . . . > I dug it out of my own email archives, and append it below. I added > Thomas Chan's second email address. You can take it up with him. (Above and beyond the call of duty. Thanks.) > . . . Concerning where Kanji originated, there have been a number of archeological finds in Japan that include what appear to be Kanji on items that predate the historical influence of China, and may in fact predate the Han characters in China. Don't know if any of this is on the web, but I have read it I read in newspapers like Kobe Shinbun or Nikkei Shinbun. Newspapers here are no more accurate than in the USA, and archealogical dating techniques are known to have problems, but the standard interpretation of history must always be taken with a measure of salt. Look how much we have had to revise history in the USA, for all that there is more than an order of magnitude less official messing with the official version of facts there. But that's beside the point. I would assume it really isn't in the UNICODE Consortium's interest to try to determine who invented Kanji. > . . . A great many of the "hidden issues with Kanji" are just as you say, mystical and ephemeral and not really subject to being stored in bits and bytes at this point in time. Here's a concrete example I am beginning to get a handle on: The average Japanese will tell you that they do not write by radical. The only time they even study the radicals per se is when they study calligraphy. (A fairly large percentage of them do take up calligraphy, by the way). But I watch them write. They may not realize it, but they do write and read by radical. When they get stuck looking up a word the "easy" way (by pronunciation) they don't hesitate to dig into the radical index. When they need to specify over the telephone one character among several with the same pronunciation, they name the radicals, generally in the order they write them. It is the same thing as English speakers not recognizing how much they depend on root words for spelling and cognition. Or perhaps it is a mystical attitude toward computers, trying to figure out why their most beautiful simplified shorthand script is too high a level for the magic box to handle. UNICODE does have all the radicals, and that is great. I still wonder why the JIS committees did not bother to include at least the base representation of all the radicals. I admit that encoding the Kanji by radical does not seem to make sense, but the present encoding by whole character hides certain character issues from them. I picked up on this when I was trying to explain the convenience of the ctype library to some co-workers at a previous company, and then realized that the Japanese have simply never bothered to write a ctype library that works with JIS and publish it. Too much trouble, and they can't see the benefits. "You can't do that with Japanese." is the usual response to any suggestions in that direction. There are some companies that have something like the ctype library, but it is viewed as not being necessary for ordinary programming use (and too valuable for what they do use it for to release outside the company). Contradiction on contradiction. > . . . > The Unicode Standard is *not* intended to put historians of Han characters > out of business. It is not the ultimate, final catalog. It does not attempt > to resolve all the scholastic questions that will continue to be of > interest. Heck, Richard S. Cook recently wrote a 250 page monograph on > The Etymology of Chinese Chen2 (the scorpion character). He lists 208 Oracle > bone exempla and 35 bronze exempla, and tracks the whole set of related > forms through Shuowen and other documents. (There's the guy who clued me in on extension B. Thanks, Richard, if you're monitoring the list today.) > But for global information interchange on computers, *somebody* had > to put a stake in the ground for Han characters. The alternative was > a dozen different stakes being moved by different committees from > different points of view and in different directions. It already was > chaotic, and the needs of the Internet are slowly pushing that kind > of chaos aside, in favor of (relatively) simple, interoperable standards. Yes, we need some basis for international communication. But five good fonts with 90K+ characters each, and a lot of convoluted rendering and input rules is more than seems reasonable to put in every desktop. Well, I suppose that 100G HDs and quad GHz processors are going to standard in yet four more years. What do you do with the keyboard? Oh, never mind, we can just use a low-to-medium resolution LCD touch panel that changes the keycaps according to whatever we want to type at the moment. I guess I am probably not being sarcastic after all. Yikes. Anyway, what I would have wished for, if you will permit a little fantasizing, was about 8K (maybe 16K) characters in the common font, just enough to produce something readable for ordinary business in each language. Japanese, for instance, would have included only the bare kana, dakuten, base radicals, and the education characters, for a little less than 1,500 total. No fancy rendering or direction rules, just enough to get each included character on the screen in some recognizable form. Characters outside the common set would have been assigned code points by simple additive translation of existing standards into a 32 bit (or larger) code space. Each country or other organization registering a code set would have been assigned their own sub-space(s) in the international code space, and would have been primarily responsible for tables or rules to transform to the common set or to other sets. Ditto most of the really difficult rendering issues. (Did I hear that the ISO group originally planned for something like this, but hadn't planned on a common set?) This sort of idea requires that the standard include some way of inserting rendering information for codes outside the common set. But since the bulk of most transmissions would consist of characters from the common set, the burden of putting the rendering data in with text would not be so great. Rendering data would have been inserted at the top of a file, to keep it out of the way of the text data. Since the 32 bit code point would also be part of the text data, a computer containing its own fonts for a specific language would be free to substitute, especially in plain text. Tying this fantasy into my comments on radicals, the presence of the broader list of fully rendered characters seems to me to discourage attempts at encoding characters as lists of radicals. I note that UNICODE 3.1 contains the ideographic description characters, not enough for rendition of course, but apparently enough for search purposes and for doing _something_ with those characters that get invented for various technical purposes each year. > . . . I'd better quit fantasizing in public and get back to work. Thanks again. Joel Rees, Media Fusion KK Amagasaki, Japan

More rambling about Han

Reply via email to