On Fri, 14 Feb 2003 07:45:44 -0800 (PST), Thomas Chan wrote: > I think zhe4 'this' (simp U+8FD9 / trad U+9019) might be better for a very > simple heuristic for modern text, since it occupies position #11 in at > least one frequency list (compared to #15 for the above-cited ge4), and as > far as I know, U+8FD9 is not one of those ancient characters that have > been promoted/reused as a simplified form.
On the other hand I don't think that zhe4 is used in Cantonese, whereas I think that ge4 is, so it wouldn't be so good for pages written in Cantonese (not that I have ever seen any, but I'm sure there must be some). Probably even a simple heuristic would need to try several common characters such as ge4 and zhe4. > Aren't such texts by default "traditional"? "Simplified" text, besides > using simplified form characters, usually also entails refraining from > using variant forms (according to PRC definitions of what is a variant). Probably true, but the point that I was making is that the simplified ge4 in the text would confuse a simple heuristic. > There are even some cases of semi-simplified forms where one half of a > character might have been simplified according to pre-1964 rules, but the > simplification rule for the other half has to wait until 1964. But I > think these might've been missed by Unicode, like some of the > ultra-simplified forms in the short-lived 1977 scheme, and Singapore's > temporarily different (from the PRC's) schemes prior to 1976. I think that most of the 1977 simplifications have already been encoded in Unicode, but any that haven't and the hybrid semi-simplified forms found in some printed books from the 50s and 60s will probably be included in CJK-C along with the rest of its unnecessary baggage (excuse my distaste for CJK-C, but I think that the Ideographic Rapporteur Group is indiscrimately collecting characters that in most cases probably do not needed to be encoded, just for the sake of encoding as many characters as possible - 24,000+ and counting - see the "CJK Extension C Project" at http://www.cse.cuhk.edu.hk/~irg/irg/extc/CJK_Ext_C.htm for details). > >Now if Hanyu Da Cidian were to be put onto the internet ... > > How about the one here? <a href="http://202.109.114.220">http://202.109.114.220</a>/ Yes, this is an excellent resource. Although the Hanyu Da Cidian look-up only gives definitions, and none of the extremely useful quotations found in the printed book, it still mixes traditional form head words with simplified definitions, so that both ge4 simplified and traditional are found together on the same page if you search under U+500B and look at the appended compound words. I guess that according to Thomas's definition of Simplified Chinese, this makes it a Traditional Chinese page, even though most of the text is in simplified Chinese !? Incidentally, for those interested in UTF-16 Chinese web pages, I noticed that this site is encoded as UTF-16LE. On a related matter, I was wondering about language tagging for Chinese. "zh-CN" and "zh-TW" are used quite frequently, but what do they imply ? Is an HTML page tagged as "zh-CN" expected to be composed of simplified characters, and a a page tagged as "zh-TW" expected to be traditional characters ? Or does the CN or TW imply nothing about the orthography of the text, in which case the CN or TW may simply allow selection of an appropriate font ? What if I am writing a Chinese page here in England - should I put "zh-UK" or should I make a political decision as to whose side I'm on, and use "zh-CN" or "zh-TW" ? On the other hand, "zh-simplified" and "zh-traditional" are sometimes found. These tags are less politically charged, but miss out on mixed simplified/traditional pages. Is there a "zh-mixed" ? Andrew

