Re: [Fwd: Re: Swastika to be banned by Microsoft?]
On Mon, 15 Dec 2003, Mark E. Shoulson wrote: Is this like baseball scoreboards showing the third consecutive strikeout symbol (which is a K) reversed? Is that to avoid KKK or is it for another reason? Which of course begs the question of whether we should encode a LATIN CAPITAL REVERSED K character. How about getting a taboo variation indicator, like the ideographic one proposed in WG2 N2475 (http:://anubis.dkuug.dk/JTC1/SC2/WG2/docs/n2475.pdf)? Then one can also represent the Nazi swastika (if it were encoded) the way it is shown on censored products (e.g., model kits, toys/games, etc.), shorn of its arms so it looks like an x, or replaced with the black cross (shown on aircraft of the period). Thomas Chan [EMAIL PROTECTED]
Re: Swastika to be banned by Microsoft?
On Sun, 14 Dec 2003, Michael Everson wrote: At 15:40 +0100 2003-12-14, Stefan Persson wrote: Aren't the U+534D and U+5350 only defined for Asian usage, so that different code points (which seem not to be defined in the current version of the standard) have to be used for ancient European purpose? All of the characters in the Unicode Standard are for anyone's use. So would all swastikas be unifiable as U+534D and U+5350? The entry for U+534D in the _Hanyu Da Zidian_, vol. 1, p. 51 (as indicated in unihan.txt) includes a quote that it was originally not a Han character, wan ben fei zi ..., suggesting that it now is. There are also serifs shown in that dictionary and the _Kangxi Zidian_ for both characters. Couldn't the above two characters be considerd a CJK or IDEOGRAPHIC version (like the spaces, zero, punctuation, brackets, etc. in the CJK Symbols and Punctuation block)? Thomas Chan [EMAIL PROTECTED]
Re: Klingons and their allies - Beyond 17 planes
On Sat, 18 Oct 2003 [EMAIL PROTECTED] wrote: In addition to the problem of the OS substituting improper glyphs from inappropriate fonts unexpectedly, there's often a problem with line breaking. Since the PUA has no properties, some applications seem to ignore the space character and break lines arbitrarily, splitting words in the middle. No properties, or Han properties? Thomas Chan [EMAIL PROTECTED]
Re: [OT] Meaning of U+24560?
On Sun, 12 Oct 2003, Patrick Andries wrote: I'm a bit lost now... I'm looking for U+24560, radical 89 (double x) and 6 strokes. Here's the _Kangxi Zidian_ entry on U+24560, as well as U+2455F's entry, which is referenced by the former (whose first of three definitions--the one I suspect you are looking for--references everyday U+758F): http://deall.ohio-state.edu/grads/chan.200/misc/kangxi_zidian_u24560.jpg Thomas Chan [EMAIL PROTECTED]
Re: When is a character a currency sign?
On Tue, 8 Jul 2003, Philippe Verdy wrote: On Tuesday, July 08, 2003 3:35 AM, Thomas Chan [EMAIL PROTECTED] wrote: On Mon, 7 Jul 2003, Philippe Verdy wrote: Would Euro also be a (four-character) currency sign? Certainly not: this would be a word, whose orthograph varies with language. See the banknotes, where it is written in Greek letters, the capitalization also changes with language or context (all uppercase on banknotes, lowercase in normal French text, titlecase in German), as well as the plural forms according to language rules. We could say the same thing about the terms dollar, pound/livre, mark, escudo, peseta, yen, yuan, ruppie/roupie, sucre... (see also the Japanese Kana square characters created for these terms: they are not really currency signs, but an orthographic representation of these names adapted to a script, mostly like a transliteration)... But what does one do for a script like Han characters where those tests don't apply? e.g., in Chinese, U+938A is used for 'pound'--is that a word, or a currency sign? U+5713 or U+5143 for 'yuan'? Etc. Thomas Chan [EMAIL PROTECTED]
Re: When is a character a currency sign?
On Mon, 7 Jul 2003, Philippe Verdy wrote: On Monday, July 07, 2003 9:41 PM, Michael Everson [EMAIL PROTECTED] wrote: At 15:03 -0400 2003-07-07, Tex Texin wrote: When is a character properly called a currency sign? Hunh? When you use it to represent currency. DM was two characters used as a character sign in Germany. As well as now the EUR international currency code, usable also as a symbol when the Euro sign is not available. Would Euro also be a (four-character) currency sign? Thomas Chan [EMAIL PROTECTED]
Re: pinyin syllable `rua'
On Fri, 14 Mar 2003, Werner LEMBERG wrote: Some lists of pinyin syllables contain `rua', but I actually can't find any Chinese character with this name. Not all words or utterances necessarily have written forms, but... Does it exist at all? Or is it just there for completeness of pinyin? ...I found a rua2 in the _Xiandai Hanyu Cidian_, U+633C, glossed as 1) (zhi huo bu) zhou ((paper or cloth) wrinkle) and 2) kuai yao po (about to break). It's marked fang (dialect). There's a pointer to the same character, pronounced ruo2, which in turn is glossed as roucuo (to rub), and marked shu (bookish). Thomas Chan [EMAIL PROTECTED]
Re: traditional vs simplified chinese
On Thu, 13 Feb 2003, Zhang Weiwu wrote: Take it easy, if you find one 500B (the measure word) it is usually enough to say it is traditional Chinese, one 4E2A (measure word) is in simplified Chinese. They never happen together in a logically correct document. Others have already given examples of logically correct documents with both characters, but one cannot always have the luxury of assuming the data is not deviant. For example, there are many electronic texts online that are a hybrid of simplified and traditional text, because they contain erroneous conversions from a simplified source document (typically GB2312) to a traditional one (typically Big5). I think zhe4 'this' (simp U+8FD9 / trad U+9019) might be better for a very simple heuristic for modern text, since it occupies position #11 in at least one frequency list (compared to #15 for the above-cited ge4), and as far as I know, U+8FD9 is not one of those ancient characters that have been promoted/reused as a simplified form. On Thu, 13 Feb 2003, Andrew C. West wrote: Take, for example, this Web page -- http://uk.geocities.com/Morrison1782/Texts/TianguanCifu.html -- which transcribes a short one-act play from the Cantonese Opera tradition, published during the Qing dynasty (probably early 19th century). It has U+4E2A (simplified ge4) but not U+500B (traditional ge4), and yet is written mostly in traditional characters. How would your algorithm classify such a page ? Aren't such texts by default traditional? Simplified text, besides using simplified form characters, usually also entails refraining from using variant forms (according to PRC definitions of what is a variant). And depending on how far one wants to stretch the definition, PRC-style vocabulary, etc., cf., http://www.cjk.org/cjk/reference/chinvar.htm and http://www.cjk.org/cjk/c2c/c2cbasis.htm . On Thu, 13 Feb 2003, Marco Cimarosti wrote: The easiest way to do it is folding both the user's query and the conten being sought to the same form (either traditional or simplified, it doesn't matter). It may also help to fold also other kinds of variants beside simplified and traditional. It would help to at least fold the Unicode z-variants together. For example, with the possibility of Unicode data, authors have the choice of U+6236, U+6237, and U+6238 for hu4 'door', but these are not meaningful distinctions, and certainly a lot harder to detect than the typical traditional/simplified case. On Thu, 13 Feb 2003, Edward H Trager wrote: And I've seen books printed in the beginning years of the PRC era using mostly simplified, but with smatterings of traditional characters here and there. These books were printed in the days of lead type, so I Those must be the ones printed before the final 1964 version of the simplification (drafts dating back to 1956, and some earlier pre-1949 usages in Communist-occupied areas), so that they do not utilize all the simplified characters that eventually exist in the 1964 version. There are even some cases of semi-simplified forms where one half of a character might have been simplified according to pre-1964 rules, but the simplification rule for the other half has to wait until 1964. But I think these might've been missed by Unicode, like some of the ultra-simplified forms in the short-lived 1977 scheme, and Singapore's temporarily different (from the PRC's) schemes prior to 1976. On Fri, 14 Feb 2003, Andrew C. West wrote: Now if Hanyu Da Cidian were to be put onto the internet ... How about the one here? http://202.109.114.220/ Thomas Chan [EMAIL PROTECTED]
Re: 4701
On Tue, 4 Feb 2003, Andrew C. West wrote: I have a half-finished page that gives the names of the twelve calendrical animals in the languages of various peoples within and bordering China that have adopted the Chinese calendrical system, available at : http://uk.geocities.com/BabelStone1357/Calendar/index.html That makes a very nice multilingual Unicode demonstration. I've typed up my notes on the Vietnamese and Korean ones here, for you: http://deall.ohio-state.edu/grads/chan.200/misc/cal.html I've checked the native orthography with dictionaries (but would appreciate double-checking by natives) but not the definitions (would also appreciate that being checked/confirmed). For Korean, I haven't provided a transcription nor transliteration (that affects for example, the second from last, 'chicken', with regards to the underlying r). I don't know if any of these are bound forms, or if any are not the everyday term for those animals, cf., the alternative Chinese term in your list for 'dog', gou3, which replaces the quan3 vocabulary item in almost all dialects. As I recall, in Vietnamese the rabbit is replaced by a cat, and in I've heard of that, but my teacher also said that tho? 'rabbit' was also possible. Thomas Chan [EMAIL PROTECTED]
Re: 4701
On Sat, 1 Feb 2003, Michael Everson wrote: At 10:19 -0800 2003-02-01, Eric Muller wrote: Michael Everson wrote: Happy New Year of the Yáng to everybody! (I can't work out whether it's the Year of the Sheep, the Goat, or the Ram.) Ram. europe.cnn.com (which I was looking at for other, sadder reasons), says Goat. My local Superquinn's (large grocery chain) has had signs on all the Chinese food for weeks which says Ram. My Chinese dictionary says Sheep. And the website of the Pearl River (www.pearlriver.com) department store in New York City says lamb! unihan.txt says that U+7F8A is sheep, goat; KangXi radical 123. On Google, year of the goat has the lead. And it is 4701 or 4700?--the only thing that is certain is that it is the guiwei year of the sixty-year cycle of Cathay. Thomas Chan [EMAIL PROTECTED]
RE: Precomposed Tibetan
On Wed, 18 Dec 2002, Marco Cimarosti wrote: Andrew C. West wrote: If anyone thinks that a mapping table would be useful as a weapon in the fight against the Chinese proposal, I would be happy to provide one. Do you have the relevant data? As I said, so far I found little or nothing about BrdaRten or about the Founders System mentioned by Ken Whistler. Previously, Ken Whistler said: One additional detail for people. The BrdaRten stacks are currently implemented, in the Founders System software in Tibet, as an extension to GB 2312. This sounds like they might have been implemented as a vendor extension in the private/end-user area of GB 2312, if it is anything like how as-yet-unencoded Han characters are treated. If so, then one'd probably need access to a font itself to see. Looking at Founder's site, I found this, a bunch of Tibetan fonts they make: http://font.founder.com.cn/chanpinzl/CP_zangwen.htm In the body text, they describe Tibetan as 600+ (pre-composed) characters, and 4,400+ if Sanskrit is included. But next to each font, it says 4000+ for the first one (Tibetan and Sanskrit), 2000+ for the second one (Tibetan and Sanskrit), and for the last three, 800+ (Tibetan). WG2 N2558 only proposes 956 pre-composed, so I'm not sure what these different numbers mean, except that counts sometimes cavalierly include irrelevant stuff like punctuation and symbols to pad the number. But so far I haven't seen anything strongly linking Founder to WG2 N2558, except that the latter mentions Founder as an *example* of a precomposed Tibetan implementation (2). We don't necessarily want to be making vendor/legacy/font-based to unicode mapping tables for every potential vendor, do we? Thomas Chan [EMAIL PROTECTED]
Re: CJK fonts
(I've merged Andrew's two messages--12/13 and 12/16--together, below.) On Fri, 13 Dec 2002, Andrew C. West wrote: On Fri, 13 Dec 2002 01:33:08 -0800 (PST), Thomas Chan wrote: I can't imagine where the yi4 reading comes from, although I note I was thinking along the same lines. The Kangxi Zidian gives U+3CBC a reading of YI4 (as does the Unihan database - the CHA4 reading seems to be as a variant form of U+6C4A). What edition of the _Kangxi Zidian_ are you using that gives explicit Mandarin readings like yi4, or are you interpreting the fanqie notation yourself? I use the 1958 edition, 1997 2nd printing published by Zhonghua, ISBN 7-101-00518-7. I find self-interpretation of fanqie to be fraught with peril, partially as fanqie was never a completely perfect transcription system, not to mention that fanqie from old dictonaries does not necessarily tell one anything about contemporary pronunciation. e.g., U+5B7B, is a Yue (Cantonese), Hakka, and Min character, meaning 'last (child)' (derived from 'last child of an old man', hence the character's appearance as 'child' + 'to use up'), pronounced laai1 or lai1 in Cantonese.[1] However, the old dictionaries including Kangxi give a fanqie of U+6CE5 U+53F0 U+5207, which would yield an artificial nai2 in Mandarin, which is exactly what the _Hanyu Da Zidian_ says explicitly. Either the pronunciation has changed from [n-] and [l-] and reading old dictionaries fails to account for modern developments, or whoever choose U+6CE5 to indicate the onset was pronouncing U+6CE5 as *l-. [1] While there is a long-standing ongoing sound change in Cantonese from [n-] to [l-], this is probably no longer one of them, and *naai1/nai1 would now be regarded as hypercorrection. [...] At any rate, what I think is important is that we do not assume that YI4 is wrong and through it out just because none of us recognise the reading ... though I guess if it is that obscure, it really hasn't got a place in the Unihan database. But what if the character is obscure, and the reading thusly also obscure? I think there are diminishing benefits to overly-proofing the unihan database for such characters--if they are so rare, then no one will find the character by searching on an obscure/artificial reading, and if it is so rare, then those interested should be consulting actual comprehensive dictionaries (like the Kangxi or _Hanyu Da Zidian_) instead of relying on a text file. In a way, we currently have this situation--the Plane 2 characters are, on average, more obscure than the BMP characters, and the lack of information is kind of saying look it up yourself if you really, really need to know. If Hanyu Da Zidian and Hanyu Da Cidian both give GAN4 for the modern reading of U+5481 I for one would prefer that reading to GEM4. Ci Hai also has such non-Mandarin syllables as NGU2 for U+5514. The principle of Pinyin are clearly defined (and like most PRC dictionaries Ci Hai includes a copy of the Hanyu Pinyin Fang'an as an appendix - even if it does not fully adhere to it), and syllables like GEM4 and NGU2 are simply not allowed. I agree with your sentiment that gem4 is an aberration, despite my support of the _Cihai_ (PRC 1979) in that it did not get included in the unihan database from out of nowhere. When U+5481 was reinvented by the Cantonese, it was patterned both graphically and phonologically on U+7518, which is gan1 'sweet' in Mandarin (gam1 in Cantonese). U+5481 is in Cantonese gam3 'so (quantity)' (3 = yinqu tone); hence gan4 is an appropriate Mandarin reflex. ngu2 for U+5514 is also an aberration--yet another case of a quixotic attempt to mimic dialect pronunciation in Mandarin. Sure, it's m4 (a syllabic nasal [m]) 'not' in Cantonese, but this is just a re-use of a pre-existing semi-homophonous character, ng4 (another syllabic nasal; considered close enough to m4 in Cantonese), a sound in singing. As that is wu2 in Mandarin, so thus should 'not' be given an artificial *wu2 reading (which is what the unihan database has currently--no doubt that piece of data was inputted from a more sensible dictionary). But elsewhere, this battle is lost--U+5187 'to not have' (among other meanings), is perhaps the most recognizable Cantonese character to non-Cantonese, is given nowadays given the pronunciation mao3[2], despite the recognition of earlier dictionary compilers such as Samuel Wells Williams in his 1877 dictionary who recognized it as derived from U+7121 with a tone change, and assigned it a Mandarin wu3 reading accordingly. [2] I note that even mao is a poor approximation; *mou would've been closer (and still a valid and normal Mandarin syllable). On the other hand, a reading of FIAO4 for the dialectal ideograph U+8985 may sound odd to a Mandarin speaker, but it is perfectly acceptable according to the rules of Pinyin (F is a valid initial, and IAO4 is a valid final). FIAO4 is the only reading for this ideograph given in Hanyu Da Cidian, Ci Hai
Re: CJK fonts
On Thu, 12 Dec 2002, Andrew C. West wrote: On Thu, 12 Dec 2002 03:26:07 -0800 (PST), Raymond Mercier wrote: For example, the simplified form of the character Han itself (U+6C49) is given the Pinyin reading Yi, the traditional form U+6F22 is the correct reading Han. This is probably another example of misplaced secondary Mandarin readings - I reckon that about 10% of the CJK block (i.e. a couple of thousand of characters) are affected. Unihan Version 3.0 (the latest version to have the correct Mandarin readings for the CJK Unified Ideographs block) gives : U+6C49kMandarin YI4 HAN4 In Unihan 3.2 this becomes : U+6C49kMandarin YI4 and the reading of HAN4 is mislocated to U+6C44 : U+6C44kMandarin HAN4 ZE4 (plain ZE4 in Unihan 3.0) It is quite possible that YI4 is a reading for U+6C49 when not a simplified form of U+6F22 (I'll have to check this when I get home this evening ... no dictionaries here I'm afraid). The _Hanyu Da Zidian_ (3: 1549) only says that U+6C49 is a simplified form of U+6F22, as expected, which in turn has a han4 and a tan1 reading (3: 1714)--I don't know if the rarer tan1 (and its associated definition) can really be inherited by u+6C49, though. U+6C44 is given as interchangeable (3: 1548) with U+3CC1, which has a ze4 reading (3: 1560). I can't imagine where the yi4 reading comes from, although I note that U+3CBC, which looks somewhat similar to U+6C49, is given both yi4 and cha4 readings (3: 1549). U+5481kMandarin GEM4 - GEM4 is Cantonese pinyin (it is a common Cantonese ideograph) - I don't think this ideograph has a Mandarin reading ... but if it did it would presumably be GAN4 ... which is the reading I give it in BabelMap The _Hanyu Da Zidian_ (1: 598) has han2 'breast; milk', xian2 'to hold in the mouth', and gan4 'so (quantity)' for readings. But I don't think gem4 is an error--the 1979 PRC _Ci Hai_ has gem4 'so (quantity)' and han1 'so (quantity)' for readings, where gem4 corresponds to the same vocabulary item as gan4 in the former dictionary. (The han1 reading is not Cantonese usage of the character, but Hunanese, despite the identical meaning.) It's rare, but sometimes there are unusual Mandarin syllables like gem4 given in dictionaries. However, I don't expect much reason when it comes to interpreting and creating artificial Mandarin cognate readings of ancient or dialect words, e.g., U+5C58 'child' is given as man3, but this is based on a misreading of a pronunciation gloss--an phonologically reasonable cognate reading of this Taiwanese dialect character would have been *man1; likewise, U+55F2 'coquettish' is given as the unusual Mandarin syllable dia3--a phonologically reasonable cognate reading of this Cantonese dialect character would have been *die3 (patterning on U+7239 die1 'father' both graphically and phonologically). U+4C5BkMandarin XU4M - this is from CJK-A in Unihan 3.2 ... I assume that the M is spurious xu4 and yi4 in the _Hanyu Da Zidian_ (7: 4695). U+6F71kMandarin YIE - this should be YI1 _Hanyu Da Zidian_ (3: 1736) says ye1 here, but I bet you have a source that backs yi1 just as well... Thomas Chan [EMAIL PROTECTED]
Re: The result of the Plane 14 tag characters review
On Mon, 18 Nov 2002, Michael Everson wrote: At 13:37 -0800 2002-11-18, Kenneth Whistler wrote: Go to any Japanese newspaper. There is no required change of typographic style when Chinese names and placenames are mentioned in the context of Japanese articles about China. Go to any Chinese newspaper. There is no required change of typographic style when Japanese names and placenames are mentioned in the context of Chinese articles about Japan. Just to be sure: this means that when a Japanese newspaper it uses the glyphs its readers prefer for Chinese names, not glyphs which Chinese readers may prefer? Does this extend to the Simplified/Traditional instance, so that if a Chinese name has the word for horse in it, it uses the Japanese glyph for horse,not either the S or T version of the glyph (assuming for the sake of argument that both occur and that both are different from the preferred Japanese glyph)? Yes. Not only are Japan's preferred's glyphs used, but the actual characters are changed, if necessary. e.g., the famous eighteenth-century Chinese novel, _Hongloumeng_ (Dream of Red Chamber), is studied in Japan, where it is known as _Kouroumu_. In TC, one writes U+7D05 U+6A13 U+5922 for the title, while in SC, U+7EA2 U+697C U+68A6--all three characters are different. Searching on Japanese pages only at Google for the TC form, I only find 153 matches, whereas the actual Japanese form of the title, U+7D05 U+697C U+5922, which differs from the TC in the second character (or, to look at it in another way, differs from SC in all but the second character), finds 2,730 matches. Granted, looking only at electronically-stored instances has its flaws, such as the limitations of legacy character sets, but since both U+6A13 (form of second character used in TC) and U+697C (actual second character used in the Japanese form of the title) are available in a Japanese character set, the choice of the latter is clearly a deliberate choice. Thomas Chan [EMAIL PROTECTED]
Re: A .notdef glyph
On Thu, 7 Nov 2002, John Hudson wrote: At 13:07 11/7/2002, John Cowan wrote: Wouldn't the glyph for the GETA SIGN be suitable as a .notdef glyph? That seems to be just what GETA is for. Aha! Thank you, I'd never noticed that before. I think the GETA MARK would be ambiguous to a non CJK user, but I like the idea of the strong horizontal bars very much. GETA MARK is also ambiguous to Chinese readers; an M-sized WHITE SQUARE or WHITE CIRCLE (or LARGE CIRCLE) are more familiar. I'm not familiar with how the GETA MARK is supposed to be used in Japanese, but I hesitate to blur the possible distinction between 1) there's a character here but you don't see it because the font is missing a glyph, 2) there's no character here for you to see because what the author would like to put there is not encoded in Unicode, and 3) there is expected to be something here (e.g., a letter, an ideograph, etc) but the author doesn't even know what it is (e.g., transcribing a tablet with broken pieces or paper with insect damage, or undecipherable/illegible source text). I don't think the distinction between #2 and #3 need or should be standardized at this level--it is up to a convention that the author should establish with the reader, as with any specialized notation--but there is certainly a difference between #1 (author succeeds in writing but reader fails in viewing) and #2/#3 (author fails in writing). Given the current white box/rectangle (or other symbols) for notdef, if I see one of those, I really don't know if my font is defective, or if the author volunatarily put it there to signify something. Thomas Chan [EMAIL PROTECTED]
RE: In defense of Plane 14 language tags (long)
On Tue, 5 Nov 2002, Marco Cimarosti wrote: Doug Ewell wrote: Readers are asked to consider the following arguments individually, so that any particular argument that seems untenable or contrary to consensus does not affect the validity of other arguments. 1. Language tags may be useful for display issues. The most commonly suggested use, and the original impetus, for Plane 14 language tags is to suggest to the display subsystem that “Chinese-style” or “Japanese-style” glyphs are preferred for unified Han characters. [...] IMHO, there has never been any practical need to consider these glyphic differences in plain text. It is a non-issue raised to the rank of issue because of obscure political reasons. It is false that Japanese is unreadable if displayed with Chinese-style glyphs, or that Polish is unreadable if displayed with Spanish-styles acute accents. It is also not even an issue of language, but national glyph preferences. It is only through situations such as Chinese (language) used in mainland China and Chinese (language) used in Taiwan that we are able to disambiguate whether glyph preference differences are due to language or country--and it is country, because mainland China has implemented glyph reforms that reduce the differences between printing and handwritten print forms by making the former resemble the latter[1]. Certainly, it is confusing what part of Japanese-style glyphs is due to language or country, but it is misleading to present it as a language issue, and use the worst case scenario (comparison with glyphs from mainland China) for illustration; many of the differences vanish if a comparison is made with more conservative-looking glyphs from Taiwan. [1] A xin-jiu zixing duizhao julie chart (select examples of comparison of new and old character glyph forms), from the 1979 PRC_Cihai_ dictionary: http://deall.ohio-state.edu/grads/chan.200/misc/cn_newold_glyphs.jpg Each of the four columns contains three subcolumns, where the leftmost subcolumn is the new form, the center subcolumn is the old form, and rightmost subcolumn gives examples of characters utilizing the new glyph forms. The circled number refers to the number of strokes. Among the examples is the fifth example in the third column from the left, of U+75F4 'straight/direct', and the thirteenth example in the same column, U+9AA8 'bone'. (It is true that some of these glyph reforms have been encoded separately as separate characters, although some cases are due to source separation.) Thomas Chan [EMAIL PROTECTED]
RE: The Currency Symbol of China
On Mon, 30 Sep 2002, Thomas Chan wrote: (Was U+56ED what you saw, James?--I don't have my Krause catalog by me at the moment, but I think it was present on older PRC coinage.) A correction to myself here--I thought I had seen U+56ED as a currency unit, but now I cannot find a reference in my notes, so I'm retracting this one. James Kass said: I don't blame you. According to Krause... One Dollar (Yuan) = 100 Cents (Fen/Hsien) = 1000 Cash (Wen/Ch'ien) = (=) 0.72 Tael (Liang) = 7 Mace and 2 Candareens ...and, that's just for starters. Well, the last part is a different system--mace and candareens are weight measures for silver coins as part of the tael system: liang/qian/fei/li (tael/mace/candareen/?). Hence, there are three systems: dollar, cash, and tael. The 1/100th units fen and xian (hsien in Krause) are part of different systems: yuan-jiao-fen in the north, and yuan-hao-xian in the south. (xian U+4ED9 English 'cent', even in Macau, where 1/100th of a pataca is an avo.) The northern and southern systems may be seen residually in contemporary Hong Kong and Macau, and historically during the early 20th century during a period of provincial minting in mainland China, where people used their local terminology on their coins, with the exception of the 1.0 unit. The situation is similar for the 1/10th unit; jiao in the north and hao in the south. Marco Cimarosti said: U+5143 4~6~D^4~D6~^A U+5186 4~6~D^4~DC^4~DC U+5706 4~6~D^4~D^4~DC U+570E 4~6~D^4~D^I U+5713 4~6~D^4~D^O Thank you for finding these--I didn't realize that U+570E was encoded independently of U+5713, and not as a font variant of the latter. (And I had forgotten the obvious U+5713 ~ U+5706 connection.) I checked Krause--U+5713 may be seen on pre-war Japanese coinage for yen. Alan Wood said: I have added all of the symbols from this discussion to the second table on my page at: http://www.alanwood.net/unicode/currency_symbols.html Please remove U+56ED--that was my mistake. U+6587 is not entirely appropriate there--while it was a currency unit (approx. 1/1000th yuan), it was gone in all regions by the early 1930s, and now it is just (a least) a colloquial Cantonese synonym for yuan, sort of like northern kuai4 U+584A/U+5757 'piece'. I can provide you with a bunch of other terms for 1/10th and 1/100th units, but once one steps into the realm of Han characters, one is no longer dealing with symbols but words, and the list can inflate very quickly unless restrictions are set, such as primary currency units (not 1/10th or 1/100th units) in contemporary use (not historical) that appear appear on currency (not other terms like bucks, benjamins, etc). U+5713 I wouldn't list as yen/yuan variant--it should be on the same level as U+5143 and U+5186, as U+5713 (Yuan) is the unit used in Taiwan and Hong Kong on the currency (despite being dollars in English). Thomas Chan [EMAIL PROTECTED]
RE: The Currency Symbol of China
Lots of confusion. I don't know the origin of yen for the Japanese currency, aside from hearing that it was the way it was spelled (perhaps in Hepburn's dictionary) and adopted as such in English, and that the source of the ye might have either historical and/or regional pronunciation--i.e., not a phonemic difference distinct from e. (Corrections appreciated here.) I do have the following (other) remarks on the characters that some have brought up, though: Marco Cimarosti wrote: Similarly, yen is just the Japanese (kun) pronunciation of Chinese yuan. Stefan Persson wrote: Yen (4~6~D^4~DC^4~DC) is U+5186, while yuan (4~6~D^4~D6~^A) is U+5143. Yen is an ancient on pronunciation for U+5186; today it's pronounced en. James Kass wrote: How about U+5143 ? (smile) Looking at pictures of Chinese coins in the Krause catalog, some coins used an ideograph other than U+5143, but a quick search of CJK BMP ranges didn't find it. (Doesn't mean it's not there.) This other character looks like rad. 31 surrounding stacked rads 30, 72, and 9. (The pictures are a bit fuzzy, though.) The Japanese currency may be U+5186 today, but that is just a simplification of U+5713. Chinese took a different path of simplifiction and variants, including U+56ED and today's (PRC) U+5143. (The Korean won currency is of the same etymology, though not U+571C hwan, although the theme of a circular object--rounds?--is still present.) (Was U+56ED what you saw, James?--I don't have my Krause catalog by me at the moment, but I think it was present on older PRC coinage.) I wouldn't seriously advocate U+5143 :) --that is a word and not a symbol, cf., $100 vs. 100 dollars--the symbol is prefixed out speech order, but the word is suffixed per pronunciation. But if we are to get into writing out currency amounts in longhand words, there is at least also U+6587, formerly approxiately 1/1000th of a yuan, but now promoted to equal status as the yuan in Cantonese-speaking areas unofficially (i.e., it appears on price tags, but not money). This man is also ridiculously written U+868A 'mosquito'. (I'm not going to get into 1/10th and 1/100th units at this time.) Thomas Chan [EMAIL PROTECTED]
Re: [CHN] 2 geography questions
On Mon, 30 Sep 2002, villafea wrote: my self-induced geography study is an ongoing disaster. can someone please give me the tones or characters for _Lushun_, before i go insane? Port Arthur? Luu3shun4 ¤jªf. also, an old map seems to refer to a _Taku_ in Hebei. is there now a modern name for Taku? the only gloss i can find refers instead to a municipality in Taiwan. You mean Taku as in the Taku forts? It's Da4gu1 ®È¶¶. Thomas Chan [EMAIL PROTECTED]
Re: glyph selection for Unicode in browsers
On Thu, 26 Sep 2002 [EMAIL PROTECTED] wrote: Tex Texin wrote, Given the (un)workable approach, do you then intend to have variants of code2000 for CJKT, so one can make the appropriate assignments? (ugh!) Code2000's coverage of CJKTV ideographs isn't adequate to support any language yet. Eventually and hopefully the repertoire will be completed. Given the current ceiling of 65536 max glyphs per font, it might not be feasible to try to have one font cover all scripts and variants, but time will tell. I don't mean to detract from the point of this discussion, nor to criticize a particular font, but I think the Han glyphs in Code2000 are aesthetically disappointing in that that they are distorted enough (shape, proportions, and positioning) that they differ farther from any typical CJK font more so than two comparable CJK fonts may differ due to language/country glyph preferences. Compare, for instance, with other sans serif CJK fonts like Arial Unicode MS, (cn) MS Hei, or (ja) MS Gothic. But changing the example to fonts like Arial Unicode MS doesn't completely solve everything--a sans serif font is not the norm for non-trivial quantities of CJK text (compare any book or newspaper). These problems would cause rejection of a font faster than adverse reactions to foreign/unfamiliar glyph designs. (The aging serifed Bitstream Cyberbit font might be a better example in this respect.) Thomas Chan [EMAIL PROTECTED]
RE: Latin vowels?
On Mon, 9 Sep 2002, Marco Cimarosti wrote: Mark Davis wrote: 4. List Nonvowels - ambiguous letters that are probably vowels: U+0059 # (Y) LATIN CAPITAL LETTER Y U+0079 # (y) LATIN SMALL LETTER Y I would consider all these as vowels, although I know there is much room for errors: - Y is historically a vowel, and it still is mainly a vowel in all languages using it (including English and French: système, quickly). In English and French, however, it can be a consonant (e.g., yes). In orthographies derived from English-based romanizations (e.g., Pinyin), it is always a consonant. This vowel vs. consonant distinction is really unsatisfyingly simplistic. It sounds like the (US) grade-school list of vowels: a, e, i, o, U ... and sometimes y. About Pinyin, some sources would disagree and set up a zero initial, so that (initial) y is just a way to write i [i] (clearly a vowel) occurring at the beginning of syllables. I know you meant to give an example of y used as a glide or approximant (which laymen would consider a consonant), and there are surely better examples of it, but we can't always judge the origin or inspirations of romanization systems, either (Pinyin comes up again--some people are still unconvinced that it is free of Cyrillic or Albanian influences). Thomas Chan [EMAIL PROTECTED]
Re: Strange resemblances and weird sisters
On Wed, 10 Jul 2002, Kenneth Whistler wrote: Then in Extension B there are many, many weird and wonderful candidates for strangest CJK characters. Some of my personal favorites include: Looks like they are all attempts to create modern reflexes of characters that never made it past the seal script stage ~2,000 years ago. U+26B99 U+20137 Neither of these two are in the _Kangxi Zidian_ or _Hanyu Da Zidian_--what are they? They seem to be the fault? of CNS 11643. U+20572 Ancient form of U+96E8, yu3 'rain'. Looks like it, too! But U+96E8 is more concise. U+2069C Ancient form of U+20698, tao1, a basket for feeding cows. But this one seems like an redundancy--certainly, neighoring U+206A1 would the closest attempt at creating a modern reflex that mimics the appearance of the ancient form, but also neighboring U+20698 looks more reasonable as a modern reflex that looks modern. U+2069C just seems like an in-between form that is neither here nor there, although _Hanyu Da Zidian_ does have it. It seems like critics prefer U+20698--both _Kangxi Zidian_ and _Hanyu Da Zidian_ include it, as well as a North Korean character set, and CNS 11643, whereas U+206A1 is backed only by CNS 11643. I wonder why these three were not unified--surely, they are just three attempts to convert a dead character into a modern form that are within the range of font or glyph variation. I think I know the answer is source separation, but that would not prevent U+2069C... U+2696E Ancient form of U+7232, wei2 'to make'; wei4 'on behalf of'. This doesn't look like it is the direct ancestor of U+7232 like the 'rain' case above, but an ancient form that lost out to the ancient form which was the ancestor of U+7232. With such genetic defects, one would have expected such characters to die out long ago, but Unicode has brought them back to life. Like U+2069C, which was born maybe 15-20 years ago... And of course, there is always the miraculous proliferation of turtles... ;-) Look through the 'tiger' radical section of the SIP and there's a dozen or so used as signs and flag signals just by the _Tian Di Hui_ organization. I suppose some of them are technically logos... Mind sharing a list of those turtles? Thomas Chan [EMAIL PROTECTED]
Re: Phaistos in ConScript
At 20:48 -0400 2002-07-08, John Cowan wrote: Michael Everson scripsit: My point being that though Beijing and Hong Kong newspaper headlines might present LTR or RTL directionality without mirroring, this practice is rare or indeed unknown in Europe at 1700 BCE. In any event, the so-called RTL CJK is more like TTB-RTL with columns of size one. (That sentence deserves a jargon award of some sort.) That analysis may appear to be the case for headlines or signs, but it doesn't hold up for some multi-row photo captions or some movie subtitles. (I'd provide a scan, but the newspaper I was going to use has since switched from top-to-bottom to left-to-right. I should've kept some of the issues that had 911 embedded in rtl headlines, too.) Thomas Chan [EMAIL PROTECTED]
Re: Encoding of symbols and a lock/unlock pre-proposal
On Tue, 21 May 2002, William Overington wrote: Yes, I feel that it is worth putting forward a proposal for the open and closed padlock symbols, yet wonder if I may make mention that maybe the words should be unlocked and locked as adjectives rather than unlock and lock as imperative verbs. Surely, a padlock is either unlocked or locked, so that the symbols indicate the state in which a system now exists. This then raises the question as to whether there should be symbols for unlock and lock as imperative verbs, such that those symbols would indicate where to click so as to change from being in an unsecure state to a secure state or from being in a secure state to an unsecure state. This then gets into the fact that with a padlock one needs a key to unlock it but one does not need a key to lock it, yet using a key symbol to mean unlock would seem to go against the way that computer systems are organized in that a key might seem more naturally to mean lock, notwithstanding that one does not need a key to lock a padlock. I'm not a big fan of pictographs and prefer to see real writing, but as an alternative to a locked and unlocked padlock, isn't there also an intact key and a broken key as allographs? I think Netscape Navigator once used these. Thomas Chan [EMAIL PROTECTED]
Re: CJK Unified Ideographs Extension B
On Mon, 13 May 2002, William Overington wrote: I have been looking at the characters in the CJK Unified Ideographs Extension B document. These are the characters from U+02 through to U+02A6DF, which, as I understand it, are the rarer CJK characters. I wonder if any of the people who read this list who understand the languages involved might please like to say what any one or two of these characters, of their choice, mean please, just as a matter of general cultural interest for people who see these characters in the Unicode specification and, though not themselves knowledgeable of the languages, find the characters interesting for their artistry and history. Culturally, the majority of them are really not that interesting. Here's ten random ones from Plane 2: U+224D3 is an ancient form of U+4F5C, zuo4 'to make'. U+22984 comes from a Vietnamese source--I don't have any info on it. U+230C4 is xin1; meaning unknown. U+24ECB is an erroneous form of U+765F, bie3 'shrivelled'. U+25BF6 is ku3, the name of a kind of bamboo. U+27028 is lie4 'movement of grass'. U+28966 is qi2 'sharp'. U+294DF is kan3, something having to do with a distorted or ugly head/face (I don't quite understand its definition.) U+2A1FD is a variant form of U+6B4E, tan4 'to sigh'. U+2A606 is xiu1; meaning unknown. Thomas Chan [EMAIL PROTECTED]
on U+7384 (was Re: Synthetic scripts (was: Re: Private Use Agreementsand Unappr oved Characters))
On Sat, 16 Mar 2002, John H. Jenkins wrote: On Friday, March 15, 2002, at 07:39 PM, Thomas Chan wrote: Is this open to names written with taboo-avoiding forms of characters which omit strokes? e.g., U+7384 less the final stroke. Or are these unified with the normal forms? The UTC will consider at its next meeting a proposal for a IDEOGRAPHIC TABOO VARIATION INDICATOR for precisely this reason. Sorry. I just found U+248E5 as the four-stroke taboo-avoiding form of U+7384. (That disqualifies me there! :) I didn't expect to find it disunified.) The kIRGKangXi fields for both also suggests that they are not unified (although kAlternateKangXi and kKangXi sometimes say otherwise). However, it seems that the four- and five-stroke forms are unified and interchangeable when used as radicals for U+7385 .. U+7388, U+248E6 .. U+248E8--these characters are given in the _Kangxi Zidian_ with the four-stroke form radical, but kIRGKangXi maps them anyway. I guess it'd stink to have to encode dupes just for the sake of taboo. (Taboos aside, I find many cases of this elsewhere, where two characters are not unified in isolation, but apparently only one participates as a component in the formation of other characters.) And to think that U+248E5 could've been avoided if Kangxi was published post-Qing, or if a post-Qing corrected edition (i.e., taboos removed and orig. characters restored) had been used (I have no idea if such a thing exists, though). Thomas Chan [EMAIL PROTECTED]
Re: Character indices (was: Unicode Humor)
On Fri, 26 Apr 2002, [iso-2022-jp] $B$m!;!;!;!;(B $B$m!;!;!;(B wrote: In the Unicode 3.0 book, WHY ON EARTH are the Han digits (you know them) not listed directly with the other numerics? They are given their own category. (I have always wondered why the Han digit 1 ($B0l(B) is not called HAN DIGIT ONE, etc.) How would you distinguish the common form U+4E00 from the ancient form U+5F0C, the Japanese anti-fraud form U+58F1, the Chinese anti-fraud form U+58F9, and the (historical) variant form U+2092A? Thomas Chan [EMAIL PROTECTED]
sources for plane 2 characters?
Hi all, I was looking at the plane 2 characters in the March 15, 2001 version of the unihan.txt file, and found five that did not have an IRG source: U+20957, U+221EC, U+22FDD, U+24FB9, and U+2A13A. (The last one, U+2A13A, however, has kIRGHanyuDaZidian and kIRGKangXi information showing that it can be found in those dictionaries. Still, shouldn't there be an IRG source for it?) Where are the first four from? Thanks, Thomas Chan [EMAIL PROTECTED]
Re: Unicode Font Pros and Cons
On Sun, 31 Mar 2002, Jungshik Shin wrote: On Sat, 30 Mar 2002, Doug Ewell wrote: Maggie Yeung wrote: Can someone think of any other issues related to using Unicode font. I find it mildly annoying that Outlook Express picks a font on the basis of the encoding chosen for a given message. On this list and the IDN list, a message encoded as JIS or EUC-KR or BIG5 will likely be displayed in a different font from the one used to display Latin-1 or UTF-8 messages. Did you mean that different fonts are used to render US-ASCII part of messages depending on the encoding used in messages, ISO-2022-JP, EUC-KR, Big5, ISO-8859-X, UTF-8? If this is what was meant, then I find this annoying as well. Reading a thread in English (ASCII) and then, because someone includes a small bit of East Asian text and their client tags it as such (or if it's tagged as such anyway, even with only ASCII-representable text), the font suddenly changes from a nice proportional font to the monospaced and uneven kind (i.e., the letters float up and down) seen in East Asian fonts that looks worse than Courier. (I know about the fonts that have proportional Latin glyphs, but they're still ugly.) In a mixed language context (e.g., English/Chinese or English/Japanese), I don't mind as much, as I appreciate the added clarity of information, but if it's really just English-only text, then it's plain unattractive. Thomas Chan [EMAIL PROTECTED]
Re: The Arrogants and the sillies (RE: Euros and cents)
On Tue, 26 Mar 2002, Doug Ewell wrote: I'm surprised nobody took Dan the Silly Man to task on this one. English enjoys new words on-the-fly. What a pity Kanji on-the-fly is a taboo, at least on Unicode ;) I think these were meant as rhetorical questions, but I'll bite, particularly #3... Can you name a character encoding standard, anywhere in the world, invented by anybody -- government, industry consortium, private company, individual, kwijibo, ANYBODY -- that can do better in this regard than Unicode? Besides the giant 70K+ repetoire which reduces the likelihood of an unavailable character, there's always the PUA option. Some other competitors in the Han character area don't even have that (ie., a gaiji area), instead forcing one to submit such characters for registration. Can you name a font technology that will support the display of these invented-on-the-fly Kanji? For that matter, can you invent a Kanji on the fly that cannot be represented (perhaps in a rather cumbersome way) with Ideographic Description Characters? Yes, it's possible but uncommon. Unlike some other character description schemes, IDS can only form characters by composition. e.g., there's no way to gut out everything except the right half of U+8BD1 (yi4 'to translate') and use the former right half as a component in describing another character (as of Unicode 2.1--I haven't checked later versions.) Such a component would need to be separately encoded for it to participate in an IDS. Sometimes such components are not independent characters, or they are rare independent characters that have been overlooked for encoding. In this particular example, when U+776A occurs as part of a character in unsimplified Chinese, then the simplified Chinese form would have U+776A converted into the component mentioned above by application of simplification rules (standing alone, U+776A is identical in simplified form). Find all the characters containing U+776A as a component and create the simplified forms by applying the rule--that'll generate plenty of characters that IDS's can't represent. Another case is a character for 'Marxism'--it is U+9A6C with the final stroke gutted out, and replaced with U+4E49 (Again, this example only checked to be true as of Unicode 2.1). There are also an almost negligible number of cases such as U+4E52 and U+4E53 (used to write ping1pang1 'ping pong') or U+5187 (used to write Cantonese mou5 'to not have', among other words), which are created by deleting of single strokes from U+5175 and U+6709, respectively. A number of Vietnamese chu+~ no^m characters are also created in such fashion. This is at a level smaller than the components that IDS work on, and is really not a flaw of IDS. IDS's, unlike some other description schemes, also don't handle rotation--there are also an almost negligible number of cases where a character (or a component) is formed by rotating another 180 degrees, e.g., U+20114, which is U+4E88 rotated 180. However, this is so rare that it wouldn't be a productive IDC if it were to exist. IDS's also don't handle cases of ligaturing, e.g., U+21155 (xi3 'double happiness'), which is two U+559C side-by-side in origin. Distinguish from U+56CD of the same meaning as U+21155, where ligaturing doesn't take place. IDS's also don't handle cases of guwen 'ancient character', which are characters in pre-modern form that have been converted to modern form, e.g., U+20A30, a tortured character which is really the zhuan 'seal' form of U+5973 (nuu3 'woman; female') modernized. IDS's might handle it, but clumsily. Others such as U+20066 are just impossible with IDS's. However, this type are not likely to be created in this age, except as modernizations of ancient forms. Despite these counterexamples, IDS do handle the majority of unencoded Han characters, most of which are the left to right or above to below variety with respect to the particular IDC's used. Thomas Chan [EMAIL PROTECTED]
Re: Talk about Unicode Myths...
On Wed, 20 Mar 2002, John Cowan wrote: Dan Kogai scripsit: And as for Chinese, how do you tell whether Traditional or Simplified is more appropriate? Traditional and Simplified characters are *not* unified in Unicode, (BTW, this should go on the list of Unicode Myths), so that would be up to the author, not the browser. That's true, but you still can't distinguish preferred glyph variants of mainland China from those of Taiwan (nor those from that of Japan, either). cf., the often-cited U+76F4 and U+9AA8. What is often claimed as a difference between Japanese and Chinese is misportrayed and/or misunderstood as a difference between Japanese language and Chinese language practices, but is really a difference in national glyph preferences, e.g., U+9AA8 is often said to be a case where Japanese and Chinese differ, but this is true only when comparing the glyph preference of Japan (not Japanese language) and mainland China, and false if compared to Taiwan. I think this might be what Dan was referring to. BTW, there are some people who would've preferred to have traditional and simplified Chinese characters unified, so that conversion may be performed by changing the font. e.g., Founder (sorry, don't have the url at the moment) has some fonts which come in J and F versions. The J versions are perfectly normal in being Unicode fonts with simplified Chinese characters. The F versions, however, are a different story--although they are also Unicode fonts, at the codepoint for a simplified Chinese one actually finds the glyph for its traditional analogue (in most cases). Thus, one can store master data in simplified Chinese, and generate what appears to be traditional Chinese by changing the font (and making a few minor edits because of lack of 1-to-1 correspondence). e.g., type U+56FD with the J font, change the font to the F version and you see the glyph for U+570B, but its still really U+56FD. Certainly, this sort of thing can't help improve understanding that simplified and traditional Chinese characters are not unified in Unicode. Thomas Chan [EMAIL PROTECTED]
Re: 31 Angry Watanabes (or the Itaiji problem)
On Mon, 18 Mar 2002, Dan Kogai wrote: However, if one is to pick over little details, then I still don't know what U+5F3E is (in the context of Dan's name)--does the upper right corner have two or three strokes? Three. That's the only official 'Dan' with 'Bow' and 'Single' in Modern Japanese. Two stroke form is for Simplified Chinese and two mouths Traditional. You're right. I was mistaken in believing that U+5F39 (two dot version) and U+5F3E (three dot version) were unified. Thomas Chan [EMAIL PROTECTED]
Re: Synthetic scripts
On Sun, 17 Mar 2002, Andy Heninger wrote: From: Miikka-Markus Alhonen Stefan Persson replied Can you prove that this doesn't apply to any of the scripts already in the Standard? No, you can't, as it is not known under which circumstances Latin, Greek, Kanji, etc., were created. What about a script that was invented by one person with the principal intention of representing an artificially constructed language? Tighten up the definition of an artificially constructed language to be one that has never had native speakers, and you're there. Separate the evolution of the spoken language from the evolution of the script. That sounds better, but that definition of artificially constructed language would still include some planned languages and artificial standard versions of languages. Thomas Chan [EMAIL PROTECTED]
Re: Synthetic scripts (was: Re: Private Use Agreements and Unapproved Characters)
On Fri, 15 Mar 2002, Jungshik Shin wrote: On Sat, 16 Mar 2002, Dan Kogai wrote: (*) My parents wanted me to name me ÷¥ (U+5F48), a classical form, but it was not listed on the table of Kanjis allowed for names so I was named U+5F3E. Frankly speaking, I find it rather hard to understand what difference there is between using U+5F48 and using U+5F3E in spelling your name. They're the same character with the same meaning but with a bit of variation in shape. However, I should be careful because this is about one's name. This particular case in a Chinese context wouldn't be respected. However, if one is to pick over little details, then I still don't know what U+5F3E is (in the context of Dan's name)--does the upper right corner have two or three strokes? Thomas Chan [EMAIL PROTECTED]
Re: Synthetic scripts (was: Re: Private Use Agreements and Unapproved Characters)
On Fri, 15 Mar 2002, John H. Jenkins wrote: On Friday, March 15, 2002, at 11:38 AM, Dan Kogai wrote: There are so many Watanabe-sans, Saito-san, and others whose name cannot be spelled in Unicode. Can you document this? You know, there's a prize offered for the first person to document the existence of someone whose Japanese name cannot be represented by Unicode 3.2. Is this open to names written with taboo-avoiding forms of characters which omit strokes? e.g., U+7384 less the final stroke. Or are these unified with the normal forms? (I'm aware that not all taboo-avoidance works by stroke deletion.) The Kangxi emperor (r. 1662-1722) writes the xuan of his personal name, Xuanye, with U+7384, but everyone else has to use a similar form, but less the final stroke--see p. 725 of the edition of the _Kangxi Zidian_ that the IRG uses. (This taboo-avoidance has apparently extended to other characters that utilizied U+7384 as a component.) Of course, my example here is not a Japanese one... Thomas Chan [EMAIL PROTECTED]
Re: Synthetic scripts (was: Re: Private Use Agreements and Unapproved Characters)
On Fri, 15 Mar 2002, Kenneth Whistler wrote: Ben Monroe wrote: As it is a personal spelling, I never expected Unicode to map a code point to this character to me. For those not following the Japanese in the UTF-8, Ben's name is Monryuu Ben in kanji. This is a sound-based name coinage for an English name. Mon 'gate' ryuu ~ ryoo 'dragon'. (Sorry, but I can't tell just from the kanji just exactly what pronunciation you would use.) And sticking the dragon inside the gate, which is then structured like a radical, is creating a phonological rebus that departs from the ordinary way that radical + phonological component characters are constructed. No wonder your teacher marvelled at how to pronounce it -- Han characters aren't constructed by putting two syllables together in one character to create a disyllabic pronunciation. I thought it was a case of an obscure personal name character, until I saw the connection to Monroe. There are a small number of these with polysyllabic Chinese or Sino-Xenic readings where the reading is a concatenation of the readings of its components, such as U+55E7 jia1lun2 'gallon' (U+52A0 U+4F96), which are really ligatures. Some like U+337B heisei (Japanese era name) (U+5E73 U+6210) are easy to recognize, but it would be easy for characters like the Monroe one to slip through in the absence of information about it. This is different from another small set where a polysyllabic Chinese/Sino-Xenic reading is not a concatenation of the readings of its components, such as U+544E ying1chi3 'English foot' (U+5C3A chi3 'foot' plus a 'mouth' radical--indicating a semantic connection/modifiation?). However, not everything that looks like a ligature really is, such as U+6B6A wai1 'crooked' appears to spell out the phrase bu4 zheng4 'not straight' (U+4E0D U+6B63); U+81AD Cantonese chun 'animal egg' appears to spell out the phrase mei sing yuk 'not yet become flesh' (U+672A U+6210 U+8089); or U+7526 su1 'to revive' appears to spell out a synonymous word geng4sheng1 'to revive' (U+66F4 U+751F). Should I really have any reason to expect Unicode to deal with this? Nope. Any more than it should deal with the fanciful but ubiquitous good luck coinages like the shuang1xi3 'double happiness' character. U+56CD and U+21155--the latter seems to be more common on printed matter. Almost no Chinese dictionaries include them, but Korean ones seem to. However, I think this case may have made the jump from ligature to independent character, as it has acquired a monosyllabic xi3 reading, and appears in the title of at least two movies (rather than appearing independently as decoration). But more examples of this class can be found at the shinji page[1] of Kanji no shashin jiten[2] (content is in Japanese). Apparently Mojikyou has been okay with encoding some of them. [1] http://homepage2.nifty.com/Gat_Tin/kanji/sinji.htm [2] http://homepage2.nifty.com/Gat_Tin/kanji/kaindex.htm Thomas Chan [EMAIL PROTECTED]
Re: This spoofing and security thread
On Tue, 12 Feb 2002, Michael Everson wrote: At 18:37 + 2002-02-11, Juliusz Chroboczek wrote: - a cross-reference of characters whose associated glyphs are identical, whatever the font (applies to symbols and ``modifier letters''); But the letter b isn't identical from font to font in Latin. (piggybacking on your message, Michael) Nor U+0061 LATIN SMALL LETTER A, which sometimes looks like U+0251 LATIN SMALL LETTER ALPHA, and sometimes doesn't. - a cross-reference of characters whose associated glyphs could be confused by a non-technical user; Out of the entire standard? Who's going to do that for free? :-) And where would the data come from? :/ A turned R (similar to U+042F CYRILLIC CAPITAL LETTER YA) is sometimes used whimsically in place of U+0052 LATIN CAPITAL LETTER R to 1) imitate children's handwriting mistakes, e.g., in the logo of the toy store Toys R Us (http://www.toyrsus.com/), or to 2) imitate Russian, e.g., the Tetris logo. Like with Han characters, its not only what looks similar, but also what's considered similar... Just yesterday, I saw some product packaging where a grave accent was used where an acute accent was meant--no doubt, the error resulting from (US) English speakers' general unfamiliarity with diacritics (and who'd consider them to be optional adornments such that e = e acute = e grave). Thomas Chan [EMAIL PROTECTED]
Re: Phonetic grouping in UniHan
On Mon, 4 Feb 2002, Marco Cimarosti wrote: I also take the occasion to suggest a new field that could be very useful: the frequency of usage of each character. This information may be derived from good on-line sources. E.g., for Chinese, from Chi-Ho Tsai's research (http://www.geocities.com/hao510/charfreq/) and, for Japanese, from the KanjiDic database, (http://www.csse.monash.edu.au/~jwb/kanjidic_doc.html). (I don't know the licensing terms for using these data.) I think whatever frequency data is included, the particulars of how they were arrived at (or where to find such information) should be included, e.g., Tsai's findings were based on 1993-1994 Big5 Usenet postings. There's also frequency data buried under the kFenn field (as yet unpopulated), where A, B, C, D, E, F, G, H, I, K (J is omitted) indicates if it falls in the first, second, third, etc group of five hundred characters, based on earliness of occurrence in the textbooks of 1926. (The P code is also used for something that is not quite clear to me from the explanation in the dictionary alone--I presume it might refer to characters in the dictionary that were not in the 1926 study.) P.S. Recently you asked about estimates of usage of Plane 2 characters--since a large percentage are CNS 11643-1992 characters (and perhaps the oldest IT source), that may provide a clue. In the Concluding Remarks section of Christian Wittern's Taming the Masses[1], the higher CNS planes (ignore 1 and 2, which are in the BMP, and perhaps some parts of 3) are rarely used in historic texts, and he expects even lower usage in modern texts. [1] http://www.gwdg.de/~cwitter/cw/taming.html Thomas Chan [EMAIL PROTECTED]
Re: TC/SC mapping
On Wed, 23 Jan 2002, John H. Jenkins wrote: Thomas, do you have a reference for U+9EBC (麼) and U+9EBD (麽) being different? The only dictionary I have which contains both is the (traditional) CiHai, it and it claims they're variants of each other. Belated, but a little more on these two. Annex T: Procedure for the Unification and Arrangement of CJK Ideographs[1] (AMD8 of ISO/IEC 10646-1:1993) at the very end of section T.3, gives this pair as an example of unification blocked by source separation (a T source is the culprit). [1] ftp://ftp.cse.cuhk.edu.hk/pub/irg/AnnexT.rtf Thomas Chan [EMAIL PROTECTED]
Re: Wade - Pinyin transliteration (Unihan ?)
On Thu, 24 Jan 2002, Patrick Andries wrote: John Cowan wrote: Patrick Andries scripsit: Let's assume I want to transliterate a large Wade-Giles database into pinyin. It this a purely algorithmic process? For all nouns ? Common and proper (cf. Chiang Kai-Shek vs Jiang Jeshi )? Even for dialectal words? Chiang Kai-Shek isn't Wade-Giles; it isn't even Mandarin. I did mention dialectal forms (I believe final -k does no longer occur in Mandarin), I just wondered whether I would find such nouns (proper or common) in dictionary edited in Taiwan. I asked because I could see no algorithmic way of converting this name using traditional Wade to Pinyin tables. Incidentally, if this is not Wade-Giles applied to a dialectal pronunciation, what is it? Geniously interested. It should be noted that Wade-Giles is commonly misused as a cover term for many old, ad hoc, non-Mandarin-based, or non-Pinyin romanization systems. Chiang Kai-shek is a mixture of what looks like Wade-Giles (surname CHIANG) and some kind of archaic romanization based on Cantonese (given name Kai-shek). For placenames, there are many postal romanizations that are often erroneously considered to be Wade-Giles, e.g., the city Nanking (postal)/Nan-ching (Wade-Giles)/Nanjing (Pinyin). In any case, one should also beware of degenerate Wade-Giles forms where details such as apostrophes (denoting aspiration) are omitted, e.g., the city Changchun (degenerate Wade-Giles)/Ch'ang-ch'un (Wade-Giles)/Changchun (Pinyin). If Changchun were accepted as proper Wade-Giles input, then a corrupt *Zhangzhun pinyin form would be generated. Thomas Chan [EMAIL PROTECTED]
Re: TC/SC mapping
On Thu, 24 Jan 2002, John H. Jenkins wrote: However, this is already a problem in Unicode. shuowen.org will have to register both U+8AAAU+6587.org and U+8AACU+6587.org; Jingwa, Inc., will need both U+4E3CU+86D9 and U+4E95U+86D9. U+8AAA and U+8AAC are given on p. 265 of TUS3.0 as an example of what would have been unified had it not been for source separation. Is it possible to acquire data on other z-variants? The kZVariant fields do not seem to contain exactly that data. Had that example not been pointed out, I wouldn't have been known that both were encoded. Thomas Chan [EMAIL PROTECTED]
Re: TC/SC mapping
On Wed, 23 Jan 2002, John H. Jenkins wrote: On Wednesday, January 23, 2002, at 09:05 AM, Thomas Chan wrote: In other words, yao1 'small'TC U+4E48 or U+5E7A - SC U+4E48 me (as in shen2me 'what') TC U+9EBC or U+9EBD - SC U+4E48 mo2 (as in yao1mo2 'insignificant') TC U+9EBC or U+9EBD - SC U+9EBD Thomas, do you have a reference for U+9EBC (麼) and U+9EBD (麽) being different? The only dictionary I have which contains both is the (traditional) CiHai, it and it claims they're variants of each other. Well, first, the Jianhuazi Zongbiao that defines the PRC simplifications juxtaposes U+9EBD and U+4E48 for the me pronunciation of the former (non-me usage of the former are not simplified); U+9EBC is not mentioned. In the PRC's _Ci Hai_ from 1979 (the third dictionary to bear that name), U+9EBC is a pointer to U+9EBD for all usages of U+9EBD. In the _Hanyu Da Zidian_ (PRC, 1986), U+9EBD has the following usages: 1) mo2 'small' 2) ma2 of gan4ma2 'what for'. (It says that nowadays this particular ma2 is written U+55CE.) 3.1) ma, a particle, which can sometimes be written U+55CE. 3.2) ma, a particle, which can sometimes be written U+561B. 4) me of zhe4me 'so; like this'; also used as padding in songs. However, for U+9EBC, it says it is the same as U+9EBD, but the only examples given have the 'small' meaning, including one from the _Shuowen Jiezi_ (China, AD 100) that says that U+9EBD is a vulgar (su2) form of U+9EBC. Apparently, U+9EBC is the more orthodox version as far as mo2 'small' is concerned, but U+9EBD has become more common, including becoming used to write various modern/colloquial words. I would revise the mapping as follows: me (as in shen2me 'what')TC U+9EBD - SC U+4E48 mo2 (as in yao1mo2 'insignificant') TC U+9EBC - TC U+9EBD - SC U+9EBD I think the choice whether to regard U+9EBC and U+9EBD as different or not depends on the application. I would lean towards treating them as the same. Meanwhile, both Sanseido and KangXi say that U+5C1B (尛) is a member of the family. (KangXi says that anciently U+9EBC (麼) was written U+5C1B (尛) . Mathews and Sanseido also remind us that U+5E85 (庅) is another variant, and Sanseido *also* lists U+5692 (嚒). In the _Hanyu Da Zidian_, U+5C1B points to U+9EBC. (I see on the same page that U+21B6F also points to U+9EBC, and the _Hanyu Da Zidian_ is citing this pointer from the same source.) It doesn't say, but I would presume these refer only to the original mo2 'usage', given the age of the cited source, _Longkan Shoujian_ (China, AD 997), and the composition of U+5C1B (three 'smalls') and U+21B6F ('three' + 'small'). U+5E85 is understandable as an abbreviated form of U+9EBD, and I'll add that it's also documented in Samuel Wells Williams' 1874 dictionary (pushes back the usage given in Mathews by at least half a century). U+5692 seems understandable--it is just U+9EBC with a mouth radical tacked on--I presume this is only for the modern/colloquial me usages, and not mo2 'small'. (I wouldn't be surprised if somewhere there is attested a U+9EBD with a mouth radical tacked on.) I would further revise the (partial) mapping as follows: me (as in shen2me 'what'): TC U+9EBC - TC U+9EBD - TC U+5E85 - SC U+4E48 TC U+9EBC - TC U+5692 mo2 (as in yao1mo2 'insignificant'): TC U+9EBC - TC U+9EBD - SC U+9EBD And this is not finished, yet! The _Hanyu Da Zidian_ also lists some other variant forms of U+9EBD--I suspect they are probably all/mostly for the mo2 'small' usage. I should point out that the _Hanyu Da Zidian_ is in no way the final word despite its comprehensiveness, e.g., U+5E85 and U+5692 are not included in it. So, Doug, you see that U+4E48 (么) could conceivably be a traditional character in its own right *or* the simplified form for no fewer than six (!) other ideographs. This is the kind of mess that has discouraged anybody from doing a systematic survey of simplifications for the Unihan database. Part of this is because there is the orthogonal complexity of variant TC forms. Before converting TC to SC, one should resolve all TC variants to the most common or standard TC form (good luck deciding what that means). e.g., in the above case, resolve to U+9EBD. I think we are also complicating things by treating the entire process of variants and simplifications as operating solely on the orthography (cf., upper and lower case); in some cases, it is simpler to conceptualize it as the spelling of words being changed. The other example (U+8721 kTraditionalVariant U+8721 U+881F) is a mistake--the TraditionalVariant should only be U+881F. Actually, no. Both KangXi and the Cihai list U+8721 (蜡) as a traditional character in its own right, although I assume it's rare as I can't find it in my other dictionaries. You're right. The presence of U+8721 in Big5 should have been a preliminary hint to me
Re: Fun with UDCs in Shift-JIS
On Thu, 17 Jan 2002, Thomas Chan wrote: - NTT-DoCoMo pictographs[1] in webpages for cell phones [1] http://www.nttdocomo.co.jp/tag/emoji/ (Shift-JIS 0xF89F to 0xF971) It was pointed out to me that the above URL didn't work. It should have read: http://www.nttdocomo.co.jp/i/tag/emoji/ . Thomas Chan [EMAIL PROTECTED]
Re: GBK Traditional to Simplified mapping table
On Thu, 10 Jan 2002, Ken Krugler wrote: I've got GBK-encoded text that contains a number of Traditional Hanzi characters. I'd like to convert all of these to their Simplified equivalents. So does anybody know of a GBK table that maps each Traditional form to its Simplified form? If converting to simplified equivalents means reducing the text so that it can be representable in GB2312, then I'd recommend: 1) If the GBK character is in GB2312, keep it as-is. 2) Otherwise, convert to Big5 using Unicode as an intermediary. Take the characters that converted to Big5 successfully and use one of those many Big5-GB2312 converters as suggested by Frank Tang, which will perform the traditional-simplified conversion. 3) If there are any characters that weren't handled by step #2 (e.g., traditional Chinese characters not in Big5[1]; traditional Chinese characters in Big5 but not treated by most Big5-GB2312 converters[2]; non-Chinese characters used in Japanese[3]/Korean since the source text *is* GBK), then probably turning them and the surrounding context over to a human with access to a number of good dictionaries would probably be the best way to (hopefully) find a best fit within the circumstances (e.g., if it happens to be a variant of a character that is in GB2312[4]). If even that fails, perhaps the character in question can be described graphically ala A+B[5] or the text in question rewritten[6]. [1] e.g., U+5700 (GBK 0x87F3) is a variant form of guo2 'country' that is not in Big5, but one can substitute U+56FD (GB2312 0xB9FA), the form of guo2 'country' used in simplified Chinese. [2] e.g., U+5187 (GBK 0x83D3) is in Big5, used primarily to write mou 'not' in Cantonese (but other meanings also exist), but I haven't seen a converter to GB2312 yet that'll substitute U+65E0 (GB2312 0xCEDE), a near-synonym and etymologically-related character. [3] e.g., U+7A93 (GBK 0xB799) is a Japanese form of chuang1 'window', but one can substitute U+7A97 (GB2312 0xB4B0). [4] See [1], [2], [3]. [5] i.e., as the combination of its components. [6] e.g, U+72C6 (GBK 0xA0F0) occurs in Big5 and in most Chinese texts encountered, it means 'Japanese spaniel dog; Japanese Chin' (and not a pejorative ethnonym), which'll have to be rewritten to whatever phrasing that dog breed goes under in GB2312 simplified Chinese texts. Thomas Chan [EMAIL PROTECTED]
Re: Bird headed CJK variants?
Name: Yael email: [EMAIL PROTECTED] #1, Posted Dec 26th 02001 10:59:33 AM next I am looking for an online resources that have graphical samples of the unusual chinese script that is called bird script because all lines end in little bird heads. It is from the han dynasty. May somebody help me? niaozhuan \u9ce5\u7bc6 'bird seal' is a decorative seal script typeface. Here's two samples of what is supposed to be the seal of Qin Shihuang \u79e6\u59cb\u7687 (r. 221-210 B.C.), the first Chinese emperor: http://deall.ohio-state.edu/grads/chan.200/misc/niaozhuan-qinshihuang_seal-1.jpg from p. 30 of GUO Bingguang's \u90ed\u51b0\u5149 _Zhuanke rumen_ \u7bc6\u523b\u5165\u9580 [Introduction to Seal-carving] (Hong Kong: Mingtian \u660e\u5929, 1989). http://deall.ohio-state.edu/grads/chan.200/misc/niaozhuan-qinshihuang_seal-2.jpg from p. 176 of R.W.L. Guisgo's _The First Emperor of China_ (New York: Carol Publishing Group, 1989). The seal consist of four columns of two characters each, read top-to-bottom, right-to-left: 7 5 3 1 8 6 4 2 It's supposed to be: \u53d7\u547d\u65bc\u5929\u65e2\u58fd\u6c38\u660c . However, notice that there are a number of differences between the two samples!--I do not know the reason for this. 'bird seal' and other decorative and/or imaginary scripts (mostly just font variants) are also covered in Knud LUNDBAEK's _The Traditional History of the Chinese Script from a Seventeenth Century Jesuit Manuscript_ (Aarhus, Denmark: Aarhus University Press, 1988). Thomas Chan [EMAIL PROTECTED]
Re: Fact vs. fiction
On Sun, 6 Jan 2002, James Kass wrote: Michael Everson wrote, I was unable to find http://www.thelordoftherings.com and therefore could not see any Tengwar links. Is this the right address? Yes, it's right. Here's the direct link to their Tengwar links page: http://www.thelordoftherings.com/tengwar/ I can't find the beginning of this thread now, but that's not New Line Cinema's official site (lordoftherings.net). I spent a few minutes with whois and found a number of similar domain names, with or without preceding the, followed by an optional -movie or movie, and under .com, .net, or .org. That's even more confusing than the unicode.com imposter... Thomas Chan [EMAIL PROTECTED]
Re: Character display problem example
On Sat, 22 Dec 2001, Michael (michka) Kaplan wrote: (See my reply below--I'd like to retain Michael's ASCII art for purposes of illustration, hence the length.) Robert (11 digit boy) said: font is used to display Japanese or such. I think that there is a certain 5-stroke character that will answer it. It is U+5E73. Well, there is a difference here: Japanese/CHS version: -- \ | / \ | / \ | / +- | | Korean/CHT version: -- / | \ / | \ / | \ +- | | Although I suppose this could be font differences, too? Pseudo Verified on a WinXP system with the following fonts: Yes, there are simply font differences. The latter form, with the diagonal strokes arranged like / \, is the more canonical form, typically seen in printing when using the kinds of fonts that you tested with. However, the former form, with the diagonal strokes positioned like \ /, is more of a handwritten form, although you may see it in fonts that more resemble handwriting, like the brush-like kaishu(zh)/kaisho(ja) styles (which were not represented in a limited font survey). Both forms are fine in Traditional Chinese practice. PRC practice (i.e., Simplified Chinese) tends to have made even the printing forms resemble the handwritten form, although I do not doubt that a Simplified Chinese reader would accept the / \ form too. I won't presume to speak for Japanese and Koreans, but I suspect the two forms are interchangeable for them too (comments, please). In any case, note the last example in Table 10-4 Ideographs Unified in TUS3.0 p. 265 shows that the rotated strokes/dots are unified. I'd like to caution against the use of fonts to shows national differences (or lack of them), not only because a font can only show one glyph (and does not account for scenarios where two interchangeble glyphs are acceptable), but also because such font surveys are often poorly controlled for variables. For example, in Michael's survey of Win XP fonts (I'm not criticizing you specifically, Michael, so I hope you do not take this personally) there were two font styles represented for most locales: the serifed Ming(zh)/Song(zh)/Mincho(ja) and sans serif Hei(zh)/Gothic(zh). However, additional styles such as the brush-like Kai are not represented, which sometimes will yield different conclusions, such as for the appearance of the lower left corner of U+5317 'north'. Second, in such font surveys, the fonts for each locale often come from different vendors. For instance, see section #4 of a webpage of samples I created for U+76F4 'straight'[1], which I gave the URL to Suzanne Topping for (though not on this list). The MingLiU and AR Mingti2L Big5 fonts differ between vendors (Microsoft/Dynalab vs. Arphic), although they are for the same locale (Taiwan) and are the same style (the serifed Ming). [1] http://deall.ohio-state.edu/grads/chan.200/cjkv/u76f4/ Furthermore, I see there is a tendency for PRC font vendors to create fonts with completely wrong glyphs. i.e., the codepoint for the simplified form is populated with the glyph for the traditional form (simplified and traditional not being unified, remember). The idea is apparently so that a user can type in Simplified Chinese, and then produce a Traditional Chinese document by simply (and erroneously) changing the font. While none of these sorts of fonts are encountered here, they do exist out there, and would contaminate any font-based studies. Thomas Chan [EMAIL PROTECTED]
Re: Microsoft input method, 950, and Unicode mapping
On Tue, 18 Dec 2001, Tex Texin wrote: I am glad Sybase gave different character sets different names. There's a Big5-HKSCS tag[1]--is anyone using that? [1] http://www.iana.org/assignments/character-sets (see MIBenum 2101; I don't understand why it's in the vendor range, though) For that matter I wonder what a user in HK does when their Windows operating system is upgraded and their files that had HKSCS characters in the private use area now expect them in other locations. Or distinguishing between data in HKSCS, GCCS, pre-GCCS vendor extensions, and privately-created extensions, all of which can occupy the same encoding space. Too bad that GCCS and HKSCS first existed as government-anointed waizi/gaiji extensions, and were (and still are) implemented that way, rather than as part of a proper and separate character set. (I wish GCCS and HKSCS had proper numbers and dates to refer to them by--the names are really too similar, and easily garbled and confused. Recently I gave feedback on an article where HKSCS was used and discussed, but under the GCCS name and with arguments that were only true for GCCS.) Thomas Chan [EMAIL PROTECTED]
Re: Microsoft input method, 950, and Unicode mapping
On Tue, 18 Dec 2001, Kenneth Whistler wrote: And to add to the chaos and confusion, note that the HKSCS patch for Windows Code Page 950 does not map exactly the same as the HK Government mapping table. And that the HK And that's in addition to the confusion caused by the semi-official, semi-published precursor version, GCCS. I've got here an 2001 edition (post-publication of HKSCS) atlas+cdrom of Hong Kong which includes a GCCS (!) support add-on (and not the same as the one that used to be available from http://www.info.gov.hk/gccs/ before that URL became a redirect to the present HKSCS site). Thomas Chan [EMAIL PROTECTED]
Re: Ext-B fonts updated
On Wed, 17 Oct 2001, Asmus Freytag wrote: Even with a number of errata not yet corrected, the current font is a vast improvement over the previously used TTF font (which did not contain any glyphs at all for several hundred positions). In the meantime, John said it all when he wrote the the identity of the ideograph is given by its source mapping, not its glyph in the table. That works for those that have a character set source mapping (e.g., T- sources for higher CNS 11643 planes) or dictionary source page and serial number mappings (e.g., kHanyuDaZidian field), but what do we do about those that don't? For example, although I own a copy of the _Ci Hai_ dictionary (G-CH source), I can't tell which character in it that U+206C5 is meant to be. (No doubt this can be resolved with more cross-references, although that still leaves the problem of non-dictionary sources such as G-FZ/FZ_BK and G-4K.) Also, there's no way I can determine what U+20850 is, as it doesn't come from a real character set (K-4 source). Thomas Chan [EMAIL PROTECTED]
Re: [OT] o-circumflex
On Mon, 10 Sep 2001, [ISO-2022-JP] $B$F$s$I$$j$e$$8(B wrote: If they can't agree on the pronunciation for these cities, can they agree on the Hanzi for them? What ARE the Hanzi for these cities, anyway?? Are you asking for the names of cities in Chinese? Copenhagen is ge1ben3ha1gen1 \u54e5\u672c\u54c8\u6839. The Han characters used to write the names of cities depends on many factors, including but not limited to source spelling/pronunciation, language/dialect of the rendering party, mapping rules used by the renderer, time period, etc. For example, New York is rendered in Chinese as Mandarin niu3yue4 \u7d10\u7d04, lit. 'button-appointment' (nauyeuk in Cantonese), while in Japanese it was at one time rendered as \u7d10\u80b2, lit. 'button-rearing'. Asking for the hanzi (from your wording, I don't think you are just talking about Chinese usage of Han characters) is like asking for a single Latin script rendering. (I think you need to get yourself an English-Chinese dictionary or something, btw...) Thomas Chan [EMAIL PROTECTED]
RE: Re[2]: Errata in language/script list
On Mon, 13 Aug 2001, Ayers, Mike wrote: From: Thomas Chan [mailto:[EMAIL PROTECTED]] No, they do. While the dominant way that Chinese languages are written today, which is based on Mandarin Chinese, has been well supported since pre-Unicode 3.0 days, other Chinese languages have faced the problem of many unencoded (or yet-to-be-encoded) characters. I've written on this matter on this list before in the past, principly about Yue Chinese (=~ Cantonese), but also applicable to other Chinese languages. Since those all will get coded into the Chinese alphabet (if they get coded), what's the point? It's pretty simple. Just because enough of a script is encoded for the needs of one language doesn't mean that is necessarily true for other languages that use that script. In time, those omissions are patched up in newer versions of Unicode. Latin, Cyrillic, Arabic, and other scripts have all had new characters added to them in sucessive versions of Unicode. e.g., If someone asked 1-2 (pre-Unicode 3.1) years ago the question, Can I write Cantonese with Unicode?, the answer would have been no or not really. If it were asked today, the answer would be yes. But try that question today with other minority Chinese languages substituted in it, and the answer is still pretty much a no or not really. Some also require different scripts, such as the Dungan living in the former Soviet Union, who write in Cyrillic (I've been told all the characters they need are encoded), or some Min Chinese, who write in whole or part using the characters in the Bopomofo Extended block (Unicode 3.0) and/or Latin (using certain letter and diacritics that weren't always If you get genuine exceptions, then list them (i.e. list Min Chinese). I get the feeling that you're talking about a darn small userbase here, though. According to the SIL Ethnologue 14th ed.[1], Dungan (SIL DNG): 38,000 in Kyrgyzstan (1993 Johnstone). Mother tongue speakers were 95% out of an ethnic population of 52,000 in the former USSR (1979 census). Population total all countries 49,400 out of an ethnic population of 100,000. [1] http://www.ethnologue.com/show_language.asp?code=DNG I don't have figures for the size of the userbase of Min Chinese written in Latin script offhand, but see for instance Proposal to add Latin characters required by Latinized Taiwanese languages to ISO/IEC 10646[2] (1997.6.26) under the user community questions. [2] http://www.egt.ie/standards/la/taioan.html (Did this ever become a WG2 document? I recall seeing discussions of this once, but can't find them offhand at the moment.) BTW, what do you consider to be a darn small userbase, numberwise? Would the UCAS or Cherokee userbases be too small by your standards to include a mention of them? encoded). There's also the Hunan women who write in the unencoded Nushu script that was discussed on this rather recently. Discussed well enough for me to know that we're talking about a userbase of approximately twelve and counting down. This is not a very pressing case. No, its probably not pressing at the moment. I'm sure there are more than twelve people who use it for writing and/or research, though. Start counting with the number of people who write in it, and add to that figure the researchers and their assistants (i.e., their students) who are doing the surveys... And this is without going into historical alternative ways of writing Chinese, such as the prolific Guanhua Zimu alphabet/syllabary used in the 1900s-1920s. ...which we don't really need to do, I think, since we're trying to stick to the useful stuff. What do you consider useful? What one person considers useless is useful to someone else. Without specific requirements like userbase size, economic power, cultural significance, extant writings, etc, I don't think we can start making any claims about usefulness. The Bible (or portions of it) has been published using Guanhua Zimu[3]. Is that not useful to someone? [3] From Eugene A. Nida, ed., _Book of a Thousand Tongues_, 2nd ed. (London: United Bible Societies, 1972): http://deall.ohio-state.edu/grads/chan.200/misc/guanhua_zimu.jpg If you think historical scripts are not useful, then perhaps the four Phillipine scripts, Ogham, Runic, etc should not be mentioned on the list. Anyway, I don't see usefulness as one of the requisites for inclusion on the list in question. And then there are various transliteration schemes, which although they are not anyone's primary script, but which are widely employed, such as Hanyu Pinyin (people do ask, as legacy GB2312 and Big5 character sets don't have them, or only include ugly full-width versions) for Mandarin, or Yale for Cantonese (e.g., people ask if a precomposed m with a grave accent is encoded, as that is need to transcribe the negative). Transliteration
RE: Re[2]: Errata in language/script list
On Wed, 1 Aug 2001, Ayers, Mike wrote: From: Marco Cimarosti [mailto:[EMAIL PROTECTED]] BTW, I notice that a single Chinese entry is listed. This should probably be split in several entries for the various Chinese languages (or dialects, e.g. Mandarin, Cantonese, Hakka, etc.). This split may be handy because the different languages could need different information. They don't. The joy of unification! No, they do. While the dominant way that Chinese languages are written today, which is based on Mandarin Chinese, has been well supported since pre-Unicode 3.0 days, other Chinese languages have faced the problem of many unencoded (or yet-to-be-encoded) characters. I've written on this matter on this list before in the past, principly about Yue Chinese (=~ Cantonese), but also applicable to other Chinese languages. Some also require different scripts, such as the Dungan living in the former Soviet Union, who write in Cyrillic (I've been told all the characters they need are encoded), or some Min Chinese, who write in whole or part using the characters in the Bopomofo Extended block (Unicode 3.0) and/or Latin (using certain letter and diacritics that weren't always encoded). There's also the Hunan women who write in the unencoded Nushu script that was discussed on this rather recently. And this is without going into historical alternative ways of writing Chinese, such as the prolific Guanhua Zimu alphabet/syllabary used in the 1900s-1920s. There is also the blind, for which Braille schemes exist for at least Mandarin and Cantonese, although I'll concede that Braille could be listed for almost any language. And then there are various transliteration schemes, which although they are not anyone's primary script, but which are widely employed, such as Hanyu Pinyin (people do ask, as legacy GB2312 and Big5 character sets don't have them, or only include ugly full-width versions) for Mandarin, or Yale for Cantonese (e.g., people ask if a precomposed m with a grave accent is encoded, as that is need to transcribe the negative). Thomas Chan [EMAIL PROTECTED]
RE: Re[2]: Errata in language/script list
On Tue, 31 Jul 2001, Marco Cimarosti wrote: BTW, I notice that a single Chinese entry is listed. This should probably be split in several entries for the various Chinese languages (or dialects, e.g. Mandarin, Cantonese, Hakka, etc.). This split may be handy because the different languages could need different information. In the absence of additional qualifying information, I think Chinese would be interpreted as the most salient variety, the modern standard written Chinese (based on Mandarin Chinese; SIL CHN) in dominant use today by speakers of all Chinese languages. However, some people might have questions asking for details like Does Unicode have traditional characters? and/or Does Unicode have simplified characters?--it might even be worth pointing out that both can be used concurrently, which is not what people accustomed to the likes of GB2312, Big5, etc would expect. Still others might ask, Does Unicode have Cantonese/Hong Kong characters? (the terms are not exactly synonymous, but often interchanged). Prior to Unicode 3.1's introduction of the Han characters in Plane 2, I'd say that support for Yue Chinese (SIL YUH; ~= Cantonese) was not really usable. With a logosyllabic script, it'll never be possible to exhaustively check that all its characters included, but it looks very usable now--I've had high success rates finding them in Plane 2, partially due to sourcing from the HKSCS character set (H source) from Hong Kong, and partially due to sourcing from large dictionaries such as the _Hanyu Da Zidian_ (G-HZ source) where characters (and the words they transcribe) have died out in Mandarin, but are preserved in Yue and other Chinese languages. However, I'm not so sure what the situation is for other Chinese languages, other than a vague impression that they are not well supported--probably the stage that Yue Chinese was at with Unicode 2.1. e.g., U+20547 is used only in Min Chinese (MNP, CFR), meaning 'hard, durable', with a pseudo-Mandarin reading of dian4. It's in Unicode only because it happened to be in HKSCS, and to my knowledge that is the only character set it appears in, perhaps for the use of Chaozhou speakers (Chiuchow, Teochew), a linguistic minority in Hong Kong (Chaozhou is a dialect of Minnan Chinese, CFR). U+20547 is also documented only in very few dictionaries, none of which were apparently a source for Unicode. I think any support for Min Chinese at this point is probably accidental. (FYI, U+20547 looks like U+6709 with the two center strokes removed and replaced by U+4E36.) Thomas Chan [EMAIL PROTECTED]
Re: Erratum in Unicode book
On Sun, 8 Jul 2001, James Kass wrote: An ideal index for the casual or non-CJK user might be quite different in approach. Perhaps the first component drawn in For the less than proficient user, I think it would be beneficial to have a means to restrict the pool of characters that they are searching amongst--consider the circumstances under which they are likely to have encountered the character they are looking up. The radical-strokes index in TUS3.0 cover over 27,000 characters, many times more than most dictionaries and character sets, and in some places, there are just too many characters falling under a particular radical+residual stroke count for one to scan the page efficiently. Thomas Chan [EMAIL PROTECTED]
Re: Erratum in Unicode book
On Mon, 9 Jul 2001, Richard Cook wrote: On a related note, I have 9000 word/char frequencies from Hanyu Pinlu Cidian (a mainland text; I typed the entries in back in the early 90's, and this is the freq data currently used in Wenlin). I'd be happy to give the Consortium access to this data for the purpose of sorting characters with identical rad/str numbers by frequency. Wouldn't that bias sorting according to Chinese language usage frequencies? e.g., \u7684, \u4f60, \u5403 are very common in Chinese, but rare or obscure in Japanese. Subsorting by pronuniciation would also be language-dependent. For a language-neutral method of sorting characters with otherwise the same radical and # of residual strokes, how about the method used in the _Hanyu Da Zidian_ (and some other dictionaries) of sorting by the type of stroke of the first stroke, second stroke, etc., by whether it is one of the five basic types of strokes as exemplified in the first five Kangxi radicals? This requires such data be available for all 70,000+ characters, though... Thomas Chan [EMAIL PROTECTED]
Re: Re: Erratum in Unicode book
On Sun, 8 Jul 2001, Michael (michka) Kaplan wrote: From: James Kass [EMAIL PROTECTED] Perhaps he (てんどうりゅうじ) was lamenting the character's absence in the Han Radical Index section under radical # 85. If all the characters made from the water radical were listed under that radical in the Han Radical Index (and so forth), where would the sport be in looking up CJK code points? The Han Radical Index is particulary useful when the significant radical is known, kind of like having to know the correct spelling of an English word before it can be looked up in an English dictionary. I suspect you are correct -- but since Unicode does not promise to support such a thing (complete decomposability into radical and stroke of all unified CJK ideographs), it might be a stretch to consider it an erratum? I don't think there's any mistake. U+9152 is filed under radical 164 as a three residual-stroke character because that's where dictionary #1, the _Kangxi Zidian_, places it as per the rules in Han Ideograph Arrangement on p. 266 of TUS3.0. What 11 prefers is a more progressive system of filing characters under radicals that does not require one to know what a character means, which part is not the phonetic element, etc. e.g., U+554F, wen 'to ask' (also part of wenti 'problem'; Japanese mondai) is filed under radical 30 mouth as an eight residual-stroke character, since that's where dictionary #1 places it, although some people might prefer to see it as a three residual-stroke character under radical 169 gate, such as what is done in dictionary #3, the _Hanyu Da Zidian_ (but too bad, it's only #3). Indeed, consider the almost similar character U+95EE, which is the Chinese simplified form, which is filed as an eight residual-stroke character under radical 169, because that is what dictionary #3 places it (and dictionaries #1 and #2, not containing it, thus have no say otherwise concerning where it is filed). Thomas Chan [EMAIL PROTECTED]
status of Jindai scripts?
Hi all, I'd like to ask about the encoding status of the Japanese Jindai scripts, which are mentioned in older documents[1], and until a certain point in time, versions of the Roadmap. Here's a scan of a partial table of over a dozen Jindai scripts (except the rightmost column, which is modern katakana a to sa), from part of the appendix to He 2000[2]: http://deall.ohio-state.edu/grads/chan.200/misc/jindai.jpg [1] e.g., Concerning Future Allocations (April 11, 1993) http://www.unicode.org/Public/TEXT/ALLOC.TXT [2] HE Qunxiong \u4f55\u7fa4\u96c4's _Hanzi zai Riben_ \u6f22\u5b57\u5728\u65e5\u672c (Han Characters in Japan) (Hong Kong: Shangwu \u5546\u52d9, 2001). Book is in Chinese. Thomas Chan [EMAIL PROTECTED]
Re: status of Jindai scripts?
On Tue, 3 Jul 2001, Rick McGowan wrote: Thomas Chan wrote... I'd like to ask about the encoding status of the Japanese Jindai scripts, which are mentioned in older documents[1], and until a certain point in time, versions of the Roadmap. Do you have a paper on the topic? You say over a dozen 'Jindai' scripts. What does this mean? Is it a style of stylization? A style of its own? Something else entirely? A cipher on Chinese characters? Of Kana? I don't know anyone who knows enough about it to even answer basic questions. We know it's Japanese, and is probably associated with Shinto. Sorry, I don't know more about what they are. My count of over a dozen is based on the number of xxx moji's listed in the caption of the illustration I cited. They seem to be ciphers of kana. I'm just puzzled by the disappearance of mentions of them in what I can find on the publically available parts of the unicode.org and WG2 websites, even if its just to say that not enough is known about them to do anything, e.g., WG2 N1955 (1999.1.26) mentions that there is not enough data, WG2 2046 (1999.8.15) and later don't mention them at all, as far as I can see. Thomas Chan [EMAIL PROTECTED]
RE: Innovative use of Latin ?!
On Mon, 2 Jul 2001, Ayers, Mike wrote: From: Martin Duerst [mailto:[EMAIL PROTECTED]] For people interested in new scripts, and new uses of existing scripts :-) http://www.google.com/intl/xx-hacker/ This looks like what is called L33T (elite) writing. It's popular among online gamers. Kinda like computer pig latin... /|/|ike The way you sign your messages is related to that, isn't it? :) I've seen ]\/[, too. There doesn't seem to be any standard scheme--the goal appears to obfuscate writing by substituting graphically similar characters. What Google is using is pretty tame--there were 1980's and 1990's versions that made use of all the characters in CP437--Greek, math, linedrawing, etc. Thomas Chan [EMAIL PROTECTED]
RE: Nushu
On Tue, 26 Jun 2001, Marco Cimarosti wrote: Michael Everson wrote: 600 characters it is then. If Nüshu is actually logographic, think that this may be too early a conclusion. If the figure is based on a single person's sample, it could only reflect the vocabulary of that lady or, even worse, the vocabulary of the topics that she was dealing with in the sample texts. If that were the case, 600 is low, as it could mean there is almost a 1-to-1 correspondence between a syllable and Nushu character when not taking tone into account (c.f. the 400 or so extant syllables in the Mandarin spoken in Beijing, when not counting tones), implying that most/all homophonous words are written with the same Nushu character, and that tonal distinctions are omitted in the orthography--neither of which is the case in the logographic Chinese model. Alternatively, the 600 could simply represent what is written in the extant corpus (Marco's point above), and those Nushu writers do only write on a limited range of formulaic topics. Chiang 1995 (which I do not have at the moment) did in fact provide an analysis of the phonology of the language/dialect written down by Nushu and comparison to its characters. If it is not possible to do more investigation on Nüshu, I would suggest to reserve a bigger area (= 2000 entries), because that is the minimal number that I would expect from a logographic script. Somewhere around 2000 is the bare minimum expected for a high school education, e.g., the 1945 Jouyou Kanji of Japanese and its slightly lower predecessor Touyou Kanji, and something similar in the Korean of South Korea. For Chinese, the figure is higher--around the 3000s. (Marco, do you have the figure from the Yin and Rohsenow book?) Cheng (2000: 109-110) gives various figures for frequently used characters, from the early 1928 study by CHEN Heqin's Yutiwen Yingyong Zihui (Practical Lexicon for Colloquial Style) for 4261, and ranging from 2000 to over 7000 for the other works in his list of practical lexicons and pedagogically oriented frequency books. The number 4000 or thereabouts shows up quite often--he cites newspapers in Taipei, Hong Kong, and Singapore (110) as using about 4000; 4501 in the corpus of the novel _Honglou Meng_ (110); and about 4000-8000 in each of the dynastic histories (which are written on a range of topics). Norman (1988: 73) provides similar figures and conclusions, 3000-4000 for ordinary literacy. The question is perhaps how literate are these Nushu writers in relation to the benchmark of mainstream Chinese, and what is the variety of writing. Maybe also a unification of all the forms in use by all the writers involved? I have a book that has a nice chapter about the frequency of Chinese character (*). From evaluations made on a modern corpus, they obtained this statistical progression: [snip] - the first 2400 cover the 99%, - the first 4000 cover the 99.9%, - 6359 practically covers 100% of all modern text. But we know that Unicode, as well as any CJK character set, has much more than 7000 characters. Not to detract from your point, but I believe the 6359 figure represents the GB2312 character set, which represents the ceiling for the number of characters in the study. (Even then, GB2312 is a pruned repetoire that is that meant only for everyday mainstream use, not literature, technical fields, dialects, etc.) This discussion reminds me that I have a scan of the illustration from the Nushu entry in Florian Coulmas' _Blackwell Encyclopedia of Writing Systems_ (Cambridge, MA: Blackwell Publishers, 1996) lying around from a while ago: http://deall.ohio-state.edu/grads/chan.200/misc/nushu.jpg (Its accompanied by a transliteration in Han characters, simplified spelling.) Coulmas in fact says very little about Nushu, but provides two references. References Cheng, Chin-Chuan. 2000. Frequently-Used Chinese Characters and Language Cognition. Studies in the Linguistic Sciences 30, no. 1 (Spring 2000); 107-118. Chiang, William. 1995. _We Two Know the Script; We Have Become Good Friends_. Lanham, MD: University Press of America. (This is a revision of his Ph.D. dissertation.) Norman, Jerry. 1988. _Chinese_. Cambridge: Cambridge University Press. Thomas Chan [EMAIL PROTECTED]
Re: How to tell Japanese from Chinese.
On Fri, 8 Jun 2001, [ISO-2022-JP] $B$F$s$I$$j$e$$8(B wrote: My very simple rule of thumb for telling Japanese from Chinese is to look for kana. If I see even one kana, I am looking at Japanese, right? (Warning: A few kanji resemble katakana.) So if I see so much as a hiragana to, it's Japanese, right? But sometimes there are stretches of many kanji. Yes, that rule of thumb works for most everyday cases that one'll run into. However, manyougana would be classified as Chinese under that rule, as well as kanbun. I'm not sure that one would want to classify the more deviant (from a classical Chinese POV) and more Japanized forms of kanbun as Chinese. Have you seen hentaigana before?--that straddles the boundary between being kanji used for transliteration/transcription and being kana. (How would such text be encoded in Unicode, if at all?) Doesn't this kanji bad-ascii-art [snip] /bad ascii art NOT to be confused with hiragana e (oy vey), usually only appear in Chinese? In other words, \u4e4b vs. \u3048. I presume you're asking for purposes of a human reader, as a machine could easily detect the difference between U+4E4B and U+3048. But then, a sufficiently literate human could probably read the text and eliminate one of the two choices. Marco has already explained U+4E4B. But if you want a simple system based only on presence (or lack) of certain characters, then I'd look for common Chinese ones such as: \u9019 (\u8fd9) zhe 'this' \u5011 (\u4eec) men (plural suffix) \u9ebc (\u4e48) me (as in \u751a\u9ebc, \u751a\u4e48, \u4ec0\u4e48 shenme 'what') \u55ce (\u5417) ma (question particle) \u4f60 ni 'you' Pardon my incoherence. I haven't had enough sake. \u9152 on a sign or label--is that Chinese or Japanese? Hard to tell. If you're familiar with some differences in simplification, you can also make corroborating conclusions, e.g., if I see \u6226 embroidered on someone's baseball cap, rather than \u6230 or \u6218, then that strikes me as Japanese. (Not that the person who made it or wearing it probably knows or cares.) Thomas Chan [EMAIL PROTECTED]
Re: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET)
On Wed, 6 Jun 2001, John Cowan wrote: Marco Cimarosti scripsit: Or Hanyu, in fact, which is the normal name for Mandarin in Mandarin. I believe, however, that this term is relatively recent in its current sense, and is part of the effort the PRC government makes to distinguish between zhongguo as a political term and han as an ethnic one. Hanyu usually does refer to Mandarin (in the same way that Chinese in English usage usually refers to Mandarin), but I think that is because Mandarin is considered the standard or is the most salient form of Chinese. There are usages such as in the title of the book _Hanyu Fangyan Cihui_[1], a Swadesh-esque comparison of terms in various fangyan (topolects; usu. trans. as dialects), where non-standard Mandarin and non-Mandarin forms of Chinese are included under the umbrella Hanyu term. (FYI, here's a sample from the 2nd ed. of the tomato entry: http://deall.ohio-state.edu/grads/chan.200/misc/tomato.jpg) [1] _Hanyu Fangyan Cihui_ \u6c49\u8bed\u65b9\u8a00\u8bcd\u6c47, 1st ed. (Beijing: Wenzi Gaige, 1964); 2nd ed. (Beijing: Yuwen, 1995). There is a discussion in one the latter chapters of Jerry Norman's _Chinese_ (Cambridge: Cambridge University Press, 1988) which discusses terms including hanyu, putonghua, zhongwen, guanhua, etc. Thomas Chan [EMAIL PROTECTED]
Re: Unicode under fire again
On Tue, 5 Jun 2001, Mark Leisher wrote: http://www.hastingsresearch.com/net/04-unicode-limitations.shtml Hmm, lots of ink spilled just to rehash a weak argument against Han Unification. Beneath the misinterpretations and typos, I do see hints of valid concerns about yet-unencoded Han characters (e.g., neglect/bias of/against Taoism and old texts--the sort of thing he works with), which would have made a better argument. It wouldn't hurt for him to have newer and more advanced references, too. Thomas Chan [EMAIL PROTECTED]
Re: New kana letters (was RE: Oriyan Language)
On Tue, 5 Jun 2001, Marco Cimarosti wrote: 11 wrote: [...] if you look at the chart for hiragana, you see that they left space for four new kana, just in case somebody decided to invent new kana. [...] I think they're Both things, I think. For new letters, see the yellow squares in: [snip] http://www.unicode.org/charts/draftunicode32/U32-31F0.pdf Interesting. Why is the U+31F0 to U+31FF block currently named Katakana Phonetic Extensions? Phonetic seems like unnecessary wordiness--is the standard set of katakana not phonetic? There also seems to be multiple ways to name an extension/supplement to an existing block/script: Basic Latin : Latin-1 Supplement, Latin Extended-A, Latin Extended-B, IPA Extensions, Latin Extended Additional Greek: Greek Extended Kangxi Radicals : CJK Radicals Supplement Bopomofo : Bopomofo Extended CJK Unified Ideographs : CJK Unified Ideographs Extension A, CJK Unified Ideographs Extension B CJK Compatibility Ideographs : CJK Compatibility Ideographs Supplement Any pattern to this, or is it just historical? Thomas Chan [EMAIL PROTECTED]
RE: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET)
On Tue, 5 Jun 2001, Marco Cimarosti wrote: BTW, I don't see a problem of political correctness here. The terms used by Japanese or Korean themselves (kanji and hanja, respectively) litterally means Han or Chinese characters. They have the advantage of having more than one morpheme/word that maps to English Chinese. e.g., in Japanese, there is chuu vs. kan, such as the near-minimal pair chuu-goku-go 'Chinese language (lit. language of China)' vs. kan-go 'word composed of Sino-Japanese morphemes'. Perhaps if Han is too unfamiliar a word to be used directly, Sino or Sinitic could be used as translations to convey the same meaning without using the overloaded term Chinese (language, culture, origin, ethnicity, nationality, etc), e.g., Sino characters, Sinitic characters. Rather, there is a possible technical incorrectness, because the term hides the fact that Japanese and Korean also use their own local phonetic characters. Unfortunately, similiar technical incorrectness already exists, e.g., the Japanese-enabled (and probably localized as well) version of a product, website, etc is sometimes known as the kanji version, despite the presence of kana; similarly, the Korean version is sometimes (almost always?) labeled as the hangul version, despite the rare use of ideographs. Actually, what's worse is that hangul is often used rather the name of the language--one'll see a list of choices like English, Francais, Deutsch, Nihongo, Hangul... (the last not in Latin script, of course, but its own. (I won't go into how simplified Chinese is an abused and misunderstood term, for now.) Thomas Chan [EMAIL PROTECTED]
RE: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET)
On Mon, 4 Jun 2001, Ayers, Mike wrote: From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] For the Han characters, I have found in the past that people whose native language does not use these characters usually refer to them as Chinese. Obviously (to us anyway), calling them Chinese characters is not adequate, so we search for alternatives. Consider: Hanzi - Chinese characters Kanji - Chinese characters Hanja - Chinese characters ...so it would seem that everyone but us agrees on what to call 'em. I think the problem that Doug might be suggesting (correct me if I'm wrong, Doug) is that Chinese is also the name of a language(s). The term Han doesn't have this problem, but the term is unfamilar to most people. Thomas Chan [EMAIL PROTECTED]
Re: Term Asian is not used properly on Computers and NET
On Tue, 29 May 2001, David Gallardo wrote: Please excuse the unintended querulousness, but isn't the Greenwich meridian merely the reification of this bias? [snip] Nonetheless, and more to my point, the terms Near East and Far East were in use long before this. There are also terms like the West or Western (world, languages, civilization, etc) which have referents that are not completely west of the Greenwich Meridian, whose usage cannot be simply explained or justified by it. Thomas Chan [EMAIL PROTECTED]
RE: Braille vs Bidi
On Wed, 30 May 2001, Marco Cimarosti wrote: Kenneth Whistler wrote: I doubt it. But if Marco is correct that Hebrew braille is left-to-right, there could conceivably be some exemplary printed materials in Hebrew, with braille examples, [...] There is a very nice book about how braille is used in each language. Unluckily I can't remember any reference off-hand (I think it was edited by UNESCO), but it was mentioned time ago also on this list. There are at least two: the 1953 edition of _World Braille Usage_ by Sir Clutha Mackenzie published by UNESCO, and the 1990 edition of the same title (but not by him), published by UNESCO and the US Library of Congress (or a dept of it--I don't have the book in front of me to give full bib details). They're both worth looking at for comparative purposes at the changes over about four decades. There are some mistakes and omissions, though, which can be partially detected if one is familiar with the language's sound system, syllable structure, or writing system, e.g., handling of Vietnamese tones is omitted in the 1990 edition--but the bibliography contained within allows one to track down the original sources. The 1953 edition also contains interesting background on their goals and history of braille, including multiple incompatible 6-dot systems for English and German at one point in the past (of course, those readers couldn't simply convert their books into a readable format), and trying to create unified standards. Neither book really contains extended writing samples to show directionality, though. Thomas Chan [EMAIL PROTECTED]
RE: Term Asian is not used properly on Computers and NET
On Tue, 29 May 2001, Marco Cimarosti wrote: Doug Ewell wrote: Peter has an excellent solution -- much better than trying to explain the term CJK to ordinary people -- and I plan to use the term East Asian in the future. But, if by East Asian you mean languages written with Han ideographs, you fall in another pitfall, because Mongolian, Russian, Vietnamese and many other languages spoken in East Asia aren't accounted for. There are many pitfalls. Does the definition exclude Korean when written solely in Hangul? Is Vietnamese clearly East Asian? How about Yi (TUS3.0 thinks so)? Does it include the Cyrillic-writing Dungan Chinese? How about Zhuang written with Han characters? Min Chinese in Latin script? Etc. I think what one wants is something like languages usually and currently possibly including Han characters in their written form. That frees us from worrying about historical or aberrant cases, I think. Or how about just languages written with a very large collection of characters? Then we can include the Tangut, et al too, without including some of the medium-sized syllabaries. (This does require a distorted analysis of hangul, though.) Personally, I got used to the acronym CJK and, so far, I haven't met many people who are so ordinary not to understand the explanation that CJK means Chinese/Japanese/Korean. One problem with that term is that its members are very transparent; some people would like to add V to that to include historical Vietnamese usage. Is one going to add Z for Zhuang and whatever other letters tomorrow? Rather, my problem with the acronym is that I don't know how to translate it in Italian: CGC (for cinese/giapponese/coreano) is horrible, especially if you consider that it is pronounced chee-jee-chee. Moreover, unlike the English acronym, the three initials are not in alphabetical order, and this could be seen as politically incorrect. So we should have chee-chee-jee, which is even worse, if possible. Feel free to switch them around for local consumption--the Japanese term nichi-chuu-kan arranges them Japanese-Chinese-Korean. I don't think you can win if you try to be politically correct to everyone--e.g., there is a faction that is unhappy with Korea spelled with K rather than C, as it alphabetizes after J. Thomas Chan [EMAIL PROTECTED]
RE: Term Asian is not used properly on Computers and NET
On Tue, 29 May 2001 [EMAIL PROTECTED] wrote: On 05/29/2001 02:37:55 PM Thomas Chan wrote: I think what one wants is something like languages usually and currently possibly including Han characters in their written form. That frees us from worrying about historical or aberrant cases, I think. Folks, this discussion was about how to label a control in a dialog box, as in the attached image. You can't use a label like that. It may have begun with that (Word XP font dialog box), which was just one of the original poster's (Liwal) examples--he also mentioned websurfing in general--but I think the discussion has expanded beyond that into a discussion of the meanings and implications of various other terms. I don't disagree that a label has to be concise. Thomas Chan [EMAIL PROTECTED]
Re: Radical of U+4E71
On Mon, 28 May 2001, [ISO-2022-JP] $B$F$s$I$$j$e$$8(B wrote: I thought the radical was tongue, not hook. It depends on what source you use as your authority. The _Kangxi Zidian_ says it goes under radical #5, U+2F04 KANGXI RADICAL SECOND (not #6, U+2F05 KANGXI RADICAL HOOK, btw), and it seems for this particular character U+4E71, the other three dictionaries (Dai Kanwa Jiten, Hanyu Da Zidian, and Dae Jaweon) happen to agree. I don't see a pointer on p. 895 of TUS3.0 under radical #135, U+2F86 KANGXI RADICAL TONGUE, but maybe some people who are unfamiliar with the character would need it (and it would be logical to try looking under what's on the left half of the character first). In case you're wondering, the kanji in question is ran as in random. That IS what it is, isn't it? Yes, U+4E71 is 'chaos' and other related/derived words; luan in Chinese, ran in Japanese. Its what appears on the box art for Kurosawa's movie _Ran_. However, there are cases such as U+7551, hatake 'dry field', which is, surprise! (well, maybe just to Chinese people), filed under radical #102, U+2F65 KANGXI RADICAL FIELD, rather than #86, KANGXI RADICAL FIRE. According to the priorities in the Han Ideograph Arrangement section on p. 266 of TUS3.0, it failed to be found in the _Kangxi Zidian_ (dictionary #1), so it was up to the _Dai Kanwa Jiten_ (dictionary #2) to dictate where it goes in Unicode, despite dictionary #3, the _Hanyu Da Zidian_, saying it should go under #86, etc. (If and when U+7551 does appear in Chinese dictionaries, it is filed under radical #86. Why include it if this character is used in Japanese and not Chinese?--the story I have heard is that it appears as part of the name of a Japanese solder in one of MAO Zedong's works. People must want to know what it means and how to read it, so it acquired an artificial tian2 reading in Chinese.) Some -- a very few -- kanji actually have proper names in Unicode, if you know how to look for them, such as IDEOGRAPH FIRE. I think that's a bit misleading. You must be refering to either or both of these two: U+322B PARENTHESIZED IDEOGRAPH FIRE U+328B CIRCLED IDEOGRAPH FIRE In this particular case, not only is there something different about them (parenthesized and circled), but if you look at the context of the source they probably came from, they were probably meant as abbreviations for 'Tuesday' (in Japanese and Korean use), and not 'fire'. e.g., U+322B PARENTHESIZED IDEOGRAPH MOON for 'Monday' .. U+3230 PARENTHESIZED IDEOGRAPH SUN for 'Sunday', and likewise for U+328A .. U+3290. Thomas Chan [EMAIL PROTECTED]
Re: name this hanzi
On Sat, 26 May 2001, Richard Cook wrote: Gaspar Sinai wrote: On Sat, 26 May 2001, Richard Cook wrote: Here's a puzzle: Any idea 1.) what this character is, and 2.) if it's in Unicode? http://linguistics.berkeley.edu/~rscook/bishop/Picture1.gif This CJK millet in English (kibi in Japanese) U+9ECD Yes, that's right. You're the winner. :-) This is a *variant* of [U+9ecd], with [U+efe2] in place of [U+6c3a]. What is U+EFE2? Compare e.g. [U+257e6] and also [U+22863]. I know from context that it is in fact a variant writing of [U+9ecd] ... since it appears in DUAN Yucai's gloss at 113.410 (SWJZZ). But what i don't know is why he wrote it this way ... This form is not in Ext. B, nor HYDZD, nor Kangxi, as far as I can tell So, did you find this exact form in any character dictionary or list? I don't know where Gaspar found it, but you may want to use the Jiaoyubu Yitizi Zidian (Ministry of Education Dictionary of Chinese Character Variants by the Zhonghua Minkuo Jiaoyubu Guoyu Tuixing Weiyuanhui (Mandarin Promotion Council, Ministry of Education, Republic of China): http://140.111.1.40/ I believe it only came out earlier this year. Site is in Chinese, of course. Big5 encoding. The particular variant you are looking for is here, indexed as a04768-009: http://140.111.1.40/yitia/fra/fra04768.htm It gives three sources where it appears, including the _Yupian_--perhaps DUAN or his printer got it from there? What's nice is that scans of the entries from most of their sources are available (primarily the older ones). Despite all the attention _Kangxi Zidian_ and _Hanyu Da Zidian_ get as being comprehensive, they are are sometimes selective in what they inherit from earlier dictionaries. Thomas Chan [EMAIL PROTECTED]
Re: supplementary planes support
On Fri, 25 May 2001, Markus Scherer wrote: Thomas Chan wrote: than Italian's 37 million (http://www.sil.org/ethnologue/top100.html). Italy has about 60 million people. Do you not count at least most of them as speakers of Italian, plus some in Switzerland etc.? I'm just quoting the SIL figure (for comparative purposes to Hakka). If you dispute it, take it up with them, not me. Thomas Chan [EMAIL PROTECTED]
supplementary planes support
Hi all, A while ago, there were questions about the applicability of Han characters in Plane 2 for contemporary everyday use (as opposed to historical or specialist use), to which I offered some Cantonese (SIL YUH) examples. I believe I've found a better example. U+2028E is ngai, the Hakka (SIL HAK) first person pronoun. According to the online edition of the SIL Ethnologue, 13th ed., there are 34 million speakers worldwide (http://www.sil.org/ethnologue/countries/Chin.html#HAK)--slightly less than Italian's 37 million (http://www.sil.org/ethnologue/top100.html). The language is on the decline, though, as they are scattered and tend to be linguistic minorities where they live, and they are losing speakers to the more prestigious Mandarin (SIL CHN) or Cantonese (SIL YUH). To my knowledge, no other character sets (including CNS 11643 and HKSCS) include this certainly common character--I'm actually surprised. Thomas Chan [EMAIL PROTECTED]
Re: Single Unicode Font
On Tue, 22 May 2001, David Starner wrote: In any case, some scripts just go together. Mathematicians and linguists frequently use Latin and Greek together (cf. IPA) in ways that require consistent font looks. Do these need to go together? IPA is Latin, except for a few unfortunate unifications with Greek. Thomas Chan [EMAIL PROTECTED]
RE: Word, Asian characters, and Arial Unicode
On Sun, 6 May 2001, David J. Perry wrote: In classical studies, characters with the shape of U+3008/09, 300A-300F, 3016/17, and 301A/1B are sometimes used to mark various kinds of editorial uncertainty or conjecture in a text. The first and last pairs in my list are the most common by far (I know 3008/09 has another version somewhere in the math block). The guillemets (U+00AB/BB) and the greater than/less than signs do not have the appropriate shapes. Since these are already in Unicode, it would seem best to use them rather than proposing them for inclusion in another range. I'm sorry, but I don't understand how U+00AB/U+00BB or U+226A/U+226B don't have the appropriate shape, but that U+300A/U+300B would. U+300A/U+300B are taller than the guillemets U+00AB/U+000BB, but they tend to not be proportionally-spaced and are not centered inside their square (i.e., U+300A, as a left bracket, is padded out with white space to the left of it), not to mention having vertical forms (probably not relevant for you). In lieu of U+3008/U+3009, why not U+003C/U+003E, U+2039/U+203A, or U+2329/U+232A? (All of these are suggested on p. 568 of TUS3.0.) Similarly, in lieu of U+300C/U+300D, why not U+2308/U+230B (also suggested, but they are really math symbols). I wasn't aware that brackets similar to U+300E/U+300F, U+3016/U+3017, U+301A/U+301B existed outside of CJK usage, although they don't seem to be as common, and I have no idea what U+301A/U+301B would be used for. Thomas Chan [EMAIL PROTECTED]
Re: Word, Asian characters, and Arial Unicode
On Sun, 6 May 2001, David J. Perry wrote: Word 2000 (under Win98) insists on using Arial Unicode MS whenever you insert a character in the CJK Punctuation range. There are some characters here that might be useful in non-CJK situations, such as the double brackets. I have made a font with these characters but Word will not let me By double brackets, do you mean U+300A and U+300B (LEFT/RIGHT DOUBLE ANGLE BRACKET)? Those are used to delimit titles of books and articles. What kind of usage do you have in mind that U+00AB and U+00BB (LEFT/RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK) or U+226A and U+226B (MUCH LESS/GREATER-THAN) couldn't be used? Not all of the symbols in that block have analogues elsewhere, though--I'm not sure what you can do about that. Thomas Chan [EMAIL PROTECTED]
Re: FW: IDS question
On Mon, 30 Apr 2001, Ayers, Mike wrote: From: Thomas Chan [mailto:[EMAIL PROTECTED]] There are some characters in LENG Yulong and WEI Yixin's _Zhonghua Zihai_ dictionary (Beijing: Zhonghua, 1994), such as gu2 on p. 31 and lin2 on p. 32 that incorporate a circular component. I'd probably describe them as: [snip] (Both look somewhat like crosshairs.) Could you please scan the characters in question? If you can't post them to a web page, you may mail them to me personally. I think I know what this is, but need to see. I really should have a picture, but I was only able to use the dictionary in question briefly while in Minneapolis, so I only have my notes: http://deall.ohio-state.edu/grads/chan.200/misc/gu-lin.jpg (If anyone has access to this dictionary and scan the entries in question, I'd appreciate it.) p. 31's gu2 reads yin gu yangping 'sound is the same as that of the character gu1 '(paternal) aunt', but in the yangping tone class [hence gu2 in Mandarin]'; yi wei xiang jian _Bian Hai_ 'meaning not yet clear [to the dictionary compilers], seen in the _Bian Hai_'. p. 32's lin2 reads yin lin 'sound is same as that of the character lin2 'forest''; and then the same compiler comment and source reference. Rather weird cases, but LENG Yulong, WEI Yixin, et al. felt they were Han characters. Thomas Chan [EMAIL PROTECTED]
IDS question
Hi all, I've recently been using Ideographic Description Sequences to describe some Han characters that are not in Unicode 3.1, and I noticed that U+3007 is not included in the set of UnifiedIdeographs, despite having the ideographic property (TUS3.0, p. 269; UAX #27, section 10.1). I understand that compatibility ideographs are not allowed to participate in IDS, but U+3007 doesn't have a clone, as far as I know. There are some characters in LENG Yulong and WEI Yixin's _Zhonghua Zihai_ dictionary (Beijing: Zhonghua, 1994), such as gu2 on p. 31 and lin2 on p. 32 that incorporate a circular component. I'd probably describe them as: gu2, p. 31: U+2FFB IDEOGRAPHIC DESCRIPTION CHARACTER OVERLAID U+5341 (shi 'ten') U+3007 (ling 'zero') lin2, p. 32: U+2FFB IDEOGRAPHIC DESCRIPTION CHARACTER OVERLAID U+2FFB IDEOGRAPHIC DESCRIPTION CHARACTER OVERLAID U+5341 (shi 'ten') U+3007 (ling 'zero') U+3405 (x-like shape) (Both look somewhat like crosshairs.) However, those aren't valid sequences. I realize the above two characters are rather odd, but the likes of U+3AB3 and U+3AC8 would have faced the same problem, since they also incorporate a circular component. What would be the advisable way to handle these cases, besides creating invalid IDS sequences, using the PUA, or giving a prose description? Thomas Chan [EMAIL PROTECTED]
RE: Moving mail lists
On Tue, 20 Mar 2001, Mike Lischke wrote: Our old list manglement software will be retired to a far far bitter place in the bucket, and in its stead we are going to an open-source package called Listar which will be much more flexible, and will include digest mode. Just out of curiosity, why do you use an own mailing list server if you can use a free one (Yahoo Groups)? The Unicode list is mirrored there anyway, so why not make the "backup list" being the actual list. Not all posts that make it to [EMAIL PROTECTED] end up at the "unicode" group at Egroups/Yahoo Groups--some are mysteriously lost. Quoting with greater-than signs is also messed up with the HTML interface there, too. Thomas Chan [EMAIL PROTECTED]
Re: TRON
On Wed, 14 Mar 2001 [EMAIL PROTECTED] wrote: From: "Suzanne M. Topping" [EMAIL PROTECTED] After doing some surfing on the topic, it appears to be in at least some use (primarily in Japan?) I hadn't thought there were any viable alternatives to Unicode out there, and was surprised to see TRON looking as alive as it does. A recent version of the comercial implementation of BTRON from Personal Media, called Cho Kanji 3(Cho means Super in Japanese), claims 171,500 characters are supported. While we take the approach of unifying glyphic variants, their approach on this is to distinguish them all. I believe the website is http://www.chokanji.com/ . However, that 171,500 figure should at least be halved before even beginning discussion or comparison to Unicode, as it inherits the collections of both Mojikyo (http://www.mojikyo.org/) and Tokyo University's GT Mincho fonts (http://www.l.u-tokyo.ac.jp/GT), which both have included the 48,000+ kanji from the _Dai Kanwa Jiten_ (aka Morohashi) dictionary, i.e., flat-out overlap, and not an issue of who considers what to be a glyph variation. Contents of Mojikyo (~80,000): http://www.mojikyo.org/html/download/pdf/pdflist.htm Contents of GT Mincho (~64,000): http://www.l.u-tokyo.ac.jp/KanjiWEB/04_01.html Contents of Chokanji (~171,500): http://www.chokanji.com/ck3/webp/soft.html#mojikind http://www.chokanji.com/ck3/webp/feature-moji.html (Above websites are in Japanese, but there are some pictures.) I'm not sure about the "universality" of any of these; the emphasis seems to be mostly on kanji--they can only be a regional [Japanese] alternative to Unicode. Thomas Chan [EMAIL PROTECTED]
RE: UTF8 vs. Unicode (UTF16) in code
On Mon, 12 Mar 2001, Marco Cimarosti wrote: Thomas Chan wrote: How about the case of a retailer who needs to deal with parts for elevators and needs U+282E2, lip 'elevator'? Or neckties, requiring U+27639, taai 'tie'. I am not seeking excuses to not implement UTF-16 -- rather examples of characters that *do* justify it. I did not mean to imply that you were looking for excuses; sorry if it came across that way. However, there are people who have potentially legitimate reasons to conserve costs and resources by implementating a subset, e.g., the "CJK Unified Ideographs" block in the BMP is one of the first things to go when people want to make fonts lightweight, or do not want or have expertise to draw all those glyphs. If the perception is that effort for supplementary characters is only for "rare/obscure/historic CJKV" (which is admittedly true for most supplementary characters at the moment), then some people might not bother with support for surrogates, UTF-16, etc. (And Plane 1 users like musicians, mathematicians, and LDS would be "hurt" in the crossfire, too, since it is an "all-or-nothing" matter.) And all your examples are perfectly valid: it would be crazy to tell users: "Sorry: because of software limitations, you cannot order ties or elevators". I neglected to describe U+27639, taai 'tie'. It looks like a left-to-right horizontal arrangement of U+8864 U+592A. OT Out of curiosity, are these loanwords from English? Or is it just a coincidence that they sound like "lift" and "tie"? /OT I don't think your question is entirely off-topic. Loanwords are one way for a language to gain new words and morphemes, some of which will be assimilated enough to eventually find a written representation. In the case of Cantonese which is liberal enough to accept loanwords (rather than preferring calques made of native morphemes, like Mandarin), but conservative enough to prefer writing in Han characters (rather than romanization, as preferred for Southern Min in some quarters), that means there'll be new characters invented for some of these new words (when existing characters are not reused, such as diksi 'taxi' \u7684\u58eb), which results in new candidates to be added to Unicode. To answer your question, yes, "lip" and "taai" are loanwords from (British) English lift and tie. See for instance SUN Zehua's "Xianggang de wailaici" (English Loanwords in Hong Kong)[1] for more examples. [1] http://home.ust.hk/~lbsun/hkloan.html The page is in Big5, and written within the limitations of pure Big5, such as the graph U+6064 for seut 'shirt', instead of the more preferable (and recent) U+88C7. i.e., Sun's page's emphasis is on loanwords, and not the orthography. However, I guess that Cantonese speakers might use dialectal terms (like "lip" and "taai" above) even when writing in literary Mandarin. And certainly they would not Mandarinize proper names. Yes, as the written language is not the spoken language, "errors" do show up, making the text less universally intelligible. In addition, there is also a register difference between the Mandarinesque singgonggei 'elevator' \u5347\u964d\u843d, and the above-mentioned "lip", that a writer may wish to make use of. Pentagrams? I haven't seen those... where are they? Hmmm... This is possibly an Italian word badly Anglicized. I just meant "musical notation". Okay. I thought perhaps there were additions to "Misc Symbols" U+2600 .. U+267F or elsewhere that I had missed. Thomas Chan [EMAIL PROTECTED]
RE: UTF8 vs. Unicode (UTF16) in code
On Fri, 9 Mar 2001, Marco Cimarosti wrote: Addison P. Phillips wrote: [...] currently there are no characters "up there" this isn't a really big deal. Shortly, when Unicode 3.1 is official, there will be 40K or so characters in the supplemental planes... but they'll be relatively rare. This reminds me of a question that I wanted to ask since a lot time: how rare is the most common of characters in the extended planes? H... Maybe I should be clearer. Does it exist at least one character U+ that is commonly used in at least one modern language? How about music and math notation? But, yes. U+21075,[1] gan, is an aspect marker in Cantonese, that when placed after a verb, denotes continuing action (roughly equivalent to -ing in English). I don't think anyone would dispute the indispensability or high frequency of this character. [1] It looks like a left-to-right horizontal arrangement of U+53E3 U+7DCA. There is pre-existing data with that character, such as: HKSCS (Hong Kong Supplementary Character Set) has it at 0x9E44, as does its predecessor GCCS (Government Chinese Character Set). One can buy Chinese handwriting recognition and OCR software that support at least GCCS. Vendor extensions to Big5 which predate GCCS and HKSCS, and which have been smaller in size (i.e., only the more frequently-used characters) include it as well, 0xFA5E in Dynalab HK A, and 0xFAD9 in one of Monotype's extensions. I am wondering especially about the CJK characters in Extension B. We all know that the majority of them are rare, ancient or idiosyncratic characters, but I am not quite sure that this is true for *all* of them. I probably wouldn't use "idiosyncratic" as an adjective to describe the *majority* of them, but "rare" and "ancient" (perhaps "historical"[2] would be a better word choice?) are correct. [2] e.g., the "recently deceased", such as Vietnamese chu+~ no^m characters in Plane 2, or even Deseret in Plane 1. I think that this is an important question for deciding whether an application should use 32 or 16 bit characters internally, and whether an application has to be fully UTF-16 aware or it can be "UTF-16 ignorant". E.g., imagine designing an application that will be localized in Cantonese: it is important to know whether all characters needed in Cantonese are in the BMP, or if some of them are in Extension B. Some of them are in Extension B. HKSCS is unfortunately a mix of characters needed in Cantonese, and characters needed in Hong Kong (the two are not necessarily the same thing). Rather than trying to figure out what all the characters used in writing Cantonese are, which is an open-ended set, it is simpler to make the assumption that any characters needed for Cantonese that are worth supporting have already made it into HKSCS. Then make a decision based on whether one will support legacy data from HKSCS. (In some cases, one does not have a decision to make, if it is mandatory--e.g., a product that will be used by the HKSAR government.) It doesn't have to be an application localized into Cantonese necessarily even; just one that can process Cantonese text, e.g., for court transcription purposes. It seems that current practice in software is to stuff the characters from HKSCS into the BMP's PUA area, sans unification. Hopefully this will only be a temporary phase. Thomas Chan [EMAIL PROTECTED]
RE: UTF8 vs. Unicode (UTF16) in code
ae, or an aborted orthographic for English, or the script used in Viet-Nam centuries ago)... Pentagrams? I haven't seen those... where are they? Thomas Chan [EMAIL PROTECTED]
Re: CJKV ideographic, was Re: Perception that Unicode is 16-bit
On Tue, 27 Feb 2001, John Jenkins wrote: On Tuesday, February 27, 2001, at 10:46 AM, Richard Cook wrote: * 'Kanji' in Cantonese Chinese (kahn jee; "k" as in 'can', "a" as in 'father', "jee" as in 'jeep'); I'm afraid it's not "kanji" but "hanji" in Cantonese. Sorry. But is a romanized version of U+6F22 U+5B57 based on the Cantonese pronunciation ever used in English writing the way hanzi (based on Mandarin pronunciation) is? For those familiar with "ASCII IPA", it's /hOn33 tSi22/. (O denotes U+0254 LATIN SMALL LETTER OPEN O; s denotes U+0283 LATIN SMALL LETTER ESH.)[1] Yale romanization would write it honjih, a modified Yale would write it hon3ji6, etc. [1] I wish I could assume that everyone can view IPA, and not go through contortions like this. Thomas Chan [EMAIL PROTECTED]
Re: CJKV ideographic, was Re: Perception that Unicode is 16-bit
On Tue, 27 Feb 2001, Richard Cook wrote: * 'chunom' in Vietnamese [similar to (i.e., analogical) Chinese characters]. If one is going to talk about Vietnamese chu+~ no^m '"southern" characters', then one might as well mention the Japanese kokuji 'national characters' and Korean gugja 'national characters' as well, which are their equivalents of "homemade" characters that do not exist in Chinese.[1] There is also a similar phenomena in Chinese, called fangyanzi '"dialect" character', which may be considered analogous to the above, the most well known being the Cantonese ones, although others (Wu, Hakka, etc) do exist. [1] There is a small chance that they might exist in Chinese, or even in other languages, depending on the criteria for being a "national character". Thomas Chan [EMAIL PROTECTED]
Re: Possibilities of future expansion (from Perception etc thread
On Mon, 26 Feb 2001 [EMAIL PROTECTED] wrote: On 02/25/2001 08:01:38 PM "Joel Rees" wrote: I know this has been hashed over time and time again, and the answer has been handed down as if by edict time and again, but _your_ attitude as expressed below is taken by many who are not involved as rather arrogant. Michael and I don't always see eye-to-eye, but I back certainly him on this one. His statement is not arrogant. It is merely fact. Facts are fine. But perhaps the FAQ (if it doesn't have one already) needs an entry with a Question along the lines of: "I've heard that Adobe/Apple/CSUR/Linux/Microsoft/etc are already using certain parts of the Private Use Area (PUA). Does that mean I can't/shouldn't use those codepoints in my application?" . To many people, it seems like the UNICODE has taken in hand to define language itself. The explanations, no matter how well founded, sound to ordinary people like slick lawyers trying to cover up something baad with legalize. Like it or not, Unicode is the property of the Unicode Consortium and its members, not ordinary people. Clearly, ordinary people have an interest in its development, but it will benefit them only as the members of the Consortium deem to provide service to them. Now, this sounds cold and legal, and that's really not necessary. The members of the Consortium have it in their best interest to provide good service to ordinary people - that's how they earn their living. Thanks for the diplomatic explanation. Personally, I think the PUA is a wonderful compromise. I know a linguist These sorts of user-defined areas are a great idea. I'm reminded of an old (1996) usage of such space in JIS X 0208 for Ethiopic: http://www.abyssiniacybergateway.net/admas/jis/ Thomas Chan [EMAIL PROTECTED]
Re: Possibilities of future expansion (from Perception etc thread
On Sun, 25 Feb 2001, William Overington wrote: I find the Private Use Areas of great interest and a valuable resource. However, use of the private use characters requires agreement between users if private use characters are to be used for exchanging information between people. Already there is a development of the ConScript registry. This has its influence. I am researching a concept that I am hoping to call a uniengine that uses a few more than 1024 characters. For research purposes I am placing it in the private use area. From the unicode documentation, I have decide to place it in the middle, using U+EC00 to U+EFFF as a block and placing the additional character codes in U+EB00 to U+EBFF. Yet I checked at the ConScript registry to ensure that I was not clashing with that research work. If the uniengine concept becomes popular maybe it will become encoded by the committees into the standard. I feel that the interesting point though is to ask whether, just because there has been mention in the unicode list of that range for a particular line of research work, notes will be made of the fact in documents here and there amongst researchers in the unicode area, so that any possibilities of clashes of meaning with some other person's use of a particular code in those ranges is noted. The very fact that I felt it desirable to check at the ConScript registry is to my mind a demonstration that the private use area is already something other than a private use area. On the other hand, apparently neither CSUR nor you were aware of (or chose to ignore) the clashes with the mappings between the PUA and legacy CJK encodings and character sets, which [the mappings] have already been implemented for 5+ years now on CJK versions of Windows. For the particular ranges you've chosen (U+EB00 .. U+EFFF), you clash with: CP936: 9A41-A0FE U+E000-U+E4DD AA41-AFFE U+E4DE-U+E909 F8A1-FEFE U+E90A-U+EDE7 CP950: FA40-FEFE U+E000-U+E310 8E40-A0FE U+E311-U+EEB7 8140-8DFE U+EEB8-U+F6B0 C6A1-C8FE U+F6B1-U+F848 Various groups such as academics, input projects, newspapers, etc use these ranges; perhaps the most prominent is HKSCS (and its predecessor GCCS), which had been placed in CP950's user-defined zones, and thus has a mapping to the PUA exists (and .tte fonts are created by using PUA codepoints in the cmap). If you chose other ranges, you might clash with CP932 or CP949's. Similar ranges are used on the Macintosh as well. (I don't know about other platforms.) Of course, all of these clash with each other, CSUR, Microsoft's "Symbol", etc in whole or part. I don't see why there's any particular reason why anyone should really care about clashes. Thomas Chan [EMAIL PROTECTED]
Re: Possibilities of future expansion (from Perception etc thread
On Sun, 25 Feb 2001, Michael Everson wrote: At 10:15 -0800 2001-02-25, Thomas Chan wrote: On the other hand, apparently neither CSUR nor you were aware of (or chose to ignore) the clashes with the mappings between the PUA and legacy CJK encodings and character sets, which [the mappings] have already been implemented for 5+ years now on CJK versions of Windows. So what? Anyone can use the PUA for anything they want. The code positions are NOT standardized. If the "characters" those CJK people are using need to be interchangeable, they need to be properly proposed and put into Unicode. If they are not, then they are NOT characters. I've never denied that anyone can use the PUA for what they want. If you read my post more carefully in entirety, you'll see that I said: Of course, all of these clash with each other, CSUR, Microsoft's "Symbol", etc in whole or part. I don't see why there's any particular reason why anyone should really care about clashes. If someone wants to use the PUA, there's no need to check CSUR or any other source for "clashes". As well as: Various groups such as academics, input projects, newspapers, etc use these ranges; perhaps the most prominent is HKSCS (and its predecessor GCCS), which had been placed in CP950's user-defined zones, and thus has a mapping to the PUA exists (and .tte fonts are created by using PUA codepoints in the cmap). If you chose other ranges, you might clash with CP932 or CP949's. Similar ranges are used on the Macintosh as well. (I don't know about other platforms.) If you want examples of these characters (and in some cases, publically available .tte fonts), there's: http://www.info.gov.hk/gccs/ (GCCS, from the Hong Kong SAR government) http://www.info.gov.hk/digital21/eng/hkscs/index.html (HKSCS, also from the Hong Kong SAR government) http://www.microsoft.com/hk/hkscs/ (HKSCS, from Microsoft HK) http://www.sinica.edu.tw/~tdbproj/handy1/ (for the Scripta Sinica project, from Academia Sinica Computing Centre, a part of Academia Sinica, a research insitution, in Taiwan) http://www.appledaily.com.hk/ (for Apple Daily, a Hong Kong newspaper) http://www.ust.hk/itsc/chinese/infra/eudc/ (for HKUST, a Hong Kong university) http://www.dynalab.com.hk/font/gaigi.htm (for one of DynaLab's sets, from same, a foundry) etc etc--this is not an exhaustive list. Someone else can probably provide examples of Japanese or other Chinese ones. As for whether proposals exist or not, that doesn't matter if one is concerned about clashes--e.g., many people will avoid using U+F000 .. U+F0FF because Microsoft Symbol (which itself has multiple variants) and others already use those codepoints on a widespread scale. Thomas Chan [EMAIL PROTECTED]
Re: More rambling about Han
On Thu, 22 Feb 2001, Joel Rees wrote: What I want is to be able to send a piece of text with one or more characters that I know the recipient will not have in his collection of fonts, and have some hope that he or she will be able to see the glyph in a meaningful manner. The IDC's can help do part of the job. I'm not sure this sort of thing belongs in Unicode, but the Big5+ spec described two ways of transmitting the glyphs for user-defined characters. One of them referred to CNS13479/ISO6429 to use an escape-code to switch into a mode where a bitmap of 16x16, 24x24, 32x32, 40x40, 48x48, 64x64, 80x80, 96x96, or 128x128 size could be embedded, followed by its user-defined codepoint. The other referred to an ISO9541 SGML-based solution. (See section 3.6 of big5pm2.doc file in the big5p-1.zip package from http://www.cmex.org.tw/ . It's all in Chinese, though.) Thomas Chan [EMAIL PROTECTED]
Re: fictional scripts revisited
On Thu, 22 Feb 2001, David Starner wrote: On Wed, Feb 21, 2001 at 10:58:06PM -0800, Thomas Chan wrote: First, there are the 4000 new[4] "CJK Ideographs" that he created solely for a work called _Tianshu_ (A Book from the Sky)[5] (1987-1991), which Xu spent three years carving movable wooden type for. There is no doubt that these are bona fide Han characters, albeit without readings and meanings. Idiosyncratic and personal characters are not encoded in Unicode. I know they aren't, but they have in the past, and some have just sneaked in with Plane 2. Of course, policy can also change--it wasn't that long ago that musical notation and braille were barred. I'm not suggesting that Xu's 4000 bogus characters deserve to be included; this is merely an example of a possibility to think about. If say, they were included in a futuristic cjk vertical extension "M" (to pick a letter safely in the distant future), who'll know to object to them? (Someone will probably dig up this thread, now that I've mentioned it...) However, the lack of readings and/or meanings, or nonce usage, has not stopped characters before from being included in Unicode or precursor dictionaries and standards, e.g., U+20091 and U+219CC, created as a "I know these two characters and you don't" one-upmanship stunt; or the various typos inherited from JIS standards. I believe Unicode, as a general rule, does not encode meaningless characters. Any currently in Unicode are either mistakes, or come from preexisting standards. Mistakes are mistakes; they happen. But how does one decide how to handle pre-existing sources? Set 1991 as a cutoff date? It really becomes a delicate issue. But consider that these represent potentially 4000 codepoints that could be gobbled up by "fictional characters", and it only took a a single individual three years to come up with them. But that's not true. No one is proposing that every newly created script that comes along be encoded in Unicode. For them to gobble up 4000 codepoints, it would take a body of work by a number of authors, like Tengwar and Cirth have had. A number of the characters in Plane 2 were grandfathered in because of their inclusion in dictionaries, despite lacking reading, meaning, or both, or being outright typos. If those bogus 4000 made their way into a dictionary or standard, and some large country(s) pressured to include them, then there'd be an ugly situation--one can probably think of a few examples of compromises made, Unicode and elsewhere. The second example I would like to raise are the "Square Words" or "New English Calligraphy"[6] (I don't know which name is more appropriate, but I will refer to it hereafter as "NEC"), which is a Sinoform script. NEC is a system where each letter of the English alphabet[7] is equated with one (?) component of Han characters, and each orthographic word is written within the confines of a square block, in imitation of Chinese writing [... CJK ideographs are precomposed in Unicode ...] Thus, there's no reason to expect that NEC would be encoded any differently. I disagree. Say, for instance, some small* country decided to adopt NEC as a writing style, and hence Unicode had to include it. There are 1,000,000 words in English by some counts, so it's not feasible to encode them all in Unicode, or even some semi-complete subset. So it would be encoded by component and treated like any other complex script. (* I say a small country, because a large country might be able to get a large chunk of precomposed characters stored in Unicode. I still don't think that it would be done soley precomposed.) Even very small countries using a certain script would have more users than scholars of dead scripts, even if the margin is measured in thousands, and countries have political clout that loose federations scholars do not. Yet, the historical Sinoform scripts of the Khitan, Jurchen, and Tanguts[1] are given many rows in WG2 N2314[1] (2001.1.9), the Plane 1 roadmap, and there are no modern successors who can champion their cause for cultural reasons, like Vietnam for chu+ no^m characters, or Ireland for ogham. I'm not privy to WG2's workings, but what else would one conclude based on this roadmap; and treatment of Han characters and Hangul, except that a script like NEC would be treated precomposed? Of course, this is only a transient roadmap. [1] On the same roadmap are other South American, Near Eastern, and other large scripts in same position. [2] http://www.egt.ie/standards/iso10646/plane1-roadmap-table.html At the inception of various other fictional scripts, no one could foresee the growth of scholarly and/or amateur interest in them; True. That's why we wait until there is, before we consider encoding a script. Yes, I agree. It is harder to find historical scripts and characte
Re: New BMP characters (was Re: [very OT] Documentation: beyond
On Wed, 21 Feb 2001, Werner LEMBERG wrote: Section 10.1 of PDUTR #27 "Unicode 3.1" (2000.1.17) gives the sources of the 42,711 new characters as: ... CNS 11643-1992, 15th plane Really? I thought this should be CNS 11643-1986. I think there isn't a 15th plane in the 1992 version. I thought so too, but both PDUTR #27 and IRG N777 give the 1992 version and the 15th plane. Perhaps it might be a simple typo, since lower-numbered planes are given on preceding lines as the 1992 version. South Korea's PKS 5700 This is a North Korean standard AFAIK. I thought the PKS standards were too, but then I heard that North Korean ones (such as KPS 9566) were filed under the KP- sources. I'm sure the KS C and KS X standards are South Korean, and filed under the K- sources, but it doesn't seem to make much sense to have North Korean sources split across K- and KP- sources. Now I'm not sure anymore what is going on. Thomas Chan [EMAIL PROTECTED]
Re: New BMP characters (was Re: [very OT] Documentation: beyond
On Wed, 21 Feb 2001, Jungshik Shin wrote: On Wed, 21 Feb 2001, Jungshik Shin wrote: On Wed, 21 Feb 2001, Werner LEMBERG wrote: South Korea's PKS 5700 This is a North Korean standard AFAIK. No. AFAIK, PKS stands for 'Proposed Korean Standard' and as such PKS 5700 became KS C 5700 which in turn was renamed KS X 1005-1. Then, what is KS X 1005-1? It's just the Korean version of ISO 10646 (aligned with Unicode 2.0). I could be wrong in saying that PKS C 5700 became KS C 5700 although it's (almost) certain that PKS represents 'Proposed Korean Standard' (where Korean means South Korean). Unicode 3.0 (p. 259) lists two PKS C's as K source 2 and K source 3 (PKS C 5700-1 1994 and PKS C 5700-2 1994) and http://www.cse.cuhk.edu.hk/~irg/irg/N777_CJK_B_CoverNote.pdf lists PKS C 5700-3 1998 as another K source. What is this mysterious PKS C 5700-[1-3]? I asked around in the past but haven't obtained the definitive answer. Perhaps, I should ask someone in IRG. The unihan.txt file ver 3.0b1 (1999.7.2) lists four K- sources as: K0 KS C 5601-1987 K1 KS C 5657-1991 K2 PKS C 5700-1 1994 K3 PKS C 5700-2 1994 It's very clear what K0 and K1 are, and they are given as GR ranges arranged by pronunciation, and it is okay that these ranges overlap, since K0 and K1 are two different character sets. K2 has what appears to be GL ranges given for it (0x2121 .. 0x7530), and arranged by radical+strokes. K3 looks similar, having what appear to be GL ranges (0x2121 .. 0x3771), arranged by radical+strokes, but they all fall within CJK Extension A. The ranges given for K2 and K3 also overlap. (They seem reminiscent of the "planes" of CNS 11643 / EUC-TW .) According to the 02n34428_cjk_b_fcd_mapping.txt file[1] (May 2000?), the K source (#4?) is given as a decimal number from 0002 .. 0269, arranged by radical+strokes, and all within CJK Extension B (but this file only deals with Ext B, so that doesn't mean much). There seem to be some gaps in the numbering, though. I'm not sure what to make of this in relation to K2 and K3, or the whole "PKS C 5700" thing. The later date (1998 vs. 1994) must also be of some significance. [1] Available at http://anubis.dkuug.dk/JTC1/SC2/open/02n3442list.htm Thomas Chan [EMAIL PROTECTED]
Re: New BMP characters (was Re: [very OT] Documentation: beyond
On Wed, 21 Feb 2001, Werner LEMBERG wrote: Section 10.1 of PDUTR #27 "Unicode 3.1" (2000.1.17) gives the sources of the 42,711 new characters as: ... CNS 11643-1992, 15th plane Really? I thought this should be CNS 11643-1986. I think there isn't a 15th plane in the 1992 version. Looking more at this, the TF source in the unihan.txt file ver 3.0b1 (1999.7.2) also says "CNS 11643-1992, plane 15", as does p. 259 of TUS 3.0. (Is this what ISO documents have always said too?) Thomas Chan [EMAIL PROTECTED]
Re: New BMP characters (was Re: [very OT] Documentation: beyond
On Wed, 21 Feb 2001, Jungshik Shin wrote: On Wed, 21 Feb 2001, Thomas Chan wrote: The unihan.txt file ver 3.0b1 (1999.7.2) lists four K- sources as: K0 KS C 5601-1987 K1 KS C 5657-1991 K2 PKS C 5700-1 1994 K3 PKS C 5700-2 1994 It's very clear what K0 and K1 are, and they are given as GR ranges arranged by pronunciation, and it is okay that these ranges overlap, since K0 and K1 are two different character sets. Hmm, it's not a big deal but I wonder why they're given as GR ranges instead of just row-column values (or GL). Somebody must have mixed up ... Sorry, this was my mistake. K0 and K1 are given as GL. K2 has what appears to be GL ranges given for it (0x2121 .. 0x7530), and arranged by radical+strokes. K3 looks similar, having what appear to be GL ranges (0x2121 .. 0x3771), arranged by radical+strokes, but they all fall within CJK Extension A. The ranges given for K2 and K3 also overlap. (They seem reminiscent of the "planes" of CNS 11643 / EUC-TW .) By K2 and K3 overlapping, you do not mean some characters in Ext. B are given references to both K2 and K3, do you? If not, it's natural and all right by the same token you said about the overlap of K0 and K1 ranges because it indicates that K2 and K3 have repertoirs disjoint from each other (i.e. The intersection of K2 and K3 is a null set) just like K0 and K1 do. No, I don't mean that some characters are given references to both K2 and K3, which is impossible in the format the unihan.txt file is in. (That doesn't mean it can't happen, though--e.g., a character can be in both GB 2312 and GB 12345, but only a reference to the former, the G0 source, is given.) Thomas Chan [EMAIL PROTECTED]
fictional scripts revisited
thing is a greater threat--imagine a NEC Sinoform "ideograph" for each orthographic word in English, Spanish, and French--just to take three arbitrary languages--and we'll even give them the benefit of "unification". From what I see now, the only way to handle this sort of thing would be to throw ever more precious codepoints at it. [6] See http://www.echinaart.com/Advisor/xubing/adv_xubing_gallery04.htm , especially the "men" and "women" restroom signs! [7] Technically, Latin script letters--see the Spanish surnames in the "Your surname Please" exhibit (1998) in the URL listed in footnote #6. [8] http://www.hanshan.com/specials/xubingsw.html Thomas Chan [EMAIL PROTECTED]
RE: Unicode Transcriptions
On Fri, 16 Feb 2001, Marco Cimarosti wrote: 2) Which Chinese dialect to adopt for transliterating. Mandarin would be the most likely. Notice the particularities of Bopomofo spelling: - the sound (spelled "ong" in pinyin) is spelled "u-eng"; - there is no "y" in "yi"; - there is no sign to indicate the 1st tone. [snip] Also notice that you may have a few typographical problems in producing the picture: a) In most fonts, the glyph for vowel i is a horizontal line. This is only valid for vertical texts: in horizontal writing it should be vertical. (Suggestion: you may substitute it with an uppercase I from a sans-serif font). Yes, you are right about this. I don't know why TUS3.0 p. 278 says "The character U+3127 BOPOMOFO LETTER I is usually written as a vertical stroke when Bopomofo text is set vertically.", which is *wrong*. b) The glyph for the "combining breve" (3rd tone) is normally designed to fit on western lowercase vowels. (Suggestion: if you use a bigger size for the combining marks, you might get a correct result). I've made two .gif files demonstrating Bopomofo typography: http://deall.ohio-state.edu/grads/chan.200/misc/biaozhunwanguoma.gif http://deall.ohio-state.edu/grads/chan.200/misc/tongyima.gif Both depict left-to-right Han character text, and each character is annotated on its right side with top-to-bottom Bopomofo text. (Alternatively, I could have created versions where the Han character text runs top-to-bottom, and each character is annotated on its right side with top-to-bottom Bopomofo text, but I didn't.) Note the place of the tone diacritics, which is "stacked" even more to the right than the Bopomofo consonants and vowels. Thomas Chan [EMAIL PROTECTED]
[OT] RE: FW: extracting words
On Sun, 11 Feb 2001, Mike Lischke wrote: If you are willing to give up precision, then you can use heuristics. It's ugly but perhaps ok for a simple editor. You can improve the precision with better heuristics and more data, so you get to decide how much is good enough... So using white spaces for general word breaking and ideographs for CJK would be an acceptable approach? What I wonder about is how to handle No, that is not acceptable for Chinese. Chinese text does not use white space anywhere.[1] What was described was that it is tolerable (but not perfect--e.g., punctuation is not handled properly) to break *lines* in Chinese text between Chinese characters. To break *words* properly in Chinese text, you really need a dictionary.[2] [1] There is some Chinese text with spaces, where a space is inserted after each Chinese character, but that is a hack to make word-wrapping behave properly on Chinese-unaware software (which would otherwise treat an entire paragraph of Chinese text as a single "word"). [2] You might get away with treating each Chinese character as a "word", but this is technically wrong from linguistic standpoint, despite cultural claims to the contrary, and will have implications. The handling of Japanese and Korean text is different from that of Chinese (lumping them together as "CJK" is inappropriate in this context), but I will leave them for others to provide a better treatment. (Jungshik Shin has already explained the Korean case.) Thomas Chan [EMAIL PROTECTED]
Re: ConScript registry?
On Wed, 31 Jan 2001, Michael Everson wrote: Ar 13:23 -0800 2001-01-30, scríobh Thomas Chan: I don't think that CSUR is conclusive proof that there wouldn't be a deluge of demands for encoding fictional or constructed scripts if the likes of Tengwar or Klingon were encoded. Well, I think what David was saying is that there don't seem to be all that many of them. My primary objection was that we don't have conclusive evidence for either scenario. CSUR is just a pair of websites without nowhere the high profile nor authority of Unicode. I thought one of the Unicode web pages linked it. I could be wrong. And the CSUR states explicitly that it is just for fun. Having said that, I do know of some folks who have done implementations of one sort or another based on its specifications. I don't see any wording along the lines of "just for fun" on either CSUR website itself, except for a link on your http://www.egt.ie/sc2wg2.html page. The only thing that suggests their unofficialness and volatility is mention of the Private Use Area, but perhaps that is not clear to people who see the words "Unicode" and "Registry", and think it is the real thing, or there are problems comprehending the concept of a Private Use Area. Or perhaps they have heard about it secondhand. For example, look through the Usenet newsgroup archives at deja.com or any discussion board online and see how often people believe Klingon is in Unicode, or "going to be in the next version" of Unicode, when there has only a proposal. (And I doubt they are looking at the WG2 proposal itself, but the CSUR registration or derivative information.) If say, a fictional script were included and published by Unicode and ISO, then people all over would suddenly be aware of the fact that a fictional script got included, and perhaps they might conclude that they should submit their own pet scripts as well. Thomas, if a script like Tengwar, which has thousands of users who are actually interested in writing texts in it, sorting, searching, and all that, gets into the UCS it is because there is a credible requirement to encode it. Plenty of "nonfictional" historical scripts have fewer users than Tengwar. For some of them we have a handful of texts. Tengwar on the other hand is studied by linguists, used by enthusiasts, and at any rate is an integral part of the work of one of the 20th century's finest and most influential writers. Please note that I did not single Tengwar out for criticism. I believe it has a valid argument to be encoded because of the size of the user community. It is the fictional scripts with small user communities that are the problem, and how that relates to treatment of real-world historical scripts with small user communities. For example, it is easy to find a variety of fonts for fantasy runes or other alphabets that people have created, some based off a description in published fiction, but they have not gotten in touch with CSUR. Actually there aren't all that many. Are we sure about this? It remains to be examined how they would be treated, but there are Chinese fictional scripts that have the potential capability of gobbling up codepoints like "ideographs" have done. e.g., http://deall.ohio-state.edu/grads/chan.200/misc/100fu.jpg http://deall.ohio-state.edu/grads/chan.200/misc/100shou.jpg each show a single character in what are supposedly a hundred different scripts. Most of these "scripts" could probably be conflated and treated as font variants, but a few are distinct. Multiply that by 4000-8000 each, and you might have an explosion. Or take the case of bunch of obsoleted reformist alphabets and syllabaries of the late-19th and early 20th century, such as the Guanhua Zimu ("Mandarin letters") alphabet, which is to my knowledge only partially described in one Western source. If I understand correctly, these would be in the same category as Deseret or Visible Speech. Or take the case of the Hotsuma Tsutae syllabary, created in modern times to provide an fictional pre-Chinese writing system (http://www.jtc.co.jp/hotsuma/index-e.htm) for what is supposedly Old Japanese, which has books and articles published about it, and fonts in existence, but it has no contact with CSUR. In fact, I *have* seen this. As I recall Ken Whistler and I looked at it when we were at the WG2 meeting in Fukuoka. How did that discussion turn out? Thomas Chan [EMAIL PROTECTED]