from:"Thomas Chan"

Re: [Fwd: Re: Swastika to be banned by Microsoft?]

2003-12-15 Thread Thomas Chan

On Mon, 15 Dec 2003, Mark E. Shoulson wrote:

  Is this like baseball scoreboards showing the third consecutive 
  strikeout symbol (which is a K) reversed?  Is that to avoid KKK or 
  is it for another reason? 
 
 Which of course begs the question of whether we should encode a LATIN 
 CAPITAL REVERSED K character.

How about getting a taboo variation indicator, like the ideographic one
proposed in WG2 N2475
(http:://anubis.dkuug.dk/JTC1/SC2/WG2/docs/n2475.pdf)?

Then one can also represent the Nazi swastika (if it were encoded) the way
it is shown on censored products (e.g., model kits, toys/games, etc.),
shorn of its arms so it looks like an x, or replaced with the black
cross (shown on aircraft of the period).


Thomas Chan
[EMAIL PROTECTED]

Re: Swastika to be banned by Microsoft?

2003-12-14 Thread Thomas Chan

On Sun, 14 Dec 2003, Michael Everson wrote:
 At 15:40 +0100 2003-12-14, Stefan Persson wrote:
 Aren't the U+534D and U+5350 only defined for Asian usage, so that 
 different code points (which seem not to be defined in the current 
 version of the standard) have to be used for ancient European 
 purpose?
 
 All of the characters in the Unicode Standard are for anyone's use.

So would all swastikas be unifiable as U+534D and U+5350?

The entry for U+534D in the _Hanyu Da Zidian_, vol. 1, p. 51 (as indicated
in unihan.txt) includes a quote that it was originally not a Han
character, wan ben fei zi ..., suggesting that it now is.  There are
also serifs shown in that dictionary and the _Kangxi Zidian_ for both
characters.

Couldn't the above two characters be considerd a CJK or IDEOGRAPHIC
version (like the spaces, zero, punctuation, brackets, etc. in the CJK
Symbols and Punctuation block)?


Thomas Chan
[EMAIL PROTECTED]

Re: Klingons and their allies - Beyond 17 planes

2003-10-18 Thread Thomas Chan

On Sat, 18 Oct 2003 [EMAIL PROTECTED] wrote:

 In addition to the problem of the OS substituting improper glyphs
 from inappropriate fonts unexpectedly, there's often a problem with
 line breaking.
 Since the PUA has no properties, some applications seem to ignore the
 space character and break lines arbitrarily, splitting words in the
 middle.

No properties, or Han properties?


Thomas Chan
[EMAIL PROTECTED]

Re: [OT] Meaning of U+24560?

2003-10-12 Thread Thomas Chan

On Sun, 12 Oct 2003, Patrick Andries wrote:

 I'm a bit lost now... I'm looking for U+24560, radical 89 (double x) and 6
 strokes.

Here's the _Kangxi Zidian_ entry on U+24560, as well as U+2455F's entry,
which is referenced by the former (whose first of three definitions--the
one I suspect you are looking for--references everyday U+758F):
  http://deall.ohio-state.edu/grads/chan.200/misc/kangxi_zidian_u24560.jpg


Thomas Chan
[EMAIL PROTECTED]

Re: When is a character a currency sign?

2003-07-08 Thread Thomas Chan

On Tue, 8 Jul 2003, Philippe Verdy wrote:
 On Tuesday, July 08, 2003 3:35 AM, Thomas Chan [EMAIL PROTECTED] wrote:
  On Mon, 7 Jul 2003, Philippe Verdy wrote:
  Would Euro also be a (four-character) currency sign?
 
 Certainly not: this would be a word, whose orthograph varies with
 language. See the banknotes, where it is written in Greek letters, the
 capitalization also changes with language or context (all uppercase on
 banknotes, lowercase in normal French text, titlecase in German), as
 well as the plural forms according to language rules.
 
 We could say the same thing about the terms dollar, pound/livre,
 mark, escudo, peseta, yen, yuan, ruppie/roupie,
 sucre... (see also the Japanese Kana square characters created for
 these terms: they are not really currency signs, but an orthographic
 representation of these names adapted to a script, mostly like a
 transliteration)...

But what does one do for a script like Han characters where those tests
don't apply?  e.g., in Chinese, U+938A is used for 'pound'--is that a
word, or a currency sign?  U+5713 or U+5143 for 'yuan'?  Etc.


Thomas Chan
[EMAIL PROTECTED]

Re: When is a character a currency sign?

2003-07-07 Thread Thomas Chan

On Mon, 7 Jul 2003, Philippe Verdy wrote:

 On Monday, July 07, 2003 9:41 PM, Michael Everson [EMAIL PROTECTED] wrote:
  At 15:03 -0400 2003-07-07, Tex Texin wrote:
   When is a character properly called a currency sign?
  
  Hunh? When you use it to represent currency. DM was two characters
  used as a character sign in Germany.
 
 As well as now the EUR international currency code, usable also
 as a symbol when the Euro sign is not available.

Would Euro also be a (four-character) currency sign?


Thomas Chan
[EMAIL PROTECTED]

Re: pinyin syllable `rua'

2003-03-14 Thread Thomas Chan

On Fri, 14 Mar 2003, Werner LEMBERG wrote:

 Some lists of pinyin syllables contain `rua', but I actually can't
 find any Chinese character with this name.

Not all words or utterances necessarily have written forms, but...

 
 Does it exist at all?  Or is it just there for completeness of pinyin?

...I found a rua2 in the _Xiandai Hanyu Cidian_, U+633C, glossed as 1)
(zhi huo bu) zhou ((paper or cloth) wrinkle) and 2) kuai yao po (about
to break).  It's marked fang (dialect).  There's a pointer to the same
character, pronounced ruo2, which in turn is glossed as roucuo (to rub),
and marked shu (bookish).


Thomas Chan
[EMAIL PROTECTED]

Re: traditional vs simplified chinese

2003-02-14 Thread Thomas Chan

On Thu, 13 Feb 2003, Zhang Weiwu wrote:
Take it easy, if you find one 500B (the measure word)  it is usually enough to
say it is traditional Chinese, one 4E2A (measure word)  is in simplified
Chinese. They never happen together in a logically correct document.

Others have already given examples of logically correct documents with
both characters, but one cannot always have the luxury of assuming the
data is not deviant.  For example, there are many electronic texts online
that are a hybrid of simplified and traditional text, because they contain
erroneous conversions from a simplified source document (typically GB2312)
to a traditional one (typically Big5).

I think zhe4 'this' (simp U+8FD9 / trad U+9019) might be better for a very
simple heuristic for modern text, since it occupies position #11 in at
least one frequency list (compared to #15 for the above-cited ge4), and as
far as I know, U+8FD9 is not one of those ancient characters that have
been promoted/reused as a simplified form.


On Thu, 13 Feb 2003, Andrew C. West wrote:
Take, for example, this Web page --
http://uk.geocities.com/Morrison1782/Texts/TianguanCifu.html -- which
transcribes a short one-act play from the Cantonese Opera tradition, published
during the Qing dynasty (probably early 19th century). It has U+4E2A
(simplified
ge4) but not U+500B (traditional ge4), and yet is written mostly in
traditional characters. How would your algorithm classify such a page ?

Aren't such texts by default traditional?  Simplified text, besides
using simplified form characters, usually also entails refraining from
using variant forms (according to PRC definitions of what is a variant).  
And depending on how far one wants to stretch the definition, PRC-style
vocabulary, etc., cf., http://www.cjk.org/cjk/reference/chinvar.htm and
http://www.cjk.org/cjk/c2c/c2cbasis.htm .


On Thu, 13 Feb 2003, Marco Cimarosti wrote:
The easiest way to do it is folding both the user's query and the conten
being sought to the same form (either traditional or simplified, it doesn't
matter). It may also help to fold also other kinds of variants beside
simplified and traditional.

It would help to at least fold the Unicode z-variants together.  For
example, with the possibility of Unicode data, authors have the choice of
U+6236, U+6237, and U+6238 for hu4 'door', but these are not meaningful
distinctions, and certainly a lot harder to detect than the typical
traditional/simplified case.


On Thu, 13 Feb 2003, Edward H Trager wrote:
And I've seen books printed in the beginning years of the PRC era using
mostly simplified, but with smatterings of traditional characters here and
there.  These books were printed in the days of lead type, so I

Those must be the ones printed before the final 1964 version of the
simplification (drafts dating back to 1956, and some earlier pre-1949
usages in Communist-occupied areas), so that they do not utilize all the
simplified characters that eventually exist in the 1964 version.

There are even some cases of semi-simplified forms where one half of a
character might have been simplified according to pre-1964 rules, but the
simplification rule for the other half has to wait until 1964.  But I
think these might've been missed by Unicode, like some of the
ultra-simplified forms in the short-lived 1977 scheme, and Singapore's
temporarily different (from the PRC's) schemes prior to 1976.


On Fri, 14 Feb 2003, Andrew C. West wrote:
Now if Hanyu Da Cidian were to be put onto the internet ...

How about the one here?  http://202.109.114.220/


Thomas Chan
[EMAIL PROTECTED]

Re: 4701

2003-02-04 Thread Thomas Chan

On Tue, 4 Feb 2003, Andrew C. West wrote:
 I have a half-finished page that gives the names of the twelve
 calendrical animals in the languages of various peoples within and
 bordering China that have adopted the Chinese calendrical system,
 available at :
 http://uk.geocities.com/BabelStone1357/Calendar/index.html

That makes a very nice multilingual Unicode demonstration.

I've typed up my notes on the Vietnamese and Korean ones here, for you:
  http://deall.ohio-state.edu/grads/chan.200/misc/cal.html
I've checked the native orthography with dictionaries (but would
appreciate double-checking by natives) but not the definitions (would also
appreciate that being checked/confirmed).  For Korean, I haven't provided
a transcription nor transliteration (that affects for example, the second
from last, 'chicken', with regards to the underlying r).

I don't know if any of these are bound forms, or if any are not the
everyday term for those animals, cf., the alternative Chinese term in
your list for 'dog', gou3, which replaces the quan3 vocabulary item in
almost all dialects.


 As I recall, in Vietnamese the rabbit is replaced by a cat, and in

I've heard of that, but my teacher also said that tho? 'rabbit' was also
possible.


Thomas Chan
[EMAIL PROTECTED]

Re: 4701

2003-02-01 Thread Thomas Chan

On Sat, 1 Feb 2003, Michael Everson wrote:

 At 10:19 -0800 2003-02-01, Eric Muller wrote:
 Michael Everson wrote:
 Happy New Year of the Yáng to everybody! (I can't work out whether 
 it's the Year of the Sheep, the Goat, or the Ram.)
 
 Ram.
 
 europe.cnn.com (which I was looking at for other, sadder reasons), 
 says Goat. My local Superquinn's (large grocery chain) has had signs 
 on all the Chinese food for weeks which says Ram. My Chinese 
 dictionary says Sheep.

And the website of the Pearl River (www.pearlriver.com) department store
in New York City says lamb!  unihan.txt says that U+7F8A is 
sheep, goat; KangXi radical 123.  On Google, year of the goat has the
lead.

And it is 4701 or 4700?--the only thing that is certain is that it is the
guiwei year of the sixty-year cycle of Cathay.


Thomas Chan
[EMAIL PROTECTED]

RE: Precomposed Tibetan

2002-12-18 Thread Thomas Chan

On Wed, 18 Dec 2002, Marco Cimarosti wrote:
 Andrew C. West wrote:
  If anyone thinks that a mapping table would be
  useful as a weapon in the fight against the Chinese proposal, 
  I would be happy to provide one.
 
 Do you have the relevant data?  As I said, so far I found little or nothing
 about BrdaRten or about the Founders System mentioned by Ken Whistler.

Previously, Ken Whistler said:
  One additional detail for people. The BrdaRten stacks are currently
  implemented, in the Founders System software in Tibet, as an extension
  to GB 2312.

This sounds like they might have been implemented as a vendor extension in
the private/end-user area of GB 2312, if it is anything like how
as-yet-unencoded Han characters are treated.  If so, then one'd probably
need access to a font itself to see.  Looking at Founder's site, I found
this, a bunch of Tibetan fonts they make:
  http://font.founder.com.cn/chanpinzl/CP_zangwen.htm

In the body text, they describe Tibetan as 600+ (pre-composed) characters,
and 4,400+ if Sanskrit is included.  But next to each font, it says 4000+
for the first one (Tibetan and Sanskrit), 2000+ for the second one
(Tibetan and Sanskrit), and for the last three, 800+ (Tibetan).  WG2 N2558
only proposes 956 pre-composed, so I'm not sure what these
different numbers mean, except that counts sometimes cavalierly include
irrelevant stuff like punctuation and symbols to pad the number.

But so far I haven't seen anything strongly linking Founder to WG2 N2558,
except that the latter mentions Founder as an *example* of a precomposed
Tibetan implementation (2).  We don't necessarily want to be making
vendor/legacy/font-based to unicode mapping tables for every potential
vendor, do we?


Thomas Chan
[EMAIL PROTECTED]

Re: CJK fonts

2002-12-16 Thread Thomas Chan

(I've merged Andrew's two messages--12/13 and 12/16--together, below.)

On Fri, 13 Dec 2002, Andrew C. West wrote:
 On Fri, 13 Dec 2002 01:33:08 -0800 (PST), Thomas Chan wrote:
  I can't imagine where the yi4 reading comes from, although I note
 
 I was thinking along the same lines. The Kangxi Zidian gives U+3CBC a reading of
 YI4 (as does the Unihan database - the CHA4 reading seems to be as a variant
 form of U+6C4A).

What edition of the _Kangxi Zidian_ are you using that gives explicit
Mandarin readings like yi4, or are you interpreting the fanqie notation
yourself?  I use the 1958 edition, 1997 2nd printing published by
Zhonghua, ISBN 7-101-00518-7.

I find self-interpretation of fanqie to be fraught with peril, partially
as fanqie was never a completely perfect transcription system, not to
mention that fanqie from old dictonaries does not necessarily tell one
anything about contemporary pronunciation.

e.g., U+5B7B, is a Yue (Cantonese), Hakka, and Min character, meaning
'last (child)' (derived from 'last child of an old man', hence the
character's appearance as 'child' + 'to use up'), pronounced laai1 or lai1
in Cantonese.[1]  However, the old dictionaries including Kangxi give a
fanqie of U+6CE5 U+53F0 U+5207, which would yield an artificial nai2 in
Mandarin, which is exactly what the _Hanyu Da Zidian_ says explicitly.
Either the pronunciation has changed from [n-] and [l-] and reading old
dictionaries fails to account for modern developments, or whoever choose
U+6CE5 to indicate the onset was pronouncing U+6CE5 as *l-.

[1] While there is a long-standing ongoing sound change in Cantonese from
[n-] to [l-], this is probably no longer one of them, and *naai1/nai1
would now be regarded as hypercorrection.

[...]


 At any rate, what I think is important is that we do not assume that YI4
 is wrong and through it out just because none of us recognise the reading ...
 though I guess if it is that obscure, it really hasn't got a place in the Unihan
 database.

But what if the character is obscure, and the reading thusly also obscure?
I think there are diminishing benefits to overly-proofing the
unihan database for such characters--if they are so rare, then no one will
find the character by searching on an obscure/artificial reading, and if
it is so rare, then those interested should be consulting actual
comprehensive dictionaries (like the Kangxi or _Hanyu Da Zidian_) instead
of relying on a text file.  In a way, we currently have this 
situation--the Plane 2 characters are, on average, more obscure than the
BMP characters, and the lack of information is kind of saying look it up
yourself if you really, really need to know.


 If Hanyu Da Zidian and Hanyu Da Cidian both give GAN4 for the modern
 reading of U+5481 I for one would prefer that reading to GEM4. Ci Hai
 also has such non-Mandarin syllables as NGU2 for U+5514. The principle
 of Pinyin are clearly defined (and like most PRC dictionaries Ci Hai
 includes a copy of the Hanyu Pinyin Fang'an as an appendix - even if it
 does not fully adhere to it), and syllables like GEM4 and NGU2 are
 simply not allowed.

I agree with your sentiment that gem4 is an aberration, despite my
support of the _Cihai_ (PRC 1979) in that it did not get included in the
unihan database from out of nowhere.  When U+5481 was reinvented by the
Cantonese, it was patterned both graphically and phonologically on U+7518,
which is gan1 'sweet' in Mandarin (gam1 in Cantonese).  U+5481 is in
Cantonese gam3 'so (quantity)' (3 = yinqu tone); hence gan4 is an
appropriate Mandarin reflex.

ngu2 for U+5514 is also an aberration--yet another case of a quixotic
attempt to mimic dialect pronunciation in Mandarin.  Sure, it's m4 (a
syllabic nasal [m]) 'not' in Cantonese, but this is just a re-use of a
pre-existing semi-homophonous character, ng4 (another syllabic nasal;
considered close enough to m4 in Cantonese), a sound in singing.  As that
is wu2 in Mandarin, so thus should 'not' be given an artificial *wu2
reading (which is what the unihan database has currently--no doubt that
piece of data was inputted from a more sensible dictionary).

But elsewhere, this battle is lost--U+5187 'to not have' (among other
meanings), is perhaps the most recognizable Cantonese character to
non-Cantonese, is given nowadays given the pronunciation mao3[2], despite
the recognition of earlier dictionary compilers such as Samuel Wells
Williams in his 1877 dictionary who recognized it as derived from U+7121
with a tone change, and assigned it a Mandarin wu3 reading accordingly.

[2] I note that even mao is a poor approximation; *mou would've been
closer (and still a valid and normal Mandarin syllable).


 On the other hand, a reading of FIAO4 for the dialectal ideograph U+8985
 may sound odd to a Mandarin speaker, but it is perfectly acceptable
 according to the rules of Pinyin (F is a valid initial, and IAO4 is
 a valid final). FIAO4 is the only reading for this ideograph given in
 Hanyu Da Cidian, Ci Hai

Re: CJK fonts

2002-12-12 Thread Thomas Chan

On Thu, 12 Dec 2002, Andrew C. West wrote:

 On Thu, 12 Dec 2002 03:26:07 -0800 (PST), Raymond Mercier wrote:
  For example, the simplified form of the character Han itself (U+6C49) is 
  given the Pinyin reading Yi, the traditional form U+6F22 is the correct 
  reading Han.
 This is probably another example of misplaced secondary Mandarin readings - I
 reckon that about 10% of the CJK block (i.e. a couple of thousand of characters)
 are affected. Unihan Version 3.0 (the latest version to have the correct
 Mandarin readings for the CJK Unified Ideographs block) gives :
 U+6C49kMandarin   YI4 HAN4
 In Unihan 3.2 this becomes :
 U+6C49kMandarin   YI4
 and the reading of HAN4 is mislocated to U+6C44 :
 U+6C44kMandarin   HAN4 ZE4 (plain ZE4 in Unihan 3.0)
 It is quite possible that YI4 is a reading for U+6C49 when not a simplified form
 of U+6F22 (I'll have to check this when I get home this evening ... no
 dictionaries here I'm afraid).

The _Hanyu Da Zidian_ (3: 1549) only says that U+6C49 is a simplified form
of U+6F22, as expected, which in turn has a han4 and a tan1 reading (3:
1714)--I don't know if the rarer tan1 (and its associated definition) can
really be inherited by u+6C49, though.

U+6C44 is given as interchangeable (3: 1548) with U+3CC1, which has a ze4
reading (3: 1560).

I can't imagine where the yi4 reading comes from, although I note
that U+3CBC, which looks somewhat similar to U+6C49, is given both yi4 and
cha4 readings (3: 1549).


 U+5481kMandarin   GEM4 - GEM4 is Cantonese pinyin (it is a common 
Cantonese
 ideograph) - I don't think this ideograph has a Mandarin reading ... but if it
 did it would presumably be GAN4 ... which is the reading I give it in BabelMap

The _Hanyu Da Zidian_ (1: 598) has han2 'breast; milk', xian2 'to hold in
the mouth', and gan4 'so (quantity)' for readings.  But I don't think
gem4 is an error--the 1979 PRC _Ci Hai_ has gem4 'so (quantity)' and
han1 'so (quantity)' for readings, where gem4 corresponds to the same
vocabulary item as gan4 in the former dictionary.  (The han1 reading is
not Cantonese usage of the character, but Hunanese, despite the identical 
meaning.)  It's rare, but sometimes there are unusual Mandarin syllables
like gem4 given in dictionaries.

However, I don't expect much reason when it comes to interpreting and
creating artificial Mandarin cognate readings of ancient or dialect
words, e.g., U+5C58 'child' is given as man3, but this is based on
a misreading of a pronunciation gloss--an phonologically reasonable
cognate reading of this Taiwanese dialect character would have been
*man1; likewise, U+55F2 'coquettish' is given as the unusual Mandarin
syllable dia3--a phonologically reasonable cognate reading of this
Cantonese dialect character would have been *die3 (patterning on U+7239
die1 'father' both graphically and phonologically).

 
 U+4C5BkMandarin   XU4M - this is from CJK-A in Unihan 3.2 ... I assume 
that the M
 is spurious

xu4 and yi4 in the _Hanyu Da Zidian_ (7: 4695).

 
 U+6F71kMandarin   YIE - this should be YI1

_Hanyu Da Zidian_ (3: 1736) says ye1 here, but I bet you have a source
that backs yi1 just as well...

 
Thomas Chan
[EMAIL PROTECTED]

Re: The result of the Plane 14 tag characters review

2002-11-18 Thread Thomas Chan

On Mon, 18 Nov 2002, Michael Everson wrote:

 At 13:37 -0800 2002-11-18, Kenneth Whistler wrote:
 Go to any Japanese newspaper. There is no required change of
 typographic style when Chinese names and placenames are mentioned
 in the context of Japanese articles about China.
 Go to any Chinese newspaper. There is no required change of
 typographic style when Japanese names and placenames are mentioned
 in the context of Chinese articles about Japan.
 
 Just to be sure: this means that when a Japanese newspaper it uses 
 the glyphs its readers prefer for Chinese names, not glyphs which 
 Chinese readers may prefer? Does this extend to the 
 Simplified/Traditional instance, so that if a Chinese name has the 
 word for horse in it, it uses the Japanese glyph for horse,not either 
 the S or T version of the glyph (assuming for the sake of argument 
 that both occur and that both are different from the preferred 
 Japanese glyph)?

Yes.  Not only are Japan's preferred's glyphs used, but the actual
characters are changed, if necessary.  e.g., the famous eighteenth-century
Chinese novel, _Hongloumeng_ (Dream of Red Chamber), is studied in Japan,
where it is known as _Kouroumu_.  In TC, one writes U+7D05 U+6A13 U+5922
for the title, while in SC, U+7EA2 U+697C U+68A6--all three characters are
different.  Searching on Japanese pages only at Google for the TC form,
I only find 153 matches, whereas the actual Japanese form of the title,
U+7D05 U+697C U+5922, which differs from the TC in the second character
(or, to look at it in another way, differs from SC in all but the second
character), finds 2,730 matches.

Granted, looking only at electronically-stored instances has its flaws,
such as the limitations of legacy character sets, but since both U+6A13
(form of second character used in TC) and U+697C (actual second character
used in the Japanese form of the title) are available in a Japanese
character set, the choice of the latter is clearly a deliberate choice.


Thomas Chan
[EMAIL PROTECTED]

Re: A .notdef glyph

2002-11-07 Thread Thomas Chan

On Thu, 7 Nov 2002, John Hudson wrote:

 At 13:07 11/7/2002, John Cowan wrote:
 Wouldn't the glyph for the GETA SIGN be suitable as a .notdef glyph?
 That seems to be just what GETA is for.
 
 Aha! Thank you, I'd never noticed that before. I think the GETA MARK would 
 be ambiguous to a non CJK user, but I like the idea of the strong 
 horizontal bars very much.

GETA MARK is also ambiguous to Chinese readers; an M-sized WHITE SQUARE
or WHITE CIRCLE (or LARGE CIRCLE) are more familiar.

I'm not familiar with how the GETA MARK is supposed to be used in
Japanese, but I hesitate to blur the possible distinction between
1) there's a character here but you don't see it because the font is
missing a glyph, 2) there's no character here for you to see because
what the author would like to put there is not encoded in Unicode, and
3) there is expected to be something here (e.g., a letter, an ideograph,
etc) but the author doesn't even know what it is (e.g., transcribing a
tablet with broken pieces or paper with insect damage, or  
undecipherable/illegible source text).  I don't think the distinction
between #2 and #3 need or should be standardized at this level--it is up
to a convention that the author should establish with the reader, as with
any specialized notation--but there is certainly a difference between #1
(author succeeds in writing but reader fails in viewing) and #2/#3 (author
fails in writing).  Given the current white box/rectangle (or other
symbols) for notdef, if I see one of those, I really don't know if my font
is defective, or if the author volunatarily put it there to signify
something.


Thomas Chan
[EMAIL PROTECTED]

RE: In defense of Plane 14 language tags (long)

2002-11-05 Thread Thomas Chan

On Tue, 5 Nov 2002, Marco Cimarosti wrote:

 Doug Ewell wrote:
  Readers are asked to consider the following arguments individually, so
  that any particular argument that seems untenable or contrary to
  consensus does not affect the validity of other arguments.
  1.  Language tags may be useful for display issues.
  The most commonly suggested use, and the original impetus, 
  for Plane 14 language tags is to suggest to the display
  subsystem that “Chinese-style” or “Japanese-style” glyphs
  are preferred for unified Han characters. [...]
 
 IMHO, there has never been any practical need to consider these glyphic
 differences in plain text. It is a non-issue raised to the rank of issue
 because of obscure political reasons.
 
 It is false that Japanese is unreadable if displayed with Chinese-style
 glyphs, or that Polish is unreadable if displayed with Spanish-styles acute
 accents.

It is also not even an issue of language, but national glyph preferences.

It is only through situations such as Chinese (language) used in
mainland China and Chinese (language) used in Taiwan that we are able to
disambiguate whether glyph preference differences are due to language or
country--and it is country, because mainland China has implemented glyph
reforms that reduce the differences between printing and handwritten
print forms by making the former resemble the latter[1].  Certainly, it is
confusing what part of Japanese-style glyphs is due to language or
country, but it is misleading to present it as a language issue, and use
the worst case scenario (comparison with glyphs from mainland China) for
illustration; many of the differences vanish if a comparison is made
with more conservative-looking glyphs from Taiwan.

[1] A xin-jiu zixing duizhao julie chart (select examples of comparison
of new and old character glyph forms), from the 1979 PRC_Cihai_
dictionary:
  http://deall.ohio-state.edu/grads/chan.200/misc/cn_newold_glyphs.jpg
Each of the four columns contains three subcolumns, where the leftmost
subcolumn is the new form, the center subcolumn is the old form, and
rightmost subcolumn gives examples of characters utilizing the new glyph
forms.  The circled number refers to the number of strokes.  Among the
examples is the fifth example in the third column from the left, of
U+75F4 'straight/direct', and the thirteenth example in the same column,
U+9AA8 'bone'.  (It is true that some of these glyph reforms have been
encoded separately as separate characters, although some cases are due to
source separation.)


Thomas Chan
[EMAIL PROTECTED]

RE: The Currency Symbol of China

2002-10-01 Thread Thomas Chan


On Mon, 30 Sep 2002, Thomas Chan wrote:

 (Was U+56ED what you saw, James?--I don't have my Krause catalog by me at
 the moment, but I think it was present on older PRC coinage.)

A correction to myself here--I thought I had seen U+56ED as a currency
unit, but now I cannot find a reference in my notes, so I'm retracting
this one.


James Kass said:
I don't blame you.  According to Krause...
One Dollar (Yuan) = 100 Cents (Fen/Hsien) = 1000 Cash (Wen/Ch'ien) =
(=)  0.72 Tael (Liang) = 7 Mace and 2 Candareens
...and, that's just for starters.

Well, the last part is a different system--mace and candareens are weight
measures for silver coins as part of the tael system: liang/qian/fei/li
(tael/mace/candareen/?).  Hence, there are three systems: dollar, cash,
and tael.

The 1/100th units fen and xian (hsien in Krause) are part of different
systems: yuan-jiao-fen in the north, and yuan-hao-xian in the south.
(xian U+4ED9  English 'cent', even in Macau, where 1/100th of a pataca is
an avo.)  The northern and southern systems may be seen residually in
contemporary Hong Kong and Macau, and historically during the early 20th
century during a period of provincial minting in mainland China, where
people used their local terminology on their coins, with the exception
of the 1.0 unit.  The situation is similar for the 1/10th unit; jiao in
the north and hao in the south.


Marco Cimarosti said:
U+5143 4~6~D^4~D6~^A
U+5186 4~6~D^4~DC^4~DC
U+5706 4~6~D^4~D^4~DC
U+570E 4~6~D^4~D^I
U+5713 4~6~D^4~D^O

Thank you for finding these--I didn't realize that U+570E was encoded
independently of U+5713, and not as a font variant of the latter.  (And I
had forgotten the obvious U+5713 ~ U+5706 connection.)

I checked Krause--U+5713 may be seen on pre-war Japanese coinage for
yen.


Alan Wood said:
I have added all of the symbols from this discussion to the second table
on
my page at:
http://www.alanwood.net/unicode/currency_symbols.html

Please remove U+56ED--that was my mistake.

U+6587 is not entirely appropriate there--while it was a currency unit
(approx. 1/1000th yuan), it was gone in all regions by the early 1930s,
and now it is just (a least) a colloquial Cantonese synonym for yuan, sort
of like northern kuai4 U+584A/U+5757 'piece'.  I can provide you with a
bunch of other terms for 1/10th and 1/100th units, but once one steps into
the realm of Han characters, one is no longer dealing with symbols but
words, and the list can inflate very quickly unless restrictions are set,
such as primary currency units (not 1/10th or 1/100th units) in
contemporary use (not historical) that appear appear on currency (not
other terms like bucks, benjamins, etc).

U+5713 I wouldn't list as yen/yuan variant--it should be on the same
level as U+5143 and U+5186, as U+5713 (Yuan) is the unit used in Taiwan
and Hong Kong on the currency (despite being dollars in English).


Thomas Chan
[EMAIL PROTECTED]

RE: The Currency Symbol of China

2002-09-30 Thread Thomas Chan


Lots of confusion.  I don't know the origin of yen for the Japanese
currency, aside from hearing that it was the way it was spelled (perhaps
in Hepburn's dictionary) and adopted as such in English, and that the
source of the ye might have either historical and/or regional 
pronunciation--i.e., not a phonemic difference distinct from e.
(Corrections appreciated here.)  I do have the following (other) remarks
on the characters that some have brought up, though:


Marco Cimarosti wrote:
Similarly, yen is just the Japanese (kun) pronunciation of Chinese
yuan.

Stefan Persson wrote:
Yen (4~6~D^4~DC^4~DC) is U+5186, while yuan (4~6~D^4~D6~^A) is U+5143.

Yen is an ancient on pronunciation for U+5186; today it's pronounced
en.

James Kass wrote:
How about U+5143 ?  (smile)
Looking at pictures of Chinese coins in the Krause catalog, some
coins used an ideograph other than U+5143, but a quick search of
CJK BMP ranges didn't find it.  (Doesn't mean it's not there.)
This other character looks like rad. 31 surrounding stacked rads
30, 72, and 9.  (The pictures are a bit fuzzy, though.)

The Japanese currency may be U+5186 today, but that is just a
simplification of U+5713.  Chinese took a different path of simplifiction
and variants, including U+56ED and today's (PRC) U+5143.  (The Korean
won currency is of the same etymology, though not U+571C hwan,
although the theme of a circular object--rounds?--is still present.)
(Was U+56ED what you saw, James?--I don't have my Krause catalog by me at
the moment, but I think it was present on older PRC coinage.)

I wouldn't seriously advocate U+5143 :) --that is a word and not a symbol,
cf., $100 vs. 100 dollars--the symbol is prefixed out speech order,
but the word is suffixed per pronunciation. But if we are to get into
writing out currency amounts in longhand words, there is at least also
U+6587, formerly approxiately 1/1000th of a yuan, but now promoted to
equal status as the yuan in Cantonese-speaking areas unofficially (i.e.,
it appears on price tags, but not money).  This man is also
ridiculously written U+868A 'mosquito'.  (I'm not going to get into 1/10th
and 1/100th units at this time.)


Thomas Chan
[EMAIL PROTECTED]

Re: [CHN] 2 geography questions

2002-09-30 Thread Thomas Chan


On Mon, 30 Sep 2002, villafea wrote:

 my self-induced geography study is an ongoing disaster.  
can someone please give me the tones or characters for _Lushun_, 
before i go insane?  

Port Arthur?  Luu3shun4 ¤jªf.


also, an old map seems to refer to a _Taku_ in Hebei.  is there now 
a modern name for Taku?  the only gloss i can find refers instead 
to a municipality in Taiwan.

You mean Taku as in the Taku forts?  It's Da4gu1 ®È¶¶.


Thomas Chan
[EMAIL PROTECTED]

Re: glyph selection for Unicode in browsers

2002-09-26 Thread Thomas Chan


On Thu, 26 Sep 2002 [EMAIL PROTECTED] wrote:

 Tex Texin wrote,
  Given the (un)workable approach, do you then intend to have variants of
  code2000 for CJKT, so one can make the appropriate assignments? (ugh!)
 
 Code2000's coverage of CJKTV ideographs isn't adequate to support any language
 yet.  Eventually and hopefully the repertoire will be completed.  Given the
 current ceiling of 65536 max glyphs per font, it might not be feasible to
 try to have one font cover all scripts and variants, but time will tell.

I don't mean to detract from the point of this discussion, nor to
criticize a particular font, but I think the Han glyphs in Code2000 are
aesthetically disappointing in that that they are distorted enough (shape,
proportions, and positioning) that they differ farther from any typical
CJK font more so than two comparable CJK fonts may differ due to
language/country glyph preferences.  Compare, for instance, with other
sans serif CJK fonts like Arial Unicode MS, (cn) MS Hei, or (ja) MS
Gothic.

But changing the example to fonts like Arial Unicode MS doesn't completely
solve everything--a sans serif font is not the norm for non-trivial
quantities of CJK text (compare any book or newspaper).  These problems
would cause rejection of a font faster than adverse reactions to
foreign/unfamiliar glyph designs.  (The aging serifed Bitstream Cyberbit
font might be a better example in this respect.)


Thomas Chan
[EMAIL PROTECTED]

RE: Latin vowels?

2002-09-09 Thread Thomas Chan


On Mon, 9 Sep 2002, Marco Cimarosti wrote:

 Mark Davis wrote:
 4. List Nonvowels - ambiguous letters that are probably vowels:
  U+0059 # (Y) LATIN CAPITAL LETTER Y
  U+0079 # (y) LATIN SMALL LETTER Y

 I would consider all these as vowels, although I know there is much room for
 errors:
 - Y is historically a vowel, and it still is mainly a vowel in all languages
 using it  (including English and French: système,  quickly). In English
 and French, however, it can be a consonant (e.g., yes). In orthographies
 derived from English-based  romanizations (e.g., Pinyin), it is always a
 consonant.

This vowel vs. consonant distinction is really unsatisfyingly simplistic.
It sounds like the (US) grade-school list of vowels: a, e, i, o, U ... 
and sometimes y.  About Pinyin, some sources would disagree and set up a
zero initial, so that (initial) y is just a way to write i [i]
(clearly a vowel) occurring at the beginning of syllables.  I know you
meant to give an example of y used as a glide or approximant (which
laymen would consider a consonant), and there are surely better
examples of it, but we can't always judge the origin or inspirations of
romanization systems, either (Pinyin comes up again--some people are still
unconvinced that it is free of Cyrillic or Albanian influences).


Thomas Chan
[EMAIL PROTECTED]

Re: Strange resemblances and weird sisters

2002-07-11 Thread Thomas Chan


On Wed, 10 Jul 2002, Kenneth Whistler wrote:

 Then in Extension B there are many, many weird and wonderful
 candidates for strangest CJK characters. Some of my
 personal favorites include:

Looks like they are all attempts to create modern reflexes of characters
that never made it past the seal script stage ~2,000 years ago.

 
 U+26B99
 U+20137

Neither of these two are in the _Kangxi Zidian_ or _Hanyu Da Zidian_--what
are they?  They seem to be the fault? of CNS 11643.


 U+20572

Ancient form of U+96E8, yu3 'rain'.  Looks like it, too!  But U+96E8 is
more concise.


 U+2069C

Ancient form of U+20698, tao1, a basket for feeding cows.  But this one
seems like an redundancy--certainly, neighoring U+206A1 would the closest
attempt at creating a modern reflex that mimics the appearance of the
ancient form, but also neighboring U+20698 looks more reasonable as a
modern reflex that looks modern.  U+2069C just seems like an in-between
form that is neither here nor there, although _Hanyu Da Zidian_ does have it.

It seems like critics prefer U+20698--both _Kangxi Zidian_ and _Hanyu Da
Zidian_ include it, as well as a North Korean character set, and
CNS 11643, whereas U+206A1 is backed only by CNS 11643.

I wonder why these three were not unified--surely, they are just three
attempts to convert a dead character into a modern form that are within
the range of font or glyph variation.  I think I know the answer is
source separation, but that would not prevent U+2069C...


 U+2696E

Ancient form of U+7232, wei2 'to make'; wei4 'on behalf of'.  This doesn't
look like it is the direct ancestor of U+7232 like the 'rain' case above,
but an ancient form that lost out to the ancient form which was the
ancestor of U+7232.

 
 With such genetic defects, one would have expected such
 characters to die out long ago, but Unicode has brought them
 back to life.

Like U+2069C, which was born maybe 15-20 years ago...


 And of course, there is always the miraculous proliferation
 of turtles... ;-)

Look through the 'tiger' radical section of the SIP and there's a dozen or
so used as signs and flag signals just by the _Tian Di Hui_ organization.
I suppose some of them are technically logos...

Mind sharing a list of those turtles?


Thomas Chan
[EMAIL PROTECTED]

Re: Phaistos in ConScript

2002-07-09 Thread Thomas Chan


At 20:48 -0400 2002-07-08, John Cowan wrote:
Michael Everson scripsit:
  My point being that though Beijing and Hong Kong newspaper headlines
  might present LTR or RTL directionality without mirroring, this
  practice is rare or indeed unknown in Europe at 1700 BCE.

In any event, the so-called RTL CJK is more like TTB-RTL with columns
of size one.  (That sentence deserves a jargon award of some sort.)

That analysis may appear to be the case for headlines or signs, but it
doesn't hold up for some multi-row photo captions or some movie subtitles.

(I'd provide a scan, but the newspaper I was going to use has since
switched from top-to-bottom to left-to-right.  I should've kept some of
the issues that had 911 embedded in rtl headlines, too.)


Thomas Chan
[EMAIL PROTECTED]

Re: Encoding of symbols and a lock/unlock pre-proposal

2002-05-21 Thread Thomas Chan


On Tue, 21 May 2002, William Overington wrote:

 Yes, I feel that it is worth putting forward a proposal for the open and
 closed padlock symbols, yet wonder if I may make mention that maybe the
 words should be unlocked and locked as adjectives rather than unlock
 and lock as imperative verbs.
 Surely, a padlock is either unlocked or locked, so that the symbols indicate
 the state in which a system now exists.  This then raises the question as to
 whether there should be symbols for unlock and lock as imperative verbs,
 such that those symbols would indicate where to click so as to change from
 being in an unsecure state to a secure state or from being in a secure state
 to an unsecure state.  This then gets into the fact that with a padlock one
 needs a key to unlock it but one does not need a key to lock it, yet using a
 key symbol to mean unlock would seem to go against the way that computer
 systems are organized in that a key might seem more naturally to mean lock,
 notwithstanding that one does not need a key to lock a padlock.

I'm not a big fan of pictographs and prefer to see real writing, but as an
alternative to a locked and unlocked padlock, isn't there also an intact  
key and a broken key as allographs?  I think Netscape Navigator once used
these.


Thomas Chan
[EMAIL PROTECTED]

Re: CJK Unified Ideographs Extension B

2002-05-13 Thread Thomas Chan


On Mon, 13 May 2002, William Overington wrote:

 I have been looking at the characters in the CJK Unified Ideographs
 Extension B document.  These are the characters from U+02 through to
 U+02A6DF, which, as I understand it, are the rarer CJK characters.
 I wonder if any of the people who read this list who understand the
 languages involved might please like to say what any one or two of these
 characters, of their choice, mean please, just as a matter of general
 cultural interest for people who see these characters in the Unicode
 specification and, though not themselves knowledgeable of the languages,
 find the characters interesting for their artistry and history.

Culturally, the majority of them are really not that interesting.  Here's
ten random ones from Plane 2:

U+224D3 is an ancient form of U+4F5C, zuo4 'to make'.
U+22984 comes from a Vietnamese source--I don't have any info on it.
U+230C4 is xin1; meaning unknown.
U+24ECB is an erroneous form of U+765F, bie3 'shrivelled'.
U+25BF6 is ku3, the name of a kind of bamboo.
U+27028 is lie4 'movement of grass'.
U+28966 is qi2 'sharp'.
U+294DF is kan3, something having to do with a distorted or ugly head/face
  (I don't quite understand its definition.)
U+2A1FD is a variant form of U+6B4E, tan4 'to sigh'.
U+2A606 is xiu1; meaning unknown.


Thomas Chan
[EMAIL PROTECTED]

on U+7384 (was Re: Synthetic scripts (was: Re: Private Use Agreementsand Unappr oved Characters))

2002-05-10 Thread Thomas Chan


On Sat, 16 Mar 2002, John H. Jenkins wrote:

 On Friday, March 15, 2002, at 07:39 PM, Thomas Chan wrote:
  Is this open to names written with taboo-avoiding forms of characters
  which omit strokes?  e.g., U+7384 less the final stroke.  Or are these
  unified with the normal forms?
 
 The UTC will consider at its next meeting a proposal for a IDEOGRAPHIC 
 TABOO VARIATION INDICATOR for precisely this reason.  Sorry.

I just found U+248E5 as the four-stroke taboo-avoiding form of U+7384.
(That disqualifies me there! :)  I didn't expect to find it disunified.)  

The kIRGKangXi fields for both also suggests that they are not unified
(although kAlternateKangXi and kKangXi sometimes say otherwise).  

However, it seems that the four- and five-stroke forms are unified and
interchangeable when used as radicals for U+7385 .. U+7388, U+248E6 ..
U+248E8--these characters are given in the _Kangxi Zidian_ with the
four-stroke form radical, but kIRGKangXi maps them anyway.  I guess it'd
stink to have to encode dupes just for the sake of taboo.  (Taboos
aside, I find many cases of this elsewhere, where two characters are not
unified in isolation, but apparently only one participates as a component
in the formation of other characters.)

And to think that U+248E5 could've been avoided if Kangxi was published
post-Qing, or if a post-Qing corrected edition (i.e., taboos removed
and orig. characters restored) had been used (I have no idea if such a
thing exists, though).


Thomas Chan
[EMAIL PROTECTED]

Re: Character indices (was: Unicode Humor)

2002-04-30 Thread Thomas Chan


On Fri, 26 Apr 2002, [iso-2022-jp] $B$m!;!;!;!;(B $B$m!;!;!;(B wrote:

 In the Unicode 3.0 book, WHY ON EARTH are the Han digits (you know them) 
 not listed directly with the other numerics? They are given their own 
 category. (I have always wondered why the Han digit 1 ($B0l(B) is not called 
 HAN DIGIT ONE, etc.)

How would you distinguish the common form U+4E00 from the ancient form
U+5F0C, the Japanese anti-fraud form U+58F1, the Chinese anti-fraud
form U+58F9, and the (historical) variant form U+2092A?


Thomas Chan
[EMAIL PROTECTED]

sources for plane 2 characters?

2002-04-30 Thread Thomas Chan


Hi all,

I was looking at the plane 2 characters in the March 15, 2001 version of
the unihan.txt file, and found five that did not have an IRG source:
U+20957, U+221EC, U+22FDD, U+24FB9, and U+2A13A.  (The last one, U+2A13A,
however, has kIRGHanyuDaZidian and kIRGKangXi information showing that it
can be found in those dictionaries.  Still, shouldn't there be an IRG
source for it?)  Where are the first four from?

Thanks,


Thomas Chan
[EMAIL PROTECTED]

Re: Unicode Font Pros and Cons

2002-03-31 Thread Thomas Chan


On Sun, 31 Mar 2002, Jungshik Shin wrote:

 On Sat, 30 Mar 2002, Doug Ewell wrote:
  Maggie Yeung wrote:
   Can someone think of any other issues related to using Unicode font.
  
  I find it mildly annoying that Outlook Express picks a font on the
  basis of the encoding chosen for a given message.  On this list and
  the IDN list, a message encoded as JIS or EUC-KR or BIG5 will likely be
  displayed in a different font from the one used to display Latin-1 or
  UTF-8 messages.  
 
   Did you mean that different fonts are used to render US-ASCII part
 of messages depending on the encoding used in messages, ISO-2022-JP,
 EUC-KR, Big5, ISO-8859-X, UTF-8? 

If this is what was meant, then I find this annoying as well.  Reading a
thread in English (ASCII) and then, because someone includes a small bit
of East Asian text and their client tags it as such (or if it's tagged
as such anyway, even with only ASCII-representable text), the font 
suddenly changes from a nice proportional font to the monospaced and
uneven kind (i.e., the letters float up and down) seen in East Asian fonts
that looks worse than Courier.  (I know about the fonts that have
proportional Latin glyphs, but they're still ugly.)  In a mixed language
context (e.g., English/Chinese or English/Japanese), I don't mind as much,
as I appreciate the added clarity of information, but if it's really just
English-only text, then it's plain unattractive.


Thomas Chan
[EMAIL PROTECTED]

Re: The Arrogants and the sillies (RE: Euros and cents)

2002-03-27 Thread Thomas Chan


On Tue, 26 Mar 2002, Doug Ewell wrote:

 I'm surprised nobody took Dan the Silly Man to task on this one.
  English enjoys new words on-the-fly.
  What a pity Kanji on-the-fly is a taboo, at least on Unicode ;)

I think these were meant as rhetorical questions, but I'll bite,
particularly #3...

 
 Can you name a character encoding standard, anywhere in the world,
 invented by anybody -- government, industry consortium, private company,
 individual, kwijibo, ANYBODY -- that can do better in this regard than
 Unicode?

Besides the giant 70K+ repetoire which reduces the likelihood of an
unavailable character, there's always the PUA option.  Some other
competitors in the Han character area don't even have that (ie., a
gaiji area), instead forcing one to submit such characters for
registration.

 
 Can you name a font technology that will support the display of these
 invented-on-the-fly Kanji?
 
 For that matter, can you invent a Kanji on the fly that cannot be
 represented (perhaps in a rather cumbersome way) with Ideographic
 Description Characters?

Yes, it's possible but uncommon.  Unlike some other character description
schemes, IDS can only form characters by composition.  e.g., there's no
way to gut out everything except the right half of U+8BD1 (yi4 'to
translate') and use the former right half as a component in describing
another character (as of Unicode 2.1--I haven't checked later versions.) 
Such a component would need to be separately encoded for it to participate
in an IDS.  Sometimes such components are not independent characters, or
they are rare independent characters that have been overlooked for
encoding.  In this particular example, when U+776A occurs as part of a
character in unsimplified Chinese, then the simplified Chinese form would
have U+776A converted into the component mentioned above by application of
simplification rules (standing alone, U+776A is identical in simplified
form).  Find all the characters containing U+776A as a component and
create the simplified forms by applying the rule--that'll generate plenty  
of characters that IDS's can't represent.  Another case is a character for
'Marxism'--it is U+9A6C with the final stroke gutted out, and replaced
with U+4E49 (Again, this example only checked to be true as of Unicode
2.1).

There are also an almost negligible number of cases such as U+4E52 and
U+4E53 (used to write ping1pang1 'ping pong') or U+5187 (used to write
Cantonese mou5 'to not have', among other words), which are created by
deleting of single strokes from U+5175 and U+6709, respectively.  A
number of Vietnamese chu+~ no^m characters are also created in such
fashion.  This is at a level smaller than the components that IDS work on,
and is really not a flaw of IDS.

IDS's, unlike some other description schemes, also don't handle
rotation--there are also an almost negligible number of cases where a
character (or a component) is formed by rotating another 180 degrees,
e.g., U+20114, which is U+4E88 rotated 180.  However, this is so rare that
it wouldn't be a productive IDC if it were to exist.

IDS's also don't handle cases of ligaturing, e.g., U+21155 (xi3 'double
happiness'), which is two U+559C side-by-side in origin.  Distinguish from
U+56CD of the same meaning as U+21155, where ligaturing doesn't take
place.

IDS's also don't handle cases of guwen 'ancient character', which are
characters in pre-modern form that have been converted to modern form,
e.g., U+20A30, a tortured character which is really the zhuan 'seal' form
of U+5973 (nuu3 'woman; female') modernized.  IDS's might handle it, but
clumsily.  Others such as U+20066 are just impossible with IDS's.
However, this type are not likely to be created in this age, except as
modernizations of ancient forms.

Despite these counterexamples, IDS do handle the majority of unencoded Han
characters, most of which are the left to right or above to below
variety with respect to the particular IDC's used.


Thomas Chan
[EMAIL PROTECTED]

Re: Talk about Unicode Myths...

2002-03-20 Thread Thomas Chan


On Wed, 20 Mar 2002, John Cowan wrote:

 Dan Kogai scripsit:
 And as for Chinese, how do you tell whether Traditional or Simplified 
  is more appropriate?
 
 Traditional and Simplified characters are *not* unified in Unicode,
 (BTW, this should go on the list of Unicode Myths),
 so that would be up to the author, not the browser.

That's true, but you still can't distinguish preferred glyph variants of
mainland China from those of Taiwan (nor those from that of Japan,
either).  cf., the often-cited U+76F4 and U+9AA8.  What is often claimed
as a difference between Japanese and Chinese is misportrayed and/or
misunderstood as a difference between Japanese language and Chinese
language practices, but is really a difference in national glyph
preferences, e.g., U+9AA8 is often said to be a case where Japanese and
Chinese differ, but this is true only when comparing the glyph 
preference of Japan (not Japanese language) and mainland China, and false
if compared to Taiwan.  I think this might be what Dan was referring to.

BTW, there are some people who would've preferred to have traditional and
simplified Chinese characters unified, so that conversion may be performed
by changing the font.  e.g., Founder (sorry, don't have the url at the
moment) has some fonts which come in J and F versions.  The J
versions are perfectly normal in being Unicode fonts with simplified
Chinese characters.  The F versions, however, are a different
story--although they are also Unicode fonts, at the codepoint for a
simplified Chinese one actually finds the glyph for its traditional
analogue (in most cases).  Thus, one can store master data in simplified
Chinese, and generate what appears to be traditional Chinese by changing
the font (and making a few minor edits because of lack of 1-to-1
correspondence).  e.g., type U+56FD with the J font, change
the font to the F version and you see the glyph for U+570B, but its
still really U+56FD.  Certainly, this sort of thing can't help improve
understanding that simplified and traditional Chinese characters are not
unified in Unicode.


Thomas Chan
[EMAIL PROTECTED]

Re: 31 Angry Watanabes (or the Itaiji problem)

2002-03-18 Thread Thomas Chan


On Mon, 18 Mar 2002, Dan Kogai wrote:

However,
  if one is to pick over little details, then I still don't know what 
  U+5F3E
  is (in the context of Dan's name)--does the upper right corner have two 
  or
  three strokes?
 
Three.  That's the only official 'Dan' with 'Bow' and 'Single' in 
 Modern Japanese.  Two stroke form is for Simplified Chinese and two 
 mouths Traditional.

You're right.  I was mistaken in believing that U+5F39 (two dot version)
and U+5F3E (three dot version) were unified.


Thomas Chan
[EMAIL PROTECTED]

Re: Synthetic scripts

2002-03-17 Thread Thomas Chan


On Sun, 17 Mar 2002, Andy Heninger wrote:

 From: Miikka-Markus Alhonen
  Stefan Persson replied
Can you prove that this doesn't apply to any of the scripts
already in the
Standard? No, you can't, as it is not known under which circumstances
Latin, Greek, Kanji, etc., were created.
 
  What about a script that was invented by one person with the principal
  intention of representing an artificially constructed language?
 
 Tighten up the definition of an artificially constructed language to
 be one that has never had native speakers, and you're there.  Separate
 the evolution of the spoken language from the evolution of the script.

That sounds better, but that definition of artificially constructed
language would still include some planned languages and artificial
standard versions of languages.


Thomas Chan
[EMAIL PROTECTED]

Re: Synthetic scripts (was: Re: Private Use Agreements and Unapproved Characters)

2002-03-15 Thread Thomas Chan


On Fri, 15 Mar 2002, Jungshik Shin wrote:

 On Sat, 16 Mar 2002, Dan Kogai wrote:
  (*) My parents wanted me to name me ÷¥ (U+5F48), a classical form, but 
  it was not listed on the table of Kanjis allowed for names so I was 
  named  U+5F3E.
 
   Frankly speaking, I find it rather hard to understand what difference
 there is between using U+5F48 and using U+5F3E in spelling your
 name. They're the same character with the same meaning but with a bit of
 variation in shape. However, I should be careful because this is about
 one's name.

This particular case in a Chinese context wouldn't be respected.  However,
if one is to pick over little details, then I still don't know what U+5F3E
is (in the context of Dan's name)--does the upper right corner have two or
three strokes?


Thomas Chan
[EMAIL PROTECTED]

Re: Synthetic scripts (was: Re: Private Use Agreements and Unapproved Characters)

2002-03-15 Thread Thomas Chan


On Fri, 15 Mar 2002, John H. Jenkins wrote:

 On Friday, March 15, 2002, at 11:38 AM, Dan Kogai wrote:
 
  There are so many Watanabe-sans, Saito-san, and others whose name cannot 
  be spelled in Unicode.
 
 Can you document this?  You know, there's a prize offered for the first 
 person to document the existence of someone whose Japanese name cannot be 
 represented by Unicode 3.2.

Is this open to names written with taboo-avoiding forms of characters
which omit strokes?  e.g., U+7384 less the final stroke.  Or are these
unified with the normal forms?  (I'm aware that not all taboo-avoidance
works by stroke deletion.)  The Kangxi emperor (r. 1662-1722) writes the
xuan of his personal name, Xuanye, with U+7384, but everyone else has to
use a similar form, but less the final stroke--see p. 725 of the edition
of the _Kangxi Zidian_ that the IRG uses.  (This taboo-avoidance has
apparently extended to other characters that utilizied U+7384 as a
component.)  Of course, my example here is not a Japanese one...


Thomas Chan
[EMAIL PROTECTED]

Re: Synthetic scripts (was: Re: Private Use Agreements and Unapproved Characters)

2002-03-15 Thread Thomas Chan


On Fri, 15 Mar 2002, Kenneth Whistler wrote:

 Ben Monroe wrote:
  As it is a personal spelling, I never expected
  Unicode to map a code point to this character to me. 
 
 For those not following the Japanese in the UTF-8, Ben's name
 is Monryuu Ben in kanji. This is a sound-based name coinage
 for an English name. Mon 'gate' ryuu ~ ryoo 'dragon'. (Sorry,
 but I can't tell just from the kanji just exactly what
 pronunciation you would use.)
 And sticking the dragon inside the gate, which is then
 structured like a radical, is creating a phonological rebus
 that departs from the ordinary way that radical + phonological
 component characters are constructed. No wonder your teacher
 marvelled at how to pronounce it -- Han characters aren't
 constructed by putting two syllables together in one character
 to create a disyllabic pronunciation. 

I thought it was a case of an obscure personal name character, until I saw
the connection to Monroe.

There are a small number of these with polysyllabic Chinese or
Sino-Xenic readings where the reading is a concatenation of the readings
of its components, such as U+55E7 jia1lun2 'gallon' (U+52A0 U+4F96),
which are really ligatures.  Some like U+337B heisei (Japanese era name) 
(U+5E73 U+6210) are easy to recognize, but it would be easy for characters
like the Monroe one to slip through in the absence of information about
it.

This is different from another small set where a polysyllabic
Chinese/Sino-Xenic reading is not a concatenation of the readings of its
components, such as U+544E ying1chi3 'English foot' (U+5C3A chi3 'foot'
plus a 'mouth' radical--indicating a semantic connection/modifiation?).

However, not everything that looks like a ligature really is, such as
U+6B6A wai1 'crooked' appears to spell out the phrase bu4 zheng4 'not
straight' (U+4E0D U+6B63); U+81AD Cantonese chun 'animal egg' appears to
spell out the phrase mei sing yuk 'not yet become flesh' (U+672A U+6210
U+8089); or U+7526 su1 'to revive' appears to spell out a synonymous word 
geng4sheng1 'to revive' (U+66F4 U+751F).

 
  Should I really have any reason to expect Unicode to deal with this?
 
 Nope. Any more than it should deal with the fanciful but
 ubiquitous good luck coinages like the shuang1xi3 'double happiness'
 character.

U+56CD and U+21155--the latter seems to be more common on printed matter.
Almost no Chinese dictionaries include them, but Korean ones seem to.
However, I think this case may have made the jump from ligature to
independent character, as it has acquired a monosyllabic xi3 reading, and
appears in the title of at least two movies (rather than appearing
independently as decoration).  But more examples of this class can be
found at the shinji page[1] of Kanji no shashin jiten[2]
(content is in Japanese).  Apparently Mojikyou has been okay with encoding
some of them.

[1] http://homepage2.nifty.com/Gat_Tin/kanji/sinji.htm
[2] http://homepage2.nifty.com/Gat_Tin/kanji/kaindex.htm

 
Thomas Chan
[EMAIL PROTECTED]

Re: This spoofing and security thread

2002-02-12 Thread Thomas Chan


On Tue, 12 Feb 2002, Michael Everson wrote:

 At 18:37 + 2002-02-11, Juliusz Chroboczek wrote:
- a cross-reference of characters whose associated glyphs are
  identical, whatever the font (applies to symbols and ``modifier
  letters'');
 
 But the letter b isn't identical from font to font in Latin.

(piggybacking on your message, Michael)

Nor U+0061 LATIN SMALL LETTER A, which sometimes looks like U+0251
LATIN SMALL LETTER ALPHA, and sometimes doesn't.

 
- a cross-reference of characters whose associated glyphs could be
  confused by a non-technical user;
 
 Out of the entire standard? Who's going to do that for free? :-)

And where would the data come from? :/  A turned R (similar to U+042F
CYRILLIC CAPITAL LETTER YA) is sometimes used whimsically in place of
U+0052 LATIN CAPITAL LETTER R to 1) imitate children's handwriting
mistakes, e.g., in the logo of the toy store Toys R Us
(http://www.toyrsus.com/), or to 2) imitate Russian, e.g., the Tetris
logo.  Like with Han characters, its not only what looks similar, but also
what's considered similar...

Just yesterday, I saw some product packaging where a grave accent was used
where an acute accent was meant--no doubt, the error resulting from (US)
English speakers' general unfamiliarity with diacritics (and who'd
consider them to be optional adornments such that e = e acute = e grave).


Thomas Chan
[EMAIL PROTECTED]

Re: Phonetic grouping in UniHan

2002-02-04 Thread Thomas Chan


On Mon, 4 Feb 2002, Marco Cimarosti wrote:

 I also take the occasion to suggest a new field that could be very useful:
 the frequency of usage of each character. This information may be derived
 from good on-line sources. E.g., for Chinese, from Chi-Ho Tsai's research
 (http://www.geocities.com/hao510/charfreq/) and, for Japanese, from the
 KanjiDic database, (http://www.csse.monash.edu.au/~jwb/kanjidic_doc.html).
 (I don't know the licensing terms for using these data.)

I think whatever frequency data is included, the particulars of how they
were arrived at (or where to find such information) should be included,
e.g., Tsai's findings were based on 1993-1994 Big5 Usenet postings.

There's also frequency data buried under the kFenn field (as yet
unpopulated), where A, B, C, D, E, F, G, H, I, K (J is omitted)
indicates if it falls in the first, second, third, etc group of five
hundred characters, based on earliness of occurrence in the textbooks of
1926.  (The P code is also used for something that is not quite clear to
me from the explanation in the dictionary alone--I presume it might refer
to characters in the dictionary that were not in the 1926 study.)

P.S. Recently you asked about estimates of usage of Plane 2
characters--since a large percentage are CNS 11643-1992 characters (and
perhaps the oldest IT source), that may provide a clue.  In the
Concluding Remarks section of Christian Wittern's Taming the
Masses[1], the higher CNS planes (ignore 1 and 2, which are in the
BMP, and perhaps some parts of 3) are rarely used in historic texts, and
he expects even lower usage in modern texts.

[1] http://www.gwdg.de/~cwitter/cw/taming.html


Thomas Chan
[EMAIL PROTECTED]

Re: TC/SC mapping

2002-02-04 Thread Thomas Chan


On Wed, 23 Jan 2002, John H. Jenkins wrote:

 Thomas, do you have a reference for U+9EBC (麼) and U+9EBD (麽) being 
 different?  The only dictionary I have which contains both is the 
 (traditional) CiHai, it and it claims they're variants of each other.

Belated, but a little more on these two.  Annex T: Procedure for the
Unification and Arrangement of CJK Ideographs[1] (AMD8 of ISO/IEC
10646-1:1993) at the very end of section T.3, gives this pair as an
example of unification blocked by source separation (a T source is the
culprit).

[1] ftp://ftp.cse.cuhk.edu.hk/pub/irg/AnnexT.rtf


Thomas Chan
[EMAIL PROTECTED]

Re: Wade - Pinyin transliteration (Unihan ?)

2002-01-25 Thread Thomas Chan


On Thu, 24 Jan 2002, Patrick Andries wrote:

 John Cowan wrote:
 Patrick Andries scripsit:
 Let's assume I want to transliterate a large Wade-Giles database into 
 pinyin. It this a purely algorithmic process? For all nouns ? Common and 
 proper (cf.  Chiang Kai-Shek vs Jiang Jeshi )? Even for dialectal words?
 
 Chiang Kai-Shek isn't Wade-Giles; it isn't even Mandarin.

 I did mention dialectal forms (I believe final -k does no longer occur 
 in Mandarin), I just wondered whether I would find such nouns (proper or 
 common) in dictionary edited in Taiwan. I asked because I could see no 
 algorithmic way of converting this name using traditional Wade to Pinyin 
 tables.
 
 Incidentally, if this is not Wade-Giles applied to a dialectal 
 pronunciation, what is it? Geniously interested.

It should be noted that Wade-Giles is commonly misused as a cover term
for many old, ad hoc, non-Mandarin-based, or non-Pinyin romanization
systems.

Chiang Kai-shek is a mixture of what looks like Wade-Giles (surname
CHIANG) and some kind of archaic romanization based on Cantonese (given
name Kai-shek).  For placenames, there are many postal romanizations
that are often erroneously considered to be Wade-Giles, e.g., the city
Nanking (postal)/Nan-ching (Wade-Giles)/Nanjing (Pinyin).

In any case, one should also beware of degenerate Wade-Giles forms where
details such as apostrophes (denoting aspiration) are omitted, e.g., the
city Changchun (degenerate Wade-Giles)/Ch'ang-ch'un
(Wade-Giles)/Changchun (Pinyin).  If Changchun were accepted as proper  
Wade-Giles input, then a corrupt *Zhangzhun pinyin form would be
generated.


Thomas Chan
[EMAIL PROTECTED]

Re: TC/SC mapping

2002-01-24 Thread Thomas Chan


On Thu, 24 Jan 2002, John H. Jenkins wrote:

 However, this is already a problem in Unicode.  shuowen.org will have to 
 register both U+8AAAU+6587.org and U+8AACU+6587.org; Jingwa, 
 Inc., will need both U+4E3CU+86D9 and U+4E95U+86D9.

U+8AAA and U+8AAC are given on p. 265 of TUS3.0 as an example of what
would have been unified had it not been for source separation.  Is it
possible to acquire data on other z-variants?  The kZVariant fields do not
seem to contain exactly that data.  Had that example not been pointed out,
I wouldn't have been known that both were encoded.


Thomas Chan
[EMAIL PROTECTED]

Re: TC/SC mapping

2002-01-23 Thread Thomas Chan


On Wed, 23 Jan 2002, John H. Jenkins wrote:

 On Wednesday, January 23, 2002, at 09:05 AM, Thomas Chan wrote:
  In other words,
yao1 'small'TC U+4E48 or U+5E7A - SC U+4E48
me (as in shen2me 'what')   TC U+9EBC or U+9EBD - SC U+4E48
mo2 (as in yao1mo2 'insignificant') TC U+9EBC or U+9EBD - SC U+9EBD
 
 Thomas, do you have a reference for U+9EBC (麼) and U+9EBD (麽) being 
 different?  The only dictionary I have which contains both is the 
 (traditional) CiHai, it and it claims they're variants of each other.

Well, first, the Jianhuazi Zongbiao that defines the PRC
simplifications juxtaposes U+9EBD and U+4E48 for the me pronunciation
of the former (non-me usage of the former are not simplified);
U+9EBC is not mentioned.

In the PRC's _Ci Hai_ from 1979 (the third dictionary to bear that
name), U+9EBC is a pointer to U+9EBD for all usages of U+9EBD.

In the _Hanyu Da Zidian_ (PRC, 1986), U+9EBD has the following
usages:
  1)   mo2 'small'
  2)   ma2 of gan4ma2 'what for'.  (It says that nowadays this
   particular ma2 is written U+55CE.)
  3.1) ma, a particle, which can sometimes be written U+55CE.
  3.2) ma, a particle, which can sometimes be written U+561B.
  4)   me of zhe4me 'so; like this'; also used as padding in songs.

However, for U+9EBC, it says it is the same as U+9EBD, but the
only examples given have the 'small' meaning, including one from
the _Shuowen Jiezi_ (China, AD 100) that says that U+9EBD is a
vulgar (su2) form of U+9EBC.

Apparently, U+9EBC is the more orthodox version as far as mo2
'small' is concerned, but U+9EBD has become more common,
including becoming used to write various modern/colloquial words.

I would revise the mapping as follows:
  me (as in shen2me 'what')TC U+9EBD - SC U+4E48
  mo2 (as in yao1mo2 'insignificant') TC U+9EBC - TC U+9EBD - SC U+9EBD

I think the choice whether to regard U+9EBC and U+9EBD as different or not
depends on the application.  I would lean towards treating them as the
same.

 
 Meanwhile, both Sanseido and KangXi say that U+5C1B (尛) is a member of 
 the family.  (KangXi says that anciently U+9EBC (麼) was written U+5C1B (尛)
 .  Mathews and Sanseido also remind us that U+5E85 (庅) is another variant,
   and Sanseido *also* lists U+5692 (嚒).

In the _Hanyu Da Zidian_, U+5C1B points to U+9EBC.  (I see on the same
page that U+21B6F also points to U+9EBC, and the _Hanyu Da Zidian_ is
citing this pointer from the same source.)  It doesn't say, but I would
presume these refer only to the original mo2 'usage', given the age
of the cited source, _Longkan Shoujian_ (China, AD 997), and the
composition of U+5C1B (three 'smalls') and U+21B6F ('three' + 'small').

U+5E85 is understandable as an abbreviated form of U+9EBD, and I'll
add that it's also documented in Samuel Wells Williams' 1874
dictionary (pushes back the usage given in Mathews by at least half a
century).

U+5692 seems understandable--it is just U+9EBC with a mouth radical
tacked on--I presume this is only for the modern/colloquial me usages,
and not mo2 'small'.  (I wouldn't be surprised if somewhere there is
attested a U+9EBD with a mouth radical tacked on.)

I would further revise the (partial) mapping as follows:

  me (as in shen2me 'what'):
TC U+9EBC - TC U+9EBD - TC U+5E85 - SC U+4E48
TC U+9EBC - TC U+5692

  mo2 (as in yao1mo2 'insignificant'):
TC U+9EBC - TC U+9EBD - SC U+9EBD

And this is not finished, yet!  The _Hanyu Da Zidian_ also lists
some other variant forms of U+9EBD--I suspect they are probably
all/mostly for the mo2 'small' usage.  I should point out that the _Hanyu
Da Zidian_ is in no way the final word despite its comprehensiveness,
e.g., U+5E85 and U+5692 are not included in it.

 
 So, Doug, you see that U+4E48 (么) could conceivably be a traditional 
 character in its own right *or* the simplified form for no fewer than six 
 (!) other ideographs.
 
 This is the kind of mess that has discouraged anybody from doing a 
 systematic survey of simplifications for the Unihan database.

Part of this is because there is the orthogonal complexity of variant TC
forms.  Before converting TC to SC, one should resolve all TC variants to
the most common or standard TC form (good luck deciding what that
means).  e.g., in the above case, resolve to U+9EBD.

I think we are also complicating things by treating the entire process of
variants and simplifications as operating solely on the orthography (cf., 
upper and lower case); in some cases, it is simpler to conceptualize it as
the spelling of words being changed.

 
  The other example (U+8721 kTraditionalVariant U+8721 U+881F) is a
  mistake--the TraditionalVariant should only be U+881F.
 
 Actually, no.  Both KangXi and the Cihai list U+8721 (蜡) as a traditional 
 character in its own right, although I assume it's rare as I can't find it 
 in my other dictionaries.

You're right.  The presence of U+8721 in Big5 should have been a
preliminary hint to me

Re: Fun with UDCs in Shift-JIS

2002-01-17 Thread Thomas Chan


On Thu, 17 Jan 2002, Thomas Chan wrote:

   - NTT-DoCoMo pictographs[1] in webpages for cell phones
 
 [1] http://www.nttdocomo.co.jp/tag/emoji/
 (Shift-JIS 0xF89F to 0xF971)

It was pointed out to me that the above URL didn't work.  It should have
read: http://www.nttdocomo.co.jp/i/tag/emoji/ .


Thomas Chan
[EMAIL PROTECTED]

Re: GBK Traditional to Simplified mapping table

2002-01-10 Thread Thomas Chan


On Thu, 10 Jan 2002, Ken Krugler wrote:

 I've got GBK-encoded text that contains a number of Traditional Hanzi 
 characters. I'd like to convert all of these to their Simplified 
 equivalents. So does anybody know of a GBK table that maps each 
 Traditional form to its Simplified form?

If converting to simplified equivalents means reducing the text so that
it can be representable in GB2312, then I'd recommend:
  1) If the GBK character is in GB2312, keep it as-is.
  2) Otherwise, convert to Big5 using Unicode as an intermediary.  Take
 the characters that converted to Big5 successfully and use one of
 those many Big5-GB2312 converters as suggested by Frank Tang,
 which will perform the traditional-simplified conversion.
  3) If there are any characters that weren't handled by step #2 (e.g.,
 traditional Chinese characters not in Big5[1]; traditional Chinese
 characters in Big5 but not treated by most Big5-GB2312 converters[2];
 non-Chinese characters used in Japanese[3]/Korean since the source text
 *is* GBK), then probably turning them and the surrounding context
 over to a human with access to a number of good dictionaries would
 probably be the best way to (hopefully) find a best fit within
 the circumstances (e.g., if it happens to be a variant of a
 character that is in GB2312[4]).  If even that fails, perhaps the
 character in question can be described graphically ala A+B[5] or
 the text in question rewritten[6].

[1] e.g., U+5700 (GBK 0x87F3) is a variant form of guo2 'country' that
is not in Big5, but one can substitute U+56FD (GB2312 0xB9FA),
the form of guo2 'country' used in simplified Chinese.
[2] e.g., U+5187 (GBK 0x83D3) is in Big5, used primarily to write mou
'not' in Cantonese (but other meanings also exist), but I haven't seen
a converter to GB2312 yet that'll substitute U+65E0 (GB2312 0xCEDE),
a near-synonym and etymologically-related character.
[3] e.g., U+7A93 (GBK 0xB799) is a Japanese form of chuang1 'window', but
one can substitute U+7A97 (GB2312 0xB4B0).
[4] See [1], [2], [3].
[5] i.e., as the combination of its components.
[6] e.g, U+72C6 (GBK 0xA0F0) occurs in Big5 and in most Chinese texts
encountered, it means 'Japanese spaniel dog; Japanese Chin' (and not
a pejorative ethnonym), which'll have to be rewritten to whatever
phrasing that dog breed goes under in GB2312 simplified Chinese texts.


Thomas Chan
[EMAIL PROTECTED]

Re: Bird headed CJK variants?

2002-01-08 Thread Thomas Chan


 Name: Yael
 email: [EMAIL PROTECTED]
 #1, Posted Dec 26th 02001 10:59:33 AM next
 
 I am looking for an online resources that have graphical samples of 
 the unusual chinese script that is called bird script because all 
 lines end in little bird heads. It is from the han dynasty. May 
 somebody help me?

niaozhuan \u9ce5\u7bc6 'bird seal' is a decorative seal script typeface.
Here's two samples of what is supposed to be the seal of Qin Shihuang
\u79e6\u59cb\u7687 (r. 221-210 B.C.), the first Chinese emperor:

http://deall.ohio-state.edu/grads/chan.200/misc/niaozhuan-qinshihuang_seal-1.jpg
from p. 30 of GUO Bingguang's \u90ed\u51b0\u5149 _Zhuanke rumen_
\u7bc6\u523b\u5165\u9580 [Introduction to Seal-carving] (Hong Kong:
Mingtian \u660e\u5929, 1989).

http://deall.ohio-state.edu/grads/chan.200/misc/niaozhuan-qinshihuang_seal-2.jpg
from p. 176 of R.W.L. Guisgo's _The First Emperor of China_ (New York:
Carol Publishing Group, 1989).

The seal consist of four columns of two characters each, read
top-to-bottom, right-to-left:
  7 5 3 1
  8 6 4 2
It's supposed to be: \u53d7\u547d\u65bc\u5929\u65e2\u58fd\u6c38\u660c .
However, notice that there are a number of differences between
the two samples!--I do not know the reason for this.

'bird seal' and other decorative and/or imaginary scripts (mostly just
font variants) are also covered in Knud LUNDBAEK's _The Traditional
History of the Chinese Script from a Seventeenth Century Jesuit
Manuscript_ (Aarhus, Denmark: Aarhus University Press, 1988).


Thomas Chan
[EMAIL PROTECTED]

Re: Fact vs. fiction

2002-01-06 Thread Thomas Chan


On Sun, 6 Jan 2002, James Kass wrote:

 Michael Everson wrote,
  I was unable to find  http://www.thelordoftherings.com and therefore
  could not see any Tengwar links. Is this the right address?
 
 Yes, it's right.  Here's the direct link to their Tengwar links page:
 http://www.thelordoftherings.com/tengwar/

I can't find the beginning of this thread now, but that's not New Line
Cinema's official site (lordoftherings.net).  I spent a few minutes with
whois and found a number of similar domain names, with or without
preceding the, followed by an optional -movie or movie, and under
.com, .net, or .org.  That's even more confusing than the
unicode.com imposter...


Thomas Chan
[EMAIL PROTECTED]

Re: Character display problem example

2001-12-22 Thread Thomas Chan


On Sat, 22 Dec 2001, Michael (michka) Kaplan wrote:

(See my reply below--I'd like to retain Michael's ASCII art for purposes
of illustration, hence the length.)

 Robert (11 digit boy) said:
  font is used to display Japanese or such. I think that
  there is a certain 5-stroke character that will answer it.
  It is U+5E73.
 
 Well, there is a difference here:
 
 Japanese/CHS version:
 --
  \   |   /
   \  |  /
\ | /
 +-
  |
  |
 
 Korean/CHT version:
 --
/ | \
   /  |  \
  /   |   \
 +-
  |
  |
 
 Although I suppose this could be font differences, too? Pseudo Verified on
 a WinXP system with the following fonts:

Yes, there are simply font differences.  The latter form, with the
diagonal strokes arranged like / \, is the more canonical form, typically
seen in printing when using the kinds of fonts that you tested with.
However, the former form, with the diagonal strokes positioned like \ /,
is more of a handwritten form, although you may see it in fonts that more
resemble handwriting, like the brush-like kaishu(zh)/kaisho(ja) styles
(which were not represented in a limited font survey).  Both forms are
fine in Traditional Chinese practice.  PRC practice (i.e., Simplified
Chinese) tends to have made even the printing forms resemble the
handwritten form, although I do not doubt that a Simplified Chinese
reader would accept the / \ form too.  I won't presume to speak for
Japanese and Koreans, but I suspect the two forms are interchangeable for
them too (comments, please).

In any case, note the last example in Table 10-4 Ideographs Unified in
TUS3.0 p. 265 shows that the rotated strokes/dots are unified.

I'd like to caution against the use of fonts to shows national differences
(or lack of them), not only because a font can only show one glyph (and
does not account for scenarios where two interchangeble glyphs are
acceptable), but also because such font surveys are often poorly
controlled for variables.  For example, in Michael's survey of Win XP
fonts (I'm not criticizing you specifically, Michael, so I hope you
do not take this personally) there were two font styles represented for
most locales: the serifed Ming(zh)/Song(zh)/Mincho(ja) and sans serif 
Hei(zh)/Gothic(zh).  However, additional styles such as the brush-like Kai
are not represented, which sometimes will yield different conclusions,
such as for the appearance of the lower left corner of U+5317 'north'.

Second, in such font surveys, the fonts for each locale often come from
different vendors.  For instance, see section #4 of a webpage of samples I
created for U+76F4 'straight'[1], which I gave the URL to Suzanne Topping
for (though not on this list).  The MingLiU and AR Mingti2L Big5 fonts
differ between vendors (Microsoft/Dynalab vs. Arphic), although they are
for the same locale (Taiwan) and are the same style (the serifed Ming).
[1] http://deall.ohio-state.edu/grads/chan.200/cjkv/u76f4/

Furthermore, I see there is a tendency for PRC font vendors to create
fonts with completely wrong glyphs.  i.e., the codepoint for the
simplified form is populated with the glyph for the traditional form
(simplified and traditional not being unified, remember).  The idea is
apparently so that a user can type in Simplified Chinese, and then produce
a Traditional Chinese document by simply (and erroneously) changing the
font.  While none of these sorts of fonts are encountered here, they do
exist out there, and would contaminate any font-based studies.


Thomas Chan
[EMAIL PROTECTED]

Re: Microsoft input method, 950, and Unicode mapping

2001-12-18 Thread Thomas Chan


On Tue, 18 Dec 2001, Tex Texin wrote:

 I am glad Sybase gave different character sets different names.

There's a Big5-HKSCS tag[1]--is anyone using that?

[1] http://www.iana.org/assignments/character-sets (see MIBenum 2101;
I don't understand why it's in the vendor range, though)


 For that matter I wonder what a user in HK does when their Windows
 operating system is upgraded and their files that had HKSCS characters
 in the private use area now expect them in other locations.

Or distinguishing between data in HKSCS, GCCS, pre-GCCS vendor extensions,
and privately-created extensions, all of which can occupy the same
encoding space.  Too bad that GCCS and HKSCS first existed as
government-anointed waizi/gaiji extensions, and were (and still are) 
implemented that way, rather than as part of a proper and separate
character set.

(I wish GCCS and HKSCS had proper numbers and dates to refer to them
by--the names are really too similar, and easily garbled and confused.
Recently I gave feedback on an article where HKSCS was used and discussed,
but under the GCCS name and with arguments that were only true for GCCS.)


Thomas Chan
[EMAIL PROTECTED]

Re: Microsoft input method, 950, and Unicode mapping

2001-12-18 Thread Thomas Chan


On Tue, 18 Dec 2001, Kenneth Whistler wrote:

 And to add to the chaos and confusion, note that the HKSCS
 patch for Windows Code Page 950 does not map exactly the
 same as the HK Government mapping table. And that the HK

And that's in addition to the confusion caused by the semi-official,
semi-published precursor version, GCCS.  I've got here an 2001
edition (post-publication of HKSCS) atlas+cdrom of Hong Kong which
includes a GCCS (!) support add-on (and not the same as the one that used
to be available from http://www.info.gov.hk/gccs/ before that URL became
a redirect to the present HKSCS site).


Thomas Chan
[EMAIL PROTECTED]

Re: Ext-B fonts updated

2001-10-24 Thread Thomas Chan


On Wed, 17 Oct 2001, Asmus Freytag wrote:

 Even with a number of errata not yet corrected, the current font is a vast
 improvement over the previously used TTF font (which did not contain any
 glyphs at all for several hundred positions). In the meantime, John said
 it all when he wrote the the identity of the ideograph is given by its
 source mapping, not its glyph in the table.

That works for those that have a character set source mapping (e.g., T-
sources for higher CNS 11643 planes) or dictionary source page and serial
number mappings (e.g., kHanyuDaZidian field), but what do we do about
those that don't?  For example, although I own a copy of the _Ci Hai_
dictionary (G-CH source), I can't tell which character in it that U+206C5
is meant to be.  (No doubt this can be resolved with more
cross-references, although that still leaves the problem of non-dictionary
sources such as G-FZ/FZ_BK and G-4K.)  Also, there's no way I can
determine what U+20850 is, as it doesn't come from a real character set
(K-4 source).


Thomas Chan
[EMAIL PROTECTED]

Re: [OT] o-circumflex

2001-09-10 Thread Thomas Chan


On Mon, 10 Sep 2001, [ISO-2022-JP] $B$F$s$I$$j$e$$8(B wrote:

 If they can't agree on the pronunciation for these cities, can they
 agree on the Hanzi for them? What ARE the Hanzi for these cities,
 anyway??

Are you asking for the names of cities in Chinese?  Copenhagen is
ge1ben3ha1gen1 \u54e5\u672c\u54c8\u6839.  The Han characters used to write
the names of cities depends on many factors, including but not
limited to source spelling/pronunciation, language/dialect of the
rendering party, mapping rules used by the renderer, time period, etc.
For example, New York is rendered in Chinese as Mandarin niu3yue4
\u7d10\u7d04, lit. 'button-appointment' (nauyeuk in Cantonese), while in
Japanese it was at one time rendered as \u7d10\u80b2, lit.
'button-rearing'.  Asking for the hanzi (from your wording, I don't
think you are just talking about Chinese usage of Han characters) is like
asking for a single Latin script rendering.

(I think you need to get yourself an English-Chinese dictionary or
something, btw...)


Thomas Chan
[EMAIL PROTECTED]

RE: Re[2]: Errata in language/script list

2001-08-13 Thread Thomas Chan


On Mon, 13 Aug 2001, Ayers, Mike wrote:

  From: Thomas Chan [mailto:[EMAIL PROTECTED]] 
  No, they do.  While the dominant way that Chinese languages 
  are written
  today, which is based on Mandarin Chinese, has been well 
  supported since
  pre-Unicode 3.0 days, other Chinese languages have faced the 
  problem of
  many unencoded (or yet-to-be-encoded) characters.  I've 
  written on this
  matter on this list before in the past, principly about Yue 
  Chinese (=~
  Cantonese), but also applicable to other Chinese languages.
 
   Since those all will get coded into the Chinese alphabet (if they
 get coded), what's the point?

It's pretty simple.  Just because enough of a script is encoded for the
needs of one language doesn't mean that is necessarily true for other
languages that use that script.  In time, those omissions are patched up
in newer versions of Unicode.  Latin, Cyrillic, Arabic, and other scripts
have all had new characters added to them in sucessive versions of
Unicode.

e.g., If someone asked 1-2 (pre-Unicode 3.1) years ago the question, Can
I write Cantonese with Unicode?, the answer would have been no or not
really.  If it were asked today, the answer would be yes.  But try that
question today with other minority Chinese languages substituted in it,
and the answer is still pretty much a no or not really.

 
  Some also require different scripts, such as the Dungan living in the
  former Soviet Union, who write in Cyrillic (I've been told all the
  characters they need are encoded), or some Min Chinese, who 
  write in whole
  or part using the characters in the Bopomofo Extended block (Unicode 
  3.0) and/or Latin (using certain letter and diacritics that 
  weren't always
 
   If you get genuine exceptions, then list them (i.e. list Min
 Chinese).  I get the feeling that you're talking about a darn small
 userbase here, though.

According to the SIL Ethnologue 14th ed.[1], Dungan (SIL DNG):
  38,000 in Kyrgyzstan (1993 Johnstone). Mother tongue speakers were 95%
  out of an ethnic population of 52,000 in the former USSR (1979
  census). Population total all countries 49,400 out of an ethnic
  population of 100,000.

[1] http://www.ethnologue.com/show_language.asp?code=DNG


I don't have figures for the size of the userbase of Min Chinese written
in Latin script offhand, but see for instance Proposal to add Latin
characters required by Latinized Taiwanese languages to ISO/IEC 10646[2]
(1997.6.26) under the user community questions.

[2] http://www.egt.ie/standards/la/taioan.html
(Did this ever become a WG2 document?  I recall seeing discussions of
this once, but can't find them offhand at the moment.)


BTW, what do you consider to be a darn small userbase, numberwise?
Would the UCAS or Cherokee userbases be too small by your standards to
include a mention of them?


  encoded).  There's also the Hunan women who write in the 
  unencoded Nushu
  script that was discussed on this rather recently.
 
   Discussed well enough for me to know that we're talking about a
 userbase of approximately twelve and counting down.  This is not a very
 pressing case.

No, its probably not pressing at the moment.

I'm sure there are more than twelve people who use it for writing and/or
research, though.  Start counting with the number of people who write
in it, and add to that figure the researchers and their assistants (i.e.,
their students) who are doing the surveys...

 
  And this is without going into historical alternative ways of writing
  Chinese, such as the prolific Guanhua Zimu alphabet/syllabary 
  used in the
  1900s-1920s.
 
   ...which we don't really need to do, I think, since we're trying to
 stick to the useful stuff.

What do you consider useful?  What one person considers useless is
useful to someone else.  Without specific requirements like userbase
size, economic power, cultural significance, extant writings, etc, I don't
think we can start making any claims about usefulness.

The Bible (or portions of it) has been published using
Guanhua Zimu[3].  Is that not useful to someone?

[3] From Eugene A. Nida, ed., _Book of a Thousand Tongues_, 2nd ed. 
(London: United Bible Societies, 1972):
  http://deall.ohio-state.edu/grads/chan.200/misc/guanhua_zimu.jpg

If you think historical scripts are not useful, then perhaps the four
Phillipine scripts, Ogham, Runic, etc should not be mentioned on the list.

Anyway, I don't see usefulness as one of the requisites for inclusion on
the list in question.

 
  And then there are various transliteration schemes, which 
  although they
  are not anyone's primary script, but which are widely 
  employed, such as
  Hanyu Pinyin (people do ask, as legacy GB2312 and Big5 character sets
  don't have them, or only include ugly full-width versions) 
  for Mandarin,
  or Yale for Cantonese (e.g., people ask if a precomposed m 
  with a grave
  accent is encoded, as that is need to transcribe the negative).
 
   Transliteration

RE: Re[2]: Errata in language/script list

2001-08-01 Thread Thomas Chan


On Wed, 1 Aug 2001, Ayers, Mike wrote:

  From: Marco Cimarosti [mailto:[EMAIL PROTECTED]] 
  BTW, I notice that a single Chinese entry is listed. This 
  should probably
  be split in several entries for the various Chinese languages (or
  dialects, e.g. Mandarin, Cantonese, Hakka, etc.). This 
  split may be handy
  because the different languages could need different information.
 
   They don't.  The joy of unification!

No, they do.  While the dominant way that Chinese languages are written
today, which is based on Mandarin Chinese, has been well supported since
pre-Unicode 3.0 days, other Chinese languages have faced the problem of
many unencoded (or yet-to-be-encoded) characters.  I've written on this
matter on this list before in the past, principly about Yue Chinese (=~
Cantonese), but also applicable to other Chinese languages.

Some also require different scripts, such as the Dungan living in the
former Soviet Union, who write in Cyrillic (I've been told all the
characters they need are encoded), or some Min Chinese, who write in whole
or part using the characters in the Bopomofo Extended block (Unicode 
3.0) and/or Latin (using certain letter and diacritics that weren't always
encoded).  There's also the Hunan women who write in the unencoded Nushu
script that was discussed on this rather recently.

And this is without going into historical alternative ways of writing
Chinese, such as the prolific Guanhua Zimu alphabet/syllabary used in the
1900s-1920s.

There is also the blind, for which Braille schemes exist for at least
Mandarin and Cantonese, although I'll concede that Braille could be listed
for almost any language.

And then there are various transliteration schemes, which although they
are not anyone's primary script, but which are widely employed, such as
Hanyu Pinyin (people do ask, as legacy GB2312 and Big5 character sets
don't have them, or only include ugly full-width versions) for Mandarin,
or Yale for Cantonese (e.g., people ask if a precomposed m with a grave
accent is encoded, as that is need to transcribe the negative).


Thomas Chan
[EMAIL PROTECTED]

RE: Re[2]: Errata in language/script list

2001-07-31 Thread Thomas Chan


On Tue, 31 Jul 2001, Marco Cimarosti wrote:

 BTW, I notice that a single Chinese entry is listed. This should probably
 be split in several entries for the various Chinese languages (or
 dialects, e.g. Mandarin, Cantonese, Hakka, etc.). This split may be handy
 because the different languages could need different information.

In the absence of additional qualifying information, I think Chinese
would be interpreted as the most salient variety, the modern standard
written Chinese (based on Mandarin Chinese; SIL CHN) in dominant use
today by speakers of all Chinese languages.

However, some people might have questions asking for details like
Does Unicode have traditional characters? and/or Does Unicode have
simplified characters?--it might even be worth pointing out that both can
be used concurrently, which is not what people accustomed to the likes of
GB2312, Big5, etc would expect.

Still others might ask, Does Unicode have Cantonese/Hong Kong
characters? (the terms are not exactly synonymous, but often
interchanged).  Prior to Unicode 3.1's introduction of the Han characters
in Plane 2, I'd say that support for Yue Chinese (SIL YUH; ~= Cantonese)
was not really usable.  With a logosyllabic script, it'll never be
possible to exhaustively check that all its characters included, but it
looks very usable now--I've had high success rates finding them in
Plane 2, partially due to sourcing from the HKSCS character set (H source)
from Hong Kong, and partially due to sourcing from large dictionaries such
as the _Hanyu Da Zidian_ (G-HZ source) where characters (and the words
they transcribe) have died out in Mandarin, but are preserved in Yue and
other Chinese languages.

However, I'm not so sure what the situation is for other Chinese
languages, other than a vague impression that they are not well
supported--probably the stage that Yue Chinese was at with Unicode 2.1.
e.g., U+20547 is used only in Min Chinese (MNP, CFR), meaning 'hard,
durable', with a pseudo-Mandarin reading of dian4.  It's in Unicode only
because it happened to be in HKSCS, and to my knowledge that is the only
character set it appears in, perhaps for the use of Chaozhou
speakers (Chiuchow, Teochew), a linguistic minority in Hong Kong
(Chaozhou is a dialect of Minnan Chinese, CFR).  U+20547 is also
documented only in very few dictionaries, none of which were
apparently a source for Unicode.  I think any support for Min Chinese at
this point is probably accidental.  (FYI, U+20547 looks like U+6709 with
the two center strokes removed and replaced by U+4E36.)


Thomas Chan
[EMAIL PROTECTED]

Re: Erratum in Unicode book

2001-07-09 Thread Thomas Chan


On Sun, 8 Jul 2001, James Kass wrote:

 An ideal index for the casual or non-CJK user might be quite 
 different in approach.  Perhaps the first component drawn in 

For the less than proficient user, I think it would be beneficial to have
a means to restrict the pool of characters that they are searching
amongst--consider the circumstances under which they are likely to have
encountered the character they are looking up.  The radical-strokes index
in TUS3.0 cover over 27,000 characters, many times more than most
dictionaries and character sets, and in some places, there are just too
many characters falling under a particular radical+residual stroke count
for one to scan the page efficiently.


Thomas Chan
[EMAIL PROTECTED]

Re: Erratum in Unicode book

2001-07-09 Thread Thomas Chan


On Mon, 9 Jul 2001, Richard Cook wrote:

 On a related note, I have 9000 word/char frequencies from Hanyu Pinlu
 Cidian (a mainland text; I typed the entries in back in the early 90's,
 and this is the freq data currently used in Wenlin). I'd be happy to
 give the Consortium access to this data for the purpose of sorting
 characters with identical rad/str numbers by frequency.

Wouldn't that bias sorting according to Chinese language usage 
frequencies?  e.g., \u7684, \u4f60, \u5403 are very common in Chinese, but
rare or obscure in Japanese.  Subsorting by pronuniciation would also be
language-dependent.

For a language-neutral method of sorting characters with otherwise the
same radical and # of residual strokes, how about the method used in the
_Hanyu Da Zidian_ (and some other dictionaries) of sorting by the type of
stroke of the first stroke, second stroke, etc., by whether it is one of
the five basic types of strokes as exemplified in the first five Kangxi
radicals?  This requires such data be available for all 70,000+
characters, though...


Thomas Chan
[EMAIL PROTECTED]

Re: Re: Erratum in Unicode book

2001-07-08 Thread Thomas Chan


On Sun, 8 Jul 2001, Michael (michka) Kaplan wrote:

 From: James Kass [EMAIL PROTECTED]
  Perhaps he (てんどうりゅうじ) was lamenting the character's absence
  in the Han Radical Index section under radical # 85.
  If all the characters made from the water radical were listed
  under that radical in the Han Radical Index (and so forth),
  where would the sport be in looking up CJK code points?
  The Han Radical Index is particulary useful when the significant
  radical is known, kind of like having to know the correct
  spelling of an English word before it can be looked up in an
  English dictionary.
 
 I suspect you are correct -- but since Unicode does not promise to support
 such a thing (complete decomposability into radical and stroke of all
 unified CJK ideographs), it might be a stretch to consider it an erratum?

I don't think there's any mistake.  U+9152 is filed under radical 164 as a
three residual-stroke character because that's where dictionary #1, the
_Kangxi Zidian_, places it as per the rules in Han Ideograph Arrangement
on p. 266 of TUS3.0.

What 11 prefers is a more progressive system of filing characters under
radicals that does not require one to know what a character means, which
part is not the phonetic element, etc.  e.g., U+554F, wen 'to ask' (also
part of wenti 'problem'; Japanese mondai) is filed under radical 30
mouth as an eight residual-stroke character, since that's where
dictionary #1 places it, although some people might prefer to see it as a
three residual-stroke character under radical 169 gate, such as what is
done in dictionary #3, the _Hanyu Da Zidian_ (but too bad, it's only #3).
Indeed, consider the almost similar character U+95EE, which is the Chinese
simplified form, which is filed as an eight residual-stroke character
under radical 169, because that is what dictionary #3 places it (and
dictionaries #1 and #2, not containing it, thus have no say otherwise
concerning where it is filed).


Thomas Chan
[EMAIL PROTECTED]

status of Jindai scripts?

2001-07-03 Thread Thomas Chan


Hi all,

I'd like to ask about the encoding status of the Japanese Jindai 
scripts, which are mentioned in older documents[1], and until a certain
point in time, versions of the Roadmap.

Here's a scan of a partial table of over a dozen Jindai scripts (except 
the rightmost column, which is modern katakana a to sa), from part of 
the appendix to He 2000[2]:
  http://deall.ohio-state.edu/grads/chan.200/misc/jindai.jpg

[1] e.g., Concerning Future Allocations (April 11, 1993)
http://www.unicode.org/Public/TEXT/ALLOC.TXT

[2] HE Qunxiong \u4f55\u7fa4\u96c4's _Hanzi zai Riben_ 
\u6f22\u5b57\u5728\u65e5\u672c (Han Characters in Japan) (Hong Kong:
Shangwu \u5546\u52d9, 2001).  Book is in Chinese.


Thomas Chan
[EMAIL PROTECTED]

Re: status of Jindai scripts?

2001-07-03 Thread Thomas Chan


On Tue, 3 Jul 2001, Rick McGowan wrote:

 Thomas Chan wrote...
  I'd like to ask about the encoding status of the Japanese Jindai
  scripts, which are mentioned in older documents[1], and until a certain
  point in time, versions of the Roadmap.
 
 Do you have a paper on the topic?  You say over a dozen 'Jindai'  
 scripts.  What does this mean?  Is it a style of stylization?  A style of  
 its own?  Something else entirely?  A cipher on Chinese characters?  Of  
 Kana?
 I don't know anyone who knows enough about it to even answer basic  
 questions.  We know it's Japanese, and is probably associated with Shinto.

Sorry, I don't know more about what they are.  My count of over a dozen
is based on the number of xxx moji's listed in the caption of the
illustration I cited.  They seem to be ciphers of kana.

I'm just puzzled by the disappearance of mentions of them in what I can
find on the publically available parts of the unicode.org and WG2
websites, even if its just to say that not enough is known about them to
do anything, e.g., WG2 N1955 (1999.1.26) mentions that there is not enough
data, WG2 2046 (1999.8.15) and later don't mention them at all, as far as
I can see.


Thomas Chan
[EMAIL PROTECTED]

RE: Innovative use of Latin ?!

2001-07-02 Thread Thomas Chan


On Mon, 2 Jul 2001, Ayers, Mike wrote:

  From: Martin Duerst [mailto:[EMAIL PROTECTED]] 
  For people interested in new scripts, and new uses
  of existing scripts :-)
  http://www.google.com/intl/xx-hacker/
 
   This looks like what is called L33T (elite) writing.  It's popular
 among online gamers.  Kinda like computer pig latin...
 
 /|/|ike

The way you sign your messages is related to that, isn't it? :)  I've seen
]\/[, too.

There doesn't seem to be any standard scheme--the goal appears to
obfuscate writing by substituting graphically similar characters.  What
Google is using is pretty tame--there were 1980's and 1990's versions
that made use of all the characters in CP437--Greek, math, linedrawing,
etc.


Thomas Chan
[EMAIL PROTECTED]

RE: Nushu

2001-06-26 Thread Thomas Chan


On Tue, 26 Jun 2001, Marco Cimarosti wrote:

 Michael Everson wrote:
  600 characters it is then.
 
 If Nüshu is actually logographic, think that this may be too early a
 conclusion.
 If the figure is based on a single person's sample, it could only reflect
 the vocabulary of that lady or, even worse, the vocabulary of the topics
 that she was dealing with in the sample texts.

If that were the case, 600 is low, as it could mean there is almost a
1-to-1 correspondence between a syllable and Nushu character when
not taking tone into account (c.f. the 400 or so extant syllables in the
Mandarin spoken in Beijing, when not counting tones), implying that
most/all homophonous words are written with the same Nushu character, and
that tonal distinctions are omitted in the orthography--neither of which
is the case in the logographic Chinese model.

Alternatively, the 600 could simply represent what is written in the
extant corpus (Marco's point above), and those Nushu writers do only write
on a limited range of formulaic topics.  Chiang 1995 (which I do not have
at the moment) did in fact provide an analysis of the phonology of the
language/dialect written down by Nushu and comparison to its characters.
 

 If it is not possible to do more investigation on Nüshu, I would suggest to
 reserve a bigger area (= 2000 entries), because that is the minimal number
 that I would expect from a logographic script.

Somewhere around 2000 is the bare minimum expected for a high school
education, e.g., the 1945 Jouyou Kanji of Japanese and its slightly
lower predecessor Touyou Kanji, and something similar in the Korean of
South Korea.  For Chinese, the figure is higher--around the 3000s.
(Marco, do you have the figure from the Yin and Rohsenow book?)

Cheng (2000: 109-110) gives various figures for frequently used
characters, from the early 1928 study by CHEN Heqin's Yutiwen Yingyong
Zihui (Practical Lexicon for Colloquial Style) for 4261, and ranging from
2000 to over 7000 for the other works in his list of practical lexicons
and pedagogically oriented frequency books.  The number 4000 or
thereabouts shows up quite often--he cites newspapers in Taipei, Hong
Kong, and Singapore (110) as using about 4000; 4501 in the corpus of the
novel _Honglou Meng_ (110); and about 4000-8000 in each of the dynastic
histories (which are written on a range of topics).  Norman (1988: 73)
provides similar figures and conclusions, 3000-4000 for ordinary literacy.

The question is perhaps how literate are these Nushu writers in relation
to the benchmark of mainstream Chinese, and what is the variety of 
writing.  Maybe also a unification of all the forms in use by all the
writers involved?

 
 I have a book that has a nice chapter about the frequency of Chinese
 character (*). From evaluations made on a modern corpus, they obtained this
 statistical progression:
[snip]
 - the first 2400 cover the 99%,
 - the first 4000 cover the 99.9%,
 - 6359 practically covers 100% of all modern text.
 But we know that Unicode, as well as any CJK character set, has much more
 than 7000 characters.

Not to detract from your point, but I believe the 6359 figure represents
the GB2312 character set, which represents the ceiling for the number of
characters in the study.  (Even then, GB2312 is a pruned repetoire that
is that meant only for everyday mainstream use, not literature, technical
fields, dialects, etc.)


This discussion reminds me that I have a scan of the illustration
from the Nushu entry in Florian Coulmas' _Blackwell Encyclopedia of
Writing Systems_ (Cambridge, MA: Blackwell Publishers, 1996) lying around
from a while ago:
  http://deall.ohio-state.edu/grads/chan.200/misc/nushu.jpg
(Its accompanied by a transliteration in Han characters, simplified 
spelling.)  Coulmas in fact says very little about Nushu, but provides two
references.


References

Cheng, Chin-Chuan.  2000.  Frequently-Used Chinese Characters and
  Language Cognition.  Studies in the Linguistic Sciences 30, no. 1
  (Spring 2000); 107-118.
 
Chiang, William.  1995.  _We Two Know the Script; We Have Become Good
  Friends_.  Lanham, MD: University Press of America.  (This is a
  revision of his Ph.D. dissertation.)

Norman, Jerry.  1988.  _Chinese_.  Cambridge: Cambridge University Press.


Thomas Chan
[EMAIL PROTECTED]

Re: How to tell Japanese from Chinese.

2001-06-08 Thread Thomas Chan


On Fri, 8 Jun 2001, [ISO-2022-JP] $B$F$s$I$$j$e$$8(B wrote:

 My very simple rule of thumb for telling Japanese from Chinese is to
 look for kana. If I see even one kana, I am looking at Japanese,
 right? (Warning: A few kanji resemble katakana.) So if I see so much
 as a hiragana to, it's Japanese, right? But sometimes there are
 stretches of many kanji.

Yes, that rule of thumb works for most everyday cases that one'll run
into.

However, manyougana would be classified as Chinese under that rule, as
well as kanbun.  I'm not sure that one would want to classify the more
deviant (from a classical Chinese POV) and more Japanized forms of
kanbun as Chinese.

Have you seen hentaigana before?--that straddles the boundary between
being kanji used for transliteration/transcription and being kana.  (How
would such text be encoded in Unicode, if at all?)

 
 Doesn't this kanji
 bad-ascii-art
[snip]
 /bad ascii art NOT to be confused with hiragana e (oy vey),
 usually only appear in Chinese?

In other words, \u4e4b vs. \u3048.

I presume you're asking for purposes of a human reader, as a machine could
easily detect the difference between U+4E4B and U+3048.  But then, a
sufficiently literate human could probably read the text and eliminate one
of the two choices.

Marco has already explained U+4E4B.

But if you want a simple system based only on presence (or lack) of
certain characters, then I'd look for common Chinese ones such as:

  \u9019 (\u8fd9) zhe 'this'
  \u5011 (\u4eec) men (plural suffix)
  \u9ebc (\u4e48) me (as in \u751a\u9ebc, \u751a\u4e48, \u4ec0\u4e48
 shenme 'what')
  \u55ce (\u5417) ma (question particle)
  \u4f60 ni 'you'

 
 Pardon my incoherence. I haven't had enough sake.

\u9152 on a sign or label--is that Chinese or Japanese?  Hard to tell.

If you're familiar with some differences in simplification, you can also
make corroborating conclusions, e.g., if I see \u6226 embroidered
on someone's baseball cap, rather than \u6230 or \u6218, then that
strikes me as Japanese.  (Not that the person who made it or wearing it
probably knows or cares.)


Thomas Chan
[EMAIL PROTECTED]

Re: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET)

2001-06-06 Thread Thomas Chan


On Wed, 6 Jun 2001, John Cowan wrote:

 Marco Cimarosti scripsit:
  Or Hanyu, in fact, which is the normal name for Mandarin in Mandarin.
 
 I believe, however, that this term is relatively recent in its current
 sense, and is part of the effort the PRC government makes to distinguish
 between zhongguo as a political term and han as an ethnic one.

Hanyu usually does refer to Mandarin (in the same way that Chinese in
English usage usually refers to Mandarin), but I  think that is because
Mandarin is considered the standard or is the most salient form of
Chinese.  There are usages such as in the title of the book _Hanyu Fangyan
Cihui_[1], a Swadesh-esque comparison of terms in various fangyan
(topolects; usu. trans. as dialects), where non-standard Mandarin and
non-Mandarin forms of Chinese are included under the umbrella Hanyu
term.

(FYI, here's a sample from the 2nd ed. of the tomato entry:
  http://deall.ohio-state.edu/grads/chan.200/misc/tomato.jpg)

[1] _Hanyu Fangyan Cihui_  \u6c49\u8bed\u65b9\u8a00\u8bcd\u6c47,
1st ed. (Beijing: Wenzi Gaige, 1964); 2nd ed. (Beijing: Yuwen,
1995).


There is a discussion in one the latter chapters of Jerry Norman's
_Chinese_ (Cambridge: Cambridge University Press, 1988) which discusses
terms including hanyu, putonghua, zhongwen, guanhua, etc.


Thomas Chan
[EMAIL PROTECTED]

Re: Unicode under fire again

2001-06-05 Thread Thomas Chan


On Tue, 5 Jun 2001, Mark Leisher wrote:

 http://www.hastingsresearch.com/net/04-unicode-limitations.shtml

Hmm, lots of ink spilled just to rehash a weak argument against Han
Unification.

Beneath the misinterpretations and typos, I do see hints of valid concerns
about yet-unencoded Han characters (e.g., neglect/bias of/against Taoism
and old texts--the sort of thing he works with), which would have made a
better argument.

It wouldn't hurt for him to have newer and more advanced references, too.


Thomas Chan
[EMAIL PROTECTED]

Re: New kana letters (was RE: Oriyan Language)

2001-06-05 Thread Thomas Chan


On Tue, 5 Jun 2001, Marco Cimarosti wrote:

 11 wrote:
  [...] if you look at the chart for hiragana, you see that 
  they left space for four new kana, just in case somebody 
  decided to invent new kana. [...] I think they're
 
 Both things, I think.
 For new letters, see the yellow squares in:
[snip]
 http://www.unicode.org/charts/draftunicode32/U32-31F0.pdf

Interesting.  Why is the U+31F0 to U+31FF block currently named Katakana
Phonetic Extensions?

Phonetic seems like unnecessary wordiness--is the standard set of
katakana not phonetic?

There also seems to be multiple ways to name an extension/supplement to
an existing block/script:

  Basic Latin  : Latin-1 Supplement, Latin Extended-A,
 Latin Extended-B, IPA Extensions, Latin
 Extended Additional
  Greek: Greek Extended
  Kangxi Radicals  : CJK Radicals Supplement
  Bopomofo : Bopomofo Extended
  CJK Unified Ideographs   : CJK Unified Ideographs Extension A, CJK
 Unified Ideographs Extension B
  CJK Compatibility Ideographs : CJK Compatibility Ideographs Supplement

Any pattern to this, or is it just historical?


Thomas Chan
[EMAIL PROTECTED]

RE: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET)

2001-06-05 Thread Thomas Chan


On Tue, 5 Jun 2001, Marco Cimarosti wrote:

 BTW, I don't see a problem of political correctness here. The terms used by
 Japanese or Korean themselves (kanji and hanja, respectively) litterally
 means Han or Chinese characters.

They have the advantage of having more than one morpheme/word that maps to
English Chinese.  e.g., in Japanese, there is chuu vs. kan, such as
the near-minimal pair chuu-goku-go 'Chinese language (lit. language of
China)' vs. kan-go 'word composed of Sino-Japanese morphemes'.

Perhaps if Han is too unfamiliar a word to be used directly, Sino or
Sinitic could be used as translations to convey the same meaning without
using the overloaded term Chinese (language, culture, origin, ethnicity,
nationality, etc), e.g., Sino characters, Sinitic characters.


 Rather, there is a possible technical
 incorrectness, because the term hides the fact that Japanese and Korean also
 use their own local phonetic characters.

Unfortunately, similiar technical incorrectness already exists, e.g., the
Japanese-enabled (and probably localized as well) version of a product,
website, etc is sometimes known as the kanji version, despite the
presence of kana; similarly, the Korean version is sometimes (almost
always?) labeled as the hangul version, despite the rare use
of ideographs.  Actually, what's worse is that hangul is often used
rather the name of the language--one'll see a list of choices like
English, Francais, Deutsch, Nihongo, Hangul... (the last not in Latin
script, of course, but its own.  (I won't go into how simplified Chinese
is an abused and misunderstood term, for now.)


Thomas Chan
[EMAIL PROTECTED]

RE: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET)

2001-06-04 Thread Thomas Chan


On Mon, 4 Jun 2001, Ayers, Mike wrote:

  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
  For the Han characters, I have found in the past that people 
  whose native 
  language does not use these characters usually refer to them 
  as Chinese.  
  Obviously (to us anyway), calling them Chinese characters 
  is not adequate, 
  so we search for alternatives.
 
   Consider:
   Hanzi - Chinese characters
   Kanji - Chinese characters
   Hanja - Chinese characters
   ...so it would seem that everyone but us agrees on what to call 'em.

I think the problem that Doug might be suggesting (correct me if I'm
wrong, Doug) is that Chinese is also the name of a language(s).  The
term Han doesn't have this problem, but the term is unfamilar to most
people.


Thomas Chan
[EMAIL PROTECTED]

Re: Term Asian is not used properly on Computers and NET

2001-05-30 Thread Thomas Chan


On Tue, 29 May 2001, David Gallardo wrote:

 Please excuse the unintended querulousness, but isn't the Greenwich meridian
 merely the reification of this bias?
[snip]
 Nonetheless, and more to my point,  the terms Near East and Far East
 were in use long before this.

There are also terms like the West or Western (world, languages,
civilization, etc) which have referents that are not completely west of
the Greenwich Meridian, whose usage cannot be simply explained or
justified by it.


Thomas Chan
[EMAIL PROTECTED]

RE: Braille vs Bidi

2001-05-30 Thread Thomas Chan


On Wed, 30 May 2001, Marco Cimarosti wrote:

 Kenneth Whistler wrote:
  I doubt it. But if Marco is correct that Hebrew braille is 
  left-to-right, there could conceivably be some exemplary
  printed materials in Hebrew, with braille examples, [...]
 
 There is a very nice book about how braille is used in each language.
 Unluckily I can't remember any reference off-hand (I think it was edited by
 UNESCO), but it was mentioned time ago also on this list.

There are at least two: the 1953 edition of _World Braille Usage_ by
Sir Clutha Mackenzie published by UNESCO, and the 1990 edition of the same
title (but not by him), published by UNESCO and the US Library of Congress
(or a dept of it--I don't have the book in front of me to give full bib
details).

They're both worth looking at for comparative purposes at the changes over
about four decades.  There are some mistakes and omissions, though, which
can be partially detected if one is familiar with the language's sound
system, syllable structure, or writing system, e.g., handling of
Vietnamese tones is omitted in the 1990 edition--but the bibliography
contained within allows one to track down the original sources.  The 1953
edition also contains interesting background on their goals and history of
braille, including multiple incompatible 6-dot systems for English and
German at one point in the past (of course, those readers couldn't simply
convert their books into a readable format), and trying to create unified
standards.  Neither book really contains extended writing samples to show
directionality, though.


Thomas Chan
[EMAIL PROTECTED]

RE: Term Asian is not used properly on Computers and NET

2001-05-29 Thread Thomas Chan


On Tue, 29 May 2001, Marco Cimarosti wrote:

 Doug Ewell wrote:
  Peter has an excellent solution -- much better than trying to 
  explain the 
  term CJK to ordinary people -- and I plan to use the term 
  East Asian in the future.
 
 But, if by East Asian you mean languages written with Han ideographs,
 you fall in another pitfall, because Mongolian, Russian, Vietnamese and many
 other languages spoken in East Asia aren't accounted for.

There are many pitfalls.  Does the definition exclude Korean when written
solely in Hangul?  Is Vietnamese clearly East Asian?  How about Yi
(TUS3.0 thinks so)?  Does it include the Cyrillic-writing Dungan Chinese?
How about Zhuang written with Han characters?  Min Chinese in Latin
script?  Etc.

I think what one wants is something like languages usually and currently
possibly including Han characters in their written form.  That frees us
from worrying about historical or aberrant cases, I think.

Or how about just languages written with a very large collection of
characters?  Then we can include the Tangut, et al too, without including
some of the medium-sized syllabaries.  (This does require a distorted
analysis of hangul, though.)

 
 Personally, I got used to the acronym CJK and, so far, I haven't met many
 people who are so ordinary not to understand the explanation that CJK
 means Chinese/Japanese/Korean.

One problem with that term is that its members are very transparent; some
people would like to add V to that to include historical Vietnamese
usage.  Is one going to add Z for Zhuang and whatever other letters
tomorrow?

 
 Rather, my problem with the acronym is that I don't know how to translate it
 in Italian: CGC (for cinese/giapponese/coreano) is horrible, especially
 if you consider that it is pronounced chee-jee-chee.
 Moreover, unlike the English acronym, the three initials are not in
 alphabetical order, and this could be seen as politically incorrect. So we
 should have chee-chee-jee, which is even worse, if possible.

Feel free to switch them around for local consumption--the Japanese term
nichi-chuu-kan arranges them Japanese-Chinese-Korean.  I don't think you
can win if you try to be politically correct to everyone--e.g., there is
a faction that is unhappy with Korea spelled with K rather than C, as
it alphabetizes after J.


Thomas Chan
[EMAIL PROTECTED]

RE: Term Asian is not used properly on Computers and NET

2001-05-29 Thread Thomas Chan


On Tue, 29 May 2001 [EMAIL PROTECTED] wrote:

 On 05/29/2001 02:37:55 PM Thomas Chan wrote:
 I think what one wants is something like languages usually and currently
 possibly including Han characters in their written form.  That frees us
 from worrying about historical or aberrant cases, I think.
 
 Folks, this discussion was about how to label a control in a dialog box, as
 in the attached image. You can't use a label like that.

It may have begun with that (Word XP font dialog box), which was just one
of the original poster's (Liwal) examples--he also mentioned websurfing in
general--but I think the discussion has expanded beyond that into a
discussion of the meanings and implications of various other terms.

I don't disagree that a label has to be concise.


Thomas Chan
[EMAIL PROTECTED]

Re: Radical of U+4E71

2001-05-28 Thread Thomas Chan


On Mon, 28 May 2001, [ISO-2022-JP] $B$F$s$I$$j$e$$8(B wrote:

 I thought the radical was tongue, not hook.

It depends on what source you use as your authority.

The _Kangxi Zidian_ says it goes under radical #5, U+2F04 KANGXI RADICAL
SECOND (not #6, U+2F05 KANGXI RADICAL HOOK, btw), and it seems for this
particular character U+4E71, the other three dictionaries (Dai Kanwa
Jiten, Hanyu Da Zidian, and Dae Jaweon) happen to agree.

I don't see a pointer on p. 895 of TUS3.0 under radical #135, U+2F86
KANGXI RADICAL TONGUE, but maybe some people who are unfamiliar with the
character would need it (and it would be logical to try looking
under what's on the left half of the character first).


 In case you're wondering, the kanji in question is ran as in
 random. That IS what it is, isn't it?

Yes, U+4E71 is 'chaos' and other related/derived words; luan in Chinese,
ran in Japanese.  Its what appears on the box art for Kurosawa's movie
_Ran_.


However, there are cases such as U+7551, hatake 'dry field', which is,
surprise! (well, maybe just to Chinese people), filed under radical #102,
U+2F65 KANGXI RADICAL FIELD, rather than #86, KANGXI RADICAL FIRE.
According to the priorities in the Han Ideograph Arrangement section on
p. 266 of TUS3.0, it failed to be found in the _Kangxi Zidian_ (dictionary
#1), so it was up to the _Dai Kanwa Jiten_ (dictionary #2) to dictate
where it goes in Unicode, despite dictionary #3, the _Hanyu Da Zidian_,
saying it should go under #86, etc.

(If and when U+7551 does appear in Chinese dictionaries, it is filed under
radical #86.  Why include it if this character is used in Japanese and not
Chinese?--the story I have heard is that it appears as part of the name of
a Japanese solder in one of MAO Zedong's works.  People must want to know
what it means and how to read it, so it acquired an artificial tian2
reading in Chinese.)

 
 Some -- a very few -- kanji actually have proper names in Unicode, if
 you know how to look for them, such as IDEOGRAPH FIRE.

I think that's a bit misleading.

You must be refering to either or both of these two:
  U+322B PARENTHESIZED IDEOGRAPH FIRE
  U+328B CIRCLED IDEOGRAPH FIRE

In this particular case, not only is there something different about them
(parenthesized and circled), but if you look at the context of the source
they probably came from, they were probably meant as abbreviations for
'Tuesday' (in Japanese and Korean use), and not 'fire'.  e.g., U+322B
PARENTHESIZED IDEOGRAPH MOON for 'Monday' .. U+3230 PARENTHESIZED
IDEOGRAPH SUN for 'Sunday', and likewise for U+328A .. U+3290.


Thomas Chan
[EMAIL PROTECTED]

Re: name this hanzi

2001-05-27 Thread Thomas Chan


On Sat, 26 May 2001, Richard Cook wrote:

 Gaspar Sinai wrote:
  On Sat, 26 May 2001, Richard Cook wrote:
   Here's a puzzle: Any idea 1.) what this character is, and 2.) if 
   it's in Unicode?
   http://linguistics.berkeley.edu/~rscook/bishop/Picture1.gif
 
  This CJK millet in English (kibi in Japanese) U+9ECD
 
 Yes, that's right. You're the winner. :-)
 This is a *variant* of [U+9ecd], with [U+efe2] in place of [U+6c3a].

What is U+EFE2?


 Compare e.g. [U+257e6] and also [U+22863].
 I know from context that it is in fact a variant writing of [U+9ecd] ...
 since it appears in DUAN Yucai's gloss at 113.410 (SWJZZ). But what i
 don't know is why he wrote it this way ... 
 This form is not in Ext. B, nor HYDZD, nor Kangxi, as far as I can tell
 So, did you find this exact form in any character dictionary or list?

I don't know where Gaspar found it, but you may want to use the Jiaoyubu
Yitizi Zidian (Ministry of Education Dictionary of Chinese Character
Variants by the Zhonghua Minkuo Jiaoyubu Guoyu Tuixing Weiyuanhui
(Mandarin Promotion Council, Ministry of Education, Republic of China):
  http://140.111.1.40/
I believe it only came out earlier this year.  Site is in Chinese, of
course.  Big5 encoding.

The particular variant you are looking for is here, indexed as a04768-009:
  http://140.111.1.40/yitia/fra/fra04768.htm
It gives three sources where it appears, including the _Yupian_--perhaps
DUAN or his printer got it from there?  What's nice is that scans of
the entries from most of their sources are available (primarily the older
ones).

Despite all the attention _Kangxi Zidian_ and _Hanyu Da Zidian_ get as
being comprehensive, they are are sometimes selective in what they
inherit from earlier dictionaries.


Thomas Chan
[EMAIL PROTECTED]

Re: supplementary planes support

2001-05-25 Thread Thomas Chan


On Fri, 25 May 2001, Markus Scherer wrote:

 Thomas Chan wrote:
  than Italian's 37 million (http://www.sil.org/ethnologue/top100.html).
 
 Italy has about 60 million people. Do you not count at least most of
 them as speakers of Italian, plus some in Switzerland etc.?

I'm just quoting the SIL figure (for comparative purposes to Hakka).  If
you dispute it, take it up with them, not me.


Thomas Chan
[EMAIL PROTECTED]

supplementary planes support

2001-05-24 Thread Thomas Chan


Hi all,

A while ago, there were questions about the applicability of Han
characters in Plane 2 for contemporary everyday use (as opposed to
historical or specialist use), to which I offered some Cantonese (SIL
YUH) examples.

I believe I've found a better example.  U+2028E is ngai, the Hakka (SIL
HAK) first person pronoun.

According to the online edition of the SIL Ethnologue, 13th ed., there are
34 million speakers worldwide 
(http://www.sil.org/ethnologue/countries/Chin.html#HAK)--slightly less
than Italian's 37 million (http://www.sil.org/ethnologue/top100.html).

The language is on the decline, though, as they are scattered and tend to
be linguistic minorities where they live, and they are losing speakers to
the more prestigious Mandarin (SIL CHN) or Cantonese (SIL YUH).

To my knowledge, no other character sets (including CNS 11643 and HKSCS)
include this certainly common character--I'm actually surprised.


Thomas Chan
[EMAIL PROTECTED]

Re: Single Unicode Font

2001-05-22 Thread Thomas Chan


On Tue, 22 May 2001, David Starner wrote:

 In any case, some scripts just go together. Mathematicians and
 linguists frequently use Latin and Greek together (cf. IPA) in ways
 that require consistent font looks.

Do these need to go together?  IPA is Latin, except for a few unfortunate
unifications with Greek.


Thomas Chan
[EMAIL PROTECTED]

RE: Word, Asian characters, and Arial Unicode

2001-05-07 Thread Thomas Chan


On Sun, 6 May 2001, David J. Perry wrote:

 In classical studies, characters with the shape of U+3008/09, 300A-300F,
 3016/17, and 301A/1B are sometimes used to mark various kinds of editorial
 uncertainty or conjecture in a text.  The first and last pairs in my list
 are the most common by far (I know 3008/09 has another version somewhere in
 the math block). The guillemets (U+00AB/BB) and the greater than/less than
 signs do not have the appropriate shapes.  Since these are already in
 Unicode, it would seem best to use them rather than proposing them for
 inclusion in another range.

I'm sorry, but I don't understand how U+00AB/U+00BB or U+226A/U+226B don't
have the appropriate shape, but that U+300A/U+300B would.  U+300A/U+300B
are taller than the guillemets U+00AB/U+000BB, but they tend to not be
proportionally-spaced and are not centered inside their square (i.e.,
U+300A, as a left bracket, is padded out with white space to the left of
it), not to mention having vertical forms (probably not relevant for you).

In lieu of U+3008/U+3009, why not U+003C/U+003E, U+2039/U+203A, or
U+2329/U+232A?  (All of these are suggested on p. 568 of TUS3.0.)
Similarly, in lieu of U+300C/U+300D, why not U+2308/U+230B (also
suggested, but they are really math symbols).

I wasn't aware that brackets similar to U+300E/U+300F, U+3016/U+3017,
U+301A/U+301B existed outside of CJK usage, although they don't seem to be
as common, and I have no idea what U+301A/U+301B would be used for.


Thomas Chan
[EMAIL PROTECTED]

Re: Word, Asian characters, and Arial Unicode

2001-05-06 Thread Thomas Chan


On Sun, 6 May 2001, David J. Perry wrote:

 Word 2000 (under Win98) insists on using Arial Unicode MS whenever you
 insert a character in the CJK Punctuation range.  There are some characters
 here that might be useful in non-CJK situations, such as the double
 brackets.  I have made a font with these characters but Word will not let me

By double brackets, do you mean U+300A and U+300B (LEFT/RIGHT DOUBLE
ANGLE BRACKET)?  Those are used to delimit titles of books and articles.
What kind of usage do you have in mind that U+00AB and U+00BB
(LEFT/RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK) or U+226A and U+226B
(MUCH LESS/GREATER-THAN) couldn't be used?

Not all of the symbols in that block have analogues elsewhere, though--I'm
not sure what you can do about that.


Thomas Chan
[EMAIL PROTECTED]

Re: FW: IDS question

2001-04-30 Thread Thomas Chan


On Mon, 30 Apr 2001, Ayers, Mike wrote:

 From: Thomas Chan [mailto:[EMAIL PROTECTED]]
 There are some characters in LENG Yulong and WEI Yixin's 
 _Zhonghua Zihai_
 dictionary (Beijing: Zhonghua, 1994), such as gu2 on p. 31 
 and lin2 on p.
 32 that incorporate a circular component.  I'd probably 
 describe them as:
[snip]
 (Both look somewhat like crosshairs.)
 
   Could you please scan the characters in question?  If you can't post
 them to a web page, you may mail them to me personally.  I think I know what
 this is, but need to see. 

I really should have a picture, but I was only able to use the dictionary
in question briefly while in Minneapolis, so I only have my notes:
  http://deall.ohio-state.edu/grads/chan.200/misc/gu-lin.jpg

(If anyone has access to this dictionary and scan the entries in question,
I'd appreciate it.)

p. 31's gu2 reads yin gu yangping 'sound is the same as that of the
character gu1 '(paternal) aunt', but in the yangping tone class [hence gu2
in Mandarin]'; yi wei xiang jian _Bian Hai_ 'meaning not yet clear [to
the dictionary compilers], seen in the _Bian Hai_'.

p. 32's lin2 reads yin lin 'sound is same as that of the character lin2
'forest''; and then the same compiler comment and source reference.

Rather weird cases, but LENG Yulong, WEI Yixin, et al. felt they were Han
characters.


Thomas Chan
[EMAIL PROTECTED]

IDS question

2001-04-28 Thread Thomas Chan


Hi all,

I've recently been using Ideographic Description Sequences to describe
some Han characters that are not in Unicode 3.1, and I noticed that
U+3007 is not included in the set of UnifiedIdeographs, despite having
the ideographic property (TUS3.0, p. 269; UAX #27, section 10.1).  I
understand that compatibility ideographs are not allowed to participate
in IDS, but U+3007 doesn't have a clone, as far as I know.

There are some characters in LENG Yulong and WEI Yixin's _Zhonghua Zihai_
dictionary (Beijing: Zhonghua, 1994), such as gu2 on p. 31 and lin2 on p.
32 that incorporate a circular component.  I'd probably describe them as:

gu2, p. 31:
  U+2FFB IDEOGRAPHIC DESCRIPTION CHARACTER OVERLAID
  U+5341 (shi 'ten')
  U+3007 (ling 'zero')

lin2, p. 32:
  U+2FFB IDEOGRAPHIC DESCRIPTION CHARACTER OVERLAID
  U+2FFB IDEOGRAPHIC DESCRIPTION CHARACTER OVERLAID
  U+5341 (shi 'ten')
  U+3007 (ling 'zero')
  U+3405 (x-like shape)

(Both look somewhat like crosshairs.)

However, those aren't valid sequences.  I realize the above two characters
are rather odd, but the likes of U+3AB3 and U+3AC8 would have faced the
same problem, since they also incorporate a circular component.

What would be the advisable way to handle these cases, besides
creating invalid IDS sequences, using the PUA, or giving a prose
description?


Thomas Chan
[EMAIL PROTECTED]

RE: Moving mail lists

2001-03-20 Thread Thomas Chan


On Tue, 20 Mar 2001, Mike Lischke wrote:

  Our old list manglement software will be retired
  to a far far bitter place in the bucket, and
  in its stead we are going to an open-source
  package called Listar which will be much more
  flexible, and will include digest mode.
 
 Just out of curiosity, why do you use an own mailing list server if
 you can use a free one (Yahoo Groups)? The Unicode list is mirrored
 there anyway, so why not make the "backup list" being the actual list.

Not all posts that make it to [EMAIL PROTECTED] end up at the "unicode"
group at Egroups/Yahoo Groups--some are mysteriously lost.  Quoting with
greater-than signs is also messed up with the HTML interface there, too.


Thomas Chan
[EMAIL PROTECTED]

Re: TRON

2001-03-14 Thread Thomas Chan


On Wed, 14 Mar 2001 [EMAIL PROTECTED] wrote:

  From: "Suzanne M. Topping" [EMAIL PROTECTED]
  After doing some surfing on the topic, it appears to be in at least some
  use (primarily in Japan?) I hadn't thought there were any viable
  alternatives to Unicode out there, and was surprised to see TRON looking
  as alive as it does.
 
 A recent version of the comercial implementation of BTRON from
 Personal Media, called Cho Kanji 3(Cho means Super in Japanese),
 claims 171,500 characters are supported. While we take the
 approach of unifying glyphic variants, their approach on this is to
 distinguish them all.

I believe the website is http://www.chokanji.com/ .  However, that
171,500 figure should at least be halved before even beginning discussion
or comparison to Unicode, as it inherits the collections of both Mojikyo
(http://www.mojikyo.org/) and Tokyo University's GT Mincho fonts
(http://www.l.u-tokyo.ac.jp/GT), which both have included the 48,000+
kanji from the _Dai Kanwa Jiten_ (aka Morohashi) dictionary, i.e.,
flat-out overlap, and not an issue of who considers what to be a glyph
variation.

Contents of Mojikyo (~80,000):
  http://www.mojikyo.org/html/download/pdf/pdflist.htm

Contents of GT Mincho (~64,000):
  http://www.l.u-tokyo.ac.jp/KanjiWEB/04_01.html

Contents of Chokanji (~171,500):
  http://www.chokanji.com/ck3/webp/soft.html#mojikind
  http://www.chokanji.com/ck3/webp/feature-moji.html

(Above websites are in Japanese, but there are some pictures.)

I'm not sure about the "universality" of any of these; the emphasis seems
to be mostly on kanji--they can only be a regional [Japanese] alternative
to Unicode.


Thomas Chan
[EMAIL PROTECTED]

RE: UTF8 vs. Unicode (UTF16) in code

2001-03-12 Thread Thomas Chan


On Mon, 12 Mar 2001, Marco Cimarosti wrote:

 Thomas Chan wrote:
  How about the case of a retailer who needs to deal with parts for
  elevators and needs U+282E2, lip 'elevator'?  Or neckties, requiring
  U+27639, taai 'tie'.
 
 I am not seeking excuses to not implement UTF-16 -- rather examples of
 characters that *do* justify it.

I did not mean to imply that you were looking for excuses; sorry if it
came across that way.  However, there are people who have potentially
legitimate reasons to conserve costs and resources by implementating a
subset, e.g., the "CJK Unified Ideographs" block in the BMP is one of the
first things to go when people want to make fonts lightweight, or do not
want or have expertise to draw all those glyphs.  If the perception is
that effort for supplementary characters is only for 
"rare/obscure/historic CJKV" (which is admittedly true for most 
supplementary characters at the moment), then some people might not bother
with support for surrogates, UTF-16, etc.  (And Plane 1 users like
musicians, mathematicians, and LDS would be "hurt" in the crossfire, too,
since it is an "all-or-nothing" matter.)

 
 And all your examples are perfectly valid: it would be crazy to tell users:
 "Sorry: because of software limitations, you cannot order ties or
 elevators".

I neglected to describe U+27639, taai 'tie'.  It looks like a 
left-to-right horizontal arrangement of U+8864 U+592A.

 
 OT
 Out of curiosity, are these loanwords from English? Or is it just a
 coincidence that they sound like "lift" and "tie"?
 /OT

I don't think your question is entirely off-topic.  Loanwords are one way
for a language to gain new words and morphemes, some of which will be
assimilated enough to eventually find a written representation.  In the
case of Cantonese which is liberal enough to accept loanwords (rather than
preferring calques made of native morphemes, like Mandarin), but
conservative enough to prefer writing in Han characters (rather than
romanization, as preferred for Southern Min in some quarters), that means
there'll be new characters invented for some of these new words (when
existing characters are not reused, such as diksi 'taxi' \u7684\u58eb),
which results in new candidates to be added to Unicode.

To answer your question, yes, "lip" and "taai" are loanwords from
(British) English lift and tie.  See for instance SUN Zehua's
"Xianggang de wailaici" (English Loanwords in Hong Kong)[1] for more
examples.

[1] http://home.ust.hk/~lbsun/hkloan.html  The page is in Big5, and
written within the limitations of pure Big5, such as the graph U+6064 for
seut 'shirt', instead of the more preferable (and recent) U+88C7.  i.e.,
Sun's page's emphasis is on loanwords, and not the orthography.

 
 However, I guess that Cantonese speakers might use dialectal terms (like
 "lip" and "taai" above) even when writing in literary Mandarin. And
 certainly they would not Mandarinize proper names.

Yes, as the written language is not the spoken language, "errors" do show
up, making the text less universally intelligible.  In addition, there is
also a register difference between the Mandarinesque singgonggei 'elevator' 
\u5347\u964d\u843d, and the above-mentioned "lip", that a writer may wish
to make use of.

 
  Pentagrams?  I haven't seen those... where are they?
 
 Hmmm... This is possibly an Italian word badly Anglicized. I just meant
 "musical notation".

Okay.  I thought perhaps there were additions to "Misc Symbols" U+2600 ..
U+267F or elsewhere that I had missed.


Thomas Chan
[EMAIL PROTECTED]

RE: UTF8 vs. Unicode (UTF16) in code

2001-03-09 Thread Thomas Chan


On Fri, 9 Mar 2001, Marco Cimarosti wrote:

 Addison P. Phillips wrote:
  [...]
  currently there are no characters "up there" this isn't a really big
  deal. Shortly, when Unicode 3.1 is official, there will be 40K or so
  characters in the supplemental planes... but they'll be 
  relatively rare.
 
 This reminds me of a question that I wanted to ask since a lot time: how
 rare is the most common of characters in the extended planes? H... Maybe
 I should be clearer.
 Does it exist at least one character  U+ that is commonly used in at
 least one modern language?

How about music and math notation?


But, yes.  U+21075,[1] gan, is an aspect marker in Cantonese, that when
placed after a verb, denotes continuing action (roughly equivalent to
-ing in English).  I don't think anyone would dispute the 
indispensability or high frequency of this character.

[1] It looks like a left-to-right horizontal arrangement of U+53E3 U+7DCA.

There is pre-existing data with that character, such as:

HKSCS (Hong Kong Supplementary Character Set) has it at 0x9E44, as does
its predecessor GCCS (Government Chinese Character Set).  One can buy
Chinese handwriting recognition and OCR software that support at least
GCCS.

Vendor extensions to Big5 which predate GCCS and HKSCS, and which have
been smaller in size (i.e., only the more frequently-used characters)
include it as well, 0xFA5E in Dynalab HK A, and 0xFAD9 in one of 
Monotype's extensions.


 I am wondering especially about the CJK characters in Extension B. We all
 know that the majority of them are rare, ancient or idiosyncratic
 characters, but I am not quite sure that this is true for *all* of them.

I probably wouldn't use "idiosyncratic" as an adjective to describe the
*majority* of them, but "rare" and "ancient" (perhaps "historical"[2]
would be a better word choice?) are correct.

[2] e.g., the "recently deceased", such as Vietnamese chu+~ no^m 
characters in Plane 2, or even Deseret in Plane 1.

 
 I think that this is an important question for deciding whether an
 application should use 32 or 16 bit characters internally, and whether an
 application has to be fully UTF-16 aware or it can be "UTF-16 ignorant".
 
 E.g., imagine designing an application that will be localized in Cantonese:
 it is important to know whether all characters needed in Cantonese are in
 the BMP, or if some of them are in Extension B.

Some of them are in Extension B.  HKSCS is unfortunately a mix of
characters needed in Cantonese, and characters needed in Hong Kong (the
two are not necessarily the same thing).  Rather than trying to figure out
what all the characters used in writing Cantonese are, which is an
open-ended set, it is simpler to make the assumption that any characters
needed for Cantonese that are worth supporting have already made it into
HKSCS.  Then make a decision based on whether one will support legacy data
from HKSCS.  (In some cases, one does not have a decision to make, if it
is mandatory--e.g., a product that will be used by the HKSAR government.)

It doesn't have to be an application localized into Cantonese necessarily
even; just one that can process Cantonese text, e.g., for court 
transcription purposes.

It seems that current practice in software is to stuff the characters from
HKSCS into the BMP's PUA area, sans unification.  Hopefully this will only
be a temporary phase.


Thomas Chan
[EMAIL PROTECTED]

RE: UTF8 vs. Unicode (UTF16) in code

2001-03-09 Thread Thomas Chan

ae, or an aborted orthographic
 for English, or the script used in Viet-Nam centuries ago)...

Pentagrams?  I haven't seen those... where are they?


Thomas Chan
[EMAIL PROTECTED]

Re: CJKV ideographic, was Re: Perception that Unicode is 16-bit

2001-02-27 Thread Thomas Chan


On Tue, 27 Feb 2001, John Jenkins wrote:

 On Tuesday, February 27, 2001, at 10:46 AM, Richard Cook wrote:
  * 'Kanji' in Cantonese Chinese  (kahn jee; "k" as in 'can', "a" as in
  'father', "jee" as in 'jeep');
 
 I'm afraid it's not "kanji" but "hanji" in Cantonese.  Sorry.

But is a romanized version of U+6F22 U+5B57 based on the Cantonese
pronunciation ever used in English writing the way hanzi (based on
Mandarin pronunciation) is?

For those familiar with "ASCII IPA", it's /hOn33 tSi22/.  (O denotes
U+0254 LATIN SMALL LETTER OPEN O; s denotes U+0283 LATIN SMALL LETTER
ESH.)[1]  Yale romanization would write it honjih, a modified Yale would
write it hon3ji6, etc.

[1] I wish I could assume that everyone can view IPA, and not go through
contortions like this.


Thomas Chan
[EMAIL PROTECTED]

Re: CJKV ideographic, was Re: Perception that Unicode is 16-bit

2001-02-27 Thread Thomas Chan


On Tue, 27 Feb 2001, Richard Cook wrote:

 * 'chunom' in Vietnamese [similar to (i.e., analogical) Chinese characters].

If one is going to talk about Vietnamese chu+~ no^m '"southern"
characters', then one might as well mention the Japanese kokuji 'national
characters' and Korean gugja 'national characters' as well, which are
their equivalents of "homemade" characters that do not exist in
Chinese.[1]

There is also a similar phenomena in Chinese, called fangyanzi '"dialect"
character', which may be considered analogous to the above, the most well
known being the Cantonese ones, although others (Wu, Hakka, etc) do exist.

[1] There is a small chance that they might exist in Chinese, or even in
other languages, depending on the criteria for being a "national
character".


Thomas Chan
[EMAIL PROTECTED]

Re: Possibilities of future expansion (from Perception etc thread

2001-02-26 Thread Thomas Chan


On Mon, 26 Feb 2001 [EMAIL PROTECTED] wrote:

 On 02/25/2001 08:01:38 PM "Joel Rees" wrote:
 I know this has been hashed over time and time again, and the answer has
 been handed down as if by edict time and again, but _your_ attitude as
 expressed below is taken by many who are not involved as rather arrogant.
 
 Michael and I don't always see eye-to-eye, but I back certainly him on this
 one. His statement is not arrogant. It is merely fact.

Facts are fine.

But perhaps the FAQ (if it doesn't have one already)
needs an entry with a Question along the lines of: "I've heard that
Adobe/Apple/CSUR/Linux/Microsoft/etc are already using certain parts of
the Private Use Area (PUA).  Does that mean I can't/shouldn't use those
codepoints in my application?" .


 To
 many people, it seems like the UNICODE has taken in hand to define
 language
 itself. The explanations, no matter how well founded, sound to ordinary
 people like slick lawyers trying to cover up something baad with legalize.
 
 Like it or not, Unicode is the property of the Unicode Consortium and its
 members, not ordinary people. Clearly, ordinary people have an interest in
 its development, but it will benefit them only as the members of the
 Consortium deem to provide service to them. Now, this sounds cold and
 legal, and that's really not necessary. The members of the Consortium have
 it in their best interest to provide good service to ordinary people -
 that's how they earn their living.

Thanks for the diplomatic explanation.

 
 Personally, I think the PUA is a wonderful compromise. I know a linguist

These sorts of user-defined areas are a great idea.  I'm reminded of an
old (1996) usage of such space in JIS X 0208 for Ethiopic:
  http://www.abyssiniacybergateway.net/admas/jis/


Thomas Chan
[EMAIL PROTECTED]

Re: Possibilities of future expansion (from Perception etc thread

2001-02-25 Thread Thomas Chan


On Sun, 25 Feb 2001, William Overington wrote:

 I find the Private Use Areas of great interest and a valuable resource.
 However, use of the private use characters requires agreement between users
 if private use characters are to be used for exchanging information between
 people.  Already there is a development of the ConScript registry.  This has
 its influence.  I am researching a concept that I am hoping to call a
 uniengine that uses a few more than 1024 characters.  For research purposes
 I am placing it in the private use area.  From the unicode documentation, I
 have decide to place it in the middle, using U+EC00 to U+EFFF as a block and
 placing the additional character codes in U+EB00 to U+EBFF.  Yet I checked
 at the ConScript registry to ensure that I was not clashing with that
 research work.  If the uniengine concept becomes popular maybe it will
 become encoded by the committees into the standard.  I feel that the
 interesting point though is to ask whether, just because there has been
 mention in the unicode list of that range for a particular line of research
 work, notes will be made of the fact in documents here and there amongst
 researchers in the unicode area, so that any possibilities of clashes of
 meaning with some other person's use of a particular code in those ranges is
 noted.  The very fact that I felt it desirable to check at the ConScript
 registry is to my mind a demonstration that the private use area is already
 something other than a private use area.

On the other hand, apparently neither CSUR nor you were aware of (or
chose to ignore) the clashes with the mappings between the PUA and legacy
CJK encodings and character sets, which [the mappings] have already been
implemented for 5+ years now on CJK versions of Windows.

For the particular ranges you've chosen (U+EB00 .. U+EFFF), you clash
with:

  CP936:
9A41-A0FE  U+E000-U+E4DD
AA41-AFFE  U+E4DE-U+E909
F8A1-FEFE  U+E90A-U+EDE7

  CP950:
FA40-FEFE  U+E000-U+E310
8E40-A0FE  U+E311-U+EEB7
8140-8DFE  U+EEB8-U+F6B0
C6A1-C8FE  U+F6B1-U+F848

Various groups such as academics, input projects, newspapers, etc use
these ranges; perhaps the most prominent is HKSCS (and its predecessor
GCCS), which had been placed in CP950's user-defined zones, and thus has
a mapping to the PUA exists (and .tte fonts are created by using PUA 
codepoints in the cmap).  If you chose other ranges, you might clash with
CP932 or CP949's.  Similar ranges are used on the Macintosh as well.  (I
don't know about other platforms.)

Of course, all of these clash with each other, CSUR, Microsoft's "Symbol",
etc in whole or part.  I don't see why there's any particular reason why
anyone should really care about clashes.


Thomas Chan
[EMAIL PROTECTED]

Re: Possibilities of future expansion (from Perception etc thread

2001-02-25 Thread Thomas Chan


On Sun, 25 Feb 2001, Michael Everson wrote:

 At 10:15 -0800 2001-02-25, Thomas Chan wrote:
 On the other hand, apparently neither CSUR nor you were aware of (or
 chose to ignore) the clashes with the mappings between the PUA and legacy
 CJK encodings and character sets, which [the mappings] have already been
 implemented for 5+ years now on CJK versions of Windows.
 
 So what?
 Anyone can use the PUA for anything they want. The code positions are 
 NOT standardized. If the "characters" those CJK people are using need 
 to be interchangeable, they need to be properly proposed and put into 
 Unicode. If they are not, then they are NOT characters.

I've never denied that anyone can use the PUA for what they want.  If you
read my post more carefully in entirety, you'll see that I said:

  Of course, all of these clash with each other, CSUR, Microsoft's
  "Symbol", etc in whole or part.  I don't see why there's any particular
  reason why anyone should really care about clashes.

If someone wants to use the PUA, there's no need to check CSUR or any
other source for "clashes".

As well as:

  Various groups such as academics, input projects, newspapers, etc use
  these ranges; perhaps the most prominent is HKSCS (and its predecessor
  GCCS), which had been placed in CP950's user-defined zones, and thus has
  a mapping to the PUA exists (and .tte fonts are created by using PUA
  codepoints in the cmap).  If you chose other ranges, you might clash
  with CP932 or CP949's.  Similar ranges are used on the Macintosh as
  well.  (I don't know about other platforms.)

If you want examples of these characters (and in some cases, publically
available .tte fonts), there's:

  http://www.info.gov.hk/gccs/ (GCCS, from the Hong Kong SAR government)
  http://www.info.gov.hk/digital21/eng/hkscs/index.html (HKSCS, also from
the Hong Kong SAR government)
  http://www.microsoft.com/hk/hkscs/ (HKSCS, from Microsoft HK)
  http://www.sinica.edu.tw/~tdbproj/handy1/ (for the Scripta Sinica
project, from Academia Sinica Computing Centre, a part of Academia
Sinica, a research insitution, in Taiwan)
  http://www.appledaily.com.hk/ (for Apple Daily, a Hong Kong newspaper)
  http://www.ust.hk/itsc/chinese/infra/eudc/ (for HKUST, a Hong Kong
university)
  http://www.dynalab.com.hk/font/gaigi.htm (for one of DynaLab's sets,
from same, a foundry)

etc etc--this is not an exhaustive list.  Someone else can probably
provide examples of Japanese or other Chinese ones.

As for whether proposals exist or not, that doesn't matter if one is
concerned about clashes--e.g., many people will avoid using U+F000 ..
U+F0FF because Microsoft Symbol (which itself has multiple variants) and
others already use those codepoints on a widespread scale.


Thomas Chan
[EMAIL PROTECTED]

Re: More rambling about Han

2001-02-22 Thread Thomas Chan


On Thu, 22 Feb 2001, Joel Rees wrote:

 What I want is to be able to send a piece of text with one or more
 characters that I know the recipient will not have in his collection of
 fonts, and have some hope that he or she will be able to see the glyph in a
 meaningful manner.

The IDC's can help do part of the job.

I'm not sure this sort of thing belongs in Unicode, but the Big5+ spec
described two ways of transmitting the glyphs for user-defined characters.
One of them referred to CNS13479/ISO6429 to use an escape-code to switch
into a mode where a bitmap of 16x16, 24x24, 32x32, 40x40, 48x48, 64x64,
80x80, 96x96, or 128x128 size could be embedded, followed by its 
user-defined codepoint.  The other referred to an ISO9541 SGML-based
solution.  (See section 3.6 of big5pm2.doc file in the big5p-1.zip package
from http://www.cmex.org.tw/ .  It's all in Chinese, though.)


Thomas Chan
[EMAIL PROTECTED]

Re: fictional scripts revisited

2001-02-22 Thread Thomas Chan


On Thu, 22 Feb 2001, David Starner wrote:

 On Wed, Feb 21, 2001 at 10:58:06PM -0800, Thomas Chan wrote:
  First, there are the 4000 new[4] "CJK Ideographs" that he created solely
  for a work called _Tianshu_ (A Book from the Sky)[5] (1987-1991), which Xu
  spent three years carving movable wooden type for.  There is no doubt that
  these are bona fide Han characters, albeit without readings and meanings.
 
 Idiosyncratic and personal characters are not encoded in Unicode. 

I know they aren't, but they have in the past, and some have just sneaked
in with Plane 2.  Of course, policy can also change--it wasn't that long
ago that musical notation and braille were barred.

I'm not suggesting that Xu's 4000 bogus characters deserve to be included;
this is merely an example of a possibility to think about.  If say, they
were included in a futuristic cjk vertical extension "M" (to pick a letter
safely in the distant future), who'll know to object to them?  

(Someone will probably dig up this thread, now that I've mentioned it...)

 
  However, the lack of readings and/or meanings, or nonce usage, has not
  stopped characters before from being included in Unicode or precursor
  dictionaries and standards, e.g., U+20091 and U+219CC, created as a "I
  know these two characters and you don't" one-upmanship stunt; or the
  various typos inherited from JIS standards.  
 
 I believe Unicode, as a general rule, does not encode meaningless
 characters. Any currently in Unicode are either mistakes, or come from 
 preexisting standards.

Mistakes are mistakes; they happen.  But how does one decide how to handle
pre-existing sources?  Set 1991 as a cutoff date?  It really becomes a
delicate issue.

 
  But consider that these represent potentially 4000
  codepoints that could be gobbled up by "fictional characters", and it
  only took a a single individual three years to come up with them.
 
 But that's not true. No one is proposing that every newly created script
 that comes along be encoded in Unicode. For them to gobble up 4000 
 codepoints, it would take a body of work by a number of authors, like 
 Tengwar and Cirth have had. 

A number of the characters in Plane 2 were grandfathered in because of
their inclusion in dictionaries, despite lacking reading, meaning, or
both, or being outright typos.  If those bogus 4000 made their way into a
dictionary or standard, and some large country(s) pressured to include
them, then there'd be an ugly situation--one can probably think of a few
examples of compromises made, Unicode and elsewhere.

 
  The second example I would like to raise are the "Square Words" or "New
  English Calligraphy"[6] (I don't know which name is more appropriate,
  but I will refer to it hereafter as "NEC"), which is a Sinoform script.
  NEC is a system where each letter of the English alphabet[7] is equated
  with one (?) component of Han characters, and each orthographic word is
  written within the confines of a square block, in imitation of Chinese
  writing [... CJK ideographs are precomposed in Unicode ...]
  Thus, there's no reason to expect that NEC would be encoded any
  differently.
 
 I disagree. Say, for instance, some small* country decided to adopt NEC 
 as a writing style, and hence Unicode had to include it. There are 
 1,000,000 words in English by some counts, so it's not feasible to 
 encode them all in Unicode, or even some semi-complete subset. So
 it would be encoded by component and treated like any other complex
 script. (* I say a small country, because a large country might be
 able to get a large chunk of precomposed characters stored in Unicode.
 I still don't think that it would be done soley precomposed.)

Even very small countries using a certain script would have more users
than scholars of dead scripts, even if the margin is measured in
thousands, and countries have political clout that loose federations
scholars do not.  Yet, the historical Sinoform scripts of the Khitan,
Jurchen, and Tanguts[1] are given many rows in WG2 N2314[1] (2001.1.9),
the Plane 1 roadmap, and there are no modern successors who can champion
their cause for cultural reasons, like Vietnam for chu+ no^m characters,
or Ireland for ogham.  I'm not privy to WG2's workings, but what else
would one conclude based on this roadmap; and treatment of Han characters
and Hangul, except that a script like NEC would be treated precomposed?
Of course, this is only a transient roadmap.

[1] On the same roadmap are other South American, Near Eastern, and other
large scripts in same position.

[2] http://www.egt.ie/standards/iso10646/plane1-roadmap-table.html

 
  At the inception of various other fictional scripts, no one could foresee
  the growth of scholarly and/or amateur interest in them; 
 
 True. That's why we wait until there is, before we consider encoding
 a script.

Yes, I agree.  It is harder to find historical scripts and characte

Re: New BMP characters (was Re: [very OT] Documentation: beyond

2001-02-21 Thread Thomas Chan


On Wed, 21 Feb 2001, Werner LEMBERG wrote:

  Section 10.1 of PDUTR #27 "Unicode 3.1" (2000.1.17) gives the sources of
  the 42,711 new characters as:
...
CNS 11643-1992, 15th plane
 
 Really?  I thought this should be CNS 11643-1986.  I think there isn't
 a 15th plane in the 1992 version.

I thought so too, but both PDUTR #27 and IRG N777 give the 1992 version
and the 15th plane.  Perhaps it might be a simple typo, since 
lower-numbered planes are given on preceding lines as the 1992 version.

 
  South Korea's PKS 5700
 
 This is a North Korean standard AFAIK.

I thought the PKS standards were too, but then I heard that North Korean 
ones (such as KPS 9566) were filed under the KP- sources.  I'm sure
the KS C and KS X standards are South Korean, and filed under the K-
sources, but it doesn't seem to make much sense to have North Korean
sources split across K- and KP- sources.  Now I'm not sure anymore what is
going on.


Thomas Chan
[EMAIL PROTECTED]

Re: New BMP characters (was Re: [very OT] Documentation: beyond

2001-02-21 Thread Thomas Chan


On Wed, 21 Feb 2001, Jungshik Shin wrote:

 On Wed, 21 Feb 2001, Jungshik Shin wrote:
  On Wed, 21 Feb 2001, Werner LEMBERG wrote:
South Korea's PKS 5700
   This is a North Korean standard AFAIK.
 
  No. AFAIK, PKS stands for 'Proposed Korean Standard' and as such PKS 5700
  became KS C 5700 which in turn was renamed KS X 1005-1.  Then, what is
  KS X 1005-1? It's just the Korean version of ISO 10646 (aligned with
  Unicode 2.0).
 
 I could be wrong in saying that PKS C 5700 became KS C 5700 although
 it's (almost) certain that PKS represents 'Proposed Korean Standard'
 (where Korean means South Korean).  Unicode 3.0 (p. 259) lists two PKS
 C's as K source 2 and K source 3 (PKS C 5700-1 1994 and  PKS C 5700-2
 1994) and http://www.cse.cuhk.edu.hk/~irg/irg/N777_CJK_B_CoverNote.pdf
 lists PKS C 5700-3 1998 as another K source. What is this mysterious
 PKS C 5700-[1-3]? I asked around in the past but haven't obtained the
 definitive answer. Perhaps, I should ask someone in IRG.

The unihan.txt file ver 3.0b1 (1999.7.2) lists four K- sources as:
  K0  KS C 5601-1987
  K1  KS C 5657-1991
  K2  PKS C 5700-1 1994
  K3  PKS C 5700-2 1994

It's very clear what K0 and K1 are, and they are given as GR ranges 
arranged by pronunciation, and it is okay that these ranges overlap, since
K0 and K1 are two different character sets.

K2 has what appears to be GL ranges given for it (0x2121 .. 0x7530), and
arranged by radical+strokes.  K3 looks similar, having what appear to be
GL ranges (0x2121 .. 0x3771), arranged by radical+strokes, but they all
fall within CJK Extension A.  The ranges given for K2 and K3 also overlap.
(They seem reminiscent of the "planes" of CNS 11643 / EUC-TW .)

According to the 02n34428_cjk_b_fcd_mapping.txt file[1] (May 2000?), the
K source (#4?) is given as a decimal number from 0002 .. 0269, arranged by 
radical+strokes, and all within CJK Extension B (but this file only deals
with Ext B, so that doesn't mean much).  There seem to be some gaps in the
numbering, though.  I'm not sure what to make of this in relation to K2
and K3, or the whole "PKS C 5700" thing.  The later date (1998 vs. 1994)
must also be of some significance.

[1] Available at http://anubis.dkuug.dk/JTC1/SC2/open/02n3442list.htm


Thomas Chan
[EMAIL PROTECTED]

Re: New BMP characters (was Re: [very OT] Documentation: beyond

2001-02-21 Thread Thomas Chan


On Wed, 21 Feb 2001, Werner LEMBERG wrote:

 
  Section 10.1 of PDUTR #27 "Unicode 3.1" (2000.1.17) gives the sources of
  the 42,711 new characters as:
...
CNS 11643-1992, 15th plane
 
 Really?  I thought this should be CNS 11643-1986.  I think there isn't
 a 15th plane in the 1992 version.

Looking more at this, the TF source in the unihan.txt file ver 3.0b1
(1999.7.2) also says "CNS 11643-1992, plane 15", as does p. 259 of TUS 
3.0.  (Is this what ISO documents have always said too?)


Thomas Chan
[EMAIL PROTECTED]

Re: New BMP characters (was Re: [very OT] Documentation: beyond

2001-02-21 Thread Thomas Chan


On Wed, 21 Feb 2001, Jungshik Shin wrote:

 On Wed, 21 Feb 2001, Thomas Chan wrote:
  The unihan.txt file ver 3.0b1 (1999.7.2) lists four K- sources as:
K0  KS C 5601-1987
K1  KS C 5657-1991
K2  PKS C 5700-1 1994
K3  PKS C 5700-2 1994
  It's very clear what K0 and K1 are, and they are given as GR ranges
  arranged by pronunciation, and it is okay that these ranges overlap, since
  K0 and K1 are two different character sets.
 
 Hmm, it's not a big deal but I wonder why they're given as GR ranges
 instead of just row-column values (or GL). Somebody must have mixed
 up ...

Sorry, this was my mistake.  K0 and K1 are given as GL.
 

  K2 has what appears to be GL ranges given for it (0x2121 .. 0x7530), and
  arranged by radical+strokes.  K3 looks similar, having what appear to be
  GL ranges (0x2121 .. 0x3771), arranged by radical+strokes, but they all
  fall within CJK Extension A.  The ranges given for K2 and K3 also overlap.
  (They seem reminiscent of the "planes" of CNS 11643 / EUC-TW .)
 
 By K2 and K3 overlapping, you do not  mean some characters in Ext. B are
 given references to both K2 and K3, do you? If not, it's natural and all
 right by the same token you said about the overlap of K0 and K1 ranges
 because it indicates that K2 and K3 have repertoirs disjoint from each
 other (i.e. The intersection of K2 and K3 is a null set) just like K0
 and K1 do.

No, I don't mean that some characters are given references to both K2 and
K3, which is impossible in the format the unihan.txt file is in.  (That
doesn't mean it can't happen, though--e.g., a character can be in both GB
2312 and GB 12345, but only a reference to the former, the G0 source, is
given.)

 
Thomas Chan
[EMAIL PROTECTED]

fictional scripts revisited

2001-02-21 Thread Thomas Chan

 thing is a greater threat--imagine a NEC Sinoform "ideograph"
for each orthographic word in English, Spanish, and French--just to take
three arbitrary  languages--and we'll even give them the benefit of
"unification".  From what I see now, the only way to handle this sort of
thing would be to throw ever more precious codepoints at it.

[6] See http://www.echinaart.com/Advisor/xubing/adv_xubing_gallery04.htm ,
especially the "men" and "women" restroom signs!

[7] Technically, Latin script letters--see the Spanish surnames in the
"Your surname Please" exhibit (1998) in the URL listed in footnote #6.

[8] http://www.hanshan.com/specials/xubingsw.html


Thomas Chan
[EMAIL PROTECTED]

RE: Unicode Transcriptions

2001-02-16 Thread Thomas Chan


On Fri, 16 Feb 2001, Marco Cimarosti wrote:

 2) Which Chinese dialect to adopt for transliterating.

Mandarin would be the most likely.


 Notice the particularities of Bopomofo spelling:
 
 - the sound (spelled "ong" in pinyin) is spelled "u-eng";
 - there is no "y" in "yi";
 - there is no sign to indicate the 1st tone.

[snip]
 
 Also notice that you may have a few typographical problems in producing the
 picture:
 
 a) In most fonts, the glyph for vowel i is a horizontal line. This is only
 valid for vertical texts: in horizontal writing it should be vertical.
 (Suggestion: you may substitute it with an uppercase I from a sans-serif
 font).

Yes, you are right about this.  I don't know why TUS3.0 p. 278 says "The 
character U+3127 BOPOMOFO LETTER I is usually written as a vertical
stroke when Bopomofo text is set vertically.", which is *wrong*.

 
 b) The glyph for the "combining breve" (3rd tone) is normally designed to
 fit on western lowercase vowels. (Suggestion: if you use a bigger size for
 the combining marks, you might get a correct result).

I've made two .gif files demonstrating Bopomofo typography:

  http://deall.ohio-state.edu/grads/chan.200/misc/biaozhunwanguoma.gif
  http://deall.ohio-state.edu/grads/chan.200/misc/tongyima.gif

Both depict left-to-right Han character text, and each character is
annotated on its right side with top-to-bottom Bopomofo text.

(Alternatively, I could have created versions where the Han character text
runs top-to-bottom, and each character is annotated on its right side with
top-to-bottom Bopomofo text, but I didn't.)

Note the place of the tone diacritics, which is "stacked" even more to the
right than the Bopomofo consonants and vowels.


Thomas Chan
[EMAIL PROTECTED]

[OT] RE: FW: extracting words

2001-02-11 Thread Thomas Chan


On Sun, 11 Feb 2001, Mike Lischke wrote:

  If you are willing to give up precision, then you can use heuristics.
 
  It's ugly but perhaps ok for a simple editor. You can improve the
  precision
  with better heuristics and more data, so you get to decide how much is
  good enough...
 
 So using white spaces for general word breaking and ideographs for CJK
 would be an acceptable approach? What I wonder about is how to handle

No, that is not acceptable for Chinese.  Chinese text does not use white 
space anywhere.[1]  What was described was that it is tolerable (but not
perfect--e.g., punctuation is not handled properly) to break *lines* in
Chinese text between Chinese characters.  To break *words* properly in
Chinese text, you really need a dictionary.[2]

[1] There is some Chinese text with spaces, where a space is inserted
after each Chinese character, but that is a hack to make word-wrapping
behave properly on Chinese-unaware software (which would otherwise treat
an entire paragraph of Chinese text as a single "word").

[2] You might get away with treating each Chinese character as a "word",
but this is technically wrong from linguistic standpoint, despite cultural
claims to the contrary, and will have implications.


The handling of Japanese and Korean text is different from that of Chinese
(lumping them together as "CJK" is inappropriate in this context), but I
will leave them for others to provide a better treatment.  (Jungshik Shin
has already explained the Korean case.)


Thomas Chan
[EMAIL PROTECTED]

Re: ConScript registry?

2001-01-31 Thread Thomas Chan


On Wed, 31 Jan 2001, Michael Everson wrote:

 Ar 13:23 -0800 2001-01-30, scríobh Thomas Chan:
 I don't think that CSUR is conclusive proof that there wouldn't be a
 deluge of demands for encoding fictional or constructed scripts if the
 likes of Tengwar or Klingon were encoded.
 
 Well, I think what David was saying is that there don't seem to be all that
 many of them.

My primary objection was that we don't have conclusive evidence for either
scenario.

 
 CSUR is just a pair of websites
 without nowhere the high profile nor authority of Unicode.
 
 I thought one of the Unicode web pages linked it. I could be wrong. And the
 CSUR states explicitly that it is just for fun. Having said that, I do know
 of some folks who have done implementations of one sort or another based on
 its specifications.

I don't see any wording along the lines of "just for fun" on either CSUR
website itself, except for a link on your http://www.egt.ie/sc2wg2.html
page.

The only thing that suggests their unofficialness and volatility is
mention of the Private Use Area, but perhaps that is not clear to
people who see the words "Unicode" and "Registry", and think it is the
real thing, or there are problems comprehending the concept of a Private
Use Area.  Or perhaps they have heard about it secondhand.  For example,
look through the Usenet newsgroup archives at deja.com or any discussion
board online and see how often people believe Klingon is in Unicode, or
"going to be in the next version" of Unicode, when there has only a
proposal.  (And I doubt they are looking at the WG2 proposal itself, but
the CSUR registration or derivative information.)

 
 If say, a
 fictional script were included and published by Unicode and ISO, then
 people all over would suddenly be aware of the fact that a fictional
 script got included, and perhaps they might conclude that they should
 submit their own pet scripts as well.
 
 Thomas, if a script like Tengwar, which has thousands of users who are
 actually interested in writing texts in it, sorting, searching, and all
 that, gets into the UCS it is because there is a credible requirement to
 encode it. Plenty of "nonfictional" historical scripts have fewer users
 than Tengwar. For some of them we have a handful of texts. Tengwar on the
 other hand is studied by linguists, used by enthusiasts, and at any rate is
 an integral part of the work of one of the 20th century's finest and most
 influential writers.

Please note that I did not single Tengwar out for criticism.  I believe it
has a valid argument to be encoded because of the size of the user
community.  It is the fictional scripts with small user communities that
are the problem, and how that relates to treatment of real-world
historical scripts with small user communities.

 
 For example, it is easy to
 find a variety of fonts for fantasy runes or other alphabets that people
 have created, some based off a description in published fiction, but they
 have not gotten in touch with CSUR.
 
 Actually there aren't all that many.

Are we sure about this?  It remains to be examined how they would be
treated, but there are Chinese fictional scripts that have the potential
capability of gobbling up codepoints like "ideographs" have done.  e.g.,
  http://deall.ohio-state.edu/grads/chan.200/misc/100fu.jpg
  http://deall.ohio-state.edu/grads/chan.200/misc/100shou.jpg
each show a single character in what are supposedly a hundred different
scripts.  Most of these "scripts" could probably be conflated and treated
as font variants, but a few are distinct.  Multiply that by 4000-8000
each, and you might have an explosion.

Or take the case of bunch of obsoleted reformist alphabets and
syllabaries of the late-19th and early 20th century, such as the Guanhua
Zimu ("Mandarin letters") alphabet, which is to my knowledge only
partially described in one Western source.  If I understand correctly,
these would be in the same category as Deseret or Visible Speech.


 Or take the case of the Hotsuma
 Tsutae syllabary, created in modern times to provide an fictional
 pre-Chinese writing system (http://www.jtc.co.jp/hotsuma/index-e.htm) for
 what is supposedly Old Japanese, which has books and articles published
 about it, and fonts in existence, but it has no contact with CSUR.
 
 In fact, I *have* seen this. As I recall Ken Whistler and I looked at it
 when we were at the WG2 meeting in Fukuoka.

How did that discussion turn out?


Thomas Chan
[EMAIL PROTECTED]

1 2 >

1 - 100 of 124 matches

Mail list logo