from:"Richard Cook"

Re: Counting rods alternate forms

2016-08-17 Thread Richard Cook

On Aug 13, 2016, at 2:06 PM, eduardo marin  wrote:
> 
> It is well known that the southern song style of counting rods, had different 
> forms for the digits 4, 5 and 9 https://en.wikipedia.org/wiki/Counting_rods , 
> however currently there is no way to represent such forms,

〤[U+3024] 〥[U+3025] 〩[U+3029]

> a proposal to add them would only occupy five code points, since number four 
> is identical both vertical and horizontally.
> 
> Counting rods - Wikipedia, the free encyclopedia
> en.wikipedia.org
> Counting rods represent digits by the number of rods, and the perpendicular 
> rod represents five. To avoid confusion, vertical and horizontal forms are 
> alternately used.

Re: Emoji characters for food allergens

2015-07-28 Thread Richard Cook

On Jul 28, 2015, at 8:56 AM, Asmus Freytag  wrote:
> 
>> On 7/28/2015 8:07 AM, Richard Cook wrote:
>>> On Jul 28, 2015, at 7:53 AM, Doug Ewell  wrote:
>>> Richard Cook  wrote:
>>> 
>>>> And, what is the emotion playfully expressed by 🍔🍟 ?
>>> "I'm having a burger and fries for lunch but can't be bothered to type
>>> all that into this text message lol"
>>> 
>> Is all that one emotion or two?
> 
> Remember:
> e-moji == picto-graph
> 
> and 
> 
> emoji != emoticon.
> 

hey Michael, 

You want 🍟 with that? 😳

-R

> A./
>> 
>>> --
>>> Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸
>>> 
>>> 
>

Re: Emoji characters for food allergens

2015-07-28 Thread Richard Cook

On Jul 28, 2015, at 7:53 AM, Doug Ewell  wrote:
> 
> Richard Cook  wrote:
> 
>> And, what is the emotion playfully expressed by 🍔🍟 ?
> 
> "I'm having a burger and fries for lunch but can't be bothered to type
> all that into this text message lol"
> 
Is all that one emotion or two?

> --
> Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸
> 
>

Re: Emoji characters for food allergens

2015-07-28 Thread Richard Cook

On Jul 28, 2015, at 6:00 AM, Michael Everson  allegedly 
wrote:
> 
> Emojis are not for labelling things. They’re for the playful expression of 
> emotions.

Is that what they're for? I thought they were (encoded) to satisfy certain 
device manufacturers. And, what is the emotion playfully expressed by 🍔🍟 ?

Re: vexillology, was: Adding RAINBOW FLAG to Unicode

2015-07-07 Thread Richard Cook

On Jul 7, 2015, at 7:53 AM, Richard Cook  wrote:
> 
> Ken Whistler wrote:
>>> vexillology
> 
> 
>> Garth Wallace wrote:
>> 
>> Tangentially, I recently ran across something called International
>> Flag Identification Symbols. It's a symbolic notation for vexillology
>> that describes their use of flags and some aspects of their design but
>> not enough to reproduce them.
> 
> Ken,
> 
> Hasn't any vexillogist

=> vexillologist

> defined a full blown FDL (Flag Description Language) yet? That would be a 
> sub-discipline of heraldic arms blazoning, I guess.
> 
> -Richard
> 
> <http://wenlin.com/cdl> ☯

vexillology, was: Adding RAINBOW FLAG to Unicode

2015-07-07 Thread Richard Cook

Ken Whistler wrote:
>> vexillology

> Garth Wallace wrote:
> 
> Tangentially, I recently ran across something called International
> Flag Identification Symbols. It's a symbolic notation for vexillology
> that describes their use of flags and some aspects of their design but
> not enough to reproduce them.

Ken,

Hasn't any vexillogist defined a full blown FDL (Flag Description Language) 
yet? That would be a sub-discipline of heraldic arms blazoning, I guess.

-Richard

 ☯

Re: Adding RAINBOW FLAG to Unicode

2015-06-30 Thread Richard Cook

> On Jun 30, 2015, at 9:11 AM, Garth Wallace  wrote:
> 
> I don't think display of U+1F308 as a rainbow flag would be expected 
> behavior. It risks turning a text like "It's a beautiful day! " into a 
> political statement.

Garth, 

Any statement can be a political statement, in the right context. But I think 
the main point of my earlier comment was that the specific glyph for U+1F308 
might be indistinguishable from a flag. For example, this is the glyph in iOS 8:

Not a cloud in the sky.

 ☯

Re: Adding RAINBOW FLAG to Unicode

2015-06-29 Thread Richard Cook

Ken, 

I know that U+1F308 is RAINBOW ... because my nameslist lookup tool tells me so 
...

T   C   UTF-8   Codepoint : Name : Annotations
1   🌈   C2_A0   1F308   RAINBOW



... but could 🌈 also be a 'rainbow (flag)'? 

-Richard


[☯ iMM (iPhone Mangled Message)]

Biang,was: And what happened to...

2014-10-07 Thread Richard Cook

On Oct 7, 2014, at 5:23 PM, Mark E. Shoulson  wrote:
> 
> The infamous Biang-Biang Noodle

Mark, 

You seem to know as much as anyone about biang. All I can say is, biang is 
attested in tones 2, 4 and 1, and enshrined (along with a glyph variant) in 
Wenlin CDL PUA at U+E999, with 51 or 57 strokes (your stroke count may vary). 
Yes, I just happened to remember the code point and trivia. If you'd like to 
see the CDL let me know ...

-Richard

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Is this the oldest d20 on Earth?

2014-09-20 Thread Richard Cook

On Sep 20, 2014, at 5:35 PM, Jonathan Coxhead  
wrote:
> 
> Here's an icosahedral dice from the Ptolemaic period:
> 
> http://www.metmuseum.org/collection/the-collection-online/search/551070
> 
> I find myself idly wondering whether the identities of the characters are all 
> known and encoded ...
> 

The enlarged image doesn't show all of the sides.

> Cheers
> 
> —Jonathan
> ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode
> 

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Discrepancies between kTotalStrokes and kRSUnicode in the Unihan database - repost all ascii

2014-09-09 Thread Richard COOK

>> On Sep 9, 2014, at 8:28 AM, Richard COOK  wrote:
> On Sep 8, 2014, at 12:03 PM, John Armstrong  
> wrote:

Mr. Armstrong,

I see that my reply to your message bounced from the main Unicode list, due to 
length constraints.

At any rate, the message did go through on the Unihan list, where people 
involved in Unihan development can read it.

In sum, I was suggesting that you might prepare a list of variant property 
values for kRSUnicode and kTotalStrokes.

This would feed into ongoing work on those properties.

---
Richard Cook
文林研究所 Wénlín Institute, Inc.
<http://www.wenlin.com/cdl/>
☯

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-07-04 Thread Richard COOK

On Jul 3, 2014, at 1:48 PM, Asmus Freytag  wrote:

> On 7/3/2014 11:02 AM, Richard COOK wrote:
>> On Jul 2, 2014, at 8:02 AM, Karl Williamson  wrote:
>> 
>>> Corrigendum #9 has changed this so much that people are coming to me and 
>>> saying that inputs may very well have non-characters, and that the default 
>>> should be to pass them through.  Since we have no published wording for how 
>>> the TUS will absorb Corrigendum #9, I don't know how this will play out.  
>>> But this abrupt a change seems wrong to me, and it was done without public 
>>> input or really adequate time to consider its effects.
>> Asmus,
>> 
>> I think you will recall that in late 2012 and early 2013, when the subject 
>> of the proposed changes (or clarifications) to text relating to 
>> noncharacters first arose, we (at Wenlin) expressed our concerns. Some 
>> concerns were grave, and some of the discussion and comments were captured 
>> in this web page:
>> 
>> <http://wenlininstitute.org/UnicodeNoncharacters/>
>> 
>> There was much back and forth on the editorial list. Discussion clarified 
>> some of the issues for me, and mollified some of my concerns.
>> 
>> At that time we did implement support for noncharacters in Wenlin, 
>> controlled by an Advanced Option to:
>> 
>>  Replace noncharacters with [U+FFFD]
>> 
>> This user preference is turned on by default.
>> 
>> Not sure if revisiting any of our prior discussion would help clarify the 
>> evolution of thinking on this issue.
>> 
>> But I did want to mention that the comment “without public input” is not 
>> quite correct.
> 
> Richard,
> 
> "public input" is best understood as PRI or similar process, not discussions 
> by members or other people closely associated with the project.  Also, in 
> particular, discussions on the editorial list are invisible to the public.

Asmus,

The document (L2/13-015, see link above) which we submitted to UTC in response 
to the original proposal (L2/13-006) advocated caution. When L2/13-006 came to 
our attention it was perhaps rather late in the game (as Karl suggests in his 
reply). The changes were perhaps already a foregone conclusion in the minds of 
the proposers. I don’t recall if anyone even proposed doing a PRI, but in 
retrospect that would have been good a idea, a PRI would have been ideal and 
someone should have suggested it.

> 
>> As is so often the case, and as the web page above shows, there was input 
>> and discussion. Whether the amount of time given to this was really adequate 
>> is another question. Work required may expand to fill the available time, 
>> and perhaps more time is now available.
> 
> Given the wide ranging nature of implementations this "clarification" 
> affected, I believe the process failed to provide the necessary safeguards.
> 
> Conformance changes are really significant, and a Corrigendum, no matter how 
> much it is presented as harmless clarification, does affect conformance.
> 
> The UTC would be well served to formally adopt a process that requires a PRI 
> as well as resolutions taken at two separate UTCs to approve any Corrigendum.
> 
> There are changes to properties and algorithms that would also benefit from 
> such an extended process that has a guaranteed minimum number of times for 
> the change to be debated, to surface in minutes and to surface in calls for 
> public input, rather than sailing quietly and quickly into the standard.
> 
> The threshold for this should really be rather low -- as the standard has 
> matured, the number and nature of implementations that depend on it have 
> multiplied, to the point where even a diverse membership is no guarantee that 
> issues can be correctly identified and averted.
> 
> With the minutes from the UTC only recording decisions, one change, to 
> require an initial and a confirming resolution at separate meetings would 
> allow more issues to surface. It would also help if proposal documents were 
> updated to reflect the initial discussion, much as it is done with character 
> encoding proposals that are updated to address additional concerns identified 
> or resolved.
> 
> That said, I could imagine a possible exception for true errata (typos), 
> where correcting a clear mistake should not be unnecessarily drawn out, so 
> the error can be removed promptly. Such cases usually are turning on facts 
> (was there an editing mistake, was there new data about how a character is 
> used that makes an original property assignment a mistake (rather than a less 
> than optimal choice).
> 
> Despite being called a "clarification"

Re: Corrigendum #9

2014-07-03 Thread Richard COOK

On Jul 2, 2014, at 8:02 AM, Karl Williamson  wrote:

> Corrigendum #9 has changed this so much that people are coming to me and 
> saying that inputs may very well have non-characters, and that the default 
> should be to pass them through.  Since we have no published wording for how 
> the TUS will absorb Corrigendum #9, I don't know how this will play out.  But 
> this abrupt a change seems wrong to me, and it was done without public input 
> or really adequate time to consider its effects.

Asmus,

I think you will recall that in late 2012 and early 2013, when the subject of 
the proposed changes (or clarifications) to text relating to noncharacters 
first arose, we (at Wenlin) expressed our concerns. Some concerns were grave, 
and some of the discussion and comments were captured in this web page:

There was much back and forth on the editorial list. Discussion clarified some 
of the issues for me, and mollified some of my concerns.

At that time we did implement support for noncharacters in Wenlin, controlled 
by an Advanced Option to:

Replace noncharacters with [U+FFFD]

This user preference is turned on by default.

Not sure if revisiting any of our prior discussion would help clarify the 
evolution of thinking on this issue.

But I did want to mention that the comment “without public input” is not quite 
correct. As is so often the case, and as the web page above shows, there was 
input and discussion. Whether the amount of time given to this was really 
adequate is another question. Work required may expand to fill the available 
time, and perhaps more time is now available.

-Richard

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Bidi Brackets for Dummies

2014-04-24 Thread Richard COOK

On Apr 24, 2014, at 2:16 PM, Whistler, Ken wrote:

> Given the incredible level of interest shown on this list during
> the last week, I am glad that I can finally announce the publication
> of Bidi Brackets for Dummies:
>  

Dear Dr. Ken, 

Thanks ever so much for that enlightening course in BFD. It is rather long, and 
I dozed off in the middle, but as soon as I woke up I thought to give you some 
important feedback.

I'd like to make one small suggestion (in addition to Markus' that you change 
the "r" to an "n" in the URL, which I have taken the liberty of doing above).

You are using "pair" not only as a noun in several places, but as a *singular* 
noun. For example:

>> And that tells you that U+005D is the pair for U+005B LEFT SQUARE BRACKET.

>> And that tells you that U+005B is the pair for U+005D RIGHT SQUARE BRACKET.

Maybe that pair of nominal singular "pair" usages is all of them. And maybe 
that's like "maths" in an earlier sentence.

>> It's probably something to do with maths, but it's a "bracket", anyway.

But, I'd think if you are going to use "pair" as a nominal singular, you might 
at least add a chapter on the subject. (And one on "maths" too, for that 
matter.)

However, it might be easier just to use the word "mate" instead. For example,

 "And that tells you that U+005D is the *mate* of U+005B LEFT SQUARE BRACKET."

But then, that may be a bit racy for the Unicode Censors ...

> I had wanted to publish that several weeks ago, but unfortunately,
> publication was held up for more than three weeks while I
> struggled to get the document past the Unicode Censors!

... so perhaps just say ...

 "And that tells you that U+005D matches U+005B LEFT SQUARE BRACKET."

... or something like that, to avoid suggestively suggesting that these BFD 
things are actually, ahem, re-productive.

> Enjoy.

Yes, thanks!

-Richard

> --Ken
>  
>  
>  
> ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: CJK stroke order data: kRSUnicode v. kRSKangXi

2014-03-12 Thread Richard COOK

On Mar 12, 2014, at 2:59 AM, Adam Nohejl wrote:
>> 
>> Since kRSUnicode is a Normative property, a formal proposal to modify that 
>> data is required, for review in WG2. I have added notes on the items you 
>> mention below, for consideration in that process, and in the meantime, if 
>> you identify any other issues, please bring them to our attention.
> 
> OK, I will prepare a more comprehensive list. Do you mean that you would 
> submit such a formal proposal? Or can I submit it myself somehow?

You are welcome to prepare a proposal, or just send us your list.

We have already started a proposal to augment kRSUnicode, but I'm not sure 
about the timeframe for completion.

The proofing of the various Kangxi properties is separate from this, but is 
aimed at the specific KX edition used by IRG in Extension B work.

-Richard

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: ?MP = Multilingual plane?

2014-03-10 Thread Richard COOK


On Feb 27, 2014, at 7:23 AM, Michael Everson wrote:

> On 27 Feb 2014, at 02:32, Shriramana Sharma  wrote:
> 
>> Given that Unicode encodes scripts and not languages, how appropriate is it 
>> to call the BMP and the SMP as the multi*lingual* planes?
> 
> You are more than two decades late in asking this.
> 
> It may have seemed more appropriate in an 8-bit code page world where rather 
> small subsets limited the number of languages accessible by one or another 
> part of ISO/IEC 8859. 
> 
> A new term like “multiscriptal” would not have been appropriate. File this 
> under “We know the term ‘ideograph’ is a misnomer."

'When I use a word,' Humpty Dumpty said, in rather a scornful tone, 
'it means just what I choose it to mean — neither more nor less.'

'The question is,' said Alice, 'whether you can make words mean so many 
different things.'

'The question is,' said Humpty Dumpty, 'which is to be master — that's all.'

Alice was too much puzzled to say anything; so after a minute Humpty Dumpty 
began again.



> Michael Everson * http://www.evertype.com/
> 
> 
> ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode
> 


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: CJK stroke order data: kRSUnicode v. kRSKangXi

2014-03-10 Thread Richard COOK

Mr. Nohejl,

About the property data you mention below. kRSUnicode property data permits 
multiple/variant (space-delimited) radical/stroke values, and I think we will 
see important variants added in the future. Where a specific value attested in 
a specific Kangxi edition is missing from kRSUnicode, it would indeed be useful 
to add it, and perhaps to give it priority (move it to the front of the list). 
Likewise, if a common variant value is missing (even one not associated with 
Kangxi), it might be added for convenience. And if there are any outright 
errors, of course those should be identified and corrected (but clear errors 
are harder to find these days). 

Note that because kRSUnicode covers *all* Unihan CJK, even those characters not 
present in the original Kangxi, some of the radical/stroke values are so-called 
"virtual" assignments (those should be omitted from consideration, in proofing 
original KX data).

Several years ago we (at Wenlin.com) produced consolidated Kangxi data for our 
Zidian (Wenlin 4.X), taking these four properties (among other data) as input:

The last of these may not have any obvious connection with Kangxi, until one 
reads the kIRG_GSource property description and sees this "sub-property" 
description:

"GKX Kangxi Dictionary ideographs (康熙字典) 9th edition (1958) including the 
addendum (康熙字典)補遺"

PRC researchers have done much work proofing G-Source Kangxi data, to address 
many aspects of the complex original text. 

The Kangxi work we did at Wenlin has several dimensions, and some of this has 
not yet rippled back into UCD.

We have in fact already identified many important omissions from kRSUnicode, 
which we plan to propose for a future data release. 

Since kRSUnicode is a Normative property, a formal proposal to modify that data 
is required, for review in WG2. I have added notes on the items you mention 
below, for consideration in that process, and in the meantime, if you identify 
any other issues, please bring them to our attention.

-Richard

PS: About the subject line of your message. Please note that despite the "CJK 
stroke order" subject line in your message, we are not talking about CJK stroke 
order here at all, but about Kangxi and UCS radical assignment, and residual 
stroke *count* data. Such data can indeed be used to "order" (collate) CJK 
data, but "stroke order" is a separate issue, involving the particular sequence 
of CJK Strokes (see The Unicode Standard, Appendix F) in the writing of a given 
character (stroke-order data can also be used for collation and indexing). 
Wenlin's CDL database (which inspired the CJK Stroke block, and also produced 
Appendix F) contains a comprehensive analysis of CJK Stroke order *and* 
Radical/Stroke data for all UCS CJK, primarily focused on PRC norms, but also 
including a great many variants (variants forms, variant stroke counts, and 
variant radical assignments).

On Feb 28, 2014, at 10:56 AM, Adam Nohejl wrote:

> 
> (1) A very common character for "most, maximum".
> 最[U+6700] kRSKangXi   73.8
> 最[U+6700] kRSUnicode  13.10
> 
> (2) A funny character for autumn containing the turtle component.
> 龝[U+9F9D] kRSKangXi   115.16
> 龝[U+9F9D] kRSKanWa115.16
> 龝[U+9F9D] kRSUnicode  213.5
> 
> There are also characters that actually are not included in the Kang Xi 
> dictionary**, but the Unihan data contain both a purported Kang Xi radical 
> and in addition to that a _different_ Unicode radical.
> 
> (3) The simplified turtle character (commonly assigned to the traditional 
> radical #213):
> 亀[U+4E80] kRSKangXi   213.0
> 亀[U+4E80] kRSUnicode  5.10
> 
> (4) Character with the radical #72/73 at the top, i.e. IMHO an arbitrary 
> decision, but unexpectedly the fields differ:
> 曻[U+66FB] kRSKangXi   72.7
> 曻[U+66FB] kRSUnicode  73.7

> Hello,
> 
> I am comparing radical data for CJK characters from different sources, 
> including the Unihan database. According to the Unihan documentation* the 
> kRSUnicode radical should correspond to kRSKangXi radical, which in turn 
> should be based on the Kang Xi dictionary.
> 
> Is there any explanation for the following discrepancies? Did I miss any 
> other rules or reasoning behind the content of these two fields?
> 
> Examples of the discrepancies:
> 
> (1) A very common character for "most, maximum".
> U+6700kRSKangXi   73.8
> U+6700kRSUnicode  13.10
> 
> (2) A funny character for autumn containing the turtle component.
> U+9F9DkRSKangXi   115.16
> U+9F9DkRSKanWa115.16
> U+9F9DkRSUnicode  213.5
> 
> There are also characters that actually are not included in the Kang Xi 
> dictionary**, but the Unihan data contain both a purported Kang

Re: Simplified Chinese radical set in Unihan

2004-12-19 Thread Richard Cook

On Dec 16, 2004, at 3:20 PM, Tom Emerson wrote:
Ah, I don't have my copy of the Comprehensive ABC here at home with me.
If you have Wenlin, you have it in electronic form. Wenlin does the 
typesetting (and sub-licensing) for ABC, and the ABC data is accessible 
from within the Wenlin app.

But on the subject of a Simplified Chinese radical set for Unihan:
Please see the new field kHDZRadBreak coming in the Unihan 4.1 beta. 
This field shows a way to add additional radical info to Unihan. That 
is, for a lexical kSource in Unihan, one can associate kSource mappings 
with radical transitions. The Hanyu Da Zidian radical set is in fact a 
simplification of Kang Xi, though not one using simplified characters. 
When lexical mappings for a good simplified PRC lexicon are included in 
Unihan, a similar table can be built. We've got mapping and pinyin data 
for all of Xiandai Hanyu Cidian, accepted by UTC for inclusion in 
future Unihan. This will hopefully be added to Unihan in the coming 
year (pending final proofing).

Re: Unicode for words?

2004-12-07 Thread Richard Cook

On Dec 5, 2004, at 07:02 PM, Doug Ewell wrote:
A word-based encoding for English could automatically assume spaces
where they are appropriate.  The sentence:
"What means this, my lord?"
would have seven encodable elements: the five words, the comma, and the
question mark.  Spaces would be automatically filled in as needed, not
explicitly encoded.  This implies "standard" English punctuation and
spacing conventions, however that is defined.  For French conventions,
there would probably be a space before the question mark as well.
Well, why stop with words, my lord? Why not just encode all sentences, 
paragraphs, pages, chapters, books, libraries, or your higher level 
unit of choice, for that matter.

For example, in my library, the single code point U+10 happens to 
contain hi-res color images of all pages of an edition of Moby Dick 
that I happen to like very much.

Or consider an image-based encoding, which joins standard text to 
image. Images of the text to be encoded are indexed using some private 
indexing scheme, and the index elements are then mapped to a standard 
encoding. The relatively lo-res standard encoding (which must 
necessarily collapse some distinctions that are less generally 
important), is augmented with hi-res indexing of images of the specific 
text to be digitized.

Whether you choose to associate a single glyph with your private-use 
code point, or an entire book, why, that's up to you (and your 
software).

Re: Unicode for words?

2004-12-05 Thread Richard Cook

On Dec 5, 2004, at 12:27 AM, Tim Finney wrote:
my co-worker suggested encoding entire words in Unicode.
The "word" is considerably less well-defined than the character. The 
set of words is open-ended. If you'd like to see where you go when you 
start trying to encode words, take a look at CJK Extension B. CJK 
ideographs are much like words, in that they are both comprised of more 
basic units. English words are composed of letters, while ideographs 
are composed of strokes. If you encode only higher level constructs, 
then you must address the issue of input/indexing via lower-level 
units. So, there's no way to escape from defining the lower-level 
units. If you mean to suggest encoding words as shorthand for sequences 
of encoded low-level units, that might work for very specific, 
well-defined purposes. But whenever someone creates a neologism (and 
word-creation is an on-going process in all living languages), you need 
to revisit the encoding process, and encode a new unit. This is 
burdensome, to say the least. I think that most people who work on 
encoding like to imagine that it is mostly a finite task. Maintenance 
of the standard is infinite, but encoding should taper off, 
comparatively, over time. Except for encoding of CJK ideographs.

script complexity, was Re: OpenType vs TrueType (was current version of unicode-font)

2004-12-04 Thread Richard Cook

On Dec 4, 2004, at 12:15 PM, John Hudson wrote:
I think Peter's point was that complex script require font layout 
tables
Script complexity is not so easily quantified. Has anyone tried to sort 
scripts by complexity? In terms of the present discussion, Han would be 
viewed as a simple script, and yet it is "simple" only in terms of the 
script model in which ideographs are the smallest unit. In a 
stroke-based Han script model, Han is at least as complex as any.

Re: current version of unicode-font

2004-12-02 Thread Richard Cook

On Thu, 2 Dec 2004, John Cowan xiele:

> Paul Hastings scripsit:
>
> > speaking of which, *are* there any open source fonts that come even
> > close to Arial Unicode MS?
>
> In what, breadth of coverage or aesthetics?  The GNU Unifont has very
> wide coverage though it is a bitmap font; James Kass's CODE 2000 and CODE
> 2001 probably have the widest coverage of any font, though it costs US$5
> to use them.  Both of them IMHO are a tad on the ugly side.

In all fairness, the CODE 2000 font from James Kass is quite beautiful,
conceptually speaking. If the current execution is a tad ungainly here and
there, I ask 3 questions: (0) "What do you want for nothing (if you have
not yet paid the shareware fee)?"; (1) "What do you want for $5?"; and (2)
what do you want from a $5 shareware font that aspires to perfect coverage
of the *entire* BMP?

Code2000 is not open source, but Kass is remarkably responsive to user
input.

I urge everyone to download a copy of Code2000, and provide the developer
with feedback, both in terms of suggestions to improve the TrueType font,
and in terms of money to fund development.

http://home.att.net/~jameskass/

James is doing some great work, using some relatively low-level
programming tools. In my experience (admittedly somewhat limited, since I
don't care about *everything* in the BMP) his font works where other
fonts, professional and amature, completely fail. If a font has the glyph
you need in any form, that's far better than having a glyph of last
resort, or no glyph at all.

Disclaimer: I have no commercial relation to Kass, and have received no
compensation for this endorsement. This review should also not be taken as
expressing approval of the shape of any glyph in the Code2000 font,
especially the Capital Letter J, which I think even Kass himself has
called "quirky at best". Note however that the Code2000 "hexagram" block
characters do look quite nice, and better yet, they work in Adobe
Illustrator CS, though no one (neither Kass nor Adobe) seems to know why
yet :-)

RE: Ideograph?!?

2004-11-30 Thread Richard Cook

On Mon, 29 Nov 2004, Kenneth Whistler opined contemplatively:

> Allen Haaheim provided some further detailed clarification:
>
> > Note that Han characters are logographic, not ideographic. That is,
> > they are graphemes that represent words (or at least morphemes),
> > not ideas.
>
> This correctly states the situation for the normal case for
> Chinese characters used writing the Chinese language in most
> instances. But as is not unusual for real writing systems, the
> situation gets blurred all around the edges.
>
> For one thing, Chinese has characters which are simply used for
> their sound, as syllabics. In some instances, they are characters
> in dual use, as logographs *or* as syllabics, but in either
> instance they are used to "spell out" foreign words irrespective
> of the morphemic status of the orginal characters -- or the
> morphemes of the foreign word, for that matter.
>
> And the situation is also not so clear when considered in
> the dynamic context of the historical borrowing of the Chinese
> writing system to write unrelated languages such as Japanese,
> Korean, and Vietnamese. Much of the writing system borrowing
> was *attached* to words -- in other words, the vocabulary itself was
> borrowed in from Chinese, using the Chinese characters to
> write it. But Japanese and other languages faced the problem of how
> to adapt the writing system for preexisting, *native* vocabulary,
> as well as for all the borrowed words from Chinese. And a
> variety of strategies evolved, some of which involved
> abstracting the *meaning* of a Chinese character, and then
> reapplying the character to write an unrelated word in Japanese
> (for example) which had a similar meaning. This semantic-based
> transference of Chinese characters completely ignored
> morphemic status in Chinese, as the whole point was to simply
> find the appropriate character to express the lexical semantics
> of the historically unrelated (but semantically similar)
> word(s) in the borrowing language.
>
> During such a borrowing transition, you can conceive of
> the process as many Chinese characters temporarily "floating off"
> their morphemic anchors in Chinese, being considered
> purely semantically, and then reattaching to a new set of
> morphemic anchors in Japanese, where they subsequently
> evolve with new lexical histories in another language.

As usual, Dr. Whistler has hit upon some key and interesting ditinctions.
The association of a Chinese-derived character with any particular
morpheme is a slippery slope, as is morphemics or semantics in general, I
reckon.  Much less slippery is phonology, and one is on firmer ground to
assert that Chinese characters (and characters derived from the Chinese
tradition, either directly, or indirectly) are regularly associated with
specific quantifiable pronunciations, in whatever language. And even more
specificially, they are associated with one or more monosyllabic readings.
This is why I call the Chinese character an element of a heterographic
syllabary. In a particular dialect one can identify a set of possible
syllables, each of which may have one or more writings via a single
character. In an isographic syllabary, on the other hand, each syllable in
the canon has but a single graphical rendering. It is the case in Chinese
that a given morpheme with a certain well-defined pronunciation may in
fact be representable with more than one character. This is due to
character variants, or simply orthographic variation. And the graphical-
semantic- phonological complex shows variations in all three of these
dimensions, in time and locale ...

> > But somehow "ideograph" has become the standard term in use outside
> > the field of experts in Chinese linguistics (because of Ezra
> > Pound et al., perhaps?).
>
> I don't think you have to look to Ezra Pound's poetic
> misrepresentations of the nature of Chinese to find
> reasons here.

Pound was famous for being very *bad* at Chinese, but at least he was
enthusiastic (and only mildly fascist). Prior to Pound some would argue
that early use of "ideograph" was due to a Jesuit misunderstanding. The
Jesuits are in fact the ones who also brought us "radical" due to a
creative analogy with Hebrew "root". It is however less clear to me that
early use of ideograph is always a mistaken interpretation of Chinese
writing as "more semantic" and "less phongraphic". It is true that Chinese
is poorly phonographic. But is it better at ideography? The term ideograph
needn't be understood as "idea" writing (skipping speech completely),
which interpretation I think follows one etymology of the "ideo"
component, associating it with the Greek counterpart of English "idea".
But the Greek root of "idea" is actually the same as that of English
"vision", a root related to 'seeing'.  Greek "eidon" is also related, and
means 'image' or 'form' (that which is or has been seen). So, the term
"ideograph" can be understood in some sense as a pure Greek equivalent to
Greco-L

Re: Ideograph?!?

2004-11-29 Thread Richard Cook

The term ideograph has special meaning in Unicode/ISO usage. "Ideograph"
is short for "CJK Unified Ideograph", and is one of the characters with
mapping or reference data in the Unihan.txt database.

Likewise, "Radical" has special meaning. CJK Radicals are found in two
places, in the "Kangxi Radicals" block, and in the "CJK Radicals
Supplement".  (Actually, there is also a third block of radicals, "Yi
Radicals", but these are not CJK).

CDL provides a way for precise description of any CJK Unified Ideograph or
Radical. Please see , and the "Jargon Notes".

In other contexts (beyond Unicode) both of these terms have different or
broader usages. Radical, for example, is a 'lexicographic indexing
component' (used in Radical/Stroke indexes), and ideograph is 'idea
writing' ...

On Mon, 29 Nov 2004, Clark Cox wrote:

> On Mon, 29 Nov 2004 15:13:51 -0500, Flarn <[EMAIL PROTECTED]> wrote:
> > What's an ideograph?
>
> An ideograph (aka ideogram) is (from www.m-w.com):
>
> "a picture or symbol used in a system of writing to represent a thing
> or an idea but not a particular word or phrase for it"
>
> > Also, what's a radical?
>
> A radical is, in the set of Han characters, a symbol that occurs as
> part of other ideographic characters that often serves to show common
> meaning or history to the character. In many ways, radicals are to Han
> characters as Greek and Latin roots are to English words.
>
> for instance, in Japanese, the character 妊 (U+598A) means "pregnant",
> and contains, as a radical the character 女(U+5973), which means
> "woman".
>
> --
> Clark S. Cox III
> [EMAIL PROTECTED]
> http://www.livejournal.com/users/clarkcox3/
> http://homepage.mac.com/clarkcox3/
>
>
>

Re: outside decomposed, inside precomposed

2004-10-14 Thread Richard Cook

On Oct 13, 2004, at 1:42 PM, Eric Muller wrote:
Going back to the original scenario, to make my point clearer:
System A, a subset of FileMaker, has {U+0065, U+0303, U+1EBD} as its 
repertoire. When presented with the input , it 
produces the output .

System B, my rendering system, has {U+0065, U+0303} as its repertoire. 
When presented with the input , it produces a correct 
rendering. When presented with the input  it outputs a smiley.

Both systems are conformant (I would hope), yet putting them together 
does not mean that the result is automatically conformant, even on the 
intersection of their repertoire. Hence, one cannot attribute the 
problem that Richard is seeing to either system. The problem belongs 
to Richard, when he did put the two systems together.
Well, if it belongs to me, I'll happily give it away for free to anyone 
willing to take it off my hands.

If we want some automatic guarantee of conformance for combinations of 
conformant systems, then we need at least to impose that the 
repertoire supported by conformant implementations be closed under 
canonical equivalence. Such a condition is not there today. It has 
interesting consequences: e.g. U+2FA1C is canonically equivalent to 
U+9F3B, so the BMP is not closed under canonical equivalence, so no 
conformant system could make its repertoire exactly the BMP.
If someone wants to normalize my text into precomposed things, that's 
all well and good, so long as there's a fallback mechanism for 
rendering via a font which may have only the decomposed parts.

I don't know if this is the same as saying that "we need at least to 
impose that the repertoire supported by conformant implementations be 
closed under canonical equivalence".

But, I say this after spending *way* too much time yesterday adding 
precomposed things to my font. You might think that I can rationalize 
it, thinking that now the diacritic placement will be somewhat better 
than it was before. But I thought the diacritic placement was adequate 
to begin with. And I can't predict what new precomposed things I might 
need to support in the future.

-Richard

Re: outside decomposed, inside precomposed

2004-10-13 Thread Richard Cook

Jon,
Thanks for your reply.
On Oct 13, 2004, at 3:15 AM, you wrote:
imported UTF-8 sequences like [U+0065][U+0303]  get
remapped internally to [U+1ebd] LATIN SMALL LETTER E WITH TILDE.
Is this kind of behavior what one would expect?
That's conformant, if it causes problems with any other process 
(including
other processes that are part of the system in question)
Like, for example, a rendering process?
then that other
process isn't complying with conformance clause C9.
At a guess I'd say it's probably normalising to NFC which is 
advantageous in
a lot of ways (for example you should do this with data that has to 
conform
with the web's [draft] character model).

One of the clearest advantages is that it makes searching a lot more
efficient, as only one of the potentially very many canonically 
equivalent
sequences will have to be searched for
Yes.
(though case-insensitive and/or
diacritical-insensitive searches will still have many possible matching
strings).
Yup.
On the other hand there are potential security risks with such
normalisation, and perhaps therefore it is something that should be
configurable.
It's problematic (and buglike) for at least one reason: one needs to
put all these precomposed things in one's font, or FileMaker doesn't
display them properly.
That's were the problem lies, not in the normalisation.
Maybe they ought to be rendering the glyphs according to the characters 
in the font, with a fallback via decomposition. If they normalize and 
simply throw up the missing character empty box, this is not very 
helpful.

I built a tidy IPA transcription font, lacking many precomposed things. 
Importing and exporting a data subset in FM7 reveals a total of 113 
characters not displaying properly. This is annoying, to say the least.

One reason I wanted a *small* font is that in PDF generation big fonts 
may not always be subsetted properly, and even a single page PDF will 
end up embedding the whole font.

Also, there is extra overhead with a big font that seems to slow things 
up a bit, even on a fast machine.

I'm assuming it will export the data in decomposed form ...
but haven't actually tried that yet ...
I wouldn't assume anything of the sort. Normalising to NFD would be 
quite
unusual.
Yes, I realize that now. And my test confirms that the internal 
normalization is also what you get on export. And hence those 113 empty 
boxes ...

BTW, this application supports import of UTF-8, but will not export
UTF-8. That's odd, isn't it? It'll only export UTF-16 (it's internal
storage form).
Odd indeed.
Well, maybe they're saving UTF-8 export for a future release ... though 
I can't imagine why.

-Richard

outside decomposed, inside precomposed

2004-10-12 Thread Richard Cook

Using a certain newly "Unicode-aware" database application which shall 
remain nameless (FileMaker 7):

imported UTF-8 sequences like [U+0065][U+0303]  get remapped 
internally to [U+1ebd] LATIN SMALL LETTER E WITH TILDE.

Is this kind of behavior what one would expect?
It's problematic (and buglike) for at least one reason: one needs to 
put all these precomposed things in one's font, or FileMaker doesn't 
display them properly.

I'm assuming it will export the data in decomposed form ... but haven't 
actually tried that yet ...

BTW, this application supports import of UTF-8, but will not export 
UTF-8. That's odd, isn't it? It'll only export UTF-16 (it's internal 
storage form).

-Richard

RE: Doulos SIL (was: French typographic thin space)

2004-04-07 Thread Richard Cook

On Wed, 7 Apr 2004, Peter Constable wrote:

> They were encoded that way some while before they were accepted in
> Unicode. Also, until Unicode 4.1 is published, there is a possibility
> that codepoints may change.

I see. I assumed the codepoint assignments were already firm.

Re: Tai Xuan Jing Symbols, any background information ?

2003-10-12 Thread Richard Cook

Patrick,

Also, the Chinese names (for hexagrams and tetragrams) are in the  
original proposals, copies here:

http://linguistics.berkeley.edu/~rscook/pdf/UniProp-Final/02089- 
n2416.pdf
http://linguistics.berkeley.edu/~rscook/pdf/UniProp-Final/01283- 
n2363.pdf

-Richard

On Saturday, Oct 11, 2003, at 15:55 US/Pacific, Richard Cook wrote:

The English TXJ names come from Michael Nylan's book. You'll have to  
find that book to learn what she meant. Or better, get a copy of the  
Chinese original. -Richard

On Saturday, Oct 11, 2003, at 13:28 US/Pacific, Patrick Andries wrote:

Would anyone know where I could find some background information  
on the
Tai Xuan Symbols (U+1D300-U+1D356)? Any JTC1/SC2 document? The Chinese
names?

I'm having problem understanding a lot of names in this block. For
instance :
  What does "watch" mean in U+1D344?  Timepiece? Period of  
duty? A
guardsman? A division of the night?
  What does "aggravation" mean in U+1D351 ? Worsening ?  
Exasperation
?
  What does "compliance" mean in U+1D352 ? Acceptance,  
conformity or
submission ?

  Etc.

P. A.

Re: Tai Xuan Jing Symbols, any background information ?

2003-10-11 Thread Richard Cook

The English TXJ names come from Michael Nylan's book. You'll have to 
find that book to learn what she meant. Or better, get a copy of the 
Chinese original. -Richard

On Saturday, Oct 11, 2003, at 13:28 US/Pacific, Patrick Andries wrote:

Would anyone know where I could find some background information 
on the
Tai Xuan Symbols (U+1D300-U+1D356)? Any JTC1/SC2 document? The Chinese
names?

I'm having problem understanding a lot of names in this block. For
instance :
  What does "watch" mean in U+1D344?  Timepiece? Period of 
duty? A
guardsman? A division of the night?
  What does "aggravation" mean in U+1D351 ? Worsening ? 
Exasperation
?
  What does "compliance" mean in U+1D352 ? Acceptance, 
conformity or
submission ?

  Etc.

P. A.

Re: TAI NÜA , TAI LE

2003-09-11 Thread Richard Cook

Gedney says "nuea"/"nü" is a Thai word for 'north/northern' ... looks 
as if the syllable in this name gets written many different ways ... 
le, lu, lü, lüe, lue, nü, nüa, nüe, neua, nuea ... at least it's 
possibly the same syllable.

Here are some references:

Gedney, William J. 1976. "Notes on Tai Nuea". In _Tai Linguistics in 
Honor of Fang-kuei Li. (Gething et al. eds.) Bangkok. [62-102].

Morez, L. N. 1978. _Jazyk Ly_. Moskow: Nauka.

Seree Weroha. 1974. _Tai Lue / English Dictionary_. 2 Vol. Univ. 
Michigan.

Re: TAI NÜA , TAI LE

2003-09-11 Thread Richard Cook

On Thursday, Sep 11, 2003, at 10:45 US/Pacific, Michael Everson wrote:

At 10:02 -0700 2003-09-11, Richard Cook wrote:

I'm guessing that "Tai Le" would be the exonym (Chinese name), while 
"TAI NÜA" is the autonym.
Don't guess. The Chinese name is Dehong Dai.
Well, "Le" is a Chinese (Mandarin) syllable, while "NÜA" is not ...

-Richard

Re: TAI NÜA , TAI LE

2003-09-11 Thread Richard Cook

On Thursday, Sep 11, 2003, at 09:42 US/Pacific, Michael Everson wrote:

At 11:04 -0400 2003-09-11, Patrick Andries wrote:
Does TAI LE, encoded in Unicode 4.0, refer to the same language as 
TAI NÜA ?
Yes.

If so, isn't TAI NÜA the most frequently used form of this language ?
According to the Ethnologue it is. According to the Chinese experts 
who transcribed the name working with me, it's Tai Le. The Le by the 
way is pronounced just like the masculine definite article in French. 
Donc « Taï Le ».
I'm guessing that "Tai Le" would be the exonym (Chinese name), while 
"TAI NÜA" is the autonym.

Re: missing .GIF's for ideographs on unicode.org?

2003-07-16 Thread Richard Cook

"Ostermueller, Erik" wrote:
> 
> I apologize if you all have already discussed this.
> 
> At unicode.org, when I click this link,
> 
> http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=2
> 
> I'm expecting to see a little square GIF that displays U+2.
> Instead, I see "N/A".
> 
> Shouldn't there be a link like this?
> http://www.unicode.org/cgi-bin/refglyph?24-2
> 
> What am I doing wrong here?
> 
Erik, I think you are correct. The link should be like so:

http://www.unicode.org/cgi-bin/refglyph?24-2

I'm guessing this just hasn't been implemented yet.

-Richard

Re: Chinese language support for Unicode

2003-07-08 Thread Richard Cook

Sourav,

You wrote:
> 
> Hi All,
> 
> Does Unicode support both Simplified as well as Traditional Chinese ?
> 
Yes, it does, though the Simplified support is rather lacking in
comparison with the Traditional, since the Traditional characterset is
rather large, if not completely open-ended, and simpifications may be
applied to many traditional forms, by regular processes.

> If it supports, then could you please let me know what are the respective
> character blocks in Unicode support these two?
> 

Traditional and Simplified characters are not in distinct blocks.
Rather, you should be concerned with Traditonal <=> Simplified mappings. 

I can see how it might seem to make some sense to put all purely
simplified characters together; but when you consider that some
simplified characters are really traditional simplifications, and when
you consider that many traditional characters have no modern
simplifications (the same characters are used in both Simplified and
Traditional contexts), and when you consider that there are sometimes
many-to-one mappings ... well, perhaps you see where I am going with this.

-Richard

Re: Problem with Arial Unicode MS font for BOLD/ITALICS in PDF

2003-06-20 Thread Richard Cook

On Friday, June 20, 2003, at 02:44 , Kenneth Whistler wrote:

What is true is that use of italicized text is unusual
in Chinese or Japanese body text--certainly not with the frequency
or same range of functions as occurs in Latin typography.
Bold text is not that unusual, however.
In precomputer Chinese, it would be very unusual to see italics or bold. 
The place of both is filled with point size differences, brackets/quotes 
of various styles, underlining (straight or saw-toothed, single or 
double). In later times, even with computerized font faces, it's my 
impression that italics and bold are not quite suitable for formal 
writing. Of course, in pop e-print, nearly everything that can be done 
to a character is done ... including Bold-Ital-Outline-Shadow ...

Re: Unicode not in Quark 6

2003-06-20 Thread Richard Cook

Michael Everson wrote:
> 
> I wonder what Quark would do if we all wrote to [EMAIL PROTECTED] to
> ask for Unicode support.
> 
Good idea. I just did. But, Quark is just the tip of the iceberg. I
still need a good (Mac OS X) database that can do Unicode Chinese
(including supplemental planes). Any recommendations? -Richard

Re: Ext-B fonts updated

2001-10-17 Thread Richard Cook


James Kass wrote:
> 
> Richard Cook wrote:
> 
> > >
> > > > Are there any instructions for reporting errata such as the glyphs
> > > > at U+29FD7 and U+29FCE being identical?
> > > >
> > [U+29FD7] and [U+29FCE] are not identical. They are (admittedly rather
> > close) graphical variants. If you want to ID all graphical variants,
> > you've got a long row to hoe.
> >
> 
> The row's long enough without mapping all the graphical variants.
> 
See my comments below.

> Attached are two small gifs, 29fce and 29fd7.  The glyphs used for
> these two characters on the new chart are identical, as far as I can
> tell.  Can someone point out a difference?

Aha! I was looking at a bound version of 10646-2-2000-12-05 (SC2/WG2
N2309) in which the forms are not identical, but betray the variation
which causes the codepoints to be separate. It seems that the font
vendor has done some unification here ...
> 
> > For an example of even closer graphical variants (some might even say
> > *exactly* identical forms), compare [U+20a37] and [U+200ae] ... which I
> > mentioned to Mr. Jenkins a few weeks ago. As he pointed out, they both
> > have T-source numbers, and were perhaps deunified because they're
> > separate in CNS 11643 ...
> >
> 
> The difference between the glyphs used for U+20A37 and U+200AE on
> the chart is obvious.  The two glyphs are similar but not identical.
> They are stored under different base radicals.
> 
> > [U+20a37] and [U+200ae] along with [U+28443], [U+20a31] and [U+20a5f]
> > are of course all variants of [U+8fb0].
> >
> 
> The variances are clear on the chart(s) and the glyphs look quite
> different in some cases.  If these characters are all variants of
> U+8FB0, which is a Chinese radical (#161), shouldn't they all be
> stored under that radical?
> 
As the esteemed Dr. Whistler wrote, graphical variation sometimes leads
to classificational variation ... and when variants get variously
classified, they may also wind up being variously encoded.

What we really need is a field in Unihan.txt which could be used to
unify Han graphical variants. Of course, unification is a judgement
call, some cases more open to contention than others, but I think that
on the whole such a field would be rather useful, at least as useful as
the kRSUnicode and the kRSKangXi fields.

-Richard

Re: Ext-B fonts updated

2001-10-17 Thread Richard Cook

> On Tuesday, October 16, 2001, at 08:00 PM, James Kass wrote:
> 
> > Are there any instructions for reporting errata such as the glyphs
> > at U+29FD7 and U+29FCE being identical?
> >
[U+29FD7] and [U+29FCE] are not identical. They are (admittedly rather
close) graphical variants. If you want to ID all graphical variants,
you've got a long row to hoe.

For an example of even closer graphical variants (some might even say
*exactly* identical forms), compare [U+20a37] and [U+200ae] ... which I
mentioned to Mr. Jenkins a few weeks ago. As he pointed out, they both
have T-source numbers, and were perhaps deunified because they're
separate in CNS 11643 ... 

[U+20a37] and [U+200ae] along with [U+28443], [U+20a31] and [U+20a5f]
are of course all variants of [U+8fb0].

-Richard

Re: Erratum in Unicode book

2001-07-09 Thread Richard Cook


Thomas Chan wrote:
> 
> On Mon, 9 Jul 2001, Richard Cook wrote:
> 
> > On a related note, I have 9000 word/char frequencies from Hanyu Pinlu
> > Cidian (a mainland text; I typed the entries in back in the early 90's,
> > and this is the freq data currently used in Wenlin). I'd be happy to
> > give the Consortium access to this data for the purpose of sorting
> > characters with identical rad/str numbers by frequency.
> 
> Wouldn't that bias sorting according to Chinese language usage
> frequencies?  e.g., \u7684, \u4f60, \u5403 are very common in Chinese, but
> rare or obscure in Japanese.  Subsorting by pronuniciation would also be
> language-dependent.

I thought of the lang. dependency for freq. ... but I don't know. The
Kang Xi Radical system itself is biased toward Chinese usage, albeit a
widespread one.

But it might be interesting to get frequency lists for representative
CJKV usages, and average them for index sorting. How'd that be for
unbiased? John? Want to start collecting that data? :-)
> 
> For a language-neutral method of sorting characters with otherwise the
> same radical and # of residual strokes, how about the method used in the
> _Hanyu Da Zidian_ (and some other dictionaries) of sorting by the type of
> stroke of the first stroke, second stroke, etc., by whether it is one of
> the five basic types of strokes as exemplified in the first five Kangxi
> radicals?  This requires such data be available for all 70,000+
> characters, though...

This is a good idea too.

Re: What should be radicals

2001-07-09 Thread Richard Cook


"Becker, Joseph" wrote:
> 
> > Unicode is going to stick with the KangXi radical system
> 
> There Unicode goes again, flouting the will of the people ... while
> meanwhile in another thread an esteemed Unicode elder has proposed the death
> radical.  It's time to bring this system into the 21st Century: where's the
> plastics radical, the fast-food radical, the unix radical?!
> 
Joe,

It sounds like you're talking about productive innovation in the script.
Watch out! Here come another 50,000 characters!

-Richard

Re: Erratum in Unicode book

2001-07-09 Thread Richard Cook


"Michael (michka) Kaplan" wrote:
> 
> From: "John H. Jenkins" <[EMAIL PROTECTED]>
> 
> > >Has the UNIHAN.TXT file been updated to include radical-stroke data
> > >for Plane Two characters?
> 
> > Yes.  Ever since Unicode 3.1 was released.  (We still don't have an
> > Extension B font, however.)
> 
> There is one in Office XP's CHS and CHP language packs (along with an IME).
> The name of the font is "Simsun (Founder Extended)" and the parentheses in
> the names are causing no end to problems in many tools! ).
> 
> I am working on test pages that will show all the characters, using WEFT
> created .EOT files for those who are using Internet Explorer as their
> browser (I am actually helping the WEFT folks with some bugs that these
> pages exposed!).
> 
This must be the Beijing Zhong Yi Electronics font ... I heard that
Microsoft was licensing it, but didn't imagine they'd release it so soon ...

Can anyone here speak of whether Apple will be licensing it?

Funny that Microsoft has it before Unicode, no? They have deeper
pockets, and that matters?

Re: Erratum in Unicode book

2001-07-09 Thread Richard Cook

"John H. Jenkins" wrote:
> 
> At 11:29 AM -0400 7/9/01, Thomas Chan wrote:
> >On Sun, 8 Jul 2001, James Kass wrote:
> >
> >>  An ideal index for the casual or non-CJK user might be quite
> >>  different in approach.  Perhaps the first component drawn in
> >
> >For the less than proficient user, I think it would be beneficial to have
> >a means to restrict the pool of characters that they are searching
> >amongst--consider the circumstances under which they are likely to have
> >encountered the character they are looking up.  The radical-strokes index
> >in TUS3.0 cover over 27,000 characters, many times more than most
> >dictionaries and character sets, and in some places, there are just too
> >many characters falling under a particular radical+residual stroke count
> >for one to scan the page efficiently.
> > 
> I've been thinking the same thing.  Adding another 40,000+ ideographs
> isn't going to help it.  What will be best will be to prepare, again,
> multiple indices, one for just the original Unihan, one for Unihan +
> Extension A, and one for Unihan + Extension A + Extension B.
> 
> The other thing I need to do is to make the chart-generating program
> a bit more sophisticated in the order in which it puts the
> ideographs.  Right now, all the ideographs for a single
> radical-stroke count are sorted by Unicode scalar value, which means
> that the rare ideographs in Extension A come before the common
> ideographs in the original Unihan block.  Either they should be
> ordered the other way or they should be put in strict KangXi order,
> or something. The way it's done now is definitely bad, bad, bad.

John,

I could imagine that it's best not to have to search multiple separate
printed indices, if that's what you mean above.

Rather, simply sort the Ext A and then Ext B items at the end of each
rad/str. The Ext B chars generally have a very low frequency in common
usage, and Ext A a bit higher; a user seeking one of them would then
know to look toward the end of the rad/str count.

One thing: having 5-digit Ext B numbers in the index is going to throw
off your neat grid tabulation. Perhaps the numbers for Ext B can be set
in a smaller typeface?

On a related note, I have 9000 word/char frequencies from Hanyu Pinlu
Cidian (a mainland text; I typed the entries in back in the early 90's,
and this is the freq data currently used in Wenlin). I'd be happy to
give the Consortium access to this data for the purpose of sorting
characters with identical rad/str numbers by frequency.

You're most sensitive to what its failings might be, but I think you did
a very good job on the 3.0 index, very nicely done indeed. I really do
look forward to the next version.

-Richard

Re: Erratum in Unicode book

2001-07-08 Thread Richard Cook

James Kass wrote:
> 
> Richard Cook wrote:
> 
> > "John H. Jenkins" wrote:
> > >
> > > It is on occasion something of an art figuring out the correct
> > > radical/stroke position for a character in this kind of an index, sad
> > > to say.
> >
> > I'd say, when 2 radicals are possible, put it under both. When 3, well
> > ... you probably get the idea ...
> >
> 
> This is a swell item to add to a "wish list", but imagine the
> challenge faced by anyone wanting to set-up such a database:
> existing information is sorted by residual strokes after the
> significant radical.  When you want to add each character
> under every one of its components, these residual stroke
> counts would need to be re-counted for each 'permutation'
> of every character!

Well, not all components are Kang Xi radicals.

What you're talking about is not a Kang Xi index, but a complete
component index, and this is not quite the technical feat you imagine.

E.g., I have complete component and rad/str data that is lexicon
specific (Shuowen), and somewhat less complete general data.

The most comprehensive collection of such data is from
http://www.wenlin.com . When compiling wish lists, watch Wenlin's development.

> 
> The Han Radical Index is set up for people familiar with CJK,
> the rest of us will just have to guess (and learn something during
> each look-up process, I'd suspect.)
> 

Even for experts,there are cases in which the choice of a single
classifier is completely arbitrary, or at least apprently so to the
casual user. In these cases, putting the character under both is a good
idea.

Re: Erratum in Unicode book

2001-07-08 Thread Richard Cook


"John H. Jenkins" wrote:
> 
> It is on occasion something of an art figuring out the correct
> radical/stroke position for a character in this kind of an index, sad
> to say.

I'd say, when 2 radicals are possible, put it under both. When 3, well
... you probably get the idea ...

Re: Shavian (was: Re: UTF-17)

2001-07-04 Thread Richard Cook


Michael Everson wrote:
> 
> At 11:10 -0700 2001-07-04, Richard Cook wrote:
> >Michael Everson wrote:
> >>
> >>  UTC approved it and there's a new document from John Jenkins and me
> >>  on Shavian for WG2, so it should get approved for ballotting at the
> >>  next meeting of WG2.
> >
> >Hi Michael,
> >
> >I'm new to the idea that anyone would care to have Shavian encoded. Will
> >you enlighten me?
> 
> Easily: just read http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2362.pdf.

> C. Technical -- Justification
> 1. Contact with the user community?
> Yes, such as it is.

funny :-)

now, I know of other phonemic alphabets for English ... e.g., I think
Ben Franklin invented one, ... and I have one of my own. Are any of
these slated for encoding too?

Re: Shavian (was: Re: UTF-17)

2001-07-04 Thread Richard Cook

Michael Everson wrote:
> 
> UTC approved it and there's a new document from John Jenkins and me
> on Shavian for WG2, so it should get approved for ballotting at the
> next meeting of WG2.

Hi Michael,

I'm new to the idea that anyone would care to have Shavian encoded. Will
you enlighten me?

Best,
Richard

Re: status of Jindai scripts?

2001-07-03 Thread Richard Cook

"John H. Jenkins" wrote:
> 
> At 8:07 PM +0200 7/3/01, Genenz wrote:
> >Should one consider the Chinese oracle bone
> >inscriptions (1200 BC) for entry to the unicode list?
> >They really did exist.
> >
> 
> As a rule, historical scripts (in which I'll include OBI, even though
> their descendant is with us today), are encoded when the following
> criteria are met:
> 
> 1)  They are sufficiently well understood that a definitive catalog
> of signs can be made, for at least part of the collection, and
> 
> 2)  Representatives of the scholarly community are involved in the
> encoding process.
> 
> The problem with OBI is that, as I understand it, the only signs
> which are sufficiently well understood that they would meet criterion
> (1) are already in Unicode in the form of their modern forms.  I
> could be wrong and am starting to research the matter myself.

John,

There is another class of signs in OBI (and other inscriptions) which
would be candidates for encoding, and those are the ones which are
composed entirely of recognized components, but which configuration of
components doesn't happen to relate to any existing modern form. 

So, for example, say we have an OBI character composed of 2 components A
and B. Both of these components are known (in "everyone's" opinion) to
relate regularly to modern characters, but either the particular
combination of A with B doesn't relate to any modern character yet in
Unicode, or else the combination occurs, but with different relative placement.

Let's see if I can find an example ... OK, here's one off the top of my
head: look at [U+668A]. If you compare this form with the OBI characters
on p. 26 of my LTBA Monograph, you'll see that although both the
[U+65e5] and [U+9801] components are both in [U+668A], the arrangement
of these components is different in the known OBI variants of the
diviner name. In many of the OBI forms, the sun/star is over the
kneeling person's head, rather than to the left.

So, the questions here are, can we reliably relate the OBI variant forms
to one another? and can we relate any of the OBI variants to the modern
form in [U+668A] ?

It is sometimes the case, though admittedly a rather rare one, that 2
characters, e.g. in Seal forms, have the exact same components, but
different relative positioning of the components, and that this
difference is *disinctive*. That is, the two characters have different
meanings and distinct usages. I don't know if anyone has argued for
distinctiveness in the case I mention ... but they could choose to write
the character with relative component placement as in OBI ... 

creating another one of these virtual kaishu forms based on a modern
interpetation of forms in older inscriptions ...

Re: New characters query (Hexagrams)

2001-07-03 Thread Richard Cook


Michael Everson wrote:
> 
> At 13:59 -0700 2001-07-03, Edward Cherlin wrote:
> 
> >>But I thought proposals for  characters with  decompositions into existing
> >>characters are no longer being accepted.
> >
> >True for accented letters where the combining marks already exist,
> >but  I don't think we want to have two sets of trigrams, one spacing
> >and the other combining. Do we?
> 
> Gods, no.
> 
There are arguments for seeing many signs as having decompositions, for
example, most Hanzi are composite ... and I've even seen decomposition
schemes that beautifully decompose roman text into a small number of
graphical primitives ... the thing is, I think most people would agree
that graphical decomposition, for all it's elegance, is not always the
way to go when encoding semantic units ... larger composite units are
more manageable to humans ...

> >>(Couldn't a ZWJ be used as a way of "joining" two trigrams as a
> >>hexagram?)
> >
> >That would put them side by side. Don't even think about suggesting
> >special case semantics.
> 
> Yuck, yuck.

Re: New characters query

2001-07-03 Thread Richard Cook


John Cowan wrote:
> 
> Rick McGowan scripsit:
> 
> > I don't think there's any point in encoding 64 hexagrams; especially when
> > we have the pieces already.  Use the pieces of three and position them with
> > a drawing program.  We don't have combining thingies for putting chess
> > pieces on board squares, either.
> 
> No.  But don't the hexagrams appear in running text with hanzi?  If so,
> then IMHO they should be encoded separately.
> 
Yes, that's right. In running text with Hanzi, and also in running text
with Kanji, and also in running text with English, and what have you ...

Re: New characters query

2001-07-03 Thread Richard Cook


Another list member mentioned (off-list) the system of 9 bigrams and 81 tetragrams.

These appear in the text of a book called [U+592a][U+7384][U+7d93] 
 by [U+63da][U+96c4] Yang Xiong.(c.53BC-c.18AD).

Where the 64 hexagrams are based on a binary system,
the 81 tetragrams are based on a trinary system.

They're much less well-known, a relatively recent innovation, and a much
less influential imitation of .

I don't think anyone is proposing to encode these ... are you?

Re: New characters query

2001-07-03 Thread Richard Cook

Rick McGowan wrote:
> 
> I don't think there's any point in encoding 64 hexagrams; especially when
> we have the pieces already.  Use the pieces of three and position them with
> a drawing program.  We don't have combining thingies for putting chess
> pieces on board squares, either.
> 

Hi Rick,

I was half in this camp with you to begin with, with my comments about
the IDC, but if John Jenkins says I'm in favor of this, then I guess I
should take a stab at defending it ...

Encoding the 64 hexagrams has surely come up in the past, and on the off
chance that I can put a different spin on it ... here are points 1-3 in
favor, rebuttals of points A-C against, and a quote from John Lennon and
the Plastic Ono Band:

--1: The 64 hexagrams are semantically distinct written signs associated
with specific words. Each of the 64 hexagrams has a unique name, of one
or two syllables (see my earlier post). Each name is intimately
connected with the sequence and meaning of the 6 lines.

--2: They represent a very important feature of the most important of
the Chinese Classics. This text, _Zhou Yi_ ('the Zhou Dynasty [classic
of] change'), was considered by early Chinese, and is considered by many
modern people, to be the most abstruse and subtle book in the world. In
these respects, these signs represent a primary semantic level of a book
which is at least tantamount to a religious text, if not actually one in
many people's minds (depending on the definition of religion).

--3: They are attested in use all over the world, anciently and modernly
(China, Tibet, Japan, US ...). They appear in many many printed books,
both in Asia and elsewhere. For a sample of English titles in print, go to
http://www.amazon.com and search for "I Ching" (~357 hits) or "Book of
Changes" (~89).

Now, examining some points against:

--A: They are compositionally formed from the 8 trigrams.

Rebuttal: By this reasoning, the 8 trigrams themselves ought not to have
been encoded, since the 8 trigrams can be generated from simple broken
and unbroken lines. This alone is not a reason to encode them, but it is precedent.

--B: They derive their distinct meanings from the composition of the 2
composing trigrams.

Rebuttal: It is agreed that their meanings are distinct from the
meanings of the 8 trigrams. However, many would contend that the
meanings are compositionally derived from the broken and unbroken lines.
See A above.

--C: They are primarily used in China, and a proposal to encode them
ought to come from China.

Rebuttal: See point 3 above.

---

"... I don't believe in I Ching ..."

"God", by John Lennon and the Plastic Ono Band
http://members.aol.com/pop1rock1/JohnLennon/Lyrics/lyric5.html

Re: New characters query

2001-07-02 Thread Richard Cook


"John H. Jenkins" wrote:
> 
> At 7:07 PM -0700 7/2/01, Richard Cook wrote:
> >Evidence? There's ample evidence, starting c. 1000 BC, with
> >[U+5468][U+6613] _Zhou Yi_ (aka _Yi Jing_ aka _I Ching_ aka _The Book of
> >Changes_), an artifact of the Zhou Dynasty ...
> >
> 
> I agree with Richard here.  It's silly to have the trigrams and not
> the hexagrams, although I know why it worked out that way.  Richard,
> are they used much *outside* of the Yi?  If so, I think it's
> reasonable to add them.

I think this PDF makes the traditional arrangement more explicit:

http://linguistics.berkeley.edu/~rscook/pdf/64Gua-TradOrder-dec.pdf

Re: 64 Hexagrams, was re: New characters query

2001-07-02 Thread Richard Cook

"John H. Jenkins" wrote:
> 
> At 7:07 PM -0700 7/2/01, Richard Cook wrote:
> >Evidence? There's ample evidence, starting c. 1000 BC, with
> >[U+5468][U+6613] _Zhou Yi_ (aka _Yi Jing_ aka _I Ching_ aka _The Book of
> >Changes_), an artifact of the Zhou Dynasty ...
> >
> 
> I agree with Richard here.  It's silly to have the trigrams and not
> the hexagrams, although I know why it worked out that way.  Richard,
> are they used much *outside* of the Yi?  If so, I think it's
> reasonable to add them.

Well, the thing is that the Yi itself is a major industry in publishing.
One of the largest topical bibliographies I've ever seen is a Zhou Yi
bibliography. Thousands of books in many different languages spanning
thousands of years. The system of divination is all over Asia in various
permutations. And to call them "Daoist" as I believe the original poster
did, is rather beside the point: these symbols originated in China long
before there was anything called Daoism ...

If they're going to be encoded, I believe that they ought to be encoded
in the order in which they appear in Zhou Yi, which is not a strict
binary order. A binary ordering is  in the table at the top of this page:

http://socrates.berkeley.edu/~rscook/html/Da4Xiang4.html

On that same page the traditional ordering as handed down in Zhou Yi is
sequence in the list below on the same page.

This PDF also has the traditional order, reading from top, left to right:

http://linguistics.berkeley.edu/~rscook/pdf/64GuaTradOrder.pdf

I made TrueType fonts for these a while back, if you'd like them to
craft the proposal.

This file has the traditional ordering, with naming, pronunciation and
HYDZD references:

http://linguistics.berkeley.edu/~rscook/text/64Gua-TradNamesPY.txt

Re: New characters query

2001-07-02 Thread Richard Cook

Michael Everson wrote:
> 
> At 12:33 -0700 2001-07-02, Edward Cherlin wrote:
> >Has anyone proposed the following for inclusion in Unicode? If so,
> >what is their status?
> >
> >Daoist Hexagrams, 64 forms (the trigrams are already included, but
> >with no combining mechanism)
> 
> You're welcome to, if you have evidence for these.

Evidence? There's ample evidence, starting c. 1000 BC, with
[U+5468][U+6613] _Zhou Yi_ (aka _Yi Jing_ aka _I Ching_ aka _The Book of
Changes_), an artifact of the Zhou Dynasty ...

Here they are with the _Da Xiang_ ('The Great Symbolism') commentary:

http://socrates.berkeley.edu/~rscook/html/Da4Xiang4.html

But for combining mechanisms ... Hey! another use for an IDC: what about
[U+2FF1] ...

Re: Book review: Cang Method

2001-06-29 Thread Richard Cook

Edward Cherlin wrote:
> 
> I use Cangjie to access my character database, since it is usually much
> faster than radical and stroke count, and I usually don't know the Chinese
> pronunciation of characters I need to look up. The database gives me
> Radical number, Stroke count, Chinese, Japanese, and Korean pronunciations,
> and the numbers for the character entries in the Nelson and Mathews
> dictionaries. I would like to have a professional-quality electronic Han
> dictionary that included all of these lookup techniques and more.

You should check out http://www.wenlin.com/

A lot of people on this list know and love Wenlin and can recommend it.
It's the smartest thing out there for Chinese. The current release (demo
on the website) has very flexible look-up methods, including look-up by
component and pinyin. The current alpha version has full Ext A and B
support. Look for a new release soon.

Re: Not the Roadmap was Re: UTF-17

2001-06-23 Thread Richard Cook


Michael Everson wrote:
> 
> At 14:52 -0700 2001-06-22, Yves Arrouye wrote:
> >Isn't UTF-17 just a sarcastic comment on all of this UTF- discussion?
> 
> I think UTF-11digit would be clearly sarcastic. UTF-17, well, I don't
> know. I've been deleting the threads. Not my area.
> 
> Didj'all like the Osmanya document?
> 
> Y'all happy about Shavian being encoded?
> 
> Have you seen the really cool new "Not the Roadmap" page? (See
> http://www.egt.ie/standards/iso10646/ucs-roadmap.html)
> --
> Michael Everson

Ooh. That's a good one:

http://anubis.dkuug.dk/JTC1/SC2/WG2/docs/n1688/n1688.htm

Were you hanging around with Ed Pulleyblank?

Yikes!

Re: Why call kanji/hanji/hanja 'ideographs' when almost none are?

2001-06-02 Thread Richard Cook


Jon Babcock wrote:
> 
> The Asia/East Asian/CJK thread reminded me of one of my own pet peeves,
> the use of 'ideograph' to refer to kanji.
> 
>  Perhaps some of the professionals on this list can enlighten me here. I
> thought that an ideograph meant that the graph stood for an idea, not a
> sound or a zographic image. Since only a very small percentage of kanji
> do this ... I can think of only about ten ...  why do writers on Unicode
> lend credence to a fundamental misconception by using this term to refer
> to the whole lot?
> 
oh, and BTW, Jon, what ~10 are you thinking of? I can't think of any ...

Re: Why call kanji/hanji/hanja 'ideographs' when almost none are?

2001-06-02 Thread Richard Cook


"John H. Jenkins" wrote:
> 
> At 4:16 PM -0600 6/1/01, Jon Babcock wrote:
> >The Asia/East Asian/CJK thread reminded me of one of my own pet
> >peeves, the use of 'ideograph' to refer to kanji.
> >
> >Perhaps some of the professionals on this list can enlighten me
> >here. I thought that an ideograph meant that the graph stood for an
> >idea, not a sound or a zographic image. Since only a very small
> >percentage of kanji do this ... I can think of only about ten ...
> >why do writers on Unicode lend credence to a fundamental
> >misconception by using this term to refer to the whole lot?
> 
> We use the term "ideograph" because it's traditional, not because
> it's correct.  We can make this more explicit in the next edition of
> the book, of course.  Meanwhile, the glossary does give the
> definition, "(1) Any symbol that primarily denotes an idea (or
> meaning) in contrast to a sound (or pronunciation)S(2) A common term
> used to refer to Han characters."

"common" among whom? of course, Unicode is a Standard, and if it says
it's common, fiat lux! :-)

Re: Radical of U+4E71

2001-05-30 Thread Richard Cook


Marco Cimarosti wrote:
> So it seems there is in fact an etymological "hook" radical here, although
> its modern shape has become identical to "second".

The placement of [U+4e82] under a [U+4e59] [U+90e8][U+9996] goes back at
least as far as [U+8aaa][U+6587] (121 AD).

ex presidents

2001-05-28 Thread Richard Cook


Anyone know which US president is [U+704c][U+6027][U+704c] ?

Someone told me this (admittedly silly) joke in Japanese, with

[U+85ea][U+6027][U+85ea]

Re: name this hanzi

2001-05-27 Thread Richard Cook


Thomas Chan wrote:
> 
> On Sat, 26 May 2001, Richard Cook wrote:
> 
> > Gaspar Sinai wrote:
> > > On Sat, 26 May 2001, Richard Cook wrote:
> > > > Here's a puzzle: Any idea 1.) what this character is, and 2.) if
> > > > it's in Unicode?
> > > > http://linguistics.berkeley.edu/~rscook/bishop/Picture1.gif
> > >
> > > This CJK millet in English ("kibi" in Japanese) U+9ECD
> >
> > Yes, that's right. You're the winner. :-)
> > This is a *variant* of [U+9ecd], with [U+efe2] in place of [U+6c3a].
> 
> What is U+EFE2?

oh, sorry, that's a private use code ... I guess you can tell what it is ...
> 
> > Compare e.g. [U+257e6] and also [U+22863].
> > I know from context that it is in fact a variant writing of [U+9ecd] ...
> > since it appears in DUAN Yucai's gloss at 113.410 (SWJZZ). But what i
> > don't know is why he wrote it this way ...
> > This form is not in Ext. B, nor HYDZD, nor Kangxi, as far as I can tell
> > So, did you find this exact form in any character dictionary or list?
> 
> I don't know where Gaspar found it, but you may want to use the "Jiaoyubu
> Yitizi Zidian" (Ministry of Education Dictionary of Chinese Character
> Variants" by the Zhonghua Minkuo Jiaoyubu Guoyu Tuixing Weiyuanhui
> (Mandarin Promotion Council, Ministry of Education, Republic of China):
>   http://140.111.1.40/
> I believe it only came out earlier this year.  Site is in Chinese, of
> course.  Big5 encoding.
> 
> The particular variant you are looking for is here, indexed as a04768-009:
>   http://140.111.1.40/yitia/fra/fra04768.htm
> It gives three sources where it appears, including the _Yupian_--perhaps
> DUAN or his printer got it from there?  What's nice is that scans of
> the entries from most of their sources are available (primarily the older
> ones).

Yes, that is nice.
> 
> Despite all the attention _Kangxi Zidian_ and _Hanyu Da Zidian_ get as
> being "comprehensive", they are are sometimes selective in what they
> inherit from earlier dictionaries.
> 
I often encounter charatcers not in HYDZD ... but a new edition of HYDZD
is in the works ... I've heard.

Re: name this hanzi

2001-05-26 Thread Richard Cook

Yes, that's right. You're the winner. :-)

This is a *variant* of [U+9ecd], with [U+efe2] in place of [U+6c3a].
Compare e.g. [U+257e6] and also [U+22863].

I know from context that it is in fact a variant writing of [U+9ecd] ...
since it appears in DUAN Yucai's gloss at 113.410 (SWJZZ). But what i
don't know is why he wrote it this way ... 

This form is not in Ext. B, nor HYDZD, nor Kangxi, as far as I can tell
... 

So, did you find this exact form in any character dictionary or list?

Gaspar Sinai wrote:
> 
> This CJK millet in English ("kibi" in Japanese)
> 
>U+9ECD
> 
> Gaspar Sinai <[EMAIL PROTECTED]>
> http://www.yudit.org/
> 
> On Sat, 26 May 2001, Richard Cook wrote:
> 
> > Here's a puzzle: Any idea 1.) what this character is, and 2.) if it's in Unicode?
> >
> > http://linguistics.berkeley.edu/~rscook/bishop/Picture1.gif
> >
> >

name this hanzi

2001-05-26 Thread Richard Cook


Here's a puzzle: Any idea 1.) what this character is, and 2.) if it's in Unicode?

http://linguistics.berkeley.edu/~rscook/bishop/Picture1.gif

[Fwd: www.perl.com - Larry Wall Apocalypse Two]

2001-05-03 Thread Richard Cook


at http://www.perl.com/pub/2001/05/03/wall.html Larry Wall writes:
>
> Perl 6 programs are notionally written in Unicode, and assume 
> Unicode semantics by default even when they happen to be 
> processing other character sets behind the scenes. Note that 
> when we say that Perl is written in Unicode, we're speaking of 
> an abstract character set, not any particular encoding. (The 
> typical program will likely be written in UTF-8 in the West, 
> and in some 16-bit character set in the East.)
>

[unicode] Re: What is Unicode?

2001-03-23 Thread Richard Cook


Another web page, for your collective amusement:

http://linguistics.berkeley.edu/~rscook/html/Unicode-tetralog.html

[unicode] Re: Spam being sent to the list?

2001-03-22 Thread Richard Cook



I thought Sarasvati was immune to this. Parvati?

Re: Unicode complaints

2001-03-15 Thread Richard Cook


Kenneth Whistler wrote:
> 
> > In hunting around for negative opinions about Unicode, I've found that
> > the majority of complaints relate to CJK character sets. Would listers
> > agree that this is the largest area of unrest? Or is it just that people
> > involved with CJK are vocal?
> 
> Maybe it's just that since Han ideographs now constitute slightly more
> than 75% of the standard by count (and probably 90% of the standard
> by weight), they have more to complain about.
> 
> --Ken ;-)

Suzanne, what exactly are the complaints you heard?

Re: Unicode market acceptance

2001-03-09 Thread Richard Cook


Tex Texin wrote:
> 
> not the same as work for execs. The success of Unicode is obvious
> to us (techies) is not clear to them.

Tex,

Recently looking at and talking about this

http://i18n.homepage.com/UnicodeBenefits.html

with some people, initiated and uninitiated, I quickly wrote this:

http://linguistics.berkeley.edu/~rscook/UnicodeGives.html

trying to make absolutely clear, in non-technical language what some of
the benefits are.

Comments? 4-letter words?

--Richard

Oh Unicode

2001-03-01 Thread Richard Cook


Is the Unicode anthem from the CD on a server somewhere, hopefully in mp3?

Re: CJKV ideographic, - Cantonese

2001-02-27 Thread Richard Cook

"akerbeltz.alba" wrote:
> 
> Not kvite, the Cantonese for Kanji is "Hòn Jih" - although the term is
> rather uncommon, sounding rather outlandish to Cantonese ears. "Jung Màhn
> Jih" [U+4E2D] [U+6587] [U+5B57] is a lot more common.
>
> Cantonese, highly conservative in it's sound system generally, has been more
> innovative than Mandarin in one respect, that is the loss of initial k in
> certain words cf. Mand. kè 'guest' Cant. hahk, Mand. kou 'mouth' Cant. háu
> etc.

Quite right. But there was never a velar stop in this word in any
dialect of Chinese. In reconstructions of Middle Chinese the initial is
actually more like [x] (velar fricative) or by some even further back
[X] (uvular fricative). I think the Japanese phonologization "k-" of
"kanji" probably reflects a Medieval Chinese velar fricative ... not a
velar stop.

Modern Beijing and Xiang dialects (northern) and some southern dialects
still bear traces of this ... not [h] (glottal fricative) here ... but
something more front.

> And for the core English vocab you'd have to add "Cantonese ideographs",
> because there's quite a bunch of ideographs that have been encoded which are
> 'strictly Cantonese' such as 5497 [past tense marker], 54CB [plural pronoun
> marker] etc

Quite right again. There's no end to local variation in the script ...
and so no end of names for local varieties. In fact, one's typology
should clearly distinguish such things ...

I was wondering later what the Mandarin term for local dialect
characters is ... I think the term [U+767d][U+5b57] is sometimes used ...

Re: CJKV ideographic, was Re: Perception that Unicode is 16-bit

2001-02-27 Thread Richard Cook


Jungshik Shin wrote:
> 
> On Tue, 27 Feb 2001, Thomas Chan wrote:
> 
> > On Tue, 27 Feb 2001, Richard Cook wrote:
> >
> > > * 'chunom' in Vietnamese [similar to (i.e., analogical) Chinese characters].
> >
> > If one is going to talk about Vietnamese chu+~ no^m '"southern"
> > characters', then one might as well mention the Japanese kokuji 'national
> > characters' and Korean gugja 'national characters' as well, which are
> > their equivalents of "homemade" characters that do not exist in
> > Chinese.[1]
> 
>As for 'gugja' in Korean, its meaning is ambiguous (it could mean
> Hangul as well as home-grown Hanjas in Korea) and most people in Korea
> would NOT recognize the word at all.  When I was asked about it by Ken
> Lunde (the author of CJKV information processing), I had to ask around
> (my Korean dictionary does NOT explain the word as such although some -
> not all - dictionaries do ) and virtually everyone told me they had never
> heard of the word as being used to mean Korean-made Hanja.  We just refer
> to Korean-made Hanja  as 'Han-kuk-shik Hanja' (or something like that).
> 
I just looked in 2 Korean dictionaries, and didn't see gugja either
...maybe I need a bigger dictionary.

Re: Kana, was re: CJKV ideographic, was Re: Perception that Unicode is

2001-02-27 Thread Richard Cook


All this talk of ideographs made me think that people on-list might
enjoy this (Acrobat 4) PDF:

http://linguistics.berkeley.edu/~rscook/pdf/HanKana.fp3.pdf

It illustrates the evolution of the Hiragana and Katakana from Chinese
characters, with kaishu (simplified and traditional) and small seal
forms. There are some minor errors in it ... which I'll fix, if any one
can point them out :-)

Re: CJKV ideographic, was Re: Perception that Unicode is 16-bit

2001-02-27 Thread Richard Cook

Thomas Chan wrote:
> 
> There is also a similar phenomena in Chinese, called fangyanzi '"dialect"
> character', which may be considered analogous to the above, the most well
> known being the Cantonese ones, although others (Wu, Hakka, etc) do exist.
> 
> [1] There is a small chance that they might exist in Chinese, or even in
> other languages, depending on the criteria for being a "national
> character".

Yes, [U+65b9][U+8a00][U+5b57] 'dialect character' is also
[U+767d][U+5b57] though I think the latter may have pejorative
connotations ...

Any given dialect is likely to show local variation in the script ...
another gazillion characters for Unihan!

Re: CJKV ideographic, was Re: Perception that Unicode is 16-bit

2001-02-27 Thread Richard Cook


Thomas Chan wrote:
> 
> But is a romanized version of U+6F22 U+5B57 based on the Cantonese
> pronunciation ever used in English writing the way  (based on
> Mandarin pronunciation) is?

it could be ... it might even be used as a special term to distinguish
"Cantonese Ideographs" ...
> 
> For those familiar with "ASCII IPA", it's /hOn33 tSi22/.  ( denotes
> U+0254 LATIN SMALL LETTER OPEN O;  denotes U+0283 LATIN SMALL LETTER
> ESH.)[1]  Yale romanization would write it , a modified Yale would
> write it , etc.
> 

I think that modern uses of romanized Cantonese are few and far between
...

Re: CJKV ideographic, was Re: Perception that Unicode is 16-bit (was:

2001-02-27 Thread Richard Cook

Kenneth Whistler wrote:
> 
> Doug Ewell asked, on this hopelessly wandering thread:
> 
> > (Is
> > there an English-language term for the subset of the CJK ideographic script
> > that is used by a given language, say, Japanese?)
> 
> Well, since "kanji" by now has been borrowed into English, at least among
> a rather large class of specialists who are at least somewhat knowledgeable about
> Japanese, I would say that the relevant English-language phrase to cover
> this is "the Japanese kanji". I know, not a good, core English word like
> "alphabet" or "syllabary" or "abjad", is it. But wait. Hmmm. alpha, beta, gamma...
> syllaba, syllabae, syllabarum ... syllabé, syllabídzo ...
> 
> *wanders off muttering to himself*
> 

And not only "kanji". These terms are all used by specialists:

* 'Hanzi' in Beijing Chinese (with reference to "American English", "ha"
as in 'hard'; "zi" pronounced like "tsz" where "z" here represents a
vowel sound similar to English "z" with the tongue tip lowered slightly,
near also to English "r");
* 'Kanji' in Cantonese Chinese  (kahn jee; "k" as in 'can', "a" as in
'father', "jee" as in 'jeep');
* 'Kanji' in Japanese (pronunciation similar to that in Cantonese);
* 'Hanja' in Korean (Han as in Beijing Chinese, "ja" as in English "jar";
* 'chuhan' in Vietnamese [real Chinese chars];
* 'chunom' in Vietnamese [similar to (i.e., analogical) Chinese characters].

But for core English vocabulary, I don't think "Chinese ideographs",
"Japanese ideographs", "Korean ideographs", or "Vietnamese ideographs"
would be objectionable terms to anyone ... that is, to anyone who
doesn't find the term "ideograph" objectionable.

Re: Question on Unicode data files

2001-02-26 Thread Richard Cook


"John H. Jenkins" wrote:
> 
> At 7:57 AM -0800 2/26/01, Richard Zhang wrote:
> >Hello, Marco,
> >
> >Unihan is the official site I think. You can visit www.unihan.com.cn for
> >more information about this, if you know Chinese :).

Knowing Chinese is not enough. You and your browser need to know
Simplified Chinese (GBK?) ... arguably not Chinese at all ...
> >
> >If you sign up for cooperation with them, you will get full access to their
> >database.

what does "cooperation" mean?
> >
> 
> No, Unihan is *NOT* the official site.  They are not in any way
> associated with Unicode.  The official Unihan database is available
> only from unicode.org.
> 

Is there any connection between this http://www.unihan.com.cn/ site and
IRG? What is UniHan Digital Tech Co.? Their website has some rather
annoying graphics and windows, but no basic info that i can see ... the
bottom buttons don't work at all, no?

Re: bijective (was re: An Absurdly Brief Introduction to Unicode (was

2001-02-24 Thread Richard Cook


Tom Lord wrote:
> 
>> I think I'd like bijective too, if I knew what it meant. Someone?
> 
> It would be a lot more fun to answer this question in plain-text
> Unicode (using math notation) than in ASCII.
> 
> Informally:
> 
> "Bijective" describes a mapping between two sets.  Every element of
> the source set ("the domain") is mapped to a unique element of the
> destination set ("the codomain") AND there are no left over elements
> in the codomain.  A one to one mapping.  An invertible mapping.
> 
> "Injective" describes a mapping where every element of the domain maps
> to a unique element, but there may be left over elements in the
> codomain.
> 
> So, if you have a legacy character set for which Unicode provides a
> loss-less transcoding, then there is an injective mapping from that
> legacy set to Unicode, and, equivalently, a bijective mapping from the
> legacy set to a certain subset of Unicode.
> 
> In the revised version of the absurdly short introduction, I have
> avoided the term "bijective".
> 
> This is now way off topic, so please think twice about following up.
> 
> Thomas Lord

Um, this terminology isn't exactly off-topic, I think. But I see that
neither "Bijective" nor "Injective" are in the Unicode 3.0 glossary.

Whence does this terminology derive? Set or Mapping theory? Anyone
recommend a definitive text? I imagine there are more such terms ...
e.g., what is it called if there are elements left over in the domain
(but not in the codomain)? "Ejective"? I'm feeling "Dejective" for not
knowing these terms already ...

Re: bijective (was re: An Aburdly Brief Introduction to Unicode (was Re:

2001-02-23 Thread Richard Cook


Mark Davis wrote:

> > that must be made about what counts as an abstract character and what
> > does not; and the generally acknowledged desirability of supporting
> > bijective mappings between a variety of older character sets and
> 
> while I like bijective, it is not a commonly understood term.

I think I'd like bijective too, if I knew what it meant. Someone?

Re: Benefits of Unicode

2001-02-23 Thread Richard Cook


Sorry, I tuned out for a moment: is there a URL for the final version of
Tex's tabulation of benefits?

Also, I'd appreciate any similar links that might be used in a page of
info for the uninitiated.

Best,
Richard

Re: Radical Index online? (was Re: Chemistry on chinesse. (CJK))

2001-01-29 Thread Richard Cook


Just a correction. Someone previously asked about 

http://www.wenlin.com/

and its support for Vertical Ext. A. It turns out that this support has
not yet made it into the public release ...

Best,
Richard

Re: Benefits of Unicode

2001-01-27 Thread Richard Cook


Has anybody played devil's advocate to this, with a list of "Failings of
Unicode"? Are there any? :-) This question might in fact result in a
longer Benefits list 

> Tex Texin wrote:
> 
> I was asked to produce a list of the benefits of Unicode to be used
> as a sidebar with an article referencing Unicode.
> 
> Ideally, it would be a brief set of bullets for an audience that
> doesn't know a lot about internationalization. The bullets shouldnt
> be too detailed or technical.
> 
> I came up with the attached table. I think with some minor amendments
> I can drop the left column now and just use the benefits with
> examples.
> 
> I have until Monday morning to turn it in, so I thought I'd ask
> for some review.
> 
> Anything missing? Any constructive suggestions, gratefully
> appreciated.
> 
> tex
> 
> ---
> 
> 
> 
>   Benefits of Unicode
> 
>  Unicode
>Properties  Benefits   Example
> 
> All the   Invoice or
> characters of Multi-lingual   ticketing
> all the   documents: use any  applications can
> languages you or all the  print native
> might ever need   languages you want  language names
> 
>   Reduced
>   development and
> Defines one set   support costs and   Sales to
> of algorithms reduced multiple
> for processingtime-to-market, countries the
> text  with one versionday of initial
>   of source code  release
>   that works
>   world-wide
> 
>   Any applications
>   reading the same
> An ISO standard   Standards insuretext file will
>   interoperability
>   interpret it
>   correctly
> 
>   Text sent from
> Accepted  Worldwide   any part of the
> globally  deployment  world to any
>   capability
>   other part
> 
> Supported by  Applications can
> most, if not all  Ease of exchange text
> modernintegration without
> technologies  conversion loss
>   or errors
> 
>   XML, the format
>   for structured
> Web standards documents and
> are based on it   Internet-readiness  data on the Web
>   is
>   Unicode-based
> 
>   Unicode Version
>   3.0 added
>   25,000+
>   Evolution extends   characters and
> Undergoes application new technical
> continuouslifetime andspecifications
> development   expands that improved,
>   capabilities to
>   meet future needs   for
>   example, Middle
>   Eastern language
>   support.

Re: Radical Index online? (was Re: Chemistry on chinesse. (CJK))

2001-01-26 Thread Richard Cook


Kenneth Whistler wrote:
> 
> > > > I cannot check now if these characters are included in Unicode as I don't
> > > > have TUS handy in this moment.
> > >
> > > http://www.unicode.org/unicode/uni2book/u2.html  (The Online Edition)
> > >
> > > and
> > >
> > > http://www.unicode.org/charts/draftunicode31/  (for CJK Extension B, etc.)
> 
> As noted, for now, go through PDUTR #27 to get to the CJK Extension B
> charts. This will be updated once Unicode 3.1 is officially released.
> 
> >
> > I could not find the radical index. Has this been put online too?
> 
> No. The CJK radical index was generated and printed with custom
> software from the Unihan database. It was too much effort to try
> to convert that software to produce a postable .pdf file, so the
> radical index is omitted from the online version.
> 

Ken, What software was used to produce that index? I'm interested in any
and all details.

Best,
Richard

Re: Radical Index online? (was Re: Chemistry on chinesse. (CJK))

2001-01-25 Thread Richard Cook


John Jenkins wrote:
> 
> On Thursday, January 25, 2001, at 03:14 AM, Pierpaolo BERNARDI wrote:
> 
> > I was talking about the index for the hanzi's ordered by radical+strokes
> > which can be found at the end of the book, since I wanted to check
> > whether
> > high numbered elements were there. I know the look and pronunciations of
> > these characters, but don't know any code (in whatsoever charset).
> >
> 
> Don't forget the online Unihan database,
> .  Even if its radical-stroke
> lookup should prove inadequate, it lets you get at the blocks which are
> (very nearly) in radical-stroke order in any event within each of the
> two blocks.

Also, don't forget . It has the 3.0 hanzi, and
various indices and transformations. If you haven't looked at the
program, check out the demo, by all means.

Re: Radical Index online? (was Re: Chemistry on chinesse. (CJK))

2001-01-24 Thread Richard Cook


> > Kenneth Whistler wrote:
> > > >
> > > > I could not find the radical index. Has this been put online too?
> > >
> > > No. The CJK radical index was generated and printed with custom
> > > software from the Unihan database. It was too much effort to try
> > > to convert that software to produce a postable .pdf file, so the
> > > radical index is omitted from the online version.
> > >
> >
Um, I think I misunderstood. What Radical index are you talking about?
The one for Ext. B?

Re: Radical Index online? (was Re: Chemistry on chinesse. (CJK))

2001-01-24 Thread Richard Cook


Richard Cook wrote:
> 
> Kenneth Whistler wrote:
> > >
> > > I could not find the radical index. Has this been put online too?
> >
> > No. The CJK radical index was generated and printed with custom
> > software from the Unihan database. It was too much effort to try
> > to convert that software to produce a postable .pdf file, so the
> > radical index is omitted from the online version.
> >
> 
> Here's one I just generated:
> 
> http://linguistics.berkeley.edu/~rscook/pdf/214KangxiRadicals-Unicode.pdf
> 
> using
> 
> http://www.wenlin.com/
> 
> If you need a higher-res version, please let me know.
> 
Also, let me know if you need one for the range from [U+2f00] to [U+2fd5].

Re: Radical Index online? (was Re: Chemistry on chinesse. (CJK))

2001-01-24 Thread Richard Cook


Kenneth Whistler wrote:
> >
> > I could not find the radical index. Has this been put online too?
> 
> No. The CJK radical index was generated and printed with custom
> software from the Unihan database. It was too much effort to try
> to convert that software to produce a postable .pdf file, so the
> radical index is omitted from the online version.
> 

Here's one I just generated:

http://linguistics.berkeley.edu/~rscook/pdf/214KangxiRadicals-Unicode.pdf

using

http://www.wenlin.com/

If you need a higher-res version, please let me know.

Best,
Richard

Re: Transcriptions of "Unicode"

2001-01-16 Thread Richard Cook


Michael Everson wrote:
> 
> >The stress is definitely on the first syllable. One does hear some normal
> >generative English variations such as ËjunÉËkoË*d. (schwa instead of
> >short-i),
> 
> The pronuncuation ['juni:ko:d] with [i:] or [i] instead of schwa irritates
> me a lot. No one would pronounce "universe" with an [i].
> 
I agree. I just don't understnad why people would want to say
['juni:ko:d]. These same people seem to want to pronounce GIF and
Gigabyte as though they were spelt JIFF and Jigabyte :-)

-R

Re: Representation of aspiration (was: Re: Transcriptions of "Unicode")

2001-01-12 Thread Richard Cook


Kenneth Whistler wrote:
> 
> Richard Cook surmised:
> 
> > BTW, in a very close transcription, if one is using superscription
> > (position above baseline) and relative size reduction to indicate
> > aspiration, I suppose that degree of superscription or the size or both
> > could be modulated to indicate degree of aspiration?
> 
> Nah, if you tried to go down that path, you'd just end up with
> unrepresentable transcriptions and unreliable reproduction. I doubt
> that there are many transcribers who could reliably record more than
> three degrees of aspiration, anyway (roughly: slight aspiration,
> "normal" aspiration, and superaspiration).

Ken, I was only kidding ... mostly,  should have put a smiley in there
:-) But I was also thinking of the superscription question, which I
think Peter C. might like to discuss.
> 
> Once you go past that level, which could be reliably indicated with
> appropriate use of diacritics, you are really into the realm of
> instrumental phonetics. I'd just hook up the machine and let it
> give you precise timings of voice delays post consonatal release
> in milliseconds.
> 
> >
> > Or perhaps just mark-up the unsuperscripted aspiration indicator, to
> > note degree of aspiration ... however you would like to measure that.
> 
> No need to "mark it up". Just add another diacritic. That's how
> most transcribers would work, in practice.
> 
Well, I was thinking of linking the transcription to the machine data
... so that the relation would be set on a compound key (aspiration
diacritic & measurement reference) ...

Re: Transcriptions of "Unicode"

2001-01-12 Thread Richard Cook

[EMAIL PROTECTED] wrote:
> 
> On 01/12/2001 10:33:48 AM Marco Cimarosti wrote:
> 
> >Is that k aspirated?
> 
> It is for any English speakers I've ever met.
> 
I think the question about aspiration of this k relates to the fact that
it is at the onset of a word-medial unstressed syllable. So, although it
is aspirated, the aspiration might not be as pronounced as it would be
in a stressed and/or word-initial environment.

BTW, in a very close transcription, if one is using superscription
(position above baseline) and relative size reduction to indicate
aspiration, I suppose that degree of superscription or the size or both
could be modulated to indicate degree of aspiration?

Or perhaps just mark-up the unsuperscripted aspiration indicator, to
note degree of aspiration ... however you would like to measure that.

I guess you see what I'm hinting at ...

Re: Transcriptions of "Unicode"

2001-01-12 Thread Richard Cook


Thomas Chan wrote:
> 
> On Thu, 11 Jan 2001, Richard Cook wrote:
> 
> > I see 2 Traditional Chinese translations here:
> > > http://www.macchiato.com/unicode/Unicode_transcriptions.html
> > Which one do people like?
> >
> > 
>http://my.ispchannel.com/~markdavis//unicode/Unicode_transcription_images/U_Chinese2.gif
> > 
>http://my.ispchannel.com/~markdavis//unicode/Unicode_transcription_images/U_Chinese3.gif
> 
> It seems the former ("tongyi ma") rather than the latter ("biaozhun wanguo
> ma").
> 
> Some searches...
> 
> "tongyi ma" (U_Chinese2.gif):
> 
> Altavista: 66 matches
> Yahoo (Chinese/Hong Kong/Taiwan): 78 matches
> Microsoft Taiwan: 100 matches
> 
> ("Yahoo Chinese" != "Yahoo China".  I couldn't get through to
> Microsoft Hong Kong's search page.)
> 
> Also IUC10 page (http://www.unicode.org/iuc/iuc10/languages.html)
> and Java glossary (http://java.sun.com/docs/glossaries/glossary.print.html)
> agree.

Others have suggested to me that the full form for Unicode Standard
should be

[U+7d71][U+4e00][U+78bc][U+6a19][U+6e96]

tongyi ma biaozhun, or

[U+7d71][U+4e00][U+78bc][U+898f][U+7bc4]
[U+7d71][U+4e00][U+78bc][U+8ecc][U+7bc4]

tongyi ma guifan.

Which of these do people prefer?

> 
> "biaozhun wanguo ma" (U_Chinese3.gif):
> 
> Altavista: 7 matches
> Yahoo (Chinese/Hong Kong/Taiwan): 1 match
> Microsoft Taiwan: 78 matches
> 
> I do wonder, however, if "biaozhun wanguo ..." was meant as a translation
> of "ISO ...".
> 
Yes, that's a very good point.

Re: Transcriptions of "Unicode"

2001-01-11 Thread Richard Cook

Jon Babcock wrote:
> 
> At first glance, I agreed. But then if the U_Chinese3.gif, gets
> shortened to the last three characters, wanguo ma, as I suspect it
> would in practice, I'd favor it slightly over the three-character
> tongyi ma of U_Chinese2.gif. FWIW. To me, wanguo ma emphasizes the
> multilingual aspect, whereas tongyi ma emphasizes the unifying aspect,
> but it isn't fully apparent, from the name (tongyi ma) alone, what is
> being unified.
> 

Well, I'd say a problem with wanguo ma [lit. 'standard myriad-country
code'] is that it would be a better translation of Globalcode, rather
than of Unicode. All in favor of changing the standard name, say aye?

And is it apparent from the name "Unicode" alone that "Uni-" stands for
"Unified" and not, um, "Unicorn"? :-)

tongyi ma seems much more natural, less clunky to me ... but some people
prefer what I think is clunky, so I'm willing to admit that my opinion
of clunkiness may be completely subjective.

Here's the Unicode, courtesy of http://www.wenlin.com/ :

[U+6a19][U+6e96][U+842c][U+570b][U+78bc] biao1zhun3 wan4guo2 ma3
[U+7d71][U+4e00][U+78bc] tong3yi1 ma3

UTF8:

Ê®ôÊ*ñËê¨ÂúãÁ¢º biao1zhun3 wan4guo2 ma3
Áµ±*ÄÁ¢º tong3yi1 ma3

Re: Transcriptions of "Unicode"

2001-01-11 Thread Richard Cook


John Jenkins wrote:
> 
> On Thursday, January 11, 2001, at 10:25 AM, Richard Cook wrote:
> 
>> Which one do people like?
>>
>> 
>http://my.ispchannel.com/~markdavis//unicode/Unicode_transcription_images/U_Chinese2.gif
> 
> Is much better.  "Unified Code"
> 
This was my opinion too. I like "tongyima". And so far I haven't heard
from anyone advocating

>>http://my.ispchannel.com/~markdavis//unicode/Unicode_transcription_images 
>U_Chinese3.gif
>
> Stinks.  "Standard International Code"
>
Although "stinks" might be a little harsh. I'd opt for "opaque" :-)
Anyone else?

Re: Transcriptions of "Unicode"

2001-01-11 Thread Richard Cook


I see 2 Traditional Chinese translations here:

> http://www.macchiato.com/unicode/Unicode_transcriptions.html

Which one do people like?

http://my.ispchannel.com/~markdavis//unicode/Unicode_transcription_images/U_Chinese2.gif
http://my.ispchannel.com/~markdavis//unicode/Unicode_transcription_images/U_Chinese3.gif

Linguistic Transcription list

2000-12-21 Thread Richard Cook


Greetings Unicoders,

As people on the Unicode list obviously have considerable experience in
such matters, we would like to invite your recommendations and
suggestions on the following.

A new Linguistic Transcription mailing list has been set up, preliminary
description and subscription info as follows:

> The Transcription mailing list exists for the purpose of the exchange
> of information relating to the assemblage of a comprehensive inventory
> and typological classification of symbols and variants used to represent
> aspects of human speech in the modern phonetic and phonologic traditions.
> 
> The result of this collaboration will be joint submission of a formal
> proposal to the Unicode Consortium for augmentation of the Unicode
> Standard's current treatment of linguistic transcription symbols.
> 
> Linguists with specific knowledge of transcription symbols which ought
> to be addressed in such a proposal are encouraged to join and contribute
> to this list.

Linguists at various institutions, in various areas of specialization
have already joined or agreed to join. 

The initial list agenda is to determine a number of procedural issues,
including project participants, sponsoring institutions, scope and
timeline.

Subscription instructions follow:
---
To join the transcription mailing list, send mail to 

<[EMAIL PROTECTED]>

with the following command in the body of your email message:

subscribe transcription email@address

Replacing "email@address" with your actual email address.

If you ever need to get in contact with the owner of the list,
(if you have trouble unsubscribing, or have questions about the
list itself) send email to <[EMAIL PROTECTED]> .
This is the general rule for most mailing lists when you need
to contact a human.



Richard S. COOK
STEDT Project, Linguistics Department
University of California, Berkeley
mailto:[EMAIL PROTECTED]
http://stedt.berkeley.edu/

Re: Mongolian and Uighur (was Re: I have a drem one day...)

2000-12-19 Thread Richard Cook

Kenneth Whistler wrote:
> 
> Thus the Uighur script is the direct ancestor of the Mongolian
> script, and is also a term used for the modern Mongolian script
> itself, to distinguish it from Mongolian written in one of the other
> scripts (including Latin and Tibetan).

And the Uighur script has itself apparently also been adapted to other
languages. It was employed also by Anatolian Turks, according to
Kornfilt (Comrie, p. 621):

"From the very beginning of its Anatolian period, Turkish was written in
the Arabic script, until the Latin script was adopted in the course of
the so-called 'writing reform' of 1928 (put into force in 1929), one of
the various reforms introduced after the founding of the Turkish
Republic with the aim of westernizing the country. However, the Uighur
script was also employed by the Anatolian Turks up to the 15th century,
which might explain some of the features of the Arabic script as used by
the Turks of that period and which differ from standard Arabic usage,
e.g. vowels are written out in Turkish words. This point, incidentally,
has often been brought up to motivate the so-called writing reform,
arguing that the multiple ambiguities that arise in Turkish within a
non-vocalized orthography made the Arabic system highly inadequate for Turkish."

-Richard

Re: Information about curly-tailed phonetic letters

2000-12-17 Thread Richard Cook

"J%ORG KNAPPEN" wrote:
> 
> The curly-tail consonants t, d, n, l, c, z are also included in the
> TeX IPA (tipa fonts). The documentation of those fonts is available
> on
> 
> ftp://ftp.dante.de/texarchive/fonts/tipa/tipaman.ps.gz
> 
> --J"org Knappen

Hi J"org,
It looks as if you sent the wrong url. The right path is, I believe:

ftp://ftp.dante.de/tex-archive/fonts/tipa/

And as for the consonant symbols, why stop with t, d, n, l, c, z? Why
not include the rest of the curly-tail and other symbols in the
following chart:

http://stedt.berkeley.edu/pdf/curly-tail-table3.pdf

there are a few other bits of data you might glean also, including usage
of the apical vowel symbols.

-Richard

Chinese Support

2000-12-13 Thread Richard Cook


I've been meaning to mention this program on-list. Tom Bishop's Wenlin at

http://www.wenlin.com/

is a self-contained, Mac/Win means of editing Unicode Chinese. I've
heard Unicoders speak well of it before. At the last conference one
presenter said in his presentation, concluding his praise of Wenlin:

"No one who does Chinese should be without this program."

I couldn't agree more. It's a commercial program (in which I have no
financial stake), incorporating work over many years, including the
DeFrancis ABC dictionary. I'm looking forward to Extension A and B
support. Anyone else?

Re: curly-tailed phonetic letters

2000-12-08 Thread Richard Cook


This table has undergone some further revision:

http://stedt.berkeley.edu/pdf/curly-tail-table3.pdf

Please note in the center of the table:

U+0291/U+0293 and U+0255/U+0286

These 4 may in fact be 2 pairs of functional equivalents (synographs),
pointing to the same place of articulation. According to Pullum &
Ladusaw (1996), IPA approval of U+0286 and U+0293 was withdrawn in 1989.

Please note that also in the above table are symbols for the 2 pairs of
so-called "apical" vowels. These include U+0285 and U+027F (the
unrounded apicals, relatively front and back, respectively), as well as
their rounded counterparts. These are all 4 non-IPA-sanctioned symbols.


Richard S. COOK, Jr.
STEDT Project, Linguistics Department
University of California, Berkeley

Re: curly-tailed phonetic letters

2000-12-05 Thread Richard Cook


With regard to the curly-tail character set, here's a link to an
IPA-style chart of this I made:

http://stedt.berkeley.edu/pdf/curly-tail-table2.pdf

The curly-tail series is in red. As always, comments, suggestions and
corrections are welcome.


Richard S. COOK, Jr.
STEDT Project, Linguistics Department
University of California, Berkeley

1 2 >

1 - 100 of 104 matches

Mail list logo