from:"John H. Jenkins"

Re: proposal for a "creative commons" character

2004-06-15 Thread John H. Jenkins

On Jun 15, 2004, at 2:22 PM, [EMAIL PROTECTED] wrote:
Michael Tiemann scripsit:
Without getting greedy, I'd like to propose the adoption of the (cc)
symbol in whatever way would be most expedient (so that creative 
commons
authors can identify their work more appropriately), and leave for 
later
the question of the other symbols.
It's a logo.  We normally don't do logos.
To be a little less terse, in the case of symbols like this, it is the 
strong preference not to encode as a means to encourage use.

====
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/

Re: number of bytes for simplified chinese

2004-06-28 Thread John H. Jenkins

On Jun 27, 2004, at 11:37 PM, Duraivel wrote:
hi,
 
I would like to know the number opf bytes required for simplified 
chinese language. Can we represent all the characters of  simplified 
chinese in unicode using just two bytes.

No.  It will take up to four bytes per character, whether you're using 
UTF-8, UTF-16, or UTF-32.


John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/

Re: Looking for transcription or transliteration standards latin- >arabic

2004-07-02 Thread John H. Jenkins

æ Jul 2, 2004 11:17 AM æïChris Harvey æåï
Perhaps one could think of "Ha Tinh" as the English word for the city, 
like "Rome" (English) for "Roma" (Italian), or Tokyo (English) for 
"TÅkyÅ" (English transliteration of Japanese), or Kahnawake 
(English/French) for KahnawÃ:ke (Mohawk).
Or Peking for BeÇjÄng.  :-)

John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/

Re: Chinese Simplified - How many bytes

2004-07-06 Thread John H. Jenkins

於 Jul 6, 2004 3:10 AM 時，Duraivel 提到：
Hi,
I browsed  through the ICU library and it looks similar to 
gettext library which GNU provides, with more functionality added. But 
we are developing our product on QT which has its own translations. So 
I dont want to use another library for translations. Also there is a 
class QString which says its takes care of byte issues. Basically it 
is overloaded and acts accordingly for two byte Unicode char set. Also 
it states that QString supports Chinese(simplified). Am not getting 
how he says that two bytes can support Chinese simplified. Is it true 
that, to represent Chinese simplified programmatically, two bytes will 
do.

Unicode in the UTF-16 encoding will cover almost all the simplified 
Chinese characters people use today in two bytes.  There are the 
occasional exceptions which will require four bytes.


John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/

Re: Unicode v. 4 font software for Mac

2004-07-15 Thread John H. Jenkins

於 Jul 15, 2004 12:13 PM 時，David Branner 提到：
I have tried AsiaFont Studio 4 and FontLab, but they are not compatible
with version 4 of the Unicode Standard and hence are not suitable for 
my
purposes.

I assume that by saying they're not compatible, you mean that they 
don't support characters off of the BMP.  If this is the problem, you 
can use Apple's tool ftxdumperfuser to alter the cmap after FontLab has 
generated it.  Apple's font tool suite is available at 
<http://developer.apple.com/fonts>.(Alternatively, if you give a 
character a name of the form "ux," e.g., "u2" I'm told that the 
latest version of FontLab will generate an appropriate cmap entry for 
it, but I don't know for sure.)


John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/

Re: Unicode v. 4 font software for Mac

2004-07-15 Thread John H. Jenkins

於 Jul 15, 2004 2:54 PM 時，David Branner 提到：
: :I assume that by saying they're not compatible, you mean that 
they
: :don't support characters off of the BMP.

They can neither generate such characters nor (apparently) open fonts 
that
contain such characters.

Then move the non-BMP characters to the PUA using ftxdumperfuser (or 
remove their Unicode mappings altogether), and re-add (or re-shift) the 
Unicode mappings after using FontLab with the same tool.

========
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/

Re: Problem with accented characters

2004-08-23 Thread John H. Jenkins

On Aug 23, 2004, at 3:34 PM, Doug Ewell wrote:
Deborah Goldsmith  wrote:
FYI, by far the largest source of text in NFD (decomposed) form in
Mac OS X is the file system. File names are stored this way (for
historical reasons), so anything copied from a file name is in (a
slightly altered form of) NFD.
"Slightly altered"?
Yes, the specification for the Mac file system was frozen before NFD 
had been developed by the UTC, so it isn't exactly the same.  But it's 
close.


John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/

Re: Arial Unicode MS

2004-12-06 Thread John H. Jenkins

On Dec 6, 2004, at 10:23 AM, Johannes Bergerhausen wrote:

From some discussions here i learned that Arial Unicode MS contains 
about 50.000 glyphs,
which is about the size of characters encoded in Unicode 2.0 and was 
shipped the last
time bundled with Office for Windows 2003.

A Pan-Unicode-Font is a beautiful idea.
Why Microsoft/Monotype stopped the developpement of further versions?
The TrueType and OpenType font formats do not allow a font to contain 
more than about 65,000 glyphs. Since there are well over 65,000 
characters in Unicode, plus additional glyphic forms that would be 
necessary for proper support for various scripts, it is no longer 
possible to produce a single font like Arial Unicode MS.

There are other issues -- making a single typeface which covers all the 
scripts in Unicode and has a common esthetic design is really not 
possible; loading a huge font can consume a significant chunk of the 
resources on a system, most of which is wasted; and so on.

Re: IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again.

2004-12-08 Thread John H. Jenkins

On Dec 8, 2004, at 3:57 PM, Patrick Andries wrote:

Azzedine Ait Khelifa a écrit :
Hello All,
The subject of this conference is really interesting and veryusefull.
But once again Africa is forgotten.
I want to know, if we can have the same conference "AfricaOriented" 
scheduled ?
If Not,  What should we do to have this conference scheduled in a 
cityaccesible for african community (like Paris).

If this is possible, I would also add « and with much more contents 
ina language understood in Africa and the host country : French ».

Well, and as with everything else associated with Unicode, feel free to 
volunteer.

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-13 Thread John H. Jenkins

On Dec 10, 2004, at 1:25 PM, Tim Greenwood wrote:

Is that like the 'Please RSVP' that I see all too often? Or should
that not be excused?

Or -- my own personal favorite -- "in the year AD 2004."

Re: Simplified Chinese radical set in Unihan

2004-12-16 Thread John H. Jenkins

As you say, the main problem is that there are so many different 
possible sets. Some will be proprietary, which would limit their 
usefulness although there would, I believe, otherwise be no objection 
to its inclusion. If you can come up with a reasonably standard set and 
reasonably consistent data across several dictionaries referencing it, 
I'm sure there'd be no objection to including it.

On Dec 16, 2004, at 2:19 PM, Erik Peterson wrote:

Hello,
 I've found many uses for the UniHan data file the past few years. 
It's a great source of information.

 One potential addition that I've wanted is a field listing the 
simplified Chinese radical for at least the simplified Chinese 
characters, like what exists for the Xinhua Zidian ("Xinhua 
Dictionary") and other mainland Chinese dictionaries. I was wondering 
if this has been discussed before?

 Some potential difficulties I could see include the fact that 
mainland dictionaries use a variety of different radical schemes. The 
most standard one that I can find is the Chinese Academy of Social 
Sciences (CASS) set with 189 different radicals. Even for dictionaries 
that use this set the ordering is often different. Could the radical 
set also be proprietary in some way?

 Anyway, I was curious. I've been working on something like this 
myself that I could also contribute when it's farther along.

Regards,
Erik Peterson

Re: Preparing a proposal for encoding a portable interpretable object code into Unicode (from Re: IUC 34 - call for participation open until May 26)

2010-06-01 Thread John H. Jenkins

First of all, as Michael says, this isn't character encoding.  You're not 
interchanging plain text.  This is essentially machine language you're writing 
here, and there are entirely different venues for developing this kind of 
thing.  

Secondly, I have virtually no idea what problem this is attempting to solve 
unless it's attempting to embed a text rendering engine within plain text.  If 
so, it's both entirely superfluous (there are already projects to provide for 
cross-platform support for text rendering) and woefully inadequate and 
underspecified.  Even if this were sufficient to be able to draw a currently 
unencoded script, the fact of the matter is that it doesn't allow for doing 
anything with the script other than drawing.  (Spell-checking?  Sorting?  
Text-to-speech?)

Unicode and ISO/IEC 10646 are attempts to solve a basic, simply-described 
problem:  provide for a standardized computer representation of plain text 
written using existing writing systems.  That's it.  Any attempt to use the two 
to do something different is not going to fly.  Creating new writing systems, 
directly embedding language, directly embedding mathematics or machine 
language--all of these are entirely outside of Unicode's purview and WG2's 
remit.  They simply will not be adopted.

Your enthusiasm may be commendable, but you're spending your energy developing 
something which is not appropriate for inclusion within Unicode.

Re: Preparing a proposal for encoding a portable interpretable object code into Unicode (from Re: IUC 34 - call for participation open until May 26)

2010-06-02 Thread John H. Jenkins


On Jun 2, 2010, at 3:51 AM, William_J_G Overington wrote:

> I know of no reason to think that a person "skilled in the art" would be 
> unable to write an iPad app to receive a program written in the portable 
> interpretable object code arriving within a Unicode text message and then for 
> the program to run in a virtual machine within the app, displaying a 
> graphical result on the screen of the iPad. Could such an app be written 
> based on the information in the paper_draft_005.pdf document? 
>  

OK, one very last note.  The answer to this question is, "No."  

=
John H. Jenkins
jenk...@apple.com

Re: Preparing a proposal for encoding a portable interpretable object code into Unicode (from Re: IUC 34 - call for participation open until May 26)

2010-06-02 Thread John H. Jenkins

On Jun 2, 2010, at 3:51 AM, William_J_G Overington wrote:

> 
>> Unicode and ISO/IEC 10646 are attempts to solve a basic,
>> simply-described problem:  provide for a standardized
>> computer representation of plain text written using existing
>> writing systems.
> 
> Well, that might well be the case historically, yet then the emoji were 
> invented and they were encoded. The emoji existed at the time that they were 
> encoded, yet they did not exist at the time that the standards were started. 
> So, if the idea of the portable interpretable object code gathers support, 
> then maybe the defined scope of the standards will become extended.

*If* the idea of a portable, interpretable object code embedded in plain text 
garners support and actual implementation outside of Unicode itself, then yes, 
it's conceivable that the UTC might consider it.  Emoji were encoded because 
they were already widely implemented in Japanese cell phones.  If the emoji set 
had been submitted to the UTC as is *without* prior, widespread implementation, 
it would likely not have been approved.

And in any event, Unicode already included significant collections of dingbat 
and dingbat-like elements and has from the first.  Whatever one may feel about 
the merits of encoding this particular set, the fact is that there was ample 
precedent already there.  Encoding emoji did not alter the τό τί ἦν εἶναι, the 
essence, of the standard.   

> 
>> That's it.  Any attempt to use
>> the two to do something different is not going to fly.
> 
> Well, I appreciate that the use of the phrase "not going to fly" is a 
> metaphor and I could use a creative writing metaphor of it soaring on 
> thermals above olive groves, yet to what exactly are you using the metaphor 
> "not going to fly" to refer please?

I mean that there is no chance at all that the UTC would approve this proposal 
as matters stand, and that pursuing such a concept through Unicode channels is 
a waste of everybody's time, yours not excepted. If you seriously want to get 
such a radical redefinition of "plain text" included in Unicode, you'll need to 
start elsewhere.  

And I don't have time myself to really comment further than I already have.

=
John H. Jenkins
jenk...@apple.com

Re: A question about "user areas"

2010-06-02 Thread John H. Jenkins

On Jun 2, 2010, at 3:49 AM, Vinodh Rajan wrote:

> If there are similar projects that encode Ancient Characters in PUA, may be 
> you can co-ordinate with them. Similar to the ConScript Unicode Registry.
>  

There is a proposal for "Old Hanzi" being worked on by the IRG.  You can peruse 
the IRGs documents on the subject at their Web site, 
<http://appsrv.cse.cuhk.edu.hk/~irg/>.

=
John H. Jenkins
jenk...@apple.com

Re: Hexadecimal digits

2010-06-04 Thread John H. Jenkins

Unicode has Roman numerals for compatibility reasons, not for serious use as 
Roman numerals. If you *really* want to work with roman numerals, even in the 
year MMDCCLXIII AUC, use the letters, just like the Romans did.

And in any event, you're undermining your own case, because a *lot* of 
societies have used the same symbols for letters and numerals.  People learn to 
live with it, just the way we live with cough and slough, minute and minute, 
and 1750 hours and 1750 days.  This is where gematria had its start.

從我的 iPhone 傳送

在 Jun 4, 2010 12:39 PM 時，Luke-Jr  寫到：

> Unicode has Roman numerals and bar counting (base 0); why should base 16 be 
> denied unique characters?
> 
> From another perspective, the English-language Arabic-numeral world came up 
> with ASCII. Unicode was created to unlimit the character set to include  
> coverage of other languages' characters. Why shouldn't a variety of numeric 
> systems also be supported?
> 
>

Re: Hexadecimal digits

2010-06-04 Thread John H. Jenkins


On Jun 4, 2010, at 2:48 PM, Luke-Jr wrote:

> The computer industry already has units of 'kilobyte' and such referring to 
> powers of 1024. 
> 

You mean, of course, kibibyte.  A kilobyte is 1000 bytes.

Re: Overloading Unicode

2010-06-07 Thread John H. Jenkins

On Jun 7, 2010, at 2:48 AM, William_J_G Overington wrote:

> I am hoping to submit a document to the Unicode Technical Committee in the 
> hope that the Unicode Technical Committee will institute a Public Review.
> 

I don't believe that the UTC will institute a Public Review on this proposal 
because it is so patently outside the scope of the Unicode Standard.  

> I feel that the possibility of the Unicode Technical Committee instituting 
> such a Public Review would be increased if there were support for such a 
> Public Review to take place.
> 

If there were support, the possibility might be increased from 0% to 0.001%.  
But there isn't any support.  

> I feel that a Public Review conducted by the Unicode Technical Committee 
> would be a good way to decide whether to encode a portable interpretable 
> object code into Unicode.
> 

Public Reviews aren't intended to help the UTC decide whether or not a 
particular proposal is within the scope of the standard.  

Nobody's stopping you from submitting a proposal, but bear in mind that nobody 
on this list has shown any support for it and you have been told repeatedly by 
a number of people that it's outside of Unicode's scope.  There is absolutely 
no chance that the UTC will do anything on this proposal other than reject it.

This really isn't the proper venue to pursue the proposal, and you're wasting 
your time by doing so.  Implement it, get support for it, get it adopted 
outside of a narrow group of supporters.  If there is a *demonstrated* problem 
that this is a *demonstrated* solution for, then *maybe* the UTC would look at 
it.  Until then, discussing the proposal here is simply tilting at windmills.  

=
John H. Jenkins
jenk...@apple.com

Re: Octal

2010-06-07 Thread John H. Jenkins

For me, the biggest advantage for octal is that you can still count easily on 
your fingers.  (And yes, I do count on my fingers.  I also still use a slide 
rule and have been known to do long division in Roman numerals.)

On Jun 5, 2010, at 11:16 AM, Jonathan Rosenne wrote:

> When I started using computers we used octal, so I suggest new characters for 
> the octal digits “0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”.
>  
> BTW, octal has all the benefits claimed for hexadecimal with the advantage 
> that it is much simpler.
>  
> Jony
>  
> From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On 
> Behalf Of Peter Constable
> Sent: Saturday, June 05, 2010 6:45 PM
> To: Unicode Discussion
> Subject: base-9 digits
>  
> Can we please encode new characters for base-9 digits “0”, “1”, “2”, “3”, 
> “4”, “5”, “6”, “7”, “8”?
>  
>  
>  
> Peter

=
John H. Jenkins
jenk...@apple.com

Re: Hexadecimal digits

2010-06-09 Thread John H. Jenkins

Both a decimal 2 and a hexadecimal 2 are an ideogram representing the abstract 
concept of "two-ness," and the latter is derived typographically from the 
former (and, indeed, currently looks exactly like it).  This is comparable to a 
Chinese 二 and a Japanese 二, which we've unified.

Unicode encodes characters, not glyphs.  In order to separately encode a 
hexadecimal-2 separately from an decimal-2, you'd either have to show either 
that the two are, in fact, inherently different characters (in which case you'd 
better be prepared to separately encode the octal-2 and the duodecimal-2 et 
al.), or you'd have to two that widespread existing practice treats them as 
distinct or at least draws them distinctly.  

(And before anybody raises the objection, nobody treats the Chinese 二 and 
Japanese 二 as distinct.  There are other sinograms which look different when 
designed for Chinese use and Japanese use and some people would like to treat 
them as distinct for that reason, but historically and in current practice, 
this is not actually done.)

Indeed, current practice universally treats decimal-0 through decimal-9 as 
hexadecimal-0 through hexadecimal-9 and letter-A/a through letter-F/f as 
hexadecimal-10 through hexadecimal-15.  That practice would have to change 
before any serious attempt at encoding "hexadecimal digits" would be 
considered.  And using letters for numerals has a long and distinguished 
history despite the inherent ambiguities, so there is ample precedent for the 
current practice.

Yes, this does create a chicken-and-egg problem, and whether or not this will 
have a long-term impact on the creation or adoption of new alphabets or new 
typographic practice is an interesting one.  That, however, is irrelevant to 
how Unicode does things.  

In re the tonal system specifically, I note that it uses a glyph for 
hexadecimal-10 which looks (to me, at least) identical with a glyph for 
decimal-9.  This IMHO represents a serious impediment  to the system ever being 
adopted.  I will, however, gladly be proven wrong.

=
井作恆
John H. Jenkins
jenk...@apple.com

Re: Hexadecimal digits

2010-06-09 Thread John H. Jenkins

Both a decimal 2 and a hexadecimal 2 are an ideogram representing the abstract 
concept of "two-ness," and the latter is derived typographically from the 
former (and, indeed, currently looks exactly like it).  This is comparable to a 
Chinese 二 and a Japanese 二, which we've unified.

Unicode encodes characters, not glyphs.  In order to separately encode a 
hexadecimal-2 separately from an decimal-2, you'd either have to show either 
that the two are, in fact, inherently different characters (in which case you'd 
better be prepared to separately encode the octal-2 and the duodecimal-2 et 
al.), or you'd have to two that widespread existing practice treats them as 
distinct or at least draws them distinctly.  

(And before anybody raises the objection, nobody treats the Chinese 二 and 
Japanese 二 as distinct.  There are other sinograms which look different when 
designed for Chinese use and Japanese use and some people would like to treat 
them as distinct for that reason, but historically and in current practice, 
this is not actually done.)

Indeed, current practice universally treats decimal-0 through decimal-9 as 
hexadecimal-0 through hexadecimal-9 and letter-A/a through letter-F/f as 
hexadecimal-10 through hexadecimal-15.  That practice would have to change 
before any serious attempt at encoding "hexadecimal digits" would be 
considered.  And using letters for numerals has a long and distinguished 
history despite the inherent ambiguities, so there is ample precedent for the 
current practice.

Yes, this does create a chicken-and-egg problem, and whether or not this will 
have a long-term impact on the creation or adoption of new alphabets or new 
typographic practice is an interesting one.  That, however, is irrelevant to 
how Unicode does things.  

In re the tonal system specifically, I note that it uses a glyph for 
hexadecimal-10 which looks (to me, at least) identical with a glyph for 
decimal-9.  This IMHO represents a serious impediment  to the system ever being 
adopted.  I will, however, gladly be proven wrong.

=
井作恆
John H. Jenkins
jenk...@apple.com

Re: Are Unihan variant relations expected to be symmetrical?

2010-06-29 Thread John H. Jenkins

The kZVariant field has bad data in it that we haven't had time to clean up.  
It should, in theory, be symmetrical, and it should, in theory, contain only 
unifiable forms, but as you note, it doesn't.  In addition to the use of the 
source separation rule, it should also cover characters which were added to the 
standard in error.  

In any event, I'm afraid that right now it's probably best not to rely on it 
for anything.

On Jun 29, 2010, at 8:25 AM, Uriah Eisenstein wrote:

> Hi,
> To clarify my question with an example :) The character 亀 (U+4E80) is listed 
> in Unihan as a Z-variant of 龜 (U+9F9C). However, the opposite is not true. 
> Similarly, 疍 (U+758D) is listed as a semantic variant of 蛋 (U+86CB), but not 
> vice versa. From the definitions of these variant types in UAX#38, one would 
> naturally expect them to be symmetrical, and both characters to show each 
> other as variants. There are quite a few other such cases, although it does 
> appear that in most cases the relation is symmetrical.
> My reason for asking, BTW, is that I'm thinking of grouping characters which 
> are Z-variants of each other in some application, so I need to understand 
> whether Z-variants are expected to have clear "cliques" in which each 
> character is a Z-variant of all others.
> I realize that the semantic variant relation, at least, is based on external 
> sources and not determined by Unicode; regarding Z-variants I'm not clear. 
> I'd like to know though whether the relation is expected to be symmetrical, 
> and the above cases are to be considered errors; or there is some meaning to 
> a one-directional relation; or something else.
> On a side note, some Z-variants I've looked at seem to have very different 
> abstract shapes, in some cases looking more like simplified/traditional 
> pairs. As I said I don't know clearly how they are determined. Are they 
> supposed to be exactly those pairs which would be unified if it were not for 
> the Source Separation Rule?
> 
> TIA,
> Uriah

=
John H. Jenkins
jenk...@apple.com

Re: 001B, 001D, 001C

2010-07-07 Thread John H. Jenkins

I see "Escape" used (or at least the "esc" key on my keyboard) in a lot of 
applications still as a kind of "get me out of here" key.  And it's used a lot 
by emacs as the meta key, IIRC.

On Jul 7, 2010, at 9:00 AM, Michael S. Kaplan wrote:

> Not for any terribly interesting reason, but mainly for all kinds of
> ancient features like I mention here:
> 
> http://blogs.msdn.com/b/michkap/archive/2008/11/04/9037027.aspx
> 
> and here:
> 
> http://blogs.msdn.com/b/michkap/archive/2007/05/28/2954171.aspx
> 
> Michael
> 
>> Hello!
>> 
>> 001B, 001D, 001C are present in some keyboard layouts. What are these
>> characters used for?
>> 
>> 
> 
> 
> 

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com

Re: Status of Unihan

2010-07-12 Thread John H. Jenkins

We hope to have it back in the next few days.

On Jul 12, 2010, at 8:34 AM, Martin Heijdra wrote:

> When will Unihan be back? It has been down for quite a while now, and there 
> are librarians for whom checking this is part of their workflow…
>  
> Martin

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com

Re: ? Reasonable to propose stability policy on numeric type = decimal

2010-07-26 Thread John H. Jenkins


On Jul 24, 2010, at 7:09 PM, Michael Everson wrote:

> On 25 Jul 2010, at 02:02, Bill Poser wrote:
> 
>> As I said, it isn't a huge issue, but scattering the digits makes the 
>> programming a bit more complex and error-prone and the programs a little 
>> less efficient.
> 
> But it would still *work*. So my hyperbole was not outrageous. And nobody has 
> actually scattered them.
> 

The set of Chinese numerals used in decimal notation is rather spectacularly 
scattered.

(FWIW I'm on the "Yes, it's *very* useful, and yes, it's the way we should do 
it wherever possible, but no, a formal policy is probably not best" camp.)

=
井作恆
John H. Jenkins
jenk...@apple.com

Re: Indian new rupee sign

2010-07-30 Thread John H. Jenkins

On Jul 30, 2010, at 5:01 AM, William_J_G Overington wrote:

> Is there any good reason why people cannot arrange that the new symbol is 
> fully encoded into Unicode and ISO 10646 by 31 December 2010, that is, before 
> the end of the present decade, ready to use in the next decade?
> 
> If there is progress over getting the encoding done, then maybe other people 
> will join in the effort and update fonts and whatever else needs updating by 
> the same date.
> 

Unicode is a complex standard whose structure involves code charts, data files, 
and various standard annexes and reports.  Any change to the standard involves 
changes to at least some of these, if not all of them.  This work is done by 
several individuals scattered around the world.  Time is needed to make sure 
the changes are properly coordinated and made with due care.  

WG2 is governed by ISO rules.  ISO is a large organization and involves 
national bodies from all over the globe.  The ISO voting process involves 
several rounds in order to make sure that any objections are properly discussed 
and responded to.  Even in the age of electronic communications, this takes 
time.

And many of the people involved in both UTC and WG2 have substantial 
responsibilities in addition to character encoding work.  (Some, indeed, do the 
character encoding work on their own time.)  It's not necessarily easy for them 
to find the time to look everything over carefully.

All of this is done at a deliberate pace because experience has taught that 
inasmuch as *any* change may have unintended consequences, making even a small 
change quickly may prove to create more problems than it solves.  

Note, for example, the "early adopter" who simply slapped support for the new 
rupee symbol by overlaying it on top of `.  For a lot of people, that's a cool 
solution because it means that everything works *right* *now*.  The problem is 
that it breaks a lot of other things that the person in question (and his 
supporters) obviously didn't even think of, and now they've got a pile of 
unintended consequences.  

Obviously this is an important new symbol, and I'm sure that WG2 and the UTC 
will make every effort to encode it as expeditiously as possible.  As for 
exactly how long it will take, neither WG2 nor the UTC has even *met* since 
this hit the news.  While it's exciting to have the new symbol, and while one 
does want to strike while the iron is hot, ten years from now it won't have 
made much difference whether it was encoded in 2010 or 2011--unless the job got 
botched through over-haste.

Festina lente.

=
井作恆
John H. Jenkins
jenk...@apple.com

Re: Most complete (free) Chinese font?

2010-07-30 Thread John H. Jenkins

The Han Nom fonts cover everything through Extension B and look OK.  They're 
TrueType.

On Jul 30, 2010, at 1:41 PM, jander...@talentex.co.uk wrote:

> Does anybody know what the most complete, Chinese font is called? This is for 
> Linux, but I think I can use just about any format. I know about the one 
> called Unifont, which is possibly as ugly as one can make it :-) so I was 
> hoping to find something a little bit nicer.
> 
> The problem I have is that there are so many holes in most of the fonts, and 
> it seems to be quite hard to judge which font is more complete. Are there any 
> tools around that could show this - perhaps something that could tell how 
> many glyphs are defined in a given interval?
> 
> 

=
井作恆
John H. Jenkins
jenk...@apple.com

Re: Unihan is back, but...

2010-08-03 Thread John H. Jenkins

Thanks for the report; it's been fixed.  

BTW, problems with the Unihan database should be reported via 
http://www.unicode.org/reporting.html.  They're less likely to slip through the 
cracks that way.

On Aug 3, 2010, at 9:51 AM, Martin Heijdra wrote:

> We were glad to find this week that Unihan is backup.
>  
> However, there are some teething problems. If you look up by pronunciation, 
> say in Mandarin for jing3, you get as a result
>  
> 
>  
> Without any actual characters, unlike previously. I am sure that’s not the 
> way it is supposed to work…
>  
> Martin Heijdra
> Martin J. Heijdra
> Chinese Studies/East Asian Studies Bibliographer 
> East Asian Library and the Gest Collection 
> Frist Campus Center, Room 314 
> Princeton University 
> Princeton, NJ 08544 
> United States

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com

Re: Unihan is back, but...

2010-08-03 Thread John H. Jenkins


On Aug 3, 2010, at 12:00 PM, Robert Abel wrote:

> On 2010/08/03 18:17, John H. Jenkins wrote:
>> 
>> Thanks for the report; it's been fixed.  
>> 
>> BTW, problems with the Unihan database should be reported via 
>> http://www.unicode.org/reporting.html.  They're less likely to slip through 
>> the cracks that way.
> Speaking of slipping through cracks. Are there any plans to update the 
> reference glyphs for all Han characters added after approximately Unicode 
> 3.1? I filed an error report on said page some time ago and got back that 
> Unicode just didn't get around to producing them. So is there an estimate on 
> when that will be the case?
> 

Alas, no.  We do still plan to do this, but we can't give any sense of when it 
will be done.

=
John H. Jenkins
jenk...@apple.com

Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-06 Thread John H. Jenkins

On Aug 6, 2010, at 3:03 AM, William_J_G Overington wrote:

> The standards organizations have a great opportunity to advance typography by 
> defining some of the Latin letter plus variation selector pairs so that 
> alternate glyphs within a font may be accessed directly from plain text.
> 

This is another case of a solution in search of a problem.  It isn't Unicode's 
business to advance typography, and in any event, typesetting plain text isn't 
the path to good typography.  Other technologies, such as OpenType, AAT, and 
Graphite, *do* have the job of making good typography easy and accessible.  
And, mirabile dictu, they can already do what you are suggesting here for plain 
text.  

Unicode's responsibility is to deal with existing needs.  If it is common for 
poets to use various letter shapes at the end of words to convey some semantic 
meaning, and if they do this in their emails or tweets, or if they're 
complaining that this is something that they want to do but can't, then Unicode 
and plain text provide a proper way to help them.  

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com

Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-09 Thread John H. Jenkins

On Aug 7, 2010, at 10:40 AM, Doug Ewell wrote:

> I'd like to see an FAQ page on "What is Plain Text?" written primarily by UTC 
> officers.  That might go a long way toward resolving the differences between 
> William's interpretation of what plain text is, which people like me think is 
> too broad, and mine, which some people have said is too narrow.
> 

Well, we do have <http://www.unicode.org/faq/ligature_digraph.html#10> and 
related FAQs?

The basic idea is that "plain text" is the minimum amount of information to 
process the given language in a "normal" way.  FOR EXAMPLE, ALTHOUGH ENGLISH 
CAN BE WRITTEN IN ALL-CAPS, IT USUALLY ISN'T, AND DOING IT LOOKS WRONG.  We 
therefore have both upper- and lower-case letters for English.  On the other 
hand, although English *is* usually written with some facility to provide 
emphasis, different media have different ways of providing that facility 
(asterisks, underlining, italicizing), and English written without any of these 
looks perfectly fine.  

Arabic, on the other hand, absolutely must have some way of allowing for 
different letter shapes in different contexts, or it looks just wrong, so 
Arabic "plain text" must have facility to allow for that, either by explicitly 
having different characters for the different shapes the letters take, or by 
providing a default layout algorithm that defines them.  

Beyond rendering, there are also considerations as to the minimal amount of 
information necessary for other text-based processes, such as sorting, 
searching, and text-to-speech.

Yes, there are issues which end up being judgment calls, and it's easy to come 
up with cases where you can't really capture the full semantic intent of the 
author without what Unicode calls "rich text."  My favorite example is "The 
Mouse's Tale" in _Alice in Wonderland_.   Plain text isn't intended to capture 
all the nuances of the original's semantics, but to provide at the least a very 
close approximation.

Variation selectors are intended to cover cases where more information is 
needed for rendering than is required for other processes such as searching 
(Mongolian), or cases where different user communities disagree on whether two 
forms must be unified or must be deunified.

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com

Re: Accessing alternate glyphs from plain text

2010-08-11 Thread John H. Jenkins

On Aug 11, 2010, at 8:18 AM, Doug Ewell wrote:

> But to imply that because text always has a specific appearance, determining 
> the underlying plain text is an artificial process that was imposed on us by 
> computers seems wrong.  We (meaning "readers of alphabetic scripts, at least 
> Latin and Cyrillic") learn to recognize letters at an early age, but quickly 
> run into additional glyphs we don't recognize, like certain cursive uppercase 
> letters (especially G and Q) and the two-tier vs. one-tier lowercase a and g. 
>  Then we find out they are different forms of the same letter, and learn to 
> read them the same, and that is the essence of "plain text"—the underlying 
> letters behind potentially differing glyphs.
> 

Just to illustrate Doug's point, suppose someone hands you a hand-written 
letter and asks you to copy it.  To what extent do you attempt to fully 
recreate the format of the original?  Most likely, you'll simply copy the 
letters and punctuation.  If the letter has some specific formatting (such as 
underlining), you may attempt to recreate that.  By and large, however, there 
would be no effort to recreate the non-paragraphing line breaks and definitely 
not any effort to recreate the original letter shapes.  Copying the letter in 
this fashion is certainly acceptable under almost all circumstances--indeed, in 
many cases it would be preferred over, say, a photocopy--and it strongly 
suggests the existence of some sort of Platonic "plain text" which is the 
essence of what was written.

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com

Re: Accessing alternate glyphs from plain text

2010-08-12 Thread John H. Jenkins

You seem to be missing a couple of important points here which Peter is 
illustrating.

First of all, what you want to do can be done with existing technology.  
There's no need to add variation selectors or other mechanisms to achieve your 
goal.

Secondly, fonts are themselves works of art, and a well-designed face will have 
a set of swashes appropriate face but not necessarily another face.  Simply 
saying "I want a swash here" isn't enough.  On a Mac, for example, Hoefler Text 
Italic has one swash available for the "t", whereas Zapfino has three, none of 
which are like the swash Hoefler Text Italic provides, and one of which is 
inappropriate for use at the end of a line.  Most fonts won't have any, because 
swashes are usually seen as the purview of calligraphic fonts.  

So what do you do?  Do you provide a variation selector for every kind of swash 
a font designer might include to make sure you get the "right" one?  Or do you 
just say, "Put a swash in here, I don't care what it looks like?"  Neither 
seems like a good idea.  

Note, too, that Peter used swashes where you didn't ask for them.  Since we're 
trying to embody the swashing in plain text, doesn't that mean that he's 
violating what the poet was intending to say?

When you're doing real-life typography, it's really meaningless to talk about 
alternate glyph shapes without knowing what font you're working with.  

Typography is not done with plain text.  

Just to illustrate *my* point, I'm adding a PDF of four of the huge number of 
possibilities for laying out your first stanza with Zapfino on a Mac.  Which 
one did the poet intend?

Poem.pdf
Description: Adobe PDF document

On Aug 12, 2010, at 5:38 AM, William_J_G Overington wrote:

> Thank you for taking the time to produce the pdf and thank you also for 
> sharing the result.
> 
> I had not known of the Gabriola font previously.
> 
> I found the following page on the web.
> 
> http://www.microsoft.com/typography/fonts/family.aspx?FID=372
> 
> Best regards
> 
> William Overington
> 
> 12 August 2010
> 
> On Thursday 12 August 2010, Peter Constable  wrote:
> 
>> See the attached PDF showing Unicode
>> 5.2 text set in Word 2010 using the Gabriola font with
>> line-ending characters formatted with the Stylistic Set 7
>> OpenType Feature. No PUA; no variation selectors. Just
>> flourishing, OpenType glyphs.
>> 
>> 
>> Peter
>> 
> 
> 
> 

=
John H. Jenkins
jenk...@apple.com

Re: U-Source ideographs mapped to themselves

2010-08-30 Thread John H. Jenkins

On Aug 29, 2010, at 6:07 AM, Uriah Eisenstein wrote:

> Hi,
> UAX #38 (Unihan) defines the kIRG_USource field as a reference into the 
> U-source ideograph database described in UTR #45, having the form "UTCn". 
> However, several CJK Compatibility Ideographs are mapped to their own code 
> point values, e.g. "U+FA0CkIRG_USourceU+FA0C". The formal syntax of 
> kIRG_USource allows this, but I've found no explanation as to the meaning of 
> such a mapping; there is also no such mapping from a code point to another 
> code point.
> Thanks,
> Uriah

This is being changed with the 6.0.0 release.  The U-source for all such 
ideographs has been turned into a UTR #45 index, e.g., the U-source for U+FA0C 
is now UTC00915.  

What it means is that the character is a unifiable variant derived from one of 
the industrial (and not national) sources used by Unicode during the 
development of the original URO.   

=
John H. Jenkins
jenk...@apple.com

Re: Unihan SQL access

2010-09-12 Thread John H. Jenkins

I'll raise the possibility with the appropriate individuals, but I think it 
likely that the Consortium would prefer that third parties not host clones of 
the Unihan database.  

On Sep 12, 2010, at 9:57 AM, Uriah Eisenstein wrote:

> Hello,
> I'm nearing completion of a simple Java program which loads Unihan data from 
> the source files into a DB, and provides SQL access to it.There's still at 
> least a week or so of work on issues I consider essential, but once ready I'd 
> be happy to make it available on the Internet if anyone's interested.
> So far I've used it to search for possibly erroneous data in Unihan; my 
> latest find is that 73 characters have a kTaiwanTelegraph value of , 
> which seems doubtful. It can also be useful for various statistical 
> information such as how many characters are listed under each radical, or 
> which blocks include IICore characters.
> I'm also considering adding the contents of the Unicode Character Database as 
> well at a later phase.
> Regards,
> Uriah Eisenstein

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com

Re: Unihan SQL access

2010-09-12 Thread John H. Jenkins

Got it.  Yes, that would be no problem at all.  Go for it.

On Sep 12, 2010, at 4:23 PM, Uriah Eisenstein wrote:

> OK, I should probably clarify: the program does not provide direct web access 
> to Unihan in any way, nor even contain it. It rather expects the user to have 
> downloaded Unihan.zip, and direct the program to the location of that file, 
> which it would then process. This shouldn't be more of a problem than having 
> a local copy of Unihan.zip, to the best of my understanding.
> (It might also be possible to direct it to read Unihan.zip directly from the 
> Unicode site, but I'd be sure to ask permission before trying that out).
> 
> Uriah
> 
> On Mon, Sep 13, 2010 at 12:03 AM, John H. Jenkins  wrote:
> I'll raise the possibility with the appropriate individuals, but I think it 
> likely that the Consortium would prefer that third parties not host clones of 
> the Unihan database.  
> 
> On Sep 12, 2010, at 9:57 AM, Uriah Eisenstein wrote:
> 
>> Hello,
>> I'm nearing completion of a simple Java program which loads Unihan data from 
>> the source files into a DB, and provides SQL access to it.There's still at 
>> least a week or so of work on issues I consider essential, but once ready 
>> I'd be happy to make it available on the Internet if anyone's interested.
>> So far I've used it to search for possibly erroneous data in Unihan; my 
>> latest find is that 73 characters have a kTaiwanTelegraph value of , 
>> which seems doubtful. It can also be useful for various statistical 
>> information such as how many characters are listed under each radical, or 
>> which blocks include IICore characters.
>> I'm also considering adding the contents of the Unicode Character Database 
>> as well at a later phase.
>> Regards,
>> Uriah Eisenstein
> 
> =
> Siôn ap-Rhisiart
> John H. Jenkins
> jenk...@apple.com
> 
> 
> 

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com

Re: OpenType update for Unicode 5.2/6.0?

2010-10-11 Thread John H. Jenkins

You might start with at http://www.microsoft.com/typography/otspec/otlist.htm.  

On Oct 11, 2010, at 5:11 AM, Saqqara wrote:

> Given that OpenType is the de-facto standard for fonts, it is disappointing 
> to see the 'Script tag' list for OpenType has not been updated in almost 
> three years. I'm a patient person but the lack of inclusion of new scripts in 
> Unicode 5.2 a year after the fact seems like carelessness. I've elaborated a 
> little further on my jtotobsc blog, see 
> http://jtotobsc.blogspot.com/2010/10/isounicode-scripts-missing-in-opentype.html.
>  
> My particular interest being 𓌃𓂧𓏏𓏯𓀁𓏪𓆎𓅓𓊖 (mdt-kmt, the Egyptian language in 
> hieroglyphs).
>  
> Any ideas who needs to be prodded to make an update happen? It would also be 
> very useful if HTML5/WOFF could spec Unicode 6.0 or later as a step towards a 
> multiscript web.
>  
> Bob Richmond
>  
>  
>  
>  

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com

Re: Creative people on Twitter

2010-10-14 Thread John H. Jenkins

On Oct 14, 2010, at 4:12 AM, William_J_G Overington wrote:

> What is the position regarding the 32-bit code point space above U+10 
> please?
> 

Its use is incompatible with Unicode.  Fundamentally, it cannot be represented 
using UTF-16 (without a major rearchitecture), so it doesn't exist.

> Does the Unicode Consortium and/or ISO or indeed anyone else make any claims 
> upon it?
> 

Yes, the claim is that if you use it, you're generating invalid Unicode.  

Don't do it, don't contemplate it, don't think about it.  

=
John H. Jenkins
jenk...@apple.com

Re: Errors in Unihan data : simplified/traditional variants

2010-11-01 Thread John H. Jenkins


On 2010/10/30, at 下午8:42, Koxinga wrote:

> My quickly done parsing program counted 1154 such pairs, where the head 
> character was the same as the character above. It seems to be always in the 
> order "kTraditionalVariant" then "kSimplifiedVariant", so can maybe be 
> automatically corrected. It seems to be a very evident mistake, and the 
> correction should be easy. I can help with that, I am just waiting to see if 
> this is the right place to report problems in Unihan. I also 
> consideredhttp://www.unicode.org/reporting.html , would it be better ?
> 

Yes, that would be better.  That way it will be tracked and it's less likely to 
slip through the cracks in my schedule.  For general questions, you can email 
me directly.

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com

Re: ch ligature in a monospace font

2011-06-28 Thread John H. Jenkins

On 28 Jun, 2011, at 11:29 AM, Jean-François Colson wrote:

> * In the C’HWERTY layout on Linux, the digraph and trigraph had to be 
> replaced by six PUA characters and an input method such as xim must be used 
> to get the correct character sequences. Since they are PUA characters, those 
> substitutions are not installed by default and the user has to add them 
> him/herself in his/her ~/.XCompose file. I’ve made a bug report at 
> Freedesktop.org to ask 6 new keysyms, but I don’t know when I’ll get an 
> answer if I get one at all. If there were Unicode characters such as Ǉ ǈ ǉ Ǌ 
> ǋ ǌ etc. for ch and c’h, such a problem wouldn’t occur.
> 

Why do you need to process them as single characters?  The typical way of 
handling these things is to use multiple characters, as is done in Welsh for 
"dd," "ff," and "ll" (among many other examples from many other languages).  
This is a well-known problem and with modern systems, there's no aspect of text 
processing that can't be handled this way. Keyboards can emit multiple 
characters with one keystroke, sorting can be tailored to account for 
multiple-character "letters," and so on.  

> * Since those two letters must be encoded in 2 or 3 characters, with a 
> monospace font, they are twice or 3 times larger than the other letters.
> 
> To solve this last problem, would it be possible to make a font in which c 
> ZWJ h would be displayed as a new glyph?
> 

Yes, it's fairly trivial to do.  

=
井作恆
John H. Jenkins
jenk...@apple.com

Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)

2011-07-15 Thread John H. Jenkins

I'll try to arrange for an official corporate response to this document for the 
next UTC, but informally, I note that the charts include a number of variants 
of the Apple corporate logo, which Apple wants *not* to be encoded in any form. 
 

Beyond this—and speaking purely for myself and not for Apple (and unfortunately 
aware that some people don't understand or will not respect the distinction)—I 
think that this whole discussion is starting up a little too quickly.  The mere 
fact that they're in fonts some corporation ships is not evidence that they are 
appropriate even for consideration, let alone encoding, particularly in the 
absence of clones or other widely-distributed fonts which contain these glyphs. 
 I think it's fair to say that if Apple felt that these glyphs were needed in 
general text interchange, Apple would have proposed them.  

In any event, I would personally prefer that the whole discussion be dropped 
until Apple has had a chance to at least look over the document and respond.  
To do otherwise strikes me as at the least discourteous and at best premature.  

=
井作恆
John H. Jenkins

Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)

2011-07-15 Thread John H. Jenkins

Not to sideline this discussion, but it's been brought to my attention that I 
wasn't clear on an important point.  What Karl provides in N4127 is just a 
summary of what PUA characters are found in Apple fonts, and not a proposal for 
encoding *anything*.  Kudos to Karl for digging up the data and formatting it 
so nicely, but there seems to be an awful lot of discussion going on for what 
is really a non-proposal, especially when the owner of the data in question has 
yet to comment on it.  

I guess my own background is leaking through here, but to me, it feels like the 
equivalent of the IRG using a copy of Adobe-Japan1-6 to work out issues on Han 
unification without waiting for Adobe to say anything about the set.  

=
井作恆
John H. Jenkins
jenk...@apple.com

Re: Endangered Alphabets

2011-08-19 Thread John H. Jenkins

I think you want ISO 2022.  

In any event, this will never happen in Unicode, because this is the exact 
opposite of what Unicode is all about, unless I misunderstand you.  Unicode's 
goal is for every code unit to have a fixed interpretation.  So far as many 
people involved in the original design of Unicode, code pages were a disaster.  

srivas sinnathurai 於 2011年8月19日 上午7:14 寫道：

> PUA is not structured and not officially programmable to accommodate numerous 
> code pages.
>  
> Take the ISO 8859-1, 2, 3, and so on .
> These are now allocating the same code points to many languages and for other 
> purposes.
> Similarly, a structured and official allocations to any many requirements can 
> be done using the same codes, say 16,000 of them.
>  
> Sinnathurai
> 
> On 19 August 2011 13:53, Doug Ewell  wrote:
> In what way is this not what the PUA is all about?
>  
> --
> Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
> www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell 
> From: srivas sinnathurai
> Sent: Friday, August 19, 2011 5:13
> To: Michael Everson
> Cc: unicode Unicode Discussion ; unicore UnicoRe Discussion
> Subject: Re: Endangered Alphabets
>  
> This is about time we allocate a significant space withi the Unicode code 
> space to work in the old fashion code page provisioning mode.
>  
> I'm not calling for any change to existing major aloocations. However, this 
> is about time we allocate (not PUA) large number of codes to a code page 
> based sub codes so that not only all 7000+ languages can Freely use it 
> without INTERFERENCE from Unicode and have the freedom to carry out research 
> works, like we were doing with the legacy 8bit codes.
>  
> All those in favour of creating code pages, please say yes, and others please 
> say why not.
>  
> Kind Regards
> Sinnathurai Srivas
> On 19 August 2011 10:55, Michael Everson  wrote:
> I'd like to invite everyone to support this worthwhile project:
> 
> http://www.kickstarter.com/projects/1496420787/the-endangered-alphabets-project/
> 
> Michael Everson * http://www.evertype.com/
> 
> 
> 
>  
> 

=
井作恆
John H. Jenkins
jenk...@apple.com

Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

2011-08-19 Thread John H. Jenkins


srivas sinnathurai 於 2011年8月19日 上午9:40 寫道：

> Why this suggestion?
> With current flat space, one code point is only allocated to one and only one 
> purpose.
> We can run out of code space soon.
> 


There are a couple of problems here.

We currently have over 860,000 unassigned code points.  Surveys of all known 
writing systems indicate that only a small fraction of these will be needed.  
Indeed, although it looks likely that Han will spill out of the SIP into plane 
3, all non-Han will likely fit into the SMP.  (Michael, you can correct me on 
this if I'm wrong.)

Even if we allow for the possibility that there are a lot of writing systems 
out there we don't know about, there would have to be a *lot* of writing 
systems out there we don't know about to fill up planes 4 through 14.  If the 
average script requires 256 code points, there would have to be some 2800 
unencoded scripts to do that.  

Moreover, it's taken us 20 years to use 250,000 code points.  Even if that rate 
remained steady (and it's been going down), it will take us something on the 
order of a century to fill up the remaining space, if that's even possible, and 
that hardly qualifies as "soon."

And there already is a code page switching mechanism such as you propose.  It's 
called ISO 2022 and it supports Unicode.  

In order to get the UTC and WG2 to agree to a major architectural change such 
as you're suggesting, you'd have to have some very solid evidence that it's 
needed—not an interesting idea, not potentially useful, but seriously *needed*. 
 That's how surrogates and the astral planes came about—people came up with 
solid figures showing that 65,536 code points was not nearly enough.  So far, 
the evidence suggests that we're in no danger of running out of code points.  

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com

Re: RTL PUA?

2011-08-19 Thread John H. Jenkins

Michael Everson 於 2011年8月19日 上午11:15 寫道：

> On 19 Aug 2011, at 18:01, Shriramana Sharma wrote:
> 
>>> Even though it isn't encoded?  That is, my understanding is that we *can't* 
>>> change the PUA to ON now, but that there is a suggestion that some *new* 
>>> hunk of PUA be created that is R, in order to balance the existing L. Is 
>>> that right?
>> 
>> Right, Michael is suggesting that, but since the properties of the PUA 
>> characters aren't binding as said above, this is also unnecessary.
> 
> Saying that does not make it possible for people to use PUA characters with 
> RTL directionality, since all the OSes treat them as LTR.
> 

Mac OS has a mechanism to override that default assumption, the 'prop' table.  
And hopefully people support RLO and LRO properly, which provides a 
general-purpose mechanism.  

>> Would mean yet another chunk of space where we aren't allowed to encode 
>> anything. (Yes yes I know all that about "plenty of space", but that space 
>> gets filled up pretty quickly. I predict/expect the SMP will be filled soon.)
> 
> Put a RTL PUA zone in Plane 14, which is mostly empty, and expected to remain 
> so, and you're done. 
> 

No, you're not, because the OSs/rendering engines would have to rev, and to be 
honest, there won't be a lot of enthusiasm for doing something something like 
this so long as it isn't actually *required* in order to be Unicode conformant. 
(It's hard enough to get people to do the required stuff.) RTL, PUA support, 
and optional features are usually pretty low on most people's priority lists. 

I'm very sympathetic with the frustration people feel over the current 
situation, but, again, before you could convince the UTC to do this, you'd have 
to present pretty solid evidence that  the current solution doesn't work and 
that this would.  

=
John H. Jenkins
jenk...@apple.com

Re: Code pages and Unicode

2011-08-19 Thread John H. Jenkins


Benjamin M Scarborough 於 2011年8月19日 下午3:53 寫道：

> Whenever somebody talks about needing 31 bits for Unicode, I always think of 
> the hypothetical situation of discovering some extraterrestrial civilization 
> and trying to add all of their writing systems to Unicode. I imagine there 
> would be little to unify outside of U+002E FULL STOP.

Oh, I imagine they'll have one or two turtle ideographs.  :-)

Seriously, though, if and when we run into ETs with all their myriad writing 
systems, I really don't think that we'll be Unicode to represent them.

=
井作恆
John H. Jenkins
jenk...@apple.com

Re: Code pages and Unicode

2011-08-22 Thread John H. Jenkins


Christoph Päper 於 2011年8月20日 上午2:31 寫道：

> Mark Davis ☕:
> 
>> Under the original design principles of Unicode, the goal was a bit more 
>> limited; we envisioned […] a generative mechanism for infrequent CJK 
>> ideographs,
> 
> I'd still like having that as an option.
> 


Et voilà!  We have Ideographic Description Sequences.  Or, if you're more 
ambitious, CDL.  

Generative mechanisms for Han are very attractive given the nature of the 
script, but once you try to support something other than display, or even try 
to write a rendering engine, all sorts of nasty problems crop up that have 
proven difficult to solve.  We won't even get into the problem of wanting to 
discourage people from making up new ad hoc characters for Han. 

I won't say some sort of generative mechanism will never become the preferred 
way of handling unencoded ideographs, but there is a lot of work to be done 
before that would be practical.

=
John H. Jenkins
jenk...@apple.com

Re: RTL PUA?

2011-08-22 Thread John H. Jenkins

Doug Ewell 於 2011年8月22日 上午10:59 寫道：

> Petr Tomasek  wrote:
> 
>>> Some PUA properties, like glyph shapes and maybe directionality, can
>>> be stored in a font.  Others, like numeric values and casing, might
>>> not or cannot.  An interchangeable format needs to be agreed upon for
>> 
>> Why not?
> 
> Where does one store numeric values in a font?  Maybe this should be
> taken off-list.
> 

This is actually a relevant point.  The major TrueType variants all work 
primarily with glyphs, not characters.  Using them as a place to store 
information about the *characters* in the text is therefore not a reliable way 
to provide an override for default system behavior.  By the time the rendering 
engine consults the fonts for layout specifics, large chunks of the text 
processing will already be completed.  

OpenType, for example, expects that the bidi algorithm is largely run in 
character space, not glyph space, and therefore without regard for the specific 
font involved.  (AAT does almost everything in glyph space, including bidi.  
I'm not sure about Graphite.)  

The net result is that a font is an unreliable way of storing 
character-specific information useful on multiple platforms.  This is one 
reason why embedding the existing directionality controls within the text 
itself is currently the most reliable way of getting the behavior one might 
want in a platform-agnostic way.

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com

Re: RTL PUA?

2011-08-22 Thread John H. Jenkins


William_J_G Overington 於 2011年8月22日 上午10:49 寫道：

> In the Description section of the Macintosh Roman section of a TrueType font, 
> include a line of text in a plain text format of which the following line of 
> text is an example.
> 
> PUA.RTL="$E000-$E1FF,$E440-$E447,$E541,$E549,$E57C,$EA00-$EA0F,$EC07";
> 

Forgive my asking, but this reference to the "description section of the 
Macintosh Roman section of a TrueType font" has me puzzled, because I don't 
know what you're talking about.  What table contains this string?

=
井作恆
John H. Jenkins
jenk...@apple.com

Re: RTL PUA?

2011-08-22 Thread John H. Jenkins


William_J_G Overington 於 2011年8月22日 下午12:36 寫道：

> On Monday 22 August 2011, John H. Jenkins  wrote:
> 
>> Forgive my asking, but this reference to the "description section of the 
>> Macintosh Roman section of a TrueType font" has me puzzled, because I don't 
>> know what you're talking about.  What table contains this string?
> 
> When I use FontCreator, made by High-Logic, http://www.high-logic.com is the 
> webspace: with a font file open, I can select Format from the menu bar and 
> then select Naming... from the drop down menu.
> 
> That leads to a dialogue panel.
> 
> From that dialogue panel one may select, for an ordinary, basic Unicode font, 
> either of two platforms, namely Macintosh Roman and Microsoft Unicode BMP 
> only.
> 
> Having selected a platform, one may view the text content of various fields 
> for that platform, such as font family name and copyright notice, version 
> string and postscript name. There is then a button that is labelled 
> Advanced... that, if clicked, opens another dialogue panel with various other 
> text fields, including Font Designer and Description, which are the two that 
> I often use.
> 
> Now, when the text values in the fields are stored in the font file, the 
> values for the Macintosh Roman platform are stored in plain text and the 
> values for the Microsoft Unicode BMP only platform are stored in some encoded 
> format.
> 
> So, if one opens a TrueType font file in WordPad and one searches for an item 
> of plain text that is in one of the fields of the font, then the text that is 
> in the Macintosh platform can be found, yet the text that is in the Microsoft 
> Unicode BMP only platform cannot be found.
> 
> So, I thought that if a manufacturer of a wordprocessing application or a 
> desktop publishing application decided to make a "special researcher's 
> edition" of the software, then that software could, when a font is selected, 
> first scan the font for a PUA.RTL string and, if one is found, override the 
> left-to-right nature of the identified characters to be a right-to-left 
> nature, just while that font is selected.
> 
> Whether such a software package ever becomes available is something that only 
> time will tell, yet it seems to me that it is a method that could be used 
> without needing any changes by any committee.
> 

Ah.  You're referring to an entry in the 'name' table, then.  The intention of 
the 'name' table is to provide localizable strings for the UI.  Using it to 
store data of any sort for the rendering engine would be very, very 
inappropriate.  

In general, one should not be using a text editor to examine the contents of a 
TrueType font. It would be like using a text editor to examine the contents of 
an application.  Even if you see some plain text, you really don't have any 
sense for how it's actually being used.  

You may want to bone up on the structure of TrueType/OpenType fonts.

=
John H. Jenkins
井作恆
𐐖𐐱𐑌 𐐐. 𐐖𐐩𐑍𐐿𐐮𐑌𐑆
jenk...@apple.com

Re: RTL PUA?

2011-08-23 Thread John H. Jenkins


John Hudson 於 2011年8月23日 下午2:33 寫道：

> Behdad Esfahbod wrote:
> 
>>> I can see the advantages of such an approach -- performing GSUB prior to 
>>> BiDi
>>> would enable cross-directional contextual substitutions, which are currently
>>> impossible -- but the existing model in which BiDi is applied to characters
>>> *not glyphs* isn't likely to change. Switching from processing GSUB lookups 
>>> in
>>> logical order rather than reading order would break too many things.
> 
>> You can't get cross-directional-run GSUB either way because  by definition
>> GSUB in an RTL run runs RTL, and GSUB in an LTR run runs LTR.  If you do it
>> before Bidi, you get, eg, kerning between two glyphs which end up being
>> reordered far apart from eachother.  You really want GSUB to be applied on 
>> the
>> visual glyph string, but which direction it runs is a different issue.
> 
> Kerning is GPOS, not GSUB.
> 
> But generally I agree. My point was that Philippe's suggestion, although it 
> could be the basis of an alternative form of layout that might have some 
> benefits if fully worked out, is a radical departure from how OpenType works.
> 

I'll toss in my obligatory, "That's how AAT does it" reference.  It has 
advantages and disadvantages—but, as you say, OT would have to be heavily 
redesigned to do it.  

=
John H. Jenkins
井作恆
𐐖𐐱𐑌 𐐐. 𐐖𐐩𐑍𐐿𐐮𐑌𐑆
jenk...@apple.com

Re: RTL PUA?

2011-08-24 Thread John H. Jenkins


John Hudson 於 2011年8月23日 下午9:08 寫道：

> I think you may be right that quite a lot of existing OTL functionality 
> wouldn't be affected by applying BiDi after glyph shaping: logical order and 
> resolved order are often identical in terms of GSUB input. But it is in the 
> cases where they are not identical that there needs to be a clearly defined 
> and standard way to do things on which font developers can rely. [A parallel 
> is canonical combining class ordering and GPOS mark positioning: there are 
> huge numbers of instances, even for quite complicated combinations of base 
> plus multiple marks, in which it really doesn't matter what order the marks 
> are in for the typeform to display correctly; but there are some instances in 
> which you absolutely need to have a particular mark sequence.]

And this is really the key point.  There really isn't anything inherent to 
OpenType that absolutely *requires* the bidi algorithm be run in character 
space.  It would theoretically be possible to manage things in a fashion so 
that it's run afterwards, à la AAT.  But font designers *must* know which way 
it's being done in practice, and, in practice, all OT engines run the bidi 
algorithm in character space and not in glyph space.  At this point, trying to 
arrange things so that it can be done in glyph space instead is a practical 
impossibility.

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com

Re: Code pages and Unicode

2011-08-24 Thread John H. Jenkins


Asmus Freytag 於 2011年8月23日 下午2:00 寫道：

> 
> Until then, I find further speculation rather pointless and would love if it 
> moved off this list (until such time).
> 


That would be wonderful, because we could then turn our attention to more 
urgent subjects, such as what to do when the sun reaches its red giant stage 
and threatens to engulf the Earth. ☺ 

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com

Re: Code pages and Unicode

2011-08-24 Thread John H. Jenkins

It has ceased to be. It's expired and gone to meet its maker. It's a stiff. 
Bereft of life, it rests in peace.…Its metabolic processes are now history. 
It's off the twig. It's kicked the bucket, it's shuffled off its mortal coil, 
run down the curtain and joined the bleedin' choir invisible.  This is an 
ex-possibility.

And even if that *weren't* true, there are nowhere *near* enough kanji to have 
a serious impact on Ken's analysis.  

Richard Wordingham 於 2011年8月24日 下午4:51 寫道：

> Has Japanese
> disunification been completely killed, or merely scotched?

=
井作恆
John H. Jenkins
jenk...@apple.com

Re: Glaring mistake in the code list for South Asian Script

2011-09-08 Thread John H. Jenkins

The Latin script covers alphabets for languages other than Latin.

The Arabic script covers alphabets for languages other than Arabic. 

CJK Ideographs aren't ideographs.  

U+FE18 PRESENTATION FORM FOR VERTICAL WHITE LENTICULAR BRAKCET isn't a brakcet. 
 

And so on.

I realize it is frustrating when Unicode and related standards show apparent 
indifference to getting names absolutely right.  The practical reality is that 
Unicode's intention is to use names which reflect standard or common English 
usage, not to be incontrovertibly correct.  Experts (or native speakers) may 
well use or prefer different terminology.  In some cases, such as Burmese, the 
terminology involved can be controversial, often for political reasons.  Almost 
never is it true that *everybody* agrees on a name/term.  

Moreover, for stability reasons, Unicode names can well be frozen.  Even if 
everybody comes to agree that a given name is absolutely and completely wrong, 
we can get stuck with it.  There was a time when Unicode was willing to change 
names, but that proved to be a very bad idea.  

The net result is that Unicode is loaded with misnomers.  And yes, this is 
unfortunate and often very embarrassing—but so long as Unicode does its 
intended job and makes it possible for people to represent texts written in the 
various languages it covers, it's something we just have to live with.

See also <http://www.unicode.org/faq/basic_q.html#4>.

=====
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com

Re: Controls, gliphs, flies, lemonade

2011-09-13 Thread John H. Jenkins


QSJN 4 UKR 於 2011年9月12日 下午9:06 寫道：

> I know it is sacred cow, but let me just ask, how do you people think.
> Is it good or bad that the codepoint means all about character: what,
> where, how... (see theme)? Maybe have we separate graph & control
> codes - wellnt have many problems, from banal ltr (( rtl instead ltr
> (rtl) to placing one tilde above 3, 4, anymore letters, or egyptian
> hierogliphs in rows'n'cols. Conceptually, I mean! Each letter in text
> is at least two codepoints ("what" and "where") in file. Is it stupid?
> Trying to render the text we anyway must generate this data.
> 


It's not really a sacred cow per se, but it is a fundamental architectural 
decision which would be pretty much impossible to revisit now.

Almost all writing is done using a small set of script-specific rules which are 
pretty straightforward.  English, for example, is laid out in horizontal lines 
running left-to-right and arranged top-to-bottom of the writing surface.  East 
Asian languages were traditionally laid out in vertical lines running from 
top-to-bottom and arranged right-to-left on the writing surface.  

Because some scripts are right-to-left and ltr and rtl text can be freely 
intermingled on a single line, Unicode provides plain-text directionality 
controls.  The preference, however, is to use higher-level protocols where 
possible.

As for the scripts which are inherently two-dimensional (using hieroglyphics, 
mathematics, and music), it's almost impossible to provide "plain text" support 
for them.  There is too much dependence on additional information such as the 
specifics of font and point size.  Because of this, the UTC decided long ago 
that layout for such scripts absolutely must be done using a higher-level 
protocol to handle all the details.

There are occasionally suggestions that positioning controls be added to plain 
text in Unicode, but so far the UTC has felt that the benefits are too marginal 
to overcome its reasons for having left them out in the first place.

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com

Re: Controls, gliphs, flies, lemonade

2011-09-20 Thread John H. Jenkins

In re CJK, that's already a FAQ: http://www.unicode.org/faq/han_cjk.html#16.  
The short version is: if all you want to do is to draw something, then yes, 
making up new hanzi on the fly is a solvable problem.  If you want to do 
anything that deals with the *content* (lexical analysis, sorting, 
text-to-speech), it's an incredibly difficult problem.  

And, actually, there's already a way to insert nonstandard hanzi into text 
(well, two, if you count the Ideographic Variation Indicator), namely 
Ideographic Description Sequences.  They're clumsy and awkward, but they do 
make it possible to exchange text with unencoded hanzi in a vaguely standard 
fashion.  

And yes, Unicode is very complicated, but that's because of the problem it's 
intended to solve.  If all you're interested in is drawing text in a couple of 
common scripts, such as Latin and Japanese, then you really don't need Unicode 
with all of its complexity.  Unicode is trying to provide a basis for handling 
all aspects of plain text processing for all the languages of the world in a 
single application.  

Just go to Wikipedia and look down the long list of different languages that a 
popular subject has articles in.  *That* is what Unicode is trying to provide.  
It's very tough to implement, but fortunately on all the major platforms, there 
are libraries that make it unnecessary for you to do all the work yourself.

QSJN 4 UKR 於 2011年9月20日 下午9:01 寫道：

> Yes, i had written 'egyptian hieroglyphs' but how about banal CJK? We
> still have no way to insert nonstandard ideogramme into text. Isn't it
> a simple task? There are just 20 basic strokes :)  ok, 500 basic
> symbols. Or 20? However  we can't combine it together :( !
> Unicode is to complex standard. I even don't know how many properties
> have one character (did you know about unicode-coloured characters? -
> there was somewhere that my theme in this list), how can i know how my
> application has to render 'plain' text with bidi, noncanonicordered
> diacritics, and korean script. Right, i don't know that. And my
> application render it in my way, some else in another (a_a / aa_ -
> double comb. char., sure you seen that), so we have no standard at
> all.
> Off course, i can learn this complex standard, but what for? Most of
> them i never use.
> There must be a simpler system, not so many aprior data for it work.
> 
> 2011/9/13, John H. Jenkins :
>> 
>> QSJN 4 UKR 於 2011年9月12日 下午9:06 寫道：
>> 
>>> I know it is sacred cow, but let me just ask, how do you people think.
>>> Is it good or bad that the codepoint means all about character: what,
>>> where, how... (see theme)? Maybe have we separate graph & control
>>> codes - wellnt have many problems, from banal ltr (( rtl instead ltr
>>> (rtl) to placing one tilde above 3, 4, anymore letters, or egyptian
>>> hierogliphs in rows'n'cols. Conceptually, I mean! Each letter in text
>>> is at least two codepoints ("what" and "where") in file. Is it stupid?
>>> Trying to render the text we anyway must generate this data.
>>> 
>> 
>> 
>> It's not really a sacred cow per se, but it is a fundamental architectural
>> decision which would be pretty much impossible to revisit now.
>> 
>> Almost all writing is done using a small set of script-specific rules which
>> are pretty straightforward.  English, for example, is laid out in horizontal
>> lines running left-to-right and arranged top-to-bottom of the writing
>> surface.  East Asian languages were traditionally laid out in vertical lines
>> running from top-to-bottom and arranged right-to-left on the writing
>> surface.
>> 
>> Because some scripts are right-to-left and ltr and rtl text can be freely
>> intermingled on a single line, Unicode provides plain-text directionality
>> controls.  The preference, however, is to use higher-level protocols where
>> possible.
>> 
>> As for the scripts which are inherently two-dimensional (using
>> hieroglyphics, mathematics, and music), it's almost impossible to provide
>> "plain text" support for them.  There is too much dependence on additional
>> information such as the specifics of font and point size.  Because of this,
>> the UTC decided long ago that layout for such scripts absolutely must be
>> done using a higher-level protocol to handle all the details.
>> 
>> There are occasionally suggestions that positioning controls be added to
>> plain text in Unicode, but so far the UTC has felt that the benefits are too
>> marginal to overcome its reasons for having left them out in the first
>> place.
>> 
>> =
>> Hoani H. Tinikini
>> John H. Jenkins
>> jenk...@apple.com
>> 
>> 
>> 
>> 
>> 
> 
> 

=
John H. Jenkins
jenk...@apple.com

Re: New version of UTR #45 published

2011-10-03 Thread John H. Jenkins


Philippe Verdy 於 2011年9月30日 下午11:32 寫道：

> What is the current status of this UTC's Extension E ?
> - If it's still not validated, then the description of the field in
> the last paragraph quoted below should not be there, but in a pending
> update of this UTR.
> - If it's approved, then the first paragraph should list E, and there
> should not be any reference to a "proposal" in the paragraph
> describing it.
> 

It's still being looked at by the IRG along with all the other Extension E 
submissions.  It's not a part of the standard yet, and it's subject to change, 
but it is a well-defined set of interest to people who are tracking IRG work.  

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com

Re: Unihan data for U+2B5B8 error

2011-10-19 Thread John H. Jenkins


Jukka K. Korpela 於 2011年10月19日 上午3:06 寫道：

> I don’t know what issue Shi Zhao is referring to, but there is definitively 
> an error on the page. Under the heading “Glyphs,” the small table contains, 
> under the header cell “The Unicode Standard”, a cell that appears to be 
> empty. 

This is a known (and, alas, long-standing) problem.  We really do intend to get 
it fixed, but it's impossible to say when.

=
John H. Jenkins
jenk...@apple.com

Re: Unihan data for U+2B5B8 error

2011-10-19 Thread John H. Jenkins


Andrew West 於 2011年10月19日 上午4:14 寫道：

> On 19 October 2011 10:43, shi zhao  wrote:
>> The page said kTraditionalVariant of U+2B5B8 is U+9858 願.
> 
> which is correct.
> 
>> ) said U+2B5B8 𫖸 is kSimplifiedVariant of U+9858 願, U+613F 愿 is
>> kSemanticVariant, but 愿 is simplified of 願, not U+2B5B8 𫖸.
> 
> which I agree is not correct.  It's not always clear how asymmetrical
> cases like this should be handled.  For U+9918 餘, which is analagous,
> with a common simplified form U+4F59 余 and an alternate simplified
> form U+9980 馀, the Unihan database lists them both as simplified
> variants of U+9918:
> 
> U+9918kSimplifiedVariant  U+4F59 U+9980
> 
> On this precedent, I would expect:
> 
> U+9858kSimplifiedVariant  U+613F U+2B5B8
> 

Actually, it's a bit more complicated than that.  Note that the 
kSemanticVariant field for U+613F is actually "U+9858 I suggest you report this issue on the Unicode Error Reporting form:
> 
> <http://www.unicode.org/reporting.html>
> 

Always sage advice, since you can't count on there being anybody reading this 
mailing list who can make the change.  When you do so, *please* include a 
source for your information.  We get all kinds of offered corrections to the 
Unihan data which we can't use because there's no authoritative source. 

=
井作恆
John H. Jenkins
jenk...@apple.com

Re: Unihan data for U+2B5B8 error

2011-10-20 Thread John H. Jenkins


Andrew West 於 2011年10月20日 上午3:25 寫道：

> On 19 October 2011 18:41, John H. Jenkins  wrote:
>> 
>> U+613F kDefinition (variant/simplification of U+9858 願) desire, want, wish; 
>> (archaic) prudent, cautious
>> U+613F kSemanticVariant U+9858> U+613F kSpecializedSemanticVariant U+9858> U+613F kTraditionalVariant U+613F U+9858
>> U+613F kSimplifiedVariant U+613F
>> U+9858 kSimplifiedVariant U+613F U+2B5B8
>> U+9858 kSemanticVariant U+9613F> 
>> Andrew, does that look like it covers everything correctly?
> 
> Looks OK to me (except for the typo on the last line), although I
> wonder about the necessity for:
> 
> U+613F kSimplifiedVariant U+613F
> 
> Where a character can either traditionalify (what is the opposite of
> simplify?) to another character or stay the same then it is useful to
> have (e.g.):
> 
> U+613F kTraditionalVariant U+613F U+9858
> 
> But where a character does not change on simplification, is it not
> redundant to give it a kSimplifiedVariant mapping to itself ?  

Per the latest draft of UAX #38, if, when mapping from SC to TC, a character 
may change or may be left alone depending on context, it should be included in 
among its both simplified and traditional variants.  And so…

> But there are other characters that fit this paradigm that do not have
> kSimplifiedVariant mappings to themself, such as:
> 
> U+5E72 干
> 
> But maybe that is a reflection of this line:
> 
> U+5E72kTraditionalVariant U+4E7E U+5E79
> 
> which I think should be:
> 
> U+5E72kTraditionalVariant U+4E7E U+5E72 U+5E79
> 


Yes, this should be fixed.  If you know of any others, please let me know.

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com

Re: missing characters: combining marks above runs of more than 2 base letters

2011-11-21 Thread John H. Jenkins

Michael Everson 於 2011年11月21日 上午3:37 寫道：

> On 21 Nov 2011, at 07:23, Julian Bradfield wrote:
> 
>> Marking the (usually automatic) elisions is markup for elementary students.
> 
> I can't think of any reason why this shouldn't be achievable in plain text. 
> Many encoded characters exist for paedagogical reasons.

Well, on a theoretical level, the issue is whether or not this is needed for 
minimal legibility, that is, whether or not the essential meaning of the text 
can be conveyed without it.  Personally, I don't think this is needed for 
minimal legibility, but that's a judgement call.

On a more pragmatic level, there's the issue of how many people would actually 
implement this, were it to become part of the standard.  This is of pretty 
marginal utility—we have, after all, managed to go for twenty years encoding 
Latin texts without it—and it would be very difficult to implement.  From a 
cost/benefit perspective, it's a pretty sure bet that virtually nobody would go 
to the trouble.

Now, granted, just because almost nobody would implement it, that doesn't mean 
that it shouldn't be part of the standard.  There's a lot in the standard 
already that is implemented but rarely, if at all. And granted, there are other 
portions of the standard which are similar enough to this that if you implement 
them, you may as well implement this, too.  Still, this strikes me as being of 
such very marginal utility that efforts to get it implemented as part of a 
plain-text standard seem pretty quixotic to me.  

(And before anybody accuses me of being overly cynical, I should point out that 
I'm probably the person putting in the greatest effort to get the Deseret 
Alphabet to be actually *used*.  How quixotic is *that*?)

=
井作恆
John H. Jenkins
jenk...@apple.com

Re: Upside Down Fu character

2012-01-03 Thread John H. Jenkins

There are really three choices:

1) Don't encode it at all and rely on higher-level protocols to display it.  
(After all, it's only used in specialized contexts and does not have a distinct 
meaning or pronunciation from the regular 福.)

2) Use a registered ideographic variation sequence to support it.  (This is 
really a variation of #1.)

3) Add it to UTR #45 and submit it to the IRG for inclusion in Extension F.  

My own feeling is that either #1 or #2 would be best, given its specialized 
nature.  

On 2011年12月30日, at 上午8:34, Andre Schappo wrote:

> The character 福 means happiness 
> http://www.mdbg.net/chindict/chindict.php?page=chardict&cdcanoce=0&cdqchi=福
> 
> Unicode entry: U+798F  CJK UNIFIED IDEOGRAPH-798F
> 
> It is customary to use an upside-down version of 福 during the Spring Festival 
> http://en.wikipedia.org/wiki/Fu_character
> 
> I am considering proposing an upside-down version of 福 for inclusion in 
> Unicode. Not sure where it should go. Maybe - Enclosed Ideographic Supplement
> 
> Thoughts?
> 
> André 小山 Schappo
> ❀❀
> http://weibo.com/andreschappo
> http://blog.sina.com.cn/andreschappo
> http://twitter.com/andreschappo
> http://schappo.blogspot.com/
> http://me2day.net/andreschappo
>

Re: Upside Down Fu character

2012-01-12 Thread John H. Jenkins

Kang-Hao (Kenny) Lu 於 2012年1月12日 上午12:13 寫道：

> * Three folks think this is rather unnecessary (including me). Some
> people go more and say "What about a code point for XXX and YYY?"

Do they have specific XXXs and YYYs in mind?

In general, the process is outlined at 
http://www.unicode.org/pending/proposals.html.  For hanzi, the characters need 
to be added to UTR #45 first, but I'm going to propose that for both the 
upside-down fuk1—er, fu, and the upside-down chun, since they have been 
discussed.  UTR #45 lets us track such discussions.

=
井作恆
John H. Jenkins
jenk...@apple.com

Re: Upside Down Fu character

2012-01-13 Thread John H. Jenkins

Hanzi have a slightly different way of getting into the standard because it's 
all done through the IRG, which receives submissions from each member body. 
Submissions from the UTC start by being added to UTR #45.  That, however, is 
merely a database to track potential characters we're aware of. It doesn't mean 
that the UTC plans to request their encoding. Characters generally start out 
with a status of X, meaning that no decision has been made.

From everything I've seen so far, my own recommendation would be that the 
upside-down fu, at least, be given status W (meaning "inappropriate for 
encoding").  If anybody wants to advocate encoding it, they need to write a 
document and submit it to the UTC. They would need to either provide evidence 
of actual use as a text element in plain-text (not as a graphic embedded in 
plain text—the emoji were a special case), or that it would be "widely" used as 
such (given a reasonable definition of "widely").  The UTC might well respond 
by asking for more information.  The current submission form is certainly a 
good template for providing information in a requested hanzi.

Assuming the UTC approves a status of "N" (to be encoded), the character would 
be included in the UTC's submission to the IRG for Extension F.  Work on 
Extension F will likely start in 2013.

Andre Schappo 於 2012年1月13日 上午8:36 寫道：

> 
> On 12 Jan 2012, at 16:54, John H. Jenkins wrote:
> 
>> Kang-Hao (Kenny) Lu 於 2012年1月12日 上午12:13 寫道：
>> 
>>> * Three folks think this is rather unnecessary (including me). Some
>>> people go more and say "What about a code point for XXX and YYY?"
>> 
>> Do they have specific XXXs and YYYs in mind?
>> 
>> In general, the process is outlined at 
>> http://www.unicode.org/pending/proposals.html.  For hanzi, the characters 
>> need to be added to UTR #45 first, but I'm going to propose that for both 
>> the upside-down fuk1—er, fu, and the upside-down chun, since they have been 
>> discussed.  UTR #45 lets us track such discussions.
>> 
>> =
>> 井作恆
>> John H. Jenkins
>> jenk...@apple.com
>> 
> 
> I have received a request for an upside-down 钱 (=qián = money = U+94B1).
> 
> I have talked with a small number of Chinese students about having an 
> upside-down fu character and they were all enthusiastic. I will be talking 
> with more Chinese students next week which is when the new term starts.
> 
> John: As you are progressing upside-down fu and chun characters into UTR #45 
> does this mean that I no longer need to submit a "Proposal Summary Form" for 
> upside-down fu? I have not yet actually started on said form.
> 
> André 小山 Schappo
>

Re: Standaridized variation sequences for the Desert alphabet?

2017-03-22 Thread John H. Jenkins

My own take on this is "absolutely not." This is a font issue, pure and simple. 
There is no dispute as to the identity of the characters in question, just 
their appearance. 

In any event, these two letters were never part of the "standard" Deseret 
Alphabet used in printed materials. To the extent they were used, it was in 
hand-written material only, where you're going to see a fair amount of 
variation anyway. There were also two recensions of the DA used in printed 
materials which are materially different, and those would best be handled via 
fonts.

It isn't unreasonable to suggest we change the glyphs we use in the Standard. 
Ken Beesley and I have have discussed the possibility, and we both feel that 
it's very much on the table.

Re: Standaridized variation sequences for the Desert alphabet?

2017-03-27 Thread John H. Jenkins

> On Mar 27, 2017, at 2:04 AM, James Kass  wrote:
> 
>> 
>> If we have any historic metal types, are there
>> examples where a font contains both ligature
>> variants?
> 
> Apparently not.
> 
> John H. Jenkins mentioned early in this thread that these ligatures
> weren't used in printed materials and were not part of the official
> Deseret set.  They were only used in manuscript.
> 

This is correct. Neither of the nineteenth century metal types included the 
letters in question. Nor were they included in any electronic fonts that I'm 
aware of before they were included in Unicode.

Re: Standaridized variation sequences for the Desert alphabet?

2017-03-27 Thread John H. Jenkins


> On Mar 27, 2017, at 9:56 AM, John H. Jenkins  wrote:
> 
> 
>> On Mar 27, 2017, at 2:04 AM, James Kass > <mailto:jameskass...@gmail.com>> wrote:
>> 
>>> 
>>> If we have any historic metal types, are there
>>> examples where a font contains both ligature
>>> variants?
>> 
>> Apparently not.
>> 
>> John H. Jenkins mentioned early in this thread that these ligatures
>> weren't used in printed materials and were not part of the official
>> Deseret set.  They were only used in manuscript.
>> 
> 
> This is correct. Neither of the nineteenth century metal types included the 
> letters in question. Nor were they included in any electronic fonts that I'm 
> aware of before they were included in Unicode. 
> 

This should teach me to double-check before posting. Apparently, the earlier 
typeface *did* include all forty letters; it just didn't use these two. I don't 
know what glyphs were used.

Re: Standaridized variation sequences for the Desert alphabet?

2017-03-29 Thread John H. Jenkins

> On Mar 29, 2017, at 4:12 AM, Martin J. Dürst  wrote:
> 
> Let me start with a short summary of where I think we are at, and how we got 
> there.
> 
> - The discussion started out with two letters,
>  with two letter forms each. There is explicit talk of the
>  40-letter alphabet and glyphs in the Wikipedia page, not
>  of two different letters.
> - That suggests that IF this script is in current use, and the
>  shapes for these diphthongs are interchangeable (for those
>  who use the script day-to-day, not for meta-purposes such
>  as historic and typographic texts), keeping things unified
>  is preferable.
> - As far as we have heard (in the course of the discussion,
>  after questioning claims made without such information),
>  it seems that:
>  - There may not be enough information to understand how the
>creators and early users of the script saw this issue,
>on a scale that may range between "everybody knows these
>are the same, and nobody cares too much who uses which,
>even if individual people may have their preferences in
>their handwriting" to something like "these are different
>choices, and people wouldn't want their texts be changed
>in any way when published".

I see this part of the problem more one of proper transcription of existing 
materials, and less of one of what the original authors saw the issues as. 
Handwritten material is very important in the study of 19th century LDS 
history, and although the materials actually in the DA are scant (at best), the 
peculiarities of the spelling can be instructive. As such, I certainly agree 
that being able to transcribe material "faithfully" is important.

I'm not an expert in this area, though, so I can't speak for myself whether 
this separate encoding or variation selectors or some other mechanism is the 
best way to provide support for this. I'm more than happy to defer to Michael 
and other people who *are* experts. If paleographers think separate encoding is 
best, then I'm for separate encoding. 

>  - Similarly, there seem to be not enough modern practitioners
>of the script using the ligatures that could shed any
>light on the question asked in the previous item in a
>historical context, first apparently because there are not
>that many modern practitioners at all, and second because
>modern practitioners seem to prefer spelling with
>individual letters rather than using the ligatures.

Well, as one of the people in this camp, and as Michael has pointed out, I 
eschew use of these letters altogether. I restrict myself to the 1869 version 
of the alphabet, which is used in virtually all of the printed materials and 
has only thirty-eight letters. 

> - IF the above is true, then it may be that these ligatures
>  are mostly used for historic purposes only, in which case
>  it wouldn't do any harm to present-day users if they were separated.
> 
> If the above is roughly correct, then it's important that we reached that 
> conclusion after explicitly considering the potential of a split to create 
> inconvenience and confusion for modern practitioners, not after just looking 
> at the shapes only, coming up with separate historical derivations for each 
> of them, and deciding to split because history is way more important than 
> modern practice.

Fortunately, since the existing Deseret block is full, any separately encoded 
entities will have to be put somewhere else, making it easier to document the 
nature and purpose of the symbols involved. 

Not that we can be confident that it will help. 
(http://www.deseretalphabet.info/XKCD/1726.html 
)

Re: Upside Down Fu character

2012-01-13 Thread John H. Jenkins

Asmus Freytag 於 2012年1月13日 上午11:01 寫道：

> Nobody has written a formal proposal yet.
> 
> When that is done, then one of the questions that needs to be decided in 
> initial triage is whether these are elements of the han script proper or 
> iconic symbols that happen to be derived from han characters. (The proposal 
> may suggest a particular resolution of this issue). If, with all facts on the 
> table, the consensus is that they are "regular" han characters, then their 
> further evaluation starts with tracking them under TR#45 and potentially 
> taking them to IRG for possible consideration in extension F.

It's been suggested that one way of handling them would be as encoded hanzi. 
That's one criterion for going into UTR #45 as it is part of the paper trail of 
the UTC's decision process.  

And the UTC could always refuse to put them into UTR #45.  My job is to make 
the recommendation.

In either case, somebody other than me (that is, somebody who wants them added 
to Unicode) needs to write a document/proposal to the UTC justifying that and 
giving the options for encoding.  

=
John H. Jenkins
井作恆
𐐖𐐱𐑌 𐐐. 𐐖𐐩𐑍𐐿𐐮𐑌𐑆
jenk...@apple.com

Re: Unihan database

2012-04-13 Thread John H. Jenkins

Yes, this is very much possible, although I can't predict how soon we'll get it 
done.

Martin Heijdra  於 2012年4月13日 上午10:26 寫道：

> Librarians are certainly a group of users using Unihan a lot, to identify 
> encodings for rare characters.
>  
> Several of them have complained that it gets more and more difficult to use 
> for them. One issue is, that the database itself started to use encodings 
> rather than images; which made it impossible to find characters in versions 
> their standard SimSun fonts did not support. That of course had a solution; 
> they should now choose “use images”.
>  
> But now they report that the radical-stroke page itself has changed to 
> encodings rather than images; and the radicals are not in the standard fonts. 
> Hence, the search pages (clicking on the number of strokes of the radical)  
> shows something like
>  
> 
>  
> Can there be a change so that also these pages (based upon the number of 
> strokes of the radical) has an option to show these pages, not only the 
> result, have a “display with images” option?
>  
> Martin J. Heijdra
> Chinese Studies/East Asian Studies Bibliographer 
> East Asian Library and the Gest Collection 
> Frist Campus Center, Room 314 
> Princeton University 
> Princeton, NJ 08544 
> United States

=
John H. Jenkins
井作恆
𐐖𐐱𐑌 𐐐. 𐐖𐐩𐑍𐐿𐐮𐑌𐑆
jenk...@apple.com

Re: A new character to encode from the Onion? :)

2012-04-30 Thread John H. Jenkins


Asmus Freytag  於 2012年4月30日 下午1:59 寫道：

> On 4/30/2012 12:27 PM, Bill Poser wrote:
>> 
>> Digital typography has reached The Onion: 
>> http://www.theonion.com/articles/errant-keystroke-produces-character-never-before-s,28030/.
>> 
> Quote:
> 
> , it is, in all likelihood, "probably just another goddamn fertility 
> symbol."
> 
> Make that: "currency symbol" and ship it.
> 
> 

Maybe a "turtle ideograph"?

=
井作恆
John H. Jenkins
jenk...@apple.com

Re: Plese add a Chinese Hanzi

2012-05-28 Thread John H. Jenkins

On 2012年5月28日, at 上午10:21, Charlie Ruland  wrote:

> Zhao,
> 1. If the character 鱼⿰丹 that you would like to have encoded is a contemporary 
> Standard Chinese word or morpheme, then what is its pronunciation?

FWIW, the correct syntax is ⿰鱼丹.  I take it that he would also like ⿰魚丹.

> 2. Can you provide material (for example photos, scans from books, etc.) that 
> clearly shows that 鱼⿰丹 is used as a single character? By which group of 
> people is it used?

Exactly.  *No* hanzi will be added to Unicode/ISO 10646 without solid evidence 
of actual use. Generally, this means authoritative, printed materials (a 
dictionary, government ID). Handwritten materials could conceivably be used, 
but they would have to be awfully convincing. Well-known websites with the 
character embedded *as a graphic* has been used as a past, but in those cases 
the character was quite well-known.  

> Charlie
> 
> * shi zhao  [2012-05-28 17:07]:
>> PS:
>>  zh-hans:  鱼+丹 
>> zh-hant: 魚+丹
>> 
>> 
>> 2012/5/28 shi zhao 
>> Plese add a Hanzi to Unihan: a fish name 鱼+丹 = Danio.
>> 
>> see:
>>  https://en.wikipedia.org/wiki/Danio
>> https://zh.wikipedia.org/wiki/Category:%28%E9%AD%9A%E4%B8%B9%29%E5%B1%AC
>> http://www.cnffd.com/index.php?route=product/category&path=3_11_64_284
>> http://zd1.brim.ac.cn/Mnamelist.asp?start=1982
>> http://hello.area.com.tw/is_bs.cgi?areacode=nt097&bsid=2.9.1.1.3
>> https://www.google.com/search?q=Danio+ 魚丹
>> 
>> 
>> Chinese wikipedia: http://zh.wikipedia.org/
>> My blog: http://shizhao.org
>> twitter: https://twitter.com/shizhao
>> 
>> [[zh:User:Shizhao]]
>> 
>> 

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com

Re: Plese add a Chinese Hanzi

2012-05-30 Thread John H. Jenkins

Making a proposal directly to the IRG isn't possible under the present 
procedures.  What's usually done for this kind of thing is to have the UTC 
propose them.  

Andrew West  於 2012年5月30日 上午8:14 寫道：

> I personally think that rather than add characters such as this
> piecemeal, it would be more useful if someone or some organization
> could research what newly devised, unencoded characters are in use in
> biology, chemistry, etc., and make a proposal to encode them all,
> either via the Chinese national body or directly to IRG.  Characters
> used in modern scientific literature should be considered urgent use,
> in my opinion, and encoded sooner rather than later.
> 

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com

Re: Flag tags

2012-05-31 Thread John H. Jenkins


Michael Everson  於 2012年5月31日 上午11:57 寫道：

> When you encode a flag for Germany and the US, you automatically get a demand 
> for the encoding of a flag for Ireland and Iceland. That's the way it is. 


Oh, c'mon, Michael, next you'll be saying that because some countries have 
currency symbols with decidated code points, other countries will make *new* 
currency symbols and demand that *they* get dedicated code points, too. We all 
know how unrealistic a scenario *that* is.


=
John H. Jenkins
jenk...@apple.com

Re: Offlist: complex rendering

2012-06-18 Thread John H. Jenkins


Naena Guru  於 2012年6月18日 下午3:50 寫道：

> Unicode says that it is all about codes and not shapes. It gave two examples, 
> Fraktur and Gaelic as scripts allowed to reside on Latin-1 but have shapes 
> not expected of Latin-1. That makes me wonder if Singhala is frowned upon 
> because it is not European. There is no other excuse because English was one 
> time romanized from fuþorc.


I'm going to regret this, but:

Unicode specifies semantics, not shapes.  The reason that drawing Latin 
characters with Sinhalese glyphs is incorrect is that they have different 
character semantics—that is, they behave differently.  It has nothing to do 
with Unicode failing to specify shapes.  

=====
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com

Re: Unicode Core

2012-06-22 Thread John H. Jenkins


vanis...@boil.afraid.org 於 2012年6月22日 下午3:49 寫道：

> Wait a minute. Isn't 6.2 just adding the Turkish Lira? Does that really take 
> the chart people more than about 10 minutes?
> 

The only *character* change is the Turkish lira.  There are numerous updates to 
UAXes and other parts of the documentation.  

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com

Re: texteditors that can process and save in different encodings

2012-10-04 Thread John H. Jenkins

BBEdit and TextWrangler on OS X both do a good job at handling different 
encodings.

On 2012年10月3日, at 下午10:58, Stephan Stiller  wrote:

> Dear all,
> 
> In your experience, what are the best (plaintext) texteditors or word 
> processors for Linux / Mac OS X / Windows that have the ability to save in 
> many different encodings?
> 
> This question is more specific than asking which editors have the best 
> knowledge of conversion tables for codepages (incl their different versions), 
> which I'm interested in as well. There are a number of programs that appear 
> to be able to read many different encodings – though I prefer the type that 
> actually tells me about where format errors are when a file is loaded. Then, 
> many editors that claim to be able to read all those encodings cannot display 
> them; as for that, I don't care about font choice and the aesthetics of 
> display, as I'm only interested in plaintext.
> 
> Some things I have seen that are no good:
> the editor not telling me about the encoding and line breaks it has detected 
> and not letting me choose
> the editor displaying a BOM in hex mode even if there is none (a version of 
> UltraEdit I worked with at some point)
> 
> Stephan
>

Re: xkcd: ‮LTR

2012-11-26 Thread John H. Jenkins

Or, if one prefers:

http://www.井作恆.net/XKCD/1137.html

On 2012年11月21日, at 上午10:22, Deborah Goldsmith  wrote:

> 
> http://xkcd.com/1137/ 
> 
> Finally, an xkcd for Unicoders. :-)
> 
> Debbie
>

Re: xkcd: LTR

2012-11-26 Thread John H. Jenkins

That's because the domain does, in fact, use sinograms and not Deseret.  (It's 
my Chinese name.)

On 2012年11月26日, at 下午1:54, Philippe Verdy  wrote:

> I wonder why this IDN link appears to me using sinograms in its domain name, 
> instead of Deseret letters. The link works, but my browser cannot display it 
> and its displays the Punycoded name instead without decoding it.
> 
> This is strange because I do have Deseret fonts installed and I can view 
> "Unicoded" HTML pages containing Deseret letters.
> 
> 
> 2012/11/26 John H. Jenkins 
> Or, if one prefers:
> 
> http://www.井作恆.net/XKCD/1137.html
> 
> On 2012年11月21日, at 上午10:22, Deborah Goldsmith  wrote:
> 
>> 
>> http://xkcd.com/1137/ 
>> 
>> Finally, an xkcd for Unicoders. :-)
>> 
>> Debbie
>> 
> 
>

Re: ‮LTR

2012-11-29 Thread John H. Jenkins

I double-checked *very* carefully, and I did't see anything wrong at all.  :-)

You got sharp eyes there, Doug.

On 2012年11月28日, at 下午10:58, Doug Ewell  wrote:

> John H. Jenkins wrote:
> 
>> Or, if one prefers:
>> 
>> http://www.井作恆.net/XKCD/1137.html
> 
> In all the ensuing discussion about this page, did anyone notice the typo in 
> the Deseret cartoon?
> 
> ☺
> 
> --
> Doug Ewell | Thornton, Colorado, USA
> http://www.ewellic.org | @DougEwell  
>

Re: I missed my self-imposed deadline for the Mayan numeral proposal

2012-12-21 Thread John H. Jenkins

http://xkcd.com/998/

On 2012年12月21日, at 下午4:22, Doug Ewell  wrote:

> And as you've no doubt heard to death by now, real Maya don't believe in that 
> apocalyptic mumbo-jumbo anyway. Today was a celebration.
> 
> --
> Doug Ewell | Thornton, Colorado, USA
> http://www.ewellic.org | @DougEwell
> From: Julian Bradfield
> Sent: ‎12/‎21/‎2012 15:55
> To: unicode@unicode.org
> Subject: Re: I missed my self-imposed deadline for the Mayan numeral proposal
> 
> On 2012-12-21, Clive Hohberger  wrote:
> > Don't worry, I think you now have another 5351 years until the next Mayan
> > Doomsday...
> 
> It's only 394 years till the next b'ak'tun.
> 
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 
> 
>

Re: Ideograms (was: Spiral symbol)

2013-01-30 Thread John H. Jenkins

On 2013年1月30日, at 上午4:50, Andreas Stötzner  wrote:

> Most ideographs in use are pictographs, for obvious reasons. But it would be 
> nice indeed to have ideograms for “thanks”,

謝

> “please”,

請

> “yes”,

對

> “no”,

不

> “perhaps”

許

> – all those common notions which cannot be de-*picted* in the true sense of 
> the word.
> 

I'm not being entirely snarky here. The whole reason why the term "ideograph" 
got attached to Chinese characters in the first place is that they can convey 
the same meaning while representing different words in different languages. 
Chinese writing was one of the inspirations for Leibniz' Characteristica 
universalis, for example.  

Personally, I think that extensive reliance on ideographs for communication is 
a bad idea. Again, Chinese illustrates this. The grammars of Chinese and 
Japanese are so very different that although hanzi are perfectly adequate for 
the writing of a large number of Sinitic languages, they are completely 
inadquate for Japanese.  Ideographs are fine for some short, simple messages 
("The lady's room lieth behind yon door"), but not for actually expressing 
*language*.

And, in any event, if you *really* want non-pictographic ways of conveying 
abstract ideas, most of the work has been already done for you.

Re: Ideograms

2013-01-30 Thread John H. Jenkins

I happen to disagree slightly with De Francis on this point, BTW.  Senso 
strictu, he is correct, but looking at it in such a limited way minimizes the 
cross-language utility of sinograms. 花 *means* "flower" whether it's Mandarin 
"huā" or Japanese "ka" or Japanese "hana". Indeed, the fact that the same kanji 
can be used for both native Japanese words and Chinese loan-words illustrates 
my point.

De Francis' point is that you can't use hanzi for real communication other than 
the most basic (e.g., street signs). 花 means "flower" in China and Japan 
because it represents the Chinese morpheme for "flower" and the Japanese 
equivalent, not because it has any inherent "meaning" per se.  I feel that 
since many hanzi represent equivalent morphemes in several different languages, 
they can actually be said to have inherent "meaning" for all practical intents 
and purposes.

A Japanese reader can see the sentence "我有一只猫。" and come away with a general 
sense that it has something to do with "a cat," but they can't *read* it any 
more than a Chinese speaker can truly read the sentence "私は猫を所有している。" OTOH, 
both Japanese and Chinese can find 日本 on a map without any trouble, since it 
means "day-root" in both languages.  (Actually, it means "Japan" in both 
languages, but it literally means "day-root", too, and I think that sounds more 
poetic.)

On 2013年1月30日, at 下午12:08, Charlie Ruland  wrote:

> Yes, and on page 145 DeFrancis comes to the following conclusion:
> 
> Chinese characters represent words (or better, morphemes), not ideas, and 
> they represent them phonetically, for the most part, as do all real writing 
> systems despite their diverse techniques and differing effectiveness in 
> accomplishing the task.
> 
> The chapter these lines are from is also on-line: 
> http://www.pinyin.info/readings/texts/ideographic_myth.html .
> 
> Charlie
> 
> 
> * Tim Greenwood  [2013-01-30 20:17]:
>> A very accessible book on all this is "The Chinese Language: Fact and 
>> Fantasy" by John De Francis, published  in 1984 by University of Hawaii 
>> Press. There is a brief synopsis on Wikipedia 
>> http://en.wikipedia.org/wiki/The_Chinese_Language:_Fact_and_Fantasy
>> 
>> - Tim
>> 
>> 
>> 
>> On Wed, Jan 30, 2013 at 1:46 PM, John H. Jenkins  wrote:
>> 
>> On 2013年1月30日, at 上午4:50, Andreas Stötzner  wrote:
>> 
>>> Most ideographs in use are pictographs, for obvious reasons. But it would 
>>> be nice indeed to have ideograms for “thanks”,
>> 
>> 謝
>> 
>>> “please”,
>> 
>> 請
>> 
>>> “yes”,
>> 
>> 對
>> 
>>> “no”,
>> 
>> 不
>> 
>>> “perhaps”
>> 
>> 許
>> 
>>> – all those common notions which cannot be de-*picted* in the true sense of 
>>> the word.
>>> 
>> 
>> 
>> I'm not being entirely snarky here. The whole reason why the term 
>> "ideograph" got attached to Chinese characters in the first 
>> place is that they can convey the same meaning while representing different 
>> words in different languages. Chinese writing was one of the inspirations 
>> for Leibniz' Characteristica universalis, for example.  
>> 
>> Personally, I think that extensive reliance on ideographs for communication 
>> is a bad idea. Again, Chinese illustrates this. The grammars of Chinese and 
>> Japanese are so very different that although hanzi are perfectly adequate 
>> for the writing of a large number of Sinitic languages, they are completely 
>> inadquate for Japanese.  Ideographs are fine for some short, simple messages 
>> ("The lady's room lieth behind yon door"), but not for actually expressing 
>> *language*.
>> 
>> And, in any event, if you *really* want non-pictographic ways of conveying 
>> abstract ideas, most of the work has been already done for you.
>> 
>> 
>>

Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

2013-02-01 Thread John H. Jenkins

On 2013年2月1日, at 上午6:07, "Costello, Roger L."  wrote:

> So why would one ever generate text in decomposed form (NFD)?
> 

The Unihan database is stored in NFD because it makes the regular expressions 
used to qualify its contents much, *much* simpler.  I imagine that things like 
fuzzy text matching are easier in NFD.  At worst, it's about as useful as 
UTF-32: occasionally very handy in internal processing, but not terribly 
attactive overall.

Re: Encoding localizable sentences (was: RE: UTC Document Register Now Public)

2013-04-19 Thread John H. Jenkins


On 2013年4月19日, at 下午1:52, Stephan Stiller  wrote:

> But I'd argue that the distance of the information content of such 
> low-quality translations to the information content conveyed by correct and 
> polished language is often tolerable. Grammar isn't that important for 
> getting one's point across.

As my daughter says, "Talking is for to be understood, so if the meaning 
conveyed, the point happened."

Re: The pointless thread continues

2002-07-07 Thread John H. Jenkins

On Friday, July 5, 2002, at 08:54 AM, John Hudson wrote:

> Actually, this isn't nonsense. A single buggy font is quite capable of 
> crashing an operating system. Obviously the damage is not permanent, 
> presuming one is able to get the system started in safe mode and remove 
> the offending font. I've seen some spectacularly nasty fonts over the 
> years, as have many of my colleagues (including engineers in the type 
> group at Apple, so this isn't simply a Windows issue).
>

C'est vrai.  One of the fonts we used to print Unicode 2.0 killed *all* 
text display on the system if it were to be used with ATSUI.  It was kind 
of cool, actually.  We actually have a "font zoo" stashed away full of 
pathological fonts which have been known to do all kinds of interesting 
things if someone should be foolish enough to install them.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/

Re:_How_do_I_encode_HTML_documents_in_old_languages_=C5¿uch as 17th century Swediſh in Unicode?

2002-07-07 Thread John H. Jenkins

On Wednesday, July 3, 2002, at 11:10 AM, Stefan Persson wrote:

> There is a big problem in the current Unicode ſtandard, ſince
> Fraktur letters aren't ſupported in any ſuitable manner.

Aargh!  Medial long-s!  Run away!  Run away!  :-)

======
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/

Re: ZWJ and Latin Ligatures (was Re: (long) Re: Chromatic font research)

2002-07-07 Thread John H. Jenkins

On Saturday, July 6, 2002, at 03:42 AM, James Kass wrote:

>
> We certainly agree that ligature use is a choice.  I think we diverge
> on just what kind of choice is involved.  You consider that ligature
> use is generally similar to bold or italic choices.  I consider use of
> ligatures to be more akin to differences in spelling.  If you're
> quoting from a source which used the word "fount", it is wrong to
> change it to "font".  And, if you're quoting from a source which
> used "hæmoglobin", anything other than "hæmoglobin" is incorrect.
> If the source used "&c.", it should never be changed to "etc.".
> So, if the source used the "ct" ligature...
>
>

I see your point, but I think we're to the stage where we'll just have to 
agree to disagree.  We *do* agree that ligation is a choice, but you're 
quite accurate in your assessment of where precisely we diverge.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/

Re: [OpenType] Proposal: Ligatures w/ ZWJ in OpenType

2002-07-07 Thread John H. Jenkins

On Saturday, July 6, 2002, at 04:11 PM, John Hudson wrote:

> There are going to be documents containing this character -- and ZWNJ -- 
> and fonts that do not contain these characters may display them with 
> .notdef glyphs. The only solution is system or application intelligence 
> that is able to ensure that no attempt is made to display glyphs for 
> these characters. This issue seems to have already been resolved in MS 
> text processing, at least as far as I have tested it in WordPad. I have 
> inserted a ZWJ character in a string of text using a standard PS Type 1 
> font, and the character is treated as a zero-width, no outline control 
> character.
>

Well, by default no attempt is made to display glyphs for these characters.
   (Somebody may have a "show invisibles" or equivalent on.  BTW, does OT 
have a "show invisibles" feature?  I'm too lazy to check right now.)  We 
also have a list of "invisible" characters which should, ordinarily, be 
left undisplayed including ZWJ, ZWNJ, the bidi overrides, and so on.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/

Re: Proposal: Ligatures w/ ZWJ in OpenType

2002-07-15 Thread John H. Jenkins

On Monday, July 15, 2002, at 09:58 AM, Doug Ewell wrote:
> No, what bothers me is that the ZWJ/ZWNJ ligation scheme is starting to
> look just like the DOA (deprecated on arrival) Plane 14 language tags.
> In each case, Unicode has created a mechanism to solve a genuine (if
> limited) need, but then told us -- officially or unofficially -- that we
> should not use it, or that it is "reserved for use with special
> protocols" which are never defined or mentioned again.
>

I'm not sure I agree with you here.  The position of the UTC is not that 
ZWJ should never be used and we're sorry we added it, which is the case of 
the Plane 14 language tags.  It's that the ZWJ should not be the primary 
mechanism for providing ligature support in many cases.  That's as far as 
it goes.

> The UTC may have "intended" that ZWJ ligation be used only in rare and
> exceptional circumstances, but UAX #27, revised section 13.2 doesn't say
> that.

The latest word is the Unicode 3.2 document, not the Unicode 3.1 document.
   It says:

Ligatures and Latin Typography (addition)

It is the task of the rendering system to select a ligature (where 
ligatures are possible) as part of the task of creating the most pleasing 
line layout. Fonts that provide more ligatures give the rendering system 
more options.

However, defining the locations where ligatures are possible cannot be 
done by the rendering system, because there are many languages in which 
this depends not on simple letter pair context but on the meaning of the 
word in question. 

ZWJ and ZWNJ are to be used for the latter task, marking the non-regular 
cases where ligatures are required or prohibited. This is different from 
selecting a degree of ligation for stylistic reasons. Such selection is 
best done with style markup. See Unicode Technical Report #20, Unicode in 
XML and other Markup Languages for more information.

>  It says that ZWJ and ZWNJ *may be used* to request ligation or
> non-ligation, and that "font vendors should add ZWJ to their ligature
> mapping tables as appropriate."  It does acknowledge that some fonts
> won't (or shouldn't) include glyphs for every possible ligature, and
> never claims that they must (or should).  It specifically does *not* say
> that ZWJ ligation is to be restricted to certain orthographies, or to
> cases where ligation changes the meaning of the text.
>

This is correct.  Nor is this changed in Unicode 3.2.  The goal is to make 
the ZWJ mechanism available to people who feel it is appropriate to meet 
their needs, but to try to inform them that in the majority of cases, a 
higher-level protocol would be better.

Adobe doesn't have to revise InDesign, for example, to insert ZWJ all over 
when a user selects text and turns optional ligatures on.  OTOH, the hope 
is that if ligatures are available InDesign will honor the ZWJ marked ones,
  even if ligation has been turned off.

John Hudson has recommended what seems a reasonable way to handle this in 
OT.  Apple will be releasing new versions of its font tools in the near 
future, and the documentation will include a recommendation for how this 
can be done with AAT.  We've been revising our own fonts as the 
opportunity presents itself to support ZWJ as well.  (The system and 
ATSUI-savvy applications require no revision.)

The push-back coming from the font community on the issue has to do mostly 
with the communications problem that they weren't aware of it in as timely 
a fashion as would have been best,  and the concern that font developers 
and application/OS developers will be forced to add ligature support where 
they have felt it in appropriate in the past.

> ZWJ/ZWNJ for ligation control is part of Unicode.  It is not always the
> best solution, but it is *a* solution, and should be available to the
> user without restriction or discouragement.
>

It's discouraged when it's inappropriate.  It isn't deprecated.  There are 
numerous places where Unicode provides multiple ways of representing 
something.  In this instance, Unicode is trying to delineate where a 
particular mechanism is appropriate and where inappropriate.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/

Re: Tamil Text Messaging in Mobile Phones

2002-07-28 Thread John H. Jenkins

On Sunday, July 28, 2002, at 02:43 PM, Doug Ewell wrote:

> Script reforms sometimes are successful, but often are not.  The
> benefits must be seen to outweigh the costs by a *significant* margin,
> and in most cases the proponents of reform do not adequately consider
> the costs.
>

OK, now's the time for us all to chant together,

Deseret! Deseret! Shaw! Shaw! Shaw!
Deseret! Deseret! Shaw! Shaw! Shaw!

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/

Re: "Missing character" glyph

2002-07-31 Thread John H. Jenkins

On Tuesday, July 30, 2002, at 08:58 PM, Doug Ewell wrote:

> Have Last Resort symbols been devised for all the blocks in Unicode,
> including the new ones like Tagalog?  Neither Mark Leisher's page nor
> the Apple typography page contains a complete list.
>
>

Yes.  It covers all of Unicode 3.2; but the font has been entirely 
redesigned.  We really need to update our documentation.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/

Re: "Missing character" glyph

2002-08-02 Thread John H. Jenkins

On Friday, August 2, 2002, at 01:24 AM, Martin Kochanski wrote:

> 1. Since all existing fonts already display the new character 
> correctly, there would be no overwhelming need for any font designer 
> to alter any font at all. If they choose, despite this, to copy their 
> own interpretation of 'missing character' from Glyph ID zero into the 
> new slot, this will also give exactly the display that is required.
>

This is where I start to get uncomfortable.  Most (not all) *TrueType* 
fonts display the new character correctly.  Other font technologies 
which are still in use and may be used, for example, by a Web browser, 
do *not* automatically use an empty box or anything visual for an 
undefined character.

There has been considerable uproar in the font development community 
lately about Unicode making unwarranted assumptions about how fonts 
work.  I think it would be improper for us to add a character to the 
standard on the basis of "font technology X solves the problem".  In 
particular, if we want to have a Web page that includes a visual 
representation for "this is what you'll see if your system can't 
support this character"there are too many variables depending on the 
system and the particular application to make any guarantees.  This 
isn't a font issue.

Even mandating that the character should have the visual appearance of 
an unsupported character on the given platform/application/font is 
really meaningless.  On Mac OS X, the precise appearance of such a 
character can have any of several dozen appearances, depending on the 
Unicode block in which it's found.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/

Re: Taboo Variants (was Re: Digraphs as Distinct Logical Units )

2002-08-09 Thread John H. Jenkins

On Friday, August 9, 2002, at 06:45 AM, Andrew C. West wrote:

> The secondary examples (where the taboo-form is used as a phonetic 
> component in a more
> complex character) could be currently coded using Ideographic 
> Description Characters - e.g.  U+2E98, U+22606> and . Is there still a need 
> for an Ideographic Taboo
> Variation Indicator ?
>

Yes, because you do not *encode* characters using IDC's.  You describe 
them.  This is carefully explained in the standard.

Of course, using the taboo variant selector is about as vague as an 
IDC, so it doesn't make that much difference.

As to the proposed location, note that the byte-order mark got stuck 
with a bunch of Arabic compatibility forms.  Sometimes the odd 
character gets stuck in an odd place; as you say, there wasn't any room 
left in the more logical location, and this spot in the KangXi radicals 
block was pretty much never going to be used otherwise.  Six of one, as 
it were.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/

Re: Digraphs as Distinct Logical Units

2002-08-09 Thread John H. Jenkins

On Friday, August 9, 2002, at 03:54 AM, Andrew C. West wrote:

> And in China, historically the personal names of emperors (for 
> emperors read dictators) have been
> tabooed

An Ideographic Taboo Variation Indicator has been approved by the UTC 
for addition to the standard to handle precisely this kind of situation 
(see <http://www.unicode.org/unicode/alloc/Pipeline.html>.  It works on 
the theory that you rarely need to know the precise *form* of the taboo 
variant, just that a taboo form is being used.  There was some 
disagreement in WG2 about its utility, however, and there is the 
problem that, as you note, some taboo variants have already been 
encoded.  It's currently scheduled to be reconsidered by the UTC.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/

Re: Taboo Variants

2002-08-09 Thread John H. Jenkins

On Friday, August 9, 2002, at 11:38 AM, Andrew C. West wrote:

> For argument's sake, what are you going to do when I publish the 
> manuscript copy of a draft edition
> of the Kangxi dictionary that I recently purchased in a second-hand 
> bookstore in London that
> includes ten supplementary radicals not found in the printed editions ?
>
>

Realistically, they'd probably go in the CJK Radicals Supplement block.

> Given that there's going to be proposals for additional CJK symbols 
> and punctuation marks in the
> future (if no-one else does I've got a few I'll propose), surely it 
> would be better to simply create
> a "CJK Symbols and Punctuation B" block for the proposed IDEOGRAPHIC 
> TABOO VARIATION INDICATOR. It's
> irrelevant that the block will only have one charcacter to start with. 
> It's got to be better than
> poluting other blocks with characters that just don't belong there.
>

Well, nobody is strongly wedded to the current proposed allocation, to 
be frank.  It can always change.  I think the one thing people would 
hope is that the new character goes somewhere in the neighborhood of 
other Han characters.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/

Re: Taboo Variants

2002-08-09 Thread John H. Jenkins

On Friday, August 9, 2002, at 11:38 AM, Andrew C. West wrote:

> My point is that if the commonly encountered taboo variants are 
> already encoded in CJK-B, then
> either the other taboo variants should also be added to CJK-B or they 
> could be *described* using
> IDCs.

Encoding them was a mistake, pure and simple.  We didn't monitor the 
IRG well enough in the CJK-B encoding process, or we would have 
objected to this kind of cruft.

And describing them is a valid approach.  It depends on what's more 
important to youthe appearance (which IDS's are better at), or the 
semantic (which is explicit with the TVS).

> Adding a taboo variant selector does make a difference, because then 
> there'll be more than one
> way to reference the same character.
>

Well, yes and no.  Even though we've already got taboo variants 
encoded, we have no way to flag in a text that the purpose they're 
serving is taboo variants.  The interesting thing about the taboo 
variants is precisely that meaning:  This is character X written in a 
deliberately distorted way.  You identified the taboo variants you 
found in Ext B not based on anything in the standard, but because of 
your outside knowledge.  A student encountering them in a text may well 
be stymied until she goes to her professor.

Meanwhile, multiple encodings of the same Han character are *already* a 
major problem.  This is one reason why the UTC is determined to be 
stricter in the future to keep it from continuing to happen.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/

Re: Keys. (derives from Re: Sequences of combining characters.)

2002-09-27 Thread John H. Jenkins



On Friday, September 27, 2002, at 09:52 AM, [EMAIL PROTECTED] 
wrote:

> I doubt there's anyone on this list that always agrees with me


I think you're wrong, there, Peter.  I *never* disagree with you.  :-)

======
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://www.tejat.net/

Re: script or block detection needed for Unicode fonts

2002-09-29 Thread John H. Jenkins

On Saturday, September 28, 2002, at 03:19 PM, David Starner wrote:

> On Sat, Sep 28, 2002 at 01:19:58PM -0700, Murray Sargent wrote:
>> Michael Everson said:
>>> I don't understand why a particular bit has to be set in
>>> some table. Why can't the OS just accept what's in the font?
>>
>> The main reason is performance. If an application has to check the 
>> font
>> cmap for every character in a file, it slows down reading the file.
>
> Try, for example, opening a file for which you have no font coverage in
> Mozilla on Linux. It will open every font on the system looking for the
> missing characters, and it will take quite a while, accompanied by much
> disk thrashing to find they aren't there.
>

This just seems wildly inefficient to me, but then I'm coming from an 
OS where this isn't done.  The app doesn't keep track of whether or not 
a particular font can draw a particular character; that's handled at 
display time.  If a particular font doesn't handle a particular 
character, then a fallback mechanism is invoked by the system, which 
caches the necessary data.  I really don't see why an application needs 
to check every character as it reads in a file to make sure it can be 
drawn with the set font.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://www.tejat.net/

1 2 3 4 >

1 - 100 of 317 matches

Mail list logo