Re: Property-Problems

2000-12-06 Thread Tobias Hunger

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Wednesday 06 December 2000 02:29, Kenneth Whistler wrote:
 Tobias Hunger asked:
  1.) What are the EastAsian Width properties of the characters in the new
  Private Use areas (Plane 15/16)?

 "A", the same as for the private use area in the BMP:

  2.) What are the Linebreaking Properties for those characters?

 "AI", the same as for the private use are in the BMP:

That is what I exspected. Thank you for verifying that for me.

  3.) How do you generate the PropList File? Some of the properties are
  quite obvious (for example the Bidi-Properties), but others are a mystery
  to me.
snip
 Some of the properties currently in PropList.txt are completely
 derivative from information in UnicodeData.txt, but were included
 in PropList.txt despite their redundancy, since PropList.txt gives
 a different "view" on properties. It gives a property by property
 list of all the characters with a particular property.

Yes, I noticed, that this file is informational. But it is very useful to 
have:-) From what I read about it I guessed that it was completly derivable 
from the other, normative, data. So these deltas were a surprise for me.

Thank you for pointing out this misconception.

A sidenote: The standard is a great book to have when you need to work with 
characters as it points out many pitfalls that are not obvious to the latin-1 
using programmer.  It is well written and easy to understand. The only 
problem I encounter from time to time is figuring out which characters you 
mean exactly when you talk about groups of characters. Maybe it is possible 
to define Properties for all those groups in the next version of the 
standard? This information is redundant, but it would help me greatly.

- -- 
Gruss,
Tobias

- ---
Tobias Hunger  The box said: 'Windows 95 or better'
[EMAIL PROTECTED]  So I installed Linux.
- ---

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.0.4 (GNU/Linux)
Comment: Weitere Infos: siehe http://www.gnupg.org

iD8DBQE6Lf7UVND+cGpk748RAlwiAJ9TfM3CLvlqBalkNfWGaYNWYolVrQCbByHd
2zRzsfEaQL7DFuxsjmeLC3k=
=x/2U
-END PGP SIGNATURE-



Re: Transcriptions of Unicode

2000-12-06 Thread addison

But NN6 *does* select a font for characters outside the so-called user's
locale when said characters are in a UTF-8 page. It appears that this
mechanism is somewhat haphazard for CJK unified ideographs: I get a mix of
fonts usually (probably because ja is in my locale "stack" currently and
'zh' and 'ko' are not, so I guess Japanese fonts are preferred for
characters that are in JIS X 208 ??).

AP

===
Addison P. PhillipsPrincipal Consultant
Inter-Locale LLChttp://www.inter-locale.com
Los Gatos, CA, USA  mailto:[EMAIL PROTECTED]

+1 408.210.3569 (mobile)  +1 408.904.4762 (fax)
===
Globalization Engineering  Consulting Services

On Mon, 4 Dec 2000, Erik van der Poel wrote:

 Mark Davis wrote:
  
  What wasn't clear from his message
  is whether Mozilla picks a reasonable font if the language is not there.
 
 Sorry about the lack of clarity. When there is no LANG attribute in the
 element (or in a parent element), Mozilla uses the document's charset as
 a fallback. Mozilla has font preferences for each language group. The
 language groups have been set up to have a one-to-one correspondence
 with charsets (roughly). E.g. iso-8859-1 - Western, shift_jis - ja.
 When the charset is a Unicode-based one (e.g. UTF-8), then Mozilla uses
 the language group that contains the user's locale's language.
 
 In other words, Mozilla does not (yet) use the Unicode character codes
 to select fonts. We may do this in the future.
 
 Erik
 




OT (Kind of): Determining whether Locales are left-to-right or right-to-left.

2000-12-06 Thread David Tooke



Is there a general mechanism for determining the 
directionality of a locale?

I am using Java Servlets to create HTML 
pages. Is there something that will tell me when it is appropriate 
to generate the HTML in right to left as opposed to left to right? 


At the moment it looks like I have to maintain a 
table of right to left locales myself. If that is the way to go, 
apart from the Arabic (ar); Hebrew (he); Urdu (ur) which other locales is it 
appropriate to set the directionality to right-to-left? Is there a 
standard document somewhere that would tell me?


Thanks in advance.

David Tooke
[EMAIL PROTECTED]



Re: OT (Kind of): Determining whether Locales are left-to-right or right-to-left.

2000-12-06 Thread Michael \(michka\) Kaplan

Well, there are lots of other Arabic script locales. Here is from a message
from Elaine Keown just the other day:

Arabic  Balti  Baluchi  Berber  Farsi  Hausa  Karaite  Kashmiri  Kazakh
Kirghiz Kurmanji  Luri  Mazanderani   Moplah  Panjabi---PakistaniPashto
Pulaar  Sindhi  Siraiki (also known as Saraiki or Lahnda or Western Panjabi)
Sulu   Uighur   Urdu   Uzbek  Wolof

There are also several Hebrew ones such as Yiddish, Aramaic, etc.

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/

- Original Message -
From: "David Tooke" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Wednesday, December 06, 2000 8:48 AM
Subject: OT (Kind of): Determining whether Locales are left-to-right or
right-to-left.


Is there a general mechanism for determining the directionality of a locale?

I am using Java Servlets to create HTML pages.   Is there something that
will tell me when it is appropriate to generate the HTML in right to left as
opposed to left to right?

At the moment it looks like I have to maintain a table of right to left
locales myself.   If that is the way to go, apart from the Arabic (ar);
Hebrew (he); Urdu (ur) which other locales is it appropriate to set the
directionality to right-to-left?  Is there a standard document somewhere
that would tell me?


Thanks in advance.

David Tooke
[EMAIL PROTECTED]






Re: OT (Kind of): Determining whether Locales are left-to-right or

2000-12-06 Thread Roozbeh Pournader



On Wed, 6 Dec 2000, David Tooke wrote:

 At the moment it looks like I have to maintain a table of right to left
 locales myself.   If that is the way to go, apart from the Arabic (ar);
 Hebrew (he); Urdu (ur) which other locales is it appropriate to set the
 directionality to right-to-left?  Is there a standard document somewhere
 that would tell me?

You can add this list:

   Persian (fa), Iranian and Iraqi Kurdish (ku_IR, ku_IQ), Pashtu (ps),
   and Yiddish (yi).

There are also others, but I believe them all to be in the three letter
(ISO 639-2) world: Baluchi (bal), Syriac (syr), etc.

--roozbeh





Re: OT (Kind of): Determining whether Locales are left-to-right or

2000-12-06 Thread Antoine Leca

Michael Kaplan wrote:
 
 Well, there are lots of other Arabic script locales. Here is from a message
 from Elaine Keown just the other day:
 
 Arabic  Balti  Baluchi  Berber  Farsi  Hausa  Karaite  Kashmiri  Kazakh
 Kirghiz Kurmanji  Luri  Mazanderani   Moplah  Panjabi---PakistaniPashto
 Pulaar  Sindhi  Siraiki (also known as Saraiki or Lahnda or Western Panjabi)
 Sulu   Uighur   Urdu   Uzbek  Wolof

Urdu written in Nagari script is left-to-right? This is new to me...

Of course, a similar pitfall exist for a number of others "locales" when one
equates that to the mere language.


OTOH, don't forget the "other" RTL scripts, such as for Thaana (Maldivian,
for the Divehi language) and the Syriac scripts.


Antoine
 
 - Original Message -
 From: "David Tooke" [EMAIL PROTECTED]
 
 I am using Java Servlets to create HTML pages.   Is there something that
 will tell me when it is appropriate to generate the HTML in right to left as
 opposed to left to right?

Why do you want to generate HTML in "right to left"?
Isn't HTML just a stream of characters, that runs from "begin" to "end"?

Just do nothing, that's the browser's job to do the visual reversing.
 
 At the moment it looks like I have to maintain a table of right to left
 locales myself.   If that is the way to go, apart from the Arabic (ar);
 Hebrew (he); Urdu (ur) which other locales is it appropriate to set the
 directionality to right-to-left?  Is there a standard document somewhere
 that would tell me?

Now, can you tell me how this scheme will handle boustrophedon, until you
know in advance the size of the displaying window...


Antoine



Re: OT (Kind of): Determining whether Locales are left-to-right or right-to-left.

2000-12-06 Thread David Tooke

Thanks for your prompt replies.

I noticed from that list that there are quite a few languages that do not
have 2 character ISO 639 codes.

Balti  Baluchi  Berber  Hausa  Karaite   Kurmanji  Luri  Mazanderani
Moplah
 PulaarSiraiki (also known as Saraiki or Lahnda or Western Panjabi)
Sulu

Is it true that one would not be able set their browser locales to these
languages as it appears ISO 639 is a pre-requisite for this?

plus...
dumb question 1.  Is Aramaic (which doesn't seem to have a 2 character ISO
code) the same as Amharic (which does...AM)?   If not, Amharic appears to be
a Semetic language too, is that written right-to-left too?

dumb question 2. Are there an known cases where the full locale name
(language+country+variant) has a different directionality as for the root
language?   I know that some languages are written in different scripts
based on the locale; are there any cases where there are a two scripts that
have the same language code in their locale but differ in their writing
direction?

- Original Message -
From: "Michael (michka) Kaplan" [EMAIL PROTECTED]
To: "David Tooke" [EMAIL PROTECTED]; "Unicode List"
[EMAIL PROTECTED]
Sent: Wednesday, December 06, 2000 11:36 AM
Subject: Re: OT (Kind of): Determining whether Locales are left-to-right or
right-to-left.


 Well, there are lots of other Arabic script locales. Here is from a
message
 from Elaine Keown just the other day:

 Arabic  Balti  Baluchi  Berber  Farsi  Hausa  Karaite  Kashmiri  Kazakh
 Kirghiz Kurmanji  Luri  Mazanderani   Moplah  Panjabi---Pakistani
Pashto
 Pulaar  Sindhi  Siraiki (also known as Saraiki or Lahnda or Western
Panjabi)
 Sulu   Uighur   Urdu   Uzbek  Wolof

 There are also several Hebrew ones such as Yiddish, Aramaic, etc.

 MichKa

 Michael Kaplan
 Trigeminal Software, Inc.
 http://www.trigeminal.com/

 - Original Message -
 From: "David Tooke" [EMAIL PROTECTED]
 To: "Unicode List" [EMAIL PROTECTED]
 Sent: Wednesday, December 06, 2000 8:48 AM
 Subject: OT (Kind of): Determining whether Locales are left-to-right or
 right-to-left.


 Is there a general mechanism for determining the directionality of a
locale?

 I am using Java Servlets to create HTML pages.   Is there something that
 will tell me when it is appropriate to generate the HTML in right to left
as
 opposed to left to right?

 At the moment it looks like I have to maintain a table of right to left
 locales myself.   If that is the way to go, apart from the Arabic (ar);
 Hebrew (he); Urdu (ur) which other locales is it appropriate to set the
 directionality to right-to-left?  Is there a standard document somewhere
 that would tell me?


 Thanks in advance.

 David Tooke
 [EMAIL PROTECTED]







Re: OT (Kind of): Determining whether Locales are left-to-right or right-to-left.

2000-12-06 Thread Michael \(michka\) Kaplan

From: "David Tooke" [EMAIL PROTECTED]

 I noticed from that list that there are quite a few languages that do not
 have 2 character ISO 639 codes.

 Balti  Baluchi  Berber  Hausa  Karaite   Kurmanji  Luri  Mazanderani
 Moplah
  PulaarSiraiki (also known as Saraiki or Lahnda or Western Panjabi)
 Sulu

 Is it true that one would not be able set their browser locales to these
 languages as it appears ISO 639 is a pre-requisite for this?

I do not think that is universally true, no.

 plus...
 dumb question 1.  Is Aramaic (which doesn't seem to have a 2 character ISO
 code) the same as Amharic (which does...AM)?   If not, Amharic appears to
be
 a Semetic language too, is that written right-to-left too?

Amharic uses the Ethiopic script, and is not RTL as far a I know. Aramaic
has no native speakers (unless you count Hugh Nibley, who reportedly wigged
out during a class one day and started lecturing in Aramaic -- witnessed by
two people I know among the 50+ in the class!) so while you may have
Aramaic content, you probably would not have you machine set to use it as a
locale. :-)

 dumb question 2. Are there an known cases where the full locale name
 (language+country+variant) has a different directionality as for the root
 language?   I know that some languages are written in different scripts
 based on the locale; are there any cases where there are a two scripts
that
 have the same language code in their locale but differ in their writing
 direction?

Well, there are some languages in the former Soviet Union that are
considering an Arabic script either instead of or in addition to existing
Latin/Cyrillic scripts. Not sure if any have been officially adopted?

BTW - I try not answer stupid questions, so you can assume I disagree with
your characterization since I answered them. :-)

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/





Re: OT (Kind of): Determining whether Locales are left-to-right or

2000-12-06 Thread John Cowan

"Michael (michka) Kaplan" wrote:

 Well, there are some languages in the former Soviet Union that are
 considering an Arabic script either instead of or in addition to existing
 Latin/Cyrillic scripts. Not sure if any have been officially adopted?

I missed this bit before.

Mongolian (not a language of the former USSR) has Cyrillic and Mongolian-script
representations; they are not automatically interconvertible.  Cyrillic is
L2R, of course; Mongolian is T2B.

-- 
There is / one art   || John Cowan [EMAIL PROTECTED]
no more / no less|| http://www.reutershealth.com
to do / all things   || http://www.ccil.org/~cowan
with art- / lessness \\ -- Piet Hein



Re: OT (Kind of): Determining whether Locales are left-to-right or right-to-left.

2000-12-06 Thread David Tooke

  Is it true that one would not be able set their browser locales to these
  languages as it appears ISO 639 is a pre-requisite for this?

 I do not think that is universally true, no.

But according to RFC-1766 that governs the language tags in HTML and in
HTTP, only two character ISO 639 language codes, 'i' tags registered with
the IANA and 'x' private tags are valid.
There seem very few languages registered with IANA and certainly none of the
ones mentioned earlier.
Similiarly, this seems to be the same as far as Java locales is too, they do
not it seems actually validate the language, but from the documentation it
seems that is what is expected.  Do you think it is possible that some user
agents could have language strings using (say) the 3 character language ISO
identifiers, i.e. "syr"?

BTW - I try not answer stupid questions, so you can assume I disagree with
your characterization since I answered them. :-)
You're very gracious.  :-)


David Tooke
[EMAIL PROTECTED]



- Original Message -
From: "Michael (michka) Kaplan" [EMAIL PROTECTED]
To: "David Tooke" [EMAIL PROTECTED]; "Unicode List"
[EMAIL PROTECTED]
Sent: Wednesday, December 06, 2000 12:37 PM
Subject: Re: OT (Kind of): Determining whether Locales are left-to-right or
right-to-left.


 From: "David Tooke" [EMAIL PROTECTED]

  I noticed from that list that there are quite a few languages that do
not
  have 2 character ISO 639 codes.
 
  Balti  Baluchi  Berber  Hausa  Karaite   Kurmanji  Luri  Mazanderani
  Moplah
   PulaarSiraiki (also known as Saraiki or Lahnda or Western Panjabi)
  Sulu
 
  Is it true that one would not be able set their browser locales to these
  languages as it appears ISO 639 is a pre-requisite for this?

 I do not think that is universally true, no.

  plus...
  dumb question 1.  Is Aramaic (which doesn't seem to have a 2 character
ISO
  code) the same as Amharic (which does...AM)?   If not, Amharic appears
to
 be
  a Semetic language too, is that written right-to-left too?

 Amharic uses the Ethiopic script, and is not RTL as far a I know. Aramaic
 has no native speakers (unless you count Hugh Nibley, who reportedly
wigged
 out during a class one day and started lecturing in Aramaic -- witnessed
by
 two people I know among the 50+ in the class!) so while you may have
 Aramaic content, you probably would not have you machine set to use it as
a
 locale. :-)

  dumb question 2. Are there an known cases where the full locale name
  (language+country+variant) has a different directionality as for the
root
  language?   I know that some languages are written in different scripts
  based on the locale; are there any cases where there are a two scripts
 that
  have the same language code in their locale but differ in their writing
  direction?

 Well, there are some languages in the former Soviet Union that are
 considering an Arabic script either instead of or in addition to existing
 Latin/Cyrillic scripts. Not sure if any have been officially adopted?

 BTW - I try not answer stupid questions, so you can assume I disagree with
 your characterization since I answered them. :-)

 MichKa

 Michael Kaplan
 Trigeminal Software, Inc.
 http://www.trigeminal.com/






Re: OT (Kind of): Determining whether Locales are left-to-right

2000-12-06 Thread Michael Everson

Ar 10:54 -0800 2000-12-06, scríobh David Tooke:
But according to RFC-1766 that governs the language tags in HTML and in
HTTP, only two character ISO 639 language codes, 'i' tags registered with
the IANA and 'x' private tags are valid.

This is being revised to include the 639-2 codes.

There seem very few languages registered with IANA and certainly none of the
ones mentioned earlier.

You are welcome to register them.

Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Mob +353 86 807 9169 ** Fax +353 1 478 2597 ** Vox +353 1 478 2597
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire





Re: OT (Kind of): Determining whether Locales are left-to-right or

2000-12-06 Thread David Tooke

In general, directionality is a script property, not a language
or locale property: any language written using Arabic, Hebrew,
Syriac, or Thaana script remains right-to-left even when embedded
in some foreign locale, unless it is transliterated into Latin script.
Yes, I realise that is true.   I am, however, trying to determine when it is
appropriate to generate a web page or an applet in right-to-left as opposed
to left-to-right.   I am assuming that the browser (and/or operating system)
is going to render the actual text in the correct visual order as defined by
the Unicode Bidi Algorithm.
However I still need to indicate whether the page itself should be oriented
in right-to-left format (i.e. with labels to form fields on the right not
the left).
I would like to be able to, as automatically as possible, determine what
would be the best for the user...which means trying to figuring out based on
their locale.
I think, for example, it would be appropriate to show a form oriented
right-to-left to someone who has their browser set to 'ar-EG', even if the
application has not been translated into arabic.
Unfortunately, the application is such that maintaining preferences for each
user is not possible so I am trying to make a best guess at it.
- Original Message -
From: "John Cowan" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Wednesday, December 06, 2000 12:35 PM
Subject: Re: OT (Kind of): Determining whether Locales are left-to-right or


 David Tooke wrote:

  dumb question 1.  Is Aramaic (which doesn't seem to have a 2 character
ISO
  code) the same as Amharic (which does...AM)?

 No.

  If not, Amharic appears to be
  a Semetic language too, is that written right-to-left too?

 No, Amharic is written with Ethiopic script, which is left-to-right.
 In general, directionality is a script property, not a language
 or locale property: any language written using Arabic, Hebrew,
 Syriac, or Thaana script remains right-to-left even when embedded
 in some foreign locale, unless it is transliterated into Latin script.

 --
 There is / one art   || John Cowan
[EMAIL PROTECTED]
 no more / no less|| http://www.reutershealth.com
 to do / all things   || http://www.ccil.org/~cowan
 with art- / lessness \\ -- Piet Hein




Re: OT (Kind of): Determining whether Locales are left-to-right or

2000-12-06 Thread John Cowan

David Tooke wrote:

  I am assuming that the browser (and/or operating system)
 is going to render the actual text in the correct visual order as defined by
 the Unicode Bidi Algorithm.
 However I still need to indicate whether the page itself should be oriented
 in right-to-left format (i.e. with labels to form fields on the right not
 the left).

If the text is right-to-left, then widgets/controls embedded in the text
will be rendered to the right of the text they follow, so you shouldn't
need to do anything different at all.

 I think, for example, it would be appropriate to show a form oriented
 right-to-left to someone who has their browser set to 'ar-EG', even if the
 application has not been translated into arabic.

Ah, I see.

I think it would be very weird to render an English-language application with
labels on the right of their fields, just because the user also understands
Arabic.  Overall directionality, like local directionality, is a property of
the script in which the current language is written, not a question of
cultural preference.

Would you expect a Hebrew-speaking person to want to start reading at the back
of a book written in English?

-- 
There is / one art   || John Cowan [EMAIL PROTECTED]
no more / no less|| http://www.reutershealth.com
to do / all things   || http://www.ccil.org/~cowan
with art- / lessness \\ -- Piet Hein



OT (Kind of): Determining whether Locales are left-to-right or

2000-12-06 Thread David Tooke

 I think it would be very weird to render an English-language application
with
  labels on the right of their fields, just because the user also
understands
  Arabic.
The application is a database application where the majority of fields are
from a Unicode database and user-entered.  Their text is likely to be in
Arabic.  Therefore, as far as I am concerned, the *content* of the page is
in Arabic not English despite it being an English application.  So the page
should be formatted as if it an Arabic page with some English text.

As it is a Unicode database; I do not want to try to determine what
language/script *exactly* is being used.  That would involve scanning the
Unicode characters and a lot more giggery pokery than I need.






Re: OT (Kind of): Determining whether Locales are left-to-right or

2000-12-06 Thread David Tooke

 You're the boss, but it still sounds like an English page with embedded
Arabic
 text to me.
Just because the application used to create the content is in english, that
doesn't make the content english.  Anymore than if your Hebrew speaker wrote
a book using a English version of his word processing software.
The fact that the application has to expose certain utilitarian English
labels to the user does not make the content of the page any less Arabic.

 The Unicode folks have nicely arranged that the RTL characters are all
going
 to be in the ranges U+0590 through U+08FF and U+10800 to U+10FFF, of which
 only the first range matters just yet.  This is a rather modest test, and
 probably more reliable than using the browser setting.
But again, just because there are *some* RTL characters in the output that
does not make *all* the content RTL.  Plus, this would result in some
wierdness where the same user could go into the same page with two different
parameters and get it first in LTR, then in RTL, just because the database
hit a RTL character the 2nd time.

Obviously, there's no ideal way of handling this.   We could just say f*k
it...everybody see's it in LTR.   But I thought trying to figure it out from
the browser might be more user friendly.



- Original Message -
From: "John Cowan" [EMAIL PROTECTED]
To: "David Tooke" [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Wednesday, December 06, 2000 3:49 PM
Subject: Re: OT (Kind of): Determining whether Locales are left-to-right or


 David Tooke wrote:

  The application is a database application where the majority of fields
are
  from a Unicode database and user-entered.  Their text is likely to be in
  Arabic.  Therefore, as far as I am concerned, the *content* of the page
is
  in Arabic not English despite it being an English application.  So the
page
  should be formatted as if it an Arabic page with some English text.


  As it is a Unicode database; I do not want to try to determine what
  language/script *exactly* is being used.  That would involve scanning
the
  Unicode characters and a lot more jiggery-pokery than I need.

 The Unicode folks have nicely arranged that the RTL characters are all
going
 to be in the ranges U+0590 through U+08FF and U+10800 to U+10FFF, of which
 only the first range matters just yet.  This is a rather modest test, and
 probably more reliable than using the browser setting.

 --
 There is / one art   || John Cowan
[EMAIL PROTECTED]
 no more / no less|| http://www.reutershealth.com
 to do / all things   || http://www.ccil.org/~cowan
 with art- / lessness \\ -- Piet Hein




Re: Transcriptions of Unicode

2000-12-06 Thread James Kass


Erik van der Poel wrote:


 
 The font selection is indeed somewhat haphazard for CJK when there are
 no LANG attributes and the charset doesn't tell us anything either, but
 then, what do you expect in that situation anyway? I suppose we could
 deduce that the language is Japanese for Hiragana and Katakana, but what
 should we do about ideographs? Don't tell me the browser has to start
 guessing the language for those characters. I've had enough of the
 guessing game. We have been doing it for charsets for years, and it has
 led to trouble that we can't back out of now. I think we need to draw
 the line here, and tell Web page authors to mark their pages with LANG
 attributes or with particular fonts, preferrably in style sheets.


A Universal Character Set should not require mark-up/tags.

If the Japanese version of a Chinese character looks different
than the Chinese character, it *is* different.  In many cases,
"variant" does not mean "same".

When limited to BMP code points, CJK unification kind of made
sense.  In light of the new additional planes...

The IRG seems to be doing a fine job.

Best regards,

James Kass.






Re: Transcriptions of Unicode

2000-12-06 Thread John H. Jenkins

At 3:57 PM -0800 12/6/00, James Kass wrote:
A Universal Character Set should not require mark-up/tags.

Au contraire, it's been implicit in the design of Unicode from the 
beginning that markup/tags would be required in certain situations. 

If the Japanese version of a Chinese character looks different
than the Chinese character, it *is* different.  In many cases,
"variant" does not mean "same".

But as a rule, the Japanese and Chinese would disagree with you here. 
Certainly the IRG would disagree.  Few in the west would argue over 
the fundamental unity of Fraktur and Roman variations of the Latin 
alphabet; most of the Chinese/Japanese variations are on that order 
or less.


When limited to BMP code points, CJK unification kind of made
sense.  In light of the new additional planes...

The IRG seems to be doing a fine job.


Here you've really lost me.  The IRG is unifying in plane 2, as well. 
Nobody in the IRG has suggested that we abandon unification for plane 
2.

-- 
=
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/



Re: Transcriptions of Unicode

2000-12-06 Thread Erik van der Poel

James Kass wrote:
 
 Erik van der Poel wrote:
 
  The font selection is indeed somewhat haphazard for CJK when there are
  no LANG attributes and the charset doesn't tell us anything either, but
  then, what do you expect in that situation anyway? I suppose we could
  deduce that the language is Japanese for Hiragana and Katakana, but what
  should we do about ideographs? Don't tell me the browser has to start
  guessing the language for those characters. I've had enough of the
  guessing game. We have been doing it for charsets for years, and it has
  led to trouble that we can't back out of now. I think we need to draw
  the line here, and tell Web page authors to mark their pages with LANG
  attributes or with particular fonts, preferrably in style sheets.
 
 A Universal Character Set should not require mark-up/tags.
 
 If the Japanese version of a Chinese character looks different
 than the Chinese character, it *is* different.  In many cases,
 "variant" does not mean "same".

I was referring to the CJK Unified Ideagraphs in the range U+4E00 to
U+9FA5. I agree that those codes do not *require* mark-up/tags, but if
the author wishes to have them displayed with a "Japanese font", then
they must indicate the language or specify the font directly. The latter
may be problematic. I don't think it's reasonable to expect a browser to
apply various heuristics to determine the language.

 When limited to BMP code points, CJK unification kind of made
 sense.  In light of the new additional planes...
 
 The IRG seems to be doing a fine job.

Somehow I get the impression that you have more to say, but you just
aren't saying it. Cough it up already. :-)

Erik



Re: Transcriptions of Unicode

2000-12-06 Thread James Kass

Erik van der Poel wrote:

  
   The font selection is indeed somewhat haphazard for CJK when there are
   no LANG attributes and the charset doesn't tell us anything either, but
   then, what do you expect in that situation anyway? I suppose we could
   deduce that the language is Japanese for Hiragana and Katakana, but what
   should we do about ideographs? Don't tell me the browser has to start
   guessing the language for those characters. I've had enough of the
   guessing game. We have been doing it for charsets for years, and it has
   led to trouble that we can't back out of now. I think we need to draw
   the line here, and tell Web page authors to mark their pages with LANG
   attributes or with particular fonts, preferrably in style sheets.
 
  A Universal Character Set should not require mark-up/tags.
 
  If the Japanese version of a Chinese character looks different
  than the Chinese character, it *is* different.  In many cases,
  "variant" does not mean "same".

 I was referring to the CJK Unified Ideagraphs in the range U+4E00 to
 U+9FA5. I agree that those codes do not *require* mark-up/tags, but if
 the author wishes to have them displayed with a "Japanese font", then
 they must indicate the language or specify the font directly. The latter
 may be problematic. I don't think it's reasonable to expect a browser to
 apply various heuristics to determine the language.


I completely agree that it is not reasonable to expect a browser
to guess the language.  Since browsers primarily display
information, the browser doesn't really need to be language-aware
in most cases.  Exceptions like word-breaks for Thai and related
scripts exist, of course.  Even scripts which don't use spaces
or other word breaks can be encoded with the special spacing
variants available in the Unicode Standard, though.

  When limited to BMP code points, CJK unification kind of made
  sense.  In light of the new additional planes...
 
  The IRG seems to be doing a fine job.

 Somehow I get the impression that you have more to say, but you just
 aren't saying it. Cough it up already. :-)


Sorry, I'm trying to learn how to be brief (!) and hoped the
inference would be apparent.  Although the IRG still
considers unification relevant, it seems to me that they
are much tighter now in their definition of 'sameness'
than was previously the case.  Not all of the approx 4
"new" characters in Plane 2 are the names of race horses,
some of them, as far as I can tell, would have been unified
before.

Consider the "teeth" ideograph(s).  (Radical number 211, in
some radical lists.)  Because this is a radical, CJK encoders
can select the specific desired character:  
U+2FD2 for Traditional Chinese
U+2EED for Japanese
U+2EEE for Simplified Chinese

Since anyone encoding U+9F52 might see any of the above
three versions, my opinion is that encoders (authors) would 
wish to explicitly encode their expected character and would
do so whenever they have the option.  I believe that they
should have the option.  The abundance of unassigned code
points offered by additional Unicode planes makes this
possible and would eliminate the need for a browser
(or any other application) to "guess" a language in order
to display material as its authors and users desire.

Best regards,

James Kass.





Re: Transcriptions of Unicode

2000-12-06 Thread John H. Jenkins

At 6:40 PM -0800 12/6/00, James Kass wrote:
Consider the "teeth" ideograph(s).  (Radical number 211, in
some radical lists.)  Because this is a radical, CJK encoders
can select the specific desired character: 
U+2FD2 for Traditional Chinese
U+2EED for Japanese
U+2EEE for Simplified Chinese

Since anyone encoding U+9F52 might see any of the above
three versions, my opinion is that encoders (authors) would
wish to explicitly encode their expected character and would
do so whenever they have the option.

This doesn't reflect, however, the way people actually use these 
ideographs.  By and large, the Japanese reader wants to see them 
drawn with the Japanese glyph, whether or not the originator was 
Chinese.

There are some cases where the specific glyph *does* matter, largely 
in personal names.  (We had a mildly heated discussion this morning 
in the IRG meeting going on about how to show one particular glyph 
for precisely this reason.) By and large, however, it is recognized 
that the glyph differences do *not* affect meaning and should be up 
to the reader, not forced by the originator.

I believe that they
should have the option.  The abundance of unassigned code
points offered by additional Unicode planes makes this
possible and would eliminate the need for a browser
(or any other application) to "guess" a language in order
to display material as its authors and users desire.


But then why not deunify the English and French alphabets?  Or French 
and Polish accents?  Or Fraktur and Italic and Roman styles of Latin?

-- 
=
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/



Re: displaying Unicode text (was re: Transcriptions of Unicode)

2000-12-06 Thread James Kass

John H. Jenkins wrote:

 At 3:57 PM -0800 12/6/00, James Kass wrote:
 A Universal Character Set should not require mark-up/tags.
 
 Au contraire, it's been implicit in the design of Unicode from the 
 beginning that markup/tags would be required in certain situations. 


Because of the 65536 character limitation ?  (Which no
longer applies.)
 
 If the Japanese version of a Chinese character looks different
 than the Chinese character, it *is* different.  In many cases,
 "variant" does not mean "same".
 
 But as a rule, the Japanese and Chinese would disagree with you here. 
 Certainly the IRG would disagree.  Few in the west would argue over 
 the fundamental unity of Fraktur and Roman variations of the Latin 
 alphabet; most of the Chinese/Japanese variations are on that order 
 or less.
 

As our Asian friends come on-line, they will hopefully
contribute to the discussion in this regard.  The reason
I suspect that the Japanese would tend to agree is that
Unicode had not been widely accepted by the Japanese
user community.  

Perhaps if Unicode originated elsewhere, we would have 
had to deal with Greek/Latin/Cyrillic unification?  
(And we could say that since the "W" is really a ligature 
of two "V"s, it shouldn't have an explicit encoding...)

 
 When limited to BMP code points, CJK unification kind of made
 sense.  In light of the new additional planes...
 
 The IRG seems to be doing a fine job.
 
 
 Here you've really lost me.  The IRG is unifying in plane 2, as well. 
 Nobody in the IRG has suggested that we abandon unification for plane 
 2.
 

I tried to respond to this in an earlier letter.  We don't 
even have CJK unification in the BMP, witness the blocks
U+8A00 to U+8B9f versus U+8BA0 to U+8C36.  Many of
the characters in the latter block are simplified versions
of the former.

U+8A02/U+8BA2
U+8A03/U+8BA3
U+8A0C/U+8BA7
U+8A41/U+8BC2
etc.

Fraktur and roman are both adaptations of the Latin
script, or stylistic variations just as italic and roman.  
The Japanese writing system is Japanese, but derived 
from Chinese.  As you say, some of the differences
are minimal, perhaps slight variation in stroke order,
but other differences are substantial.  In some cases,
the Japanese version may use a variant of a certain
radical component, or even a different radical.  I said
I think the IRG is doing a fine job because it is such a
monumental task, much progress is being made, and the
results of their work seem to reflect the expectations
of the various user communities involved.

Best regards,

James Kass.





Unicode Technical Reports (Formerly: RE: TR22)

2000-12-06 Thread Robert Wheelock




From: "Mark Davis" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Subject: TR22
Date: Mon, 4 Dec 2000 12:58:53 -0800 (GMT-0800)

As per the instructions of the Unicode Technical Committee, TR#22: 
Character
Mapping Markup Language (CharMapML) has been advanced from draft TR to full
TR.  See http://www.unicode.org/unicode/reports/tr22/ for more information.

Note: The UTC intends to continue development this TR to also encompass
complex mappings such as 2022 and glyph-based mappings.

Mark

P.S. I will be out of town for a few days, so will be unable to address any
questions that come up until I get back.



Hello, UniCoders!
Whatever happened to UniCode Technical Report *#12*—what's it about?!  Is 
TR12 closer to adoptation by UniCode?

Robert Lloyd Wheelock
Augusta, ME  USA



_
Get more from the Web.  FREE MSN Explorer download : http://explorer.msn.com




Re: displaying Unicode text (was re: Transcriptions of

2000-12-06 Thread Doug Ewell

"James Kass" [EMAIL PROTECTED] wrote:

 I tried to respond to this in an earlier letter.  We don't 
 even have CJK unification in the BMP, witness the blocks
 U+8A00 to U+8B9f versus U+8BA0 to U+8C36.  Many of
 the characters in the latter block are simplified versions
 of the former.

 U+8A02/U+8BA2
 U+8A03/U+8BA3
 U+8A0C/U+8BA7
 U+8A41/U+8BC2
 etc.

I usually stay out of CJK discussions since they are typically outside
any expertise I may claim, but I thought there was a BIG difference
between the issue of Chinese vs. Japanese glyphs (which may differ only
in stroke weight and number of minor strokes) and the issue of
traditional vs. simplified characters (which may appear completely
different from each other and are not even necessarily convertible from
one set to the other).  Unicode unifies the former and does not unify
the latter.

-Doug Ewell
 Fullerton, California