Re: Property-Problems
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Wednesday 06 December 2000 02:29, Kenneth Whistler wrote: Tobias Hunger asked: 1.) What are the EastAsian Width properties of the characters in the new Private Use areas (Plane 15/16)? "A", the same as for the private use area in the BMP: 2.) What are the Linebreaking Properties for those characters? "AI", the same as for the private use are in the BMP: That is what I exspected. Thank you for verifying that for me. 3.) How do you generate the PropList File? Some of the properties are quite obvious (for example the Bidi-Properties), but others are a mystery to me. snip Some of the properties currently in PropList.txt are completely derivative from information in UnicodeData.txt, but were included in PropList.txt despite their redundancy, since PropList.txt gives a different "view" on properties. It gives a property by property list of all the characters with a particular property. Yes, I noticed, that this file is informational. But it is very useful to have:-) From what I read about it I guessed that it was completly derivable from the other, normative, data. So these deltas were a surprise for me. Thank you for pointing out this misconception. A sidenote: The standard is a great book to have when you need to work with characters as it points out many pitfalls that are not obvious to the latin-1 using programmer. It is well written and easy to understand. The only problem I encounter from time to time is figuring out which characters you mean exactly when you talk about groups of characters. Maybe it is possible to define Properties for all those groups in the next version of the standard? This information is redundant, but it would help me greatly. - -- Gruss, Tobias - --- Tobias Hunger The box said: 'Windows 95 or better' [EMAIL PROTECTED] So I installed Linux. - --- -BEGIN PGP SIGNATURE- Version: GnuPG v1.0.4 (GNU/Linux) Comment: Weitere Infos: siehe http://www.gnupg.org iD8DBQE6Lf7UVND+cGpk748RAlwiAJ9TfM3CLvlqBalkNfWGaYNWYolVrQCbByHd 2zRzsfEaQL7DFuxsjmeLC3k= =x/2U -END PGP SIGNATURE-
Re: Transcriptions of Unicode
But NN6 *does* select a font for characters outside the so-called user's locale when said characters are in a UTF-8 page. It appears that this mechanism is somewhat haphazard for CJK unified ideographs: I get a mix of fonts usually (probably because ja is in my locale "stack" currently and 'zh' and 'ko' are not, so I guess Japanese fonts are preferred for characters that are in JIS X 208 ??). AP === Addison P. PhillipsPrincipal Consultant Inter-Locale LLChttp://www.inter-locale.com Los Gatos, CA, USA mailto:[EMAIL PROTECTED] +1 408.210.3569 (mobile) +1 408.904.4762 (fax) === Globalization Engineering Consulting Services On Mon, 4 Dec 2000, Erik van der Poel wrote: Mark Davis wrote: What wasn't clear from his message is whether Mozilla picks a reasonable font if the language is not there. Sorry about the lack of clarity. When there is no LANG attribute in the element (or in a parent element), Mozilla uses the document's charset as a fallback. Mozilla has font preferences for each language group. The language groups have been set up to have a one-to-one correspondence with charsets (roughly). E.g. iso-8859-1 - Western, shift_jis - ja. When the charset is a Unicode-based one (e.g. UTF-8), then Mozilla uses the language group that contains the user's locale's language. In other words, Mozilla does not (yet) use the Unicode character codes to select fonts. We may do this in the future. Erik
OT (Kind of): Determining whether Locales are left-to-right or right-to-left.
Is there a general mechanism for determining the directionality of a locale? I am using Java Servlets to create HTML pages. Is there something that will tell me when it is appropriate to generate the HTML in right to left as opposed to left to right? At the moment it looks like I have to maintain a table of right to left locales myself. If that is the way to go, apart from the Arabic (ar); Hebrew (he); Urdu (ur) which other locales is it appropriate to set the directionality to right-to-left? Is there a standard document somewhere that would tell me? Thanks in advance. David Tooke [EMAIL PROTECTED]
Re: OT (Kind of): Determining whether Locales are left-to-right or right-to-left.
Well, there are lots of other Arabic script locales. Here is from a message from Elaine Keown just the other day: Arabic Balti Baluchi Berber Farsi Hausa Karaite Kashmiri Kazakh Kirghiz Kurmanji Luri Mazanderani Moplah Panjabi---PakistaniPashto Pulaar Sindhi Siraiki (also known as Saraiki or Lahnda or Western Panjabi) Sulu Uighur Urdu Uzbek Wolof There are also several Hebrew ones such as Yiddish, Aramaic, etc. MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/ - Original Message - From: "David Tooke" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Wednesday, December 06, 2000 8:48 AM Subject: OT (Kind of): Determining whether Locales are left-to-right or right-to-left. Is there a general mechanism for determining the directionality of a locale? I am using Java Servlets to create HTML pages. Is there something that will tell me when it is appropriate to generate the HTML in right to left as opposed to left to right? At the moment it looks like I have to maintain a table of right to left locales myself. If that is the way to go, apart from the Arabic (ar); Hebrew (he); Urdu (ur) which other locales is it appropriate to set the directionality to right-to-left? Is there a standard document somewhere that would tell me? Thanks in advance. David Tooke [EMAIL PROTECTED]
Re: OT (Kind of): Determining whether Locales are left-to-right or
On Wed, 6 Dec 2000, David Tooke wrote: At the moment it looks like I have to maintain a table of right to left locales myself. If that is the way to go, apart from the Arabic (ar); Hebrew (he); Urdu (ur) which other locales is it appropriate to set the directionality to right-to-left? Is there a standard document somewhere that would tell me? You can add this list: Persian (fa), Iranian and Iraqi Kurdish (ku_IR, ku_IQ), Pashtu (ps), and Yiddish (yi). There are also others, but I believe them all to be in the three letter (ISO 639-2) world: Baluchi (bal), Syriac (syr), etc. --roozbeh
Re: OT (Kind of): Determining whether Locales are left-to-right or
Michael Kaplan wrote: Well, there are lots of other Arabic script locales. Here is from a message from Elaine Keown just the other day: Arabic Balti Baluchi Berber Farsi Hausa Karaite Kashmiri Kazakh Kirghiz Kurmanji Luri Mazanderani Moplah Panjabi---PakistaniPashto Pulaar Sindhi Siraiki (also known as Saraiki or Lahnda or Western Panjabi) Sulu Uighur Urdu Uzbek Wolof Urdu written in Nagari script is left-to-right? This is new to me... Of course, a similar pitfall exist for a number of others "locales" when one equates that to the mere language. OTOH, don't forget the "other" RTL scripts, such as for Thaana (Maldivian, for the Divehi language) and the Syriac scripts. Antoine - Original Message - From: "David Tooke" [EMAIL PROTECTED] I am using Java Servlets to create HTML pages. Is there something that will tell me when it is appropriate to generate the HTML in right to left as opposed to left to right? Why do you want to generate HTML in "right to left"? Isn't HTML just a stream of characters, that runs from "begin" to "end"? Just do nothing, that's the browser's job to do the visual reversing. At the moment it looks like I have to maintain a table of right to left locales myself. If that is the way to go, apart from the Arabic (ar); Hebrew (he); Urdu (ur) which other locales is it appropriate to set the directionality to right-to-left? Is there a standard document somewhere that would tell me? Now, can you tell me how this scheme will handle boustrophedon, until you know in advance the size of the displaying window... Antoine
Re: OT (Kind of): Determining whether Locales are left-to-right or right-to-left.
Thanks for your prompt replies. I noticed from that list that there are quite a few languages that do not have 2 character ISO 639 codes. Balti Baluchi Berber Hausa Karaite Kurmanji Luri Mazanderani Moplah PulaarSiraiki (also known as Saraiki or Lahnda or Western Panjabi) Sulu Is it true that one would not be able set their browser locales to these languages as it appears ISO 639 is a pre-requisite for this? plus... dumb question 1. Is Aramaic (which doesn't seem to have a 2 character ISO code) the same as Amharic (which does...AM)? If not, Amharic appears to be a Semetic language too, is that written right-to-left too? dumb question 2. Are there an known cases where the full locale name (language+country+variant) has a different directionality as for the root language? I know that some languages are written in different scripts based on the locale; are there any cases where there are a two scripts that have the same language code in their locale but differ in their writing direction? - Original Message - From: "Michael (michka) Kaplan" [EMAIL PROTECTED] To: "David Tooke" [EMAIL PROTECTED]; "Unicode List" [EMAIL PROTECTED] Sent: Wednesday, December 06, 2000 11:36 AM Subject: Re: OT (Kind of): Determining whether Locales are left-to-right or right-to-left. Well, there are lots of other Arabic script locales. Here is from a message from Elaine Keown just the other day: Arabic Balti Baluchi Berber Farsi Hausa Karaite Kashmiri Kazakh Kirghiz Kurmanji Luri Mazanderani Moplah Panjabi---Pakistani Pashto Pulaar Sindhi Siraiki (also known as Saraiki or Lahnda or Western Panjabi) Sulu Uighur Urdu Uzbek Wolof There are also several Hebrew ones such as Yiddish, Aramaic, etc. MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/ - Original Message - From: "David Tooke" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Wednesday, December 06, 2000 8:48 AM Subject: OT (Kind of): Determining whether Locales are left-to-right or right-to-left. Is there a general mechanism for determining the directionality of a locale? I am using Java Servlets to create HTML pages. Is there something that will tell me when it is appropriate to generate the HTML in right to left as opposed to left to right? At the moment it looks like I have to maintain a table of right to left locales myself. If that is the way to go, apart from the Arabic (ar); Hebrew (he); Urdu (ur) which other locales is it appropriate to set the directionality to right-to-left? Is there a standard document somewhere that would tell me? Thanks in advance. David Tooke [EMAIL PROTECTED]
Re: OT (Kind of): Determining whether Locales are left-to-right or right-to-left.
From: "David Tooke" [EMAIL PROTECTED] I noticed from that list that there are quite a few languages that do not have 2 character ISO 639 codes. Balti Baluchi Berber Hausa Karaite Kurmanji Luri Mazanderani Moplah PulaarSiraiki (also known as Saraiki or Lahnda or Western Panjabi) Sulu Is it true that one would not be able set their browser locales to these languages as it appears ISO 639 is a pre-requisite for this? I do not think that is universally true, no. plus... dumb question 1. Is Aramaic (which doesn't seem to have a 2 character ISO code) the same as Amharic (which does...AM)? If not, Amharic appears to be a Semetic language too, is that written right-to-left too? Amharic uses the Ethiopic script, and is not RTL as far a I know. Aramaic has no native speakers (unless you count Hugh Nibley, who reportedly wigged out during a class one day and started lecturing in Aramaic -- witnessed by two people I know among the 50+ in the class!) so while you may have Aramaic content, you probably would not have you machine set to use it as a locale. :-) dumb question 2. Are there an known cases where the full locale name (language+country+variant) has a different directionality as for the root language? I know that some languages are written in different scripts based on the locale; are there any cases where there are a two scripts that have the same language code in their locale but differ in their writing direction? Well, there are some languages in the former Soviet Union that are considering an Arabic script either instead of or in addition to existing Latin/Cyrillic scripts. Not sure if any have been officially adopted? BTW - I try not answer stupid questions, so you can assume I disagree with your characterization since I answered them. :-) MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Re: OT (Kind of): Determining whether Locales are left-to-right or
"Michael (michka) Kaplan" wrote: Well, there are some languages in the former Soviet Union that are considering an Arabic script either instead of or in addition to existing Latin/Cyrillic scripts. Not sure if any have been officially adopted? I missed this bit before. Mongolian (not a language of the former USSR) has Cyrillic and Mongolian-script representations; they are not automatically interconvertible. Cyrillic is L2R, of course; Mongolian is T2B. -- There is / one art || John Cowan [EMAIL PROTECTED] no more / no less|| http://www.reutershealth.com to do / all things || http://www.ccil.org/~cowan with art- / lessness \\ -- Piet Hein
Re: OT (Kind of): Determining whether Locales are left-to-right or right-to-left.
Is it true that one would not be able set their browser locales to these languages as it appears ISO 639 is a pre-requisite for this? I do not think that is universally true, no. But according to RFC-1766 that governs the language tags in HTML and in HTTP, only two character ISO 639 language codes, 'i' tags registered with the IANA and 'x' private tags are valid. There seem very few languages registered with IANA and certainly none of the ones mentioned earlier. Similiarly, this seems to be the same as far as Java locales is too, they do not it seems actually validate the language, but from the documentation it seems that is what is expected. Do you think it is possible that some user agents could have language strings using (say) the 3 character language ISO identifiers, i.e. "syr"? BTW - I try not answer stupid questions, so you can assume I disagree with your characterization since I answered them. :-) You're very gracious. :-) David Tooke [EMAIL PROTECTED] - Original Message - From: "Michael (michka) Kaplan" [EMAIL PROTECTED] To: "David Tooke" [EMAIL PROTECTED]; "Unicode List" [EMAIL PROTECTED] Sent: Wednesday, December 06, 2000 12:37 PM Subject: Re: OT (Kind of): Determining whether Locales are left-to-right or right-to-left. From: "David Tooke" [EMAIL PROTECTED] I noticed from that list that there are quite a few languages that do not have 2 character ISO 639 codes. Balti Baluchi Berber Hausa Karaite Kurmanji Luri Mazanderani Moplah PulaarSiraiki (also known as Saraiki or Lahnda or Western Panjabi) Sulu Is it true that one would not be able set their browser locales to these languages as it appears ISO 639 is a pre-requisite for this? I do not think that is universally true, no. plus... dumb question 1. Is Aramaic (which doesn't seem to have a 2 character ISO code) the same as Amharic (which does...AM)? If not, Amharic appears to be a Semetic language too, is that written right-to-left too? Amharic uses the Ethiopic script, and is not RTL as far a I know. Aramaic has no native speakers (unless you count Hugh Nibley, who reportedly wigged out during a class one day and started lecturing in Aramaic -- witnessed by two people I know among the 50+ in the class!) so while you may have Aramaic content, you probably would not have you machine set to use it as a locale. :-) dumb question 2. Are there an known cases where the full locale name (language+country+variant) has a different directionality as for the root language? I know that some languages are written in different scripts based on the locale; are there any cases where there are a two scripts that have the same language code in their locale but differ in their writing direction? Well, there are some languages in the former Soviet Union that are considering an Arabic script either instead of or in addition to existing Latin/Cyrillic scripts. Not sure if any have been officially adopted? BTW - I try not answer stupid questions, so you can assume I disagree with your characterization since I answered them. :-) MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Re: OT (Kind of): Determining whether Locales are left-to-right
Ar 10:54 -0800 2000-12-06, scríobh David Tooke: But according to RFC-1766 that governs the language tags in HTML and in HTTP, only two character ISO 639 language codes, 'i' tags registered with the IANA and 'x' private tags are valid. This is being revised to include the 639-2 codes. There seem very few languages registered with IANA and certainly none of the ones mentioned earlier. You are welcome to register them. Michael Everson ** Everson Gunn Teoranta ** http://www.egt.ie 15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland Mob +353 86 807 9169 ** Fax +353 1 478 2597 ** Vox +353 1 478 2597 27 Páirc an Fhéithlinn; Baile an Bhóthair; Co. Átha Cliath; Éire
Re: OT (Kind of): Determining whether Locales are left-to-right or
In general, directionality is a script property, not a language or locale property: any language written using Arabic, Hebrew, Syriac, or Thaana script remains right-to-left even when embedded in some foreign locale, unless it is transliterated into Latin script. Yes, I realise that is true. I am, however, trying to determine when it is appropriate to generate a web page or an applet in right-to-left as opposed to left-to-right. I am assuming that the browser (and/or operating system) is going to render the actual text in the correct visual order as defined by the Unicode Bidi Algorithm. However I still need to indicate whether the page itself should be oriented in right-to-left format (i.e. with labels to form fields on the right not the left). I would like to be able to, as automatically as possible, determine what would be the best for the user...which means trying to figuring out based on their locale. I think, for example, it would be appropriate to show a form oriented right-to-left to someone who has their browser set to 'ar-EG', even if the application has not been translated into arabic. Unfortunately, the application is such that maintaining preferences for each user is not possible so I am trying to make a best guess at it. - Original Message - From: "John Cowan" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Wednesday, December 06, 2000 12:35 PM Subject: Re: OT (Kind of): Determining whether Locales are left-to-right or David Tooke wrote: dumb question 1. Is Aramaic (which doesn't seem to have a 2 character ISO code) the same as Amharic (which does...AM)? No. If not, Amharic appears to be a Semetic language too, is that written right-to-left too? No, Amharic is written with Ethiopic script, which is left-to-right. In general, directionality is a script property, not a language or locale property: any language written using Arabic, Hebrew, Syriac, or Thaana script remains right-to-left even when embedded in some foreign locale, unless it is transliterated into Latin script. -- There is / one art || John Cowan [EMAIL PROTECTED] no more / no less|| http://www.reutershealth.com to do / all things || http://www.ccil.org/~cowan with art- / lessness \\ -- Piet Hein
Re: OT (Kind of): Determining whether Locales are left-to-right or
David Tooke wrote: I am assuming that the browser (and/or operating system) is going to render the actual text in the correct visual order as defined by the Unicode Bidi Algorithm. However I still need to indicate whether the page itself should be oriented in right-to-left format (i.e. with labels to form fields on the right not the left). If the text is right-to-left, then widgets/controls embedded in the text will be rendered to the right of the text they follow, so you shouldn't need to do anything different at all. I think, for example, it would be appropriate to show a form oriented right-to-left to someone who has their browser set to 'ar-EG', even if the application has not been translated into arabic. Ah, I see. I think it would be very weird to render an English-language application with labels on the right of their fields, just because the user also understands Arabic. Overall directionality, like local directionality, is a property of the script in which the current language is written, not a question of cultural preference. Would you expect a Hebrew-speaking person to want to start reading at the back of a book written in English? -- There is / one art || John Cowan [EMAIL PROTECTED] no more / no less|| http://www.reutershealth.com to do / all things || http://www.ccil.org/~cowan with art- / lessness \\ -- Piet Hein
OT (Kind of): Determining whether Locales are left-to-right or
I think it would be very weird to render an English-language application with labels on the right of their fields, just because the user also understands Arabic. The application is a database application where the majority of fields are from a Unicode database and user-entered. Their text is likely to be in Arabic. Therefore, as far as I am concerned, the *content* of the page is in Arabic not English despite it being an English application. So the page should be formatted as if it an Arabic page with some English text. As it is a Unicode database; I do not want to try to determine what language/script *exactly* is being used. That would involve scanning the Unicode characters and a lot more giggery pokery than I need.
Re: OT (Kind of): Determining whether Locales are left-to-right or
You're the boss, but it still sounds like an English page with embedded Arabic text to me. Just because the application used to create the content is in english, that doesn't make the content english. Anymore than if your Hebrew speaker wrote a book using a English version of his word processing software. The fact that the application has to expose certain utilitarian English labels to the user does not make the content of the page any less Arabic. The Unicode folks have nicely arranged that the RTL characters are all going to be in the ranges U+0590 through U+08FF and U+10800 to U+10FFF, of which only the first range matters just yet. This is a rather modest test, and probably more reliable than using the browser setting. But again, just because there are *some* RTL characters in the output that does not make *all* the content RTL. Plus, this would result in some wierdness where the same user could go into the same page with two different parameters and get it first in LTR, then in RTL, just because the database hit a RTL character the 2nd time. Obviously, there's no ideal way of handling this. We could just say f*k it...everybody see's it in LTR. But I thought trying to figure it out from the browser might be more user friendly. - Original Message - From: "John Cowan" [EMAIL PROTECTED] To: "David Tooke" [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Wednesday, December 06, 2000 3:49 PM Subject: Re: OT (Kind of): Determining whether Locales are left-to-right or David Tooke wrote: The application is a database application where the majority of fields are from a Unicode database and user-entered. Their text is likely to be in Arabic. Therefore, as far as I am concerned, the *content* of the page is in Arabic not English despite it being an English application. So the page should be formatted as if it an Arabic page with some English text. As it is a Unicode database; I do not want to try to determine what language/script *exactly* is being used. That would involve scanning the Unicode characters and a lot more jiggery-pokery than I need. The Unicode folks have nicely arranged that the RTL characters are all going to be in the ranges U+0590 through U+08FF and U+10800 to U+10FFF, of which only the first range matters just yet. This is a rather modest test, and probably more reliable than using the browser setting. -- There is / one art || John Cowan [EMAIL PROTECTED] no more / no less|| http://www.reutershealth.com to do / all things || http://www.ccil.org/~cowan with art- / lessness \\ -- Piet Hein
Re: Transcriptions of Unicode
Erik van der Poel wrote: The font selection is indeed somewhat haphazard for CJK when there are no LANG attributes and the charset doesn't tell us anything either, but then, what do you expect in that situation anyway? I suppose we could deduce that the language is Japanese for Hiragana and Katakana, but what should we do about ideographs? Don't tell me the browser has to start guessing the language for those characters. I've had enough of the guessing game. We have been doing it for charsets for years, and it has led to trouble that we can't back out of now. I think we need to draw the line here, and tell Web page authors to mark their pages with LANG attributes or with particular fonts, preferrably in style sheets. A Universal Character Set should not require mark-up/tags. If the Japanese version of a Chinese character looks different than the Chinese character, it *is* different. In many cases, "variant" does not mean "same". When limited to BMP code points, CJK unification kind of made sense. In light of the new additional planes... The IRG seems to be doing a fine job. Best regards, James Kass.
Re: Transcriptions of Unicode
At 3:57 PM -0800 12/6/00, James Kass wrote: A Universal Character Set should not require mark-up/tags. Au contraire, it's been implicit in the design of Unicode from the beginning that markup/tags would be required in certain situations. If the Japanese version of a Chinese character looks different than the Chinese character, it *is* different. In many cases, "variant" does not mean "same". But as a rule, the Japanese and Chinese would disagree with you here. Certainly the IRG would disagree. Few in the west would argue over the fundamental unity of Fraktur and Roman variations of the Latin alphabet; most of the Chinese/Japanese variations are on that order or less. When limited to BMP code points, CJK unification kind of made sense. In light of the new additional planes... The IRG seems to be doing a fine job. Here you've really lost me. The IRG is unifying in plane 2, as well. Nobody in the IRG has suggested that we abandon unification for plane 2. -- = John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jenkins/
Re: Transcriptions of Unicode
James Kass wrote: Erik van der Poel wrote: The font selection is indeed somewhat haphazard for CJK when there are no LANG attributes and the charset doesn't tell us anything either, but then, what do you expect in that situation anyway? I suppose we could deduce that the language is Japanese for Hiragana and Katakana, but what should we do about ideographs? Don't tell me the browser has to start guessing the language for those characters. I've had enough of the guessing game. We have been doing it for charsets for years, and it has led to trouble that we can't back out of now. I think we need to draw the line here, and tell Web page authors to mark their pages with LANG attributes or with particular fonts, preferrably in style sheets. A Universal Character Set should not require mark-up/tags. If the Japanese version of a Chinese character looks different than the Chinese character, it *is* different. In many cases, "variant" does not mean "same". I was referring to the CJK Unified Ideagraphs in the range U+4E00 to U+9FA5. I agree that those codes do not *require* mark-up/tags, but if the author wishes to have them displayed with a "Japanese font", then they must indicate the language or specify the font directly. The latter may be problematic. I don't think it's reasonable to expect a browser to apply various heuristics to determine the language. When limited to BMP code points, CJK unification kind of made sense. In light of the new additional planes... The IRG seems to be doing a fine job. Somehow I get the impression that you have more to say, but you just aren't saying it. Cough it up already. :-) Erik
Re: Transcriptions of Unicode
Erik van der Poel wrote: The font selection is indeed somewhat haphazard for CJK when there are no LANG attributes and the charset doesn't tell us anything either, but then, what do you expect in that situation anyway? I suppose we could deduce that the language is Japanese for Hiragana and Katakana, but what should we do about ideographs? Don't tell me the browser has to start guessing the language for those characters. I've had enough of the guessing game. We have been doing it for charsets for years, and it has led to trouble that we can't back out of now. I think we need to draw the line here, and tell Web page authors to mark their pages with LANG attributes or with particular fonts, preferrably in style sheets. A Universal Character Set should not require mark-up/tags. If the Japanese version of a Chinese character looks different than the Chinese character, it *is* different. In many cases, "variant" does not mean "same". I was referring to the CJK Unified Ideagraphs in the range U+4E00 to U+9FA5. I agree that those codes do not *require* mark-up/tags, but if the author wishes to have them displayed with a "Japanese font", then they must indicate the language or specify the font directly. The latter may be problematic. I don't think it's reasonable to expect a browser to apply various heuristics to determine the language. I completely agree that it is not reasonable to expect a browser to guess the language. Since browsers primarily display information, the browser doesn't really need to be language-aware in most cases. Exceptions like word-breaks for Thai and related scripts exist, of course. Even scripts which don't use spaces or other word breaks can be encoded with the special spacing variants available in the Unicode Standard, though. When limited to BMP code points, CJK unification kind of made sense. In light of the new additional planes... The IRG seems to be doing a fine job. Somehow I get the impression that you have more to say, but you just aren't saying it. Cough it up already. :-) Sorry, I'm trying to learn how to be brief (!) and hoped the inference would be apparent. Although the IRG still considers unification relevant, it seems to me that they are much tighter now in their definition of 'sameness' than was previously the case. Not all of the approx 4 "new" characters in Plane 2 are the names of race horses, some of them, as far as I can tell, would have been unified before. Consider the "teeth" ideograph(s). (Radical number 211, in some radical lists.) Because this is a radical, CJK encoders can select the specific desired character: U+2FD2 for Traditional Chinese U+2EED for Japanese U+2EEE for Simplified Chinese Since anyone encoding U+9F52 might see any of the above three versions, my opinion is that encoders (authors) would wish to explicitly encode their expected character and would do so whenever they have the option. I believe that they should have the option. The abundance of unassigned code points offered by additional Unicode planes makes this possible and would eliminate the need for a browser (or any other application) to "guess" a language in order to display material as its authors and users desire. Best regards, James Kass.
Re: Transcriptions of Unicode
At 6:40 PM -0800 12/6/00, James Kass wrote: Consider the "teeth" ideograph(s). (Radical number 211, in some radical lists.) Because this is a radical, CJK encoders can select the specific desired character: U+2FD2 for Traditional Chinese U+2EED for Japanese U+2EEE for Simplified Chinese Since anyone encoding U+9F52 might see any of the above three versions, my opinion is that encoders (authors) would wish to explicitly encode their expected character and would do so whenever they have the option. This doesn't reflect, however, the way people actually use these ideographs. By and large, the Japanese reader wants to see them drawn with the Japanese glyph, whether or not the originator was Chinese. There are some cases where the specific glyph *does* matter, largely in personal names. (We had a mildly heated discussion this morning in the IRG meeting going on about how to show one particular glyph for precisely this reason.) By and large, however, it is recognized that the glyph differences do *not* affect meaning and should be up to the reader, not forced by the originator. I believe that they should have the option. The abundance of unassigned code points offered by additional Unicode planes makes this possible and would eliminate the need for a browser (or any other application) to "guess" a language in order to display material as its authors and users desire. But then why not deunify the English and French alphabets? Or French and Polish accents? Or Fraktur and Italic and Roman styles of Latin? -- = John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jenkins/
Re: displaying Unicode text (was re: Transcriptions of Unicode)
John H. Jenkins wrote: At 3:57 PM -0800 12/6/00, James Kass wrote: A Universal Character Set should not require mark-up/tags. Au contraire, it's been implicit in the design of Unicode from the beginning that markup/tags would be required in certain situations. Because of the 65536 character limitation ? (Which no longer applies.) If the Japanese version of a Chinese character looks different than the Chinese character, it *is* different. In many cases, "variant" does not mean "same". But as a rule, the Japanese and Chinese would disagree with you here. Certainly the IRG would disagree. Few in the west would argue over the fundamental unity of Fraktur and Roman variations of the Latin alphabet; most of the Chinese/Japanese variations are on that order or less. As our Asian friends come on-line, they will hopefully contribute to the discussion in this regard. The reason I suspect that the Japanese would tend to agree is that Unicode had not been widely accepted by the Japanese user community. Perhaps if Unicode originated elsewhere, we would have had to deal with Greek/Latin/Cyrillic unification? (And we could say that since the "W" is really a ligature of two "V"s, it shouldn't have an explicit encoding...) When limited to BMP code points, CJK unification kind of made sense. In light of the new additional planes... The IRG seems to be doing a fine job. Here you've really lost me. The IRG is unifying in plane 2, as well. Nobody in the IRG has suggested that we abandon unification for plane 2. I tried to respond to this in an earlier letter. We don't even have CJK unification in the BMP, witness the blocks U+8A00 to U+8B9f versus U+8BA0 to U+8C36. Many of the characters in the latter block are simplified versions of the former. U+8A02/U+8BA2 U+8A03/U+8BA3 U+8A0C/U+8BA7 U+8A41/U+8BC2 etc. Fraktur and roman are both adaptations of the Latin script, or stylistic variations just as italic and roman. The Japanese writing system is Japanese, but derived from Chinese. As you say, some of the differences are minimal, perhaps slight variation in stroke order, but other differences are substantial. In some cases, the Japanese version may use a variant of a certain radical component, or even a different radical. I said I think the IRG is doing a fine job because it is such a monumental task, much progress is being made, and the results of their work seem to reflect the expectations of the various user communities involved. Best regards, James Kass.
Unicode Technical Reports (Formerly: RE: TR22)
From: "Mark Davis" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Subject: TR22 Date: Mon, 4 Dec 2000 12:58:53 -0800 (GMT-0800) As per the instructions of the Unicode Technical Committee, TR#22: Character Mapping Markup Language (CharMapML) has been advanced from draft TR to full TR. See http://www.unicode.org/unicode/reports/tr22/ for more information. Note: The UTC intends to continue development this TR to also encompass complex mappings such as 2022 and glyph-based mappings. Mark P.S. I will be out of town for a few days, so will be unable to address any questions that come up until I get back. Hello, UniCoders! Whatever happened to UniCode Technical Report *#12*what's it about?! Is TR12 closer to adoptation by UniCode? Robert Lloyd Wheelock Augusta, ME USA _ Get more from the Web. FREE MSN Explorer download : http://explorer.msn.com
Re: displaying Unicode text (was re: Transcriptions of
"James Kass" [EMAIL PROTECTED] wrote: I tried to respond to this in an earlier letter. We don't even have CJK unification in the BMP, witness the blocks U+8A00 to U+8B9f versus U+8BA0 to U+8C36. Many of the characters in the latter block are simplified versions of the former. U+8A02/U+8BA2 U+8A03/U+8BA3 U+8A0C/U+8BA7 U+8A41/U+8BC2 etc. I usually stay out of CJK discussions since they are typically outside any expertise I may claim, but I thought there was a BIG difference between the issue of Chinese vs. Japanese glyphs (which may differ only in stroke weight and number of minor strokes) and the issue of traditional vs. simplified characters (which may appear completely different from each other and are not even necessarily convertible from one set to the other). Unicode unifies the former and does not unify the latter. -Doug Ewell Fullerton, California