Latvian palatalised consonants
Is the use of the existing precomposed characters in the Latin Extended-A block considered 'right' for encoding Latvian palatal consonants, or is it considered 'wrong' so that I will have to use composites with U+0326 'Combining comma below' in stead? I am aware that many use those percomposed cedilla-characters, but nevertheless it does not look Latvian to me... Romanian did get its precomposed letters - can one expect any precendence with regard to Latvian? :-) -- Herman Ranes Høgskolen i Sør-Trøndelag Avdeling for teknologi Telefon +47 73559606Institutt for elektroteknikk Telefaks +47 73559581 [EMAIL PROTECTED] N-7004 Trondheim http://www.hist.no/~herman/ NOREG
Re: Latvian palatalised consonants
On Tue, 17 Oct 2000, Herman Ranes wrote: Is the use of the existing precomposed characters in the Latin Extended-A block considered 'right' for encoding Latvian palatal consonants, or is it considered 'wrong' so that I will have to use composites with U+0326 'Combining comma below' in stead? I am aware that many use those percomposed cedilla-characters, but nevertheless it does not look Latvian to me... Romanian did get its precomposed letters - can one expect any precendence with regard to Latvian? As far as I know, there is an official decision by the Romanian Standards Institute to regard certain Romanian characters as containing comma below and not cedilla, making ISO 8859-2 (originally designed to cover Romanian too) inadequate for writing Romanian. There is a committee draft for ISO 8859-16 intended to solve this problem: http://www.egt.ie/standards/iso8859/cd8859-16-en.pdf What is the official position on the nature of the diacritic mark we're discussing, in Latvia? Inofficial documents, like http://www.geocities.com/tuksnesis/valoda/diacrtic.html seem to call it "cedilla" - and display glyphs where it is clearly comma-like in appearance. _If_ there were an official statement saying that it's a comma and not a cedilla, then one _might_ refer to the Romanian case as a precedent. But then the problem would arise whether one really needs to make a distinction between comma and cedilla. The problem with s and t with comma or cedilla was that they are also used outside Romanian. The Unicode attitude, expressed in the description of Latin Extended-A, http://www.unicode.org/charts/PDF/U0100.pdf is somewhat confusing. For example, U+015F LATIN SMALL LETTER S WITH CEDILLA is, according to it, used in Turkish, Azerbaidjani, Romanian, ..., but "a glyph variant with comma below is preferred for Romanian"; on the other hand, that "glyph variant" appears as U+0219 LATIN SMALL LETTER S WITH COMMA BELOW, with the note "Romanian, when distinct comma below form is required". So are the characters you're referring to "Latvian only"? Unfortunately, there doesn't seem to be any collection of information that could be used as a reference concerning the use of letters in different languages. The ISO 8859 series implicitly constitutes a partial (but very partial) reference, since those standards list languages for which a particular standard of the series is applicable for. (See my http://www.hut.fi/u/jkorpela/8859.html which summarizes coverage of European languages by ISO 8859 alphabets.) Then there's the rather detailed http://www.eki.ee/itstandard/docs/draft-alvestrand-lang-char-03.txt but it is old, and with a status of expired Internet-draft. And there are some notes in the Unicode standard, but they are typically just _examples_ of languages in which a character is used. And there's a nice online database at http://www.eki.ee/letter/ which is based on various sources. For example, for U+0137 LATIN SMALL LETTER K WITH CEDILLA, all the information available to me suggests that it is used in Latvian only, with a glyph where the diacritic part is a comma below "k", not connected to it in any cedilla-like manner. So what would be the problem in using it? There _would_ be a problem if some other language used the character so that the diacritic part is somehow cedilla-like. (But even then, it might be regarded as something to be handled at a higher protocol level, based on language information.) So I don't think it's a problem; the only real problem appears to be the _name_ which contains the word CEDILLA, but it's just a name, and diacritics may vary in appearance anyway. (Consider how differently an acute accent can be displayed.) -- Yucca, http://www.hut.fi/u/jkorpela/ or http://yucca.hut.fi/yucca.html
Re: utf-8 != latin-1
One of the main features of XML is that it has quite strict rules about how to handle errors. The goal, I believe, is to ensure that we are not awash in malformed files that have no clear interpretation. And this is clearly an error: the acceptable code points are quite clearly stated: http://www.w3.org/TR/2000/REC-xml-20001006#dt-character Converting an illegal UTF-8 sequence into a valid -- BUT WRONG -- sequence of valid code points is clearly against the intent of this production rule. XML could have taken the opposite tack -- that illegal code points and illegal code unit sequences are to be ignored. But it didn't. Mark BTW, I have a simple browser-based UTF converter (in Javascript) at http://www.macchiato.com/unicode/charts.html (click on Converter). It lets you convert back and forth to different UTFs, with various choices for format. And, it does checks for illegal UTF-8 sequences! - Original Message - From: "Doug Ewell" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Friday, October 13, 2000 21:59 Subject: Re: utf-8 != latin-1 "Steven R. Loomis" [EMAIL PROTECTED] wrote: What happened was that the sequence AD 63 61 73 was interpreted as U+E54E U+DC73.. Why? As an illegal UTF-8 sequence, it shouldn't be interpreted as anything. John Cowan's "utf" perl script (which carries the appropriate disclaimers about no error checking) converts that sequence to U+D94E U+DC73, which seems a bit more reasonable -- at least it's a complete surrogate pair. -Doug Ewell Fullerton, California
Re: Can anyone help me!!!
hi, can't i use unicode to generate and show the fonts in any browser irrespective of their support to unicode!. like by writing plugin or something like this. and when a user with browser which doesn't support unicode like to access that webpage. he/she needs to install that plugin. will it be possible On Thu, 28 Sep 2000, Yung-Fong Tang wrote: Antoine Leca wrote: sanatan mohanty wrote: i have a project to make a webpage, which will be unicode enable. Good. i can show indian language fonts. i can type those fonts on the webpage itself on text boxes!. Ah! How do you do that? Or do you mean "would/should" instead? and it should be atleast work on netscape and windows explorer!, and atleast LINUX and Windows OS supports it!. I am not aware that Netscape, even in version 6, is able to display Indian sentences encoded in Unicode (although it is able to display individual characters). The problem is in the rendering (displaying) of the conjuncts, and the reordering of the left-positionned matra's. Does Netscape6 on Win2K have this problem ? If so, can you put together a test page for us? We know there are problem when we try to select the conjuncts. However, since we use TextOutW, in theory the TextOutW should handle conjuncts and handle the reording of the left-positionned matra's. so, can u people give me some brief ideas abt keyboard mapping, Keyboard layout is unrelated to the problem. You can use whatever you want (or are comfortable with). However, you certainly need a Unicode-able editor. Very few of them are Indian-enabled (Microsoft are the best choice, but are not the cheaper, particularly since it pratically needs Win2000). unicode font setting, There are very few Indian "Unicode" fonts for the moment. And even less work with X11/Linux. In fact, I am not aware of any such a font. Which is the main reason why I ask the questions above. dispay setting What do you mean with display setting? The display setting is on the the client side. You are not going to have any form of control on this setting... (and no, I do not like browsing a web site and encountering a page that says "please, change over all your settings in order to browse my site"; actually, I often switch away). Antoine
RE: Can anyone help me!!!
Hi, Writing a plugin would not be enough. There are quite a few issues to deal with when rendering Indian text in a browser without Unicode support (as you all know). I assume that you are looking for a solution that works for more than just one browser on one platform!? Some browser may neither support Unicode text encoding formats (e.g. utf-8), nor rendering of 16-bit characters. Also they would probably not be able to deal with the complex character shaping and positioning and text direction issues found in Indian and other languages. Some browsers do not support downloading (partial) fonts yet, so these browsers may not be able to show the text even if they did support Unicode. There are other issues as well It's not impossible to solve these problems though, but it is *very* hard. We (at BorWare AB) are working on a product with which we intend to support Unicode, CSS level 2 and font embedding on many platforms and browsers. Specifically, it will support Indian Unicode fonts (OpenType Layout) and non-Unicode Indian fonts (TT, T1, etc) in IE 4.x, IE 5.x, Nav 4.x, Nav 6.x, Op4, WebTV on (non-Indian) Windows, Unix, Mac. It's being beta tested right now and should be available sometime next year... Regards, - Michael -Original Message- From: sanatan mohanty [mailto:[EMAIL PROTECTED]] Sent: Tuesday, October 17, 2000 5:33 PM To: Unicode List Cc: Unicode List; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: Can anyone help me!!! hi, can't i use unicode to generate and show the fonts in any browser irrespective of their support to unicode!. like by writing plugin or something like this. and when a user with browser which doesn't support unicode like to access that webpage. he/she needs to install that plugin. will it be possible On Thu, 28 Sep 2000, Yung-Fong Tang wrote: Antoine Leca wrote: sanatan mohanty wrote: i have a project to make a webpage, which will be unicode enable. Good. i can show indian language fonts. i can type those fonts on the webpage itself on text boxes!. Ah! How do you do that? Or do you mean "would/should" instead? and it should be atleast work on netscape and windows explorer!, and atleast LINUX and Windows OS supports it!. I am not aware that Netscape, even in version 6, is able to display Indian sentences encoded in Unicode (although it is able to display individual characters). The problem is in the rendering (displaying) of the conjuncts, and the reordering of the left-positionned matra's. Does Netscape6 on Win2K have this problem ? If so, can you put together a test page for us? We know there are problem when we try to select the conjuncts. However, since we use TextOutW, in theory the TextOutW should handle conjuncts and handle the reording of the left-positionned matra's. so, can u people give me some brief ideas abt keyboard mapping, Keyboard layout is unrelated to the problem. You can use whatever you want (or are comfortable with). However, you certainly need a Unicode-able editor. Very few of them are Indian-enabled (Microsoft are the best choice, but are not the cheaper, particularly since it pratically needs Win2000). unicode font setting, There are very few Indian "Unicode" fonts for the moment. And even less work with X11/Linux. In fact, I am not aware of any such a font. Which is the main reason why I ask the questions above. dispay setting What do you mean with display setting? The display setting is on the the client side. You are not going to have any form of control on this setting... (and no, I do not like browsing a web site and encountering a page that says "please, change over all your settings in order to browse my site"; actually, I often switch away). Antoine
C Programming for Unicode
Hi, I would like to modify existing C application so that it supports unicode. Does anybody know any references any samples that would help? Thanks. SoHee
Preliminary charts for Unicode 3.2 draft
Preliminary character charts are now available for those characters that are proposed to go into Unicode 3.2 (and into AMD1 to ISO/IEC 10646-1:2000). The majority of the proposed characters are mathematical symbols and arrows. The new URL is: http://www.unicode.org/charts/draftunicode32/ There is also a link to the draft charts from the Pipeline Table. The link is in a new paragraph reading: Charts of the characters proposed for addition in Unicode Version 3.2 are currently available for review. The charts provide preliminary information only. Click here for the index. The difference for these charts over other proposal documents is that the new characters are shown in context with the existing characters, using the standard charts and nameslist format. The file format is PDF. The charts are made available to allow implementers to prepare products that will eventually support Unicode 3.2 when it is published. Please note the cautionary language and disclaimers that accompany these charts. If you note any errors, omissions, inaccuracies etc. you may send your detailed comments to me. A./
Re: C Programming for Unicode
There are a few options, depending what you mean by "supports unicode". If all you care about the code page conversion so your program can process Unicode code points, glibc is freely available on many platforms, http://www.gnu.org. If your application requires more sophisticated Unicode support such as collation and word break etc., take a look at ICU, http://oss.software.ibm.com/icu. It's also freely available on many interesting environments. Qt also provides a great set of features, again for free. A more complete list of internatinalization libraries can be found at http://www.unicode.org/unicode/onlinedat/products.html. Some of them are commercial products and some not. - Original Message - From: "SoHee Kim" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Tuesday, October 17, 2000 1:54 PM Subject: C Programming for Unicode Hi, I would like to modify existing C application so that it supports unicode. Does anybody know any references any samples that would help? Thanks. SoHee
Re: C Programming for Unicode
On Tue, 17 Oct 2000, Helena Shih wrote: There are a few options, depending what you mean by "supports unicode". If all you care about the code page conversion so your program can process Unicode code points, glibc is freely available on many platforms, http://www.gnu.org. I'm afraid this is a little bit of understatement fow what glibc can do (among other things, glibc can do collation like any other C library can do with appropriate locales) . It would be a good description of iconv and friends in glibc ( any C library with iconv supporting many encodings ). In case glibc is too big to install on the target platform, there's also a standalone free (LGPLed) libiconv (developed by Bruno Haible ) that offers iconv(3) for a lot of encodings. http://clisp.cons.org/~haible/packages-libiconv.html Jungshik Shin
Re: C Programming for Unicode
My apology, I didn't realize glibc also supports Unicode collation algorithm. If so, yes, my statement underestimated the support in glibc quite a bit. Sorry. - Original Message - I'm afraid this is a little bit of understatement fow what glibc can do (among other things, glibc can do collation like any other C library can do with appropriate locales) . It would be a good description of iconv and friends in glibc ( any C library with iconv supporting many encodings ). In case glibc is too big to install on the target platform, there's also a standalone free (LGPLed) libiconv (developed by Bruno Haible ) that offers iconv(3) for a lot of encodings. http://clisp.cons.org/~haible/packages-libiconv.html Jungshik Shin
Re: Korean syllable decomposition(was: CJK combining components)
On Tue, 17 Oct 2000 [EMAIL PROTECTED] wrote: So, do they have a table that says "This hangul syllable is made up of components X, Y, and Z"?) Maybe Unicode should have one. Well, Unicode will never have one for dynamic glyph composition of Hangul syllables ;-) because there are so many possibilities (how many different sets of glyphs to use for initial consonants, medial vowels and final consonants. The higher quality you want to get, the more sets you need). One example of such a table is, though, provided by Hanterm(Korean xterm) source code (http://elf.kaist.ac.kr/hanterm) which can make use of both precomposed Hangul fonts (with only 2350 syllables for KS X 1001) and fonts made up of Jamos ( 10 sets of initial consonants, 4? sets of medial vowels and 4? sets of final consonants) for on-the-fly composition of glyphs (for all 11,172 modern syllables and thousands of antique syllables). Mozilla supports that and you may find it interestng to go thru nsUnicodeToX11Johab.cpp (at www.mozilla.org, follow the link for the source code and type in the file name). Unix/X11 JDK used to allow this kind of on-the-fly composition by simply editing font.properties file and providing a simple Java class to take care of dynamic composition, but at least Linux port of JDK 1.2 stopped working that way. When you make a Korean font, you only need to make the components and have a program combine them for you, correct? That's not that simple, unfortunately. In principle, that's possible, but in reality it still needs a lot of manual intervention to get a high quaility font. (I'm not familiar with the way foundries in Korea make Hangul fonts) Anyway, if you look inside the some of truetype fonts with Hangul syllables, you'll find a lot of components (Jamos) that I presume make up syllables making use of facilities provided by truetype for the 'dynamic' composition(??). Jungshik Shin
RE: C Programming for Unicode
SoHee, See http://oss.software.ibm.com/developerworks/opensource/icu/project/index.html This has a library of Unicode C APIs. Most of the docs are for the C++ APIs but if you look at the user guide http://oss.software.ibm.com/developerworks/opensource/icu/project/userguide/ index.html you can see examples of the C API. Carl -Original Message- From: SoHee Kim [mailto:[EMAIL PROTECTED]] Sent: Tuesday, October 17, 2000 1:54 PM To: Unicode List Subject: C Programming for Unicode Hi, I would like to modify existing C application so that it supports unicode. Does anybody know any references any samples that would help? Thanks. SoHee