Re: Indic Devanagari Query
Hi Aditya, --- Aditya Gokhale [EMAIL PROTECTED] wrote: I had few query regarding representation of Devanagari script in Unicode (Code page - 0x0900 - 0x097F). Devanagari is a writing script, is used in Hindi, Marathi and Sanskrit languages. I have following questions - In the same script code page, how do I use these two different Glyphs, to represent the same character ? Is there any way by which I can do it in an Open type font and Free type font implementation ? Yes, it is certainly possible with OpenType font. Please note that FreeType is not a font format but it is a rendering library used to rasterize different kind of fonts including TrueType and OpenType fonts. In an Opentype font, you can include all glyphs with alternate shapes and then select one of them depending upon the script and language. Application should specify script and language tag while sending character codes to the opentype rendering library/engine. All substitution will be taken place depending on the language and/or script selection. There should be a default script in the font. Similarly there will be a default language for that script which will be used as fallback language if application does not specify which language to be used for processing. From the list of alternate glyphs you may want to use the glyph for default language for an entry in cmap table. This default glyph can be substituted by alternate glyph depending upon the language specification. You have to use GSUB table and write language dependent lookup for substitution. 2. Implementation Query - In an implementation where I need to send / process Hindi, Marathi and Sanskrit data, how do I differentiate between languages (Hindi, Marathi and Sanskrit). Say for example, I am writing a translation engine, and I want to translate a document having Hindi, Marathi and Sanskrit Text in it, how do I know from the code points between 0x0900 and 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ? Unicode is not divided into code pages. Unlike few old encodings there is only one code page for entire Unicode standard. However, for better readability and quick user reference the entire chart has been divided into different sections which you might interpret as code pages. I would suggest that we should give different code pages for Marathi, Hindi and Sanskrit. May be current code page of Devanagari can be traded as Hindi and two new code pages for Marathi and Sanskrit be added. This could solve these issues. If there is any better way of solving this, any one suggest. Unicode gives code points to script only and not language. In fact it is not desirable to give code points to individual languages falling under the same script. Also, Unicode encodes characters which have abstract meaning and properties. Unicode does not encode glyphs. The shapes of glyphs shown in the Unicode chart have been given just for convenience and not actually represent the shapes to be used in the font. The shape of the glyph for a Unicode character may vary from one font to another. Since it is already possible to select proper glyph(s) depending upon language selection, this scheme is suitable for all Indian languages. 3. Character codes for jna, shra, ksh - In Sanskrit and Marathi jna, shra and ksh are considered as separate characters and not ligatures. How do we take care of this ? Can I get over all views on the matter from the group ? In my opinion they should be given different code points in the specific language code page. Please find below the character glyphs - jna shra ksh All of the above can be composed through following consonant clusters: jna - ja halant nya shra - sha halant ra ksh - ka halant ssha The point that the above sequences are considered as characters in some of the Indian languages has merit. If there is demand from native speakers then a proposal can be submitted to Unicode. There is a predefined procedure for proposal submission. Once this is discussed with concerned people and agreed upon then these ligatures can be added in Devanagari script itself because Devenagari script represent all three languages you mentioned namely Sanskrit, Marathi, and Hindi. Meanwhile you can write rules for composing them from the consonant clusters. Regards, Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
Re: Indic Devanagari Query
Hi, Forgot to reply implementation query. The reply is inline. --- Aditya Gokhale [EMAIL PROTECTED] wrote: 2. Implementation Query - In an implementation where I need to send / process Hindi, Marathi and Sanskrit data, how do I differentiate between languages (Hindi, Marathi and Sanskrit). Say for example, I am writing a translation engine, and I want to translate a document having Hindi, Marathi and Sanskrit Text in it, how do I know from the code points between 0x0900 and 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ? I would suggest that we should give different code pages for Marathi, Hindi and Sanskrit. May be current code page of Devanagari can be traded as Hindi and two new code pages for Marathi and Sanskrit be added. This could solve these issues. If there is any better way of solving this, any one suggest. Instead of changing/recommending change in an encoding standard, your problem can best be solved in your application. You can use tags in your text to specify language. Unicode also facilitates tagging your text but its use in Unicode is highly discouraged. So you can use some language similar to xml or html to specify language boundary. Then parse your text, identify the language boundaries, and do further processing depending upon the language. If you don't want to use tags in your text then you can predict language by using some heuristic. This heuristic can be used on some language properties which may be different for all three languages. In this case your processing will be divided into two phases. First phase involves applying some heuristic rule to identify language bounadaries from plain text and the second is actually processing text for translation. But beware that the result will not be accurate all the time with such heuristic processing. Hence use of tags is recommended. Regards, Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
Re: Indic Devanagari Query
Hello, Thanks for the reply. I will check the points as you said, as far as the font issues are considered. We all know how jna,shra and ksh are formed in UNICODE and ISCII, but the point I wanted to make was, if we have to sort / search / process the data in Devanagari script, then we have to keep track of at least three characters and not one. This becomes tedious, thought not impossible. If single code point is present it will be very easy to process. With regards, to predict language by using some heuristic, in my opinion it is a very risky solution, at least when I don't have much information at stage one of my application. I am running OCR engine on a Devanagari page, then based on the formatting, tagging the language. So I think tagging, as I am doing right now is a better solution. I also agree with the views expressed by Asmus Freytag, that if we go on including all the 6000 languages, it will be extremely impossible to cross-correlate these 'code pages'. -Aditya
RE: Indic Devanagari Query
Aditya Gokhale wrote: Hello Everybody, I had few query regarding representation of Devanagari script in Unicode All your questions are FAQ's, so I'll just reference the entries which answers them. (Code page - 0x0900 - 0x097F). Devanagari is a writing script, is used in Hindi, Marathi and Sanskrit languages. I have following questions - Unicode has no code pages: http://www.unicode.org/faq/basic_q.html#18 1. In Marathi and Sanskrit language two characters glyphs of 'la' and 'sha' are represented differently as shown in the image below - (First glyph is 'la' and second one is 'sha') as compared to Hindi where these character glyphs are represented as shown in the image below - (First glyph is 'la' and second one is 'sha') Unicode encodes (abstract) characters, not glyphs: http://www.unicode.org/faq/han_cjk.html#3 (This FAQ is in the Chinese/Japanese/Korean section because it is more often raised for Chinese ideograms.) In the same script code page, how do I use these two different Glyphs, to represent the same character ? Is there any way by which I can do it in an Open type font and Free type font implementation ? Unicode's requirements for fonts: http://www.unicode.org/faq/font_keyboard.html#1 A few links to OpenType stuff: http://www.unicode.org/faq/font_keyboard.html#4 2. Implementation Query - In an implementation where I need to send / process Hindi, Marathi and Sanskrit data, how do I differentiate between languages (Hindi, Marathi and Sanskrit). Say for example, I am writing a translation engine, and I want to translate a document having Hindi, Marathi and Sanskrit Text in it, how do I know from the code points between 0x0900 and 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ? What you need here is some sort of language tagging: http://www.unicode.org/faq/languagetagging.html I would suggest that we should give different code pages for Marathi, Hindi and Sanskrit. May be current code page of Devanagari can be traded as Hindi and two new code pages for Marathi and Sanskrit be added. This could solve these issues. If there is any better way of solving this, any one suggest. Characters are encoder per scripts, not per languages: http://www.unicode.org/faq/basic_q.html#17 3. Character codes for jna, shra, ksh - In Sanskrit and Marathi jna, shra and ksh are considered as separate characters and not ligatures. How do we take care of this ? Can I get over all views on the matter from the group ? In my opinion they should be given different code points in the specific language code page. Please find below the character glyphs - Unicode encodes Indic analytically: http://www.unicode.org/faq/indic.html#17 thanks, For more details about Devanagari in Unicode, see Chapter 9 of the Standard: http://www.unicode.org/uni2book/ch09.pdf _ Marco
Re: Indic Devanagari Query
--- Asmus Freytag [EMAIL PROTECTED] wrote: All of the above can be composed through following consonant clusters: jna - ja halant nya shra - sha halant ra ksh - ka halant ssha The point that the above sequences are considered as characters in some of the Indian languages has merit. If there is demand from native speakers then a proposal can be submitted to Unicode. There is a predefined procedure for proposal submission. Once this is discussed with concerned people and agreed upon then these ligatures can be added in Devanagari script itself because Devenagari script represent all three languages you mentioned namely Sanskrit, Marathi, and Hindi. Meanwhile you can write rules for composing them from the consonant clusters. I wouldn't go so far. The fact that clusters belong together is something that can be handled by the software. Collation and other data processing needs to deal with such issues already for many other languages. See http://www.unicode.org/reports/tr10 on the collation algorithm. I beg to differ with you on this point. Merely having some provision for composing a character doesn't mean that the character is not a candidate for inclusion as separate code point. India is a big country with millions of people geographically divided and speaking variety of languages. Sentiments are attached with cultures which may vary from one geographical area to another. So when one of the many languages falling under the same script dominate the entire encoding for the script, then other group of people may feel that their language has not been represented properly in the encoding. While Unicode encodes scripts only, the aim was to provide sufficient representation to as many languages as possible. In Unicode many characters have been given codepoints regardless of the fact that the same character could have been rendered through some compose mechanism. This includes Indic scripts as well as other scripts. For example, in Devanagari script some code points are allocated to characters (ConsonantNukta) even though the same characters could be produced with combination of the consonant and Nukta. Similarly, in Latin-1 range [U+0080-U+00FF] there are few characters which can be produced otherwise. That is why the text should be normalized to either pre-composed or de-composed character sequence before going for further processing in operations like searching and sorting. Also, many times processing of text depends on the smallest addressable unit of that language. Again as discussed in earlier e-mails this may vary from one language to another in the same script. Consider a case when a language processor/application wants to count the number of characters in some text in order to find number of keystrokes required to input the text. Further assume that API functions used for this purpose are based on either WChar (wide characters) or UTF-8. In this case it is very much necessary that you assign the character, say Kssha, to the class consonant. Since assignment to this class consonant applies to single code point (the smallest addressable unit) and not to the sequence of codes, it is very much necessary to have single code point for the character Kssha. This is my understanding. Please enlighten me if I am wrong. Regards, Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
Suggestions in Unicode Indic FAQ
Hello, There are few discrepancies in Indic FAQ. Though it was reported earlier by Andy White, I see they still have place there in the FAQ. I also clarified it but by mistake I sent the mail to Yahoo groups where this mailing list is archived and hence my mail never reached to this mailing list. You can refer to the link http://groups.yahoo.com/group/unicode/message/16352 The following are the suggestions. SUGGESTION-1: In the FAQ http://www.unicode.org/faq/indic.html#2 it is mentioned that ISCII: Unicode: Halant + Halant Halant + ZWJ produce similar result. This is wrong. In ISCII, Halant+Halant is known as explicit halant and its Unicode equivalent sequence is Halant+ZWNJ. So ZWJ should be replaced by ZWNJ. SUGGESTION-2: In the FAQ http://www.unicode.org/faq/indic.html#16 It is mentioned that following are equivalent ISCII Unicode KA halant INV KA virama ZWJ RA halant INV RAsup (i.e., repha) In fact there is no way in Unicode to produce RAsup directly, i.e., without using base consonant. The sequence RA virama ZWJ will actually produce half-RA (or eyelash-RA) which is used commonly in Marathi. eyelash-RA can also be produced with the sequence RA Halant Nukta sequence both in ISCII (known as soft halant) and Unicode (just for conformance with ISCII). Also, in the same answer the following sequence is recommended. ISCII Unicode INV halant RA SPACE virama RA (RAsub) SUGGESTION-3: Use of SPACE character as consonant may create problem for state machine which finds language/syllable boundary. In fact we need a codepoint for one invisible consonant (similar to INV in ISCII) in Unicode which can solve this problem with Unicode. After inclusion of INV character the following can be recommended. ISCII Unicode KA halant INV KA virama INV RA halant INV RA virama INV (i.e., repha) INV halant RA INV virama RA (RAsub) The INV character in Unicode can also be used for displaying dependent vowel matras without dotted circle. Unicode INV Vowel sign O INV Vowel sign AI etc. This can replace existing definition of SPACE as invisible consonant depending on the context. Any other pointers!!? - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
Re: Indic Devanagari Query
Keyur Shroff scripsit: Sentiments are attached with cultures which may vary from one geographical area to another. So when one of the many languages falling under the same script dominate the entire encoding for the script, then other group of people may feel that their language has not been represented properly in the encoding. Indeed, they may have such beliefs, but those beliefs are based on two incorrect notions: that what the charts show is normative, and that the codepoint is the proper unit of processing. In Unicode many characters have been given codepoints regardless of the fact that the same character could have been rendered through some compose mechanism. In every case this was done for backward compatibility with existing encodings. No new codepoints of this type will be added in future. That is why the text should be normalized to either pre-composed or de-composed character sequence before going for further processing in operations like searching and sorting. The collation algorithm makes allowance for these points. It will be quite typical to tailor the algorithm to take language-specific rules into account. Also, many times processing of text depends on the smallest addressable unit of that language. Again as discussed in earlier e-mails this may vary from one language to another in the same script. Consider a case when a language processor/application wants to count the number of characters in some text in order to find number of keystrokes required to input the text. This will not work without knowledge of the keyboard layout in any case. To enter Latin-1 characters on the Windows U.S. keyboard requires 5 keystrokes, but they are represented by one or two Unicode characters. -- Henry S. Thompson said, / Syntactic, structural, John Cowan Value constraints we / Express on the fly. [EMAIL PROTECTED] Simon St. Laurent: Your / Incomprehensible http://www.reutershealth.com Abracadabralike / schemas must die!http://www.ccil.org/~cowan
Re: Indic Devanagari Query
At 02:13 -0800 2003-01-29, Keyur Shroff wrote: I beg to differ with you on this point. Merely having some provision for composing a character doesn't mean that the character is not a candidate for inclusion as separate code point. Yes, it does. India is a big country with millions of people geographically divided and speaking variety of languages. Sentiments are attached with cultures which may vary from one geographical area to another. So when one of the many languages falling under the same script dominate the entire encoding for the script, then other group of people may feel that their language has not been represented properly in the encoding. A lot of these feelings are simply WRONG, and that has to be faced. The syllable KSSA may be treated as a single letter, but this does not change the fact that it is a ligature of KA and SSA and that it can be represented in Unicode by a string of three characters. In Unicode many characters have been given codepoints regardless of the fact that the same character could have been rendered through some compose mechanism. This includes Indic scripts as well as other scripts. For example, in Devanagari script some code points are allocated to characters (ConsonantNukta) even though the same characters could be produced with combination of the consonant and Nukta. There are historical and compatibility reasons that most of this stuff, as well as the similar stuff in the Latin range, were encoded. At one point some years ago the line was drawn, normalization was enacted, and that was that. Also, many times processing of text depends on the smallest addressable unit of that language. Again as discussed in earlier e-mails this may vary from one language to another in the same script. Consider a case when a language processor/application wants to count the number of characters in some text in order to find number of keystrokes required to input the text. I can't think of any reason why this would be useful. And what if you were not typing, but speaking to your computer? Then there would be no keystrokes at all! Further assume that API functions used for this purpose are based on either WChar (wide characters) or UTF-8. In this case it is very much necessary that you assign the character, say Kssha, to the class consonant. Since assignment to this class consonant applies to single code point (the smallest addressable unit) and not to the sequence of codes, it is very much necessary to have single code point for the character Kssha. We are not going to encode KSSA as a single character. It is a ligature of KA and SSA, and can already be represented in Unicode. You need to handle this consonant issue with some other protocol. -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Suggestions in Unicode Indic FAQ
Keyur Shroff wrote: In the FAQ http://www.unicode.org/faq/indic.html#16 It is mentioned that following are equivalent ISCII Unicode KA halant INV KA virama ZWJ RA halant INV RAsup (i.e., repha) The last line is really bizarre! I would agree that it is plain wrong... What is supposed to appear in column Unicode is the Unicode *encoding* equivalent to the RA halant INV in the ISCII column. But RAsup (i.e., repha) is the description of a *glyph*. In fact there is no way in Unicode to produce RAsup directly, i.e., without using base consonant. [...] I agree. This issue has been raised several times, and several viable solutions have been proposed, but I don't remember that Unicode officials ever showed to even acknowledge the problem. But probably this has been noted down and discussed. I hope to see an official solution in TUS 4.0. SUGGESTION-3: Use of SPACE character as consonant may create problem for state machine which finds language/syllable boundary. In fact we need a codepoint for one invisible consonant (similar to INV in ISCII) in Unicode which can solve this problem with Unicode. After inclusion of INV character the following can be recommended. ISCII Unicode KA halant INV KA virama INV RA halant INV RA virama INV (i.e., repha) INV halant RA INV virama RA (RAsub) Why not representing INV with a double ZWJ? E.g.: ISCII Unicode KA halant INV KA virama ZWJ ZWJ RA halant INV RA virama ZWJ ZWJ (i.e., repha) INV halant RA ZWJ ZWJ virama RA (RAsub) This has the advantage that the most common sequences will work OK also on old display engines implemented *before* the double-ZWJ convention is introduced. E.g., sequence KA virama ZWJ ZWJ works well also on an old engine, for the simple reason that the first ZWJ is enough to do the work, and the second ZWJ is invisible. Of course, an old engine will still display a RA[eyelash] for RA virama ZWJ ZWJ, but that is not worse than displaying RA+virama followed by a white box, which is what would happen with your new INV character. _ Marco
RE: Suggestions in Unicode Indic FAQ
The [new] INV character in Unicode can also be used for displaying dependent vowel matras without dotted circle. A space followed by a dependent vowel sign should display just the dependent vowel sign, no dotted circle. Indeed, (except for a show invisibles mode, or a character chart display mode) no (Indic or other) text that does not contain the *character* DOTTED CIRCLE should ever display a dotted circle as part of the displayed text. Systems that do display a dotted circle (in normal display mode) where there is no such *character* in the displayed text are buggy! /Kent K (B.t.w. the chart dotted circle glyph for combining characters look a bit different from the (normal) glyph for DOTTED CIRLCE.)
RE: Suggestions in Unicode Indic FAQ
--- Marco Cimarosti [EMAIL PROTECTED] wrote: Why not representing INV with a double ZWJ? E.g.: ISCII Unicode KA halant INV KA virama ZWJ ZWJ RA halant INV RA virama ZWJ ZWJ (i.e., repha) INV halant RA ZWJ ZWJ virama RA (RAsub) This has the advantage that the most common sequences will work OK also on old display engines implemented *before* the double-ZWJ convention is introduced. E.g., sequence KA virama ZWJ ZWJ works well also on an old engine, for the simple reason that the first ZWJ is enough to do the work, and the second ZWJ is invisible. Of course, an old engine will still display a RA[eyelash] for RA virama ZWJ ZWJ, but that is not worse than displaying RA+virama followed by a white box, which is what would happen with your new INV character. Certainly. This looks more promising because even RAsub has two alternate forms. One form is used with consonants KA, KHA, GHA, etc and the other form is used with consonants TTA, TTHA, DDA, DDHA, etc. With your ZWJ based scheme we can insert as many ZWJ as we wish to produce all possible alternate forms! But sometimes a user may want visual representation of these symbols in two different ways: with dotted circle and without dotted circle. Example of this could be RAsup on top of dotted circle and RAsup on top of space character. Current use of space character to eliminate dotted circle is really painful and may create problems in determining language and syllable boundaries. The main problem with space character is that unlike ZWJ/ZWNJ/Dotted Circle, it falls within the range of other important script Latin. Finally it may affect all important text processing which uses Unicode characters to find language boundaries. Use of INV character in one shot can solve all these problems. We can put it in consonant class which can help text processing applications. Moreover, it will be difficult for all possible to provide upward compatibility all the time even though it is desirable. Implementation of Unicode will need to be upgraded with every introduction of new glyphs or rules. Otherwise applications have to explicitly declare the version of Unicode used in implementation. - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
RE: Indic Devanagari Query
I wouldn't go so far. The fact that clusters belong together is something that can be handled by the software. Collation and other data processing needs to deal with such issues already for many other languages. See http://www.unicode.org/reports/tr10 on the collation algorithm. I beg to differ with you on this point. Merely having some provision for composing a character doesn't mean that the character is not a candidate for inclusion as separate code point. At this point, having some provision for composing a particular letter is very much preventing it from being encoded at a separate code position. This is due mostly to the fixation of normal forms (except for very rare error corrections). In Unicode many characters have been given codepoints regardless of the fact that the same character could have been rendered through some compose mechanism. This includes Indic scripts as well as other scripts. For For legacy reasons, yes. These reasons no longer apply for not-yet-encoded compositions. Also, many times processing of text depends on the smallest addressable unit of that language. Again as discussed in earlier e-mails this may vary from one language to another in the same script. Consider a case when a language processor/application wants to count the number of characters in some text in order to find number of keystrokes required to input the text. You cannot find the number of keystrokes that way. Not even if you know which keyboard (and disregarding backspace). E.g. รค can be produced by one or two (or more, if you count hex input) keystrokes on (most) Swedish keyboards. Further assume that API functions used for this purpose are based on either WChar (wide characters) or UTF-8. In this case it is very much necessary that you assign the character, say Kssha, to the class consonant. Since assignment to this class consonant applies to single code point (the smallest addressable unit) and not to the sequence of codes, it is very much necessary to have single code point for the character Kssha. No, that is not the case. E.g. Hungarian (Magyar) has gy, ny, ly (and more) as letters (look in a Hungarian dictionary, and its headings). Similarly, Albanian has dh, rr, th (and more) as letters. None of these combinations are candidates for single code point allocation. For compatibility reasons the Dutch ij got a single code point, but it is better to just use i followed by j (though that has some difficulties; e.g. the titlecase of ijs is IJs, not Ijs). /Kent K
RE: Suggestions in Unicode Indic FAQ
--- Kent Karlsson [EMAIL PROTECTED] wrote: A space followed by a dependent vowel sign should display just the dependent vowel sign, no dotted circle. Indeed, (except for a show invisibles mode, or a character chart display mode) no (Indic or other) text that does not contain the *character* DOTTED CIRCLE should ever display a dotted circle as part of the displayed text. Systems that do display a dotted circle (in normal display mode) where there is no such *character* in the displayed text are buggy! In Indic scripts any sign that appear in text not in conjunction with a valid consonant base may be rendered with dotted circle as fallback mechanism (Section 5.14 Rendering Nonspacing Marks http://www.unicode.org/uni2book/ch05.pdf). Any system implementing this as default behaviour should not be considered buggy. What should be the default rendering behaviour (i.e., show hidden or not) may vary from one script to another script and also depends on implementation policy. For scripts other than Indic scripts, it may be useful to render the nonspacing mark without dotted circle because even after rendering it as an overlap glyph, the result is recognizable. However, for Indic scripts use of dotted circle is very useful as default behaviour since it gives immediate feedback to the user that there may be some defective combining character in the text. Most of the time such errors are unintentional rather than intentional. Unicode has provision to remove this dotted circle. Space character is used to give indication to fallback mechanism that no dotted circle should be used while rendering this stand alone sign which is normally attached to other characters. This is useful when sometimes user want to display the sign without any circle. Also, with this scheme it is possible to show some combining marks with dotted circle and some without dotted circle. - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
RE: Suggestions in Unicode Indic FAQ
Keyur Shroff wrote: But sometimes a user may want visual representation of these symbols in two different ways: with dotted circle and without dotted circle. Why not using a dotted circle character explicity, when you want to see one? Example of this could be RAsup on top of dotted circle and RAsup on top of space character. Current use of space character to eliminate dotted circle is really painful and may create problems in determining language and syllable boundaries. Languages or syllable boundaries have nothing to do with this. These special sequences should *never* be part of any syllabe or word in any language: they are just a way of showing the shape of a glyph, to be used when, e.g., talking about typography or spelling. The main problem with space character is that unlike ZWJ/ZWNJ/Dotted Circle, it falls within the range of other important script Latin. Plain wrong! White-space characters and punctuation do not belong to any script: character such as , ! and ? are used for many scripts and languages. Even the danda punctuation, which is in the Devanagari range, does not belong to Devanagari: it is also used for other Indic scripts. Use of INV character in one shot can solve all these problems. We can put it in consonant class which can help text processing applications. [...] How can calling a consonant something which has nothing to do with consonants help anybody doing anything? _ Marco
Re: Indic Devanagari Query
Michael Everson wrote: At 02:13 -0800 2003-01-29, Keyur Shroff wrote: I beg to differ with you on this point. Merely having some provision for composing a character doesn't mean that the character is not a candidate for inclusion as separate code point. Yes, it does. India is a big country with millions of people geographically divided and speaking variety of languages. Sentiments are attached with cultures which may vary from one geographical area to another. So when one of the many languages falling under the same script dominate the entire encoding for the script, then other group of people may feel that their language has not been represented properly in the encoding. A lot of these feelings are simply WRONG, and that has to be faced. The syllable KSSA may be treated as a single letter, but this does not change the fact that it is a ligature of KA and SSA and that it can be represented in Unicode by a string of three characters. Of course an anomoly is that KSSA *is* encoded in the Tibetan block at U+0F69. In normal Tibetan or Dzongkha words KSSA U+0F69 (or the combination U+0F40 U+0FB5) does not occur - AFAIK it is *only* used when writing Sanskrit words containing KSSA in Tibetan script. I had thought that the argument for including KSSA as a seperate character in the Tibetan block (rather than only having U+0F40 and U+0FB5) was originally for compatibility / cross mapping with Devanagari and other Indic scripts. - Chris
Re: Indic Devanagari Query
Aditya Gokhale wrote: 1. In Marathi and Sanskrit language two characters glyphs of 'la' and 'sha' are represented differently as shown in the image below - Actually, for everyone's information: these allographs for Marathi were recently brought to our attention, and Unicode 4.0 will have a mention of the allographs, including pictures of the variant glyphs. Rick
RE: Suggestions in Unicode Indic FAQ
Keyur Shroff wrote Kent Karlsson [EMAIL PROTECTED] wrote: A space followed by a dependent vowel sign should display just the dependent vowel sign, no dotted circle. Indeed, (except for a show invisibles mode, or a character chart display mode) no (Indic or other) text that does not contain the *character* DOTTED CIRCLE should ever display a dotted circle as part of the displayed text. Systems that do display a dotted circle (in normal display mode) where there is no such *character* in the displayed text are buggy! In Indic scripts any sign that appear in text not in conjunction with a valid consonant base may be rendered with dotted circle as fallback mechanism (Section 5.14 Rendering Nonspacing Marks http://www.unicode.org/uni2book/ch05.pdf). I don't know where you find support for that position in that text. Can you please quote? There are no invalid base consonants for any dependent vowel (for Indic scripts; similarly for any other script). Any system implementing this as default behaviour should not be considered buggy. Indeed they are. And it should certainly not be default behaviour. Any combining characters can be placed on any base characters without there being any dotted circles displayed. In particular, any combining Devanagari characters (note: including, in principle, several dependent vowels, even if that does not occur in any (existing) orthography) can be placed on any Devanagari base character as well as SPACE (and other punctuation). What should result is a reasonable composed glyph, no dotted circle in sight (except in show invisibles mode, which I'm not discussing here). Spelling errors should be indicated otherwise, since they are of a very different nature. For scripts other than Indic scripts, it may be useful to render the nonspacing mark without dotted circle because even after rendering it as an overlap glyph, the result is recognizable. However, for Indic scripts use of dotted circle is very useful as default behaviour since it gives immediate feedback to the user that there may be some defective combining character in the text. Most of the time such errors are unintentional rather than intentional. No combination of base + combining characters is defective per se. Even if the scripts are different within the combining sequence. (Note also that the 0300 block of combining characters are script independent.) Spelling errors is something else entirely. Unicode has provision to remove this dotted circle. I'm not sure what you are talking about here. Space character is used to give indication to fallback mechanism that no dotted circle should be used while rendering this stand alone sign which is normally attached to other characters. This is useful when sometimes user want to display the sign without any circle. Also, with this scheme it is possible to show some combining marks with dotted circle and some without dotted circle. The fallback mechanisms talked about in section 5.14 of TUS 3.0 is the use of less than ideal (typographically!) mechanisms to display an *approximation* of the glyph(s) for the combining sequence. An exceedingly bad approximation is displaying a dotted circle as a fake base (again: disregarding show invisibles, or chart modes, which, however, should be consistent and show a dotted circle fake base for ALL combining characters occurring in the text). The use of this exceedingly bad approximation (in normal display mode) does in no way indicate that the combining sequence is at all defective. It may indicate that the display engine (or the font) is defective... /Kent K
RE: Indic Devanagari Query
Christopher John Fynn wrote: I had thought that the argument for including KSSA as a seperate character in the Tibetan block (rather than only having U+0F40 and U+0FB5) was originally for compatibility / cross mapping with Devanagari and other Indic scripts. Which is not a valid reason either, considering that U+0F69 and the combination U+0F40 U+0FB5 are *canonically* equivalent. This means that normalizing applications are not allowed to treat U+0F69 differntly from U+0F40 U+0FB5, including displaying them differently or mapping them differently to something else. _ Marco
RE: Suggestions in Unicode Indic FAQ
--- Marco Cimarosti [EMAIL PROTECTED] wrote: Keyur Shroff wrote: But sometimes a user may want visual representation of these symbols in two different ways: with dotted circle and without dotted circle. Why not using a dotted circle character explicity, when you want to see one? Note that whenever I mention the word combining mark I am really talking about vowel signs (matras) and other modifiers in Indic scripts which is script dependent. I am sorry if I have confused you with the combining diacritical marks in the block [U+0300-U+036F] which I really didn't mean. Let me give a proper example this time. Consider a Vowel Sign E [U+0947] appearing after any non-consonant character. This sign is generally attached to the consonants. It has zero advance width with negative left side bearing in the font. Clearly, since in this case the sign is not preceded by any consonant base, it has to be rendered using one of the mechanisms specified in fallback rendering of non-spacing marks. If we render it with space, as you said, then we have to insert space character at the time of fallback rendering (which can be taken care in rendering pipeline) even though space character is not present in backing store of the application. Now in order to render it with dotted circle if we introduce the circle in the text before this sign then also the circle is invalid base for this Vowel Sign E. As a result, again fallback rendering will take place with rendering circle and the vowel sign positionally separate. In this case first dotted circle will apear which will be followed by vowel sign (matra) on top of space character. If you know any other way to solve this problem then please explain. Also let me know if I have misinterpreted the text written in Unicode standard. Example of this could be RAsup on top of dotted circle and RAsup on top of space character. Current use of space character to eliminate dotted circle is really painful and may create problems in determining language and syllable boundaries. Languages or syllable boundaries have nothing to do with this. These special sequences should *never* be part of any syllabe or word in any language: they are just a way of showing the shape of a glyph, to be used when, e.g., talking about typography or spelling. Then how can we rake care of fallback mechanism? Thanks for taking pain for answering my queries :-) - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
urban legends just won't go away!
http://archive.devx.com/free/tips/tipview.asp?content_id=4151 Who knew in this day and age flipping bits to change case is still publishable (this is from today!) Barry Caplan www.i18n.com Vendor Showcase: http://Showcase.i18n.com -- Use Logical Bit Operations to Changing Character Case This is a simple example demonstrating my own personal method. // to lower case public char lower(int c) { return (char)((c = 65 c = 90) ? c |= 0x20 : c); } //to upper case public char upper(int c) { return (char)((c = 97 c =122) ? c ^= 0x20 : c); } /* If I would I could create a method for converting an entire string to lower, like this: */ public String getLowerString(String s) { char[] c = s.toCharArray(); char[] cres = new char[s.length()]; for(int i=0;ic.length;++i) cres[i] = lower(c[i]); return String.valueOf(cres); } /* even converting in capital: */ public String capital(String s) { return String.valueOf(upper(s.toCharArray()[0])).concat(s.substring(1)); } /* using it*/ public static void main(String args[]) { x xx = new x(); System.out.println(xx.getLowerString(LOWER: + FRAME)); System.out.println(xx.upper('f')); System.out.println(xx.capital(randomaccessfile)); }