Arabic Presentation Forms
Hello Unicode-List, I need to figure out a method to convert Arabic Unicode text encoded in its normal form to Arabic Unicode text encoded in Arabic presentation forms. So the Unicode of each character having an Arabic encoding (ex: 0622) would be converted to the equivalent Arabic presentation form encoding (ex:Fxxx). Thus if the character was a letter "Lam" in the middle of a word, it would be encoded with the corresponding presentation form encoding. Do you any suggestions on how I could convert a piece of Unicode text in this manner? Are there any programs that could do this? Thank you very much for the information. Mete Kural
Re: vietnamese font
> correct. > Plain old Arial actually isn't your best choice, because it displays the actually i meant ms arial unicode. its handy but huge. > Fonts on my Windows 2000 system (at work) that support the Vietnamese > precomposed characters *and* display these combinations correctly > include: good to know. thanks. > settle for a font that only supports the "Windows Vietnamese" (CP1258) > set. And do *not* let yourself get involved with VNI fonts. ah more good advice, thanks again.
Re: vietnamese font
Paul Hastings wrote: > besides ms arial would anybody like to recommend a font suitable for > vietnamese? Plain old Arial actually isn't your best choice, because it displays the circumflex-plus-grave and circumflex-plus-acute combinations incorrectly. For Vietnamese they're supposed to be side-by-side, not one stacked on top of the other. Courier New and Times New Roman have the same problem. Fonts on my Windows 2000 system (at work) that support the Vietnamese precomposed characters *and* display these combinations correctly include: - Arial Unicode MS - Code2000 - Gentium and GentiumAlt - Microsoft Sans Serif (not "MS Sans Serif") - Palatino Linotype (not "Palatino") - Tahoma - TITUS Cyberbit Basic - Verdana - Verdana Ref Palatino Linotype is interesting: it displays the grave tone mark to the right of the circumflex (^`), unlike the others (`^), but according to some Vietnamese the first method is preferable. On Verdana Ref, the vertical spacing is a bit too tight for comfort for stacked combinations like circumflex-plus-tilde, but you may be able to adjust this. For best typography, you want to make sure your font supports the Vietnamese precomposed characters in the U+1Exx range. The decomposed combinations are canonically equivalent, of course, but they may not display as nicely on currently available rendering engines. Don't settle for a font that only supports the "Windows Vietnamese" (CP1258) set. And do *not* let yourself get involved with VNI fonts. -Doug Ewell Fullerton, California
Re: urban legends just won't go away!
> This is a simple example demonstrating my own personal method. > //to upper case public char upper(int c) { return (char)((c >= 97 && c <=122) ? VisitSewers(c) : c); } static int VisitSewers(int c) { return AlligatorByte(c); } static int AlligatorByte(int c) { // Remove SPACE from character and return mangled remnant. return (c - 0x20); }
Fwd: News: W3C home page genuinely served as UTF-8
This came in recently: From: Martin Duerst <[EMAIL PROTECTED]> Subject: News: W3C home page genuinely served as UTF-8 This is just a very small news item that I wanted to share: (probably too little too late, but a step in the right direction anyway) Since a few minutes, the W3C home page at http://www.w3.org is finally served genuinely as UTF-8. It was already served with UTF-8 as the charset/encoding for a while, but numeric character references were used for the few non-US-ASCII characters (copyright, registered) in the page, so that the charset/encoding was not particularly relevant. Regards,Martin.
Re: Indic Devanagari Query
On 2003.01.29, 05:52, Aditya Gokhale <[EMAIL PROTECTED]> wrote: > 1. In Marathi and Sanskrit language two characters glyphs of 'la' and > 'sha' are represented differently as shown in the image below - > > (First glyph is 'la' and second one is 'sha') > > as compared to Hindi where these character glyphs are represented as > shown in the image below - > > (First glyph is 'la' and second one is 'sha') Not very different from the serbian vs. russian rendition of cyrillic lower case "i" in italics. There are more examples, though (almost?) none in the latin script. -- . António MARTINS-Tuválkin| ()| <[EMAIL PROTECTED]> || R. Laureano de Oliveira, 64 r/c esq. | PT-1885-050 MOSCAVIDE (LRS) Não me invejo de quem tem | +351 917 511 459 carros, parelhas e montes | http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe | http://pagina.de/bandeiras/ a água em todas as fontes |
Re[2]: ISO 639 arg - Esperanto
On 2003.01.22, 17:04, Markus Scherer <[EMAIL PROTECTED]> wrote: > If only Ferran used "real" characters (car) instead of > transliterations (cxar) - oddly intermixed with using > "real" é in "aragonés" :-} > > Sorry for being off-topic. Not too off-topic, IMHO, as it concerns the still uncomplete penetration of UTF encoding in e-mail practices. The message quoted used, as usual in Esperanto, a 7-bit surogate along with cp1252. It is still very rare to properly presented esperanto in e-mails (unlike, f.i., in webpages). Replacing U+0302 with U+0079 (seldom others) is, BTW, not a transliteration, but a surogate. AFAIK first used in the 1920ies for Esperanto telegraphy. -- . António MARTINS-Tuválkin| ()| <[EMAIL PROTECTED]> || R. Laureano de Oliveira, 64 r/c esq. | PT-1885-050 MOSCAVIDE (LRS) Não me invejo de quem tem | +351 917 511 459 carros, parelhas e montes | http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe | http://pagina.de/bandeiras/ a água em todas as fontes |
vietnamese font
besides ms arial would anybody like to recommend a font suitable for vietnamese? thanks. -- Paul Hastings [EMAIL PROTECTED] Member Team Macromedia (ColdFusion)
RE: Suggestions in Unicode Indic FAQ
. Kent Karlsson wrote, > > I add that this is a good way of displaying a combining mark that has no > > base character, i.e. one occurring at the begin of a line or paragraph. > > No, those should be displayed *as if* preceded by a SPACE (TUS 3.0 page 121). So it says. But, the 'space method' could be interpreted as being under the "Simple Overlap" fallback rendering method, since the paragraph from which you are quoting appears immediately after the paragraph treating with "Simple Overlap" method. In the first paragraph under "Fallback Rendering" (bottom of page 120), the "Show Hidden" method is described. It would have been redundant to say that degenerate cases **under the "Show Hidden" method** need to be displayed as if on a dotted circle, since the "Show Hidden" method is **all about displaying on dotted circles** whenever there is an inability to draw the sequence. Like, in the event of an invalid sequence. Whether to use "Show Hidden" or "Simple Overlap" method to display invalid or degenerate sequences should be left up to the various software developers. It does seem to be very easy to spot bad input when the dotted circle appears in the display. Stands out like a sore thumb, which was probably the intention. Perhaps this section of T.U.S. could stand some clarification. Best regards, James Kass .
RE: Suggestions in Unicode Indic FAQ
At 01:20 AM 1/30/2003, Marco Cimarosti wrote: However, I totally agree with Kent that this funny rendering is *not* a requirement of the Unicode standard, as Keyur Shroff seems to suggest. It is just an example of many "several methods [that] are available to deal with" strange sequences. Perhaps there is some confusion here because the use of the dotted circle is an explicit recommendation of the MS Indic OpenType spec, which is what the majority of Indian font developers are now working with. So there may be some confusion between what is expected in the MS spec and what is required by Unicode. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] A book is a visitor whose visits may be rare, or frequent, or so continual that it haunts you like your shadow and becomes a part of you. - al-Jahiz, The Book of Animals
RE: Suggestions in Unicode Indic FAQ
Keyur Shroff wrote: > > However, I totally agree with Kent that this funny > rendering is *not* a > > requirement of the Unicode standard, as Keyur Shroff seems > to suggest. It > > is just an example of many "several methods [that] are > available to deal > > with" strange sequences. > > A sequence should not be treated as "strange" sequence if it has been > written intentionally. It may have some contextual meaning. I said "strange" in the sense of character sequences that are not part of the ordinary spelling of any language. In fact, a thing like a matra floating in the air or on a dotted circle is something that you'd only see in a text (not necessarily *in* an Indian language) which talks about spelling, character sets, and the like. > Also, what is good or bad is also subjective. It may also > vary from one script to another. Yes, but what is mandatory and what is not in Unicode sciould not be too much subjective, else we could not call it a "standard". _ Marco
RE: Suggestions in Unicode Indic FAQ
> Let me give a proper example this time. Consider a "Vowel Sign E" [U+0947] > appearing after any non-consonant character. This sign is generally > attached to the consonants. It has zero advance width with negative left > side bearing in the font. Ok. > Clearly, since in this case the sign is not > preceded by any consonant base, it has to be rendered using one of the > mechanisms specified in fallback rendering of non-spacing marks. If it is preceded by a SPACE (or is first in a string/paragraph/similar) it should be rendered as a "freestanding" glyph (no dotted circle). If it is preceded, in the source string, by, say, FULL STOP, a typographically acceptable rendering would be to have the vowel sign E glyph float on top of the glyph for the FULL STOP (no dotted circle). Similarly for a vowel sign E that follows a LATIN CAPITAL LETTER A. (But I don't expect good positioning, just readable.) Again similarly, a vowel sign I that follows an EQUAL SIGN should be rendered as a vowel sign I glyph to the left of an EQUAL SIGN glyph. No dotted circle. (I know that the reordrant vowel signs may reorder over more than the preceding base character IF it is a (sub)string in an Indic script.) Again similarly, a string should be rendered as a KA + II + II glyph sequence (invoking any ligature for KA + II if there is one in the font; II + II is unlikely to have any ligature, since it is not used by any orthography). No dotted circle(s). The fallback hinted at in TUS 3.0 that uses dotted circles is 1) typographically horrible, and 2) cannot indicate that there is any error in the given character sequence. ... > the application. Now in order to render it with dotted circle if we > introduce the circle in the text before this sign then also > the circle is invalid base for this "Vowel Sign E". No base character is invalid for any combining character. ... > > Languages or syllable boundaries have nothing to do with this. These > > special > > sequences should *never* be part of any syllable or word in any language: > > they are just a way of showing the shape of a glyph, to be used when, > > e.g., talking about typography or spelling. > > Then how can we rake care of fallback mechanism? By removing that particular fallback mechanism from implementations as well as the TUS text! (I'm serious!) This particular fallback mechanism is NOT recommended as it stands. But since its mention is erroneously taken as a recommendation, I'd suggest removing also its mention. That mechanism is as bad as misplacing glyphs for combining marks on the glyph(s) for the follow-on character, if not worse. ("Show invisibles" (for all of the text or a "user" selected run of the text) is an entirely different story.) /Kent K
RE: Suggestions in Unicode Indic FAQ
> > I don't know where you find support for that position in that text. > > Can you please quote? There are no "invalid base consonants" for > > any dependent vowel (for Indic scripts; similarly for any > > other script). > > Actually, there is a mention of displaying combining marks on dotted > circles: I know. But there is no mention (that I have found) of "invalid base characters" or any recommendation for using dotted circles especially for Indic scripts. > I add that this is a good way of displaying a combining mark that has no > base character, i.e. one occurring at the begin of a line or paragraph. No, those should be displayed *as if* preceded by a SPACE (TUS 3.0 page 121). /Kent K
RE: urban legends just won't go away!
Barry, If you think that this is bad try 390 mainframe EBCDIC shift to upper case. You can shift up to 256 characters at a time with a single machine language instruction by ORing a line of spaces to your character field. Now that is bit flipping and is still heavily used. Carl > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > Behalf Of Barry Caplan > Sent: Wednesday, January 29, 2003 10:01 PM > To: [EMAIL PROTECTED] > Subject: urban legends just won't go away! > > > http://archive.devx.com/free/tips/tipview.asp?content_id=4151 > > Who knew in this day and age flipping bits to change case is > still publishable (this is from today!) > > Barry Caplan > www.i18n.com > Vendor Showcase: http://Showcase.i18n.com > > > -- > > Use Logical Bit Operations to Changing Character Case > > > This is a simple example demonstrating my own personal method. > > // to lower case > public char lower(int c) > { >return (char)((c >= 65 && c <= 90) ? c |= 0x20 : c); > } > > //to upper case > public char upper(int c) > { > return (char)((c >= 97 && c <=122) ? c ^= 0x20 : c); > } > /* > If I would I could create a method for converting an entire > string to lower, like this: > */ > public String getLowerString(String s) > { > char[] c = s.toCharArray(); > char[] cres = new char[s.length()]; > for(int i=0;i cres[i] = lower(c[i]); > return String.valueOf(cres); > } > /* > even converting in capital: > */ > public String capital(String s) > { > return > String.valueOf(upper(s.toCharArray()[0])).concat(s.substring(1)); > } > /* using it*/ > public static void main(String args[]) > { > x xx = new x(); > System.out.println(xx.getLowerString("LOWER: " + "FRAME")); > System.out.println(xx.upper('f')); > System.out.println(xx.capital("randomaccessfile")); > } > > >
RE: Suggestions in Unicode Indic FAQ
--- Marco Cimarosti <[EMAIL PROTECTED]> wrote: > > I add that this is a good way of displaying a combining mark that has no > base character, i.e. one occurring at the begin of a line or paragraph. > > However, I totally agree with Kent that this funny rendering is *not* a > requirement of the Unicode standard, as Keyur Shroff seems to suggest. It > is just an example of many "several methods [that] are available to deal > with" strange sequences. A sequence should not be treated as "strange" sequence if it has been written intentionally. It may have some contextual meaning. > > > Any combining characters can be placed on any base characters without > > there being any dotted circles displayed. Not only that, but it is also desirable. How can one write a vowel matra both with and without dotted circle in a single document if Unicode recommends to place it only on top of space character? Matra with dotted circle is sometimes useful as in the case of printing/explaining Unicode standard. A user may want to hide dotted circle in the same document in order to explain the actual shape of the matra character, i.e., without dotted circle. Both kind of rendering behaviour is possible. There should be some mechanism either to turn on or off dotted circle depending on the default behaviour. Also, what is good or bad is also subjective. It may also vary from one script to another. - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
RE: Suggestions in Unicode Indic FAQ
Kent Karlsson wrote: > Keyur Shroff wrote > [...] > > In Indic scripts any sign that appear in text not in > > conjunction with a > > valid consonant base may be rendered with dotted circle as fallback > > mechanism (Section 5.14 "Rendering Nonspacing Marks" > > http://www.unicode.org/uni2book/ch05.pdf). > > I don't know where you find support for that position in that text. > Can you please quote? There are no "invalid base consonants" for > any dependent vowel (for Indic scripts; similarly for any > other script). Actually, there is a mention of displaying combining marks on dotted circles: "Several methods are available to deal with an unknown composed character sequence that is outside of a fixed, renderable set [...]. One method (Show Hidden) indicates the inability to draw the sequence by drawing the base character first and then rendering the nonspacing mark as an individual unit - with the nonspacing mark positioned on a dotted circle." (The Unicode Standard 3.0, page 120 - 5.14 Rendering Nonspacing Marks - Fallback Rendering) I add that this is a good way of displaying a combining mark that has no base character, i.e. one occurring at the begin of a line or paragraph. However, I totally agree with Kent that this funny rendering is *not* a requirement of the Unicode standard, as Keyur Shroff seems to suggest. It is just an example of many "several methods [that] are available to deal with" strange sequences. > > Any system implementing this as > > default behaviour should not be considered buggy. > > Indeed they are. And it should certainly not be default behaviour. In this case, I disagree with Kent: displaying these dotted circles is not mandatory, but certainly not a bug. > Any combining characters can be placed on any base characters without > there being any dotted circles displayed. True. But notice that Kent (against his own opinion) correctly wrote "can", not "must". > [...] _ Marco
Re: Suggestions in Unicode Indic FAQ
To support what Kayur has to say I will add few more things. Take for instance a "vowel sigh" (matras as we call here in India) e.g. say is e (U093F), is combined with a consonant like ka (U0915) in the sequence it forms ke. (Please see the first image). The repositioning of the shape happens automatically. If the anyone puts the e matra (U093F) first and then the consonant ka (U0915) then they should be highlighted, putting a space can still make them look like ka. So to make this mistake very explicit, we have to put a dotted circle. If wrong combination is stored, it will create lot of problem in searching the data as well as in sorting. Please refer BIS (Bureau of Indian Standards) ISCII 91 / 88 documentation for this where in -Aditya - Original Message - From: "Keyur Shroff" <[EMAIL PROTECTED]> To: "'Unicode Mailing List'" <[EMAIL PROTECTED]> > --- Marco Cimarosti <[EMAIL PROTECTED]> wrote:> > Keyur Shroff wrote:> > > But sometimes a user may want visual representation of these > > > symbols in two different ways: with dotted circle and> > > without dotted circle.> > > > Why not using a dotted circle character explicity, when you want to see> > one?> > Note that whenever I mention the word "combining mark" I am really talking> about "vowel signs (matras)" and other modifiers in Indic scripts which is> script dependent. I am sorry if I have confused you with the combining> diacritical marks in the block [U+0300-U+036F] which I really didn't mean.> > Let me give a proper example this time. Consider a "Vowel Sign E" [U+0947]> appearing after any non-consonant character. This sign is generally> attached to the consonants. It has zero advance width with negative left> side bearing in the font. Clearly, since in this case the sign is not> preceded by any consonant base, it has to be rendered using one of the> mechanisms specified in fallback rendering of non-spacing marks. If we> render it with space, as you said, then we have to insert "space" character> at the time of fallback rendering (which can be taken care in rendering> pipeline) even though space character is not present in backing store of> the application. Now in order to render it with dotted circle if we> introduce the circle in the text before this sign then also the circle is> invalid base for this "Vowel Sign E". As a result, again fallback rendering> will take place with rendering circle and the vowel sign positionally> separate. In this case first dotted circle will apear which will be> followed by vowel sign (matra) on top of space character.> > If you know any other way to solve this problem then please explain. Also> let me know if I have misinterpreted the text written in Unicode standard.> > > > > > > Example of> > > this could be RAsup on top of dotted circle and RAsup on top of space> > > character. Current use of space character to eliminate dotted > > > circle is really painful and may create problems in determining > > > language and syllable boundaries.> > > > Languages or syllable boundaries have nothing to do with this. These> > special> > sequences should *never* be part of any syllabe or word in any language:> > they are just a way of showing the shape of a glyph, to be used when,> > e.g., talking about typography or spelling.> > Then how can we rake care of fallback mechanism?> > > Thanks for taking pain for answering my queries :-)> > - Keyur> > > > __> Do you Yahoo!?> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.> http://mailplus.yahoo.com