Re: An unexpected sight...
Michael Everson [EMAIL PROTECTED] writes: It is common enough. It is more common in Sweden than it is in Germany. I can't compare with Germany, but I wouldn't say that it's common. I could think of it as a gimmick, but I would be inclined to say that it is more common to use cyrillic letter shapes. (I'm using "shapes", since they are supposed to be read as their Latin lookalikes, e.g. "ja" should be read as R.) It was more common in Germany, Sweden, and Estonia earlier this century than it is today. You mean that were was a fad last year? I have to confess that I missed it. -- Erland Sommarskog, Stockholm, [EMAIL PROTECTED]
Re: An unexpected sight...
Michael Everson had written: It was more common in Germany, Sweden, and Estonia earlier this century than it is today. On 2001-01-17 at 09:22 h UCT, Erland Sommarskog wrote: You mean that were was a fad last year? I have to confess that I missed it. You mean, this very month? (Rather than last year, which belongs to the previous, viz. 20th, century.) Best wishes, Otto Stolz
Re: conjucts beginning with independent vowel?
Ar 13:50 -0800 2001-01-16, scrobh [EMAIL PROTECTED]: In the better known Indic scripts, are there ever cases of conjuncts formed with independent vowels and a following consonant? Not in the better-known ones, except possibly in esoteric manuscripts. One finds weird stacking behaviour in Tibetan in such magical texts. Abracadabra, Michael Everson ** Everson Gunn Teoranta ** http://www.egt.ie 15 Port Chaeimhghein ochtarach; Baile tha Cliath 2; ire/Ireland Mob +353 86 807 9169 ** Fax +353 1 478 2597 ** Vox +353 1 478 2597 27 Pirc an Fhithlinn; Baile an Bhthair; Co. tha Cliath; ire
Re: conjucts beginning with independent vowel?
Michael Everson wrote: Ar 13:50 -0800 2001-01-16, scrobh [EMAIL PROTECTED]: Now, suppose a VC conjunct were to occur, as described above; "al", for example. Would it seem preferable to treat the vowel like a consonant, and encode as A + virama + L or to treat the consonant, and encode as A + Ldep No such thing as Ldep in our model I see two candidates: - U+0962 (dependent vocalic l) and all its variations in the other scripts - U+0D32 ("normal" la in Malayalam) which behaves very much like a dependent vowel (like the ra "vattu" in Nagari) The second is no special (it would be encoded as L anyway! so it returns to the first case). A "problem" with the first is that I was taught that A + Vdep (which A + dependent lri really is) is used as a pedagogical way to teach the alphabet, and should mean the same as the stand-alone form of Lri (and indeed some earlier encodings of Nagari went this way; I am unsure if the telegraph still does). I do not know how extensive is this behaviour (and how it may compete with Peter's proposal). Of course, in regular Nagari, one ought to encode A + virama + La/0932 (+ virama if followed by a consonant or at end of the word in Sanskrit), as this is the way it is written. Antoine
UNICODE application on IBM Mainframe
I am investigating using the Unicode standard to store and forward Chinese characters in a mainframe (IMS) environment. Basically we want to receive Chinese into the system, encode into UNICODE, send it to the mainframe and store on the IMSDB. At a later stage, then decode back into Chinese for forwarding out of the system. Any advice or feedback from anyone who has done anything similar would be appreciated. How would the unicode look stored in EBCDIC? for example, code point 006D for 'n' - stored as character '00D6' or hex x'006D'? What about the 'U' - or does one HAVE to use one of the UTFs? As you can tell, this is all still new to me. Any hints and tips would be appreciated as well as whether this is feasible or not. In future we would also want to store and forward other languages, as well as possibly update the values using a front- end interface. Regards, Tracey _ Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.
Re: UNICODE application on IBM Mainframe
Unicode is always serialized in a UTF: UTF-8, UTF-16*, or UTF-32*. The definition of each of these is invariant across systems: in UTF-8 an 'a' is always stored as 0x61. There is a special UTF for use on EBCDIC systems. Check out the technical reports and FAQs on www.unicode.org. Mark - Original Message - From: "tracey kelly" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Wednesday, January 17, 2001 06:30 Subject: UNICODE application on IBM Mainframe I am investigating using the Unicode standard to store and forward Chinese characters in a mainframe (IMS) environment. Basically we want to receive Chinese into the system, encode into UNICODE, send it to the mainframe and store on the IMSDB. At a later stage, then decode back into Chinese for forwarding out of the system. Any advice or feedback from anyone who has done anything similar would be appreciated. How would the unicode look stored in EBCDIC? for example, code point 006D for 'n' - stored as character '00D6' or hex x'006D'? What about the 'U' - or does one HAVE to use one of the UTFs? As you can tell, this is all still new to me. Any hints and tips would be appreciated as well as whether this is feasible or not. In future we would also want to store and forward other languages, as well as possibly update the values using a front- end interface. Regards, Tracey _ Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.
Re: UNICODE application on IBM Mainframe
Within the IMS database, any form of data can be stored. Beware, however, that certain parameters, such as the transaction name, must always be in EBCDIC. While the database itself can handle Unicode in any format, you have to be careful about how you work with that data - the IMS Transaction Manager cannot handle Unicode. No 3270-based product can work with Unicode unless it is in the UTF-EBCDIC format. IMS shipped support for Unicode in V7, October 2000, to support working with Unicode using Java and IMS Connect. Lisa
Re: conjucts beginning with independent vowel?
On 01/17/2001 06:05:15 AM Antoine Leca wrote: Of course, in regular Nagari, one ought to encode A + virama + La/0932 (+ virama if followed by a consonant or at end of the word in Sanskrit), as this is the way it is written. This is actually done? I got the impression from reading chapter 9 in TUS3 that in Devanagari virama occurs only after a consonant, which seems reasonable if you consider that it doesn't make sense to kill an inherent vowel on an independent vowel. I'm trying to sort out what should be proposed for Syloti Nagri. There are four consonants that can be conjoined to a preceding independent vowel. From what I understand, these are mainly used for Arabic borrowings in Islamic texts, but possibly also in English borrowings. So, for example, Allah is written as al-la-h. I hadn't noticed the vocalic L and LL in Devanagari and Bengali before. These do give a precedent of consonantal sounds encoded as combining marks. There is a difference from the Syloti case, though: in D and B, these are distinct marks, discontiguous from the base character, whereas the marks in the Syloti case are conjoined, being obligatorily attached to the vertical stem of the base (independent vowel) character. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: conjucts beginning with independent vowel?
On Wed, 17 Jan 2001 [EMAIL PROTECTED] wrote: On 01/17/2001 05:13:25 AM Michael Everson wrote: A + Ldep No such thing as Ldep in our model, so you'd have to rely on A + virama + L. Well, if a script had such behaviour, one possibility could be to propose a combining CONSONANT SIGN L for what we would be choosing to think of as a dependent form of the consonant. I.e. it may not be in an existing model, but for a new script one could create a new model. I hear you saying, though, that you think it would be preferable to fit this into the existing model that uses a virama. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED] Wednesday, January 17, 2001 A virama after other than a consonant seems un-Indian. My novice's understanding of virama is that it means: If the available rendering capabilities allow it, consider the implicit 'a' expunged and combine the preceding consonant with the next one to form a conjunct; otherwise (i.e. if the rendering capabilities do not allow this) insert the virama glyph beneath the preceding consonant. This would mean the last example in Unicode 3.0 figure 9-3 could be ignored and instead RA + vocalic R vowel sign (U+0930, U+0943 with no virama) would be rendered as independent vocalic R (U+090B) with "reph hook" above it. Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.
Re: A real bug in bidi
On Tue, 16 Jan 2001, Mark Davis wrote: Doug Felt here confirmed that this is a bug in the implementation section. While it does not affect the conformance of the main algorithm, it would affect people trying to use that optimization strategy. (we here don't use that strategy, by the way). We think that the implementation strategy could be changed to still work, but for now we would recommend removing the characters. Will there be note in the online version of the technical report to mention this? There may be poor developers just like us ;) who won't know that these recommendations will make their application nonconforming. In our case, we read and reread the spec many times, even by developers who had not heard about the Unicode bidi before, because we simply thought that it's our implementation or interpretation bug. --roozbeh
RE: conjucts beginning with independent vowel?
In Bengali Vowel_A can form a conjunct with letter_Ya (Ya taking its zophola form.) It has been suggested that this should be encoded as Vowel_A ZWJ Ya I believe that the series V ZWJ C is much more logical than V Virama C as the semantics of virama are to suppress the vowel. Abdul
Re: conjucts beginning with independent vowel?
On 01/17/2001 02:52:41 PM John Hudson wrote: Are thes four consonants always joined in this way when following an independent vowel? Or is this behaviour exceptional and limited to borrowed words, etc.? My understanding is the latter. Thus, I don't think obligatory ligation would work. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
RE: conjucts beginning with independent vowel?
On 01/17/2001 03:10:22 PM "AbdulMalik" wrote: In Bengali Vowel_A can form a conjunct with letter_Ya (Ya taking its zophola form.) It has been suggested that this should be encoded as Vowel_A ZWJ Ya I believe that the series V ZWJ C is much more logical than V Virama C as the semantics of virama are to suppress the vowel. I had thought about this earlier, but forgot to include this among the possibilities when I raised the question. Thanks for bringing it up. This matches the general use of ZWJ for requesting ligation, which UTC decided to add to the semantics of ZWJ last year (thus it can be considered an existing mechanism). But, it doesn't match the use of ZWJ in Indic scripts for forcing half forms rather than conjuncts. This use of ZWJ isn't needed for Syloti Nagri, which does not have half forms. Since that use is applied to other Indic scripts, though, I don't know if it would be a problem to introduce the other use of ZWJ for an Indic script. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: A real bug in bidi
Yes, I have already proposed an agenda item for the next UTC, to get this fix into 3.1. Mark ___ Mark Davis, IBM GCoC, Cupertino (408) 777-5850 [fax: 5891], [EMAIL PROTECTED], [EMAIL PROTECTED] http://maps.yahoo.com/py/maps.py?Pyt=Tmapaddr=10275+N.+De+Anzacsz=95014 Roozbeh Pournader [EMAIL PROTECTED] on 01-17-2001 12:56:57 To: Mark Davis/Cupertino/IBM@IBMUS cc: Unicode List [EMAIL PROTECTED], [EMAIL PROTECTED], Behdad Esfahbod [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: A real bug in bidi On Tue, 16 Jan 2001, Mark Davis wrote: Doug Felt here confirmed that this is a bug in the implementation section. While it does not affect the conformance of the main algorithm, it would affect people trying to use that optimization strategy. (we here don't use that strategy, by the way). We think that the implementation strategy could be changed to still work, but for now we would recommend removing the characters. Will there be note in the online version of the technical report to mention this? There may be poor developers just like us ;) who won't know that these recommendations will make their application nonconforming. In our case, we read and reread the spec many times, even by developers who had not heard about the Unicode bidi before, because we simply thought that it's our implementation or interpretation bug. --roozbeh
Teletext mappings
Hi everyone, I'm preparing some mappings of teletext character sets to Unicode. You can see my results so far at http://www.sneezes.freeserve.co.uk/teletext/tech/charenc/teletextcharencs.ht ml [hope that URL doesn't get split..] This is a LARGE page, btw (150k). In IE5+, hover over the character to get its name. As you can see, I have some ambiguous characters and unknows, and am wondering whether anyone would like to answer these questions :) 1) I'm not sure about the forms in G0_ARABIC. I've had some excellent help from an Arabic-speaker, but am wondering whether it could be further refined. I've uploaded the tables in the teletext spec to http://www.sneezes.freeserve.co.uk/teletext/tech/charenc/teletextarabic.gif so you can make a comparison. I haven't finished G2_ARABIC yet, so there's a few gaps. 2) Hyphens or dashes - what's the difference? 3) Which to use: 2016: DOUBLE VERTICAL LINE, or 0x2225 PARALLEL TO, or 0x2251 BOX DRAWING DOUBLE VERTICAL, or 0x01C1 LATIN LETTER LATERAL CLICK ? 4) Turkish Lira - the teletext spec represents this with a combined ligature 'TL', which I can't find a Unicode character for. I've put in 20A4 LIRA SIGN, but I don't think this is what the teletext designers had in mind. Is this a case for a new Unicode character? 5) G0_LATIN_LETTISH_LITHUIAN looks to have a LATIN SMALL LETTER I WITH CEDILLA, which I can't find in Unicode (so I've stuck in i with ogonek instead). Is this missing? 6) Is there a 041F CYRILLIC CAPITAL LETTER PE with a curved top, like 0x22C2 N-ARY INTERSECTION, in both uppercase and lowercase forms? Perhaps this a particular glyph of the PE character, represented as a separate entry in the teletext table. 7) Misc. other characters: Couldn't decide between a) 2126: OHM SIGN or GREEK CAPITAL LETTER OMEGA, 03A9 b) 0110: LATIN CAPITAL LETTER D WITH STROKE, or LATIN CAPITAL LETTER ETH, 00D0 c) 00DF: LATIN SMALL LETTER SHARP S, or GREEK SMALL LETTER BETA, 03B2 d) 0251: LATIN SMALL LETTER ALPHA, or GREEK SMALL LETTER ALPHA, 03B1 e) 00B0: DEGREE SIGN, or MASCULINE ORDINAL INDICATOR, 00BA 8) And some others I'm not sure of: a) Character 0x28 of G2_GREEK, looks like a colon b) Character 0x6e of G2_LATIN, looks like a tall Greek eta c) Character 0x7e of G2_LATIN, looks like an eta d) Character 0x52 of G0_GREEK, I've put it in as 0374 GREEK NUMERAL SIGN but can't be sure Perhaps there's some 7-bit sets knocking about which the teletext ones were based on, which would help. The full teletext spec is available from http://www.etsi.org, named ETSI 300 706 (you'll have to register to download, but it's free). I suspect the designers of the spec would use a single glyph to represent two characters in some cases, e.g. D with a stroke would mean both 0110 and 00D0, seeing as both lowercase forms are further up in the same set. Hope I haven't asked too much in my first posting to this list :) Regards, Rob.
PDUTR #27: Unicode 3.1
Proposed Draft Unicode Technical Report #27: Unicode 3.1 is now available at http://www.unicode.org/unicode/reports/tr27/ Please take a look at it and report any problems you may find. It is approximately 60 pages long. Julie Allen Editor, Unicode, Inc.