Re: An unexpected sight...

2001-01-17 Thread Erland Sommarskog

Michael Everson [EMAIL PROTECTED] writes:
 It is common enough. It is more common in Sweden than it is in Germany.

I can't compare with Germany, but I wouldn't say that it's common.
I could think of it as a gimmick, but I would be inclined to say that
it is more common to use cyrillic letter shapes. (I'm using "shapes",
since they are supposed to be read as their Latin lookalikes, e.g.
"ja" should be read as R.)

 It was more common in Germany, Sweden, and Estonia earlier this century than
 it is today.

You mean that were was a fad last year? I have to confess that I missed
it.
--
Erland Sommarskog, Stockholm, [EMAIL PROTECTED]



Re: An unexpected sight...

2001-01-17 Thread Otto Stolz

Michael Everson had written:
 It was more common in Germany, Sweden, and Estonia earlier this century than
 it is today.

On 2001-01-17 at 09:22 h UCT, Erland Sommarskog wrote:
 You mean that were was a fad last year? I have to confess that I missed
 it.

You mean, this very month? (Rather than last year, which belongs to the
previous, viz. 20th, century.)

Best wishes,
   Otto Stolz



Re: conjucts beginning with independent vowel?

2001-01-17 Thread Michael Everson

Ar 13:50 -0800 2001-01-16, scrobh [EMAIL PROTECTED]:
In the better known Indic scripts, are there ever cases of conjuncts formed
with independent vowels and a following consonant?

Not in the better-known ones, except possibly in esoteric manuscripts. One
finds weird stacking behaviour in Tibetan in such magical texts.

Abracadabra,


Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
15 Port Chaeimhghein ochtarach; Baile tha Cliath 2; ire/Ireland
Mob +353 86 807 9169 ** Fax +353 1 478 2597 ** Vox +353 1 478 2597
27 Pirc an Fhithlinn;  Baile an Bhthair;  Co. tha Cliath; ire





Re: conjucts beginning with independent vowel?

2001-01-17 Thread Antoine Leca

Michael Everson wrote:
 
 Ar 13:50 -0800 2001-01-16, scrobh [EMAIL PROTECTED]:
 
 Now, suppose a VC conjunct were to occur, as described above; "al", for
 example. Would it seem preferable to treat the vowel like a consonant, and
 encode as
 
  A + virama + L
 
 or to treat the consonant, and encode as
 
  A + Ldep
 
 No such thing as Ldep in our model

I see two candidates:
- U+0962 (dependent vocalic l) and all its variations in the other scripts
- U+0D32 ("normal" la in Malayalam) which behaves very much like a dependent
  vowel (like the ra "vattu" in Nagari)

The second is no special (it would be encoded as L anyway! so it returns to
the first case).

A "problem" with the first is that I was taught that A + Vdep (which A +
dependent lri really is) is used as a pedagogical way to teach the alphabet,
and should mean the same as the stand-alone form of Lri (and indeed some
earlier encodings of Nagari went this way; I am unsure if the telegraph still
does).
I do not know how extensive is this behaviour (and how it may compete with
Peter's proposal). Of course, in regular Nagari, one ought to encode A +
virama + La/0932 (+ virama if followed by a consonant or at end of the word
in Sanskrit), as this is the way it is written.


Antoine



UNICODE application on IBM Mainframe

2001-01-17 Thread tracey kelly

I am investigating using the Unicode standard to store and forward
Chinese characters in a mainframe (IMS) environment.

Basically we want to receive Chinese into the system, encode into
UNICODE, send it to the mainframe and store on the IMSDB. At a later
stage, then decode back into Chinese for forwarding out of the system.

Any advice or feedback from anyone who has done anything similar
would be appreciated. How would the unicode look stored in EBCDIC?
for example, code point 006D for 'n' - stored as character '00D6'
or hex x'006D'? What about the 'U' - or does one HAVE to use one of
the UTFs?

As you can tell, this is all still new to me.

Any hints and tips would be appreciated as well as whether this is
feasible or not. In future we would also want to store and forward
other languages, as well as possibly update the values using a front-
end interface.

Regards,
Tracey
_
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.




Re: UNICODE application on IBM Mainframe

2001-01-17 Thread Mark Davis

Unicode is always serialized in a UTF: UTF-8, UTF-16*, or UTF-32*. The
definition of each of these is invariant across systems: in UTF-8 an 'a' is
always stored as 0x61. There is a special UTF for use on EBCDIC systems.
Check out the technical reports and FAQs on www.unicode.org.

Mark

- Original Message -
From: "tracey kelly" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Wednesday, January 17, 2001 06:30
Subject: UNICODE application on IBM Mainframe


 I am investigating using the Unicode standard to store and forward
 Chinese characters in a mainframe (IMS) environment.

 Basically we want to receive Chinese into the system, encode into
 UNICODE, send it to the mainframe and store on the IMSDB. At a later
 stage, then decode back into Chinese for forwarding out of the system.

 Any advice or feedback from anyone who has done anything similar
 would be appreciated. How would the unicode look stored in EBCDIC?
 for example, code point 006D for 'n' - stored as character '00D6'
 or hex x'006D'? What about the 'U' - or does one HAVE to use one of
 the UTFs?

 As you can tell, this is all still new to me.

 Any hints and tips would be appreciated as well as whether this is
 feasible or not. In future we would also want to store and forward
 other languages, as well as possibly update the values using a front-
 end interface.

 Regards,
 Tracey
 _
 Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.





Re: UNICODE application on IBM Mainframe

2001-01-17 Thread lisam



Within the IMS database, any form of data can be stored.  Beware, however,
that certain parameters, such as the transaction name, must always be in
EBCDIC.  While the database itself can handle Unicode in any format, you
have to be careful about how you work with that data - the IMS Transaction
Manager cannot handle Unicode.   No 3270-based product can work with
Unicode unless it is in the UTF-EBCDIC format.  IMS shipped support for
Unicode in V7, October 2000, to support working with Unicode using Java and
IMS Connect.

Lisa





Re: conjucts beginning with independent vowel?

2001-01-17 Thread Peter_Constable


On 01/17/2001 06:05:15 AM Antoine Leca wrote:

Of course, in regular Nagari, one ought to encode A +
virama + La/0932 (+ virama if followed by a consonant or at end of the
word
in Sanskrit), as this is the way it is written.

This is actually done? I got the impression from reading chapter 9 in TUS3
that in Devanagari virama occurs only after a consonant, which seems
reasonable if you consider that it doesn't make sense to kill an inherent
vowel on an independent vowel.

I'm trying to sort out what should be proposed for Syloti Nagri. There are
four consonants that can be conjoined to a preceding independent vowel.
From what I understand, these are mainly used for Arabic borrowings in
Islamic texts, but possibly also in English borrowings. So, for example,
Allah is written as al-la-h.

I hadn't noticed the vocalic L and LL in Devanagari and Bengali before.
These do give a precedent of consonantal sounds encoded as combining marks.
There is a difference from the Syloti case, though: in D and B, these are
distinct marks, discontiguous from the base character, whereas the marks in
the Syloti case are conjoined, being obligatorily attached to the vertical
stem of the base (independent vowel) character.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: conjucts beginning with independent vowel?

2001-01-17 Thread James E. Agenbroad

On Wed, 17 Jan 2001 [EMAIL PROTECTED] wrote:

 
 On 01/17/2001 05:13:25 AM Michael Everson wrote:
 
  A + Ldep
 
 No such thing as Ldep in our model, so you'd have to rely on A + virama +
 L.
 
 Well, if a script had such behaviour, one possibility could be to propose a
 combining CONSONANT SIGN L for what we would be choosing to think of as a
 dependent form of the consonant. I.e. it may not be in an existing model,
 but for a new script one could create a new model. I hear you saying,
 though, that you think it would be preferable to fit this into the existing
 model that uses a virama.
 
 
 
 - Peter
 
 
 ---
 Peter Constable
 
 Non-Roman Script Initiative, SIL International
 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
 Tel: +1 972 708 7485
 E-mail: [EMAIL PROTECTED]
 
 
 
  Wednesday, January 17, 2001
A virama after other than a consonant seems un-Indian.  My novice's
understanding of virama is that it means: If the available rendering
capabilities allow it, consider the implicit 'a' expunged and combine the
preceding consonant with the next one to form a conjunct; otherwise
(i.e. if the rendering capabilities do not allow this) insert the virama
glyph beneath the preceding consonant.  This would mean the last example
in Unicode 3.0 figure 9-3 could be ignored and instead RA + vocalic R
vowel sign (U+0930, U+0943 with no virama) would be rendered as
independent vocalic R (U+090B) with "reph hook" above it.  
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Re: A real bug in bidi

2001-01-17 Thread Roozbeh Pournader



On Tue, 16 Jan 2001, Mark Davis wrote:

 Doug Felt here confirmed that this is a bug in the implementation section.
 While it does not affect the conformance of the main algorithm, it would
 affect people trying to use that optimization strategy. (we here don't use
 that strategy, by the way). We think that the implementation strategy could
 be changed to still work, but for now we would recommend removing the
 characters.

Will there be note in the online version of the technical report to
mention this? There may be poor developers just like us ;) who won't know
that these recommendations will make their application nonconforming.

In our case, we read and reread the spec many times, even by developers
who had not heard about the Unicode bidi before, because we simply thought
that it's our implementation or interpretation bug.

--roozbeh





RE: conjucts beginning with independent vowel?

2001-01-17 Thread AbdulMalik

In Bengali Vowel_A can form a conjunct with letter_Ya (Ya taking its zophola form.)

It has been suggested that this should be encoded as Vowel_A ZWJ Ya

I believe that the series V ZWJ C is much more logical than V Virama C as the 
semantics of virama are to suppress the vowel.


Abdul




Re: conjucts beginning with independent vowel?

2001-01-17 Thread Peter_Constable


On 01/17/2001 02:52:41 PM John Hudson wrote:

Are thes four consonants always joined in this way when following an
independent vowel? Or is this behaviour exceptional and limited to
borrowed
words, etc.?

My understanding is the latter. Thus, I don't think obligatory ligation
would work.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





RE: conjucts beginning with independent vowel?

2001-01-17 Thread Peter_Constable


On 01/17/2001 03:10:22 PM "AbdulMalik" wrote:

In Bengali Vowel_A can form a conjunct with letter_Ya (Ya taking its
zophola
form.)

It has been suggested that this should be encoded as Vowel_A ZWJ Ya

I believe that the series V ZWJ C is much more logical than V Virama C as
the
semantics of virama are to suppress the vowel.

I had thought about this earlier, but forgot to include this among the
possibilities when I raised the question. Thanks for bringing it up.

 This matches the general use of ZWJ for requesting ligation, which UTC
decided to add to the semantics of ZWJ last year (thus it can be considered
an existing mechanism). But, it doesn't match the use of ZWJ in Indic
scripts for forcing half forms rather than conjuncts. This use of ZWJ isn't
needed for Syloti Nagri, which does not have half forms. Since that use is
applied to other Indic scripts, though, I don't know if it would be a
problem to introduce the other use of ZWJ for an Indic script.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: A real bug in bidi

2001-01-17 Thread Mark Davis


Yes, I have already proposed an agenda item for the next UTC, to get this
fix into 3.1.

Mark
___
Mark Davis, IBM GCoC, Cupertino
(408) 777-5850 [fax: 5891], [EMAIL PROTECTED], [EMAIL PROTECTED]
http://maps.yahoo.com/py/maps.py?Pyt=Tmapaddr=10275+N.+De+Anzacsz=95014



Roozbeh Pournader [EMAIL PROTECTED] on 01-17-2001 12:56:57

To:   Mark Davis/Cupertino/IBM@IBMUS
cc:   Unicode List [EMAIL PROTECTED], [EMAIL PROTECTED], Behdad
  Esfahbod [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject:  Re: A real bug in bidi





On Tue, 16 Jan 2001, Mark Davis wrote:

 Doug Felt here confirmed that this is a bug in the implementation
section.
 While it does not affect the conformance of the main algorithm, it would
 affect people trying to use that optimization strategy. (we here don't
use
 that strategy, by the way). We think that the implementation strategy
could
 be changed to still work, but for now we would recommend removing the
 characters.

Will there be note in the online version of the technical report to
mention this? There may be poor developers just like us ;) who won't know
that these recommendations will make their application nonconforming.

In our case, we read and reread the spec many times, even by developers
who had not heard about the Unicode bidi before, because we simply thought
that it's our implementation or interpretation bug.

--roozbeh








Teletext mappings

2001-01-17 Thread Rob Hardy

Hi everyone,

I'm preparing some mappings of teletext character sets to Unicode.  You can
see my results so far at
http://www.sneezes.freeserve.co.uk/teletext/tech/charenc/teletextcharencs.ht
ml
[hope that URL doesn't get split..]   This is a LARGE page, btw (150k).  In
IE5+, hover over the character to get its name.

As you can see, I have some ambiguous characters and unknows, and am
wondering whether anyone would like to answer these questions :)

1) I'm not sure about the forms in G0_ARABIC.  I've had some excellent help
from an Arabic-speaker, but am wondering whether it could be further
refined.  I've uploaded the tables in the teletext spec to
http://www.sneezes.freeserve.co.uk/teletext/tech/charenc/teletextarabic.gif
so you can make a comparison.  I haven't finished G2_ARABIC yet, so there's
a few gaps.

2) Hyphens or dashes - what's the difference?

3) Which to use:  2016: DOUBLE VERTICAL LINE, or 0x2225 PARALLEL TO, or
0x2251 BOX DRAWING DOUBLE VERTICAL, or 0x01C1 LATIN LETTER LATERAL CLICK ?

4) Turkish Lira - the teletext spec represents this with a combined ligature
'TL', which I can't find a Unicode character for.  I've put in 20A4 LIRA
SIGN, but I don't think this is what the teletext designers had in mind.  Is
this a case for a new Unicode character?

5) G0_LATIN_LETTISH_LITHUIAN looks to have a LATIN SMALL LETTER I WITH
CEDILLA, which I can't find in Unicode (so I've stuck in i with ogonek
instead).  Is this missing?

6) Is there a 041F CYRILLIC CAPITAL LETTER PE with a curved top, like 0x22C2
N-ARY INTERSECTION, in both uppercase and lowercase forms?  Perhaps this a
particular glyph of the PE character, represented as a separate entry in the
teletext table.

7) Misc. other characters: Couldn't decide between
a) 2126: OHM SIGN or GREEK CAPITAL LETTER OMEGA, 03A9
b) 0110: LATIN CAPITAL LETTER D WITH STROKE, or LATIN CAPITAL LETTER
ETH, 00D0
c) 00DF: LATIN SMALL LETTER SHARP S, or GREEK SMALL LETTER BETA, 03B2
d) 0251: LATIN SMALL LETTER ALPHA, or GREEK SMALL LETTER ALPHA, 03B1
e) 00B0: DEGREE SIGN, or MASCULINE ORDINAL INDICATOR, 00BA

8) And some others I'm not sure of:
a) Character 0x28 of G2_GREEK, looks like a colon
b) Character 0x6e of G2_LATIN, looks like a tall Greek eta
c) Character 0x7e of G2_LATIN, looks like an eta
d) Character 0x52 of G0_GREEK, I've put it in as 0374 GREEK NUMERAL SIGN
but can't be sure

Perhaps there's some 7-bit sets knocking about which the teletext ones were
based on, which would help.   The full teletext spec is available from
http://www.etsi.org, named ETSI 300 706 (you'll have to register to
download, but it's free).  I suspect the designers of the spec would use a
single glyph to represent two characters in some cases, e.g. D with a stroke
would mean both 0110 and 00D0, seeing as both lowercase forms are further up
in the same set.

Hope I haven't asked too much in my first posting to this list :)

Regards,
Rob.






PDUTR #27: Unicode 3.1

2001-01-17 Thread Julie Doll Allen

Proposed Draft Unicode Technical Report #27: Unicode 3.1 is now
available at
http://www.unicode.org/unicode/reports/tr27/

Please take a look at it and report any problems you may find. It is
approximately 60 pages long.

Julie Allen
Editor, Unicode, Inc.