RE: Bangla: [ZWJ], [VIRAMA] and CV sequences

Unicode (public) Thu, 09 Oct 2003 10:09:01 -0700

Title: Message

Gautam--

[Gautam]: Well, too bad. I guess we still have an obligation to explore the extent of sub-optimal solutions that are being imposed upon South-Asian scripts for the sake of *backward compatibility* or simply because they are "fait accomplis". (See Peter Kirk's posting on this issue). However, I am by no means suggesting that the fault lies with the Unicode Consortium.

I'm a little confused by this statement. What would be the difference between sticking with a suboptimal solution because it's a fait accompli and sticking with it out of the need for backward compatibility? The need for backward compatibility exists because the suboptimal solution is a fait accompli. Or are you stating that backward compatibility is a specious argument because the encoding is so broken nobody's actually using it?

[Gautam]: This is again the "fait accompli" argument. We need to *know* whether adopting an alternative model WOULD HAVE BEEN PREFERABLE, even if the option to do so is no longer available to us.

I don't understand. If the option to go to an alternative model is not available, why is it important to know that the alternative model would have been preferable?

[Gautam]: I think there is a slight misunderstanding here. The ZWJ I am proposing is script-specific (each script would have its own), call it "ZWJ PRIME" or even "JWZ" (in order to avoid confusion with ZWJ). It doesn't exist yet and hence has no semantics.

Okay. Maybe I'm dense, but this wasn't clear to me from your other emails. You're not proposing that U+200D be used to join Indic consonants together; you're basically arguing for virama-like functionality that goes far enough beyond what the virama does that you're not comfortable calling it a virama anymore.

JWZ is a piece of formalism. Its meaning would be precisely what we chose to assign to it. It behaves like the existing (script-specific) VIRAMA's except that it also occurs between a consonant and an independent vowel, forcing the latter to show up in its combining form.

Aha! This is what I wasn't parsing out of your previous emails. It was there, but I somehow didn't grok it. To summarize:

Tibetan deals with consonant clusters by encoding each of the consonants twice: One series of codes is to be used for the first consonant in a cluster, and the other series is to be used for the others. The Indian scripts don't do this; they use a single series of codes for the consonants and cause consonants to form clusters by adding a VIRAMA code between them. But the Indian scripts still have two series of VOWELS more or less analogous to the two series of consonants in Tibetan. When you want a non-joining vowel, you use one series, and when you want a joining vowel, you use the other.

You want to have one series of vowels and extend the virama model to conbining vowels. Thus, you'd represent KI as KA + VIRAMA + I; KA + I would represent two syllables: KA-I. Since a real virama never does this, you're using a different term ("JWZ" in your most recent message) for the character that causes the joining to happen. You're not proposing any difference in how consonants are treated, other than having this new character server the sticking-together function that the VIRAMA now serves and changing the existing VIRAMA to always display explicitly.

Now do I understand you? Sorry for my earlier misunderstandings.

Now that we have freed up all those code points occupied by the combining forms of vowels by introducing the VIRAMA with extended function, let us introduce an explicit (always visible) VIRAMA. That's all.

As far as Unicode is concerned, you can't "free up" any code points. Once a code point is assigned, it's always assigned. You can deprecate code points, but that doesn't free them up to be reused; it only (with luck) keeps people from continuing to use them.

It seems to me that a system could support the usage you want and the old usage at the same time. I could be wrong, but I'm guessing that KA + VIRAMA + I isn't a sequence that makes any sense with current implementations and isn't being used. It would be possible to extend the meaning of the current VIRAMA to turn the independent vowels into dependent vowels. Future use of the dependent-vowel code points could be discouraged in favor of VIRAMA plus the independent-vowel code points. Old documents would continue to work, but new documents could use the model you're after. (You get the explicit virama the same way you do now: VIRAMA + ZWNJ.) This solution would involve encoding no new characters and no removal of existing characters, but just a change in the semantics of the VIRAMA.

That said, I'm not sure this is a good idea. If what you're really concerned about is typing and editing of text, you can have that work the way you want without changing the underlying encoding model. It involves somewhat more complicated keyboard handling, but I'm pretty sure all the major operating systems allow this. The basic idea is that you have one set of vowel keys that normally generate the independent-vowel code points, but if one of them is preceded by the VIRAMA key, the two keystrokes map to a single character: the dependent-vowel code point. This is a simple solution that can be implemented today with very little fuss and involves no changes to Unicode or to the various fonts and rendering engines that would be required of the VIRAMA code point took on a new meaning. From a user's point of view, things work the way they're supposed to, and they work that way sooner than if Unicode is changed. Only programmers have to worry about the actual encoding details, and unless keeping the existing model makes THEIR jobs significantly harder, the encoding itself shouldn't change.

I hope this makes sense...

--Rich Gillam

Language Analysis Systems, Inc.

RE: Bangla: [ZWJ], [VIRAMA] and CV sequences

Reply via email to