Re: Conflicting principles

Philippe Verdy Thu, 14 Aug 2003 12:29:00 -0700

On Thursday, August 07, 2003 11:29 PM, Michael Everson <[EMAIL PROTECTED]> wrote:


> Ken's point of course is that however bizarre the backing store for
> Sindarin and English Tengwar modes may be, combining characters per
> se must follow their base characters no matter what.

Even if that breaks the logical analysis of text?
How does the Sindarin mode affect the line or word breaking rule for example:
suppose that the combining character is coded after the next logical base character, 
would it be valid to break at this base character and thus send the combining vowel to 
the next line, where in fact what is intended is to use a vowel carier for the 
combining character logically attached to the previous base character?

I don't know Tengwar's Sindarin mode enough to see how word breaking can affect the 
interpretation of text. But preserving the logical ordering of letters seems much more 
important for actual text encoding than just being constrained by combining rules that 
were created taking into account only the first encoded scripts for Latin, Greek, 
Cyrillic, Hebrew, Arabic and Hiragana/Katakana scripts that use combining characters.

The response to such answer would come in relation with other still unencoded scripts; 
you quoted some of them which have similar difficulties, and that are neither extinct, 
and have a huge amount of existing texts to represent, including many modern languages 
that are only partly litterated and that would benefit from a written litteracy form 
according to similar languages spoken and written in a cultural region, notably in 
Africa, Central Asia, and Oceania (regions that have suffered for too long of an 
absence of an easy to adapt and learn writing system for minority languages).

Even in India, there is still no consensus for the use of the ISCII-based writing 
system for Brahmic scripts, and the current work on Tibetan or on Indo-Aryan languages 
show that the currently officially adopted system does not fit the cultural demand of 
minority users, because the official writing system does not fit very well their 
language.

There will certainly not be a huge revolution in writing systems (families of scripts 
with similar behaviors), but existing systems will still continue to be adapted to fit 
local cultural demands for minorities and specialized areas, that a too strict 
encoding model proposed now by Unicode cannot fit well. Some examples include text 
that use a non linear layout, where the layout carries important semantics (examples 
are numerous for hieroglyphic languages, one of which having modern use and not 
fitting well with Unicode which often fails to represent clusters with simple 
combining sequences assuming a base character and diacritics).

If one looks at Korean jamos, the problem has only been solved by actually *reducing* 
the number of layout combinations, and creating artificial "letters" (jamos) for some 
combinations that are logically perceived as multiple letters (for example the 
SSANGKIEOK jamo, which is really a pair of KIEOK letters), which are only partly 
decomposed and represented as their component letters, whose composition layout is 
greatly simplified but does not match correctly the historic Hangul clusters.

Probably the same thing can be said about Han ideographs, constantly updated to 
present new clusters, and even Hiragana/Katakana clusters currently represented as 
single codepoints when in fact they are really composed, and constantly enriched with 
new clusters notably in the scientific area. To allow users to create their own 
clusters, Unicode has added ideographic description characters which are controls used 
as prefixes for a combining sequence containing base "letters". This is already a 
break in the axiomatic view of combining sequences made with a single base letter.

Other areas where combining sequences are not following this model is of course the 
Hangul script, the CGJ character used between two base letters, double (width) 
diacritics, ... Really there already exists many exceptions to the axiomatic view of 
combining sequences, and I don't see why there could not exist a model allowing new 
classes of combining characters attached to a *following* base character, such as for 
Tangwar Sindarin vowels (if we suppose that Sindarin vowels are encoded separately 
from Quenya vowels, because of their distinct combining properties, and because the 
Tengwar "script" is really a family of related scripts, which contains much more 
differences than between Latin, Greek and Cyrillic separate scripts).

So one cannot be satisfied by the currently limited model with a single base letter 
and combining modifiers, which would create an artificial hierarchy between letters, 
that does not fit the cultural semantics of the encoded language.

-- 
Philippe.
Spams non tol�r�s: tout message non sollicit� sera
rapport� � vos fournisseurs de services Internet.

Re: Conflicting principles

Reply via email to