Re: What is the principle?

Philippe Verdy Fri, 26 Mar 2004 13:58:46 -0800

From: "Arcane Jill" <[EMAIL PROTECTED]>
> Ignoring all compatibility characters; ignoring everything that has gone
> before; and considering only present and future characters (that is,
> characters currently under consideration for inclusion in Unicode, and
> characters which will be under consideration in the future), which of
> the following is the PRINCIPLE which decides whether or not a character
> is suitable:
>
> (A) A proposed character will be rejected if its glyph is identical in
> appearance to that of an extant glyph, regardless of its semantic
> meaning, or
> (B) A proposed character will be rejected if its semantic meaning is
> identical to that of an extant character, regardless of the appearance
> of its glyph, or
> (C) A proposed character will be rejected if either (A) or (B) are true, or
> (D) None of the above
> ?
>
> Although this is a question about the future, no clairvoyance is
> required, since I am asking about the principle behind decisions, not
> about specific characters.


Response (D) unambiguously. There's no normative glyph in Unicode, which just
specifies a single representative glyph just to exhibit the identity of the
character and identify it between other encoded characters in the same script
(or sometimes in other scripts as well, but there are many counter examples
where even these representative glyphs for distinct characters will look the
same).

My opinion is that the main reason why a new "similar" character needs to be
encoded is because its current normative properties can't fit with some
linguistic usages or create false interpretation of text in some language. Or
because the character glyph was borrowed from another script which globally
behaves very differently (see the various symbols or letters that look like a
Greek uppercase Lambda but have very distinct histories of use, and very
different applications and properties, and would be used inconsistently if they
were simply borrowed from a foreign script without gaining the new identity in
the new script).

If something can't be corrected by adding more glyphs substitution rules in
fonts to render the text the way the authors want, or if any basic text handling
produces wrong results because of a normative behavior (for example Bidi
properties, case mappings, decompositions and canonical reordering of
diacritics) then comes the need to add new characters.

Look for example how various D with stroke, which look very similar or identical
in uppercase, are given distinct codepoints: this is needed because they have
very distinct lowercase mappings and because the lowercase versions should not
be mixed as they have different identities.

Another example is with some greek letters whose letterforms were borrowed into
Latin but with distinct case mappings too: see the uppercase version of Latin
Esh which looks very similar or identical to the Greek uppercase Sigma.

Another example comes with the new mathematical symbols for which no case
mappings are acceptable as lowercase and uppercase versions need to remain
distinct symbols.

We will probably soon see new characters added to Hebrew because of problems for
the interpretation of Biblic texts, or simply because the currently used
characters can't fit with any other symbol or letters borrowed from other
scripts as they have the wrong character properties for usage in Hebrew.

Unicode just needs to encode what is needed to preserve the identity of the
encoded text without loosing parts of its semantics.

Also Unicode will make efforts to ensure that a single script will be enough to
represent the same language, at least at the lexical level (exceptions exist for
example in Japanese which mixes several scripts in the same text: Hiragana,
Katakana and Han, but I think that this does not affect the lexical level), so
that a text in some language needs not to mix characters from all blocks. This
simplifies the work as it reduces the number of code point blocks to support for
a language (and I see it as a good reason why letters borrowed into a romanized
text from other scripts such as Cyrillic and Greek, were added to the Latin
block with separate code points).

May be Unicode members have distinct views about it, but this seems what is
needed to allow consistent handling of text in its encoded form, without
reference to any graphical considerations such as glyph processing, positioning,
or reordering, as this allows a renderer to use whatever font design that
respects the character identity (see the extended differences of glyph styles
which can exist in Latin or Arabic, for which a very rich and complex set of
calligraphic designs have been created thoughout centuries and milleniums).

Re: What is the principle?

Reply via email to