On Wednesday, July 02, 2003 12:55 PM, Jony Rosenne <[EMAIL PROTECTED]> wrote:

> I would like to summarize my understanding:
> 
> 1. The sequence Lamed Patah Hiriq is invalid for Hebrew. It is
> invalid in Hebrew to have two vowels for one letter. It may or may
> not be a valid Unicode sequence, but there are many examples of valid
> Unicode sequences that are invalid.

Only invalid for Modern Hebrew. In addition we are not discussing about the *validity* 
of the Unicode/ISO10646 encoding (any Unicode string is valid even if it is not 
normalized, provided that it uses normalized codepoints, and respect a few constraints 
such as approved variant sequences, and valid usage of surrogate code units, but 
forbidden use of surrogate codepoints).

The issue created by the Unicode normalization of text which is NOT required for 
Unicode encoding validity, but only for text processing (notably with the legacy HTML 
and SGML or the newer XML, XHTML and related standards based on XML).

You have not understood the issue with *Traditional Hebrew* where there are actually 
two or more vowels for one base letter notably in Biblic texts but certainly in many 
other manuscripts of the same epochs, and probably after and still today, as long as 
these important texts for the human culture have been (and will be) studied by 
scholars and searchers or interested people, whever they were (are or will be) 
historians, sociologists, economists, linguists, translators, theologists, religious 
adepts, or many other scientific searches in various domains studied since milleniums 
(including mathematics, astronomy, medecine...).

What has been demonstrated here is that the current combining classes defined on 
Hebrew characters were not needed for Modern Hebrew (which could have been written 
perfectly with all vowels defined with CC=0), but encoded with "randomly assigned" 
combining classes on vowels (for which the 220 and 230 classes were not usable).

The initial encoding may have been done by studying some fragments only of the 
traditional texts, which exposed some combinations of vowels, and without really 
searching in such important traditional texts such as the Hebrew Bible (and also 
certainly in some old versions of the Torah, or some old translations to Hebrew of the 
Coran, or of famous Roman Latin, Greek, Phenician, or Syriac manuscripts, in a 
Middle-East region that has seen a lot of foreign invasions and been in the crossroad 
of all most famous cultures and commercial roads). For all vowels for which there did 
not seem to exist a demonstrated preference order (in the studied fragments of text), 
the combining classes have been mostly defined in a order matching the codepoint order 
in the legacy 8-bit encodings, thinking that occurences of those owels would be rare 
and would not cause problems.

When there will be new old scripts added in Unicode, I do think that Unicode should 
not make assumptions from a small set of text fragments: further researches may 
demonstrate that a definition of non-zero combining classes would introduce too much 
problems to allow encoding new texts, for which an existing normalization would 
incorrectly swap combining letters and change the semantic of the encoded text. These 
old texts should be handled assuming that the typist which entered and encoded them 
was correct in its transcription, and a NF* normalization should not change this 
decision automatically, as it would frustrate all the efforts performed by the 
transcripter to produce an accurate transcript of the encoded text.

I think that if there are some reasons to define some combining classes for the 
normalization of some categories of text, we should accept to sacrifice the 
unification of characters, each time it will cause a problem, or Unicode and ISO10646 
should accept to define/assign a generic codepoint with class "Mn", CC=0, whose only 
role will be to bypass the currently assigned non-zero CC value of combining 
characters, even if, temporarily, this causes some problems for text rendering engines 
(which can be corrected later to consider this character as ignorable for all 
rendering purpose, including searches of possible ligatures).

I suggest that such codepoint be allocated in the U+03XX block for generic combining 
characters, so that it can be used in any script, including the existing ones. This 
character would be named "Combining Variant Selector" (CVS), it would preserve the 
semantic of the diacritic to which it is prefixed, and it would not override the 
current semantic of the "Combining Grapheme Joiner" (CGJ) that may have specific usage 
to create ligatures between diacritics, and that should still continue to be 
canonically ordered, so that if the diacritic <A> has a CC=a and diacritic <B> has a 
CC=b, and if (a < b), the sequence <A,CGJ,B> would be valid, but not <B, CGJ, A> 
unless the combining class of A is overriden with <B, CGJ, CVS, A>.

This definition preserves the current semantic of the CGJ (without extending it too 
much in a way that was not intended when it was defined), and it makes possible to 
define combining classes for the most usual cases of an encoded script, without 
compromizing the future, if more rare texts are discovered where the first unification 
works violate the old text semantics for normalization.


Reply via email to