I cannot agree with some of these statements. My comments are inserted.

Jony

> -----Original Message-----
> From: Philippe Verdy [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, July 02, 2003 2:43 PM
> To: Jony Rosenne
> Cc: [EMAIL PROTECTED]
> Subject: Re: Yerushala(y)im - or Biblical Hebrew
> 
> 
> On Wednesday, July 02, 2003 12:55 PM, Jony Rosenne 
> <[EMAIL PROTECTED]> wrote:
> 
> > I would like to summarize my understanding:
> > 
> > 1. The sequence Lamed Patah Hiriq is invalid for Hebrew. It 
> is invalid 
> > in Hebrew to have two vowels for one letter. It may or may not be a 
> > valid Unicode sequence, but there are many examples of 
> valid Unicode 
> > sequences that are invalid.
> 
> Only invalid for Modern Hebrew. 

No - it is true also for Biblical Hebrew and any other. The extra vowel
belongs to another letter, which is known to exist but isn't printed.

> In addition we are not 
> discussing about the *validity* of the Unicode/ISO10646 
> encoding (any Unicode string is valid even if it is not 
> normalized, provided that it uses normalized codepoints, and 
> respect a few constraints such as approved variant sequences, 
> and valid usage of surrogate code units, but forbidden use of 
> surrogate codepoints).

I tried to say that although it may be valid Unicode, it is not valid
Hebrew. 

> 
> The issue created by the Unicode normalization of text which 
> is NOT required for Unicode encoding validity, but only for 
> text processing (notably with the legacy HTML and SGML or the 
> newer XML, XHTML and related standards based on XML).
> 
> You have not understood the issue with *Traditional Hebrew* 
> where there are actually two or more vowels for one base 
> letter notably in Biblic texts but certainly in many other 
> manuscripts of the same epochs, and probably after and still 
> today, as long as these important texts for the human culture 
> have been (and will be) studied by scholars and searchers or 
> interested people, whever they were (are or will be) 
> historians, sociologists, economists, linguists, translators, 
> theologists, religious adepts, or many other scientific 
> searches in various domains studied since milleniums 
> (including mathematics, astronomy, medecine...).

See above.

> 
> What has been demonstrated here is that the current combining 
> classes defined on Hebrew characters were not needed for 
> Modern Hebrew (which could have been written perfectly with 
> all vowels defined with CC=0), but encoded with "randomly 
> assigned" combining classes on vowels (for which the 220 and 
> 230 classes were not usable).

Unicode Hebrew points and cantillation marks were defined with Biblical
Hebrew in mind.

> 
> The initial encoding may have been done by studying some 
> fragments only of the traditional texts, which exposed some 
> combinations of vowels, and without really searching in such 
> important traditional texts such as the Hebrew Bible (and 
> also certainly in some old versions of the Torah, or some old 
> translations to Hebrew of the Coran, or of famous Roman 
> Latin, Greek, Phenician, or Syriac manuscripts, in a 
> Middle-East region that has seen a lot of foreign invasions 
> and been in the crossroad of all most famous cultures and 
> commercial roads). For all vowels for which there did not 
> seem to exist a demonstrated preference order (in the studied 
> fragments of text), the combining classes have been mostly 
> defined in a order matching the codepoint order in the legacy 
> 8-bit encodings, thinking that occurences of those owels 
> would be rare and would not cause problems.

There are no such cases, barring misunderstandings.

> 
> When there will be new old scripts added in Unicode, I do 
> think that Unicode should not make assumptions from a small 
> set of text fragments: further researches may demonstrate 
> that a definition of non-zero combining classes would 
> introduce too much problems to allow encoding new texts, for 
> which an existing normalization would incorrectly swap 
> combining letters and change the semantic of the encoded 
> text. These old texts should be handled assuming that the 
> typist which entered and encoded them was correct in its 
> transcription, and a NF* normalization should not change this 
> decision automatically, as it would frustrate all the efforts 
> performed by the transcripter to produce an accurate 
> transcript of the encoded text.
> 
> I think that if there are some reasons to define some 
> combining classes for the normalization of some categories of 
> text, we should accept to sacrifice the unification of 
> characters, each time it will cause a problem, or Unicode and 
> ISO10646 should accept to define/assign a generic codepoint 
> with class "Mn", CC=0, whose only role will be to bypass the 
> currently assigned non-zero CC value of combining characters, 
> even if, temporarily, this causes some problems for text 
> rendering engines (which can be corrected later to consider 
> this character as ignorable for all rendering purpose, 
> including searches of possible ligatures).
> 
> I suggest that such codepoint be allocated in the U+03XX 
> block for generic combining characters, so that it can be 
> used in any script, including the existing ones. This 
> character would be named "Combining Variant Selector" (CVS), 
> it would preserve the semantic of the diacritic to which it 
> is prefixed, and it would not override the current semantic 
> of the "Combining Grapheme Joiner" (CGJ) that may have 
> specific usage to create ligatures between diacritics, and 
> that should still continue to be canonically ordered, so that 
> if the diacritic <A> has a CC=a and diacritic <B> has a CC=b, 
> and if (a < b), the sequence <A,CGJ,B> would be valid, but 
> not <B, CGJ, A> unless the combining class of A is overriden 
> with <B, CGJ, CVS, A>.
> 
> This definition preserves the current semantic of the CGJ 
> (without extending it too much in a way that was not intended 
> when it was defined), and it makes possible to define 
> combining classes for the most usual cases of an encoded 
> script, without compromizing the future, if more rare texts 
> are discovered where the first unification works violate the 
> old text semantics for normalization.
> 
> 
> 


Reply via email to