I cannot agree with some of these statements. My comments are inserted. Jony
> -----Original Message----- > From: Philippe Verdy [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 02, 2003 2:43 PM > To: Jony Rosenne > Cc: [EMAIL PROTECTED] > Subject: Re: Yerushala(y)im - or Biblical Hebrew > > > On Wednesday, July 02, 2003 12:55 PM, Jony Rosenne > <[EMAIL PROTECTED]> wrote: > > > I would like to summarize my understanding: > > > > 1. The sequence Lamed Patah Hiriq is invalid for Hebrew. It > is invalid > > in Hebrew to have two vowels for one letter. It may or may not be a > > valid Unicode sequence, but there are many examples of > valid Unicode > > sequences that are invalid. > > Only invalid for Modern Hebrew. No - it is true also for Biblical Hebrew and any other. The extra vowel belongs to another letter, which is known to exist but isn't printed. > In addition we are not > discussing about the *validity* of the Unicode/ISO10646 > encoding (any Unicode string is valid even if it is not > normalized, provided that it uses normalized codepoints, and > respect a few constraints such as approved variant sequences, > and valid usage of surrogate code units, but forbidden use of > surrogate codepoints). I tried to say that although it may be valid Unicode, it is not valid Hebrew. > > The issue created by the Unicode normalization of text which > is NOT required for Unicode encoding validity, but only for > text processing (notably with the legacy HTML and SGML or the > newer XML, XHTML and related standards based on XML). > > You have not understood the issue with *Traditional Hebrew* > where there are actually two or more vowels for one base > letter notably in Biblic texts but certainly in many other > manuscripts of the same epochs, and probably after and still > today, as long as these important texts for the human culture > have been (and will be) studied by scholars and searchers or > interested people, whever they were (are or will be) > historians, sociologists, economists, linguists, translators, > theologists, religious adepts, or many other scientific > searches in various domains studied since milleniums > (including mathematics, astronomy, medecine...). See above. > > What has been demonstrated here is that the current combining > classes defined on Hebrew characters were not needed for > Modern Hebrew (which could have been written perfectly with > all vowels defined with CC=0), but encoded with "randomly > assigned" combining classes on vowels (for which the 220 and > 230 classes were not usable). Unicode Hebrew points and cantillation marks were defined with Biblical Hebrew in mind. > > The initial encoding may have been done by studying some > fragments only of the traditional texts, which exposed some > combinations of vowels, and without really searching in such > important traditional texts such as the Hebrew Bible (and > also certainly in some old versions of the Torah, or some old > translations to Hebrew of the Coran, or of famous Roman > Latin, Greek, Phenician, or Syriac manuscripts, in a > Middle-East region that has seen a lot of foreign invasions > and been in the crossroad of all most famous cultures and > commercial roads). For all vowels for which there did not > seem to exist a demonstrated preference order (in the studied > fragments of text), the combining classes have been mostly > defined in a order matching the codepoint order in the legacy > 8-bit encodings, thinking that occurences of those owels > would be rare and would not cause problems. There are no such cases, barring misunderstandings. > > When there will be new old scripts added in Unicode, I do > think that Unicode should not make assumptions from a small > set of text fragments: further researches may demonstrate > that a definition of non-zero combining classes would > introduce too much problems to allow encoding new texts, for > which an existing normalization would incorrectly swap > combining letters and change the semantic of the encoded > text. These old texts should be handled assuming that the > typist which entered and encoded them was correct in its > transcription, and a NF* normalization should not change this > decision automatically, as it would frustrate all the efforts > performed by the transcripter to produce an accurate > transcript of the encoded text. > > I think that if there are some reasons to define some > combining classes for the normalization of some categories of > text, we should accept to sacrifice the unification of > characters, each time it will cause a problem, or Unicode and > ISO10646 should accept to define/assign a generic codepoint > with class "Mn", CC=0, whose only role will be to bypass the > currently assigned non-zero CC value of combining characters, > even if, temporarily, this causes some problems for text > rendering engines (which can be corrected later to consider > this character as ignorable for all rendering purpose, > including searches of possible ligatures). > > I suggest that such codepoint be allocated in the U+03XX > block for generic combining characters, so that it can be > used in any script, including the existing ones. This > character would be named "Combining Variant Selector" (CVS), > it would preserve the semantic of the diacritic to which it > is prefixed, and it would not override the current semantic > of the "Combining Grapheme Joiner" (CGJ) that may have > specific usage to create ligatures between diacritics, and > that should still continue to be canonically ordered, so that > if the diacritic <A> has a CC=a and diacritic <B> has a CC=b, > and if (a < b), the sequence <A,CGJ,B> would be valid, but > not <B, CGJ, A> unless the combining class of A is overriden > with <B, CGJ, CVS, A>. > > This definition preserves the current semantic of the CGJ > (without extending it too much in a way that was not intended > when it was defined), and it makes possible to define > combining classes for the most usual cases of an encoded > script, without compromizing the future, if more rare texts > are discovered where the first unification works violate the > old text semantics for normalization. > > >

