See some further comments on the details below.
On 28/10/2003 18:49, Philippe Verdy wrote:
Actually rafe, in class 23, and varika, class 26 but not used in Hebrew, should be considered consonant modifiers. Rafe basically indicates the absence of dagesh, and so these two fit in the same logical class. The only vowel modifier in this sense is meteg. But meteg is best considered as an accent, although it is sometimes used in texts which are not otherwise accented. Typographically, the only ways in which meteg differs from other accents are that it can appear to the right of a low vowel or in the middle of one of the hataf vowels.I just finished an Excel speadsheet that shows the Hebrew composition model, and all the problems caused by the canonical order of Hebrew diacritics.
In summary, most problems come from consonnant modifiers which have a combining class higher than vowels or vowel modifiers.
If vowels had been assigned a null combining class, such problems would have
not appeared. The idea of generating a CGJ before all vowels in input
methods (and then let a prenormalization process remove unnecessary CGJ in
composed strings) seems interesting, as it forces vowels to behave like base
characters, but it does not solve all the problem, but only the ordering
problem caused by the wrong combining classes 21, 24 and 25 assigned
respectively to DAGESH/MAPIQ, SHIN DOT and SIN DOT, that come logically
before the vowels (in classes 10 to 20), or vowel modifiers (classes 22, 23
and 26).
Note that, as an exception to the neat rules here, when vav with shuruq is used as a vowel, i.e. the vowel is a separate base character, any meteg or accent is attached not to the vav with shuruq but to the consonant. For the accents are more syllable modifiers than vowel modifiers. But it is most sensible for a number of reasons to continue to order them after the combining vowels.
We could specify a rule for inserting CGJ only when it is useful:A good rule. This will greatly simplify rendering and collation. But it does need to be agreed by all.
- before (any vowel, vowel modifier or cantillation mark) if it follows
(DAGESH/MAPIQ, SHIN DOT or SIN DOT),
- just before a second vowel on the same consonnant (in that case itWell, if this CGJ is inserted by a keyboard utility or a code conversion routine, these more specific rules can be programmed. But in practice only a very small number of superfluous CGJs will be added if this rule is used unmodified; I found only one case in the whole Hebrew Bible of two vowel points on one base character which happen to be in canonical order.
plays the role of the "missing consonnant" in Yerushala(y)im). But this
requires some more specific rules to remove other superfluous CGJ, as it is
not always needed: this depends on the relative combining classes of the
corresponding vowels (in classes 10 to 20). See below when this is needed.
Another use for CGJ which you have not specified is to ensure proper positioning of meteg, to the right or left of vowels or accents. (Medial meteg needs a different mechanism.)
There would indeed have been fewer problems if these precomposed forms had not been specified as decomposition exclusions. Well, it would have enabled rendering of NFC Hebrew by non-compliant rendering engines i.e. ones which don't render all canonically equivalent sequences the same. It would not have simplified the collation issue as collation is based on NFD.Another solution could be to duplicate these 3 consonnant modifiers, so: NEW DAGESH/MAPIQ: class 10 (central position) NEW SHIN DOT: class 11 (above-right position) NEW SIN DOT: class 12 (above-left position) and remap all vowels and vowel signs starting at class 13 and higher... It would be appropriate in that case to not name them "POINT", but "LETTER MODIFIER"
Also I note that mosts usages of dagesh/mapiq, shin dots and sin dots with base consonnants were mapped in Unicode by encoding precomposed consonnants. The bad thing is that they are canonically decomposed and prohibited from recomposition.
If one had encoded and used the original text with a legacy Hebrew encoding
which included these precomposed characters, without considering the case of
Unicode-specific normalizations, there was no such problems. So the problem
has been introduced by Unicode, which made them canonical decompositions in
all NF forms, instead of defining them only for NFK*. Worse, the canonical
composition exclusions are blocking us from using these precomposed
characters in a NFC text.
This would never be necessary. As holam really does not interact typographically with low vowels, there can be no significance in their relative ordering, and it is appropriate that they have different combining classes. The problem is that the various low vowels have different combining classes although they do interact typographically, in breach of the standard itself. That is why CGJ often needs to be inserted between vowel pairs.Assigning new codepoints could ease the transcription of texts from legacy Hebrew encodings (including the Windows Hebrew and ISO Hebrew character sets) to Unicode without experimenting all these common problems: this would affect the mapping to Unicode of these legacy charsets, but certainly it would be beneficial in the long term.
There are however two more subtle problems:
1) Within the set of vowels U+05B0 to U+05B9 (classes 10 to 20):
They are all combining with a position "below", except U+05B9 POINT HOLAM (class 19).
The U+05BB POINT QUBUTS vowel (class 20) is not grouped along other "below" vowels. In fact the canonical ordering attempts to force a unique order for all vowels, which does not take their real layout combining properties.
In reality, these vowels should have been given only 2 possible combining classes, such as 13 (position below) for all vowels, except POINT HOLAM which would have class 14 (or could be kept at its existing class 19, position above-left).
The ordering problem can be solved using a CGJ before the U+05B9 POINT
HOLAM vowel (class 19, above-left) if it needs to follow the U+5BB POINT
QUBUTS vowel (class 20, below).
228 is actually above-left, surely.The alternative would be to encode a new POINT HOLAM or a new POINT QOUBOUTS with a more correct class that respects the combinings groups in the Hebrew script.
2) Within cantillation marks (U+0591 to U+05AF, plus U+05CA MARK UPPER DOT):
The accents coded with class 220 (below), 222 (below-right), 228
(below-left) have no problem.
However the remaining 19 accents and marks at class 230 (above) do notThere is also a problem with U+0592 which is also positioned above-left, at least in many texts, but is in class 230. But U+05A1 is usually centred above. See http://www.lrz-muenchen.de/~hr/teamim/tables.html for a useful summary of accent positions. But these positions vary - and accent names vary even more.
belong to the same combining category, the problems are for these 5
characters:
U+059D ACCENT GERESH MUQDAM (alias "gerich mouqdam")
U+05A0 ACCENT TELISHA GEDOLA (alias "talchah")
which are combining at position above-right, and
U+0599 ACCENT PASHTA (alias "qadma")
U+05A1 ACCENT PAZER
U+05A9 ACCENT TELISHA QETANA (alias "tarsa")
which are combining at position above-left.
There is a potential problem in that U+05AE, also positioned above left (although wrongly shown as below left in your chart), is in class 228, and does interact typographically with the other above left accents which are in class 230. But this is probably of theoretical importance only, and CGJ can be used if really necessary.This is not strictly a problem to keep the semantic of text, as they share the same combining class, and so the normalization process will not reorder them. ...
... But it still prevents a more complete normalization thatI posted before the results of my analysis of the rare cases of multiple accents in the Hebrew Bible. The only cases in which there was a potential normalisation issue were combinations of meteg and other accents. Accents are very rarely used in other texts.
considers the case of these 5 accents which may (should?) be reordered.
However, the case of multiple cantillation marks on the same vowel may be
quite rare even in historic texts (but I don't have a copy of the large
liturgic Hebrew texts to verify this.)
Thank you. I have pointed out above a few small errors of detail, but the principle is good.If one is interested, I attach my Excel sheet which makes all this more vizual, and that includes also the existing decompositions of compatibility characters (U+FBxx), shown in italic rows. The table is ordered by logical semantic and grouping. The combining classes that cause problems are shown with bold white on red squares, and the positioning constraints partly explained in the Unicode reference chapter can better be explained by looking at the positioning columns in the table.
If there remains errors in this table, please don't shout me too much...
-- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Hebrew character model.zip
Description: Zip compressed data

