Thank you, Philippe. I include the full text of your posting plus the attachment for the benefit of those on the Unicode Hebrew list who have missed out on this. Some of the issues here have already been discussed on that list. Also I wonder if you have seen http://scripts.sil.org/cms/sites/nrsi/media/BibHebAltCharsProposal.pdf, which includes a good analysis of the issues and proposes new characters with more suitable combining classes. The problem with that proposal was not with the technical details but that the principle of using a separate encoding for biblical Hebrew is unacceptable. That problem would be avoided if these new characters (with names adjusted) were used for all pointed Hebrew and the existing characters deprecated. But there are other reasons which make that suggestion difficult to accept - although there is probably not much existing pointed modern Hebrew text.

See some further comments on the details below.


On 28/10/2003 18:49, Philippe Verdy wrote:

I just finished an Excel speadsheet that shows the Hebrew composition model,
and all the problems caused by the canonical order of Hebrew diacritics.

In summary, most problems come from consonnant modifiers which have a
combining class higher than vowels or vowel modifiers.

If vowels had been assigned a null combining class, such problems would have
not appeared. The idea of generating a CGJ before all vowels in input
methods (and then let a prenormalization process remove unnecessary CGJ in
composed strings) seems interesting, as it forces vowels to behave like base
characters, but it does not solve all the problem, but only the ordering
problem caused by the wrong combining classes 21, 24 and 25 assigned
respectively to DAGESH/MAPIQ, SHIN DOT and SIN DOT, that come logically
before the vowels (in classes 10 to 20), or vowel modifiers (classes 22, 23
and 26).


Actually rafe, in class 23, and varika, class 26 but not used in Hebrew, should be considered consonant modifiers. Rafe basically indicates the absence of dagesh, and so these two fit in the same logical class. The only vowel modifier in this sense is meteg. But meteg is best considered as an accent, although it is sometimes used in texts which are not otherwise accented. Typographically, the only ways in which meteg differs from other accents are that it can appear to the right of a low vowel or in the middle of one of the hataf vowels.

Note that, as an exception to the neat rules here, when vav with shuruq is used as a vowel, i.e. the vowel is a separate base character, any meteg or accent is attached not to the vav with shuruq but to the consonant. For the accents are more syllable modifiers than vowel modifiers. But it is most sensible for a number of reasons to continue to order them after the combining vowels.

We could specify a rule for inserting CGJ only when it is useful:

- before (any vowel, vowel modifier or cantillation mark) if it follows
(DAGESH/MAPIQ, SHIN DOT or SIN DOT),


A good rule. This will greatly simplify rendering and collation. But it does need to be agreed by all.

- just before a second vowel on the same consonnant (in that case it
plays the role of the "missing consonnant" in Yerushala(y)im). But this
requires some more specific rules to remove other superfluous CGJ, as it is
not always needed: this depends on the relative combining classes of the
corresponding vowels (in classes 10 to 20). See below when this is needed.


Well, if this CGJ is inserted by a keyboard utility or a code conversion routine, these more specific rules can be programmed. But in practice only a very small number of superfluous CGJs will be added if this rule is used unmodified; I found only one case in the whole Hebrew Bible of two vowel points on one base character which happen to be in canonical order.

Another use for CGJ which you have not specified is to ensure proper positioning of meteg, to the right or left of vowels or accents. (Medial meteg needs a different mechanism.)

Another solution could be to duplicate these 3 consonnant modifiers, so:
   NEW DAGESH/MAPIQ: class 10 (central position)
   NEW SHIN DOT: class 11 (above-right position)
   NEW SIN DOT: class 12 (above-left position)
   and remap all vowels and vowel signs starting at class 13 and higher...
It would be appropriate in that case to not name them "POINT", but "LETTER
MODIFIER"

Also I note that mosts usages of dagesh/mapiq, shin dots and sin dots with
base consonnants were mapped in Unicode by encoding precomposed consonnants.
The bad thing is that they are canonically decomposed and prohibited from
recomposition.

If one had encoded and used the original text with a legacy Hebrew encoding
which included these precomposed characters, without considering the case of
Unicode-specific normalizations, there was no such problems. So the problem
has been introduced by Unicode, which made them canonical decompositions in
all NF forms, instead of defining them only for NFK*. Worse, the canonical
composition exclusions are blocking us from using these precomposed
characters in a NFC text.


There would indeed have been fewer problems if these precomposed forms had not been specified as decomposition exclusions. Well, it would have enabled rendering of NFC Hebrew by non-compliant rendering engines i.e. ones which don't render all canonically equivalent sequences the same. It would not have simplified the collation issue as collation is based on NFD.

Assigning new codepoints could ease the transcription of texts from legacy
Hebrew encodings (including the Windows Hebrew and ISO Hebrew character
sets) to Unicode without experimenting all these common problems: this would
affect the mapping to Unicode of these legacy charsets, but certainly it
would be beneficial in the long term.

There are however two more subtle problems:

1) Within the set of vowels U+05B0 to U+05B9 (classes 10 to 20):

   They are all combining with a position "below", except U+05B9 POINT
HOLAM (class 19).

   The U+05BB POINT QUBUTS vowel (class 20) is not grouped along other
"below" vowels. In fact the canonical ordering attempts to force a unique
order for all vowels, which does not take their real layout combining
properties.

   In reality, these vowels should have been given only 2 possible
combining classes, such as 13 (position below) for all vowels, except POINT
HOLAM which would have class 14 (or could be kept at its existing class 19,
position above-left).

The ordering problem can be solved using a CGJ before the U+05B9 POINT
HOLAM vowel (class 19, above-left) if it needs to follow the U+5BB POINT
QUBUTS vowel (class 20, below).


This would never be necessary. As holam really does not interact typographically with low vowels, there can be no significance in their relative ordering, and it is appropriate that they have different combining classes. The problem is that the various low vowels have different combining classes although they do interact typographically, in breach of the standard itself. That is why CGJ often needs to be inserted between vowel pairs.

   The alternative would be to encode a new POINT HOLAM or a new POINT
QOUBOUTS with a more correct class that respects the combinings groups in
the Hebrew script.

2) Within cantillation marks (U+0591 to U+05AF, plus U+05CA MARK UPPER DOT):

The accents coded with class 220 (below), 222 (below-right), 228
(below-left) have no problem.


228 is actually above-left, surely.

However the remaining 19 accents and marks at class 230 (above) do not
belong to the same combining category, the problems are for these 5
characters:
U+059D ACCENT GERESH MUQDAM (alias "gerich mouqdam")
U+05A0 ACCENT TELISHA GEDOLA (alias "talchah")
which are combining at position above-right, and
U+0599 ACCENT PASHTA (alias "qadma")
U+05A1 ACCENT PAZER
U+05A9 ACCENT TELISHA QETANA (alias "tarsa")
which are combining at position above-left.


There is also a problem with U+0592 which is also positioned above-left, at least in many texts, but is in class 230. But U+05A1 is usually centred above. See http://www.lrz-muenchen.de/~hr/teamim/tables.html for a useful summary of accent positions. But these positions vary - and accent names vary even more.

   This is not strictly a problem to keep the semantic of text, as they
share the same combining class, and so the normalization process will not
reorder them. ...

There is a potential problem in that U+05AE, also positioned above left (although wrongly shown as below left in your chart), is in class 228, and does interact typographically with the other above left accents which are in class 230. But this is probably of theoretical importance only, and CGJ can be used if really necessary.

... But it still prevents a more complete normalization that
considers the case of these 5 accents which may (should?) be reordered.
However, the case of multiple cantillation marks on the same vowel may be
quite rare even in historic texts (but I don't have a copy of the large
liturgic Hebrew texts to verify this.)


I posted before the results of my analysis of the rare cases of multiple accents in the Hebrew Bible. The only cases in which there was a potential normalisation issue were combinations of meteg and other accents. Accents are very rarely used in other texts.

If one is interested, I attach my Excel sheet which makes all this more
vizual, and that includes also the existing decompositions of compatibility
characters (U+FBxx), shown in italic rows.
The table is ordered by logical semantic and grouping. The combining classes
that cause problems are shown with bold white on red squares, and the
positioning constraints partly explained in the Unicode reference chapter
can better be explained by looking at the positioning columns in the table.

If there remains errors in this table, please don't shout me too much...


Thank you. I have pointed out above a few small errors of detail, but the principle is good.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/


Attachment: Hebrew character model.zip
Description: Zip compressed data

Reply via email to