OK, it sounds like we have clarity on two items:These changes are certainly a move in the right direction, but only part of the way. If we can get these in quickly, that would be good. But we mustn't let things rest there.
- change the non-final characters to <isolated> - moving dagesh to after RAFE, SHIN DOT, SIN DOT, VARIKA
I'll talk to Ken about whether we have time to get them into UCA 4.0.0, and in
any event we can get them into ICU 2.8 for Hebrew.
Yes, the number of cases where the relative ordering of dagesh and final forms is important is vanishingly small, because final forms are nearly always predictable anyway.As far as the strength issue of final vs dagesh, I don't think we should take any immediate action. The collation strength also affects matching. If a user sets the sorting or matching level to "ignore accents", for example, they probably expect the dots to be ignored then, as well as graves, acutes, etc. If this showed up in a lot of words, then it would still be worth doing, I suspect. But because the number of cases is so very small where you would have a combination of dageshes and finals that would make a difference, ...
Nevertheless, this is an important issue. It is important, certainly in the biblical context, that the difference between regular and final forms is ignored in a basic "ignore accents" type of search. And Mati seems to agree: he wrote: "in most cases, the difference between Final vs non-Final must be ignored for searches". Compare for example ignoring upper and lower case differences in English. I would propose putting the final/non-final difference at the same level as that one.
... I wouldIndeed. The issue is a lot more complex than it seems here.
recommend that SII approach this very carefully. If we are going to do anything,
it should be in the next version of UCA so that we have time to consider all of
the ramifications. I would not recommend it for ICU 2.8 either, even though we
have more time (and flexibility) there.
You raise one other issue in the following:Agreed. But it seems, Mark, that you are not on the Hebrew list, as your posting has not reached there. So I am copying your whole posting, plus my additions, to the Hebrew list.
I am no expert in Judeo-Spanish, but since FB1E Varika is a glyph variant
of 05BF Rafe, it makes sense that Dagesh be in the same relation to both,
so Dagesh should go after Varika.
From that, it would also appear that VARIKA should either have the same weightas RAFE or at least be adjacent to. This would would only be an issue for users of that character, so probably difficult to establish the right behavior, and thus one we would not even try to get into UCA this round.
We should probably take this discussion off of [EMAIL PROTECTED], and just have it on [EMAIL PROTECTED] and [EMAIL PROTECTED] Any people interested in this topic should be on those groups anyway.
By the way, I am not on the bidi group because I am interested mainly in the kinds of Hebrew issues which are independent ot specific bidi matters. Am I in fact missing out on important discussion of Hebrew?
Mark __________________________________ http://www.macchiato.com ► “Eppur si muove” ◄
----- Original Message ----- From: "Matitiahu Allouche" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Thursday, August 21, 2003 00:55
Subject: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)
Hello, Mark!
In order to address your points in order, I will put excerpts of your note within <MARK> . . . </MARK> tags, and my comments as untagged text.
<MARK>
A. Final.
1) Precedence of Dagesh over Final/non-Final: in the chart, the presencea
or absence of Dagesh is a Secundary difference, while Final/non-Final is
Tertiary difference. This is relevant only for letters Kaf and Pe. Myis
gut feeling says that Final/non-Final should have precedence over
Dagesh/no-Dagesh.
Note that the number of actual cases where this would make a difference
probably *very* small.So there are two issues for final vs non-final: strength and ordering.
A1. Ordering is easy to change; in ICU or UCA we could put the final values before the independent letters. In ICU they are just rules, while in UCA they follow http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table. The easiest in UCA would be to give the 5 independent forms that have finals the value <isolated>.
Note: there is one minor fallout in ICU: we optimize the sortkey compression of tertiary values of NONE; if we change the ordering then each instance of the <isolated> letters will mean about a 2-3 byte increase in sort-key sizes. </MARK>
I like giving the value <isolated> to the 5 independent forms that have finals. As for the increase in sort-key sizes, this is what cheap memory is made for :-)
<MARK> A2. For Strength, it is not as clear cut. If Final vs non-Final is more important than dagesh, etc, the easiest thing is to make it a primary difference; but that would make
Zayin Yod PeFinal
sort before all words
Zayin Yod Pe XXX
But I'm guessing that is probably not desired for Hebrew. </MARK>
Why? This is exactly what I desire for Hebrew. But I am afraid that making primary differences for Final vs non-Final will make searches using a Final form not match a non-Final form and vice-versa, which is is bad: in most cases, the difference between Final vs non-Final must be ignored for searches.
<MARK> In ICU we could make Final vs non-Final be a secondary difference, and have Dagesh, etc. be tertiary differences. The disadvantage is that people tend to expect the 2nd level to be 'accent-like', and there might be more inconsistencies in practice than you would gain by having the current situation. </MARK>
I don't think that there is enough experience accumulated to create people expectations. If this is the right thing (and I think it is), it is still early enough to do it now.
In Unicode, the UCA has more production restrictions as per http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table, so it would be a bit harder to make that change.
So if SII would like this change, I'd recommend that we make the ordering change in UCA (which will then affect ICU), but not make a stength change (it would have to be extremely exotic for that to make a difference). </MARK>
Personally, I would go for the strength change, but I understand the adverse considerations. I will have to take the matter to SII.
<MARK>
B. Dagesh
2) There is something strange in the combinations of Shin with Dageshand
dots: for all other letters, the form without Dagesh sorts before theform
with Dagesh. But Shin with Sin/Shin dot sort after their corresponding
combinations with Dagesh. I cannot imagine a justification for that.
We have currently in UCA the following (from UCA 4.0.0d1 (beta)) 05B0 ; [.0000.00B2.0002.05B0] # HEBREW POINT SHEVA 05B1 ; [.0000.00B3.0002.05B1] # HEBREW POINT HATAF SEGOL 05B2 ; [.0000.00B4.0002.05B2] # HEBREW POINT HATAF PATAH 05B3 ; [.0000.00B5.0002.05B3] # HEBREW POINT HATAF QAMATS 05B4 ; [.0000.00B6.0002.05B4] # HEBREW POINT HIRIQ 05B5 ; [.0000.00B7.0002.05B5] # HEBREW POINT TSERE 05B6 ; [.0000.00B8.0002.05B6] # HEBREW POINT SEGOL 05B7 ; [.0000.00B9.0002.05B7] # HEBREW POINT PATAH 05B8 ; [.0000.00BA.0002.05B8] # HEBREW POINT QAMATS 05B9 ; [.0000.00BB.0002.05B9] # HEBREW POINT HOLAM 05BB ; [.0000.00BC.0002.05BB] # HEBREW POINT QUBUTS 05BC ; [.0000.00BD.0002.05BC] # HEBREW POINT DAGESH OR MAPIQ 05BF ; [.0000.00C0.0002.05BF] # HEBREW POINT RAFE 05C1 ; [.0000.00C1.0002.05C1] # HEBREW POINT SHIN DOT 05C2 ; [.0000.00C2.0002.05C2] # HEBREW POINT SIN DOT FB1E ; [.0000.00C3.0002.FB1E] # HEBREW POINT JUDEO-SPANISH VARIKA
To make this change, we would move Dagesh to after SIN DOT. Question: should it also go after VARIKA or not? </MARK>
I am no expert in Judeo-Spanish, but since FB1E Varika is a glyph variant of 05BF Rafe, it makes sense that Dagesh be in the same relation to both, so Dagesh should go after Varika.
Shalom (Regards), Mati Bidi Architect Globalization Center Of Competency - Bidirectional Scripts IBM Israel Phone: +972 2 5888802 Fax: +972 2 5870333 Mobile: +972 52 554160
-- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/

