On 19/08/2003 14:23, Mark Davis wrote:
Three points.
First, While we try to make the the UCA collation table (DUCET) as reasonable as possible for the main languages of a given script, it is not guaranteed to produce the correct sorting for any particular language. The UCA *is* designed so that it provides a default base ordering for all of Unicode, and individual languages can be given tailorings of the DUCET that handle the specifics of their string comparison requirements.
Thus if there are changes that improve the handling of the UCA for the major
languages using a given script, and do not destabilize others, those are
candidates for change in a version. For example, if it turned out that a
particular Tamil character (or sequence of characters!) was not sorted correctly
according to the DUCET (e.g. on http://www.unicode.org/charts/collation/beta/),
then it would be a candidate, and should be submitted on the form.
Understood. On this basis, the DUCET sorting for the Hebrew block should be based on the requirements for modern Hebrew, with Yiddish, Ladino etc also being taken into acount.
Second, we do and should favor modern language communities when making
incompatible tradeoffs. So if we have the choice between making French sort
correctly without tailoring, or have Latin sort correctly without tailoring, we
should choose the modern community. The Latin community can always use a
tailored UCA, in any event.
Understood. I accept the primacy of the modern language in this case. There may be some issues on which the modern language has no preference, especially for characters only used in older Hebrew, and in such cases it would make sense to follow the preferences of ancient Hebrew scholars. If it becomes necessary to use a tailored UCA for biblical work, so be it, but I would prefer not to. We have come close to having to use a separate set of vowels for biblical Hebrew simply because decisions were rushed and then frozen on the basis of modern Hebrew requirements. I don't want any danger of falling into the same kind of trap with collation.
Third, there is often a serious confusion between sorting weight and canonical
ordering. The fact that a grave accent precedes a cedilla in canonical order is
*completely independent of* whatever collation weights each of them has, either
in a tailoring or in the DUCET. The only substantive issue is how each of these
sorts separately or in combination. And making the combination (sequence) of
grave and cedilla sort before grave, after grave, before cedilla, or after
cedilla are all possible; all of those can be handled by the UCA as
contractions. See http://www.unicode.org/reports/tr10/tr10-10.html for more
information.
Yes, I understand that the collation weights are quite independent of the canonical combining classes. But collation does become trickier when the canonical ordering is not the expected one, because of the assumption that collation is based on the order of the string i.e. based on the first character, then the second etc.
Well, I am glad that contractions provide a way around that problem. So perhaps we ought to be looking at using them for Hebrew in DUCET. I guess we should consider defining contractions for each case of <consonant, dagesh> which differ from the consonant at the second level only, perhaps also the same for rafe, and similarly for each combination of shin, shin/sin dot and dagesh. The problem comes that the vowels intrude between the consonant and the dagesh, and meteg comes before shin/sin dot, so there is a potential need for a rather large number of contractions, especially if we consider a shin with a right meteg which might come out as:
<shin, dagesh, meteg, CGJ, {any one of 11 vowels}, {optional shin dot |
sin dot}, masora circle>with the CGJ inhibiting complete canonical reordering, and the shin/sin dot must be contracted with the shin.
Perhaps we need to specify that dagesh and shin/sin dot must always come BEFORE any CGJ in such combinations so that they don't get separated too far from the base character. In fact I think I will change my document to specify that.
PS Is there a problem with the Unicode Hebrew list? Nothing seems to have appeared on it today, including my previous posting on this thread and Mark's reply to it.
-- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/

