On 21/08/2003 08:21, Mark Davis wrote:

OK, it sounds like we have clarity on two items:

- change the non-final characters to <isolated>
- moving dagesh to after RAFE, SHIN DOT, SIN DOT, VARIKA

I'll talk to Ken about whether we have time to get them into UCA 4.0.0, and in
any event we can get them into ICU 2.8 for Hebrew.


These changes are certainly a move in the right direction, but only part of the way. If we can get these in quickly, that would be good. But we mustn't let things rest there.

As far as the strength issue of final vs dagesh, I don't think we should take
any immediate action. The collation strength also affects matching. If a user
sets the sorting or matching level to "ignore accents", for example, they
probably expect the dots to be ignored then, as well as graves, acutes, etc. If
this showed up in a lot of words, then it would still be worth doing, I suspect.
But because the number of cases is so very small where you would have a
combination of dageshes and finals that would make a difference, ...

Yes, the number of cases where the relative ordering of dagesh and final forms is important is vanishingly small, because final forms are nearly always predictable anyway.

Nevertheless, this is an important issue. It is important, certainly in the biblical context, that the difference between regular and final forms is ignored in a basic "ignore accents" type of search. And Mati seems to agree: he wrote: "in most cases, the difference between Final vs non-Final must be ignored for searches". Compare for example ignoring upper and lower case differences in English. I would propose putting the final/non-final difference at the same level as that one.

... I would
recommend that SII approach this very carefully. If we are going to do anything,
it should be in the next version of UCA so that we have time to consider all of
the ramifications. I would not recommend it for ICU 2.8 either, even though we
have more time (and flexibility) there.


Indeed. The issue is a lot more complex than it seems here.

You raise one other issue in the following:



I am no expert in Judeo-Spanish, but since FB1E Varika is a glyph variant
of 05BF Rafe, it makes sense that Dagesh be in the same relation to both,
so Dagesh should go after Varika.



From that, it would also appear that VARIKA should either have the same weight
as RAFE or at least be adjacent to. This would would only be an issue for users
of that character, so probably difficult to establish the right behavior, and
thus one we would not even try to get into UCA this round.

We should probably take this discussion off of [EMAIL PROTECTED], and just
have it on [EMAIL PROTECTED] and [EMAIL PROTECTED] Any people interested in
this topic should be on those groups anyway.

Agreed. But it seems, Mark, that you are not on the Hebrew list, as your posting has not reached there. So I am copying your whole posting, plus my additions, to the Hebrew list.

By the way, I am not on the bidi group because I am interested mainly in the kinds of Hebrew issues which are independent ot specific bidi matters. Am I in fact missing out on important discussion of Hebrew?


Mark __________________________________ http://www.macchiato.com ► “Eppur si muove” ◄

----- Original Message ----- From: "Matitiahu Allouche" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Thursday, August 21, 2003 00:55
Subject: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)





Hello, Mark!

In order to address your points in order, I will put excerpts of your note
within <MARK> . . . </MARK> tags, and my comments as untagged text.

<MARK>
A. Final.


1) Precedence of Dagesh over Final/non-Final: in the chart, the presence
or absence of Dagesh is a Secundary difference, while Final/non-Final is


a


Tertiary difference. This is relevant only for letters Kaf and Pe. My
gut feeling says that Final/non-Final should have precedence over
Dagesh/no-Dagesh.
Note that the number of actual cases where this would make a difference


is


probably *very* small.


So there are two issues for final vs non-final: strength and ordering.

A1. Ordering is easy to change; in ICU or UCA we could put the final
values
before the independent letters. In ICU they are just rules, while in UCA
they
follow
http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table.
The
easiest in UCA would be to give the 5 independent forms that have finals
the
value <isolated>.

Note: there is one minor fallout in ICU: we optimize the sortkey
compression of
tertiary values of NONE; if we change the ordering then each instance of
the
<isolated> letters will mean about a 2-3 byte increase in sort-key sizes.
</MARK>

I like giving the value <isolated> to the 5 independent forms that have
finals.  As for the increase in sort-key sizes, this is what cheap memory
is made for :-)

<MARK>
A2. For Strength, it is not as clear cut. If Final vs non-Final is more
important than dagesh, etc, the easiest thing is to make it a primary
difference; but that would make

Zayin Yod PeFinal

sort before all words

Zayin Yod Pe XXX

But I'm guessing that is probably not desired for Hebrew.
</MARK>

Why?  This is exactly what I desire for Hebrew.  But I am afraid that
making primary differences for Final vs non-Final will make searches using
a Final form not match a non-Final form and vice-versa, which is is bad:
in most cases, the difference between Final vs non-Final must be ignored
for searches.

<MARK>
In ICU we could make Final vs non-Final be a secondary difference, and
have
Dagesh, etc. be tertiary differences. The disadvantage is that people tend
to
expect the 2nd level to be 'accent-like', and there might be more
inconsistencies in practice than you would gain by having the current
situation.
</MARK>

I don't think that there is enough experience accumulated to create people
expectations.  If this is the right thing (and I think it is), it is still
early enough to do it now.

In Unicode, the UCA has more production restrictions as per
http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table, so
it
would be a bit harder to make that change.

So if SII would like this change, I'd recommend that we make the ordering
change
in UCA (which will then affect ICU), but not make a stength change (it
would
have to be extremely exotic for that to make a difference).
</MARK>

Personally, I would go for the strength change, but I understand the
adverse considerations.  I will have to take the matter to SII.

<MARK>
B. Dagesh


2) There is something strange in the combinations of Shin with Dagesh


and


dots: for all other letters, the form without Dagesh sorts before the


form


with Dagesh. But Shin with Sin/Shin dot sort after their corresponding
combinations with Dagesh. I cannot imagine a justification for that.


We have currently in UCA the following (from UCA 4.0.0d1 (beta))
05B0  ; [.0000.00B2.0002.05B0] # HEBREW POINT SHEVA
05B1  ; [.0000.00B3.0002.05B1] # HEBREW POINT HATAF SEGOL
05B2  ; [.0000.00B4.0002.05B2] # HEBREW POINT HATAF PATAH
05B3  ; [.0000.00B5.0002.05B3] # HEBREW POINT HATAF QAMATS
05B4  ; [.0000.00B6.0002.05B4] # HEBREW POINT HIRIQ
05B5  ; [.0000.00B7.0002.05B5] # HEBREW POINT TSERE
05B6  ; [.0000.00B8.0002.05B6] # HEBREW POINT SEGOL
05B7  ; [.0000.00B9.0002.05B7] # HEBREW POINT PATAH
05B8  ; [.0000.00BA.0002.05B8] # HEBREW POINT QAMATS
05B9  ; [.0000.00BB.0002.05B9] # HEBREW POINT HOLAM
05BB  ; [.0000.00BC.0002.05BB] # HEBREW POINT QUBUTS
05BC  ; [.0000.00BD.0002.05BC] # HEBREW POINT DAGESH OR MAPIQ
05BF  ; [.0000.00C0.0002.05BF] # HEBREW POINT RAFE
05C1  ; [.0000.00C1.0002.05C1] # HEBREW POINT SHIN DOT
05C2  ; [.0000.00C2.0002.05C2] # HEBREW POINT SIN DOT
FB1E  ; [.0000.00C3.0002.FB1E] # HEBREW POINT JUDEO-SPANISH VARIKA

To make this change, we would move Dagesh to after SIN DOT. Question:
should it
also go after VARIKA or not?
</MARK>

I am no expert in Judeo-Spanish, but since FB1E Varika is a glyph variant
of 05BF Rafe, it makes sense that Dagesh be in the same relation to both,
so Dagesh should go after Varika.


Shalom (Regards), Mati Bidi Architect Globalization Center Of Competency - Bidirectional Scripts IBM Israel Phone: +972 2 5888802 Fax: +972 2 5870333 Mobile: +972 52 554160














--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/





Reply via email to