On 09/07/2004 17:06, Mark Davis wrote:

I agree with Michael -- diacritic folding is a useful folding to add,
independent of the UCA.

Also, Peter's remark that: "And it is already covered by the Unicode
collation algorithm and default table..." is incorrect. ...


Well, I think this depends on whether the stroke in characters like U+00D8 and similar additional marks are considered to be diacritics. I am not sure that they are diacritics in the strict sense, and the current DUCET mappings don't treat them as such, but John Cowan's list does treat them as such.


... The UCA generally
follows our decompositions in determining many primary weights, and we do
not decompose characters like U+00D8 LATIN CAPITAL LETTER O WITH STROKE. [I
have felt from the beginning that it was a mistake to not be consistent in
our decompositions -- but that is water under the bridge.] If you look at
John's suggested file for diacritic
folding(http://www.ccil.org/~cowan/DiacriticFolding.txt), ...


I have just reviewed this list and found it odd that Hebrew presentation forms are included but Arabic ones are not. But in fact surely not only the Hebrew presentation forms but also most of the precomposed characters are redundant in this list. For the basic folding algorithm (in http://www.unicode.org/reports/tr30/) is:


a. Apply optional  folding operations
b. Apply canonical decomposition
c. Repeat (*a*) and (*b*) until stable
d. Apply composition if necessary


Step (b) will decompose not only presentation forms but also all precomposed characters with canonical decompositions, and the combining marks will be deleted by the repeat of step (a). It is therefore necessary to list in the specification of the folding only all (?) combining marks, which are to be deleted, and all precomposed characters which do *not* have canonical decompositions. Letters like O with stroke are presumably in this latter list, along with many of the listed Cyrillic characters.

But I would suggest some caution about listing for diacritic folding some of the Cyrillic characters below, especially those with descenders. I note that 0429 is not folded to 0428 etc, and this is correct because within the Cyrillic writing system these are entirely separate characters. But the difference between these two is in fact exactly the same descender which is removed in 0496 etc. I am also surprised to note that no folding is given for 0419/0439; although in some ways this is desirable because Russians do not consider this breve to be a diacritic (and after all we would not want the dot on i to be removed as a diacritic!), these characters have canonical decompositions to 0418/0438 and breve and the principle of canonical equivalence and the folding algorithm (which works on decomposed characters) more or less demand that the breve be deleted. Also 048A/048B should then fold to 0418/0438 rather than 0419/0439.

...
04D0; 0410; ; !uca #CYRILLIC CAPITAL LETTER A WITH BREVE
04D2; 0410; ; !uca #CYRILLIC CAPITAL LETTER A WITH DIAERESIS
0490; 0413; !nfd+remove_marks; #CYRILLIC CAPITAL LETTER GHE WITH UPTURN
0492; 0413; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER GHE WITH STROKE
0494; 0413; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER GHE WITH MIDDLE
HOOK
04D6; 0415; ; !uca #CYRILLIC CAPITAL LETTER IE WITH BREVE
0496; 0416; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ZHE WITH
DESCENDER
04DC; 0416; ; !uca #CYRILLIC CAPITAL LETTER ZHE WITH DIAERESIS
0498; 0417; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ZE WITH
DESCENDER
04DE; 0417; ; !uca #CYRILLIC CAPITAL LETTER ZE WITH DIAERESIS
04E4; 0418; ; !uca #CYRILLIC CAPITAL LETTER I WITH DIAERESIS
048A; 0419; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER SHORT I WITH
TAIL
049A; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH
DESCENDER
049C; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH
VERTICAL STROKE
049E; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH STROKE
04C3; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH HOOK
04C5; 041B; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EL WITH TAIL
04CD; 041C; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EM WITH TAIL
04A2; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH
DESCENDER
04C7; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH HOOK
04C9; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH TAIL
04E6; 041E; ; !uca #CYRILLIC CAPITAL LETTER O WITH DIAERESIS
04A6; 041F; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER PE WITH MIDDLE
HOOK
048E; 0420; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ER WITH TICK
04AA; 0421; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ES WITH
DESCENDER
04AC; 0422; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER TE WITH
DESCENDER
04F0; 0423; ; !uca #CYRILLIC CAPITAL LETTER U WITH DIAERESIS
04F2; 0423; ; !uca #CYRILLIC CAPITAL LETTER U WITH DOUBLE ACUTE
04B2; 0425; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER HA WITH
DESCENDER
04B3; 0425; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER HA WITH DESCENDER
04F4; 0427; ; !uca #CYRILLIC CAPITAL LETTER CHE WITH DIAERESIS
04F8; 042B; ; !uca #CYRILLIC CAPITAL LETTER YERU WITH DIAERESIS
04EC; 042D; ; !uca #CYRILLIC CAPITAL LETTER E WITH DIAERESIS
04D1; 0430; ; !uca #CYRILLIC SMALL LETTER A WITH BREVE
04D3; 0430; ; !uca #CYRILLIC SMALL LETTER A WITH DIAERESIS
0491; 0433; !nfd+remove_marks; #CYRILLIC SMALL LETTER GHE WITH UPTURN
0493; 0433; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER GHE WITH STROKE
0495; 0433; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER GHE WITH MIDDLE
HOOK
04D7; 0435; ; !uca #CYRILLIC SMALL LETTER IE WITH BREVE
0497; 0436; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ZHE WITH
DESCENDER
04DD; 0436; ; !uca #CYRILLIC SMALL LETTER ZHE WITH DIAERESIS
0499; 0437; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ZE WITH DESCENDER
04DF; 0437; ; !uca #CYRILLIC SMALL LETTER ZE WITH DIAERESIS
04E5; 0438; ; !uca #CYRILLIC SMALL LETTER I WITH DIAERESIS
048B; 0439; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER SHORT I WITH TAIL
049B; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH DESCENDER
049D; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH VERTICAL
STROKE
049F; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH STROKE
04C4; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH HOOK
04C6; 043B; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EL WITH TAIL
04CE; 043C; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EM WITH TAIL
04A3; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH DESCENDER
04C8; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH HOOK
04CA; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH TAIL
04E7; 043E; ; !uca #CYRILLIC SMALL LETTER O WITH DIAERESIS
04A7; 043F; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER PE WITH MIDDLE
HOOK
048F; 0440; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ER WITH TICK
04AB; 0441; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ES WITH DESCENDER
04AD; 0442; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER TE WITH DESCENDER
04F1; 0443; ; !uca #CYRILLIC SMALL LETTER U WITH DIAERESIS
04F3; 0443; ; !uca #CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE
04B9; 0447; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER CHE WITH VERTICAL
STROKE
04F5; 0447; ; !uca #CYRILLIC SMALL LETTER CHE WITH DIAERESIS
04F9; 044B; ; !uca #CYRILLIC SMALL LETTER YERU WITH DIAERESIS
04ED; 044D; ; !uca #CYRILLIC SMALL LETTER E WITH DIAERESIS
047C; 0460; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER OMEGA WITH
TITLO
047D; 0461; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER OMEGA WITH TITLO
0476; 0474; ; !uca #CYRILLIC CAPITAL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT
0477; 0475; ; !uca #CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT
04B0; 04AE; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER STRAIGHT U WITH
STROKE
04B1; 04AF; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER STRAIGHT U WITH
STROKE
04B6; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER CHE WITH
DESCENDER
04B7; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER CHE WITH
DESCENDER
04B8; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER CHE WITH
VERTICAL STROKE
04BE; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ABKHASIAN CHE
WITH DESCENDER
04BF; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ABKHASIAN CHE
WITH DESCENDER
04CB; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KHAKASSIAN CHE
04CC; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KHAKASSIAN CHE
04DA; 04D8; ; !uca #CYRILLIC CAPITAL LETTER SCHWA WITH DIAERESIS
04DB; 04D9; ; !uca #CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS
04EA; 04E8; ; !uca #CYRILLIC CAPITAL LETTER BARRED O WITH DIAERESIS
04EB; 04E9; ; !uca #CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS



-- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/




Reply via email to