On Wed, 5 Apr 2017 10:00:25 +0700 "Gerriet M. Denkmann" <[email protected]> wrote:
> Any two strings which look (almost?) identical should be normalised > into some canonical form. Reason: not to have identical looking > filenames in a filesystem. With the current rules of normalisation > there could be 8 different filenames all looking identical to > “กินครึ่งทิ้งครึ่ง”. > E.g. : > - both NIKHAHIT + Sara Aa and Sara Am should be normalised into the > same string (whatever this is) I think the answer to this is for renderers to insert a dotted circle in the former. I hope no-one is going to argue that NIKHAHIT + SARA AA is appropriate for Sanskrit. NFKC is not the answer; NFKC(น้ำ) = น้ํา. > - both top-vowel + tone-mark and tone-mark + top-vowel should be > normalised into the same string (whatever this is). etc. TUS declares that กิ่ (vowel then tone mark) and ก่ิ (tone mark then vowel) should render differently. Unfortunately, there is a tendency for mark to mark positioning, if employed at all, to be restricted to combinations that actually occur in correctly spelt Thai. A particularly nasty example is that doubled vowels above can be indistinguishable from single vowels above. I got an angry response when I suggested that mark-to-mark positioning should be used for all combinations of marks above - allegedly it makes the GPOS tables 'too big'. There's also the very high confusability of <SARA I, NIKHAHIT> and <SARA UE>. Traditionally, SARA UE is SARA I plus NIKHAHIT, and I suspect this is the origin of the etymologically odd form of ลึงค์ 'lingam'. > If, as Richard Wordingham wrote: "Unicode combining classes cannot be > changed. All that can be done is to enforce the order of characters > in normalised text.” then the Unicode Normalisation algorithms should > be updated. I think it will be a long time before canonical equivalence is replaced by canonical equivalence Version 2, but we may not have to wait many centuries. In the mean time, you will have to work with your own folding. Richard.

