On Sat, 31 May 2014 21:27:55 +0200 Mark Davis ☕️ <[email protected]> wrote:
> The structure of the data is based on the use of NFKC characters in > identifiers. So SARA AM and the Lao equivalent are both not NFKC > characters, and are categorized as such, and would need to be > represented by their NFKC fors. The process is in > http://www.unicode.org/reports/tr39/proposed.html#IDMOD_Data_Collection There's no absolute IETF prohibition on NFKC characters. > > Now, U+0E4D THAI > > CHARACTER NIKHAHIT is classified as 'allowed; recommended', although > > its main use is in writing Pali, which would suggest that it should > > be 'restricted; historic' or 'restricted; limited-use'. > For that, it would be best to submit via > http://www.unicode.org/reports/tr39/proposed.html#Feedback, AND file a > feedback form at http://www.unicode.org/reporting.html, just to be > sure. I have no desire to restrict NIKHAHIT simply because of limited use. The problem is simply the confusion caused by the existence of SARA AM. Unicode support for the compatibility decomposition of SARA AM is incomplete, in part irremediably so. The problem is that <tonemark, SARA AM> has a different appearance to <tonemark, NIKHAHIT, SARA AA>. In the former, the tone mark is the topmost glyph; in the latter, the nikkhahit is the topmost glyph. <tonemark, SARA AM> usually has the same appearance as <NIKHAHIT, tonemark, SARA AA>, which is what Uniscribe effectively converts it to. There used to be filters in place to stop <NIKHAHIT, SARA AA> being typed. It's not unknown for <tonemark, SARA AM> to be mistyped as <NIKHAHIT, tonemark, SARA AA>, and that too used to be blocked. DUCET has a contraction for <NIKHAHIT, SARA AA> to reduce the ill-effects, but of course the contraction doesn't work for the sequence <NIKHAHIT, tonemark, SARA AA>. (Action on me: CLDR ticket on omission for th locale.) In short, the co-existence of NIKHAHIT with ccc=0 and SARA AM causes problems. The simplest solution is to restrict NIKHAHIT, which should be tolerable. Ideally, one would merely prohibit the sequence \p{Mn}*\u0E4D\p{Mn}*\u0E32. There is no virtue in making both NIKHAHIT and SARA AM 'restricted'. Indeed, one could argue that applying the compatibility decomposition to SARA AM brings NIKHAHIT into 'high frequency modern use' - it depends on the frequency of NFKC and NFKD conversions. However, the compatibility decomposition of SARA AM is simply *wrong* as Thai text. It would be good to hear from someone at Thailand's National Electronics and Computer Technology Center (NECTEC) on the matter of SARA AM in domain names. The sequence-prohibiting solution ought to extend to Lao, but there may be the additional problem of the tone mark being applied to the SARA AM. The m17n Lao keyboard on my computer actually comes with a single keystroke for the sequence <SARA AM, MAI THO>! (Action on me: File a bug report against the keyboard.) Richard. _______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

