On Tue, 10 Jan 2017 13:12:47 -0800 Asmus Freytag <asm...@ix.netcom.com> wrote:
> Unicode clearly doesn't forbid most sequences in complex scripts, > even if they cannot be expected to render properly and otherwise > would stump the native reader. Is this expectation based on sequence enforcement in the renderer? The main problem with getting text to render reasonably (not necessarily as desired) is now anti-phishing. The Unicode standard does define what short sequences of characters mean. The problem is that then, outside the Apple world, it seems to be left to Microsoft to decide what longer sequences they will allow. > The advantage of the text I brought to your attention is the way it > is formalized and that it was created with local expertise. The > disadvantage from your perspective is that the scope does not match > with your intended use case. Perhaps ICANN will be the industry-wide definer. However, to stay with Indic rendering, one may have cases where CVC and CCV orthographic syllables have little to no visible difference. The Khmer writing system once made much greater use of CVC syllables. For reproducing older texts, one might be forced to encode phonetic CVC as though it were CCV. This is already the case, through error rather than design, with the Thai script in Tai Tham. This affects about 30% of the Northern Thai lexicon*, and I believe even a higher proportion when adjusted for word frequency. Now, to fight phishing, I have always believed that some brutal folding would be required for Tai Tham, which is why I suggested that the S.SA ligature be encoded (U+1A54 TAI THAM LETTER GREAT SA). *I've sampled the MFL dictionary. I suspect a bias to untruncated forms in loans from Pali, such as _kathina_ rather than _kathin_. If my suspicion is correct, the proportion would be even higher. However, I believe there is some advantage in distinguishing CVC and CCV at the code level, even where there is no visual difference. To display small visual differences, perhaps we will be forced to beg for mark-up to make the distinction visible. In Tai Tham, there are very few CCV-CVC visual homographs in native words because of the phonological structure of Northern Thai, and one can usually guess whether the word is CCV or CVC. Richard.