For assembling a rendering system for a script with combining marks, is there a guide as to how to decide what strings one should exclude, and which one should strive to support?
There will also be characters outside the script that should be supported. For a font, there are lists of characters for Microsoft Word and for the Universal Scripting Engine, and it is frequently desirable for a font to be able to display its own name. There are also various control and formatting characters, and punctuation characters from outside the script. I believe compromises are necessary. There are issues with stacking combining marks - at one point does one throw oneself on the mercy of the application? Making characters small enough to accommodate a cross-line stack of 20 within the nominal line separation is usually not acceptable! (There are Sanskrit manuscripts where a stack extends across several lines.) There are also problems if glyphs cannot simply be stacked - it is not unknown for a 'subscript' glyph to obligatorily have a part on the baseine - preposed 'subscript' RA can required different glyphs depending on how deeply it is stacked. If canonical equivalence does not eliminate homographs, there is the question of which homographs to tolerate. I have hit this issue with Tai Tham. The essence of the problem is that a CVCV word with identical consonants can be abbreviated to CVV, as in some other scripts, and dependent vowels can be written using several vowel symbols. All vowels have ccc=0. Now, the accepted proposal (i.e. the one accepted by the UTC for the ISO process) gave an order for the vowels in such polygraphs, and most combinations resulting from such contraction comply with this order. The existence of such a contraction can be indicated in writing by the (ambiguous) mark MAI SAM, and in such cases the proposed encoding of Tai Tham text is of the form CVxV where 'x' is MAI SAM. In such cases I allow the constraint on vowel order to apply to each vowel separately. This allows homographs, but I take the view that I am rejecting homographs to facilitate searching, not to prevent spoofing. The prevention of spoofing would use stricter rules, which would ban some words, just as the English word "café" is prohibited in British domain names. (The doublet "cafe" refers to a lower class of establishment in British English.) However, the mark MAI SAM is not always used. Now, if Tai Tham vowels had non-zero combining marks, I would separate the vowels from the two phonetic syllables by the general disruptor, CGJ, to facilitate sorting. At the very least the word should then be sorted with other words starting with the same CV, and with preprocessing, the CGJ could be replaced by the omitted consonant. Now, Tai Tham vowels have ccc=0, but I favour retaining the CGJ to mark the location of the repeated consonant. This CGJ also enables me to make some check as to whether the individual phonetic syllables' vowel symbols are in the correct order. So: (a) If the vowel symbols in CVV are in the permitted order, the string is accepted. (b) If the word is typed as CV<CGJ>V and the vowels on either side of CGJ are in the correct order, the string is accepted. (c) If the word is typed as CVV and the vowel symbols are not in the permitted order, and I can detect this, I allow the implementation of the Universal Script Engine (be it Microsoft, AAT or HarfBuzz) to insert its dotted circles. More precisely, I don't remove them. Is this a reasonable approach to allowing both collation and suppressing needless homographs? My contribution to the rendering is only the provision of a font. Richard.