On Sat, 17 May 2014 11:56:35 +0100 Richard Wordingham <[email protected]> wrote:
> I've reviewed the application of the revised categories as set forth > in L2/14-126 > (http://www.unicode.org/L2/L2014/14126r-indic-properties.pdf) as > applied to the Thai, Lao and Tai Tham scripts, and noted a few other > characters, and come up with the following proposed changes of > syllabic category. I've just submitted a slightly different set of changes via the Unicode report function. They were updated to take into account other proposed changes and also Microsoft's new 'Universal Shaping Engine'. The submitted comment follows. Richard. I've reviewed the application of the revised categories as set forth in L2/14-126 (http://www.unicode.org/L2/L2014/14126r-indic-properties.pdf) as applied to the Thai, Lao and Tai Tham scripts, and noted a few other characters, and come up with the following proposed changes of syllabic category. I have also taken into account the proposals of Roozbeh Pournader of 24 February 2015 related to work on the Universal Shaping Engine. I've come up with 3 new characters of category Bindu: 0303 ;Bindu # Mn COMBINING TILDE 0310 ; Bindu # Mn COMBINING CANDRABINDU 1A74 ; Bindu # Mn TAI THAM SIGN MAI KANG (currently Vowel_Dependent) Note that both U+0ECD LAO NIGGAHITA and U+1A74 function both as Bindu and as Vowel_Dependent. U+0303 is used in Patani Malay in the Thai script - see UTC document L2/10-451. U+0310 is used for Sanskrit in Tamil script, according to Indic list email 'Re: Tamil Punctuation', 27/7/12 9:24 +0530 from Shriramana Sharma. I've found 4 new characters of category Visarga: 0E30 ; Visarga # Lo THAI CHARACTER SARA A 0EB0 ; Visarga # Lo LAO VOWEL SIGN A 1A61 ; Visarga # Mc TAI THAM VOWEL SIGN A 19B0 ; Visarga # Mc (to be Lo) NEW TAI LUE VOWEL SIGN VOWEL SHORTENER Note that the tone (or voice modulation) character U+1038 MYANMAR SIGN VISARGA is currently classified as Visarga. U+0E30 is used as visarga in Sanskrit, e.g. in the Royal Institute Dictionary. The typical sound of the four visargas above is /ʔ/ rather than /h/, and, through a feature of Tai (SW Tai?) phonology, they all have the additional function of shortening a vowel. As a vowel shortener, U+1A61 and U+19B0 may follow a final consonant. These 4 characters are currently classified as Vowel_Dependent. Except for the Lao script, that usage can easily be interpreted as a modification of the implicit vowel. Modern Lao does not acknowledge the existence of an implicit vowel, so that interpretation may be harder to accept. (Vowel_Dependent U+0EB1 LAO VOWEL SIGN MAI KAN is also a vowel shortener; in the 19th century it was denied that Vowel_Dependent U+0E31 THAI CHARACTER MAI HAN-AKAT was a vowel in Thai.) U+1A61 occasionally has the sound /k/, especially when used in conjunction with U+1A62 TAI THAM VOWEL SIGN MAI SAT. I think we should regard this as just one of the uses of visarga. I've found 3 new nuktas, at least, so long as the application of nukta is not restricted to *foreign* consonants. 0331 ; Nukta # Mn COMBINING MACRON BELOW 0359 ; Nukta # Mn COMBINING ASTERISK BELOW 1A7F ; Nukta # Mn TAI THAM COMBINING CRYPTOGRAMMIC DOT U+0331 is used in Patani Malay in the Thai script - see L2/10-451 and the consonant chart on p16 of http://mlenetwork.org/sites/default/files/Patani%20Malay%20Presentation%20-%20Part%202.pdf. U+0331 and U+0359 have been used in English-Thai dictionaries to represent English sounds, very much a nukta role. They were previously classified as 'Other', though there is a proposal to make U+1A7F 'Syllable_Modifier'. U+0EC8 LAO TONE MAI EK functions as Nukta in Khmu as well as performing its principal rôle of Tone_Mark in Lao. U+0E3A THAI CHARACTER PHINTHU is used both as Nukta and as Pure_Killer; the latter is its traditional rôle. I've found 4 new pure killers, all currently classified as 'Other', though there is a proposal to classify U+0E4C (along with U+17CD) as 'Consonant_Killer'. They are: 0E4C ;Pure_Killer # Mn THAI CHARACTER THANTHAKHAT 0ECC ; Pure_Killer # Mn LAO CANCELLATION MARK 1A7C ; Pure_Killer # Mn TAI THAM SIGN KHUEN-LUE KARAN 1A7A ; Pure_Killer # Mn TAI THAM SIGN RA HAAM U+0E4C THAI CHARACTER THANTHAKHAT and U+0E4E THAI CHARACTER YAMAKKAN once divided the role of vowel killing - U+0E4E formed clusters and U+0E4C removed final vowels. The use of U+0E4C came to be largely restricted to vowels associated with clusters of consonants. Removing the vowel made the final consonant of the cluster silent (spoken Thai does not permit final consonant clusters), and from this effect it has been reinterpreted as a consonant-killer. U+0ECC probably had the same behaviour as U+0E4C. I don't know if it is still used in Laos - foreign loanwords often don't follow the rules. The Tai Tham marks are still at the transitional stage - they are sometimes found on final unsubscripted consonants to indicate that they have no vowel. There is an unfortunate overlap with the final consonant mark for <r> (pronunciation necessarily /n/). The Khuen and Lue from of the final consonant symbol has the same shape as the Thai and Lao form of the pure killer. Consequently U+1A7A serves as Consonant_Final in Tai Khuen and Tai Lue. In Tai Khuen, at least, the use as a final consonant seems to have recently fallen into disfavour, so it seems most appropriate to classify U+1A7A as 'Pure_Killer'. I noted above that the 'Pure_Killer' U+0E3A THAI CHARACTER PHINTHU also serves as a nukta. I have a vague recollection that U+0E4C THAI CHARACTER THANTHAKHAT serves as a register mark in an orthography for the Chong language, so that would count as an auxiliary rôle as Tone_Mark. If 'Consonant_Killer' is to be separated from 'Pure_Killer', then we need a separate category 'Dual_Mode_Killer' for U+1A7A and U+1A7C. It should be noted that U+1A62 TAI THAM VOWEL SIGN MAI SAT serves not only as Vowel_Dependent but also as Consonant_Final. This seems to be chiefly relevant to anyone attempting to deduce the pronunciation from the spelling. There are 4 characters currently categorised as 'Consonant' which I think are better categorised as 'Vowel': 0E24 ; Vowel # Lo THAI CHARACTER RU 0E26 ; Vowel # Lo THAI CHARACTER LU 1A42 ; Vowel # Lo TAI THAM LETTER RUE 1A44 ; Vowel # Lo TAI THAM LETTER LUE They serve both as independent and dependent vowels. Note that U+0E24 and U+0E26 may be followed by the length mark U+0E45 THAI CHARACTER LAKKHANGYAO, which is categorised as 'Vowel_Dependent'. I am not aware of any usage of U+0E45 as a true vowel. The sequence <U+1AAD TAI THAM SIGN CAANG, U+1A63 TAI THAM VOWEL SIGN AA> occurs with the same meaning, 'elephant', as U+1AAD. I don't know AA> whether this justifies changing U+1AAD from 'Other' to 'Consonant_Placeholder'. I've found one new Consonant: 0EBD ; Consonant # Lo LAO SEMIVOWEL SIGN NYO (was Consonant_Medial) 0EDE ; Consonant # Lo LAO LETTER KHMU GO (was Other) U+0EBD is used as an initial consonant in Khmu, so U+0EBD has been used in all rôles in the Lao script, like U+0EA7 LAO LETTER WO, which is of category Consonant. For information on Khmu usage, see UTC document L2/10-335 (http://www.unicode.org/L2/L2010/10335r-n3893r-lao-hosken.pdf). The Khmu alphabet chart included backs up the text. (It also shows U+0EC8 LAO TONE MAI EK acting as a Nukta!) If 'repha' can be used as a general category, including for example Myanmar script kinzi, then there are two arguable new examples, currently categorised as Consonant_Final: 1A58 ; Consonant_Preceding_Repha? # Mn TAI THAM SIGN MAI KANG LAI 1A5A ; Consonant_Succeeding_Repha? # Mn TAI THAM CONSONANT SIGN LOW PA There are significant issues with U+1A58; while traditionally it behaves as repha/kinzi, some modern styles are better served by treating it as Consonant_Final. It takes some juggling for a single OTL-style rendering engine to be able to render either style depending on the lookups while oblivious to the difference, but it can be done. I've found 5 new instances of Consonant_Subjoined: 1A57 ; Consonant_Subjoined # Mc TAI THAM CONSONANT SIGN LA TANG LAI 1A5B ; Consonant_Subjoined # Mn TAI THAM CONSONANT SIGN HIGH RATHA OR LOW PA 1A5C ; Consonant_Subjoined # Mn TAI THAM CONSONANT SIGN MA 1A5D ; Consonant_Subjoined # Mn TAI THAM TAI THAM CONSONANT SIGN BA 1A5E ; Consonant_Subjoined # Mn TAI THAM CONSONANT SIGN SA They were all previously categorised as Consonant_Final. Note that U+1A57 is an abbreviation. It is derived by the addition of a stroke to the subscript form <U+1A60 TAI THAM SIGN SAKOT, U+1A43 TAI THAM LETTER LA>. Abbreviations of the word _tanglaai_ 'all' using U+1A57 normally include at least <U+1A57, U+1A63 TAI THAM VOWEL SIGN AA>, so U+1A57 is not Consonant_Final. An example, apparently spelt <U+1A26 TAI THAM LETTER NGA, U+1A57, U+1A76 TAI THAM SIGN TONE-2, U+1A63 TAI THAM VOWEL SIGN AA>, is given in Table 16 at http://www.seasite.niu.edu/tai/TaiLue/graphic%20blends.htm. The word ᨶᩥᨻᩛᩤᨶ <U+1A36 TAI THAM LETTER NA, 1A65 TAI THAM VOWEL SIGN I, 1A3B TAI THAM LETTER LOW PA, 1A5B, 1A64 TAI THAM VOWEL SIGN TALL AA, 1A36> _nippa:na_ 'nirvana' immediately demonstrates that U+1A5B is not a final consonant. U+1A5C occurs in Pali proper names ending -mmo <U+1A3E TAI THAM LETTER MA, U+1A5C, U+1A6E TAI THAM VOWEL SIGN E, U+1A63 TAI THAM VOWEL SIGN AA>, so is clearly not a final consonant. U+1A5D occurs in Northern Thai principally in one word, whose pronunciation is roughly /kɔbɔː/. U+1A5D is not Consonant_Final in its phonetic effect. The word is a compound word (or perhaps just a visual compound), formed by chaining two syllables and striking out the duplicated characters. I have a text in which the constituents are to be encoded <U+1A20 TAI THAM LETTER HIGH KA, U+1A74 TAI THAM SIGN MAI KANG> and <U+1A37 TAI THAM LETTER BA, U+1A74, U+1A75 TAI THAM SIGN KANG> TONE-1>, so the chained word may reasonably be encoded <U+1A20, KANG> U+1A74, U+1A5D, U+1A75> or <U+1A20, U+1A5D, U+1A74, U+1A75>. While all my examples of U+1A5E are word final, it seems to differ from <U+1A60, U+1A48 TAI THAM LETTER HIGH SA> on the basis of the room available for it. Both forms are used as a word final consonant. The only Pali consonant cluster ending in /s/ is /ss/, and that is written using U+1A54 TAI THAM LETTER GREAT SA, so a non-final <s> will be rare. (I'm finding /ks/ written with U+1A47 TAI THAM LETTER HIGH SSA due to the application of RUKI.) However, I feel it would be rash to presume that every example of U+1A5E will be a final consonant. I have one new Consonant_Final: 0EDF ; Consonant_Final # Lo LAO LETTER KHMU NYO (was Consonant) See UTC document L2/10-335 for evidence. I have one possible new Consonant_subjoined: 1A7B ; Consonant_subjoined # Mn TAI THAM SIGN MAI SAM The value of its Indic_Matra_Category, if relevant, should be recorded as Top. U+1A7B is principally a repetition mark, indicating the repetition of a word. As extensions of this role, it can also do at least the following: (1) Indicate a repeated (not geminate) consonant (2) Indicate an omitted implicit vowel (one omits an implicit vowel by replacing it with U+1A60) (3) Indicate an epenthetic vowel (extension of Role 2). In rôle (1), it serves as a subjoined consonant. In rôles (2) and (3), it serves as a dependent vowel. For a shaper that does not constrain appearance, such as the Universal Shaping Engine, the best categorisation is probably 'Consonant_subjoined'. Although U+1A55 TAI THAM CONSONANT SIGN MEDIAL RA and U+1A56 TAI THAM CONSONANT SIGN MEDIAL LA are named as medial consonants, too much should not be read into such a description. Both are, very occasionally, immediately preceded by vowels, and both may be followed by <U+1A60 TAI THAM SIGN SAKOT, U+1A40 TAI THAM LETTER HIGH YA> and <U+1A60, U+1A45 TAI THAM LETTER WA>. While the latter two sequences most commonly represent vowels, the strictly consonantal cluster <U+1A49 TAI THAM LETTER HIGH HA, U+1A56, U+1A60, U+1A45> starts a few words beginning with the cluster /lw/. This is a behaviour the Universal Shaping Engine of Microsoft currently disallows for medial consonants. We should therefore have: 1A55 ; Consonant_Subjoined #Mc TAI THAM CONSONANT SIGN MEDIAL RA 1A56 ; Consonant_Subjoined #Mn TAI THAM CONSONANT SIGN MEDIAL LA I actually see no benefits for rendering engines in distinguishing Consonant_Medial and Consonant_Subjoined, though the contrast may help in locating phonetic syllable boundaries. _______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

