|
On 1/9/2017 2:24 PM, Richard Wordingham
wrote:
Richard,Where, if anywhere, is the encoding of plain text specified? I am particularly concerned with the arrangement of the code sequences for non-spacing abstract characters once one has determined an encoding for the abstract characters.For example, a naive reading of TUS 9.0 Section 16.4 Subsection "Ordering of Syllable Components" would lead one to believe that the word _khnyom_ 'I' shall be encoded as <U+1781 KHMER LETTER KHA, U+17D2 KHMER SIGN COENG, U+1789 KHMER LETTER NYO, U+17BB KHMER VOWEL SIGN U, U+17C6 KHMER SIGN NIKAHIT>. the group of Khmer experts that developed the recent label generation rules for root zone domain names considers that ordering the only one supported, a specification you find here: https://www.icann.org/en/system/files/files/proposal-khmer-lgr-15aug16-en.pdf That document states: 7.4 Context of COENG Sign (U+17D2) The sign ្ KHMER SIGN COENG (U+17D2) used for subscripting consonants must occur between two consonants. If it occurs between any other categories, it is not in a valid context so the label is not well formed. Further, the consonant following it must not include ឡ KHMER LETTER LA (U+17A1), ... So, you are not alone in thinking that the COENG goes between consonants. Did they just make this up? No, they followed what is laid out in the standard: Page 621 in Unicode 9.0.0, you find (http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf) Subscript Consonants. Subscript consonant signs differ from independent consonant characters and are called coeng (literally, “foot, leg”) after their subscript position. While a consonant character can constitute an orthographic syllable by itself, a subscript consonant sign cannot. Note that U+17A1 C khmer letter la does not have a corresponding subscript consonant sign in standard Khmer.... Subscript consonant signs are used to represent any consonant following the first consonant in an orthographic syllable. and on page 624: .... each of these [subscript consonant] signs is represented by the sequence of two characters: a special control character (U+17D2 khmer sign coeng) and a corresponding consonant character. That text fixes the order MAIN CONSONANT + COENG OPERATOR + SUBSCRIPT CONSONANT with suffficient clarity (as do all the examples and tables). However, on further investigation, I cannot find any text that says that <U+1781, U+17C6, U+17D2, U+1789, U+17BB> would not be compliant with the Unicode standard. Have I missed anything? In this example, your coeng operator U+17D2 is out of order, while it is followed by a consonant, it does not in turn immediately follow the main consonant, because a sign NIKAHIT is inserted in your example. Again, from the Root Zone LGR document we find an explicit rule: 7.10 Context of NIKAHIT SIGN (U+17C6) The sign ្ំ KHMER SIGN NIKAHIT (U+17C6) can only be preceded by a consonant or a shifter or one of the subset of dependent vowels tagged “dependent-vowel-1” in the repertoire table (្ ្ុ), i.e. vowel signs AA and U. That would allow the NIKAHIT to be placed where you suggest, if it were not for the rule on the coeng operator (7.4). Now, it is a known fact that the label generation rules are slightly more restrictive than the rules for general text. (See also section 5 in that document). See the text on p. 622 in TUS 9.0.0 where the following exception is noted: "The subscript consonant signs in the Khmer script can be used to denote a final consonant, although this practice is uncommon." The associated example shows MAIN CONSONANT + VOWEL + NIKHAHIT + COENG + FINAL CONSONANT Another exception that is noted on p. 623 is the following: "While these subscript consonant signs are usually attached to a consonant character, they can also be attached to an independent vowel character. Although this practice is relatively rare, it is used in one very common word, meaning “to give.”" Taken together, it would appear that, unless your example fits the first of these two exceptions, the NIKAHIT in it is out of order. (The label generation rules disallow both of these exceptions, in an attempt to streamline the rules, sacrificing a number of potential domain names. Equivelant rule sets for validating text would have to be more complete). One might hope that the subsection about 'logical order' in TUS 9.0 Section 2.2 Unicode Design Principles would help, but: 1) Section 3 'Conformance' says nothing about logical order; and 2) The subsection about 'logical order' seems to assume that there exists a common practice; it does not actually place any requirement on this common practice. Richard. I don't think either of these general sections are intended to
provide the correct http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf A./
|
- Specification of Encoding of Plain Text Richard Wordingham
- Re: Specification of Encoding of Plain Text Asmus Freytag
- Re: Specification of Encoding of Plain Text Mark Davis ☕️
- Re: Specification of Encoding of Plain Tex... Asmus Freytag
- Re: Specification of Encoding of Plain Tex... Richard Wordingham
- Re: Specification of Encoding of Plain... Mark Davis ☕️
- Re: Specification of Encoding of ... Richard Wordingham
- Re: Specification of Encoding... Mark Davis ☕️
- Re: Specification of Enco... Richard Wordingham
- Re: Specification of Enco... Mark Davis ☕️
- Re: Specification of Enco... Richard Wordingham
- Re: Specification of Enco... Mark Davis ☕️

