On Sat, 24 Oct 2015 08:40:32 +0300 Eli Zaretskii <[email protected]> wrote:
> > Date: Fri, 23 Oct 2015 23:16:32 +0100 > > From: Richard Wordingham <[email protected]> > > > > "C6: A process shall not assume that the interpretations of two > > canonical-equivalent character sequences are distinct." > > > > Firstly, I have grave difficulties assigning mental activities to > > processes. > > > > Secondly, it may be possible to interpet "A process shall not > > assume X" as "A process shall function correctly regardless of > > whether X holds." > > > > However, let image(Y) be the bitmap depicting the string Y. Then > > the following logic would be non-compliant: > > > > if A and B are canonically equivalent and image(A) and image(B) are > > different, then > > write(A, " and ", B, "are canonically equivalent but have > > different images ", image(A), " and ", image(B)); > > end if > > > > The logic is non-compliant, for if it is invoked then the write > > statement will only work correctly if image(A) and image(B) are > > different, i.e. if A and B are interpreted differently. Apparently > > it is permissible to render canonically equivalent sequences > > differently, so image(A) and image(B) might be different even > > though canonically equivalent. > > > > I therefore conclude that C6 is in some language that I do not > > adequately understand. > > AFAIU, Unicode is about processing text, and only mentions display > rarely, where it's directly related to the processing part. So the > above is about _processing_ canonically-equivalent sequences, not > about their display. When looked at in this way, I see no > difficulties in understanding the text. Display is part of interpretation - indeed, it is currently the most important part. At least, I would interpret displaying U+0041 with a glyph like 'X' (an example in 'D2 Character identity') as violating: "C4: A process shall interpret a coded character sequence according to the character semantics established by this standard, if that process does interpret that coded character sequence." I chose the complicated function image() as being less controversial. However, as you do not think it interprets a string, consider the full, default toUppercase() instead. The problem lies with troublesome U+0345 COMBINING GREEK YPOGEGRAMMENI (subscript iota) with ccc=240, which uppercases to U+0399 GREEK CAPITAL LETTER IOTA with ccc=0. While U+0345 commutes with Greek accents, U+0399 does not. Thus U+1F80 GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI uppercases, in full mode, to <U+1F08 GREEK CAPITAL LETTER ALPHA WITH PSILI, U+0399>, but the canonically equivalent lower case form <U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0313 COMBINING COMMA ABOVE> uppercases, in full mode, to the inequivalent upper case <U+0391 GREEK CAPITAL LETTER ALPHA, U+0399, U+0313>. The brute force solution to this in practice minor issue is to convert strings to NFD before upper-casing, but this would foul of one guess of the meaning of C6, namely "An author shall not assume that the interpretations of two canonical-equivalent character sequences are distinct". Of course, if that is the meaning, determining whether X = toNFC(toUppercase(toNFD(X))) is compliant depends on answering the question, "Did the author think he could get a different result if he omitted the conversion to NFD?". I'm not sure whether the code would be compliant under my interpretation if the author was unsure as to whether omitting the conversion would get a different result. > The Hebrew script is never an alphabet, AFAIU, it's likely an abugida > when the vowel marks are used. No, the definition of an abugida is that there is a default vowel which is indicated by the absence of any vowel mark. In fully pointed Hebrew, it's only final, silent and quiescent consonants that lack vowel marks. I don't like the definitions, because they are extremely vulnerable to small changes in use. Indeed, having taken the name from the consonant system underlying the Ethiopic syllabary, the inventors of the term subsequently concluded that the eponymous abugida was not actually an abugida! > The so-called "full spelling", where > some vowels are indicated by consonants, does not replace all the > vowels with consonants, so it isn't, strictly speaking, an alphabet in > the above sense. Nor would I claim it as such. Richard.

