On Tue, 08 Sep 2015 08:19:03 -0700 "Doug Ewell" <[email protected]> wrote:
> Mark Davis 🍱️ <mark at macchiato dot com> wrote: > > >> TUS 8.0 Chapter 3 C6: "A process shall not assume that the > >> interpretations of two canonical-equivalent character sequences are > >> distinct." > > > > A compiler will take source code containing String x="á"; and > > compile it to a certain binary. If that same source code is NFD'd, > > the compiler will produce a different result. > > > > Do you really think that such compiler is not compliant to Unicode?? > > If so, then we should add some more clarifications around C6. It's not me who put mens rea into the conformance requirements. If a compiler does no more than check strings for validity, than it may simply naively copy the sequence of scalar values without being non-compliant, so long as the *intent* is not to preserve differences. For example, if a process changes strings to preferred canonically equivalent strings, but treats characters with ccc=9 as though they had ccc=0, it probably is in breach. On the other hand, if it treated characters with ccc=9 as though they had ccc=300 (not a possible value of ccc), it is compliant. I think it is quite possible to have two identical pieces of code of which one is compliant and the other is non-compliant. It all depends on the code's motive, which I can only think refers to the motives of the intelligent entity that caused the code to be as it is. > I agree. The word "interpretations" in C6 can't have been intended to > include the interpretation of code points qua code points. That would > make a great many internal processes impossible. I would make it even more extreme by saying that the intent is that the rule apply to encoded text, as opposed to mere strings of code units. The problem is that some procedures allow a character to represent itself even where that is not consistent because the data will be seen as text. For example, it is my opinion that combining marks and control characters only belong in the representation of Unicode sets when they part of a non-defective string element. > I think of C6 as meaning that spell-checkers, for example, should not > treat José (NFC, four code points) and José (NFD, five code points) > as separate entries. C6 does not prohibit spell-checkers from neglecting to normalise. The authors of the code of a spell-checker could take the view that the database writers should have included all canonically equivalent forms. Practically, that allows a spell-checker to enforce normalisation. There's another, subtle feature for spell checkers. By any reading, C6 does not require a spell-checker to realise that 'find' might be spelt with U+FB01 LATIN SMALL LIGATURE FI. Applying NFKC or NFKD to the Thai word for 'water' would be wrong, for that converts <NA, MAI THO, SARA AM> to <NA, MAI THO, NIKHAHIT, SARA AA>, which is wrong and looks quite different. Moreover, U+FB01 is not an acceptable alternative to <f, i> in Turkish. Richard.

