IMHO, a programming language that accepts non-ASCII identifiers should always nrmalize the identifiers it accepts, before heeding it in its hashed symbol table.
And for this type of usage, we strongly need that normalization is stable, but much more than with existing stability rules: the normalization stability is not warrantied if the language can accept unassigned code points that may be allocated later and will normalize differently (the normalization of unassigned code points just assumes a default combining class 0 where reordering and recombining cannot occur, but once code points pass from unassigned to assigned, this may no longer be true. For this reason, a reasonable programming language should restrict itself to only characters of a defined Unicode version and should notaccept unassigned characters in that version. Alternatively compiled programs should track the Unicode version version to make sure that later reusers of compiled programs will link properly to the older compiled programs, by making sure that newer idenfiers used in newer programs cannot never match an identifier defined by the older compield program assuming a different normalization. Programming languages should follow the practices used in IDNA for security reasons. Then, extending the allowed subset should be done with care: this extension will be compatible *only* if the newly assigned characters added to the extended subset have combining class 0 and are not listed in restricted recompositions. Otherwise, all other added characters in the extension should not be compatible with older versions of the language (if the language cannot check Uncidoe version or does not want to be incompatible with past versions, they will not be allowed to extend safely their allowed subset for identifiers, and notably not any combining characers with non-zero combining class). 2014-06-05 19:24 GMT+02:00 Jeff Senn <s...@maya.com>: > > On Jun 5, 2014, at 12:41 PM, Hans Aberg <haber...@telia.com> wrote: > > > On 5 Jun 2014, at 17:46, Jeff Senn <s...@maya.com> wrote: > > > >> That is: are identifiers merely sequences of characters or intended to > be comparable as “Unicode strings” (under some sort of compatibility rule)? > > > > In computer languages, identifiers are normally compared only for > equality, as it reduces lookup time complexity. > > Well in this case we are talking about parsing a source file and > generating internal symbols, so the complexity of the comparison operation > is a red herring. > > The real question is how does the source identifier get mapped into a > (compiled) symbol. (e.g. in C++ this is not an obvious operation) > > If your implication is that there should be no canonicalization (the > string from the source is used as a sequence of characters only directly > mapped to a symbol), then I predict sticky problems in the future. The > most obvious of which is that in some cases I will be able to change the > semantics of the complied program by (accidentally) canonicalizing the > source text (an operation, I will point out, that is invisible to the user > in many (most?) Unicode aware editors). > > > > > _______________________________________________ > Unicode mailing list > Unicode@unicode.org > http://unicode.org/mailman/listinfo/unicode >
_______________________________________________ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode