> That is, why is conforming to UAX #31 worth the risk of prohibiting the use of characters that some users might want to use?
One could parse for certain sequences, putting characters into a number of broad categories. Very approximately: - junk ~= [[:cn:][:cs:][:co:]]+ - whitespace ~= [[:z:][:c:]-junk]+ - syntax ~= [[:s:][:p:]] // broadly speaking, including both the language syntax & user-named operators - identifiers ~= [all-else]+ UAX #31 specifies several different kinds of identifiers, and takes roughly that approach for http://unicode.org/reports/tr31/#Immutable_Identifier_Syntax, although the focus there is on immutability. So an implementation could choose to follow that course, rather than the more narrowly defined identifiers in http://unicode.org/reports/tr31/#Default_Identifier_Syntax. Alternatively, one can conform to the Default Identifiers but declare a profile that expands the allowable characters. One could take a Swiftian approach <http://www.globalnerdy.com/2014/06/03/swift-fun-fact-1-you-can-use-emoji-characters-in-variable-constant-function-and-class-names/>, for example... Mark On Fri, Jun 8, 2018 at 11:07 AM, Henri Sivonen via Unicode < unicode@unicode.org> wrote: > On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen <hsivo...@hsivonen.fi> > wrote: > > Considering that ruling out too much can be a problem later, but just > > treating anything above ASCII as opaque hasn't caused trouble (that I > > know of) for HTML other than compatibility issues with XML's stricter > > stance, why should a programming language, if it opts to support > > non-ASCII identifiers in an otherwise ASCII core syntax, implement the > > complexity of UAX #31 instead of allowing everything above ASCII in > > identifiers? In other words, what problem does making a programming > > language conform to UAX #31 solve? > > After refreshing my memory of XML history, I realize that mentioning > XML does not helpfully illustrate my question despite the mention of > XML 1.0 5th ed. in UAX #31 itself. My apologies for that. Please > ignore the XML part. > > Trying to rephrase my question more clearly: > > Let's assume that we are designing a computer-parseable syntax where > tokens consisting of user-chosen characters can't occur next to each > other and, instead, always have some syntax-reserved characters > between them. That is, I'm talking about syntaxes that look like this > (could be e.g. Java): > > ab.cd(); > > Here, ab and cd are tokens with user-chosen characters whereas space > (the indent), period, parenthesis and the semicolon are > syntax-reserved. We know that ab and cd are distinct tokens, because > there is a period between them, and we know the opening parethesis > ends the cd token. > > To illustrate what I'm explicitly _not_ talking about, I'm not talking > about a syntax like this: > > αβ⊗γδ > > Here αβ and γδ are user-named variable names and ⊗ is a user-named > operator and the distinction between different kinds of user-named > tokens has to be known somehow in order to be able to tell that there > are three distinct tokens: αβ, ⊗, and γδ. > > My question is: > > When designing a syntax where tokens with the user-chosen characters > can't occur next to each other without some syntax-reserved characters > between them, what advantages are there from limiting the user-chosen > characters according to UAX #31 as opposed to treating any character > that is not a syntax-reserved character as a character that can occur > in user-named tokens? > > I understand that taking the latter approach allows users to mint > tokens that on some aesthetic measure don't make sense (e.g. minting > tokens that consist of glyphless code points), but why is it important > to prescribe that this is prohibited as opposed to just letting users > choose not to mint tokens that are inconvenient for them to work with > given the behavior that their plain text editor gives to various > characters? That is, why is conforming to UAX #31 worth the risk of > prohibiting the use of characters that some users might want to use? > The introduction of XID after ID and the introduction of Extended > Hashtag Identifiers after XID is indicative of over-restriction having > been a problem. > > Limiting user-minted tokens to UAX #31 does not appear to be necessary > for security purposes considering that HTML and CSS exist in a > particularly adversarial environment and get away with taking the > approach that any character that isn't a syntax-reserved character is > collected as part of a user-minted identifier. (Informally, both treat > non-ASCII characters the same as an ASCII underscore. HTML even treats > non-whitespace, non-U+0000 ASCII controls that way.) > > -- > Henri Sivonen > hsivo...@hsivonen.fi > https://hsivonen.fi/ > >