On 4 Jun 2018, at 20:49, Manish Goregaokar via Unicode <unicode@unicode.org> 
wrote:
> 
> The Rust community is considering adding non-ascii identifiers, which follow 
> UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for 
> identifiers to be treated as equivalent under NFKC.
> 
> Are there any cases where this will lead to inconsistencies? I.e. can the 
> NFKC of a valid UAX 31 ident be invalid UAX 31?
> 
> (In general, are there other problems folks see with this proposal?)

IMO the major issue with non-ASCII identifiers is not a technical one, but 
rather that it runs the risk of fragmenting the developer community.  Everyone 
can *type* ASCII and everyone can read Latin characters (for reasonably wide 
values of “everyone”, at any rate… most computer users aren’t going to have a 
problem).  Not everyone can type Hangul, Chinese or Arabic (for instance), and 
there is no good fix or workaround for this.

Note that this is orthogonal to issues such as which language identifiers or 
comments are written in (indeed, there’s no problem with comments written in 
any script you please); the problem is that e.g. given a function

  func الطول(s : String)

it isn’t obvious to a non-Arabic speaking user how to enter الطول in order to 
call it.  This isn’t true of e.g.

  func pituus(s : String)

Even though “pituus” is Finnish, it’s still ASCII and everyone knows how to 
type that.

Copy and paste is not always a good solution here, I might add; in bidi text in 
particular, copy and paste can have confusing results (and results that vary 
depending on the editor being used).  There is also the issue of additional 
confusions that might be introduced; even if you stick to Latin scripts, this 
could be an problem sometimes (e.g. at small sizes, it’s hard to distinguish ă 
and ǎ or ȩ and ę), and of course there are Cyrillic and Greek characters that 
are indistinguishable from their Latin counterparts in most fonts.  UAX #31 
also manages (I suspect unintentionally?) to give a good example of a pair of 
Farsi identifiers that might be awkward to tell apart in certain fonts, namely 
نامهای and نامه‌ای; I think those are OK in monospaced fonts, where the join is 
reasonably wide, but at small point sizes in proportional fonts the difference 
in appearance is very subtle, particularly for a non-Arabic speaker.

You could avoid *some* of these issues by restricting the allowable scripts 
somehow (e.g. requiring that an identifier that had Latin characters could not 
also contain Cyrillic and so on) or perhaps by establishing additional 
canonical equivalences between similar looking characters (so that e.g. while a 
and а - or, more radically, ă and ǎ - might be different characters, you might 
nevertheless regard them as the same for symbol lookup).  It might be worth 
looking at UTR #36 and maybe UTR #39, not so much from a security standpoint, 
but more because those documents already have to deal with the problem of 
confusables.

You could also recommend that people stick to ASCII unless there’s a good 
reason to do otherwise (and note that using non-ASCII characters might impact 
on their ability to collaborate with teams in other countries).

None of this is necessarily a reason *not* to support non-ASCII identifiers, 
but it *is* something to be cautious about.  Right now, most programming 
languages operate as a lingua franca, with code written by a wide range of 
people, not all of whom speak English, but all of whom can collaborate together 
to a greater or lesser degree by virtue of the fact that they all understand 
and can write code.  Going down this particular rabbit hole risks changing 
that, and not for the better, and IMO it’s important to understand that when 
considering whether the trade-off of being able to use non-ASCII characters in 
identifiers is genuinely worth it.

Kind regards,

Alastair.

--
http://alastairs-place.net


Reply via email to