Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

Alastair Houghton via Unicode Wed, 06 Jun 2018 02:33:14 -0700

On 4 Jun 2018, at 20:49, Manish Goregaokar via Unicode <unicode@unicode.org> 
wrote:
> 
> The Rust community is considering adding non-ascii identifiers, which follow 
> UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for 
> identifiers to be treated as equivalent under NFKC.
> 
> Are there any cases where this will lead to inconsistencies? I.e. can the 
> NFKC of a valid UAX 31 ident be invalid UAX 31?
> 
> (In general, are there other problems folks see with this proposal?)

IMO the major issue with non-ASCII identifiers is not a technical one, but
rather that it runs the risk of fragmenting the developer community. Everyone
can *type* ASCII and everyone can read Latin characters (for reasonably wide
values of “everyone”, at any rate… most computer users aren’t going to have a
problem). Not everyone can type Hangul, Chinese or Arabic (for instance), and
there is no good fix or workaround for this.

Note that this is orthogonal to issues such as which language identifiers or
comments are written in (indeed, there’s no problem with comments written in
any script you please); the problem is that e.g. given a function

func الطول(s : String)

it isn’t obvious to a non-Arabic speaking user how to enter الطول in order to
call it. This isn’t true of e.g.

func pituus(s : String)

Even though “pituus” is Finnish, it’s still ASCII and everyone knows how to
type that.

Copy and paste is not always a good solution here, I might add; in bidi text in
particular, copy and paste can have confusing results (and results that vary
depending on the editor being used). There is also the issue of additional
confusions that might be introduced; even if you stick to Latin scripts, this
could be an problem sometimes (e.g. at small sizes, it’s hard to distinguish ă
and ǎ or ȩ and ę), and of course there are Cyrillic and Greek characters that
are indistinguishable from their Latin counterparts in most fonts. UAX #31
also manages (I suspect unintentionally?) to give a good example of a pair of
Farsi identifiers that might be awkward to tell apart in certain fonts, namely
نامهای and نامه‌ای; I think those are OK in monospaced fonts, where the join is
reasonably wide, but at small point sizes in proportional fonts the difference
in appearance is very subtle, particularly for a non-Arabic speaker.

You could avoid *some* of these issues by restricting the allowable scripts
somehow (e.g. requiring that an identifier that had Latin characters could not
also contain Cyrillic and so on) or perhaps by establishing additional
canonical equivalences between similar looking characters (so that e.g. while a
and а - or, more radically, ă and ǎ - might be different characters, you might
nevertheless regard them as the same for symbol lookup). It might be worth
looking at UTR #36 and maybe UTR #39, not so much from a security standpoint,
but more because those documents already have to deal with the problem of
confusables.

You could also recommend that people stick to ASCII unless there’s a good
reason to do otherwise (and note that using non-ASCII characters might impact
on their ability to collaborate with teams in other countries).

None of this is necessarily a reason *not* to support non-ASCII identifiers,
but it *is* something to be cautious about. Right now, most programming
languages operate as a lingua franca, with code written by a wide range of
people, not all of whom speak English, but all of whom can collaborate together
to a greater or lesser degree by virtue of the fact that they all understand
and can write code. Going down this particular rabbit hole risks changing
that, and not for the better, and IMO it’s important to understand that when
considering whether the trade-off of being able to use non-ASCII characters in
identifiers is genuinely worth it.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

Reply via email to