Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

Alastair Houghton via Unicode Thu, 07 Jun 2018 09:03:29 -0700

On 7 Jun 2018, at 15:51, Frédéric Grosshans via Unicode <unicode@unicode.org> 
wrote:
> 
>> IMO the major issue with non-ASCII identifiers is not a technical one, but 
>> rather that it runs the risk of fragmenting the developer community.  
>> Everyone can *type* ASCII and everyone can read Latin characters (for 
>> reasonably wide values of “everyone”, at any rate… most computer users 
>> aren’t going to have a problem). Not everyone can type Hangul, Chinese or 
>> Arabic (for instance), and there is no good fix or workaround for this.
> Well, your ”reasonable” value of everyone exclude many kids,


Every keyboard I’ve ever seen, including Chinese ones, is marked with ASCII 
characters as well. Typing ASCII on a machine in the Chinese locale might not 
be entirely straightforward, but entering Chinese characters, even on such a 
machine, takes significant training, and on a machine not set to Chinese locale 
it might even require the installation of additional software. It isn’t even 
the case, as I understand it, that all machines set to Chinese locales use the 
same input method, so being able to enter Chinese on one system doesn’t 
necessarily mean you’ll be able to do so on another. (I imagine it makes it 
easier to learn, once you’ve done it once, but still…)

I appreciate that the upshot of the Anglicised world of software engineering is 
that native English speakers have an advantage, and those for whom Latin isn’t 
their usual script are at a particular disadvantage, and I’m sure that seems 
unfair to many of us — but that doesn’t mean that allowing the use of other 
scripts everywhere, desirable as it is, is entirely unproblematic.

>> it isn’t obvious to a non-Arabic speaking user how to enter الطول in order 
>> to call it.
> OK. Clearly, someone not knowing the Arabic alphabet will have difficulties 
> with this one, but if one has good reason to think the targeted developper 
> community is literate in Arabic and a lower mastery of the latin alphabet, it 
> still may be a good idea.
> If I understand you correctly, an Arabic speaker should always transliterate 
> the function name to ASCII,

That’s one option; or they could write it in Arabic, but they need to be aware 
of the consequences of doing so (and those they are working for or with also 
need to understand that); or they could choose some other language, perhaps one 
shared with other teams who are likely to work on the code. Imagine you 
outsourced development to a team that happened to be Arabic speaking, and they 
developed (let’s say) French language software for you, but later you wanted to 
bring development in house and found all the identifiers were in Arabic script, 
which made the code very difficult for your developers to work with. That isn’t 
exactly going to make your day, and if it isn’t a problem that anyone has 
mentioned, it might not be obvious that you when you originally outsourced your 
development that you needed to make sure people weren't going to do that.

>>  UAX #31 also manages (I suspect unintentionally?) to give a good example of 
>> a pair of Farsi identifiers that might be awkward to tell apart in certain 
>> fonts, namely نامهای and نامه‌ای; I think those are OK in monospaced fonts, 
>> where the join is reasonably wide, but at small point sizes in proportional 
>> fonts the difference in appearance is very subtle, particularly for a 
>> non-Arabic speaker.
> In ASCII, identifiers with I, l, and 1 can be difficult to tell apart. And it 
> is not an artificial problem: I’ve once had some difficulties with an 
> automatically generated login which was do11y but tried to type dolly, 
> despites my familiarity with ASCII. So I guess this problem is not specific 
> to the ASCII vs non-ASCII debate

It isn’t, though fonts used by programmers typically emphasise the differences 
between I, l and 1 as well as 0 and O, 5 and S and so on specifically to avoid 
this problem.

But please don’t misunderstand; I am not — and have not been — arguing against 
non-ASCII identifiers. We were asked whether there were any problems. These are 
problems (or perhaps we might call them “trade-offs”). We can debate the 
severity of them, and whether, and what, it’s worthwhile doing anything to 
mitigate any of them. What we shouldn’t do is sweep them under the carpet.

Personally I think a combination of documentation to explain that it’s worth 
thinking carefully about which script(s) to use, and some steps to consider 
certain characters to be equivalent even though they aren’t the same (and 
shouldn’t be the same even when normalised) might be a good idea. Is that 
really so controversial a position?

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

Reply via email to