Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Richard Wordingham via Unicode
On Mon, 4 Jun 2018 12:49:20 -0700
Manish Goregaokar via Unicode  wrote:

> Hi,
> 
> The Rust community is considering
>  adding non-ascii
> identifiers, which follow UAX #31
>  (XID_Start XID_Continue*, with
> tweaks). The proposal also asks for identifiers to be treated as
> equivalent under NFKC.
> 
> Are there any cases where this will lead to inconsistencies? I.e. can
> the NFKC of a valid UAX 31 ident be invalid UAX 31?
> 
> (In general, are there other problems folks see with this proposal?)

Confusable checking may need to be reviewed.  There are several cases
where, sometimes depending on the font, anagrams (differing
even after normalisation) can render the same. The examples I
know of are of from SE Asia. The categories I know of are:

a) Swapping subscript letters - a big issue in the Myanmar script, but
Sanskrit grv- and gvr- can easily be rendered the same.  I don't know
how easily confusion arises by 'finger trouble'.

b) Vowel-subscript consonant and subscript consonant-vowel often look
the same in Khmer and Tai Tham.  The former spelling was supposedly
dropped in Khmer a century ago (the consonant ceasing to be subscript),
but lingered on in a few words and is acknowledged by Unicode but not by
the Microsoft font developer's guide.

c) Unresolved grammar.  In Thai minority languages, U+0E3A THAI
CHARACTER PHINTHU and a mark above (U+0E34 THAI CHARACTER SARA I, I
believe) can and do occur in either order, with no difference in
appearance or meaning.

The obvious humane solution is a brutal folding of the sequences.
(Using spell-checkers works wonders on normal text, but spell
checking code is tricky.) 

I actually suggested a character (U+1A54 TAI THAM LETTER GREAT SA) so
that folding 'ses' to 'sse' would not result in the 'ss' conjunct being
used; the conjunct is not used in 'ses'.

Richard.


Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Asmus Freytag via Unicode

  
  
On 6/6/2018 2:25 PM, Hans Åberg via
  Unicode wrote:


  

  
On 4 Jun 2018, at 21:49, Manish Goregaokar via Unicode  wrote:

The Rust community is considering adding non-ascii identifiers, which follow UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for identifiers to be treated as equivalent under NFKC.

  
  
So, in this language, if one defines a projection function 휋 and the usual constant π, what is 휋(π) supposed to mean? - Just curious.






In a language where one writes ASCII "pi"
instead, what is pi(pi) supposed to mean?
A./

  



Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Hans Åberg via Unicode


> On 4 Jun 2018, at 21:49, Manish Goregaokar via Unicode  
> wrote:
> 
> The Rust community is considering adding non-ascii identifiers, which follow 
> UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for 
> identifiers to be treated as equivalent under NFKC.

So, in this language, if one defines a projection function 휋 and the usual 
constant π, what is 휋(π) supposed to mean? - Just curious.





Re: Requiring typed text to be NFKC

2018-06-06 Thread Richard Wordingham via Unicode
On Tue, 5 Jun 2018 19:48:53 -0700
Manish Goregaokar via Unicode  wrote:

> Following up from my previous email
> ,
> one of the ideas that was brought up was that if we're going to
> consider NFKC forms equivalent, we should require things to be typed
> in NFKC.
> 
> 
> I'm a bit wary of this. As Richard brought up in that thread, some
> Thai NFKC forms are untypable. I *suspect* there are Hangul keyboards
> (perhaps physical non-IME based ones) that have this problem.
> 
> Do folks have other examples? Interested in both:

I don't know of any different problems for NFKC,but there are problems
with getting people to enter normalised data.

>  - Words (as in, real things people will want to type) where a
> keyboard/IME does not type the NFKC form

There are problems with insisting that users type normalised text.
Vietnamese is probably a real issue here; the standard keyboard is set
up to enter vowels (some of which are accented) and tone marks
separately.  Indeed, with the nặnɡ tone (as in the vowel of its name),
one is likely to find the codepoint sequence  which is not NFC, not NFD and
not even FCD.

>  - Words where the NFKC form is *visually* distinct enough that it
> will look weird to native speakers

There may be issues with BMP CJK compatibility ideographs.  I don't
know how far they've been replaced by variation sequences requesting
the same appearance.

>  - Words where a keyboard/IME *can* type the NFKC form but users are
> not used to it

Well, typing Tai Khuen in normalised form is hideously
counter-intuitive, but at present the USE makes displaying correctly
spelt text into a struggle for a font.  The problem there is that the
usual way of typing a closed syllable with a tone mark gets normalised
at the end to ; that normalisation
broke early pre-USE OpenType-based fonts as databases caught up with
Unicode 5.2.  That problem was promptly cured by HarfBuzz tweaking its
internal normalisation, until USE unintentionally outlawed correct
spelling.

A universal keyboard for entering large swathes of the Latin script is
not a very big problem, but entering text with diacritics in form NFC is
a real pain. This problem might arise when editing a Hungarian program
without a Hungarian keyboard.  The program development environment
would have to provide a normalisation tool.

Richard.



Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Henri Sivonen via Unicode
On Mon, Jun 4, 2018 at 10:49 PM, Manish Goregaokar via Unicode
 wrote:
> The Rust community is considering adding non-ascii identifiers, which follow
> UAX #31 (XID_Start XID_Continue*, with tweaks).

UAX #31 is rather light on documenting its rationale.

I realize that XML is a different case from Rust considering how the
Rust compiler is something a programmer runs locally whereas control
XML documents and XML processors, especially over time, is
significantly less coupled.

Still, the experience from XML and HTML suggests that, if non-ASCII is
to be allowed in identifiers at all, restricting the value space of
identifiers a priori easily ends up restricting too. HTML went with
the approach of collecting everything up to the next ASCII code point
that's a delimiter in HTML (and a later check for names that are
eligible for Custom Element treatment that mainly achieves
compatibility with XML but no such check for what the parser can
actually put in the document tree) while keeping the actual vocabulary
to ASCII (except for Custom Elements whose seemingly arbitrary
restrictions are inherited from XML).

XML 1.0 codified for element and attribute names what then was the
understanding of the topic that UAX #31 now covers and made other
cases a hard failure. Later, it turned out that XML originally ruled
out too much and the whole mess that was XML 1.1 and XML 1.0 5th ed.
resulted from trying to relax the rules.

Considering that ruling out too much can be a problem later, but just
treating anything above ASCII as opaque hasn't caused trouble (that I
know of) for HTML other than compatibility issues with XML's stricter
stance, why should a programming language, if it opts to support
non-ASCII identifiers in an otherwise ASCII core syntax, implement the
complexity of UAX #31 instead of allowing everything above ASCII in
identifiers? In other words, what problem does making a programming
language conform to UAX #31 solve?

Allowing anything above ASCII will lead to some cases that obviously
don't make sense, such as declaring a function whose name is a
paragraph separator, but why is it important to prohibit that kind of
thing when prohibiting things risks prohibiting too much, as happened
with XML, and people just don't mint identifiers that aren't practical
to them? Is there some important badness prevention concern that
applies to programming languages more than it applies to HTML? The key
thing here in terms of considering if badness is _prevented_ isn't
what's valid HTML but what the parser can actually put in the DOM, and
the HTML parser can actually put any non-ASCII code point in the DOM
as an element or attribute name (after the initial ASCII code point).

(The above question is orthogonal to normalization. I do see the value
of normalizing identifiers to NFC or requiring them to be in NFC to
begin with. I'm inclined to consider NFKC as a bug in the Rust
proposal.)
-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Philippe Verdy via Unicode
It could be argued that "modern" languages could use unique identifiers for
their syntax or API independantly of the name being rendered. The problem
is that translated names may collide in non-obvious way and become
ambiguous.
We've already seen the problems it caused in Excel with its translated
function names in some spreadsheets (things being worse when the
spreadsheet itself does not contain a language identifier to indicate in
which these identifiers are defined, so English-only installations of Excel
(without the MUI/LUI installed) cannot open or process correctly the
spreadsheets created in other languages.

In practice, ASCII-only or ISO8859-1 only identifiers work realtively well,
but there's always a problem to enter these identifiers, a solution would
be to allow identifiers having an ASCII-only alias even if they are not so
friendly for the original authors. But I've not seen any programming
language or API allowing to define aliases for identifiers that have
exactly the same semantic as the few translated ones that non-English users
would prefer to see and use. In C/C++ you may have aliases but this
requires special support in the binary object or library format to allow
equivalent bindings and resolution.

For programming languages that are too near from the machine level
(assembly, C, C++), or for common libraries intended to be used worldwide,
in most cases these names are in English-only or use "augmented English"
with approximate transliterations when they use some borrowed words
(notably proper names), or invented words (company names, trademarks,
custom neologisms specific to an app or service, and a lot of acronyms).
These API or languages tend to create their own "jargon" with their own
definitions (which may be translated in their documentation).
Programmer comments however are very frequently written in any language or
script because they don't have to be restricted by uniqueness and name
resolution or binding mechanisms.
But newer scripting languages are now very liberal (notably
Javascript/ECMAscript) and are somewhat easy to rebind to other names to
generate an "equivalent" library, except if the library needs to work
through reflection mechanisms and introspection. scripting languages
designed to be used for user personalisation should however be user
friendly and only designed to work well with the language of the initial
author for his own usage (but cooperation will be limited on the Internet,
and if one wants to share his code, he will have to create some basic
translation or transliteration.

Most system-level APIs (filesystem or I/O, multiprocessing/multithreading,
networking) and data format options are specified using English terms only
(or near-English). The various IDE's however can make this language more
friendly by providing documentation searches, contextual helpers in the
editor itself, hinting popups, or various "machine learning" tools
(including "natural language" query wizards to help create and document the
technical language using the English-like jargon).

Most programming languages however do not define a lot of reserved keywords
(in English) and there's rarely the need to translate them (but I've seen
several programming languages also translating them in a few wellknown
languages), notably languages designed to be used by children or to learn
programming. Some of these languages do not use a plain-text syntax but use
graphic diagrams with symbols, arrows, boxes and programmers navigate in
the graphic layout or rearrange the layout to fit new items or
remove/combine them (then an "advanced" view can be used to present this
layout in plain-text using partly translated terms: this is easier if
there's a clear syntaxic separation of custom identifiers created by users
(not translated) and core keywords of the language (generally this
separation uses quotation marks around custom identifiers, but this is not
even needed everywhere for data-oriented syntaxes like JSON which does not
need any "reserved" identifier, but reserves only  some punctuations).

Anyway, all programming jobs require a basic proficiency to read/write
basic English correctly, and require acquiring a common English-like
technical jargon (that jargon does not have to be perfect English, it is
used as a de facto standard, which evolves too fast to be correctly
translated). This jargon is still NOT normal English and using it means
that documentation should still be adapted/translated to better English for
native English readers. If you look at some wellknown projects in China,
you'll see that many projects are documented and supported only in Chinese,
by programmers that have a very limtied knowledge of English (so their
usage of Engliush in the crearted technical jargon is liguistically
incorrect, but still correct for the technical needs (and to
translate/Adapt these programs to other languages, Chinese is the source of
all translations, and must be present in all translation files to 

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Alastair Houghton via Unicode
On 5 Jun 2018, at 07:09, Martin J. Dürst via Unicode  
wrote:
> 
> Hello Rebecca,
> 
> On 2018/06/05 12:43, Rebecca T via Unicode wrote:
> 
>> Something I’d love to see is translated keywords; shouldn’t be hard with a
>> line in the cargo.toml for a ruidmentary lookup. Again, I’m of the opinion
>> that an imperfect implementation is better than no attempt. I remember
>> reading an article about a professor who translated the keywords in...
>> maybe it was Python? And found their students were much more engaged with
>> the material. Anecdotal, of course, but it’s stuck with me.
> 
> It would be good to have a reference for this. I can certainly see the point. 
> But on the other hand, I have also heard that using keywords in a foreign 
> language makes it clear that there may be a difference between the everyday 
> use of the word and the specific formal meaning in the programming language. 
> Then, there's also the problem that just translating keywords may work for 
> languages with the same sentence structure, but not for languages with a 
> completely different sentence structure. On top of that, keywords are just a 
> start; class/function/method names in libraries would have to be translated, 
> too, which would be much more work (especially if one wants to do a good job).

ALGOL68 was apparently localised (the standard explicitly supported that; it 
wasn’t an extension but rather something explicitly encouraged).  AppleScript 
was also designed to be (French and Japanese syntaxes were defined), and I have 
an inkling that someone once told me that at least one translation had actually 
shipped, though the translated variants are now deprecated as far as I’m aware.

Translated keywords are in some ways better than allowing non-ASCII 
identifiers, because they’re typically amenable to machine translation (indeed, 
in AppleScript, the scripts are not usually saved in ASCII anyway, but IIRC as 
a set of Apple Event Descriptors, so the “language” is just a matter for 
rendering to the user), which means that they don’t suffer from the problem of 
community fragmentation that non-ASCII identifiers *could* cause.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Alastair Houghton via Unicode
On 4 Jun 2018, at 20:49, Manish Goregaokar via Unicode  
wrote:
> 
> The Rust community is considering adding non-ascii identifiers, which follow 
> UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for 
> identifiers to be treated as equivalent under NFKC.
> 
> Are there any cases where this will lead to inconsistencies? I.e. can the 
> NFKC of a valid UAX 31 ident be invalid UAX 31?
> 
> (In general, are there other problems folks see with this proposal?)

IMO the major issue with non-ASCII identifiers is not a technical one, but 
rather that it runs the risk of fragmenting the developer community.  Everyone 
can *type* ASCII and everyone can read Latin characters (for reasonably wide 
values of “everyone”, at any rate… most computer users aren’t going to have a 
problem).  Not everyone can type Hangul, Chinese or Arabic (for instance), and 
there is no good fix or workaround for this.

Note that this is orthogonal to issues such as which language identifiers or 
comments are written in (indeed, there’s no problem with comments written in 
any script you please); the problem is that e.g. given a function

  func الطول(s : String)

it isn’t obvious to a non-Arabic speaking user how to enter الطول in order to 
call it.  This isn’t true of e.g.

  func pituus(s : String)

Even though “pituus” is Finnish, it’s still ASCII and everyone knows how to 
type that.

Copy and paste is not always a good solution here, I might add; in bidi text in 
particular, copy and paste can have confusing results (and results that vary 
depending on the editor being used).  There is also the issue of additional 
confusions that might be introduced; even if you stick to Latin scripts, this 
could be an problem sometimes (e.g. at small sizes, it’s hard to distinguish ă 
and ǎ or ȩ and ę), and of course there are Cyrillic and Greek characters that 
are indistinguishable from their Latin counterparts in most fonts.  UAX #31 
also manages (I suspect unintentionally?) to give a good example of a pair of 
Farsi identifiers that might be awkward to tell apart in certain fonts, namely 
نامهای and نامه‌ای; I think those are OK in monospaced fonts, where the join is 
reasonably wide, but at small point sizes in proportional fonts the difference 
in appearance is very subtle, particularly for a non-Arabic speaker.

You could avoid *some* of these issues by restricting the allowable scripts 
somehow (e.g. requiring that an identifier that had Latin characters could not 
also contain Cyrillic and so on) or perhaps by establishing additional 
canonical equivalences between similar looking characters (so that e.g. while a 
and а - or, more radically, ă and ǎ - might be different characters, you might 
nevertheless regard them as the same for symbol lookup).  It might be worth 
looking at UTR #36 and maybe UTR #39, not so much from a security standpoint, 
but more because those documents already have to deal with the problem of 
confusables.

You could also recommend that people stick to ASCII unless there’s a good 
reason to do otherwise (and note that using non-ASCII characters might impact 
on their ability to collaborate with teams in other countries).

None of this is necessarily a reason *not* to support non-ASCII identifiers, 
but it *is* something to be cautious about.  Right now, most programming 
languages operate as a lingua franca, with code written by a wide range of 
people, not all of whom speak English, but all of whom can collaborate together 
to a greater or lesser degree by virtue of the fact that they all understand 
and can write code.  Going down this particular rabbit hole risks changing 
that, and not for the better, and IMO it’s important to understand that when 
considering whether the trade-off of being able to use non-ASCII characters in 
identifiers is genuinely worth it.

Kind regards,

Alastair.

--
http://alastairs-place.net