Re: Mixed-Script confusables in prog.languages

Richard Wordingham Thu, 15 Dec 2016 12:35:16 -0800

On Wed, 14 Dec 2016 18:44:39 +0100
Reini Urban <[email protected]> wrote:


> On Dec 5, 2016, at 3:31 PM, Richard Wordingham
> <[email protected]> wrote:

> > The choice with PHI includes:
> > 
> > U+0278 LATIN SMALL LETTER PHI
> > U+03C6 GREEK SMALL LETTER PHI
> > 
> > a Greek (!) script character with compatibiity decomposition to
> > U+03C6
> > 
> > U+03D5 GREEK PHI SYMBOL
> > 
> > and a whole host of common script characters with compatibility
> > decomposition to U+03C6:
> > 
> > U+1D6D7 MATHEMATICAL BOLD SMALL PHI
> > U+1D6DF MATHEMATICAL BOLD PHI SYMBOL
> > U+1D711 MATHEMATICAL ITALIC SMALL PHI
> > U+1D719 MATHEMATICAL ITALIC PHI SYMBOL
> > U+1D74B MATHEMATICAL BOLD ITALIC SMALL PHI
> > U+1D753 MATHEMATICAL BOLD ITALIC PHI SYMBOL
> > U+1D785 MATHEMATICAL SANS-SERIF BOLD SMALL PHI
> > U+1D78D MATHEMATICAL SANS-SERIF BOLD PHI SYMBOL
> > U+1D7BF MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL PHI
> > U+1D7C7 MATHEMATICAL SANS-SERIF BOLD ITALIC PHI SYMBOL
> > 
> > They are all ID_Start.  
> 
> Oh my. Dragons beware. So I need to add some trie tables to add
> warnings with those rules also. I don’t want to error on some obscure
> confusables rule only yet. perl doesn’t even ship the security
> tables, so people are not aware of it.

Another solution would be to treat two identifiers as the same if they
have the same NFKC normalisation.

> > You didn't mention the inherited script.  Is that automatically
> > allowed, e.g. φ̈ᵣ <U+03C6, U+0308 COMBINING DIAERESIS, U+1D63 LATIN
> > SUBSCRIPT SMALL LETTER R> (scripts: Greek, inherited, Latin)?  I
> > encountered that variable name in a radar specification last week.  
> 
> Inherited is allowed with ID_Continue, yes. Not in ID_Start position.
> Combiners are normalized to NFC.

<U+03C6, U+0308, U+1D63> is unchanged under normalisation to NFC, NFD,
NFKC and NFKD. 

> > There might be issues - it's possible that क̐ <U+0915 DEVANAGARI
> > LETTER KA, U+0310 COMBINING CANDRABINDU> might spoof कँ <U+0915,
> > U+0901 DEVANAGARI SIGN CANDRABINDU>.  

> \x{915}\x{310} is legal Devanagari normalized to one char.

I don't know know what you mean by this statement. <U+0915, U+0310> is
also unchanged under the 4 normalisations.
 
> \x{915}\x{901} are two legal Devanagari characters.
> but they are confusables. This would need special confusable rules.

Additionally, U+0310 can be confused quite readily with the sequence
<U+0306 COMBINING BREVE, U+0307 COMBINING DOT ABOVE>.

Richard.

Re: Mixed-Script confusables in prog.languages

Reply via email to