I’m working on adding Mixed-Script confusable protection to a programming 
language, 
cperl a perl5 fork, for security reasons, for its identifiers.

i.e. variable names, package names, function names, literals.

This is a bit different to the typical use cases of libidna, in email or 
browsers.

Is anybody aware of any other language implementation, which does confusable or 
mixed-script protection?
I think R has something, because it has this header: 
 https://cran.r-project.org/bin/windows/extsoft/3.4/include/unicode/uspoof.h
but I found nothing else, which is quite annoying.

My approach is as following:

* normalize identifiers (NFC) and only store normalized variants. this should 
catch bidi spoofs, combining characters and such.
* check each unicode code point for its Script property and besides Latin, 
Common and Inherited
only allow the first script, but error on any other mixed script. Additional 
scripts need to be declared.
https://github.com/perl11/cperl/issues/229

in perl like this:
    use utf8 ‘Greek’, ‘Cyrillic’;

utf8 is a pragma to allow unicode identifiers, not strings, to be added to the 
symbol table.
Obviously this has risks when reviewing a codebase, which might even bypass 
test suites.
This is fast enough, and has no measurable costs in the parser.

unicode has a nice security/confusable.txt table which could be used for more 
fine-grained checks, yes.
But I fear this is too much overhead for the generic parser, and I think that 
avoiding the 
problem by forbidding/need to declare mixed scripts is much easier, and more 
declarative.

Of course there exist several languages which require more than one script, 
like 
Japanese = Hiragana and Katakana and maybe Han,
Korean = Hangul + Han, …
or african languages as some have other than Latin roots, e.g. Ethiopian from 
Semitic.
Indian languages also sound problematic, and all the Old_<script>

For these I just add aliases to allow multiple Scripts.

Reini Urban
[email protected]




Reply via email to