How about package names like ロシアМС21(Note the МС are Cyrillic), or πr²の秘密, or エリ_хорошо_μ'sic_4⃣ever? Although they aren't really names that people would usually use in package/var names, they are meaningful names...
2016年12月5日 16:39 於 "Reini Urban" <[email protected]> 寫道: > > > On Dec 4, 2016, at 11:45 PM, Richard Wordingham < > [email protected]> wrote: > > > > On Sun, 4 Dec 2016 12:09:36 +0100 > > Reini Urban <[email protected]> wrote: > > > >> * normalize identifiers (NFC) and only store normalized variants. > >> this should catch bidi spoofs, combining characters and such. > > > > That doesn't catch bidi spoofs. > > Right. Bidi spoofs are already caught by the IDStart, IDContinue rule. > > i.e. google <U+202E (right-to-left override), g, o, o, g, U+202C (pop > directional formatting), l, e> > is already caught as illegal. > > Mixing RTL scripts, such as Arabic with Latin is not caught with the > mixed-script rule per se. > > >> * check each unicode code point for its Script property and besides > >> Latin, Common and Inherited only allow the first script, but error on > >> any other mixed script. Additional scripts need to be declared. > >> https://github.com/perl11/cperl/issues/229 > >> > >> in perl like this: > >> use utf8 ‘Greek’, ‘Cyrillic’; > > > > Your rule isn't clear. Would an identifier like ψ_S be automatically > > allowed? > > ψ_S contains Greek U+03C8, Common and Latin. Since Latin and Common are > always allowed, the only > new script is Greek. The first non-default script is automatically and > silently allowed, only a mix with another > non-default script, such as Cyrillic would error or need an explicit > declaration. > > So ψ_S alone is fine, if everything else is Greek. > But mixing with the Cyrillic version would lead to an error. > > > I presume you're handling the spoofing of the SMALL PHI characters by > > other means. > > The spoof attempt would be ѱ_S with Cyrillic U+0471, Common, Latin. > 2 mixed scripts which are illegal, if undeclared. > Same with PHI, which exists as Greek or Cyrillic. Most of Greek characters > have confusable > Cyrillic counterparts, that’s why a declaration of use utf8 ‘Greek’, > ‘Cyrillic’; > i.e. mixing those two sounds highly dangerous. > With the UCD confusable table this would be an error. In my rule not, > since the user > declared those two scripts to be mixed. > > > For multilingual support, you would want rules more like > > > > 'After script X, allow script Y’. > > Can you expand on that with an example? I’m no expert on this. > > Like after Hangul, allow Han? After Hiragana, allow Katakana? > > >> Of course there exist several languages which require more than one > >> script, > > <snip> > >> or african languages as some have other than Latin roots, e.g. > >> Ethiopian from Semitic. > > > > I don't see your problem here. What problem do you see with Amharic? > > Amharic is not defined as UCD script property. It’s alphabet is called > Ge’ez, which we call > Ethiopic in the UCD. But that’s all I know. I’m not a domain expert. Does > Ethiopic uses > other Semitic scripts in its alphabet or is it complete? I learned some > CFK languages, > where you historically allow mixed scripts. But for other scripts I’m > clueless. > The examples I got mix it with Runic. Valid or nonsense? > > The problem is to decide which scripts are commonly mixed in which > languages to allow > them to be valid identifiers. > > How about the many Indian scripts? Do they mix? > Being an indian movie expert tells me that indian languages usually don’t > mix. > They make Tamil and Bengali versions of Hindi movies, and usually fall > back to english to > get common points across the barrier. But their scripts? No idea. > > > > >> Indian languages also sound problematic, > > > > Is this the ZWJ/ZWNJ issue? That surely is a problem within a script. > > > >> and > >> all the Old_<script> > > > > Now I am confused. What problem do you see that you don't have in the > > Latin script? > > That I have no idea if those Old_<script> alphabets are still in use to > create > aliases for them. > In the examples in perl which partially came from parrot there’s a wild > eclectic mix of various scripts > which do make no sense at all. So I don’t know if I can trust those tests, > that they make sense and > are readable at all. My guess is that the authors just liked code golfing > and picked random unicode > characters. It’s from perl after all. > > Such as this perl test t/mro/isa_c3_utf8.t > > use utf8 qw( Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam > Hiragana ); > > ... > package 캎oẃ; > package urḲḵk; > @urḲḵk::ISA = 'kഌoんḰ'; > package к; > @urḲḵk::ISA = ('kഌoんḰ', '캎oẃ'); > package ṭ화ckэ; > ... > > These identifiers are unreadable, because I don’t assume that anybody will > be able to understand > Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam and Hiragana at > once. > I understand a bit Hangul, Cyrillic and Hiragana, but the mix sounds > highly illegal to me. > > So my rule makes sense. You need to declare non-default scripts used in > your identifiers if mixed. > (not strings. these can be everything, even illegal UTF-8). > > >

