Tom Christiansen tchr...@perl.com added the comment:
Terry J. Reedy rep...@bugs.python.org wrote
on Fri, 12 Aug 2011 23:05:27 -:
Ouch!
Do the rejected characters qualify as identifier characters as defined
in Reference 2.3 Identifiers and keywords?
http://docs.python.org/py3k/reference/lexical_analysis.html#identifiers
Yes, that's right, they do. You're using the standard IDS and IDC, and
XIDS and XIDC, definitions. Here were the three identifiers that were
a problem:
픘픫픦픠픬픡픢 = super
ДЯхШщЯл = Deseret
̰̰̈́̈́‿̰̿̽̓͂ = Gothic our father
If you cannot read those, then when piped through `uniquote -v` they are:
\N{MATHEMATICAL FRAKTUR CAPITAL U}\N{MATHEMATICAL FRAKTUR SMALL
N}\N{MATHEMATICAL FRAKTUR SMALL I}\N{MATHEMATICAL FRAKTUR SMALL
C}\N{MATHEMATICAL FRAKTUR SMALL O}\N{MATHEMATICAL FRAKTUR SMALL
D}\N{MATHEMATICAL FRAKTUR SMALL E} = super
\N{DESERET CAPITAL LETTER DEE}\N{DESERET SMALL LETTER SHORT E}\N{DESERET
SMALL LETTER ES}\N{DESERET SMALL LETTER LONG I}\N{DESERET SMALL LETTER
ER}\N{DESERET SMALL LETTER SHORT E}\N{DESERET SMALL LETTER TEE} = Deseret
\N{GOTHIC LETTER AHSA}\N{GOTHIC LETTER TEIWS}\N{GOTHIC LETTER
TEIWS}\N{GOTHIC LETTER AHSA}\N{UNDERTIE}\N{GOTHIC LETTER URUS}\N{GOTHIC LETTER
NAUTHS}\N{GOTHIC LETTER SAUIL}\N{GOTHIC LETTER AHSA}\N{GOTHIC LETTER RAIDA} =
Gothic our father
I'm not sure whether you recognize the scripts they belong to, but they're
all in the astral planes. Using `uniquote -x` on them shows:
\x{1D518}\x{1D52B}\x{1D526}\x{1D520}\x{1D52C}\x{1D521}\x{1D522} =
super
\x{10414}\x{1042F}\x{10445}\x{10428}\x{10449}\x{1042F}\x{1043B} =
Deseret
\x{10330}\x{10344}\x{10344}\x{10330}\x{203F}\x{1033F}\x{1033D}\x{10343}\x{10330}\x{10342}
= Gothic our father
As to whether they're proper identifiers per your reference above, I
will take the first letter from each of 픘픫픦픠픬픡픢, ДЯхШщЯл, and ̰̰̈́̈́‿̰̿̽̓͂,
which are repsectively 픘, Д, and ̰, or
MATHEMATICAL FRAKTUR CAPITAL U
DESERET CAPITAL LETTER DEE
GOTHIC LETTER AHSA
or
1D518
10414
10330
and show you their full Unicode properties of these reject code points.
This requires the uniprops command, given which, these three commands
are then completely identical:
% uniprops -ga 픘 Д ̰
% uniprops -ga 1D518 10414 10330
% uniprops -ga MATHEMATICAL FRAKTUR CAPITAL U DESERET CAPITAL LETTER
DEE GOTHIC LETTER AHSA
and produce this output:
U+1D518 ‹픘› \N{MATHEMATICAL FRAKTUR CAPITAL U}
\w \pL \p{LC} \p{L_} \p{L} \p{Lu}
All Any Alnum Alpha Alphabetic Assigned InMathematicalAlphanumericSymbols
Cased Cased_Letter LC Changes_When_NFKC_Casefolded
CWKCF Common Zyyy Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue
IDC ID_Start IDS Letter L_ Uppercase_Letter Math
Mathematical_Alphanumeric_Symbols Print Upper Uppercase Word
XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha
X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word
Age=3.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L
Block=Mathematical_Alphanumeric_Symbols Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR
Canonical_Combining_Class=NR General_Category=Cased_Letter Script=Common
Decomposition_Type=Font DT=Font Decomposition_Type=Non_Canon
Decomposition_Type=Non_Canonical DT=NonCanon
East_Asian_Width=Neutral GC=LC General_Category=L
General_Category=Letter General_Category=L_ General_Category=LC GC=L
General_Category=Lu General_Category=Uppercase_Letter GC=Lu
Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX
Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA
Joining_Group=No_Joining_Group JG=NoJoiningGroup
Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL
Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2
Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1
Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2
Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy
Sentence_Break=UP Sentence_Break=Upper SB=UP Word_Break=ALetter WB=LE
Word_Break=LE _X_Begin
U+10414 ‹Д› \N{DESERET CAPITAL LETTER DEE}
\w \pL \p{LC} \p{L_} \p{L} \p{Lu}
All Any Alnum Alpha Alphabetic Assigned InDeseret Cased Cased_Letter LC
Changes_When_Casefolded CWCF Changes_When_Casemapped
CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF
Deseret Dsrt Lu L Gr_Base Grapheme_Base Graph GrBase
ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Print Upper
Uppercase Word XID_Continue XIDC XID_Start XIDS
X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper
X_POSIX_Word
Age=3.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Deseret
Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR
Canonical_Combining_Class=NR General_Category=Cased_Letter