Technical Report issues would be fine. I think #1 is worth considering. For #2, see other message to Peter Kirk.
Mark __________________________________ http://www.macchiato.com ► “Eppur si muove” ◄ ----- Original Message ----- From: "Marco Cimarosti" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Friday, August 22, 2003 06:04 Subject: RE: Proposed Draft UTR #31 - Syntax Characters > Rick McGowan wrote: > > the process as possible so that it can be considered > > The draft is found at http://www.unicode.org/reports/tr31/ > > and feedback can be submitted as described there. > > (Before submitting official feedback, I'd like to discuss my comments here. > BTW, which "Type of Message" should I use in the feedback form? Is it OK to > use "Technical Report or Tech Note issues"?) > > > My two cents are both about adding characters in the <Pattern_Syntax> of > "4.1 Proposed Pattern Properties". > > IMHO: > > 1. Full-width, half-width, and "small" punctuation characters should > in class <Pattern_Syntax> as their "normal width" counterparts. > > 2. Non-Latin punctuation character should be in class > <Pattern_Syntax> as their Latin counterparts. > > The rationale for suggestion 1 is that <wide>, <narrow> and <small> > compatibility characters are substantially identical (in appearance and > function) to their "normal width" counterparts. A parser allowing an > unquoted full-width punctuation character in an identifier is guaranteed to > cause confusion to the user. > > E.g., consider the following expression: > > foo,bar > > To me, it *definitely* looks like two identifiers separated by a comma, and > I expect my parser to agree with me on this, even if the "comma" is actually > a full-width comma. I am not saying that the parser must necessarily accept > a full-width comma in that position: it is perfectly OK if the above > expression causes a syntax error such as: "Illegal character U+FF0C > (FULLWIDTH COMMA) after identifier <foo>'". > > But what the parser should absolutely *not* do, IMHO, is handling "foo,bar" > as a *single* identifier! Doing such a thing is guaranteed to cause troubles > to me. E.g., I might receive a puzzling error message saying: "Parameter > missing: this statement requires 2 parameters", while I can *see* that there > *are* two parameters: "foo" and "bar"... > > The rationale for suggestion 2 is very similar. E.g., the following > expression looks a perfectly legal C++ or Java statement: > > return; > > If the compiler tells me: "Undeclared identifier", I may get crazy for the > whole day trying to figure out what's going on... But if tells me "Illegal > character U+037E (GREEK QUESTION MARK) after keyword <return>", then I > immediately understand that something is wrong with that "semicolon". > > The reason I keep suggestions 1 and 2 separate is that, in the case of > <wide>, <narrow> and <small> compatibility characters, it is trivial to > determine the corresponding regular character, while in the case of > non-Latin punctuation there is room for discussing which punctuation > characters are similar enough (in function or appearance) to which Latin > punctuation character. > > For full-width, half-width, and "small" punctuation characters, my > suggestion is to add the following lines to "4.1 Proposed Pattern > Properties": > > FE50..FE52 ; Pattern_Syntax # SMALL COMMA..SMALL FULL STOP > FE54..FE57 ; Pattern_Syntax # SMALL SEMICOLON..SMALL EXCLAMATION > MARK > FE59..FE66 ; Pattern_Syntax # SMALL LEFT PARENTHESIS..SMALL EQUALS > SIGN > FE68..FE6B ; Pattern_Syntax # SMALL REVERSE SOLIDUS..SMALL > COMMERCIAL AT > FF01..FF0F ; Pattern_Syntax # FULLWIDTH EXCLAMATION MARK..FULLWIDTH > SOLIDUS > FF1A..FF20 ; Pattern_Syntax # FULLWIDTH COLON..FULLWIDTH COMMERCIAL > AT > FF3B..FF40 ; Pattern_Syntax # FULLWIDTH LEFT SQUARE > BRACKET..FULLWIDTH GRAVE ACCENT > FF5B..FF5E ; Pattern_Syntax # FULLWIDTH LEFT CURLY > BRACKET..FULLWIDTH TILDE > FF5F..FF61 ; Pattern_Syntax # FULLWIDTH LEFT WHITE > PARENTHESIS..HALFWIDTH IDEOGRAPHIC FULL STOP > FF64 ; Pattern_Syntax # HALFWIDTH IDEOGRAPHIC COMMA > FFE0..FFE2 ; Pattern_Syntax # FULLWIDTH CENT SIGN..FULLWIDTH NOT > SIGN > FFE4..FFE5 ; Pattern_Syntax # FULLWIDTH BROKEN BAR..FULLWIDTH YEN > SIGN > FFE8..FFEE ; Pattern_Syntax # HALFWIDTH FORMS LIGHT > VERTICAL..HALFWIDTH WHITE CIRCLE > > For non-Latin punctuation characters, this is my tentative list of > characters that may cause trouble if used in identifiers, and which, > consequently, should be added to class <Pattern_Syntax>: > > 037E GREEK QUESTION MARK > 0387 GREEK ANO TELEIA > 055C ARMENIAN EXCLAMATION MARK > 055D ARMENIAN COMMA > 055E ARMENIAN QUESTION MARK > 0589 ARMENIAN FULL STOP > 060C ARABIC COMMA > 060D ARABIC DATE SEPARATOR > 061B ARABIC SEMICOLON > 061F ARABIC QUESTION MARK > 066A ARABIC PERCENT SIGN > 066B ARABIC DECIMAL SEPARATOR > 066C ARABIC THOUSANDS SEPARATOR > 06D4 ARABIC FULL STOP > 0964 DEVANAGARI DANDA > 0965 DEVANAGARI DOUBLE DANDA > 10FB GEORGIAN PARAGRAPH SEPARATOR > 1362 ETHIOPIC FULL STOP > 1363 ETHIOPIC COMMA > 1364 ETHIOPIC SEMICOLON > 1365 ETHIOPIC COLON > 1366 ETHIOPIC PREFACE COLON > 1367 ETHIOPIC QUESTION MARK > 1368 ETHIOPIC PARAGRAPH SEPARATOR > 166E CANADIAN SYLLABICS FULL STOP > 1802 MONGOLIAN COMMA > 1803 MONGOLIAN FULL STOP > 1804 MONGOLIAN COLON > 1808 MONGOLIAN MANCHU COMMA > 1809 MONGOLIAN MANCHU FULL STOP > 1944 LIMBU EXCLAMATION MARK > 1945 LIMBU QUESTION MARK > > But I am not 100% about all the above characters. Should any of them be > removed from the list (i.e., allowed in identifiers)? > > The following list includes all the non-Latin punctuation character which I > feel not worth including in class <Pattern_Syntax>, because I think that, > for a reason or another, they would cause no problem in identifiers: > > 055A ARMENIAN APOSTROPHE > 055B ARMENIAN EMPHASIS MARK > 055F ARMENIAN ABBREVIATION MARK > 058A ARMENIAN HYPHEN > 05BE HEBREW PUNCTUATION MAQAF > 05C0 HEBREW PUNCTUATION PASEQ > 05C3 HEBREW PUNCTUATION SOF PASUQ > 05F3 HEBREW PUNCTUATION GERESH > 05F4 HEBREW PUNCTUATION GERSHAYIM > 066D ARABIC FIVE POINTED STAR > 0700 SYRIAC END OF PARAGRAPH > 0701 SYRIAC SUPRALINEAR FULL STOP > 0702 SYRIAC SUBLINEAR FULL STOP > 0703 SYRIAC SUPRALINEAR COLON > 0704 SYRIAC SUBLINEAR COLON > 0705 SYRIAC HORIZONTAL COLON > 0706 SYRIAC COLON SKEWED LEFT > 0707 SYRIAC COLON SKEWED RIGHT > 0708 SYRIAC SUPRALINEAR COLON SKEWED LEFT > 0709 SYRIAC SUBLINEAR COLON SKEWED RIGHT > 070A SYRIAC CONTRACTION > 070B SYRIAC HARKLEAN OBELUS > 070C SYRIAC HARKLEAN METOBELUS > 070D SYRIAC HARKLEAN ASTERISCUS > 0970 DEVANAGARI ABBREVIATION SIGN > 0DF4 SINHALA PUNCTUATION KUNDDALIYA > 0E4F THAI CHARACTER FONGMAN > 0E5A THAI CHARACTER ANGKHANKHU > 0E5B THAI CHARACTER KHOMUT > 0F04 TIBETAN MARK INITIAL YIG MGO MDUN MA > 0F05 TIBETAN MARK CLOSING YIG MGO SGAB MA > 0F06 TIBETAN MARK CARET YIG MGO PHUR SHAD MA > 0F07 TIBETAN MARK YIG MGO TSHEG SHAD MA > 0F08 TIBETAN MARK SBRUL SHAD > 0F09 TIBETAN MARK BSKUR YIG MGO > 0F0A TIBETAN MARK BKA- SHOG YIG MGO > 0F0B TIBETAN MARK INTERSYLLABIC TSHEG > 0F0C TIBETAN MARK DELIMITER TSHEG BSTAR > 0F0D TIBETAN MARK SHAD > 0F0E TIBETAN MARK NYIS SHAD > 0F0F TIBETAN MARK TSHEG SHAD > 0F10 TIBETAN MARK NYIS TSHEG SHAD > 0F11 TIBETAN MARK RIN CHEN SPUNGS SHAD > 0F12 TIBETAN MARK RGYA GRAM SHAD > 0F3A TIBETAN MARK GUG RTAGS GYON > 0F3B TIBETAN MARK GUG RTAGS GYAS > 0F3C TIBETAN MARK ANG KHANG GYON > 0F3D TIBETAN MARK ANG KHANG GYAS > 0F85 TIBETAN MARK PALUTA > 104A MYANMAR SIGN LITTLE SECTION > 104B MYANMAR SIGN SECTION > 104C MYANMAR SYMBOL LOCATIVE > 104D MYANMAR SYMBOL COMPLETED > 104E MYANMAR SYMBOL AFOREMENTIONED > 104F MYANMAR SYMBOL GENITIVE > 1361 ETHIOPIC WORDSPACE > 166D CANADIAN SYLLABICS CHI SIGN > 169B OGHAM FEATHER MARK > 169C OGHAM REVERSED FEATHER MARK > 16EB RUNIC SINGLE PUNCTUATION > 16EC RUNIC MULTIPLE PUNCTUATION > 16ED RUNIC CROSS PUNCTUATION > 1735 PHILIPPINE SINGLE PUNCTUATION > 1736 PHILIPPINE DOUBLE PUNCTUATION > 17D4 KHMER SIGN KHAN > 17D5 KHMER SIGN BARIYOOSAN > 17D6 KHMER SIGN CAMNUC PII KUUH > 17D8 KHMER SIGN BEYYAL > 17D9 KHMER SIGN PHNAEK MUAN > 17DA KHMER SIGN KOOMUUT > 1800 MONGOLIAN BIRGA > 1801 MONGOLIAN ELLIPSIS > 1805 MONGOLIAN FOUR DOTS > 1806 MONGOLIAN TODO SOFT HYPHEN > 1807 MONGOLIAN SIBE SYLLABLE BOUNDARY MARKER > 180A MONGOLIAN NIRUGU > 10100 AEGEAN WORD SEPARATOR LINE > 10101 AEGEAN WORD SEPARATOR DOT > 1039F UGARITIC WORD DIVIDER > > Should any of the above character be added to <Pattern_Syntax> (i.e. *not* > allowed in identifiers)? > > _ Marco > >

