On Thu, 17 Jan 2019 18:44:50 -0500 "J. S. Choi" via Unicode <unicode@unicode.org> wrote:
> I’m implementing a Unicode names library. I’m confused about loose > character-name matching, even after rereading The Unicode Standard § > 4.8, UAX #34 § 4, #44 § 5.9.2 – as well as > [L2/13-142](http://www.unicode.org/L2/L2013/13142-name-match.txt > <http://www.unicode.org/L2/L2013/13142-name-match.txt>), > [L2/14-035](http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035 > <http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035>), and > the [meeting in which those two items were > resolved](https://www.unicode.org/L2/L2014/14026.htm > <https://www.unicode.org/L2/L2014/14026.htm>). > > In particular, I’m confused by the claim in The Unicode Standard § > 4.8 saying, “Because Unicode character names do not contain any > underscore (“_”) characters, a common strategy is to replace any > hyphen-minus or space in a character name by a single “_” when > constructing a formal identifier from a character name. This strategy > automatically results in a syntactically correct identifier in most > formal languages. Furthermore, such identifiers are guaranteed to be > unique, because of the special rules for character name matching.” Unfortunately, the loose matching rules don't distinguish '__' and '_'. Note that '__' is sometimes forbidden in identifiers. > I’m also confused by the relationship between UAX34-R3 and UAX44-LM2. > > To make these issues concrete, let’s say that my library provides a > function called getCharacter that takes a name argument, tries to > find a loosely matching character, and then returns it (or a null > value if there is no currently loosely matching character). So then > what should the following expressions return? > Loose matching of names may be looser than prescribed; it shall not be stricter. > getCharacter(“HANGUL-JUNGSEONG-O-E”) U+1180 HANGUL JUNGSEONG O-E, or just possibly null. > getCharacter(“HANGUL_JUNGSEONG_O_E”) U+116C HANGUL JUNGSEONG OE* > getCharacter(“HANGUL_JUNGSEONG_O_E_”) U+116C > getCharacter(“HANGUL_JUNGSEONG_O__E”) U+116C > getCharacter(“HANGUL_JUNGSEONG_O_-E”) U+1180 > getCharacter(“HANGUL JUNGSEONGCHARACTERO E”) null or U+116C - up to you. The sequence 'CHARACTER' shall not distinguish names, but loose matching is not required to know this fact. > getCharacter(“HANGUL JUNGSEONG CHARACTER OE”) null or U+116C - up to you. > getCharacter(“TIBETAN_LETTER_A”) U+0F68 TIBETAN LETTER A > getCharacter(“TIBETAN_LETTER__A”) U+0F68 TIBETAN LETTER A** > getCharacter(“TIBETAN_LETTER _A”) U+0F68 > getCharacter(“TIBETAN_LETTER_-A”) U+0F60 TIBETAN LETTER -A *This is unfortunate, as the usual symbolic name for U+1180 would be HANGUL_JUNGSEONG_O_E. **This is also unfortunate, as the usual symbolic name for U+0F60 would be TIBETAN_LETTER__A. The key problem here is that the hyphen after a space is required in names as understood by the name property. The hyphen is also required in "HANGUL JUNGSEONG O-E". The simple tactic is: 1) Canonicalise, by stripping out spaces, underscores and medial hyphens and lowercasing. (It's probably better to fold the character U+0131 LATIN SMALL LETTER I' to 'i'.) 2) Look the result up. 3) If you get the result U+116C but the input matches ".*[oO]-[eE][_- ]*$", convert to U+1180. Symbolic identifiers in programs need not match the name; one may choose to depend on the compiler or interpreter to catch duplicates; some will, some won't. Replacing '-' by '_' to convert a name to an identifier looses the distinction between a hyphen and an arbitrarily inserted space, Richard.