On Mon, Dec 24, 2012 at 2:51 AM, Albert-Jan Roskam <fo...@yahoo.com> wrote: > > First, check if the first character is a (unicode) letter
You can use unicode.isalpha, with a caveat. On a narrow build isalpha fails for supplementary planes. That's about 50% of all alphabetic characters, +/- depending on the version of Unicode. But it's mostly the less common CJK characters (over 90% this), dead languages (e.g. Linear B, Cuneiform, Egyptian Hieroglyphs), and mathematical script. Instead, you could check if index 0 is category 'Cs' (surrogate). If so, check the category of the slice [:2]. > Having unicode versions of the classes \d, \w, etc (let's call them > \ud, \uw) would be cool. (?u) enables re.U in case you're looking to keep the flag setting in the pattern itself. \d and \w are defined for Unicode. It's just the available categories are insufficient. Matthew Barnett's regex module implements level 1 (and much of level 2) of UTS #18: Unicode Regular Expressions. See RL1.2 and Annex C: http://unicode.org/reports/tr18/ > def isUnicodeChar(c): > assert len(c) == 1 > c = c.decode("utf-8") if isinstance(c, str) else c > return 'L' in unicodedata.category(c) For UTF-8, len(c) == 1 only for ASCII codes; otherwise character codes have a leading byte and up to 3 continuation bytes. Also, on a narrow build len(c) is 2 for codes in the supplementary planes. Also keep in mind canonical composition/decomposition equivalence and compatibility when thinking about 'characters' in terms of comparison, sorting, dictionary keys, sets, etc. You might want to first normalize a string to canonical composed form (NFC). Python 3.x uses NFKC for identifiers. For example: >>> d = {} >>> exec("e\u0301 = 1", d) >>> d["e\u0301"] Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'é' >>> d["\xe9"] 1 >>> "\xe9" 'é' http://en.wikipedia.org/wiki/Unicode_equivalence _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor