On Sat, Dec 22, 2012 at 9:53 PM, Albert-Jan Roskam <fo...@yahoo.com> wrote:
> Hi, > > Is the code below the only/shortest way to match unicode characters? I > would like to match whatever is defined as a character in the unicode > reference database. So letters in the broadest sense of the word, but not > digits, underscore or whitespace. Until just now, I was convinced that the > re.UNICODE flag generalized the [a-z] class to all unicode letters, and > that the absence of re.U was an implicit 're.ASCII'. Apparently that mental > model was *wrong*. > But [^\W\s\d_]+ is kind of hard to read/write. > > import re > s = unichr(956) # mu sign > m = re.match(ur"[^\W\s\d_]+", s, re.I | re.U) > > A thought would be to rely on the general category of the character, as listed in the Unicode database. Unicodedata.category will give you what you need. Here is a list of categories in the Unicode standard: http://www.fileformat.info/info/unicode/category/index.htm So, if you wanted only letters, you could say: def is_unicode_character(c): assert len(c) == 1 return 'L' in unicodedata.category(c) if only the Letter category will get you what you need, this is pretty simple, but if you also need symbols and marks or something it will start to get more complicated. Another thought is to match against two separate regexes, one being \w for alphanumeric and the other being [^\d] to leave you only with alpha. Not exactly ideal either. The last option is to just go with the regex, make sure you write it only once, and leave a nice comment. That's not too bad. Hugo
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor