On Sat, Dec 22, 2012 at 11:12 PM, Steven D'Aprano <st...@pearwood.info> wrote: > > No. You could install a more Unicode-aware regex engine, and use it instead > of Python's re module, where Unicode support is at best only partial. > > Try this one: > > http://pypi.python.org/pypi/regex
Looking over the old docs, I count 4 regex implementations up to 2.0: regexp regex (0.9.5) re / pcre (1.5) re / sre (2.0) It would be nice to see Matthew Barnett's regex module added as an option in 3.4, just as sre was added to 1.6 before taking the place of pcre in 2.0. > The failures are all numbers with category Nl or No ("letterlike > numeric character" and "numeric character of other type"). The pattern basically matches any word character that's not a decimal/underscore (the \s is redundant AFAIK). Any character that's numeric but not decimal also matches. For example, the following are all numeric: \N{SUPERSCRIPT ONE}: category "No", digit, not decimal \N{ROMAN NUMERAL ONE}: category "Nl", not digit, not decimal \u4e00 (1, CJK): category "Lo", not digit, not decimal Regarding the latter, if the pattern shouldn't match numeric characters in a broad sense, then it should be OK to exclude CJK numeric ideograms in category "Lo", but it's like excluding the word "one". _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor