Re: [Tutor] regex: matching unicode

eryksun Sun, 23 Dec 2012 20:19:26 -0800

On Sat, Dec 22, 2012 at 11:12 PM, Steven D'Aprano <[email protected]> wrote:
>
> No. You could install a more Unicode-aware regex engine, and use it instead
> of Python's re module, where Unicode support is at best only partial.
>
> Try this one:
>
> http://pypi.python.org/pypi/regex


Looking over the old docs, I count 4 regex implementations up to 2.0:

    regexp
    regex (0.9.5)
    re / pcre (1.5)
    re / sre (2.0)

It would be nice to see Matthew Barnett's regex module added as an
option in 3.4, just as sre was added to 1.6 before taking the place of
pcre in 2.0.

> The failures are all numbers with category Nl or No ("letterlike
> numeric character" and "numeric character of other type").

The pattern basically matches any word character that's not a
decimal/underscore (the \s is redundant AFAIK). Any character that's
numeric but not decimal also matches. For example, the following are
all numeric:

    \N{SUPERSCRIPT ONE}: category "No", digit, not decimal
    \N{ROMAN NUMERAL ONE}: category "Nl", not digit, not decimal
    \u4e00 (1, CJK): category "Lo", not digit, not decimal

Regarding the latter, if the pattern shouldn't match numeric
characters in a broad sense, then it should be OK to exclude CJK
numeric ideograms in category "Lo", but it's like excluding the word
"one".
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] regex: matching unicode

Reply via email to