[issue30838] re \w does not match some valid Unicode characters

2017-07-05 Thread Matthew Barnett

Matthew Barnett added the comment:

Python identifiers match the regex:

[_\p{XID_Start}]\p{XID_Continue}*

The standard re module doesn't support \p{...}, but the third-party "regex" 
module does.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30838] re \w does not match some valid Unicode characters

2017-07-05 Thread David Lord

David Lord added the comment:

After thinking about it more, I guess I misunderstood what \w was doing 
compared to isidentifier. Since Python just relies on the Unicode database, 
there's not much to be done anyway. Closing this.

For anyone interested, we ended up with a hybrid approach for lexing 
identifiers: build a regex group that includes all valid ranges not matched by 
\w, then validate with isidentifier later. 
https://github.com/pallets/jinja/pull/731/files

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30838] re \w does not match some valid Unicode characters

2017-07-03 Thread Matthew Barnett

Matthew Barnett added the comment:

In Unicode 9.0.0, U+1885 and U+1886 changed from being 
General_Category=Other_Letter (Lo) to General_Category=Nonspacing_Mark (Mn).

U+2118 is General_Category=Math_Symbol (Sm) and U+212E is 
General_Category=Other_Symbol (So).

\w doesn't include Mn, Sm or So.

The .identifier method uses the Unicode properties XID_Start and XID_Continue, 
which include these codepoints.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30838] re \w does not match some valid Unicode characters

2017-07-03 Thread David Lord

David Lord added the comment:

Adding `or ('a' + s).isidentifer()`, to catch valid id_continue characters, to 
the test in the previous script reveals many more characters that seem like 
valid word characters but aren't matched by `\w`.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30838] re \w does not match some valid Unicode characters

2017-07-03 Thread ThiefMaster

Changes by ThiefMaster :


--
nosy: +ThiefMaster

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30838] re \w does not match some valid Unicode characters

2017-07-03 Thread STINNER Victor

Changes by STINNER Victor :


--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30838] re \w does not match some valid Unicode characters

2017-07-03 Thread David Lord

New submission from David Lord:

This came up while writing a regex to match characters that are valid in Python 
identifiers for Jinja. https://github.com/pallets/jinja/pull/731 `\w` matches 
all valid identifier characters except for 4 special cases:

import unicodedata
import re
import sys

cre = re.compile(r'\w')

for cp in range(sys.maxunicode + 1):
s = chr(cp)

if s.isidentifier() and not cre.match(s):
print(hex(cp), unicodedata.name(s))

0x1885 MONGOLIAN LETTER ALI GALI BALUDA
0x1886 MONGOLIAN LETTER ALI GALI THREE BALUDA
0x2118 SCRIPT CAPITAL P
0x212e ESTIMATED SYMBOL

Python < 3.6 matches the two Mongolian characters, not sure why 3.6 stopped 
matching them.

For our case, we just added them to a character set, 
`[\w\u1885\u1886\u2118\u212e]`.

It can cause unexpected behavior when using `\b`, since that's defined as the 
transition from `\w` to `\W` and those 4 characters aren't in `\w`. 
`re.match(r'\b[\w\u212e', '℮')` fails to match.

--
components: Regular Expressions, Unicode
messages: 297603
nosy: davidism, ezio.melotti, haypo, mrabarnett
priority: normal
severity: normal
status: open
title: re \w does not match some valid Unicode characters
type: behavior
versions: Python 3.3, Python 3.4, Python 3.5, Python 3.6, Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com