[issue25743] Clarify exactly what \w matches in UNICODE mode

2016-01-03 Thread Ezio Melotti

Changes by Ezio Melotti :


--
components: +Regular Expressions
nosy: +ezio.melotti, mrabarnett
stage:  -> needs patch
type:  -> enhancement
versions:  -Python 3.2, Python 3.3, Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25743] Clarify exactly what \w matches in UNICODE mode

2015-11-27 Thread Zack Weinberg

New submission from Zack Weinberg:

The `re` module documentation does not do a good job of explaining exactly what 
`\w` matches.  Quoting https://docs.python.org/3.5/library/re.html :

> \w
> For Unicode (str) patterns:
> Matches Unicode word characters; this includes most characters
> that can be part of a word in any language, as well as numbers
> and the underscore.

Empirically, this appears to mean "everything in Unicode general categories L* 
and N*, plus U+005F (underscore)".  That is a perfectly sensible definition and 
the documentation should state it in those terms.  "Unicode word characters" 
could mean any number of different things; note for instance that UTS#18 gives 
a very different definition.

(Further reading: https://gist.github.com/zackw/3077f387591376c7bf67 plus links 
therefrom).

--
assignee: docs@python
components: Documentation
messages: 255463
nosy: docs@python, zwol
priority: normal
severity: normal
status: open
title: Clarify exactly what \w matches in UNICODE mode
versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25743] Clarify exactly what \w matches in UNICODE mode

2015-11-27 Thread Andi McClure

Andi McClure added the comment:

I would like to request also a clear explanation be given for the documentation 
in the 2.7 branch. From https://docs.python.org/2.7/library/re.html :

"\w ... If UNICODE is set, this will match the characters [0-9_] plus whatever 
is classified as alphanumeric in the Unicode character properties database"

This is ambiguous. Does it mean the "Alphabetic" property from UAX#44? Does it 
mean something else?

--
nosy: +Andi McClure

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25743] Clarify exactly what \w matches in UNICODE mode

2015-11-27 Thread Zack Weinberg

Zack Weinberg added the comment:

FWIW, the actual behavior of \w matching "everything in Unicode general 
categories L* and N*, plus U+005F (underscore)" is consistent across all 
versions I can conveniently test (2.7, 3.4, 3.5).

In 2.7, there are four characters in general category Nl that \w doesn't match, 
but I believe that is just a bug, not an intentional difference of behavior.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com