[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2021-05-26 Thread Antoine Pitrou


Change by Antoine Pitrou :


--
stage: test needed -> needs patch
versions: +Python 3.11 -Python 3.6, Python 3.7, Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2020-02-03 Thread STINNER Victor


Change by STINNER Victor :


--
nosy:  -vstinner

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2020-01-31 Thread Terry J. Reedy


Change by Terry J. Reedy :


--
assignee: docs@python -> 
components: +Unicode -Documentation
nosy: +benjamin.peterson, lemburg, serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2020-01-31 Thread Henry S. Thompson


Henry S. Thompson  added the comment:

[One year and 2 days later... :-[

Is this fixed in 3.9?  If not, the Versions list above should be updated.

The failure of lower() to preserve 'alpha-ness' is a serious bug, it causes 
significant failures in e.g. Turkish NLP, and it's _not_ just a failure of the 
documentation!

Please can we move this to category Unicode and get at least this aspect of the 
problem fixed?  Should I raise a separate issue on isalpha() etc.?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2019-09-07 Thread Justin Arthur


Change by Justin Arthur :


--
nosy: +JustinTArthur

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2019-01-29 Thread Henry S. Thompson

Henry S. Thompson  added the comment:

This issue is also implicated in a failure of isalpha and friends.
Easy way to see this is to compare
>>> isalpha('İ')
True
>>> isalpha('İ'.lower())
False

This results from the use of a combining character to encode lower-case Turkish 
dotted i:
>>> len('İ'.lower())
2
>>> unicodedata.category('İ'.lower()[1])
'Mn'

--
nosy: +HThompson

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2018-03-14 Thread Terry J. Reedy

Terry J. Reedy  added the comment:

Whatever I may have said before, I favor supporting the Unicode standard for 
\w, which is related to the standard for identifiers.

This is one of 2 issues about \w being defined too narrowly.  I am somewhat 
arbitrarily closing #1693050 as a duplicate of this (fewer digits ;-).

There are 3 issues about tokenize.tokenize failing on valid identifiers, 
defined as \w sequences whose first char is an identifier itself (and therefore 
a start char).  In msg313814 of #32987, Serhiy indicates which start and 
continue identifier characters are matched by \W for re and regex.  I am 
leaving #24194 open as the tokenizer name issue.

--
stage: needs patch -> test needed
versions: +Python 3.6, Python 3.7, Python 3.8 -Python 2.7, Python 3.3, Python 
3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2013-07-10 Thread Terry J. Reedy

Changes by Terry J. Reedy tjre...@udel.edu:


--
versions: +Python 3.4 -Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12731
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-09-29 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

The failing re tests after PEP 393 are:
FAIL lib re found non alphanumeric string  'cafe'
FAIL lib re found non alphanumeric string  'Ⓚ'
FAIL lib re found non alphanumeric string  ''
FAIL lib re found non alphanumeric string  ''
FAIL lib re found non alphanumeric string  'connector‿punctuation'
FAIL lib re found non alphanumeric string  'Ὰ_Στο_Διάολο'
FAIL lib re found non alphanumeric string  '̰̰̈́̈́‿̰̿̽̓͂‿̸̿‿̹̽‿̷̹̼̹̰̼̽'
FAIL lib re found all alphanumeric string  '¹²³'
FAIL lib re found all alphanumeric string  '₁₂₃'
FAIL lib re found all alphanumeric string  '¼½¾'
FAIL lib re found all alphanumeric string  '⑶'

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12731
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-28 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

 Or the re module should be *replaced* by the code from the regex module
 (but renamed to re, and with certain backwards compatibilities
 restored, probably).

This is what I meant.

 But I really hope the re module (really: the _sre extension module)
 can be fixed.

Start fixing these issues from scratch doesn't make much sense IMHO.  We could 
extract the fixes from regex and merge them in re, but then again it's 
probably easier to just replace the whole module.

 We should also make a habit in our docs of citing specific versions
 of the Unicode standard, and specific TR numbers and versions where 
 they apply.

While this is a good thing it's not always doable.  Usually someone reports a 
bug related to something specified in some standard and only that part gets 
fixed.  Sometimes everything else is also updated to follow the whole standard, 
but often this happens incrementally, so we can't say, e.g., the re module 
supports Unicode x.y unless we go through the whole standard and 
fix/implements everything.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12731
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-28 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

 But I really hope the re module (really: the _sre extension module)
 can be fixed.

If you mean on 2.7/3.2, then I guess we could extract the fixes from regex, but 
we have to see if it's doable and someone will have to do it.

Also consider that the regex module is available for 2.7/3.2, so we could 
suggest the users to use it if they have problems with the re bugs (even if 
that means having an additional dependency).

ISTM that current plan is:
  * replace re with regex (and rename it) on 3.3 and fix all these bugs;
  * leave 2.7 and 3.2 with the old re and its bugs;
  * let people use the external regex module on 2.7/3.2 if they need to.

If this is not ok, maybe it should be discussed on python-dev.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12731
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-28 Thread Guido van Rossum

Guido van Rossum gu...@python.org added the comment:

[me]
 But I really hope the re module (really: the _sre extension module)
 can be fixed.

[Ezio]
 Start fixing these issues from scratch doesn't make much sense IMHO.  We 
 could extract the fixes from regex and merge them in re, but then again 
 it's probably easier to just replace the whole module.

I have changed my mind at least half-way. I am open to having regex
(with some changes, details TBD) replace re in 3.3. (I am not yet 100%
convinced, but I'm not rejecting it as strongly as I was when I wrote
that comment in this bug. See the ongoing python-dev discussion on
this topic.)

 We should also make a habit in our docs of citing specific versions
 of the Unicode standard, and specific TR numbers and versions where
 they apply.

 While this is a good thing it's not always doable.  Usually someone reports a 
 bug related to something specified in some standard and only that part gets 
 fixed.  Sometimes everything else is also updated to follow the whole 
 standard, but often this happens incrementally, so we can't say, e.g., the 
 re module supports Unicode x.y unless we go through the whole standard and 
 fix/implements everything.

Hm. I think that for Unicode it may actually be important enough to be
consistent in following the whole standard that we should attempt to
be consistent and not just chase bug reports. Now, we may consciously
decide not to implement a certain recommendation of the standard. E.g.
I'm not going to require that IronPython or Jython have string objects
that support O(1) indexing of code points, even (assuming PEP 393 gets
accepted) CPython will have them. But these decisions should be made
explicitly, and documented clearly.

Ideally, we need a Unicode czar -- a core developer whose job it is
to keep track of Python's compliance with various parts and versions
of the Unicode standard and who can nudge other developers towards
fixing bugs or implementing features, or update the documentation in
case things don't get added. (I like Tom's approach to Java 1.7, where
he submitted proposed doc fixes explaining the deviations from the
standard. Perhaps a bit passive-aggressive, but it was effective. :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12731
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-28 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

 Ideally, we need a Unicode czar -- a core developer whose job it is
 to keep track of Python's compliance with various parts and versions
 of the Unicode standard and who can nudge other developers towards
 fixing bugs or implementing features, or update the documentation in
 case things don't get added.

We should first do a full review of the latest Unicode standard and see what's 
missing.  I think there might be parts of older Unicode versions (even  
Unicode 5) that are not yet implemented.  Chapter 3 is a good place where to 
start, but I'm not sure that's enough -- there are a few TRs that should be 
considered as well.
If we manage to catch up with Unicode 6, then it shouldn't be too difficult to 
review the changes that every new version will introduce and open an issue for 
each (or a single issue if the changes are limited).
FWIW I'm planning to look at the conformance of the UTF codecs and fix them (if 
necessary) whenever I'll have time.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12731
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-26 Thread Guido van Rossum

Guido van Rossum gu...@python.org added the comment:

Really?  The re module cannot be salvaged and we should add regex but keep the 
(buggy) re?  That does not make a lot of sense to me.  I think it should just 
be fixed in the re module.  Or the re module should be *replaced* by the code 
from the regex module (but renamed to re, and with certain backwards 
compatibilities restored, probably).  But I really hope the re module (really: 
the _sre extension module) can be fixed.  We should also make a habit in our 
docs of citing specific versions of the Unicode standard, and specific TR 
numbers and versions where they apply.  (And hopefully we can supply URLs to 
the Unicode consortium's canonical copies of those documents.)

--
nosy: +gvanrossum

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12731
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-15 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

If the regex module works fine here, I think it's better to leave the re module 
alone and include the regex module in 3.3.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12731
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-13 Thread Antoine Pitrou

Changes by Antoine Pitrou pit...@free.fr:


--
nosy: +haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12731
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-13 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 However, because the \wc issues are bigger, Java addressed the tr18 RL1.2a
 issues differently, this time by creating a new compilation flag called
 UNICODE_CHARACTER_CLASSES (with corresponding embedded (?U) regex flag.)
 
 Truth be told, even Perl has secret pattern compilation flags to govern
 this sort of thing (ascii, locale, unicode), but we (well, I) hope you
 never have to use or even notice them.  
 
 That too might be a route forward for Python, although I am not quite sure
 how much flexibility and control of your lexical scope you have.  However,
 the from __future_ imports suggest you may have enough to do something
 slick so that only people who ask for it get it, and also importantly that
 they get it all over the place so don't have to add an extra flag or u'...'
 or whatever every single time.  

If the current behaviour is buggy or sub-optimal, I think we should
simply fix it (which might be done by replacing re with regex if
someone wants to shepherd its inclusion in the stdlib).

By the way, thanks for the detailed explanations, Tom.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12731
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-12 Thread Arfrever Frehtes Taifersar Arahesis

Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com:


--
nosy: +Arfrever

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12731
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-12 Thread Terry J. Reedy

Terry J. Reedy tjre...@udel.edu added the comment:

However desireable it would be, I do not believe there is any claim in the 
manual that the re module follows the evolving Unicode consortium r.e. 
standard. If I understand, you are saying that this statement in the doc, 
Matches Unicode word characters; is not now correct and should be revised. 
Was it once correct? Could we add by an older definition of 'word' character?

There has been some discussion of adding regex to the stdlib, possibly as a 
replacement for re. You posts indicate that regex is more improved then some 
realized, and hence has more incompatibilities that we realized, and hence is 
less suitable as a strictly backwards-compatible replacement. So I think it 
needs to be looked at as a parallel addition. I do not know Mathew's current 
position on the subject.

--
assignee:  - docs@python
components: +Documentation
nosy: +docs@python, pitrou, terry.reedy
stage:  - needs patch
versions: +Python 3.2, Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12731
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-12 Thread Tom Christiansen

Tom Christiansen tchr...@perl.com added the comment:

 Terry J. Reedy tjre...@udel.edu added the comment:

 However desireable it would be, I do not believe there is any claim in the =
 manual that the re module follows the evolving Unicode consortium r.e. stan=

My from the hip thought is that if re cannot be fixed to follow
the Unicode Standard, it should be deprecated in favor of code
that can if such is available, because you cannot process Unicode
text with regular expressions otherwise.

 dard. If I understand, you are saying that this statement in the doc, Matc=
 hes Unicode word characters; is not now correct and should be revised. Was=
  it once correct? Could we add by an older definition of 'word' character=
 ?

Yes, your hunch is exactly correct.  They once had a lesser definition that
they have now.  It is very very old.  I had to track this down for Java
once.  There is some discussion of a word_character class at least 
as far back as tr18v3 from back in 1998.

http://www.unicode.org/reports/tr18/tr18-3.html

By the time tr18v5 rolled around just a year later in 1999, the overall
document has changed substantially, and you can clearly see its current
shape there.  Word characters are supposed to include all code points with
the Alphabetic property, for example.  

http://www.unicode.org/reports/tr18/tr18-5.html

However, the word alphabetic has *never* been synonymous in 
Unicode with 

\p{gc=Lu}
\p{gc=Ll}
\p{gc=Lt}
\p{gc=Lm}
\p{gc=Lo}

as many people incorrectly assume, nor certainly to 

\p{gc=Lu}
\p{gc=Ll}
\p{gc=Lt}

let alone to 

\p{gc=Lu}
\p{gc=Ll}

Rather, it has since its creation included code points that are not
letters, such as all GC=Nl and also certain GC=So code points.  And,
notoriously, U+0345. Indeed it is here I first noticed that that Python had
already broken with the Standard, because U+0345 COMBINING GREEK
YPOGEGRAMMENI is GC=Mn, but Alphabetic=True, yet I have shown that 
Python's title method is messing up there.  

I wouldn't spend too much in archaeological digs, though, because lots of
stuff has changed since the less millennium.  It was in tr18v7 from 2003-05
that we hit paydirt, because this is when the famous Annex C of RL1.2a 
fame first appeared:

http://www.unicode.org/reports/tr18/tr18-7.html#Compatibility_Properties

Notice how it defines \w to be nothing more than \p{alpha}, \p{digit}, and
\p{gc=Pc}.  It does not yet contain the requirement that all Marks be
counted as part of the word, just the few that are alphas -- which the
U+0345 counts for, since it has an uppercase map of a capital iota!

That particular change did not occur until tr18v8 in 2003-08, barely
a scant three months later.

http://www.unicode.org/reports/tr18/tr18-8.html#Compatibility_Properties

Now at last we see word characters defined in the modern way that we 
have become used to.  They must match any of:

\p{alpha}
\p{gc=Mark}
\p{digit}
\p{gc=Connector_Punctuation}

BTW, Python is matching  all of 

\p{GC=N}

meaning

\p{GC=Nd}
\p{GC=Nl}
\p{GC=No}

instead of the required 

\p{GC=Nd}

which is a synonym for \p{digit}.

I don't know had that happened, because \w has never included
all number code points in Unicode, only the decimal number ones.

That all goes to show why, when citing conformance to some aspect of 
The Unicode Standard, one must be exceedingly careful just how one 
does so!
The Unicode Consortium recognizes this is an issue, and I am pretty
sure I can hear it in your own subtext as well.  

Kindly bear with and forgive me for momentarily sounding like a standard
lawyer.  I do this because to show not just why it is important to get
references to the Unicode Standard correct, but indeed, how to do so.

After I have given the formal requirements, I will then produce
illustrations of various purported claims, some of which meet the
citation requirements, and others which do not.

===

To begin with, there is an entire technical report on conformance.
It includes:

http://unicode.org/reports/tr33/

The Unicode Standard [Unicode] is a very large and complex standard.
Because of this complexity, and because of the nature and role of the
standard, it is often rather difficult to determine, in any particular
case, just exactly what conformance to the Unicode Standard means.

...

Conformance claims must be specific to versions of the Unicode
Standard, but the level of specificity needed for a claim may vary
according to the nature of the particular conformance claim. Some
standards developed by the Unicode Consortium require separate
conformance to a specific version (or later), of the Unicode Standard.
This version is sometimes called the  base version. In such cases, the
version of the standard and the version of the Unicode Standard to
which the conformance claim 

[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-12 Thread Matthew Barnett

Changes by Matthew Barnett pyt...@mrabarnett.plus.com:


--
nosy: +mrabarnett

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12731
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-11 Thread Tom Christiansen

New submission from Tom Christiansen tchr...@perl.com:

You cannot use Python's lib re for handling Unicode regular expressions because 
it violates the standard set out for the same in UTS#18 on Unicode Regular 
Expressions in RL1.2a on compatibility properties.  What \w is allowed to match 
is clearly explained there, but Python has its own idea. Because it is in clear 
violation of the standard, it is misleading and wrong for Python to claim that 
the re.UNICODE flag makes \w and friends match Unicode.  Here are the failed 
test cases when the attached file is run under v3.2; there are further failures 
when run under v2.7.

FAIL lib refound non alphanumeric string café
FAIL lib refound non alphanumeric string Ⓚ
FAIL lib refound non alphanumeric string ͅ
FAIL lib refound non alphanumeric string ְ
FAIL lib refound non alphanumeric string ퟘ
FAIL lib refound non alphanumeric string ́
FAIL lib refound non alphanumeric string 픘픫픦픠픬픡픢
FAIL lib refound non alphanumeric string ДЯхШщЯл
FAIL lib refound non alphanumeric string connector‿punctuation
FAIL lib refound non alphanumeric string Ὰͅ_Στο_Διάολο
FAIL lib refound non alphanumeric string ̰̰̈́̈́‿̰̿̽̓͂‿̸̿‿̹̽‿̷̹̼̹̰̼̽
FAIL lib refound all alphanumeric string ¹²³
FAIL lib refound all alphanumeric string ₁₂₃
FAIL lib refound all alphanumeric string ¼½¾
FAIL lib refound all alphanumeric string ⑶

Note that Matthew Barnett's regex lib for Python handles all of these cases in 
comformance with The Unicode Standard.

--
components: Regular Expressions
files: alnum.python
messages: 141920
nosy: tchrist
priority: normal
severity: normal
status: open
title: python lib re uses obsolete sense of \w in full violation of UTS#18 
RL1.2a
type: behavior
versions: Python 2.7
Added file: http://bugs.python.org/file22881/alnum.python

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12731
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-11 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +ezio.melotti

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12731
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com