[issue47152] Reorganize the re module sources

2022-04-04 Thread Matthew Barnett


Matthew Barnett  added the comment:

For reference, I also implemented .regs in the regex module for compatibility, 
but I've never used it myself. I had to do some investigating to find out what 
it did!

It returns a tuple of the spans of the groups.

Perhaps I might have used it if it didn't have such a cryptic name and/or was 
documented.

--

___
Python tracker 
<https://bugs.python.org/issue47152>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47081] Replace "qualifiers" with "quantifiers" in the re module documentation

2022-03-21 Thread Matthew Barnett


Matthew Barnett  added the comment:

I don't think it's a typo, and you could argue the case for "qualifiers", but I 
still agree with the proposal as it's a more meaningful term in the context.

--

___
Python tracker 
<https://bugs.python.org/issue47081>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47023] re.sub shows key error on regex escape chars provided in repl param

2022-03-17 Thread Matthew Barnett


Matthew Barnett  added the comment:

I'd just like to point out that to a user it could _look_ like a bug, that an 
error occurred while reporting, because the traceback isn't giving a 'clean' 
report; the stuff about the KeyError is an internal detail.

--

___
Python tracker 
<https://bugs.python.org/issue47023>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46825] slow matching on regular expression

2022-02-22 Thread Matthew Barnett


Matthew Barnett  added the comment:

The expression is a repeated alternative where the first alternative is a 
repeat. Repeated repeats can result in a lot of attempts and backtracking and 
should be avoided.

Try this instead:

(0|1(01*0)*1)+

--

___
Python tracker 
<https://bugs.python.org/issue46825>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46627] Regex hangs indefinitely

2022-02-03 Thread Matthew Barnett


Matthew Barnett  added the comment:

That pattern has:

(?P[^]]+)+

Is that intentional? It looks wrong to me.

--

___
Python tracker 
<https://bugs.python.org/issue46627>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46515] Benefits Of Phool Makhana

2022-01-25 Thread Matthew Barnett


Change by Matthew Barnett :


--
stage:  -> resolved
status: open -> closed

___
Python tracker 
<https://bugs.python.org/issue46515>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46410] TypeError when parsing regexp with unicode named character sequence escape

2022-01-18 Thread Matthew Barnett


Matthew Barnett  added the comment:

They're not supported in string literals either:

Python 3.10.1 (tags/v3.10.1:2cd268a, Dec  6 2021, 19:10:37) [MSC v.1929 64 bit 
(AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> "\N{KEYCAP NUMBER SIGN}"
  File "", line 1
"\N{KEYCAP NUMBER SIGN}"
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in 
position 0-21: unknown Unicode character name

--

___
Python tracker 
<https://bugs.python.org/issue46410>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45899] NameError on if clause of class-level list comprehension

2021-11-25 Thread Matthew Barnett


Matthew Barnett  added the comment:

It's not just in the 'if' clause:

>>> class Foo:
... a = ['a', 'b']
... b = ['b', 'c']
... c = [b for x in a]
...
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 4, in Foo
  File "", line 4, in 
NameError: name 'b' is not defined

--
nosy: +mrabarnett

___
Python tracker 
<https://bugs.python.org/issue45899>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45869] Unicode and acii regular expressions do not agree on ascii space characters

2021-11-22 Thread Matthew Barnett


Matthew Barnett  added the comment:

For comparison, the regex module says that 0x1C..0x1F aren't whitespace, and 
the Unicode property White_Space ("\p{White_Space}" in a pattern, where 
supported) also says that they aren't whitespace.

--

___
Python tracker 
<https://bugs.python.org/issue45869>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45539] Negative lookaround assertions sometimes leak capture groups

2021-10-21 Thread Matthew Barnett


Matthew Barnett  added the comment:

It's definitely a bug.

In order for the pattern to match, the negative lookaround must match, which 
means that its subexpression mustn't match, so none of the groups in that 
subexpression have captured.

--
versions: +Python 3.10

___
Python tracker 
<https://bugs.python.org/issue45539>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45461] UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 8191: \ at end of string

2021-10-13 Thread Matthew Barnett


Matthew Barnett  added the comment:

It can be shortened to this:

buffer = b"a" * 8191 + b"\\r\\n"

with open("bug_csv.csv", "wb") as f:
f.write(buffer)

with open("bug_csv.csv", encoding="unicode_escape", newline="") as f:
f.readline()

To me it looks like it's reading in blocks of 8K and then decoding them,  but 
it isn't correctly handling an escape sequence that happens to cross a block 
boundary.

--
nosy: +mrabarnett

___
Python tracker 
<https://bugs.python.org/issue45461>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45155] Add default arguments for int.to_bytes()

2021-09-13 Thread Matthew Barnett


Matthew Barnett  added the comment:

I wonder whether there should be a couple of other endianness values, namely, 
"native" and "network", for those cases where you want to be explicit about it. 
If you use "big" it's not clear whether that's because you want network 
endianness or because the platform is big-endian.

--

___
Python tracker 
<https://bugs.python.org/issue45155>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45155] Add default arguments for int.to_bytes()

2021-09-13 Thread Matthew Barnett


Matthew Barnett  added the comment:

I'd probably say "In the face of ambiguity, refuse the temptation to guess".

As there's disagreement about the 'correct' default, make it None and require 
either "big" or "little" if length > 1 (the default).

--
nosy: +mrabarnett

___
Python tracker 
<https://bugs.python.org/issue45155>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44699] Simple regex appears to take exponential time in length of input

2021-07-21 Thread Matthew Barnett


Matthew Barnett  added the comment:

It's called "catastrophic backtracking". Think of the number of ways it could 
match, say, 4 characters: 4, 3+1, 2+2, 2+1+1, 1+3, 1+2+1, 1+1+2, 1+1+1+1. Now 
try 5 characters...

--

___
Python tracker 
<https://bugs.python.org/issue44699>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-21 Thread Matthew Barnett


Matthew Barnett  added the comment:

I've only just realised that the test cases don't cover all eventualities: none 
of them test what happens with multiple spaces _between_ the letters, such as:

'  a  b c '.split(maxsplit=1) == ['a', 'b c ']

Comparing that with:

'  a  b c '.split(' ', maxsplit=1)

you see that passing None as the split character does not mean "any whitespace 
character". There's clearly a little more to it than that.

--

___
Python tracker 
<https://bugs.python.org/issue28937>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-18 Thread Matthew Barnett


Matthew Barnett  added the comment:

We have that already, although it's spelled:

'   x y z'.split(maxsplit=1) == ['x', 'y z']

because the keepempty option doesn't exist yet.

--

___
Python tracker 
<https://bugs.python.org/issue28937>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-18 Thread Matthew Barnett


Matthew Barnett  added the comment:

The best way to think of it is that .split() is like .split(' '), except that 
it's splitting on any whitespace character instead of just ' ', and keepempty 
is defaulting to False instead of True.

Therefore:

'   x y z'.split(maxsplit=1, keepempty=True) == ['', '  x y z']

because:

'   x y z'.split(' ', maxsplit=1) == ['', '  x y z']

but:

'   x y z'.split(maxsplit=1, keepempty=False) == ['x y z']

At least, I think that's the case!

--

___
Python tracker 
<https://bugs.python.org/issue28937>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-18 Thread Matthew Barnett


Matthew Barnett  added the comment:

The case:

'  a b c  '.split(maxsplit=1) == ['a', 'b c  ']

suggests that empty strings don't count towards maxsplit, otherwise it would 
return [' a b c  '] (i.e. the split would give ['', ' a b c  '] and dropping 
the empty strings would give [' a b c  ']).

--

___
Python tracker 
<https://bugs.python.org/issue28937>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43714] re.split(), re.sub(): '\Z' must consume end of string if it matched

2021-04-03 Thread Matthew Barnett


Matthew Barnett  added the comment:

Do any other regex implementations behave the way you want?

In my experience, there's no single "correct" way for a regex to behave; 
different implementations might give slightly different results, so if the most 
common ones behave a certain way, then that's the de facto standard, even if it 
not what you'd expect or want.

--
nosy: +mrabarnett

___
Python tracker 
<https://bugs.python.org/issue43714>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43535] Make str.join auto-convert inputs to strings.

2021-03-19 Thread Matthew Barnett


Matthew Barnett  added the comment:

I'm also -1, for the same reason as Serhiy gave. However, if it was opt-in, 
then I'd be OK with it.

--
nosy: +mrabarnett

___
Python tracker 
<https://bugs.python.org/issue43535>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43156] Python windows installer has a confusing name - add setup to its name

2021-02-07 Thread Matthew Barnett


Matthew Barnett  added the comment:

Sorry to bikeshed, but I think it would be clearer to keep the version next to 
the "python" and the "setup" at the end:

python-3.10.0a5-win32-setup.exe
python-3.10.0a5-win64-setup.exe

--
nosy: +mrabarnett

___
Python tracker 
<https://bugs.python.org/issue43156>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42871] Regex compilation crashed if I change order of alternatives under quantifier

2021-01-08 Thread Matthew Barnett


Matthew Barnett  added the comment:

Example 1:

((a)|b\2)*
 ^^^   Group 2

((a)|b\2)*
  ^^   Reference to group 2

The reference refers backwards to the group.

Example 2:

(b\2|(a))*
 ^^^   Group 2

(b\2|(a))*
  ^^   Reference to group 2

The reference refers forwards to the group.

As I said, the re module doesn't support forward references to groups.

If you have a regex where forward references are unavoidable, try the 3rd-party 
'regex' module instead. It's available on PyPI.

--

___
Python tracker 
<https://bugs.python.org/issue42871>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42871] Regex compilation crashed if I change order of alternatives under quantifier

2021-01-08 Thread Matthew Barnett


Matthew Barnett  added the comment:

It's not a crash. It's complaining that you're referring to group 2 before 
defining it. The re module doesn't support forward references to groups, but 
only backward references to them.

--

___
Python tracker 
<https://bugs.python.org/issue42871>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42668] re.escape does not correctly escape newlines

2020-12-17 Thread Matthew Barnett


Matthew Barnett  added the comment:

In a regex, putting a backslash before any character that's not an ASCII-range 
letter or digit makes it a literal. re.escape doesn't special-case control 
characters. Its purpose is to make a string that might contain metacharacters 
into one that's a literal, and it already does that, although it sometimes 
escapes more than necessary.

--

___
Python tracker 
<https://bugs.python.org/issue42668>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42475] wrongly cache pattern by re.compile

2020-11-26 Thread Matthew Barnett


Matthew Barnett  added the comment:

That behaviour has nothing to do with re.

This line:

samples = filter(lambda sample: not pttn.match(sample), data)

creates a generator that, when evaluated, will use the value of 'pttn' _at that 
time_.

However, you then bind 'pttn' to something else.

Here's a simple example:

>>> x = 1
>>> func = lambda: print(x)
>>> func()
1
>>> x = 2
>>> func()
2

A workaround is to capture the current value with a default argument:

>>> x = 1
>>> func = lambda x=x: print(x)
>>> func()
1
>>> x = 2
>>> func()
1

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 
<https://bugs.python.org/issue42475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42473] re.sub ignores flag re.M

2020-11-26 Thread Matthew Barnett


Matthew Barnett  added the comment:

Not a bug.

Argument 4 of re.sub is the count:

sub(pattern, repl, string, count=0, flags=0)

not the flags.

--
nosy: +mrabarnett
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 
<https://bugs.python.org/issue42473>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41885] Unexpected behavior re.sub() with raw f-strings

2020-09-29 Thread Matthew Barnett


Matthew Barnett  added the comment:

Arguments are evaluated first and then the results are passed to the function. 
That's true throughout the language.

In this instance, you can use \g<1> in the replacement string to refer to group 
1:

re.sub(r'([a-z]+)', fr"\g<1>{REPLACEMENT}", 'something')

--

___
Python tracker 
<https://bugs.python.org/issue41885>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41764] sub function would not work without the flags but the search would work fine

2020-09-11 Thread Matthew Barnett


Matthew Barnett  added the comment:

The arguments are: re.sub(pattern, repl, string, count=0, flags=0).

Therefore:

re.sub("pattern","replace", txt, re.IGNORECASE | re.DOTALL)

is passing re.IGNORECASE | re.DOTALL as the count, not the flags.

It's in the documentation and the interactive help.

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 
<https://bugs.python.org/issue41764>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41664] re.sub does NOT substitute all the matching patterns when re.IGNORECASE is used

2020-08-29 Thread Matthew Barnett


Matthew Barnett  added the comment:

The 4th argument of re.sub is 'count', not 'flags'.

re.IGNORECASE has the numeric value of 2, so:

re.sub(r'[aeiou]', '#', 'all is fair in love and war', re.IGNORECASE)

is equivalent to:

re.sub(r'[aeiou]', '#', 'all is fair in love and war', count=2)

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 
<https://bugs.python.org/issue41664>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41531] Python 3.9 regression: Literal dict with > 65535 items are one item shorter

2020-08-12 Thread Matthew Barnett


Matthew Barnett  added the comment:

I think what's happening is that in 'compiler_dict' (Python/compile.c), it's 
checking whether 'elements' has reached a maximum (0x). However, it's not 
doing this after incrementing; instead, it's checking before incrementing and 
resetting 'elements' to 0 when it should be resetting to 1. The 65535th element 
isn't counted.

--
nosy: +mrabarnett

___
Python tracker 
<https://bugs.python.org/issue41531>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40043] RegExp Conditional Construct (?(id/name)yes-pattern|no-pattern) Problem

2020-03-26 Thread Matthew Barnett


Matthew Barnett  added the comment:

That's what searching does!

Does the pattern match here? If not, advance by one character and try again. 
Repeat until a match is found or you've reached the end.

--

___
Python tracker 
<https://bugs.python.org/issue40043>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40043] RegExp Conditional Construct (?(id/name)yes-pattern|no-pattern) Problem

2020-03-22 Thread Matthew Barnett


Matthew Barnett  added the comment:

The documentation is talking about whether it'll match at the current position 
in the string. It's not a bug.

--
resolution:  -> not a bug

___
Python tracker 
<https://bugs.python.org/issue40043>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40027] re.sub inconsistency beginning with 3.7

2020-03-20 Thread Matthew Barnett


Matthew Barnett  added the comment:

Duplicate of Issue39687.

See https://docs.python.org/3/library/re.html#re.sub and 
https://docs.python.org/3/whatsnew/3.7.html#changes-in-the-python-api.

--
resolution:  -> duplicate
stage:  -> resolved
status: open -> closed

___
Python tracker 
<https://bugs.python.org/issue40027>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38826] Regular Expression Denial of Service in urllib.request.AbstractBasicAuthHandler

2020-03-03 Thread Matthew Barnett


Matthew Barnett  added the comment:

A smaller change to the regex would be to replace the "(?:.*,)*" with 
"(?:[^,]*,)*".

I'd also suggest using a raw string instead:

rx = re.compile(r'''(?:[^,]*,)*[ \t]*([^ \t]+)[ \t]+realm=(["']?)([^"']*)\2''', 
re.I)

--
nosy: +mrabarnett

___
Python tracker 
<https://bugs.python.org/issue38826>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39436] Strange behavior of comparing int and float numbers

2020-01-23 Thread Matthew Barnett


Change by Matthew Barnett :


--
stage:  -> resolved
status: open -> closed

___
Python tracker 
<https://bugs.python.org/issue39436>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39436] Strange behavior of comparing int and float numbers

2020-01-23 Thread Matthew Barnett


Matthew Barnett  added the comment:

Python floats have 53 bits of precision, so ints larger than 2**53 will lose 
their lower bits (assumed to be 0) when converted.

--
nosy: +mrabarnett
resolution:  -> not a bug

___
Python tracker 
<https://bugs.python.org/issue39436>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38974] using filedialog.askopenfilename() freezes python 3.8

2019-12-04 Thread Matthew Barnett


Matthew Barnett  added the comment:

I've just tried it on Windows 10 with Python 3.8 64-bit and Python 3.8 32-bit 
without issue.

--
nosy: +mrabarnett

___
Python tracker 
<https://bugs.python.org/issue38974>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38764] Deterministic globbing.

2019-11-11 Thread Matthew Barnett


Matthew Barnett  added the comment:

I could also add: would sorting be case-sensitive or case-insensitive? Windows 
is case-insensitive, Linux is case-sensitive.

--
nosy: +mrabarnett

___
Python tracker 
<https://bugs.python.org/issue38764>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23692] Undocumented feature prevents re module from finding certain matches

2019-11-04 Thread Matthew Barnett


Matthew Barnett  added the comment:

It's been many years since I looked at the code, and there have been changes 
since then, so some of the details might not be correct.

As to have it should behave:

re.match('(?:()|(?(1)()|z)){1,2}(?(2)a|z)', 'a')

Iteration 1.
Match the repeated part. Group 1 matches.
Iteration 2.
Match the repeated part. Group 1 matches.
Has group 2 matched? No.
Try to match 'z'. Fail and backtrack.
Retry the repeated part.
Iteration 2.
Has group 1 matched? Yes.
Group 2 matches.
Has group 2 matched? Yes.
Try to match 'a'. Success. Group 1 matched and group 2 matched.


re.match('(?:()|(?(1)()|z)){1,2}(?(1)a|z)', 'a')

Iteration 1.
Match the repeated part. Group 1 matches.
Iteration 2.
Match the repeated part. Group 1 matches.
Has group 1 matched? Yes.
Try to match 'a'. Success. Group 1 matched and group 2 didn't match.

--

___
Python tracker 
<https://bugs.python.org/issue23692>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23692] Undocumented feature prevents re module from finding certain matches

2019-10-27 Thread Matthew Barnett


Matthew Barnett  added the comment:

Suppose you had a pattern:

.*

It would advance one character on each iteration of the * until the . failed to 
match. The text is finite, so it would stop matching eventually.

Now suppose you had a pattern:

(?:)*

On each iteration of the * it wouldn't advance, so it would keep matching 
forever.

A way to avoid that is to stop the * if it hasn't advanced.

The example pattern shows that there's still a problem. It advances if a group 
has matched, but that group doens't match until the first iteration, after the 
test, and does not, itself, advance. The * stops because it hasn't advanced, 
but, in this instance, that doesn't mean it never will.

The solution is for the * to check not only whether it has advanced, but also 
whether a group has changed. (Strictly speaking, the latter check is needed 
only if the repeated part tests whether a group also in the repeated part has 
changed, but it's probably not worth "optimising" for that possibility.)

In the regex module, it increments a "capture changed" counter whenever any 
group is changed (a group's first match or a change to a group's span). That 
makes it easier for the * to check. The code needs to save that counter for 
backtracking and restore it when backtracking.

I've mentioned only the *, but the same remarks apply to + and {...}, except 
that the {...} should keep repeating until it has reached its prescribed 
minimum.

--

___
Python tracker 
<https://bugs.python.org/issue23692>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38582] re: backreference number in replace string can't >= 100

2019-10-25 Thread Matthew Barnett


Matthew Barnett  added the comment:

If we did decide to remove it, but there was still a demand for octal escapes, 
then I'd suggest introducing \oXXX.

--

___
Python tracker 
<https://bugs.python.org/issue38582>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38582] re: backreference number in replace string can't >= 100

2019-10-24 Thread Matthew Barnett


Matthew Barnett  added the comment:

A numeric escape of 3 digits is an octal (base 8) escape; the octal escape 
"\100" gives the same character as the hexadecimal escape "\x40".

In a replacement template, you can use "\g<100>" if you want group 100 because 
\g<...> accepts both numeric and named group references.

However, \g<...> is not accepted in a pattern.

(By the way, in the "regex" module I added support for it in a pattern too.)

--

___
Python tracker 
<https://bugs.python.org/issue38582>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37996] 2to3 introduces unwanted extra backslashes for unicode characters in regular expressions

2019-08-31 Thread Matthew Barnett


Matthew Barnett  added the comment:

You wrote "the u had already been removed by hand". By removing the u in the 
_Python 2_ code, you changed that string from a Unicode string to a bytestring.

In a bytestring, \u is not an escape; b"\u" == b"\\u".

--
nosy: +mrabarnett

___
Python tracker 
<https://bugs.python.org/issue37996>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37723] important performance regression on regular expression parsing

2019-07-31 Thread Matthew Barnett


Matthew Barnett  added the comment:

I've just had a look at _uniq, and the code surprises me.

The obvious way to detect duplicates is with a set, but that requires the items 
to be hashable. Are they?

Well, the first line of the function uses 'set', so they are.

Why, then, isn't it using a set to detect the duplicates?

How about this:

def _uniq(items):
newitems = []
seen = set()
for item in items:
if item not in seen:
newitems.append(item)
seen.add(item)
return newitems

--

___
Python tracker 
<https://bugs.python.org/issue37723>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37687] Invalid regexp should rise exception

2019-07-25 Thread Matthew Barnett


Matthew Barnett  added the comment:

For historical reasons, if it isn't valid as a repeat then it's a literal. This 
is true in other regex implementations, and is by no means unique to the re 
module.

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 
<https://bugs.python.org/issue37687>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37327] python re bug

2019-06-18 Thread Matthew Barnett


Matthew Barnett  added the comment:

The problem is the "(?:[^<]+|<(?!/head>))*?".

If I simplify it a little I get "(?:[^<]+)*?", which is a repeat within a 
repeat.

There are many ways in which it could match, and if what follows fails to match 
(it doesn't because there's no "

[issue36468] Treeview: wrong color change

2019-05-16 Thread Matthew Barnett


Matthew Barnett  added the comment:

I've just come across the same problem.

For future reference, adding the following code before using a Treeview widget 
will fix the problem:

def fixed_map(option):
# Fix for setting text colour for Tkinter 8.6.9
# From: https://core.tcl.tk/tk/info/509cafafae
#
# Returns the style map for 'option' with any styles starting with
# ('!disabled', '!selected', ...) filtered out.

# style.map() returns an empty list for missing options, so this
# should be future-safe.
return [elm for elm in style.map('Treeview', query_opt=option) if
  elm[:2] != ('!disabled', '!selected')]

style = ttk.Style()
style.map('Treeview', foreground=fixed_map('foreground'),
  background=fixed_map('background'))

--
nosy: +mrabarnett

___
Python tracker 
<https://bugs.python.org/issue36468>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36653] Dictionary Key is without ' ' quotes

2019-04-17 Thread Matthew Barnett


Matthew Barnett  added the comment:

That should be:

def __repr__(self):
return repr(self.name)

Not a bug.

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 
<https://bugs.python.org/issue36653>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32308] Replace empty matches adjacent to a previous non-empty match in re.sub()

2019-04-12 Thread Matthew Barnett


Matthew Barnett  added the comment:

Consider re.findall(r'.{0,2}', 'abcde').

It finds 'ab', then continues where it left off to find 'cd', then 'e'.

It can also find ''; re.match(r'.*', '') does match, after all.

It could, in fact, an infinite number of ''.

And what about re.match(r'()*', '')?

What should it do? Run forever? Raise an exception?

At some point you have to make a decision as to what should happen, and the 
general consensus has been to match once.

--

___
Python tracker 
<https://bugs.python.org/issue32308>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32308] Replace empty matches adjacent to a previous non-empty match in re.sub()

2019-04-11 Thread Matthew Barnett


Matthew Barnett  added the comment:

It's now consistent with Perl, PCRE and .Net (C#), as well as re.split(), 
re.sub(), re.findall() and re.finditer().

--

___
Python tracker 
<https://bugs.python.org/issue32308>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36397] re.split() incorrectly splitting on zero-width pattern

2019-03-23 Thread Matthew Barnett


Matthew Barnett  added the comment:

The list alternates between substrings (s, between the splits) and captures (c):

['1', '1', '2', '2', '11']
 -s-  -c-  -s-  -c-  -s--

You can use slicing to extract the substrings:

>>> re.split(r'(?<=(\d))(?!\1)(?=\d)', '12111')[ : : 2]
['1', '2', '111']

--

___
Python tracker 
<https://bugs.python.org/issue36397>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36397] re.split() incorrectly splitting on zero-width pattern

2019-03-21 Thread Matthew Barnett


Matthew Barnett  added the comment:

>From the docs:

"""If capturing parentheses are used in pattern, then the text of all groups in 
the pattern are also returned as part of the resulting list."""

The pattern does contain a capture, so that's why the result has additional '1' 
and '2'.

Presumably, Java's split doesn't do that.

Not a bug.

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 
<https://bugs.python.org/issue36397>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35155] Clarify Protocol Handlers in urllib.request Docs

2019-02-12 Thread Matthew Barnett


Matthew Barnett  added the comment:

You could italicise the "protocol" part using asterisks, like this:

*protocol*_request

or this:

*protocol*\ _request

depending on the implementation of the rst software.

--
nosy: +mrabarnett

___
Python tracker 
<https://bugs.python.org/issue35155>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35859] Capture behavior depends on the order of an alternation

2019-01-30 Thread Matthew Barnett


Matthew Barnett  added the comment:

It matches, and the span is (0, 2).

The only way that it can match like that is for the capture group to match the 
'a', and the final 'b' to match the 'b'.

Therefore, re.search(r'(ab|a)*b', 'ab').groups() should be ('a', ), as it is 
for the pattern with a greedy repeat.

--

___
Python tracker 
<https://bugs.python.org/issue35859>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35859] Capture behavior depends on the order of an alternation

2019-01-30 Thread Matthew Barnett


Matthew Barnett  added the comment:

It looks like a bug in re to me.

--

___
Python tracker 
<https://bugs.python.org/issue35859>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35653] All regular expression match groups are the empty string

2019-01-03 Thread Matthew Barnett


Matthew Barnett  added the comment:

Look at the spans of the groups:

>>> import re
>>> re.search(r'^(?:(\d*)(\D*))*$', "42AZ").span(1)
(4, 4)
>>> re.search(r'^(?:(\d*)(\D*))*$', "42AZ").span(2)
(4, 4)

They're telling you that the groups are matching twice (because of the outer 
*). The first time, they match ('42', 'AZ'); the second time, they match ('', 
'') at the end of the string.

Not a bug.

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 
<https://bugs.python.org/issue35653>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35645] Alarm usage

2019-01-03 Thread Matthew Barnett


Matthew Barnett  added the comment:

@Steven: The complaint is that the BEL character ('\a') doesn't result in a 
beep when printed.

@Siva: These days, you shouldn't be relying on '\a' because it's not always 
supported. If you want to make a beep, do so with the appropriate function 
call. Ask Google!

--

___
Python tracker 
<https://bugs.python.org/issue35645>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35546] String formatting produces incorrect result with left-aligned zero-padded format

2018-12-20 Thread Matthew Barnett


Matthew Barnett  added the comment:

A similar issue exists with centring:

>>> format(42, '^020')
'0420'

--
nosy: +mrabarnett

___
Python tracker 
<https://bugs.python.org/issue35546>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35538] splitext does not seems to handle filepath ending in .

2018-12-19 Thread Matthew Barnett


Matthew Barnett  added the comment:

It always returns the dot.

For example:

>>> posixpath.splitext('.blah.txt')
('.blah', '.txt')

If there's no extension (no dot):

>>> posixpath.splitext('blah')
('blah', '')

Not a bug.

--
nosy: +mrabarnett
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 
<https://bugs.python.org/issue35538>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35072] re.sub does not play nice with chr(92)

2018-10-26 Thread Matthew Barnett


Matthew Barnett  added the comment:

@Ezio: the value of stringy_thingy is irrelevant because it never gets that 
far; it fails when it tries to parse the replacement, which occurs before 
attempting any matching.

I can't reproduce the difference either.

--
status: pending -> open

___
Python tracker 
<https://bugs.python.org/issue35072>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34694] Dismiss To Avoid Slave/Master wording cause it easier for non English spoken programmers

2018-09-26 Thread Matthew Barnett


Change by Matthew Barnett :


--
nosy:  -mrabarnett

___
Python tracker 
<https://bugs.python.org/issue34694>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34763] Python lacks 0x4E17

2018-09-21 Thread Matthew Barnett


Change by Matthew Barnett :


--
Removed message: https://bugs.python.org/msg326012

___
Python tracker 
<https://bugs.python.org/issue34763>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34763] Python lacks 0x4E17

2018-09-21 Thread Matthew Barnett


Change by Matthew Barnett :


--
Removed message: https://bugs.python.org/msg326014

___
Python tracker 
<https://bugs.python.org/issue34763>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34763] Python lacks 0x4E17

2018-09-21 Thread Matthew Barnett


Change by Matthew Barnett :


--
Removed message: https://bugs.python.org/msg326013

___
Python tracker 
<https://bugs.python.org/issue34763>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34763] Python lacks 0x4E17

2018-09-21 Thread Matthew Barnett


Change by Matthew Barnett :


--
Removed message: https://bugs.python.org/msg326015

___
Python tracker 
<https://bugs.python.org/issue34763>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34763] Python lacks 0x4E17

2018-09-21 Thread Matthew Barnett

Matthew Barnett  added the comment:

Unicode 11.0.0 has 卅 (U+5345) as being numeric and having the value 30.

What's the difference between that and U+4E17?

I notice that they look at lot alike. Are they different variants, perhaps 
traditional vs simplified?

--

___
Python tracker 
<https://bugs.python.org/issue34763>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34738] Distutils: ZIP files don't include directory entries

2018-09-19 Thread Matthew Barnett


Matthew Barnett  added the comment:

I don't see a problem with this. If the zip file has 'dist/file1.py' then you 
know to create a directory when unzipping. If you want to indicate that there's 
an empty directory 'foo', then put 'foo/' in the zip file.

--
nosy: +mrabarnett

___
Python tracker 
<https://bugs.python.org/issue34738>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34605] Avoid master/slave terminology

2018-09-07 Thread Matthew Barnett


Matthew Barnett  added the comment:

Not all uses of the word "master" are associated with slavery, e.g. "master 
craftsman", "master copy", "master file table".

I think it's best to avoid use of master/slave where practicable, but other 
uses of "master" are not necessarily a problem.

--
nosy: +mrabarnett

___
Python tracker 
<https://bugs.python.org/issue34605>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33785] Crash caused by pasting ̖̈ into python

2018-06-06 Thread Matthew Barnett


Matthew Barnett  added the comment:

For clarity, the first is '\U00010308\U00010316' and the second is 
'\U00010306\U00010300\U0001030B'.

The BMP is the Basic Multilingual Plane, which covers the codepoints in the 
range U+ to U+. Some software has a problem dealing with codepoints 
outside the BMP.

--
nosy: +mrabarnett

___
Python tracker 
<https://bugs.python.org/issue33785>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33721] os.path.exists() ought to return False if pathname contains NUL

2018-05-31 Thread Matthew Barnett


Matthew Barnett  added the comment:

It also raises a ValueError on Windows. For other invalid paths on Windows it 
returns False.

--
nosy: +mrabarnett

___
Python tracker 
<https://bugs.python.org/issue33721>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33566] re.findall() dead locked whent the expected ending char not occur until end of string

2018-05-18 Thread Matthew Barnett

Matthew Barnett <pyt...@mrabarnett.plus.com> added the comment:

You don't give the value of 'newlines', but the problem is probably 
catastrophic backtracking, not deadlock.

--
nosy: +mrabarnett

___
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue33566>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32982] Parse out invisible Unicode characters?

2018-03-02 Thread Matthew Barnett

Matthew Barnett <pyt...@mrabarnett.plus.com> added the comment:

For the record, '\u200e' is '\N{LEFT-TO-RIGHT MARK}'.

--
nosy: +mrabarnett

___
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue32982>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2017-12-02 Thread Matthew Barnett

Matthew Barnett <pyt...@mrabarnett.plus.com> added the comment:

findall() and finditer() consist of multiple uses of search(), basically, as do 
sub() and split(), so we want the same rule to apply to them all.

--

___
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue25054>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2017-12-02 Thread Matthew Barnett

Matthew Barnett <pyt...@mrabarnett.plus.com> added the comment:

The pattern:

\b|:+

will match a word boundary (zero-width) before colons, so if there's a word 
followed by colons, finditer will find the boundary and then the colons. You 
_can_ get a zero-width match (ZWM) joined to the start of a nonzero-width match 
(NWM). That's not really surprising.

If you wanted to avoid a ZWM joined to either end of a NWM, you'd need to keep 
looking for another match at a position even after you'd already found a match 
if what you'd found was zero-width. That would also affect re.search and 
re.match.

For regex on Python 3.7, I'm going with avoiding a ZWM joined to the end of a 
NWM, unless re's going a different way, in which case I have more work to do to 
remain compatible! The change I did for Python 3.7+ was trivial.

--

___
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue25054>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31969] re.groups() is not checking the arguments

2017-11-08 Thread Matthew Barnett

Matthew Barnett <pyt...@mrabarnett.plus.com> added the comment:

@Narendra: The argument, if provided, is merely a default. Checking whether it 
_could_ be used would not be straightforward, and raising an exception if it 
would never be used would have little, if any, benefit.

It's not a bug, and it's not worth changing.

--
status: open -> closed

___
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue31969>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31856] Unexpected behavior of re module when VERBOSE flag is set

2017-10-23 Thread Matthew Barnett

Matthew Barnett <pyt...@mrabarnett.plus.com> added the comment:

Your verbose examples put the pattern into raw triple-quoted strings, which is 
OK, but their first character is a backslash, which makes the next character (a 
newline) an escaped literal whitespace character. Escaped whitespace is 
significant in a verbose pattern.

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue31856>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31803] Remove not portable time.clock(), replaced by time.perf_counter() and time.process_time()

2017-10-17 Thread Matthew Barnett

Matthew Barnett <pyt...@mrabarnett.plus.com> added the comment:

@Victor: True, people often ignore DeprecationWarning anyway, but that's their 
problem, at least you can say "well, you were warned". They might not have read 
the documentation on it recently because they have not felt the need to read 
again about a function with which they are already familiar.

--
nosy: +mrabarnett

___
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue31803>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31759] re wont recover nor fail on runaway regular expression

2017-10-13 Thread Matthew Barnett

Matthew Barnett <pyt...@mrabarnett.plus.com> added the comment:

@Tim: the regex module includes some extra checks to reduce the chance of 
excessive backtracking. In the case of the OP's example, they seem to be 
working. However, it's difficult to know when adding such checks will help, and 
your example is one case where they are being done but aren't helping, with the 
result that it's slower.

--

___
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue31759>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31759] re wont recover nor fail on runaway regular expression

2017-10-11 Thread Matthew Barnett

Matthew Barnett <pyt...@mrabarnett.plus.com> added the comment:

You shouldn't assume that just because it takes a long time on one 
implementation that it'll take a long time on all of the others, because it's 
sometimes possible to include additional checks to reduce the problem. (I doubt 
you could eliminate the problem entirely, however.)

My regex module, for example, includes some additional checks, and it seems to 
be OK with these tests.

--

___
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue31759>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31193] re.IGNORECASE strips combining character from lower case of LATIN CAPITAL LETTER I WITH DOT ABOVE

2017-08-14 Thread Matthew Barnett

Matthew Barnett added the comment:

The re module works with codepoints, it doesn't understand canonical 
equivalence.

For example, it doesn't recognise that "\N{LATIN CAPITAL LETTER E}\N{COMBINING 
ACUTE ACCENT}" is equivalent to "\N{LATIN CAPITAL LETTER E WITH ACUTE}".

This is true for Python in general, except for identifiers, which are 
normalised:

>>> "\N{LATIN CAPITAL LETTER E}\N{COMBINING ACUTE ACCENT}"
'É'
>>> É = 0
>>> "\N{LATIN CAPITAL LETTER E WITH ACUTE}"
'É'
>>> É
0

This also means that, say '.' will match only 1 _codepoint_.

--
nosy: +mrabarnett

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue31193>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30802] datetime.datetime.strptime('200722', '%Y%U')

2017-07-25 Thread Matthew Barnett

Matthew Barnett added the comment:

I think the relevant standard is ISO 8601:

https://en.wikipedia.org/wiki/ISO_8601

The first day of the week is Monday.

Note particularly the examples it gives:

Monday 29 December 2008 is written "2009-W01-1"
Sunday 3 January 2010 is written "2009-W53-7"

So the first few days of January can be in the last week of the previous year!

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30802>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30973] Regular expression "hangs" interpreter

2017-07-20 Thread Matthew Barnett

Matthew Barnett added the comment:

The regex module is much better in this respect, but it's not foolproof. With 
this particular example it completes quickly.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30973>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30927] re.sub() does not work correctly on '.' pattern and \n

2017-07-13 Thread Matthew Barnett

Matthew Barnett added the comment:

The 4th parameter is the count, not the flags:

sub(pattern, repl, string, count=0, flags=0)

>>> re.sub(r'X.', '+', '-X\n-', flags=re.DOTALL)
'-+-'

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30927>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30838] re \w does not match some valid Unicode characters

2017-07-05 Thread Matthew Barnett

Matthew Barnett added the comment:

Python identifiers match the regex:

[_\p{XID_Start}]\p{XID_Continue}*

The standard re module doesn't support \p{...}, but the third-party "regex" 
module does.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30838>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30838] re \w does not match some valid Unicode characters

2017-07-03 Thread Matthew Barnett

Matthew Barnett added the comment:

In Unicode 9.0.0, U+1885 and U+1886 changed from being 
General_Category=Other_Letter (Lo) to General_Category=Nonspacing_Mark (Mn).

U+2118 is General_Category=Math_Symbol (Sm) and U+212E is 
General_Category=Other_Symbol (So).

\w doesn't include Mn, Sm or So.

The .identifier method uses the Unicode properties XID_Start and XID_Continue, 
which include these codepoints.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30838>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30802] datetime.datetime.strptime('200722', '%Y%U')

2017-06-29 Thread Matthew Barnett

Matthew Barnett added the comment:

Expected result is datetime.datetime(2017, 6, 25, 0, 0).

--
nosy: +mrabarnett

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30802>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30772] If I make an attribute "[a unicode version of B]", it gets assigned to "[ascii B]", and so on.

2017-06-26 Thread Matthew Barnett

Matthew Barnett added the comment:

See PEP 3131 -- Supporting Non-ASCII Identifiers

It says: """All identifiers are converted into the normal form NFKC while 
parsing; comparison of identifiers is based on NFKC."""

>>> import unicodedata
>>> unicodedata.name(unicodedata.normalize('NFKC', '\N{MATHEMATICAL 
>>> DOUBLE-STRUCK CAPITAL B}'))
'LATIN CAPITAL LETTER B'

--
nosy: +mrabarnett

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30772>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30736] Support Unicode 10.0

2017-06-22 Thread Matthew Barnett

Matthew Barnett added the comment:

@Steven: Python 3.6 supports Unicode 9.


Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit 
(AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> unicodedata.unidata_version
'9.0.0'

--
nosy: +mrabarnett

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30736>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30209] some UTF8 symbols

2017-04-29 Thread Matthew Barnett

Matthew Barnett added the comment:

IDLE uses tkinter, which wraps tcl/tk. Versions up to tcl/tk 8.6 can't handle 
'astral' codepoints.

See also:

Issue #30019: IDLE freezes when opening a file with astral characters

Issue #21084: IDLE can't deal with characters above the range (U+-U+)

--
nosy: +mrabarnett

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30209>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30157] csv.Sniffer.sniff() regex error

2017-04-25 Thread Matthew Barnett

Matthew Barnett added the comment:

There are 4 patterns. They try to determine the delimiter and quote by looking 
for matches. Each pattern supposedly covers one of 4 cases:

1. Delimiter, quote, value, quote, delimiter.

2. Start of line/text, quote, value, quote, delimiter.

3. Delimiter, quote, value, quote, end of line/text.

4. Start of line/text, quote, value, quote, end of line/text.

On that basis, case 3 looks wrong because the pattern for delimiter is:

>[^\w\n"\']

instead of the expected:

[^\w\n"\']

Looks like a bug to me.

--
nosy: +mrabarnett

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30157>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30148] Pathological regex behaviour

2017-04-23 Thread Matthew Barnett

Matthew Barnett added the comment:

If 'ignores' is '', you get this:

(?:\b(?:extern|G_INLINE_FUNC|%s)\s*)

which can match an empty string, and it's tried repeatedly.

That's inadvisable.

There's also:

(?:\s+|\*)+

which can match whitespace in multiple ways.

That's inadvisable too.

If the pattern really doesn't match the string (and it doesn't!), then it won't 
find out until it has tried _all_ of the possibilities.

Some implementations, such as Perl's, have extra checks to try to reduce the 
problem.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30148>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30133] Strings that end with properly escaped backslashes cause error to be thrown in re.search/sub/etc. functions.

2017-04-21 Thread Matthew Barnett

Matthew Barnett added the comment:

The function solution does have a larger overhead than a literal.

Could the template be made more accepting of backslashes without breaking 
anything? (There's also issue29995 "re.escape() escapes too much", which might 
help.)

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30133>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30133] Strings that end with properly escaped backslashes cause error to be thrown in re.search/sub/etc. functions.

2017-04-21 Thread Matthew Barnett

Matthew Barnett added the comment:

Yes, the second argument is a replacement template, not a literal.

This issue does point out a different problem, though: re.escape will add 
backslashes that will then be treated as literals in the template, for example:

>>> re.sub(r'a', re.escape('(A)'), 'a')
'\\(A\\)'

re.escape doesn't always help.

The solution here is to pass a replacement function instead:

>>> re.sub(r'a', lambda m: '(A)', 'a')
'(A)'

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30133>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue29977] re.sub stalls forever on an unmatched non-greedy case

2017-04-04 Thread Matthew Barnett

Matthew Barnett added the comment:

A slightly shorter form:

/\*(?:(?!\*/).)*\*/

Basically it's:

match start

while not match end:
consume character

match end

If the "match end" is a single character, you can use a negated character set, 
for example:

[^\n]*

otherwise you need a negative lookahead, for example:

(?:(?!\r\n).)*

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue29977>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17441] Do not cache re.compile

2017-03-07 Thread Matthew Barnett

Matthew Barnett added the comment:

If we were doing it today, maybe we wouldn't cache them, but, as you say, it's 
been like that for a long time. (The regex module also caches them, because the 
re module does.) Unless someone can demonstrate that it's a problem, I'd say 
just leave it as it is.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue17441>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue29571] test_re is failing when local is set for `en_IN`

2017-02-15 Thread Matthew Barnett

Matthew Barnett added the comment:

The report says "==  encodings: locale=UTF-8, FS=utf-8".

It says that "test_locale_caching" was skipped, but also that 
"test_locale_flag" failed.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue29571>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue29571] test_re is failing when local is set for `en_IN`

2017-02-15 Thread Matthew Barnett

Matthew Barnett added the comment:

I'm just wondering whether the problem is just due to the locale's encoding 
being UTF-8. The locale support in re really only works with encodings that use 
1 byte/character.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue29571>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22594] Add a link to the regex module in re documentation

2017-02-08 Thread Matthew Barnett

Matthew Barnett added the comment:

Ah, well, if it hasn't changed after this many years, it never will. Expect one 
or two changes to the text. :-)

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue22594>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22594] Add a link to the regex module in re documentation

2017-02-07 Thread Matthew Barnett

Matthew Barnett added the comment:

With the VERSION0 flag (the default behaviour), it should behave the same as 
the re module, and that's not going to change.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue22594>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22594] Add a link to the regex module in re documentation

2017-02-07 Thread Matthew Barnett

Matthew Barnett added the comment:

I agree with Marco that it shouldn't be too verbose. I'd like to suggest that 
it says that it's compatible (i.e. has the same API), but with additional 
features.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue22594>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



  1   2   3   4   5   6   >