[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-31 Thread Jacques Grove

Jacques Grove aquara...@gmail.com added the comment:

Thanks for putting up the hg repo, makes it much easier to follow.

Getting back to the performance regression I reported in msg124904:

I've verified that if I take the hg commit 7abd9f9bb1 , and I back out the 
guards changes manually, while leaving the FAST_INIT changes in, the 
performance is back to normal on my full regression suite (i.e. the 30-40% 
penalty disappears).

I've repeated my tests a few times to make sure I'm not mistaken;  since the 
guard changes doesn't look like it should impact performance much, but it does.

I've attached the diff that restored the speed for me (as usual, using Python 
2.6.5 on Linux x86_64)

BTW, now that we have the code on google code, can we log individual issues 
over there?  Might make it easier for those interested to follow certain issues 
than trying to comb through every individual detail in this 
super-issue-thread...?

--
Added file: http://bugs.python.org/file20203/remove_guards.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-31 Thread Jacques Grove

Jacques Grove aquara...@gmail.com added the comment:

You're correct, after the change:

regex.search(r'\d{4}(\s*\w)?\W*((?!\d)\w){2}', XX)

doesn't match (i.e. as before commit 7abd9f9bb1).

I was, however, just trying to narrow down which part of the code change killed 
the performance on my regression tests :-)

Happy new year to all out there.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-29 Thread Jacques Grove

Jacques Grove aquara...@gmail.com added the comment:

More an observation than a bug:

I understand that we're trading memory for performance, but I've noticed that 
the peak memory usage is rather high, e.g.:

$ cat test.py
import os
import regex as re

def resident():
for line in open('/proc/%d/status' % os.getpid(), 'r').readlines():
if line.startswith(VmRSS:):
return line.split(:)[-1].strip()

cache = {}

print resident()
for i in xrange(0,1000):
cache[i] = 
re.compile(str(i)+(abcd12kl|efghlajsdf|ijkllakjsdf|mnoplasjdf|qrstljasd|sdajdwxyzlasjdf|kajsdfjkasdjkf|kasdflkasjdflkajsd|klasdfljasdf))

print resident()


Execution output on my machine (Linux x86_64, Python 2.6.5):
4328 kB
32052 kB

with the standard regex library:
3688 kB
5428 kB

So, it looks like around 16x the memory per pattern vs standard regex module

Now the example is pretty silly, the difference is even larger for more complex 
regexes.  I also understand that the once the patterns are GC-ed, python can 
reuse the memory (pymalloc doesn't return it to the OS, unfortunately).  
However, I have some applications that use large numbers (many thousands) of 
regexes and need to keep them cached (compiled) indefinitely (especially 
because compilation is expensive).  This causes some pain (long story).

I've played around with increasing RE_MIN_FAST_LENGTH, and it makes a 
significant difference, e.g.:

RE_MIN_FAST_LENGTH = 10:
4324 kB
25976 kB

In my use-cases, having a larger RE_MIN_FAST_LENGTH doesn't make a huge 
performance difference, so that might be the way I'll go.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-29 Thread Jacques Grove

Jacques Grove aquara...@gmail.com added the comment:

Yeah, issue2636-20101230.zip DOES reduce memory usage significantly (30-50%) in 
my use cases;  however, it also tanks performance overall by 35% for me, so 
I'll prefer to stick with issue2636-20101229.zip (or some variant of it).

Maybe a regex compile-time option, although that's not necessary.

Thanks for the effort.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-29 Thread Jacques Grove

Jacques Grove aquara...@gmail.com added the comment:

re.search('\d{4}(\s*\w)?\W*((?!\d)\w){2}', XX)

matches on stock 2.6.5 regex module, but not on issue2636-20101230.zip or 
issue2636-20101229.zip (which I've fallen back to for now)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-29 Thread Jacques Grove

Jacques Grove aquara...@gmail.com added the comment:

Another one that diverges between stock regex and issue2636-20101229.zip:

re.search('A\s*?.*?(\n+.*?\s*?){0,2}\(X', 'A\n1\nS\n1 (X')

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-28 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

Thanks, issue2636-20101228a.zip also resolves my compilation speed issues I had 
on other (very) complex regexes.

Found this one:

re.search((X.*?Y\s*){3}(X\s*)+AB:, XY\nX Y\nX  Y\nXY\nXX AB:)

produces a search hit with stock python 2.6.5 regex library, but not with 
issue2636-20101228a.zip.

re.search((X.*?Y\s*){3,}(X\s*)+AB:, XY\nX Y\nX  Y\nXY\nXX AB:)

matches on both, however.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-28 Thread Jacques Grove

Jacques Grove aquara...@gmail.com added the comment:

Here is a somewhat crazy pattern (slimmed down from something much larger and 
more complex, which didn't finish compiling even after several minutes): 

re.compile((?:(?:[23][0-9]|3[79]|0?[1-9])(?:[Aa][Aa]|[Aa][Aa]|[Aa][Aa])??(?:[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}|[Aa]{4}|[Aa]{4}|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{5})?)?|[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{5})?)|(?:[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}|[Aa]{4}|[Aa]{4}|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{5})?)?|[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{5})?)(?:(?:[\-\s\.,/]){0,4}?)(?:[23][0-9]|3[79]|0?[1-9])(?:[Aa][Aa]|[Aa][Aa]|[Aa][Aa])??)\W*(?:[79][0-9]|2[0-4]|\d)(?:[\.:Aa])?(?:[0-5][0-9])\W*(?:(?:[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{3})?)?|[Aa]{3}(?:[Aa]{5}[Aa])?|[Aa]{3}(?:[Aa](?:[Aa]{4})?)?|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{3})?)|(?:[Aa][Aa](?:[Aa](?:[Aa]{3})?)?|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa](?:[Aa]{3})?)?)?)?|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa]{4})?)?)?|[Aa][Aa](?:[Aa](?:[Aa]{3}(?:[Aa](?:[Aa]{3})?)?)?)?
 
|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa]{3})?)?)?|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa]{3})?)?)?|[Aa]{3}(?:[Aa](?:[Aa](?:[Aa]{4})?)?)?|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa]{3})?)?)?))\s*(\-\s*)?(?:(?:[23][0-9]|3[79]|0?[1-9])(?:[Aa][Aa]|[Aa][Aa]|[Aa][Aa])??(?:(?:[\-\s\.,/]){0,4}?)(?:[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}|[Aa]{4}|[Aa]{4}|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{5})?)?|[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{5})?)|(?:[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}|[Aa]{4}|[Aa]{4}|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{5})?)?|[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{5})?)(?:(?:[\-\s\.,/]){0,4}?)(?:[23][0-9]|3[79]|0?[1-9])(?:[Aa][Aa]|[Aa][Aa]|[Aa][Aa])??)(?:(?:(?:[\-\s\.,/]){0,4}?)(?:(?:68)?[7-9]\d|(?:2[79])?\d{2}))?\W*(?:[79][0-9]|2[0-4]|\d)(?:[\.:Aa])?(?:[0-5][0-9]))


Runs about 10.5 seconds on my machine with issue2636-20101228a.zip, less than 
0.03 seconds with stock Python 2.6.5 regex engine.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-27 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

Testing issue2636-20101224.zip:

Nested modifiers seems to hang the regex compilation when used in a 
non-capturing group e.g.:

re.compile((?:(?i)foo))

or

re.compile((?:(?u)foo))


No problem on stock Python 2.6.5 regex engine.

The unnested version of the same regex compiles fine.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-27 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

Another re.compile performance issue (I've seen a couple of others, but I'm 
still trying to simplify the test-cases):

re.compile((?ui)(a\s?b\s?c\s?d\s?e\s?f\s?g\s?h\s?i\s?j\s?k\s?l\s?m\s?n\s?o\s?p\s?q\s?r\s?s\s?t\s?u\s?v\s?w\s?y\s?z\s?a\s?b\s?c\s?d))

completes in around 0.01s on my machine using Python 2.6.5 standard regex 
library, but takes around 12 seconds using issue2636-20101228.zip

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-01 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

OK, I think this might be the last one I will find for the moment:

$ cat test.py
import re, regex

text = test?
regexp = test\?
sub_value = result\?
print repr(re.sub(regexp, sub_value, text))
print repr(regex.sub(regexp, sub_value, text))


$ python test.py
'result\\?'
'result?'

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-01 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

Spoke too soon, although this might be a valid divergence in behavior:

$ cat test.py 
import re, regex

text = test: 2

print regex.sub('(test)\W+(\d+)(?:\W+(TEST)\W+(\d))?', '\\2 \\1, \\4 \\3', text)
print re.sub('(test)\W+(\d+)(?:\W+(TEST)\W+(\d))?', '\\2 \\1, \\4 \\3', text)


$ python test.py 
2 test,  
Traceback (most recent call last):
  File test.py, line 6, in module
print re.sub('(test)\W+(\d+)(?:\W+(TEST)\W+(\d))?', '\\2 \\1, \\4 \\3', 
text)
  File /usr/lib64/python2.7/re.py, line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
  File /usr/lib64/python2.7/re.py, line 278, in filter
return sre_parse.expand_template(template, match)
  File /usr/lib64/python2.7/sre_parse.py, line 787, in expand_template
raise error, unmatched group
sre_constants.error: unmatched group

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-01 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

Another, with backreferences:

import re, regex

text = TEST, BEST; LEST ; Lest 123 Test, Best
regexp = (?i)(.{1,40}?),(.{1,40}?)(?:;)+(.{1,80}).{1,40}?\\3(\ 
|;)+(.{1,80}?)\\1
print re.findall(regexp, text)
print regex.findall(regexp, text)

$ python test.py 
[('TEST', ' BEST', ' LEST', ' ', '123 ')]
[('T', ' BEST', ' ', ' ', 'Lest 123 ')]

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-31 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

And another, bit less pathological, testcase.  Sorry for the ugly testcase;  it 
was much worse before I boiled it down :-)

$ cat test.py 
import re, regex

text = \nTest\nxyz\nxyz\nEnd

regexp = '(\nTest(\n+.+?){0,2}?)?\n+End'
print re.findall(regexp, text)
print regex.findall(regexp, text)

$ python test.py
[('\nTest\nxyz\nxyz', '\nxyz')]
[('', '')]

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-30 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

Here's one that really falls in the category of don't do that;  but I found 
this because I was limiting the system recursion level to somewhat less than 
the standard 1000 (for other reasons), and I had some shorter duplicate 
patterns in a big regex.  Here is the simplest case to make it blow up with the 
standard recursion settings:

$ cat test.py
import re, regex
regexp = 
'(abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ|abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ)'
re.compile(regexp)
regex.compile(regexp)

$ python test.py
snip big traceback except for last few lines

File /tmp/test/src/lib/_regex_core.py, line 2024, in optimise
subpattern = subpattern.optimise(info)
  File /tmp/test/src/lib/_regex_core.py, line 1552, in optimise
branches = [_Branch(branches)]
RuntimeError: maximum recursion depth exceeded

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-29 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

Do we expect this to work on 64 bit Linux and python 2.6.5?  I've compiled and 
run some of my code through this, and there seems to be issues with non-greedy 
quantifier matching (at least relative to the old re module):

$ cat test.py
import re, regex

text = (MY TEST)
regexp = '\((?Ptest.{0,5}?TEST)\)'
print re.findall(regexp, text)
print regex.findall(regexp, text)


$ python test.py
['MY TEST']
[]

python 2.7 produces the same results for me.

However, making the quantifier greedy (removing the '?') gives the same result 
for both re and regex modules.

--
nosy: +jacques

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-29 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

Here's another inconsistency (same setup as before, running 
issue2636-20101029.zip code):

$ cat test.py
import re, regex

text = \n  S

regexp = '[^a]{2}[A-Z]'
print re.findall(regexp, text)
print regex.findall(regexp, text)

$ python test.py
['  S']
[]


I might flush out some more as I excercise this over the next few days.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-29 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

And another (with issue2636-20101030.zip):

$ cat test.py 
import re, regex
text = XYABCYPPQ\nQ DEF
regexp = 'X(Y[^Y]+?){1,2}(\ |Q)+DEF'
print re.findall(regexp, text)
print regex.findall(regexp, text)

$ python test.py 
[('YPPQ\n', ' ')]
[]

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5102] urllib2.py timeouts do not propagate across redirects for 2.6.1 (and 3.x?)

2009-01-29 Thread Jacques Grove

New submission from Jacques Grove jacq...@tripitinc.com:

When doing a urllib2 fetch of a url that results in a redirect, the
connection to the redirect does not pass along the timeout of the
original url opener.  The result is that the redirected url fetch (which
is a new request) will get the default socket timeout, instead of the
timeout that the user requested originally.  This is obviously a bug.

So we have in urllib2.py in 2.6.1:

def http_error_302(self, req, fp, code, msg, headers):
.
return self.parent.open(new)

this should be:
return self.parent.open(new, timeout=req.timeout)

or something in that vein.


Of course, to be 100% correct, you should probably keep track of how
much time has elapsed since the original url fetch went out, and reduce
the timeout based on this, but I'm not asking for miracles :-)


Jacques

--
components: Library (Lib)
messages: 80787
nosy: jacques
severity: normal
status: open
title: urllib2.py timeouts do not propagate across redirects for 2.6.1 (and 
3.x?)
type: behavior
versions: Python 2.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5102
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5103] ssl.SSLSocket timeout not working correctly when remote end is hanging

2009-01-29 Thread Jacques Grove

New submission from Jacques Grove jacq...@tripitinc.com:

In ssl.py of Python 2.6.1 we have this code in  SSLSocket.__init__():

if do_handshake_on_connect:
timeout = self.gettimeout()
try:
self.settimeout(None)
self.do_handshake()
finally:
self.settimeout(timeout)

The problem is, what happens if the remote end (server) is hanging when
do_handshake() is called?  The result is that the user-requested timeout
will be ignored, and the connection will hang until the TCP socket
timeout expires.

This is easily reproducable with this test code:


import urllib2
urllib2.urlopen(https://localhost:9000/;, timeout=2.0)


and running netcat on port 9000, i.e.:

nc -l -p 9000 localhost

If you use http instead of https, the timeout works as expected
(after 2 seconds in this case).

--
components: Library (Lib)
messages: 80790
nosy: jacques
severity: normal
status: open
title: ssl.SSLSocket timeout not working correctly when remote end is hanging
type: behavior
versions: Python 2.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5103
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com