[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
Jacques Grove aquara...@gmail.com added the comment: Thanks for putting up the hg repo, makes it much easier to follow. Getting back to the performance regression I reported in msg124904: I've verified that if I take the hg commit 7abd9f9bb1 , and I back out the guards changes manually, while leaving the FAST_INIT changes in, the performance is back to normal on my full regression suite (i.e. the 30-40% penalty disappears). I've repeated my tests a few times to make sure I'm not mistaken; since the guard changes doesn't look like it should impact performance much, but it does. I've attached the diff that restored the speed for me (as usual, using Python 2.6.5 on Linux x86_64) BTW, now that we have the code on google code, can we log individual issues over there? Might make it easier for those interested to follow certain issues than trying to comb through every individual detail in this super-issue-thread...? -- Added file: http://bugs.python.org/file20203/remove_guards.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2636 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
Jacques Grove aquara...@gmail.com added the comment: You're correct, after the change: regex.search(r'\d{4}(\s*\w)?\W*((?!\d)\w){2}', XX) doesn't match (i.e. as before commit 7abd9f9bb1). I was, however, just trying to narrow down which part of the code change killed the performance on my regression tests :-) Happy new year to all out there. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2636 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
Jacques Grove aquara...@gmail.com added the comment: More an observation than a bug: I understand that we're trading memory for performance, but I've noticed that the peak memory usage is rather high, e.g.: $ cat test.py import os import regex as re def resident(): for line in open('/proc/%d/status' % os.getpid(), 'r').readlines(): if line.startswith(VmRSS:): return line.split(:)[-1].strip() cache = {} print resident() for i in xrange(0,1000): cache[i] = re.compile(str(i)+(abcd12kl|efghlajsdf|ijkllakjsdf|mnoplasjdf|qrstljasd|sdajdwxyzlasjdf|kajsdfjkasdjkf|kasdflkasjdflkajsd|klasdfljasdf)) print resident() Execution output on my machine (Linux x86_64, Python 2.6.5): 4328 kB 32052 kB with the standard regex library: 3688 kB 5428 kB So, it looks like around 16x the memory per pattern vs standard regex module Now the example is pretty silly, the difference is even larger for more complex regexes. I also understand that the once the patterns are GC-ed, python can reuse the memory (pymalloc doesn't return it to the OS, unfortunately). However, I have some applications that use large numbers (many thousands) of regexes and need to keep them cached (compiled) indefinitely (especially because compilation is expensive). This causes some pain (long story). I've played around with increasing RE_MIN_FAST_LENGTH, and it makes a significant difference, e.g.: RE_MIN_FAST_LENGTH = 10: 4324 kB 25976 kB In my use-cases, having a larger RE_MIN_FAST_LENGTH doesn't make a huge performance difference, so that might be the way I'll go. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2636 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
Jacques Grove aquara...@gmail.com added the comment: Yeah, issue2636-20101230.zip DOES reduce memory usage significantly (30-50%) in my use cases; however, it also tanks performance overall by 35% for me, so I'll prefer to stick with issue2636-20101229.zip (or some variant of it). Maybe a regex compile-time option, although that's not necessary. Thanks for the effort. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2636 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
Jacques Grove aquara...@gmail.com added the comment: re.search('\d{4}(\s*\w)?\W*((?!\d)\w){2}', XX) matches on stock 2.6.5 regex module, but not on issue2636-20101230.zip or issue2636-20101229.zip (which I've fallen back to for now) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2636 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
Jacques Grove aquara...@gmail.com added the comment: Another one that diverges between stock regex and issue2636-20101229.zip: re.search('A\s*?.*?(\n+.*?\s*?){0,2}\(X', 'A\n1\nS\n1 (X') -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2636 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
Jacques Grove jacq...@tripitinc.com added the comment: Thanks, issue2636-20101228a.zip also resolves my compilation speed issues I had on other (very) complex regexes. Found this one: re.search((X.*?Y\s*){3}(X\s*)+AB:, XY\nX Y\nX Y\nXY\nXX AB:) produces a search hit with stock python 2.6.5 regex library, but not with issue2636-20101228a.zip. re.search((X.*?Y\s*){3,}(X\s*)+AB:, XY\nX Y\nX Y\nXY\nXX AB:) matches on both, however. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2636 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
Jacques Grove aquara...@gmail.com added the comment: Here is a somewhat crazy pattern (slimmed down from something much larger and more complex, which didn't finish compiling even after several minutes): re.compile((?:(?:[23][0-9]|3[79]|0?[1-9])(?:[Aa][Aa]|[Aa][Aa]|[Aa][Aa])??(?:[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}|[Aa]{4}|[Aa]{4}|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{5})?)?|[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{5})?)|(?:[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}|[Aa]{4}|[Aa]{4}|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{5})?)?|[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{5})?)(?:(?:[\-\s\.,/]){0,4}?)(?:[23][0-9]|3[79]|0?[1-9])(?:[Aa][Aa]|[Aa][Aa]|[Aa][Aa])??)\W*(?:[79][0-9]|2[0-4]|\d)(?:[\.:Aa])?(?:[0-5][0-9])\W*(?:(?:[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{3})?)?|[Aa]{3}(?:[Aa]{5}[Aa])?|[Aa]{3}(?:[Aa](?:[Aa]{4})?)?|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{3})?)|(?:[Aa][Aa](?:[Aa](?:[Aa]{3})?)?|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa](?:[Aa]{3})?)?)?)?|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa]{4})?)?)?|[Aa][Aa](?:[Aa](?:[Aa]{3}(?:[Aa](?:[Aa]{3})?)?)?)? |[Aa][Aa](?:[Aa](?:[Aa](?:[Aa]{3})?)?)?|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa]{3})?)?)?|[Aa]{3}(?:[Aa](?:[Aa](?:[Aa]{4})?)?)?|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa]{3})?)?)?))\s*(\-\s*)?(?:(?:[23][0-9]|3[79]|0?[1-9])(?:[Aa][Aa]|[Aa][Aa]|[Aa][Aa])??(?:(?:[\-\s\.,/]){0,4}?)(?:[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}|[Aa]{4}|[Aa]{4}|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{5})?)?|[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{5})?)|(?:[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}|[Aa]{4}|[Aa]{4}|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{5})?)?|[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{5})?)(?:(?:[\-\s\.,/]){0,4}?)(?:[23][0-9]|3[79]|0?[1-9])(?:[Aa][Aa]|[Aa][Aa]|[Aa][Aa])??)(?:(?:(?:[\-\s\.,/]){0,4}?)(?:(?:68)?[7-9]\d|(?:2[79])?\d{2}))?\W*(?:[79][0-9]|2[0-4]|\d)(?:[\.:Aa])?(?:[0-5][0-9])) Runs about 10.5 seconds on my machine with issue2636-20101228a.zip, less than 0.03 seconds with stock Python 2.6.5 regex engine. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2636 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
Jacques Grove jacq...@tripitinc.com added the comment: Testing issue2636-20101224.zip: Nested modifiers seems to hang the regex compilation when used in a non-capturing group e.g.: re.compile((?:(?i)foo)) or re.compile((?:(?u)foo)) No problem on stock Python 2.6.5 regex engine. The unnested version of the same regex compiles fine. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2636 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
Jacques Grove jacq...@tripitinc.com added the comment: Another re.compile performance issue (I've seen a couple of others, but I'm still trying to simplify the test-cases): re.compile((?ui)(a\s?b\s?c\s?d\s?e\s?f\s?g\s?h\s?i\s?j\s?k\s?l\s?m\s?n\s?o\s?p\s?q\s?r\s?s\s?t\s?u\s?v\s?w\s?y\s?z\s?a\s?b\s?c\s?d)) completes in around 0.01s on my machine using Python 2.6.5 standard regex library, but takes around 12 seconds using issue2636-20101228.zip -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2636 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
Jacques Grove jacq...@tripitinc.com added the comment: OK, I think this might be the last one I will find for the moment: $ cat test.py import re, regex text = test? regexp = test\? sub_value = result\? print repr(re.sub(regexp, sub_value, text)) print repr(regex.sub(regexp, sub_value, text)) $ python test.py 'result\\?' 'result?' -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2636 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
Jacques Grove jacq...@tripitinc.com added the comment: Spoke too soon, although this might be a valid divergence in behavior: $ cat test.py import re, regex text = test: 2 print regex.sub('(test)\W+(\d+)(?:\W+(TEST)\W+(\d))?', '\\2 \\1, \\4 \\3', text) print re.sub('(test)\W+(\d+)(?:\W+(TEST)\W+(\d))?', '\\2 \\1, \\4 \\3', text) $ python test.py 2 test, Traceback (most recent call last): File test.py, line 6, in module print re.sub('(test)\W+(\d+)(?:\W+(TEST)\W+(\d))?', '\\2 \\1, \\4 \\3', text) File /usr/lib64/python2.7/re.py, line 151, in sub return _compile(pattern, flags).sub(repl, string, count) File /usr/lib64/python2.7/re.py, line 278, in filter return sre_parse.expand_template(template, match) File /usr/lib64/python2.7/sre_parse.py, line 787, in expand_template raise error, unmatched group sre_constants.error: unmatched group -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2636 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
Jacques Grove jacq...@tripitinc.com added the comment: Another, with backreferences: import re, regex text = TEST, BEST; LEST ; Lest 123 Test, Best regexp = (?i)(.{1,40}?),(.{1,40}?)(?:;)+(.{1,80}).{1,40}?\\3(\ |;)+(.{1,80}?)\\1 print re.findall(regexp, text) print regex.findall(regexp, text) $ python test.py [('TEST', ' BEST', ' LEST', ' ', '123 ')] [('T', ' BEST', ' ', ' ', 'Lest 123 ')] -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2636 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
Jacques Grove jacq...@tripitinc.com added the comment: And another, bit less pathological, testcase. Sorry for the ugly testcase; it was much worse before I boiled it down :-) $ cat test.py import re, regex text = \nTest\nxyz\nxyz\nEnd regexp = '(\nTest(\n+.+?){0,2}?)?\n+End' print re.findall(regexp, text) print regex.findall(regexp, text) $ python test.py [('\nTest\nxyz\nxyz', '\nxyz')] [('', '')] -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2636 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
Jacques Grove jacq...@tripitinc.com added the comment: Here's one that really falls in the category of don't do that; but I found this because I was limiting the system recursion level to somewhat less than the standard 1000 (for other reasons), and I had some shorter duplicate patterns in a big regex. Here is the simplest case to make it blow up with the standard recursion settings: $ cat test.py import re, regex regexp = '(abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ|abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ)' re.compile(regexp) regex.compile(regexp) $ python test.py snip big traceback except for last few lines File /tmp/test/src/lib/_regex_core.py, line 2024, in optimise subpattern = subpattern.optimise(info) File /tmp/test/src/lib/_regex_core.py, line 1552, in optimise branches = [_Branch(branches)] RuntimeError: maximum recursion depth exceeded -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2636 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
Jacques Grove jacq...@tripitinc.com added the comment: Do we expect this to work on 64 bit Linux and python 2.6.5? I've compiled and run some of my code through this, and there seems to be issues with non-greedy quantifier matching (at least relative to the old re module): $ cat test.py import re, regex text = (MY TEST) regexp = '\((?Ptest.{0,5}?TEST)\)' print re.findall(regexp, text) print regex.findall(regexp, text) $ python test.py ['MY TEST'] [] python 2.7 produces the same results for me. However, making the quantifier greedy (removing the '?') gives the same result for both re and regex modules. -- nosy: +jacques ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2636 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
Jacques Grove jacq...@tripitinc.com added the comment: Here's another inconsistency (same setup as before, running issue2636-20101029.zip code): $ cat test.py import re, regex text = \n S regexp = '[^a]{2}[A-Z]' print re.findall(regexp, text) print regex.findall(regexp, text) $ python test.py [' S'] [] I might flush out some more as I excercise this over the next few days. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2636 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
Jacques Grove jacq...@tripitinc.com added the comment: And another (with issue2636-20101030.zip): $ cat test.py import re, regex text = XYABCYPPQ\nQ DEF regexp = 'X(Y[^Y]+?){1,2}(\ |Q)+DEF' print re.findall(regexp, text) print regex.findall(regexp, text) $ python test.py [('YPPQ\n', ' ')] [] -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2636 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5102] urllib2.py timeouts do not propagate across redirects for 2.6.1 (and 3.x?)
New submission from Jacques Grove jacq...@tripitinc.com: When doing a urllib2 fetch of a url that results in a redirect, the connection to the redirect does not pass along the timeout of the original url opener. The result is that the redirected url fetch (which is a new request) will get the default socket timeout, instead of the timeout that the user requested originally. This is obviously a bug. So we have in urllib2.py in 2.6.1: def http_error_302(self, req, fp, code, msg, headers): . return self.parent.open(new) this should be: return self.parent.open(new, timeout=req.timeout) or something in that vein. Of course, to be 100% correct, you should probably keep track of how much time has elapsed since the original url fetch went out, and reduce the timeout based on this, but I'm not asking for miracles :-) Jacques -- components: Library (Lib) messages: 80787 nosy: jacques severity: normal status: open title: urllib2.py timeouts do not propagate across redirects for 2.6.1 (and 3.x?) type: behavior versions: Python 2.6 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5102 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5103] ssl.SSLSocket timeout not working correctly when remote end is hanging
New submission from Jacques Grove jacq...@tripitinc.com: In ssl.py of Python 2.6.1 we have this code in SSLSocket.__init__(): if do_handshake_on_connect: timeout = self.gettimeout() try: self.settimeout(None) self.do_handshake() finally: self.settimeout(timeout) The problem is, what happens if the remote end (server) is hanging when do_handshake() is called? The result is that the user-requested timeout will be ignored, and the connection will hang until the TCP socket timeout expires. This is easily reproducable with this test code: import urllib2 urllib2.urlopen(https://localhost:9000/;, timeout=2.0) and running netcat on port 9000, i.e.: nc -l -p 9000 localhost If you use http instead of https, the timeout works as expected (after 2 seconds in this case). -- components: Library (Lib) messages: 80790 nosy: jacques severity: normal status: open title: ssl.SSLSocket timeout not working correctly when remote end is hanging type: behavior versions: Python 2.6 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5103 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com