Re: [Python-Dev] interesting article on regex performance
Collin Winter wrote: On Fri, Mar 12, 2010 at 8:12 AM, Nick Coghlan ncogh...@gmail.com wrote: There are major practical problems associated with making such a leap directly (Google's re2 engine is in C++ rather than C and we'd have to keep the existing implementation around regardless to handle the features that re2 doesn't support). I don't see why C++ would be a deal-breaker in this case, since it would be restricted to an extension module. It wouldn't be a dealbreaker for an optional regex acceleration module, I agree. It's still a practical problem in the sense that C++ code is very much in the minority in the CPython code base. I would say it is better to let re2 bake for a while and see if anyone is motivated to come up with Python bindings for it and release them on PyPI. FWIW, re2 is heavily, heavily used in production at Google. Stabilizing any proposed Python bindings would be a good idea, but I'm not sure how much more baking re2's core functionality needs. It was more the bindings side of things I was referring to - the algorithm's performance is no doubt well established, but how easy it is for non-Google users to pick up and integrate with other software is still to be determined. We considered such a hybrid approach for Unladen Swallow, but rejected it on the advice of the V8 team [1]: you end up maintaining two totally separate, incompatible regex engines; the hybrid system comes with interesting, possibly unintuitive performance/correctness issues when bailing from one implementation to another; performance is unstable as small changes are made to the regexes; and it's not obvious that enough Python regexes would benefit from re2's performance/restrictions tradeoff to make such a hybrid system worthwhile in the long term. (There is no representative corpus of real-world Python regexes weighted for dynamic execution frequency to use in assessing such tradeoffs empirically like there is for JavaScript.) re2 is very useful when you want to run user-provided regexes and want to protect your backends against pathological/malicious regex input, but I'm not sure how applicable it is to Python. I think there are more promising strategies to improve regex performance, such as reusing the new JIT infrastructure to JIT-compile regular expressions to machine code (along the lines of V8's irregexp). Some work has already been done in this direction, and I'd be thrilled to mentor any GSoC students interested in working on such a project this summer. Lastly, anyone interested in working on Python regex performance should take a look at the regex benchmarks in the standard benchmark suite [2]. This is the kind of baking that I was really talking about - there's an awful lot of investigation to be done before moving Python in the direction of an NFA based regex engine could be seriously considered. Good to hear the perspective of someone that has already looked into this in some detail though. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia --- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] interesting article on regex performance
Peter Portante wrote: http://code.google.com/p/re2/ On 3/11/10 8:52 PM, Neal Becker ndbeck...@gmail.com wrote: http://swtch.com/~rsc/regexp/regexp1.html Both interesting links. I'll add another one to the list: http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html To bring this on-topic for python-dev by considering how it could apply to Python's default re engine, I think the key issue is that any updates to the default engine would need to remain backwards compatible with all of the tricks that re2 doesn't support. There are major practical problems associated with making such a leap directly (Google's re2 engine is in C++ rather than C and we'd have to keep the existing implementation around regardless to handle the features that re2 doesn't support). I would say it is better to let re2 bake for a while and see if anyone is motivated to come up with Python bindings for it and release them on PyPI. Once that happens (and assuming the bindings earn a good reputation), the first step towards integration would be to include a See Also in the re module documentation to point people towards the more limited (but more robust) regex engine implementation. The next step would probably be a hybrid third party library that exploits the NFA approach when it can, but resorts to backtracking when it has to in order to handle full regex functionality. (Although developers would need to be able to disable the backtracking support in order to restore re2's guarantees of linear time execution) Only once such a hybrid library had been proven in practice could the default re module be considered for possible update. Basically, I see some interesting ideas there, but I don't see anything that will impact the Python core or standard library in the short term. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia --- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] interesting article on regex performance
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Nick Coghlan wrote: Peter Portante wrote: http://code.google.com/p/re2/ On 3/11/10 8:52 PM, Neal Becker ndbeck...@gmail.com wrote: http://swtch.com/~rsc/regexp/regexp1.html Both interesting links. I'll add another one to the list: http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html To bring this on-topic for python-dev by considering how it could apply to Python's default re engine, I think the key issue is that any updates to the default engine would need to remain backwards compatible with all of the tricks that re2 doesn't support. There are major practical problems associated with making such a leap directly (Google's re2 engine is in C++ rather than C and we'd have to keep the existing implementation around regardless to handle the features that re2 doesn't support). I would say it is better to let re2 bake for a while and see if anyone is motivated to come up with Python bindings for it and release them on PyPI. Once that happens (and assuming the bindings earn a good reputation), the first step towards integration would be to include a See Also in the re module documentation to point people towards the more limited (but more robust) regex engine implementation. The next step would probably be a hybrid third party library that exploits the NFA approach when it can, but resorts to backtracking when it has to in order to handle full regex functionality. (Although developers would need to be able to disable the backtracking support in order to restore re2's guarantees of linear time execution) Only once such a hybrid library had been proven in practice could the default re module be considered for possible update. Basically, I see some interesting ideas there, but I don't see anything that will impact the Python core or standard library in the short term. Agreed on the whole. However, it might be interesting to explore the hybrid mode using some of the older, C-based libraries referenced by the original (Russ Cox) article, e.g. Rob Pike's Unicode-aware 1992 implmentation, as ported for Unix: http://swtch.com/plan9port/unix/ Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software Excellence by Designhttp://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkuac7AACgkQ+gerLs4ltQ6t3wCgxQPsnpbyE1N9BEGg7nHqcBVP q04AnA6RMJw83+uc5u6Oyud89SXEoP5C =+0aL -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] interesting article on regex performance
I have some regex-intensive Python that might benefit from this, but I don't think I have enough skill to do a set of Python bindings. An ideal first cut would be to enable this: import re2 as re I live in hope... Brent L ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] interesting article on regex performance
On Fri, Mar 12, 2010 at 8:12 AM, Nick Coghlan ncogh...@gmail.com wrote: [snip] To bring this on-topic for python-dev by considering how it could apply to Python's default re engine, I think the key issue is that any updates to the default engine would need to remain backwards compatible with all of the tricks that re2 doesn't support. There are major practical problems associated with making such a leap directly (Google's re2 engine is in C++ rather than C and we'd have to keep the existing implementation around regardless to handle the features that re2 doesn't support). I don't see why C++ would be a deal-breaker in this case, since it would be restricted to an extension module. I would say it is better to let re2 bake for a while and see if anyone is motivated to come up with Python bindings for it and release them on PyPI. FWIW, re2 is heavily, heavily used in production at Google. Stabilizing any proposed Python bindings would be a good idea, but I'm not sure how much more baking re2's core functionality needs. Once that happens (and assuming the bindings earn a good reputation), the first step towards integration would be to include a See Also in the re module documentation to point people towards the more limited (but more robust) regex engine implementation. The next step would probably be a hybrid third party library that exploits the NFA approach when it can, but resorts to backtracking when it has to in order to handle full regex functionality. (Although developers would need to be able to disable the backtracking support in order to restore re2's guarantees of linear time execution) We considered such a hybrid approach for Unladen Swallow, but rejected it on the advice of the V8 team [1]: you end up maintaining two totally separate, incompatible regex engines; the hybrid system comes with interesting, possibly unintuitive performance/correctness issues when bailing from one implementation to another; performance is unstable as small changes are made to the regexes; and it's not obvious that enough Python regexes would benefit from re2's performance/restrictions tradeoff to make such a hybrid system worthwhile in the long term. (There is no representative corpus of real-world Python regexes weighted for dynamic execution frequency to use in assessing such tradeoffs empirically like there is for JavaScript.) re2 is very useful when you want to run user-provided regexes and want to protect your backends against pathological/malicious regex input, but I'm not sure how applicable it is to Python. I think there are more promising strategies to improve regex performance, such as reusing the new JIT infrastructure to JIT-compile regular expressions to machine code (along the lines of V8's irregexp). Some work has already been done in this direction, and I'd be thrilled to mentor any GSoC students interested in working on such a project this summer. Lastly, anyone interested in working on Python regex performance should take a look at the regex benchmarks in the standard benchmark suite [2]. Thanks, Collin Winter [1] - http://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html#c4843826268005492354 [2] - http://hg.python.org/benchmarks/file/5b8fe389710b/performance ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] interesting article on regex performance
There are major practical problems associated with making such a leap directly (Google's re2 engine is in C++ rather than C and we'd have to keep the existing implementation around regardless to handle the features that re2 doesn't support). Collin I don't see why C++ would be a deal-breaker in this case, since Collin it would be restricted to an extension module. Traditionally Python has run on some (minority) platforms where C++ was unavailable. While the re module is a dynamically linked extension module and thus could be considered optional, I doubt anybody thinks of it as optional nowadays. It's used in the regression test suite anyway. It would be tough to run unit tests on such minority platforms without it. You'd have to maintain both the current sre implementation and the new re2 implementation for a long while into the future. As I was reading the code I thought, Great! This stuff is so simple. It's even all written in C. Then I looked at the re2 page. :-( Skip ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] interesting article on regex performance
On Fri, Mar 12, 2010 at 11:29 AM, s...@pobox.com wrote: There are major practical problems associated with making such a leap directly (Google's re2 engine is in C++ rather than C and we'd have to keep the existing implementation around regardless to handle the features that re2 doesn't support). Collin I don't see why C++ would be a deal-breaker in this case, since Collin it would be restricted to an extension module. Traditionally Python has run on some (minority) platforms where C++ was unavailable. While the re module is a dynamically linked extension module and thus could be considered optional, I doubt anybody thinks of it as optional nowadays. It's used in the regression test suite anyway. It would be tough to run unit tests on such minority platforms without it. You'd have to maintain both the current sre implementation and the new re2 implementation for a long while into the future. re2 is not a full replacement for Python's current regex semantics: it would only serve as an accelerator for a subset of the current regex language. Given that, it makes perfect sense that it would be optional on such minority platforms (much like the incoming JIT). Collin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] interesting article on regex performance
Am 12.03.2010 20:29, schrieb s...@pobox.com: There are major practical problems associated with making such a leap directly (Google's re2 engine is in C++ rather than C and we'd have to keep the existing implementation around regardless to handle the features that re2 doesn't support). Collin I don't see why C++ would be a deal-breaker in this case, since Collin it would be restricted to an extension module. Traditionally Python has run on some (minority) platforms where C++ was unavailable. While the re module is a dynamically linked extension module and thus could be considered optional, I doubt anybody thinks of it as optional nowadays. Actually, _sre is a builtin module by default since at least Python 2.5. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] interesting article on regex performance
Collin re2 is not a full replacement for Python's current regex Collin semantics: it would only serve as an accelerator for a subset of Collin the current regex language. Given that, it makes perfect sense Collin that it would be optional on such minority platforms (much like Collin the incoming JIT). Sure, but over the years Python has supported at least four different regular expression modules that I'm aware of (regex, regexp, and the current re module with different extension modules underneath it, perhaps there were others). During some of that time more than one module was distributed with Python proper. I think the desire today would be that only one regular expression module be distributed with Python (that would be my vote anyway). Getting people to move off the older libraries was difficult. If re2 can't replace sre under the covers than I think it belongs in PyPI, not the Python distribution. That said, that suggests to me that a different NFA or DFA implementation written in C would replace sre, one not written in C++. Hopefully that provides some context for my earlier response. Skip ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] interesting article on regex performance
On 12 Mar 2010, at 15:22, s...@pobox.com wrote: Collin re2 is not a full replacement for Python's current regex Collin semantics: it would only serve as an accelerator for a subset of Collin the current regex language. Given that, it makes perfect sense Collin that it would be optional on such minority platforms (much like Collin the incoming JIT). Sure, but over the years Python has supported at least four different regular expression modules that I'm aware of (regex, regexp, and the current re module with different extension modules underneath it, perhaps there were others). During some of that time more than one module was distributed with Python proper. I think the desire today would be that only one regular expression module be distributed with Python (that would be my vote anyway). Getting people to move off the older libraries was difficult. If re2 can't replace sre under the covers than I think it belongs in PyPI, not the Python distribution. That said, that suggests to me that a different NFA or DFA implementation written in C would replace sre, one not written in C++. re2 would be a supplement to re -- it is not a replacement, and Python would run fine if it's not present on some platforms. It's like a floating-point processor: you can do all math you need with just an integer processor. But if you have an FPU present, then it makes sense to use it for the FP operations. Jared ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] interesting article on regex performance
http://code.google.com/p/re2/ On 3/11/10 8:52 PM, Neal Becker ndbeck...@gmail.com wrote: http://swtch.com/~rsc/regexp/regexp1.html ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com