Re: [Python-Dev] interesting article on regex performance

2010-03-13 Thread Nick Coghlan
Collin Winter wrote:
 On Fri, Mar 12, 2010 at 8:12 AM, Nick Coghlan ncogh...@gmail.com wrote:
 There are major practical problems associated with making such a leap
 directly (Google's re2 engine is in C++ rather than C and we'd have to
 keep the existing implementation around regardless to handle the
 features that re2 doesn't support).
 
 I don't see why C++ would be a deal-breaker in this case, since it
 would be restricted to an extension module.

It wouldn't be a dealbreaker for an optional regex acceleration module,
I agree. It's still a practical problem in the sense that C++ code is
very much in the minority in the CPython code base.

 I would say it is better to let re2 bake for a while and see if anyone
 is motivated to come up with Python bindings for it and release them on
 PyPI.
 
 FWIW, re2 is heavily, heavily used in production at Google.
 Stabilizing any proposed Python bindings would be a good idea, but I'm
 not sure how much more baking re2's core functionality needs.

It was more the bindings side of things I was referring to - the
algorithm's performance is no doubt well established, but how easy it is
for non-Google users to pick up and integrate with other software is
still to be determined.

 We considered such a hybrid approach for Unladen Swallow, but rejected
 it on the advice of the V8 team [1]: you end up maintaining two
 totally separate, incompatible regex engines; the hybrid system comes
 with interesting, possibly unintuitive performance/correctness issues
 when bailing from one implementation to another; performance is
 unstable as small changes are made to the regexes; and it's not
 obvious that enough Python regexes would benefit from re2's
 performance/restrictions tradeoff to make such a hybrid system
 worthwhile in the long term. (There is no representative corpus of
 real-world Python regexes weighted for dynamic execution frequency to
 use in assessing such tradeoffs empirically like there is for
 JavaScript.)
 
 re2 is very useful when you want to run user-provided regexes and want
 to protect your backends against pathological/malicious regex input,
 but I'm not sure how applicable it is to Python. I think there are
 more promising strategies to improve regex performance, such as
 reusing the new JIT infrastructure to JIT-compile regular expressions
 to machine code (along the lines of V8's irregexp). Some work has
 already been done in this direction, and I'd be thrilled to mentor any
 GSoC students interested in working on such a project this summer.
 
 Lastly, anyone interested in working on Python regex performance
 should take a look at the regex benchmarks in the standard benchmark
 suite [2].

This is the kind of baking that I was really talking about - there's an
awful lot of investigation to be done before moving Python in the
direction of an NFA based regex engine could be seriously considered.

Good to hear the perspective of someone that has already looked into
this in some detail though.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] interesting article on regex performance

2010-03-12 Thread Nick Coghlan
Peter Portante wrote:
 http://code.google.com/p/re2/
 
 On 3/11/10 8:52 PM, Neal Becker ndbeck...@gmail.com wrote:
 
 http://swtch.com/~rsc/regexp/regexp1.html

Both interesting links. I'll add another one to the list:
http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html

To bring this on-topic for python-dev by considering how it could apply
to Python's default re engine, I think the key issue is that any updates
to the default engine would need to remain backwards compatible with all
of the tricks that re2 doesn't support.

There are major practical problems associated with making such a leap
directly (Google's re2 engine is in C++ rather than C and we'd have to
keep the existing implementation around regardless to handle the
features that re2 doesn't support).

I would say it is better to let re2 bake for a while and see if anyone
is motivated to come up with Python bindings for it and release them on
PyPI. Once that happens (and assuming the bindings earn a good
reputation), the first step towards integration would be to include a
See Also in the re module documentation to point people towards the
more limited (but more robust) regex engine implementation.

The next step would probably be a hybrid third party library that
exploits the NFA approach when it can, but resorts to backtracking when
it has to in order to handle full regex functionality. (Although
developers would need to be able to disable the backtracking support in
order to restore re2's guarantees of linear time execution)

Only once such a hybrid library had been proven in practice could the
default re module be considered for possible update.

Basically, I see some interesting ideas there, but I don't see anything
that will impact the Python core or standard library in the short term.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] interesting article on regex performance

2010-03-12 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Nick Coghlan wrote:
 Peter Portante wrote:
 http://code.google.com/p/re2/

 On 3/11/10 8:52 PM, Neal Becker ndbeck...@gmail.com wrote:

 http://swtch.com/~rsc/regexp/regexp1.html
 
 Both interesting links. I'll add another one to the list:
 http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html
 
 To bring this on-topic for python-dev by considering how it could apply
 to Python's default re engine, I think the key issue is that any updates
 to the default engine would need to remain backwards compatible with all
 of the tricks that re2 doesn't support.
 
 There are major practical problems associated with making such a leap
 directly (Google's re2 engine is in C++ rather than C and we'd have to
 keep the existing implementation around regardless to handle the
 features that re2 doesn't support).
 
 I would say it is better to let re2 bake for a while and see if anyone
 is motivated to come up with Python bindings for it and release them on
 PyPI. Once that happens (and assuming the bindings earn a good
 reputation), the first step towards integration would be to include a
 See Also in the re module documentation to point people towards the
 more limited (but more robust) regex engine implementation.
 
 The next step would probably be a hybrid third party library that
 exploits the NFA approach when it can, but resorts to backtracking when
 it has to in order to handle full regex functionality. (Although
 developers would need to be able to disable the backtracking support in
 order to restore re2's guarantees of linear time execution)
 
 Only once such a hybrid library had been proven in practice could the
 default re module be considered for possible update.
 
 Basically, I see some interesting ideas there, but I don't see anything
 that will impact the Python core or standard library in the short term.

Agreed on the whole.  However, it might be interesting to explore the
hybrid mode using some of the older, C-based libraries referenced by
the original (Russ Cox) article, e.g. Rob Pike's Unicode-aware 1992
implmentation, as ported for Unix:

  http://swtch.com/plan9port/unix/


Tres.
- --
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkuac7AACgkQ+gerLs4ltQ6t3wCgxQPsnpbyE1N9BEGg7nHqcBVP
q04AnA6RMJw83+uc5u6Oyud89SXEoP5C
=+0aL
-END PGP SIGNATURE-

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] interesting article on regex performance

2010-03-12 Thread Brent Longborough
I have some regex-intensive Python that might benefit from this, but I don't
think I have enough skill to do a set of Python bindings.

An ideal first cut would be to enable this:

  import re2 as re

I live in hope...
Brent L 

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] interesting article on regex performance

2010-03-12 Thread Collin Winter
On Fri, Mar 12, 2010 at 8:12 AM, Nick Coghlan ncogh...@gmail.com wrote:
[snip]
 To bring this on-topic for python-dev by considering how it could apply
 to Python's default re engine, I think the key issue is that any updates
 to the default engine would need to remain backwards compatible with all
 of the tricks that re2 doesn't support.

 There are major practical problems associated with making such a leap
 directly (Google's re2 engine is in C++ rather than C and we'd have to
 keep the existing implementation around regardless to handle the
 features that re2 doesn't support).

I don't see why C++ would be a deal-breaker in this case, since it
would be restricted to an extension module.

 I would say it is better to let re2 bake for a while and see if anyone
 is motivated to come up with Python bindings for it and release them on
 PyPI.

FWIW, re2 is heavily, heavily used in production at Google.
Stabilizing any proposed Python bindings would be a good idea, but I'm
not sure how much more baking re2's core functionality needs.

 Once that happens (and assuming the bindings earn a good
 reputation), the first step towards integration would be to include a
 See Also in the re module documentation to point people towards the
 more limited (but more robust) regex engine implementation.

 The next step would probably be a hybrid third party library that
 exploits the NFA approach when it can, but resorts to backtracking when
 it has to in order to handle full regex functionality. (Although
 developers would need to be able to disable the backtracking support in
 order to restore re2's guarantees of linear time execution)

We considered such a hybrid approach for Unladen Swallow, but rejected
it on the advice of the V8 team [1]: you end up maintaining two
totally separate, incompatible regex engines; the hybrid system comes
with interesting, possibly unintuitive performance/correctness issues
when bailing from one implementation to another; performance is
unstable as small changes are made to the regexes; and it's not
obvious that enough Python regexes would benefit from re2's
performance/restrictions tradeoff to make such a hybrid system
worthwhile in the long term. (There is no representative corpus of
real-world Python regexes weighted for dynamic execution frequency to
use in assessing such tradeoffs empirically like there is for
JavaScript.)

re2 is very useful when you want to run user-provided regexes and want
to protect your backends against pathological/malicious regex input,
but I'm not sure how applicable it is to Python. I think there are
more promising strategies to improve regex performance, such as
reusing the new JIT infrastructure to JIT-compile regular expressions
to machine code (along the lines of V8's irregexp). Some work has
already been done in this direction, and I'd be thrilled to mentor any
GSoC students interested in working on such a project this summer.

Lastly, anyone interested in working on Python regex performance
should take a look at the regex benchmarks in the standard benchmark
suite [2].

Thanks,
Collin Winter

[1] - 
http://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html#c4843826268005492354
[2] - http://hg.python.org/benchmarks/file/5b8fe389710b/performance
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] interesting article on regex performance

2010-03-12 Thread skip

 There are major practical problems associated with making such a leap
 directly (Google's re2 engine is in C++ rather than C and we'd have
 to keep the existing implementation around regardless to handle the
 features that re2 doesn't support).

Collin I don't see why C++ would be a deal-breaker in this case, since
Collin it would be restricted to an extension module.

Traditionally Python has run on some (minority) platforms where C++ was
unavailable.  While the re module is a dynamically linked extension module
and thus could be considered optional, I doubt anybody thinks of it as
optional nowadays.  It's used in the regression test suite anyway.  It would
be tough to run unit tests on such minority platforms without it.  You'd
have to maintain both the current sre implementation and the new re2
implementation for a long while into the future.

As I was reading the code I thought, Great! This stuff is so simple.  It's
even all written in C.  Then I looked at the re2 page.  :-(

Skip
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] interesting article on regex performance

2010-03-12 Thread Collin Winter
On Fri, Mar 12, 2010 at 11:29 AM,  s...@pobox.com wrote:

     There are major practical problems associated with making such a leap
     directly (Google's re2 engine is in C++ rather than C and we'd have
     to keep the existing implementation around regardless to handle the
     features that re2 doesn't support).

    Collin I don't see why C++ would be a deal-breaker in this case, since
    Collin it would be restricted to an extension module.

 Traditionally Python has run on some (minority) platforms where C++ was
 unavailable.  While the re module is a dynamically linked extension module
 and thus could be considered optional, I doubt anybody thinks of it as
 optional nowadays.  It's used in the regression test suite anyway.  It would
 be tough to run unit tests on such minority platforms without it.  You'd
 have to maintain both the current sre implementation and the new re2
 implementation for a long while into the future.

re2 is not a full replacement for Python's current regex semantics: it
would only serve as an accelerator for a subset of the current regex
language. Given that, it makes perfect sense that it would be optional
on such minority platforms (much like the incoming JIT).

Collin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] interesting article on regex performance

2010-03-12 Thread Georg Brandl
Am 12.03.2010 20:29, schrieb s...@pobox.com:
 
  There are major practical problems associated with making such a leap
  directly (Google's re2 engine is in C++ rather than C and we'd have
  to keep the existing implementation around regardless to handle the
  features that re2 doesn't support).
 
 Collin I don't see why C++ would be a deal-breaker in this case, since
 Collin it would be restricted to an extension module.
 
 Traditionally Python has run on some (minority) platforms where C++ was
 unavailable.  While the re module is a dynamically linked extension module
 and thus could be considered optional, I doubt anybody thinks of it as
 optional nowadays.

Actually, _sre is a builtin module by default since at least Python 2.5.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] interesting article on regex performance

2010-03-12 Thread skip

Collin re2 is not a full replacement for Python's current regex
Collin semantics: it would only serve as an accelerator for a subset of
Collin the current regex language. Given that, it makes perfect sense
Collin that it would be optional on such minority platforms (much like
Collin the incoming JIT).

Sure, but over the years Python has supported at least four different
regular expression modules that I'm aware of (regex, regexp, and the current
re module with different extension modules underneath it, perhaps there were
others).  During some of that time more than one module was distributed with
Python proper.  I think the desire today would be that only one regular
expression module be distributed with Python (that would be my vote anyway).
Getting people to move off the older libraries was difficult.  If re2 can't
replace sre under the covers than I think it belongs in PyPI, not the Python
distribution.  That said, that suggests to me that a different NFA or DFA
implementation written in C would replace sre, one not written in C++.

Hopefully that provides some context for my earlier response.

Skip
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] interesting article on regex performance

2010-03-12 Thread Jared Grubb
On 12 Mar 2010, at 15:22, s...@pobox.com wrote:
 
  Collin re2 is not a full replacement for Python's current regex
  Collin semantics: it would only serve as an accelerator for a subset of
  Collin the current regex language. Given that, it makes perfect sense
  Collin that it would be optional on such minority platforms (much like
  Collin the incoming JIT).
 
 Sure, but over the years Python has supported at least four different
 regular expression modules that I'm aware of (regex, regexp, and the current
 re module with different extension modules underneath it, perhaps there were
 others).  During some of that time more than one module was distributed with
 Python proper.  I think the desire today would be that only one regular
 expression module be distributed with Python (that would be my vote anyway).
 Getting people to move off the older libraries was difficult.  If re2 can't
 replace sre under the covers than I think it belongs in PyPI, not the Python
 distribution.  That said, that suggests to me that a different NFA or DFA
 implementation written in C would replace sre, one not written in C++.

re2 would be a supplement to re -- it is not a replacement, and Python would 
run fine if it's not present on some platforms. 

It's like a floating-point processor: you can do all math you need with just an 
integer processor. But if you have an FPU present, then it makes sense to use 
it for the FP operations. 

Jared
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] interesting article on regex performance

2010-03-11 Thread Peter Portante
http://code.google.com/p/re2/


On 3/11/10 8:52 PM, Neal Becker ndbeck...@gmail.com wrote:

 http://swtch.com/~rsc/regexp/regexp1.html
 
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.com


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com