[Python-ideas] Re: Regex timeouts

2022-03-21 Thread Tim Peters
[Tim, on trying to match only the next instance of "spam"] > ,,, > It's actually far easier if assertions are used, and I'm too old to > bother trying to repair the non-assertion mess: > > ([^s]|s(?!pam))*spam Since then, Serhiy rehabilitated an old patch to add "atomic groups" and "possessive

[Python-ideas] Re: Regex timeouts

2022-02-18 Thread Stephen J. Turnbull
Paul Moore writes: > However, I do like the idea of having a better parser library in the > stdlib. But it's pretty easy to write such a thing and publish it on > PyPI, It's not easy to write a good one. I've tried in two languages (Python and Lisp). I'm not saying I'm competent, so that's

[Python-ideas] Re: Regex timeouts

2022-02-17 Thread Tim Peters
[J.B. Langston ] > And unfortunately it does appear that my app took an almost a 20% > performance hit from using regex instead of re, unfortunately. > Processing time for a test dataset with 700MB of logs went from > 77 seconds with the standard library re to 92 seconds with regex. > Profiling

[Python-ideas] Re: Regex timeouts

2022-02-17 Thread J.B. Langston
And unfortunately it does appear that my app took an almost a 20% performance hit from using regex instead of re, unfortunately. Processing time for a test dataset with 700MB of logs went from 77 seconds with the standard library re to 92 seconds with regex. Profiling confirms that the time

[Python-ideas] Re: Regex timeouts

2022-02-17 Thread J.B. Langston
And, as it turns out, I already had it installed via some transitive dependency. The bigger task will be to testing. Indeed it may be as easy as "import regex as re" but I need to test and make sure my regexes still work as expected and that the performance doesn't take a big hit. Trust, but

[Python-ideas] Re: Regex timeouts

2022-02-17 Thread J.B. Langston
I have no problem installing software via pip. This is a large project that has many other dependencies already managed via pip and virtualenv. ___ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to

[Python-ideas] Re: Regex timeouts

2022-02-17 Thread Tim Peters
[Steven D'Aprano ] > [skipping FUD about pip] If the OP: has problems with pip they can't easily resolve, I expect they'll say so. In my experience, effective help consists of sticking to the simplest things that can possibly work at first, not bury a questioner with an exhaustive account of

[Python-ideas] Re: Regex timeouts

2022-02-17 Thread Steven D'Aprano
On Wed, Feb 16, 2022 at 07:01:42PM -0600, Tim Peters wrote: > You may not realize how easy this is? Just in case: go to a shell and type > > pip install regex > > (or, on Windows, "python -m pip install regex" in a DOS box). > > That's it. You're done. Easier said than actually done.

[Python-ideas] Re: Regex timeouts

2022-02-16 Thread Barry
> On 17 Feb 2022, at 01:04, Tim Peters wrote: > > [J.B. Langston ] >> Thanks for the conclusive answer. > > Not conclusive - just my opinion. Which is informed, but not infallible ;-) > >> I will checkout the regex library soon. > > You may not realize how easy this is? Just in case: go

[Python-ideas] Re: Regex timeouts

2022-02-16 Thread Tim Peters
[J.B. Langston ] > Thanks for the conclusive answer. Not conclusive - just my opinion. Which is informed, but not infallible ;-) > I will checkout the regex library soon. You may not realize how easy this is? Just in case: go to a shell and type pip install regex (or, on Windows, "python -m

[Python-ideas] Re: Regex timeouts

2022-02-16 Thread J.B. Langston
Thanks for the conclusive answer. I will checkout the regex library soon. ___ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org

[Python-ideas] Re: Regex timeouts

2022-02-16 Thread Tim Peters
[MRAB ] > I eventually decided against having it added to the standard library > because that would tie fixes and additions to Python's release cycle, > and there's that adage that Python has "batteries included", but not > nuclear reactors. PyPI is a better place for it, for those who need more >

[Python-ideas] Re: Regex timeouts

2022-02-16 Thread MRAB
On 2022-02-16 22:13, Tim Peters wrote: [J.B. Langston ] Well, I certainly sparked a lot of interesting discussion, which I have quite enjoyed reading. But to bring this thread back around to its original topic, is there support among the Python maintainers for adding a timeout feature to the

[Python-ideas] Re: Regex timeouts

2022-02-16 Thread Tim Peters
[J.B. Langston ] > Well, I certainly sparked a lot of interesting discussion, which I have > quite enjoyed reading. But to bring this thread back around to its > original topic, is there support among the Python maintainers for > adding a timeout feature to the Python re library? Buried in the

[Python-ideas] Re: Regex timeouts

2022-02-16 Thread Chris Angelico
On Thu, 17 Feb 2022 at 08:33, J.B. Langston wrote: > > Well, I certainly sparked a lot of interesting discussion, which I have quite > enjoyed reading. But to bring this thread back around to its original topic, > is there support among the Python maintainers for adding a timeout feature to >

[Python-ideas] Re: Regex timeouts

2022-02-16 Thread J.B. Langston
Well, I certainly sparked a lot of interesting discussion, which I have quite enjoyed reading. But to bring this thread back around to its original topic, is there support among the Python maintainers for adding a timeout feature to the Python re library? I will look at the third-party regex

[Python-ideas] Re: Regex timeouts

2022-02-16 Thread Ricky Teachey
On Wed, Feb 16, 2022, 5:46 AM Paul Moore wrote: > On Wed, 16 Feb 2022 at 10:23, Chris Angelico wrote: > > > > On Wed, 16 Feb 2022 at 21:01, Stephen J. Turnbull > > wrote: > > > > What I think is more interesting than simpler (but more robust for > > > what they can do) facilities is better

[Python-ideas] Re: Regex timeouts

2022-02-16 Thread Paul Moore
On Wed, 16 Feb 2022 at 10:23, Chris Angelico wrote: > > On Wed, 16 Feb 2022 at 21:01, Stephen J. Turnbull > wrote: > > What I think is more interesting than simpler (but more robust for > > what they can do) facilities is better parser support in standard > > libraries (not just Python's), and

[Python-ideas] Re: Regex timeouts

2022-02-16 Thread Chris Angelico
On Wed, 16 Feb 2022 at 21:01, Stephen J. Turnbull wrote: > > Chris Angelico writes: > > On Wed, 16 Feb 2022 at 01:54, Stephen J. Turnbull > > wrote: > > > > That is, all regexp implementations support the same basic > > > language which is sufficient for most tasks most programmers want > >

[Python-ideas] Re: Regex timeouts

2022-02-16 Thread Stephen J. Turnbull
Chris Angelico writes: > On Wed, 16 Feb 2022 at 01:54, Stephen J. Turnbull > wrote: > > That is, all regexp implementations support the same basic > > language which is sufficient for most tasks most programmers want > > regexps for. > > The problem is that that's an illusion. It isn't

[Python-ideas] Re: Regex timeouts

2022-02-15 Thread David Mertz, Ph.D.
I know this is probably too much self promotion, but I really enjoyed writing this less than a year ago: https://gnosis.cx/regex/ (The Puzzling Quirks of Regular Expressions). It's like other puzzle books, but for programmers. You should certainly still get Friedl's book if you don't have it. You

[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Tim Peters
[Chris Angelico ] > Is there any sort of standardization of regexp syntax and semantics, Sure. "The nice thing about standards is that you have so many to choose from" ;-) For example, POSIX defines a regexp flavor so it can specify what things like grep do. The ECMAScruot standard defines its

[Python-ideas] Re: Regex timeouts

2022-02-15 Thread MRAB
On 2022-02-16 02:11, Chris Angelico wrote: On Wed, 16 Feb 2022 at 12:56, Tim Peters wrote: Regexps keep "evolving"... Once upon a time, a "regular expression" was a regular grammar. That is no longer the case. Once upon a time, a regular expression could be broadly compatible with multiple

[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Chris Angelico
On Wed, 16 Feb 2022 at 12:56, Tim Peters wrote: > Regexps keep "evolving"... Once upon a time, a "regular expression" was a regular grammar. That is no longer the case. Once upon a time, a regular expression could be broadly compatible with multiple different parser engines. That is being

[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Tim Peters
[Steven D'Aprano ] > After this thread, I no longer trust that "easy" regexes will do what > they "obviously" look like they should do :-( > > I'm not trying to be funny or snarky. I *thought* I had a reasonable > understanding of regexes, and now I have learned that I don't, and that > the

[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Chris Angelico
On Wed, 16 Feb 2022 at 10:15, Steven D'Aprano wrote: > > On Tue, Feb 15, 2022 at 11:51:41PM +0900, Stephen J. Turnbull wrote: > > > scanf just isn't powerful enough. For example, consider parsing user > > input dates: scanf("%d/%d/%d", , , ). This is nice and > > simple, but handling

[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Chris Angelico
On Wed, 16 Feb 2022 at 09:28, Steven D'Aprano wrote: > > On Wed, Feb 16, 2022 at 01:02:44AM +1100, Chris Angelico wrote: > > > Yeah, regexes always look terrible when they're used for simple > > examples :) But try matching a line that has (somewhere in it) the > > word "spam", then whitespace,

[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Steven D'Aprano
On Tue, Feb 15, 2022 at 11:51:41PM +0900, Stephen J. Turnbull wrote: > scanf just isn't powerful enough. For example, consider parsing user > input dates: scanf("%d/%d/%d", , , ). This is nice and > simple, but handling "2022-02-15" as well requires a bit of thinking > and several extra

[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Steven D'Aprano
On Wed, Feb 16, 2022 at 01:02:44AM +1100, Chris Angelico wrote: > Yeah, regexes always look terrible when they're used for simple > examples :) But try matching a line that has (somewhere in it) the > word "spam", then whitespace, then a number (or if you prefer: then a > sequence of ASCII

[Python-ideas] Re: Regex timeouts

2022-02-15 Thread J.B. Langston
How embarassing... I apologize for all the signature garbage at the end of my message. ___ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org

[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Chris Angelico
On Wed, 16 Feb 2022 at 01:54, Stephen J. Turnbull wrote: > The Zawinski quote is motivated by the perception that people seem to > think that simplicity lies in minimizing the number of tools you need > to learn. REXX and SNOBOL pattern matching quite a bit more > specialized to particular tools

[Python-ideas] Re: Regex timeouts

2022-02-15 Thread J.B. Langston
Tim Peters wrote: > """ > Some people, when confronted with a problem, think “I know, I'll use > regular expressions.” Now they have two problems. > - Jamie Zawinski > """ Maybe so, but I'm committed now :). I have dozens of regexes to parse specific log messages I'm interested in. I made a

[Python-ideas] Re: Regex timeouts

2022-02-15 Thread J.B. Langston
> > A regex that's vulnerable to pathological behavior is a DoS attack waiting >> to happen. Especially when used for parsing log data (which might contain >> untrusted data). If possible, we should make it harder for people to shoot >> themselves in the feet. >> > And this is exactly what

[Python-ideas] Re: Regex timeouts

2022-02-15 Thread MRAB
On 2022-02-15 06:05, Tim Peters wrote: [Steven D'Aprano ] I've been interested in the existence of SNOBOL string scanning for a long time, but I know very little about it. How does it differ from regexes, and why have programming languages pretty much standardised on regexes rather than other

[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Stephen J. Turnbull
Tim Peters writes: > Chris didn't say this, but I will: I'm amazed that things much > _simpler_ than regexps, like his scanf and REXX PARSE > examples, haven't spread more. scanf just isn't powerful enough. For example, consider parsing user input dates: scanf("%d/%d/%d", , , ). This is

[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Chris Angelico
On Wed, 16 Feb 2022 at 00:55, Steven D'Aprano wrote: > > On Tue, Feb 15, 2022 at 05:39:33AM -0600, Tim Peters wrote: > > > ([^s]|s(?!pam))*spam > > > > Bingo. That pattern is easy enough to understand > > You and I have very different definitions of the word "easy" :-) > > > (if not to invent the

[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Steven D'Aprano
On Tue, Feb 15, 2022 at 05:39:33AM -0600, Tim Peters wrote: > ([^s]|s(?!pam))*spam > > Bingo. That pattern is easy enough to understand You and I have very different definitions of the word "easy" :-) > (if not to invent the > first time): we can chew up a character if it's not an "s", or if

[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Tim Peters
[Tim, on trying to match only the next instance of "spam"] > Assertions aren't needed, but it is nightmarish to get right. Followed by a nightmare that got it wrong. My apologies - that's what I get for trying to show off ;-) It's actually far easier if assertions are used, and I'm too old to

[Python-ideas] Re: Regex timeouts

2022-02-14 Thread Tim Peters
[Steven D'Aprano ] > I've been interested in the existence of SNOBOL string scanning for > a long time, but I know very little about it. > > How does it differ from regexes, and why have programming languages > pretty much standardised on regexes rather than other forms of string > matching? What

[Python-ideas] Re: Regex timeouts

2022-02-14 Thread Tim Peters
[Tim] >> In SNOBOL, as I recall, it could be spelled >> >> ARB "spam" FENCE [Chris] > Ah, so that's a bit more complicated than the "no-backtracking" > parsing style of REXX and scanf. Oh, a lot more complex. In SNOBOL, arbitrary computation can be performed at any point in pattern

[Python-ideas] Re: Regex timeouts

2022-02-14 Thread Chris Angelico
On Tue, 15 Feb 2022 at 13:57, Tim Peters wrote: > In SNOBOL, as I recall, it could be spelled > > ARB "spam" FENCE > > Those are all pattern objects, and infix whitespace is a binary > pattern catenation operator. > > ARB is a builtin pattern that matches the empty string at first, and >

[Python-ideas] Re: Regex timeouts

2022-02-14 Thread Tim Peters
[Tim] >>> That leaves the happy 5% who write "[^X]*X", which >>> finally says what they intended from the start. [Steven] >> Doesn't that only work if X is literally a single character? RIght. It was an examp[e, not a meta-example. Even for a _single character_, "match up to the next, but never

[Python-ideas] Re: Regex timeouts

2022-02-14 Thread Chris Angelico
On Tue, 15 Feb 2022 at 11:47, Steven D'Aprano wrote: > > > Another 20% will write ".*?X", with scant understanding that may > > extend beyond _just_ "the next" X in some cases. > > But this surprises me. Do you have an example? Nongreedy means it'll prefer the next X, but it has to be open to

[Python-ideas] Re: Regex timeouts

2022-02-14 Thread Steven D'Aprano
On Mon, Feb 14, 2022 at 05:13:38PM -0600, Tim Peters wrote: > An interesting lesson nobody wants to learn: the original major > string-processing language, SNOBOL, had powerful pattern matching but > no regexps. Griswold's more modern successor language, Icon, found no > reason to change that.

[Python-ideas] Re: Regex timeouts

2022-02-14 Thread Steven D'Aprano
On Mon, Feb 14, 2022 at 03:58:49PM -0600, Nick Timkovich wrote: > While definitely not as bad and not as likely as SQL injection, I think the > possibility of regex DoS is totally missing in the stdlib re docs. Should > there be something added there about if you need to put user input into an >

[Python-ideas] Re: Regex timeouts

2022-02-14 Thread Tim Peters
""" Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. - Jamie Zawinski """ Even more true of regexps than of floating point, and even of daemon threads ;-) regex is a terrific module, incorporating many features that newer

[Python-ideas] Re: Regex timeouts

2022-02-14 Thread Nick Timkovich
> > A regex that's vulnerable to pathological behavior is a DoS attack waiting > to happen. Especially when used for parsing log data (which might contain > untrusted data). If possible, we should make it harder for people to shoot > themselves in the feet. > While definitely not as bad and not

[Python-ideas] Re: Regex timeouts

2022-02-14 Thread Bruce Leban
On Mon, Feb 14, 2022 at 9:55 AM J.B. Langston wrote: > ... more generally I think it would be good to have a timeout option that > could be configurable when compiling the regex so that if the regex didn't > complete within the specified timeframe, it would abort and throw an > exception. > >

[Python-ideas] Re: Regex timeouts

2022-02-14 Thread Jonathan Slenders
For what it's worth, the "regex" library on PyPI (not "re") supports timeouts: https://pypi.org/project/regex/ On Mon, Feb 14, 2022, 6:54 PM J.B. Langston wrote: > Hello, > > I had opened this bug because I had a bad regex in my code that was > causing python to hang in the regex evaluation: >