[Tim, on trying to match only the next instance of "spam"]
> ,,,
> It's actually far easier if assertions are used, and I'm too old to
> bother trying to repair the non-assertion mess:
>
> ([^s]|s(?!pam))*spam
Since then, Serhiy rehabilitated an old patch to add "atomic groups"
and "possessive
Paul Moore writes:
> However, I do like the idea of having a better parser library in the
> stdlib. But it's pretty easy to write such a thing and publish it on
> PyPI,
It's not easy to write a good one. I've tried in two languages
(Python and Lisp). I'm not saying I'm competent, so that's
[J.B. Langston ]
> And unfortunately it does appear that my app took an almost a 20%
> performance hit from using regex instead of re, unfortunately.
> Processing time for a test dataset with 700MB of logs went from
> 77 seconds with the standard library re to 92 seconds with regex.
> Profiling
And unfortunately it does appear that my app took an almost a 20% performance
hit from using regex instead of re, unfortunately. Processing time for a test
dataset with 700MB of logs went from 77 seconds with the standard library re to
92 seconds with regex. Profiling confirms that the time
And, as it turns out, I already had it installed via some transitive
dependency. The bigger task will be to testing. Indeed it may be as easy as
"import regex as re" but I need to test and make sure my regexes still work as
expected and that the performance doesn't take a big hit. Trust, but
I have no problem installing software via pip. This is a large project that has
many other dependencies already managed via pip and virtualenv.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to
[Steven D'Aprano ]
> [skipping FUD about pip]
If the OP: has problems with pip they can't easily resolve, I expect
they'll say so. In my experience, effective help consists of sticking
to the simplest things that can possibly work at first, not bury a
questioner with an exhaustive account of
On Wed, Feb 16, 2022 at 07:01:42PM -0600, Tim Peters wrote:
> You may not realize how easy this is? Just in case: go to a shell and type
>
> pip install regex
>
> (or, on Windows, "python -m pip install regex" in a DOS box).
>
> That's it. You're done.
Easier said than actually done.
> On 17 Feb 2022, at 01:04, Tim Peters wrote:
>
> [J.B. Langston ]
>> Thanks for the conclusive answer.
>
> Not conclusive - just my opinion. Which is informed, but not infallible ;-)
>
>> I will checkout the regex library soon.
>
> You may not realize how easy this is? Just in case: go
[J.B. Langston ]
> Thanks for the conclusive answer.
Not conclusive - just my opinion. Which is informed, but not infallible ;-)
> I will checkout the regex library soon.
You may not realize how easy this is? Just in case: go to a shell and type
pip install regex
(or, on Windows, "python -m
Thanks for the conclusive answer. I will checkout the regex library soon.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
[MRAB ]
> I eventually decided against having it added to the standard library
> because that would tie fixes and additions to Python's release cycle,
> and there's that adage that Python has "batteries included", but not
> nuclear reactors. PyPI is a better place for it, for those who need more
>
On 2022-02-16 22:13, Tim Peters wrote:
[J.B. Langston ]
Well, I certainly sparked a lot of interesting discussion, which I have
quite enjoyed reading. But to bring this thread back around to its
original topic, is there support among the Python maintainers for
adding a timeout feature to the
[J.B. Langston ]
> Well, I certainly sparked a lot of interesting discussion, which I have
> quite enjoyed reading. But to bring this thread back around to its
> original topic, is there support among the Python maintainers for
> adding a timeout feature to the Python re library?
Buried in the
On Thu, 17 Feb 2022 at 08:33, J.B. Langston wrote:
>
> Well, I certainly sparked a lot of interesting discussion, which I have quite
> enjoyed reading. But to bring this thread back around to its original topic,
> is there support among the Python maintainers for adding a timeout feature to
>
Well, I certainly sparked a lot of interesting discussion, which I have quite
enjoyed reading. But to bring this thread back around to its original topic, is
there support among the Python maintainers for adding a timeout feature to the
Python re library? I will look at the third-party regex
On Wed, Feb 16, 2022, 5:46 AM Paul Moore wrote:
> On Wed, 16 Feb 2022 at 10:23, Chris Angelico wrote:
> >
> > On Wed, 16 Feb 2022 at 21:01, Stephen J. Turnbull
> > wrote:
>
> > > What I think is more interesting than simpler (but more robust for
> > > what they can do) facilities is better
On Wed, 16 Feb 2022 at 10:23, Chris Angelico wrote:
>
> On Wed, 16 Feb 2022 at 21:01, Stephen J. Turnbull
> wrote:
> > What I think is more interesting than simpler (but more robust for
> > what they can do) facilities is better parser support in standard
> > libraries (not just Python's), and
On Wed, 16 Feb 2022 at 21:01, Stephen J. Turnbull
wrote:
>
> Chris Angelico writes:
> > On Wed, 16 Feb 2022 at 01:54, Stephen J. Turnbull
> > wrote:
>
> > > That is, all regexp implementations support the same basic
> > > language which is sufficient for most tasks most programmers want
> >
Chris Angelico writes:
> On Wed, 16 Feb 2022 at 01:54, Stephen J. Turnbull
> wrote:
> > That is, all regexp implementations support the same basic
> > language which is sufficient for most tasks most programmers want
> > regexps for.
>
> The problem is that that's an illusion.
It isn't
I know this is probably too much self promotion, but I really enjoyed
writing this less than a year ago: https://gnosis.cx/regex/ (The Puzzling
Quirks of Regular Expressions).
It's like other puzzle books, but for programmers. You should certainly
still get Friedl's book if you don't have it. You
[Chris Angelico ]
> Is there any sort of standardization of regexp syntax and semantics,
Sure. "The nice thing about standards is that you have so many to
choose from" ;-) For example, POSIX defines a regexp flavor so it can
specify what things like grep do. The ECMAScruot standard defines its
On 2022-02-16 02:11, Chris Angelico wrote:
On Wed, 16 Feb 2022 at 12:56, Tim Peters wrote:
Regexps keep "evolving"...
Once upon a time, a "regular expression" was a regular grammar. That
is no longer the case.
Once upon a time, a regular expression could be broadly compatible
with multiple
On Wed, 16 Feb 2022 at 12:56, Tim Peters wrote:
> Regexps keep "evolving"...
Once upon a time, a "regular expression" was a regular grammar. That
is no longer the case.
Once upon a time, a regular expression could be broadly compatible
with multiple different parser engines. That is being
[Steven D'Aprano ]
> After this thread, I no longer trust that "easy" regexes will do what
> they "obviously" look like they should do :-(
>
> I'm not trying to be funny or snarky. I *thought* I had a reasonable
> understanding of regexes, and now I have learned that I don't, and that
> the
On Wed, 16 Feb 2022 at 10:15, Steven D'Aprano wrote:
>
> On Tue, Feb 15, 2022 at 11:51:41PM +0900, Stephen J. Turnbull wrote:
>
> > scanf just isn't powerful enough. For example, consider parsing user
> > input dates: scanf("%d/%d/%d", , , ). This is nice and
> > simple, but handling
On Wed, 16 Feb 2022 at 09:28, Steven D'Aprano wrote:
>
> On Wed, Feb 16, 2022 at 01:02:44AM +1100, Chris Angelico wrote:
>
> > Yeah, regexes always look terrible when they're used for simple
> > examples :) But try matching a line that has (somewhere in it) the
> > word "spam", then whitespace,
On Tue, Feb 15, 2022 at 11:51:41PM +0900, Stephen J. Turnbull wrote:
> scanf just isn't powerful enough. For example, consider parsing user
> input dates: scanf("%d/%d/%d", , , ). This is nice and
> simple, but handling "2022-02-15" as well requires a bit of thinking
> and several extra
On Wed, Feb 16, 2022 at 01:02:44AM +1100, Chris Angelico wrote:
> Yeah, regexes always look terrible when they're used for simple
> examples :) But try matching a line that has (somewhere in it) the
> word "spam", then whitespace, then a number (or if you prefer: then a
> sequence of ASCII
How embarassing... I apologize for all the signature garbage at the end of my
message.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
On Wed, 16 Feb 2022 at 01:54, Stephen J. Turnbull
wrote:
> The Zawinski quote is motivated by the perception that people seem to
> think that simplicity lies in minimizing the number of tools you need
> to learn. REXX and SNOBOL pattern matching quite a bit more
> specialized to particular tools
Tim Peters wrote:
> """
> Some people, when confronted with a problem, think “I know, I'll use
> regular expressions.” Now they have two problems.
> - Jamie Zawinski
> """
Maybe so, but I'm committed now :). I have dozens of regexes to parse specific
log messages I'm interested in. I made a
>
> A regex that's vulnerable to pathological behavior is a DoS attack waiting
>> to happen. Especially when used for parsing log data (which might contain
>> untrusted data). If possible, we should make it harder for people to shoot
>> themselves in the feet.
>>
>
And this is exactly what
On 2022-02-15 06:05, Tim Peters wrote:
[Steven D'Aprano ]
I've been interested in the existence of SNOBOL string scanning for
a long time, but I know very little about it.
How does it differ from regexes, and why have programming languages
pretty much standardised on regexes rather than other
Tim Peters writes:
> Chris didn't say this, but I will: I'm amazed that things much
> _simpler_ than regexps, like his scanf and REXX PARSE
> examples, haven't spread more.
scanf just isn't powerful enough. For example, consider parsing user
input dates: scanf("%d/%d/%d", , , ). This is
On Wed, 16 Feb 2022 at 00:55, Steven D'Aprano wrote:
>
> On Tue, Feb 15, 2022 at 05:39:33AM -0600, Tim Peters wrote:
>
> > ([^s]|s(?!pam))*spam
> >
> > Bingo. That pattern is easy enough to understand
>
> You and I have very different definitions of the word "easy" :-)
>
> > (if not to invent the
On Tue, Feb 15, 2022 at 05:39:33AM -0600, Tim Peters wrote:
> ([^s]|s(?!pam))*spam
>
> Bingo. That pattern is easy enough to understand
You and I have very different definitions of the word "easy" :-)
> (if not to invent the
> first time): we can chew up a character if it's not an "s", or if
[Tim, on trying to match only the next instance of "spam"]
> Assertions aren't needed, but it is nightmarish to get right.
Followed by a nightmare that got it wrong. My apologies - that's what
I get for trying to show off ;-)
It's actually far easier if assertions are used, and I'm too old to
[Steven D'Aprano ]
> I've been interested in the existence of SNOBOL string scanning for
> a long time, but I know very little about it.
>
> How does it differ from regexes, and why have programming languages
> pretty much standardised on regexes rather than other forms of string
> matching?
What
[Tim]
>> In SNOBOL, as I recall, it could be spelled
>>
>> ARB "spam" FENCE
[Chris]
> Ah, so that's a bit more complicated than the "no-backtracking"
> parsing style of REXX and scanf.
Oh, a lot more complex. In SNOBOL, arbitrary computation can be
performed at any point in pattern
On Tue, 15 Feb 2022 at 13:57, Tim Peters wrote:
> In SNOBOL, as I recall, it could be spelled
>
> ARB "spam" FENCE
>
> Those are all pattern objects, and infix whitespace is a binary
> pattern catenation operator.
>
> ARB is a builtin pattern that matches the empty string at first, and
>
[Tim]
>>> That leaves the happy 5% who write "[^X]*X", which
>>> finally says what they intended from the start.
[Steven]
>> Doesn't that only work if X is literally a single character?
RIght. It was an examp[e, not a meta-example. Even for a _single
character_, "match up to the next, but never
On Tue, 15 Feb 2022 at 11:47, Steven D'Aprano wrote:
>
> > Another 20% will write ".*?X", with scant understanding that may
> > extend beyond _just_ "the next" X in some cases.
>
> But this surprises me. Do you have an example?
Nongreedy means it'll prefer the next X, but it has to be open to
On Mon, Feb 14, 2022 at 05:13:38PM -0600, Tim Peters wrote:
> An interesting lesson nobody wants to learn: the original major
> string-processing language, SNOBOL, had powerful pattern matching but
> no regexps. Griswold's more modern successor language, Icon, found no
> reason to change that.
On Mon, Feb 14, 2022 at 03:58:49PM -0600, Nick Timkovich wrote:
> While definitely not as bad and not as likely as SQL injection, I think the
> possibility of regex DoS is totally missing in the stdlib re docs. Should
> there be something added there about if you need to put user input into an
>
"""
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.” Now they have two problems.
- Jamie Zawinski
"""
Even more true of regexps than of floating point, and even of daemon threads ;-)
regex is a terrific module, incorporating many features that newer
>
> A regex that's vulnerable to pathological behavior is a DoS attack waiting
> to happen. Especially when used for parsing log data (which might contain
> untrusted data). If possible, we should make it harder for people to shoot
> themselves in the feet.
>
While definitely not as bad and not
On Mon, Feb 14, 2022 at 9:55 AM J.B. Langston
wrote:
> ... more generally I think it would be good to have a timeout option that
> could be configurable when compiling the regex so that if the regex didn't
> complete within the specified timeframe, it would abort and throw an
> exception.
>
>
For what it's worth, the "regex" library on PyPI (not "re") supports
timeouts:
https://pypi.org/project/regex/
On Mon, Feb 14, 2022, 6:54 PM J.B. Langston wrote:
> Hello,
>
> I had opened this bug because I had a bad regex in my code that was
> causing python to hang in the regex evaluation:
>
49 matches
Mail list logo