Re: [Python-Dev] re performance

2017-01-28 Thread Armin Rigo
Hi Sven,

On 26 January 2017 at 22:13, Sven R. Kunze  wrote:
> I recently refreshed regular expressions theoretical basics *indulging in
> reminiscences* So, I read https://swtch.com/~rsc/regexp/regexp1.html

Theoretical regular expressions and what Python/Perl/etc. call regular
expressions are a bit different.  You can read more about it at
https://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times
.

Discussions about why they are different often focus on
backreferences, which is a rare feature.  Let me add two other points.

The theoretical kind of regexp is about giving a "yes/no" answer,
whereas the concrete "re" or "regexp" modules gives a match object,
which lets you ask for the subgroups' location, for example.  Strange
at it may seem, I am not aware of a way to do that using the
linear-time approach of the theory---if it answers "yes", then you
have no way of knowing *where* the subgroups matched.

Another issue is that the theoretical engine has no notion of
greedy/non-greedy matching.  Basically, you walk over the source
character and it answers "yes" or "no" after each of them.  This is
different from a typical backtracking implementation.  In Python:

>>> re.match(r'a*', 'aaa')
>>> re.match(r'a*?', 'aaa')

This matches either three or zero characters in Python.  The two
versions are however indistinguishable for the theoretical engine.


A bientôt,

Armin.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-28 Thread Nick Coghlan
On 27 January 2017 at 22:24, MRAB  wrote:
> On 2017-01-27 17:03, Łukasz Langa wrote:
>>
>>
>>> On Jan 26, 2017, at 5:16 PM, MRAB >> > wrote:
>>
>>
 So, it seems as if regex already uses a better algorithm although I
 couldn't find any reference to any regex theoretical framework like dfa,
 nfa, thompson multiple-state simulation or something.

>>> It still uses backtracking, like in the re module.
>>
>>
>> What’s the status of regex inclusion in the stdlib?
>>
>>
> I'm not bothered about it. It's quite a bit bigger than the re module, and,
> anyway, keeping it as a third-party module gives me more freedom to make
> updates, which are available for a range of Python versions.

I still think it could be a good candidate for a first "bundled"
module, where we don't migrate it fully into the CPython development
process, but *do* officially bless it and provide it by default in the
form of a bundled wheel file (similar to what we do with pip).

Cheers,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-28 Thread Terry Reedy

On 1/28/2017 9:43 AM, Nick Coghlan wrote:

On 27 January 2017 at 22:24, MRAB  wrote:

On 2017-01-27 17:03, Łukasz Langa wrote:



What’s the status of regex inclusion in the stdlib?


I'm not bothered about it. It's quite a bit bigger than the re module, and,
anyway, keeping it as a third-party module gives me more freedom to make
updates, which are available for a range of Python versions.


I still think it could be a good candidate for a first "bundled"
module, where we don't migrate it fully into the CPython development
process, but *do* officially bless it and provide it by default in the
form of a bundled wheel file (similar to what we do with pip).


So like pip, we would bundle the current version, install it in 
site-packages, and users who use it could upgrade as desired, even after 
x.y bugfix ends.


What would be nice would be to have a short entry in the docs for 
bundled modules that links to the real doc, wherever it is.


--
Terry Jan Reedy


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-28 Thread Barry Warsaw
On Jan 28, 2017, at 03:43 PM, Nick Coghlan wrote:

>I still think it could be a good candidate for a first "bundled"
>module, where we don't migrate it fully into the CPython development
>process, but *do* officially bless it and provide it by default in the
>form of a bundled wheel file (similar to what we do with pip).

How would that work exactly.  I.e. is there a PEP?

While I think it could be a good idea to bundle (more?) third party packages
for a variety of reasons, I want to make sure it's done in a way that's still
friendly to downstream repackagers, as I'm sure you do to. :)

For Debian/Ubuntu, we already have some additional machinations to make
ensurepip and such work properly in a distro packaging environment.  We'd want
to be able to do the same for any bundled packages as well.

Cheers,
-Barry
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-28 Thread Brett Cannon
On Fri, 27 Jan 2017 at 13:26 MRAB  wrote:

> On 2017-01-27 17:03, Łukasz Langa wrote:
> >
> >> On Jan 26, 2017, at 5:16 PM, MRAB  >> > wrote:
> >
> >>> So, it seems as if regex already uses a better algorithm although I
> >>> couldn't find any reference to any regex theoretical framework like
> dfa,
> >>> nfa, thompson multiple-state simulation or something.
> >>>
> >> It still uses backtracking, like in the re module.
> >
> > What’s the status of regex inclusion in the stdlib?
> >
> >
> I'm not bothered about it. It's quite a bit bigger than the re module,
> and, anyway, keeping it as a third-party module gives me more freedom to
> make updates, which are available for a range of Python versions.
>

Maybe regex should get a mention in the docs like requests does under
urllib.request?
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-28 Thread Serhiy Storchaka

On 28.01.17 20:04, Brett Cannon wrote:

Maybe regex should get a mention in the docs like requests does under
urllib.request?


https://bugs.python.org/issue22594

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com