Re: [Python-ideas] Complicate str methods

2018-02-13 Thread Michel Desmoulin


Le 03/02/2018 à 23:04, Franklin? Lee a écrit :
> Let s be a str. I propose to allow these existing str methods to take
> params in new forms.
> 
> s.replace(old, new):
>     Allow passing in a collection of olds.
>     Allow passing in a single argument, a mapping of olds to news.
>     Allow the olds in the mapping to be tuples of strings.
> 
> s.split(sep), s.rsplit, s.partition:
>     Allow sep to be a collection of separators.
> 
> s.startswith, s.endswith:
>     Allow argument to be a collection of strings.
> 
> s.find, s.index, s.count, x in s:
>     Similar.
>     These methods are also in `list`, which can't distinguish between
> items, subsequences, and subsets. However, `str` is already inconsistent
> with `list` here: list.M looks for an item, while str.M looks for a
> subsequence.
> 
> s.[r|l]strip:
>     Sadly, these functions already interpret their str arguments as
> collections of characters.
> 

I second that proposal. I regularly need those and feel frustrated when
it doesn't work.

I even wrote a wrapper that does exactly this :
https://github.com/Tygs/ww/blob/master/src/ww/wrappers/strings.py

But because it's pure Python, it's guaranteed to be slow. Plus you need
to install it every time you need it.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-08 Thread Franklin? Lee
On Feb 8, 2018 13:06, "Serhiy Storchaka"  wrote:

08.02.18 12:45, Franklin? Lee пише:

> Could it be that re uses an optimization that can also be used in str?
> CPython uses a modified Boyer-Moore for str.find:
> https://github.com/python/cpython/blob/master/Objects/string
> lib/fastsearch.h
> http://effbot.org/zone/stringlib.htm
> Maybe there's a minimum length after which it's better to precompute a
> table.
>

Yes, there is a special optimization in re here. It isn't free, you need to
spend some time for preparing it. You need a special object that keeps an
optimized representation for faster search. This makes it very unlikely be
used in str, because you need either spend the time for compilation on
every search, or use some kind of caching, which is not free too, adds
complexity and increases memory consumption. Note also in case of re the
compiler is implemented in Python. This reduces the complexity.


The performance of the one-needle case isn't really relevant, though, is
it? This idea is for the multi-needle case, and my tests showed that re
performs even worse than a loop of `.find`s. How do re and .find scale with
both number and lengths of needles on your machine?
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-08 Thread Franklin? Lee
On Thu, Feb 8, 2018 at 5:45 AM, Franklin? Lee
 wrote:
> On Feb 7, 2018 17:28, "Serhiy Storchaka"  wrote:
> > The name of complicated str methods is regular expressions. For doing these
> > operations efficiently you need to convert arguments in special optimized
> > form. This is what re.compile() does. If make a compilation on every
> > invocation of a str method, this will add too large overhead and kill
> > performance.
> >
> > Even for simple string search a regular expression can be more efficient
> > than a str method.
> >
> > $ ./python -m timeit -s 'import re; p = re.compile("spam"); s =
> > "spa"*100+"m"' -- 'p.search(s)'
> > 50 loops, best of 5: 680 nsec per loop
> >
> > $ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")'
> > 20 loops, best of 5: 1.09 usec per loop

I ran Serhiy's tests (3.5.2) and got different results.

# Setup:
   __builtins__.basestring = str  #hack for re2 import in py3
import re, re2, regex

n = 1
s = "spa"*n+"m"
p = re.compile("spam")
pgex = regex.compile("spam")
p2 = re2.compile("spam")

# Tests:
%timeit s.find("spam")
%timeit p.search(s)
%timeit pgex.search(s)
%timeit p2.search(s)

n = 100
350 ns ± 17.9 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
554 ns ± 16.1 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
633 ns ± 8.05 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.62 µs ± 68.4 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

n = 1000
2.17 µs ± 177 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.57 µs ± 27.7 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.46 µs ± 66.6 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
7.8 µs ± 72 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)

n = 1
17.3 µs ± 326 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
33.5 µs ± 138 ns per loop (mean ± std. dev. of 7 runs, 1 loops each)
31.7 µs ± 396 ns per loop (mean ± std. dev. of 7 runs, 1 loops each)
67.5 µs ± 400 ns per loop (mean ± std. dev. of 7 runs, 1 loops each)

Conclusions:
- `.find` is fastest. On 3.6.1 (Windows), it's about the same speed as
re: 638 ns vs 662 ns; 41.3 µs vs 43.8 µs.
- re and regex have similar performance, probably due to a similar backend.
- re2 is slowest. I suspect it's due to the wrapper. It may be copying
the strings to a format suitable for the backend.

P.S.: I also tested `"spam" in s`, which was linearly slower than
`.find`. However, `in` is consistently faster than `.find` in my 3.6,
so the discrepancy has likely been fixed.

More curious is that, on `.find`, my MSVC-compiled 3.6.1 and 3.5.2 are
twice as slow as my 3.5.2 for Ubuntu For Windows, but the re
performance is similar. It's probably a compiler thing.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-08 Thread Serhiy Storchaka

08.02.18 12:45, Franklin? Lee пише:
On Feb 7, 2018 17:28, "Serhiy Storchaka" 
> wrote:

Even for simple string search a regular expression can be more
efficient than a str method.

$ ./python -m timeit -s 'import re; p = re.compile("spam"); s =
"spa"*100+"m"' -- 'p.search(s)'
50 loops, best of 5: 680 nsec per loop

$ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")'
20 loops, best of 5: 1.09 usec per loop


That's an odd result. Python regexes use backtracking, not a DFA. I gave 
a timing test earlier in the thread:

https://mail.python.org/pipermail/python-ideas/2018-February/048879.html
I compared using repeated .find()s against a precompiled regex, then 
against a pure Python and unoptimized tree-based algorithm.


Could it be that re uses an optimization that can also be used in str? 
CPython uses a modified Boyer-Moore for str.find:

https://github.com/python/cpython/blob/master/Objects/stringlib/fastsearch.h
http://effbot.org/zone/stringlib.htm
Maybe there's a minimum length after which it's better to precompute a 
table.


Yes, there is a special optimization in re here. It isn't free, you need 
to spend some time for preparing it. You need a special object that 
keeps an optimized representation for faster search. This makes it very 
unlikely be used in str, because you need either spend the time for 
compilation on every search, or use some kind of caching, which is not 
free too, adds complexity and increases memory consumption. Note also in 
case of re the compiler is implemented in Python. This reduces the 
complexity.


Patches that add optimization for other common cases are welcomed.

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-08 Thread INADA Naoki
I think it shouldn't be str's method.

They should be separate class to reuse internal tree.

There are some Aho Corasick implementation on PyPI.
As far as I know, AC is longest match.

On the other hand, Go's replacer (it's trie based too) is:

> Replacements are performed in order, without overlapping matches.
https://golang.org/pkg/strings/#NewReplacer


On Sun, Feb 4, 2018 at 7:04 AM, Franklin? Lee
 wrote:
> Let s be a str. I propose to allow these existing str methods to take params
> in new forms.
>
> s.replace(old, new):
> Allow passing in a collection of olds.
> Allow passing in a single argument, a mapping of olds to news.
> Allow the olds in the mapping to be tuples of strings.
>
> s.split(sep), s.rsplit, s.partition:
> Allow sep to be a collection of separators.
>
> s.startswith, s.endswith:
> Allow argument to be a collection of strings.
>
> s.find, s.index, s.count, x in s:
> Similar.
> These methods are also in `list`, which can't distinguish between items,
> subsequences, and subsets. However, `str` is already inconsistent with
> `list` here: list.M looks for an item, while str.M looks for a subsequence.
>
> s.[r|l]strip:
> Sadly, these functions already interpret their str arguments as
> collections of characters.
>
> These new forms can be optimized internally, as a search for multiple
> candidate substrings can be more efficient than searching for one at a time.
> See
> https://stackoverflow.com/questions/3260962/algorithm-to-find-multiple-string-matches
>
> The most significant change is on .replace. The others are simple enough to
> simulate with a loop or something. It is harder to make multiple
> simultaneous replacements using one .replace at a time, because previous
> replacements can form new things that look like replaceables. The easiest
> Python solution is to use regex or install some package, which uses (if
> you're lucky) regex or (if unlucky) doesn't simulate simultaneous
> replacements. (If possible, just use str.translate.)
>
> I suppose .split on multiple separators is also annoying to simulate. The
> two-argument form of .split may be even more of a burden, though I don't
> know when a limited multiple-separator split is useful. The current best
> solution is, like before, to use regex, or install a package and hope for
> the best.
>
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>



-- 
INADA Naoki  
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-08 Thread Franklin? Lee
On Thu, Feb 8, 2018 at 11:24 AM, Steve Dower  wrote:
> Easily fixed by installing one of the alternate regex libraries.

MRAB's regex library, the most prominent alternative, does not use the
linear-time search algorithm. The only libraries I know that do are
the ones with re2, though I haven't looked deeply.

Let it be known that I tried to install pyre2 (re2 on PyPI) on Ubuntu
For Windows for the tests, and after hours of no success, I decided
that the alternative libraries were not the point. I eventually got it
working (https://github.com/axiak/pyre2/issues/51), and here are the
results:

# Same setup as before, with findsub_re2 using `find=re2.search,
esc=re2.escape`.

pattern2 = re2.compile('|'.join(map(re.escape, needles)))

%timeit findsub(haystack, needles)
#=> 1.26 ms ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit findsub_re(haystack, needles)
#=> 745 ms ± 19.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit findsub_re_cached(haystack, pattern)
#=> 733 ms ± 16.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit findsub_regex(haystack, needles)
#=> 639 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit findsub_re2(haystack, needles)
#=> 34.1 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit findsub_re_cached(haystack, pattern2)
#=> 9.56 µs ± 268 ns per loop (mean ± std. dev. of 7 runs, 10
loops each)

In any case, installing re2 is even more of an advanced solution for a
basic problem than using re.

> re performance and its edge cases have been discussed endlessly. Please look
> it up before restarting that discussion.

I'm not trying to restart the discussion. I'm trying to say that the
assumptions being made about its superior performance are unfounded.
Two members have suggested that re is the performant option, and
that's just not true.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-08 Thread Steve Dower
Easily fixed by installing one of the alternate regex libraries.

re performance and its edge cases have been discussed endlessly. Please look it 
up before restarting that discussion.

Top-posted from my Windows phone

From: Franklin? Lee
Sent: Thursday, February 8, 2018 2:46
To: Serhiy Storchaka
Cc: Python-Ideas
Subject: Re: [Python-ideas] Complicate str methods

On Feb 7, 2018 17:28, "Serhiy Storchaka"  wrote:
04.02.18 00:04, Franklin? Lee пише:

Let s be a str. I propose to allow these existing str methods to take params in 
new forms.

s.replace(old, new):
     Allow passing in a collection of olds.
     Allow passing in a single argument, a mapping of olds to news.
     Allow the olds in the mapping to be tuples of strings.

s.split(sep), s.rsplit, s.partition:
     Allow sep to be a collection of separators.

s.startswith, s.endswith:
     Allow argument to be a collection of strings.

s.find, s.index, s.count, x in s:
     Similar.
     These methods are also in `list`, which can't distinguish between items, 
subsequences, and subsets. However, `str` is already inconsistent with `list` 
here: list.M looks for an item, while str.M looks for a subsequence.

s.[r|l]strip:
     Sadly, these functions already interpret their str arguments as 
collections of characters.

The name of complicated str methods is regular expressions. For doing these 
operations efficiently you need to convert arguments in special optimized form. 
This is what re.compile() does. If make a compilation on every invocation of a 
str method, this will add too large overhead and kill performance.

Even for simple string search a regular expression can be more efficient than a 
str method.

$ ./python -m timeit -s 'import re; p = re.compile("spam"); s = "spa"*100+"m"' 
-- 'p.search(s)'
50 loops, best of 5: 680 nsec per loop

$ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")'
20 loops, best of 5: 1.09 usec per loop

That's an odd result. Python regexes use backtracking, not a DFA. I gave a 
timing test earlier in the thread:
https://mail.python.org/pipermail/python-ideas/2018-February/048879.html
I compared using repeated .find()s against a precompiled regex, then against a 
pure Python and unoptimized tree-based algorithm.

Could it be that re uses an optimization that can also be used in str? CPython 
uses a modified Boyer-Moore for str.find:
https://github.com/python/cpython/blob/master/Objects/stringlib/fastsearch.h
http://effbot.org/zone/stringlib.htm
Maybe there's a minimum length after which it's better to precompute a table.

In any case, once you have branches in the regex, which is necessary to emulate 
these features, it will start to slow down because it has to travel down both 
branches in the worst case.

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-08 Thread Sven R. Kunze

Same here.

On 07.02.2018 22:57, Mike Miller wrote:

+1

I have the need for one or two of these in every project (of a certain 
size) and have to come up with solutions each time with the re module 
or a loop.


Not a fan of regex's for trivial tasks, or those that require a real 
parser.


On 2018-02-03 14:04, Franklin? Lee wrote:

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-08 Thread Franklin? Lee
On Feb 7, 2018 17:28, "Serhiy Storchaka"  wrote:

04.02.18 00:04, Franklin? Lee пише:

Let s be a str. I propose to allow these existing str methods to take
> params in new forms.
>
> s.replace(old, new):
>  Allow passing in a collection of olds.
>  Allow passing in a single argument, a mapping of olds to news.
>  Allow the olds in the mapping to be tuples of strings.
>
> s.split(sep), s.rsplit, s.partition:
>  Allow sep to be a collection of separators.
>
> s.startswith, s.endswith:
>  Allow argument to be a collection of strings.
>
> s.find, s.index, s.count, x in s:
>  Similar.
>  These methods are also in `list`, which can't distinguish between
> items, subsequences, and subsets. However, `str` is already inconsistent
> with `list` here: list.M looks for an item, while str.M looks for a
> subsequence.
>
> s.[r|l]strip:
>  Sadly, these functions already interpret their str arguments as
> collections of characters.
>

The name of complicated str methods is regular expressions. For doing these
operations efficiently you need to convert arguments in special optimized
form. This is what re.compile() does. If make a compilation on every
invocation of a str method, this will add too large overhead and kill
performance.

Even for simple string search a regular expression can be more efficient
than a str method.

$ ./python -m timeit -s 'import re; p = re.compile("spam"); s =
"spa"*100+"m"' -- 'p.search(s)'
50 loops, best of 5: 680 nsec per loop

$ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")'
20 loops, best of 5: 1.09 usec per loop


That's an odd result. Python regexes use backtracking, not a DFA. I gave a
timing test earlier in the thread:
https://mail.python.org/pipermail/python-ideas/2018-February/048879.html
I compared using repeated .find()s against a precompiled regex, then
against a pure Python and unoptimized tree-based algorithm.

Could it be that re uses an optimization that can also be used in str?
CPython uses a modified Boyer-Moore for str.find:
https://github.com/python/cpython/blob/master/Objects/stringlib/fastsearch.h
http://effbot.org/zone/stringlib.htm
Maybe there's a minimum length after which it's better to precompute a
table.

In any case, once you have branches in the regex, which is necessary to
emulate these features, it will start to slow down because it has to travel
down both branches in the worst case.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-07 Thread Serhiy Storchaka

04.02.18 00:04, Franklin? Lee пише:
Let s be a str. I propose to allow these existing str methods to take 
params in new forms.


s.replace(old, new):
     Allow passing in a collection of olds.
     Allow passing in a single argument, a mapping of olds to news.
     Allow the olds in the mapping to be tuples of strings.

s.split(sep), s.rsplit, s.partition:
     Allow sep to be a collection of separators.

s.startswith, s.endswith:
     Allow argument to be a collection of strings.

s.find, s.index, s.count, x in s:
     Similar.
     These methods are also in `list`, which can't distinguish between 
items, subsequences, and subsets. However, `str` is already inconsistent 
with `list` here: list.M looks for an item, while str.M looks for a 
subsequence.


s.[r|l]strip:
     Sadly, these functions already interpret their str arguments as 
collections of characters.


The name of complicated str methods is regular expressions. For doing 
these operations efficiently you need to convert arguments in special 
optimized form. This is what re.compile() does. If make a compilation on 
every invocation of a str method, this will add too large overhead and 
kill performance.


Even for simple string search a regular expression can be more efficient 
than a str method.


$ ./python -m timeit -s 'import re; p = re.compile("spam"); s = 
"spa"*100+"m"' -- 'p.search(s)'

50 loops, best of 5: 680 nsec per loop

$ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")'
20 loops, best of 5: 1.09 usec per loop

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-07 Thread Mike Miller

+1

I have the need for one or two of these in every project (of a certain size) and 
have to come up with solutions each time with the re module or a loop.


Not a fan of regex's for trivial tasks, or those that require a real parser.

On 2018-02-03 14:04, Franklin? Lee wrote:

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-07 Thread Neil Girdhar


On Saturday, February 3, 2018 at 8:10:38 PM UTC-5, Steven D'Aprano wrote:
>
> On Sun, Feb 04, 2018 at 10:54:53AM +1100, Chris Angelico wrote: 
>
> > Picking up this one as an example, but this applies to all of them: 
> > the transformation you're giving here is dangerously flawed. If there 
> > are any regex special characters in the strings, this will either bomb 
> > with an exception, or silently do the wrong thing. The correct way to 
> > do it is (at least, I think it is): 
> > 
> > re.match("|".join(map(re.escape, strings)), testme) 
> > 
> > With that gotcha lurking in the wings, I think this should not be 
> > cavalierly dismissed with "just 'import re' and be done with it". 
>
> Indeed. 
>
> This is not Perl and "just use a regex" is not a close fit to the 
> culture of Python. 
>
> Regexes are a completely separate mini-language, and one which is the 
> opposite of Pythonic. Instead of "executable pseudo-code", regexes are 
> excessively terse and cryptic once you get past the simple examples. 
> Doing anything complicated using regexes is painful. 
>
> Even Larry Wall has criticised regex syntax for choosing poor defaults 
> and information density. (Rarely used symbols get a single character, 
> while frequently needed symbols are coded as multiple characters, so 
> Perlish syntax has the worst of both worlds: too terse for casual users, 
> too verbose for experts, hard to maintain for everyone.) 
>
> Any serious programmer should have at least a passing familiarity with 
> regexes. They are ubiquitous, and useful, especially as a common 
> mini-language for user-specified searching. 
>
> But I consider regexes to be the fall-back for when Python doesn't 
> support the kind of string matching operation I need, not the primary 
> solution. I would never write: 
>
> re.match('start', text) 
> re.search('spam', text) 
>
> when 
>
> text.startswith('start') 
> text.find('spam') 
>
> will do. I think this proposal to add more power to the string methods 
> is worth some serious consideration. 
>

Completely agree with the sentiment.  I don't know about this proposal, but 
complicated regular expressions are never a good solution even when they 
are the best solution.  

>
>
>
> -- 
> Steve 
> ___ 
> Python-ideas mailing list 
> python...@python.org  
> https://mail.python.org/mailman/listinfo/python-ideas 
> Code of Conduct: http://python.org/psf/codeofconduct/ 
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-03 Thread Franklin? Lee
On Sat, Feb 3, 2018 at 6:43 PM, Terry Reedy  wrote:
> On 2/3/2018 5:04 PM, Franklin? Lee wrote:
>>
>> Let s be a str. I propose to allow these existing str methods to take
>> params in new forms.
>
>
> Thanks for the honest title.  As you sort of indicate, these can all be done
> with re module.  However, you imply loops are needed besides, which is
> mostly not true.  Your complications mostly translate to existing calls and
> hence are not needed.

My point wasn't that they _needed_ loops, but that they were simple
enough to write (correctly) using loops, while simultaneous multiple
.replace isn't.

In any case, due to the backtracking algorithm used by re, a real
specialized implementation can be much more efficient for the
functions I listed, and will add value.

> Perhaps 'Regular Expression HOWTO' could use more examples, or even a
> section on generalizing string methinds.  Perhaps the string method doc
> needs suggestion to use re for multiple string args and references to the re
> howto.  Please consider taking a look at both.

Nah, I'm fine. I can take the time to write inefficient simulations,
or find them online. I proposed this not because I myself need it in
the language, but because I think it's useful for many, yet easy to
get wrong. It's the same with `sort`.

In my design notes for my own toolkit, I also allow regular
expressions as "olds. I didn't add that feature, since it'd cause a
dependency.

 import re
>
>> s.replace(old, new):
>>  Allow passing in a collection of olds.
>
>
 re.sub('Franklin|Lee', 'user', 'Franklin? Lee')
> 'user? user'
>
> Remembering the name change is a nuisance

Three issues:
1. As Chris Angelico pointed out, the input strings need to be escaped.
2. The input strings should also be reverse sorted by length. Longer
strings should be higher priority than shorter strings, or else
'foobar' will never outbid 'foo'.
3. The substitution needs to be escaped (which is probably why it has
a different name), using repl.replace('\\', r'\\'). As the docs note,
using repl.escape here is wrong (at least before 3.7).

Like I said, easy to get wrong.

>>  Allow passing in a single argument, a mapping of olds to news.
>
>
> This needs to be a separate function, say 'dictsub', that joins the keys
> with '|' and calls re.sub with a function that does the lookup as the 2nd
> parameter.  This would be a nice example for the howto.
>
> As you noted, this is generalization of str.translate, and might be proposed
> as a new re module function.

If this function is added to re, it should also allow patterns as
keys, and we may also want to add
re.compile(collection_of_strings_and_patterns).

>> These new forms can be optimized internally, as a search for multiple
>> candidate substrings can be more efficient than searching for one at a time.
>
>
> This is what re does with 's1|s2|...|sn' patterns.

Really? re generally uses backtracking, not a DFA.

Timing indicates that it is NOT using an efficient algorithm.

import re

def findsub(haystack, needles, map=map, any=any, plusone=1 .__add__):
return any(map(plusone, map(haystack.find, needles)))

def findsub_re(haystack, needles, bool=bool, map=map,
find=re.search, esc=re.escape):
pattern = '|'.join(map(esc, needles))
return bool(find(pattern, haystack))

def findsub_re_cached(haystack, regex, bool=bool):
return bool(regex.search(haystack))


n = 1000
haystack = 'a'*n + 'b'
needles = ['a'*i + 'bb' for i in range(1, n)]
needles = sorted(needles, key=len, reverse=True)
regex = re.compile('|'.join(map(re.escape, needles)))

%timeit findsub(haystack, needles)
#=> 1.81 ms ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit findsub_re(haystack, needles)
#=> 738 ms ± 39.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit findsub_re_cached(haystack, regex)
#=> 680 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Here, the naive approach using basic string operations is over 300x
faster than the re approach.

Even a simple pure Python tree approach is faster.

END = ''
# Tree-based multiple string searches.
def make_tree(needles):
"""Creates a tree such that each node maps characters to
values. The tree is not optimized.
"""
root = {}
for needle in needles:
cur = root
for c in needle:
try:
cur = cur[c]
except KeyError:
cur[c] = cur = {}  #curious
cur[END] = needle
return root

def findsub_tree(haystack, needles):
tree = make_tree(needles)
nodes = [tree]
for c in haystack:
nodes = [n[c] for n in nodes if c in n]  #NOT OPTIMAL
if any(END in n for n in nodes):
return True
nodes.append(tree)
return False

%timeit findsub_tree(haystack, needles)
#=

Re: [Python-ideas] Complicate str methods

2018-02-03 Thread Steven D'Aprano
On Sun, Feb 04, 2018 at 10:54:53AM +1100, Chris Angelico wrote:

> Picking up this one as an example, but this applies to all of them:
> the transformation you're giving here is dangerously flawed. If there
> are any regex special characters in the strings, this will either bomb
> with an exception, or silently do the wrong thing. The correct way to
> do it is (at least, I think it is):
> 
> re.match("|".join(map(re.escape, strings)), testme)
> 
> With that gotcha lurking in the wings, I think this should not be
> cavalierly dismissed with "just 'import re' and be done with it".

Indeed.

This is not Perl and "just use a regex" is not a close fit to the 
culture of Python.

Regexes are a completely separate mini-language, and one which is the 
opposite of Pythonic. Instead of "executable pseudo-code", regexes are 
excessively terse and cryptic once you get past the simple examples. 
Doing anything complicated using regexes is painful.

Even Larry Wall has criticised regex syntax for choosing poor defaults 
and information density. (Rarely used symbols get a single character, 
while frequently needed symbols are coded as multiple characters, so 
Perlish syntax has the worst of both worlds: too terse for casual users, 
too verbose for experts, hard to maintain for everyone.)

Any serious programmer should have at least a passing familiarity with 
regexes. They are ubiquitous, and useful, especially as a common 
mini-language for user-specified searching.

But I consider regexes to be the fall-back for when Python doesn't 
support the kind of string matching operation I need, not the primary 
solution. I would never write:

re.match('start', text)
re.search('spam', text)

when 

text.startswith('start')
text.find('spam') 

will do. I think this proposal to add more power to the string methods 
is worth some serious consideration.



-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-03 Thread Chris Angelico
On Sun, Feb 4, 2018 at 10:43 AM, Terry Reedy  wrote:
> On 2/3/2018 5:04 PM, Franklin? Lee wrote:
>> s.startswith, s.endswith:
>>  Allow argument to be a collection of strings.
>
>
> bool(re.match('|'.join(strings)) does exactly the proposed s.startswith,
> with the advantage that the actual match is available, and I think that one
> would nearly always want to know that match.
>
 re.match('a|e|i|o|u', 'Franklin? Lee')
 re.match('f|F', 'Franklin? Lee')
> 

Picking up this one as an example, but this applies to all of them:
the transformation you're giving here is dangerously flawed. If there
are any regex special characters in the strings, this will either bomb
with an exception, or silently do the wrong thing. The correct way to
do it is (at least, I think it is):

re.match("|".join(map(re.escape, strings)), testme)

With that gotcha lurking in the wings, I think this should not be
cavalierly dismissed with "just 'import re' and be done with it".
Maybe put some recipes in the docs showing how to do this safely?

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-03 Thread Terry Reedy

On 2/3/2018 5:04 PM, Franklin? Lee wrote:
Let s be a str. I propose to allow these existing str methods to take 
params in new forms.


Thanks for the honest title.  As you sort of indicate, these can all be 
done with re module.  However, you imply loops are needed besides, which 
is mostly not true.  Your complications mostly translate to existing 
calls and hence are not needed.


Perhaps 'Regular Expression HOWTO' could use more examples, or even a 
section on generalizing string methinds.  Perhaps the string method doc 
needs suggestion to use re for multiple string args and references to 
the re howto.  Please consider taking a look at both.


>>> import re


s.replace(old, new):
     Allow passing in a collection of olds.


>>> re.sub('Franklin|Lee', 'user', 'Franklin? Lee')
'user? user'

Remembering the name change is a nuisance


     Allow passing in a single argument, a mapping of olds to news.


This needs to be a separate function, say 'dictsub', that joins the keys 
with '|' and calls re.sub with a function that does the lookup as the 
2nd parameter.  This would be a nice example for the howto.


As you noted, this is generalization of str.translate, and might be 
proposed as a new re module function.



     Allow the olds in the mapping to be tuples of strings.


A minor addition to dictsub.



s.split(sep), s.rsplit, s.partition:
     Allow sep to be a collection of separators.


re.split is already more flexible than non-whitespace str.split and 
str.partition combined.


>>> re.split('a|e|i|o|u', 'Franklin? Lee')
['Fr', 'nkl', 'n? L', '', '']
>>> re.split('(a|e|i|o|u)', 'Franklin? Lee')  # multiple partition
['Fr', 'a', 'nkl', 'i', 'n? L', 'e', '', 'e', '']
>>> re.split('(a|e|i|o|u)', 'Franklin? Lee', 1) # single partition
['Fr', 'a', 'nklin? Lee']

re.split, and hence str.rsplit(collection) are very sensible.


s.startswith, s.endswith:
     Allow argument to be a collection of strings.


bool(re.match('|'.join(strings)) does exactly the proposed s.startswith, 
with the advantage that the actual match is available, and I think that 
one would nearly always want to know that match.


>>> re.match('a|e|i|o|u', 'Franklin? Lee')
>>> re.match('f|F', 'Franklin? Lee')


re.search with '^' at the beginning or '$' at the end covers both 
proposals, with the added flexibility of using MULTILINE mode to match 
at the beginning or end of lines within the string.



s.find, s.index, s.count, x in s:
     Similar.
     These methods are also in `list`, which can't distinguish between 
items, subsequences, and subsets. However, `str` is already inconsistent 
with `list` here: list.M looks for an item, while str.M looks for a 
subsequence.


Comments above apply.  re.search tells you which string matched as well 
as where.  bool(re.search) is 'x in s'.  re.findall and re.finditer give 
much more info than merely a count ('sum(bool(re.finditer))').



s.[r|l]strip:
     Sadly, these functions already interpret their str arguments as 
collections of characters.


To avoid this, use re.sub with ^ or $ anchor and '' replacement.

>>> re.sub('(Frank|Lee)$', '', 'Franklin? Lee')
'Franklin? '

These new forms can be optimized internally, as a search for multiple 
candidate substrings can be more efficient than searching for one at a 
time.


This is what re does with 's1|s2|...|sn' patterns.


https://stackoverflow.com/questions/3260962/algorithm-to-find-multiple-string-matches

The most significant change is on .replace. The others are simple enough 
to simulate with a loop or something.


No loops needed.

It is harder to make multiple 
simultaneous replacements using one .replace at a time, because previous 
replacements can form new things that look like replaceables.


This problem exists for single string replacement also.  The standard 
solution is to not backtrack and not do overlapping replacements.



The easiest Python solution is to use regex


My claim above is that this is sufficient for all by one case, which 
should be a new function anyway.


or install some package, which 
uses (if you're lucky) regex or (if unlucky) doesn't simulate 
simultaneous replacements. (If possible, just use str.translate.)


I suppose .split on multiple separators is also annoying to simulate. 
The two-argument form of .split may be even more of a burden, though I 
don't know when a limited multiple-separator split is useful. The 
current best solution is, like before, to use regex, or install a 
package and hope for the best.


--
Terry Jan Reedy


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/