Re: [Python-ideas] Consider generalizing Decimal to support arbitrary radix

2018-02-08 Thread Greg Ewing

Mark Dickinson wrote:


And base-16 floating-point is still used in current IBM hardware, but I
don't know whether that's purely for historical/backwards-compatibility
reasons, or because it's faster for the FPU.


Historically, base 16 was used to get a bigger exponent
range for a given number of exponent bits. That was a
bigger deal back when memory was very expensive. I doubt
there's any advantage in it now.

--
Greg

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Consider generalizing Decimal to support arbitrary radix

2018-02-08 Thread Mark Dickinson
On Wed, Feb 7, 2018 at 10:50 PM, Steven D'Aprano 
wrote:

> Why?
>
> There are clear advantages to floating point arithmetic done in base 2
> (speed, minimum possible rounding error, least amount of wobble), and a
> different advantage to floating point done in base 10 (matches exactly
> the standard decimal notation used by humans with no conversion error),
> Outside of those two bases, arithmetic done in any other base is going
> to combine the worst of both:
>

Well, there are a couple of other bases that are potentially interesting.
There are some compelling mathematical advantages to using ternary
(or even better, balanced ternary). Knuth calls balanced ternary "Perhaps
the prettiest number system of all" in TAOCP, and ternary is in some sense
the most efficient base to use; the article "Third Base" by Brian Hayes
(Sci. Am., 2001) gives a nice overview. It's completely inappropriate
for binary hardware, of course.

And base-16 floating-point is still used in current IBM hardware, but I
don't know whether that's purely for historical/backwards-compatibility
reasons, or because it's faster for the FPU.

As to making decimal support arbitrary radix, though: I don't see that
happening any time soon. The amount of work would be enormous,
for questionable gain.

-- 
Mark
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-08 Thread Franklin? Lee
On Feb 8, 2018 13:06, "Serhiy Storchaka"  wrote:

08.02.18 12:45, Franklin? Lee пише:

> Could it be that re uses an optimization that can also be used in str?
> CPython uses a modified Boyer-Moore for str.find:
> https://github.com/python/cpython/blob/master/Objects/string
> lib/fastsearch.h
> http://effbot.org/zone/stringlib.htm
> Maybe there's a minimum length after which it's better to precompute a
> table.
>

Yes, there is a special optimization in re here. It isn't free, you need to
spend some time for preparing it. You need a special object that keeps an
optimized representation for faster search. This makes it very unlikely be
used in str, because you need either spend the time for compilation on
every search, or use some kind of caching, which is not free too, adds
complexity and increases memory consumption. Note also in case of re the
compiler is implemented in Python. This reduces the complexity.


The performance of the one-needle case isn't really relevant, though, is
it? This idea is for the multi-needle case, and my tests showed that re
performs even worse than a loop of `.find`s. How do re and .find scale with
both number and lengths of needles on your machine?
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-08 Thread Franklin? Lee
On Thu, Feb 8, 2018 at 5:45 AM, Franklin? Lee
 wrote:
> On Feb 7, 2018 17:28, "Serhiy Storchaka"  wrote:
> > The name of complicated str methods is regular expressions. For doing these
> > operations efficiently you need to convert arguments in special optimized
> > form. This is what re.compile() does. If make a compilation on every
> > invocation of a str method, this will add too large overhead and kill
> > performance.
> >
> > Even for simple string search a regular expression can be more efficient
> > than a str method.
> >
> > $ ./python -m timeit -s 'import re; p = re.compile("spam"); s =
> > "spa"*100+"m"' -- 'p.search(s)'
> > 50 loops, best of 5: 680 nsec per loop
> >
> > $ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")'
> > 20 loops, best of 5: 1.09 usec per loop

I ran Serhiy's tests (3.5.2) and got different results.

# Setup:
   __builtins__.basestring = str  #hack for re2 import in py3
import re, re2, regex

n = 1
s = "spa"*n+"m"
p = re.compile("spam")
pgex = regex.compile("spam")
p2 = re2.compile("spam")

# Tests:
%timeit s.find("spam")
%timeit p.search(s)
%timeit pgex.search(s)
%timeit p2.search(s)

n = 100
350 ns ± 17.9 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
554 ns ± 16.1 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
633 ns ± 8.05 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.62 µs ± 68.4 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

n = 1000
2.17 µs ± 177 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.57 µs ± 27.7 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.46 µs ± 66.6 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
7.8 µs ± 72 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)

n = 1
17.3 µs ± 326 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
33.5 µs ± 138 ns per loop (mean ± std. dev. of 7 runs, 1 loops each)
31.7 µs ± 396 ns per loop (mean ± std. dev. of 7 runs, 1 loops each)
67.5 µs ± 400 ns per loop (mean ± std. dev. of 7 runs, 1 loops each)

Conclusions:
- `.find` is fastest. On 3.6.1 (Windows), it's about the same speed as
re: 638 ns vs 662 ns; 41.3 µs vs 43.8 µs.
- re and regex have similar performance, probably due to a similar backend.
- re2 is slowest. I suspect it's due to the wrapper. It may be copying
the strings to a format suitable for the backend.

P.S.: I also tested `"spam" in s`, which was linearly slower than
`.find`. However, `in` is consistently faster than `.find` in my 3.6,
so the discrepancy has likely been fixed.

More curious is that, on `.find`, my MSVC-compiled 3.6.1 and 3.5.2 are
twice as slow as my 3.5.2 for Ubuntu For Windows, but the re
performance is similar. It's probably a compiler thing.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-08 Thread Serhiy Storchaka

08.02.18 12:45, Franklin? Lee пише:
On Feb 7, 2018 17:28, "Serhiy Storchaka" 
> wrote:

Even for simple string search a regular expression can be more
efficient than a str method.

$ ./python -m timeit -s 'import re; p = re.compile("spam"); s =
"spa"*100+"m"' -- 'p.search(s)'
50 loops, best of 5: 680 nsec per loop

$ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")'
20 loops, best of 5: 1.09 usec per loop


That's an odd result. Python regexes use backtracking, not a DFA. I gave 
a timing test earlier in the thread:

https://mail.python.org/pipermail/python-ideas/2018-February/048879.html
I compared using repeated .find()s against a precompiled regex, then 
against a pure Python and unoptimized tree-based algorithm.


Could it be that re uses an optimization that can also be used in str? 
CPython uses a modified Boyer-Moore for str.find:

https://github.com/python/cpython/blob/master/Objects/stringlib/fastsearch.h
http://effbot.org/zone/stringlib.htm
Maybe there's a minimum length after which it's better to precompute a 
table.


Yes, there is a special optimization in re here. It isn't free, you need 
to spend some time for preparing it. You need a special object that 
keeps an optimized representation for faster search. This makes it very 
unlikely be used in str, because you need either spend the time for 
compilation on every search, or use some kind of caching, which is not 
free too, adds complexity and increases memory consumption. Note also in 
case of re the compiler is implemented in Python. This reduces the 
complexity.


Patches that add optimization for other common cases are welcomed.

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-08 Thread INADA Naoki
I think it shouldn't be str's method.

They should be separate class to reuse internal tree.

There are some Aho Corasick implementation on PyPI.
As far as I know, AC is longest match.

On the other hand, Go's replacer (it's trie based too) is:

> Replacements are performed in order, without overlapping matches.
https://golang.org/pkg/strings/#NewReplacer


On Sun, Feb 4, 2018 at 7:04 AM, Franklin? Lee
 wrote:
> Let s be a str. I propose to allow these existing str methods to take params
> in new forms.
>
> s.replace(old, new):
> Allow passing in a collection of olds.
> Allow passing in a single argument, a mapping of olds to news.
> Allow the olds in the mapping to be tuples of strings.
>
> s.split(sep), s.rsplit, s.partition:
> Allow sep to be a collection of separators.
>
> s.startswith, s.endswith:
> Allow argument to be a collection of strings.
>
> s.find, s.index, s.count, x in s:
> Similar.
> These methods are also in `list`, which can't distinguish between items,
> subsequences, and subsets. However, `str` is already inconsistent with
> `list` here: list.M looks for an item, while str.M looks for a subsequence.
>
> s.[r|l]strip:
> Sadly, these functions already interpret their str arguments as
> collections of characters.
>
> These new forms can be optimized internally, as a search for multiple
> candidate substrings can be more efficient than searching for one at a time.
> See
> https://stackoverflow.com/questions/3260962/algorithm-to-find-multiple-string-matches
>
> The most significant change is on .replace. The others are simple enough to
> simulate with a loop or something. It is harder to make multiple
> simultaneous replacements using one .replace at a time, because previous
> replacements can form new things that look like replaceables. The easiest
> Python solution is to use regex or install some package, which uses (if
> you're lucky) regex or (if unlucky) doesn't simulate simultaneous
> replacements. (If possible, just use str.translate.)
>
> I suppose .split on multiple separators is also annoying to simulate. The
> two-argument form of .split may be even more of a burden, though I don't
> know when a limited multiple-separator split is useful. The current best
> solution is, like before, to use regex, or install a package and hope for
> the best.
>
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>



-- 
INADA Naoki  
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-08 Thread Franklin? Lee
On Thu, Feb 8, 2018 at 11:24 AM, Steve Dower  wrote:
> Easily fixed by installing one of the alternate regex libraries.

MRAB's regex library, the most prominent alternative, does not use the
linear-time search algorithm. The only libraries I know that do are
the ones with re2, though I haven't looked deeply.

Let it be known that I tried to install pyre2 (re2 on PyPI) on Ubuntu
For Windows for the tests, and after hours of no success, I decided
that the alternative libraries were not the point. I eventually got it
working (https://github.com/axiak/pyre2/issues/51), and here are the
results:

# Same setup as before, with findsub_re2 using `find=re2.search,
esc=re2.escape`.

pattern2 = re2.compile('|'.join(map(re.escape, needles)))

%timeit findsub(haystack, needles)
#=> 1.26 ms ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit findsub_re(haystack, needles)
#=> 745 ms ± 19.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit findsub_re_cached(haystack, pattern)
#=> 733 ms ± 16.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit findsub_regex(haystack, needles)
#=> 639 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit findsub_re2(haystack, needles)
#=> 34.1 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit findsub_re_cached(haystack, pattern2)
#=> 9.56 µs ± 268 ns per loop (mean ± std. dev. of 7 runs, 10
loops each)

In any case, installing re2 is even more of an advanced solution for a
basic problem than using re.

> re performance and its edge cases have been discussed endlessly. Please look
> it up before restarting that discussion.

I'm not trying to restart the discussion. I'm trying to say that the
assumptions being made about its superior performance are unfounded.
Two members have suggested that re is the performant option, and
that's just not true.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-08 Thread Sven R. Kunze

Same here.

On 07.02.2018 22:57, Mike Miller wrote:

+1

I have the need for one or two of these in every project (of a certain 
size) and have to come up with solutions each time with the re module 
or a loop.


Not a fan of regex's for trivial tasks, or those that require a real 
parser.


On 2018-02-03 14:04, Franklin? Lee wrote:

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Complicate str methods

2018-02-08 Thread Franklin? Lee
On Feb 7, 2018 17:28, "Serhiy Storchaka"  wrote:

04.02.18 00:04, Franklin? Lee пише:

Let s be a str. I propose to allow these existing str methods to take
> params in new forms.
>
> s.replace(old, new):
>  Allow passing in a collection of olds.
>  Allow passing in a single argument, a mapping of olds to news.
>  Allow the olds in the mapping to be tuples of strings.
>
> s.split(sep), s.rsplit, s.partition:
>  Allow sep to be a collection of separators.
>
> s.startswith, s.endswith:
>  Allow argument to be a collection of strings.
>
> s.find, s.index, s.count, x in s:
>  Similar.
>  These methods are also in `list`, which can't distinguish between
> items, subsequences, and subsets. However, `str` is already inconsistent
> with `list` here: list.M looks for an item, while str.M looks for a
> subsequence.
>
> s.[r|l]strip:
>  Sadly, these functions already interpret their str arguments as
> collections of characters.
>

The name of complicated str methods is regular expressions. For doing these
operations efficiently you need to convert arguments in special optimized
form. This is what re.compile() does. If make a compilation on every
invocation of a str method, this will add too large overhead and kill
performance.

Even for simple string search a regular expression can be more efficient
than a str method.

$ ./python -m timeit -s 'import re; p = re.compile("spam"); s =
"spa"*100+"m"' -- 'p.search(s)'
50 loops, best of 5: 680 nsec per loop

$ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")'
20 loops, best of 5: 1.09 usec per loop


That's an odd result. Python regexes use backtracking, not a DFA. I gave a
timing test earlier in the thread:
https://mail.python.org/pipermail/python-ideas/2018-February/048879.html
I compared using repeated .find()s against a precompiled regex, then
against a pure Python and unoptimized tree-based algorithm.

Could it be that re uses an optimization that can also be used in str?
CPython uses a modified Boyer-Moore for str.find:
https://github.com/python/cpython/blob/master/Objects/stringlib/fastsearch.h
http://effbot.org/zone/stringlib.htm
Maybe there's a minimum length after which it's better to precompute a
table.

In any case, once you have branches in the regex, which is necessary to
emulate these features, it will start to slow down because it has to travel
down both branches in the worst case.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/