[issue46868] Improve performance of math.prod with bignums (and functools.reduce?)

2022-02-28 Thread Tim Peters


Tim Peters  added the comment:

> the total number of trailing 1 bits in the integers from 1
> through N inclusive is N - N.bit_count()

Sorry, that's the total number of trailing 0 bits. The total number of trailing 
1 bits is (N+1) - (N+1).bit_count().

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46868] Improve performance of math.prod with bignums (and functools.reduce?)

2022-02-28 Thread Tim Peters


Tim Peters  added the comment:

About runtime, you're right. I did a ballpark "OK, if there are N incoming 
values, the inner loop has to go around, for each one, looking for a NULL, 
across a vector of at most log2(N) entries. So N * log2(N)". But, in fact, it's 
highly skewed toward getting out early, and 2*N is an upper bound on the total 
number of inner loop iterations. Strongly related to that the total number of 
trailing 1 bits in the integers from 1 through N inclusive is N - N.bit_count().

For the rest, I'll only repeat that if this goes in, it should be as a new 
function. Special-casing, e.g., math.prod() is a Bad Idea. We can have no idea 
in advance whether the iterable is type-homogenous, or even whether the __mul__ 
methods the types involved implement are even intended to be associative. 

functools.reduce() clearly documents strict "left to right" evaluation.

But a new treereduce() can do anything it feels like promising.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46868] Improve performance of math.prod with bignums (and functools.reduce?)

2022-02-28 Thread benrg


benrg  added the comment:

>That memory frugality adds a log2 factor to the runtime.

Your iterative algorithm is exactly the one I had in mind, but it doesn't have 
the run time that you seem to think. Is that the whole reason for our 
disagreement?

It does only O(1) extra work (not even amortized O(1), really O(1)) for each 
call of the binary function, and there are exactly n-1 calls. There's a log(n) 
term (not factor) for expanding the array and skipping NULLs in the final 
cleanup. The constant factor for it is tiny since the array is so small.

I implemented it in C and benchmarked it against reduce with unvarying 
arguments (binary | on identical ints), and it's slightly slower around 75% of 
the time, and slightly faster around 25% of the time, seemingly at random, even 
in the same test, which I suppose is related to where the allocator decides to 
put the temporaries. The reordering only needs to have a sliver of a benefit 
for it to come out on top.

When I said "at the cost of a log factor" in the first message, I meant 
relative to algorithms like ''.join, not left-reduce.


>I suspect the title of this report referenced "math.prod with bignums" because 
>it's the only actual concrete use case you had ;-)

I led with math.prod because its evaluation order isn't documented, so it can 
be changed (and I guess I should have said explicitly that there is no up-front 
penalty to changing it beyond tricky cache locality issues). I said "bignums" 
because I had in mind third-party libraries and the custom classes that I 
mentioned in my last message. I put ? after reduce because its 
left-associativity is documented and useful (e.g. with nonassociative 
functions), so it would have to be extended or a new function added, which is 
always a hard sell. I also wanted the title to be short. I did the best I could.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46868] Improve performance of math.prod with bignums (and functools.reduce?)

2022-02-28 Thread Tim Peters


Tim Peters  added the comment:

Too abstract for me to find compelling. I suspect the title of this report 
referenced "math.prod with bignums" because it's the only actual concrete use 
case you had ;-)

Here's another: math.lcm. That can benefit for the same reason as math.prod - 
provoking Karatsuba multiplication. However, applying lcm to a largish 
collection of ints is so rare I can't recall ever seeing it done.

Here's a related anti-example: math.gcd. Tree reduction hurts that. It 
typically falls to 1 quickly, and tree reduction just delays that.

So I'm at best -0 on this, and I'll stop now.

For reference, here's a Python implementation that strives to match 
functools.reduce's signature and endcase behaviors. It accepts any iterable, 
and requires temp space at most about log2(number of elements the iterable 
delivers in all).

That memory frugality adds a log2 factor to the runtime. The O() speed penalty 
could be eliminated by using temp memory that grows to about half the number of 
elements in the iterable.

def treereduce(function, iterable, initializer=None):
levels = []
if initializer is not None:
levels.append(initializer)
NULL = object()
for y in iterable:
for i, x in enumerate(levels):
if x is NULL:
levels[i] = y
break
y = function(x, y)
levels[i] = NULL
else:
levels.append(y)
y = NULL
for x in levels:
if x is not NULL:
y = x if y is NULL else function(x, y)
if y is NULL:
raise TypeError("treereduce() of empty iterable with no initial 
value")
return y

Then, for example,

>>> treereduce(lambda x, y: f"({x}+{y})", "abcdefghijk")
'a+b)+(c+d))+((e+f)+(g+h)))+((i+j)+k))'

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46868] Improve performance of math.prod with bignums (and functools.reduce?)

2022-02-28 Thread benrg


benrg  added the comment:

Anything that produces output of O(m+n) size in O(m+n) time. Ordered merging 
operations. Mergesort is a binary ordered merge with log-depth tree reduction, 
and insertion sort is the same binary operation with linear-depth tree 
reduction. Say you're merging sorted lists of intervals, and overlapping 
intervals need special treatment. It's easier to write a manifestly correct 
binary merge than an n-way merge, or a filter after heapq.merge that needs to 
deal with complex interval clusters. I've written that sort of code.

Any situation that resembles a fast path but doesn't qualify for the fast path. 
For example, there's an optimized factorial function in math, but you need 
double factorial. Or math.prod is optimized for ints as you suggested, but you 
have a class that uses ints internally but doesn't pass the CheckExact test. 
Usually when you miss out on a fast path, you just take a (sometimes large) 
constant-factor penalty, but here it pushes you into a different complexity 
class. Or you have a class that uses floats internally and wants to limit 
accumulated roundoff errors, but the struture of the computation doesn't fit 
fsum.

>Tree reduction is very popular in the parallel processing world, for obvious 
>reasons.

It's the same reason in every case: the log depth limits the accumulation of 
some bad thing. In parallel computing it's critical-path length, in factorial 
and mergesort it's size, in fsum it's roundoff error. Log depth helps in a 
range of situations.

>I've searched in vain for other languages that try to address this "in general"

You've got me there.

>As Guido will tell you, the only original idea in Python is adding an "else" 
>clause to loops ;-)

I don't think that's really true, except in the sense that there's nothing new 
under the sun. No one would use Python if it was just like other languages 
except slower and with for-else.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46868] Improve performance of math.prod with bignums (and functools.reduce?)

2022-02-27 Thread Tim Peters


Tim Peters  added the comment:

Hi. "It's pretty good for a lot of things" is precisely what I'm questioning. 
Name some, and please be specific ;-)

Tree reduction is very popular in the parallel processing world, for obvious 
reasons. But we're talking about a single thread here, and there aren't all 
that many associative binary operators (or, depending on implementation, they 
may need also to be commutative).

I gave floating * as an example: tree reduction buys nothing for accuracy, and 
would be both slower and consume more memory. Other associative operators for 
which all the same are true include bitwise &, bitwise |, bitwise ^, min, and 
max.

Floating + can benefit for accuracy, but we already have math.fsum() for that, 
which delivers the most accurate possible result.

The only unaddressed example we have here so far that could actually benefit is 
bigint *. If that's the real use case, fine, but there are better ways to 
address that case.

I've searched in vain for other languages that try to address this "in 
general", except in the context of parallelization. As Guido will tell you, the 
only original idea in Python is adding an "else" clause to loops ;-)

In any case, I don't believe this belongs hiding in reduce(). As above, it's a 
net loss for most binary operators. Too close to "attractive nuisance" 
territory. Adding a new, e.g., treereduce() would be better - provided that it 
is in fact "pretty good for a lot of things".

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46868] Improve performance of math.prod with bignums (and functools.reduce?)

2022-02-27 Thread benrg


benrg  added the comment:

My example used ints, but I was being deliberately vague when I said "bignums". 
Balanced-tree reduction certainly isn't optimal for ints, and may not be 
optimal for anything, but it's pretty good for a lot of things. It's the 
comparison-based sorting of reduction algorithms.

* When the inputs are of similar sizes, it tends to produce intermediate 
operands of similar sizes, which helps with Karatsuba multiplication (as you 
said).

* When the inputs are of different sizes, it limits the "damage" any one of 
them can do, since they only participate in log2(n) operations each.

* It doesn't look at the values, so it works with third-party types that are 
unknown to the stdlib.

There's always a fallback case, and balanced reduction is good for that. If 
there's a faster path for ints that looks at their bit lengths, great.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46868] Improve performance of math.prod with bignums (and functools.reduce?)

2022-02-26 Thread Tim Peters


Tim Peters  added the comment:

I don't know that there's a good use case for this. For floating addition, 
tree-based summation can greatly reduce total roundoff error, but that's not 
true of floating multiplication.

The advantage for prod(range(2, 50001)) doesn't really stem from turning it 
into a tree reduction, but instead for that specific input sequence "the tree" 
happens to do a decent job of keeping multiplicands more-or-less balanced in 
size (bit length), which eventually allows Karatsuba multiplication to come 
into play. If CPython didn't implement Karatsuba multiplication, tree reduction 
wouldn't improve speed: if the inputs are in sequence xs, regardless how 
school-book multiplication is grouped, or rearranged, the time needed is 
proportional to

sum(a * b for a, b in combinations([x.bit_length() for x in xs], 2))

So if the real point is to speed large products of integers, a different 
approach is wanted: keep track of intermediate products' bit lengths, and at 
each step strive to multiply partial products near the same length. This will 
reliably get Karatsuba into play if possible, and caters too to that Karatsuba 
is most valuable on multiplicands of the same bit length. When any of that 
happens from blind tree reduction, it's down to luck.

I've seen decent results from doing that with a fixed, small array A, which 
(very roughly speaking) combines "the next" integer `i` to multiply with the 
array entry A[i.bit_length().bit_length()] (and continues that with the new 
partial product, & so on, until hitting an array slot containing 1).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46868] Improve performance of math.prod with bignums (and functools.reduce?)

2022-02-26 Thread Ned Deily


Change by Ned Deily :


--
nosy: +mark.dickinson, rhettinger, tim.peters

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46868] Improve performance of math.prod with bignums (and functools.reduce?)

2022-02-26 Thread benrg


New submission from benrg :

math.prod is slow at multiplying arbitrary-precision numbers. E.g., compare the 
run time of factorial(5) to prod(range(2, 50001)).

factorial has some special-case optimizations, but the bulk of the difference 
is due to prod evaluating an expression tree of depth n. If you re-parenthesize 
the product so that the tree has depth log n, as factorial does, it's much 
faster. The evaluation order of prod isn't documented, so I think the change 
would be safe.

factorial uses recursion to build the tree, but it can be done iteratively with 
no advance knowledge of the total number of nodes.

This trick is widely useful for turning a way of combining two things into a 
way of combining many things, so I wouldn't mind seeing a generic version of it 
in the standard library, e.g. reduce(..., order='mid'). For many specific cases 
there are more efficient alternatives (''.join, itertools.chain, set.unions, 
heapq.merge), but it's nice to have a recipe that saves you the trouble of 
writing special-case algorithms at the cost of a log factor that's often 
ignorable.

--
components: Library (Lib)
messages: 414126
nosy: benrg
priority: normal
severity: normal
status: open
title: Improve performance of math.prod with bignums (and functools.reduce?)
type: enhancement
versions: Python 3.11

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com