Re: [Numpy-discussion] Optimizing numpy's einsum expression (again)

2015-01-16 Thread Eelco Hoogendoorn
Thanks for taking the time to think about this; good work.

Personally, I don't think a factor 5 memory overhead is much to sweat over.
The most complex einsum I have ever needed in a production environment was
5/6 terms, and for what this anecdote is worth, speed was a far
bigger concern to me than memory.

On Fri, Jan 16, 2015 at 12:30 AM, Daniel Smith dgasm...@icloud.com wrote:

 Hello everyone,
 I originally brought an optimized einsum routine forward a few weeks back
 that attempts to contract numpy arrays together in an optimal way. This can
 greatly reduce the scaling and overall cost of the einsum expression for
 the cost of a few intermediate arrays. The current version (and more
 details) can be found here:
 https://github.com/dgasmith/opt_einsum

 I think this routine is close to a finalized version, but there are two
 problems which I would like the community to weigh in on:

 Memory usage-
 Currently the routine only considers the maximum size of intermediates
 created rather than cumulative memory usage. If we only use np.einsum it is
 straightforward to calculate cumulative memory usage as einsum does not
 require any copies of inputs to be made; however, if we attempt to use a
 vendor BLAS the memory usage can becomes quite complex. For example, the
 np.tensordot routine always forces a copy for ndarrays because it uses
 ndarray.transpose(…).reshape(…). A more memory-conscious way to do this is
 to first try and do ndarray.reshape(…).T which does not copy the data and
 numpy can just pass a transpose flag to the vendor BLAS. The caveat here is
 that the summed indices must be in the correct order- if not a copy is
 required. Maximum array size is usually close to the total overhead of the
 opt_einsum function, but can occasionally be 2-5 times this size. I see the
 following ways forward:

- Ignore cumulative memory and stick with maximum array size as the
limiting memory factor.
- Implement logic to figure out if the input arrays needs to be copied
to use BLAS, compute the extra memory required, and add an extra dimension
to the optimization algorithms (use BLAS or do not use BLAS at each step).
Some of this is already done, but may get fairly complex.
- Build an in-place nd-transpose algorithm.
- Cut out BLAS entirely. Keeping in mind that vendor BLAS can be
orders of magnitude faster than a pure einsum implementation, especially if
the BLAS threading is used.


 Path algorithms-
 There are two algorithms “optimal” (a brute force algorithm, scales like
 N!) and “opportunistic” (a greedy algorithm, scales like N^3). The optimal
 path can take seconds to minutes to calculate for a 7-10 term expression
 while the opportunistic path takes microseconds even for 20+ term
 expressions. The opportunistic algorithm works surprisingly well and
 appears to obtain the correct scaling in all test cases that I can think
 of. Both algorithms use the maximum array size as a sieve- this is very
 beneficial from several aspects. The problem occurs when a much needed
 intermediate cannot be created due to this limit- on occasions not making
 this intermediate can have slowdowns of orders of magnitude even for small
 systems. This leads to certain (and sometimes unexpected) performance
 walls. Possible solutions:

- Warn the user if the ratio between an unlimited memory solution and
a limited memory solution becomes large.
- Do not worry about it.


 Thank you for your time,
 -Daniel Smith


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Optimizing numpy's einsum expression (again)

2015-01-16 Thread Dave Hirschfeld
Daniel Smith dgasmith at icloud.com writes:

 
 Hello everyone,I originally brought an optimized einsum routine 
forward a few weeks back that attempts to contract numpy arrays together 
in an optimal way. This can greatly reduce the scaling and overall cost 
of the einsum expression for the cost of a few intermediate arrays. The 
current version (and more details) can be found here:
 https://github.com/dgasmith/opt_einsum
 
 I think this routine is close to a finalized version, but there are 
two problems which I would like the community to weigh in on:
 
 Thank you for your time,
 
 
 -Daniel Smith
 

I wasn't aware of this work, but it's very interesting to me as a user 
of einsum whose principal reason for doing so is speed. Even though I 
use it on largish arrays I'm only concerned with the performance as I'm 
on x64 with plenty of memory even were it to require temporaries 5x the 
original size.

I don't use einsum that much because I've noticed the performance can be 
very problem dependant so I've always profiled it to check. Hopefully 
this work will make the performance more consistent, allowing it to be 
used more generally throughout my code.

Thanks,
Dave


* An anecdotal example from a user only.


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Optimizing numpy's einsum expression (again)

2015-01-15 Thread Daniel Smith
Hello everyone,
I originally brought an optimized einsum routine forward a few weeks back that 
attempts to contract numpy arrays together in an optimal way. This can greatly 
reduce the scaling and overall cost of the einsum expression for the cost of a 
few intermediate arrays. The current version (and more details) can be found 
here:
https://github.com/dgasmith/opt_einsum https://github.com/dgasmith/opt_einsum

I think this routine is close to a finalized version, but there are two 
problems which I would like the community to weigh in on:

Memory usage- 
Currently the routine only considers the maximum size of intermediates created 
rather than cumulative memory usage. If we only use np.einsum it is 
straightforward to calculate cumulative memory usage as einsum does not require 
any copies of inputs to be made; however, if we attempt to use a vendor BLAS 
the memory usage can becomes quite complex. For example, the np.tensordot 
routine always forces a copy for ndarrays because it uses 
ndarray.transpose(…).reshape(…). A more memory-conscious way to do this is to 
first try and do ndarray.reshape(…).T which does not copy the data and numpy 
can just pass a transpose flag to the vendor BLAS. The caveat here is that the 
summed indices must be in the correct order- if not a copy is required. Maximum 
array size is usually close to the total overhead of the opt_einsum function, 
but can occasionally be 2-5 times this size. I see the following ways forward:
Ignore cumulative memory and stick with maximum array size as the limiting 
memory factor.
Implement logic to figure out if the input arrays needs to be copied to use 
BLAS, compute the extra memory required, and add an extra dimension to the 
optimization algorithms (use BLAS or do not use BLAS at each step). Some of 
this is already done, but may get fairly complex.
Build an in-place nd-transpose algorithm.
Cut out BLAS entirely. Keeping in mind that vendor BLAS can be orders of 
magnitude faster than a pure einsum implementation, especially if the BLAS 
threading is used.

Path algorithms-
There are two algorithms “optimal” (a brute force algorithm, scales like N!) 
and “opportunistic” (a greedy algorithm, scales like N^3). The optimal path can 
take seconds to minutes to calculate for a 7-10 term expression while the 
opportunistic path takes microseconds even for 20+ term expressions. The 
opportunistic algorithm works surprisingly well and appears to obtain the 
correct scaling in all test cases that I can think of. Both algorithms use the 
maximum array size as a sieve- this is very beneficial from several aspects. 
The problem occurs when a much needed intermediate cannot be created due to 
this limit- on occasions not making this intermediate can have slowdowns of 
orders of magnitude even for small systems. This leads to certain (and 
sometimes unexpected) performance walls. Possible solutions:
Warn the user if the ratio between an unlimited memory solution and a limited 
memory solution becomes large.
Do not worry about it.

Thank you for your time,
-Daniel Smith

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion