[Numpy-discussion] np.diag(np.dot(A, B))

2015-05-22 Thread Mathieu Blondel
Hi,

I often need to compute the equivalent of

np.diag(np.dot(A, B)).

Computing np.dot(A, B) is highly inefficient if you only need the diagonal
entries. Two more efficient ways of computing the same thing are

np.sum(A * B.T, axis=1)

and

np.einsum(ij,ji-i, A, B).

The first can allocate quite a lot of temporary memory.
The second can be quite cryptic for someone not familiar with einsum.
I assume that einsum does not compute np.dot(A, B), but I haven't verified.

Since this is is quite a recurrent pattern, I was wondering if it would be
worth adding a dedicated function to NumPy and SciPy's sparse module. A
possible name would be diagdot. The best performance would be obtained
when A is C-style and B fortran-style.

Best,
Mathieu
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.diag(np.dot(A, B))

2015-05-22 Thread Mathieu Blondel
Right now I am using np.sum(A * B.T, axis=1) for dense data and I have
implemented a Cython routine for sparse data.
I haven't benched np.sum(A * B.T, axis=1) vs. np.einsum(ij,ji-i, A, B)
yet since I am mostly interested in the sparse case right now.

When A and B are C-style and Fortran-style, the optimal algorithm should be
computing the inner products along the diagonal using BLAS.
If not, I guess this will need some benchmarking.

Another use for this is to compute the row-wise L2 norms: np.diagdot(A,
A.T).

Mathieu

On Fri, May 22, 2015 at 5:58 PM, David Cournapeau courn...@gmail.com
wrote:



 On Fri, May 22, 2015 at 5:39 PM, Mathieu Blondel math...@mblondel.org
 wrote:

 Hi,

 I often need to compute the equivalent of

 np.diag(np.dot(A, B)).

 Computing np.dot(A, B) is highly inefficient if you only need the
 diagonal entries. Two more efficient ways of computing the same thing are

 np.sum(A * B.T, axis=1)

 and

 np.einsum(ij,ji-i, A, B).

 The first can allocate quite a lot of temporary memory.
 The second can be quite cryptic for someone not familiar with einsum.
 I assume that einsum does not compute np.dot(A, B), but I haven't
 verified.

 Since this is is quite a recurrent pattern, I was wondering if it would
 be worth adding a dedicated function to NumPy and SciPy's sparse module. A
 possible name would be diagdot. The best performance would be obtained
 when A is C-style and B fortran-style.


 Does your implementation use BLAS, or is just a a wrapper around einsum ?

 David


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Objected-oriented SIMD API for Numpy

2009-10-22 Thread Mathieu Blondel
On Thu, Oct 22, 2009 at 5:05 PM, Sturla Molden stu...@molden.no wrote:
 Mathieu Blondel skrev:

 The PEP 3118 buffer syntax in Cython can be used to port NumPy to Py3k,
 replacing the current C source. That might be what Norvig meant if he
 suggested merging NumPy into Cython.

As I wrote earlier in this thread, I confused Cython and CPython. PN
was suggesting to include Numpy in the CPython  distribution (not
Cython). The reason why was also given earlier.

Mathieu
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Objected-oriented SIMD API for Numpy

2009-10-21 Thread Mathieu Blondel
On Thu, Oct 22, 2009 at 11:31 AM, Sturla Molden stu...@molden.no wrote:
 Mathieu Blondel skrev:
 Hello,

 About one year ago, a high-level, objected-oriented SIMD API was added
 to Mono. For example, there is a class Vector4f for vectors of 4
 floats and this class implements methods such as basic operators,
 bitwise operators, comparison operators, min, max, sqrt, shuffle
 directly using SIMD operations.
 I think you are confusing SIMD with Intel's MMX/SSE instruction set.

OK, I should have said Object-oriented SIMD API that is implemented
using hardware SIMD instructions.

And when an ISA doesn't allow to perform a specific operation in only
one instruction (say the absolute value of the differences), the
operation can be implemented in terms of other instructions.

 SIMD instructions in hardware for length-4 vectors are mostly useful for
 3D graphics. But they are not used a lot for that purpose, because GPUs
 are getting common. SSE is mostly for rendering 3D graphics without a
 GPU. There is nothing that prevents NumPy from having a Vector4f dtype,
 that internally stores four float32 and is aligned at 16 byte
 boundaries. But it would not be faster than the current float32 dtype.
 Do you know why?

Yes I know because this has already been explained in this very thread
by someone before you!


Mathieu
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Objected-oriented SIMD API for Numpy

2009-10-20 Thread Mathieu Blondel
Hello,

About one year ago, a high-level, objected-oriented SIMD API was added
to Mono. For example, there is a class Vector4f for vectors of 4
floats and this class implements methods such as basic operators,
bitwise operators, comparison operators, min, max, sqrt, shuffle
directly using SIMD operations. You can have a look at the following
pages for further details:

http://tirania.org/blog/archive/2008/Nov-03.html (blog post)
http://go-mono.com/docs/index.aspx?tlin...@n%3amono.simd (API reference)

It seems to me that such an API would possibly be a great fit in Numpy
too. It would also be possible to add classes that don't directly map
to SIMD types. For example, Vector8f can easily be implemented in
terms of 2 Vector4f. In addition to vectors, additional API may be
added to support operations on matrices of fixed width or height.

I search the archives for similar discussions but I only found a
discussion about memory-alignment so I hope I am not restarting an
existing discussion here. Memory-alignment is an import related issue
since non-aligned movs can tank the performance.

Any thoughts? I don't know the Numpy code base yet but I'm willing to
help if such an effort is started.

Thanks,
Mathieu
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion