subject:"Re\: \[Numpy\-discussion\] C\-coded dot 1000x faster than numpy\?"

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-24 Thread Neal Becker

Supposedly can control through env variables but I didn't see any effect

On Wed, Feb 24, 2021, 10:12 AM Charles R Harris 
wrote:

>
>
> On Wed, Feb 24, 2021 at 8:02 AM Charles R Harris <
> charlesr.har...@gmail.com> wrote:
>
>>
>>
>> On Wed, Feb 24, 2021 at 5:36 AM Neal Becker  wrote:
>>
>>> See my earlier email - this is fedora 33, python3.9.
>>>
>>> I'm using fedora 33 standard numpy.
>>> ldd says:
>>>
>>> /usr/lib64/python3.9/site-packages/numpy/core/_
>>> multiarray_umath.cpython-39-x86_64-linux-gnu.so:
>>> linux-vdso.so.1 (0x7ffdd1487000)
>>> libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x7f0512787000)
>>>
>>> So whatever flexiblas is doing controls blas.
>>> flexiblas print
>>> FlexiBLAS, version 3.0.4
>>> Copyright (C) 2014, 2015, 2016, 2017, 2018, 2019, 2020 Martin Koehler
>>> and others.
>>> This is free software; see the source code for copying conditions.
>>> There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or
>>> FITNESS FOR A PARTICULAR PURPOSE.
>>>
>>>
>>> Configured BLAS libraries:
>>> System-wide (/etc/flexiblasrc):
>>>
>>> System-wide from config directory (/etc/flexiblasrc.d/)
>>>  OPENBLAS-OPENMP
>>>library = libflexiblas_openblas-openmp.so
>>>comment =
>>>  NETLIB
>>>library = libflexiblas_netlib.so
>>>comment =
>>>  ATLAS
>>>library = libflexiblas_atlas.so
>>>comment =
>>>
>>> User config (/home/nbecker/.flexiblasrc):
>>>
>>> Host config (/home/nbecker/.flexiblasrc.nbecker8):
>>>
>>> Available hooks:
>>>
>>> Backend and hook search paths:
>>>   /usr/lib64/flexiblas/
>>>
>>> Default BLAS:
>>> System:   OPENBLAS-OPENMP
>>> User: (none)
>>> Host: (none)
>>> Active Default: OPENBLAS-OPENMP (System)
>>> Runtime properties:
>>>verbose = 0 (System)
>>>
>>> So it looks  to me it is using openblas-openmp.
>>>
>>>
>> ISTR that there have been problems with openmp. There are a ton of
>> OpenBLAS versions available in fedora 33. Just available via flexiblas
>>
>>
>>1. flexiblas-openblas-openmp.x86_64 : FlexiBLAS wrappers for OpenBLAS
>>2. flexiblas-openblas-openmp.i686 : FlexiBLAS wrappers for OpenBLAS
>>3. flexiblas-openblas-openmp64.x86_64 : FlexiBLAS wrappers for
>>OpenBLAS (64-bit)
>>4. flexiblas-openblas-serial.x86_64 : FlexiBLAS wrappers for OpenBLAS
>>5. flexiblas-openblas-serial64.x86_64 : FlexiBLAS wrappers for
>>OpenBLAS (64-bit)
>>6. flexiblas-openblas-threads.x86_64 : FlexiBLAS wrappers for OpenBLAS
>>7. flexiblas-openblas-threads64.x86_64 : FlexiBLAS wrappers for
>>OpenBLAS (64-bit)
>>
>> I am not sure how to make use of flexiblas, but would explore that. We
>> might need to do something with distutils to interoperate with it or maybe
>> you can control it though site.cfg. There are 12 versions available in
>> total. I would suggest trying serial or pthreads.
>>
>>
> Seems to be controlled in the /etc directory:
>
> /etc/flexiblas64rc.d/openblas-openmp64.conf
>
> On my machine it looks like openmp64 is the system default.
>
> Chuck
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-24 Thread Charles R Harris

On Wed, Feb 24, 2021 at 8:02 AM Charles R Harris 
wrote:

>
>
> On Wed, Feb 24, 2021 at 5:36 AM Neal Becker  wrote:
>
>> See my earlier email - this is fedora 33, python3.9.
>>
>> I'm using fedora 33 standard numpy.
>> ldd says:
>>
>> /usr/lib64/python3.9/site-packages/numpy/core/_
>> multiarray_umath.cpython-39-x86_64-linux-gnu.so:
>> linux-vdso.so.1 (0x7ffdd1487000)
>> libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x7f0512787000)
>>
>> So whatever flexiblas is doing controls blas.
>> flexiblas print
>> FlexiBLAS, version 3.0.4
>> Copyright (C) 2014, 2015, 2016, 2017, 2018, 2019, 2020 Martin Koehler
>> and others.
>> This is free software; see the source code for copying conditions.
>> There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or
>> FITNESS FOR A PARTICULAR PURPOSE.
>>
>>
>> Configured BLAS libraries:
>> System-wide (/etc/flexiblasrc):
>>
>> System-wide from config directory (/etc/flexiblasrc.d/)
>>  OPENBLAS-OPENMP
>>library = libflexiblas_openblas-openmp.so
>>comment =
>>  NETLIB
>>library = libflexiblas_netlib.so
>>comment =
>>  ATLAS
>>library = libflexiblas_atlas.so
>>comment =
>>
>> User config (/home/nbecker/.flexiblasrc):
>>
>> Host config (/home/nbecker/.flexiblasrc.nbecker8):
>>
>> Available hooks:
>>
>> Backend and hook search paths:
>>   /usr/lib64/flexiblas/
>>
>> Default BLAS:
>> System:   OPENBLAS-OPENMP
>> User: (none)
>> Host: (none)
>> Active Default: OPENBLAS-OPENMP (System)
>> Runtime properties:
>>verbose = 0 (System)
>>
>> So it looks  to me it is using openblas-openmp.
>>
>>
> ISTR that there have been problems with openmp. There are a ton of
> OpenBLAS versions available in fedora 33. Just available via flexiblas
>
>
>1. flexiblas-openblas-openmp.x86_64 : FlexiBLAS wrappers for OpenBLAS
>2. flexiblas-openblas-openmp.i686 : FlexiBLAS wrappers for OpenBLAS
>3. flexiblas-openblas-openmp64.x86_64 : FlexiBLAS wrappers for
>OpenBLAS (64-bit)
>4. flexiblas-openblas-serial.x86_64 : FlexiBLAS wrappers for OpenBLAS
>5. flexiblas-openblas-serial64.x86_64 : FlexiBLAS wrappers for
>OpenBLAS (64-bit)
>6. flexiblas-openblas-threads.x86_64 : FlexiBLAS wrappers for OpenBLAS
>7. flexiblas-openblas-threads64.x86_64 : FlexiBLAS wrappers for
>OpenBLAS (64-bit)
>
> I am not sure how to make use of flexiblas, but would explore that. We
> might need to do something with distutils to interoperate with it or maybe
> you can control it though site.cfg. There are 12 versions available in
> total. I would suggest trying serial or pthreads.
>
>
Seems to be controlled in the /etc directory:

/etc/flexiblas64rc.d/openblas-openmp64.conf

On my machine it looks like openmp64 is the system default.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-24 Thread Charles R Harris

On Wed, Feb 24, 2021 at 5:36 AM Neal Becker  wrote:

> See my earlier email - this is fedora 33, python3.9.
>
> I'm using fedora 33 standard numpy.
> ldd says:
>
> /usr/lib64/python3.9/site-packages/numpy/core/_
> multiarray_umath.cpython-39-x86_64-linux-gnu.so:
> linux-vdso.so.1 (0x7ffdd1487000)
> libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x7f0512787000)
>
> So whatever flexiblas is doing controls blas.
> flexiblas print
> FlexiBLAS, version 3.0.4
> Copyright (C) 2014, 2015, 2016, 2017, 2018, 2019, 2020 Martin Koehler
> and others.
> This is free software; see the source code for copying conditions.
> There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or
> FITNESS FOR A PARTICULAR PURPOSE.
>
>
> Configured BLAS libraries:
> System-wide (/etc/flexiblasrc):
>
> System-wide from config directory (/etc/flexiblasrc.d/)
>  OPENBLAS-OPENMP
>library = libflexiblas_openblas-openmp.so
>comment =
>  NETLIB
>library = libflexiblas_netlib.so
>comment =
>  ATLAS
>library = libflexiblas_atlas.so
>comment =
>
> User config (/home/nbecker/.flexiblasrc):
>
> Host config (/home/nbecker/.flexiblasrc.nbecker8):
>
> Available hooks:
>
> Backend and hook search paths:
>   /usr/lib64/flexiblas/
>
> Default BLAS:
> System:   OPENBLAS-OPENMP
> User: (none)
> Host: (none)
> Active Default: OPENBLAS-OPENMP (System)
> Runtime properties:
>verbose = 0 (System)
>
> So it looks  to me it is using openblas-openmp.
>
>
ISTR that there have been problems with openmp. There are a ton of OpenBLAS
versions available in fedora 33. Just available via flexiblas


   1. flexiblas-openblas-openmp.x86_64 : FlexiBLAS wrappers for OpenBLAS
   2. flexiblas-openblas-openmp.i686 : FlexiBLAS wrappers for OpenBLAS
   3. flexiblas-openblas-openmp64.x86_64 : FlexiBLAS wrappers for OpenBLAS
   (64-bit)
   4. flexiblas-openblas-serial.x86_64 : FlexiBLAS wrappers for OpenBLAS
   5. flexiblas-openblas-serial64.x86_64 : FlexiBLAS wrappers for OpenBLAS
   (64-bit)
   6. flexiblas-openblas-threads.x86_64 : FlexiBLAS wrappers for OpenBLAS
   7. flexiblas-openblas-threads64.x86_64 : FlexiBLAS wrappers for OpenBLAS
   (64-bit)

I am not sure how to make use of flexiblas, but would explore that. We
might need to do something with distutils to interoperate with it or maybe
you can control it though site.cfg. There are 12 versions available in
total. I would suggest trying serial or pthreads.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-24 Thread Neal Becker

See my earlier email - this is fedora 33, python3.9.

I'm using fedora 33 standard numpy.
ldd says:

/usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so:
linux-vdso.so.1 (0x7ffdd1487000)
libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x7f0512787000)

So whatever flexiblas is doing controls blas.
flexiblas print
FlexiBLAS, version 3.0.4
Copyright (C) 2014, 2015, 2016, 2017, 2018, 2019, 2020 Martin Koehler
and others.
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE.


Configured BLAS libraries:
System-wide (/etc/flexiblasrc):

System-wide from config directory (/etc/flexiblasrc.d/)
 OPENBLAS-OPENMP
   library = libflexiblas_openblas-openmp.so
   comment =
 NETLIB
   library = libflexiblas_netlib.so
   comment =
 ATLAS
   library = libflexiblas_atlas.so
   comment =

User config (/home/nbecker/.flexiblasrc):

Host config (/home/nbecker/.flexiblasrc.nbecker8):

Available hooks:

Backend and hook search paths:
  /usr/lib64/flexiblas/

Default BLAS:
System:   OPENBLAS-OPENMP
User: (none)
Host: (none)
Active Default: OPENBLAS-OPENMP (System)
Runtime properties:
   verbose = 0 (System)

So it looks  to me it is using openblas-openmp.


On Tue, Feb 23, 2021 at 8:00 PM Charles R Harris
 wrote:
>
>
>
> On Tue, Feb 23, 2021 at 5:47 PM Charles R Harris  
> wrote:
>>
>>
>>
>> On Tue, Feb 23, 2021 at 11:10 AM Neal Becker  wrote:
>>>
>>> I have code that performs dot product of a 2D matrix of size (on the
>>> order of) [1000,16] with a vector of size [1000].  The matrix is
>>> float64 and the vector is complex128.  I was using numpy.dot but it
>>> turned out to be a bottleneck.
>>>
>>> So I coded dot2x1 in c++ (using xtensor-python just for the
>>> interface).  No fancy simd was used, unless g++ did it on it's own.
>>>
>>> On a simple benchmark using timeit I find my hand-coded routine is on
>>> the order of 1000x faster than numpy?  Here is the test code:
>>> My custom c++ code is dot2x1.  I'm not copying it here because it has
>>> some dependencies.  Any idea what is going on?
>>>
>>> import numpy as np
>>>
>>> from dot2x1 import dot2x1
>>>
>>> a = np.ones ((1000,16))
>>> b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
>>>-0.80311816+0.80311816j, -0.80311816-0.80311816j,
>>> 1.09707981+0.29396165j,  1.09707981-0.29396165j,
>>>-1.09707981+0.29396165j, -1.09707981-0.29396165j,
>>> 0.29396165+1.09707981j,  0.29396165-1.09707981j,
>>>-0.29396165+1.09707981j, -0.29396165-1.09707981j,
>>> 0.25495815+0.25495815j,  0.25495815-0.25495815j,
>>>-0.25495815+0.25495815j, -0.25495815-0.25495815j])
>>>
>>> def F1():
>>> d = dot2x1 (a, b)
>>>
>>> def F2():
>>> d = np.dot (a, b)
>>>
>>> from timeit import timeit
>>> print (timeit ('F1()', globals=globals(), number=1000))
>>> print (timeit ('F2()', globals=globals(), number=1000))
>>>
>>> In [13]: 0.013910860987380147 << 1st timeit
>>> 28.608758996007964  << 2nd timeit
>>
>>
>> I'm going to guess threading, although huge pages can also be a problem on a 
>> machine under heavy load running other processes. Call overhead may also 
>> matter for such small matrices.
>>
>
> What BLAS library are you using. I get much better results using an 8 year 
> old i5 and ATLAS.
>
> Chuck
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion



--
Those who don't understand recursion are doomed to repeat it
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread Charles R Harris

On Tue, Feb 23, 2021 at 5:47 PM Charles R Harris 
wrote:

>
>
> On Tue, Feb 23, 2021 at 11:10 AM Neal Becker  wrote:
>
>> I have code that performs dot product of a 2D matrix of size (on the
>> order of) [1000,16] with a vector of size [1000].  The matrix is
>> float64 and the vector is complex128.  I was using numpy.dot but it
>> turned out to be a bottleneck.
>>
>> So I coded dot2x1 in c++ (using xtensor-python just for the
>> interface).  No fancy simd was used, unless g++ did it on it's own.
>>
>> On a simple benchmark using timeit I find my hand-coded routine is on
>> the order of 1000x faster than numpy?  Here is the test code:
>> My custom c++ code is dot2x1.  I'm not copying it here because it has
>> some dependencies.  Any idea what is going on?
>>
>> import numpy as np
>>
>> from dot2x1 import dot2x1
>>
>> a = np.ones ((1000,16))
>> b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
>>-0.80311816+0.80311816j, -0.80311816-0.80311816j,
>> 1.09707981+0.29396165j,  1.09707981-0.29396165j,
>>-1.09707981+0.29396165j, -1.09707981-0.29396165j,
>> 0.29396165+1.09707981j,  0.29396165-1.09707981j,
>>-0.29396165+1.09707981j, -0.29396165-1.09707981j,
>> 0.25495815+0.25495815j,  0.25495815-0.25495815j,
>>-0.25495815+0.25495815j, -0.25495815-0.25495815j])
>>
>> def F1():
>> d = dot2x1 (a, b)
>>
>> def F2():
>> d = np.dot (a, b)
>>
>> from timeit import timeit
>> print (timeit ('F1()', globals=globals(), number=1000))
>> print (timeit ('F2()', globals=globals(), number=1000))
>>
>> In [13]: 0.013910860987380147 << 1st timeit
>> 28.608758996007964  << 2nd timeit
>>
>
> I'm going to guess threading, although huge pages can also be a problem on
> a machine under heavy load running other processes. Call overhead may also
> matter for such small matrices.
>
>
What BLAS library are you using. I get much better results using an 8 year
old i5 and ATLAS.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread Charles R Harris

On Tue, Feb 23, 2021 at 11:10 AM Neal Becker  wrote:

> I have code that performs dot product of a 2D matrix of size (on the
> order of) [1000,16] with a vector of size [1000].  The matrix is
> float64 and the vector is complex128.  I was using numpy.dot but it
> turned out to be a bottleneck.
>
> So I coded dot2x1 in c++ (using xtensor-python just for the
> interface).  No fancy simd was used, unless g++ did it on it's own.
>
> On a simple benchmark using timeit I find my hand-coded routine is on
> the order of 1000x faster than numpy?  Here is the test code:
> My custom c++ code is dot2x1.  I'm not copying it here because it has
> some dependencies.  Any idea what is going on?
>
> import numpy as np
>
> from dot2x1 import dot2x1
>
> a = np.ones ((1000,16))
> b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
>-0.80311816+0.80311816j, -0.80311816-0.80311816j,
> 1.09707981+0.29396165j,  1.09707981-0.29396165j,
>-1.09707981+0.29396165j, -1.09707981-0.29396165j,
> 0.29396165+1.09707981j,  0.29396165-1.09707981j,
>-0.29396165+1.09707981j, -0.29396165-1.09707981j,
> 0.25495815+0.25495815j,  0.25495815-0.25495815j,
>-0.25495815+0.25495815j, -0.25495815-0.25495815j])
>
> def F1():
> d = dot2x1 (a, b)
>
> def F2():
> d = np.dot (a, b)
>
> from timeit import timeit
> print (timeit ('F1()', globals=globals(), number=1000))
> print (timeit ('F2()', globals=globals(), number=1000))
>
> In [13]: 0.013910860987380147 << 1st timeit
> 28.608758996007964  << 2nd timeit
>

I'm going to guess threading, although huge pages can also be a problem on
a machine under heavy load running other processes. Call overhead may also
matter for such small matrices.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread Carl Kleffner

The stackoverflow link above contains a simple testcase:

>>> from scipy.linalg import get_blas_funcs>>> gemm = get_blas_funcs("gemm", 
>>> [X, Y])>>> np.all(gemm(1, X, Y) == np.dot(X, Y))True

It would be of interest to benchmark gemm against np.dot. Maybe np.dot
doesn't use blas at al for whatever reason?


Am Di., 23. Feb. 2021 um 20:46 Uhr schrieb David Menéndez Hurtado <
davidmen...@gmail.com>:

> On Tue, 23 Feb 2021, 7:41 pm Roman Yurchak,  wrote:
>
>> For the first benchmark apparently A.dot(B) with A real and B complex is
>> a known issue performance wise
>> https://github.com/numpy/numpy/issues/10468
>
>
> I splitted B into a vector of size (N, 2) for the real and imaginary part,
> and that makes the multiplication twice as fast.
>
>
> My configuration (also in Fedora 33) np.show_config()
>
>
>
> blas_mkl_info:
>   NOT AVAILABLE
> blis_info:
>   NOT AVAILABLE
> openblas_info:
> libraries = ['openblas', 'openblas']
> library_dirs = ['/usr/local/lib']
> language = c
> define_macros = [('HAVE_CBLAS', None)]
> blas_opt_info:
> libraries = ['openblas', 'openblas']
> library_dirs = ['/usr/local/lib']
> language = c
> define_macros = [('HAVE_CBLAS', None)]
> lapack_mkl_info:
>   NOT AVAILABLE
> openblas_lapack_info:
> libraries = ['openblas', 'openblas']
> library_dirs = ['/usr/local/lib']
> language = c
> define_macros = [('HAVE_CBLAS', None)]
> lapack_opt_info:
> libraries = ['openblas', 'openblas']
> library_dirs = ['/usr/local/lib']
> language = c
> define_macros = [('HAVE_CBLAS', None)]
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread David Menéndez Hurtado

On Tue, 23 Feb 2021, 7:41 pm Roman Yurchak,  wrote:

> For the first benchmark apparently A.dot(B) with A real and B complex is
> a known issue performance wise https://github.com/numpy/numpy/issues/10468


I splitted B into a vector of size (N, 2) for the real and imaginary part,
and that makes the multiplication twice as fast.


My configuration (also in Fedora 33) np.show_config()


blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread Neal Becker

I'm using fedora 33 standard numpy.
ldd says:

/usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so:
linux-vdso.so.1 (0x7ffdd1487000)
libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x7f0512787000)

So whatever flexiblas is doing controls blas.

On Tue, Feb 23, 2021 at 1:51 PM Neal Becker  wrote:
>
> One suspect is that it seems the numpy version was multi-threading.
> This isn't useful here, because I'm running parallel monte-carlo
> simulations using all cores.  Perhaps this is perversely slowing
> things down?  I don't know how to account for 1000x  slowdown though.
>
> On Tue, Feb 23, 2021 at 1:40 PM Roman Yurchak  wrote:
> >
> > For the first benchmark apparently A.dot(B) with A real and B complex is
> > a known issue performance wise https://github.com/numpy/numpy/issues/10468
> >
> > In general, it might be worth trying different BLAS backends. For
> > instance, if you install numpy from conda-forge you should be able to
> > switch between OpenBLAS, MKL and BLIS:
> > https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation
> >
> > Roman
> >
> > On 23/02/2021 19:32, Andrea Gavana wrote:
> > > Hi,
> > >
> > > On Tue, 23 Feb 2021 at 19.11, Neal Becker  > > > wrote:
> > >
> > > I have code that performs dot product of a 2D matrix of size (on the
> > > order of) [1000,16] with a vector of size [1000].  The matrix is
> > > float64 and the vector is complex128.  I was using numpy.dot but it
> > > turned out to be a bottleneck.
> > >
> > > So I coded dot2x1 in c++ (using xtensor-python just for the
> > > interface).  No fancy simd was used, unless g++ did it on it's own.
> > >
> > > On a simple benchmark using timeit I find my hand-coded routine is on
> > > the order of 1000x faster than numpy?  Here is the test code:
> > > My custom c++ code is dot2x1.  I'm not copying it here because it has
> > > some dependencies.  Any idea what is going on?
> > >
> > >
> > >
> > > I had a similar experience - albeit with an older numpy and Python 2.7,
> > > so my comments are easily outdated and irrelevant. This was on Windows
> > > 10 64 bit, way more than plenty RAM.
> > >
> > > It took me forever to find out that numpy.dot was the culprit, and I
> > > ended up using fortran + f2py. Even with the overhead of having to go
> > > through f2py bridge, the fortran dot_product was several times faster.
> > >
> > > Sorry if It doesn’t help much.
> > >
> > > Andrea.
> > >
> > >
> > >
> > >
> > > import numpy as np
> > >
> > > from dot2x1 import dot2x1
> > >
> > > a = np.ones ((1000,16))
> > > b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
> > > -0.80311816+0.80311816j, -0.80311816-0.80311816j,
> > >  1.09707981+0.29396165j,  1.09707981-0.29396165j,
> > > -1.09707981+0.29396165j, -1.09707981-0.29396165j,
> > >  0.29396165+1.09707981j,  0.29396165-1.09707981j,
> > > -0.29396165+1.09707981j, -0.29396165-1.09707981j,
> > >  0.25495815+0.25495815j,  0.25495815-0.25495815j,
> > > -0.25495815+0.25495815j, -0.25495815-0.25495815j])
> > >
> > > def F1():
> > >  d = dot2x1 (a, b)
> > >
> > > def F2():
> > >  d = np.dot (a, b)
> > >
> > > from timeit import timeit
> > > print (timeit ('F1()', globals=globals(), number=1000))
> > > print (timeit ('F2()', globals=globals(), number=1000))
> > >
> > > In [13]: 0.013910860987380147 << 1st timeit
> > > 28.608758996007964  << 2nd timeit
> > > --
> > > Those who don't understand recursion are doomed to repeat it
> > > ___
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion@python.org 
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > 
> > >
> > >
> > > ___
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion@python.org
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > >
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
>
> --
> Those who don't understand recursion are doomed to repeat it



-- 
Those who don't understand recursion are doomed to repeat it
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread Carl Kleffner

https://stackoverflow.com/questions/19839539/how-to-get-faster-code-than-numpy-dot-for-matrix-multiplication

maybe C_CONTIGUOUS vs F_CONTIGUOUS?

Carl


Am Di., 23. Feb. 2021 um 19:52 Uhr schrieb Neal Becker :

> One suspect is that it seems the numpy version was multi-threading.
> This isn't useful here, because I'm running parallel monte-carlo
> simulations using all cores.  Perhaps this is perversely slowing
> things down?  I don't know how to account for 1000x  slowdown though.
>
> On Tue, Feb 23, 2021 at 1:40 PM Roman Yurchak 
> wrote:
> >
> > For the first benchmark apparently A.dot(B) with A real and B complex is
> > a known issue performance wise
> https://github.com/numpy/numpy/issues/10468
> >
> > In general, it might be worth trying different BLAS backends. For
> > instance, if you install numpy from conda-forge you should be able to
> > switch between OpenBLAS, MKL and BLIS:
> >
> https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation
> >
> > Roman
> >
> > On 23/02/2021 19:32, Andrea Gavana wrote:
> > > Hi,
> > >
> > > On Tue, 23 Feb 2021 at 19.11, Neal Becker  > > > wrote:
> > >
> > > I have code that performs dot product of a 2D matrix of size (on
> the
> > > order of) [1000,16] with a vector of size [1000].  The matrix is
> > > float64 and the vector is complex128.  I was using numpy.dot but it
> > > turned out to be a bottleneck.
> > >
> > > So I coded dot2x1 in c++ (using xtensor-python just for the
> > > interface).  No fancy simd was used, unless g++ did it on it's own.
> > >
> > > On a simple benchmark using timeit I find my hand-coded routine is
> on
> > > the order of 1000x faster than numpy?  Here is the test code:
> > > My custom c++ code is dot2x1.  I'm not copying it here because it
> has
> > > some dependencies.  Any idea what is going on?
> > >
> > >
> > >
> > > I had a similar experience - albeit with an older numpy and Python 2.7,
> > > so my comments are easily outdated and irrelevant. This was on Windows
> > > 10 64 bit, way more than plenty RAM.
> > >
> > > It took me forever to find out that numpy.dot was the culprit, and I
> > > ended up using fortran + f2py. Even with the overhead of having to go
> > > through f2py bridge, the fortran dot_product was several times faster.
> > >
> > > Sorry if It doesn’t help much.
> > >
> > > Andrea.
> > >
> > >
> > >
> > >
> > > import numpy as np
> > >
> > > from dot2x1 import dot2x1
> > >
> > > a = np.ones ((1000,16))
> > > b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
> > > -0.80311816+0.80311816j, -0.80311816-0.80311816j,
> > >  1.09707981+0.29396165j,  1.09707981-0.29396165j,
> > > -1.09707981+0.29396165j, -1.09707981-0.29396165j,
> > >  0.29396165+1.09707981j,  0.29396165-1.09707981j,
> > > -0.29396165+1.09707981j, -0.29396165-1.09707981j,
> > >  0.25495815+0.25495815j,  0.25495815-0.25495815j,
> > > -0.25495815+0.25495815j, -0.25495815-0.25495815j])
> > >
> > > def F1():
> > >  d = dot2x1 (a, b)
> > >
> > > def F2():
> > >  d = np.dot (a, b)
> > >
> > > from timeit import timeit
> > > print (timeit ('F1()', globals=globals(), number=1000))
> > > print (timeit ('F2()', globals=globals(), number=1000))
> > >
> > > In [13]: 0.013910860987380147 << 1st timeit
> > > 28.608758996007964  << 2nd timeit
> > > --
> > > Those who don't understand recursion are doomed to repeat it
> > > ___
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion@python.org 
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > 
> > >
> > >
> > > ___
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion@python.org
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > >
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
>
> --
> Those who don't understand recursion are doomed to repeat it
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread Neal Becker

One suspect is that it seems the numpy version was multi-threading.
This isn't useful here, because I'm running parallel monte-carlo
simulations using all cores.  Perhaps this is perversely slowing
things down?  I don't know how to account for 1000x  slowdown though.

On Tue, Feb 23, 2021 at 1:40 PM Roman Yurchak  wrote:
>
> For the first benchmark apparently A.dot(B) with A real and B complex is
> a known issue performance wise https://github.com/numpy/numpy/issues/10468
>
> In general, it might be worth trying different BLAS backends. For
> instance, if you install numpy from conda-forge you should be able to
> switch between OpenBLAS, MKL and BLIS:
> https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation
>
> Roman
>
> On 23/02/2021 19:32, Andrea Gavana wrote:
> > Hi,
> >
> > On Tue, 23 Feb 2021 at 19.11, Neal Becker  > > wrote:
> >
> > I have code that performs dot product of a 2D matrix of size (on the
> > order of) [1000,16] with a vector of size [1000].  The matrix is
> > float64 and the vector is complex128.  I was using numpy.dot but it
> > turned out to be a bottleneck.
> >
> > So I coded dot2x1 in c++ (using xtensor-python just for the
> > interface).  No fancy simd was used, unless g++ did it on it's own.
> >
> > On a simple benchmark using timeit I find my hand-coded routine is on
> > the order of 1000x faster than numpy?  Here is the test code:
> > My custom c++ code is dot2x1.  I'm not copying it here because it has
> > some dependencies.  Any idea what is going on?
> >
> >
> >
> > I had a similar experience - albeit with an older numpy and Python 2.7,
> > so my comments are easily outdated and irrelevant. This was on Windows
> > 10 64 bit, way more than plenty RAM.
> >
> > It took me forever to find out that numpy.dot was the culprit, and I
> > ended up using fortran + f2py. Even with the overhead of having to go
> > through f2py bridge, the fortran dot_product was several times faster.
> >
> > Sorry if It doesn’t help much.
> >
> > Andrea.
> >
> >
> >
> >
> > import numpy as np
> >
> > from dot2x1 import dot2x1
> >
> > a = np.ones ((1000,16))
> > b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
> > -0.80311816+0.80311816j, -0.80311816-0.80311816j,
> >  1.09707981+0.29396165j,  1.09707981-0.29396165j,
> > -1.09707981+0.29396165j, -1.09707981-0.29396165j,
> >  0.29396165+1.09707981j,  0.29396165-1.09707981j,
> > -0.29396165+1.09707981j, -0.29396165-1.09707981j,
> >  0.25495815+0.25495815j,  0.25495815-0.25495815j,
> > -0.25495815+0.25495815j, -0.25495815-0.25495815j])
> >
> > def F1():
> >  d = dot2x1 (a, b)
> >
> > def F2():
> >  d = np.dot (a, b)
> >
> > from timeit import timeit
> > print (timeit ('F1()', globals=globals(), number=1000))
> > print (timeit ('F2()', globals=globals(), number=1000))
> >
> > In [13]: 0.013910860987380147 << 1st timeit
> > 28.608758996007964  << 2nd timeit
> > --
> > Those who don't understand recursion are doomed to repeat it
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@python.org 
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> > 
> >
> >
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion



-- 
Those who don't understand recursion are doomed to repeat it
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread Roman Yurchak

For the first benchmark apparently A.dot(B) with A real and B complex is 
a known issue performance wise https://github.com/numpy/numpy/issues/10468


In general, it might be worth trying different BLAS backends. For 
instance, if you install numpy from conda-forge you should be able to 
switch between OpenBLAS, MKL and BLIS: 
https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation


Roman

On 23/02/2021 19:32, Andrea Gavana wrote:

Hi,

On Tue, 23 Feb 2021 at 19.11, Neal Becker > wrote:


I have code that performs dot product of a 2D matrix of size (on the
order of) [1000,16] with a vector of size [1000].  The matrix is
float64 and the vector is complex128.  I was using numpy.dot but it
turned out to be a bottleneck.

So I coded dot2x1 in c++ (using xtensor-python just for the
interface).  No fancy simd was used, unless g++ did it on it's own.

On a simple benchmark using timeit I find my hand-coded routine is on
the order of 1000x faster than numpy?  Here is the test code:
My custom c++ code is dot2x1.  I'm not copying it here because it has
some dependencies.  Any idea what is going on?



I had a similar experience - albeit with an older numpy and Python 2.7, 
so my comments are easily outdated and irrelevant. This was on Windows 
10 64 bit, way more than plenty RAM.


It took me forever to find out that numpy.dot was the culprit, and I 
ended up using fortran + f2py. Even with the overhead of having to go 
through f2py bridge, the fortran dot_product was several times faster.


Sorry if It doesn’t help much.

Andrea.




import numpy as np

from dot2x1 import dot2x1

a = np.ones ((1000,16))
b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
        -0.80311816+0.80311816j, -0.80311816-0.80311816j,
         1.09707981+0.29396165j,  1.09707981-0.29396165j,
        -1.09707981+0.29396165j, -1.09707981-0.29396165j,
         0.29396165+1.09707981j,  0.29396165-1.09707981j,
        -0.29396165+1.09707981j, -0.29396165-1.09707981j,
         0.25495815+0.25495815j,  0.25495815-0.25495815j,
        -0.25495815+0.25495815j, -0.25495815-0.25495815j])

def F1():
     d = dot2x1 (a, b)

def F2():
     d = np.dot (a, b)

from timeit import timeit
print (timeit ('F1()', globals=globals(), number=1000))
print (timeit ('F2()', globals=globals(), number=1000))

In [13]: 0.013910860987380147 << 1st timeit
28.608758996007964  << 2nd timeit
-- 
Those who don't understand recursion are doomed to repeat it

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org 
https://mail.python.org/mailman/listinfo/numpy-discussion



___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread Andrea Gavana

Hi,

On Tue, 23 Feb 2021 at 19.11, Neal Becker  wrote:

> I have code that performs dot product of a 2D matrix of size (on the
> order of) [1000,16] with a vector of size [1000].  The matrix is
> float64 and the vector is complex128.  I was using numpy.dot but it
> turned out to be a bottleneck.
>
> So I coded dot2x1 in c++ (using xtensor-python just for the
> interface).  No fancy simd was used, unless g++ did it on it's own.
>
> On a simple benchmark using timeit I find my hand-coded routine is on
> the order of 1000x faster than numpy?  Here is the test code:
> My custom c++ code is dot2x1.  I'm not copying it here because it has
> some dependencies.  Any idea what is going on?



I had a similar experience - albeit with an older numpy and Python 2.7, so
my comments are easily outdated and irrelevant. This was on Windows 10 64
bit, way more than plenty RAM.

It took me forever to find out that numpy.dot was the culprit, and I ended
up using fortran + f2py. Even with the overhead of having to go through
f2py bridge, the fortran dot_product was several times faster.

Sorry if It doesn’t help much.

Andrea.



>
> import numpy as np
>
> from dot2x1 import dot2x1
>
> a = np.ones ((1000,16))
> b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
>-0.80311816+0.80311816j, -0.80311816-0.80311816j,
> 1.09707981+0.29396165j,  1.09707981-0.29396165j,
>-1.09707981+0.29396165j, -1.09707981-0.29396165j,
> 0.29396165+1.09707981j,  0.29396165-1.09707981j,
>-0.29396165+1.09707981j, -0.29396165-1.09707981j,
> 0.25495815+0.25495815j,  0.25495815-0.25495815j,
>-0.25495815+0.25495815j, -0.25495815-0.25495815j])
>
> def F1():
> d = dot2x1 (a, b)
>
> def F2():
> d = np.dot (a, b)
>
> from timeit import timeit
> print (timeit ('F1()', globals=globals(), number=1000))
> print (timeit ('F2()', globals=globals(), number=1000))
>
> In [13]: 0.013910860987380147 << 1st timeit
> 28.608758996007964  << 2nd timeit
> --
> Those who don't understand recursion are doomed to repeat it
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

13 matches

Site Navigation

Mail list logo

Footer information