Re: [Numpy-discussion] ANN: NumExpr3 Alpha

2017-02-21 Thread Francesc Alted
Yes, Julian is doing an amazing work on getting rid of temporaries inside
NumPy.  However, NumExpr still has the advantage of using multi-threading
right out of the box, as well as integration with Intel VML.  Hopefully
these features will eventually arrive to NumPy, but meanwhile there is
still value in pushing NumExpr.

Francesc

2017-02-19 18:21 GMT+01:00 Marten van Kerkwijk :

> Hi All,
>
> Just a side note that at a smaller scale some of the benefits of
> numexpr are coming to numpy: Julian Taylor has been working on
> identifying temporary arrays in
> https://github.com/numpy/numpy/pull/7997. Julian also commented
> (https://github.com/numpy/numpy/pull/7997#issuecomment-246118772) that
> with PEP 523 in python 3.6, this should indeed become a lot easier.
>
> All the best,
>
> Marten
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>



-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: NumExpr3 Alpha

2017-02-19 Thread Marten van Kerkwijk
Hi All,

Just a side note that at a smaller scale some of the benefits of
numexpr are coming to numpy: Julian Taylor has been working on
identifying temporary arrays in
https://github.com/numpy/numpy/pull/7997. Julian also commented
(https://github.com/numpy/numpy/pull/7997#issuecomment-246118772) that
with PEP 523 in python 3.6, this should indeed become a lot easier.

All the best,

Marten
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: NumExpr3 Alpha

2017-02-19 Thread Robert McLeod
Hi Juan,

A guy on reddit suggested looking at SymPy for just such a thing. I know
that Dask also represents its process as a graph.

https://www.reddit.com/r/Python/comments/5um04m/numexpr3/

I'll think about it some more but it seems a little abstract still. To a
certain extent the NE3 compiler already works this way.  The compiler has a
dictionary in which keys are `ast.Node` types, and each value is a function
pointer, which knows how to handle that particular node. Providing an
external interface to this would be the most natural extension.

There's quite a few things to do before I would think about a functional
interface. The things I mentioned in my original mail; pickling of the
C-object so that it can be using within modules like `multiprocessing`;
having a pre-allocated shared memory region shared among threads for
temporaries and parameters, etc.  If someone else wants to dabble in it
they are welcome to.

Robert

On Sun, Feb 19, 2017 at 4:19 AM, Juan Nunez-Iglesias 
wrote:

> Hi everyone,
>
> Thanks for this. It looks absolutely fantastic. I've been putting off
> using numexpr but it looks like I don't have a choice anymore. ;)
>
> Regarding feature requests, I've always found it off putting that I have
> to wrap my expressions in a string to speed them up. Has anyone explored
> the possibility of using Python 3.6's frame evaluation API to do this? I
> remember a vague discussion on this list a while back but I don't know
> whether anything came of it.
>
> Thanks!
>
> Juan.
>
> On 18 Feb 2017, 3:42 AM +1100, Robert McLeod ,
> wrote:
>
> Hi David,
>
> Thanks for your comments, reply below the fold.
>
> On Fri, Feb 17, 2017 at 4:34 PM, Daπid  wrote:
>
>> This is very nice indeed!
>>
>> On 17 February 2017 at 12:15, Robert McLeod  wrote:
>> > * bytes and unicode support
>> > * reductions (mean, sum, prod, std)
>>
>> I use both a lot, maybe I can help you get them working.
>>
>> Also, regarding "Vectorization hasn't been done yet with cmath
>> functions for real numbers (such as sqrt(), exp(), etc.), only for
>> complex functions". What is the bottleneck? Is it in GCC or just
>> someone has to sit down and adapt it?
>
>
> I just haven't done it yet.  Basically I'm moving from Switzerland to
> Canada in a week so this was the gap to push something out that's usable if
> not perfect. Rather I just import cmath functions, which are inlined but I
> suspect what's needed is to break them down into their components. For
> example, the complex arccos function looks like this:
>
> static void
> nc_acos( npy_intp n, npy_complex64 *x, npy_complex64 *r)
> {
> npy_complex64 a;
> for( npy_intp I = 0; I < n; I++ ) {
> a = x[I];
> _inline_mul( x[I], x[I], r[I] );
> _inline_sub( Z_1, r[I], r[I] );
> _inline_sqrt( r[I], r[I] );
> _inline_muli( r[I], r[I] );
> _inline_add( a, r[I], r[I] );
> _inline_log( r[I] , r[I] );
> _inline_muli( r[I], r[I] );
> _inline_neg( r[I], r[I]);
> }
> }
>
> I haven't sat down and inspected whether the cmath versions get
> vectorized, but there's not a huge speed difference between NE2 and 3 for
> such a function on float (but their is for complex), so my suspicion is
> they aren't.  Another option would be to add a library such as Yeppp! as
> LIB_YEPPP or some other library that's faster than glib.  For example the
> glib function "fma(a,b,c)" is slower than doing "a*b+c" in NE3, and that's
> not how it should be.  Yeppp is also built with Python generating C code,
> so it could either be very easy or very hard.
>
> On bytes and unicode, I haven't seen examples for how people use it, so
> I'm not sure where to start. Since there's practically not a limitation on
> the number of operations now (the library is 1.3 MB now, compared to 1.2 MB
> for NE2 with gcc 5.4) the string functions could grow significantly from
> what we have in NE2.
>
> With regards to reductions, NumExpr never multi-threaded them, and could
> only do outer reductions, so in the end there was no speed advantage to be
> had compared to having NumPy do them on the result.  I suspect the primary
> value there was in PyTables and Pandas where the expression had to do
> everything.  One of the things I've moved away from in NE3 is doing output
> buffering (rather it pre-allocates the output array), so for reductions the
> understanding NumExpr has of broadcasting would have to be deeper.
>
> In any event contributions would certainly be welcome.
>
> Robert
>
> --
> Robert McLeod, Ph.D.
> Center for Cellular Imaging and Nano Analytics (C-CINA)
> Biozentrum der Universität Basel
> Mattenstrasse 26, 4058 Basel
> Work: +41.061.387.3225 <061%20387%2032%2025>
> robert.mcl...@unibas.ch
> robert.mcl...@bsse.ethz.ch 
> robbmcl...@gmail.com
> ___
> NumPy-Discussion mailing list
> 

Re: [Numpy-discussion] ANN: NumExpr3 Alpha

2017-02-18 Thread Juan Nunez-Iglesias
Hi everyone,

Thanks for this. It looks absolutely fantastic. I've been putting off using 
numexpr but it looks like I don't have a choice anymore. ;)

Regarding feature requests, I've always found it off putting that I have to 
wrap my expressions in a string to speed them up. Has anyone explored the 
possibility of using Python 3.6's frame evaluation API to do this? I remember a 
vague discussion on this list a while back but I don't know whether anything 
came of it.

Thanks!

Juan.

On 18 Feb 2017, 3:42 AM +1100, Robert McLeod , wrote:
> Hi David,
>
> Thanks for your comments, reply below the fold.
>
> > On Fri, Feb 17, 2017 at 4:34 PM, Daπid  wrote:
> > > This is very nice indeed!
> > >
> > > On 17 February 2017 at 12:15, Robert McLeod  wrote:
> > > > * bytes and unicode support
> > > > * reductions (mean, sum, prod, std)
> > >
> > > I use both a lot, maybe I can help you get them working.
> > >
> > > Also, regarding "Vectorization hasn't been done yet with cmath
> > > functions for real numbers (such as sqrt(), exp(), etc.), only for
> > > complex functions". What is the bottleneck? Is it in GCC or just
> > > someone has to sit down and adapt it?
> >
> > I just haven't done it yet.  Basically I'm moving from Switzerland to 
> > Canada in a week so this was the gap to push something out that's usable if 
> > not perfect. Rather I just import cmath functions, which are inlined but I 
> > suspect what's needed is to break them down into their components. For 
> > example, the complex arccos function looks like this:
> >
> > static void
> > nc_acos( npy_intp n, npy_complex64 *x, npy_complex64 *r)
> > {
> >     npy_complex64 a;
> >     for( npy_intp I = 0; I < n; I++ ) {
> >         a = x[I];
> >         _inline_mul( x[I], x[I], r[I] );
> >         _inline_sub( Z_1, r[I], r[I] );
> >         _inline_sqrt( r[I], r[I] );
> >         _inline_muli( r[I], r[I] );
> >         _inline_add( a, r[I], r[I] );
> >         _inline_log( r[I] , r[I] );
> >         _inline_muli( r[I], r[I] );
> >         _inline_neg( r[I], r[I]);
> >     }
> > }
> >
> > I haven't sat down and inspected whether the cmath versions get vectorized, 
> > but there's not a huge speed difference between NE2 and 3 for such a 
> > function on float (but their is for complex), so my suspicion is they 
> > aren't.  Another option would be to add a library such as Yeppp! as 
> > LIB_YEPPP or some other library that's faster than glib.  For example the 
> > glib function "fma(a,b,c)" is slower than doing "a*b+c" in NE3, and that's 
> > not how it should be.  Yeppp is also built with Python generating C code, 
> > so it could either be very easy or very hard.
> >
> > On bytes and unicode, I haven't seen examples for how people use it, so I'm 
> > not sure where to start. Since there's practically not a limitation on the 
> > number of operations now (the library is 1.3 MB now, compared to 1.2 MB for 
> > NE2 with gcc 5.4) the string functions could grow significantly from what 
> > we have in NE2.
> >
> > With regards to reductions, NumExpr never multi-threaded them, and could 
> > only do outer reductions, so in the end there was no speed advantage to be 
> > had compared to having NumPy do them on the result.  I suspect the primary 
> > value there was in PyTables and Pandas where the expression had to do 
> > everything.  One of the things I've moved away from in NE3 is doing output 
> > buffering (rather it pre-allocates the output array), so for reductions the 
> > understanding NumExpr has of broadcasting would have to be deeper.
> >
> > In any event contributions would certainly be welcome.
> >
> > Robert
> >
> --
> Robert McLeod, Ph.D.
> Center for Cellular Imaging and Nano Analytics (C-CINA)
> Biozentrum der Universität Basel
> Mattenstrasse 26, 4058 Basel
> Work: +41.061.387.3225
> robert.mcl...@unibas.ch
> robert.mcl...@bsse.ethz.ch
> robbmcl...@gmail.com
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: NumExpr3 Alpha

2017-02-17 Thread Robert McLeod
Hi David,

Thanks for your comments, reply below the fold.

On Fri, Feb 17, 2017 at 4:34 PM, Daπid  wrote:

> This is very nice indeed!
>
> On 17 February 2017 at 12:15, Robert McLeod  wrote:
> > * bytes and unicode support
> > * reductions (mean, sum, prod, std)
>
> I use both a lot, maybe I can help you get them working.
>
> Also, regarding "Vectorization hasn't been done yet with cmath
> functions for real numbers (such as sqrt(), exp(), etc.), only for
> complex functions". What is the bottleneck? Is it in GCC or just
> someone has to sit down and adapt it?


I just haven't done it yet.  Basically I'm moving from Switzerland to
Canada in a week so this was the gap to push something out that's usable if
not perfect. Rather I just import cmath functions, which are inlined but I
suspect what's needed is to break them down into their components. For
example, the complex arccos function looks like this:

static void
nc_acos( npy_intp n, npy_complex64 *x, npy_complex64 *r)
{
npy_complex64 a;
for( npy_intp I = 0; I < n; I++ ) {
a = x[I];
_inline_mul( x[I], x[I], r[I] );
_inline_sub( Z_1, r[I], r[I] );
_inline_sqrt( r[I], r[I] );
_inline_muli( r[I], r[I] );
_inline_add( a, r[I], r[I] );
_inline_log( r[I] , r[I] );
_inline_muli( r[I], r[I] );
_inline_neg( r[I], r[I]);
}
}

I haven't sat down and inspected whether the cmath versions get vectorized,
but there's not a huge speed difference between NE2 and 3 for such a
function on float (but their is for complex), so my suspicion is they
aren't.  Another option would be to add a library such as Yeppp! as
LIB_YEPPP or some other library that's faster than glib.  For example the
glib function "fma(a,b,c)" is slower than doing "a*b+c" in NE3, and that's
not how it should be.  Yeppp is also built with Python generating C code,
so it could either be very easy or very hard.

On bytes and unicode, I haven't seen examples for how people use it, so I'm
not sure where to start. Since there's practically not a limitation on the
number of operations now (the library is 1.3 MB now, compared to 1.2 MB for
NE2 with gcc 5.4) the string functions could grow significantly from what
we have in NE2.

With regards to reductions, NumExpr never multi-threaded them, and could
only do outer reductions, so in the end there was no speed advantage to be
had compared to having NumPy do them on the result.  I suspect the primary
value there was in PyTables and Pandas where the expression had to do
everything.  One of the things I've moved away from in NE3 is doing output
buffering (rather it pre-allocates the output array), so for reductions the
understanding NumExpr has of broadcasting would have to be deeper.

In any event contributions would certainly be welcome.

Robert

-- 
Robert McLeod, Ph.D.
Center for Cellular Imaging and Nano Analytics (C-CINA)
Biozentrum der Universität Basel
Mattenstrasse 26, 4058 Basel
Work: +41.061.387.3225 <061%20387%2032%2025>
robert.mcl...@unibas.ch
robert.mcl...@bsse.ethz.ch 
robbmcl...@gmail.com
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: NumExpr3 Alpha

2017-02-17 Thread Daπid
This is very nice indeed!

On 17 February 2017 at 12:15, Robert McLeod  wrote:
> * bytes and unicode support
> * reductions (mean, sum, prod, std)

I use both a lot, maybe I can help you get them working.

Also, regarding "Vectorization hasn't been done yet with cmath
functions for real numbers (such as sqrt(), exp(), etc.), only for
complex functions". What is the bottleneck? Is it in GCC or just
someone has to sit down and adapt it?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: NumExpr3 Alpha

2017-02-17 Thread Francesc Alted
Yay!  This looks really exciting.  Thanks for all the hard work!

Francesc

2017-02-17 12:15 GMT+01:00 Robert McLeod :

> Hi everyone,
>
> I'm pleased to announce that a new branch of NumExpr has been developed
> that will hopefully lead to a new major version release in the future.  You
> can find the branch on the PyData github repository, and installation is as
> follows:
>
> git clone https://github.com/pydata/numexpr.git
> cd numexpr
> git checkout numexpr-3.0
> python setup.py install
>
> What's new?
> ==
>
> Faster
> -
>
> The operations were re-written in such a way that gcc can auto-vectorize
> the loops to use SIMD instructions. Each operation now has a strided and
> aligned branch, which improves performance on aligned arrays by ~ 40 %. The
> setup time for threads has been reduced, by removing an unnecessary
> abstraction layer, and various other minor re-factorizations, resulting in
> improved thread scaling.
>
> The combination of speed-ups means that NumExpr3 often runs 200-500 %
> faster than NumExpr2.6 on a machine with AVX2 support. The break-even point
> with NumPy is now roughly arrays with 64k-elements, compared to
> 256-512k-elements for NE2.
>
> Plot of comparative performance for NumPy versus NE2 versus NE3 over a
> range of array sizes are available at:
>
> http://entropyproduction.blogspot.ch/2017/02/introduction-
> to-numexpr-3-alpha.html
>
> More NumPy Datatypes
> 
>
> The program was re-factorized from a ascii-encoded byte code to a struct
> array, so that the operation space is now 65535 instead of 128.  As such,
> support for uint8, int8, uint16, int16, uint32, uint64, and complex64 data
> types was added.
>
> NumExpr3 now uses NumPy 'safe' casting rules. If an operation doesn't
> return the same result as NumPy, it's a bug.  In the future other casting
> styles will be added if there is a demand for them.
>
>
> More complete function set
> 
>
> With the enhanced operation space, almost the entire C++11 cmath function
> set is supported (if the compiler library has them; only C99 is expected).
> Also bitwise operations were added for all integer datatypes. There are now
> 436 operations/functions in NE3, with more to come, compared to 190 in NE2.
>
> Also a library-enum has been added to the op keys which allows multiple
> backend libraries to be linked to the interpreter, and changed on a
> per-expression basis, rather than picking between GNU std and Intel VML at
> compile time, for example.
>
>
> More complete Python language support
> --
>
> The Python compiler was re-written from scratch to use the CPython `ast`
> module and a functional programming approach. As such, NE3 now compiles a
> wider subset of the Python language. It supports multi-line evaluation, and
> assignment with named temporaries.  The new compiler spends considerably
> less time in Python to compile expressions, about 200 us for 'a*b' compared
> to 550 us for NE2.
>
> Compare for example:
>
> out_ne2 = ne2.evaluate( 'exp( -sin(2*a**2) - cos(2*b**2) -
> 2*a**2*b**2' )
>
> to:
>
> neObj = NumExp( '''a2 = a*a; b2 = b*b
> out_magic = exp( -sin(2*a2) - cos(2*b2) - 2*a2*b2''' )
>
> This is a contrived example but the multi-line approach will allow for
> cleaner code and more sophisticated algorithms to be encapsulated in a
> single NumExpr call. The convention is that intermediate assignment targets
> are named temporaries if they do not exist in the calling frame, and full
> assignment targets if they do, which provides a method for multiple
> returns. Single-level de-referencing (e.g. `self.data`) is also supported
> for increased convenience and cleaner code. Slicing still needs to be
> performed above the ne3.evaluate() or ne3.NumExpr() call.
>
>
> More maintainable
> -
>
> The code base was generally refactored to increase the prevalence of
> single-point declarations, such that modifications don't require extensive
> knowledge of the code. In NE2 a lot of code was generated by the
> pre-processor using nested #defines.  That has been replaced by a
> object-oriented Python code generator called by setup.py, which generates
> about 15k lines of C code with 1k lines of Python. The use of generated
> code with defined line numbers makes debugging threaded code simpler.
>
> The generator also builds the autotest portion of the test submodule, for
> checking equivalence between NumPy and NumExpr3 operations and functions.
>
>
> What's TODO compared to NE2?
> --
>
> * strided complex functions
> * Intel VML support (less necessary now with gcc auto-vectorization)
> * bytes and unicode support
> * reductions (mean, sum, prod, std)
>
>
> What I'm looking for feedback on
> 
>
> * String arrays: How do you use them?  How would unicode differ 

[Numpy-discussion] ANN: NumExpr3 Alpha

2017-02-17 Thread Robert McLeod
Hi everyone,

I'm pleased to announce that a new branch of NumExpr has been developed
that will hopefully lead to a new major version release in the future.  You
can find the branch on the PyData github repository, and installation is as
follows:

git clone https://github.com/pydata/numexpr.git
cd numexpr
git checkout numexpr-3.0
python setup.py install

What's new?
==

Faster
-

The operations were re-written in such a way that gcc can auto-vectorize
the loops to use SIMD instructions. Each operation now has a strided and
aligned branch, which improves performance on aligned arrays by ~ 40 %. The
setup time for threads has been reduced, by removing an unnecessary
abstraction layer, and various other minor re-factorizations, resulting in
improved thread scaling.

The combination of speed-ups means that NumExpr3 often runs 200-500 %
faster than NumExpr2.6 on a machine with AVX2 support. The break-even point
with NumPy is now roughly arrays with 64k-elements, compared to
256-512k-elements for NE2.

Plot of comparative performance for NumPy versus NE2 versus NE3 over a
range of array sizes are available at:

http://entropyproduction.blogspot.ch/2017/02/introduction-to-numexpr-3-
alpha.html

More NumPy Datatypes


The program was re-factorized from a ascii-encoded byte code to a struct
array, so that the operation space is now 65535 instead of 128.  As such,
support for uint8, int8, uint16, int16, uint32, uint64, and complex64 data
types was added.

NumExpr3 now uses NumPy 'safe' casting rules. If an operation doesn't
return the same result as NumPy, it's a bug.  In the future other casting
styles will be added if there is a demand for them.


More complete function set


With the enhanced operation space, almost the entire C++11 cmath function
set is supported (if the compiler library has them; only C99 is expected).
Also bitwise operations were added for all integer datatypes. There are now
436 operations/functions in NE3, with more to come, compared to 190 in NE2.

Also a library-enum has been added to the op keys which allows multiple
backend libraries to be linked to the interpreter, and changed on a
per-expression basis, rather than picking between GNU std and Intel VML at
compile time, for example.


More complete Python language support
--

The Python compiler was re-written from scratch to use the CPython `ast`
module and a functional programming approach. As such, NE3 now compiles a
wider subset of the Python language. It supports multi-line evaluation, and
assignment with named temporaries.  The new compiler spends considerably
less time in Python to compile expressions, about 200 us for 'a*b' compared
to 550 us for NE2.

Compare for example:

out_ne2 = ne2.evaluate( 'exp( -sin(2*a**2) - cos(2*b**2) - 2*a**2*b**2'
)

to:

neObj = NumExp( '''a2 = a*a; b2 = b*b
out_magic = exp( -sin(2*a2) - cos(2*b2) - 2*a2*b2''' )

This is a contrived example but the multi-line approach will allow for
cleaner code and more sophisticated algorithms to be encapsulated in a
single NumExpr call. The convention is that intermediate assignment targets
are named temporaries if they do not exist in the calling frame, and full
assignment targets if they do, which provides a method for multiple
returns. Single-level de-referencing (e.g. `self.data`) is also supported
for increased convenience and cleaner code. Slicing still needs to be
performed above the ne3.evaluate() or ne3.NumExpr() call.


More maintainable
-

The code base was generally refactored to increase the prevalence of
single-point declarations, such that modifications don't require extensive
knowledge of the code. In NE2 a lot of code was generated by the
pre-processor using nested #defines.  That has been replaced by a
object-oriented Python code generator called by setup.py, which generates
about 15k lines of C code with 1k lines of Python. The use of generated
code with defined line numbers makes debugging threaded code simpler.

The generator also builds the autotest portion of the test submodule, for
checking equivalence between NumPy and NumExpr3 operations and functions.


What's TODO compared to NE2?
--

* strided complex functions
* Intel VML support (less necessary now with gcc auto-vectorization)
* bytes and unicode support
* reductions (mean, sum, prod, std)


What I'm looking for feedback on


* String arrays: How do you use them?  How would unicode differ from bytes
strings?
* Interface: We now have a more object-oriented interface underneath the
familiar
  evaluate() interface. How would you like to use this interface?  Francesc
suggested
  generator support, as currently it's more difficult to use NumExpr within
a loop than
  it should be.


Ideas for the future
-

*