Re: [Numpy-discussion] numexpr with the new iterator

2011-01-11 Thread Francesc Alted
A Tuesday 11 January 2011 06:45:28 Mark Wiebe escrigué:
> On Mon, Jan 10, 2011 at 11:35 AM, Mark Wiebe  
wrote:
> > I'm a bit curious why the jump from 1 to 2 threads is scaling so
> > poorly.
> > 
> >  Your timings have improvement factors of 1.85, 1.68, 1.64, and
> >  1.79.  Since
> > 
> > the computation is trivial data parallelism, and I believe it's
> > still pretty far off the memory bandwidth limit, I would expect a
> > speedup of 1.95 or higher.
> 
> It looks like it is the memory bandwidth which is limiting the
> scalability.

Indeed, this is an increasingly important problem for modern computers.  
You may want to read:

http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf

;-)

> The slower operations scale much better than faster
> ones.  Below are some timings of successively faster operations. 
> When the operation is slow enough, it scales like I was expecting...
[clip]

Yeah, for another example on this with more threads, see:

http://code.google.com/p/numexpr/wiki/MultiThreadVM

OTOH, I was curious about the performance of the new iterator with 
Intel's VML, but it seems to work decently too:

$ python bench/vml_timing.py (original numexpr, *no* VML support)
*** Numexpr vs NumPy speed-ups ***
Contiguous case: 1.72 (mean), 0.92 (min), 3.07 (max)
Strided case:2.1 (mean), 0.98 (min), 3.52 (max)
Unaligned case:  2.35 (mean), 1.35 (min), 3.31 (max)

$ python bench/vml_timing.py  (original numexpr, VML support)
*** Numexpr vs NumPy speed-ups ***
Contiguous case: 3.83 (mean), 1.1 (min), 10.19 (max)
Strided case:3.21 (mean), 0.98 (min), 7.45 (max)
Unaligned case:  3.6 (mean), 1.47 (min), 7.87 (max)

$ python bench/vml_timing.py (new iter numexpr, VML support)
*** Numexpr vs NumPy speed-ups ***
Contiguous case: 3.56 (mean), 1.12 (min), 7.38 (max)
Strided case:2.37 (mean), 0.09 (min), 7.63 (max)
Unaligned case:  3.56 (mean), 2.08 (min), 5.88 (max)

However, there a couple of quirks here.  1) The original Numexpr 
performs generally faster than the iter version.  2) The strided case is 
quite worse for the iter version.  I've isolated the tests that performs 
worse for the iter version, and here are a couple of samples:

*** Expression: exp(f3)
numpy: 0.0135
numpy strided: 0.0144
  numpy unaligned: 0.0200
  numexpr: 0.0020 Speed-up of numexpr over numpy: 6.6584
  numexpr strided: 0.1495 Speed-up of numexpr over numpy: 0.0962
numexpr unaligned: 0.0049 Speed-up of numexpr over numpy: 4.0859


*** Expression: sin(f3)>cos(f4)
numpy: 0.0291
numpy strided: 0.0366
  numpy unaligned: 0.0407
  numexpr: 0.0166 Speed-up of numexpr over numpy: 1.7518
  numexpr strided: 0.1551 Speed-up of numexpr over numpy: 0.2361
numexpr unaligned: 0.0175 Speed-up of numexpr over numpy: 2.3246

Maybe you can shed some light on what's going on here (shall we discuss 
this off-the-list so as to not bore people too much?).

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numexpr with the new iterator

2011-01-11 Thread Francesc Alted
A Monday 10 January 2011 19:29:33 Mark Wiebe escrigué:
> > so, the new code is just < 5% slower.  I suppose that removing the
> > NPY_ITER_ALIGNED flag would give us a bit more performance, but
> > that's great as it is now.  How did you do that?  Your new_iter
> > branch in NumPy already deals with unaligned data, right?
> 
> Take a look at  lowlevel_strided_loops.c.src.  In this case, the
> buffering setup code calls PyArray_GetDTypeTransferFunction, which
> in turn calls PyArray_GetStridedCopyFn, which on an x86 platform
> returns
> _aligned_strided_to_contig_size8.  This function has a simple loop of
> copies using a npy_uint64 data type.

I see.  Brilliant!

> > Well, if you can support reduce operations with your patch that
> > would be extremely good news as I'm afraid that the current reduce
> > code is a bit broken in Numexpr (at least, I vaguely remember
> > seeing it working badly in some cases).
> 
> Cool, I'll take a look at some point.  I imagine with the most
> obvious implementation small reductions would perform poorly.

IMO, reductions like sum() or prod() are mainly limited my memory 
access, so my advise would be to not try to over-optimize here, and just 
make use of the new iterator.  We can refine performance later on.

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numexpr with the new iterator

2011-01-10 Thread Mark Wiebe
On Mon, Jan 10, 2011 at 11:35 AM, Mark Wiebe  wrote:

> I'm a bit curious why the jump from 1 to 2 threads is scaling so poorly.
>  Your timings have improvement factors of 1.85, 1.68, 1.64, and 1.79.  Since
> the computation is trivial data parallelism, and I believe it's still pretty
> far off the memory bandwidth limit, I would expect a speedup of 1.95 or
> higher.


It looks like it is the memory bandwidth which is limiting the scalability.
 The slower operations scale much better than faster ones.  Below are some
timings of successively faster operations.  When the operation is slow
enough, it scales like I was expecting...

-Mark

Computing: 'cos(x**1.1) + sin(x**1.3) + tan(x**2.3)' with 2000 points
Using numpy:
*** Time elapsed: 14.47
Using numexpr:
*** Time elapsed for 1 threads: 12.659000
*** Time elapsed for 2 threads: 6.357000
*** Ratio from 1 to 2 threads: 1.991348
Using numexpr_iter:
*** Time elapsed for 1 threads: 12.573000
*** Time elapsed for 2 threads: 6.398000
*** Ratio from 1 to 2 threads: 1.965145

Computing: 'x**2.345' with 2000 points
Using numpy:
*** Time elapsed: 3.506
Using numexpr:
*** Time elapsed for 1 threads: 3.375000
*** Time elapsed for 2 threads: 1.747000
*** Ratio from 1 to 2 threads: 1.931883
Using numexpr_iter:
*** Time elapsed for 1 threads: 3.266000
*** Time elapsed for 2 threads: 1.76
*** Ratio from 1 to 2 threads: 1.855682

Computing: '1*x+2*x+3*x+4*x+5*x+6*x+7*x+8*x+9*x+10*x+11*x+12*x+13*x+14*x'
with 2000 points
Using numpy:
*** Time elapsed: 9.774
Using numexpr:
*** Time elapsed for 1 threads: 1.314000
*** Time elapsed for 2 threads: 0.703000
*** Ratio from 1 to 2 threads: 1.869132
Using numexpr_iter:
*** Time elapsed for 1 threads: 1.257000
*** Time elapsed for 2 threads: 0.683000
*** Ratio from 1 to 2 threads: 1.840410

Computing: 'x+2.345' with 2000 points
Using numpy:
*** Time elapsed: 0.343
Using numexpr:
*** Time elapsed for 1 threads: 0.348000
*** Time elapsed for 2 threads: 0.30
*** Ratio from 1 to 2 threads: 1.16
Using numexpr_iter:
*** Time elapsed for 1 threads: 0.354000
*** Time elapsed for 2 threads: 0.293000
*** Ratio from 1 to 2 threads: 1.208191
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numexpr with the new iterator

2011-01-10 Thread Mark Wiebe
I'm a bit curious why the jump from 1 to 2 threads is scaling so poorly.
 Your timings have improvement factors of 1.85, 1.68, 1.64, and 1.79.  Since
the computation is trivial data parallelism, and I believe it's still pretty
far off the memory bandwidth limit, I would expect a speedup of 1.95 or
higher.

One reason I suggest TBB is that it can produce a pretty good schedule while
still adapting to load produced by other processes and threads.  Numexpr
currently does that well, but simply dividing the data into one piece per
thread doesn't handle that case very well, and makes it possible that one
thread spends a fair bit of time finishing up while the others idle at the
end.  Perhaps using Cilk would be a better option than TBB, since the code
could remain in C.

-Mark

On Mon, Jan 10, 2011 at 3:55 AM, Francesc Alted  wrote:

> A Monday 10 January 2011 11:05:27 Francesc Alted escrigué:
> > Also, I'd like to try out the new thread scheduling that you
> > suggested to me privately (i.e. T0T1T0T1...  vs T0T0...T1T1...).
>
> I've just implemented the new partition schema in numexpr
> (T0T0...T1T1..., being the original T0T1T0T1...).  I'm attaching the
> patch for this.  The results are a bit confusing.  For example, using
> the attached benchmark (poly.py), I get these results for a common dual-
> core machine, non-NUMA machine:
>
> With the T0T1...T0T1... (original) schema:
>
> Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points
> Using numpy:
> *** Time elapsed: 3.497
> Using numexpr:
> *** Time elapsed for 1 threads: 1.279000
> *** Time elapsed for 2 threads: 0.688000
>
> With the T0T0...T1T1... (new) schema:
>
> Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points
> Using numpy:
> *** Time elapsed: 3.454
> Using numexpr:
> *** Time elapsed for 1 threads: 1.268000
> *** Time elapsed for 2 threads: 0.754000
>
> which is around a 10% slower (2 threads) than the original partition.
>
> The results are a bit different on a NUMA machine (8 physical cores, 16
> logical cores via hyper-threading):
>
> With the T0T1...T0T1... (original) partition:
>
> Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points
> Using numpy:
> *** Time elapsed: 3.005
> Using numexpr:
> *** Time elapsed for 1 threads: 1.109000
> *** Time elapsed for 2 threads: 0.677000
> *** Time elapsed for 3 threads: 0.496000
> *** Time elapsed for 4 threads: 0.394000
> *** Time elapsed for 5 threads: 0.324000
> *** Time elapsed for 6 threads: 0.287000
> *** Time elapsed for 7 threads: 0.247000
> *** Time elapsed for 8 threads: 0.234000
> *** Time elapsed for 9 threads: 0.242000
> *** Time elapsed for 10 threads: 0.239000
> *** Time elapsed for 11 threads: 0.241000
> *** Time elapsed for 12 threads: 0.235000
> *** Time elapsed for 13 threads: 0.226000
> *** Time elapsed for 14 threads: 0.214000
> *** Time elapsed for 15 threads: 0.235000
> *** Time elapsed for 16 threads: 0.218000
>
> With the T0T0...T1T1... (new) partition:
>
> Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points
> Using numpy:
> *** Time elapsed: 3.003
> Using numexpr:
> *** Time elapsed for 1 threads: 1.106000
> *** Time elapsed for 2 threads: 0.617000
> *** Time elapsed for 3 threads: 0.442000
> *** Time elapsed for 4 threads: 0.345000
> *** Time elapsed for 5 threads: 0.296000
> *** Time elapsed for 6 threads: 0.257000
> *** Time elapsed for 7 threads: 0.237000
> *** Time elapsed for 8 threads: 0.26
> *** Time elapsed for 9 threads: 0.245000
> *** Time elapsed for 10 threads: 0.261000
> *** Time elapsed for 11 threads: 0.238000
> *** Time elapsed for 12 threads: 0.21
> *** Time elapsed for 13 threads: 0.218000
> *** Time elapsed for 14 threads: 0.20
> *** Time elapsed for 15 threads: 0.235000
> *** Time elapsed for 16 threads: 0.198000
>
> In this case, the performance is similar, with perhaps a slight
> advantage for the new partition scheme, but I don't know if it is worth
> to make it the default (probably not, as this partition performs clearly
> worse on non-NUMA machines).  At any rate, both partitions perform very
> close to the aggregated memory bandwidth of NUMA machines (around 10
> GB/s in the above case).
>
> In general, I don't think there is much point in using Intel's TBB in
> numexpr because the existing implementation already hits memory
> bandwidth limits pretty early (around 10 threads in the latter example).
>
> --
> Francesc Alted
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numexpr with the new iterator

2011-01-10 Thread Mark Wiebe
On Mon, Jan 10, 2011 at 9:47 AM, Francesc Alted  wrote:

> 
>
> so, the new code is just < 5% slower.  I suppose that removing the
> NPY_ITER_ALIGNED flag would give us a bit more performance, but that's
> great as it is now.  How did you do that?  Your new_iter branch in NumPy
> already deals with unaligned data, right?
>

Take a look at  lowlevel_strided_loops.c.src.  In this case, the buffering
setup code calls PyArray_GetDTypeTransferFunction, which in turn calls
PyArray_GetStridedCopyFn, which on an x86 platform returns
_aligned_strided_to_contig_size8.  This function has a simple loop of copies
using a npy_uint64 data type.

> The new code also needs support for the reduce operation.  I didn't
> > look too closely at the code for that, but a nested iteration
> > pattern is probably appropriate.  If the inner loop is just allowed
> > to be one dimension, it could be done without actually creating the
> > inner iterator.
>
> Well, if you can support reduce operations with your patch that would be
> extremely good news as I'm afraid that the current reduce code is a bit
> broken in Numexpr (at least, I vaguely remember seeing it working badly
> in some cases).
>

Cool, I'll take a look at some point.  I imagine with the most obvious
implementation small reductions would perform poorly.

-Mark
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numexpr with the new iterator

2011-01-10 Thread Francesc Alted
A Monday 10 January 2011 17:54:16 Mark Wiebe escrigué: 
> > Apparently, you forgot to add the new_iterator_pywrap.h file.
> 
> Oops,  that's added now.

Excellent.  It works now.

> The aligned case should just be a matter of conditionally removing
> the NPY_ITER_ALIGNED flag in two places.

Wow, the support for unaligned in current `evaluate_iter()` seems pretty 
nice already:

$ python unaligned-simple.py 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Numexpr version:   1.5.dev
NumPy version: 2.0.0.dev-ebc963d
Python version:2.6.1 (r261:67515, Feb  3 2009, 17:34:37) 
[GCC 4.3.2 [gcc-4_3-branch revision 141291]]
Platform:  linux2-x86_64
AMD/Intel CPU? True
VML available? False
Detected cores:2
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
NumPy aligned:  0.658 s
NumPy unaligned:1.597 s
Numexpr aligned:0.59 s
Numexpr aligned (new iter): 0.59 s
Numexpr unaligned:  0.51 s
Numexpr unaligned (new_iter):   0.528 s

so, the new code is just < 5% slower.  I suppose that removing the 
NPY_ITER_ALIGNED flag would give us a bit more performance, but that's 
great as it is now.  How did you do that?  Your new_iter branch in NumPy 
already deals with unaligned data, right?

> The new code also needs support for the reduce operation.  I didn't
> look too closely at the code for that, but a nested iteration
> pattern is probably appropriate.  If the inner loop is just allowed
> to be one dimension, it could be done without actually creating the
> inner iterator.

Well, if you can support reduce operations with your patch that would be 
extremely good news as I'm afraid that the current reduce code is a bit 
broken in Numexpr (at least, I vaguely remember seeing it working badly 
in some cases).

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numexpr with the new iterator

2011-01-10 Thread Mark Wiebe
On Mon, Jan 10, 2011 at 2:05 AM, Francesc Alted  wrote:

> 
>
> Your patch looks mostly fine to my eyes; good job!  Unfortunately, I've
> been unable to compile your new_iterator branch of NumPy:
>
> numpy/core/src/multiarray/multiarraymodule.c:45:33: fatal error:
> new_iterator_pywrap.h: El fitxer o directori no existeix
>
> Apparently, you forgot to add the new_iterator_pywrap.h file.
>

Oops,  that's added now.


> My idea would be to merge your patch in numexpr and make the new
> `evaluate_iter()` the default (i.e. make it `evaluate()`).  However, by
> looking into the code, it seems to me that unaligned arrays (this is an
> important use case when operating with columns of structured arrays) may
> need more fine-tuning for Intel platforms.  When I can compile the
> new_iterator branch, I'll give a try at unaligned data benchs.
>

The aligned case should just be a matter of conditionally removing the
NPY_ITER_ALIGNED flag in two places.

The new code also needs support for the reduce operation.  I didn't look too
closely at the code for that, but a nested iteration pattern is probably
appropriate.  If the inner loop is just allowed to be one dimension, it
could be done without actually creating the inner iterator.

-Mark
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numexpr with the new iterator

2011-01-10 Thread Francesc Alted
A Monday 10 January 2011 11:05:27 Francesc Alted escrigué:
> Also, I'd like to try out the new thread scheduling that you
> suggested to me privately (i.e. T0T1T0T1...  vs T0T0...T1T1...).

I've just implemented the new partition schema in numexpr 
(T0T0...T1T1..., being the original T0T1T0T1...).  I'm attaching the 
patch for this.  The results are a bit confusing.  For example, using 
the attached benchmark (poly.py), I get these results for a common dual-
core machine, non-NUMA machine:

With the T0T1...T0T1... (original) schema:

Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points
Using numpy:
*** Time elapsed: 3.497
Using numexpr:
*** Time elapsed for 1 threads: 1.279000
*** Time elapsed for 2 threads: 0.688000

With the T0T0...T1T1... (new) schema:

Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points
Using numpy:
*** Time elapsed: 3.454
Using numexpr:
*** Time elapsed for 1 threads: 1.268000
*** Time elapsed for 2 threads: 0.754000

which is around a 10% slower (2 threads) than the original partition.

The results are a bit different on a NUMA machine (8 physical cores, 16 
logical cores via hyper-threading):

With the T0T1...T0T1... (original) partition:

Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points
Using numpy:
*** Time elapsed: 3.005
Using numexpr:
*** Time elapsed for 1 threads: 1.109000
*** Time elapsed for 2 threads: 0.677000
*** Time elapsed for 3 threads: 0.496000
*** Time elapsed for 4 threads: 0.394000
*** Time elapsed for 5 threads: 0.324000
*** Time elapsed for 6 threads: 0.287000
*** Time elapsed for 7 threads: 0.247000
*** Time elapsed for 8 threads: 0.234000
*** Time elapsed for 9 threads: 0.242000
*** Time elapsed for 10 threads: 0.239000
*** Time elapsed for 11 threads: 0.241000
*** Time elapsed for 12 threads: 0.235000
*** Time elapsed for 13 threads: 0.226000
*** Time elapsed for 14 threads: 0.214000
*** Time elapsed for 15 threads: 0.235000
*** Time elapsed for 16 threads: 0.218000

With the T0T0...T1T1... (new) partition:

Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points
Using numpy:
*** Time elapsed: 3.003
Using numexpr:
*** Time elapsed for 1 threads: 1.106000
*** Time elapsed for 2 threads: 0.617000
*** Time elapsed for 3 threads: 0.442000
*** Time elapsed for 4 threads: 0.345000
*** Time elapsed for 5 threads: 0.296000
*** Time elapsed for 6 threads: 0.257000
*** Time elapsed for 7 threads: 0.237000
*** Time elapsed for 8 threads: 0.26
*** Time elapsed for 9 threads: 0.245000
*** Time elapsed for 10 threads: 0.261000
*** Time elapsed for 11 threads: 0.238000
*** Time elapsed for 12 threads: 0.21
*** Time elapsed for 13 threads: 0.218000
*** Time elapsed for 14 threads: 0.20
*** Time elapsed for 15 threads: 0.235000
*** Time elapsed for 16 threads: 0.198000

In this case, the performance is similar, with perhaps a slight 
advantage for the new partition scheme, but I don't know if it is worth 
to make it the default (probably not, as this partition performs clearly 
worse on non-NUMA machines).  At any rate, both partitions perform very 
close to the aggregated memory bandwidth of NUMA machines (around 10 
GB/s in the above case).

In general, I don't think there is much point in using Intel's TBB in 
numexpr because the existing implementation already hits memory 
bandwidth limits pretty early (around 10 threads in the latter example).

-- 
Francesc Alted
Index: numexpr/interpreter.c
===
--- numexpr/interpreter.c	(revision 260)
+++ numexpr/interpreter.c	(working copy)
@@ -59,8 +59,6 @@
 int end_threads = 0; /* should exisiting threads end? */
 pthread_t threads[MAX_THREADS];  /* opaque structure for threads */
 int tids[MAX_THREADS];   /* ID per each thread */
-intp gindex; /* global index for all threads */
-int init_sentinels_done; /* sentinels initialized? */
 int giveup;  /* should parallel code giveup? */
 int force_serial;/* force serial code instead of parallel? */
 int pid = 0; /* the PID for this process */
@@ -1072,7 +1070,7 @@
 return 0;
 }
 
-/* VM engine for each threadi (general) */
+/* VM engine for each thread (general) */
 static inline int
 vm_engine_thread(char **mem, intp index, intp block_size,
   struct vm_params params, int *pc_error)
@@ -1086,11 +1084,11 @@
 /* Do the worker job for a certain thread */
 void *th_worker(void *tids)
 {
-/* int tid = *(int *)tids; */
-intp index; /* private copy of gindex */
+int tid = *(int *)tids;
+intp index;
 /* Parameters for threads */
-intp start;
-intp vlen;
+intp start, stop;
+intp vlen, nblocks, th_nblocks;
 intp block_size;
 struct vm_params params;
 int *pc_error;
@@ -1103,8 +1101,6 @@
 
 while (1) {
 
-init_sentinels_done = 0; /* sentinels have to be initialised yet */
-
  

Re: [Numpy-discussion] numexpr with the new iterator

2011-01-10 Thread Francesc Alted
A Sunday 09 January 2011 23:45:02 Mark Wiebe escrigué:
> As a benchmark of C-based iterator usage and to make it work properly
> in a multi-threaded context, I've updated numexpr to use the new
> iterator.  In addition to some performance improvements, this also
> made it easy to add optional out= and order= parameters to the
> evaluate function.  The numexpr repository with this update is
> available here:
> 
> https://github.com/m-paradox/numexpr
> 
> To use it, you need the new_iterator branch of NumPy from here:
> 
> https://github.com/m-paradox/numpy
> 
> In all cases tested, the iterator version of numexpr's evaluate
> function matches or beats the standard version.  The timing results
> are below, with some explanatory comments placed inline:
[clip]

Your patch looks mostly fine to my eyes; good job!  Unfortunately, I've 
been unable to compile your new_iterator branch of NumPy:

numpy/core/src/multiarray/multiarraymodule.c:45:33: fatal error: 
new_iterator_pywrap.h: El fitxer o directori no existeix

Apparently, you forgot to add the new_iterator_pywrap.h file.

My idea would be to merge your patch in numexpr and make the new 
`evaluate_iter()` the default (i.e. make it `evaluate()`).  However, by 
looking into the code, it seems to me that unaligned arrays (this is an 
important use case when operating with columns of structured arrays) may 
need more fine-tuning for Intel platforms.  When I can compile the 
new_iterator branch, I'll give a try at unaligned data benchs.  

Also, I'd like to try out the new thread scheduling that you suggested 
to me privately (i.e. T0T1T0T1...  vs T0T0...T1T1...).

Thanks!

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numexpr with the new iterator

2011-01-09 Thread Mark Wiebe
That's right, essentially all I've done is replaced the code that handled
preparing the arrays and producing blocks of values for the inner loops.
 There are three new parameters to evaluate_iter as well.  It has an "out="
parameter just like ufuncs do, an "order=" parameter which controls the
layout of the output if it's created by the function, and a "casting="
parameter which controls what kind of data conversions are permitted.

-Mark

On Sun, Jan 9, 2011 at 3:33 PM, John Salvatier wrote:

> Is evaluate_iter basically numpexpr but using your numpy branch or are
> there other changes?
>
> On Sun, Jan 9, 2011 at 2:45 PM, Mark Wiebe  wrote:
>
>> As a benchmark of C-based iterator usage and to make it work properly in a
>> multi-threaded context, I've updated numexpr to use the new iterator.  In
>> addition to some performance improvements, this also made it easy to add
>> optional out= and order= parameters to the evaluate function.  The numexpr
>> repository with this update is available here:
>>
>> https://github.com/m-paradox/numexpr
>>
>> To use it, you need the new_iterator branch of NumPy from here:
>>
>> https://github.com/m-paradox/numpy
>>
>> In all cases tested, the iterator version of numexpr's evaluate function
>> matches or beats the standard version.  The timing results are below, with
>> some explanatory comments placed inline:
>>
>> -Mark
>>
>> In [1]: import numexpr as ne
>>
>> # numexpr front page example
>>
>> In [2]: a = np.arange(1e6)
>> In [3]: b = np.arange(1e6)
>>
>> In [4]: timeit a**2 + b**2 + 2*a*b
>> 1 loops, best of 3: 121 ms per loop
>>
>> In [5]: ne.set_num_threads(1)
>>
>> # iterator version performance matches standard version
>>
>> In [6]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
>> 10 loops, best of 3: 24.8 ms per loop
>> In [7]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
>> 10 loops, best of 3: 24.3 ms per loop
>>
>> In [8]: ne.set_num_threads(2)
>>
>> # iterator version performance matches standard version
>>
>> In [9]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
>> 10 loops, best of 3: 21 ms per loop
>> In [10]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
>> 10 loops, best of 3: 20.5 ms per loop
>>
>> # numexpr front page example with a 10x bigger array
>>
>> In [11]: a = np.arange(1e7)
>> In [12]: b = np.arange(1e7)
>>
>> In [13]: ne.set_num_threads(2)
>>
>> # the iterator version performance improvement is due to
>> # a small task scheduler tweak
>>
>> In [14]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
>> 1 loops, best of 3: 282 ms per loop
>> In [15]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
>> 1 loops, best of 3: 255 ms per loop
>>
>> # numexpr front page example with a Fortran contiguous array
>>
>> In [16]: a = np.arange(1e7).reshape(10,100,100,100).T
>> In [17]: b = np.arange(1e7).reshape(10,100,100,100).T
>>
>> In [18]: timeit a**2 + b**2 + 2*a*b
>> 1 loops, best of 3: 3.22 s per loop
>>
>> In [19]: ne.set_num_threads(1)
>>
>> # even with a C-ordered output, the iterator version performs better
>>
>> In [20]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
>> 1 loops, best of 3: 3.74 s per loop
>> In [21]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
>> 1 loops, best of 3: 379 ms per loop
>> In [22]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b", order='C')
>> 1 loops, best of 3: 2.03 s per loop
>>
>> In [23]: ne.set_num_threads(2)
>>
>> # the standard version just uses 1 thread here, I believe
>> # the iterator version performs the same as for the flat 1e7-sized array
>>
>> In [24]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
>> 1 loops, best of 3: 3.92 s per loop
>> In [25]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
>> 1 loops, best of 3: 254 ms per loop
>> In [26]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b", order='C')
>> 1 loops, best of 3: 1.74 s per loop
>>
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numexpr with the new iterator

2011-01-09 Thread John Salvatier
Is evaluate_iter basically numpexpr but using your numpy branch or are there
other changes?

On Sun, Jan 9, 2011 at 2:45 PM, Mark Wiebe  wrote:

> As a benchmark of C-based iterator usage and to make it work properly in a
> multi-threaded context, I've updated numexpr to use the new iterator.  In
> addition to some performance improvements, this also made it easy to add
> optional out= and order= parameters to the evaluate function.  The numexpr
> repository with this update is available here:
>
> https://github.com/m-paradox/numexpr
>
> To use it, you need the new_iterator branch of NumPy from here:
>
> https://github.com/m-paradox/numpy
>
> In all cases tested, the iterator version of numexpr's evaluate function
> matches or beats the standard version.  The timing results are below, with
> some explanatory comments placed inline:
>
> -Mark
>
> In [1]: import numexpr as ne
>
> # numexpr front page example
>
> In [2]: a = np.arange(1e6)
> In [3]: b = np.arange(1e6)
>
> In [4]: timeit a**2 + b**2 + 2*a*b
> 1 loops, best of 3: 121 ms per loop
>
> In [5]: ne.set_num_threads(1)
>
> # iterator version performance matches standard version
>
> In [6]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
> 10 loops, best of 3: 24.8 ms per loop
> In [7]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
> 10 loops, best of 3: 24.3 ms per loop
>
> In [8]: ne.set_num_threads(2)
>
> # iterator version performance matches standard version
>
> In [9]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
> 10 loops, best of 3: 21 ms per loop
> In [10]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
> 10 loops, best of 3: 20.5 ms per loop
>
> # numexpr front page example with a 10x bigger array
>
> In [11]: a = np.arange(1e7)
> In [12]: b = np.arange(1e7)
>
> In [13]: ne.set_num_threads(2)
>
> # the iterator version performance improvement is due to
> # a small task scheduler tweak
>
> In [14]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
> 1 loops, best of 3: 282 ms per loop
> In [15]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
> 1 loops, best of 3: 255 ms per loop
>
> # numexpr front page example with a Fortran contiguous array
>
> In [16]: a = np.arange(1e7).reshape(10,100,100,100).T
> In [17]: b = np.arange(1e7).reshape(10,100,100,100).T
>
> In [18]: timeit a**2 + b**2 + 2*a*b
> 1 loops, best of 3: 3.22 s per loop
>
> In [19]: ne.set_num_threads(1)
>
> # even with a C-ordered output, the iterator version performs better
>
> In [20]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
> 1 loops, best of 3: 3.74 s per loop
> In [21]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
> 1 loops, best of 3: 379 ms per loop
> In [22]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b", order='C')
> 1 loops, best of 3: 2.03 s per loop
>
> In [23]: ne.set_num_threads(2)
>
> # the standard version just uses 1 thread here, I believe
> # the iterator version performs the same as for the flat 1e7-sized array
>
> In [24]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
> 1 loops, best of 3: 3.92 s per loop
> In [25]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
> 1 loops, best of 3: 254 ms per loop
> In [26]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b", order='C')
> 1 loops, best of 3: 1.74 s per loop
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] numexpr with the new iterator

2011-01-09 Thread Mark Wiebe
As a benchmark of C-based iterator usage and to make it work properly in a
multi-threaded context, I've updated numexpr to use the new iterator.  In
addition to some performance improvements, this also made it easy to add
optional out= and order= parameters to the evaluate function.  The numexpr
repository with this update is available here:

https://github.com/m-paradox/numexpr

To use it, you need the new_iterator branch of NumPy from here:

https://github.com/m-paradox/numpy

In all cases tested, the iterator version of numexpr's evaluate function
matches or beats the standard version.  The timing results are below, with
some explanatory comments placed inline:

-Mark

In [1]: import numexpr as ne

# numexpr front page example

In [2]: a = np.arange(1e6)
In [3]: b = np.arange(1e6)

In [4]: timeit a**2 + b**2 + 2*a*b
1 loops, best of 3: 121 ms per loop

In [5]: ne.set_num_threads(1)

# iterator version performance matches standard version

In [6]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
10 loops, best of 3: 24.8 ms per loop
In [7]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
10 loops, best of 3: 24.3 ms per loop

In [8]: ne.set_num_threads(2)

# iterator version performance matches standard version

In [9]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
10 loops, best of 3: 21 ms per loop
In [10]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
10 loops, best of 3: 20.5 ms per loop

# numexpr front page example with a 10x bigger array

In [11]: a = np.arange(1e7)
In [12]: b = np.arange(1e7)

In [13]: ne.set_num_threads(2)

# the iterator version performance improvement is due to
# a small task scheduler tweak

In [14]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
1 loops, best of 3: 282 ms per loop
In [15]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
1 loops, best of 3: 255 ms per loop

# numexpr front page example with a Fortran contiguous array

In [16]: a = np.arange(1e7).reshape(10,100,100,100).T
In [17]: b = np.arange(1e7).reshape(10,100,100,100).T

In [18]: timeit a**2 + b**2 + 2*a*b
1 loops, best of 3: 3.22 s per loop

In [19]: ne.set_num_threads(1)

# even with a C-ordered output, the iterator version performs better

In [20]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
1 loops, best of 3: 3.74 s per loop
In [21]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
1 loops, best of 3: 379 ms per loop
In [22]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b", order='C')
1 loops, best of 3: 2.03 s per loop

In [23]: ne.set_num_threads(2)

# the standard version just uses 1 thread here, I believe
# the iterator version performs the same as for the flat 1e7-sized array

In [24]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
1 loops, best of 3: 3.92 s per loop
In [25]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
1 loops, best of 3: 254 ms per loop
In [26]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b", order='C')
1 loops, best of 3: 1.74 s per loop
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion