Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-21 Thread Andrew Collette
Hi,

I get identical results for both shapes now; I manually removed the
"numexpr-1.1.1.dev-py2.5-linux-i686.egg" folder in site-packages and
reinstalled.  I suppose there must have been a stale set of files
somewhere.

Andrew Collette

On Wed, Jan 21, 2009 at 3:41 AM, Francesc Alted  wrote:
> A Tuesday 20 January 2009, Andrew Collette escrigué:
>> Works much, much better with the current svn version. :) Numexpr now
>> outperforms everything except the "simple" technique, and then only
>> for small data sets.
>
> Correct.  This is because of the cost of parsing the expression and
> initializing the virtual machine.  However, as soon as the sizes of the
> operands exceeds the cache of your processor you are starting to see
> the improvement in performance.
>
>> Along the lines you mentioned I noticed that simply changing from a
>> shape of (100*100*100,) to (100, 100, 100) results in nearly a factor
>> of 2 worse performance, a factor which seems constant when changing
>> the size of the data set.
>
> Sorry, but I cannot reproduce this.  When using the expression:
>
> "63 + (a*b) + (c**2) + b"
>
> I get on my machine (co...@3 GHz, running openSUSE Linux 11.1):
>
> 100 f8 (average of 10 runs)
> Simple:  0.0278068065643
> Numexpr:  0.00839750766754
> Chunked:  0.0266514062881
>
> (100, 100, 100) f8 (average of 10 runs)
> Simple:  0.0277318000793
> Numexpr:  0.00851640701294
> Chunked:  0.0346593856812
>
> and these are the expected results (i.e. no change in performance due to
> multidimensional arrays).  Even for larger arrays, I don't see nothing
> unexpected:
>
> 1000 f8 (average of 10 runs)
> Simple:  0.334054994583
> Numexpr:  0.110022115707
> Chunked:  0.29678030014
>
> (100, 100, 100, 10) f8 (average of 10 runs)
> Simple:  0.339299607277
> Numexpr:  0.111632704735
> Chunked:  0.375299096107
>
> Can you tell us which platforms are you using?
>
>> Is this related to the way numexpr handles
>> broadcasting rules?  It would seem the memory contents should be
>> identical for these two cases.
>>
>> Andrew
>>
>> On Tue, Jan 20, 2009 at 6:13 AM, Francesc Alted 
> wrote:
>> > A Tuesday 20 January 2009, Andrew Collette escrigué:
>> >> Hi Francesc,
>> >>
>> >> Looks like a cool project!  However, I'm not able to achieve the
>> >> advertised speed-ups.  I wrote a simple script to try three
>> >> approaches to this kind of problem:
>> >>
>> >> 1) Native Python code (i.e. will try to do everything at once
>> >> using temp arrays) 2) Straightforward numexpr evaluation
>> >> 3) Simple "chunked" evaluation using array.flat views.  (This
>> >> solves the memory problem and allows the use of arbitrary Python
>> >> expressions).
>> >>
>> >> I've attached the script; here's the output for the expression
>> >> "63 + (a*b) + (c**2) + sin(b)"
>> >> along with a few combinations of shapes/dtypes.  As expected,
>> >> using anything other than "f8" (double) results in a performance
>> >> penalty. Surprisingly, it seems that using chunks via array.flat
>> >> results in similar performance for f8, and even better performance
>> >> for other dtypes.
>> >
>> > [clip]
>> >
>> > Well, there were two issues there.  The first one is that when
>> > transcendental functions are used (like sin() above), the
>> > bottleneck is on the CPU instead of memory bandwidth, so numexpr
>> > speedups are not so high as usual.  The other issue was an actual
>> > bug in the numexpr code that forced a copy of all multidimensional
>> > arrays (I normally only use undimensional arrays for doing
>> > benchmarks).  This has been fixed in trunk (r39).
>> >
>> > So, with the fix on, the timings are:
>> >
>> > (100, 100, 100) f4 (average of 10 runs)
>> > Simple:  0.0426136016846
>> > Numexpr:  0.11350851059
>> > Chunked:  0.0635252952576
>> > (100, 100, 100) f8 (average of 10 runs)
>> > Simple:  0.119254398346
>> > Numexpr:  0.10092959404
>> > Chunked:  0.128384995461
>> >
>> > The speed-up is now a mere 20% (for f8), but at least it is not
>> > slower. With the patches that recently contributed Georg for using
>> > Intel's VML, the acceleration is a bit better:
>> >
>> > (100, 100, 100) f4 (average of 10 runs)
>> > Simple:  0.0417867898941
>> > Numexpr:  0.0944641113281
>> > Chunked:  0.0636183023453
>> > (100, 100, 100) f8 (average of 10 runs)
>> > Simple:  0.120059680939
>> > Numexpr:  0.0832288980484
>> > Chunked:  0.128114104271
>> >
>> > i.e. the speed-up is around 45% (for f8).
>> >
>> > Moreover, if I get rid of the sin() function and use the expresion:
>> >
>> > "63 + (a*b) + (c**2) + b"
>> >
>> > I get:
>> >
>> > (100, 100, 100) f4 (average of 10 runs)
>> > Simple:  0.0119329929352
>> > Numexpr:  0.0198570966721
>> > Chunked:  0.0338240146637
>> > (100, 100, 100) f8 (average of 10 runs)
>> > Simple:  0.0255623102188
>> > Numexpr:  0.00832500457764
>> > Chunked:  0.0340095996857
>> >
>> > which has a 3.1x speedup (for f8).
>> >
>> >> FYI, the current tar file (1.1-1) has a glitch related to the
>> >> VERSION file; I added to the bug report at

Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-21 Thread Francesc Alted
A Tuesday 20 January 2009, Andrew Collette escrigué:
> Works much, much better with the current svn version. :) Numexpr now
> outperforms everything except the "simple" technique, and then only
> for small data sets.

Correct.  This is because of the cost of parsing the expression and 
initializing the virtual machine.  However, as soon as the sizes of the 
operands exceeds the cache of your processor you are starting to see 
the improvement in performance.

> Along the lines you mentioned I noticed that simply changing from a
> shape of (100*100*100,) to (100, 100, 100) results in nearly a factor
> of 2 worse performance, a factor which seems constant when changing
> the size of the data set.

Sorry, but I cannot reproduce this.  When using the expression:

"63 + (a*b) + (c**2) + b"

I get on my machine (co...@3 GHz, running openSUSE Linux 11.1):

100 f8 (average of 10 runs)
Simple:  0.0278068065643
Numexpr:  0.00839750766754
Chunked:  0.0266514062881

(100, 100, 100) f8 (average of 10 runs)
Simple:  0.0277318000793
Numexpr:  0.00851640701294
Chunked:  0.0346593856812

and these are the expected results (i.e. no change in performance due to 
multidimensional arrays).  Even for larger arrays, I don't see nothing 
unexpected:

1000 f8 (average of 10 runs)
Simple:  0.334054994583
Numexpr:  0.110022115707
Chunked:  0.29678030014

(100, 100, 100, 10) f8 (average of 10 runs)
Simple:  0.339299607277
Numexpr:  0.111632704735
Chunked:  0.375299096107

Can you tell us which platforms are you using?

> Is this related to the way numexpr handles 
> broadcasting rules?  It would seem the memory contents should be
> identical for these two cases.
>
> Andrew
>
> On Tue, Jan 20, 2009 at 6:13 AM, Francesc Alted  
wrote:
> > A Tuesday 20 January 2009, Andrew Collette escrigué:
> >> Hi Francesc,
> >>
> >> Looks like a cool project!  However, I'm not able to achieve the
> >> advertised speed-ups.  I wrote a simple script to try three
> >> approaches to this kind of problem:
> >>
> >> 1) Native Python code (i.e. will try to do everything at once
> >> using temp arrays) 2) Straightforward numexpr evaluation
> >> 3) Simple "chunked" evaluation using array.flat views.  (This
> >> solves the memory problem and allows the use of arbitrary Python
> >> expressions).
> >>
> >> I've attached the script; here's the output for the expression
> >> "63 + (a*b) + (c**2) + sin(b)"
> >> along with a few combinations of shapes/dtypes.  As expected,
> >> using anything other than "f8" (double) results in a performance
> >> penalty. Surprisingly, it seems that using chunks via array.flat
> >> results in similar performance for f8, and even better performance
> >> for other dtypes.
> >
> > [clip]
> >
> > Well, there were two issues there.  The first one is that when
> > transcendental functions are used (like sin() above), the
> > bottleneck is on the CPU instead of memory bandwidth, so numexpr
> > speedups are not so high as usual.  The other issue was an actual
> > bug in the numexpr code that forced a copy of all multidimensional
> > arrays (I normally only use undimensional arrays for doing
> > benchmarks).  This has been fixed in trunk (r39).
> >
> > So, with the fix on, the timings are:
> >
> > (100, 100, 100) f4 (average of 10 runs)
> > Simple:  0.0426136016846
> > Numexpr:  0.11350851059
> > Chunked:  0.0635252952576
> > (100, 100, 100) f8 (average of 10 runs)
> > Simple:  0.119254398346
> > Numexpr:  0.10092959404
> > Chunked:  0.128384995461
> >
> > The speed-up is now a mere 20% (for f8), but at least it is not
> > slower. With the patches that recently contributed Georg for using
> > Intel's VML, the acceleration is a bit better:
> >
> > (100, 100, 100) f4 (average of 10 runs)
> > Simple:  0.0417867898941
> > Numexpr:  0.0944641113281
> > Chunked:  0.0636183023453
> > (100, 100, 100) f8 (average of 10 runs)
> > Simple:  0.120059680939
> > Numexpr:  0.0832288980484
> > Chunked:  0.128114104271
> >
> > i.e. the speed-up is around 45% (for f8).
> >
> > Moreover, if I get rid of the sin() function and use the expresion:
> >
> > "63 + (a*b) + (c**2) + b"
> >
> > I get:
> >
> > (100, 100, 100) f4 (average of 10 runs)
> > Simple:  0.0119329929352
> > Numexpr:  0.0198570966721
> > Chunked:  0.0338240146637
> > (100, 100, 100) f8 (average of 10 runs)
> > Simple:  0.0255623102188
> > Numexpr:  0.00832500457764
> > Chunked:  0.0340095996857
> >
> > which has a 3.1x speedup (for f8).
> >
> >> FYI, the current tar file (1.1-1) has a glitch related to the
> >> VERSION file; I added to the bug report at google code.
> >
> > Thanks. Will focus on that asap.  Mmm, seems like there is stuff
> > enough for another release of numexpr.  I'll try to do it soon.
> >
> > Cheers,
> >
> > --
> > Francesc Alted
> > ___
> > Numpy-discussion mailing list
> > Numpy-discussion@scipy.org
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
> ___
> Numpy-disc

Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-20 Thread Andrew Collette
Works much, much better with the current svn version. :) Numexpr now
outperforms everything except the "simple" technique, and then only
for small data sets.

Along the lines you mentioned I noticed that simply changing from a
shape of (100*100*100,) to (100, 100, 100) results in nearly a factor
of 2 worse performance, a factor which seems constant when changing
the size of the data set.  Is this related to the way numexpr handles
broadcasting rules?  It would seem the memory contents should be
identical for these two cases.

Andrew

On Tue, Jan 20, 2009 at 6:13 AM, Francesc Alted  wrote:
> A Tuesday 20 January 2009, Andrew Collette escrigué:
>> Hi Francesc,
>>
>> Looks like a cool project!  However, I'm not able to achieve the
>> advertised speed-ups.  I wrote a simple script to try three
>> approaches to this kind of problem:
>>
>> 1) Native Python code (i.e. will try to do everything at once using
>> temp arrays) 2) Straightforward numexpr evaluation
>> 3) Simple "chunked" evaluation using array.flat views.  (This solves
>> the memory problem and allows the use of arbitrary Python
>> expressions).
>>
>> I've attached the script; here's the output for the expression
>> "63 + (a*b) + (c**2) + sin(b)"
>> along with a few combinations of shapes/dtypes.  As expected, using
>> anything other than "f8" (double) results in a performance penalty.
>> Surprisingly, it seems that using chunks via array.flat results in
>> similar performance for f8, and even better performance for other
>> dtypes.
> [clip]
>
> Well, there were two issues there.  The first one is that when
> transcendental functions are used (like sin() above), the bottleneck is
> on the CPU instead of memory bandwidth, so numexpr speedups are not so
> high as usual.  The other issue was an actual bug in the numexpr code
> that forced a copy of all multidimensional arrays (I normally only use
> undimensional arrays for doing benchmarks).  This has been fixed in
> trunk (r39).
>
> So, with the fix on, the timings are:
>
> (100, 100, 100) f4 (average of 10 runs)
> Simple:  0.0426136016846
> Numexpr:  0.11350851059
> Chunked:  0.0635252952576
> (100, 100, 100) f8 (average of 10 runs)
> Simple:  0.119254398346
> Numexpr:  0.10092959404
> Chunked:  0.128384995461
>
> The speed-up is now a mere 20% (for f8), but at least it is not slower.
> With the patches that recently contributed Georg for using Intel's VML,
> the acceleration is a bit better:
>
> (100, 100, 100) f4 (average of 10 runs)
> Simple:  0.0417867898941
> Numexpr:  0.0944641113281
> Chunked:  0.0636183023453
> (100, 100, 100) f8 (average of 10 runs)
> Simple:  0.120059680939
> Numexpr:  0.0832288980484
> Chunked:  0.128114104271
>
> i.e. the speed-up is around 45% (for f8).
>
> Moreover, if I get rid of the sin() function and use the expresion:
>
> "63 + (a*b) + (c**2) + b"
>
> I get:
>
> (100, 100, 100) f4 (average of 10 runs)
> Simple:  0.0119329929352
> Numexpr:  0.0198570966721
> Chunked:  0.0338240146637
> (100, 100, 100) f8 (average of 10 runs)
> Simple:  0.0255623102188
> Numexpr:  0.00832500457764
> Chunked:  0.0340095996857
>
> which has a 3.1x speedup (for f8).
>
>> FYI, the current tar file (1.1-1) has a glitch related to the VERSION
>> file; I added to the bug report at google code.
>
> Thanks. Will focus on that asap.  Mmm, seems like there is stuff enough
> for another release of numexpr.  I'll try to do it soon.
>
> Cheers,
>
> --
> Francesc Alted
> ___
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-20 Thread Francesc Alted
A Tuesday 20 January 2009, Andrew Collette escrigué:
> Hi Francesc,
>
> Looks like a cool project!  However, I'm not able to achieve the
> advertised speed-ups.  I wrote a simple script to try three
> approaches to this kind of problem:
>
> 1) Native Python code (i.e. will try to do everything at once using
> temp arrays) 2) Straightforward numexpr evaluation
> 3) Simple "chunked" evaluation using array.flat views.  (This solves
> the memory problem and allows the use of arbitrary Python
> expressions).
>
> I've attached the script; here's the output for the expression
> "63 + (a*b) + (c**2) + sin(b)"
> along with a few combinations of shapes/dtypes.  As expected, using
> anything other than "f8" (double) results in a performance penalty.
> Surprisingly, it seems that using chunks via array.flat results in
> similar performance for f8, and even better performance for other
> dtypes.
[clip]

Well, there were two issues there.  The first one is that when 
transcendental functions are used (like sin() above), the bottleneck is 
on the CPU instead of memory bandwidth, so numexpr speedups are not so 
high as usual.  The other issue was an actual bug in the numexpr code 
that forced a copy of all multidimensional arrays (I normally only use 
undimensional arrays for doing benchmarks).  This has been fixed in 
trunk (r39).

So, with the fix on, the timings are:

(100, 100, 100) f4 (average of 10 runs)
Simple:  0.0426136016846
Numexpr:  0.11350851059
Chunked:  0.0635252952576
(100, 100, 100) f8 (average of 10 runs)
Simple:  0.119254398346
Numexpr:  0.10092959404
Chunked:  0.128384995461

The speed-up is now a mere 20% (for f8), but at least it is not slower.  
With the patches that recently contributed Georg for using Intel's VML, 
the acceleration is a bit better:

(100, 100, 100) f4 (average of 10 runs)
Simple:  0.0417867898941
Numexpr:  0.0944641113281
Chunked:  0.0636183023453
(100, 100, 100) f8 (average of 10 runs)
Simple:  0.120059680939
Numexpr:  0.0832288980484
Chunked:  0.128114104271

i.e. the speed-up is around 45% (for f8).

Moreover, if I get rid of the sin() function and use the expresion:

"63 + (a*b) + (c**2) + b"

I get:

(100, 100, 100) f4 (average of 10 runs)
Simple:  0.0119329929352
Numexpr:  0.0198570966721
Chunked:  0.0338240146637
(100, 100, 100) f8 (average of 10 runs)
Simple:  0.0255623102188
Numexpr:  0.00832500457764
Chunked:  0.0340095996857

which has a 3.1x speedup (for f8).

> FYI, the current tar file (1.1-1) has a glitch related to the VERSION
> file; I added to the bug report at google code.

Thanks. Will focus on that asap.  Mmm, seems like there is stuff enough 
for another release of numexpr.  I'll try to do it soon.

Cheers,

-- 
Francesc Alted
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-19 Thread Andrew Collette
Hi Francesc,

Looks like a cool project!  However, I'm not able to achieve the
advertised speed-ups.  I wrote a simple script to try three approaches
to this kind of problem:

1) Native Python code (i.e. will try to do everything at once using temp arrays)
2) Straightforward numexpr evaluation
3) Simple "chunked" evaluation using array.flat views.  (This solves
the memory problem and allows the use of arbitrary Python
expressions).

I've attached the script; here's the output for the expression
"63 + (a*b) + (c**2) + sin(b)"
along with a few combinations of shapes/dtypes.  As expected, using
anything other than "f8" (double) results in a performance penalty.
Surprisingly, it seems that using chunks via array.flat results in
similar performance for f8, and even better performance for other
dtypes.

(100, 100, 100) f4 (average of 10 runs)
Simple:  0.155238199234
Numexpr:  0.278440499306
Chunked:  0.166213512421

(100, 100, 100) f8 (average of 10 runs)
Simple:  0.241649699211
Numexpr:  0.192837905884
Chunked:  0.183888602257

(100, 100, 100, 10) f4 (average of 10 runs)
Simple:  1.56741549969
Numexpr:  3.40679829121
Chunked:  1.83729870319

(100, 100, 100) i4 (average of 10 runs)
Simple:  0.206279683113
Numexpr:  0.210431909561
Chunked:  0.182894086838

FYI, the current tar file (1.1-1) has a glitch related to the VERSION
file; I added to the bug report at google code.

Andrew Collette

On Fri, Jan 16, 2009 at 4:00 AM, Francesc Alted  wrote:
> 
>  Announcing Numexpr 1.1
> 
>
> Numexpr is a fast numerical expression evaluator for NumPy.  With it,
> expressions that operate on arrays (like "3*a+4*b") are accelerated
> and use less memory than doing the same calculation in Python.
>
> The expected speed-ups for Numexpr respect to NumPy are between 0.95x
> and 15x, being 3x or 4x typical values.  The strided and unaligned
> case has been optimized too, so if the expresion contains such arrays,
> the speed-up can increase significantly.  Of course, you will need to
> operate with large arrays (typically larger than the cache size of your
> CPU) to see these improvements in performance.
>
> This release is mainly intended to put in sync some of the
> improvements that had the Numexpr version integrated in PyTables.
> So, this standalone version of Numexpr will benefit of the well tested
> PyTables' version that has been in production for more than a year now.
>
> In case you want to know more in detail what has changed in this
> version, have a look at ``RELEASE_NOTES.txt`` in the tarball.
>
>
> Where I can find Numexpr?
> =
>
> The project is hosted at Google code in:
>
> http://code.google.com/p/numexpr/
>
>
> Share your experience
> =
>
> Let us know of any bugs, suggestions, gripes, kudos, etc. you may
> have.
>
>
> Enjoy!
>
> --
> Francesc Alted
> ___
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>

import numpy as np
import numexpr as nx
import time

test_shape = (100,100,100)   # All 3 arrays have this shape
test_dtype = 'i4'
nruns = 10   # Ensemble for timing

test_size = np.product(test_shape)

def chunkify(chunksize):
""" Very stupid "chunk vectorizer" which keeps memory use down.
This version requires all inputs to have the same number of elements,
although it shouldn't be that hard to implement simple broadcasting.
"""

def chunkifier(func):

def wrap(*args):

assert len(args) > 0
assert all(len(a.flat) == len(args[0].flat) for a in args)

nelements = len(args[0].flat)
nchunks, remain = divmod(nelements, chunksize)

out = np.ndarray(args[0].shape)

for start in xrange(0, nelements, chunksize):
#print start
stop = start+chunksize
if start+chunksize > nelements:
stop = nelements-start
iargs = tuple(a.flat[start:stop] for a in args)
out.flat[start:stop] = func(*iargs)
return out

return wrap

return chunkifier

test_func_str = "63 + (a*b) + (c**2) + sin(b)"

def test_func(a, b, c):
return 63 + (a*b) + (c**2) + np.sin(b)

test_func_chunked = chunkify(100*100)(test_func)

# The actual data we'll use
a = np.arange(test_size, dtype=test_dtype).reshape(test_shape)
b = np.arange(test_size, dtype=test_dtype).reshape(test_shape)
c = np.arange(test_size, dtype=test_dtype).reshape(test_shape)


start1 = time.time()
for idx in xrange(nruns):
result1 = test_func(a, b, c)
stop1 = time.time()

start2 = time.time()
for idx in xrange(nruns):
result2 = nx.evaluate(test_func_str)
stop2 = time.time()

start3 = time.time()
for idx in xrange(nruns):
result3 = test_func_chunked(a, b, c)
stop3 = time.time()

print "%s %s (average of %s runs)" % (test_shape, te

Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-19 Thread jh
Thanks!  I think this will help the package attract a lot of users.

A couple of housekeeping things:

on http://code.google.com/p/numexpr:

  What it is? -> What is it? or What it is (no question mark)

on http://code.google.com/p/numexpr/wiki/Overview:

  The last example got incorporated as straight text somehow.

In firefox, the first code example runs into the pastel boxes on the
right for modest-width browsers.  This is a common problem with
firefox, but I think it comes from improper HTML code that IE somehow
deals with, rather than non-standard behavior in firefox.

One thing I'd add is a benchmark example against numpy.  Make it
simple, so that people can copy and modify the benchmark code to test
their own performance improvements.

I added an entry for it on the Topical Software list.  Please check it
out and modify as you see fit

--jh--
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-19 Thread Francesc Alted
A Sunday 18 January 2009, j...@physics.ucf.edu escrigué:
> Francesc Alted wrote:
> > > > Numexpr is a fast numerical expression evaluator for NumPy. 
> > > > With it, expressions that operate on arrays (like "3*a+4*b")
> > > > are accelerated and use less memory than doing the same
> > > > calculation in Python.
> > >
> > > Please pardon my ignorance as I know this project has been around
> > > for a while.  It this looks very exciting, but either it's
> > > cumbersome, or I'm not understanding exactly what's being fixed. 
> > > If you can accelerate evaluation, why not just integrate the
> > > faster math into numpy, rather than having two packages?  Or is
> > > this something that is only an advantage when the expression is
> > > given as a string (and why is that the case)?  It would be
> > > helpful if you could put the answer on your web page and in your
> > > standard release blurb in some compact form. I guess what I'm
> > > really looking for when I read one of those is a quick answer to
> > > the question "should I look into this?".
> >
> > Well, there is a link in the project page to the "Overview" section
> > of the wiki, but perhaps is a bit hidden.  I've added some blurb as
> > you suggested in the main page an another link to the "Overview"
> > wiki page. Hope that, by reading the new blurb, you can see why it
> > accelerates expression evaluation with regard to NumPy.  If not,
> > tell me and will try to come with something more comprehensible.
>
> I did see the overview.  The addition you made is great but it's so
> far down that many won't get to it.  Even in its section, the meat of
> it is below three paragraphs that most users won't care about and
> many won't understand.  I've posted some notes on writing intros in
> Developer_Zone.
>
> In the following, I've reordered the page to address the questions of
> potential users first, edited it a bit, and fixed the example to
> conform to our doc standards (and 128->256; hope that was right). 
> See what you think...
[clip]

That's great!  I've heavily changed the docs on the project site.  I've 
followed your advices in most of places, but not always (a `Building` 
section has to be always high on a manual, IMHO).

Thanks a lot for your contribution!

-- 
Francesc Alted
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-18 Thread jh
Francesc Alted wrote:

> > > Numexpr is a fast numerical expression evaluator for NumPy.  With
> > > it, expressions that operate on arrays (like "3*a+4*b") are
> > > accelerated and use less memory than doing the same calculation in
> > > Python.
>
> > Please pardon my ignorance as I know this project has been around for
> > a while.  It this looks very exciting, but either it's cumbersome, or
> > I'm not understanding exactly what's being fixed.  If you can
> > accelerate evaluation, why not just integrate the faster math into
> > numpy, rather than having two packages?  Or is this something that is
> > only an advantage when the expression is given as a string (and why
> > is that the case)?  It would be helpful if you could put the answer
> > on your web page and in your standard release blurb in some compact
> > form. I guess what I'm really looking for when I read one of those is
> > a quick answer to the question "should I look into this?".

> Well, there is a link in the project page to the "Overview" section of 
> the wiki, but perhaps is a bit hidden.  I've added some blurb as you 
> suggested in the main page an another link to the "Overview" wiki page.
> Hope that, by reading the new blurb, you can see why it accelerates 
> expression evaluation with regard to NumPy.  If not, tell me and will 
> try to come with something more comprehensible.

I did see the overview.  The addition you made is great but it's so
far down that many won't get to it.  Even in its section, the meat of
it is below three paragraphs that most users won't care about and many
won't understand.  I've posted some notes on writing intros in
Developer_Zone.

In the following, I've reordered the page to address the questions of
potential users first, edited it a bit, and fixed the example to
conform to our doc standards (and 128->256; hope that was right).  See
what you think...

** Description:

The numexpr package evaluates multiple-operator array expressions many
times faster than numpy can.  It accepts the expression as a string,
analyzes it, rewrites it more efficiently, and compiles it to faster
Python code on the fly.  It's the next best thing to writing the
expression in C and compiling it with an optimizing compiler (as
scipy.weave does), but requires no compiler at runtime.

Using it is simple:

>>> import numpy as np
>>> import numexpr as ne
>>> a = np.arange(10)
>>> b = np.arange(0, 20, 2)
>>> c = ne.evaluate("2*a+3*b")
>>> c
array([ 0,  8, 16, 24, 32, 40, 48, 56, 64, 72])

** Why does it work?

There are two extremes to array expression evaluation.  Each binary
operation can run separately over the array elements and return a
temporary array.  This is what NumPy does: 2*a + 3*b uses three
temporary arrays as large as a or b.  This strategy wastes memory (a
problem if the arrays are large).  It is also not a good use of CPU
cache memory because the results of 2*a and 3*b will not be in cache
for the final addition if the arrays are large.

The other extreme is to loop over each element:

for i in xrange(len(a)):
c[i] = 2*a[i] + 3*b[i]

This conserves memory and is good for the cache, but on each iteration
Python must check the type of each operand and select the correct
routine for each operation.  All but the first such checks are wasted,
as the input arrays are not changing.

numexpr uses an in-between approach.  Arrays are handled in chunks
(the first pass uses 256 elements).  As Python code, it looks
something like this:

for i in xrange(0, len(a), 256):
r0 = a[i:i+256]
r1 = b[i:i+256]
multiply(r0, 2, r2)
multiply(r1, 3, r3)
add(r2, r3, r2)
c[i:i+256] = r2

The 3-argument form of add() stores the result in the third argument,
instead of allocating a new array.  This achieves a good balance
between cache and branch prediction.  The virtual machine is written
entirely in C, which makes it faster than the Python above.

** Supported Operators (unchanged)

** Supported Functions (unchanged, but capitalize 'F')

** Usage Notes (no need to repeat the example)

Numexpr's principal routine is:

evaluate(ex, local_dict=None, global_dict=None, **kwargs)

ex is a string forming an expression, like "2*a+3*b".  The values for
a and b will by default be taken from the calling function's frame
(through the use of sys._getframe()).  Alternatively, they can be
specified using the local_dict or global_dict` arguments, or passed as
keyword arguments.

Expressions are cached, so reuse is fast.  Arrays or scalars are
allowed for the variables, which must be of type 8-bit boolean (bool),
32-bit signed integer (int), 64-bit signed integer (long),
double-precision floating point number (float), 2x64-bit,
double-precision complex number (complex) or raw string of bytes
(str).  The arrays must all be the same size.

** Building (unchanged, but move down since it's standard and most
   users will only do this once, if ever)

** Implementation Notes (rest of current How It Works section)

** Credits

--jh--
___

Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-17 Thread David Cournapeau
On Sat, Jan 17, 2009 at 4:35 AM, Gregor Thalhammer
 wrote:
> Francesc Alted schrieb:
>> A Friday 16 January 2009, Gregor Thalhammer escrigué:
>>
>>> I also gave a try to the vector math library (VML), contained in
>>> Intel's Math Kernel Library. This offers a fast implementation of
>>> mathematical functions, operating on array. First I implemented a C
>>> extension, providing new ufuncs. This gave me a big performance gain,
>>> e.g., 2.3x (5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x
>>> (6x) for division (no gain for add, sub, mul).
>>>
>>
>> Wow, pretty nice speed-ups indeed!  In fact I was thinking in including
>> support for threading in Numexpr (I don't think it would be too
>> difficult, but let's see).  BTW, do you know how VML is able to achieve
>> a speedup of 6x for a sin() function?  I suppose this is because they
>> are using SSE instructions, but, are these also available for 64-bit
>> double precision items?
>>
> I am not an expert on SSE instructions, but to my knowledge there exist
> (in the Core 2 architecture) no SSE instruction to calculate the sin.
> But it seems to be possible to (approximately) calculate a sin with a
> couple of multiplication/ addition instructions (and they exist in SSE
> for 64-bit float). Intel (and AMD) seems to use a more clever algorithm,
> efficiently implemented than the standard implementation.

Generally, transcendent functions are not sped up because they are
implemented in hardware. There is no special algorithm: you implement
those as you would in C using Taylor expansions or other known
polynomial expansions, except you use SIMD to implement those
polynomial expansions. You can also use table lookup, which can be
pretty fast to get full precision for trigonometric functions.
musicdsp.org has some of those (take care a lot of those tricks do not
give full precision - it is used for music synthesis, where full
precision is rarely needed and speed of uttermost importance):

http://www.musicdsp.org

There were some examples on freescale.com, in full precision, but I
can't find it anymore.

For some functions, you can get almost one order of magnitude faster
transcendental functions (for full precision), but it is a lot of work
to make sure they work as expected in a cross platform way (even
limiting to one CPU arch when using asm, there are differences between
compilers which make this rather difficult).
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Olivier Grisel
2009/1/16 Gregor Thalhammer :
> Francesc Alted schrieb:
>>
>> Wow, pretty nice speed-ups indeed!  In fact I was thinking in including
>> support for threading in Numexpr (I don't think it would be too
>> difficult, but let's see).  BTW, do you know how VML is able to achieve
>> a speedup of 6x for a sin() function?  I suppose this is because they
>> are using SSE instructions, but, are these also available for 64-bit
>> double precision items?
>>
> I am not an expert on SSE instructions, but to my knowledge there exist
> (in the Core 2 architecture) no SSE instruction to calculate the sin.
> But it seems to be possible to (approximately) calculate a sin with a
> couple of multiplication/ addition instructions (and they exist in SSE
> for 64-bit float). Intel (and AMD) seems to use a more clever algorithm,

Here is the lib I use for the  transcendental functions SSE implementations:
http://gruntthepeon.free.fr/ssemath/

(only simple precision float though).

-- 
Olivier
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Ted Horst
Note that Apple has a similar library called vForce:




I think these libraries use several techniques and are not necessarily  
dependent on SSE.  The apple versions appear to only support float and  
double (no complex), and I don't see anything about strided arrays.   
At one point I thought there was talk of adding support for vForce  
into the respective ufuncs.  I don't know if anybody followed up on  
that.

On 2009-01-16, at 10:48, Francesc Alted wrote:

> Wow, pretty nice speed-ups indeed!  In fact I was thinking in  
> including
> support for threading in Numexpr (I don't think it would be too
> difficult, but let's see).  BTW, do you know how VML is able to  
> achieve
> a speedup of 6x for a sin() function?  I suppose this is because they
> are using SSE instructions, but, are these also available for 64-bit
> double precision items?

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Gregor Thalhammer
Francesc Alted schrieb:
> A Friday 16 January 2009, Gregor Thalhammer escrigué:
>   
>> I also gave a try to the vector math library (VML), contained in
>> Intel's Math Kernel Library. This offers a fast implementation of
>> mathematical functions, operating on array. First I implemented a C
>> extension, providing new ufuncs. This gave me a big performance gain,
>> e.g., 2.3x (5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x
>> (6x) for division (no gain for add, sub, mul).
>> 
>
> Wow, pretty nice speed-ups indeed!  In fact I was thinking in including 
> support for threading in Numexpr (I don't think it would be too 
> difficult, but let's see).  BTW, do you know how VML is able to achieve 
> a speedup of 6x for a sin() function?  I suppose this is because they 
> are using SSE instructions, but, are these also available for 64-bit 
> double precision items?
>   
I am not an expert on SSE instructions, but to my knowledge there exist 
(in the Core 2 architecture) no SSE instruction to calculate the sin. 
But it seems to be possible to (approximately) calculate a sin with a 
couple of multiplication/ addition instructions (and they exist in SSE 
for 64-bit float). Intel (and AMD) seems to use a more clever algorithm, 
efficiently implemented than the standard implementation.
> Well, if you can provide the code, I'd be glad to include it in numexpr.  
> The only requirement is that the VML must be optional during the build 
> of the package.
>   
Yes, I will try to provide you with a polished version of my changes, 
making them optional.
>   
>> There is one but: VML supports (at the moment) only math 
>> on contiguous arrays. At a first try I didn't understand how to
>> enforce this limitation in numexpr.
>> 
>
> No problem.  At the end of the numexpr/necompiler.py you will see some 
> code like:
>
> # All the opcodes can deal with strided arrays directly as
> # long as they are undimensional (strides in other
> # dimensions are dealt within the extension), so we don't
> # need a copy for the strided case.
> if not b.flags.aligned:
>...
>
> which you can replace with something like:
>
> # need a copy for the strided case.
> if VML_available and not b.flags.contiguous: 
>   b = b.copy()
> elif not b.flags.aligned:
>   ...
>
> That would be enough for ensuring that all the arrays are contiguous 
> when they hit numexpr's virtual machine.
>   
Ah I see, that's not difficult. I thought copying is done in the virtual 
machine. (didn't read all the code ...)
> Being said this, it is a shame that VML does not have support for 
> strided/unaligned arrays.  They are quite common beasts, specially when 
> you work with heterogeneous arrays (aka record arrays).
>   
I have the impression that you can already feel happy if these 
mathematical libraries support a C interface, not only Fortran. At least 
the Intel VML provides functions to pack/unpack strided arrays which 
seem work on a broader parameter range than specified (also zero or 
negative step sizes).
>> I also gave a quick try to the 
>> equivalent vector math library, acml_mv of AMD. I only tried sin and
>> log, gave me the same performance (on a Intel processor!) like Intels
>> VML .
>>
>> I was also playing around with the block size in numexpr. What are
>> the rationale that led to the current block size of 128? Especially
>> with VML, a larger block size of 4096 instead of 128 allowed  to
>> efficiently use multithreading in VML.
>> 
>
> Experimentation.  Back in 2006 David found that 128 was optimal for the 
> processors available by that time.  With the Numexpr 1.1 my experiments 
> show that 256 is a better value for current Core2 processors and most 
> expressions in our benchmark bed (see benchmarks/ directory); hence, 
> 256 is the new value for the chunksize in 1.1.  However, be in mind 
> that 256 has to be multiplied by the itemsize of each array, so the 
> chunksize is currently 2048 bytes for 64-bit items (int64 or float64) 
> and 4096 for double precision complex arrays, which are probably the 
> sizes that have to be compared with VML.
>   
So the optimum block size  might depend  on the type of expression and 
if VML functions are used. On question: the block size is set by a 
#define, is there a significantly poorer performance if you use a 
variable instead? Would be more flexible, especially for testing and tuning.
>> I was missing the support for single precision floats.
>> 
>
> Yeah.  This is because nobody has implemented it before, but it is 
> completely doable.
>   

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Dag Sverre Seljebotn
Francesc Alted wrote:
> A Friday 16 January 2009, j...@physics.ucf.edu escrigué:
>> Right 
>> now, I'm not quite sure whether the problem you are solving is merely
>> the case of expressions-in-strings, and there is no advantage for
>> expressions-in-code, or whether your expressions-in-strings are
>> faster than numpy's expressions-in-code. In either case, it would 
>> appear this would be a good addition to the numpy core, and it's past
>> 1.0, so why keep it separate?  Even if there is value in having a
>> non-numpy version, is there not also value in accelerating numpy by
>> default?
> 
> Having the expression encapsulated in a string has the advantage that 
> you exactly know the part of the code that you want to parse and 
> accelerate.  Making NumPy to understand parts of the Python code that 
> can be accelerated sounds more like a true JIT for Python, and this is 
> something that is not trivial at all (although, with the advent of PyPy 
> there are appearing some efforts in this direction [1]).

A full compiler/JIT isn't needed, there's another route:

One could use the Numexpr methodology together with a symbolic 
expression framework (like SymPy or the one in Sage). I.e. operator 
overloads and lazy expressions.

Combining NumExpr with a symbolic manipulation engine would be very cool 
IMO. Unfortunately I don't have time myself (and I understand that you 
don't, I'm just mentioning it).

Example using psuedo-Sage-like syntax:

a = np.arange(bignum)
b = np.arange(bignum)
x, y = sage.var("x, y")
expr = sage.integrate(x + y, x)
z = expr(x=a, y=b) # z = a**2/2 + b, but Numexpr-enabled

-- 
Dag Sverre
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Francesc Alted
A Friday 16 January 2009, Sebastian Haase escrigué:
> Hi Francesc,
> this is a wonderful project ! I was just wondering if you would /
> could support single precision float arrays ?

As I said before, it is doable, but I don't know if I will have time 
enough to implement this myself.

> In 3+D image analysis we generally don't have enough memory to effort
> double precision; and we could save our selves lots of extra C coding
> (or Cython) coding of we could use numexpr ;-)

Well, one of the ideas that I'm toying long time ago is to provide the 
capability to Numexpr to work with PyTables disk-based objects.  That 
way, you would be able to evaluate potentially complex expressions by 
using data that is completely on-disk.  But this might be a completely 
different thing of what you are talking about.

Cheers,

-- 
Francesc Alted
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Francesc Alted
A Friday 16 January 2009, Gregor Thalhammer escrigué:
> I also gave a try to the vector math library (VML), contained in
> Intel's Math Kernel Library. This offers a fast implementation of
> mathematical functions, operating on array. First I implemented a C
> extension, providing new ufuncs. This gave me a big performance gain,
> e.g., 2.3x (5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x
> (6x) for division (no gain for add, sub, mul).

Wow, pretty nice speed-ups indeed!  In fact I was thinking in including 
support for threading in Numexpr (I don't think it would be too 
difficult, but let's see).  BTW, do you know how VML is able to achieve 
a speedup of 6x for a sin() function?  I suppose this is because they 
are using SSE instructions, but, are these also available for 64-bit 
double precision items?

> The values in 
> parantheses are given if I allow VML to use several threads and to
> employ both cores of my Intel Core2Duo computer. For large arrays
> (100M entries) this performance gain is reduced because of limited
> memory bandwidth. At this point I was stumbling across numexpr and
> modified it to use the VML functions. For sufficiently long and
> complex  numerical expressions I could  get the maximum performance
> also for large arrays.

Cool.

> Together with VML numexpr seems to be a 
> extremely powerful to get an optimum performance. I would like to see
> numexpr extended to (optionally) make use of fast vectorized math
> functions.

Well, if you can provide the code, I'd be glad to include it in numexpr.  
The only requirement is that the VML must be optional during the build 
of the package.

> There is one but: VML supports (at the moment) only math 
> on contiguous arrays. At a first try I didn't understand how to
> enforce this limitation in numexpr.

No problem.  At the end of the numexpr/necompiler.py you will see some 
code like:

# All the opcodes can deal with strided arrays directly as
# long as they are undimensional (strides in other
# dimensions are dealt within the extension), so we don't
# need a copy for the strided case.
if not b.flags.aligned:
   ...

which you can replace with something like:

# need a copy for the strided case.
if VML_available and not b.flags.contiguous: 
b = b.copy()
elif not b.flags.aligned:
...

That would be enough for ensuring that all the arrays are contiguous 
when they hit numexpr's virtual machine.

Being said this, it is a shame that VML does not have support for 
strided/unaligned arrays.  They are quite common beasts, specially when 
you work with heterogeneous arrays (aka record arrays).

> I also gave a quick try to the 
> equivalent vector math library, acml_mv of AMD. I only tried sin and
> log, gave me the same performance (on a Intel processor!) like Intels
> VML .
>
> I was also playing around with the block size in numexpr. What are
> the rationale that led to the current block size of 128? Especially
> with VML, a larger block size of 4096 instead of 128 allowed  to
> efficiently use multithreading in VML.

Experimentation.  Back in 2006 David found that 128 was optimal for the 
processors available by that time.  With the Numexpr 1.1 my experiments 
show that 256 is a better value for current Core2 processors and most 
expressions in our benchmark bed (see benchmarks/ directory); hence, 
256 is the new value for the chunksize in 1.1.  However, be in mind 
that 256 has to be multiplied by the itemsize of each array, so the 
chunksize is currently 2048 bytes for 64-bit items (int64 or float64) 
and 4096 for double precision complex arrays, which are probably the 
sizes that have to be compared with VML.

>
> > Share your experience
> > =
> >
> > Let us know of any bugs, suggestions, gripes, kudos, etc. you may
> > have.

> I was missing the support for single precision floats.

Yeah.  This is because nobody has implemented it before, but it is 
completely doable.

> Great work!

You are welcome!  And thanks for excellent feedback too!  Hope we can 
have a VML-aware numexpr anytime soon ;-)

Cheers,

-- 
Francesc Alted
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Sebastian Haase
Hi Francesc,
this is a wonderful project ! I was just wondering if you would /
could support single precision float arrays ?
In 3+D image analysis we generally don't have enough memory to effort
double precision; and we could save our selves lots of extra C coding
(or Cython) coding of we could use numexpr ;-)

Thanks,
Sebastian Haase



On Fri, Jan 16, 2009 at 5:04 PM, Francesc Alted  wrote:
> A Friday 16 January 2009, j...@physics.ucf.edu escrigué:
>> Hi Francesc,
>>
>> > Numexpr is a fast numerical expression evaluator for NumPy.  With
>> > it, expressions that operate on arrays (like "3*a+4*b") are
>> > accelerated and use less memory than doing the same calculation in
>> > Python.
>>
>> Please pardon my ignorance as I know this project has been around for
>> a while.  It this looks very exciting, but either it's cumbersome, or
>> I'm not understanding exactly what's being fixed.  If you can
>> accelerate evaluation, why not just integrate the faster math into
>> numpy, rather than having two packages?  Or is this something that is
>> only an advantage when the expression is given as a string (and why
>> is that the case)?  It would be helpful if you could put the answer
>> on your web page and in your standard release blurb in some compact
>> form. I guess what I'm really looking for when I read one of those is
>> a quick answer to the question "should I look into this?".
>
> Well, there is a link in the project page to the "Overview" section of
> the wiki, but perhaps is a bit hidden.  I've added some blurb as you
> suggested in the main page an another link to the "Overview" wiki page.
> Hope that, by reading the new blurb, you can see why it accelerates
> expression evaluation with regard to NumPy.  If not, tell me and will
> try to come with something more comprehensible.
>
>> Right
>> now, I'm not quite sure whether the problem you are solving is merely
>> the case of expressions-in-strings, and there is no advantage for
>> expressions-in-code, or whether your expressions-in-strings are
>> faster than numpy's expressions-in-code. In either case, it would
>> appear this would be a good addition to the numpy core, and it's past
>> 1.0, so why keep it separate?  Even if there is value in having a
>> non-numpy version, is there not also value in accelerating numpy by
>> default?
>
> Having the expression encapsulated in a string has the advantage that
> you exactly know the part of the code that you want to parse and
> accelerate.  Making NumPy to understand parts of the Python code that
> can be accelerated sounds more like a true JIT for Python, and this is
> something that is not trivial at all (although, with the advent of PyPy
> there are appearing some efforts in this direction [1]).
>
> [1] http://www.enthought.com/~ischnell/paper.html
>
> Cheers,
>
> --
> Francesc Alted
> ___
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Francesc Alted
A Friday 16 January 2009, j...@physics.ucf.edu escrigué:
> Hi Francesc,
>
> > Numexpr is a fast numerical expression evaluator for NumPy.  With
> > it, expressions that operate on arrays (like "3*a+4*b") are
> > accelerated and use less memory than doing the same calculation in
> > Python.
>
> Please pardon my ignorance as I know this project has been around for
> a while.  It this looks very exciting, but either it's cumbersome, or
> I'm not understanding exactly what's being fixed.  If you can
> accelerate evaluation, why not just integrate the faster math into
> numpy, rather than having two packages?  Or is this something that is
> only an advantage when the expression is given as a string (and why
> is that the case)?  It would be helpful if you could put the answer
> on your web page and in your standard release blurb in some compact
> form. I guess what I'm really looking for when I read one of those is
> a quick answer to the question "should I look into this?".

Well, there is a link in the project page to the "Overview" section of 
the wiki, but perhaps is a bit hidden.  I've added some blurb as you 
suggested in the main page an another link to the "Overview" wiki page.
Hope that, by reading the new blurb, you can see why it accelerates 
expression evaluation with regard to NumPy.  If not, tell me and will 
try to come with something more comprehensible.

> Right 
> now, I'm not quite sure whether the problem you are solving is merely
> the case of expressions-in-strings, and there is no advantage for
> expressions-in-code, or whether your expressions-in-strings are
> faster than numpy's expressions-in-code. In either case, it would 
> appear this would be a good addition to the numpy core, and it's past
> 1.0, so why keep it separate?  Even if there is value in having a
> non-numpy version, is there not also value in accelerating numpy by
> default?

Having the expression encapsulated in a string has the advantage that 
you exactly know the part of the code that you want to parse and 
accelerate.  Making NumPy to understand parts of the Python code that 
can be accelerated sounds more like a true JIT for Python, and this is 
something that is not trivial at all (although, with the advent of PyPy 
there are appearing some efforts in this direction [1]).

[1] http://www.enthought.com/~ischnell/paper.html

Cheers,

-- 
Francesc Alted
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Gregor Thalhammer
Francesc Alted schrieb:
> Numexpr is a fast numerical expression evaluator for NumPy.  With it,
> expressions that operate on arrays (like "3*a+4*b") are accelerated
> and use less memory than doing the same calculation in Python.
>
> The expected speed-ups for Numexpr respect to NumPy are between 0.95x
> and 15x, being 3x or 4x typical values.  The strided and unaligned
> case has been optimized too, so if the expresion contains such arrays, 
> the speed-up can increase significantly.  Of course, you will need to 
> operate with large arrays (typically larger than the cache size of your 
> CPU) to see these improvements in performance.
>
>   
Just recently I had a more detailed look at numexpr. Clever idea, easy 
to use! I can affirm an typical performance gain of 3x if you work on 
large arrays (>100k entries).

I also gave a try to the vector math library (VML), contained in Intel's 
Math Kernel Library. This offers a fast implementation of mathematical 
functions, operating on array. First I implemented a C extension, 
providing new ufuncs. This gave me a big performance gain, e.g., 2.3x 
(5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x (6x) for 
division (no gain for add, sub, mul). The values in parantheses are 
given if I allow VML to use several threads and to employ both cores of 
my Intel Core2Duo computer. For large arrays (100M entries) this 
performance gain is reduced because of limited memory bandwidth. At this 
point I was stumbling across numexpr and modified it to use the VML 
functions. For sufficiently long and complex  numerical expressions I 
could  get the maximum performance also for large arrays.  Together with 
VML numexpr seems to be a extremely powerful to get an optimum 
performance. I would like to see numexpr extended to (optionally) make 
use of fast vectorized math functions. There is one but: VML supports 
(at the moment) only math on contiguous arrays. At a first try I didn't 
understand how to enforce this limitation in numexpr. I also gave a 
quick try to the equivalent vector math library, acml_mv of AMD. I only 
tried sin and log, gave me the same performance (on a Intel processor!) 
like Intels VML .

I was also playing around with the block size in numexpr. What are the 
rationale that led to the current block size of 128? Especially with 
VML, a larger block size of 4096 instead of 128 allowed  to efficiently 
use multithreading in VML.
> Share your experience
> =
>
> Let us know of any bugs, suggestions, gripes, kudos, etc. you may
> have.
>
>   
I was missing the support for single precision floats.

Great work!

Gregor
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion