Re: [Numpy-discussion] Objected-oriented SIMD API for Numpy

2009-10-21 Thread Andrew Friedley
sigh; yet another email dropped by the list.

David Warde-Farley wrote:
 On 21-Oct-09, at 9:14 AM, Pauli Virtanen wrote:
 
 Since these are ufuncs, I suppose the SSE implementations could just  
 be
 put in a separate module, which is always compiled. Before importing  
 the
 module, we could simply check from Python side that the CPU supports  
 the
 necessary instructions. If everything is OK, the accelerated
 implementations would then just replace the Numpy routines.
 
 Am I mistaken or wasn't that sort of the goal of Andrew Friedley's  
 CorePy work this summer?
 
 Looking at his slides again, the speedups are rather impressive. I  
 wonder if these could be usefully integrated into numpy itself?

Yes, my GSoC project is closely related, though I didn't do the CPU 
detection part, that'd be easy to do.  Also I wrote my code specifically 
for 64-bit x86.

I didn't focus so much on the transcendental functions, though they 
wouldn't be too hard to implement.  There's also the possibility to 
provide implementations with differing tradeoffs between accuracy and 
performance.

I think the blog link got posted already, but here's relevant info:

http://numcorepy.blogspot.com
http://www.corepy.org/wiki/index.php?title=CoreFunc

I talked about this in my SciPy talk and up-coming paper, as well.

Also people have just been talking about x86 in this thread -- other 
architectures could be supported too; eg PPC/Altivec or even Cell SPU 
and other accelerators.  I actually wrote a quick/dirty implementation 
of addition and vector normalization ufuncs for Cell SPU recently. Basic 
result is that overall performance is very roughly comparable to a 
similar speed x86 chip, but this is a huge win over just running on the 
extremely slow Cell PPC cores.

Andrew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] strange sin/cos performance

2009-08-05 Thread Andrew Friedley



Is anyone with this problem *not* running ubuntu?


Me - RHEL 5.2 opteron:

Python 2.6.1 (r261:67515, Jan  5 2009, 10:19:01)
[GCC 4.1.2 20071124 (Red Hat 4.1.2-42)] on linux2

Fedora 9 PS3/PPC:

Python 2.5.1 (r251:54863, Jul 17 2008, 13:25:23)
[GCC 4.3.1 20080708 (Red Hat 4.3.1-4)] on linux2


Actually I now have some interesting results that indicate the issue 
isn't in Python or NumPy at all.  I just wrote a C program to try to 
reproduce the error, and was able to do so (actually the difference is 
even larger).


Opteron:

float (32) time in usecs: 179698
double (64) time in usecs: 13795

PS3/PPC:

float (32) time in usecs: 614821
double (64) time in usecs: 37163

I've attached the code for others to review and/or try out.  I guess 
this is worth showing to the libc people?


Andrew
#include stdio.h
#include math.h
#include sys/time.h

#define LEN 159161

float inp32[LEN];
float out32[LEN];

double inp64[LEN];
double out64[LEN];

int main(int argc, char** argv)
{
struct timeval tv_start;
struct timeval tv_stop;
int i;

for(i = 0; i  LEN; i++) {
//inp32[i] = ((float)i / (float)LEN) *  (2 * M_PI);
inp32[i] = (float)i;
out32[i] = 0.0;
}

gettimeofday(tv_start, NULL);
for(i = 0; i  LEN; i++) {
out32[i] = (float)cosf((float)inp32[i]);
}
gettimeofday(tv_stop, NULL);

if(tv_start.tv_sec != tv_stop.tv_sec) {
puts(seconds changed, re-run the benchmark);
}

printf(float (32) time in usecs: %d\n, tv_stop.tv_usec - 
tv_start.tv_usec);


for(i = 0; i  LEN; i++) {
//inp32[i] = ((double)i / (double)LEN) *  (2 * M_PI);
inp64[i] = (float)i;
out64[i] = 0.0;
}

gettimeofday(tv_start, NULL);
for(i = 0; i  LEN; i++) {
out64[i] = (double)cos((double)inp64[i]);
}
gettimeofday(tv_stop, NULL);

if(tv_start.tv_sec != tv_stop.tv_sec) {
puts(seconds changed, re-run the benchmark);
}

printf(double (64) time in usecs: %d\n, tv_stop.tv_usec - 
tv_start.tv_usec);

return 0;
}

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] strange sin/cos performance

2009-08-04 Thread Andrew Friedley
Bruce Southey wrote:
 Hi,
 Can you try these from the command line:
 python -m timeit -n 100 -s import numpy as np; a = np.arange(0.0, 1000, 
 (2*3.14159) / 1000, dtype=np.float32)
 python -m timeit -n 100 -s import numpy as np; a = np.arange(0.0, 1000, 
 (2*3.14159) / 1000, dtype=np.float32); b=np.sin(a)
 python -m timeit -n 100 -s import numpy as np; a = np.arange(0.0, 1000, 
 (2*3.14159) / 1000, dtype=np.float32); np.sin(a)
 python -m timeit -n 100 -s import numpy as np; a = np.arange(0.0, 1000, 
 (2*3.14159) / 1000, dtype=np.float32) np.sin(a)
 
 The first should be similar for different dtypes because it is just 
 array creation. The second extends that by storing the sin into another 
 array. I am not sure how to interpret the third but in the Python prompt 
 it would print it to screen. The last causes Python to handle two 
 arguments which is slow using float32 but not for float64 and float128 
 suggesting compiler issue such as not using SSE or similar.

Results:

$ python -m timeit -n 100 -s import numpy as np; a = np.arange(0.0, 
1000, (2*3.14159) / 1000, dtype=np.float32)
100 loops, best of 3: 0.0811 usec per loop

$ python -m timeit -n 100 -s import numpy as np; a = np.arange(0.0, 
1000, (2*3.14159) / 1000, dtype=np.float32); b=np.sin(a)
100 loops, best of 3: 0.11 usec per loop

$ python -m timeit -n 100 -s import numpy as np; a = np.arange(0.0, 
1000, (2*3.14159) / 1000, dtype=np.float32); np.sin(a)
100 loops, best of 3: 0.11 usec per loop

$ python -m timeit -n 100 -s import numpy as np; a = np.arange(0.0, 
1000, (2*3.14159) / 1000, dtype=np.float32) np.sin(a)
100 loops, best of 3: 112 msec per loop

$ python -m timeit -n 100 -s import numpy as np; a = np.arange(0.0, 
1000, (2*3.14159) / 1000, dtype=np.float64) np.sin(a)
100 loops, best of 3: 13.2 msec per loop

I think the second and third are effectively the same; both create an 
array containing the result.  The second assigns that array to a value, 
while the third does not, so it should get garbage collected.

The fourth one is the only one that actually runs the sin in the timing 
loop.  I don't understand what you mean by causing Pyton to handle two 
arguments?

The fifth run I added uses float64 to compare (and reproduces the problem).

Andrew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] strange sin/cos performance

2009-08-04 Thread Andrew Friedley
Charles R Harris wrote:
 On Mon, Aug 3, 2009 at 11:51 AM, Andrew Friedley afrie...@indiana.eduwrote:
 
 Charles R Harris wrote:
 What compiler versions are folks using? In the slow cases, what is the
 timing for converting to double, computing the sin, then casting back to
 single?
 I did this, is this the right way to do that?

 t =
 timeit.Timer(numpy.sin(a.astype(numpy.float64)).astype(numpy.float32),
 import numpy\n
 a = numpy.arange(0.0, 1000, (2 * 3.14159) / 1000,
 dtype=numpy.float64))
 print sin converted float 32/64, min(t.repeat(3, 10))

 Timings on my opteron system (2-socket 2-core 2GHz):

 sin float32 1.13407707214
 sin float64 0.133460998535
 sin converted float 32/64 0.18202996254

 Not too surprising I guess.

 gcc --version shows:

 gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44)

 My compile flags for my Python 2.6.1/NumPy 1.3.0 builds:

 -Os -fomit-frame-pointer -pipe -s -march=k8 -m64

 
 That looks right. When numpy doesn't find a *f version it basically does
 that conversion. This is beginning to look like a hardware/software
 implementation problem, maybe compiler related. That is, I suspect the fast
 times come from using a hardware implementation. What happens if you use -O2
 instead of -Os?

Do you know where this conversion is, in the code?  The impression I got 
from my quick look at the code was that a wrapper sinf was defined that 
just calls sin.  I guess the typecast to float in there will do the 
conversion, is that what you are referring to, or something at a higher 
level?

I recompiled the same versions of Python/NumPy, using the same flags 
except -O2 instead of -Os, the behavior is still the same.

Andrew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] strange sin/cos performance

2009-08-04 Thread Andrew Friedley
David Cournapeau wrote:
 On Wed, Aug 5, 2009 at 12:14 AM, Andrew Friedleyafrie...@indiana.edu wrote:
 
 Do you know where this conversion is, in the code?  The impression I got
 from my quick look at the code was that a wrapper sinf was defined that
 just calls sin.  I guess the typecast to float in there will do the
 conversion
 
 Exact. Given your CPU, compared to my macbook, it looks like the
 float32 is the problem (i.e. the float64 is not particularly fast). I
 really can't see what could cause such a slowdown: the range over
 which you evaluate sin should not cause denormal numbers - just to be
 sure, could you try the same benchmark but using a simple array of
 constant values (say numpy.ones(1000)) ? Also, you may want to check
 what happens if you force raising errors in case of FPU exceptions
 (numpy.seterr(raise=all)).

OK, have some interesting results.  First is my array creation was not 
doing what I thought it was.  This (what I've been doing) creates an 
array of 159161 elements:

numpy.arange(0.0, 1000, (2 * 3.14159) / 1000, dtype=numpy.float32)

Which isn't what I was after (1000 elements ranging from 0 to 2PI).  So 
the values in that array climb up to 999.999.

Running with numpy.ones() gives a much different timing (I did 
numpy.ones(159161) to keep the array lengths the same):

sin float32 0.078202009201
sin float64 0.0767619609833
cos float32 0.0750858783722
cos float64 0.088515996933

Much better, but still a little strange, float32 should be relatively 
faster yet.  I tried with 1000 elements and got similar results.

So the performance has something to do with the input values.  This is 
believable, but I don't think it explains why float32 would behave that 
way and not float64, unless there's something else I don't understand.

Also I assume you meant seterr(all='raise').  This didn't seem to do 
anything, I don't have any exceptions thrown or other output.

Andrew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] strange sin/cos performance

2009-08-04 Thread Andrew Friedley
Charles R Harris wrote:
 Depends on the CPU, FPU and the compiler flags. The computations could very
 well be done using double precision internally with conversions on
 load/store.

Sure, but if this is the case, why is the performance blowing up on 
larger input values for float32 but not float64?  Both should blow up, 
not just one or the other.  In other words I think they are using 
different implementations :)  Am I missing something?

Andrew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] strange sin/cos performance

2009-08-03 Thread Andrew Friedley
While working on GSoC stuff I came across this weird performance 
behavior for sine and cosine -- using float32 is way slower than 
float64.  On a 2ghz opteron:


sin float32 1.12447786331
sin float64 0.133481025696
cos float32 1.14155912399
cos float64 0.131420135498

The times are in seconds, and are best of three runs of ten iterations 
of numpy.{sin,cos} over a 1000-element array (script attached).  I've 
produced similar results on a PS3 system also.  The opteron is running 
Python 2.6.1 and NumPy 1.3.0, while the PS3 has Python 2.5.1 and NumPy 
1.1.1.


I haven't jumped into the code yet, but does anyone know why sin/cos are 
~8.5x slower for 32-bit floats compared to 64-bit doubles?


Side question:  I see people in emails writing things like 'timeit 
foo(x)' and having it run some sort of standard benchmark, how exactly 
do I do that?  Is that some environment other than a normal Python?


Thanks,

Andrew
import timeit

t = timeit.Timer(numpy.sin(a),
import numpy\n
a = numpy.arange(0.0, 1000, (2 * 3.14159) / 1000, 
dtype=numpy.float32))
print sin float32, min(t.repeat(3, 10))

t = timeit.Timer(numpy.sin(a),
import numpy\n
a = numpy.arange(0.0, 1000, (2 * 3.14159) / 1000, 
dtype=numpy.float64))
print sin float64, min(t.repeat(3, 10))

t = timeit.Timer(numpy.cos(a),
import numpy\n
a = numpy.arange(0.0, 1000, (2 * 3.14159) / 1000, 
dtype=numpy.float32))
print cos float32, min(t.repeat(3, 10))

t = timeit.Timer(numpy.cos(a),
import numpy\n
a = numpy.arange(0.0, 1000, (2 * 3.14159) / 1000, 
dtype=numpy.float64))
print cos float64, min(t.repeat(3, 10))

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add/multiply reduction confusion

2009-08-03 Thread Andrew Friedley
Gael Varoquaux wrote:
 On Sun, Jul 05, 2009 at 02:47:18PM -0400, Andrew Friedley wrote:
 Stéfan van der Walt wrote:
 2009/7/5 Andrew Friedley afrie...@indiana.edu:
 I found the check that does the type 'upcasting' in
 umath_ufunc_object.inc around line 3072 (NumPy 1.3.0).  Turns out all I
 need to do is make sure my add and multiply ufuncs are actually named
 'add' and 'multiply' and arrays will be upcasted appropriately.
 
 Would you please be so kind as to add your findings here:
 
 http://docs.scipy.org/numpy/docs/numpy-docs/reference/index.rst/#reference-index
 
 I haven't read through that document recently, so it may be in there 
 already.
 
 I created an account (afriedle) but looks like I don't have edit 
 permissions.
 
 I have added you to the Editor list.

Thanks and sorry about the delay; I went and added the comment I proposed.

Andrew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] strange sin/cos performance

2009-08-03 Thread Andrew Friedley
Thanks for the quick responses.

David Cournapeau wrote:
 On Mon, Aug 3, 2009 at 10:32 PM, Andrew Friedleyafrie...@indiana.edu wrote:
 While working on GSoC stuff I came across this weird performance behavior
 for sine and cosine -- using float32 is way slower than float64.  On a 2ghz
 opteron:

 sin float32 1.12447786331
 sin float64 0.133481025696
 cos float32 1.14155912399
 cos float64 0.131420135498
 
 Which OS are you on ? FWIW, on max os x, with recent svn checkout, I
 get expected results (float32 ~ twice faster).

The numbers above are on linux, RHEL 5.2.  The PS3 is running Fedora 9 I 
think.  I just ran on a PPC OSX 10.5 system:

sin float32 0.111793041229
sin float64 0.0902218818665
cos float32 0.112202882767
cos float64 0.0917768478394

Much more reasonable, but still not what I'd expect or what you seem to 
expect.

 The times are in seconds, and are best of three runs of ten iterations of
 numpy.{sin,cos} over a 1000-element array (script attached).  I've produced
 similar results on a PS3 system also.  The opteron is running Python 2.6.1
 and NumPy 1.3.0, while the PS3 has Python 2.5.1 and NumPy 1.1.1.

 I haven't jumped into the code yet, but does anyone know why sin/cos are
 ~8.5x slower for 32-bit floats compared to 64-bit doubles?
 
 My guess would be that you are on a platform where there is no sinf,
 and our sinf replacement is bad for some reason.

I think linux has sinf, is there a quick/easy way to check if numpy is 
using it?

 Side question:  I see people in emails writing things like 'timeit foo(x)'
 and having it run some sort of standard benchmark, how exactly do I do that?
  Is that some environment other than a normal Python?
 
 Yes, that's in ipython.

Thanks for the pointer.

Andrew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] strange sin/cos performance

2009-08-03 Thread Andrew Friedley
Emmanuelle Gouillart wrote:
   Hi Andrew,
 
   %timeit is an Ipython magic command that uses the timeit module,
 see
 http://ipython.scipy.org/doc/stable/html/interactive/reference.html?highlight=timeit
 for more information about how to use it. So you were right to suppose
 that it is not a normal Python.

Thanks for the pointer, I'm not familiar with IPython at all, will check 
it out.


   However, I was not able to reproduce your observations.
 
 import numpy as np
 a = np.arange(0.0, 1000, (2 * 3.14159) / 1000, dtype=np.float32)
 b = np.arange(0.0, 1000, (2 * 3.14159) / 1000, dtype=np.float64)
 %timeit -n 10 np.sin(a)
 10 loops, best of 3: 8.67 ms per loop
 %timeit -n 10 np.sin(b)
 10 loops, best of 3: 9.29 ms per loop

OK, I'm curious, what OS/Python/Numpy are you using?

Andrew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] strange sin/cos performance

2009-08-03 Thread Andrew Friedley
David Cournapeau wrote:
 David Cournapeau wrote:
 On Mon, Aug 3, 2009 at 10:32 PM, Andrew Friedleyafrie...@indiana.edu 
 wrote:
 While working on GSoC stuff I came across this weird performance behavior
 for sine and cosine -- using float32 is way slower than float64.  On a 2ghz
 opteron:

 sin float32 1.12447786331
 sin float64 0.133481025696
 cos float32 1.14155912399
 cos float64 0.131420135498
 Which OS are you on ? FWIW, on max os x, with recent svn checkout, I
 get expected results (float32 ~ twice faster).
 The numbers above are on linux, RHEL 5.2.  The PS3 is running Fedora 9 I
 think.
 
 I know next to nothing about the PS3 hardware, but I know that it is
 quite different compared to conventional x86 CPU. Does it even have
 both 4 and 8 bytes native  float ?

Yes.  As far as this discussion is concerned, the PS3/Cell is just a 
slow PowerPC.  Quite different from x86, but probably not as different 
as you think :)


 Much more reasonable, but still not what I'd expect or what you seem to
 expect.
 
 On a x86 system with sinf available in the math lib, I would expect
 the float32 to be faster than float64. Other than that, the exact
 ratio depends on too many factors (sse vs x87 usage, cache size,
 compiler, math library performances). One order magnitude slower seems
 very strange in any case.

OK.  I'll probably investigate this a bit further, but I don't have 
anything that really depends on this issue.  It does explain a large 
part of why my cos ufunc was so much faster.

Since I'm observing this on both x86 and PPC (PS3), I don't think its a 
hardware issue -- something in the software stack.  And now there's two 
people reporting results with only different numpy versions.


 The times are in seconds, and are best of three runs of ten iterations of
 numpy.{sin,cos} over a 1000-element array (script attached).  I've produced
 similar results on a PS3 system also.  The opteron is running Python 2.6.1
 and NumPy 1.3.0, while the PS3 has Python 2.5.1 and NumPy 1.1.1.

 I haven't jumped into the code yet, but does anyone know why sin/cos are
 ~8.5x slower for 32-bit floats compared to 64-bit doubles?
 My guess would be that you are on a platform where there is no sinf,
 and our sinf replacement is bad for some reason.
 I think linux has sinf, is there a quick/easy way to check if numpy is
 using it?
 
 You can look at the config.h in numpy/core/include/numpy, and see if
 there is a HAVE_SINF defined (for numpy = 1.2.0 at least).

OK, I see HAVE_SINF (and HAVE_COSF) for my 1.3.0 build on the opteron 
system.  I'm using the distro-provided packages on other systems, so I 
guess I can't check those.

I don't think this matters -- numpy/core/src/npy_math.c just defines 
sinf as a function calling sin.  So if HAVE_SINF wasn't set, I'd expect 
the performance different to be very little, with floats still being 
slightly faster (less mem traffic).

Also I just went and wrote a C program to do a similar benchmark, and I 
am unable to reproduce the issue there.  Makes me think the problem is 
in NumPy, but I have no idea where to look.  Suggestions welcome :)

Andrew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] strange sin/cos performance

2009-08-03 Thread Andrew Friedley
Charles R Harris wrote:
 What compiler versions are folks using? In the slow cases, what is the
 timing for converting to double, computing the sin, then casting back to
 single?

I did this, is this the right way to do that?

t = timeit.Timer(numpy.sin(a.astype(numpy.float64)).astype(numpy.float32),
 import numpy\n
 a = numpy.arange(0.0, 1000, (2 * 3.14159) / 1000, 
dtype=numpy.float64))
print sin converted float 32/64, min(t.repeat(3, 10))

Timings on my opteron system (2-socket 2-core 2GHz):

sin float32 1.13407707214
sin float64 0.133460998535
sin converted float 32/64 0.18202996254

Not too surprising I guess.

gcc --version shows:

gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44)

My compile flags for my Python 2.6.1/NumPy 1.3.0 builds:

-Os -fomit-frame-pointer -pipe -s -march=k8 -m64

Andrew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add/multiply reduction confusion

2009-07-05 Thread Andrew Friedley
I figured this out in case anyone is interested.

I found the check that does the type 'upcasting' in 
umath_ufunc_object.inc around line 3072 (NumPy 1.3.0).  Turns out all I 
need to do is make sure my add and multiply ufuncs are actually named 
'add' and 'multiply' and arrays will be upcasted appropriately.

Maybe this is worth documenting somewhere, maybe in the UFunc C API?  Or 
is it documented already, and I just missed it?

Andrew

Andrew Friedley wrote:
 Hi,
 
 I'm trying to understand how integer types are upcast for add/multiply 
 operations for my GSoC project (Implementing Ufuncs using CorePy).
 
 The documentation says that for reduction with add/multiply operations, 
 integer types are 'upcast' to the int_ type (int64 on my system).  What 
 exactly does this mean, internally?  Where/when does the upcasting 
 occur? Is it a C-style cast, or a memory copy to a new temporary array?
 
 I'm a confused as to which low-level ufunc loop type is used (and why). 
   This is what I see:
 
   a = numpy.arange(131072, dtype=numpy.int32)
   r = numpy.add.reduce(a)
   print type(r)
 type 'numpy.int64'
   print hex(r)
 0x1L
 
 Okay, fine.  But I have my own ufunc, which defines only the following 
 types right now (I stripped it down for debugging):
 
   print corefunc.add.types
 ['ii-i', 'll-l']
 
 NumPy has this, for comparison:
 
   print numpy.add.types
 ['??-?', 'bb-b', 'BB-B', 'hh-h', 'HH-H', 'ii-i', 'II-I', 'll-l', 
 'LL-L', 'qq-q', 'QQ-Q', 'ff-f', 'dd-d', 'gg-g', 'FF-F', 'DD-D', 
 'GG-G', 'OO-O']
 
 Also just to verify I did this:
 
   print numpy.typeDict['i']
 type 'numpy.int32'
   print numpy.typeDict['l']
 type 'numpy.int64'
 
 
 Yet when I call my own ufunc, this happens:
 
   a = numpy.arange(131072, dtype=numpy.int32)
   r = corefunc.add.reduce(a)
   print type(r)
 type 'numpy.int32'
   print hex(r)
 -0x1
 
 It looks like no upcasting is occurring here?  My ii-i loop is being 
 used, not the ll-l loop.. why?  I'm guessing this is something I am 
 doing wrong, any ideas what it is?
 
 Andrew
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add/multiply reduction confusion

2009-07-05 Thread Andrew Friedley
Stéfan van der Walt wrote:
 2009/7/5 Andrew Friedley afrie...@indiana.edu:
 I found the check that does the type 'upcasting' in
 umath_ufunc_object.inc around line 3072 (NumPy 1.3.0).  Turns out all I
 need to do is make sure my add and multiply ufuncs are actually named
 'add' and 'multiply' and arrays will be upcasted appropriately.

 Would you please be so kind as to add your findings here:
 
 http://docs.scipy.org/numpy/docs/numpy-docs/reference/index.rst/#reference-index
 
 I haven't read through that document recently, so it may be in there already.

I created an account (afriedle) but looks like I don't have edit 
permissions.

The user-side upcasting behavior for add/multiply is documented (though 
a little hidden) in the long paragraph right before the 'Available 
ufuncs' section.

http://docs.scipy.org/numpy/docs/numpy-docs/reference/ufuncs.rst/#ufuncs

I'm thinking something should be added on the C API side though, perhaps 
on the 'name' parameter of PyUFunc_FromFuncAndData():

http://docs.scipy.org/numpy/docs/numpy-docs/reference/c-api.ufunc.rst/#c-api-ufunc

I would say something to the effect of (I borrowed from the user-side 
blurb :) :

Specifying a name of 'add' or 'multiply' enables a special behavior for 
integer-typed reductions when no dtype is given.  If the input type is 
an integer (or boolean) data type smaller than the size of the int_ data 
type, it will be internally upcast to the int_ (or uint) data type.

Andrew
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Add/multiply reduction confusion

2009-06-29 Thread Andrew Friedley
Hi,

I'm trying to understand how integer types are upcast for add/multiply 
operations for my GSoC project (Implementing Ufuncs using CorePy).

The documentation says that for reduction with add/multiply operations, 
integer types are 'upcast' to the int_ type (int64 on my system).  What 
exactly does this mean, internally?  Where/when does the upcasting 
occur? Is it a C-style cast, or a memory copy to a new temporary array?

I'm a confused as to which low-level ufunc loop type is used (and why). 
  This is what I see:

  a = numpy.arange(131072, dtype=numpy.int32)
  r = numpy.add.reduce(a)
  print type(r)
type 'numpy.int64'
  print hex(r)
0x1L

Okay, fine.  But I have my own ufunc, which defines only the following 
types right now (I stripped it down for debugging):

  print corefunc.add.types
['ii-i', 'll-l']

NumPy has this, for comparison:

  print numpy.add.types
['??-?', 'bb-b', 'BB-B', 'hh-h', 'HH-H', 'ii-i', 'II-I', 'll-l', 
'LL-L', 'qq-q', 'QQ-Q', 'ff-f', 'dd-d', 'gg-g', 'FF-F', 'DD-D', 
'GG-G', 'OO-O']

Also just to verify I did this:

  print numpy.typeDict['i']
type 'numpy.int32'
  print numpy.typeDict['l']
type 'numpy.int64'


Yet when I call my own ufunc, this happens:

  a = numpy.arange(131072, dtype=numpy.int32)
  r = corefunc.add.reduce(a)
  print type(r)
type 'numpy.int32'
  print hex(r)
-0x1

It looks like no upcasting is occurring here?  My ii-i loop is being 
used, not the ll-l loop.. why?  I'm guessing this is something I am 
doing wrong, any ideas what it is?

Andrew
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy ufuncs and COREPY - any info?

2009-05-26 Thread Andrew Friedley
David Cournapeau wrote:
 Francesc Alted wrote:
 Well, it is Andrew who should demonstrate that his measurement is correct, 
 but 
 in principle, 4 cycles/item *should* be feasible when using 8 cores in 
 parallel.
 
 But the 100x speed increase is for one core only unless I misread the
 table. And I should have mentioned that 400 cycles/item for cos is on a
 pentium 4, which has dreadful performances (defective L1). On a much
 better core duo extreme something, I get 100 cycles / item (on a 64 bits
 machines, though, and not same compiler, although I guess the libm
 version is what matters the most here).
 
 And let's not forget that there is the python wrapping cost: by doing
 everything in C, I got ~ 200 cycle/cos on the PIV, and ~60 cycles/cos on
 the core 2 duo (for double), using the rdtsc performance counter. All
 this for 1024 items in the array, so very optimistic usecase (everything
 in cache 2 if not 1).
 
 This shows that python wrapping cost is not so high, making the 100x
 claim a bit doubtful without more details on the way to measure speed.

I appreciate all the discussion this is creating.  I wish I could work 
on this more right now; I have a big paper deadline coming up June 1 
that I need to focus on.

Yes, you're reading the table right.  I should have been more clear on 
what my implementation is doing.  It's using SIMD, so performing 4 
cosine's at a time where a libm cosine is only doing one.  Also I don't 
think libm trancendentals are known for being fast; I'm also likely 
gaining performance by using a well-optimized but less accurate 
approximation.  In fact a little more inspection shows my accuracy 
decreases as the input values increase; I will probably need to take a 
performance hit to fix this.

I went and wrote code to use the libm fcos() routine instead of my cos 
code.  Performance is equivalent to numpy, plus an overhead:

inp sizes  102410240   102400  1024000  3072000
numpy0.7282   9.6278 115.5976  993.5738 3017.3680

lmcos1   0.7594   9.7579 116.7135 1039.5783 3156.8371
lmcos2   0.5274   5.7885  61.8052  537.8451 1576.2057
lmcos4   0.5172   5.1240  40.5018  313.2487  791.9730

corepy   1   0.0142   0.0880   0.95669.6162   28.4972
corepy   2   0.0342   0.0754   0.69916.1647   15.3545
corepy   4   0.0596   0.0963   0.56714.9499   13.8784


The times I show are in milliseconds; the system used is a dual-socket 
dual-core 2ghz opteron.  I'm testing at the ufunc level, like this:

def benchmark(fn, args):
   avgtime = 0
   fn(*args)

   for i in xrange(7):
 t1 = time.time()
 fn(*args)
 t2 = time.time()

 tm = t2 - t1
 avgtime += tm

   return avgtime / 7

Where fn is a ufunc, ie numpy.cos.  So I prime the execution once, then 
do 7 timings and take the average.  I always appreciate suggestions on 
better way to benchmark things.

Andrew
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy ufuncs and COREPY - any info?

2009-05-25 Thread Andrew Friedley
For some reason the list seems to occasionally drop my messages...

Francesc Alted wrote:
 A Friday 22 May 2009 13:52:46 Andrew Friedley escrigué:
 I'm the student doing the project.  I have a blog here, which contains
 some initial performance numbers for a couple test ufuncs I did:

 http://numcorepy.blogspot.com

 Another alternative we've talked about, and I (more and more likely) may
 look into is composing multiple operations together into a single ufunc.
   Again the main idea being that memory accesses can be reduced/eliminated.
 
 IMHO, composing multiple operations together is the most promising venue for 
 leveraging current multicore systems.

Agreed -- our concern when considering for the project was to keep the 
scope reasonable so I can complete it in the GSoC timeframe.  If I have 
time I'll definitely be looking into this over the summer; if not later.

 Another interesting approach is to implement costly operations (from the 
 point 
 of view of CPU resources), namely, transcendental functions like sin, cos or 
 tan, but also others like sqrt or pow) in a parallel way.  If besides, you 
 can 
 combine this with vectorized versions of them (by using the well spread SSE2 
 instruction set, see [1] for an example), then you would be able to achieve 
 really good results for sure (at least Intel did with its VML library ;)
 
 [1] http://gruntthepeon.free.fr/ssemath/

I've seen that page before.  Using another source [1] I came up with a 
quick/dirty cos ufunc.  Performance is crazy good compared to NumPy 
(100x); see the latest post on my blog for a little more info.  I'll 
look at the source myself when I get time again, but is NumPy using a 
Python-based cos function, a C implementation, or something else?  As I 
wrote in my blog, the performance gain is almost too good to believe.

[1] http://www.devmaster.net/forums/showthread.php?t=5784

Andrew
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy ufuncs and COREPY - any info?

2009-05-22 Thread Andrew Friedley
(sending again)

Hi,

I'm the student doing the project.  I have a blog here, which contains 
some initial performance numbers for a couple test ufuncs I did:

http://numcorepy.blogspot.com

It's really too early yet to give definitive results though; GSoC 
officially starts in two days :)  What I'm finding is that the existing 
ufuncs are already pretty fast; it appears right now that the main 
limitation is memory bandwidth.  If that's really the case, the 
performance gains I'll get will be through cache tricks (non-temporal 
loads/stores), reducing memory accesses and using multiple cores to get 
more bandwidth.

Another alternative we've talked about, and I (more and more likely) may 
look into is composing multiple operations together into a single ufunc. 
  Again the main idea being that memory accesses can be reduced/eliminated.

Andrew

dmitrey wrote:
 hi all,
 has anyone already tried to compare using an ordinary numpy ufunc vs
 that one from corepy, first of all I mean the project
 http://socghop.appspot.com/student_project/show/google/gsoc2009/python/t124024628235
 
 It would be interesting to know what is speedup for (eg) vec ** 0.5 or
 (if it's possible - it isn't pure ufunc) numpy.dot(Matrix, vec). Or
 any another example.
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy ufuncs and COREPY - any info?

2009-05-22 Thread Andrew Friedley


Francesc Alted wrote:
 A Friday 22 May 2009 11:42:56 Gregor Thalhammer escrigué:
 dmitrey schrieb:
 3) Improving performance by using multi cores is much more difficult.
 Only for sufficiently large (1e5) arrays a significant speedup is
 possible. Where a speed gain is possible, the MKL uses several cores.
 Some experimentation showed that adding a few OpenMP constructs you
 could get a similar speedup with numpy.
 4) numpy.dot uses optimized implementations.
 
 Good points Gregor.  However, I wouldn't say that improving performance by 
 using multi cores is *that* difficult, but rather that multi cores can only 
 be 
 used efficiently *whenever* the memory bandwith is not a limitation.  An 
 example of this is the computation of transcendental functions, where, even 
 using vectorized implementations, the computation speed is still CPU-bounded 
 in many cases.  And you have experimented yourself very good speed-ups for 
 these cases with your implementation of numexpr/MKL :)

Using multiple cores is pretty easy for element-wise ufuncs; no 
communication needs to occur and the work partitioning is trivial.  And 
actually I've found with some initial testing that multiple cores does 
still help when you are memory bound.  I don't fully understand why yet, 
though I have some ideas.  One reason is multiple memory controllers due 
to multiple sockets (ie opteron).  Another is that each thread is 
pulling memory from a different bank, utilizing more bandwidth than a 
single sequential thread could.  However if that's the case, we could 
possibly come up with code for a single thread that achieves (nearly) 
the same additional throughput..

Andrew
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion