Re: [Numpy-discussion] Objected-oriented SIMD API for Numpy
sigh; yet another email dropped by the list. David Warde-Farley wrote: On 21-Oct-09, at 9:14 AM, Pauli Virtanen wrote: Since these are ufuncs, I suppose the SSE implementations could just be put in a separate module, which is always compiled. Before importing the module, we could simply check from Python side that the CPU supports the necessary instructions. If everything is OK, the accelerated implementations would then just replace the Numpy routines. Am I mistaken or wasn't that sort of the goal of Andrew Friedley's CorePy work this summer? Looking at his slides again, the speedups are rather impressive. I wonder if these could be usefully integrated into numpy itself? Yes, my GSoC project is closely related, though I didn't do the CPU detection part, that'd be easy to do. Also I wrote my code specifically for 64-bit x86. I didn't focus so much on the transcendental functions, though they wouldn't be too hard to implement. There's also the possibility to provide implementations with differing tradeoffs between accuracy and performance. I think the blog link got posted already, but here's relevant info: http://numcorepy.blogspot.com http://www.corepy.org/wiki/index.php?title=CoreFunc I talked about this in my SciPy talk and up-coming paper, as well. Also people have just been talking about x86 in this thread -- other architectures could be supported too; eg PPC/Altivec or even Cell SPU and other accelerators. I actually wrote a quick/dirty implementation of addition and vector normalization ufuncs for Cell SPU recently. Basic result is that overall performance is very roughly comparable to a similar speed x86 chip, but this is a huge win over just running on the extremely slow Cell PPC cores. Andrew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] strange sin/cos performance
Is anyone with this problem *not* running ubuntu? Me - RHEL 5.2 opteron: Python 2.6.1 (r261:67515, Jan 5 2009, 10:19:01) [GCC 4.1.2 20071124 (Red Hat 4.1.2-42)] on linux2 Fedora 9 PS3/PPC: Python 2.5.1 (r251:54863, Jul 17 2008, 13:25:23) [GCC 4.3.1 20080708 (Red Hat 4.3.1-4)] on linux2 Actually I now have some interesting results that indicate the issue isn't in Python or NumPy at all. I just wrote a C program to try to reproduce the error, and was able to do so (actually the difference is even larger). Opteron: float (32) time in usecs: 179698 double (64) time in usecs: 13795 PS3/PPC: float (32) time in usecs: 614821 double (64) time in usecs: 37163 I've attached the code for others to review and/or try out. I guess this is worth showing to the libc people? Andrew #include stdio.h #include math.h #include sys/time.h #define LEN 159161 float inp32[LEN]; float out32[LEN]; double inp64[LEN]; double out64[LEN]; int main(int argc, char** argv) { struct timeval tv_start; struct timeval tv_stop; int i; for(i = 0; i LEN; i++) { //inp32[i] = ((float)i / (float)LEN) * (2 * M_PI); inp32[i] = (float)i; out32[i] = 0.0; } gettimeofday(tv_start, NULL); for(i = 0; i LEN; i++) { out32[i] = (float)cosf((float)inp32[i]); } gettimeofday(tv_stop, NULL); if(tv_start.tv_sec != tv_stop.tv_sec) { puts(seconds changed, re-run the benchmark); } printf(float (32) time in usecs: %d\n, tv_stop.tv_usec - tv_start.tv_usec); for(i = 0; i LEN; i++) { //inp32[i] = ((double)i / (double)LEN) * (2 * M_PI); inp64[i] = (float)i; out64[i] = 0.0; } gettimeofday(tv_start, NULL); for(i = 0; i LEN; i++) { out64[i] = (double)cos((double)inp64[i]); } gettimeofday(tv_stop, NULL); if(tv_start.tv_sec != tv_stop.tv_sec) { puts(seconds changed, re-run the benchmark); } printf(double (64) time in usecs: %d\n, tv_stop.tv_usec - tv_start.tv_usec); return 0; } ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] strange sin/cos performance
Bruce Southey wrote: Hi, Can you try these from the command line: python -m timeit -n 100 -s import numpy as np; a = np.arange(0.0, 1000, (2*3.14159) / 1000, dtype=np.float32) python -m timeit -n 100 -s import numpy as np; a = np.arange(0.0, 1000, (2*3.14159) / 1000, dtype=np.float32); b=np.sin(a) python -m timeit -n 100 -s import numpy as np; a = np.arange(0.0, 1000, (2*3.14159) / 1000, dtype=np.float32); np.sin(a) python -m timeit -n 100 -s import numpy as np; a = np.arange(0.0, 1000, (2*3.14159) / 1000, dtype=np.float32) np.sin(a) The first should be similar for different dtypes because it is just array creation. The second extends that by storing the sin into another array. I am not sure how to interpret the third but in the Python prompt it would print it to screen. The last causes Python to handle two arguments which is slow using float32 but not for float64 and float128 suggesting compiler issue such as not using SSE or similar. Results: $ python -m timeit -n 100 -s import numpy as np; a = np.arange(0.0, 1000, (2*3.14159) / 1000, dtype=np.float32) 100 loops, best of 3: 0.0811 usec per loop $ python -m timeit -n 100 -s import numpy as np; a = np.arange(0.0, 1000, (2*3.14159) / 1000, dtype=np.float32); b=np.sin(a) 100 loops, best of 3: 0.11 usec per loop $ python -m timeit -n 100 -s import numpy as np; a = np.arange(0.0, 1000, (2*3.14159) / 1000, dtype=np.float32); np.sin(a) 100 loops, best of 3: 0.11 usec per loop $ python -m timeit -n 100 -s import numpy as np; a = np.arange(0.0, 1000, (2*3.14159) / 1000, dtype=np.float32) np.sin(a) 100 loops, best of 3: 112 msec per loop $ python -m timeit -n 100 -s import numpy as np; a = np.arange(0.0, 1000, (2*3.14159) / 1000, dtype=np.float64) np.sin(a) 100 loops, best of 3: 13.2 msec per loop I think the second and third are effectively the same; both create an array containing the result. The second assigns that array to a value, while the third does not, so it should get garbage collected. The fourth one is the only one that actually runs the sin in the timing loop. I don't understand what you mean by causing Pyton to handle two arguments? The fifth run I added uses float64 to compare (and reproduces the problem). Andrew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] strange sin/cos performance
Charles R Harris wrote: On Mon, Aug 3, 2009 at 11:51 AM, Andrew Friedley afrie...@indiana.eduwrote: Charles R Harris wrote: What compiler versions are folks using? In the slow cases, what is the timing for converting to double, computing the sin, then casting back to single? I did this, is this the right way to do that? t = timeit.Timer(numpy.sin(a.astype(numpy.float64)).astype(numpy.float32), import numpy\n a = numpy.arange(0.0, 1000, (2 * 3.14159) / 1000, dtype=numpy.float64)) print sin converted float 32/64, min(t.repeat(3, 10)) Timings on my opteron system (2-socket 2-core 2GHz): sin float32 1.13407707214 sin float64 0.133460998535 sin converted float 32/64 0.18202996254 Not too surprising I guess. gcc --version shows: gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44) My compile flags for my Python 2.6.1/NumPy 1.3.0 builds: -Os -fomit-frame-pointer -pipe -s -march=k8 -m64 That looks right. When numpy doesn't find a *f version it basically does that conversion. This is beginning to look like a hardware/software implementation problem, maybe compiler related. That is, I suspect the fast times come from using a hardware implementation. What happens if you use -O2 instead of -Os? Do you know where this conversion is, in the code? The impression I got from my quick look at the code was that a wrapper sinf was defined that just calls sin. I guess the typecast to float in there will do the conversion, is that what you are referring to, or something at a higher level? I recompiled the same versions of Python/NumPy, using the same flags except -O2 instead of -Os, the behavior is still the same. Andrew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] strange sin/cos performance
David Cournapeau wrote: On Wed, Aug 5, 2009 at 12:14 AM, Andrew Friedleyafrie...@indiana.edu wrote: Do you know where this conversion is, in the code? The impression I got from my quick look at the code was that a wrapper sinf was defined that just calls sin. I guess the typecast to float in there will do the conversion Exact. Given your CPU, compared to my macbook, it looks like the float32 is the problem (i.e. the float64 is not particularly fast). I really can't see what could cause such a slowdown: the range over which you evaluate sin should not cause denormal numbers - just to be sure, could you try the same benchmark but using a simple array of constant values (say numpy.ones(1000)) ? Also, you may want to check what happens if you force raising errors in case of FPU exceptions (numpy.seterr(raise=all)). OK, have some interesting results. First is my array creation was not doing what I thought it was. This (what I've been doing) creates an array of 159161 elements: numpy.arange(0.0, 1000, (2 * 3.14159) / 1000, dtype=numpy.float32) Which isn't what I was after (1000 elements ranging from 0 to 2PI). So the values in that array climb up to 999.999. Running with numpy.ones() gives a much different timing (I did numpy.ones(159161) to keep the array lengths the same): sin float32 0.078202009201 sin float64 0.0767619609833 cos float32 0.0750858783722 cos float64 0.088515996933 Much better, but still a little strange, float32 should be relatively faster yet. I tried with 1000 elements and got similar results. So the performance has something to do with the input values. This is believable, but I don't think it explains why float32 would behave that way and not float64, unless there's something else I don't understand. Also I assume you meant seterr(all='raise'). This didn't seem to do anything, I don't have any exceptions thrown or other output. Andrew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] strange sin/cos performance
Charles R Harris wrote: Depends on the CPU, FPU and the compiler flags. The computations could very well be done using double precision internally with conversions on load/store. Sure, but if this is the case, why is the performance blowing up on larger input values for float32 but not float64? Both should blow up, not just one or the other. In other words I think they are using different implementations :) Am I missing something? Andrew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] strange sin/cos performance
While working on GSoC stuff I came across this weird performance behavior for sine and cosine -- using float32 is way slower than float64. On a 2ghz opteron: sin float32 1.12447786331 sin float64 0.133481025696 cos float32 1.14155912399 cos float64 0.131420135498 The times are in seconds, and are best of three runs of ten iterations of numpy.{sin,cos} over a 1000-element array (script attached). I've produced similar results on a PS3 system also. The opteron is running Python 2.6.1 and NumPy 1.3.0, while the PS3 has Python 2.5.1 and NumPy 1.1.1. I haven't jumped into the code yet, but does anyone know why sin/cos are ~8.5x slower for 32-bit floats compared to 64-bit doubles? Side question: I see people in emails writing things like 'timeit foo(x)' and having it run some sort of standard benchmark, how exactly do I do that? Is that some environment other than a normal Python? Thanks, Andrew import timeit t = timeit.Timer(numpy.sin(a), import numpy\n a = numpy.arange(0.0, 1000, (2 * 3.14159) / 1000, dtype=numpy.float32)) print sin float32, min(t.repeat(3, 10)) t = timeit.Timer(numpy.sin(a), import numpy\n a = numpy.arange(0.0, 1000, (2 * 3.14159) / 1000, dtype=numpy.float64)) print sin float64, min(t.repeat(3, 10)) t = timeit.Timer(numpy.cos(a), import numpy\n a = numpy.arange(0.0, 1000, (2 * 3.14159) / 1000, dtype=numpy.float32)) print cos float32, min(t.repeat(3, 10)) t = timeit.Timer(numpy.cos(a), import numpy\n a = numpy.arange(0.0, 1000, (2 * 3.14159) / 1000, dtype=numpy.float64)) print cos float64, min(t.repeat(3, 10)) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Add/multiply reduction confusion
Gael Varoquaux wrote: On Sun, Jul 05, 2009 at 02:47:18PM -0400, Andrew Friedley wrote: Stéfan van der Walt wrote: 2009/7/5 Andrew Friedley afrie...@indiana.edu: I found the check that does the type 'upcasting' in umath_ufunc_object.inc around line 3072 (NumPy 1.3.0). Turns out all I need to do is make sure my add and multiply ufuncs are actually named 'add' and 'multiply' and arrays will be upcasted appropriately. Would you please be so kind as to add your findings here: http://docs.scipy.org/numpy/docs/numpy-docs/reference/index.rst/#reference-index I haven't read through that document recently, so it may be in there already. I created an account (afriedle) but looks like I don't have edit permissions. I have added you to the Editor list. Thanks and sorry about the delay; I went and added the comment I proposed. Andrew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] strange sin/cos performance
Thanks for the quick responses. David Cournapeau wrote: On Mon, Aug 3, 2009 at 10:32 PM, Andrew Friedleyafrie...@indiana.edu wrote: While working on GSoC stuff I came across this weird performance behavior for sine and cosine -- using float32 is way slower than float64. On a 2ghz opteron: sin float32 1.12447786331 sin float64 0.133481025696 cos float32 1.14155912399 cos float64 0.131420135498 Which OS are you on ? FWIW, on max os x, with recent svn checkout, I get expected results (float32 ~ twice faster). The numbers above are on linux, RHEL 5.2. The PS3 is running Fedora 9 I think. I just ran on a PPC OSX 10.5 system: sin float32 0.111793041229 sin float64 0.0902218818665 cos float32 0.112202882767 cos float64 0.0917768478394 Much more reasonable, but still not what I'd expect or what you seem to expect. The times are in seconds, and are best of three runs of ten iterations of numpy.{sin,cos} over a 1000-element array (script attached). I've produced similar results on a PS3 system also. The opteron is running Python 2.6.1 and NumPy 1.3.0, while the PS3 has Python 2.5.1 and NumPy 1.1.1. I haven't jumped into the code yet, but does anyone know why sin/cos are ~8.5x slower for 32-bit floats compared to 64-bit doubles? My guess would be that you are on a platform where there is no sinf, and our sinf replacement is bad for some reason. I think linux has sinf, is there a quick/easy way to check if numpy is using it? Side question: I see people in emails writing things like 'timeit foo(x)' and having it run some sort of standard benchmark, how exactly do I do that? Is that some environment other than a normal Python? Yes, that's in ipython. Thanks for the pointer. Andrew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] strange sin/cos performance
Emmanuelle Gouillart wrote: Hi Andrew, %timeit is an Ipython magic command that uses the timeit module, see http://ipython.scipy.org/doc/stable/html/interactive/reference.html?highlight=timeit for more information about how to use it. So you were right to suppose that it is not a normal Python. Thanks for the pointer, I'm not familiar with IPython at all, will check it out. However, I was not able to reproduce your observations. import numpy as np a = np.arange(0.0, 1000, (2 * 3.14159) / 1000, dtype=np.float32) b = np.arange(0.0, 1000, (2 * 3.14159) / 1000, dtype=np.float64) %timeit -n 10 np.sin(a) 10 loops, best of 3: 8.67 ms per loop %timeit -n 10 np.sin(b) 10 loops, best of 3: 9.29 ms per loop OK, I'm curious, what OS/Python/Numpy are you using? Andrew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] strange sin/cos performance
David Cournapeau wrote: David Cournapeau wrote: On Mon, Aug 3, 2009 at 10:32 PM, Andrew Friedleyafrie...@indiana.edu wrote: While working on GSoC stuff I came across this weird performance behavior for sine and cosine -- using float32 is way slower than float64. On a 2ghz opteron: sin float32 1.12447786331 sin float64 0.133481025696 cos float32 1.14155912399 cos float64 0.131420135498 Which OS are you on ? FWIW, on max os x, with recent svn checkout, I get expected results (float32 ~ twice faster). The numbers above are on linux, RHEL 5.2. The PS3 is running Fedora 9 I think. I know next to nothing about the PS3 hardware, but I know that it is quite different compared to conventional x86 CPU. Does it even have both 4 and 8 bytes native float ? Yes. As far as this discussion is concerned, the PS3/Cell is just a slow PowerPC. Quite different from x86, but probably not as different as you think :) Much more reasonable, but still not what I'd expect or what you seem to expect. On a x86 system with sinf available in the math lib, I would expect the float32 to be faster than float64. Other than that, the exact ratio depends on too many factors (sse vs x87 usage, cache size, compiler, math library performances). One order magnitude slower seems very strange in any case. OK. I'll probably investigate this a bit further, but I don't have anything that really depends on this issue. It does explain a large part of why my cos ufunc was so much faster. Since I'm observing this on both x86 and PPC (PS3), I don't think its a hardware issue -- something in the software stack. And now there's two people reporting results with only different numpy versions. The times are in seconds, and are best of three runs of ten iterations of numpy.{sin,cos} over a 1000-element array (script attached). I've produced similar results on a PS3 system also. The opteron is running Python 2.6.1 and NumPy 1.3.0, while the PS3 has Python 2.5.1 and NumPy 1.1.1. I haven't jumped into the code yet, but does anyone know why sin/cos are ~8.5x slower for 32-bit floats compared to 64-bit doubles? My guess would be that you are on a platform where there is no sinf, and our sinf replacement is bad for some reason. I think linux has sinf, is there a quick/easy way to check if numpy is using it? You can look at the config.h in numpy/core/include/numpy, and see if there is a HAVE_SINF defined (for numpy = 1.2.0 at least). OK, I see HAVE_SINF (and HAVE_COSF) for my 1.3.0 build on the opteron system. I'm using the distro-provided packages on other systems, so I guess I can't check those. I don't think this matters -- numpy/core/src/npy_math.c just defines sinf as a function calling sin. So if HAVE_SINF wasn't set, I'd expect the performance different to be very little, with floats still being slightly faster (less mem traffic). Also I just went and wrote a C program to do a similar benchmark, and I am unable to reproduce the issue there. Makes me think the problem is in NumPy, but I have no idea where to look. Suggestions welcome :) Andrew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] strange sin/cos performance
Charles R Harris wrote: What compiler versions are folks using? In the slow cases, what is the timing for converting to double, computing the sin, then casting back to single? I did this, is this the right way to do that? t = timeit.Timer(numpy.sin(a.astype(numpy.float64)).astype(numpy.float32), import numpy\n a = numpy.arange(0.0, 1000, (2 * 3.14159) / 1000, dtype=numpy.float64)) print sin converted float 32/64, min(t.repeat(3, 10)) Timings on my opteron system (2-socket 2-core 2GHz): sin float32 1.13407707214 sin float64 0.133460998535 sin converted float 32/64 0.18202996254 Not too surprising I guess. gcc --version shows: gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44) My compile flags for my Python 2.6.1/NumPy 1.3.0 builds: -Os -fomit-frame-pointer -pipe -s -march=k8 -m64 Andrew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Add/multiply reduction confusion
I figured this out in case anyone is interested. I found the check that does the type 'upcasting' in umath_ufunc_object.inc around line 3072 (NumPy 1.3.0). Turns out all I need to do is make sure my add and multiply ufuncs are actually named 'add' and 'multiply' and arrays will be upcasted appropriately. Maybe this is worth documenting somewhere, maybe in the UFunc C API? Or is it documented already, and I just missed it? Andrew Andrew Friedley wrote: Hi, I'm trying to understand how integer types are upcast for add/multiply operations for my GSoC project (Implementing Ufuncs using CorePy). The documentation says that for reduction with add/multiply operations, integer types are 'upcast' to the int_ type (int64 on my system). What exactly does this mean, internally? Where/when does the upcasting occur? Is it a C-style cast, or a memory copy to a new temporary array? I'm a confused as to which low-level ufunc loop type is used (and why). This is what I see: a = numpy.arange(131072, dtype=numpy.int32) r = numpy.add.reduce(a) print type(r) type 'numpy.int64' print hex(r) 0x1L Okay, fine. But I have my own ufunc, which defines only the following types right now (I stripped it down for debugging): print corefunc.add.types ['ii-i', 'll-l'] NumPy has this, for comparison: print numpy.add.types ['??-?', 'bb-b', 'BB-B', 'hh-h', 'HH-H', 'ii-i', 'II-I', 'll-l', 'LL-L', 'qq-q', 'QQ-Q', 'ff-f', 'dd-d', 'gg-g', 'FF-F', 'DD-D', 'GG-G', 'OO-O'] Also just to verify I did this: print numpy.typeDict['i'] type 'numpy.int32' print numpy.typeDict['l'] type 'numpy.int64' Yet when I call my own ufunc, this happens: a = numpy.arange(131072, dtype=numpy.int32) r = corefunc.add.reduce(a) print type(r) type 'numpy.int32' print hex(r) -0x1 It looks like no upcasting is occurring here? My ii-i loop is being used, not the ll-l loop.. why? I'm guessing this is something I am doing wrong, any ideas what it is? Andrew ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Add/multiply reduction confusion
Stéfan van der Walt wrote: 2009/7/5 Andrew Friedley afrie...@indiana.edu: I found the check that does the type 'upcasting' in umath_ufunc_object.inc around line 3072 (NumPy 1.3.0). Turns out all I need to do is make sure my add and multiply ufuncs are actually named 'add' and 'multiply' and arrays will be upcasted appropriately. Would you please be so kind as to add your findings here: http://docs.scipy.org/numpy/docs/numpy-docs/reference/index.rst/#reference-index I haven't read through that document recently, so it may be in there already. I created an account (afriedle) but looks like I don't have edit permissions. The user-side upcasting behavior for add/multiply is documented (though a little hidden) in the long paragraph right before the 'Available ufuncs' section. http://docs.scipy.org/numpy/docs/numpy-docs/reference/ufuncs.rst/#ufuncs I'm thinking something should be added on the C API side though, perhaps on the 'name' parameter of PyUFunc_FromFuncAndData(): http://docs.scipy.org/numpy/docs/numpy-docs/reference/c-api.ufunc.rst/#c-api-ufunc I would say something to the effect of (I borrowed from the user-side blurb :) : Specifying a name of 'add' or 'multiply' enables a special behavior for integer-typed reductions when no dtype is given. If the input type is an integer (or boolean) data type smaller than the size of the int_ data type, it will be internally upcast to the int_ (or uint) data type. Andrew ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Add/multiply reduction confusion
Hi, I'm trying to understand how integer types are upcast for add/multiply operations for my GSoC project (Implementing Ufuncs using CorePy). The documentation says that for reduction with add/multiply operations, integer types are 'upcast' to the int_ type (int64 on my system). What exactly does this mean, internally? Where/when does the upcasting occur? Is it a C-style cast, or a memory copy to a new temporary array? I'm a confused as to which low-level ufunc loop type is used (and why). This is what I see: a = numpy.arange(131072, dtype=numpy.int32) r = numpy.add.reduce(a) print type(r) type 'numpy.int64' print hex(r) 0x1L Okay, fine. But I have my own ufunc, which defines only the following types right now (I stripped it down for debugging): print corefunc.add.types ['ii-i', 'll-l'] NumPy has this, for comparison: print numpy.add.types ['??-?', 'bb-b', 'BB-B', 'hh-h', 'HH-H', 'ii-i', 'II-I', 'll-l', 'LL-L', 'qq-q', 'QQ-Q', 'ff-f', 'dd-d', 'gg-g', 'FF-F', 'DD-D', 'GG-G', 'OO-O'] Also just to verify I did this: print numpy.typeDict['i'] type 'numpy.int32' print numpy.typeDict['l'] type 'numpy.int64' Yet when I call my own ufunc, this happens: a = numpy.arange(131072, dtype=numpy.int32) r = corefunc.add.reduce(a) print type(r) type 'numpy.int32' print hex(r) -0x1 It looks like no upcasting is occurring here? My ii-i loop is being used, not the ll-l loop.. why? I'm guessing this is something I am doing wrong, any ideas what it is? Andrew ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy ufuncs and COREPY - any info?
David Cournapeau wrote: Francesc Alted wrote: Well, it is Andrew who should demonstrate that his measurement is correct, but in principle, 4 cycles/item *should* be feasible when using 8 cores in parallel. But the 100x speed increase is for one core only unless I misread the table. And I should have mentioned that 400 cycles/item for cos is on a pentium 4, which has dreadful performances (defective L1). On a much better core duo extreme something, I get 100 cycles / item (on a 64 bits machines, though, and not same compiler, although I guess the libm version is what matters the most here). And let's not forget that there is the python wrapping cost: by doing everything in C, I got ~ 200 cycle/cos on the PIV, and ~60 cycles/cos on the core 2 duo (for double), using the rdtsc performance counter. All this for 1024 items in the array, so very optimistic usecase (everything in cache 2 if not 1). This shows that python wrapping cost is not so high, making the 100x claim a bit doubtful without more details on the way to measure speed. I appreciate all the discussion this is creating. I wish I could work on this more right now; I have a big paper deadline coming up June 1 that I need to focus on. Yes, you're reading the table right. I should have been more clear on what my implementation is doing. It's using SIMD, so performing 4 cosine's at a time where a libm cosine is only doing one. Also I don't think libm trancendentals are known for being fast; I'm also likely gaining performance by using a well-optimized but less accurate approximation. In fact a little more inspection shows my accuracy decreases as the input values increase; I will probably need to take a performance hit to fix this. I went and wrote code to use the libm fcos() routine instead of my cos code. Performance is equivalent to numpy, plus an overhead: inp sizes 102410240 102400 1024000 3072000 numpy0.7282 9.6278 115.5976 993.5738 3017.3680 lmcos1 0.7594 9.7579 116.7135 1039.5783 3156.8371 lmcos2 0.5274 5.7885 61.8052 537.8451 1576.2057 lmcos4 0.5172 5.1240 40.5018 313.2487 791.9730 corepy 1 0.0142 0.0880 0.95669.6162 28.4972 corepy 2 0.0342 0.0754 0.69916.1647 15.3545 corepy 4 0.0596 0.0963 0.56714.9499 13.8784 The times I show are in milliseconds; the system used is a dual-socket dual-core 2ghz opteron. I'm testing at the ufunc level, like this: def benchmark(fn, args): avgtime = 0 fn(*args) for i in xrange(7): t1 = time.time() fn(*args) t2 = time.time() tm = t2 - t1 avgtime += tm return avgtime / 7 Where fn is a ufunc, ie numpy.cos. So I prime the execution once, then do 7 timings and take the average. I always appreciate suggestions on better way to benchmark things. Andrew ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy ufuncs and COREPY - any info?
For some reason the list seems to occasionally drop my messages... Francesc Alted wrote: A Friday 22 May 2009 13:52:46 Andrew Friedley escrigué: I'm the student doing the project. I have a blog here, which contains some initial performance numbers for a couple test ufuncs I did: http://numcorepy.blogspot.com Another alternative we've talked about, and I (more and more likely) may look into is composing multiple operations together into a single ufunc. Again the main idea being that memory accesses can be reduced/eliminated. IMHO, composing multiple operations together is the most promising venue for leveraging current multicore systems. Agreed -- our concern when considering for the project was to keep the scope reasonable so I can complete it in the GSoC timeframe. If I have time I'll definitely be looking into this over the summer; if not later. Another interesting approach is to implement costly operations (from the point of view of CPU resources), namely, transcendental functions like sin, cos or tan, but also others like sqrt or pow) in a parallel way. If besides, you can combine this with vectorized versions of them (by using the well spread SSE2 instruction set, see [1] for an example), then you would be able to achieve really good results for sure (at least Intel did with its VML library ;) [1] http://gruntthepeon.free.fr/ssemath/ I've seen that page before. Using another source [1] I came up with a quick/dirty cos ufunc. Performance is crazy good compared to NumPy (100x); see the latest post on my blog for a little more info. I'll look at the source myself when I get time again, but is NumPy using a Python-based cos function, a C implementation, or something else? As I wrote in my blog, the performance gain is almost too good to believe. [1] http://www.devmaster.net/forums/showthread.php?t=5784 Andrew ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy ufuncs and COREPY - any info?
(sending again) Hi, I'm the student doing the project. I have a blog here, which contains some initial performance numbers for a couple test ufuncs I did: http://numcorepy.blogspot.com It's really too early yet to give definitive results though; GSoC officially starts in two days :) What I'm finding is that the existing ufuncs are already pretty fast; it appears right now that the main limitation is memory bandwidth. If that's really the case, the performance gains I'll get will be through cache tricks (non-temporal loads/stores), reducing memory accesses and using multiple cores to get more bandwidth. Another alternative we've talked about, and I (more and more likely) may look into is composing multiple operations together into a single ufunc. Again the main idea being that memory accesses can be reduced/eliminated. Andrew dmitrey wrote: hi all, has anyone already tried to compare using an ordinary numpy ufunc vs that one from corepy, first of all I mean the project http://socghop.appspot.com/student_project/show/google/gsoc2009/python/t124024628235 It would be interesting to know what is speedup for (eg) vec ** 0.5 or (if it's possible - it isn't pure ufunc) numpy.dot(Matrix, vec). Or any another example. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy ufuncs and COREPY - any info?
Francesc Alted wrote: A Friday 22 May 2009 11:42:56 Gregor Thalhammer escrigué: dmitrey schrieb: 3) Improving performance by using multi cores is much more difficult. Only for sufficiently large (1e5) arrays a significant speedup is possible. Where a speed gain is possible, the MKL uses several cores. Some experimentation showed that adding a few OpenMP constructs you could get a similar speedup with numpy. 4) numpy.dot uses optimized implementations. Good points Gregor. However, I wouldn't say that improving performance by using multi cores is *that* difficult, but rather that multi cores can only be used efficiently *whenever* the memory bandwith is not a limitation. An example of this is the computation of transcendental functions, where, even using vectorized implementations, the computation speed is still CPU-bounded in many cases. And you have experimented yourself very good speed-ups for these cases with your implementation of numexpr/MKL :) Using multiple cores is pretty easy for element-wise ufuncs; no communication needs to occur and the work partitioning is trivial. And actually I've found with some initial testing that multiple cores does still help when you are memory bound. I don't fully understand why yet, though I have some ideas. One reason is multiple memory controllers due to multiple sockets (ie opteron). Another is that each thread is pulling memory from a different bank, utilizing more bandwidth than a single sequential thread could. However if that's the case, we could possibly come up with code for a single thread that achieves (nearly) the same additional throughput.. Andrew ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion