[Numpy-discussion] Help needed with numpy 10.5 release blockers
Hi David On Fri, Mar 14, 2008 at 9:19 AM, David Huard [EMAIL PROTECTED] wrote: I added a test for ticket 691. Problem is, there seems to be a new bug. I don't know it its related to the change or if it was there before. Please check this out. Fantastic, thanks for jumping in and addressing #691. I filed the new failure as ticket #700: http://scipy.org/scipy/numpy/ticket/700 If we keep going at this pace, we'll be releasing 1.0.5 in no time at all. Cheers Stéfan ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Numpy and OpenMP
Hi, Numpy is great : I can see several IDL/matlab projects switching to numpy :) However, it would be s nice to be able to put some OpenMP into the numpy code. It would be nice to be able to be able to use several CPU using the numpy syntax ie A=sqrt(B). Ok, we can use some inline C/C++ code but it is not so easy. Ok, we can split the data over several python executables (one per CPU) but A=sqrt(B) is so simple... numpy + recent gcc with OpenMP -- :) ? Any comments ? Xavier ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Numpy and OpenMP
On Sat, Mar 15, 2008 at 2:48 PM, Gnata Xavier [EMAIL PROTECTED] wrote: Hi, Numpy is great : I can see several IDL/matlab projects switching to numpy :) However, it would be s nice to be able to put some OpenMP into the numpy code. It would be nice to be able to be able to use several CPU using the numpy syntax ie A=sqrt(B). Ok, we can use some inline C/C++ code but it is not so easy. Ok, we can split the data over several python executables (one per CPU) but A=sqrt(B) is so simple... numpy + recent gcc with OpenMP -- :) ? Any comments ? Eric Jones tried to use multithreading to split the computation of ufuncs across CPUs. Ultimately, the overhead of locking and unlocking made it prohibitive for medium-sized arrays and only somewhat disappointing improvements in performance for quite large arrays. I'm not familiar enough with OpenMP to determine if this result would be applicable to it. If you would like to try, we can certainly give you pointers as to where to start. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Numpy and OpenMP
On 15/03/2008, Damian Eads [EMAIL PROTECTED] wrote: Robert Kern wrote: Eric Jones tried to use multithreading to split the computation of ufuncs across CPUs. Ultimately, the overhead of locking and unlocking made it prohibitive for medium-sized arrays and only somewhat disappointing improvements in performance for quite large arrays. I'm not familiar enough with OpenMP to determine if this result would be applicable to it. If you would like to try, we can certainly give you pointers as to where to start. Perhaps I'm missing something. How is locking and synchronization an issue when each thread is writing to a mutually exclusive part of the output buffer? The trick is to efficiently allocate these output buffers. If you simply give each thread 1/n th of the job, if one CPU is otherwise occupied it doubles your computation time. If you break the job into many pieces and let threads grab them, you need to worry about locking to keep two threads from grabbing the same piece of data. Plus, depending on where things are in memory you can kill performance by abusing the caches (maintaining cache consistency across CPUs can be a challenge). Plus a certain amount of numpy code depends on order of evaluation: a[:-1] = 2*a[1:] Correctly handling all this can take a lot of overhead, and require a lot of knowledge about hardware. OpenMP tries to take care of some of this in a way that's easy on the programmer. To answer the OP's question, there is a relatively small number of C inner loops that could be marked up with OpenMP #pragmas to cover most matrix operations. Matrix linear algebra is a separate question, since numpy/scipy prefers to use optimized third-party libraries - in these cases one would need to use parallel linear algebra libraries (which do exist, I think, and are plug-compatible). So parallelizing numpy is probably feasible, and probably not too difficult, and would be valuable. The biggest catch, I think, would be compilation issues - is it possible to link an OpenMP-compiled shared library into a normal executable? Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Numpy and OpenMP
On Sat, Mar 15, 2008 at 07:33:51PM -0400, Anne Archibald wrote: ... To answer the OP's question, there is a relatively small number of C inner loops that could be marked up with OpenMP #pragmas to cover most matrix operations. Matrix linear algebra is a separate question, since numpy/scipy prefers to use optimized third-party libraries - in these cases one would need to use parallel linear algebra libraries (which do exist, I think, and are plug-compatible). So parallelizing numpy is probably feasible, and probably not too difficult, and would be valuable. OTOH, there are reasons to _not_ want numpy to automatically use OpenMP. I personally have a lot of multi-core CPUs and/or multi-processor servers that I use numpy on. The way I use numpy is to run a bunch of (embarassingly) parallel numpy jobs, one for each CPU core. If OpenMP became standard (and it does work well in gcc 4.2 and 4.3), we definitely want to have control over how it is used... The biggest catch, I think, would be compilation issues - is it possible to link an OpenMP-compiled shared library into a normal executable? I think so. The new gcc compilers use the libgomp libraries to provide the OpenMP functionality. I'm pretty sure those work just like any other libraries. S -- Scott M. RansomAddress: NRAO Phone: (434) 296-0320 520 Edgemont Rd. email: [EMAIL PROTECTED] Charlottesville, VA 22903 USA GPG Fingerprint: 06A9 9553 78BE 16DB 407B FFCA 9BFA B6FF FFD3 2989 ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Numpy and OpenMP
Scott Ransom wrote: On Sat, Mar 15, 2008 at 07:33:51PM -0400, Anne Archibald wrote: ... To answer the OP's question, there is a relatively small number of C inner loops that could be marked up with OpenMP #pragmas to cover most matrix operations. Matrix linear algebra is a separate question, since numpy/scipy prefers to use optimized third-party libraries - in these cases one would need to use parallel linear algebra libraries (which do exist, I think, and are plug-compatible). So parallelizing numpy is probably feasible, and probably not too difficult, and would be valuable. OTOH, there are reasons to _not_ want numpy to automatically use OpenMP. I personally have a lot of multi-core CPUs and/or multi-processor servers that I use numpy on. The way I use numpy is to run a bunch of (embarassingly) parallel numpy jobs, one for each CPU core. If OpenMP became standard (and it does work well in gcc 4.2 and 4.3), we definitely want to have control over how it is used... embarassingly parallel spliting is just fine in some cases (KISS) but IMHO there is a point to get OpenMP into numpy. Look at the g++ people : They have added a parallel version of the C++ STL into gcc4.3. Of course the non paralell one is still the standard/defaut one but here is the trend. For now we have no easy way to perform A = B + C on more than one CPU in numpy (except the limited embarassingly parallel paradigm) Yes, we want to be able to tune and to switch off (by default?) the numpy threading capability, but IMHO having this threading capability will always be better than a fully non paralell numpy. The biggest catch, I think, would be compilation issues - is it possible to link an OpenMP-compiled shared library into a normal executable? I think so. The new gcc compilers use the libgomp libraries to provide the OpenMP functionality. I'm pretty sure those work just like any other libraries. S ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] What should be the return type of average?
Hi, I want to fix up the average function. I note that the return dtype is not specified, nor is the precision of the accumulator. Both of these can be specified for the mean method and I wonder what should be the case for average. Or should we just use double precision? That would seem appropriate to me most of the time, but wouldn't match what happens with mean and would lose precision in the case of extended precision doubles. There is also no out keyword, do we want one? Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Numpy and OpenMP
Anne, Sure. I've found multi-threaded scientific computation to give mixed results. For some things, it results in very significant performance gains, and other things, it's not worth the trouble at all. It really does depend on what you're doing. But, I don't think it's fair to paint multithreaded programming with the same brush just because there exist pathologies. Robert: what benchmarks were performed showing less than pleasing performance gains? Anne Archibald wrote: On 15/03/2008, Damian Eads [EMAIL PROTECTED] wrote: Robert Kern wrote: Eric Jones tried to use multithreading to split the computation of ufuncs across CPUs. Ultimately, the overhead of locking and unlocking made it prohibitive for medium-sized arrays and only somewhat disappointing improvements in performance for quite large arrays. I'm not familiar enough with OpenMP to determine if this result would be applicable to it. If you would like to try, we can certainly give you pointers as to where to start. Perhaps I'm missing something. How is locking and synchronization an issue when each thread is writing to a mutually exclusive part of the output buffer? The trick is to efficiently allocate these output buffers. If you simply give each thread 1/n th of the job, if one CPU is otherwise occupied it doubles your computation time. If you break the job into many pieces and let threads grab them, you need to worry about locking to keep two threads from grabbing the same piece of data. For element-wise unary and binary array operations, there would never be two threads reading from the same memory at the same time. When performing matrix multiplication, more than two threads will access the same memory but this is fine as long as their accesses are read-only. The moment there is a chance one thread might need to write to the same buffer that one or more threads are reading from, use a read/write lock (pthreads supports this). As far as coordinating the work for the threads, there are several possible approaches (this is not a complete list): 1. assign to each of them the part of the buffer to work on beforehand. This assumes each thread will compute at the same rate and will finish the same chunk roughly in the same amount of time. This is not always a valid assumption. 2. assign smaller chunks, leaving a large amount of unassigned work. As threads complete computation of a chunk, assign them another chunk. This requires some memory to keep track of the chunks assigned and unassigned. Since it is possible for multiple threads to try to access (with at least one modifying thread) this chunk assignment structure at the same time, you need synchronization. In some cases, the overhead for doing this synchronization is minimal. 3. use approach #2 but assign chunk sizes of random sizes to reduce contention between threads trying to access the chunk assignment structure at the same time. 4. for very large jobs, have a chunk assignment server. Some of my experiments take several weeks and are spread across 64 processors (8 machines, 8 processors per machine). Individual units of computation take anywhere from 30 minutes to 8 hours. The cost of asking the chunk assignment server for a new chunk are minimal relative to the amount of time it takes to compute on the chunk. By not assigning all the computation up front in the beginning, most processors are working nearly all the time. It's only during the last day or two of the experiment, do there exist processors with nothing to do. Plus, depending on where things are in memory you can kill performance by abusing the caches (maintaining cache consistency across CPUs can be a challenge). Plus a certain amount of numpy code depends on order of evaluation: a[:-1] = 2*a[1:] Yes, but there are many, many instances when the order of evaluation in an array is sequential. I'm not advocating that numpy tool be devised to handle the parallelization of arbitrary computation, just common kinds of computation where performance gains might be realized. Correctly handling all this can take a lot of overhead, and require a lot of knowledge about hardware. OpenMP tries to take care of some of this in a way that's easy on the programmer. To answer the OP's question, there is a relatively small number of C inner loops that could be marked up with OpenMP #pragmas to cover most matrix operations. Matrix linear algebra is a separate question, since numpy/scipy prefers to use optimized third-party libraries - in these cases one would need to use parallel linear algebra libraries (which do exist, I think, and are plug-compatible). So parallelizing numpy is probably feasible, and probably not too difficult, and would be valuable. Yes, but there is a limit to the parallelization that can be achieved with vanilla numpy. numpy evaluates Python expressions, one at a time; thus, expressions like sqrt(0.5 * B
Re: [Numpy-discussion] Numpy and OpenMP
On Sat, Mar 15, 2008 at 8:25 PM, Damian Eads [EMAIL PROTECTED] wrote: Robert: what benchmarks were performed showing less than pleasing performance gains? The implementation is in the multicore branch. This particular file is the main benchmark Eric was using. http://svn.scipy.org/svn/numpy/branches/multicore/benchmarks/time_thread.py -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion