Re: [Numpy-discussion] IDL vs Python parallel computing
On 08/05/2014 04:00, numpy-discussion-requ...@scipy.org wrote: Send NumPy-Discussion mailing list submissions to numpy-discussion@scipy.org To subscribe or unsubscribe via the World Wide Web, visit http://mail.scipy.org/mailman/listinfo/numpy-discussion or, via email, send a message with subject or body 'help' to numpy-discussion-requ...@scipy.org You can reach the person managing the list at numpy-discussion-ow...@scipy.org When replying, please edit your Subject line so it is more specific than Re: Contents of NumPy-Discussion digest... -- Message: 1 Date: Wed, 07 May 2014 20:11:13 +0200 From: Sturla Molden sturla.mol...@gmail.com Subject: Re: [Numpy-discussion] IDL vs Python parallel computing To: numpy-discussion@scipy.org Message-ID: lkdt01$jrc$1...@ger.gmane.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed On 03/05/14 23:56, Siegfried Gonzi wrote: I noticed IDL uses at least 400% (4 processors or cores) out of the box for simple things like reading and processing files, calculating the mean etc. The DMA controller is working at its own pace, regardless of what the CPU is doing. You cannot get data faster off the disk by burning the CPU. If you are seeing 100 % CPU usage while doing file i/o there is something very bad going on. If you did this to an i/o intensive server it would go up in a ball of smoke... The purpose of high-performance asynchronous i/o systems such as epoll, kqueue, IOCP is actually to keep the CPU usage to a minimum. It is probbaly not so much about reading in files. But I just noticed (top command) it for simple things like processing say 4 dimensional fields (longitute, latitude, altitutde, time) and calculating column means or moment statistics over grid boxes and writing the fields out again and things like that. But it never uses more than 400%. I haven't done any thorough testing of where and why the 400% really kicks in and if IDL is cheating here or not. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] IDL vs Python parallel computing
On 08.05.2014 02:48, Frédéric Bastien wrote: Just a quick question/possibility. What about just parallelizing ufunc with only 1 inputs that is c or fortran contiguous like trigonometric function? Is there a fast path in the ufunc mechanism when the input is fortran/c contig? If that is the case, it would be relatively easy to add an openmp pragma to parallelize that loop, with a condition to a minimum number of element. opemmp is problematic as it gnu openmp deadlocks on fork (multiprocessing) I think if we do consider adding support using multiprocessing.pool.ThreadPool could be a good option. But it also is not difficult for the user to just write a wrapper function like this: parallel_trig(x, func, pool): x = x.reshape(s.size / nthreads, -1) # assuming 1d and no remainder return array(pool.map(func, x)) # use partial to use the out argument Anyway, I won't do it. I'm just outlining what I think is the most easy case(depending of NumPy internal that I don't now enough) to implement and I think the most frequent (so possible a quick fix for someone with the knowledge of that code). In Theano, we found in a few CPUs for the addition we need a minimum of 200k element for the parallelization of elemwise to be useful. We use that number by default for all operation to make it easy. This is user configurable. This warenty that with current generation, the threading don't slow thing down. I think that this is more important, don't show user slow down by default with a new version. Fred On Wed, May 7, 2014 at 2:27 PM, Julian Taylor jtaylor.deb...@googlemail.com mailto:jtaylor.deb...@googlemail.com wrote: On 07.05.2014 20:11, Sturla Molden wrote: On 03/05/14 23:56, Siegfried Gonzi wrote: A more technical answer is that NumPy's internals does not play very nicely with multithreading. For examples the array iterators used in ufuncs store an internal state. Multithreading would imply an excessive contention for this state, as well as induce false sharing of the iterator object. Therefore, a multithreaded NumPy would have performance problems due to synchronization as well as hierachical memory collisions. Adding multithreading support to the current NumPy core would just degrade the performance. NumPy will not be able to use multithreading efficiently unless we redesign the iterators in NumPy core. That is a massive undertaking which prbably means rewriting most of NumPy's core C code. A better strategy would be to monkey-patch some of the more common ufuncs with multithreaded versions. I wouldn't say that the iterator is a problem, the important iterator functions are threadsafe and there is support for multithreaded iteration using NpyIter_Copy so no data is shared between threads. I'd say the main issue is that there simply aren't many functions worth parallelizing in numpy. Most the commonly used stuff is already memory bandwidth bound with only one or two threads. The only things I can think of that would profit is sorting/partition and the special functions like sqrt, exp, log, etc. Generic efficient parallelization would require merging of operations improve the FLOPS/loads ratio. E.g. numexpr and theano are able to do so and thus also has builtin support for multithreading. That being said you can use Python threads with numpy as (especially in 1.9) most expensive functions release the GIL. But unless you are doing very flop intensive stuff you will probably have to manually block your operations to the last level cache size if you want to scale beyond one or two threads. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] IDL vs Python parallel computing
On 08/05/2014 04:00, numpy-discussion-requ...@scipy.org wrote: Send NumPy-Discussion mailing list submissions to numpy-discussion@scipy.org To subscribe or unsubscribe via the World Wide Web, visit http://mail.scipy.org/mailman/listinfo/numpy-discussion or, via email, send a message with subject or body 'help' to numpy-discussion-requ...@scipy.org You can reach the person managing the list at numpy-discussion-ow...@scipy.org When replying, please edit your Subject line so it is more specific than Re: Contents of NumPy-Discussion digest... -- Message: 2 Date: Wed, 7 May 2014 19:25:32 +0100 From: Nathaniel Smith n...@pobox.com Subject: Re: [Numpy-discussion] IDL vs Python parallel computing To: Discussion of Numerical Python numpy-discussion@scipy.org Message-ID: CAPJVwBnqMnrKo0=tthln1pvepwov_rh+mj_4ptwpdxb8d0h...@mail.gmail.com Content-Type: text/plain; charset=UTF-8 On Wed, May 7, 2014 at 7:11 PM, Sturla Molden sturla.mol...@gmail.com wrote: That said, reading data stored in text files is usually a CPU-bound operation, and if someone wrote the code to make numpy's text file readers multithreaded, and did so in a maintainable way, then we'd probably accept the patch. The only reason this hasn't happened is that no-one's done it. To add to the confusion what IDL offers: http://www.exelisvis.com/Support/HelpArticlesDetail/TabId/219/ArtMID/900/ArticleID/3252/3252.aspx I am not using IDL (and was never interested in IDL at all as it is a horrible language) any more except for some legacy code. Nowadays I am mostly on Python. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] IDL vs Python parallel computing
On 05/05/14 17:02, Francesc Alted wrote: Well, this might be because it is the place where using several processes makes more sense. Normally, when you are reading files, the bottleneck is the I/O subsystem (at least if you don't have to convert from text to numbers), and for calculating the mean, normally the bottleneck is memory throughput. If IDL is burning the CPU while reading a file I wouldn't call that impressive. It is certainly not something NumPy should aspire to do. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] IDL vs Python parallel computing
On 03/05/14 23:56, Siegfried Gonzi wrote: I noticed IDL uses at least 400% (4 processors or cores) out of the box for simple things like reading and processing files, calculating the mean etc. The DMA controller is working at its own pace, regardless of what the CPU is doing. You cannot get data faster off the disk by burning the CPU. If you are seeing 100 % CPU usage while doing file i/o there is something very bad going on. If you did this to an i/o intensive server it would go up in a ball of smoke... The purpose of high-performance asynchronous i/o systems such as epoll, kqueue, IOCP is actually to keep the CPU usage to a minimum. Also there are computations where using multiple processors do not help. First, there is a certain overhead due to thread synchronization and scheduling the workload. Thus you want have a certain amount of work before you consider to invoke multiple threads. Seconds, hierachical memory also makes it mandatory to avoid that the threads share the same objects in cache. Otherwise the performance will degrade as more threads are added. A more technical answer is that NumPy's internals does not play very nicely with multithreading. For examples the array iterators used in ufuncs store an internal state. Multithreading would imply an excessive contention for this state, as well as induce false sharing of the iterator object. Therefore, a multithreaded NumPy would have performance problems due to synchronization as well as hierachical memory collisions. Adding multithreading support to the current NumPy core would just degrade the performance. NumPy will not be able to use multithreading efficiently unless we redesign the iterators in NumPy core. That is a massive undertaking which prbably means rewriting most of NumPy's core C code. A better strategy would be to monkey-patch some of the more common ufuncs with multithreaded versions. I have never seen this happening with numpy except for the linalgebra stuff (e.g lapack). Any comments? The BLAS/LAPACK library can use multithreading internally, depending on which BLAS/LAPACK library you use. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] IDL vs Python parallel computing
On Wed, May 7, 2014 at 7:11 PM, Sturla Molden sturla.mol...@gmail.com wrote: On 03/05/14 23:56, Siegfried Gonzi wrote: I noticed IDL uses at least 400% (4 processors or cores) out of the box for simple things like reading and processing files, calculating the mean etc. The DMA controller is working at its own pace, regardless of what the CPU is doing. You cannot get data faster off the disk by burning the CPU. If you are seeing 100 % CPU usage while doing file i/o there is something very bad going on. If you did this to an i/o intensive server it would go up in a ball of smoke... The purpose of high-performance asynchronous i/o systems such as epoll, kqueue, IOCP is actually to keep the CPU usage to a minimum. That said, reading data stored in text files is usually a CPU-bound operation, and if someone wrote the code to make numpy's text file readers multithreaded, and did so in a maintainable way, then we'd probably accept the patch. The only reason this hasn't happened is that no-one's done it. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] IDL vs Python parallel computing
On 07.05.2014 20:11, Sturla Molden wrote: On 03/05/14 23:56, Siegfried Gonzi wrote: A more technical answer is that NumPy's internals does not play very nicely with multithreading. For examples the array iterators used in ufuncs store an internal state. Multithreading would imply an excessive contention for this state, as well as induce false sharing of the iterator object. Therefore, a multithreaded NumPy would have performance problems due to synchronization as well as hierachical memory collisions. Adding multithreading support to the current NumPy core would just degrade the performance. NumPy will not be able to use multithreading efficiently unless we redesign the iterators in NumPy core. That is a massive undertaking which prbably means rewriting most of NumPy's core C code. A better strategy would be to monkey-patch some of the more common ufuncs with multithreaded versions. I wouldn't say that the iterator is a problem, the important iterator functions are threadsafe and there is support for multithreaded iteration using NpyIter_Copy so no data is shared between threads. I'd say the main issue is that there simply aren't many functions worth parallelizing in numpy. Most the commonly used stuff is already memory bandwidth bound with only one or two threads. The only things I can think of that would profit is sorting/partition and the special functions like sqrt, exp, log, etc. Generic efficient parallelization would require merging of operations improve the FLOPS/loads ratio. E.g. numexpr and theano are able to do so and thus also has builtin support for multithreading. That being said you can use Python threads with numpy as (especially in 1.9) most expensive functions release the GIL. But unless you are doing very flop intensive stuff you will probably have to manually block your operations to the last level cache size if you want to scale beyond one or two threads. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] IDL vs Python parallel computing
Just a quick question/possibility. What about just parallelizing ufunc with only 1 inputs that is c or fortran contiguous like trigonometric function? Is there a fast path in the ufunc mechanism when the input is fortran/c contig? If that is the case, it would be relatively easy to add an openmp pragma to parallelize that loop, with a condition to a minimum number of element. Anyway, I won't do it. I'm just outlining what I think is the most easy case(depending of NumPy internal that I don't now enough) to implement and I think the most frequent (so possible a quick fix for someone with the knowledge of that code). In Theano, we found in a few CPUs for the addition we need a minimum of 200k element for the parallelization of elemwise to be useful. We use that number by default for all operation to make it easy. This is user configurable. This warenty that with current generation, the threading don't slow thing down. I think that this is more important, don't show user slow down by default with a new version. Fred On Wed, May 7, 2014 at 2:27 PM, Julian Taylor jtaylor.deb...@googlemail.com wrote: On 07.05.2014 20:11, Sturla Molden wrote: On 03/05/14 23:56, Siegfried Gonzi wrote: A more technical answer is that NumPy's internals does not play very nicely with multithreading. For examples the array iterators used in ufuncs store an internal state. Multithreading would imply an excessive contention for this state, as well as induce false sharing of the iterator object. Therefore, a multithreaded NumPy would have performance problems due to synchronization as well as hierachical memory collisions. Adding multithreading support to the current NumPy core would just degrade the performance. NumPy will not be able to use multithreading efficiently unless we redesign the iterators in NumPy core. That is a massive undertaking which prbably means rewriting most of NumPy's core C code. A better strategy would be to monkey-patch some of the more common ufuncs with multithreaded versions. I wouldn't say that the iterator is a problem, the important iterator functions are threadsafe and there is support for multithreaded iteration using NpyIter_Copy so no data is shared between threads. I'd say the main issue is that there simply aren't many functions worth parallelizing in numpy. Most the commonly used stuff is already memory bandwidth bound with only one or two threads. The only things I can think of that would profit is sorting/partition and the special functions like sqrt, exp, log, etc. Generic efficient parallelization would require merging of operations improve the FLOPS/loads ratio. E.g. numexpr and theano are able to do so and thus also has builtin support for multithreading. That being said you can use Python threads with numpy as (especially in 1.9) most expensive functions release the GIL. But unless you are doing very flop intensive stuff you will probably have to manually block your operations to the last level cache size if you want to scale beyond one or two threads. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] IDL vs Python parallel computing
On 5/3/14, 11:56 PM, Siegfried Gonzi wrote: Hi all I noticed IDL uses at least 400% (4 processors or cores) out of the box for simple things like reading and processing files, calculating the mean etc. I have never seen this happening with numpy except for the linalgebra stuff (e.g lapack). Well, this might be because it is the place where using several processes makes more sense. Normally, when you are reading files, the bottleneck is the I/O subsystem (at least if you don't have to convert from text to numbers), and for calculating the mean, normally the bottleneck is memory throughput. Having said this, there are several packages that work on top of NumPy that can use multiple cores when performing numpy operations, like numexpr (https://github.com/pydata/numexpr), or Theano (http://deeplearning.net/software/theano/tutorial/multi_cores.html) -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] IDL vs Python parallel computing
Hi all I noticed IDL uses at least 400% (4 processors or cores) out of the box for simple things like reading and processing files, calculating the mean etc. I have never seen this happening with numpy except for the linalgebra stuff (e.g lapack). Any comments? Thanks, Siegfried -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion