Re: [Numpy-discussion] IDL vs Python parallel computing

2014-05-08 Thread Siegfried Gonzi
On 08/05/2014 04:00, numpy-discussion-requ...@scipy.org wrote:
 Send NumPy-Discussion mailing list submissions to
   numpy-discussion@scipy.org

 To subscribe or unsubscribe via the World Wide Web, visit
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
 or, via email, send a message with subject or body 'help' to
   numpy-discussion-requ...@scipy.org

 You can reach the person managing the list at
   numpy-discussion-ow...@scipy.org

 When replying, please edit your Subject line so it is more specific
 than Re: Contents of NumPy-Discussion digest...


 --

 Message: 1
 Date: Wed, 07 May 2014 20:11:13 +0200
 From: Sturla Molden sturla.mol...@gmail.com
 Subject: Re: [Numpy-discussion] IDL vs Python parallel computing
 To: numpy-discussion@scipy.org
 Message-ID: lkdt01$jrc$1...@ger.gmane.org
 Content-Type: text/plain; charset=ISO-8859-1; format=flowed

 On 03/05/14 23:56, Siegfried Gonzi wrote:
I noticed IDL uses at least 400% (4 processors or cores) out of the box
for simple things like reading and processing files, calculating the
mean etc.

 The DMA controller is working at its own pace, regardless of what the
 CPU is doing. You cannot get data faster off the disk by burning the
 CPU. If you are seeing 100 % CPU usage while doing file i/o there is
 something very bad going on. If you did this to an i/o intensive server
 it would go up in a ball of smoke... The purpose of high-performance
 asynchronous i/o systems such as epoll, kqueue, IOCP is actually to keep
 the CPU usage to a minimum.


It is probbaly not so much about reading in files. But I just noticed 
(top command) it for simple things like processing say 4 dimensional 
fields (longitute, latitude, altitutde, time) and calculating column 
means or moment statistics over grid boxes and writing the fields  out 
again and things like that.

But it never uses more than 400%.

  I haven't done any thorough testing of where and why the 400% really 
kicks in and if IDL is cheating here or not.






-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] IDL vs Python parallel computing

2014-05-08 Thread Julian Taylor
On 08.05.2014 02:48, Frédéric Bastien wrote:
 Just a quick question/possibility.
 
 What about just parallelizing ufunc with only 1 inputs that is c or
 fortran contiguous like trigonometric function? Is there a fast path in
 the ufunc mechanism when the input is fortran/c contig? If that is the
 case, it would be relatively easy to add an openmp pragma to parallelize
 that loop, with a condition to a minimum number of element.

opemmp is problematic as it gnu openmp deadlocks on fork (multiprocessing)

I think if we do consider adding support using
multiprocessing.pool.ThreadPool could be a good option.

But it also is not difficult for the user to just write a wrapper
function like this:

parallel_trig(x, func, pool):
   x = x.reshape(s.size / nthreads, -1) # assuming 1d and no remainder
   return array(pool.map(func, x)) # use partial to use the out argument

 
 Anyway, I won't do it. I'm just outlining what I think is the most easy
 case(depending of NumPy internal that I don't now enough) to implement
 and I think the most frequent (so possible a quick fix for someone with
 the knowledge of that code).
 
 In Theano, we found in a few CPUs for the addition we need a minimum of
 200k element for the parallelization of elemwise to be useful. We use
 that number by default for all operation to make it easy. This is user
 configurable. This warenty that with current generation, the threading
 don't slow thing down. I think that this is more important, don't show
 user slow down by default with a new version.
 
 Fred
 
 
 
 
 On Wed, May 7, 2014 at 2:27 PM, Julian Taylor
 jtaylor.deb...@googlemail.com mailto:jtaylor.deb...@googlemail.com
 wrote:
 
 On 07.05.2014 20:11, Sturla Molden wrote:
  On 03/05/14 23:56, Siegfried Gonzi wrote:
 
  A more technical answer is that NumPy's internals does not play very
  nicely with multithreading. For examples the array iterators used in
  ufuncs store an internal state. Multithreading would imply an
 excessive
  contention for this state, as well as induce false sharing of the
  iterator object. Therefore, a multithreaded NumPy would have
 performance
  problems due to synchronization as well as hierachical memory
  collisions. Adding multithreading support to the current NumPy core
  would just degrade the performance. NumPy will not be able to use
  multithreading efficiently unless we redesign the iterators in NumPy
  core. That is a massive undertaking which prbably means rewriting most
  of NumPy's core C code. A better strategy would be to monkey-patch
 some
  of the more common ufuncs with multithreaded versions.
 
 
 I wouldn't say that the iterator is a problem, the important iterator
 functions are threadsafe and there is support for multithreaded
 iteration using NpyIter_Copy so no data is shared between threads.
 
 I'd say the main issue is that there simply aren't many functions worth
 parallelizing in numpy. Most the commonly used stuff is already memory
 bandwidth bound with only one or two threads.
 The only things I can think of that would profit is sorting/partition
 and the special functions like sqrt, exp, log, etc.
 
 Generic efficient parallelization would require merging of operations
 improve the FLOPS/loads ratio. E.g. numexpr and theano are able to do so
 and thus also has builtin support for multithreading.
 
 That being said you can use Python threads with numpy as (especially in
 1.9) most expensive functions release the GIL. But unless you are doing
 very flop intensive stuff you will probably have to manually block your
 operations to the last level cache size if you want to scale beyond one
 or two threads.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] IDL vs Python parallel computing

2014-05-08 Thread Siegfried Gonzi
On 08/05/2014 04:00, numpy-discussion-requ...@scipy.org wrote:
 Send NumPy-Discussion mailing list submissions to
   numpy-discussion@scipy.org

 To subscribe or unsubscribe via the World Wide Web, visit
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
 or, via email, send a message with subject or body 'help' to
   numpy-discussion-requ...@scipy.org

 You can reach the person managing the list at
   numpy-discussion-ow...@scipy.org

 When replying, please edit your Subject line so it is more specific
 than Re: Contents of NumPy-Discussion digest...





 --

 Message: 2
 Date: Wed, 7 May 2014 19:25:32 +0100
 From: Nathaniel Smith n...@pobox.com
 Subject: Re: [Numpy-discussion] IDL vs Python parallel computing
 To: Discussion of Numerical Python numpy-discussion@scipy.org
 Message-ID:
   CAPJVwBnqMnrKo0=tthln1pvepwov_rh+mj_4ptwpdxb8d0h...@mail.gmail.com
 Content-Type: text/plain; charset=UTF-8

 On Wed, May 7, 2014 at 7:11 PM, Sturla Molden sturla.mol...@gmail.com wrote:
 That said, reading data stored in text files is usually a CPU-bound 
 operation, and if someone wrote the code to make numpy's text file 
 readers multithreaded, and did so in a maintainable way, then we'd 
 probably accept the patch. The only reason this hasn't happened is 
 that no-one's done it.

To add to the confusion what IDL offers:

http://www.exelisvis.com/Support/HelpArticlesDetail/TabId/219/ArtMID/900/ArticleID/3252/3252.aspx

I am not using IDL (and was never interested in IDL at all as it is a 
horrible language) any more except for some legacy code. Nowadays I am 
mostly on Python.





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] IDL vs Python parallel computing

2014-05-07 Thread Sturla Molden
On 05/05/14 17:02, Francesc Alted wrote:

 Well, this might be because it is the place where using several
 processes makes more sense.  Normally, when you are reading files, the
 bottleneck is the I/O subsystem (at least if you don't have to convert
 from text to numbers), and for calculating the mean, normally the
 bottleneck is memory throughput.

If IDL is burning the CPU while reading a file I wouldn't call that 
impressive. It is certainly not something NumPy should aspire to do.

Sturla

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] IDL vs Python parallel computing

2014-05-07 Thread Sturla Molden
On 03/05/14 23:56, Siegfried Gonzi wrote:
  I noticed IDL uses at least 400% (4 processors or cores) out of the box
  for simple things like reading and processing files, calculating the
  mean etc.

The DMA controller is working at its own pace, regardless of what the 
CPU is doing. You cannot get data faster off the disk by burning the 
CPU. If you are seeing 100 % CPU usage while doing file i/o there is 
something very bad going on. If you did this to an i/o intensive server 
it would go up in a ball of smoke... The purpose of high-performance 
asynchronous i/o systems such as epoll, kqueue, IOCP is actually to keep 
the CPU usage to a minimum.

Also there are computations where using multiple processors do not help. 
First, there is a certain overhead due to thread synchronization and 
scheduling the workload. Thus you want have a certain amount of work 
before you consider to invoke multiple threads. Seconds, hierachical 
memory also makes it mandatory to avoid that the threads share the same 
objects in cache. Otherwise the performance will degrade as more threads 
are added.

A more technical answer is that NumPy's internals does not play very 
nicely with multithreading. For examples the array iterators used in 
ufuncs store an internal state. Multithreading would imply an excessive 
contention for this state, as well as induce false sharing of the 
iterator object. Therefore, a multithreaded NumPy would have performance 
problems due to synchronization as well as hierachical memory 
collisions. Adding multithreading support to the current NumPy core 
would just degrade the performance. NumPy will not be able to use 
multithreading efficiently unless we redesign the iterators in NumPy 
core. That is a massive undertaking which prbably means rewriting most 
of NumPy's core C code. A better strategy would be to monkey-patch some 
of the more common ufuncs with multithreaded versions.


  I have never seen this happening with numpy except for the linalgebra
  stuff (e.g lapack).
 
  Any comments?

The BLAS/LAPACK library can use multithreading internally, depending on 
which BLAS/LAPACK library you use.


Sturla


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] IDL vs Python parallel computing

2014-05-07 Thread Nathaniel Smith
On Wed, May 7, 2014 at 7:11 PM, Sturla Molden sturla.mol...@gmail.com wrote:
 On 03/05/14 23:56, Siegfried Gonzi wrote:
   I noticed IDL uses at least 400% (4 processors or cores) out of the box
   for simple things like reading and processing files, calculating the
   mean etc.

 The DMA controller is working at its own pace, regardless of what the
 CPU is doing. You cannot get data faster off the disk by burning the
 CPU. If you are seeing 100 % CPU usage while doing file i/o there is
 something very bad going on. If you did this to an i/o intensive server
 it would go up in a ball of smoke... The purpose of high-performance
 asynchronous i/o systems such as epoll, kqueue, IOCP is actually to keep
 the CPU usage to a minimum.

That said, reading data stored in text files is usually a CPU-bound
operation, and if someone wrote the code to make numpy's text file
readers multithreaded, and did so in a maintainable way, then we'd
probably accept the patch. The only reason this hasn't happened is
that no-one's done it.

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] IDL vs Python parallel computing

2014-05-07 Thread Julian Taylor
On 07.05.2014 20:11, Sturla Molden wrote:
 On 03/05/14 23:56, Siegfried Gonzi wrote:
 
 A more technical answer is that NumPy's internals does not play very 
 nicely with multithreading. For examples the array iterators used in 
 ufuncs store an internal state. Multithreading would imply an excessive 
 contention for this state, as well as induce false sharing of the 
 iterator object. Therefore, a multithreaded NumPy would have performance 
 problems due to synchronization as well as hierachical memory 
 collisions. Adding multithreading support to the current NumPy core 
 would just degrade the performance. NumPy will not be able to use 
 multithreading efficiently unless we redesign the iterators in NumPy 
 core. That is a massive undertaking which prbably means rewriting most 
 of NumPy's core C code. A better strategy would be to monkey-patch some 
 of the more common ufuncs with multithreaded versions.


I wouldn't say that the iterator is a problem, the important iterator
functions are threadsafe and there is support for multithreaded
iteration using NpyIter_Copy so no data is shared between threads.

I'd say the main issue is that there simply aren't many functions worth
parallelizing in numpy. Most the commonly used stuff is already memory
bandwidth bound with only one or two threads.
The only things I can think of that would profit is sorting/partition
and the special functions like sqrt, exp, log, etc.

Generic efficient parallelization would require merging of operations
improve the FLOPS/loads ratio. E.g. numexpr and theano are able to do so
and thus also has builtin support for multithreading.

That being said you can use Python threads with numpy as (especially in
1.9) most expensive functions release the GIL. But unless you are doing
very flop intensive stuff you will probably have to manually block your
operations to the last level cache size if you want to scale beyond one
or two threads.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] IDL vs Python parallel computing

2014-05-07 Thread Frédéric Bastien
Just a quick question/possibility.

What about just parallelizing ufunc with only 1 inputs that is c or fortran
contiguous like trigonometric function? Is there a fast path in the ufunc
mechanism when the input is fortran/c contig? If that is the case, it would
be relatively easy to add an openmp pragma to parallelize that loop, with a
condition to a minimum number of element.

Anyway, I won't do it. I'm just outlining what I think is the most easy
case(depending of NumPy internal that I don't now enough) to implement and
I think the most frequent (so possible a quick fix for someone with the
knowledge of that code).

In Theano, we found in a few CPUs for the addition we need a minimum of
200k element for the parallelization of elemwise to be useful. We use that
number by default for all operation to make it easy. This is user
configurable. This warenty that with current generation, the threading
don't slow thing down. I think that this is more important, don't show user
slow down by default with a new version.

Fred




On Wed, May 7, 2014 at 2:27 PM, Julian Taylor jtaylor.deb...@googlemail.com
 wrote:

 On 07.05.2014 20:11, Sturla Molden wrote:
  On 03/05/14 23:56, Siegfried Gonzi wrote:
 
  A more technical answer is that NumPy's internals does not play very
  nicely with multithreading. For examples the array iterators used in
  ufuncs store an internal state. Multithreading would imply an excessive
  contention for this state, as well as induce false sharing of the
  iterator object. Therefore, a multithreaded NumPy would have performance
  problems due to synchronization as well as hierachical memory
  collisions. Adding multithreading support to the current NumPy core
  would just degrade the performance. NumPy will not be able to use
  multithreading efficiently unless we redesign the iterators in NumPy
  core. That is a massive undertaking which prbably means rewriting most
  of NumPy's core C code. A better strategy would be to monkey-patch some
  of the more common ufuncs with multithreaded versions.


 I wouldn't say that the iterator is a problem, the important iterator
 functions are threadsafe and there is support for multithreaded
 iteration using NpyIter_Copy so no data is shared between threads.

 I'd say the main issue is that there simply aren't many functions worth
 parallelizing in numpy. Most the commonly used stuff is already memory
 bandwidth bound with only one or two threads.
 The only things I can think of that would profit is sorting/partition
 and the special functions like sqrt, exp, log, etc.

 Generic efficient parallelization would require merging of operations
 improve the FLOPS/loads ratio. E.g. numexpr and theano are able to do so
 and thus also has builtin support for multithreading.

 That being said you can use Python threads with numpy as (especially in
 1.9) most expensive functions release the GIL. But unless you are doing
 very flop intensive stuff you will probably have to manually block your
 operations to the last level cache size if you want to scale beyond one
 or two threads.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] IDL vs Python parallel computing

2014-05-05 Thread Francesc Alted
On 5/3/14, 11:56 PM, Siegfried Gonzi wrote:
 Hi all

 I noticed IDL uses at least 400% (4 processors or cores) out of the box
 for simple things like reading and processing files, calculating the
 mean etc.

 I have never seen this happening with numpy except for the linalgebra
 stuff (e.g lapack).

Well, this might be because it is the place where using several 
processes makes more sense.  Normally, when you are reading files, the 
bottleneck is the I/O subsystem (at least if you don't have to convert 
from text to numbers), and for calculating the mean, normally the 
bottleneck is memory throughput.

Having said this, there are several packages that work on top of NumPy 
that can use multiple cores when performing numpy operations, like 
numexpr (https://github.com/pydata/numexpr), or Theano 
(http://deeplearning.net/software/theano/tutorial/multi_cores.html)

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] IDL vs Python parallel computing

2014-05-03 Thread Siegfried Gonzi
Hi all

I noticed IDL uses at least 400% (4 processors or cores) out of the box 
for simple things like reading and processing files, calculating the 
mean etc.

I have never seen this happening with numpy except for the linalgebra 
stuff (e.g lapack).

Any comments?

Thanks,
Siegfried



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion