Re: [OMPI users] top question

2009-06-03 Thread George Bosilca
Simon, it is a lot more difficult than it appears. You're right,  
select/poll can do it for any file descriptor, and shared mutexes/ 
conditions (despite the performance impact) can do it for shared  
memory. However, in the case where you have to support both  
simultaneously, what is the right approach, i.e. the one that doesn't  
impact the current performance? We're open to smart solutions ...


  george.

On Jun 3, 2009, at 11:49 , Number Cruncher wrote:


Jeff Squyres wrote:
We get this question so much that I really need to add it to the  
FAQ.  :-\
Open MPI currently always spins for completion for exactly the  
reason that Scott cites: lower latency.
Arguably, when using TCP, we could probably get a bit better  
performance by blocking and allowing the kernel to make more  
progress than a single quick pass through the sockets progress  
engine, but that involves some other difficulties such as  
simultaneously allowing shared memory progress.  We have ideas how  
to make this work, but it has unfortunately remained at a lower  
priority: the performance difference isn't that great, and we've  
been focusing on the other, lower latency interconnects (shmem, MX,  
verbs, etc.).


Whilst I understand that you have other priorities, and I grateful  
for the leverage I get by using OpenMPI, I would like to offer an  
alternative use case, which I believe may become more common.


We're developing parallel software which is designed to be used  
*interactively* as well as in batch mode. We want the same SIMD code  
running on a user's quad-core workstation as on a 1,000-node cluster.


For the former case (single workstation), it would be *much* more  
user friendly and interactive, for the back-end MPI code not to be  
spinning at 100% when it's just waiting for the next front-end  
command. The GUI thread doesn't get a look in.


I can't imagine the difficulties involved, but if the POSIX calls  
select() and pthread_cond_wait() can do it for TCP and shared-memory  
threads respectively, it can't be impossible!


Just my .2c,
Simon
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] top question

2009-06-03 Thread Number Cruncher

Jeff Squyres wrote:

We get this question so much that I really need to add it to the FAQ.  :-\

Open MPI currently always spins for completion for exactly the reason 
that Scott cites: lower latency.


Arguably, when using TCP, we could probably get a bit better performance 
by blocking and allowing the kernel to make more progress than a single 
quick pass through the sockets progress engine, but that involves some 
other difficulties such as simultaneously allowing shared memory 
progress.  We have ideas how to make this work, but it has unfortunately 
remained at a lower priority: the performance difference isn't that 
great, and we've been focusing on the other, lower latency interconnects 
(shmem, MX, verbs, etc.).


Whilst I understand that you have other priorities, and I grateful for 
the leverage I get by using OpenMPI, I would like to offer an 
alternative use case, which I believe may become more common.


We're developing parallel software which is designed to be used 
*interactively* as well as in batch mode. We want the same SIMD code 
running on a user's quad-core workstation as on a 1,000-node cluster.


For the former case (single workstation), it would be *much* more user 
friendly and interactive, for the back-end MPI code not to be spinning 
at 100% when it's just waiting for the next front-end command. The GUI 
thread doesn't get a look in.


I can't imagine the difficulties involved, but if the POSIX calls 
select() and pthread_cond_wait() can do it for TCP and shared-memory 
threads respectively, it can't be impossible!


Just my .2c,
Simon


Re: [OMPI users] top question

2009-06-03 Thread Eugene Loh

tsi...@coas.oregonstate.edu wrote:

Thanks for the explanation. I am using GigEth + Open MPI and the  
buffered MPI_BSend. I had already noticed that top behaved 
differently  on another cluster with Infinibandb + MPICH.


So the only option to find out how much time each process is waiting  
around seems to be to profile the code. Will gprof show me anything  
useful or will I have to use a more sophisticated (any free ones?)  
parallel profiler?


Another frequently asked question!  I can try to add a FAQ 
entry/category.  There are a number of free options including


TAU http://www.cs.uoregon.edu/research/tau/home.php
mpiP http://mpip.sourceforge.net/
FPMPI http://www.mcs.anl.gov/research/projects/fpmpi/WWW/index.html
IPM http://ipm-hpc.sourceforge.net/
Sun Studio http://developers.sun.com/sunstudio/

The only one I've really used is Sun Studio.

Jumpshot *might* work with Open MPI, I forget.  Or, it might be more an 
MPICH tool.


Re: [OMPI users] top question

2009-06-03 Thread tsilva


Thanks for the explanation. I am using GigEth + Open MPI and the  
buffered MPI_BSend. I had already noticed that top behaved differently  
on another cluster with Infinibandb + MPICH.


So the only option to find out how much time each process is waiting  
around seems to be to profile the code. Will gprof show me anything  
useful or will I have to use a more sophisticated (any free ones?)  
parallel profiler?


Cheers,
Tiago





Re: [OMPI users] top question

2009-06-03 Thread Jeff Squyres
We get this question so much that I really need to add it to the  
FAQ.  :-\


Open MPI currently always spins for completion for exactly the reason  
that Scott cites: lower latency.


Arguably, when using TCP, we could probably get a bit better  
performance by blocking and allowing the kernel to make more progress  
than a single quick pass through the sockets progress engine, but that  
involves some other difficulties such as simultaneously allowing  
shared memory progress.  We have ideas how to make this work, but it  
has unfortunately remained at a lower priority: the performance  
difference isn't that great, and we've been focusing on the other,  
lower latency interconnects (shmem, MX, verbs, etc.).




On Jun 3, 2009, at 8:37 AM, Scott Atchley wrote:


On Jun 3, 2009, at 6:05 AM, tsi...@coas.oregonstate.edu wrote:

> Top always shows all the paralell processes at 100% in the %CPU
> field, although some of the time these must be waiting for a
> communication to complete. How can I see actual processing as
> opposed to waiting at a barrier?
>
> Thanks,
> Tiago

Using what interconnect?

For performance reasons (lower latency), the app and/or OMPI may be
polling on the completion. Are you using blocking or non-blocking
communication?

Scott
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Jeff Squyres
Cisco Systems



Re: [OMPI users] top question

2009-06-03 Thread Scott Atchley

On Jun 3, 2009, at 6:05 AM, tsi...@coas.oregonstate.edu wrote:

Top always shows all the paralell processes at 100% in the %CPU  
field, although some of the time these must be waiting for a  
communication to complete. How can I see actual processing as  
opposed to waiting at a barrier?


Thanks,
Tiago


Using what interconnect?

For performance reasons (lower latency), the app and/or OMPI may be  
polling on the completion. Are you using blocking or non-blocking  
communication?


Scott