Re: [OMPI users] top question
Simon, it is a lot more difficult than it appears. You're right, select/poll can do it for any file descriptor, and shared mutexes/ conditions (despite the performance impact) can do it for shared memory. However, in the case where you have to support both simultaneously, what is the right approach, i.e. the one that doesn't impact the current performance? We're open to smart solutions ... george. On Jun 3, 2009, at 11:49 , Number Cruncher wrote: Jeff Squyres wrote: We get this question so much that I really need to add it to the FAQ. :-\ Open MPI currently always spins for completion for exactly the reason that Scott cites: lower latency. Arguably, when using TCP, we could probably get a bit better performance by blocking and allowing the kernel to make more progress than a single quick pass through the sockets progress engine, but that involves some other difficulties such as simultaneously allowing shared memory progress. We have ideas how to make this work, but it has unfortunately remained at a lower priority: the performance difference isn't that great, and we've been focusing on the other, lower latency interconnects (shmem, MX, verbs, etc.). Whilst I understand that you have other priorities, and I grateful for the leverage I get by using OpenMPI, I would like to offer an alternative use case, which I believe may become more common. We're developing parallel software which is designed to be used *interactively* as well as in batch mode. We want the same SIMD code running on a user's quad-core workstation as on a 1,000-node cluster. For the former case (single workstation), it would be *much* more user friendly and interactive, for the back-end MPI code not to be spinning at 100% when it's just waiting for the next front-end command. The GUI thread doesn't get a look in. I can't imagine the difficulties involved, but if the POSIX calls select() and pthread_cond_wait() can do it for TCP and shared-memory threads respectively, it can't be impossible! Just my .2c, Simon ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] top question
Jeff Squyres wrote: We get this question so much that I really need to add it to the FAQ. :-\ Open MPI currently always spins for completion for exactly the reason that Scott cites: lower latency. Arguably, when using TCP, we could probably get a bit better performance by blocking and allowing the kernel to make more progress than a single quick pass through the sockets progress engine, but that involves some other difficulties such as simultaneously allowing shared memory progress. We have ideas how to make this work, but it has unfortunately remained at a lower priority: the performance difference isn't that great, and we've been focusing on the other, lower latency interconnects (shmem, MX, verbs, etc.). Whilst I understand that you have other priorities, and I grateful for the leverage I get by using OpenMPI, I would like to offer an alternative use case, which I believe may become more common. We're developing parallel software which is designed to be used *interactively* as well as in batch mode. We want the same SIMD code running on a user's quad-core workstation as on a 1,000-node cluster. For the former case (single workstation), it would be *much* more user friendly and interactive, for the back-end MPI code not to be spinning at 100% when it's just waiting for the next front-end command. The GUI thread doesn't get a look in. I can't imagine the difficulties involved, but if the POSIX calls select() and pthread_cond_wait() can do it for TCP and shared-memory threads respectively, it can't be impossible! Just my .2c, Simon
Re: [OMPI users] top question
tsi...@coas.oregonstate.edu wrote: Thanks for the explanation. I am using GigEth + Open MPI and the buffered MPI_BSend. I had already noticed that top behaved differently on another cluster with Infinibandb + MPICH. So the only option to find out how much time each process is waiting around seems to be to profile the code. Will gprof show me anything useful or will I have to use a more sophisticated (any free ones?) parallel profiler? Another frequently asked question! I can try to add a FAQ entry/category. There are a number of free options including TAU http://www.cs.uoregon.edu/research/tau/home.php mpiP http://mpip.sourceforge.net/ FPMPI http://www.mcs.anl.gov/research/projects/fpmpi/WWW/index.html IPM http://ipm-hpc.sourceforge.net/ Sun Studio http://developers.sun.com/sunstudio/ The only one I've really used is Sun Studio. Jumpshot *might* work with Open MPI, I forget. Or, it might be more an MPICH tool.
Re: [OMPI users] top question
Thanks for the explanation. I am using GigEth + Open MPI and the buffered MPI_BSend. I had already noticed that top behaved differently on another cluster with Infinibandb + MPICH. So the only option to find out how much time each process is waiting around seems to be to profile the code. Will gprof show me anything useful or will I have to use a more sophisticated (any free ones?) parallel profiler? Cheers, Tiago
Re: [OMPI users] top question
We get this question so much that I really need to add it to the FAQ. :-\ Open MPI currently always spins for completion for exactly the reason that Scott cites: lower latency. Arguably, when using TCP, we could probably get a bit better performance by blocking and allowing the kernel to make more progress than a single quick pass through the sockets progress engine, but that involves some other difficulties such as simultaneously allowing shared memory progress. We have ideas how to make this work, but it has unfortunately remained at a lower priority: the performance difference isn't that great, and we've been focusing on the other, lower latency interconnects (shmem, MX, verbs, etc.). On Jun 3, 2009, at 8:37 AM, Scott Atchley wrote: On Jun 3, 2009, at 6:05 AM, tsi...@coas.oregonstate.edu wrote: > Top always shows all the paralell processes at 100% in the %CPU > field, although some of the time these must be waiting for a > communication to complete. How can I see actual processing as > opposed to waiting at a barrier? > > Thanks, > Tiago Using what interconnect? For performance reasons (lower latency), the app and/or OMPI may be polling on the completion. Are you using blocking or non-blocking communication? Scott ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] top question
On Jun 3, 2009, at 6:05 AM, tsi...@coas.oregonstate.edu wrote: Top always shows all the paralell processes at 100% in the %CPU field, although some of the time these must be waiting for a communication to complete. How can I see actual processing as opposed to waiting at a barrier? Thanks, Tiago Using what interconnect? For performance reasons (lower latency), the app and/or OMPI may be polling on the completion. Are you using blocking or non-blocking communication? Scott