Re: [O-MPI users] LAM vs OPENMPI performance
On Jan 4, 2006, at 5:05 PM, Tom Rosmond wrote: Thanks for the quick reply. I ran my tests with a hostfile with cedar.reachone.com slots=4 I clearly misunderstood the role of the 'slots' parameter, because when I removed it, OPENMPI slightly outperformed LAM, which I assume it should. Thanks for the help. Not entirely your fault -- I just went back and re-read the FAQ entries and can easily see how the wording would lead you to that conclusion. I have touched up the wording to make it more clear, and added an FAQ item about oversubscription: http://www.open-mpi.org/faq/?category=running#oversubscribing Here's the text (it looks a bit prettier on the web page): -- Can I oversubscribe nodes (run more processes than processors)? Yes. However, it is critical that Open MPI knows that you are oversubscribing the node, or severe performance degredation can result. The short explanation is as follows: never specify a number of slots that is more than the available number of processors. For example, if you want to run 4 processes on a uniprocessor, then indicate that you only have 1 slot but want to run 4 processes. For example: shell$ cat my-hostfile localhost shell$ mpirun -np 4 --hostfile my-hostfile a.out Specifically: do NOT have a hostfile that contains "slots = 4" (because there is only one available processor). Here's the full explanation: Open MPI basically runs its message passing progression engine in two modes: aggressive and degraded. Degraded: When Open MPI thinks that it is in an oversubscribed mode (i.e., more processes are running than there are processors available), MPI processes will automatically run in degraded mode and frequently yield the processor to its peers, thereby allowing all processes to make progress. Aggressive: When Open MPI thinks that it is in an exactly- or under- subscribed mode (i.e., the number of running processes is equal to or less than the numebr of available processors), MPI processes will automatically run in aggressive mode, meaning that they will never voluntarily give up the processor to other processes. With some network transports, this means that Open MPI will spin in tight loops attempting to make message passing progress, effectively causing other processes to not get any CPU cycles (and therefore never make any progress). For example, on a uniprocessor node: shell$ cat my-hostfile localhost slots=4 shell$ mpirun -np 4 --hostfile my-hostfile a.out This would cause all 4 MPI processes to run in aggressive mode because Open MPI thinks that there are 4 available processors to use. This is actually a lie (there is only 1 processor -- not 4), and can cause extremely bad performance. - Hope that clears up the issue. Sorry about that! -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/
Re: [O-MPI users] LAM vs OPENMPI performance
Thanks for the quick reply. I ran my tests with a hostfile with cedar.reachone.com slots=4 I clearly misunderstood the role of the 'slots' parameter, because when I removed it, OPENMPI slightly outperformed LAM, which I assume it should. Thanks for the help. Tom Brian Barrett wrote: On Jan 4, 2006, at 4:24 PM, Tom Rosmond wrote: I have been using LAM-MPI for many years on PC/Linux systems and have been quite pleased with its performance. However, at the urging of the LAM-MPI website, I have decided to switch to OPENMPI. For much of my preliminary testing I work on a single processor workstation (see the attached 'config.log' and ompi_info.log files for some of the specifics of my system). I frequently run with more than one virtual mpi processor (i.e. oversubscribe the real processor) to test my code. With LAM the runtime penalty for this is usually insignificant for 2-4 virtual processors, but with OPENMPI it has been prohibitive. Below is a matrix of runtimes for a simple MPI matrix transpose code using mpi_sendrecv( I tried other variations of blocking/ non-blocking, synchronous/non-synchronous send/recv with similar results). message size= 262144 bytes LAMOPENMPI 1 proc: .02575 secs .02513 secs 2 proc: .04603 secs 10.069 secs 4 proc: .04903 secs 35.422 secs I am pretty sure that LAM exploits the fact that the virtual processors are all sharing the same memory, so communication is via memory and/or the PCI bus of the system, while my OPENMPI configuration doesn't exploit this. Is this a reasonable diagnosis of the dramatic difference in performance? More importantly, how to I reconfigure OPENMPI to match the LAM performance. Based on the output of ompi_info, you should be using shared memory with Open MPI (as you are with LAM/MPI). What RPI are you using with LAM/MPI (just so we have some idea what you are comparing to)? And how are you running Open MPI (what command are you passing to mpirun, and if you include a hostfile, what is in that host file)? If you tell Open MPI via a hostfile that a machine has 2 cpus when it only has 1 and try to run 2 processes on it, you will run into severe performance issues. In that case, Open MPI will poll very quickly on the CPUs, not giving up the CPU when there is nothing to do. If Open MPI is told that there is only 1 cpu and you run 2 procs of the same job on that node, then it will be much better about giving up the CPU. That would be where I would start looking. If you have some test code you could share, I'd love to see it - it would help in duplicating your results and finding a solution... Brian
Re: [O-MPI users] LAM vs OPENMPI performance
Hi Tom, users-requ...@open-mpi.org wrote: I am pretty sure that LAM exploits the fact that the virtual processors are all sharing the same memory, so communication is via memory and/or the PCI bus of the system, while my OPENMPI configuration doesn't exploit this. Is this a reasonable diagnosis of the dramatic difference in performance? More It would be more likely that OpenMPI is using shared memory and polling on it whereas LAM is using sockets, or at least blocking on something. Polling is a bad thing when oversubscribing processor. When you block on a socket (or any OS handle), the process immediately yield the CPU and is removed from the scheduler. When you poll waiting for a send or receive to complete, you are burning cycles on the CPU and the scheduler will wait for the next quantum of time before running another process. So, if you send a message between 2 processes sharing the same processor, the latency will be in the order of half of the scheduler quantum (10ms on Linux) if they are both polling. Things are much faster when processes are polling on different CPUs (1-2 us) but the blocking socket overhead (~20us) is way better than the quantum of time when you don't have several processors. importantly, how to I reconfigure OPENMPI to match the LAM performance. Try disabling the shared memory device in OpenMPI. Unfortunately, I have no clue how to do it. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com