Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
Gilbert Grosdidier wrote: Any other suggestion ? Can any more information be extracted from profiling? Here is where I think things left off: Eugene Loh wrote: Gilbert Grosdidier wrote: # [time] [calls] <%mpi> <%wall> # MPI_Waitall 741683 7.91081e+07 77.96 21.58 # MPI_Allreduce 114057 2.53665e+07 11.99 3.32 # MPI_Isend 27420.6 6.53513e+08 2.88 0.80 # MPI_Irecv 464.616 6.53513e+08 0.05 0.01 ### It seems to my non-expert eye that MPI_Waitall is dominant among MPI calls, but not for the overall application, Looks like on average each MPI_Waitall call is completing 8+ MPI_Isend calls and 8+ MPI_Irecv calls. I think IPM gives some point-to-point messaging information. Maybe you can tell what the distribution is of message sizes, etc. Or, maybe you already know the characteristic pattern. Does a stand-alone message-passing test (without the computational portion) capture the performance problem you're looking for? Do you know message lengths and patterns? Can you confirm whether non-MPI time is the same between good and bad runs?
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
Unfortunately, I was unable to spot any striking difference in perfs when using --bind-to-core. Sorry. Any other suggestion ? Regards,Gilbert. Le 7 janv. 11 à 16:32, Jeff Squyres a écrit : Well, bummer -- there goes my theory. According to the hwloc info you posted earlier, this shows that OMPI is binding to the 1st hyperthread on each core; *not* to both hyperthreads on a single core. :-\ It would still be slightly interesting to see if there's any difference when you run with --bind-to-core instead of paffinity_alone. On Jan 7, 2011, at 9:56 AM, Gilbert Grosdidier wrote: Yes, here it is : mpirun -np 8 --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 1.1rc6r3028/bin/hwloc-bind --get 0x0001 0x0002 0x0004 0x0008 0x0010 0x0020 0x0040 0x0080 Gilbert. Le 7 janv. 11 à 15:50, Jeff Squyres a écrit : Can you run with np=8? On Jan 7, 2011, at 9:49 AM, Gilbert Grosdidier wrote: Hi Jeff, Thanks for taking care of this. Here is what I got on a worker node: mpirun --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 1.1rc6r3028/bin/hwloc-bind --get 0x0001 Is this what is expected, please ? Or should I try yet another command ? Thanks, Regards, Gilbert. Le 7 janv. 11 à 15:35, Jeff Squyres a écrit : On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote: lstopo Machine (35GB) NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB) L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#8) L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#9) L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#10) L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#11) [snip] -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- *-* Gilbert Grosdidier gilbert.grosdid...@in2p3.fr LAL / IN2P3 / CNRS Phone : +33 1 6446 8909 Faculté des Sciences, Bat. 200 Fax : +33 1 6446 8546 B.P. 34, F-91898 Orsay Cedex (FRANCE) *-*
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
I'll very soon give a try to using Hyperthreading with our app, and keep you posted about the improvements, if any. Our current cluster is made out of 4-core dual-socket Nehalem nodes. Cheers,Gilbert. Le 7 janv. 11 à 16:17, Tim Prince a écrit : On 1/7/2011 6:49 AM, Jeff Squyres wrote: My understanding is that hyperthreading can only be activated/ deactivated at boot time -- once the core resources are allocated to hyperthreads, they can't be changed while running. Whether disabling the hyperthreads or simply telling Linux not to schedule on them makes a difference performance-wise remains to be seen. I've never had the time to do a little benchmarking to quantify the difference. If someone could rustle up a few cycles (get it?) to test out what the real-world performance difference is between disabling hyperthreading in the BIOS vs. telling Linux to ignore the hyperthreads, that would be awesome. I'd love to see such results. My personal guess is that the difference is in the noise. But that's a guess. Applications which depend on availability of full size instruction lookaside buffer would be candidates for better performance with hyperthreads completely disabled. Many HPC applications don't stress ITLB, but some do. Most of the important resources are allocated dynamically between threads, but the ITLB is an exception. We reported results of an investigation on Intel Nehalem 4-core hyperthreading where geometric mean performance of standard benchmarks for certain commercial applications was 2% better with hyperthreading disabled at boot time, compared with best 1 rank per core scheduling with hyperthreading enabled. Needless to say, the report wasn't popular with marketing. I haven't seen an equivalent investigation for the 6-core CPUs, where various strange performance effects have been noted, so, as Jeff said, the hyperthreading effect could be "in the noise." -- Tim Prince ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- *-* Gilbert Grosdidier gilbert.grosdid...@in2p3.fr LAL / IN2P3 / CNRS Phone : +33 1 6446 8909 Faculté des Sciences, Bat. 200 Fax : +33 1 6446 8546 B.P. 34, F-91898 Orsay Cedex (FRANCE) *-*
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
Well, bummer -- there goes my theory. According to the hwloc info you posted earlier, this shows that OMPI is binding to the 1st hyperthread on each core; *not* to both hyperthreads on a single core. :-\ It would still be slightly interesting to see if there's any difference when you run with --bind-to-core instead of paffinity_alone. On Jan 7, 2011, at 9:56 AM, Gilbert Grosdidier wrote: > Yes, here it is : > > > mpirun -np 8 --mca mpi_paffinity_alone 1 > > /opt/software/SGI/hwloc/1.1rc6r3028/bin/hwloc-bind --get > 0x0001 > 0x0002 > 0x0004 > 0x0008 > 0x0010 > 0x0020 > 0x0040 > 0x0080 > > Gilbert. > > Le 7 janv. 11 à 15:50, Jeff Squyres a écrit : > >> Can you run with np=8? >> >> On Jan 7, 2011, at 9:49 AM, Gilbert Grosdidier wrote: >> >>> Hi Jeff, >>> >>> Thanks for taking care of this. >>> >>> Here is what I got on a worker node: >>> mpirun --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/1.1rc6r3028/bin/hwloc-bind --get >>> 0x0001 >>> >>> Is this what is expected, please ? Or should I try yet another command ? >>> >>> Thanks, Regards, Gilbert. >>> >>> >>> >>> Le 7 janv. 11 à 15:35, Jeff Squyres a écrit : >>> On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote: >> lstopo > Machine (35GB) > NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB) > L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 >PU L#0 (P#0) >PU L#1 (P#8) > L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 >PU L#2 (P#1) >PU L#3 (P#9) > L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 >PU L#4 (P#2) >PU L#5 (P#10) > L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 >PU L#6 (P#3) >PU L#7 (P#11) [snip] -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
On 1/7/2011 6:49 AM, Jeff Squyres wrote: My understanding is that hyperthreading can only be activated/deactivated at boot time -- once the core resources are allocated to hyperthreads, they can't be changed while running. Whether disabling the hyperthreads or simply telling Linux not to schedule on them makes a difference performance-wise remains to be seen. I've never had the time to do a little benchmarking to quantify the difference. If someone could rustle up a few cycles (get it?) to test out what the real-world performance difference is between disabling hyperthreading in the BIOS vs. telling Linux to ignore the hyperthreads, that would be awesome. I'd love to see such results. My personal guess is that the difference is in the noise. But that's a guess. Applications which depend on availability of full size instruction lookaside buffer would be candidates for better performance with hyperthreads completely disabled. Many HPC applications don't stress ITLB, but some do. Most of the important resources are allocated dynamically between threads, but the ITLB is an exception. We reported results of an investigation on Intel Nehalem 4-core hyperthreading where geometric mean performance of standard benchmarks for certain commercial applications was 2% better with hyperthreading disabled at boot time, compared with best 1 rank per core scheduling with hyperthreading enabled. Needless to say, the report wasn't popular with marketing. I haven't seen an equivalent investigation for the 6-core CPUs, where various strange performance effects have been noted, so, as Jeff said, the hyperthreading effect could be "in the noise." -- Tim Prince
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
Yes, here it is : > mpirun -np 8 --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 1.1rc6r3028/bin/hwloc-bind --get 0x0001 0x0002 0x0004 0x0008 0x0010 0x0020 0x0040 0x0080 Gilbert. Le 7 janv. 11 à 15:50, Jeff Squyres a écrit : Can you run with np=8? On Jan 7, 2011, at 9:49 AM, Gilbert Grosdidier wrote: Hi Jeff, Thanks for taking care of this. Here is what I got on a worker node: mpirun --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 1.1rc6r3028/bin/hwloc-bind --get 0x0001 Is this what is expected, please ? Or should I try yet another command ? Thanks, Regards, Gilbert. Le 7 janv. 11 à 15:35, Jeff Squyres a écrit : On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote: lstopo Machine (35GB) NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB) L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#8) L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#9) L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#10) L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#11) [snip] Well, this might disprove my theory. :-\ The OS indexing is not contiguous on the hyperthreads, so I might be wrong about what happened here. Try this: mpirun --mca mpi_paffinity_alone 1 hwloc-bind --get You can even run that on just one node; let's see what you get. This will tell us what each process is *actually* bound to. hwloc- bind --get will report a bitmask of the P#'s from above. So if we see 001, 010, 011, ...etc, then my theory of OMPI binding 1 proc per hyperthread (vs. 1 proc per core) is incorrect. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- *-* Gilbert Grosdidier gilbert.grosdid...@in2p3.fr LAL / IN2P3 / CNRS Phone : +33 1 6446 8909 Faculté des Sciences, Bat. 200 Fax : +33 1 6446 8546 B.P. 34, F-91898 Orsay Cedex (FRANCE) *-* -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- *-* Gilbert Grosdidier gilbert.grosdid...@in2p3.fr LAL / IN2P3 / CNRS Phone : +33 1 6446 8909 Faculté des Sciences, Bat. 200 Fax : +33 1 6446 8546 B.P. 34, F-91898 Orsay Cedex (FRANCE) *-*
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
Can you run with np=8? On Jan 7, 2011, at 9:49 AM, Gilbert Grosdidier wrote: > Hi Jeff, > > Thanks for taking care of this. > > Here is what I got on a worker node: > > > mpirun --mca mpi_paffinity_alone 1 > > /opt/software/SGI/hwloc/1.1rc6r3028/bin/hwloc-bind --get > 0x0001 > > Is this what is expected, please ? Or should I try yet another command ? > > Thanks, Regards, Gilbert. > > > > Le 7 janv. 11 à 15:35, Jeff Squyres a écrit : > >> On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote: >> lstopo >>> Machine (35GB) >>> NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB) >>> L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 >>> PU L#0 (P#0) >>> PU L#1 (P#8) >>> L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 >>> PU L#2 (P#1) >>> PU L#3 (P#9) >>> L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 >>> PU L#4 (P#2) >>> PU L#5 (P#10) >>> L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 >>> PU L#6 (P#3) >>> PU L#7 (P#11) >> [snip] >> >> Well, this might disprove my theory. :-\ The OS indexing is not contiguous >> on the hyperthreads, so I might be wrong about what happened here. Try this: >> >> mpirun --mca mpi_paffinity_alone 1 hwloc-bind --get >> >> You can even run that on just one node; let's see what you get. This will >> tell us what each process is *actually* bound to. hwloc-bind --get will >> report a bitmask of the P#'s from above. So if we see 001, 010, 011, >> ...etc, then my theory of OMPI binding 1 proc per hyperthread (vs. 1 proc >> per core) is incorrect. >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> > > -- > *-* > Gilbert Grosdidier gilbert.grosdid...@in2p3.fr > LAL / IN2P3 / CNRS Phone : +33 1 6446 8909 > Faculté des Sciences, Bat. 200 Fax : +33 1 6446 8546 > B.P. 34, F-91898 Orsay Cedex (FRANCE) > *-* > > > > > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
On Jan 7, 2011, at 5:27 AM, John Hearns wrote: > Actually, the topic of hyperthreading is interesting, and we should > discuss it please. > Hyperthreading is supposedly implemented better and 'properly' on > Nehalem - I would be interested to see some genuine > performance measurements with hyperthreading on/off on your machine Gilbert. FWIW, from what I've seen, and from the recommendations I've heard from Intel, using hyperthreading is still a hit-or-miss proposition with HPC apps. It's true that Nehalem (and later) hyperthreading is much better than it was before. But hyperthreading is still designed to support apps that stall frequently (so the other hyperthread(s) can take over and do useful work while one is stalled). Good HPC apps don't stall much, so hyperthreading still isn't a huge win. Nehalem (and later) hyperthreading has been discussed on this list at least once or twice before; google through the archives to see if you can dig up the conversations. I have dim recollections of people sending at least some performance numbers...? (I could be wrong here, though) > Also you don;t need to reboot and change BIOS settings - there was a > rather niofty technique on this list I think, > where you disable every second CPU in Linux - which has the same > effect as switching off hyperthreading. Yes, you can disable all but one hyperthread on a processor in Linux by: # echo 0 > /sys/devices/system/cpu/cpuX/online where X is an integer from the set listed in hwloc's lstopo output from the P# numbers (i.e., the OS index values, as opposed to the logical index values). Repeat for the 2nd P# value on each core in your machine. You can run lstopo again to verify that they went offline. You can "echo 1" to the same file to bring it back online. Note that you can't offline X=0. Note that this technique technically doesn't disable each hyperthread; it just causes Linux to avoid scheduling on it. Disabling hyperthreading in the BIOS is slightly different; you are actually physically disabling all but one thread per core. The difference is in how resources in a core are split between hyperthreads. When you disable hyperthreading in the BIOS, all the resources in the core are given to the first hyperthread and the 2nd is deactivated (i.e., the OS doesn't even see it at all). When hyperthreading is enabled in the BIOS, the core resources are split between all hyperthreads. Specifically: causing the OS to simply not schedule on all but the first hyperthread doesn't give those resources back to the first hyperthread; it just effectively ignores all but the first hyperthread. My understanding is that hyperthreading can only be activated/deactivated at boot time -- once the core resources are allocated to hyperthreads, they can't be changed while running. Whether disabling the hyperthreads or simply telling Linux not to schedule on them makes a difference performance-wise remains to be seen. I've never had the time to do a little benchmarking to quantify the difference. If someone could rustle up a few cycles (get it?) to test out what the real-world performance difference is between disabling hyperthreading in the BIOS vs. telling Linux to ignore the hyperthreads, that would be awesome. I'd love to see such results. My personal guess is that the difference is in the noise. But that's a guess. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
Hi Jeff, Thanks for taking care of this. Here is what I got on a worker node: > mpirun --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 1.1rc6r3028/bin/hwloc-bind --get 0x0001 Is this what is expected, please ? Or should I try yet another command ? Thanks, Regards, Gilbert. Le 7 janv. 11 à 15:35, Jeff Squyres a écrit : On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote: lstopo Machine (35GB) NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB) L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#8) L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#9) L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#10) L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#11) [snip] Well, this might disprove my theory. :-\ The OS indexing is not contiguous on the hyperthreads, so I might be wrong about what happened here. Try this: mpirun --mca mpi_paffinity_alone 1 hwloc-bind --get You can even run that on just one node; let's see what you get. This will tell us what each process is *actually* bound to. hwloc- bind --get will report a bitmask of the P#'s from above. So if we see 001, 010, 011, ...etc, then my theory of OMPI binding 1 proc per hyperthread (vs. 1 proc per core) is incorrect. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- *-* Gilbert Grosdidier gilbert.grosdid...@in2p3.fr LAL / IN2P3 / CNRS Phone : +33 1 6446 8909 Faculté des Sciences, Bat. 200 Fax : +33 1 6446 8546 B.P. 34, F-91898 Orsay Cedex (FRANCE) *-*
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote: > > lstopo > Machine (35GB) > NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB) >L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 > PU L#0 (P#0) > PU L#1 (P#8) >L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 > PU L#2 (P#1) > PU L#3 (P#9) >L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 > PU L#4 (P#2) > PU L#5 (P#10) >L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 > PU L#6 (P#3) > PU L#7 (P#11) [snip] Well, this might disprove my theory. :-\ The OS indexing is not contiguous on the hyperthreads, so I might be wrong about what happened here. Try this: mpirun --mca mpi_paffinity_alone 1 hwloc-bind --get You can even run that on just one node; let's see what you get. This will tell us what each process is *actually* bound to. hwloc-bind --get will report a bitmask of the P#'s from above. So if we see 001, 010, 011, ...etc, then my theory of OMPI binding 1 proc per hyperthread (vs. 1 proc per core) is incorrect. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
On 6 January 2011 21:10, Gilbert Grosdidierwrote: > Hi Jeff, > > Where's located lstopo command on SuseLinux, please ? > And/or hwloc-bind, which seems related to it ? I was able to get hwloc to install quite easily on SuSE - download/configure/make Configure it to install to /usr/local/bin Actually, the topic of hyperthreading is interesting, and we should discuss it please. Hyperthreading is supposedly implemented better and 'properly' on Nehalem - I would be interested to see some genuine performance measurements with hyperthreading on/off on your machine Gilbert. Also you don;t need to reboot and change BIOS settings - there was a rather niofty technique on this list I think, where you disable every second CPU in Linux - which has the same effect as switching off hyperthreading. Maybe you could try it?
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
Hi Jeff, Here is the output of lstopo on one of the workers (thanks Jean-Christophe) : > lstopo Machine (35GB) NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB) L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#8) L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#9) L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#10) L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#11) NUMANode L#1 (P#1 18GB) + Socket L#1 + L3 L#1 (8192KB) L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 PU L#8 (P#4) PU L#9 (P#12) L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 PU L#10 (P#5) PU L#11 (P#13) L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 PU L#12 (P#6) PU L#13 (P#14) L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 PU L#14 (P#7) PU L#15 (P#15) Tests with --bind-to-core are under way ... What is your conclusion, please ? Thanks, G. Le 06/01/2011 23:16, Jeff Squyres a écrit : On Jan 6, 2011, at 5:07 PM, Gilbert Grosdidier wrote: Yes Jeff, I'm pretty sure indeed that hyperthreading is enabled, since 16 CPUs are visible in the /proc/cpuinfo pseudo-file, while it's a 8 core Nehalem node. However, I always carefully checked that only 8 processes are running on each node. Could it be that they are assigned to 8 hyperthreads but only 4 cores, for example ? Is this actually possible with paffinity set to 1 ? Yes. I actually had this happen to another user recently; I should add this to the FAQ... (/me adds to to-do list) Here's what I'm guessing is happening: OMPI's paffinity_alone algorithm is currently pretty stupid. It simply assigns the first MPI process on the node to OS processor ID 0. It then assigned the second MPI process on the node to OS processor ID 1. ...and so on. However, if hyperthreading is enabled, OS processor ID's 0 and 1 might be 2 hyperthreads on the same core. And therefore OMPI has effectively just bound 2 processes to the same core. Ouch! The output of lstopo can verify if this is happening: look to see if processor ID's 0 through 7 are on the same 4 cores. Instead of paffinity_alone, use the mpirun --bind-to-core option; that should bind each MPI process to (the first hyperthread in) its own core. Sidenote: many improvements are coming to our processor affinity system over the next few releases... See my slides from the Open MPI BOF at SC'10 for some discussion of what's coming: http://www.open-mpi.org/papers/sc-2010/
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
On Jan 6, 2011, at 5:07 PM, Gilbert Grosdidier wrote: > Yes Jeff, I'm pretty sure indeed that hyperthreading is enabled, since 16 > CPUs are visible in the /proc/cpuinfo pseudo-file, while it's a 8 core > Nehalem node. > > However, I always carefully checked that only 8 processes are running on each > node. Could it be that they are assigned to 8 hyperthreads but only 4 cores, > for example ? Is this actually possible with paffinity set to 1 ? Yes. I actually had this happen to another user recently; I should add this to the FAQ... (/me adds to to-do list) Here's what I'm guessing is happening: OMPI's paffinity_alone algorithm is currently pretty stupid. It simply assigns the first MPI process on the node to OS processor ID 0. It then assigned the second MPI process on the node to OS processor ID 1. ...and so on. However, if hyperthreading is enabled, OS processor ID's 0 and 1 might be 2 hyperthreads on the same core. And therefore OMPI has effectively just bound 2 processes to the same core. Ouch! The output of lstopo can verify if this is happening: look to see if processor ID's 0 through 7 are on the same 4 cores. Instead of paffinity_alone, use the mpirun --bind-to-core option; that should bind each MPI process to (the first hyperthread in) its own core. Sidenote: many improvements are coming to our processor affinity system over the next few releases... See my slides from the Open MPI BOF at SC'10 for some discussion of what's coming: http://www.open-mpi.org/papers/sc-2010/ -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
On Jan 6, 2011, at 4:10 PM, Gilbert Grosdidier wrote: > Where's located lstopo command on SuseLinux, please ? 'fraid I don't know anything about Suse... :-( It may be named hwloc-ls...? > And/or hwloc-bind, which seems related to it ? hwloc-bind is definitely related, but it's a different utility: http://www.open-mpi.org/projects/hwloc/doc/v1.1/tools.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
Hi Jeff, Where's located lstopo command on SuseLinux, please ? And/or hwloc-bind, which seems related to it ? Thanks, G. Le 06/01/2011 21:21, Jeff Squyres a écrit : (now that we're back from vacation) Actually, this could be an issue. Is hyperthreading enabled on your machine? Can you send the text output from running hwloc's "lstopo" command on your compute nodes? I ask because if hyperthreading is enabled, OMPI might be assigning one process per *hyerthread* (vs. one process per *core*). And that could be disastrous for performance. On Dec 22, 2010, at 2:25 PM, Gilbert Grosdidier wrote: Hi David, Yes, I set mpi_affinity_alone to 1. Is that right and sufficient, please ? Thanks for your help, Best, G. Le 22/12/2010 20:18, David Singleton a écrit : Is the same level of processes and memory affinity or binding being used? On 12/21/2010 07:45 AM, Gilbert Grosdidier wrote: Yes, there is definitely only 1 process per core with both MPI implementations. Thanks, G. Le 20/12/2010 20:39, George Bosilca a écrit : Are your processes places the same way with the two MPI implementations? Per-node vs. per-core ? george.
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
(now that we're back from vacation) Actually, this could be an issue. Is hyperthreading enabled on your machine? Can you send the text output from running hwloc's "lstopo" command on your compute nodes? I ask because if hyperthreading is enabled, OMPI might be assigning one process per *hyerthread* (vs. one process per *core*). And that could be disastrous for performance. On Dec 22, 2010, at 2:25 PM, Gilbert Grosdidier wrote: > Hi David, > > Yes, I set mpi_affinity_alone to 1. Is that right and sufficient, please ? > > Thanks for your help, Best, G. > > > > Le 22/12/2010 20:18, David Singleton a écrit : >> >> Is the same level of processes and memory affinity or binding being used? >> >> On 12/21/2010 07:45 AM, Gilbert Grosdidier wrote: >>> Yes, there is definitely only 1 process per core with both MPI >>> implementations. >>> >>> Thanks, G. >>> >>> >>> Le 20/12/2010 20:39, George Bosilca a écrit : Are your processes places the same way with the two MPI implementations? Per-node vs. per-core ? george. > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
Hi David, Yes, I set mpi_affinity_alone to 1. Is that right and sufficient, please ? Thanks for your help, Best, G. Le 22/12/2010 20:18, David Singleton a écrit : Is the same level of processes and memory affinity or binding being used? On 12/21/2010 07:45 AM, Gilbert Grosdidier wrote: Yes, there is definitely only 1 process per core with both MPI implementations. Thanks, G. Le 20/12/2010 20:39, George Bosilca a écrit : Are your processes places the same way with the two MPI implementations? Per-node vs. per-core ? george.
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance
Gilbert Grosdidier wrote: Bonsoir Eugene, Bon matin chez moi. Here follows some output for a 1024 core run. Assuming this corresponds meaningfully with your original e-mail, 1024 cores means performance of 700 vs 900. So, that looks roughly consistent with the 28% MPI time you show here. That seems to imply that the slowdown is due entirely to long MPI times (rather than slow non-MPI times). Just a sanity check. Unfortunately, I'm yet unable to have the equivalent MPT chart. That may be all right. If one run clearly shows a problem (which is perhaps the case here), then a "good profile" is not needed. Here, a "good profile" would perhaps be used only to confirm that near-zero MPI time is possible. #IPMv0.983 # host : r34i0n0/x86_64_Linux mpi_tasks : 1024 on 128 nodes # start : 12/21/10/13:18:09 wallclock : 3357.308618 sec # stop : 12/21/10/14:14:06 %comm : 27.67 ## # # [total] min max # wallclock 3.43754e+06 3356.98 3356.83 3357.31 # user 2.82831e+06 2762.02 2622.04 2923.37 # system 376230 367.412 174.603 492.919 # mpi 951328 929.031 633.137 1052.86 # %comm 27.6719 18.8601 31.363 No glaring evidence here of load imbalance being the sole explanation, but hard to tell from these numbers. (If min comm time is 0%, then that process is presumably holding everyone else up.) # [time] [calls] <%mpi> <%wall> # MPI_Waitall 741683 7.91081e+07 77.96 21.58 # MPI_Allreduce 114057 2.53665e+07 11.99 3.32 # MPI_Isend 27420.6 6.53513e+08 2.88 0.80 # MPI_Irecv 464.616 6.53513e+08 0.05 0.01 ### It seems to my non-expert eye that MPI_Waitall is dominant among MPI calls, but not for the overall application, If at 1024 cores, performance is 700 compared to 900, then whatever the problem is still hasn't dominated the entire application performance. So, it looks like MPI_Waitall is the problem, even if it doesn't dominate overall application time. Looks like on average each MPI_Waitall call is completing 8+ MPI_Isend calls and 8+ MPI_Irecv calls. I think IPM gives some point-to-point messaging information. Maybe you can tell what the distribution is of message sizes, etc. Or, maybe you already know the characteristic pattern. Does a stand-alone message-passing test (without the computational portion) capture the performance problem you're looking for? Le 22/12/2010 18:50, Eugene Loh a écrit : Can you isolate a bit more where the time is being spent? The performance effect you're describing appears to be drastic. Have you profiled the code? Some choices of tools can be found in the FAQ http://www.open-mpi.org/faq/?category=perftools The results may be "uninteresting" (all time spent in your MPI_Waitall calls, for example), but it'd be good to rule out other possibilities (e.g., I've seen cases where it's the non-MPI time that's the culprit). If all the time is spent in MPI_Waitall, then I wonder if it would be possible for you to reproduce the problem with just some MPI_Isend|Irecv|Waitall calls that mimic your program. E.g., "lots of short messages", or "lots of long messages", etc. It sounds like there is some repeated set of MPI exchanges, so maybe that set can be extracted and run without the complexities of the application.
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance
Bonsoir Eugene, First thanks for trying to help me. I already gave a try to some profiling tool, namely IPM, which is rather simple to use. Here follows some output for a 1024 core run. Unfortunately, I'm yet unable to have the equivalent MPT chart. #IPMv0.983 # # command : unknown (completed) # host: r34i0n0/x86_64_Linux mpi_tasks : 1024 on 128 nodes # start : 12/21/10/13:18:09 wallclock : 3357.308618 sec # stop: 12/21/10/14:14:06 %comm : 27.67 # gbytes : 0.0e+00 total gflop/sec : 0.0e+00 total # ## # region : * [ntasks] = 1024 # # [total]min max # entries 1024 1 1 1 # wallclock 3.43754e+06 3356.98 3356.83 3357.31 # user 2.82831e+06 2762.02 2622.04 2923.37 # system 376230 367.412 174.603 492.919 # mpi 951328 929.031 633.137 1052.86 # %comm27.6719 18.8601 31.363 # gflop/sec0 0 0 0 # gbytes 0 0 0 0 # # #[time] [calls] <%mpi> <%wall> # MPI_Waitall 741683 7.91081e+07 77.96 21.58 # MPI_Allreduce 114057 2.53665e+07 11.99 3.32 # MPI_Recv 40164.7 2048 4.22 1.17 # MPI_Isend 27420.6 6.53513e+08 2.88 0.80 # MPI_Barrier25113.5 2048 2.64 0.73 # MPI_Sendrecv2123.6212992 0.22 0.06 # MPI_Irecv 464.616 6.53513e+08 0.05 0.01 # MPI_Reduce 215.447171008 0.02 0.01 # MPI_Bcast 85.0198 1024 0.01 0.00 # MPI_Send 0.377043 2048 0.00 0.00 # MPI_Comm_rank 0.000744925 4096 0.00 0.00 # MPI_Comm_size 0.000252183 1024 0.00 0.00 ### It seems to my non-expert eye that MPI_Waitall is dominant among MPI calls, but not for the overall application, however I will have to compare with MPT, before concluding. Thanks again for your suggestions, that I'll address one by one. Best, G. Le 22/12/2010 18:50, Eugene Loh a écrit : Can you isolate a bit more where the time is being spent? The performance effect you're describing appears to be drastic. Have you profiled the code? Some choices of tools can be found in the FAQ http://www.open-mpi.org/faq/?category=perftools The results may be "uninteresting" (all time spent in your MPI_Waitall calls, for example), but it'd be good to rule out other possibilities (e.g., I've seen cases where it's the non-MPI time that's the culprit). If all the time is spent in MPI_Waitall, then I wonder if it would be possible for you to reproduce the problem with just some MPI_Isend|Irecv|Waitall calls that mimic your program. E.g., "lots of short messages", or "lots of long messages", etc. It sounds like there is some repeated set of MPI exchanges, so maybe that set can be extracted and run without the complexities of the application. Anyhow, some profiling might help guide one to the problem. Gilbert Grosdidier wrote: There are indeed a high rate of communications. But the buffer size is always the same for a given pair of processes, and I thought that mpi_leave_pinned should avoid freeing the memory in this case. Am I wrong ?
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance
There are indeed a high rate of communications. But the buffer size is always the same for a given pair of processes, and I thought that mpi_leave_pinned should avoid freeing the memory in this case. Am I wrong ? Thanks, Best, G. Le 21/12/2010 18:52, Matthieu Brucher a écrit : Don't forget that MPT has some optimizations OpenMPI may not have, as "overriding" free(). This way, MPT can have a huge performance boost if you're allocating and freeing memory, and the same happens if you communicate often. Matthieu 2010/12/21 Gilbert Grosdidier: Hi George, Thanks for your help. The bottom line is that the processes are neatly placed on the nodes/cores, as far as I can tell from the map : [...] Process OMPI jobid: [33285,1] Process rank: 4 Process OMPI jobid: [33285,1] Process rank: 5 Process OMPI jobid: [33285,1] Process rank: 6 Process OMPI jobid: [33285,1] Process rank: 7 Data for node: Name: r34i0n1 Num procs: 8 Process OMPI jobid: [33285,1] Process rank: 8 Process OMPI jobid: [33285,1] Process rank: 9 Process OMPI jobid: [33285,1] Process rank: 10 Process OMPI jobid: [33285,1] Process rank: 11 Process OMPI jobid: [33285,1] Process rank: 12 Process OMPI jobid: [33285,1] Process rank: 13 Process OMPI jobid: [33285,1] Process rank: 14 Process OMPI jobid: [33285,1] Process rank: 15 Data for node: Name: r34i0n2 Num procs: 8 Process OMPI jobid: [33285,1] Process rank: 16 Process OMPI jobid: [33285,1] Process rank: 17 Process OMPI jobid: [33285,1] Process rank: 18 Process OMPI jobid: [33285,1] Process rank: 19 Process OMPI jobid: [33285,1] Process rank: 20 [...] But the perfs are still very low ;-( Best,G. Le 20 déc. 10 à 22:27, George Bosilca a écrit : That's a first step. My question was more related to the process overlay on the cores. If the MPI implementation place one process per node, then rank k and rank k+1 will always be on separate node, and the communications will have to go over IB. In the opposite if the MPI implementation places the processes per core, then rank k and k+1 will [mostly] be on the same node and the communications will be over shared memory. Depending on how the processes are placed and how you create the neighborhoods the performance can be drastically impacted. There is a pretty good description of the problem at: http://www.hpccommunity.org/f55/behind-scenes-mpi-process-placement-640/ Some hints at http://www.open-mpi.org/faq/?category=running#mpirun-scheduling. I suggest you play with the --byslot --bynode options to see how this affect the performance of your application. For the hardcore cases we provide a rankfile feature. More info at: http://www.open-mpi.org/faq/?category=tuning#using-paffinity Enjoy, george. On Dec 20, 2010, at 15:45 , Gilbert Grosdidier wrote: Yes, there is definitely only 1 process per core with both MPI implementations. Thanks, G. Le 20/12/2010 20:39, George Bosilca a écrit : Are your processes places the same way with the two MPI implementations? Per-node vs. per-core ? george. On Dec 20, 2010, at 11:14 , Gilbert Grosdidier wrote: Bonjour, I am now at a loss with my running of OpenMPI (namely 1.4.3) on a SGI Altix cluster with 2048 or 4096 cores, running over Infiniband. After fixing several rather obvious failures with Ralph, Jeff and John help, I am now facing the bottom of this story since : - there are no more obvious failures with messages - compared to the running of the application with SGI-MPT, the CPU performances I get are very low, decreasing when the number of cores increases (cf below) - these performances are highly reproducible - I tried a very high number of -mca parameters, to no avail If I take as a reference the MPT CPU speed performance, it is of about 900 (in some arbitrary unit), whatever the number of cores I used (up to 8192). But, when running with OMPI, I get: - 700 with 1024 cores (which is already rather low) - 300 with 2048 cores - 60 with 4096 cores. The computing loop, over which the above CPU performance is evaluated, includes a stack of MPI exchanges [per core : 8 x (MPI_Isend + MPI_Irecv) + MPI_Waitall] The application is of the 'domain partition' type, and the performances, together with the memory footprint, are very identical on all cores. The memory footprint is twice higher in the OMPI case (1.5GB/core) than in the MPT case (0.7GB/core). What could be wrong with all these, please ? I provided (in attachment) the 'ompi_info -all ' output. The config.log is in attachment as well. I compiled OMPI with icc. I checked numa and affinity are OK. I use the following command to run my OMPI app: "mpiexec -mca btl_openib_rdma_pipeline_send_length 65536\ -mca btl_openib_rdma_pipeline_frag_size 65536\ -mca btl_openib_min_rdma_pipeline_size 65536\ -mca
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance
Don't forget that MPT has some optimizations OpenMPI may not have, as "overriding" free(). This way, MPT can have a huge performance boost if you're allocating and freeing memory, and the same happens if you communicate often. Matthieu 2010/12/21 Gilbert Grosdidier: > Hi George, > Thanks for your help. The bottom line is that the processes are neatly > placed on the nodes/cores, > as far as I can tell from the map : > [...] > Process OMPI jobid: [33285,1] Process rank: 4 > Process OMPI jobid: [33285,1] Process rank: 5 > Process OMPI jobid: [33285,1] Process rank: 6 > Process OMPI jobid: [33285,1] Process rank: 7 > Data for node: Name: r34i0n1 Num procs: 8 > Process OMPI jobid: [33285,1] Process rank: 8 > Process OMPI jobid: [33285,1] Process rank: 9 > Process OMPI jobid: [33285,1] Process rank: 10 > Process OMPI jobid: [33285,1] Process rank: 11 > Process OMPI jobid: [33285,1] Process rank: 12 > Process OMPI jobid: [33285,1] Process rank: 13 > Process OMPI jobid: [33285,1] Process rank: 14 > Process OMPI jobid: [33285,1] Process rank: 15 > Data for node: Name: r34i0n2 Num procs: 8 > Process OMPI jobid: [33285,1] Process rank: 16 > Process OMPI jobid: [33285,1] Process rank: 17 > Process OMPI jobid: [33285,1] Process rank: 18 > Process OMPI jobid: [33285,1] Process rank: 19 > Process OMPI jobid: [33285,1] Process rank: 20 > [...] > But the perfs are still very low ;-( > Best, G. > Le 20 déc. 10 à 22:27, George Bosilca a écrit : > > That's a first step. My question was more related to the process overlay on > the cores. If the MPI implementation place one process per node, then rank k > and rank k+1 will always be on separate node, and the communications will > have to go over IB. In the opposite if the MPI implementation places the > processes per core, then rank k and k+1 will [mostly] be on the same node > and the communications will be over shared memory. Depending on how the > processes are placed and how you create the neighborhoods the performance > can be drastically impacted. > > There is a pretty good description of the problem at: > http://www.hpccommunity.org/f55/behind-scenes-mpi-process-placement-640/ > > Some hints at > http://www.open-mpi.org/faq/?category=running#mpirun-scheduling. I suggest > you play with the --byslot --bynode options to see how this affect the > performance of your application. > > For the hardcore cases we provide a rankfile feature. More info at: > http://www.open-mpi.org/faq/?category=tuning#using-paffinity > > Enjoy, > george. > > > > On Dec 20, 2010, at 15:45 , Gilbert Grosdidier wrote: > > Yes, there is definitely only 1 process per core with both MPI > implementations. > > Thanks, G. > > > Le 20/12/2010 20:39, George Bosilca a écrit : > > Are your processes places the same way with the two MPI implementations? > Per-node vs. per-core ? > > george. > > On Dec 20, 2010, at 11:14 , Gilbert Grosdidier wrote: > > Bonjour, > > I am now at a loss with my running of OpenMPI (namely 1.4.3) > > on a SGI Altix cluster with 2048 or 4096 cores, running over Infiniband. > > After fixing several rather obvious failures with Ralph, Jeff and John help, > > I am now facing the bottom of this story since : > > - there are no more obvious failures with messages > > - compared to the running of the application with SGI-MPT, the CPU > performances I get > > are very low, decreasing when the number of cores increases (cf below) > > - these performances are highly reproducible > > - I tried a very high number of -mca parameters, to no avail > > If I take as a reference the MPT CPU speed performance, > > it is of about 900 (in some arbitrary unit), whatever the > > number of cores I used (up to 8192). > > But, when running with OMPI, I get: > > - 700 with 1024 cores (which is already rather low) > > - 300 with 2048 cores > > - 60 with 4096 cores. > > The computing loop, over which the above CPU performance is evaluated, > includes > > a stack of MPI exchanges [per core : 8 x (MPI_Isend + MPI_Irecv) + > MPI_Waitall] > > The application is of the 'domain partition' type, > > and the performances, together with the memory footprint, > > are very identical on all cores. The memory footprint is twice higher in > > the OMPI case (1.5GB/core) than in the MPT case (0.7GB/core). > > What could be wrong with all these, please ? > > I provided (in attachment) the 'ompi_info -all ' output. > > The config.log is in attachment as well. > > I compiled OMPI with icc. I checked numa and affinity are OK. > > I use the following command to run my OMPI app: > > "mpiexec -mca btl_openib_rdma_pipeline_send_length 65536\ > > -mca btl_openib_rdma_pipeline_frag_size 65536\ > > -mca btl_openib_min_rdma_pipeline_size 65536\ > > -mca btl_self_rdma_pipeline_send_length 262144\ > > -mca btl_self_rdma_pipeline_frag_size 262144\ > > -mca
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance
Hi George, Thanks for your help. The bottom line is that the processes are neatly placed on the nodes/cores, as far as I can tell from the map : [...] Process OMPI jobid: [33285,1] Process rank: 4 Process OMPI jobid: [33285,1] Process rank: 5 Process OMPI jobid: [33285,1] Process rank: 6 Process OMPI jobid: [33285,1] Process rank: 7 Data for node: Name: r34i0n1 Num procs: 8 Process OMPI jobid: [33285,1] Process rank: 8 Process OMPI jobid: [33285,1] Process rank: 9 Process OMPI jobid: [33285,1] Process rank: 10 Process OMPI jobid: [33285,1] Process rank: 11 Process OMPI jobid: [33285,1] Process rank: 12 Process OMPI jobid: [33285,1] Process rank: 13 Process OMPI jobid: [33285,1] Process rank: 14 Process OMPI jobid: [33285,1] Process rank: 15 Data for node: Name: r34i0n2 Num procs: 8 Process OMPI jobid: [33285,1] Process rank: 16 Process OMPI jobid: [33285,1] Process rank: 17 Process OMPI jobid: [33285,1] Process rank: 18 Process OMPI jobid: [33285,1] Process rank: 19 Process OMPI jobid: [33285,1] Process rank: 20 [...] But the perfs are still very low ;-( Best,G. Le 20 déc. 10 à 22:27, George Bosilca a écrit : That's a first step. My question was more related to the process overlay on the cores. If the MPI implementation place one process per node, then rank k and rank k+1 will always be on separate node, and the communications will have to go over IB. In the opposite if the MPI implementation places the processes per core, then rank k and k+1 will [mostly] be on the same node and the communications will be over shared memory. Depending on how the processes are placed and how you create the neighborhoods the performance can be drastically impacted. There is a pretty good description of the problem at: http://www.hpccommunity.org/f55/behind-scenes-mpi-process-placement-640/ Some hints at http://www.open-mpi.org/faq/?category=running#mpirun-scheduling . I suggest you play with the --byslot --bynode options to see how this affect the performance of your application. For the hardcore cases we provide a rankfile feature. More info at: http://www.open-mpi.org/faq/?category=tuning#using-paffinity Enjoy, george. On Dec 20, 2010, at 15:45 , Gilbert Grosdidier wrote: Yes, there is definitely only 1 process per core with both MPI implementations. Thanks, G. Le 20/12/2010 20:39, George Bosilca a écrit : Are your processes places the same way with the two MPI implementations? Per-node vs. per-core ? george. On Dec 20, 2010, at 11:14 , Gilbert Grosdidier wrote: Bonjour, I am now at a loss with my running of OpenMPI (namely 1.4.3) on a SGI Altix cluster with 2048 or 4096 cores, running over Infiniband. After fixing several rather obvious failures with Ralph, Jeff and John help, I am now facing the bottom of this story since : - there are no more obvious failures with messages - compared to the running of the application with SGI-MPT, the CPU performances I get are very low, decreasing when the number of cores increases (cf below) - these performances are highly reproducible - I tried a very high number of -mca parameters, to no avail If I take as a reference the MPT CPU speed performance, it is of about 900 (in some arbitrary unit), whatever the number of cores I used (up to 8192). But, when running with OMPI, I get: - 700 with 1024 cores (which is already rather low) - 300 with 2048 cores - 60 with 4096 cores. The computing loop, over which the above CPU performance is evaluated, includes a stack of MPI exchanges [per core : 8 x (MPI_Isend + MPI_Irecv) + MPI_Waitall] The application is of the 'domain partition' type, and the performances, together with the memory footprint, are very identical on all cores. The memory footprint is twice higher in the OMPI case (1.5GB/core) than in the MPT case (0.7GB/core). What could be wrong with all these, please ? I provided (in attachment) the 'ompi_info -all ' output. The config.log is in attachment as well. I compiled OMPI with icc. I checked numa and affinity are OK. I use the following command to run my OMPI app: "mpiexec -mca btl_openib_rdma_pipeline_send_length 65536\ -mca btl_openib_rdma_pipeline_frag_size 65536\ -mca btl_openib_min_rdma_pipeline_size 65536\ -mca btl_self_rdma_pipeline_send_length 262144\ -mca btl_self_rdma_pipeline_frag_size 262144\ -mca plm_rsh_num_concurrent 4096 -mca mpi_paffinity_alone 1\ -mca mpi_leave_pinned 1 -mca btl_sm_max_send_size 128\ -mca coll_tuned_pre_allocate_memory_comm_size_limit 128\ -mca btl_openib_cq_size 128 -mca btl_ofud_rd_num 128\ -mca mpool_rdma_rcache_size_limit 131072 -mca mpi_preconnect_mpi 0\ -mca mpool_sm_min_size 131072 -mca mpi_abort_print_stack 1\ -mca btl sm,openib,self -mca btl_openib_want_fork_support 0\ -mca opal_set_max_sys_limits 1 -mca osc_pt2pt_no_locks 1\ -mca osc_rdma_no_locks 1\
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
That's a first step. My question was more related to the process overlay on the cores. If the MPI implementation place one process per node, then rank k and rank k+1 will always be on separate node, and the communications will have to go over IB. In the opposite if the MPI implementation places the processes per core, then rank k and k+1 will [mostly] be on the same node and the communications will be over shared memory. Depending on how the processes are placed and how you create the neighborhoods the performance can be drastically impacted. There is a pretty good description of the problem at: http://www.hpccommunity.org/f55/behind-scenes-mpi-process-placement-640/ Some hints at http://www.open-mpi.org/faq/?category=running#mpirun-scheduling. I suggest you play with the --byslot --bynode options to see how this affect the performance of your application. For the hardcore cases we provide a rankfile feature. More info at: http://www.open-mpi.org/faq/?category=tuning#using-paffinity Enjoy, george. On Dec 20, 2010, at 15:45 , Gilbert Grosdidier wrote: > Yes, there is definitely only 1 process per core with both MPI > implementations. > > Thanks, G. > > > Le 20/12/2010 20:39, George Bosilca a écrit : >> Are your processes places the same way with the two MPI implementations? >> Per-node vs. per-core ? >> >> george. >> >> On Dec 20, 2010, at 11:14 , Gilbert Grosdidier wrote: >> >>> Bonjour, >>> >>> I am now at a loss with my running of OpenMPI (namely 1.4.3) >>> on a SGI Altix cluster with 2048 or 4096 cores, running over Infiniband. >>> >>> After fixing several rather obvious failures with Ralph, Jeff and John >>> help, >>> I am now facing the bottom of this story since : >>> - there are no more obvious failures with messages >>> - compared to the running of the application with SGI-MPT, the CPU >>> performances I get >>> are very low, decreasing when the number of cores increases (cf below) >>> - these performances are highly reproducible >>> - I tried a very high number of -mca parameters, to no avail >>> >>> If I take as a reference the MPT CPU speed performance, >>> it is of about 900 (in some arbitrary unit), whatever the >>> number of cores I used (up to 8192). >>> >>> But, when running with OMPI, I get: >>> - 700 with 1024 cores (which is already rather low) >>> - 300 with 2048 cores >>> - 60 with 4096 cores. >>> >>> The computing loop, over which the above CPU performance is evaluated, >>> includes >>> a stack of MPI exchanges [per core : 8 x (MPI_Isend + MPI_Irecv) + >>> MPI_Waitall] >>> >>> The application is of the 'domain partition' type, >>> and the performances, together with the memory footprint, >>> are very identical on all cores. The memory footprint is twice higher in >>> the OMPI case (1.5GB/core) than in the MPT case (0.7GB/core). >>> >>> What could be wrong with all these, please ? >>> >>> I provided (in attachment) the 'ompi_info -all ' output. >>> The config.log is in attachment as well. >>> I compiled OMPI with icc. I checked numa and affinity are OK. >>> >>> I use the following command to run my OMPI app: >>> "mpiexec -mca btl_openib_rdma_pipeline_send_length 65536\ >>> -mca btl_openib_rdma_pipeline_frag_size 65536\ >>> -mca btl_openib_min_rdma_pipeline_size 65536\ >>> -mca btl_self_rdma_pipeline_send_length 262144\ >>> -mca btl_self_rdma_pipeline_frag_size 262144\ >>> -mca plm_rsh_num_concurrent 4096 -mca mpi_paffinity_alone 1\ >>> -mca mpi_leave_pinned 1 -mca btl_sm_max_send_size 128\ >>> -mca coll_tuned_pre_allocate_memory_comm_size_limit 128\ >>> -mca btl_openib_cq_size 128 -mca btl_ofud_rd_num 128\ >>> -mca mpool_rdma_rcache_size_limit 131072 -mca mpi_preconnect_mpi 0\ >>> -mca mpool_sm_min_size 131072 -mca mpi_abort_print_stack 1\ >>> -mca btl sm,openib,self -mca btl_openib_want_fork_support 0\ >>> -mca opal_set_max_sys_limits 1 -mca osc_pt2pt_no_locks 1\ >>> -mca osc_rdma_no_locks 1\ >>> $PBS_JOBDIR/phmc_tm_p2.$PBS_JOBID -v -f $Jinput". >>> >>> OpenIB info: >>> >>> 1) OFED-1.4.1, installed by SGI SGI >>> >>> 2) Linux xx 2.6.16.60-0.42.10-smp #1 SMP Tue Apr 27 05:11:27 UTC 2010 >>> x86_64 x86_64 x86_64 GNU/Linux >>> OS : SGI ProPack 6SP5 for Linux, Build 605r1.sles10-0909302200 >>> >>> 3) Running most probably an SGI subnet manager >>> >>> 4)> ibv_devinfo (on a worker node) >>> hca_id:mlx4_0 >>> fw_ver:2.7.000 >>> node_guid:0030:48ff:ffcc:4c44 >>> sys_image_guid:0030:48ff:ffcc:4c47 >>> vendor_id:0x02c9 >>> vendor_part_id:26418 >>> hw_ver:0xA0 >>> board_id:SM_207101000 >>> phys_port_cnt:2 >>> port:1 >>> state:PORT_ACTIVE (4) >>> max_mtu:2048 (4) >>> active_mtu:2048 (4) >>> sm_lid:1 >>> port_lid:6009 >>> port_lmc:0x00
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
Yes, there is definitely only 1 process per core with both MPI implementations. Thanks, G. Le 20/12/2010 20:39, George Bosilca a écrit : Are your processes places the same way with the two MPI implementations? Per-node vs. per-core ? george. On Dec 20, 2010, at 11:14 , Gilbert Grosdidier wrote: Bonjour, I am now at a loss with my running of OpenMPI (namely 1.4.3) on a SGI Altix cluster with 2048 or 4096 cores, running over Infiniband. After fixing several rather obvious failures with Ralph, Jeff and John help, I am now facing the bottom of this story since : - there are no more obvious failures with messages - compared to the running of the application with SGI-MPT, the CPU performances I get are very low, decreasing when the number of cores increases (cf below) - these performances are highly reproducible - I tried a very high number of -mca parameters, to no avail If I take as a reference the MPT CPU speed performance, it is of about 900 (in some arbitrary unit), whatever the number of cores I used (up to 8192). But, when running with OMPI, I get: - 700 with 1024 cores (which is already rather low) - 300 with 2048 cores - 60 with 4096 cores. The computing loop, over which the above CPU performance is evaluated, includes a stack of MPI exchanges [per core : 8 x (MPI_Isend + MPI_Irecv) + MPI_Waitall] The application is of the 'domain partition' type, and the performances, together with the memory footprint, are very identical on all cores. The memory footprint is twice higher in the OMPI case (1.5GB/core) than in the MPT case (0.7GB/core). What could be wrong with all these, please ? I provided (in attachment) the 'ompi_info -all ' output. The config.log is in attachment as well. I compiled OMPI with icc. I checked numa and affinity are OK. I use the following command to run my OMPI app: "mpiexec -mca btl_openib_rdma_pipeline_send_length 65536\ -mca btl_openib_rdma_pipeline_frag_size 65536\ -mca btl_openib_min_rdma_pipeline_size 65536\ -mca btl_self_rdma_pipeline_send_length 262144\ -mca btl_self_rdma_pipeline_frag_size 262144\ -mca plm_rsh_num_concurrent 4096 -mca mpi_paffinity_alone 1\ -mca mpi_leave_pinned 1 -mca btl_sm_max_send_size 128\ -mca coll_tuned_pre_allocate_memory_comm_size_limit 128\ -mca btl_openib_cq_size 128 -mca btl_ofud_rd_num 128\ -mca mpool_rdma_rcache_size_limit 131072 -mca mpi_preconnect_mpi 0\ -mca mpool_sm_min_size 131072 -mca mpi_abort_print_stack 1\ -mca btl sm,openib,self -mca btl_openib_want_fork_support 0\ -mca opal_set_max_sys_limits 1 -mca osc_pt2pt_no_locks 1\ -mca osc_rdma_no_locks 1\ $PBS_JOBDIR/phmc_tm_p2.$PBS_JOBID -v -f $Jinput". OpenIB info: 1) OFED-1.4.1, installed by SGI SGI 2) Linux xx 2.6.16.60-0.42.10-smp #1 SMP Tue Apr 27 05:11:27 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux OS : SGI ProPack 6SP5 for Linux, Build 605r1.sles10-0909302200 3) Running most probably an SGI subnet manager 4)> ibv_devinfo (on a worker node) hca_id:mlx4_0 fw_ver:2.7.000 node_guid:0030:48ff:ffcc:4c44 sys_image_guid:0030:48ff:ffcc:4c47 vendor_id:0x02c9 vendor_part_id:26418 hw_ver:0xA0 board_id:SM_207101000 phys_port_cnt:2 port:1 state:PORT_ACTIVE (4) max_mtu:2048 (4) active_mtu:2048 (4) sm_lid:1 port_lid:6009 port_lmc:0x00 port:2 state:PORT_ACTIVE (4) max_mtu:2048 (4) active_mtu:2048 (4) sm_lid:1 port_lid:6010 port_lmc:0x00 5)> ifconfig -a (on a worker node) eth0 Link encap:Ethernet HWaddr 00:30:48:CE:73:30 inet adr:192.168.159.10 Bcast:192.168.159.255 Masque:255.255.255.0 adr inet6: fe80::230:48ff:fece:7330/64 Scope:Lien UP BROADCAST NOTRAILERS RUNNING MULTICAST MTU:1500 Metric:1 RX packets:32337499 errors:0 dropped:0 overruns:0 frame:0 TX packets:34733462 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 lg file transmission:1000 RX bytes:11486224753 (10954.1 Mb) TX bytes:16450996864 (15688.8 Mb) Mémoire:fbc6-fbc8 eth1 Link encap:Ethernet HWaddr 00:30:48:CE:73:31 BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 lg file transmission:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Mémoire:fbce-fbd0 ib0 Link encap:UNSPEC HWaddr 80-00-00-48-FE-C0-00-00-00-00-00-00-00-00-00-00 inet adr:10.148.9.198 Bcast:10.148.255.255 Masque:255.255.0.0 adr
Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance
Are your processes places the same way with the two MPI implementations? Per-node vs. per-core ? george. On Dec 20, 2010, at 11:14 , Gilbert Grosdidier wrote: > Bonjour, > > I am now at a loss with my running of OpenMPI (namely 1.4.3) > on a SGI Altix cluster with 2048 or 4096 cores, running over Infiniband. > > After fixing several rather obvious failures with Ralph, Jeff and John help, > I am now facing the bottom of this story since : > - there are no more obvious failures with messages > - compared to the running of the application with SGI-MPT, the CPU > performances I get > are very low, decreasing when the number of cores increases (cf below) > - these performances are highly reproducible > - I tried a very high number of -mca parameters, to no avail > > If I take as a reference the MPT CPU speed performance, > it is of about 900 (in some arbitrary unit), whatever the > number of cores I used (up to 8192). > > But, when running with OMPI, I get: > - 700 with 1024 cores (which is already rather low) > - 300 with 2048 cores > - 60 with 4096 cores. > > The computing loop, over which the above CPU performance is evaluated, > includes > a stack of MPI exchanges [per core : 8 x (MPI_Isend + MPI_Irecv) + > MPI_Waitall] > > The application is of the 'domain partition' type, > and the performances, together with the memory footprint, > are very identical on all cores. The memory footprint is twice higher in > the OMPI case (1.5GB/core) than in the MPT case (0.7GB/core). > > What could be wrong with all these, please ? > > I provided (in attachment) the 'ompi_info -all ' output. > The config.log is in attachment as well. > I compiled OMPI with icc. I checked numa and affinity are OK. > > I use the following command to run my OMPI app: > "mpiexec -mca btl_openib_rdma_pipeline_send_length 65536\ > -mca btl_openib_rdma_pipeline_frag_size 65536\ > -mca btl_openib_min_rdma_pipeline_size 65536\ > -mca btl_self_rdma_pipeline_send_length 262144\ > -mca btl_self_rdma_pipeline_frag_size 262144\ > -mca plm_rsh_num_concurrent 4096 -mca mpi_paffinity_alone 1\ > -mca mpi_leave_pinned 1 -mca btl_sm_max_send_size 128\ > -mca coll_tuned_pre_allocate_memory_comm_size_limit 128\ > -mca btl_openib_cq_size 128 -mca btl_ofud_rd_num 128\ > -mca mpool_rdma_rcache_size_limit 131072 -mca mpi_preconnect_mpi 0\ > -mca mpool_sm_min_size 131072 -mca mpi_abort_print_stack 1\ > -mca btl sm,openib,self -mca btl_openib_want_fork_support 0\ > -mca opal_set_max_sys_limits 1 -mca osc_pt2pt_no_locks 1\ > -mca osc_rdma_no_locks 1\ > $PBS_JOBDIR/phmc_tm_p2.$PBS_JOBID -v -f $Jinput". > > OpenIB info: > > 1) OFED-1.4.1, installed by SGI SGI > > 2) Linux xx 2.6.16.60-0.42.10-smp #1 SMP Tue Apr 27 05:11:27 UTC 2010 > x86_64 x86_64 x86_64 GNU/Linux > OS : SGI ProPack 6SP5 for Linux, Build 605r1.sles10-0909302200 > > 3) Running most probably an SGI subnet manager > > 4) > ibv_devinfo (on a worker node) > hca_id:mlx4_0 > fw_ver:2.7.000 > node_guid:0030:48ff:ffcc:4c44 > sys_image_guid:0030:48ff:ffcc:4c47 > vendor_id:0x02c9 > vendor_part_id:26418 > hw_ver:0xA0 > board_id:SM_207101000 > phys_port_cnt:2 > port:1 > state:PORT_ACTIVE (4) > max_mtu:2048 (4) > active_mtu:2048 (4) > sm_lid:1 > port_lid:6009 > port_lmc:0x00 > > port:2 > state:PORT_ACTIVE (4) > max_mtu:2048 (4) > active_mtu:2048 (4) > sm_lid:1 > port_lid:6010 > port_lmc:0x00 > > 5) > ifconfig -a (on a worker node) > eth0 Link encap:Ethernet HWaddr 00:30:48:CE:73:30 > inet adr:192.168.159.10 Bcast:192.168.159.255 Masque:255.255.255.0 > adr inet6: fe80::230:48ff:fece:7330/64 Scope:Lien > UP BROADCAST NOTRAILERS RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:32337499 errors:0 dropped:0 overruns:0 frame:0 > TX packets:34733462 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 lg file transmission:1000 > RX bytes:11486224753 (10954.1 Mb) TX bytes:16450996864 (15688.8 Mb) > Mémoire:fbc6-fbc8 > > eth1 Link encap:Ethernet HWaddr 00:30:48:CE:73:31 > BROADCAST MULTICAST MTU:1500 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 lg file transmission:1000 > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > Mémoire:fbce-fbd0 > > ib0 Link encap:UNSPEC HWaddr > 80-00-00-48-FE-C0-00-00-00-00-00-00-00-00-00-00 > inet adr:10.148.9.198 Bcast:10.148.255.255