I’m confused - mpi_yield_when_idle=1 is precisely the “oversubscribed” setting. So why would you expect different results?
> On Mar 27, 2017, at 3:52 AM, Jordi Guitart <jordi.guit...@bsc.es> wrote: > > Hi Ben, > > Thanks for your feedback. As described here > (https://www.open-mpi.org/faq/?category=running#oversubscribing > <https://www.open-mpi.org/faq/?category=running#oversubscribing>), OpenMPI > detects that I'm oversubscribing and runs in degraded mode (yielding the > processor). Anyway, I repeated the experiments setting explicitly the > yielding flag, and I obtained the same weird results: > > $HOME/openmpi-bin-1.10.1/bin/mpirun --mca mpi_yield_when_idle 1 -np 36 > taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 82.79 > $HOME/openmpi-bin-1.10.2/bin/mpirun --mca mpi_yield_when_idle 1 -np 36 > taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 110.93 > > Given these results, it seems that spin-waiting is not causing the issue. I > also agree that this should not be caused by HyperThreading, given that 0-27 > correspond to single HW threads on distinct cores, as shown in the following > output returned by the lstopo command: > > Machine (128GB total) > NUMANode L#0 (P#0 64GB) > Package L#0 + L3 L#0 (35MB) > L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 > PU L#0 (P#0) > PU L#1 (P#28) > L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 > PU L#2 (P#1) > PU L#3 (P#29) > L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 > PU L#4 (P#2) > PU L#5 (P#30) > L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 > PU L#6 (P#3) > PU L#7 (P#31) > L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 > PU L#8 (P#4) > PU L#9 (P#32) > L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 > PU L#10 (P#5) > PU L#11 (P#33) > L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 > PU L#12 (P#6) > PU L#13 (P#34) > L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 > PU L#14 (P#7) > PU L#15 (P#35) > L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 > PU L#16 (P#8) > PU L#17 (P#36) > L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 > PU L#18 (P#9) > PU L#19 (P#37) > L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 > PU L#20 (P#10) > PU L#21 (P#38) > L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 > PU L#22 (P#11) > PU L#23 (P#39) > L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 > PU L#24 (P#12) > PU L#25 (P#40) > L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 > PU L#26 (P#13) > PU L#27 (P#41) > HostBridge L#0 > PCIBridge > PCI 8086:24f0 > Net L#0 "ib0" > OpenFabrics L#1 "hfi1_0" > PCIBridge > PCI 14e4:1665 > Net L#2 "eno1" > PCI 14e4:1665 > Net L#3 "eno2" > PCIBridge > PCIBridge > PCIBridge > PCIBridge > PCI 102b:0534 > GPU L#4 "card0" > GPU L#5 "controlD64" > NUMANode L#1 (P#1 64GB) + Package L#1 + L3 L#1 (35MB) > L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 > PU L#28 (P#14) > PU L#29 (P#42) > L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 > PU L#30 (P#15) > PU L#31 (P#43) > L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 > PU L#32 (P#16) > PU L#33 (P#44) > L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 > PU L#34 (P#17) > PU L#35 (P#45) > L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 > PU L#36 (P#18) > PU L#37 (P#46) > L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 > PU L#38 (P#19) > PU L#39 (P#47) > L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 > PU L#40 (P#20) > PU L#41 (P#48) > L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 > PU L#42 (P#21) > PU L#43 (P#49) > L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 > PU L#44 (P#22) > PU L#45 (P#50) > L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 > PU L#46 (P#23) > PU L#47 (P#51) > L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 > PU L#48 (P#24) > PU L#49 (P#52) > L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 > PU L#50 (P#25) > PU L#51 (P#53) > L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 > PU L#52 (P#26) > PU L#53 (P#54) > L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 > PU L#54 (P#27) > PU L#55 (P#55) > > On 26/03/2017 9:37, Ben Menadue wrote: >> On 26 Mar 2017, at 2:22 am, Jordi Guitart <jordi.guit...@bsc.es >> <mailto:jordi.guit...@bsc.es>> wrote: >>> However, what is puzzling me is the performance difference between OpenMPI >>> 1.10.1 (and prior versions) and OpenMPI 1.10.2 (and later versions) in my >>> experiments with oversubscription, i.e. 82 seconds vs. 111 seconds. >> >> You’re oversubscribing while letting the OS migrate individual threads >> between cores. That taskset will bind each MPI process to the same set of 28 >> logical CPUs (i.e. hardware threads), so if you’re running 36 ranks there >> then you must have migration happening. Indeed, even when you only launch 28 >> MPI ranks, you’ll probably still see migration between the cores — but >> likely a lot less. But as soon as you oversubscribe and spin-wait rather >> than yield you’ll be very sensitive to small changes in behaviour — any >> minor changes in OpenMPI’s behaviour, while not visible under normal >> circumstances, will lead to small changes in how and when the kernel task >> scheduler runs the tasks, and this can then multiply dramatically when you >> have synchronisation between the tasks via e.g. MPI calls. >> >> Just as a purely hypothetical example, the newer versions might spin-wait in >> a slightly tighter loop and this might make the Linux task scheduler less >> likely to switch between waiting threads. This delay in switching tasks >> could appear as increased latency in any synchronising MPI call. But this is >> very speculative — it would be very hard to draw any conclusion about what’s >> happening if there’s no clear causative change in the code. >> >> Try adding "--mca mpi_yield_when_idle 1" to your mpirun command. This will >> make OpenMPI issue a sched_yield when waiting instead of spin-waiting >> constantly. While it’s a performance hit when exactly- or under-subscribing, >> I can see it helping a bit when there’s contention for the cores from >> over-subscribing. In particular, a call sched_yield relinquishes the rest of >> that process's current time slice, and allows the task scheduler to run >> another waiting task (i.e. another of your MPI ranks) in its place. >> >> So in fact this has nothing to do with HyperThreading — assuming 0 through >> 27 correspond to a single hardware thread on 28 distinct cores. Just keep in >> mind that this might not always be the case — we have at least one platform >> where where the logical processor number enumerates the hardware threads >> before cores, so 0 to (n-1) are the n threads of the first core, n to (2n-1) >> are of to the second, and so on. >> >> Cheers, >> Ben >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> > > > WARNING / LEGAL TEXT: This message is intended only for the use of the > individual or entity to which it is addressed and may contain information > which is privileged, confidential, proprietary, or exempt from disclosure > under applicable law. If you are not the intended recipient or the person > responsible for delivering the message to the intended recipient, you are > strictly prohibited from disclosing, distributing, copying, or in any way > using this message. If you have received this communication in error, please > notify the sender and destroy and delete any copies you may have received. > > http://www.bsc.es/disclaimer <http://www.bsc.es/disclaimer> > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users