I’m confused - mpi_yield_when_idle=1 is precisely the “oversubscribed” setting. 
So why would you expect different results?

> On Mar 27, 2017, at 3:52 AM, Jordi Guitart <jordi.guit...@bsc.es> wrote:
> 
> Hi Ben,
> 
> Thanks for your feedback. As described here 
> (https://www.open-mpi.org/faq/?category=running#oversubscribing 
> <https://www.open-mpi.org/faq/?category=running#oversubscribing>), OpenMPI 
> detects that I'm oversubscribing and runs in degraded mode (yielding the 
> processor). Anyway, I repeated the experiments setting explicitly the 
> yielding flag, and I obtained the same weird results:
> 
> $HOME/openmpi-bin-1.10.1/bin/mpirun --mca mpi_yield_when_idle 1 -np 36 
> taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 82.79
> $HOME/openmpi-bin-1.10.2/bin/mpirun --mca mpi_yield_when_idle 1 -np 36 
> taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 110.93
> 
> Given these results, it seems that spin-waiting is not causing the issue. I 
> also agree that this should not be caused by HyperThreading, given that 0-27 
> correspond to single HW threads on distinct cores, as shown in the following 
> output returned by the lstopo command:
> 
> Machine (128GB total)
>   NUMANode L#0 (P#0 64GB)
>     Package L#0 + L3 L#0 (35MB)
>       L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
>         PU L#0 (P#0)
>         PU L#1 (P#28)
>       L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
>         PU L#2 (P#1)
>         PU L#3 (P#29)
>       L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
>         PU L#4 (P#2)
>         PU L#5 (P#30)
>       L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
>         PU L#6 (P#3)
>         PU L#7 (P#31)
>       L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
>         PU L#8 (P#4)
>         PU L#9 (P#32)
>       L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
>         PU L#10 (P#5)
>         PU L#11 (P#33)
>       L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
>         PU L#12 (P#6)
>         PU L#13 (P#34)
>       L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
>         PU L#14 (P#7)
>         PU L#15 (P#35)
>       L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
>         PU L#16 (P#8)
>         PU L#17 (P#36)
>       L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
>         PU L#18 (P#9)
>         PU L#19 (P#37)
>       L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>         PU L#20 (P#10)
>         PU L#21 (P#38)
>       L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>         PU L#22 (P#11)
>         PU L#23 (P#39)
>       L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
>         PU L#24 (P#12)
>         PU L#25 (P#40)
>       L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
>         PU L#26 (P#13)
>         PU L#27 (P#41)
>     HostBridge L#0
>       PCIBridge
>         PCI 8086:24f0
>           Net L#0 "ib0"
>           OpenFabrics L#1 "hfi1_0"
>       PCIBridge
>         PCI 14e4:1665
>           Net L#2 "eno1"
>         PCI 14e4:1665
>           Net L#3 "eno2"
>       PCIBridge
>         PCIBridge
>           PCIBridge
>             PCIBridge
>               PCI 102b:0534
>                 GPU L#4 "card0"
>                 GPU L#5 "controlD64"
>   NUMANode L#1 (P#1 64GB) + Package L#1 + L3 L#1 (35MB)
>     L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>       PU L#28 (P#14)
>       PU L#29 (P#42)
>     L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
>       PU L#30 (P#15)
>       PU L#31 (P#43)
>     L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
>       PU L#32 (P#16)
>       PU L#33 (P#44)
>     L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
>       PU L#34 (P#17)
>       PU L#35 (P#45)
>     L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
>       PU L#36 (P#18)
>       PU L#37 (P#46)
>     L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
>       PU L#38 (P#19)
>       PU L#39 (P#47)
>     L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
>       PU L#40 (P#20)
>       PU L#41 (P#48)
>     L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
>       PU L#42 (P#21)
>       PU L#43 (P#49)
>     L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
>       PU L#44 (P#22)
>       PU L#45 (P#50)
>     L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
>       PU L#46 (P#23)
>       PU L#47 (P#51)
>     L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
>       PU L#48 (P#24)
>       PU L#49 (P#52)
>     L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
>       PU L#50 (P#25)
>       PU L#51 (P#53)
>     L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
>       PU L#52 (P#26)
>       PU L#53 (P#54)
>     L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
>       PU L#54 (P#27)
>       PU L#55 (P#55)
> 
> On 26/03/2017 9:37, Ben Menadue wrote:
>> On 26 Mar 2017, at 2:22 am, Jordi Guitart <jordi.guit...@bsc.es 
>> <mailto:jordi.guit...@bsc.es>> wrote:
>>> However, what is puzzling me is the performance difference between OpenMPI 
>>> 1.10.1 (and prior versions) and OpenMPI 1.10.2 (and later versions) in my 
>>> experiments with oversubscription, i.e. 82 seconds vs. 111 seconds.
>> 
>> You’re oversubscribing while letting the OS migrate individual threads 
>> between cores. That taskset will bind each MPI process to the same set of 28 
>> logical CPUs (i.e. hardware threads), so if you’re running 36 ranks there 
>> then you must have migration happening. Indeed, even when you only launch 28 
>> MPI ranks, you’ll probably still see migration between the cores — but 
>> likely a lot less. But as soon as you oversubscribe and spin-wait rather 
>> than yield you’ll be very sensitive to small changes in behaviour — any 
>> minor changes in OpenMPI’s behaviour, while not visible under normal 
>> circumstances, will lead to small changes in how and when the kernel task 
>> scheduler runs the tasks, and this can then multiply dramatically when you 
>> have synchronisation between the tasks via e.g. MPI calls.
>> 
>> Just as a purely hypothetical example, the newer versions might spin-wait in 
>> a slightly tighter loop and this might make the Linux task scheduler less 
>> likely to switch between waiting threads. This delay in switching tasks 
>> could appear as increased latency in any synchronising MPI call. But this is 
>> very speculative — it would be very hard to draw any conclusion about what’s 
>> happening if there’s no clear causative change in the code.
>> 
>> Try adding "--mca mpi_yield_when_idle 1" to your mpirun command. This will 
>> make OpenMPI issue a sched_yield when waiting instead of spin-waiting 
>> constantly. While it’s a performance hit when exactly- or under-subscribing, 
>> I can see it helping a bit when there’s contention for the cores from 
>> over-subscribing. In particular, a call sched_yield relinquishes the rest of 
>> that process's current time slice, and allows the task scheduler to run 
>> another waiting task (i.e. another of your MPI ranks) in its place.
>> 
>> So in fact this has nothing to do with HyperThreading — assuming 0 through 
>> 27 correspond to a single hardware thread on 28 distinct cores. Just keep in 
>> mind that this might not always be the case — we have at least one platform 
>> where where the logical processor number enumerates the hardware threads 
>> before cores, so 0 to (n-1) are the n threads of the first core, n to (2n-1) 
>> are of to the second, and so on.
>> 
>> Cheers,
>> Ben
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
> 
> 
> WARNING / LEGAL TEXT: This message is intended only for the use of the 
> individual or entity to which it is addressed and may contain information 
> which is privileged, confidential, proprietary, or exempt from disclosure 
> under applicable law. If you are not the intended recipient or the person 
> responsible for delivering the message to the intended recipient, you are 
> strictly prohibited from disclosing, distributing, copying, or in any way 
> using this message. If you have received this communication in error, please 
> notify the sender and destroy and delete any copies you may have received. 
> 
> http://www.bsc.es/disclaimer <http://www.bsc.es/disclaimer> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to