Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Ralph Castain Thu, 8 Oct 2015 17:49:41 -0400 (EDT)

If you get a chance, you might test this patch:

https://github.com/open-mpi/ompi-release/pull/656


I think it will resolve the problem you mentioned, and is small enough to go 
into 1.10.1

Ralph


> On Oct 8, 2015, at 12:36 PM, marcin.krotkiewski 
> <marcin.krotkiew...@gmail.com> wrote:
> 
> Sorry, I think I confused one thing:
> 
> On 10/08/2015 09:15 PM, marcin.krotkiewski wrote:
>> 
>> For version 1.10.1rc1 and up the situation is a bit different: it seems that 
>> in many cases all cores are present in the cpuset, just that the binding 
>> does not take place in a lot of cases. Instead, processes are bound to all 
>> cores allocated by SLURM. In other scenarios, as discussed before, some 
>> cores are over/under-subscribed. Again, this is done quietly.
> 
> The problem here was in fact failure to run with an error message, not 
> under/over-subscription. Sorry for this - wanted to cover too much at the 
> same time..
> 
> Marcin
> 
> 
> 
> 
>> 
>> In all cases what is needed is the --hetero-nodes switch. If I apply the 
>> patch that Gilles has posted, it seems to be enough for 1.10.1rc1 and up. 
>> The switch is not enough for earlier versions of OpenMPI and one needs 
>> --map-by core in addition.
>> 
>> Given all that I think some sort of fix would be in order soon. I agree with 
>> Ralph that to address this issue quickly a simplified fix would be a good 
>> choice. As Ralph has already pointed out (or at least how I understood it :) 
>> this would essentially involve activating --hetero-nodes by default, and 
>> using --map-by core in cases where the architecture is not homogeneous. 
>> Uncovering the warning so that the failure to bind is not silent is the last 
>> piece of puzzle. Maybe adding a sanity check to make sure all allocated 
>> resources are in use would be helpful - if not by default, then maybe with 
>> some flag.
>> 
>> Does all this make sense?
>> 
>> Again, thank you all for your help,
>> 
>> Marcin
>> 
>> 
>> 
>> 
>> 
>> On 10/07/2015 04:03 PM, Ralph Castain wrote:
>>> I’m a little nervous about this one, Gilles. It’s doing a lot more than 
>>> just addressing the immediate issue, and I’m concerned about any potential 
>>> side-effects that we don’t fully unocver prior to release.
>>> 
>>> I’d suggest a two-pronged approach:
>>> 
>>> 1. use my alternative method for 1.10.1 to solve the immediate issue. It 
>>> only affects this one, rather unusual, corner-case that was reported here. 
>>> So the impact can be easily contained and won’t impact anything else.
>>> 
>>> 2. push your proposed solution to the master where it can soak for awhile 
>>> and give us a chance to fully discover the secondary effects. Removing the 
>>> unused and “not-allowed” cpus from the topology means a substantial scrub 
>>> of the code base in a number of places, and your patch doesn’t really get 
>>> them all. It’s going to take time to ensure everything is working correctly 
>>> again.
>>> 
>>> HTH
>>> Ralph
>>> 
>>>> On Oct 7, 2015, at 4:29 AM, Gilles Gouaillardet 
>>>> <gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> 
>>>> wrote:
>>>> 
>>>> Jeff,
>>>> 
>>>> there are quite a lot of changes, I did not update master yet (need extra 
>>>> pairs of eyes to review this...)
>>>> so unless you want to make rc2 today and rc3 a week later, it is imho way 
>>>> safer to wait for v1.10.2
>>>> 
>>>> Ralph,
>>>> any thoughts ?
>>>> 
>>>> Cheers,
>>>> 
>>>> Gilles
>>>> 
>>>> On Wednesday, October 7, 2015, Jeff Squyres (jsquyres) <jsquy...@cisco.com 
>>>> <mailto:jsquy...@cisco.com>> wrote:
>>>> Is this something that needs to go into v1.10.1?
>>>> 
>>>> If so, a PR needs to be filed ASAP.  We were supposed to make the next 
>>>> 1.10.1 RC yesterday, but slipped to today due to some last second patches.
>>>> 
>>>> 
>>>> > On Oct 7, 2015, at 4:32 AM, Gilles Gouaillardet <gil...@rist.or.jp 
>>>> > <javascript:;>> wrote:
>>>> >
>>>> > Marcin,
>>>> >
>>>> > here is a patch for the master, hopefully it fixes all the issues we 
>>>> > discussed
>>>> > i will make sure it applies fine vs latest 1.10 tarball from tomorrow
>>>> >
>>>> > Cheers,
>>>> >
>>>> > Gilles
>>>> >
>>>> >
>>>> > On 10/6/2015 7:22 PM, marcin.krotkiewski wrote:
>>>> >> Gilles,
>>>> >>
>>>> >> Yes, it seemed that all was fine with binding in the patched 1.10.1rc1 
>>>> >> - thank you. Eagerly waiting for the other patches, let me know and I 
>>>> >> will test them later this week.
>>>> >>
>>>> >> Marcin
>>>> >>
>>>> >>
>>>> >>
>>>> >> On 10/06/2015 12:09 PM, Gilles Gouaillardet wrote:
>>>> >>> Marcin,
>>>> >>>
>>>> >>> my understanding is that in this case, patched v1.10.1rc1 is working 
>>>> >>> just fine.
>>>> >>> am I right ?
>>>> >>>
>>>> >>> I prepared two patches
>>>> >>> one to remove the warning when binding on one core if only one core is 
>>>> >>> available,
>>>> >>> an other one to add a warning if the user asks a binding policy that 
>>>> >>> makes no sense with the required mapping policy
>>>> >>>
>>>> >>> I will finalize them tomorrow hopefully
>>>> >>>
>>>> >>> Cheers,
>>>> >>>
>>>> >>> Gilles
>>>> >>>
>>>> >>> On Tuesday, October 6, 2015, marcin.krotkiewski 
>>>> >>> <marcin.krotkiew...@gmail.com <javascript:;>> wrote:
>>>> >>> Hi, Gilles
>>>> >>>> you mentionned you had one failure with 1.10.1rc1 and -bind-to core
>>>> >>>> could you please send the full details (script, allocation and output)
>>>> >>>> in your slurm script, you can do
>>>> >>>> srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep 
>>>> >>>> Cpus_allowed_list /proc/self/status
>>>> >>>> before invoking mpirun
>>>> >>>>
>>>> >>> It was an interactive job allocated with
>>>> >>>
>>>> >>> salloc --account=staff --ntasks=32 --mem-per-cpu=2G --time=120:0:0
>>>> >>>
>>>> >>> The slurm environment is the following
>>>> >>>
>>>> >>> SLURM_JOBID=12714491
>>>> >>> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
>>>> >>> SLURM_JOB_ID=12714491
>>>> >>> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
>>>> >>> SLURM_JOB_NUM_NODES=7
>>>> >>> SLURM_JOB_PARTITION=normal
>>>> >>> SLURM_MEM_PER_CPU=2048
>>>> >>> SLURM_NNODES=7
>>>> >>> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
>>>> >>> SLURM_NODE_ALIASES='(null)'
>>>> >>> SLURM_NPROCS=32
>>>> >>> SLURM_NTASKS=32
>>>> >>> SLURM_SUBMIT_DIR=/cluster/home/marcink
>>>> >>> SLURM_SUBMIT_HOST=login-0-1.local
>>>> >>> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'
>>>> >>>
>>>> >>> The output of the command you asked for is
>>>> >>>
>>>> >>> 0: c1-2.local  Cpus_allowed_list:        1-4,17-20
>>>> >>> 1: c1-4.local  Cpus_allowed_list:        1,15,17,31
>>>> >>> 2: c1-8.local  Cpus_allowed_list:        0,5,9,13-14,16,21,25,29-30
>>>> >>> 3: c1-13.local  Cpus_allowed_list:       3-7,19-23
>>>> >>> 4: c1-16.local  Cpus_allowed_list:       12-15,28-31
>>>> >>> 5: c1-23.local  Cpus_allowed_list:       2-4,8,13-15,18-20,24,29-31
>>>> >>> 6: c1-26.local  Cpus_allowed_list:       1,6,11,13,15,17,22,27,29,31
>>>> >>>
>>>> >>> Running with command
>>>> >>>
>>>> >>> mpirun --mca rmaps_base_verbose 10 --hetero-nodes --bind-to core 
>>>> >>> --report-bindings --map-by socket -np 32 ./affinity
>>>> >>>
>>>> >>> I have attached two output files: one for the original 1.10.1rc1, one 
>>>> >>> for the patched version.
>>>> >>>
>>>> >>> When I said 'failed in one case' I was not precise. I got an error on 
>>>> >>> node c1-8, which was the first one to have different number of MPI 
>>>> >>> processes on the two sockets. It would also fail on some later nodes, 
>>>> >>> just                 that because of the error we never got there.
>>>> >>>
>>>> >>> Let me know if you need more.
>>>> >>>
>>>> >>> Marcin
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>> Cheers,
>>>> >>>>
>>>> >>>> Gilles
>>>> >>>>
>>>> >>>> On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:
>>>> >>>>> Hi, all,
>>>> >>>>>
>>>> >>>>> I played a bit more and it seems that the problem results from
>>>> >>>>>
>>>> >>>>> trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()
>>>> >>>>>
>>>> >>>>> called in rmaps_base_binding.c / bind_downwards being wrong. I do 
>>>> >>>>> not know the reason, but I think I know when the problem happens (at 
>>>> >>>>> least on 1.10.1rc1). It seems that by default openmpi maps by 
>>>> >>>>> socket. The error happens when for a given compute node there is a 
>>>> >>>>> different number of cores used on each socket. Consider previously 
>>>> >>>>> studied case (the debug outputs I sent in last post). c1-8, which 
>>>> >>>>> was source of error, has 5 mpi processes assigned, and the cpuset is 
>>>> >>>>> the following:
>>>> >>>>>
>>>> >>>>> 0, 5, 9, 13, 14, 16, 21, 25, 29, 30
>>>> >>>>>
>>>> >>>>> Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. Binding 
>>>> >>>>> progresses correctly up to and including core 13 (see end of file 
>>>> >>>>> out.1.10.1rc2, before the error). That is 2 cores on socket 0, and 2 
>>>> >>>>> cores on socket 1. Error is thrown when core 14 should be bound - 
>>>> >>>>> extra core on socket 1 with no corresponding core on socket 0. At 
>>>> >>>>> that point the returned trg_obj points to the first core on the node 
>>>> >>>>> (os_index 0, socket 0).
>>>> >>>>>
>>>> >>>>> I have submitted a few other jobs and I always had an error in such 
>>>> >>>>> situation. Moreover, if I now use --map-by core instead of socket, 
>>>> >>>>> the error is gone, and I get my expected binding:
>>>> >>>>>
>>>> >>>>> rank 0 @ compute-1-2.local  1, 17,
>>>> >>>>> rank 1 @ compute-1-2.local  2, 18,
>>>> >>>>> rank 2 @ compute-1-2.local  3, 19,
>>>> >>>>> rank 3 @ compute-1-2.local  4, 20,
>>>> >>>>> rank 4 @ compute-1-4.local  1, 17,
>>>> >>>>> rank 5 @ compute-1-4.local  15, 31,
>>>> >>>>> rank 6 @ compute-1-8.local  0, 16,
>>>> >>>>> rank 7 @ compute-1-8.local  5, 21,
>>>> >>>>> rank 8 @ compute-1-8.local  9, 25,
>>>> >>>>> rank 9 @ compute-1-8.local  13, 29,
>>>> >>>>> rank 10 @ compute-1-8.local  14, 30,
>>>> >>>>> rank 11 @ compute-1-13.local  3, 19,
>>>> >>>>> rank 12 @ compute-1-13.local  4, 20,
>>>> >>>>> rank 13 @ compute-1-13.local  5, 21,
>>>> >>>>> rank 14 @ compute-1-13.local  6, 22,
>>>> >>>>> rank 15 @ compute-1-13.local  7, 23,
>>>> >>>>> rank 16 @ compute-1-16.local  12, 28,
>>>> >>>>> rank 17 @ compute-1-16.local  13, 29,
>>>> >>>>> rank 18 @ compute-1-16.local  14, 30,
>>>> >>>>> rank 19 @ compute-1-16.local  15, 31,
>>>> >>>>> rank 20 @ compute-1-23.local  2, 18,
>>>> >>>>> rank 29 @ compute-1-26.local  11, 27,
>>>> >>>>> rank 21 @ compute-1-23.local  3, 19,
>>>> >>>>> rank 30 @ compute-1-26.local  13, 29,
>>>> >>>>> rank 22 @ compute-1-23.local  4, 20,
>>>> >>>>> rank 31 @ compute-1-26.local  15, 31,
>>>> >>>>> rank 23 @ compute-1-23.local  8, 24,
>>>> >>>>> rank 27 @ compute-1-26.local  1, 17,
>>>> >>>>> rank 24 @ compute-1-23.local  13, 29,
>>>> >>>>> rank 28 @ compute-1-26.local  6, 22,
>>>> >>>>> rank 25 @ compute-1-23.local  14, 30,
>>>> >>>>> rank 26 @ compute-1-23.local  15, 31,
>>>> >>>>>
>>>> >>>>> Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and 
>>>> >>>>> 1.10.1rc1. However, there is still a difference in behavior between 
>>>> >>>>> 1.10.1rc1 and earlier versions. In the SLURM job described in last 
>>>> >>>>> post, 1.10.1rc1 fails to bind only in 1 case, while the earlier 
>>>> >>>>> versions fail in 21 out of 32 cases. You mentioned there was a bug 
>>>> >>>>> in hwloc. Not sure if it can explain the difference in behavior.
>>>> >>>>>
>>>> >>>>> Hope this helps to nail this down.
>>>> >>>>>
>>>> >>>>> Marcin
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On 10/04/2015 09:55 AM, Gilles Gouaillardet wrote:
>>>> >>>>>> Ralph,
>>>> >>>>>>
>>>> >>>>>> I suspect ompi tries to bind to threads outside the cpuset.
>>>> >>>>>> this could be pretty similar to a previous issue when ompi tried to 
>>>> >>>>>> bind to cores outside the cpuset.
>>>> >>>>>> /* when a core has more than one thread, would ompi assume all the 
>>>> >>>>>> threads are available if the core is available ? */
>>>> >>>>>> I will investigate this from tomorrow
>>>> >>>>>>
>>>> >>>>>> Cheers,
>>>> >>>>>>
>>>> >>>>>> Gilles
>>>> >>>>>>
>>>> >>>>>> On Sunday, October 4, 2015, Ralph Castain <r...@open-mpi.org 
>>>> >>>>>> <javascript:;>> wrote:
>>>> >>>>>> Thanks - please go ahead and release that allocation as I’m not 
>>>> >>>>>> going to get to this immediately. I’ve got several hot irons in the 
>>>> >>>>>> fire right now, and I’m not sure when I’ll get a chance to track 
>>>> >>>>>> this down.
>>>> >>>>>>
>>>> >>>>>> Gilles or anyone else who might have time - feel free to take a 
>>>> >>>>>> gander and see if something pops out at you.
>>>> >>>>>>
>>>> >>>>>> Ralph
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>> On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski < 
>>>> >>>>>>> <javascript:;>marcin.krotkiew...@gmail.com 
>>>> >>>>>>> <mailto:marcin.krotkiew...@gmail.com>> wrote:
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and 
>>>> >>>>>>> executed
>>>> >>>>>>>
>>>> >>>>>>> mpirun --mca rmaps_base_verbose 10 --hetero-nodes 
>>>> >>>>>>> --report-bindings --bind-to core -np 32 ./affinity
>>>> >>>>>>>
>>>> >>>>>>> In case of 1.10.rc1 I have also added :overload-allowed - output 
>>>> >>>>>>> in a separate file. This option did not make much difference for 
>>>> >>>>>>> 1.10.0, so I did not attach it here.
>>>> >>>>>>>
>>>> >>>>>>> First thing I noted for 1.10.0 are lines like
>>>> >>>>>>>
>>>> >>>>>>> [login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS
>>>> >>>>>>> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] BITMAP
>>>> >>>>>>> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] ON c1-26 
>>>> >>>>>>> IS NOT BOUND
>>>> >>>>>>>
>>>> >>>>>>> with an empty BITMAP.
>>>> >>>>>>>
>>>> >>>>>>> The SLURM environment is
>>>> >>>>>>>
>>>> >>>>>>> set | grep SLURM
>>>> >>>>>>> SLURM_JOBID=12714491
>>>> >>>>>>> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
>>>> >>>>>>> SLURM_JOB_ID=12714491
>>>> >>>>>>> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
>>>> >>>>>>> SLURM_JOB_NUM_NODES=7
>>>> >>>>>>> SLURM_JOB_PARTITION=normal
>>>> >>>>>>> SLURM_MEM_PER_CPU=2048
>>>> >>>>>>> SLURM_NNODES=7
>>>> >>>>>>> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
>>>> >>>>>>> SLURM_NODE_ALIASES='(null)'
>>>> >>>>>>> SLURM_NPROCS=32
>>>> >>>>>>> SLURM_NTASKS=32
>>>> >>>>>>> SLURM_SUBMIT_DIR=/cluster/home/marcink
>>>> >>>>>>> SLURM_SUBMIT_HOST=login-0-1.local
>>>> >>>>>>> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'
>>>> >>>>>>>
>>>> >>>>>>> I have submitted an interactive job on screen for 120 hours now to 
>>>> >>>>>>> work with one example, and not change it for every post :)
>>>> >>>>>>>
>>>> >>>>>>> If you need anything else, let me know. I could introduce some 
>>>> >>>>>>> patch/printfs and recompile, if you need it.
>>>> >>>>>>>
>>>> >>>>>>> Marcin
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> On 10/03/2015 07:17 PM, Ralph Castain wrote:
>>>> >>>>>>>> Rats - just realized I have no way to test this as none of the 
>>>> >>>>>>>> machines I can access are setup for cgroup-based multi-tenant. Is 
>>>> >>>>>>>> this a debug version of OMPI? If not, can you rebuild OMPI with 
>>>> >>>>>>>> —enable-debug?
>>>> >>>>>>>>
>>>> >>>>>>>> Then please run it with —mca rmaps_base_verbose 10 and pass along 
>>>> >>>>>>>> the output.
>>>> >>>>>>>>
>>>> >>>>>>>> Thanks
>>>> >>>>>>>> Ralph
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>> On Oct 3, 2015, at 10:09 AM, Ralph Castain < 
>>>> >>>>>>>>> <mailto:r...@open-mpi.org>r...@open-mpi.org 
>>>> >>>>>>>>> <mailto:r...@open-mpi.org>> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>> What version of slurm is this? I might try to debug it here. I’m 
>>>> >>>>>>>>> not sure where the problem lies just yet.
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>>> On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski < 
>>>> >>>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>marcin.krotkiew...@gmail.com
>>>> >>>>>>>>>>  <mailto:marcin.krotkiew...@gmail.com>> wrote:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Here is the output of lstopo. In short, (0,16) are core 0, 
>>>> >>>>>>>>>> (1,17) - core 1 etc.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Machine (64GB)
>>>> >>>>>>>>>>   NUMANode L#0 (P#0 32GB)
>>>> >>>>>>>>>>     Socket L#0 + L3 L#0 (20MB)
>>>> >>>>>>>>>>       L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core 
>>>> >>>>>>>>>> L#0
>>>> >>>>>>>>>>         PU L#0 (P#0)
>>>> >>>>>>>>>>         PU L#1 (P#16)
>>>> >>>>>>>>>>       L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core 
>>>> >>>>>>>>>> L#1
>>>> >>>>>>>>>>         PU L#2 (P#1)
>>>> >>>>>>>>>>         PU L#3 (P#17)
>>>> >>>>>>>>>>       L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core 
>>>> >>>>>>>>>> L#2
>>>> >>>>>>>>>>         PU L#4 (P#2)
>>>> >>>>>>>>>>         PU L#5 (P#18)
>>>> >>>>>>>>>>       L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core 
>>>> >>>>>>>>>> L#3
>>>> >>>>>>>>>>         PU L#6 (P#3)
>>>> >>>>>>>>>>         PU L#7 (P#19)
>>>> >>>>>>>>>>       L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core 
>>>> >>>>>>>>>> L#4
>>>> >>>>>>>>>>         PU L#8 (P#4)
>>>> >>>>>>>>>>         PU L#9 (P#20)
>>>> >>>>>>>>>>       L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core 
>>>> >>>>>>>>>> L#5
>>>> >>>>>>>>>>         PU L#10 (P#5)
>>>> >>>>>>>>>>         PU L#11 (P#21)
>>>> >>>>>>>>>>       L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core 
>>>> >>>>>>>>>> L#6
>>>> >>>>>>>>>>         PU L#12 (P#6)
>>>> >>>>>>>>>>         PU L#13 (P#22)
>>>> >>>>>>>>>>       L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core 
>>>> >>>>>>>>>> L#7
>>>> >>>>>>>>>>         PU L#14 (P#7)
>>>> >>>>>>>>>>         PU L#15 (P#23)
>>>> >>>>>>>>>>     HostBridge L#0
>>>> >>>>>>>>>>       PCIBridge
>>>> >>>>>>>>>>         PCI 8086:1521
>>>> >>>>>>>>>>           Net L#0 "eth0"
>>>> >>>>>>>>>>         PCI 8086:1521
>>>> >>>>>>>>>>           Net L#1 "eth1"
>>>> >>>>>>>>>>       PCIBridge
>>>> >>>>>>>>>>         PCI 15b3:1003
>>>> >>>>>>>>>>           Net L#2 "ib0"
>>>> >>>>>>>>>>           OpenFabrics L#3 "mlx4_0"
>>>> >>>>>>>>>>       PCIBridge
>>>> >>>>>>>>>>         PCI 102b:0532
>>>> >>>>>>>>>>       PCI 8086:1d02
>>>> >>>>>>>>>>         Block L#4 "sda"
>>>> >>>>>>>>>>   NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
>>>> >>>>>>>>>>     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
>>>> >>>>>>>>>>       PU L#16 (P#8)
>>>> >>>>>>>>>>       PU L#17 (P#24)
>>>> >>>>>>>>>>     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
>>>> >>>>>>>>>>       PU L#18 (P#9)
>>>> >>>>>>>>>>       PU L#19 (P#25)
>>>> >>>>>>>>>>     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core 
>>>> >>>>>>>>>> L#10
>>>> >>>>>>>>>>       PU L#20 (P#10)
>>>> >>>>>>>>>>       PU L#21 (P#26)
>>>> >>>>>>>>>>     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core 
>>>> >>>>>>>>>> L#11
>>>> >>>>>>>>>>       PU L#22 (P#11)
>>>> >>>>>>>>>>       PU L#23 (P#27)
>>>> >>>>>>>>>>     L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core 
>>>> >>>>>>>>>> L#12
>>>> >>>>>>>>>>       PU L#24 (P#12)
>>>> >>>>>>>>>>       PU L#25 (P#28)
>>>> >>>>>>>>>>     L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core 
>>>> >>>>>>>>>> L#13
>>>> >>>>>>>>>>       PU L#26 (P#13)
>>>> >>>>>>>>>>       PU L#27 (P#29)
>>>> >>>>>>>>>>     L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core 
>>>> >>>>>>>>>> L#14
>>>> >>>>>>>>>>       PU L#28 (P#14)
>>>> >>>>>>>>>>       PU L#29 (P#30)
>>>> >>>>>>>>>>     L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core 
>>>> >>>>>>>>>> L#15
>>>> >>>>>>>>>>       PU L#30 (P#15)
>>>> >>>>>>>>>>       PU L#31 (P#31)
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> On 10/03/2015 05:46 PM, Ralph Castain wrote:
>>>> >>>>>>>>>>> Maybe I’m just misreading your HT map - that slurm nodelist 
>>>> >>>>>>>>>>> syntax is a new one to me, but they tend to change things 
>>>> >>>>>>>>>>> around. Could you run lstopo on one of those compute nodes and 
>>>> >>>>>>>>>>> send the output?
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> I’m just suspicious because I’m not seeing a clear pairing of 
>>>> >>>>>>>>>>> HT numbers in your output, but HT numbering is BIOS-specific 
>>>> >>>>>>>>>>> and I may just not be understanding your particular pattern. 
>>>> >>>>>>>>>>> Our error message is clearly indicating that we are seeing 
>>>> >>>>>>>>>>> individual HTs (and not complete cores) assigned, and I don’t 
>>>> >>>>>>>>>>> know the source of that confusion.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>> On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski < 
>>>> >>>>>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>marcin.krotkiew...@gmail.com
>>>> >>>>>>>>>>>>  <mailto:marcin.krotkiew...@gmail.com>> wrote:
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> On 10/03/2015 04:38 PM, Ralph Castain wrote:
>>>> >>>>>>>>>>>>> If mpirun isn’t trying to do any binding, then you will of 
>>>> >>>>>>>>>>>>> course get the right mapping as we’ll just inherit whatever 
>>>> >>>>>>>>>>>>> we received.
>>>> >>>>>>>>>>>> Yes. I meant that whatever you received (what SLURM gives) is 
>>>> >>>>>>>>>>>> a correct cpu map and assigns _whole_ CPUs, not a single HT 
>>>> >>>>>>>>>>>> to MPI processes. In the case mentioned earlier openmpi 
>>>> >>>>>>>>>>>> should start 6 tasks on c1-30. If HT would be treated as 
>>>> >>>>>>>>>>>> separate and independent cores, sched_getaffinity of an MPI 
>>>> >>>>>>>>>>>> process started on c1-30 would return a map with 6 entries 
>>>> >>>>>>>>>>>> only. In my case it returns a map                             
>>>> >>>>>>>>>>>>                               with 12 entries - 2 for each 
>>>> >>>>>>>>>>>> core. So one  process is in fact allocated both HTs, not only 
>>>> >>>>>>>>>>>> one. Is what I'm saying correct?
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>> Looking at your output, it’s pretty clear that you are 
>>>> >>>>>>>>>>>>> getting independent HTs assigned and not full cores.
>>>> >>>>>>>>>>>> How do you mean? Is the above understanding wrong? I would 
>>>> >>>>>>>>>>>> expect that on c1-30 with --bind-to core openmpi should bind 
>>>> >>>>>>>>>>>> to logical cores 0 and 16 (rank 0), 1 and 17 (rank 2) and so 
>>>> >>>>>>>>>>>> on. All those logical cores are available in 
>>>> >>>>>>>>>>>> sched_getaffinity map, and there is twice as many logical 
>>>> >>>>>>>>>>>> cores as there are MPI processes started on the node.
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>> My guess is that something in slurm has changed such that it 
>>>> >>>>>>>>>>>>> detects that HT has been enabled, and then begins treating 
>>>> >>>>>>>>>>>>> the HTs as completely independent cpus.
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>> Try changing “-bind-to core” to “-bind-to hwthread  
>>>> >>>>>>>>>>>>> -use-hwthread-cpus” and see if that works
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>> I have and the binding is wrong. For example, I got this 
>>>> >>>>>>>>>>>> output
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> rank 0 @ compute-1-30.local  0,
>>>> >>>>>>>>>>>> rank 1 @ compute-1-30.local  16,
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> Which means that two ranks have been bound to the same 
>>>> >>>>>>>>>>>> physical core (logical cores 0 and 16 are two HTs of the same 
>>>> >>>>>>>>>>>> core). If I use --bind-to core, I get the following correct 
>>>> >>>>>>>>>>>> binding
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> rank 0 @ compute-1-30.local  0, 16,
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> The problem is many other ranks get bad binding with 'rank 
>>>> >>>>>>>>>>>> XXX is not bound (or bound to all available processors)' 
>>>> >>>>>>>>>>>> warning.
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> But I think I was not entirely correct saying that 1.10.1rc1 
>>>> >>>>>>>>>>>> did not fix things. It still might have improved something, 
>>>> >>>>>>>>>>>> but not everything. Consider this job:
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
>>>> >>>>>>>>>>>> SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> If I run 32 tasks as follows (with 1.10.1rc1)
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> mpirun --hetero-nodes --report-bindings --bind-to core -np 32 
>>>> >>>>>>>>>>>> ./affinity
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> I get the following error:
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> --------------------------------------------------------------------------
>>>> >>>>>>>>>>>> A request was made to bind to that would result in binding 
>>>> >>>>>>>>>>>> more
>>>> >>>>>>>>>>>> processes than cpus on a resource:
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>    Bind to:     CORE
>>>> >>>>>>>>>>>>    Node:        c9-31
>>>> >>>>>>>>>>>>    #processes:  2
>>>> >>>>>>>>>>>>    #cpus:       1
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> You can override this protection by adding the 
>>>> >>>>>>>>>>>> "overload-allowed"
>>>> >>>>>>>>>>>> option to your binding directive.
>>>> >>>>>>>>>>>> --------------------------------------------------------------------------
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> If I now use --bind-to core:overload-allowed, then openmpi 
>>>> >>>>>>>>>>>> starts and _most_ of the threads are bound correctly (i.e., 
>>>> >>>>>>>>>>>> map contains two logical cores in ALL cases), except this 
>>>> >>>>>>>>>>>> case that required the overload flag:
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> rank 15 @ compute-9-31.local   1, 17,
>>>> >>>>>>>>>>>> rank 16 @ compute-9-31.local  11, 27,
>>>> >>>>>>>>>>>> rank 17 @ compute-9-31.local   2, 18,
>>>> >>>>>>>>>>>> rank 18 @ compute-9-31.local  12, 28,
>>>> >>>>>>>>>>>> rank 19 @ compute-9-31.local   1, 17,
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> Note pair 1,17 is used twice. The original SLURM delivered 
>>>> >>>>>>>>>>>> map (no binding) on this node is
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> rank 15 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 
>>>> >>>>>>>>>>>> 28, 29,
>>>> >>>>>>>>>>>> rank 16 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 
>>>> >>>>>>>>>>>> 28, 29,
>>>> >>>>>>>>>>>> rank 17 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 
>>>> >>>>>>>>>>>> 28, 29,
>>>> >>>>>>>>>>>> rank 18 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 
>>>> >>>>>>>>>>>> 28, 29,
>>>> >>>>>>>>>>>> rank 19 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 
>>>> >>>>>>>>>>>> 28, 29,
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> Why does openmpi use cores (1,17) twice instead of using core 
>>>> >>>>>>>>>>>> (13,29)? Clearly, the original SLURM-delivered map has 5 CPUs 
>>>> >>>>>>>>>>>> included, enough for 5 MPI processes.
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> Cheers,
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> Marcin
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>> On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski < 
>>>> >>>>>>>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>marcin.krotkiew...@gmail.com
>>>> >>>>>>>>>>>>>>  <mailto:marcin.krotkiew...@gmail.com>> wrote:
>>>> >>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>> On 10/03/2015 01:06 PM, Ralph Castain wrote:
>>>> >>>>>>>>>>>>>>> Thanks Marcin. Looking at this, I’m guessing that Slurm 
>>>> >>>>>>>>>>>>>>> may be treating HTs as “cores” - i.e., as independent 
>>>> >>>>>>>>>>>>>>> cpus. Any chance that is true?
>>>> >>>>>>>>>>>>>> Not to the best of my knowledge, and at least not 
>>>> >>>>>>>>>>>>>> intentionally. SLURM starts as many processes as there are 
>>>> >>>>>>>>>>>>>> physical cores, not threads. To verify this, consider this 
>>>> >>>>>>>>>>>>>> test case:
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> _______________________________________________
>>>> >>>>>> users mailing list
>>>> >>>>>>
>>>> >>>>>> us...@open-mpi.org <javascript:;>
>>>> >>>>>>
>>>> >>>>>> Subscription:
>>>> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>> >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> >>>>>>
>>>> >>>>>> Link to this post:
>>>> >>>>>> http://www.open-mpi.org/community/lists/users/2015/10/27790.php 
>>>> >>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27790.php>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> _______________________________________________
>>>> >>>>> users mailing list
>>>> >>>>>
>>>> >>>>> us...@open-mpi.org <javascript:;>
>>>> >>>>>
>>>> >>>>> Subscription:
>>>> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>> >>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> >>>>>
>>>> >>>>> Link to this post:
>>>> >>>>> http://www.open-mpi.org/community/lists/users/2015/10/27791.php 
>>>> >>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27791.php>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> _______________________________________________
>>>> >>>> users mailing list
>>>> >>>>
>>>> >>>> us...@open-mpi.org <javascript:;>
>>>> >>>>
>>>> >>>> Subscription:
>>>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>> >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> >>>>
>>>> >>>> Link to this post:
>>>> >>>> http://www.open-mpi.org/community/lists/users/2015/10/27792.php 
>>>> >>>> <http://www.open-mpi.org/community/lists/users/2015/10/27792.php>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> _______________________________________________
>>>> >>> users mailing list
>>>> >>>
>>>> >>> us...@open-mpi.org <javascript:;>
>>>> >>>
>>>> >>> Subscription:
>>>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>> >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> >>>
>>>> >>> Link to this post:
>>>> >>> http://www.open-mpi.org/community/lists/users/2015/10/27814.php 
>>>> >>> <http://www.open-mpi.org/community/lists/users/2015/10/27814.php>
>>>> >>
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> users mailing list
>>>> >>
>>>> >> us...@open-mpi.org <javascript:;>
>>>> >>
>>>> >> Subscription:
>>>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>> >> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> >>
>>>> >> Link to this post:
>>>> >> http://www.open-mpi.org/community/lists/users/2015/10/27815.php 
>>>> >> <http://www.open-mpi.org/community/lists/users/2015/10/27815.php>
>>>> >
>>>> > <heterogeneous_topologies.patch>_______________________________________________
>>>> > users mailing list
>>>> > us...@open-mpi.org <javascript:;>
>>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>> > <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> > Link to this post: 
>>>> > http://www.open-mpi.org/community/lists/users/2015/10/27827.php 
>>>> > <http://www.open-mpi.org/community/lists/users/2015/10/27827.php>
>>>> 
>>>> 
>>>> --
>>>> Jeff Squyres
>>>> jsquy...@cisco.com <javascript:;>
>>>> For corporate legal information go to:  
>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/>http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>  <http://www.cisco.com/web/about/doing_business/legal/cri/>
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <javascript:;>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/10/27828.php 
>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27828.php>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/10/27830.php 
>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27830.php>
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/10/27834.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/10/27834.php>
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/10/27849.php

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Reply via email to