Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

Ralph Castain Fri, 9 Oct 2015 09:27:25 -0400 (EDT)

Actually, you just confirmed the problem for me. You are correct in that it 
says 4 slots. However, if you then tell us pe=4, we will consume all 4 of those 
slots with the very first process.


What we need to see was that slurm was assigning us 16 slots to correspond to 
16 cpus. Instead, it is trying to tell us to launch only 4 procs, but to use 16 
cpus as if they belong to us. This is where the confusion is coming from - 
could be something in the slurm envar syntax changed, or something else did as 
I seem to recall we handled this okay before (but I could be wrong).

Fixing that will take some time that I honestly won’t have for awhile.


> On Oct 9, 2015, at 6:14 AM, Marcin Krotkiewski <marcin.krotkiew...@gmail.com> 
> wrote:
> 
> Ralph,
> 
> Here is the result running
> 
> mpirun --map-by slot:pe=4 -display-allocation ./affinity
> 
> ======================   ALLOCATED NODES   ======================
>    c12-29: slots=4 max_slots=0 slots_inuse=0 state=UP
> =================================================================
> rank 0 @ compute-12-29.local  1, 2, 3, 4, 17, 18, 19, 20,
> 
> I also attach output with --mca rmaps_base_verbose 10. It says 4 slots all 
> over the place, so it is really weird it should not work.
> 
> Thanks!
> 
> Marcin
> 
> 
> 
> [login-0-1.local:30710] mca: base: components_register: registering rmaps 
> components
> [login-0-1.local:30710] mca: base: components_register: found loaded 
> component round_robin
> [login-0-1.local:30710] mca: base: components_register: component round_robin 
> register function successful
> [login-0-1.local:30710] mca: base: components_register: found loaded 
> component rank_file
> [login-0-1.local:30710] mca: base: components_register: component rank_file 
> register function successful
> [login-0-1.local:30710] mca: base: components_register: found loaded 
> component seq
> [login-0-1.local:30710] mca: base: components_register: component seq 
> register function successful
> [login-0-1.local:30710] mca: base: components_register: found loaded 
> component resilient
> [login-0-1.local:30710] mca: base: components_register: component resilient 
> register function successful
> [login-0-1.local:30710] mca: base: components_register: found loaded 
> component staged
> [login-0-1.local:30710] mca: base: components_register: component staged has 
> no register or open function
> [login-0-1.local:30710] mca: base: components_register: found loaded 
> component mindist
> [login-0-1.local:30710] mca: base: components_register: component mindist 
> register function successful
> [login-0-1.local:30710] mca: base: components_register: found loaded 
> component ppr
> [login-0-1.local:30710] mca: base: components_register: component ppr 
> register function successful
> [login-0-1.local:30710] [[61064,0],0] rmaps:base set policy with slot:pe=4
> [login-0-1.local:30710] [[61064,0],0] rmaps:base policy slot modifiers pe=4 
> provided
> [login-0-1.local:30710] [[61064,0],0] rmaps:base check modifiers with pe=4
> [login-0-1.local:30710] [[61064,0],0] rmaps:base setting pe/rank to 4
> [login-0-1.local:30710] mca: base: components_open: opening rmaps components
> [login-0-1.local:30710] mca: base: components_open: found loaded component 
> round_robin
> [login-0-1.local:30710] mca: base: components_open: component round_robin 
> open function successful
> [login-0-1.local:30710] mca: base: components_open: found loaded component 
> rank_file
> [login-0-1.local:30710] mca: base: components_open: component rank_file open 
> function successful
> [login-0-1.local:30710] mca: base: components_open: found loaded component seq
> [login-0-1.local:30710] mca: base: components_open: component seq open 
> function successful
> [login-0-1.local:30710] mca: base: components_open: found loaded component 
> resilient
> [login-0-1.local:30710] mca: base: components_open: component resilient open 
> function successful
> [login-0-1.local:30710] mca: base: components_open: found loaded component 
> staged
> [login-0-1.local:30710] mca: base: components_open: component staged open 
> function successful
> [login-0-1.local:30710] mca: base: components_open: found loaded component 
> mindist
> [login-0-1.local:30710] mca: base: components_open: component mindist open 
> function successful
> [login-0-1.local:30710] mca: base: components_open: found loaded component ppr
> [login-0-1.local:30710] mca: base: components_open: component ppr open 
> function successful
> [login-0-1.local:30710] mca:rmaps:select: checking available component 
> round_robin
> [login-0-1.local:30710] mca:rmaps:select: Querying component [round_robin]
> [login-0-1.local:30710] mca:rmaps:select: checking available component 
> rank_file
> [login-0-1.local:30710] mca:rmaps:select: Querying component [rank_file]
> [login-0-1.local:30710] mca:rmaps:select: checking available component seq
> [login-0-1.local:30710] mca:rmaps:select: Querying component [seq]
> [login-0-1.local:30710] mca:rmaps:select: checking available component 
> resilient
> [login-0-1.local:30710] mca:rmaps:select: Querying component [resilient]
> [login-0-1.local:30710] mca:rmaps:select: checking available component staged
> [login-0-1.local:30710] mca:rmaps:select: Querying component [staged]
> [login-0-1.local:30710] mca:rmaps:select: checking available component mindist
> [login-0-1.local:30710] mca:rmaps:select: Querying component [mindist]
> [login-0-1.local:30710] mca:rmaps:select: checking available component ppr
> [login-0-1.local:30710] mca:rmaps:select: Querying component [ppr]
> [login-0-1.local:30710] [[61064,0],0]: Final mapper priorities
> [login-0-1.local:30710]     Mapper: ppr Priority: 90
> [login-0-1.local:30710]     Mapper: seq Priority: 60
> [login-0-1.local:30710]     Mapper: resilient Priority: 40
> [login-0-1.local:30710]     Mapper: mindist Priority: 20
> [login-0-1.local:30710]     Mapper: round_robin Priority: 10
> [login-0-1.local:30710]     Mapper: staged Priority: 5
> [login-0-1.local:30710]     Mapper: rank_file Priority: 0
> 
> ======================   ALLOCATED NODES   ======================
>    c12-29: slots=4 max_slots=0 slots_inuse=0 state=UP
> =================================================================
> [login-0-1.local:30710] mca:rmaps: mapping job [61064,1]
> [login-0-1.local:30710] mca:rmaps: creating new map for job [61064,1]
> [login-0-1.local:30710] AVAILABLE NODES FOR MAPPING:
> [login-0-1.local:30710]     node: c12-29 daemon: 1
> [login-0-1.local:30710] mca:rmaps: nprocs 4
> [login-0-1.local:30710] mca:rmaps mapping given - using default
> [login-0-1.local:30710] mca:rmaps:ppr: job [61064,1] not using ppr mapper
> [login-0-1.local:30710] mca:rmaps:seq: job [61064,1] not using seq mapper
> [login-0-1.local:30710] mca:rmaps:resilient: cannot perform initial map of 
> job [61064,1] - no fault groups
> [login-0-1.local:30710] mca:rmaps:mindist: job [61064,1] not using mindist 
> mapper
> [login-0-1.local:30710] mca:rmaps:rr: mapping job [61064,1]
> [login-0-1.local:30710] AVAILABLE NODES FOR MAPPING:
> [login-0-1.local:30710]     node: c12-29 daemon: 1
> [login-0-1.local:30710] mca:rmaps:rr: mapping by slot for job [61064,1] slots 
> 4 num_procs 1
> [login-0-1.local:30710] mca:rmaps:rr:slot working node c12-29
> [login-0-1.local:30710] mca:rmaps:rr:slot assigning 1 procs to node c12-29
> [login-0-1.local:30710] mca:rmaps:base: computing vpids by slot for job 
> [61064,1]
> [login-0-1.local:30710] mca:rmaps:base: assigning rank 0 to node c12-29
> [login-0-1.local:30710] mca:rmaps: compute bindings for job [61064,1] with 
> policy CORE:IF-SUPPORTED[5008]
> [login-0-1.local:30710] [[61064,0],0] reset_usage: node c12-29 has 1 procs on 
> it
> [login-0-1.local:30710] [[61064,0],0] reset_usage: ignoring proc [[61064,1],0]
> [login-0-1.local:30710] [[61064,0],0] bind_depth: 6 map_depth 0
> [login-0-1.local:30710] mca:rmaps: bind downward for job [61064,1] with 
> bindings CORE:IF-SUPPORTED
> [login-0-1.local:30710] [[61064,0],0] GOT 1 CPUS
> [login-0-1.local:30710] [[61064,0],0] GOT 1 CPUS
> [login-0-1.local:30710] [[61064,0],0] GOT 1 CPUS
> [login-0-1.local:30710] [[61064,0],0] GOT 1 CPUS
> [login-0-1.local:30710] [[61064,0],0] PROC [[61064,1],0] BITMAP 0-3,16-19
> [login-0-1.local:30710] [[61064,0],0] BOUND PROC [[61064,1],0][c12-29] TO 
> socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 
> 0-1]], socket 0[core 3[hwt 0-1]]: 
> [BB/BB/BB/BB/../../../..][../../../../../../../..]
> rank 0 @ compute-12-29.local  1, 2, 3, 4, 17, 18, 19, 20,
> [login-0-1.local:30710] mca: base: close: component round_robin closed
> [login-0-1.local:30710] mca: base: close: unloading component round_robin
> [login-0-1.local:30710] mca: base: close: component rank_file closed
> [login-0-1.local:30710] mca: base: close: unloading component rank_file
> [login-0-1.local:30710] mca: base: close: component seq closed
> [login-0-1.local:30710] mca: base: close: unloading component seq
> [login-0-1.local:30710] mca: base: close: component resilient closed
> [login-0-1.local:30710] mca: base: close: unloading component resilient
> [login-0-1.local:30710] mca: base: close: component staged closed
> [login-0-1.local:30710] mca: base: close: unloading component staged
> [login-0-1.local:30710] mca: base: close: component mindist closed
> [login-0-1.local:30710] mca: base: close: unloading component mindist
> [login-0-1.local:30710] mca: base: close: component ppr closed
> [login-0-1.local:30710] mca: base: close: unloading component ppr
> 
> 
> 
> 
> 
> On 10/09/2015 02:07 AM, Ralph Castain wrote:
>> Hi Marcin
>> 
>> Looking again at this: could you get a similar reservation again and rerun 
>> mpirun with “-display-allocation” added to the command line? I’d like to see 
>> if we are correctly parsing the number of slots assigned in the allocation
>> 
>> Ralph
>> 
>>> On Oct 6, 2015, at 11:52 AM, marcin.krotkiewski 
>>> <marcin.krotkiew...@gmail.com> wrote:
>>> 
>>> Thank you both for your suggestion. I still cannot make this work though, 
>>> and I think - as Ralph predicted - most problems are likely related to 
>>> non-homogeneous mapping of cpus to jobs. But there is problems even before 
>>> that part..
>>> 
>>> If I reserve one entire compute node with SLURM:
>>> 
>>> salloc --ntasks=16 --tasks-per-node=16
>>> 
>>> I can run my code as you suggested with _any_ N (including odd numbers!). 
>>> OpenMPI will figure out the maximun number of tasks that fits and launch 
>>> them. This also works for many complete nodes, but this is the only case 
>>> when I managed to get it to work.
>>> 
>>> If I specify cpus per task, also allocating one full node
>>> 
>>> salloc --ntasks=4 --cpus-per-task=4 --tasks-per-node=4
>>> 
>>> things go astray:
>>> 
>>> mpirun --map-by slot:pe=4 ./affinity
>>> rank 0 @ compute-1-6.local  0, 1, 2, 3, 16, 17, 18, 19,
>>> 
>>> Yes, only one MPI process was started. Running what Gilles previously 
>>> suggested:
>>> 
>>> $ srun grep Cpus_allowed_list /proc/self/status
>>> Cpus_allowed_list:    0-31
>>> Cpus_allowed_list:    0-31
>>> Cpus_allowed_list:    0-31
>>> Cpus_allowed_list:    0-31
>>> 
>>> So the allocation seems fine. The SLURM environment is also correct, as far 
>>> as I can tell:
>>> 
>>> SLURM_CPUS_PER_TASK=4
>>> SLURM_JOB_CPUS_PER_NODE=16
>>> SLURM_JOB_NODELIST=c1-6
>>> SLURM_JOB_NUM_NODES=1
>>> SLURM_NNODES=1
>>> SLURM_NODELIST=c1-6
>>> SLURM_NPROCS=4
>>> SLURM_NTASKS=4
>>> SLURM_NTASKS_PER_NODE=4
>>> SLURM_TASKS_PER_NODE=4
>>> 
>>> I do not understand why openmpi does not want to start more than 1 process. 
>>> If I try to force it (-n 4) I of course get an error:
>>> 
>>> mpirun --map-by slot:pe=4 -n 4 ./affinity
>>> 
>>> --------------------------------------------------------------------------
>>> There are not enough slots available in the system to satisfy the 4 slots
>>> that were requested by the application:
>>>  ./affinity
>>> 
>>> Either request fewer slots for your application, or make more slots 
>>> available
>>> for use.
>>> --------------------------------------------------------------------------
>>> 
>>> 
>>> For clarity, I will not describe other cases / non-contiguous cpu sets / 
>>> heterogeneous nodes. Clearly something is wrong already with the simple 
>>> ones..
>>> 
>>> Does anyone have any ideas? Should I record some logs to see what's going 
>>> on?
>>> 
>>> Thanks a lot!
>>> 
>>> Marcin
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 10/06/2015 01:04 AM, tmish...@jcity.maeda.co.jp wrote:
>>>> Hi Ralph, it's been a long time.
>>>> 
>>>> The option "map-by core" does not work when pe=N > 1 is specified.
>>>> So, you should use "map-by slot:pe=N" as far as I remember.
>>>> 
>>>> Regards,
>>>> Tetsuya Mishima
>>>> 
>>>> 2015/10/06 5:40:33、"users"さんは「Re: [OMPI users] Hybrid OpenMPI+OpenMP
>>>> tasks using SLURM」で書きました
>>>>> Hmmm…okay, try -map-by socket:pe=4
>>>>> 
>>>>> We’ll still hit the asymmetric topology issue, but otherwise this should
>>>> work
>>>>>> On Oct 5, 2015, at 1:25 PM, marcin.krotkiewski
>>>> <marcin.krotkiew...@gmail.com> wrote:
>>>>>> Ralph,
>>>>>> 
>>>>>> Thank you for a fast response! Sounds very good, unfortunately I get an
>>>> error:
>>>>>> $ mpirun --map-by core:pe=4 ./affinity
>>>>>> 
>>>> --------------------------------------------------------------------------
>>>>>> A request for multiple cpus-per-proc was given, but a directive
>>>>>> was also give to map to an object level that cannot support that
>>>>>> directive.
>>>>>> 
>>>>>> Please specify a mapping level that has more than one cpu, or
>>>>>> else let us define a default mapping that will allow multiple
>>>>>> cpus-per-proc.
>>>>>> 
>>>> --------------------------------------------------------------------------
>>>>>> I have allocated my slurm job as
>>>>>> 
>>>>>> salloc --ntasks=2 --cpus-per-task=4
>>>>>> 
>>>>>> I have checked in 1.10.0 and 1.10.1rc1.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 10/05/2015 09:58 PM, Ralph Castain wrote:
>>>>>>> You would presently do:
>>>>>>> 
>>>>>>> mpirun —map-by core:pe=4
>>>>>>> 
>>>>>>> to get what you are seeking. If we don’t already set that qualifier
>>>> when we see “cpus_per_task”, then we probably should do so as there isn’t
>>>> any reason to make you set it twice (well, other than
>>>>> trying to track which envar slurm is using now).
>>>>>>>> On Oct 5, 2015, at 12:38 PM, marcin.krotkiewski
>>>> <marcin.krotkiew...@gmail.com> wrote:
>>>>>>>> Yet another question about cpu binding under SLURM environment..
>>>>>>>> 
>>>>>>>> Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the
>>>> purpose of cpu binding?
>>>>>>>> Full version: When you allocate a job like, e.g., this
>>>>>>>> 
>>>>>>>> salloc --ntasks=2 --cpus-per-task=4
>>>>>>>> 
>>>>>>>> SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks.
>>>> This is useful for hybrid jobs, where each MPI process spawns some internal
>>>> worker threads (e.g., OpenMP). The intention is
>>>>> that there are 2 MPI procs started, each of them 'bound' to 4 cores.
>>>> SLURM will also set an environment variable
>>>>>>>> SLURM_CPUS_PER_TASK=4
>>>>>>>> 
>>>>>>>> which should (probably?) be taken into account by the method that
>>>> launches the MPI processes to figure out the cpuset. In case of OpenMPI +
>>>> mpirun I think something should happen in
>>>>> orte/mca/ras/slurm/ras_slurm_module.c, where the variable _is_ actually
>>>> parsed. Unfortunately, it is never really used...
>>>>>>>> As a result, cpuset of all tasks started on a given compute node
>>>> includes all CPU cores of all MPI tasks on that node, just as provided by
>>>> SLURM (in the above example - 8). In general, there is
>>>>> no simple way for the user code in the MPI procs to 'split' the cores
>>>> between themselves. I imagine the original intention to support this in
>>>> OpenMPI was something like
>>>>>>>> mpirun --bind-to subtask_cpuset
>>>>>>>> 
>>>>>>>> with an artificial bind target that would cause OpenMPI to divide the
>>>> allocated cores between the mpi tasks. Is this right? If so, it seems that
>>>> at this point this is not implemented. Is there
>>>>> plans to do this? If no, does anyone know another way to achieve that?
>>>>>>>> Thanks a lot!
>>>>>>>> 
>>>>>>>> Marcin
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/10/27803.php
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/10/27804.php
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/10/27805.php
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/usersLink to
>>>> this post: http://www.open-mpi.org/community/lists/users/2015/10/27806.php
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/10/27809.php
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/10/27817.php
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/10/27851.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/10/27857.php

Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

Reply via email to