[slurm-dev] Re: Intel MPI, perhost, and SLURM: Can I override SLURM?

Aaron Knister Fri, 01 May 2015 05:36:48 -0700

Matt,

I'm pretty confident in saying this is entirely in Intel MPI land:



aknister@borgj157:~> I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=enable
mpiexec.hydra -np 48 -ppn 24 -print-rank-map /bin/true
(borgj157:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27)
(borgj164:28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47)

aknister@borgj157:~> I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=disable
mpiexec.hydra -np 48 -ppn 24 -print-rank-map /bin/true
(borgj157:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23)
(borgj164:24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47)

However, if a machinefile argument is passed to mpiexec.hydra (which mpirun
does by default) the I_MPI_JOB_RESPECT_PROCESS_PLACEMENT variable isn't
respected (see below). Maybe we need an
I_MPI_JOB_RESPECT_I_MPI_JOB_RESPECT_PROCESS_PLACEMENT_VARIABLE variable.

aknister@borgj157:~> I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=enable
mpiexec.hydra -machinefile $PBS_NODEFILE -np 48 -ppn 24 --print-rank-map
true
(borgj157:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27)
(borgj164:28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47)

aknister@borgj157:~> I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=disable
mpiexec.hydra -machinefile $PBS_NODEFILE -np 48 -ppn 24 --print-rank-map
true
(borgj157:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27)
(borgj164:28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47)

Feel free to open an in-house (Footprints) ticket if you'd like to dig into
this a little more and find a workable solution on discover.

-Aaron

On Thu, Apr 30, 2015 at 1:32 PM, Thompson, Matt[SCIENCE SYSTEMS AND
APPLICATIONS INC] <[email protected]> wrote:

>
> Aaron, et al,
>
> No. I tried setting various flags, but nothing seemed to change.
>
> Well, that's not true. Per SLURM's website:
>
>   http://slurm.schedmd.com/mpi_guide.html#intel_mpi
>
> I did try a more extreme example. This time, I had 12 nodes. So If I run
> as below, I get the same answer (28, then 20 with mpirun). So I thought,
> well, let's try srun:
>
>  (1128) $ setenv I_MPI_PMI_LIBRARY /usr/slurm/lib64/libpmi.so
>> (1129) $ srun -n 48 ./helloWorld.exe | sort -k2 -g
>> srun.slurm: cluster configuration lacks support for cpu binding
>> Process    0 of   48 is on borgj102
>> Process    1 of   48 is on borgj102
>> Process    2 of   48 is on borgj102
>> Process    3 of   48 is on borgj102
>> Process    4 of   48 is on borgj105
>> Process    5 of   48 is on borgj105
>> Process    6 of   48 is on borgj105
>> Process    7 of   48 is on borgj105
>> Process    8 of   48 is on borgj106
>> Process    9 of   48 is on borgj106
>> Process   10 of   48 is on borgj106
>> Process   11 of   48 is on borgj106
>> Process   12 of   48 is on borgj108
>> Process   13 of   48 is on borgj108
>> Process   14 of   48 is on borgj108
>> Process   15 of   48 is on borgj108
>> Process   16 of   48 is on borgj111
>> Process   17 of   48 is on borgj111
>> Process   18 of   48 is on borgj111
>> Process   19 of   48 is on borgj111
>> Process   20 of   48 is on borgj112
>> Process   21 of   48 is on borgj112
>> Process   22 of   48 is on borgj112
>> Process   23 of   48 is on borgj112
>> Process   24 of   48 is on borgj130
>> Process   25 of   48 is on borgj130
>> Process   26 of   48 is on borgj130
>> Process   27 of   48 is on borgj130
>> Process   28 of   48 is on borgj133
>> Process   29 of   48 is on borgj133
>> Process   30 of   48 is on borgj133
>> Process   31 of   48 is on borgj133
>> Process   32 of   48 is on borgj134
>> Process   33 of   48 is on borgj134
>> Process   34 of   48 is on borgj134
>> Process   35 of   48 is on borgj134
>> Process   36 of   48 is on borgj140
>> Process   37 of   48 is on borgj140
>> Process   38 of   48 is on borgj140
>> Process   39 of   48 is on borgj140
>> Process   40 of   48 is on borgj143
>> Process   41 of   48 is on borgj143
>> Process   42 of   48 is on borgj143
>> Process   43 of   48 is on borgj143
>> Process   44 of   48 is on borgj145
>> Process   45 of   48 is on borgj145
>> Process   46 of   48 is on borgj145
>> Process   47 of   48 is on borgj145
>>
>
> That looks like a very SLURM-y output. Loadbalance everywhere! This seems
> to support the "mpirun did it" theory.
>
> (Note: Do *not* have I_MPI_PMI_LIBRARY=/usr/slurm/lib64/libpmi.so set when
> you mpirun. You get fun errors!)
>
>
>
> On 04/30/2015 11:09 AM, Aaron Knister wrote:
>
>>
>> Hi Matt,
>>
>> I happen to know the admins of that cluster ;-) I'll take a look and
>>
> get back to you. Also, are you setting any additional I_MPI variables?
>
>>
>> -Aaron
>>
>> Sent from my iPhone
>>
>>  On Apr 30, 2015, at 10:53 AM, Thompson, Matt[SCIENCE SYSTEMS AND
>>> APPLICATIONS INC] (GSFC-610.1) <[email protected]> wrote:
>>>
>>>
>>> All,
>>>
>>> (Note: I'm also asking this on Intel's forums)
>>>
>>> I'm hoping you can help me with a question. Namely, I'm on a cluster
>>> that uses SLURM and lets say I ask for 2 28-core Haswell nodes to run
>>> interactively and I get them. Great, so my environment now has things like:
>>>
>>> SLURM_NTASKS_PER_NODE=28
>>> SLURM_TASKS_PER_NODE=28(x2)
>>> SLURM_JOB_CPUS_PER_NODE=28(x2)
>>> SLURM_CPUS_ON_NODE=28
>>>
>>> Now, let's run a simple HelloWorld (using Intel MPI 5.0.3.048) on, say,
>>> 48 processors (and pipe through sort to see things a bit better):
>>>
>>> (1047) $ mpirun -np 48 -print-rank-map ./helloWorld.exe | sort -k2 -g
>>> srun.slurm: cluster configuration lacks support for cpu binding
>>>
>>> (borgj102:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27)
>>> (borgj105:28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47)
>>> Process    0 of   48 is on borgj102
>>> Process    1 of   48 is on borgj102
>>> Process    2 of   48 is on borgj102
>>> Process    3 of   48 is on borgj102
>>> Process    4 of   48 is on borgj102
>>> Process    5 of   48 is on borgj102
>>> Process    6 of   48 is on borgj102
>>> Process    7 of   48 is on borgj102
>>> Process    8 of   48 is on borgj102
>>> Process    9 of   48 is on borgj102
>>> Process   10 of   48 is on borgj102
>>> Process   11 of   48 is on borgj102
>>> Process   12 of   48 is on borgj102
>>> Process   13 of   48 is on borgj102
>>> Process   14 of   48 is on borgj102
>>> Process   15 of   48 is on borgj102
>>> Process   16 of   48 is on borgj102
>>> Process   17 of   48 is on borgj102
>>> Process   18 of   48 is on borgj102
>>> Process   19 of   48 is on borgj102
>>> Process   20 of   48 is on borgj102
>>> Process   21 of   48 is on borgj102
>>> Process   22 of   48 is on borgj102
>>> Process   23 of   48 is on borgj102
>>> Process   24 of   48 is on borgj102
>>> Process   25 of   48 is on borgj102
>>> Process   26 of   48 is on borgj102
>>> Process   27 of   48 is on borgj102
>>> Process   28 of   48 is on borgj105
>>> Process   29 of   48 is on borgj105
>>> Process   30 of   48 is on borgj105
>>> Process   31 of   48 is on borgj105
>>> Process   32 of   48 is on borgj105
>>> Process   33 of   48 is on borgj105
>>> Process   34 of   48 is on borgj105
>>> Process   35 of   48 is on borgj105
>>> Process   36 of   48 is on borgj105
>>> Process   37 of   48 is on borgj105
>>> Process   38 of   48 is on borgj105
>>> Process   39 of   48 is on borgj105
>>> Process   40 of   48 is on borgj105
>>> Process   41 of   48 is on borgj105
>>> Process   42 of   48 is on borgj105
>>> Process   43 of   48 is on borgj105
>>> Process   44 of   48 is on borgj105
>>> Process   45 of   48 is on borgj105
>>> Process   46 of   48 is on borgj105
>>> Process   47 of   48 is on borgj105
>>>
>>> As you can see, the first 28 processes are on node 1, and the last 20
>>> are on node 2. Okay. Now, I want to do some load balancing, so I want 24 on
>>> each. In the past, I always used -perhost and it worked, but now:
>>>
>>> (1048) $ mpirun -np 48 -perhost 24 -print-rank-map ./helloWorld.exe |
>>> sort -k2 -g
>>> srun.slurm: cluster configuration lacks support for cpu binding
>>>
>>> (borgj102:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27)
>>> (borgj105:28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47)
>>> Process    0 of   48 is on borgj102
>>> Process    1 of   48 is on borgj102
>>> Process    2 of   48 is on borgj102
>>> Process    3 of   48 is on borgj102
>>> Process    4 of   48 is on borgj102
>>> Process    5 of   48 is on borgj102
>>> Process    6 of   48 is on borgj102
>>> Process    7 of   48 is on borgj102
>>> Process    8 of   48 is on borgj102
>>> Process    9 of   48 is on borgj102
>>> Process   10 of   48 is on borgj102
>>> Process   11 of   48 is on borgj102
>>> Process   12 of   48 is on borgj102
>>> Process   13 of   48 is on borgj102
>>> Process   14 of   48 is on borgj102
>>> Process   15 of   48 is on borgj102
>>> Process   16 of   48 is on borgj102
>>> Process   17 of   48 is on borgj102
>>> Process   18 of   48 is on borgj102
>>> Process   19 of   48 is on borgj102
>>> Process   20 of   48 is on borgj102
>>> Process   21 of   48 is on borgj102
>>> Process   22 of   48 is on borgj102
>>> Process   23 of   48 is on borgj102
>>> Process   24 of   48 is on borgj102
>>> Process   25 of   48 is on borgj102
>>> Process   26 of   48 is on borgj102
>>> Process   27 of   48 is on borgj102
>>> Process   28 of   48 is on borgj105
>>> Process   29 of   48 is on borgj105
>>> Process   30 of   48 is on borgj105
>>> Process   31 of   48 is on borgj105
>>> Process   32 of   48 is on borgj105
>>> Process   33 of   48 is on borgj105
>>> Process   34 of   48 is on borgj105
>>> Process   35 of   48 is on borgj105
>>> Process   36 of   48 is on borgj105
>>> Process   37 of   48 is on borgj105
>>> Process   38 of   48 is on borgj105
>>> Process   39 of   48 is on borgj105
>>> Process   40 of   48 is on borgj105
>>> Process   41 of   48 is on borgj105
>>> Process   42 of   48 is on borgj105
>>> Process   43 of   48 is on borgj105
>>> Process   44 of   48 is on borgj105
>>> Process   45 of   48 is on borgj105
>>> Process   46 of   48 is on borgj105
>>> Process   47 of   48 is on borgj105
>>>
>>> Huh. No change and still 28,20. Do you know if there is a way to
>>> "override" what appears to be SLURM beating the -perhost flag? I suppose
>>> there is that srun.slurm warning being thrown, but that usually is a
>>> warning for more "tasks-per-core" sort of manipulations.
>>>
>>> Thanks,
>>> Matt
>>> --
>>> Matt Thompson          SSAI, Sr Software Test Engr
>>> NASA GSFC, Global Modeling and Assimilation Office
>>> Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
>>> Phone: 301-614-6712              Fax: 301-614-6246
>>>
>>
>
>
> --
> Matt Thompson          SSAI, Sr Software Test Engr
> NASA GSFC, Global Modeling and Assimilation Office
> Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
> Phone: 301-614-6712              Fax: 301-614-6246
>

[slurm-dev] Re: Intel MPI, perhost, and SLURM: Can I override SLURM?

Reply via email to