All,
(Note: I'm also asking this on Intel's forums)
I'm hoping you can help me with a question. Namely, I'm on a cluster
that uses SLURM and lets say I ask for 2 28-core Haswell nodes to run
interactively and I get them. Great, so my environment now has things like:
SLURM_NTASKS_PER_NODE=28
SLURM_TASKS_PER_NODE=28(x2)
SLURM_JOB_CPUS_PER_NODE=28(x2)
SLURM_CPUS_ON_NODE=28
Now, let's run a simple HelloWorld (using Intel MPI 5.0.3.048) on, say,
48 processors (and pipe through sort to see things a bit better):
(1047) $ mpirun -np 48 -print-rank-map ./helloWorld.exe | sort -k2 -g
srun.slurm: cluster configuration lacks support for cpu binding
(borgj102:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27)
(borgj105:28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47)
Process 0 of 48 is on borgj102
Process 1 of 48 is on borgj102
Process 2 of 48 is on borgj102
Process 3 of 48 is on borgj102
Process 4 of 48 is on borgj102
Process 5 of 48 is on borgj102
Process 6 of 48 is on borgj102
Process 7 of 48 is on borgj102
Process 8 of 48 is on borgj102
Process 9 of 48 is on borgj102
Process 10 of 48 is on borgj102
Process 11 of 48 is on borgj102
Process 12 of 48 is on borgj102
Process 13 of 48 is on borgj102
Process 14 of 48 is on borgj102
Process 15 of 48 is on borgj102
Process 16 of 48 is on borgj102
Process 17 of 48 is on borgj102
Process 18 of 48 is on borgj102
Process 19 of 48 is on borgj102
Process 20 of 48 is on borgj102
Process 21 of 48 is on borgj102
Process 22 of 48 is on borgj102
Process 23 of 48 is on borgj102
Process 24 of 48 is on borgj102
Process 25 of 48 is on borgj102
Process 26 of 48 is on borgj102
Process 27 of 48 is on borgj102
Process 28 of 48 is on borgj105
Process 29 of 48 is on borgj105
Process 30 of 48 is on borgj105
Process 31 of 48 is on borgj105
Process 32 of 48 is on borgj105
Process 33 of 48 is on borgj105
Process 34 of 48 is on borgj105
Process 35 of 48 is on borgj105
Process 36 of 48 is on borgj105
Process 37 of 48 is on borgj105
Process 38 of 48 is on borgj105
Process 39 of 48 is on borgj105
Process 40 of 48 is on borgj105
Process 41 of 48 is on borgj105
Process 42 of 48 is on borgj105
Process 43 of 48 is on borgj105
Process 44 of 48 is on borgj105
Process 45 of 48 is on borgj105
Process 46 of 48 is on borgj105
Process 47 of 48 is on borgj105
As you can see, the first 28 processes are on node 1, and the last 20
are on node 2. Okay. Now, I want to do some load balancing, so I want 24
on each. In the past, I always used -perhost and it worked, but now:
(1048) $ mpirun -np 48 -perhost 24 -print-rank-map ./helloWorld.exe |
sort -k2 -g
srun.slurm: cluster configuration lacks support for cpu binding
(borgj102:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27)
(borgj105:28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47)
Process 0 of 48 is on borgj102
Process 1 of 48 is on borgj102
Process 2 of 48 is on borgj102
Process 3 of 48 is on borgj102
Process 4 of 48 is on borgj102
Process 5 of 48 is on borgj102
Process 6 of 48 is on borgj102
Process 7 of 48 is on borgj102
Process 8 of 48 is on borgj102
Process 9 of 48 is on borgj102
Process 10 of 48 is on borgj102
Process 11 of 48 is on borgj102
Process 12 of 48 is on borgj102
Process 13 of 48 is on borgj102
Process 14 of 48 is on borgj102
Process 15 of 48 is on borgj102
Process 16 of 48 is on borgj102
Process 17 of 48 is on borgj102
Process 18 of 48 is on borgj102
Process 19 of 48 is on borgj102
Process 20 of 48 is on borgj102
Process 21 of 48 is on borgj102
Process 22 of 48 is on borgj102
Process 23 of 48 is on borgj102
Process 24 of 48 is on borgj102
Process 25 of 48 is on borgj102
Process 26 of 48 is on borgj102
Process 27 of 48 is on borgj102
Process 28 of 48 is on borgj105
Process 29 of 48 is on borgj105
Process 30 of 48 is on borgj105
Process 31 of 48 is on borgj105
Process 32 of 48 is on borgj105
Process 33 of 48 is on borgj105
Process 34 of 48 is on borgj105
Process 35 of 48 is on borgj105
Process 36 of 48 is on borgj105
Process 37 of 48 is on borgj105
Process 38 of 48 is on borgj105
Process 39 of 48 is on borgj105
Process 40 of 48 is on borgj105
Process 41 of 48 is on borgj105
Process 42 of 48 is on borgj105
Process 43 of 48 is on borgj105
Process 44 of 48 is on borgj105
Process 45 of 48 is on borgj105
Process 46 of 48 is on borgj105
Process 47 of 48 is on borgj105
Huh. No change and still 28,20. Do you know if there is a way to
"override" what appears to be SLURM beating the -perhost flag? I suppose
there is that srun.slurm warning being thrown, but that usually is a
warning for more "tasks-per-core" sort of manipulations.
Thanks,
Matt
--
Matt Thompson SSAI, Sr Software Test Engr
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712 Fax: 301-614-6246