My institute recently purchased a linux cluster with 20 nodes; 2 sockets per 
node; 6 cores per socket. OpenMPI v 1.8.1 is installed. I want to run 15 jobs. 
Each job requires 16 MPI processes.  For each job, I want to use two cores on 
each node, mapping by socket. If I use these options:

#PBS -l nodes=8:ppn=2
mpirun --report-bindings --bind-to core --map-by socket:PE=1 -np 16 <executable 
file name>

The reported bindings are:

[burn001:09186] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
[B/././././.][./././././.]
[burn001:09186] MCW rank 1 bound to socket 1[core 6[hwt 0]]: 
[./././././.][B/././././.]
[burn004:07113] MCW rank 6 bound to socket 0[core 0[hwt 0]]: 
[B/././././.][./././././.]
[burn004:07113] MCW rank 7 bound to socket 1[core 6[hwt 0]]: 
[./././././.][B/././././.]
and so on...

These bindings appear to be OK, but when I do a "top -H" on each node, I see 
that all 15 jobs use core 0 and core 6 on each node. This means, I believe, 
that I am only using 1/6 or my resources. I want to use 100%. So I try this:

#PBS -l nodes=8:ppn=2
mpirun --report-bindings --bind-to socket --map-by socket:PE=1 -np 16 
<executable file name>

Now it appears that I am getting 100% usage of all cores on all nodes. The 
bindings are:

[burn004:07244] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 
4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[burn004:07244] MCW rank 1 bound to socket 1[core 6[hwt 0]], socket 1[core 
7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 
10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[burn008:07256] MCW rank 3 bound to socket 1[core 6[hwt 0]], socket 1[core 
7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 
10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[burn008:07256] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 
4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
and so on...

The problem now is that some of my jobs are hanging. They all start running 
fine, and produce output. But at some point I lose about 4 out of 15 jobs due 
to hanging. I suspect that an MPI message is passed and not received. The 
number of jobs that hang and the time when they hang varies from test to test. 
We have run these cases successfully on our old cluster dozens of times - they 
are part of our benchmark suite.

When I run these jobs using a map by core strategy (that is, the MPI processes 
are just mapped by core, and each job only uses 16 cores on two nodes), I do 
not see as much hanging. It still occurs, but less often. This leads me to 
suspect that there is something about the increased network traffic due to the 
map-by-socket approach that is the cause of the problem. But I do not know what 
to do about it. I think that the map-by-socket approach is the right one, but I 
do not know if I have my OpenMPI options just right.

Can you tell me what OpenMPI options to use, and can you tell me how I might 
debug the hanging issue.



Kevin McGrattan
National Institute of Standards and Technology
100 Bureau Drive, Mail Stop 8664
Gaithersburg, Maryland 20899

301 975 2712

Reply via email to