My institute recently purchased a linux cluster with 20 nodes; 2 sockets per node; 6 cores per socket. OpenMPI v 1.8.1 is installed. I want to run 15 jobs. Each job requires 16 MPI processes. For each job, I want to use two cores on each node, mapping by socket. If I use these options:
#PBS -l nodes=8:ppn=2 mpirun --report-bindings --bind-to core --map-by socket:PE=1 -np 16 <executable file name> The reported bindings are: [burn001:09186] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.] [burn001:09186] MCW rank 1 bound to socket 1[core 6[hwt 0]]: [./././././.][B/././././.] [burn004:07113] MCW rank 6 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.] [burn004:07113] MCW rank 7 bound to socket 1[core 6[hwt 0]]: [./././././.][B/././././.] and so on... These bindings appear to be OK, but when I do a "top -H" on each node, I see that all 15 jobs use core 0 and core 6 on each node. This means, I believe, that I am only using 1/6 or my resources. I want to use 100%. So I try this: #PBS -l nodes=8:ppn=2 mpirun --report-bindings --bind-to socket --map-by socket:PE=1 -np 16 <executable file name> Now it appears that I am getting 100% usage of all cores on all nodes. The bindings are: [burn004:07244] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] [burn004:07244] MCW rank 1 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] [burn008:07256] MCW rank 3 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] [burn008:07256] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] and so on... The problem now is that some of my jobs are hanging. They all start running fine, and produce output. But at some point I lose about 4 out of 15 jobs due to hanging. I suspect that an MPI message is passed and not received. The number of jobs that hang and the time when they hang varies from test to test. We have run these cases successfully on our old cluster dozens of times - they are part of our benchmark suite. When I run these jobs using a map by core strategy (that is, the MPI processes are just mapped by core, and each job only uses 16 cores on two nodes), I do not see as much hanging. It still occurs, but less often. This leads me to suspect that there is something about the increased network traffic due to the map-by-socket approach that is the cause of the problem. But I do not know what to do about it. I think that the map-by-socket approach is the right one, but I do not know if I have my OpenMPI options just right. Can you tell me what OpenMPI options to use, and can you tell me how I might debug the hanging issue. Kevin McGrattan National Institute of Standards and Technology 100 Bureau Drive, Mail Stop 8664 Gaithersburg, Maryland 20899 301 975 2712