Dear all,

recently I tryied to switch from openMPI 2.1.x to openMPI 3.1.x.
I try to run a openMP/MPI hybrid program and prior to openMPI 3.1 I used
--bind-to core --map-by slot:PE=4
and requested full nodes via PBS or Slurm (:ppn=16; --cpus-per-task=1, --tasks-per-node=16)

With openMPI 3.1, however, this approach fails. With Slurm I am still able to bind the mpi-procs to cores if I request --cpus-per-task=4, --tasks-per-node=4. With PBS I am not able to use less than :ppn=16 and the only way it works is -npernode 4 which does not bind to cores but to a socket.

If I run openMPI without any scheduler --bind-to core --map-by slot:PE=4 works as long as I have a hostlist that only specifies 4 slots per host. However, and this is the very strange thing, if I manipulate the hostfile from pbs inside the jobscript to obtain a hostfile with only 4 slots per host, the binding looks ok, but mpi_init fails:

[node035:16619] MCW rank 16 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB/../../../..][../../../../../../../..] [node037:20607] MCW rank 8 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB/../../../..][../../../../../../../..] [node035:16619] MCW rank 17 bound to socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [../../../../BB/BB/BB/BB][../../../../../../../..] [node037:20607] MCW rank 9 bound to socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [../../../../BB/BB/BB/BB][../../../../../../../..] [node035:16619] MCW rank 18 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../../../..][BB/BB/BB/BB/../../../..] [node037:20607] MCW rank 10 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../../../..][BB/BB/BB/BB/../../../..] [node035:16619] MCW rank 19 bound to socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: [../../../../../../../..][../../../../BB/BB/BB/BB] [node037:20607] MCW rank 11 bound to socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: [../../../../../../../..][../../../../BB/BB/BB/BB] [node039:13653] [[58738,0],0] ORTE_ERROR_LOG: Not found in file ../../../../orte/mca/plm/base/plm_base_receive.c at line 342
--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[58738,0],0] FORCE-TERMINATE AT (null):1 - error ../../../../orte/mca/plm/base/plm_base_receive.c(343)

This is something that should be reported to the developers.
--------------------------------------------------------------------------
[node038:04619] MCW rank 4 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB/../../../..][../../../../../../../..] [node038:04619] MCW rank 5 bound to socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [../../../../BB/BB/BB/BB][../../../../../../../..] [node038:04619] MCW rank 6 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../../../..][BB/BB/BB/BB/../../../..] [node038:04619] MCW rank 7 bound to socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: [../../../../../../../..][../../../../BB/BB/BB/BB]
--------------------------------------------------------------------------

The exact same configuration runs without problems, if I run it outside of PBS.

Is this a known problem, or is there any other, better way to handle hybrid programs?

Kind regards,
Tobias

Open MPI repo revision: v3.1.0rc2-17-g091ee94
  Configure command line: '--with-tm' '--without-cuda'
                          '--with-knem=/opt/knem-1.1.2.90mlnx1/'
                          '--with-verbs'
'--with-hcoll=/home/kloeffel/hpcx-v2.0.1-gcc-MLNX_OFED_LINUX-3.4-1.0.0.0-redhat7.2-x86_64/hcoll/'
'--with-mxm=/home/kloeffel/hpcx-v2.0.1-gcc-MLNX_OFED_LINUX-3.4-1.0.0.0-redhat7.2-x86_64/mxm'
'--with-ucx=/home/kloeffel/hpcx-v2.0.1-gcc-MLNX_OFED_LINUX-3.4-1.0.0.0-redhat7.2-x86_64/ucx'
                          '--with-cma' 'CC=icc' 'CXX=icpc' 'FC=ifort'
'--prefix=/local/openmpi-mlnx/openmpi-3.1.0-I18.1-MLNX-TM'
                          'CFLAGS=-m64' 'CXXFLAGS=-m64' 'FCFLAGS=-m64'

Intel compilers: 2018.1.163

--
M.Sc. Tobias Klöffel
=======================================================
Interdisciplinary Center for Molecular Materials (ICMM)
and Computer-Chemistry-Center (CCC)
Department Chemie und Pharmazie
Friedrich-Alexander-Universität Erlangen-Nürnberg
Nägelsbachstr. 25
D-91052 Erlangen, Germany

Room: 2.305
Phone: +49 (0) 9131 / 85 - 20423
Fax: +49 (0) 9131 / 85 - 26565

=======================================================

E-mail: tobias.kloef...@fau.de

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to