Hello,

My attempt to run and troubleshoot my an OMPI job under a slurm allocation does 
not work as I would expect.
The result below has led me to believe that under the hood, in this setup 
(SLURM with OMPI) that the correct srun options is not being used
when I call, mpirun directly.

Specifically the “cpu-bind=none” is breaking, but it also looks like the 
nodelist is incorrect.

The job script.

  1 #!/bin/bash
  2
  3 #SBATCH --job-name=mpi-hostname
  4 #SBATCH --partition=dev
  5 #SBATCH --account=Account1
  6 #SBATCH --time=01:00:00
  7 #SBATCH --nodes=2
  8 #SBATCH --ntasks-per-node=1
  9 #SBATCH --begin=now+10
10 #SBATCH --output="%x-%u-%N-%j.txt"      # jobName-userId-hostName-jobId.txt
11
12
13
14 # ---------------------------------------------------------------------- #
15 #module load DefApps
16 #module purge >/dev/null 2>&1
17 ##module load staging/slurm >/dev/null 2>&1
18 #module load  gcc/4.8.5 openmpi >/dev/null 2>&1
19 #module --ignore_cache spider openmpi/3.1.3 >/dev/null 2>&1
20 #
21 # ---------------------------------------------------------------------- #
22 #
23 MPI_RUN=$(which orterun)
24 if [[ -z "${MPI_RUN:+x}" ]]; then
25   echo "ERROR: Cannot find 'mpirun' executable..exiting"
26   exit 1
27 fi
28
29 echo
30 #CMD="orterun  -npernode 1 -np 2  /bin/hostname"
31 #CMD="srun /bin/hostname"
32 #CMD="srun -N2 -n2 --mpi=pmi2  /bin/hostname"
33 #MCMD="/sw/dev/openmpi401/bin/mpirun --bind-to-core  --report-bindings  -mca 
btl openib,self -mca plm_base_verbose 10  /bin/hostname"
34 MCMD="/sw/dev/openmpi401/bin/mpirun  --report-bindings  -mca btl openib,self 
-mca plm_base_verbose 10  /bin/hostname"
35 echo "INFO: Executing the command: $MCMD"
36 $MCMD
37 sync

Here is the output:

2 user1@node-login7g:~/git/slurm-jobs$ more mpi-hostname-user1-node513-835.txt
  3
  4 INFO: Executing the command: /sw/dev/openmpi401/bin/mpirun  
--report-bindings  -mca btl openib,self -mca plm_base_verbose 10  /bin/hostname
  5 [node513:32514] mca: base: components_register: registering framework plm 
components
  6 [node513:32514] mca: base: components_register: found loaded component 
isolated
  7 [node513:32514] mca: base: components_register: component isolated has no 
register or open function
  8 [node513:32514] mca: base: components_register: found loaded component rsh
  9 [node513:32514] mca: base: components_register: component rsh register 
function successful
10 [node513:32514] mca: base: components_register: found loaded component slurm
11 [node513:32514] mca: base: components_register: component slurm register 
function successful
12 [node513:32514] mca: base: components_register: found loaded component tm
13 [node513:32514] mca: base: components_register: component tm register 
function successful
14 [node513:32514] mca: base: components_open: opening plm components
15 [node513:32514] mca: base: components_open: found loaded component isolated
16 [node513:32514] mca: base: components_open: component isolated open function 
successful
17 [node513:32514] mca: base: components_open: found loaded component rsh
18 [node513:32514] mca: base: components_open: component rsh open function 
successful
19 [node513:32514] mca: base: components_open: found loaded component slurm
20 [node513:32514] mca: base: components_open: component slurm open function 
successful
21 [node513:32514] mca: base: components_open: found loaded component tm
22 [node513:32514] mca: base: components_open: component tm open function 
successful
23 [node513:32514] mca:base:select: Auto-selecting plm components
24 [node513:32514] mca:base:select:(  plm) Querying component [isolated]
25 [node513:32514] mca:base:select:(  plm) Query of component [isolated] set 
priority to 0
26 [node513:32514] mca:base:select:(  plm) Querying component [rsh]
27 [node513:32514] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
28 [node513:32514] mca:base:select:(  plm) Querying component [slurm]
29 [node513:32514] mca:base:select:(  plm) Query of component [slurm] set 
priority to 75
30 [node513:32514] mca:base:select:(  plm) Querying component [tm]
31 [node513:32514] mca:base:select:(  plm) Selected component [slurm]
32 [node513:32514] mca: base: close: component isolated closed
33 [node513:32514] mca: base: close: unloading component isolated
34 [node513:32514] mca: base: close: component rsh closed
35 [node513:32514] mca: base: close: unloading component rsh
36 [node513:32514] mca: base: close: component tm closed
37 [node513:32514] mca: base: close: unloading component tm
38 [node513:32514] [[4367,0],0] plm:slurm: final top-level argv:
39         srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none 
--nodes=1 --nodelist=node514 --ntasks=1 orted -mca orte_report_bindings "1" 
-mca ess "slurm" -mca ess_base_jobid "286195712" -mca ess_base_vpid "1" -mca 
ess_base_num_procs "2" -mca orte_node_regex "node[3:513-514]@0(2)" -mca 
orte_hnp_uri "286195712.0;tcp://172.30.146
40 10.38.146.45:43031" -mca btl "openib,self" -mca plm_base_verbose "10"
41 srun: unrecognized option '--cpu_bind=none'
42 srun: unrecognized option '--cpu_bind=none'
43 Try "srun --help" for more information
44 --------------------------------------------------------------------------
45 An ORTE daemon has unexpectedly failed after launch and before
46 communicating back to mpirun. This could be caused by a number
47 of factors, including an inability to create a connection back
48 to mpirun due to a lack of common network interfaces and/or no
49 route found between them. Please check network connectivity
50 (including firewalls and network routing requirements).
51 --------------------------------------------------------------------------
52 [node513:32514] mca: base: close: component slurm closed
53 [node513:32514] mca: base: close: unloading component slurm
54 user1@node-login7g:~/git/slurm-jobs$
~

Here are some ompi_info output:

                 Package: Open MPI user1@node-login7g Distribution
                Open MPI: 4.0.1
  Open MPI repo revision: v4.0.1
   Open MPI release date: Mar 26, 2019
                Open RTE: 4.0.1
  Open RTE repo revision: v4.0.1
   Open RTE release date: Mar 26, 2019
                    OPAL: 4.0.1
      OPAL repo revision: v4.0.1
       OPAL release date: Mar 26, 2019
                 MPI API: 3.1.0
            Ident string: 4.0.1
                  Prefix: /sw/dev/openmpi401
Configured architecture: x86_64-unknown-linux-gnu
          Configure host: node-login7g
           Configured by: user1
           Configured on: Wed Jul  3 12:13:45 EDT 2019
          Configure host: node-login7g
  Configure command line: '--prefix=/sw/dev/openmpi401' '--enable-shared' 
'--enable-static' '--enable-mpi-cxx' '--with-zlib=/usr' '--without-psm' 
'--without-libfabric' '--without-mxm' '--with-verbs' '--without-psm2' 
'--without-alps' '--without-lsf' '--without-sge' '--with-slurm' '--with-tm' 
'--without-load-leveler' '--disable-memchecker'
'--disable-java' '--disable-mpi-java' '--without-cuda' '--enable-cxx-exceptions'
                Built by: user1
                Built on: Wed Jul  3 12:24:11 EDT 2019
              Built host: node-login7g
              C bindings: yes
            C++ bindings: yes
             Fort mpif.h: yes (all)
            Fort use mpi: yes (limited: overloading)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: no
Fort mpi_f08 compliance: The mpi_f08 module was not built
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
  C compiler family name: GNU
      C compiler version: 4.8.5
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
           Fort compiler: gfortran
       Fort compiler abs: /usr/bin/gfortran

…..

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to