Hello, My attempt to run and troubleshoot my an OMPI job under a slurm allocation does not work as I would expect. The result below has led me to believe that under the hood, in this setup (SLURM with OMPI) that the correct srun options is not being used when I call, mpirun directly.
Specifically the “cpu-bind=none” is breaking, but it also looks like the nodelist is incorrect. The job script. 1 #!/bin/bash 2 3 #SBATCH --job-name=mpi-hostname 4 #SBATCH --partition=dev 5 #SBATCH --account=Account1 6 #SBATCH --time=01:00:00 7 #SBATCH --nodes=2 8 #SBATCH --ntasks-per-node=1 9 #SBATCH --begin=now+10 10 #SBATCH --output="%x-%u-%N-%j.txt" # jobName-userId-hostName-jobId.txt 11 12 13 14 # ---------------------------------------------------------------------- # 15 #module load DefApps 16 #module purge >/dev/null 2>&1 17 ##module load staging/slurm >/dev/null 2>&1 18 #module load gcc/4.8.5 openmpi >/dev/null 2>&1 19 #module --ignore_cache spider openmpi/3.1.3 >/dev/null 2>&1 20 # 21 # ---------------------------------------------------------------------- # 22 # 23 MPI_RUN=$(which orterun) 24 if [[ -z "${MPI_RUN:+x}" ]]; then 25 echo "ERROR: Cannot find 'mpirun' executable..exiting" 26 exit 1 27 fi 28 29 echo 30 #CMD="orterun -npernode 1 -np 2 /bin/hostname" 31 #CMD="srun /bin/hostname" 32 #CMD="srun -N2 -n2 --mpi=pmi2 /bin/hostname" 33 #MCMD="/sw/dev/openmpi401/bin/mpirun --bind-to-core --report-bindings -mca btl openib,self -mca plm_base_verbose 10 /bin/hostname" 34 MCMD="/sw/dev/openmpi401/bin/mpirun --report-bindings -mca btl openib,self -mca plm_base_verbose 10 /bin/hostname" 35 echo "INFO: Executing the command: $MCMD" 36 $MCMD 37 sync Here is the output: 2 user1@node-login7g:~/git/slurm-jobs$ more mpi-hostname-user1-node513-835.txt 3 4 INFO: Executing the command: /sw/dev/openmpi401/bin/mpirun --report-bindings -mca btl openib,self -mca plm_base_verbose 10 /bin/hostname 5 [node513:32514] mca: base: components_register: registering framework plm components 6 [node513:32514] mca: base: components_register: found loaded component isolated 7 [node513:32514] mca: base: components_register: component isolated has no register or open function 8 [node513:32514] mca: base: components_register: found loaded component rsh 9 [node513:32514] mca: base: components_register: component rsh register function successful 10 [node513:32514] mca: base: components_register: found loaded component slurm 11 [node513:32514] mca: base: components_register: component slurm register function successful 12 [node513:32514] mca: base: components_register: found loaded component tm 13 [node513:32514] mca: base: components_register: component tm register function successful 14 [node513:32514] mca: base: components_open: opening plm components 15 [node513:32514] mca: base: components_open: found loaded component isolated 16 [node513:32514] mca: base: components_open: component isolated open function successful 17 [node513:32514] mca: base: components_open: found loaded component rsh 18 [node513:32514] mca: base: components_open: component rsh open function successful 19 [node513:32514] mca: base: components_open: found loaded component slurm 20 [node513:32514] mca: base: components_open: component slurm open function successful 21 [node513:32514] mca: base: components_open: found loaded component tm 22 [node513:32514] mca: base: components_open: component tm open function successful 23 [node513:32514] mca:base:select: Auto-selecting plm components 24 [node513:32514] mca:base:select:( plm) Querying component [isolated] 25 [node513:32514] mca:base:select:( plm) Query of component [isolated] set priority to 0 26 [node513:32514] mca:base:select:( plm) Querying component [rsh] 27 [node513:32514] mca:base:select:( plm) Query of component [rsh] set priority to 10 28 [node513:32514] mca:base:select:( plm) Querying component [slurm] 29 [node513:32514] mca:base:select:( plm) Query of component [slurm] set priority to 75 30 [node513:32514] mca:base:select:( plm) Querying component [tm] 31 [node513:32514] mca:base:select:( plm) Selected component [slurm] 32 [node513:32514] mca: base: close: component isolated closed 33 [node513:32514] mca: base: close: unloading component isolated 34 [node513:32514] mca: base: close: component rsh closed 35 [node513:32514] mca: base: close: unloading component rsh 36 [node513:32514] mca: base: close: component tm closed 37 [node513:32514] mca: base: close: unloading component tm 38 [node513:32514] [[4367,0],0] plm:slurm: final top-level argv: 39 srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1 --nodelist=node514 --ntasks=1 orted -mca orte_report_bindings "1" -mca ess "slurm" -mca ess_base_jobid "286195712" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca orte_node_regex "node[3:513-514]@0(2)" -mca orte_hnp_uri "286195712.0;tcp://172.30.146 40 10.38.146.45:43031" -mca btl "openib,self" -mca plm_base_verbose "10" 41 srun: unrecognized option '--cpu_bind=none' 42 srun: unrecognized option '--cpu_bind=none' 43 Try "srun --help" for more information 44 -------------------------------------------------------------------------- 45 An ORTE daemon has unexpectedly failed after launch and before 46 communicating back to mpirun. This could be caused by a number 47 of factors, including an inability to create a connection back 48 to mpirun due to a lack of common network interfaces and/or no 49 route found between them. Please check network connectivity 50 (including firewalls and network routing requirements). 51 -------------------------------------------------------------------------- 52 [node513:32514] mca: base: close: component slurm closed 53 [node513:32514] mca: base: close: unloading component slurm 54 user1@node-login7g:~/git/slurm-jobs$ ~ Here are some ompi_info output: Package: Open MPI user1@node-login7g Distribution Open MPI: 4.0.1 Open MPI repo revision: v4.0.1 Open MPI release date: Mar 26, 2019 Open RTE: 4.0.1 Open RTE repo revision: v4.0.1 Open RTE release date: Mar 26, 2019 OPAL: 4.0.1 OPAL repo revision: v4.0.1 OPAL release date: Mar 26, 2019 MPI API: 3.1.0 Ident string: 4.0.1 Prefix: /sw/dev/openmpi401 Configured architecture: x86_64-unknown-linux-gnu Configure host: node-login7g Configured by: user1 Configured on: Wed Jul 3 12:13:45 EDT 2019 Configure host: node-login7g Configure command line: '--prefix=/sw/dev/openmpi401' '--enable-shared' '--enable-static' '--enable-mpi-cxx' '--with-zlib=/usr' '--without-psm' '--without-libfabric' '--without-mxm' '--with-verbs' '--without-psm2' '--without-alps' '--without-lsf' '--without-sge' '--with-slurm' '--with-tm' '--without-load-leveler' '--disable-memchecker' '--disable-java' '--disable-mpi-java' '--without-cuda' '--enable-cxx-exceptions' Built by: user1 Built on: Wed Jul 3 12:24:11 EDT 2019 Built host: node-login7g C bindings: yes C++ bindings: yes Fort mpif.h: yes (all) Fort use mpi: yes (limited: overloading) Fort use mpi size: deprecated-ompi-info-value Fort use mpi_f08: no Fort mpi_f08 compliance: The mpi_f08 module was not built Fort mpi_f08 subarrays: no Java bindings: no Wrapper compiler rpath: runpath C compiler: gcc C compiler absolute: /usr/bin/gcc C compiler family name: GNU C compiler version: 4.8.5 C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fort compiler: gfortran Fort compiler abs: /usr/bin/gfortran …..
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel