Hi Howard,,
anything you can see in the logfile?
https://download.deltacomputer.com/slurm-job-parallel.30.out
------
Is this a problem: srun: cluster configuration lacks support for cpu binding
This is the batchfile I am submitting:
#!/bin/bash
# 2 nodes, 8 processes (MPI ranks) per node
# request exclusive nodes (not sharing nodes with other jobs)
#SBATCH --nodes=2-2
#SBATCH --ntasks-per-node=8
#SBATCH --exclusive
#SBATCH -o slurm-job-parallel.%j.out
echo -n "this script is running on: "
hostname -f
date
env | grep ^SLURM | sort
for OPENMPI in 3.0.6 3.1.6 4.0.3 4.0.4
do
echo "### running ./OWnetbench/OWnetbench.openmpi-${OPENMPI} with
/opt/openmpi/${OPENMPI}/gcc/bin/mpirun ###"
# process bindings are used for repeatable benchmark results
# use with care when sharing node(s) with other jobs!
# we've requested exclusive nodes so we don't have to care about
other jobs!
case "${OPENMPI}" in
1.6.5)
BIND_OPT="--bind-to-core --bycore --report-bindings"
;;
*)
BIND_OPT="--bind-to core --map-by core --report-bindings"
;;
esac
# because openmpi is compiled with slurm support there is no need to
# specify the number of processes or a hostfile to mpirun.
/opt/openmpi/${OPENMPI}/gcc/bin/mpirun ${BIND_OPT} --mca
pmix_base_verbose 100 --debug-daemons
./OWnetbench/OWnetbench.openmpi-${OPENMPI}
done
On 08/08/2020 18:46, Howard Pritchard wrote:
Hello Michael,
Not sure what could be causing this in terms of delta between v4.0.3 and
v4.0.4.
Two things to try
- add --debug-daemons and --mca pmix_base_verbose 100 to the mpirun line
and compare output from the v4.0.3 and v4.0.4 installs
- perhaps try using the --enable-mpirun-prefix-by-default configure
option and reinstall v4.0.4
Howard
Am Do., 6. Aug. 2020 um 04:48 Uhr schrieb Michael Fuckner via users
<users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>:
Hi,
I have a small setup with one headnode and two compute nodes connected
via IB-QDR running CentOS 8.2 and Mellanox OFED 4.9 LTS. I installed
openmpi 3.0.6, 3.1.6, 4.0.3 and 4.0.4 with identical configuration
(configure, compile, nothing configured in openmpi-mca-params.conf),
the
output from ompi-info and orte-info looks identical.
There is a small benchmark basically just doing MPI_Send() and
MPI_Recv(). I can invoke it directly like this (as 4.0.3 and 4.0.4)
/opt/openmpi/4.0.3/gcc/bin/mpirun -np 16 -hostfile HOSTFILE_2x8
-nolocal
./OWnetbench.openmpi-4.0.3
when running this job from slurm, it works with 4.0.3, but there is an
error with 4.0.4. Any hint what to check?
### running ./OWnetbench/OWnetbench.openmpi-4.0.4 with
/opt/openmpi/4.0.4/gcc/bin/mpirun ###
[node002.cluster:04960] MCW rank 0 bound to socket 0[core 7[hwt 0-1]]:
[../../../../../../../BB]
[node002.cluster:04963] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 231
[node002.cluster:04963] OPAL ERROR: Error in file pmix3x_client.c at
line 112
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[node002.cluster:04963] Local abort before MPI_INIT completed completed
successfully, but am not able to aggregate error messages, and not able
to guarantee that all other processes were kil
led!
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:
Process name: [[15424,1],0]
Exit code: 1
--------------------------------------------------------------------------
Any hint why 4.0.4 behaves not like the other versions?
--
DELTA Computer Products GmbH
Röntgenstr. 4
D-21465 Reinbek bei Hamburg
T: +49 40 300672-30
F: +49 40 300672-11
E: michael.fuck...@delta.de <mailto:michael.fuck...@delta.de>
Internet: https://www.delta.de
Handelsregister Lübeck HRB 3678-RE, Ust.-IdNr.: DE135110550
Geschäftsführer: Hans-Peter Hellmann
--
DELTA Computer Products GmbH
Röntgenstr. 4
D-21465 Reinbek bei Hamburg
T: +49 40 300672-30
F: +49 40 300672-11
E: fuck...@delta.de
Internet: https://www.delta.de
Handelsregister Lübeck HRB 3678-RE, Ust.-IdNr.: DE135110550
Geschäftsführer: Hans-Peter Hellmann