Hi Howard,,

anything you can see in the logfile?

https://download.deltacomputer.com/slurm-job-parallel.30.out

------

Is this a problem: srun: cluster configuration lacks support for cpu binding


This is the batchfile I am submitting:

#!/bin/bash

# 2 nodes, 8 processes (MPI ranks) per node
# request exclusive nodes (not sharing nodes with other jobs)

#SBATCH --nodes=2-2
#SBATCH --ntasks-per-node=8
#SBATCH --exclusive
#SBATCH -o slurm-job-parallel.%j.out


echo -n "this script is running on: "
hostname -f
date

env | grep ^SLURM | sort

for OPENMPI in 3.0.6 3.1.6 4.0.3 4.0.4
do
echo "### running ./OWnetbench/OWnetbench.openmpi-${OPENMPI} with /opt/openmpi/${OPENMPI}/gcc/bin/mpirun ###"

  # process bindings are used for repeatable benchmark results
  # use with care when sharing node(s) with other jobs!
# we've requested exclusive nodes so we don't have to care about other jobs!
  case "${OPENMPI}" in
    1.6.5)
      BIND_OPT="--bind-to-core --bycore --report-bindings"
      ;;
    *)
      BIND_OPT="--bind-to core --map-by core --report-bindings"
      ;;
  esac

  # because openmpi is compiled with slurm support there is no need to
  # specify the number of processes or a hostfile to mpirun.

/opt/openmpi/${OPENMPI}/gcc/bin/mpirun ${BIND_OPT} --mca pmix_base_verbose 100 --debug-daemons ./OWnetbench/OWnetbench.openmpi-${OPENMPI}

done


On 08/08/2020 18:46, Howard Pritchard wrote:
Hello Michael,

Not sure what could be causing this in terms of delta between v4.0.3 and v4.0.4.
Two things to try

- add --debug-daemons and --mca pmix_base_verbose 100 to the mpirun line and compare output from the v4.0.3 and v4.0.4 installs - perhaps try using the --enable-mpirun-prefix-by-default configure option and reinstall v4.0.4

Howard


Am Do., 6. Aug. 2020 um 04:48 Uhr schrieb Michael Fuckner via users <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>:

    Hi,

    I have a small setup with one headnode and two compute nodes connected
    via IB-QDR running CentOS 8.2 and Mellanox OFED 4.9 LTS. I installed
    openmpi 3.0.6, 3.1.6, 4.0.3 and 4.0.4 with identical configuration
    (configure, compile, nothing configured in openmpi-mca-params.conf),
    the
    output from ompi-info and orte-info looks identical.

    There is a small benchmark basically just doing MPI_Send() and
    MPI_Recv(). I can invoke it directly like this (as 4.0.3 and 4.0.4)

    /opt/openmpi/4.0.3/gcc/bin/mpirun -np 16 -hostfile HOSTFILE_2x8
    -nolocal
    ./OWnetbench.openmpi-4.0.3

    when running this job from slurm, it works with 4.0.3, but there is an
    error with 4.0.4. Any hint what to check?


    ### running ./OWnetbench/OWnetbench.openmpi-4.0.4 with
    /opt/openmpi/4.0.4/gcc/bin/mpirun ###
    [node002.cluster:04960] MCW rank 0 bound to socket 0[core 7[hwt 0-1]]:
    [../../../../../../../BB]
    [node002.cluster:04963] PMIX ERROR: OUT-OF-RESOURCE in file
    client/pmix_client.c at line 231
    [node002.cluster:04963] OPAL ERROR: Error in file pmix3x_client.c at
    line 112
    *** An error occurred in MPI_Init
    *** on a NULL communicator
    *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
    ***    and potentially your MPI job)
    [node002.cluster:04963] Local abort before MPI_INIT completed completed
    successfully, but am not able to aggregate error messages, and not able
    to guarantee that all other processes were kil
    led!
    --------------------------------------------------------------------------
    Primary job  terminated normally, but 1 process returned
    a non-zero exit code. Per user-direction, the job has been aborted.
    --------------------------------------------------------------------------
    --------------------------------------------------------------------------
    mpirun detected that one or more processes exited with non-zero status,
    thus causing
    the job to be terminated. The first process to do so was:

        Process name: [[15424,1],0]
        Exit code:    1
    --------------------------------------------------------------------------

    Any hint why 4.0.4 behaves not like the other versions?

-- DELTA Computer Products GmbH
    Röntgenstr. 4
    D-21465 Reinbek bei Hamburg
    T: +49 40 300672-30
    F: +49 40 300672-11
    E: michael.fuck...@delta.de <mailto:michael.fuck...@delta.de>

    Internet: https://www.delta.de
    Handelsregister Lübeck HRB 3678-RE, Ust.-IdNr.: DE135110550
    Geschäftsführer: Hans-Peter Hellmann



--
DELTA Computer Products GmbH
Röntgenstr. 4
D-21465 Reinbek bei Hamburg
T: +49 40 300672-30
F: +49 40 300672-11
E: fuck...@delta.de

Internet: https://www.delta.de
Handelsregister Lübeck HRB 3678-RE, Ust.-IdNr.: DE135110550
Geschäftsführer: Hans-Peter Hellmann

Reply via email to