Hello Michael, Not sure what could be causing this in terms of delta between v4.0.3 and v4.0.4. Two things to try
- add --debug-daemons and --mca pmix_base_verbose 100 to the mpirun line and compare output from the v4.0.3 and v4.0.4 installs - perhaps try using the --enable-mpirun-prefix-by-default configure option and reinstall v4.0.4 Howard Am Do., 6. Aug. 2020 um 04:48 Uhr schrieb Michael Fuckner via users < users@lists.open-mpi.org>: > Hi, > > I have a small setup with one headnode and two compute nodes connected > via IB-QDR running CentOS 8.2 and Mellanox OFED 4.9 LTS. I installed > openmpi 3.0.6, 3.1.6, 4.0.3 and 4.0.4 with identical configuration > (configure, compile, nothing configured in openmpi-mca-params.conf), the > output from ompi-info and orte-info looks identical. > > There is a small benchmark basically just doing MPI_Send() and > MPI_Recv(). I can invoke it directly like this (as 4.0.3 and 4.0.4) > > /opt/openmpi/4.0.3/gcc/bin/mpirun -np 16 -hostfile HOSTFILE_2x8 -nolocal > ./OWnetbench.openmpi-4.0.3 > > when running this job from slurm, it works with 4.0.3, but there is an > error with 4.0.4. Any hint what to check? > > > ### running ./OWnetbench/OWnetbench.openmpi-4.0.4 with > /opt/openmpi/4.0.4/gcc/bin/mpirun ### > [node002.cluster:04960] MCW rank 0 bound to socket 0[core 7[hwt 0-1]]: > [../../../../../../../BB] > [node002.cluster:04963] PMIX ERROR: OUT-OF-RESOURCE in file > client/pmix_client.c at line 231 > [node002.cluster:04963] OPAL ERROR: Error in file pmix3x_client.c at > line 112 > *** An error occurred in MPI_Init > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > *** and potentially your MPI job) > [node002.cluster:04963] Local abort before MPI_INIT completed completed > successfully, but am not able to aggregate error messages, and not able > to guarantee that all other processes were kil > led! > -------------------------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code. Per user-direction, the job has been aborted. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun detected that one or more processes exited with non-zero status, > thus causing > the job to be terminated. The first process to do so was: > > Process name: [[15424,1],0] > Exit code: 1 > -------------------------------------------------------------------------- > > Any hint why 4.0.4 behaves not like the other versions? > > -- > DELTA Computer Products GmbH > Röntgenstr. 4 > D-21465 Reinbek bei Hamburg > T: +49 40 300672-30 > F: +49 40 300672-11 > E: michael.fuck...@delta.de > > Internet: https://www.delta.de > Handelsregister Lübeck HRB 3678-RE, Ust.-IdNr.: DE135110550 > Geschäftsführer: Hans-Peter Hellmann >