Hello Michael,

Not sure what could be causing this in terms of delta between v4.0.3 and
v4.0.4.
Two things to try

- add --debug-daemons and --mca pmix_base_verbose 100 to the mpirun line
and compare output from the v4.0.3 and v4.0.4 installs
- perhaps try using the --enable-mpirun-prefix-by-default configure option
and reinstall v4.0.4

Howard


Am Do., 6. Aug. 2020 um 04:48 Uhr schrieb Michael Fuckner via users <
users@lists.open-mpi.org>:

> Hi,
>
> I have a small setup with one headnode and two compute nodes connected
> via IB-QDR running CentOS 8.2 and Mellanox OFED 4.9 LTS. I installed
> openmpi 3.0.6, 3.1.6, 4.0.3 and 4.0.4 with identical configuration
> (configure, compile, nothing configured in openmpi-mca-params.conf), the
> output from ompi-info and orte-info looks identical.
>
> There is a small benchmark basically just doing MPI_Send() and
> MPI_Recv(). I can invoke it directly like this (as 4.0.3 and 4.0.4)
>
> /opt/openmpi/4.0.3/gcc/bin/mpirun -np 16 -hostfile HOSTFILE_2x8 -nolocal
> ./OWnetbench.openmpi-4.0.3
>
> when running this job from slurm, it works with 4.0.3, but there is an
> error with 4.0.4. Any hint what to check?
>
>
> ### running ./OWnetbench/OWnetbench.openmpi-4.0.4 with
> /opt/openmpi/4.0.4/gcc/bin/mpirun ###
> [node002.cluster:04960] MCW rank 0 bound to socket 0[core 7[hwt 0-1]]:
> [../../../../../../../BB]
> [node002.cluster:04963] PMIX ERROR: OUT-OF-RESOURCE in file
> client/pmix_client.c at line 231
> [node002.cluster:04963] OPAL ERROR: Error in file pmix3x_client.c at
> line 112
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [node002.cluster:04963] Local abort before MPI_INIT completed completed
> successfully, but am not able to aggregate error messages, and not able
> to guarantee that all other processes were kil
> led!
> --------------------------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so was:
>
>    Process name: [[15424,1],0]
>    Exit code:    1
> --------------------------------------------------------------------------
>
> Any hint why 4.0.4 behaves not like the other versions?
>
> --
> DELTA Computer Products GmbH
> Röntgenstr. 4
> D-21465 Reinbek bei Hamburg
> T: +49 40 300672-30
> F: +49 40 300672-11
> E: michael.fuck...@delta.de
>
> Internet: https://www.delta.de
> Handelsregister Lübeck HRB 3678-RE, Ust.-IdNr.: DE135110550
> Geschäftsführer: Hans-Peter Hellmann
>

Reply via email to