Kurt,

I think Joachim was also asking for the command line used to launch your
application.

Since you are using Slurm and MPI_Comm_spawn(), it is important to
understand whether you are using mpirun or srun

FWIW, --mpi=pmix is a srun option. you can srun --mpi=list to find the
available options.


Cheers,

Gilles

On Sat, Jun 17, 2023 at 2:53 AM Mccall, Kurt E. (MSFC-EV41) via users <
users@lists.open-mpi.org> wrote:

> Joachim,
>
>
>
> Sorry to make you resort to divination.   My sbatch command is as follows:
>
>
>
> sbatch --ntasks-per-node=24 --nodes=16 --ntasks=384  --job-name $job_name
> --exclusive --no-kill --verbose $release_dir/script.bash &
>
>
>
> --mpi=pmix isn’t an option recognized by sbatch.   Is there an
> alternative?   The slurm doc you mentioned has the following paragraph.  Is
> it still true with OpenMpi 4.1.5?
>
>
>
> “*NOTE*: OpenMPI has a limitation that does not support calls to
> *MPI_Comm_spawn()* from within a Slurm allocation. If you need to use the *
> MPI_Comm_spawn()* function you will need to use another MPI
> implementation combined with PMI-2 since PMIx doesn't support it either.”
>
>
>
> I use MPI_Comm_spawn extensively in my application.
>
>
>
> Thanks,
>
> Kurt
>
>
>
>
>
> *From:* Jenke, Joachim <je...@itc.rwth-aachen.de>
> *Sent:* Thursday, June 15, 2023 5:33 PM
> *To:* Open MPI Users <users@lists.open-mpi.org>
> *Cc:* Mccall, Kurt E. (MSFC-EV41) <kurt.e.mcc...@nasa.gov>
> *Subject:* [EXTERNAL] Re: OpenMPI crashes with TCP connection error
>
>
>
> CAUTION*:* This email originated from outside of NASA.  Please take care
> when clicking links or opening attachments.  Use the "Report Message"
> button to report suspicious messages to the NASA SOC.
>
>
>
> Hi Kurt,
>
>
>
> Without knowing your exact MPI launch command, my cristal orb thinks you
> might want to try the -mpi=pmix flag for srun as documented for
> slurm+openmpi:
>
> https://slurm.schedmd.com/mpi_guide.html#open_mpi
>
>
>
> -Joachim
> ------------------------------
>
> *From:* users <users-boun...@lists.open-mpi.org> on behalf of Mccall,
> Kurt E. (MSFC-EV41) via users <users@lists.open-mpi.org>
> *Sent:* Thursday, June 15, 2023 11:56:28 PM
> *To:* users@lists.open-mpi.org <users@lists.open-mpi.org>
> *Cc:* Mccall, Kurt E. (MSFC-EV41) <kurt.e.mcc...@nasa.gov>
> *Subject:* [OMPI users] OpenMPI crashes with TCP connection error
>
>
>
> My job immediately crashes with the error message below.   I don’t know
> where to begin looking for the cause
>
> of the error, or what information to provide to help you understand it.
> Maybe you could clue me in 😊.
>
>
>
> I am on RedHat 4.18.0, using Slurm 20.11.8 and OpenMPI 4.1.5 compiled with
> gcc 8.5.0.
>
> I built OpenMPI with the following  “configure” command:
>
>
>
> ./configure --prefix=/opt/openmpi/4.1.5_gnu --with-slurm --enable-debug
>
>
>
>
>
>
>
> WARNING: Open MPI accepted a TCP connection from what appears to be a
>
> another Open MPI process but cannot find a corresponding process
>
> entry for that peer.
>
>
>
> This attempted connection will be ignored; your MPI job may or may not
>
> continue properly.
>
>
>
>   Local host: n001
>
>   PID:        985481
>
>
>
>
>

Reply via email to