Kurt,
I think Joachim was also asking for the command line used to launch your application. Since you are using Slurm and MPI_Comm_spawn(), it is important to understand whether you are using mpirun or srun FWIW, --mpi=pmix is a srun option. you can srun --mpi=list to find the available options. Cheers, Gilles On Sat, Jun 17, 2023 at 2:53 AM Mccall, Kurt E. (MSFC-EV41) via users < users@lists.open-mpi.org> wrote: > Joachim, > > > > Sorry to make you resort to divination. My sbatch command is as follows: > > > > sbatch --ntasks-per-node=24 --nodes=16 --ntasks=384 --job-name $job_name > --exclusive --no-kill --verbose $release_dir/script.bash & > > > > --mpi=pmix isn’t an option recognized by sbatch. Is there an > alternative? The slurm doc you mentioned has the following paragraph. Is > it still true with OpenMpi 4.1.5? > > > > “*NOTE*: OpenMPI has a limitation that does not support calls to > *MPI_Comm_spawn()* from within a Slurm allocation. If you need to use the * > MPI_Comm_spawn()* function you will need to use another MPI > implementation combined with PMI-2 since PMIx doesn't support it either.” > > > > I use MPI_Comm_spawn extensively in my application. > > > > Thanks, > > Kurt > > > > > > *From:* Jenke, Joachim <je...@itc.rwth-aachen.de> > *Sent:* Thursday, June 15, 2023 5:33 PM > *To:* Open MPI Users <users@lists.open-mpi.org> > *Cc:* Mccall, Kurt E. (MSFC-EV41) <kurt.e.mcc...@nasa.gov> > *Subject:* [EXTERNAL] Re: OpenMPI crashes with TCP connection error > > > > CAUTION*:* This email originated from outside of NASA. Please take care > when clicking links or opening attachments. Use the "Report Message" > button to report suspicious messages to the NASA SOC. > > > > Hi Kurt, > > > > Without knowing your exact MPI launch command, my cristal orb thinks you > might want to try the -mpi=pmix flag for srun as documented for > slurm+openmpi: > > https://slurm.schedmd.com/mpi_guide.html#open_mpi > > > > -Joachim > ------------------------------ > > *From:* users <users-boun...@lists.open-mpi.org> on behalf of Mccall, > Kurt E. (MSFC-EV41) via users <users@lists.open-mpi.org> > *Sent:* Thursday, June 15, 2023 11:56:28 PM > *To:* users@lists.open-mpi.org <users@lists.open-mpi.org> > *Cc:* Mccall, Kurt E. (MSFC-EV41) <kurt.e.mcc...@nasa.gov> > *Subject:* [OMPI users] OpenMPI crashes with TCP connection error > > > > My job immediately crashes with the error message below. I don’t know > where to begin looking for the cause > > of the error, or what information to provide to help you understand it. > Maybe you could clue me in 😊. > > > > I am on RedHat 4.18.0, using Slurm 20.11.8 and OpenMPI 4.1.5 compiled with > gcc 8.5.0. > > I built OpenMPI with the following “configure” command: > > > > ./configure --prefix=/opt/openmpi/4.1.5_gnu --with-slurm --enable-debug > > > > > > > > WARNING: Open MPI accepted a TCP connection from what appears to be a > > another Open MPI process but cannot find a corresponding process > > entry for that peer. > > > > This attempted connection will be ignored; your MPI job may or may not > > continue properly. > > > > Local host: n001 > > PID: 985481 > > > > >