Re: [OMPI users] OpenMPI crashes with TCP connection error
Kurt, I think Joachim was also asking for the command line used to launch your application. Since you are using Slurm and MPI_Comm_spawn(), it is important to understand whether you are using mpirun or srun FWIW, --mpi=pmix is a srun option. you can srun --mpi=list to find the available options. Cheers, Gilles On Sat, Jun 17, 2023 at 2:53 AM Mccall, Kurt E. (MSFC-EV41) via users < users@lists.open-mpi.org> wrote: > Joachim, > > > > Sorry to make you resort to divination. My sbatch command is as follows: > > > > sbatch --ntasks-per-node=24 --nodes=16 --ntasks=384 --job-name $job_name > --exclusive --no-kill --verbose $release_dir/script.bash & > > > > --mpi=pmix isn’t an option recognized by sbatch. Is there an > alternative? The slurm doc you mentioned has the following paragraph. Is > it still true with OpenMpi 4.1.5? > > > > “*NOTE*: OpenMPI has a limitation that does not support calls to > *MPI_Comm_spawn()* from within a Slurm allocation. If you need to use the * > MPI_Comm_spawn()* function you will need to use another MPI > implementation combined with PMI-2 since PMIx doesn't support it either.” > > > > I use MPI_Comm_spawn extensively in my application. > > > > Thanks, > > Kurt > > > > > > *From:* Jenke, Joachim > *Sent:* Thursday, June 15, 2023 5:33 PM > *To:* Open MPI Users > *Cc:* Mccall, Kurt E. (MSFC-EV41) > *Subject:* [EXTERNAL] Re: OpenMPI crashes with TCP connection error > > > > CAUTION*:* This email originated from outside of NASA. Please take care > when clicking links or opening attachments. Use the "Report Message" > button to report suspicious messages to the NASA SOC. > > > > Hi Kurt, > > > > Without knowing your exact MPI launch command, my cristal orb thinks you > might want to try the -mpi=pmix flag for srun as documented for > slurm+openmpi: > > https://slurm.schedmd.com/mpi_guide.html#open_mpi > > > > -Joachim > ------ > > *From:* users on behalf of Mccall, > Kurt E. (MSFC-EV41) via users > *Sent:* Thursday, June 15, 2023 11:56:28 PM > *To:* users@lists.open-mpi.org > *Cc:* Mccall, Kurt E. (MSFC-EV41) > *Subject:* [OMPI users] OpenMPI crashes with TCP connection error > > > > My job immediately crashes with the error message below. I don’t know > where to begin looking for the cause > > of the error, or what information to provide to help you understand it. > Maybe you could clue me in . > > > > I am on RedHat 4.18.0, using Slurm 20.11.8 and OpenMPI 4.1.5 compiled with > gcc 8.5.0. > > I built OpenMPI with the following “configure” command: > > > > ./configure --prefix=/opt/openmpi/4.1.5_gnu --with-slurm --enable-debug > > > > > > > > WARNING: Open MPI accepted a TCP connection from what appears to be a > > another Open MPI process but cannot find a corresponding process > > entry for that peer. > > > > This attempted connection will be ignored; your MPI job may or may not > > continue properly. > > > > Local host: n001 > > PID:985481 > > > > >
Re: [OMPI users] OpenMPI crashes with TCP connection error
Joachim, Sorry to make you resort to divination. My sbatch command is as follows: sbatch --ntasks-per-node=24 --nodes=16 --ntasks=384 --job-name $job_name --exclusive --no-kill --verbose $release_dir/script.bash & --mpi=pmix isn’t an option recognized by sbatch. Is there an alternative? The slurm doc you mentioned has the following paragraph. Is it still true with OpenMpi 4.1.5? “NOTE: OpenMPI has a limitation that does not support calls to MPI_Comm_spawn() from within a Slurm allocation. If you need to use the MPI_Comm_spawn() function you will need to use another MPI implementation combined with PMI-2 since PMIx doesn't support it either.” I use MPI_Comm_spawn extensively in my application. Thanks, Kurt From: Jenke, Joachim Sent: Thursday, June 15, 2023 5:33 PM To: Open MPI Users Cc: Mccall, Kurt E. (MSFC-EV41) Subject: [EXTERNAL] Re: OpenMPI crashes with TCP connection error CAUTION: This email originated from outside of NASA. Please take care when clicking links or opening attachments. Use the "Report Message" button to report suspicious messages to the NASA SOC. Hi Kurt, Without knowing your exact MPI launch command, my cristal orb thinks you might want to try the -mpi=pmix flag for srun as documented for slurm+openmpi: https://slurm.schedmd.com/mpi_guide.html#open_mpi -Joachim From: users mailto:users-boun...@lists.open-mpi.org>> on behalf of Mccall, Kurt E. (MSFC-EV41) via users mailto:users@lists.open-mpi.org>> Sent: Thursday, June 15, 2023 11:56:28 PM To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> mailto:users@lists.open-mpi.org>> Cc: Mccall, Kurt E. (MSFC-EV41) mailto:kurt.e.mcc...@nasa.gov>> Subject: [OMPI users] OpenMPI crashes with TCP connection error My job immediately crashes with the error message below. I don’t know where to begin looking for the cause of the error, or what information to provide to help you understand it. Maybe you could clue me in . I am on RedHat 4.18.0, using Slurm 20.11.8 and OpenMPI 4.1.5 compiled with gcc 8.5.0. I built OpenMPI with the following “configure” command: ./configure --prefix=/opt/openmpi/4.1.5_gnu --with-slurm --enable-debug WARNING: Open MPI accepted a TCP connection from what appears to be a another Open MPI process but cannot find a corresponding process entry for that peer. This attempted connection will be ignored; your MPI job may or may not continue properly. Local host: n001 PID:985481
Re: [OMPI users] OpenMPI crashes with TCP connection error
Hi Kurt, Without knowing your exact MPI launch command, my cristal orb thinks you might want to try the -mpi=pmix flag for srun as documented for slurm+openmpi: https://slurm.schedmd.com/mpi_guide.html#open_mpi -Joachim From: users on behalf of Mccall, Kurt E. (MSFC-EV41) via users Sent: Thursday, June 15, 2023 11:56:28 PM To: users@lists.open-mpi.org Cc: Mccall, Kurt E. (MSFC-EV41) Subject: [OMPI users] OpenMPI crashes with TCP connection error My job immediately crashes with the error message below. I don’t know where to begin looking for the cause of the error, or what information to provide to help you understand it. Maybe you could clue me in . I am on RedHat 4.18.0, using Slurm 20.11.8 and OpenMPI 4.1.5 compiled with gcc 8.5.0. I built OpenMPI with the following “configure” command: ./configure --prefix=/opt/openmpi/4.1.5_gnu --with-slurm --enable-debug WARNING: Open MPI accepted a TCP connection from what appears to be a another Open MPI process but cannot find a corresponding process entry for that peer. This attempted connection will be ignored; your MPI job may or may not continue properly. Local host: n001 PID:985481
[OMPI users] OpenMPI crashes with TCP connection error
My job immediately crashes with the error message below. I don’t know where to begin looking for the cause of the error, or what information to provide to help you understand it. Maybe you could clue me in . I am on RedHat 4.18.0, using Slurm 20.11.8 and OpenMPI 4.1.5 compiled with gcc 8.5.0. I built OpenMPI with the following “configure” command: ./configure --prefix=/opt/openmpi/4.1.5_gnu --with-slurm --enable-debug WARNING: Open MPI accepted a TCP connection from what appears to be a another Open MPI process but cannot find a corresponding process entry for that peer. This attempted connection will be ignored; your MPI job may or may not continue properly. Local host: n001 PID:985481