Hi Gilles, Yes it was just a typo in the last email, it was correctly spelled in the job script.
So I just tried to use 1 node * 2 tasks/node, I got the same error I posted before, just a copy for each process, here it is again: *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [cn603-20-l:169109] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [cn603-20-l:169108] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! srun: error: cn603-20-l: tasks 0-1: Exited with exit code 1 I'm suspecting Slurm, but anyways, how can I troubleshoot this? The program is a simple MPI Hello World code. All the best, -- Passant A. Hafez | HPC Applications Specialist KAUST Supercomputing Core Laboratory (KSL) King Abdullah University of Science and Technology Building 1, Al-Khawarizmi, Room 0123 Mobile : +966 (0) 55-247-9568 Mobile : +20 (0) 106-146-9644 Office : +966 (0) 12-808-0367 ________________________________________ From: users <users-boun...@lists.open-mpi.org> on behalf of Gilles Gouaillardet <gil...@rist.or.jp> Sent: Tuesday, March 12, 2019 8:22 AM To: users@lists.open-mpi.org Subject: Re: [OMPI users] Building PMIx and Slurm support Passant, Except the typo (it should be srun --mpi=pmix_v3), there is nothing wrong with that, and it is working just fine for me (same SLURM version, same PMIx version, same Open MPI version and same Open MPI configure command line) that is why I asked you some more information/logs in order to investigate your issue. You might want to try a single node job first in order to rule out potential interconnect related issues. Cheers, Gilles On 3/12/2019 1:54 PM, Passant A. Hafez wrote: > Hello Gilles, > > Yes I do use srun --mpi=pmix_3 to run the app, what's the problem with > that? > Before that, when we tried to launch MPI apps directly with srun, we > got the error message saying Slurm missed the PMIx support, that's why > we proceeded with the installation. > > > > All the best, > > -- > > Passant > > On Mar 12, 2019 6:53 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > Passant, > > > I built a similar environment, and had no issue running a simple MPI > program. > > > Can you please post your slurm script (I assume it uses srun to start > the MPI app), > > the output of > > scontrol show config | grep Mpi > > and the full output of your job ? > > > Cheers, > > > Gilles > > > On 3/12/2019 7:59 AM, Passant A. Hafez wrote: > > > > Hello, > > > > > > So we now have Slurm 18.08.6-2 compiled with PMIx 3.1.2 > > > > then I installed openmpi 4.0.0 with: > > > > --with-slurm --with-pmix=internal --with-libevent=internal > > --enable-shared --enable- > > static --with-x > > > > > > (Following the thread, it was mentioned that building OMPI 4.0.0 with > > PMIx 3.1.2 will fail with PMIX_MODEX and PMIX_INFO_ARRAY errors, so I > > used internal PMIx) > > > > > > > > The MPI program fails with: > > > > > > *** An error occurred in MPI_Init > > *** on a NULL communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > *** and potentially your MPI job) > > [cn603-13-r:387088] Local abort before MPI_INIT completed completed > > successfully, but am not able to aggregate error messages, and not > > able to guarantee that all other processes were killed! > > > > > > for each process, please advise! what's going wrong here? > > > > > > > > > > > > > > > > All the best, > > -- > > Passant A. Hafez | HPC Applications Specialist > > KAUST Supercomputing Core Laboratory (KSL) > > King Abdullah University of Science and Technology > > Building 1, Al-Khawarizmi, Room 0123 > > Mobile : +966 (0) 55-247-9568 > > Mobile : +20 (0) 106-146-9644 > > Office : +966 (0) 12-808-0367 > > ------------------------------------------------------------------------ > > *From:* users <users-boun...@lists.open-mpi.org> on behalf of Ralph H > > Castain <r...@open-mpi.org> > > *Sent:* Monday, March 4, 2019 5:29 PM > > *To:* Open MPI Users > > *Subject:* Re: [OMPI users] Building PMIx and Slurm support > > > > > >> On Mar 4, 2019, at 5:34 AM, Daniel Letai <d...@letai.org.il > >> <mailto:d...@letai.org.il>> wrote: > >> > >> Gilles, > >> On 3/4/19 8:28 AM, Gilles Gouaillardet wrote: > >>> Daniel, > >>> > >>> > >>> On 3/4/2019 3:18 PM, Daniel Letai wrote: > >>>> > >>>>> So unless you have a specific reason not to mix both, you might > >>>>> also give the internal PMIx a try. > >>>> Does this hold true for libevent too? Configure complains if > >>>> libevent for openmpi is different than the one used for the other > >>>> tools. > >>>> > >>> > >>> I am not exactly sure of which scenario you are running. > >>> > >>> Long story short, > >>> > >>> - If you use an external PMIx, then you have to use an external > >>> libevent (otherwise configure will fail). > >>> > >>> It must be the same one used by PMIx, but I am not sure configure > >>> checks that. > >>> > >>> - If you use the internal PMIx, then it is up to you. you can either > >>> use the internal libevent, or an external one. > >>> > >> Thanks, that clarifies the issues I've experienced. Since PMIx > >> doesn't have to be the same for server and nodes, I can compile slurm > >> with external PMIx with system libevent, and compile openmpi with > >> internal PMIx and libevent, and that should work. Is that correct? > > > > Yes - that is indeed correct! > > > >> > >> BTW, building 4.0.1rc1 completed successfully using external for all, > >> will start testing in near future. > >>> > >>> Cheers, > >>> > >>> > >>> Gilles > >>> > >> Thanks, > >> Dani_L. > >>> _______________________________________________ > >>> users mailing list > >>> users@lists.open-mpi.org > >>> https://lists.open-mpi.org/mailman/listinfo/users > >> _______________________________________________ > >> users mailing list > >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > >> https://lists.open-mpi.org/mailman/listinfo/users > > > > > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users