Passant,

you can first try a PMIx only program


for example, in the test directory of PMIx

srun --mpi=pmix -N 2 -n 4 .libs/pmix_client -n 4

should work just fine (otherwise, this is a non Open MPI related issue)


If it works, then you can build an other Open MPI and pass --enable-debug to the configure command line.

Hopefully, it will provide more information (or at least, you will have the option to ask some very verbose logs)


Cheers,


Gilles

On 3/12/2019 5:46 PM, Passant A. Hafez wrote:
Hi Gilles,

Yes it was just a typo in the last email, it was correctly spelled in the job 
script.

So I just tried to use 1 node * 2 tasks/node, I got the same error I posted 
before, just a copy for each process, here it is again:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[cn603-20-l:169109] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[cn603-20-l:169108] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!
srun: error: cn603-20-l: tasks 0-1: Exited with exit code 1


I'm suspecting Slurm, but anyways, how can I troubleshoot this?
The program is a simple MPI Hello World code.




All the best,
--
Passant A. Hafez | HPC Applications Specialist
KAUST Supercomputing Core Laboratory (KSL)
King Abdullah University of Science and Technology
Building 1, Al-Khawarizmi, Room 0123
Mobile : +966 (0) 55-247-9568
Mobile : +20 (0) 106-146-9644
Office  : +966 (0) 12-808-0367

________________________________________
From: users <users-boun...@lists.open-mpi.org> on behalf of Gilles Gouaillardet 
<gil...@rist.or.jp>
Sent: Tuesday, March 12, 2019 8:22 AM
To: users@lists.open-mpi.org
Subject: Re: [OMPI users] Building PMIx and Slurm support

Passant,


Except the typo (it should be srun --mpi=pmix_v3), there is nothing
wrong with that, and it is working just fine for me

(same SLURM version, same PMIx version, same Open MPI version and same
Open MPI configure command line)

that is why I asked you some more information/logs in order to
investigate your issue.


You might want to try a single node job first in order to rule out
potential interconnect related issues.


Cheers,


Gilles


On 3/12/2019 1:54 PM, Passant A. Hafez wrote:
Hello Gilles,

Yes I do use srun --mpi=pmix_3 to run the app, what's the problem with
that?
Before that, when we tried to launch MPI apps directly with srun, we
got the error message saying Slurm missed the PMIx support, that's why
we proceeded with the installation.



All the best,

--

Passant

On Mar 12, 2019 6:53 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
Passant,


I built a similar environment, and had no issue running a simple MPI
program.


Can you please post your slurm script (I assume it uses srun to start
the MPI app),

the output of

scontrol show config | grep Mpi

and the full output of your job ?


Cheers,


Gilles


On 3/12/2019 7:59 AM, Passant A. Hafez wrote:
Hello,


So we now have Slurm 18.08.6-2 compiled with PMIx 3.1.2

then I installed openmpi 4.0.0 with:

--with-slurm  --with-pmix=internal --with-libevent=internal
--enable-shared --enable-
static  --with-x


(Following the thread, it was mentioned that building OMPI 4.0.0 with
PMIx 3.1.2 will fail with PMIX_MODEX and PMIX_INFO_ARRAY errors, so I
used internal PMIx)



The MPI program fails with:


*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[cn603-13-r:387088] Local abort before MPI_INIT completed completed
successfully, but am not able to aggregate error messages, and not
able to guarantee that all other processes were killed!


for each process, please advise! what's going wrong here?







All the best,
--
Passant A. Hafez | HPC Applications Specialist
KAUST Supercomputing Core Laboratory (KSL)
King Abdullah University of Science and Technology
Building 1, Al-Khawarizmi, Room 0123
Mobile : +966 (0) 55-247-9568
Mobile : +20 (0) 106-146-9644
Office : +966 (0) 12-808-0367
------------------------------------------------------------------------
*From:* users <users-boun...@lists.open-mpi.org> on behalf of Ralph H
Castain <r...@open-mpi.org>
*Sent:* Monday, March 4, 2019 5:29 PM
*To:* Open MPI Users
*Subject:* Re: [OMPI users] Building PMIx and Slurm support


On Mar 4, 2019, at 5:34 AM, Daniel Letai <d...@letai.org.il
<mailto:d...@letai.org.il>> wrote:

Gilles,
On 3/4/19 8:28 AM, Gilles Gouaillardet wrote:
Daniel,


On 3/4/2019 3:18 PM, Daniel Letai wrote:
So unless you have a specific reason not to mix both, you might
also give the internal PMIx a try.
Does this hold true for libevent too? Configure complains if
libevent for openmpi is different than the one used for the other
tools.

I am not exactly sure of which scenario you are running.

Long story short,

  - If you use an external PMIx, then you have to use an external
libevent (otherwise configure will fail).

It must be the same one used by PMIx, but I am not sure configure
checks that.

- If you use the internal PMIx, then it is up to you. you can either
use the internal libevent, or an external one.

Thanks, that clarifies the issues I've experienced. Since PMIx
doesn't have to be the same for server and nodes, I can compile slurm
with external PMIx with system libevent, and compile openmpi with
internal PMIx and libevent, and that should work. Is that correct?
Yes - that is indeed correct!

BTW, building 4.0.1rc1 completed successfully using external for all,
will start testing in near future.
Cheers,


Gilles

Thanks,
Dani_L.
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to