mpirun should definitely still work in parallel with srun - they aren’t 
mutually exclusive. OMPI 3.1.2 contains PMIx v2.1.3.

The problem here is that you built Slurm against PMIx v2.0.2, which is not 
cross-version capable. You can see the cross-version situation here: 
https://pmix.org/support/faq/how-does-pmix-work-with-containers/

Your options would be to build OMPI against the same PMIx 2.0.2 you used for 
Slurm, or update the PMIx version you used for Slurm to something that can 
support cross-version operations.

Ralph


> On Nov 11, 2018, at 5:21 PM, Bennet Fauber <ben...@umich.edu> wrote:
> 
> I have been having some difficulties getting the right combination of SLURM, 
> PMIx, and OMPI 3.1.x (specifically 3.1.2) to compile in such a way that both 
> the srun method of starting jobs and mpirun/mpiexec will also run.
> 
> If someone has a slurm 18.08 or newer, PMIx, and OMPI 3.x that works with 
> both srun and mpirun and wouldn't mind sending me the version numbers and any 
> tips for getting this to work, I would appreciate it.
> 
> Should mpirun still work?  If that is just off the table and I missed the 
> memo, please let me know.
> 
> I'm asking for both because of programs like OpenFOAM and others where mpirun 
> is built into the application.  I have OMPI 1.10.7 built with similar flags, 
> and it seems to work.
> 
> [bennet@beta-build mpi_example]$ srun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  0.000458
> 
> [bennet@beta-build mpi_example]$ mpirun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  0.000295
> 
> SLURM documentation doesn't seem to list a recommended PMIx, that I can find. 
>  I can't find where the version of PMIx that is bundled with OMPI is 
> specified.
> 
> I have slurm 18.08.0, which is built against pmix-2.0.2.  We settled on that 
> version with SLURM 17.something prior to SLURM supporting PMIx 2.1.  Is OMPI 
> 3.1.2 balking at too old a PMIx?
> 
> Sorry to be so at sea.
> 
> I built OMPI with
> 
> ./configure \
>     --prefix=${PREFIX} \
>     --mandir=${PREFIX}/share/man \
>     --with-pmix=/opt/pmix/2.0.2 \
>     --with-libevent=external \
>     --with-hwloc=external \
>     --with-slurm \
>     --with-verbs \
>     --disable-dlopen --enable-shared \
>     CC=gcc CXX=g++ FC=gfortran
> 
> I have a simple test program, and it runs with
> 
> [bennet@beta-build mpi_example]$ srun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  0.000573
> 
> but, on a login node, where I just want a few processors on the local node, 
> not to run on the compute nodes of the cluster, mpirun fails with
> 
> [bennet@beta-build mpi_example]$ mpirun -np 2 ./test_mpi
> [beta-build.stage.arc-ts.umich.edu:102541 
> <http://beta-build.stage.arc-ts.umich.edu:102541>] [[13610,1],0] 
> ORTE_ERROR_LOG: Not found in file base/ess_base_std_app.c at line 219
> [beta-build.stage.arc-ts.umich.edu:102542 
> <http://beta-build.stage.arc-ts.umich.edu:102542>] [[13610,1],1] 
> ORTE_ERROR_LOG: Not found in file base/ess_base_std_app.c at line 219
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   store DAEMON URI failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> [beta-build.stage.arc-ts.umich.edu:102541 
> <http://beta-build.stage.arc-ts.umich.edu:102541>] [[13610,1],0] 
> ORTE_ERROR_LOG: Not found in file ess_pmi_module.c at line 401
> [beta-build.stage.arc-ts.umich.edu:102542 
> <http://beta-build.stage.arc-ts.umich.edu:102542>] [[13610,1],1] 
> ORTE_ERROR_LOG: Not found in file ess_pmi_module.c at line 401
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_ess_init failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>   ompi_mpi_init: ompi_rte_init failed
>   --> Returned "Not found" (-13) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [beta-build.stage.arc-ts.umich.edu:102541 
> <http://beta-build.stage.arc-ts.umich.edu:102541>] Local abort before 
> MPI_INIT completed completed successfully, but am not able to aggregate error 
> messages, and not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [beta-build.stage.arc-ts.umich.edu:102542 
> <http://beta-build.stage.arc-ts.umich.edu:102542>] Local abort before 
> MPI_INIT completed completed successfully, but am not able to aggregate error 
> messages, and not able to guarantee that all other processes were killed!
> -------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status, thus 
> causing
> the job to be terminated. The first process to do so was:
> 
>   Process name: [[13610,1],0]
>   Exit code:    1
> --------------------------------------------------------------------------
> [beta-build.stage.arc-ts.umich.edu:102536 
> <http://beta-build.stage.arc-ts.umich.edu:102536>] 3 more processes have sent 
> help message help-orte-runtime.txt / orte_init:startup:internal-failure
> [beta-build.stage.arc-ts.umich.edu:102536 
> <http://beta-build.stage.arc-ts.umich.edu:102536>] Set MCA parameter 
> "orte_base_help_aggregate" to 0 to see all help / error messages
> [beta-build.stage.arc-ts.umich.edu:102536 
> <http://beta-build.stage.arc-ts.umich.edu:102536>] 1 more process has sent 
> help message help-orte-runtime / orte_init:startup:internal-failure
> [beta-build.stage.arc-ts.umich.edu:102536 
> <http://beta-build.stage.arc-ts.umich.edu:102536>] 1 more process has sent 
> help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
> 
> 
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to