Thanks, Ralph,

I did try to build OMPI against the PMIx 2.0.2 -- using the configure
option --with-pmix=/opt/pmix/2.0.2, but it sounds like the better route
would be to upgrade to PMIx 2.1.

Thanks, and I'll give it a try!

-- bennet


On Mon, Nov 12, 2018 at 12:42 PM Ralph H Castain <r...@open-mpi.org> wrote:

> mpirun should definitely still work in parallel with srun - they aren’t
> mutually exclusive. OMPI 3.1.2 contains PMIx v2.1.3.
>
> The problem here is that you built Slurm against PMIx v2.0.2, which is not
> cross-version capable. You can see the cross-version situation here:
> https://pmix.org/support/faq/how-does-pmix-work-with-containers/
>
> Your options would be to build OMPI against the same PMIx 2.0.2 you used
> for Slurm, or update the PMIx version you used for Slurm to something that
> can support cross-version operations.
>
> Ralph
>
>
> On Nov 11, 2018, at 5:21 PM, Bennet Fauber <ben...@umich.edu> wrote:
>
> I have been having some difficulties getting the right combination of
> SLURM, PMIx, and OMPI 3.1.x (specifically 3.1.2) to compile in such a way
> that both the srun method of starting jobs and mpirun/mpiexec will also run.
>
> If someone has a slurm 18.08 or newer, PMIx, and OMPI 3.x that works with
> both srun and mpirun and wouldn't mind sending me the version numbers and
> any tips for getting this to work, I would appreciate it.
>
> Should mpirun still work?  If that is just off the table and I missed the
> memo, please let me know.
>
> I'm asking for both because of programs like OpenFOAM and others where
> mpirun is built into the application.  I have OMPI 1.10.7 built with
> similar flags, and it seems to work.
>
> [bennet@beta-build mpi_example]$ srun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  0.000458
>
> [bennet@beta-build mpi_example]$ mpirun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  0.000295
>
> SLURM documentation doesn't seem to list a recommended PMIx, that I can
> find.  I can't find where the version of PMIx that is bundled with OMPI is
> specified.
>
> I have slurm 18.08.0, which is built against pmix-2.0.2.  We settled on
> that version with SLURM 17.something prior to SLURM supporting PMIx 2.1.
> Is OMPI 3.1.2 balking at too old a PMIx?
>
> Sorry to be so at sea.
>
> I built OMPI with
>
> ./configure \
>     --prefix=${PREFIX} \
>     --mandir=${PREFIX}/share/man \
>     --with-pmix=/opt/pmix/2.0.2 \
>     --with-libevent=external \
>     --with-hwloc=external \
>     --with-slurm \
>     --with-verbs \
>     --disable-dlopen --enable-shared \
>     CC=gcc CXX=g++ FC=gfortran
>
> I have a simple test program, and it runs with
>
> [bennet@beta-build mpi_example]$ srun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  0.000573
>
> but, on a login node, where I just want a few processors on the local
> node, not to run on the compute nodes of the cluster, mpirun fails with
>
> [bennet@beta-build mpi_example]$ mpirun -np 2 ./test_mpi
> [beta-build.stage.arc-ts.umich.edu:102541] [[13610,1],0] ORTE_ERROR_LOG:
> Not found in file base/ess_base_std_app.c at line 219
> [beta-build.stage.arc-ts.umich.edu:102542] [[13610,1],1] ORTE_ERROR_LOG:
> Not found in file base/ess_base_std_app.c at line 219
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   store DAEMON URI failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> [beta-build.stage.arc-ts.umich.edu:102541] [[13610,1],0] ORTE_ERROR_LOG:
> Not found in file ess_pmi_module.c at line 401
> [beta-build.stage.arc-ts.umich.edu:102542] [[13610,1],1] ORTE_ERROR_LOG:
> Not found in file ess_pmi_module.c at line 401
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   orte_ess_init failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>   ompi_mpi_init: ompi_rte_init failed
>   --> Returned "Not found" (-13) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [beta-build.stage.arc-ts.umich.edu:102541] Local abort before MPI_INIT
> completed completed successfully, but am not able to aggregate error
> messages, and not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [beta-build.stage.arc-ts.umich.edu:102542] Local abort before MPI_INIT
> completed completed successfully, but am not able to aggregate error
> messages, and not able to guarantee that all other processes were killed!
> -------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so was:
>
>   Process name: [[13610,1],0]
>   Exit code:    1
> --------------------------------------------------------------------------
> [beta-build.stage.arc-ts.umich.edu:102536] 3 more processes have sent
> help message help-orte-runtime.txt / orte_init:startup:internal-failure
> [beta-build.stage.arc-ts.umich.edu:102536] Set MCA parameter
> "orte_base_help_aggregate" to 0 to see all help / error messages
> [beta-build.stage.arc-ts.umich.edu:102536] 1 more process has sent help
> message help-orte-runtime / orte_init:startup:internal-failure
> [beta-build.stage.arc-ts.umich.edu:102536] 1 more process has sent help
> message help-mpi-runtime.txt / mpi_init:startup:internal-failure
>
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to