Thanks, Ralph, I did try to build OMPI against the PMIx 2.0.2 -- using the configure option --with-pmix=/opt/pmix/2.0.2, but it sounds like the better route would be to upgrade to PMIx 2.1.
Thanks, and I'll give it a try! -- bennet On Mon, Nov 12, 2018 at 12:42 PM Ralph H Castain <r...@open-mpi.org> wrote: > mpirun should definitely still work in parallel with srun - they aren’t > mutually exclusive. OMPI 3.1.2 contains PMIx v2.1.3. > > The problem here is that you built Slurm against PMIx v2.0.2, which is not > cross-version capable. You can see the cross-version situation here: > https://pmix.org/support/faq/how-does-pmix-work-with-containers/ > > Your options would be to build OMPI against the same PMIx 2.0.2 you used > for Slurm, or update the PMIx version you used for Slurm to something that > can support cross-version operations. > > Ralph > > > On Nov 11, 2018, at 5:21 PM, Bennet Fauber <ben...@umich.edu> wrote: > > I have been having some difficulties getting the right combination of > SLURM, PMIx, and OMPI 3.1.x (specifically 3.1.2) to compile in such a way > that both the srun method of starting jobs and mpirun/mpiexec will also run. > > If someone has a slurm 18.08 or newer, PMIx, and OMPI 3.x that works with > both srun and mpirun and wouldn't mind sending me the version numbers and > any tips for getting this to work, I would appreciate it. > > Should mpirun still work? If that is just off the table and I missed the > memo, please let me know. > > I'm asking for both because of programs like OpenFOAM and others where > mpirun is built into the application. I have OMPI 1.10.7 built with > similar flags, and it seems to work. > > [bennet@beta-build mpi_example]$ srun ./test_mpi > The sum = 0.866386 > Elapsed time is: 0.000458 > > [bennet@beta-build mpi_example]$ mpirun ./test_mpi > The sum = 0.866386 > Elapsed time is: 0.000295 > > SLURM documentation doesn't seem to list a recommended PMIx, that I can > find. I can't find where the version of PMIx that is bundled with OMPI is > specified. > > I have slurm 18.08.0, which is built against pmix-2.0.2. We settled on > that version with SLURM 17.something prior to SLURM supporting PMIx 2.1. > Is OMPI 3.1.2 balking at too old a PMIx? > > Sorry to be so at sea. > > I built OMPI with > > ./configure \ > --prefix=${PREFIX} \ > --mandir=${PREFIX}/share/man \ > --with-pmix=/opt/pmix/2.0.2 \ > --with-libevent=external \ > --with-hwloc=external \ > --with-slurm \ > --with-verbs \ > --disable-dlopen --enable-shared \ > CC=gcc CXX=g++ FC=gfortran > > I have a simple test program, and it runs with > > [bennet@beta-build mpi_example]$ srun ./test_mpi > The sum = 0.866386 > Elapsed time is: 0.000573 > > but, on a login node, where I just want a few processors on the local > node, not to run on the compute nodes of the cluster, mpirun fails with > > [bennet@beta-build mpi_example]$ mpirun -np 2 ./test_mpi > [beta-build.stage.arc-ts.umich.edu:102541] [[13610,1],0] ORTE_ERROR_LOG: > Not found in file base/ess_base_std_app.c at line 219 > [beta-build.stage.arc-ts.umich.edu:102542] [[13610,1],1] ORTE_ERROR_LOG: > Not found in file base/ess_base_std_app.c at line 219 > -------------------------------------------------------------------------- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > store DAEMON URI failed > --> Returned value Not found (-13) instead of ORTE_SUCCESS > -------------------------------------------------------------------------- > [beta-build.stage.arc-ts.umich.edu:102541] [[13610,1],0] ORTE_ERROR_LOG: > Not found in file ess_pmi_module.c at line 401 > [beta-build.stage.arc-ts.umich.edu:102542] [[13610,1],1] ORTE_ERROR_LOG: > Not found in file ess_pmi_module.c at line 401 > -------------------------------------------------------------------------- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > orte_ess_init failed > --> Returned value Not found (-13) instead of ORTE_SUCCESS > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_mpi_init: ompi_rte_init failed > --> Returned "Not found" (-13) instead of "Success" (0) > -------------------------------------------------------------------------- > *** An error occurred in MPI_Init > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > *** and potentially your MPI job) > [beta-build.stage.arc-ts.umich.edu:102541] Local abort before MPI_INIT > completed completed successfully, but am not able to aggregate error > messages, and not able to guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > *** and potentially your MPI job) > [beta-build.stage.arc-ts.umich.edu:102542] Local abort before MPI_INIT > completed completed successfully, but am not able to aggregate error > messages, and not able to guarantee that all other processes were killed! > ------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code. Per user-direction, the job has been aborted. > ------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun detected that one or more processes exited with non-zero status, > thus causing > the job to be terminated. The first process to do so was: > > Process name: [[13610,1],0] > Exit code: 1 > -------------------------------------------------------------------------- > [beta-build.stage.arc-ts.umich.edu:102536] 3 more processes have sent > help message help-orte-runtime.txt / orte_init:startup:internal-failure > [beta-build.stage.arc-ts.umich.edu:102536] Set MCA parameter > "orte_base_help_aggregate" to 0 to see all help / error messages > [beta-build.stage.arc-ts.umich.edu:102536] 1 more process has sent help > message help-orte-runtime / orte_init:startup:internal-failure > [beta-build.stage.arc-ts.umich.edu:102536] 1 more process has sent help > message help-mpi-runtime.txt / mpi_init:startup:internal-failure > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users