Re: [OMPI users] OMPI 3.1.x, PMIx, SLURM, and mpiexec/mpirun

2018-11-12 Thread Bennet Fauber
Thanks, Ralph,

I did try to build OMPI against the PMIx 2.0.2 -- using the configure
option --with-pmix=/opt/pmix/2.0.2, but it sounds like the better route
would be to upgrade to PMIx 2.1.

Thanks, and I'll give it a try!

-- bennet


On Mon, Nov 12, 2018 at 12:42 PM Ralph H Castain  wrote:

> mpirun should definitely still work in parallel with srun - they aren’t
> mutually exclusive. OMPI 3.1.2 contains PMIx v2.1.3.
>
> The problem here is that you built Slurm against PMIx v2.0.2, which is not
> cross-version capable. You can see the cross-version situation here:
> https://pmix.org/support/faq/how-does-pmix-work-with-containers/
>
> Your options would be to build OMPI against the same PMIx 2.0.2 you used
> for Slurm, or update the PMIx version you used for Slurm to something that
> can support cross-version operations.
>
> Ralph
>
>
> On Nov 11, 2018, at 5:21 PM, Bennet Fauber  wrote:
>
> I have been having some difficulties getting the right combination of
> SLURM, PMIx, and OMPI 3.1.x (specifically 3.1.2) to compile in such a way
> that both the srun method of starting jobs and mpirun/mpiexec will also run.
>
> If someone has a slurm 18.08 or newer, PMIx, and OMPI 3.x that works with
> both srun and mpirun and wouldn't mind sending me the version numbers and
> any tips for getting this to work, I would appreciate it.
>
> Should mpirun still work?  If that is just off the table and I missed the
> memo, please let me know.
>
> I'm asking for both because of programs like OpenFOAM and others where
> mpirun is built into the application.  I have OMPI 1.10.7 built with
> similar flags, and it seems to work.
>
> [bennet@beta-build mpi_example]$ srun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  0.000458
>
> [bennet@beta-build mpi_example]$ mpirun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  0.000295
>
> SLURM documentation doesn't seem to list a recommended PMIx, that I can
> find.  I can't find where the version of PMIx that is bundled with OMPI is
> specified.
>
> I have slurm 18.08.0, which is built against pmix-2.0.2.  We settled on
> that version with SLURM 17.something prior to SLURM supporting PMIx 2.1.
> Is OMPI 3.1.2 balking at too old a PMIx?
>
> Sorry to be so at sea.
>
> I built OMPI with
>
> ./configure \
> --prefix=${PREFIX} \
> --mandir=${PREFIX}/share/man \
> --with-pmix=/opt/pmix/2.0.2 \
> --with-libevent=external \
> --with-hwloc=external \
> --with-slurm \
> --with-verbs \
> --disable-dlopen --enable-shared \
> CC=gcc CXX=g++ FC=gfortran
>
> I have a simple test program, and it runs with
>
> [bennet@beta-build mpi_example]$ srun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  0.000573
>
> but, on a login node, where I just want a few processors on the local
> node, not to run on the compute nodes of the cluster, mpirun fails with
>
> [bennet@beta-build mpi_example]$ mpirun -np 2 ./test_mpi
> [beta-build.stage.arc-ts.umich.edu:102541] [[13610,1],0] ORTE_ERROR_LOG:
> Not found in file base/ess_base_std_app.c at line 219
> [beta-build.stage.arc-ts.umich.edu:102542] [[13610,1],1] ORTE_ERROR_LOG:
> Not found in file base/ess_base_std_app.c at line 219
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   store DAEMON URI failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --
> [beta-build.stage.arc-ts.umich.edu:102541] [[13610,1],0] ORTE_ERROR_LOG:
> Not found in file ess_pmi_module.c at line 401
> [beta-build.stage.arc-ts.umich.edu:102542] [[13610,1],1] ORTE_ERROR_LOG:
> Not found in file ess_pmi_module.c at line 401
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   orte_ess_init failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure

Re: [OMPI users] OMPI 3.1.x, PMIx, SLURM, and mpiexec/mpirun

2018-11-12 Thread Ralph H Castain
mpirun should definitely still work in parallel with srun - they aren’t 
mutually exclusive. OMPI 3.1.2 contains PMIx v2.1.3.

The problem here is that you built Slurm against PMIx v2.0.2, which is not 
cross-version capable. You can see the cross-version situation here: 
https://pmix.org/support/faq/how-does-pmix-work-with-containers/

Your options would be to build OMPI against the same PMIx 2.0.2 you used for 
Slurm, or update the PMIx version you used for Slurm to something that can 
support cross-version operations.

Ralph


> On Nov 11, 2018, at 5:21 PM, Bennet Fauber  wrote:
> 
> I have been having some difficulties getting the right combination of SLURM, 
> PMIx, and OMPI 3.1.x (specifically 3.1.2) to compile in such a way that both 
> the srun method of starting jobs and mpirun/mpiexec will also run.
> 
> If someone has a slurm 18.08 or newer, PMIx, and OMPI 3.x that works with 
> both srun and mpirun and wouldn't mind sending me the version numbers and any 
> tips for getting this to work, I would appreciate it.
> 
> Should mpirun still work?  If that is just off the table and I missed the 
> memo, please let me know.
> 
> I'm asking for both because of programs like OpenFOAM and others where mpirun 
> is built into the application.  I have OMPI 1.10.7 built with similar flags, 
> and it seems to work.
> 
> [bennet@beta-build mpi_example]$ srun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  0.000458
> 
> [bennet@beta-build mpi_example]$ mpirun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  0.000295
> 
> SLURM documentation doesn't seem to list a recommended PMIx, that I can find. 
>  I can't find where the version of PMIx that is bundled with OMPI is 
> specified.
> 
> I have slurm 18.08.0, which is built against pmix-2.0.2.  We settled on that 
> version with SLURM 17.something prior to SLURM supporting PMIx 2.1.  Is OMPI 
> 3.1.2 balking at too old a PMIx?
> 
> Sorry to be so at sea.
> 
> I built OMPI with
> 
> ./configure \
> --prefix=${PREFIX} \
> --mandir=${PREFIX}/share/man \
> --with-pmix=/opt/pmix/2.0.2 \
> --with-libevent=external \
> --with-hwloc=external \
> --with-slurm \
> --with-verbs \
> --disable-dlopen --enable-shared \
> CC=gcc CXX=g++ FC=gfortran
> 
> I have a simple test program, and it runs with
> 
> [bennet@beta-build mpi_example]$ srun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  0.000573
> 
> but, on a login node, where I just want a few processors on the local node, 
> not to run on the compute nodes of the cluster, mpirun fails with
> 
> [bennet@beta-build mpi_example]$ mpirun -np 2 ./test_mpi
> [beta-build.stage.arc-ts.umich.edu:102541 
> ] [[13610,1],0] 
> ORTE_ERROR_LOG: Not found in file base/ess_base_std_app.c at line 219
> [beta-build.stage.arc-ts.umich.edu:102542 
> ] [[13610,1],1] 
> ORTE_ERROR_LOG: Not found in file base/ess_base_std_app.c at line 219
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   store DAEMON URI failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --
> [beta-build.stage.arc-ts.umich.edu:102541 
> ] [[13610,1],0] 
> ORTE_ERROR_LOG: Not found in file ess_pmi_module.c at line 401
> [beta-build.stage.arc-ts.umich.edu:102542 
> ] [[13610,1],1] 
> ORTE_ERROR_LOG: Not found in file ess_pmi_module.c at line 401
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_ess_init failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional inform

[OMPI users] OMPI 3.1.x, PMIx, SLURM, and mpiexec/mpirun

2018-11-11 Thread Bennet Fauber
I have been having some difficulties getting the right combination of
SLURM, PMIx, and OMPI 3.1.x (specifically 3.1.2) to compile in such a way
that both the srun method of starting jobs and mpirun/mpiexec will also run.

If someone has a slurm 18.08 or newer, PMIx, and OMPI 3.x that works with
both srun and mpirun and wouldn't mind sending me the version numbers and
any tips for getting this to work, I would appreciate it.

Should mpirun still work?  If that is just off the table and I missed the
memo, please let me know.

I'm asking for both because of programs like OpenFOAM and others where
mpirun is built into the application.  I have OMPI 1.10.7 built with
similar flags, and it seems to work.

[bennet@beta-build mpi_example]$ srun ./test_mpi
The sum = 0.866386
Elapsed time is:  0.000458

[bennet@beta-build mpi_example]$ mpirun ./test_mpi
The sum = 0.866386
Elapsed time is:  0.000295

SLURM documentation doesn't seem to list a recommended PMIx, that I can
find.  I can't find where the version of PMIx that is bundled with OMPI is
specified.

I have slurm 18.08.0, which is built against pmix-2.0.2.  We settled on
that version with SLURM 17.something prior to SLURM supporting PMIx 2.1.
Is OMPI 3.1.2 balking at too old a PMIx?

Sorry to be so at sea.

I built OMPI with

./configure \
--prefix=${PREFIX} \
--mandir=${PREFIX}/share/man \
--with-pmix=/opt/pmix/2.0.2 \
--with-libevent=external \
--with-hwloc=external \
--with-slurm \
--with-verbs \
--disable-dlopen --enable-shared \
CC=gcc CXX=g++ FC=gfortran

I have a simple test program, and it runs with

[bennet@beta-build mpi_example]$ srun ./test_mpi
The sum = 0.866386
Elapsed time is:  0.000573

but, on a login node, where I just want a few processors on the local node,
not to run on the compute nodes of the cluster, mpirun fails with

[bennet@beta-build mpi_example]$ mpirun -np 2 ./test_mpi
[beta-build.stage.arc-ts.umich.edu:102541] [[13610,1],0] ORTE_ERROR_LOG:
Not found in file base/ess_base_std_app.c at line 219
[beta-build.stage.arc-ts.umich.edu:102542] [[13610,1],1] ORTE_ERROR_LOG:
Not found in file base/ess_base_std_app.c at line 219
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  store DAEMON URI failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--
[beta-build.stage.arc-ts.umich.edu:102541] [[13610,1],0] ORTE_ERROR_LOG:
Not found in file ess_pmi_module.c at line 401
[beta-build.stage.arc-ts.umich.edu:102542] [[13610,1],1] ORTE_ERROR_LOG:
Not found in file ess_pmi_module.c at line 401
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[beta-build.stage.arc-ts.umich.edu:102541] Local abort before MPI_INIT
completed completed successfully, but am not able to aggregate error
messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[beta-build.stage.arc-ts.umich.edu:102542] Local abort before MPI_INIT
completed completed successfully, but am not able to aggregate error
messages, and not able to guarantee that all other processes were killed!
--