Re: [OMPI users] Can't start jobs with srun.

2020-04-24 Thread Riebs, Andy via users
Prentice, have you tried something trivial, like "srun -N3 hostname", to rule 
out non-OMPI problems?

Andy

-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Prentice 
Bisbal via users
Sent: Friday, April 24, 2020 2:19 PM
To: Ralph Castain ; Open MPI Users 
Cc: Prentice Bisbal 
Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.

Okay. I've got Slurm built with pmix support:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmi2
srun: openmpi
srun: pmix

But now when I try to launch a job with srun, the job appears to be 
running, but doesn't do anything - it just hangs in the running state 
but doesn't do anything. Any ideas what could be wrong, or how to debug 
this?

I'm also asking around on the Slurm mailing list, too

Prentice

On 4/23/20 3:03 PM, Ralph Castain wrote:
> You can trust the --mpi=list. The problem is likely that OMPI wasn't 
> configured --with-pmi2
>
>
>> On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
>>  wrote:
>>
>> --mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to 
>> either of them, my job still fails. Why is that? Can I not trust the output 
>> of --mpi=list?
>>
>> Prentice
>>
>> On 4/23/20 10:43 AM, Ralph Castain via users wrote:
>>> No, but you do have to explicitly build OMPI with non-PMIx support if that 
>>> is what you are going to use. In this case, you need to configure OMPI 
>>> --with-pmi2=
>>>
>>> You can leave off the path if Slurm (i.e., just "--with-pmi2") was 
>>> installed in a standard location as we should find it there.
>>>
>>>
 On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
  wrote:

 It looks like it was built with PMI2, but not PMIx:

 $ srun --mpi=list
 srun: MPI types are...
 srun: none
 srun: pmi2
 srun: openmpi

 I did launch the job with srun --mpi=pmi2 

 Does OpenMPI 4 need PMIx specifically?


 On 4/23/20 10:23 AM, Ralph Castain via users wrote:
> Is Slurm built with PMIx support? Did you tell srun to use it?
>
>
>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
>>  wrote:
>>
>> I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software 
>> with a very simple hello, world MPI program that I've used reliably for 
>> years. When I submit the job through slurm and use srun to launch the 
>> job, I get these errors:
>>
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***and potentially your MPI job)
>> [dawson029.pppl.gov:26070] Local abort before MPI_INIT completed 
>> completed successfully, but am not able to aggregate error messages, and 
>> not able to guarantee that all other processes were killed!
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***and potentially your MPI job)
>> [dawson029.pppl.gov:26076] Local abort before MPI_INIT completed 
>> completed successfully, but am not able to aggregate error messages, and 
>> not able to guarantee that all other processes were killed!
>>
>> If I run the same job, but use mpiexec or mpirun instead of srun, the 
>> jobs run just fine. I checked ompi_info to make sure OpenMPI was 
>> compiled with  Slurm support:
>>
>> $ ompi_info | grep slurm
>>Configure command line: 
>> '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
>> '--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
>> '--with-slurm' '--with-psm'
>>   MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component 
>> v4.0.3)
>>   MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component 
>> v4.0.3)
>>   MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component 
>> v4.0.3)
>>MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component 
>> v4.0.3)
>>
>> Any ideas what could be wrong? Do you need any additional information?
>>
>> Prentice
>>
>


Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-04-24 Thread Prentice Bisbal via users

Okay. I've got Slurm built with pmix support:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmi2
srun: openmpi
srun: pmix

But now when I try to launch a job with srun, the job appears to be 
running, but doesn't do anything - it just hangs in the running state 
but doesn't do anything. Any ideas what could be wrong, or how to debug 
this?


I'm also asking around on the Slurm mailing list, too

Prentice

On 4/23/20 3:03 PM, Ralph Castain wrote:

You can trust the --mpi=list. The problem is likely that OMPI wasn't configured 
--with-pmi2



On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
 wrote:

--mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to 
either of them, my job still fails. Why is that? Can I not trust the output of 
--mpi=list?

Prentice

On 4/23/20 10:43 AM, Ralph Castain via users wrote:

No, but you do have to explicitly build OMPI with non-PMIx support if that is what 
you are going to use. In this case, you need to configure OMPI 
--with-pmi2=

You can leave off the path if Slurm (i.e., just "--with-pmi2") was installed in 
a standard location as we should find it there.



On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
 wrote:

It looks like it was built with PMI2, but not PMIx:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: openmpi

I did launch the job with srun --mpi=pmi2 

Does OpenMPI 4 need PMIx specifically?


On 4/23/20 10:23 AM, Ralph Castain via users wrote:

Is Slurm built with PMIx support? Did you tell srun to use it?



On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
 wrote:

I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
very simple hello, world MPI program that I've used reliably for years. When I 
submit the job through slurm and use srun to launch the job, I get these errors:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26070] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26076] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!

If I run the same job, but use mpiexec or mpirun instead of srun, the jobs run 
just fine. I checked ompi_info to make sure OpenMPI was compiled with  Slurm 
support:

$ ompi_info | grep slurm
   Configure command line: '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
'--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
'--with-slurm' '--with-psm'
  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
  MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
   MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)

Any ideas what could be wrong? Do you need any additional information?

Prentice





[OMPI users] RMA in openmpi

2020-04-24 Thread Claire Cashmore via users
Hello

I was wondering if someone could help me with a question.

When using RMA is there a requirement to use some type of synchronization? When 
using one-sided communication such as MPI_Get the code will only run when I 
combine it with MPI_Win_fence or MPI_Win_lock/unlock. I do not want to use 
MPI_Win_fence as I’m using the one-sided communication to allow some 
communication when processes are not synchronised, so this defeats the point. I 
could use MPI_Win_lock/unlock, however someone I’ve spoken to has said that I 
should be able to use RMA without any synchronization calls, if so then I would 
prefer to do this to reduce any overheads using MPI_Win_lock every time I use 
the one-sided communication may produce.

Is it possible to use the one-sided communication without combining it with 
synchronization calls?

(It doesn’t seem to matter what version of openmpi I use).

Thank you

Claire