Re: [OMPI users] Can't start jobs with srun.

2020-05-07 Thread Patrick Bégou via users
Le 07/05/2020 à 11:42, John Hearns via users a écrit :
> Patrick, I am sure that you have asked Dell for support on this issue?

No I didn't :-(. I was just accessing these server for a short time to
run a bench and the workaround was enough. I'm not using slurm but a
local scheduler (OAR) so the problem was not critical for my futur.


Patrick

>
> On Sun, 26 Apr 2020 at 18:09, Patrick Bégou via users
> mailto:users@lists.open-mpi.org>> wrote:
>
> I have also this problem on servers I'm benching at DELL's lab with
> OpenMPI-4.0.3. I've tried  a new build of OpenMPI with
> "--with-pmi2". No
> change.
> Finally my work around in the slurm script was to launch my code with
> mpirun. As mpirun was only finding one slot per nodes I have used
> "--oversubscribe --bind-to core" and checked that every process was
> binded on a separate core. It worked but do not ask me why :-)
>
> Patrick
>
> Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :
> > Prentice, have you tried something trivial, like "srun -N3
> hostname", to rule out non-OMPI problems?
> >
> > Andy
> >
> > -Original Message-
> > From: users [mailto:users-boun...@lists.open-mpi.org
> ] On Behalf Of Prentice
> Bisbal via users
> > Sent: Friday, April 24, 2020 2:19 PM
> > To: Ralph Castain mailto:r...@open-mpi.org>>;
> Open MPI Users  >
> > Cc: Prentice Bisbal mailto:pbis...@pppl.gov>>
> > Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.
> >
> > Okay. I've got Slurm built with pmix support:
> >
> > $ srun --mpi=list
> > srun: MPI types are...
> > srun: none
> > srun: pmix_v3
> > srun: pmi2
> > srun: openmpi
> > srun: pmix
> >
> > But now when I try to launch a job with srun, the job appears to be
> > running, but doesn't do anything - it just hangs in the running
> state
> > but doesn't do anything. Any ideas what could be wrong, or how
> to debug
> > this?
> >
> > I'm also asking around on the Slurm mailing list, too
> >
> > Prentice
> >
> > On 4/23/20 3:03 PM, Ralph Castain wrote:
> >> You can trust the --mpi=list. The problem is likely that OMPI
> wasn't configured --with-pmi2
> >>
> >>
> >>> On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users
> mailto:users@lists.open-mpi.org>> wrote:
> >>>
> >>> --mpi=list shows pmi2 and openmpi as valid values, but if I
> set --mpi= to either of them, my job still fails. Why is that? Can
> I not trust the output of --mpi=list?
> >>>
> >>> Prentice
> >>>
> >>> On 4/23/20 10:43 AM, Ralph Castain via users wrote:
>  No, but you do have to explicitly build OMPI with non-PMIx
> support if that is what you are going to use. In this case, you
> need to configure OMPI --with-pmi2=
> 
>  You can leave off the path if Slurm (i.e., just
> "--with-pmi2") was installed in a standard location as we should
> find it there.
> 
> 
> > On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users
> mailto:users@lists.open-mpi.org>> wrote:
> >
> > It looks like it was built with PMI2, but not PMIx:
> >
> > $ srun --mpi=list
> > srun: MPI types are...
> > srun: none
> > srun: pmi2
> > srun: openmpi
> >
> > I did launch the job with srun --mpi=pmi2 
> >
> > Does OpenMPI 4 need PMIx specifically?
> >
> >
> > On 4/23/20 10:23 AM, Ralph Castain via users wrote:
> >> Is Slurm built with PMIx support? Did you tell srun to use it?
> >>
> >>
> >>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users
> mailto:users@lists.open-mpi.org>> wrote:
> >>>
> >>> I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing
> the software with a very simple hello, world MPI program that I've
> used reliably for years. When I submit the job through slurm and
> use srun to launch the job, I get these errors:
> >>>
> >>> *** An error occurred in MPI_Init
> >>> *** on a NULL communicator
> >>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator
> will now abort,
> >>> ***    and potentially your MPI job)
> >>> [dawson029.pppl.gov:26070
> ] Local abort before MPI_INIT
> completed completed successfully, but am not able to aggregate
> error messages, and not able to guarantee that all other processes
> were killed!
> >>> *** An error occurred in MPI_Init
> >>> *** on a NULL communicator
> >>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator
> will now abort,
> >>> ***    and potentially your MPI job)
> >>> 

Re: [OMPI users] Can't start jobs with srun.

2020-05-07 Thread John Hearns via users
Patrick, I am sure that you have asked Dell for support on this issue?

On Sun, 26 Apr 2020 at 18:09, Patrick Bégou via users <
users@lists.open-mpi.org> wrote:

> I have also this problem on servers I'm benching at DELL's lab with
> OpenMPI-4.0.3. I've tried  a new build of OpenMPI with "--with-pmi2". No
> change.
> Finally my work around in the slurm script was to launch my code with
> mpirun. As mpirun was only finding one slot per nodes I have used
> "--oversubscribe --bind-to core" and checked that every process was
> binded on a separate core. It worked but do not ask me why :-)
>
> Patrick
>
> Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :
> > Prentice, have you tried something trivial, like "srun -N3 hostname", to
> rule out non-OMPI problems?
> >
> > Andy
> >
> > -Original Message-
> > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
> Prentice Bisbal via users
> > Sent: Friday, April 24, 2020 2:19 PM
> > To: Ralph Castain ; Open MPI Users <
> users@lists.open-mpi.org>
> > Cc: Prentice Bisbal 
> > Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.
> >
> > Okay. I've got Slurm built with pmix support:
> >
> > $ srun --mpi=list
> > srun: MPI types are...
> > srun: none
> > srun: pmix_v3
> > srun: pmi2
> > srun: openmpi
> > srun: pmix
> >
> > But now when I try to launch a job with srun, the job appears to be
> > running, but doesn't do anything - it just hangs in the running state
> > but doesn't do anything. Any ideas what could be wrong, or how to debug
> > this?
> >
> > I'm also asking around on the Slurm mailing list, too
> >
> > Prentice
> >
> > On 4/23/20 3:03 PM, Ralph Castain wrote:
> >> You can trust the --mpi=list. The problem is likely that OMPI wasn't
> configured --with-pmi2
> >>
> >>
> >>> On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users <
> users@lists.open-mpi.org> wrote:
> >>>
> >>> --mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi=
> to either of them, my job still fails. Why is that? Can I not trust the
> output of --mpi=list?
> >>>
> >>> Prentice
> >>>
> >>> On 4/23/20 10:43 AM, Ralph Castain via users wrote:
>  No, but you do have to explicitly build OMPI with non-PMIx support if
> that is what you are going to use. In this case, you need to configure OMPI
> --with-pmi2=
> 
>  You can leave off the path if Slurm (i.e., just "--with-pmi2") was
> installed in a standard location as we should find it there.
> 
> 
> > On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users <
> users@lists.open-mpi.org> wrote:
> >
> > It looks like it was built with PMI2, but not PMIx:
> >
> > $ srun --mpi=list
> > srun: MPI types are...
> > srun: none
> > srun: pmi2
> > srun: openmpi
> >
> > I did launch the job with srun --mpi=pmi2 
> >
> > Does OpenMPI 4 need PMIx specifically?
> >
> >
> > On 4/23/20 10:23 AM, Ralph Castain via users wrote:
> >> Is Slurm built with PMIx support? Did you tell srun to use it?
> >>
> >>
> >>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users <
> users@lists.open-mpi.org> wrote:
> >>>
> >>> I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the
> software with a very simple hello, world MPI program that I've used
> reliably for years. When I submit the job through slurm and use srun to
> launch the job, I get these errors:
> >>>
> >>> *** An error occurred in MPI_Init
> >>> *** on a NULL communicator
> >>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> >>> ***and potentially your MPI job)
> >>> [dawson029.pppl.gov:26070] Local abort before MPI_INIT completed
> completed successfully, but am not able to aggregate error messages, and
> not able to guarantee that all other processes were killed!
> >>> *** An error occurred in MPI_Init
> >>> *** on a NULL communicator
> >>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> >>> ***and potentially your MPI job)
> >>> [dawson029.pppl.gov:26076] Local abort before MPI_INIT completed
> completed successfully, but am not able to aggregate error messages, and
> not able to guarantee that all other processes were killed!
> >>>
> >>> If I run the same job, but use mpiexec or mpirun instead of srun,
> the jobs run just fine. I checked ompi_info to make sure OpenMPI was
> compiled with  Slurm support:
> >>>
> >>> $ ompi_info | grep slurm
> >>>Configure command line:
> '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' '--disable-silent-rules'
> '--enable-shared' '--with-pmix=internal' '--with-slurm' '--with-psm'
> >>>   MCA ess: slurm (MCA v2.1.0, API v3.0.0,
> Component v4.0.3)
> >>>   MCA plm: slurm (MCA v2.1.0, API v2.0.0,
> Component v4.0.3)
> >>>   MCA ras: slurm (MCA v2.1.0, API v2.0.0,
> Component v4.0.3)
> >>>MCA 

Re: [OMPI users] Can't start jobs with srun.

2020-04-28 Thread Daniel Letai via users

  
  
I know it's not supposed to matter, but have you tried building
  both ompi and slurm against same pmix? That is - first build pmix,
  than build slurm with-pmix, and than ompi with both slurm and
  pmix=external ?





On 23/04/2020 17:00, Prentice Bisbal
  via users wrote:


  $ ompi_info | grep slurm
  
    Configure command line:
  '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3'
  '--disable-silent-rules' '--enable-shared' '--with-pmix=internal'
  '--with-slurm' '--with-psm'
  
   MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component
  v4.0.3)
  
   MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component
  v4.0.3)
  
   MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component
  v4.0.3)
  
    MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component
  v4.0.3)
  
  
  Any ideas what could be wrong? Do you need any additional
  information?
  
  
  Prentice
  
  

  



Re: [OMPI users] Can't start jobs with srun.

2020-04-27 Thread Riebs, Andy via users
Lost a line…

Also helpful to check

$ srun -N3 which ompi_info

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Riebs, Andy 
via users
Sent: Monday, April 27, 2020 10:59 AM
To: Open MPI Users 
Cc: Riebs, Andy 
Subject: Re: [OMPI users] Can't start jobs with srun.

Y’know, a quick check on versions and PATHs might be a good idea here. I 
suggest something like

$ srun  -N3  ompi_info  |&  grep  "MPI repo"

to confirm that all nodes are running the same version of OMPI.

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Prentice 
Bisbal via users
Sent: Monday, April 27, 2020 10:25 AM
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
Cc: Prentice Bisbal mailto:pbis...@pppl.gov>>
Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.


Ralph,

PMI2 support works just fine. It's just PMIx that seems to be the problem.

We rebuilt Slurm with PMIx 3.1.5, but the problem persists. I've opened a 
ticket with Slurm support to see if it's a problem on Slurm's end.

Prentice
On 4/26/20 2:12 PM, Ralph Castain via users wrote:
It is entirely possible that the PMI2 support in OMPI v4 is broken - I doubt it 
is used or tested very much as pretty much everyone has moved to PMIx. In fact, 
we completely dropped PMI-1 and PMI-2 from OMPI v5 for that reason.

I would suggest building Slurm with PMIx v3.1.5 
(https://github.com/openpmix/openpmix/releases/tag/v3.1.5) as that is what OMPI 
v4 is using, and launching with "srun --mpi=pmix_v3"


On Apr 26, 2020, at 10:07 AM, Patrick Bégou via users 
mailto:users@lists.open-mpi.org>> wrote:

I have also this problem on servers I'm benching at DELL's lab with
OpenMPI-4.0.3. I've tried  a new build of OpenMPI with "--with-pmi2". No
change.
Finally my work around in the slurm script was to launch my code with
mpirun. As mpirun was only finding one slot per nodes I have used
"--oversubscribe --bind-to core" and checked that every process was
binded on a separate core. It worked but do not ask me why :-)

Patrick

Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :
Prentice, have you tried something trivial, like "srun -N3 hostname", to rule 
out non-OMPI problems?

Andy

-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Prentice 
Bisbal via users
Sent: Friday, April 24, 2020 2:19 PM
To: Ralph Castain mailto:r...@open-mpi.org>>; Open MPI Users 
mailto:users@lists.open-mpi.org>>
Cc: Prentice Bisbal mailto:pbis...@pppl.gov>>
Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.

Okay. I've got Slurm built with pmix support:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmi2
srun: openmpi
srun: pmix

But now when I try to launch a job with srun, the job appears to be
running, but doesn't do anything - it just hangs in the running state
but doesn't do anything. Any ideas what could be wrong, or how to debug
this?

I'm also asking around on the Slurm mailing list, too

Prentice

On 4/23/20 3:03 PM, Ralph Castain wrote:
You can trust the --mpi=list. The problem is likely that OMPI wasn't configured 
--with-pmi2


On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org>> wrote:

--mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to 
either of them, my job still fails. Why is that? Can I not trust the output of 
--mpi=list?

Prentice

On 4/23/20 10:43 AM, Ralph Castain via users wrote:
No, but you do have to explicitly build OMPI with non-PMIx support if that is 
what you are going to use. In this case, you need to configure OMPI 
--with-pmi2=

You can leave off the path if Slurm (i.e., just "--with-pmi2") was installed in 
a standard location as we should find it there.


On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org>> wrote:

It looks like it was built with PMI2, but not PMIx:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: openmpi

I did launch the job with srun --mpi=pmi2 

Does OpenMPI 4 need PMIx specifically?


On 4/23/20 10:23 AM, Ralph Castain via users wrote:
Is Slurm built with PMIx support? Did you tell srun to use it?


On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org>> wrote:

I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
very simple hello, world MPI program that I've used reliably for years. When I 
submit the job through slurm and use srun to launch the job, I get these errors:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26070<http://dawson029.pppl.gov:26070>] Local abort before 
MPI_INIT completed completed successfully, but am not able to aggregate error 
messages, 

Re: [OMPI users] Can't start jobs with srun.

2020-04-27 Thread Riebs, Andy via users
Y’know, a quick check on versions and PATHs might be a good idea here. I 
suggest something like

$ srun  -N3  ompi_info  |&  grep  "MPI repo"

to confirm that all nodes are running the same version of OMPI.

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Prentice 
Bisbal via users
Sent: Monday, April 27, 2020 10:25 AM
To: users@lists.open-mpi.org
Cc: Prentice Bisbal 
Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.


Ralph,

PMI2 support works just fine. It's just PMIx that seems to be the problem.

We rebuilt Slurm with PMIx 3.1.5, but the problem persists. I've opened a 
ticket with Slurm support to see if it's a problem on Slurm's end.

Prentice
On 4/26/20 2:12 PM, Ralph Castain via users wrote:
It is entirely possible that the PMI2 support in OMPI v4 is broken - I doubt it 
is used or tested very much as pretty much everyone has moved to PMIx. In fact, 
we completely dropped PMI-1 and PMI-2 from OMPI v5 for that reason.

I would suggest building Slurm with PMIx v3.1.5 
(https://github.com/openpmix/openpmix/releases/tag/v3.1.5) as that is what OMPI 
v4 is using, and launching with "srun --mpi=pmix_v3"



On Apr 26, 2020, at 10:07 AM, Patrick Bégou via users 
mailto:users@lists.open-mpi.org>> wrote:

I have also this problem on servers I'm benching at DELL's lab with
OpenMPI-4.0.3. I've tried  a new build of OpenMPI with "--with-pmi2". No
change.
Finally my work around in the slurm script was to launch my code with
mpirun. As mpirun was only finding one slot per nodes I have used
"--oversubscribe --bind-to core" and checked that every process was
binded on a separate core. It worked but do not ask me why :-)

Patrick

Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :

Prentice, have you tried something trivial, like "srun -N3 hostname", to rule 
out non-OMPI problems?

Andy

-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Prentice 
Bisbal via users
Sent: Friday, April 24, 2020 2:19 PM
To: Ralph Castain mailto:r...@open-mpi.org>>; Open MPI Users 
mailto:users@lists.open-mpi.org>>
Cc: Prentice Bisbal mailto:pbis...@pppl.gov>>
Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.

Okay. I've got Slurm built with pmix support:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmi2
srun: openmpi
srun: pmix

But now when I try to launch a job with srun, the job appears to be
running, but doesn't do anything - it just hangs in the running state
but doesn't do anything. Any ideas what could be wrong, or how to debug
this?

I'm also asking around on the Slurm mailing list, too

Prentice

On 4/23/20 3:03 PM, Ralph Castain wrote:

You can trust the --mpi=list. The problem is likely that OMPI wasn't configured 
--with-pmi2



On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org>> wrote:

--mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to 
either of them, my job still fails. Why is that? Can I not trust the output of 
--mpi=list?

Prentice

On 4/23/20 10:43 AM, Ralph Castain via users wrote:

No, but you do have to explicitly build OMPI with non-PMIx support if that is 
what you are going to use. In this case, you need to configure OMPI 
--with-pmi2=

You can leave off the path if Slurm (i.e., just "--with-pmi2") was installed in 
a standard location as we should find it there.



On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org>> wrote:

It looks like it was built with PMI2, but not PMIx:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: openmpi

I did launch the job with srun --mpi=pmi2 

Does OpenMPI 4 need PMIx specifically?


On 4/23/20 10:23 AM, Ralph Castain via users wrote:

Is Slurm built with PMIx support? Did you tell srun to use it?



On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org>> wrote:

I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
very simple hello, world MPI program that I've used reliably for years. When I 
submit the job through slurm and use srun to launch the job, I get these errors:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26070] Local abort before 
MPI_INIT completed completed successfully, but am not able to aggregate error 
messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26076] Local abort before 
MPI_INIT completed completed successfully, but am not able to aggregate error 
messages, and not able to guarantee that all other processes were 

Re: [OMPI users] Can't start jobs with srun.

2020-04-26 Thread Ralph Castain via users
It is entirely possible that the PMI2 support in OMPI v4 is broken - I doubt it 
is used or tested very much as pretty much everyone has moved to PMIx. In fact, 
we completely dropped PMI-1 and PMI-2 from OMPI v5 for that reason.

I would suggest building Slurm with PMIx v3.1.5 
(https://github.com/openpmix/openpmix/releases/tag/v3.1.5) as that is what OMPI 
v4 is using, and launching with "srun --mpi=pmix_v3"


On Apr 26, 2020, at 10:07 AM, Patrick Bégou via users mailto:users@lists.open-mpi.org> > wrote:

I have also this problem on servers I'm benching at DELL's lab with
OpenMPI-4.0.3. I've tried  a new build of OpenMPI with "--with-pmi2". No
change.
Finally my work around in the slurm script was to launch my code with
mpirun. As mpirun was only finding one slot per nodes I have used
"--oversubscribe --bind-to core" and checked that every process was
binded on a separate core. It worked but do not ask me why :-)

Patrick

Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :
Prentice, have you tried something trivial, like "srun -N3 hostname", to rule 
out non-OMPI problems?

Andy

-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org 
 ] On Behalf Of Prentice Bisbal via 
users
Sent: Friday, April 24, 2020 2:19 PM
To: Ralph Castain mailto:r...@open-mpi.org> >; Open MPI 
Users mailto:users@lists.open-mpi.org> >
Cc: Prentice Bisbal mailto:pbis...@pppl.gov> >
Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.

Okay. I've got Slurm built with pmix support:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmi2
srun: openmpi
srun: pmix

But now when I try to launch a job with srun, the job appears to be 
running, but doesn't do anything - it just hangs in the running state 

but doesn't do anything. Any ideas what could be wrong, or how to debug 
this?

I'm also asking around on the Slurm mailing list, too

Prentice

On 4/23/20 3:03 PM, Ralph Castain wrote:
You can trust the --mpi=list. The problem is likely that OMPI wasn't configured 
--with-pmi2


On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org> > wrote:

--mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to 
either of them, my job still fails. Why is that? Can I not trust the output of 
--mpi=list?

Prentice

On 4/23/20 10:43 AM, Ralph Castain via users wrote:
No, but you do have to explicitly build OMPI with non-PMIx support if that is 
what you are going to use. In this case, you need to configure OMPI 
--with-pmi2=

You can leave off the path if Slurm (i.e., just "--with-pmi2") was installed in 
a standard location as we should find it there.


On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org> > wrote:

It looks like it was built with PMI2, but not PMIx:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: openmpi

I did launch the job with srun --mpi=pmi2 

Does OpenMPI 4 need PMIx specifically?


On 4/23/20 10:23 AM, Ralph Castain via users wrote:
Is Slurm built with PMIx support? Did you tell srun to use it?


On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org> > wrote:

I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
very simple hello, world MPI program that I've used reliably for years. When I 
submit the job through slurm and use srun to launch the job, I get these errors:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

***    and potentially your MPI job)
[dawson029.pppl.gov:26070  ] Local abort 
before MPI_INIT completed completed successfully, but am not able to aggregate 
error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

***    and potentially your MPI job)
[dawson029.pppl.gov:26076  ] Local abort 
before MPI_INIT completed completed successfully, but am not able to aggregate 
error messages, and not able to guarantee that all other processes were killed!

If I run the same job, but use mpiexec or mpirun instead of srun, the jobs run 
just fine. I checked ompi_info to make sure OpenMPI was compiled with  Slurm 
support:

$ ompi_info | grep slurm
   Configure command line: '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
'--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
'--with-slurm' '--with-psm'
  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
  MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
   MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)


Any ideas what could 

Re: [OMPI users] Can't start jobs with srun.

2020-04-26 Thread Patrick Bégou via users
I have also this problem on servers I'm benching at DELL's lab with
OpenMPI-4.0.3. I've tried  a new build of OpenMPI with "--with-pmi2". No
change.
Finally my work around in the slurm script was to launch my code with
mpirun. As mpirun was only finding one slot per nodes I have used
"--oversubscribe --bind-to core" and checked that every process was
binded on a separate core. It worked but do not ask me why :-)

Patrick

Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :
> Prentice, have you tried something trivial, like "srun -N3 hostname", to rule 
> out non-OMPI problems?
>
> Andy
>
> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Prentice 
> Bisbal via users
> Sent: Friday, April 24, 2020 2:19 PM
> To: Ralph Castain ; Open MPI Users 
> 
> Cc: Prentice Bisbal 
> Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.
>
> Okay. I've got Slurm built with pmix support:
>
> $ srun --mpi=list
> srun: MPI types are...
> srun: none
> srun: pmix_v3
> srun: pmi2
> srun: openmpi
> srun: pmix
>
> But now when I try to launch a job with srun, the job appears to be 
> running, but doesn't do anything - it just hangs in the running state 
> but doesn't do anything. Any ideas what could be wrong, or how to debug 
> this?
>
> I'm also asking around on the Slurm mailing list, too
>
> Prentice
>
> On 4/23/20 3:03 PM, Ralph Castain wrote:
>> You can trust the --mpi=list. The problem is likely that OMPI wasn't 
>> configured --with-pmi2
>>
>>
>>> On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
>>>  wrote:
>>>
>>> --mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to 
>>> either of them, my job still fails. Why is that? Can I not trust the output 
>>> of --mpi=list?
>>>
>>> Prentice
>>>
>>> On 4/23/20 10:43 AM, Ralph Castain via users wrote:
 No, but you do have to explicitly build OMPI with non-PMIx support if that 
 is what you are going to use. In this case, you need to configure OMPI 
 --with-pmi2=

 You can leave off the path if Slurm (i.e., just "--with-pmi2") was 
 installed in a standard location as we should find it there.


> On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
>  wrote:
>
> It looks like it was built with PMI2, but not PMIx:
>
> $ srun --mpi=list
> srun: MPI types are...
> srun: none
> srun: pmi2
> srun: openmpi
>
> I did launch the job with srun --mpi=pmi2 
>
> Does OpenMPI 4 need PMIx specifically?
>
>
> On 4/23/20 10:23 AM, Ralph Castain via users wrote:
>> Is Slurm built with PMIx support? Did you tell srun to use it?
>>
>>
>>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
>>>  wrote:
>>>
>>> I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software 
>>> with a very simple hello, world MPI program that I've used reliably for 
>>> years. When I submit the job through slurm and use srun to launch the 
>>> job, I get these errors:
>>>
>>> *** An error occurred in MPI_Init
>>> *** on a NULL communicator
>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>> ***and potentially your MPI job)
>>> [dawson029.pppl.gov:26070] Local abort before MPI_INIT completed 
>>> completed successfully, but am not able to aggregate error messages, 
>>> and not able to guarantee that all other processes were killed!
>>> *** An error occurred in MPI_Init
>>> *** on a NULL communicator
>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>> ***and potentially your MPI job)
>>> [dawson029.pppl.gov:26076] Local abort before MPI_INIT completed 
>>> completed successfully, but am not able to aggregate error messages, 
>>> and not able to guarantee that all other processes were killed!
>>>
>>> If I run the same job, but use mpiexec or mpirun instead of srun, the 
>>> jobs run just fine. I checked ompi_info to make sure OpenMPI was 
>>> compiled with  Slurm support:
>>>
>>> $ ompi_info | grep slurm
>>>Configure command line: 
>>> '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
>>> '--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
>>> '--with-slurm' '--with-psm'
>>>   MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component 
>>> v4.0.3)
>>>   MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component 
>>> v4.0.3)
>>>   MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component 
>>> v4.0.3)
>>>MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component 
>>> v4.0.3)
>>>
>>> Any ideas what could be wrong? Do you need any additional information?
>>>
>>> Prentice
>>>



Re: [OMPI users] Can't start jobs with srun.

2020-04-24 Thread Riebs, Andy via users
Prentice, have you tried something trivial, like "srun -N3 hostname", to rule 
out non-OMPI problems?

Andy

-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Prentice 
Bisbal via users
Sent: Friday, April 24, 2020 2:19 PM
To: Ralph Castain ; Open MPI Users 
Cc: Prentice Bisbal 
Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.

Okay. I've got Slurm built with pmix support:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmi2
srun: openmpi
srun: pmix

But now when I try to launch a job with srun, the job appears to be 
running, but doesn't do anything - it just hangs in the running state 
but doesn't do anything. Any ideas what could be wrong, or how to debug 
this?

I'm also asking around on the Slurm mailing list, too

Prentice

On 4/23/20 3:03 PM, Ralph Castain wrote:
> You can trust the --mpi=list. The problem is likely that OMPI wasn't 
> configured --with-pmi2
>
>
>> On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
>>  wrote:
>>
>> --mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to 
>> either of them, my job still fails. Why is that? Can I not trust the output 
>> of --mpi=list?
>>
>> Prentice
>>
>> On 4/23/20 10:43 AM, Ralph Castain via users wrote:
>>> No, but you do have to explicitly build OMPI with non-PMIx support if that 
>>> is what you are going to use. In this case, you need to configure OMPI 
>>> --with-pmi2=
>>>
>>> You can leave off the path if Slurm (i.e., just "--with-pmi2") was 
>>> installed in a standard location as we should find it there.
>>>
>>>
 On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
  wrote:

 It looks like it was built with PMI2, but not PMIx:

 $ srun --mpi=list
 srun: MPI types are...
 srun: none
 srun: pmi2
 srun: openmpi

 I did launch the job with srun --mpi=pmi2 

 Does OpenMPI 4 need PMIx specifically?


 On 4/23/20 10:23 AM, Ralph Castain via users wrote:
> Is Slurm built with PMIx support? Did you tell srun to use it?
>
>
>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
>>  wrote:
>>
>> I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software 
>> with a very simple hello, world MPI program that I've used reliably for 
>> years. When I submit the job through slurm and use srun to launch the 
>> job, I get these errors:
>>
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***and potentially your MPI job)
>> [dawson029.pppl.gov:26070] Local abort before MPI_INIT completed 
>> completed successfully, but am not able to aggregate error messages, and 
>> not able to guarantee that all other processes were killed!
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***and potentially your MPI job)
>> [dawson029.pppl.gov:26076] Local abort before MPI_INIT completed 
>> completed successfully, but am not able to aggregate error messages, and 
>> not able to guarantee that all other processes were killed!
>>
>> If I run the same job, but use mpiexec or mpirun instead of srun, the 
>> jobs run just fine. I checked ompi_info to make sure OpenMPI was 
>> compiled with  Slurm support:
>>
>> $ ompi_info | grep slurm
>>Configure command line: 
>> '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
>> '--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
>> '--with-slurm' '--with-psm'
>>   MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component 
>> v4.0.3)
>>   MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component 
>> v4.0.3)
>>   MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component 
>> v4.0.3)
>>MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component 
>> v4.0.3)
>>
>> Any ideas what could be wrong? Do you need any additional information?
>>
>> Prentice
>>
>


Re: [OMPI users] Can't start jobs with srun.

2020-04-23 Thread Ralph Castain via users
Is Slurm built with PMIx support? Did you tell srun to use it?


> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
>  wrote:
> 
> I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
> very simple hello, world MPI program that I've used reliably for years. When 
> I submit the job through slurm and use srun to launch the job, I get these 
> errors:
> 
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [dawson029.pppl.gov:26070] Local abort before MPI_INIT completed completed 
> successfully, but am not able to aggregate error messages, and not able to 
> guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [dawson029.pppl.gov:26076] Local abort before MPI_INIT completed completed 
> successfully, but am not able to aggregate error messages, and not able to 
> guarantee that all other processes were killed!
> 
> If I run the same job, but use mpiexec or mpirun instead of srun, the jobs 
> run just fine. I checked ompi_info to make sure OpenMPI was compiled with  
> Slurm support:
> 
> $ ompi_info | grep slurm
>   Configure command line: '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
> '--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
> '--with-slurm' '--with-psm'
>  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
>  MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>   MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)
> 
> Any ideas what could be wrong? Do you need any additional information?
> 
> Prentice
> 




[OMPI users] Can't start jobs with srun.

2020-04-23 Thread Prentice Bisbal via users
I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software 
with a very simple hello, world MPI program that I've used reliably for 
years. When I submit the job through slurm and use srun to launch the 
job, I get these errors:


*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[dawson029.pppl.gov:26070] Local abort before MPI_INIT completed 
completed successfully, but am not able to aggregate error messages, and 
not able to guarantee that all other processes were killed!

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[dawson029.pppl.gov:26076] Local abort before MPI_INIT completed 
completed successfully, but am not able to aggregate error messages, and 
not able to guarantee that all other processes were killed!


If I run the same job, but use mpiexec or mpirun instead of srun, the 
jobs run just fine. I checked ompi_info to make sure OpenMPI was 
compiled with  Slurm support:


$ ompi_info | grep slurm
  Configure command line: 
'--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
'--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
'--with-slurm' '--with-psm'

 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
  MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)

Any ideas what could be wrong? Do you need any additional information?

Prentice