Re: [OMPI users] Fwd: pmix and srun

2019-01-18 Thread Ralph H Castain
Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm plugin 
folks seem to be off somewhere for awhile and haven’t been testing it. Sigh.

I’ll patch the branch and let you know - we’d appreciate the feedback.
Ralph


> On Jan 18, 2019, at 2:09 PM, Michael Di Domenico  
> wrote:
> 
> here's the branches i'm using.  i did a git clone on the repo's and
> then a git checkout
> 
> [ec2-user@labhead bin]$ cd /hpc/src/pmix/
> [ec2-user@labhead pmix]$ git branch
>  master
> * v2.2
> [ec2-user@labhead pmix]$ cd ../slurm/
> [ec2-user@labhead slurm]$ git branch
> * (detached from origin/slurm-18.08)
>  master
> [ec2-user@labhead slurm]$ cd ../ompi/
> [ec2-user@labhead ompi]$ git branch
> * (detached from origin/v3.1.x)
>  master
> 
> 
> attached is the debug out from the run with the debugging turned on
> 
> On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain  wrote:
>> 
>> Looks strange. I’m pretty sure Mellanox didn’t implement the event 
>> notification system in the Slurm plugin, but you should only be trying to 
>> call it if OMPI is registering a system-level event code - which OMPI 3.1 
>> definitely doesn’t do.
>> 
>> If you are using PMIx v2.2.0, then please note that there is a bug in it 
>> that slipped through our automated testing. I replaced it today with v2.2.1 
>> - you probably should update if that’s the case. However, that wouldn’t 
>> necessarily explain this behavior. I’m not that familiar with the Slurm 
>> plugin, but you might try adding
>> 
>> PMIX_MCA_pmix_client_event_verbose=5
>> PMIX_MCA_pmix_server_event_verbose=5
>> OMPI_MCA_pmix_base_verbose=10
>> 
>> to your environment and see if that provides anything useful.
>> 
>>> On Jan 18, 2019, at 12:09 PM, Michael Di Domenico  
>>> wrote:
>>> 
>>> i compilied pmix slurm openmpi
>>> 
>>> ---pmix
>>> ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
>>> --disable-debug
>>> ---slurm
>>> ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
>>> --with-pmix=/hpc/pmix/2.2
>>> ---openmpi
>>> ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
>>> --with-libevent=external --with-slurm=/hpc/slurm/18.08
>>> --with-pmix=/hpc/pmix/2.2
>>> 
>>> everything seemed to compile fine, but when i do an srun i get the
>>> below errors, however, if i salloc and then mpirun it seems to work
>>> fine.  i'm not quite sure where the breakdown is or how to debug it
>>> 
>>> ---
>>> 
>>> [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
>>> [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
>>> event/pmix_event_registration.c at line 101
>>> [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
>>> event/pmix_event_registration.c at line 101
>>> [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
>>> event/pmix_event_registration.c at line 101
>>> --
>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>> likely to abort.  There are many reasons that a parallel process can
>>> fail during MPI_INIT; some of which are due to configuration or environment
>>> problems.  This failure appears to be an internal failure; here's some
>>> additional information (which may only be relevant to an Open MPI
>>> developer):
>>> 
>>> ompi_interlib_declare
>>> --> Returned "Would block" (-10) instead of "Success" (0)
>>> ...snipped...
>>> [labcmp6:18355] *** An error occurred in MPI_Init
>>> [labcmp6:18355] *** reported by process [140726281390153,15]
>>> [labcmp6:18355] *** on a NULL communicator
>>> [labcmp6:18355] *** Unknown error
>>> [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this
>>> communicator will now abort,
>>> [labcmp6:18355] ***and potentially your MPI job)
>>> [labcmp6:18352] *** An error occurred in MPI_Init
>>> [labcmp6:18352] *** reported by process [1677936713,12]
>>> [labcmp6:18352] *** on a NULL communicator
>>> [labcmp6:18352] *** Unknown error
>>> [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this
>>> communicator will now abort,
>>> [labcmp6:18352] ***and potentially your MPI job)
>>> [labcmp6:18354] *** An error occurred in MPI_Init
>>> [labcmp6:18354] *** reported by process [140726281390153,14]
>>> [labcmp6:18354] *** on a NULL communicator
>>> [labcmp6:18354] *** Unknown error
>>> [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this
>>> communicator will now abort,
>>> [labcmp6:18354] ***and potentially your MPI job)
>>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>>> slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 
>>> 2019-01-18T20:03:33 ***
>>> [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file
>>> event/pmix_event_registration.c at line 101
>>> --
>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>> likely to abort.  There are many reasons that a parallel process can
>>> fail during MPI_INIT; some of which are due to configuration or 

Re: [OMPI users] Fwd: pmix and srun

2019-01-18 Thread Michael Di Domenico
here's the branches i'm using.  i did a git clone on the repo's and
then a git checkout

[ec2-user@labhead bin]$ cd /hpc/src/pmix/
[ec2-user@labhead pmix]$ git branch
  master
* v2.2
[ec2-user@labhead pmix]$ cd ../slurm/
[ec2-user@labhead slurm]$ git branch
* (detached from origin/slurm-18.08)
  master
[ec2-user@labhead slurm]$ cd ../ompi/
[ec2-user@labhead ompi]$ git branch
* (detached from origin/v3.1.x)
  master


attached is the debug out from the run with the debugging turned on

On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain  wrote:
>
> Looks strange. I’m pretty sure Mellanox didn’t implement the event 
> notification system in the Slurm plugin, but you should only be trying to 
> call it if OMPI is registering a system-level event code - which OMPI 3.1 
> definitely doesn’t do.
>
> If you are using PMIx v2.2.0, then please note that there is a bug in it that 
> slipped through our automated testing. I replaced it today with v2.2.1 - you 
> probably should update if that’s the case. However, that wouldn’t necessarily 
> explain this behavior. I’m not that familiar with the Slurm plugin, but you 
> might try adding
>
> PMIX_MCA_pmix_client_event_verbose=5
> PMIX_MCA_pmix_server_event_verbose=5
> OMPI_MCA_pmix_base_verbose=10
>
> to your environment and see if that provides anything useful.
>
> > On Jan 18, 2019, at 12:09 PM, Michael Di Domenico  
> > wrote:
> >
> > i compilied pmix slurm openmpi
> >
> > ---pmix
> > ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
> > --disable-debug
> > ---slurm
> > ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
> > --with-pmix=/hpc/pmix/2.2
> > ---openmpi
> > ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
> > --with-libevent=external --with-slurm=/hpc/slurm/18.08
> > --with-pmix=/hpc/pmix/2.2
> >
> > everything seemed to compile fine, but when i do an srun i get the
> > below errors, however, if i salloc and then mpirun it seems to work
> > fine.  i'm not quite sure where the breakdown is or how to debug it
> >
> > ---
> >
> > [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
> > [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
> > event/pmix_event_registration.c at line 101
> > [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
> > event/pmix_event_registration.c at line 101
> > [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
> > event/pmix_event_registration.c at line 101
> > --
> > It looks like MPI_INIT failed for some reason; your parallel process is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during MPI_INIT; some of which are due to configuration or environment
> > problems.  This failure appears to be an internal failure; here's some
> > additional information (which may only be relevant to an Open MPI
> > developer):
> >
> >  ompi_interlib_declare
> >  --> Returned "Would block" (-10) instead of "Success" (0)
> > ...snipped...
> > [labcmp6:18355] *** An error occurred in MPI_Init
> > [labcmp6:18355] *** reported by process [140726281390153,15]
> > [labcmp6:18355] *** on a NULL communicator
> > [labcmp6:18355] *** Unknown error
> > [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this
> > communicator will now abort,
> > [labcmp6:18355] ***and potentially your MPI job)
> > [labcmp6:18352] *** An error occurred in MPI_Init
> > [labcmp6:18352] *** reported by process [1677936713,12]
> > [labcmp6:18352] *** on a NULL communicator
> > [labcmp6:18352] *** Unknown error
> > [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this
> > communicator will now abort,
> > [labcmp6:18352] ***and potentially your MPI job)
> > [labcmp6:18354] *** An error occurred in MPI_Init
> > [labcmp6:18354] *** reported by process [140726281390153,14]
> > [labcmp6:18354] *** on a NULL communicator
> > [labcmp6:18354] *** Unknown error
> > [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this
> > communicator will now abort,
> > [labcmp6:18354] ***and potentially your MPI job)
> > srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> > slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 
> > 2019-01-18T20:03:33 ***
> > [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file
> > event/pmix_event_registration.c at line 101
> > --
> > It looks like MPI_INIT failed for some reason; your parallel process is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during MPI_INIT; some of which are due to configuration or environment
> > problems.  This failure appears to be an internal failure; here's some
> > additional information (which may only be relevant to an Open MPI
> > developer):
> >
> >  ompi_interlib_declare
> >  --> Returned "Would block" (-10) instead of "Success" (0)
> > --
> > [labcmp5:18357] PMIX 

Re: [OMPI users] Fwd: pmix and srun

2019-01-18 Thread Ralph H Castain
Looks strange. I’m pretty sure Mellanox didn’t implement the event notification 
system in the Slurm plugin, but you should only be trying to call it if OMPI is 
registering a system-level event code - which OMPI 3.1 definitely doesn’t do.

If you are using PMIx v2.2.0, then please note that there is a bug in it that 
slipped through our automated testing. I replaced it today with v2.2.1 - you 
probably should update if that’s the case. However, that wouldn’t necessarily 
explain this behavior. I’m not that familiar with the Slurm plugin, but you 
might try adding

PMIX_MCA_pmix_client_event_verbose=5
PMIX_MCA_pmix_server_event_verbose=5
OMPI_MCA_pmix_base_verbose=10

to your environment and see if that provides anything useful.

> On Jan 18, 2019, at 12:09 PM, Michael Di Domenico  
> wrote:
> 
> i compilied pmix slurm openmpi
> 
> ---pmix
> ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
> --disable-debug
> ---slurm
> ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
> --with-pmix=/hpc/pmix/2.2
> ---openmpi
> ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
> --with-libevent=external --with-slurm=/hpc/slurm/18.08
> --with-pmix=/hpc/pmix/2.2
> 
> everything seemed to compile fine, but when i do an srun i get the
> below errors, however, if i salloc and then mpirun it seems to work
> fine.  i'm not quite sure where the breakdown is or how to debug it
> 
> ---
> 
> [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
> [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
> event/pmix_event_registration.c at line 101
> [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
> event/pmix_event_registration.c at line 101
> [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
> event/pmix_event_registration.c at line 101
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>  ompi_interlib_declare
>  --> Returned "Would block" (-10) instead of "Success" (0)
> ...snipped...
> [labcmp6:18355] *** An error occurred in MPI_Init
> [labcmp6:18355] *** reported by process [140726281390153,15]
> [labcmp6:18355] *** on a NULL communicator
> [labcmp6:18355] *** Unknown error
> [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,
> [labcmp6:18355] ***and potentially your MPI job)
> [labcmp6:18352] *** An error occurred in MPI_Init
> [labcmp6:18352] *** reported by process [1677936713,12]
> [labcmp6:18352] *** on a NULL communicator
> [labcmp6:18352] *** Unknown error
> [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,
> [labcmp6:18352] ***and potentially your MPI job)
> [labcmp6:18354] *** An error occurred in MPI_Init
> [labcmp6:18354] *** reported by process [140726281390153,14]
> [labcmp6:18354] *** on a NULL communicator
> [labcmp6:18354] *** Unknown error
> [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,
> [labcmp6:18354] ***and potentially your MPI job)
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 2019-01-18T20:03:33 
> ***
> [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file
> event/pmix_event_registration.c at line 101
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>  ompi_interlib_declare
>  --> Returned "Would block" (-10) instead of "Success" (0)
> --
> [labcmp5:18357] PMIX ERROR: NOT-SUPPORTED in file
> event/pmix_event_registration.c at line 101
> [labcmp5:18356] PMIX ERROR: NOT-SUPPORTED in file
> event/pmix_event_registration.c at line 101
> srun: error: labcmp6: tasks 12-15: Exited with exit code 1
> srun: error: labcmp3: tasks 0-3: Killed
> srun: error: labcmp4: tasks 4-7: Killed
> srun: error: labcmp5: tasks 8-11: Killed
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users