Re: [OMPI users] pmix and srun

2019-01-18 Thread Ralph H Castain
Good - thanks!

> On Jan 18, 2019, at 3:25 PM, Michael Di Domenico  
> wrote:
> 
> seems to be better now.  jobs are running
> 
> On Fri, Jan 18, 2019 at 6:17 PM Ralph H Castain  wrote:
>> 
>> I have pushed a fix to the v2.2 branch - could you please confirm it?
>> 
>> 
>>> On Jan 18, 2019, at 2:23 PM, Ralph H Castain  wrote:
>>> 
>>> Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm 
>>> plugin folks seem to be off somewhere for awhile and haven’t been testing 
>>> it. Sigh.
>>> 
>>> I’ll patch the branch and let you know - we’d appreciate the feedback.
>>> Ralph
>>> 
>>> 
 On Jan 18, 2019, at 2:09 PM, Michael Di Domenico  
 wrote:
 
 here's the branches i'm using.  i did a git clone on the repo's and
 then a git checkout
 
 [ec2-user@labhead bin]$ cd /hpc/src/pmix/
 [ec2-user@labhead pmix]$ git branch
 master
 * v2.2
 [ec2-user@labhead pmix]$ cd ../slurm/
 [ec2-user@labhead slurm]$ git branch
 * (detached from origin/slurm-18.08)
 master
 [ec2-user@labhead slurm]$ cd ../ompi/
 [ec2-user@labhead ompi]$ git branch
 * (detached from origin/v3.1.x)
 master
 
 
 attached is the debug out from the run with the debugging turned on
 
 On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain  wrote:
> 
> Looks strange. I’m pretty sure Mellanox didn’t implement the event 
> notification system in the Slurm plugin, but you should only be trying to 
> call it if OMPI is registering a system-level event code - which OMPI 3.1 
> definitely doesn’t do.
> 
> If you are using PMIx v2.2.0, then please note that there is a bug in it 
> that slipped through our automated testing. I replaced it today with 
> v2.2.1 - you probably should update if that’s the case. However, that 
> wouldn’t necessarily explain this behavior. I’m not that familiar with 
> the Slurm plugin, but you might try adding
> 
> PMIX_MCA_pmix_client_event_verbose=5
> PMIX_MCA_pmix_server_event_verbose=5
> OMPI_MCA_pmix_base_verbose=10
> 
> to your environment and see if that provides anything useful.
> 
>> On Jan 18, 2019, at 12:09 PM, Michael Di Domenico 
>>  wrote:
>> 
>> i compilied pmix slurm openmpi
>> 
>> ---pmix
>> ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
>> --disable-debug
>> ---slurm
>> ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
>> --with-pmix=/hpc/pmix/2.2
>> ---openmpi
>> ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
>> --with-libevent=external --with-slurm=/hpc/slurm/18.08
>> --with-pmix=/hpc/pmix/2.2
>> 
>> everything seemed to compile fine, but when i do an srun i get the
>> below errors, however, if i salloc and then mpirun it seems to work
>> fine.  i'm not quite sure where the breakdown is or how to debug it
>> 
>> ---
>> 
>> [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
>> [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
>> event/pmix_event_registration.c at line 101
>> [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
>> event/pmix_event_registration.c at line 101
>> [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
>> event/pmix_event_registration.c at line 101
>> --
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or 
>> environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>> 
>> ompi_interlib_declare
>> --> Returned "Would block" (-10) instead of "Success" (0)
>> ...snipped...
>> [labcmp6:18355] *** An error occurred in MPI_Init
>> [labcmp6:18355] *** reported by process [140726281390153,15]
>> [labcmp6:18355] *** on a NULL communicator
>> [labcmp6:18355] *** Unknown error
>> [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this
>> communicator will now abort,
>> [labcmp6:18355] ***and potentially your MPI job)
>> [labcmp6:18352] *** An error occurred in MPI_Init
>> [labcmp6:18352] *** reported by process [1677936713,12]
>> [labcmp6:18352] *** on a NULL communicator
>> [labcmp6:18352] *** Unknown error
>> [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this
>> communicator will now abort,
>> [labcmp6:18352] ***and potentially your MPI job)
>> [labcmp6:18354] *** An error occurred in MPI_Init
>> [labcmp6:18354] *** reported by process [140726281390153,14]
>> [labcmp6:18354] *** on a NULL communicator
>> [labcmp6:18354] *** Unknown error
>> [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL 

Re: [OMPI users] pmix and srun

2019-01-18 Thread Michael Di Domenico
seems to be better now.  jobs are running

On Fri, Jan 18, 2019 at 6:17 PM Ralph H Castain  wrote:
>
> I have pushed a fix to the v2.2 branch - could you please confirm it?
>
>
> > On Jan 18, 2019, at 2:23 PM, Ralph H Castain  wrote:
> >
> > Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm 
> > plugin folks seem to be off somewhere for awhile and haven’t been testing 
> > it. Sigh.
> >
> > I’ll patch the branch and let you know - we’d appreciate the feedback.
> > Ralph
> >
> >
> >> On Jan 18, 2019, at 2:09 PM, Michael Di Domenico  
> >> wrote:
> >>
> >> here's the branches i'm using.  i did a git clone on the repo's and
> >> then a git checkout
> >>
> >> [ec2-user@labhead bin]$ cd /hpc/src/pmix/
> >> [ec2-user@labhead pmix]$ git branch
> >> master
> >> * v2.2
> >> [ec2-user@labhead pmix]$ cd ../slurm/
> >> [ec2-user@labhead slurm]$ git branch
> >> * (detached from origin/slurm-18.08)
> >> master
> >> [ec2-user@labhead slurm]$ cd ../ompi/
> >> [ec2-user@labhead ompi]$ git branch
> >> * (detached from origin/v3.1.x)
> >> master
> >>
> >>
> >> attached is the debug out from the run with the debugging turned on
> >>
> >> On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain  wrote:
> >>>
> >>> Looks strange. I’m pretty sure Mellanox didn’t implement the event 
> >>> notification system in the Slurm plugin, but you should only be trying to 
> >>> call it if OMPI is registering a system-level event code - which OMPI 3.1 
> >>> definitely doesn’t do.
> >>>
> >>> If you are using PMIx v2.2.0, then please note that there is a bug in it 
> >>> that slipped through our automated testing. I replaced it today with 
> >>> v2.2.1 - you probably should update if that’s the case. However, that 
> >>> wouldn’t necessarily explain this behavior. I’m not that familiar with 
> >>> the Slurm plugin, but you might try adding
> >>>
> >>> PMIX_MCA_pmix_client_event_verbose=5
> >>> PMIX_MCA_pmix_server_event_verbose=5
> >>> OMPI_MCA_pmix_base_verbose=10
> >>>
> >>> to your environment and see if that provides anything useful.
> >>>
>  On Jan 18, 2019, at 12:09 PM, Michael Di Domenico 
>   wrote:
> 
>  i compilied pmix slurm openmpi
> 
>  ---pmix
>  ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
>  --disable-debug
>  ---slurm
>  ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
>  --with-pmix=/hpc/pmix/2.2
>  ---openmpi
>  ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
>  --with-libevent=external --with-slurm=/hpc/slurm/18.08
>  --with-pmix=/hpc/pmix/2.2
> 
>  everything seemed to compile fine, but when i do an srun i get the
>  below errors, however, if i salloc and then mpirun it seems to work
>  fine.  i'm not quite sure where the breakdown is or how to debug it
> 
>  ---
> 
>  [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
>  [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
>  event/pmix_event_registration.c at line 101
>  [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
>  event/pmix_event_registration.c at line 101
>  [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
>  event/pmix_event_registration.c at line 101
>  --
>  It looks like MPI_INIT failed for some reason; your parallel process is
>  likely to abort.  There are many reasons that a parallel process can
>  fail during MPI_INIT; some of which are due to configuration or 
>  environment
>  problems.  This failure appears to be an internal failure; here's some
>  additional information (which may only be relevant to an Open MPI
>  developer):
> 
>  ompi_interlib_declare
>  --> Returned "Would block" (-10) instead of "Success" (0)
>  ...snipped...
>  [labcmp6:18355] *** An error occurred in MPI_Init
>  [labcmp6:18355] *** reported by process [140726281390153,15]
>  [labcmp6:18355] *** on a NULL communicator
>  [labcmp6:18355] *** Unknown error
>  [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this
>  communicator will now abort,
>  [labcmp6:18355] ***and potentially your MPI job)
>  [labcmp6:18352] *** An error occurred in MPI_Init
>  [labcmp6:18352] *** reported by process [1677936713,12]
>  [labcmp6:18352] *** on a NULL communicator
>  [labcmp6:18352] *** Unknown error
>  [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this
>  communicator will now abort,
>  [labcmp6:18352] ***and potentially your MPI job)
>  [labcmp6:18354] *** An error occurred in MPI_Init
>  [labcmp6:18354] *** reported by process [140726281390153,14]
>  [labcmp6:18354] *** on a NULL communicator
>  [labcmp6:18354] *** Unknown error
>  [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this
>  communicator will now abort,
>  [labcmp6:18354] ***and potentially your MPI job)
>  

Re: [OMPI users] pmix and srun

2019-01-18 Thread Ralph H Castain
I have pushed a fix to the v2.2 branch - could you please confirm it?


> On Jan 18, 2019, at 2:23 PM, Ralph H Castain  wrote:
> 
> Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm plugin 
> folks seem to be off somewhere for awhile and haven’t been testing it. Sigh.
> 
> I’ll patch the branch and let you know - we’d appreciate the feedback.
> Ralph
> 
> 
>> On Jan 18, 2019, at 2:09 PM, Michael Di Domenico  
>> wrote:
>> 
>> here's the branches i'm using.  i did a git clone on the repo's and
>> then a git checkout
>> 
>> [ec2-user@labhead bin]$ cd /hpc/src/pmix/
>> [ec2-user@labhead pmix]$ git branch
>> master
>> * v2.2
>> [ec2-user@labhead pmix]$ cd ../slurm/
>> [ec2-user@labhead slurm]$ git branch
>> * (detached from origin/slurm-18.08)
>> master
>> [ec2-user@labhead slurm]$ cd ../ompi/
>> [ec2-user@labhead ompi]$ git branch
>> * (detached from origin/v3.1.x)
>> master
>> 
>> 
>> attached is the debug out from the run with the debugging turned on
>> 
>> On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain  wrote:
>>> 
>>> Looks strange. I’m pretty sure Mellanox didn’t implement the event 
>>> notification system in the Slurm plugin, but you should only be trying to 
>>> call it if OMPI is registering a system-level event code - which OMPI 3.1 
>>> definitely doesn’t do.
>>> 
>>> If you are using PMIx v2.2.0, then please note that there is a bug in it 
>>> that slipped through our automated testing. I replaced it today with v2.2.1 
>>> - you probably should update if that’s the case. However, that wouldn’t 
>>> necessarily explain this behavior. I’m not that familiar with the Slurm 
>>> plugin, but you might try adding
>>> 
>>> PMIX_MCA_pmix_client_event_verbose=5
>>> PMIX_MCA_pmix_server_event_verbose=5
>>> OMPI_MCA_pmix_base_verbose=10
>>> 
>>> to your environment and see if that provides anything useful.
>>> 
 On Jan 18, 2019, at 12:09 PM, Michael Di Domenico  
 wrote:
 
 i compilied pmix slurm openmpi
 
 ---pmix
 ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
 --disable-debug
 ---slurm
 ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
 --with-pmix=/hpc/pmix/2.2
 ---openmpi
 ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
 --with-libevent=external --with-slurm=/hpc/slurm/18.08
 --with-pmix=/hpc/pmix/2.2
 
 everything seemed to compile fine, but when i do an srun i get the
 below errors, however, if i salloc and then mpirun it seems to work
 fine.  i'm not quite sure where the breakdown is or how to debug it
 
 ---
 
 [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
 [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
 event/pmix_event_registration.c at line 101
 [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
 event/pmix_event_registration.c at line 101
 [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
 event/pmix_event_registration.c at line 101
 --
 It looks like MPI_INIT failed for some reason; your parallel process is
 likely to abort.  There are many reasons that a parallel process can
 fail during MPI_INIT; some of which are due to configuration or environment
 problems.  This failure appears to be an internal failure; here's some
 additional information (which may only be relevant to an Open MPI
 developer):
 
 ompi_interlib_declare
 --> Returned "Would block" (-10) instead of "Success" (0)
 ...snipped...
 [labcmp6:18355] *** An error occurred in MPI_Init
 [labcmp6:18355] *** reported by process [140726281390153,15]
 [labcmp6:18355] *** on a NULL communicator
 [labcmp6:18355] *** Unknown error
 [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this
 communicator will now abort,
 [labcmp6:18355] ***and potentially your MPI job)
 [labcmp6:18352] *** An error occurred in MPI_Init
 [labcmp6:18352] *** reported by process [1677936713,12]
 [labcmp6:18352] *** on a NULL communicator
 [labcmp6:18352] *** Unknown error
 [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this
 communicator will now abort,
 [labcmp6:18352] ***and potentially your MPI job)
 [labcmp6:18354] *** An error occurred in MPI_Init
 [labcmp6:18354] *** reported by process [140726281390153,14]
 [labcmp6:18354] *** on a NULL communicator
 [labcmp6:18354] *** Unknown error
 [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this
 communicator will now abort,
 [labcmp6:18354] ***and potentially your MPI job)
 srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
 slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 
 2019-01-18T20:03:33 ***
 [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file
 event/pmix_event_registration.c at line 101