Ralph --

What's the state of PMI integration with SLURM in the v1.10.x series?  (I 
haven't kept up with SLURM's recent releases to know if something broke between 
existing Open MPI releases and their new releases...?)



> On Mar 31, 2016, at 4:24 AM, Tommi T <tommi_...@yahoo.com> wrote:
> 
> Hi,
> 
> stack:
> el6.7, mlnx ofed 3.1 (IB FDR) and slurm 15.08.9 (whithout *.la libs).
> 
> problem:
> OpenMPI 1.10.x built with pmi support does not work when trying to use 
> sbatch/salloc - mpirun combination. srun ompi_mpi_app works fine.
> 
> Older 1.8.x version works fine under same salloc session.
> 
> ./configure --with-slurm --with-verbs --with-hwloc=internal --with-pmi 
> --with-cuda=/appl/opt/cuda/7.5/ --with-pic --enable-shared 
> --enable-mpi-thread-multiple --enable-contrib-no-build=vt
> 
> 
> I tried 1.10.3a from git also.
> 
> 
> mpirun  -debug-daemons ./1103aompitest 
> Daemon [[44437,0],1] checking in as pid 40979 on host g59
> Daemon [[44437,0],2] checking in as pid 23566 on host g60
> [g59:40979] [[44437,0],1] orted: up and running - waiting for commands!
> [g60:23566] [[44437,0],2] orted: up and running - waiting for commands!
> [g59:40979] [[44437,0],1] tcp_peer_send_blocking: send() to socket 9 failed: 
> Broken pipe (32)
> [g59:40979] [[44437,0],1]:errmgr_default_orted.c(260) updating exit status to 
> 1
> [g60:23566] [[44437,0],2] tcp_peer_send_blocking: send() to socket 9 failed: 
> Broken pipe (32)
> [g60:23566] [[44437,0],2]:errmgr_default_orted.c(260) updating exit status to 
> 1
> srun: error: g59: task 0: Exited with exit code 1
> srun: Terminating job step 8922923.1
> srun: Job step aborted: Waiting up to 12 seconds for job step to finish.
> srun: error: g60: task 1: Exited with exit code 1
> --------------------------------------------------------------------------
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --------------------------------------------------------------------------
> [login2:48425] [[44437,0],0] orted:comm:process_commands() Processing 
> Command: ORTE_DAEMON_HALT_VM_CMD
> [login2:48425] [[44437,0],0] orted_cmd: received halt_vm cmd
> 
> 
> [GPU-Env mpi]$ srun ./1103aompitest 
> g59: Before MPI_INIT 
> g59: After MPI_INIT 
> Hello world! I'm 0 of 2 on g59
> g60: Before MPI_INIT 
> g60: After MPI_INIT 
> Hello world! I'm 1 of 2 on g60
> 
> ompi_info  --parsable |grep pmi
> 
> mca:db:pmi:version:mca:2.0.0
> mca:db:pmi:version:api:1.0.0
> mca:db:pmi:version:component:1.10.3
> mca:ess:pmi:version:mca:2.0.0
> mca:ess:pmi:version:api:3.0.0
> mca:ess:pmi:version:component:1.10.3
> mca:grpcomm:pmi:version:mca:2.0.0
> mca:grpcomm:pmi:version:api:2.0.0
> mca:grpcomm:pmi:version:component:1.10.3
> mca:pubsub:pmi:version:mca:2.0.0
> mca:pubsub:pmi:version:api:2.0.0
> mca:pubsub:pmi:version:component:1.10.3
> 
> 
> module swap openmpi openmpi/1.8.6
> 
> 
> [GPU-Env mpi]$ mpirun -debug-daemons ./ompigcc184 
> Daemon [[810,0],2] checking in as pid 55443 on host g60
> Daemon [[810,0],1] checking in as pid 73091 on host g59
> [g60:55443] [[810,0],2] orted: up and running - waiting for commands!
> [g59:73091] [[810,0],1] orted: up and running - waiting for commands!
> [login2:05014] [[810,0],0] orted_cmd: received add_local_procs
> [g59:73091] [[810,0],1] orted_cmd: received add_local_procs
> [g60:55443] [[810,0],2] orted_cmd: received add_local_procs
> g60: Before MPI_INIT 
> g59: Before MPI_INIT 
> [g60:55443] [[810,0],2] orted_recv: received sync+nidmap from local proc 
> [[810,1],1]
> [g59:73091] [[810,0],1] orted_recv: received sync+nidmap from local proc 
> [[810,1],0]
> MPIR_being_debugged = 0
> MPIR_debug_state = 1
> MPIR_partial_attach_ok = 1
> MPIR_i_am_starter = 0
> MPIR_forward_output = 0
> MPIR_proctable_size = 2
> MPIR_proctable:
> (i, host, exe, pid) = (0, g59, ompigcc184, 73096)
> (i, host, exe, pid) = (1, g60, ompigcc184, 55448)
> MPIR_executable_path: NULL
> MPIR_server_arguments: NULL
> [login2:05014] [[810,0],0] orted_cmd: received message_local_procs
> [g59:73091] [[810,0],1] orted_cmd: received message_local_procs
> [g60:55443] [[810,0],2] orted_cmd: received message_local_procs
> [taito-login2.csc.fi:05014] [[810,0],0] orted_cmd: received 
> message_local_procs
> [g59:73091] [[810,0],1] orted_cmd: received message_local_procs
> [g60:55443] [[810,0],2] orted_cmd: received message_local_procs
> g59: After MPI_INIT 
> Hello world! I'm 0 of 2 on g59
> g60: After MPI_INIT 
> Hello world! I'm 1 of 2 on g60
> [login2:5014] [[810,0],0] orted_cmd: received message_local_procs
> [g60:55443] [[810,0],2] orted_cmd: received message_local_procs
> [g59:73091] [[810,0],1] orted_cmd: received message_local_procs
> [g59:73091] [[810,0],1] orted_recv: received sync from local proc [[810,1],0]
> [g60:55443] [[810,0],2] orted_recv: received sync from local proc [[810,1],1]
> [login2:05014] [[810,0],0] orted_cmd: received exit cmd
> [g60:55443] [[810,0],2] orted_cmd: received exit cmd
> [g59:73091] [[810,0],1] orted_cmd: received exit cmd
> [g60:55443] [[810,0],2] orted_cmd: all routes and children gone - exiting
> [g59:73091] [[810,0],1] orted_cmd: all routes and children gone - exiting
> 
> 
> [GPU-Env mpi]$ ompi_info -parsable |grep pmi
> mca:db:pmi:version:mca:2.0
> mca:db:pmi:version:api:1.0
> mca:db:pmi:version:component:1.8.6
> mca:ess:pmi:version:mca:2.0
> mca:ess:pmi:version:api:3.0
> mca:ess:pmi:version:component:1.8.6
> mca:grpcomm:pmi:version:mca:2.0
> mca:grpcomm:pmi:version:api:2.0
> mca:grpcomm:pmi:version:component:1.8.6
> mca:pubsub:pmi:version:mca:2.0
> mca:pubsub:pmi:version:api:2.0
> mca:pubsub:pmi:version:component:1.8.6
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28866.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to