Re: [OMPI users] Openmpi 1.10.x, mpirun and Slurm 15.08 problem

Gilles Gouaillardet Fri, 23 Sep 2016 17:39:06 -0700

Marcin,

You can also try to exclude the public subnet(s) (e.g. 1.2.3.0/24) and the
loopback interface instead of em4 that does not exist on the compute nodes.
Or you can include only the private subnet(s) that are common to frontend
and compute nodes


Cheers,

Gilles

On Saturday, September 24, 2016, marcin.krotkiewski <
marcin.krotkiew...@gmail.com> wrote:

> Thanks for a quick answer, Ralph!
>
> This does not work, because em4 is only defined on the frontend node. Now
> I get errors from the computes:
>
> [compute-1-4.local:12206] found interface lo
> [compute-1-4.local:12206] found interface em1
> [compute-1-4.local:12206] mca: base: components_open: component posix_ipv4
> open function successful
> [compute-1-4.local:12206] mca: base: components_open: found loaded
> component linux_ipv6
> [compute-1-4.local:12206] mca: base: components_open: component linux_ipv6
> open function successful
> --------------------------------------------------------------------------
> None of the TCP networks specified to be included for out-of-band
> communications
> could be found:
>
>   Value given: em4
>
> Please revise the specification and try again.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> No network interfaces were found for out-of-band communications. We require
> at least one available network for out-of-band messaging.
> --------------------------------------------------------------------------
>
> But since only the front-end node has a different network config, the
> problem only exists when I run interactive sessions using salloc. If I use
> sbatch to submit the jobs, they are executed correctly. uff.
>
> Thanks for your help, now I can make my way through it!
>
> Marcin
>
>
>
> On 09/23/2016 04:45 PM, r...@open-mpi.org wrote:
>
>> This isn’t an issue with the SLURM integration - this is the problem of
>> our OOB not correctly picking the right subnet for connecting back to
>> mpirun. In this specific case, you probably want
>>
>> -mca btl_tcp_if_include em4 -mca oob_tcp_if_include em4
>>
>> since it is the em4 network that ties the compute nodes together, and the
>> compute nodes to the frontend
>>
>> We are working on the subnet selection logic, but the 1.10 series seems
>> to have not been updated with those changes
>>
>> On Sep 23, 2016, at 6:00 AM, Marcin Krotkiewski <
>>> marcin.krotkiew...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I have stumbled upon a similar issue, so I wonder those might be
>>> related. On one of our systems I get the following error message, both when
>>> using openmpi 1.8.8 and 1.10.4
>>>
>>> $ mpirun -debug-daemons --mca btl tcp,self --mca mca_base_verbose 100
>>> --mca btl_base_verbose 100 ls
>>>
>>> [...]
>>> [compute-1-1.local:07302] mca: base: close: unloading component direct
>>> [compute-1-1.local:07302] mca: base: close: unloading component radix
>>> [compute-1-1.local:07302] mca: base: close: unloading component debruijn
>>> [compute-1-1.local:07302] orte_routed_base_select: initializing selected
>>> component binomial
>>> [compute-1-2.local:13744] [[63041,0],2]: parent 0 num_children 0
>>> Daemon [[63041,0],2] checking in as pid 13744 on host c1-2
>>> [compute-1-2.local:13744] [[63041,0],2] orted: up and running - waiting
>>> for commands!
>>> [compute-1-2.local:13744] [[63041,0],2] tcp_peer_send_blocking: send()
>>> to socket 9 failed: Broken pipe (32)
>>> [compute-1-2.local:13744] mca: base: close: unloading component binomial
>>> [compute-1-1.local:07302] [[63041,0],1]: parent 0 num_children 0
>>> Daemon [[63041,0],1] checking in as pid 7302 on host c1-1
>>> [compute-1-1.local:07302] [[63041,0],1] orted: up and running - waiting
>>> for commands!
>>> [compute-1-1.local:07302] [[63041,0],1] tcp_peer_send_blocking: send()
>>> to socket 9 failed: Broken pipe (32)
>>> [compute-1-1.local:07302] mca: base: close: unloading component binomial
>>> srun: error: c1-1: task 0: Exited with exit code 1
>>> srun: Terminating job step 4538.1
>>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>>> srun: error: c1-2: task 1: Exited with exit code 1
>>>
>>>
>>> I have also tested version 2.0.1 - this one works without problems.
>>>
>>> In my case the problem appears on one system with slurm versions 15.08.8
>>> and 15.08.12. On another system running 15.08.8 all is working fine, so I
>>> guess it is not about SLURM version, but maybe system / network
>>> configuration?
>>>
>>> Following that thought I have also noticed this thread:
>>>
>>> http://users.open-mpi.narkive.com/PwJpWXLm/ompi-users-tcp-pe
>>> er-send-blocking-send-to-socket-9-failed-broken-pipe-32-on-
>>> openvz-containers
>>>
>>> As Jeff suggested there, I tried to run with --mca btl_tcp_if_include
>>> em1 --mca oob_tcp_if_include em1, but got the same error.
>>>
>>> Could these problems be related to interface naming / lack of
>>> infiniband? Or to the fact that the front-end node, from which I execute
>>> mpirun, has a different network configuration? The system, on which things
>>> don't work, only has TCP  network interfaces:
>>>
>>> em1, lo (frontend has em1, em4 - local compute network, lo)
>>>
>>> while the cluster, on which openmpi does work, uses infiniband, and had
>>> the following tcp interfaces:
>>>
>>> eth0, eth1, ib0, lo
>>>
>>> I would appreciate any hints..
>>>
>>> Thanks!
>>>
>>> Marcin
>>>
>>>
>>> On 04/01/2016 04:16 PM, Jeff Squyres (jsquyres) wrote:
>>>
>>>> Ralph --
>>>>
>>>> What's the state of PMI integration with SLURM in the v1.10.x series?
>>>> (I haven't kept up with SLURM's recent releases to know if something broke
>>>> between existing Open MPI releases and their new releases...?)
>>>>
>>>>
>>>>
>>>> On Mar 31, 2016, at 4:24 AM, Tommi T <tommi_...@yahoo.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> stack:
>>>>> el6.7, mlnx ofed 3.1 (IB FDR) and slurm 15.08.9 (whithout *.la libs).
>>>>>
>>>>> problem:
>>>>> OpenMPI 1.10.x built with pmi support does not work when trying to use
>>>>> sbatch/salloc - mpirun combination. srun ompi_mpi_app works fine.
>>>>>
>>>>> Older 1.8.x version works fine under same salloc session.
>>>>>
>>>>> ./configure --with-slurm --with-verbs --with-hwloc=internal --with-pmi
>>>>> --with-cuda=/appl/opt/cuda/7.5/ --with-pic --enable-shared
>>>>> --enable-mpi-thread-multiple --enable-contrib-no-build=vt
>>>>>
>>>>>
>>>>> I tried 1.10.3a from git also.
>>>>>
>>>>>
>>>>> mpirun  -debug-daemons ./1103aompitest
>>>>> Daemon [[44437,0],1] checking in as pid 40979 on host g59
>>>>> Daemon [[44437,0],2] checking in as pid 23566 on host g60
>>>>> [g59:40979] [[44437,0],1] orted: up and running - waiting for commands!
>>>>> [g60:23566] [[44437,0],2] orted: up and running - waiting for commands!
>>>>> [g59:40979] [[44437,0],1] tcp_peer_send_blocking: send() to socket 9
>>>>> failed: Broken pipe (32)
>>>>> [g59:40979] [[44437,0],1]:errmgr_default_orted.c(260) updating exit
>>>>> status to 1
>>>>> [g60:23566] [[44437,0],2] tcp_peer_send_blocking: send() to socket 9
>>>>> failed: Broken pipe (32)
>>>>> [g60:23566] [[44437,0],2]:errmgr_default_orted.c(260) updating exit
>>>>> status to 1
>>>>> srun: error: g59: task 0: Exited with exit code 1
>>>>> srun: Terminating job step 8922923.1
>>>>> srun: Job step aborted: Waiting up to 12 seconds for job step to
>>>>> finish.
>>>>> srun: error: g60: task 1: Exited with exit code 1
>>>>> ------------------------------------------------------------
>>>>> --------------
>>>>> An ORTE daemon has unexpectedly failed after launch and before
>>>>> communicating back to mpirun. This could be caused by a number
>>>>> of factors, including an inability to create a connection back
>>>>> to mpirun due to a lack of common network interfaces and/or no
>>>>> route found between them. Please check network connectivity
>>>>> (including firewalls and network routing requirements).
>>>>> ------------------------------------------------------------
>>>>> --------------
>>>>> [login2:48425] [[44437,0],0] orted:comm:process_commands() Processing
>>>>> Command: ORTE_DAEMON_HALT_VM_CMD
>>>>> [login2:48425] [[44437,0],0] orted_cmd: received halt_vm cmd
>>>>>
>>>>>
>>>>> [GPU-Env mpi]$ srun ./1103aompitest
>>>>> g59: Before MPI_INIT
>>>>> g59: After MPI_INIT
>>>>> Hello world! I'm 0 of 2 on g59
>>>>> g60: Before MPI_INIT
>>>>> g60: After MPI_INIT
>>>>> Hello world! I'm 1 of 2 on g60
>>>>>
>>>>> ompi_info  --parsable |grep pmi
>>>>>
>>>>> mca:db:pmi:version:mca:2.0.0
>>>>> mca:db:pmi:version:api:1.0.0
>>>>> mca:db:pmi:version:component:1.10.3
>>>>> mca:ess:pmi:version:mca:2.0.0
>>>>> mca:ess:pmi:version:api:3.0.0
>>>>> mca:ess:pmi:version:component:1.10.3
>>>>> mca:grpcomm:pmi:version:mca:2.0.0
>>>>> mca:grpcomm:pmi:version:api:2.0.0
>>>>> mca:grpcomm:pmi:version:component:1.10.3
>>>>> mca:pubsub:pmi:version:mca:2.0.0
>>>>> mca:pubsub:pmi:version:api:2.0.0
>>>>> mca:pubsub:pmi:version:component:1.10.3
>>>>>
>>>>>
>>>>> module swap openmpi openmpi/1.8.6
>>>>>
>>>>>
>>>>> [GPU-Env mpi]$ mpirun -debug-daemons ./ompigcc184
>>>>> Daemon [[810,0],2] checking in as pid 55443 on host g60
>>>>> Daemon [[810,0],1] checking in as pid 73091 on host g59
>>>>> [g60:55443] [[810,0],2] orted: up and running - waiting for commands!
>>>>> [g59:73091] [[810,0],1] orted: up and running - waiting for commands!
>>>>> [login2:05014] [[810,0],0] orted_cmd: received add_local_procs
>>>>> [g59:73091] [[810,0],1] orted_cmd: received add_local_procs
>>>>> [g60:55443] [[810,0],2] orted_cmd: received add_local_procs
>>>>> g60: Before MPI_INIT
>>>>> g59: Before MPI_INIT
>>>>> [g60:55443] [[810,0],2] orted_recv: received sync+nidmap from local
>>>>> proc [[810,1],1]
>>>>> [g59:73091] [[810,0],1] orted_recv: received sync+nidmap from local
>>>>> proc [[810,1],0]
>>>>> MPIR_being_debugged = 0
>>>>> MPIR_debug_state = 1
>>>>> MPIR_partial_attach_ok = 1
>>>>> MPIR_i_am_starter = 0
>>>>> MPIR_forward_output = 0
>>>>> MPIR_proctable_size = 2
>>>>> MPIR_proctable:
>>>>> (i, host, exe, pid) = (0, g59, ompigcc184, 73096)
>>>>> (i, host, exe, pid) = (1, g60, ompigcc184, 55448)
>>>>> MPIR_executable_path: NULL
>>>>> MPIR_server_arguments: NULL
>>>>> [login2:05014] [[810,0],0] orted_cmd: received message_local_procs
>>>>> [g59:73091] [[810,0],1] orted_cmd: received message_local_procs
>>>>> [g60:55443] [[810,0],2] orted_cmd: received message_local_procs
>>>>> [taito-login2.csc.fi:05014] [[810,0],0] orted_cmd: received
>>>>> message_local_procs
>>>>> [g59:73091] [[810,0],1] orted_cmd: received message_local_procs
>>>>> [g60:55443] [[810,0],2] orted_cmd: received message_local_procs
>>>>> g59: After MPI_INIT
>>>>> Hello world! I'm 0 of 2 on g59
>>>>> g60: After MPI_INIT
>>>>> Hello world! I'm 1 of 2 on g60
>>>>> [login2:5014] [[810,0],0] orted_cmd: received message_local_procs
>>>>> [g60:55443] [[810,0],2] orted_cmd: received message_local_procs
>>>>> [g59:73091] [[810,0],1] orted_cmd: received message_local_procs
>>>>> [g59:73091] [[810,0],1] orted_recv: received sync from local proc
>>>>> [[810,1],0]
>>>>> [g60:55443] [[810,0],2] orted_recv: received sync from local proc
>>>>> [[810,1],1]
>>>>> [login2:05014] [[810,0],0] orted_cmd: received exit cmd
>>>>> [g60:55443] [[810,0],2] orted_cmd: received exit cmd
>>>>> [g59:73091] [[810,0],1] orted_cmd: received exit cmd
>>>>> [g60:55443] [[810,0],2] orted_cmd: all routes and children gone -
>>>>> exiting
>>>>> [g59:73091] [[810,0],1] orted_cmd: all routes and children gone -
>>>>> exiting
>>>>>
>>>>>
>>>>> [GPU-Env mpi]$ ompi_info -parsable |grep pmi
>>>>> mca:db:pmi:version:mca:2.0
>>>>> mca:db:pmi:version:api:1.0
>>>>> mca:db:pmi:version:component:1.8.6
>>>>> mca:ess:pmi:version:mca:2.0
>>>>> mca:ess:pmi:version:api:3.0
>>>>> mca:ess:pmi:version:component:1.8.6
>>>>> mca:grpcomm:pmi:version:mca:2.0
>>>>> mca:grpcomm:pmi:version:api:2.0
>>>>> mca:grpcomm:pmi:version:component:1.8.6
>>>>> mca:pubsub:pmi:version:mca:2.0
>>>>> mca:pubsub:pmi:version:api:2.0
>>>>> mca:pubsub:pmi:version:component:1.8.6
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: http://www.open-mpi.org/commun
>>>>> ity/lists/users/2016/03/28866.php
>>>>>
>>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Openmpi 1.10.x, mpirun and Slurm 15.08 problem

Reply via email to