--mca btl ^openib
disables the openib btl, which is native infiniband only.

ib0 is treated as any TCP interface and then handled by the tcp btl

an other option is you to use
--mca btl_tcp_if_exclude ib0

On 2014/11/13 16:43, Syed Ahsan Ali wrote:
> You are right it is running on 10.0.0.0 interface [pmdtest@pmd ~]$
> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
> btl_tcp_if_include 10.0.0.0/8 ring_c
> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
> Process 0 sent to 1
> Process 0 decremented value: 9
> Process 0 decremented value: 8
> Process 0 decremented value: 7
> Process 0 decremented value: 6
> Process 1 exiting
> Process 0 decremented value: 5
> Process 0 decremented value: 4
> Process 0 decremented value: 3
> Process 0 decremented value: 2
> Process 0 decremented value: 1
> Process 0 decremented value: 0
> Process 0 exiting
> [pmdtest@pmd ~]$
>
> While the ip addresses 192.168.108* are for ib interface.
>
>  [root@compute-01-01 ~]# ifconfig
> eth0      Link encap:Ethernet  HWaddr 00:24:E8:59:4C:2A
>           inet addr:10.0.0.3  Bcast:10.255.255.255  Mask:255.0.0.0
>           inet6 addr: fe80::224:e8ff:fe59:4c2a/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:65588 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:14184 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:18692977 (17.8 MiB)  TX bytes:1834122 (1.7 MiB)
>           Interrupt:169 Memory:dc000000-dc012100
> ib0       Link encap:InfiniBand  HWaddr
> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>           inet addr:192.168.108.14  Bcast:192.168.108.255  Mask:255.255.255.0
>           UP BROADCAST MULTICAST  MTU:65520  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:256
>           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>
>
>
> So the point is why mpirun is following the ib  path while I it has
> been disabled. Possible solutions?
>
> On Thu, Nov 13, 2014 at 12:32 PM, Gilles Gouaillardet
> <gilles.gouaillar...@iferc.org> wrote:
>> mpirun complains about the 192.168.108.10 ip address, but ping reports a
>> 10.0.0.8 address
>>
>> is the 192.168.* network a point to point network (for example between a
>> host and a mic) so two nodes
>> cannot ping each other via this address ?
>> /* e.g. from compute-01-01 can you ping the 192.168.108.* ip address of
>> compute-01-06 ? */
>>
>> could you also run
>>
>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
>> btl_tcp_if_include 10.0.0.0/8 ring_c
>>
>> and see whether it helps ?
>>
>>
>> On 2014/11/13 16:24, Syed Ahsan Ali wrote:
>>> Same result in both cases
>>>
>>> [pmdtest@pmd ~]$ mpirun --mca btl ^openib --host
>>> compute-01-01,compute-01-06 ring_c
>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>> Process 0 sent to 1
>>> Process 0 decremented value: 9
>>> [compute-01-01.private.dns.zone][[47139,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>
>>>
>>> [pmdtest@compute-01-01 ~]$ mpirun --mca btl ^openib --host
>>> compute-01-01,compute-01-06 ring_c
>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>> Process 0 sent to 1
>>> Process 0 decremented value: 9
>>> [compute-01-01.private.dns.zone][[11064,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>
>>>
>>> On Thu, Nov 13, 2014 at 12:11 PM, Gilles Gouaillardet
>>> <gilles.gouaillar...@iferc.org> wrote:
>>>> Hi,
>>>>
>>>> it seems you messed up the command line
>>>>
>>>> could you try
>>>>
>>>> $ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 ring_c
>>>>
>>>>
>>>> can you also try to run mpirun from a compute node instead of the head
>>>> node ?
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On 2014/11/13 16:07, Syed Ahsan Ali wrote:
>>>>> Here is what I see when disabling openib support.\
>>>>>
>>>>>
>>>>> [pmdtest@pmd ~]$ mpirun --host --mca btl ^openib
>>>>> compute-01-01,compute-01-06 ring_c
>>>>> ssh:  orted: Temporary failure in name resolution
>>>>> ssh:  orted: Temporary failure in name resolution
>>>>> --------------------------------------------------------------------------
>>>>> A daemon (pid 7608) died unexpectedly with status 255 while attempting
>>>>> to launch so we are aborting.
>>>>>
>>>>> While nodes can still ssh each other
>>>>>
>>>>> [pmdtest@compute-01-01 ~]$ ssh compute-01-06
>>>>> Last login: Thu Nov 13 12:05:58 2014 from compute-01-01.private.dns.zone
>>>>> [pmdtest@compute-01-06 ~]$
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 13, 2014 at 12:03 PM, Syed Ahsan Ali <ahsansha...@gmail.com> 
>>>>> wrote:
>>>>>>  Hi Jefff
>>>>>>
>>>>>> No firewall is enabled. Running the diagnostics I found that non
>>>>>> communication mpi job is running . While ring_c remains stuck. There
>>>>>> are of course warnings for open fabrics but in my case I an running
>>>>>> application by disabling openib., Please see below
>>>>>>
>>>>>>  [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 hello_c.out
>>>>>> --------------------------------------------------------------------------
>>>>>> WARNING: There is at least one OpenFabrics device found but there are
>>>>>> no active ports detected (or Open MPI was unable to use them).  This
>>>>>> is most certainly not what you wanted.  Check your cables, subnet
>>>>>> manager configuration, etc.  The openib BTL will be ignored for this
>>>>>> job.
>>>>>>   Local host: compute-01-01.private.dns.zone
>>>>>> --------------------------------------------------------------------------
>>>>>> Hello, world, I am 0 of 2
>>>>>> Hello, world, I am 1 of 2
>>>>>> [pmd.pakmet.com:06386] 1 more process has sent help message
>>>>>> help-mpi-btl-openib.txt / no active ports found
>>>>>> [pmd.pakmet.com:06386] Set MCA parameter "orte_base_help_aggregate" to
>>>>>> 0 to see all help / error messages
>>>>>>
>>>>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 ring_c
>>>>>> --------------------------------------------------------------------------
>>>>>> WARNING: There is at least one OpenFabrics device found but there are
>>>>>> no active ports detected (or Open MPI was unable to use them).  This
>>>>>> is most certainly not what you wanted.  Check your cables, subnet
>>>>>> manager configuration, etc.  The openib BTL will be ignored for this
>>>>>> job.
>>>>>>   Local host: compute-01-01.private.dns.zone
>>>>>> --------------------------------------------------------------------------
>>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>>>> Process 0 sent to 1
>>>>>> Process 0 decremented value: 9
>>>>>> [compute-01-01.private.dns.zone][[54687,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>> [pmd.pakmet.com:15965] 1 more process has sent help message
>>>>>> help-mpi-btl-openib.txt / no active ports found
>>>>>> [pmd.pakmet.com:15965] Set MCA parameter "orte_base_help_aggregate" to
>>>>>> 0 to see all help / error messages
>>>>>> <span class="sewh9wyhn1gq30p"><br></span>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 12, 2014 at 7:32 PM, Jeff Squyres (jsquyres)
>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>> Do you have firewalling enabled on either server?
>>>>>>>
>>>>>>> See this FAQ item:
>>>>>>>
>>>>>>>     
>>>>>>> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Nov 12, 2014, at 4:57 AM, Syed Ahsan Ali <ahsansha...@gmail.com> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Dear All
>>>>>>>>
>>>>>>>> I need your advice. While trying to run mpirun job across nodes I get
>>>>>>>> following error. It seems that the two nodes i.e, compute-01-01 and
>>>>>>>> compute-01-06 are not able to communicate with each other. While nodes
>>>>>>>> see each other on ping.
>>>>>>>>
>>>>>>>> [pmdtest@pmd ERA_CLM45]$ mpirun -np 16 -hostfile hostlist --mca btl
>>>>>>>> ^openib ../bin/regcmMPICLM45 regcm.in
>>>>>>>>
>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113)
>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113)
>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113)
>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],10][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],12][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],14][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>>
>>>>>>>> mpirun: killing job...
>>>>>>>>
>>>>>>>> [pmdtest@pmd ERA_CLM45]$ ssh compute-01-01
>>>>>>>> Last login: Wed Nov 12 09:48:53 2014 from pmd-eth0.private.dns.zone
>>>>>>>> [pmdtest@compute-01-01 ~]$ ping compute-01-06
>>>>>>>> PING compute-01-06.private.dns.zone (10.0.0.8) 56(84) bytes of data.
>>>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=1
>>>>>>>> ttl=64 time=0.108 ms
>>>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=2
>>>>>>>> ttl=64 time=0.088 ms
>>>>>>>>
>>>>>>>> --- compute-01-06.private.dns.zone ping statistics ---
>>>>>>>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
>>>>>>>> rtt min/avg/max/mdev = 0.088/0.098/0.108/0.010 ms
>>>>>>>> [pmdtest@compute-01-01 ~]$
>>>>>>>>
>>>>>>>> Thanks in advance.
>>>>>>>>
>>>>>>>> Ahsan
>>>>>>>> _______________________________________________
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/11/25788.php
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/11/25789.php
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25790.php

Reply via email to