https://www.open-mpi.org/faq/?category=tcp#ip-virtual-ip-interfaces is 
referring to interfaces like "eth0:0", where the Linux kernel will have the 
same index for both "eth0" and "eth0:0".  This will cause Open MPI to get 
confused (because it identifies Ethernet interfaces by their kernel indexes).

If you have non-physical Ethernet interfaces (like vibr0, etc.), those should 
work just fine with btl_tcp_if_include|exclude.

What version of Open MPI are you using?

You might want to "--mca btl_tcp_if_include CIDR" where CIDR is the 
representation of the subnet you want to use.  This will allow your app to 
work, even if that network is on different Ethernet interfaces on different 
hosts.  For example:

    mpirun --mca btl_tcp_if_include 192.168.10.0/24 ...

If you're still getting a hang, try with btl_base_verbose value of 100.



On Jun 18, 2020, at 7:39 PM, Kulshrestha, Vipul via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:

Hi,

I have read conflicting statements about OMPI support for virtual interfaces.

The Open MPI FAQ mentions that virtual IP interfaces are not supported and this 
will not be solved by using either btl_tcp_if_include or btl_tcp_if_exclude.  
(https://www.open-mpi.org/faq/?category=tcp#ip-virtual-ip-interfaces)

However, somewhere else, I read that you can exclude the virtual interfaces by 
specifying –mca btl_tcp_if_exclude virbr0,lo 
(https://github.com/open-mpi/ompi/issues/6377)

I am trying this out on different machines and find that it (specifying 
btl_tcp_if_exclude virbr0,lo) works on one pair of machine but does not work on 
another pair of machines. I am hoping to get an explanation on why does one 
work and other does not.

I tried to generate some verbose output (on the pair of machine where it does 
not work) by specifying –mca btl_base_verbose 30, but it just hangs and does 
not generate any messages.

$ mpirun -np 4 --mca btl_base_verbose 30 --mca btl_tcp_if_exclude 
virbr0,virbr1,virbr2,virbr3,lo --hostfile host.txt /home/vipulk/mpitest2 100
…..
….
<no output and remains stuck forever>

The ifconfig output for the 2 machines in the host list are listed below.

Thanks,
Vipul


Host1:

eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 175.148.218.46  netmask 255.255.255.0  broadcast 175.148.218.255
        inet6 fe80::9af2:b3ff:fe2a:3e84  prefixlen 64  scopeid 0x20<link>
        ether 98:f2:b3:2a:3e:84  txqueuelen 1000  (Ethernet)
        RX packets 5938671220  bytes 6033195902625 (5.4 TiB)
        RX errors 0  dropped 534674  overruns 0  frame 0
        TX packets 3933921252  bytes 3077919856788 (2.7 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 16

eno2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.2  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 fe80::be68:2aa2:8b42:d6d  prefixlen 64  scopeid 0x20<link>
        ether 98:f2:b3:2a:3e:85  txqueuelen 1000  (Ethernet)
        RX packets 2355308  bytes 279699254 (266.7 MiB)
        RX errors 0  dropped 350  overruns 0  frame 0
        TX packets 60  bytes 8732 (8.5 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 17

eno3: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 98:f2:b3:2a:3e:86  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 16

eno4: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 98:f2:b3:2a:3e:87  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 17

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 3161146200  bytes 225991248912 (210.4 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 3161146200  bytes 225991248912 (210.4 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

virbr2: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 192.168.122.1  netmask 255.255.255.0  broadcast 192.168.122.255
        ether 52:54:00:0a:cd:21  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

virbr3: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 192.168.123.1  netmask 255.255.255.0  broadcast 192.168.123.255
        ether 52:54:00:0a:cd:22  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Host2:
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 175.148.218.210  netmask 255.255.255.0  broadcast 175.148.218.255
        inet6 fe80::9af2:b3ff:fe2a:3e78  prefixlen 64  scopeid 0x20<link>
        ether 98:f2:b3:2a:3e:78  txqueuelen 1000  (Ethernet)
        RX packets 8632800  bytes 3938419917 (3.6 GiB)
        RX errors 0  dropped 350  overruns 0  frame 0
        TX packets 5504444  bytes 1791707074 (1.6 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 16

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.2  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 fe80::9af2:b3ff:fe2a:3e79  prefixlen 64  scopeid 0x20<link>
        ether 98:f2:b3:2a:3e:79  txqueuelen 1000  (Ethernet)
        RX packets 2317163  bytes 275220791 (262.4 MiB)
        RX errors 0  dropped 350  overruns 0  frame 0
        TX packets 336  bytes 26726 (26.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 17

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 32539  bytes 2540603 (2.4 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 32539  bytes 2540603 (2.4 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

virbr0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 192.168.123.1  netmask 255.255.255.0  broadcast 192.168.123.255
        ether 52:54:00:0a:cd:22  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

virbr1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 192.168.122.1  netmask 255.255.255.0  broadcast 192.168.122.255
        ether 52:54:00:0a:cd:21  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


--mca btl_tcp_if_exclude virbr0,lo works on machines with below configuration:

Host 3:
eno1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 80:30:e0:3b:c8:40  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 16

eno2: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 80:30:e0:3b:c8:41  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 17

eno3: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 80:30:e0:3b:c8:42  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 16

eno4: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 80:30:e0:3b:c8:43  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 17

eno5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 65.10.19.30  netmask 255.255.255.192  broadcast 65.10.19.63
        inet6 fe80::8230:e0ff:fe20:96a8  prefixlen 64  scopeid 0x20<link>
        ether 80:30:e0:20:96:a8  txqueuelen 1000  (Ethernet)
        RX packets 1618138239  bytes 1552281705604 (1.4 TiB)
        RX errors 184  dropped 0  overruns 184  frame 0
        TX packets 1500861577  bytes 1593767198059 (1.4 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 34  memory 0xe8000000-e87fffff

eno6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether 80:30:e0:20:96:ac  txqueuelen 1000  (Ethernet)
        RX packets 1299786  bytes 150289059 (143.3 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 77  memory 0xe7000000-e77fffff

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 20936389  bytes 2632538104 (2.4 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 20936389  bytes 2632538104 (2.4 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

virbr0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 192.168.122.1  netmask 255.255.255.0  broadcast 192.168.122.255
        ether 52:54:00:05:7c:dd  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


HOST 4:

eno1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 80:30:e0:3b:b8:5c  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 16

eno2: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 80:30:e0:3b:b8:5d  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 17

eno3: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 80:30:e0:3b:b8:5e  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 16

eno4: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 80:30:e0:3b:b8:5f  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 17

eno5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 65.10.19.29  netmask 255.255.255.192  broadcast 65.10.19.63
        inet6 fe80::8230:e0ff:fe20:96c0  prefixlen 64  scopeid 0x20<link>
        ether 80:30:e0:20:96:c0  txqueuelen 1000  (Ethernet)
        RX packets 2904054722  bytes 2656941056010 (2.4 TiB)
        RX errors 11  dropped 0  overruns 11  frame 0
        TX packets 5801141892  bytes 7474409123677 (6.7 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 34  memory 0xe8000000-e87fffff

eno6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether 80:30:e0:20:96:c4  txqueuelen 1000  (Ethernet)
        RX packets 1299694  bytes 150265217 (143.3 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 77  memory 0xe7000000-e77fffff


lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 19850956  bytes 5578561316 (5.1 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 19850956  bytes 5578561316 (5.1 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

virbr0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 192.168.122.1  netmask 255.255.255.0  broadcast 192.168.122.255
        ether 52:54:00:79:33:89  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>

Reply via email to