Re: [OMPI users] example program "ring" hangs when running across multiple hardware nodes

Gustavo Correa Thu, 4 Jul 2013 19:25:51 -0400

Hi Jed 

You could try to select only ethernet interface that match your node's IP 
addresses,
which seems to be en2.


The en1 interface seems to be an external IP. 
Not sure about en3, but it is awkward that it has a 
different IP than en2, but in the same subnet.
I wonder if this may be the reason for the program hanging.

You may need to search all nodes ifconfig for a consistent set of interfaces/IP 
addresses,
and tailor your mpiexec command line and your hostfile accordingly.

Say, something like this:

mpiexec -mca btl_tcp_if_include en2 -hostfile your_hostfile -np 43 ./ring_c

See this FAQ (actually, all of them are very informative):
http://www.open-mpi.org/faq/?category=tcp#tcp-selection

I hope this helps,
Gus Correa



On Jul 4, 2013, at 6:37 PM, Jed O. Kaplan wrote:

> Dear openmpi gurus,
> 
> I am running openmpi 1.7.2 on a homogenous cluster of Apple XServes
> running OS X 10.6.8. My hardware nodes are connected through four
> gigabit ethernet connections; I have no infiniband or other high-speed
> interconnect. The problem I describe below is the same if I use openmpi
> 1.6.5. My openmpi installation is compiled with Intel icc and ifort. See
> the attached result of ompi_info --all for more details on my
> installation and runtime parameters, and other diagnostic information
> below
> 
> My problem is that I noticed that inter-hardware communication hangs in
> one of my own programs; I thought this was the fault of my own bad
> programming, so I tried some of the example programs that are
> distributed with the openmpi source code. In the program "ring_*" using
> whichever of the APIs (c, cxx, fortran etc.), I have the same faulty
> behavior that I noticed in my own program: if I run the program on a
> single hardware node (with multiple processes) it works fine. As soon as
> I run the program across hardware nodes, it hangs. Below you will find
> an example of the program output and other diagnostic information.
> 
> This problem has really frustrated me. Unfortunately I am not
> experienced enough with openmpi to get more into the debugging.
> 
> Thank you in advance for any help you can give me!
> 
> Jed Kaplan
> 
> --- DETAILS OF MY PROBLEM ---
> 
> -- this run works because it is only on one hardware node --
> 
> jkaplan@grkapsrv2:~/openmpi_examples >  mpirun --prefix /usr/local
> --hostfile arvehosts.txt -np 3 ring_c
> Process 0 sending 10 to 1, tag 201 (3 processes in ring)
> Process 0 sent to 1
> Process 0 decremented value: 9
> Process 0 decremented value: 8
> Process 0 decremented value: 7
> Process 0 decremented value: 6
> Process 0 decremented value: 5
> Process 0 decremented value: 4
> Process 0 decremented value: 3
> Process 0 decremented value: 2
> Process 0 decremented value: 1
> Process 0 decremented value: 0
> Process 0 exiting
> Process 1 exiting
> Process 2 exiting
> 
> -- this run hangs when running over two hardware nodes --
> 
> jkaplan@grkapsrv2:~/openmpi_examples >  mpirun --prefix /usr/local
> --hostfile arvehosts.txt -np 4 ring_c
> Process 0 sending 10 to 1, tag 201 (4 processes in ring)
> Process 0 sent to 1
> Process 0 decremented value: 9
> Process 0 decremented value: 8
> ... hangs forever ...
> ^CKilled by signal 2.
> 
> -- here is what my hostfile looks like --
> 
> jkaplan@grkapsrv2:~/openmpi_examples > cat arvehosts.txt 
> #host file for ARVE group mac servers
> 
> 10.0.0.21 slots=3
> 10.0.0.31 slots=8
> 10.0.0.41 slots=8
> 10.0.0.51 slots=8
> 10.0.0.61 slots=8 
> 10.0.0.71 slots=8
> 
> -- results of ifconfig - this looks pretty much the same on all of my
> servers, with different ip addresses of course --
> 
> jkaplan@grkapsrv2:~/openmpi_examples > ifconfig
> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
>       inet6 ::1 prefixlen 128 
>       inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 
>       inet 127.0.0.1 netmask 0xff000000 
> gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280
> stf0: flags=0<> mtu 1280
> en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
>       ether 00:24:36:f3:dc:fc 
>       inet6 fe80::224:36ff:fef3:dcfc%en0 prefixlen 64 scopeid 0x4 
>       inet 128.178.107.85 netmask 0xffffff00 broadcast 128.178.107.255
>       media: autoselect (1000baseT <full-duplex>)
>       status: active
> en1: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
>       ether 00:24:36:f3:dc:fa 
>       inet6 fe80::224:36ff:fef3:dcfa%en1 prefixlen 64 scopeid 0x5 
>       inet 10.0.0.2 netmask 0xff000000 broadcast 10.255.255.255
>       media: autoselect (1000baseT <full-duplex,flow-control>)
>       status: active
> en2: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
>       ether 00:24:36:f5:ba:4e 
>       inet6 fe80::224:36ff:fef5:ba4e%en2 prefixlen 64 scopeid 0x6 
>       inet 10.0.0.21 netmask 0xff000000 broadcast 10.255.255.255
>       media: autoselect (1000baseT <full-duplex,flow-control>)
>       status: active
> en3: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
>       ether 00:24:36:f5:ba:4f 
>       inet6 fe80::224:36ff:fef5:ba4f%en3 prefixlen 64 scopeid 0x7 
>       inet 10.0.0.22 netmask 0xff000000 broadcast 10.255.255.255
>       media: autoselect (1000baseT <full-duplex,flow-control>)
>       status: active
> fw0: flags=8822<BROADCAST,SMART,SIMPLEX,MULTICAST> mtu 4078
>       lladdr 04:1e:64:ff:fe:f8:aa:d2 
>       media: autoselect <full-duplex>
>       status: inactive
> <ompi_info.txt>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] example program "ring" hangs when running across multiple hardware nodes

Reply via email to