First and foremost: is it possible to upgrade your version of Open
MPI? The version you are using (1.2.2) is rather ancient -- many bug
fixes have occurred since then (including TCP wireup issues). Note
that oob_tcp_in|exclude were renamed to be oob_tcp_if_in|exclude in
1.2.3 to be symmetric with other <foo>_if_in|exclude params in other
components.
More below.
On Jun 3, 2008, at 1:07 PM, Scott Shaw wrote:
Hi, I hope this is the right forum for my questions. I am running
into
a problem when scaling >512 cores on a infiniband cluster which has
14,336 cores. I am new to openmpi and trying to figure out the right
-mca options to pass to avoid the "mca_oob_tcp_peer_complete_connect:
connection failed:" on a cluster which has infiniband HCAs and OFED
v1.3GA release. Other MPI implementation like Intel MPI and mvapich
work fine using uDAPL or VERBs IB layers for MPI communications.
The OMPI v1.2 series is a bit inefficient in its TCP wireup for
control messages -- it creates TCP sockets between all MPI processes.
Do you allow enough fd's per process to allow this to occur?
(this situation is considerably better in the upcoming v1.3 series)
I find it difficult to understand which network interface or IB layer
being used. When I explicitly state not to use eth0,lo,ib1, or ib1:0
interfaces with the cmdline option "-mca oob_tcp_exclude" openmpi will
continue to probe these interfaces. For all MPI traffic openmpi
should
use IB0 which is the 10.148 network. But with debugging enabled I see
references trying the 10.149 network which is IB1. Below is the
ifconfig network device output for a compute node.
Just curious: does the oob_tcp_include parameter not work?
Questions:
1. Is there away to determine which network device is being used and
not
have openmpi fallback to another device? With Intel MPI or HP MPI you
can state not to use a fallback device. I thought "-mca
oob_tcp_exclude" would be the correct option to pass but I maybe
wrong.
oob_tcp_in|exclude should be suitable for this purpose. If they're
not working, I'd be surprised (but it could have been a bug that was
fixed in a later version...?). Keep in mind that the "oob" traffic is
just control messages -- it's not the actual MPI communication. That
will go over the verbs interfaces.
2. How can I determine infiniband openib device is actually being
used?
When running a MPI app I continue to see counters for in/out packets
at
a tcp level increasing when it should be using the IB RDMA device for
all MPI comms over the IB0 or mtcha0 device? OpenMPI was bundled with
OFED v1.3 so I am assuming the openib interface should work. Running
ompi_info shows btl_open_* references.
/usr/mpi/openmpi-1.2-2/intel/bin/mpiexec -mca
btl_openib_warn_default_gid_prefix 0 -mca oob_tcp_exclude
eth0,lo,ib1,ib1:0 -mca btl openib,sm,self -machinefile mpd.hosts.$$
-np
1024 ~/bin/test_ompi < input1
The "btl" is the component that controls point-to-point communication
in Open MPI. so if you specify "openib,sm,self", then Open MPI is
definitely using the verbs stack for MPI communication (not a TCP
stack).
3. When trying to avoid the "mca_oob_tcp_peer_complete_connect:
connection failed:" message I tried using "-mca btl openib,sm,self"
and
"-mca btl ^tcp" but I still get these error messages.
Unfortunately, these are two different issues -- OMPI always uses TCP
for wireup and out-of-band control messages. That's where you're
getting the errors from. Specifically: giving values for the btl MCA
parameter won't affect these messages / errors.
In cases with
using the "-mca btl openib,sm,self" openmpi will retry to use the IB1
(10.149 net) fabric to establish a connection with a node. What are
my
options to avoid these connection failed messages? I suspect
openmpi is
overflowing the tcp buffer on the clients based on large core count of
this job since I see lots of tcp buffer errors based on netstat -s
output. I reviewed all of the online FAQs and I am not sure what
options
to pass to get around this issue.
I think we made this much better in 1.2.5 -- I see notes about this
issue in the NEWS file under the 1.2.5 release.
--
Jeff Squyres
Cisco Systems