On Dec 17, 2007, at 8:35 AM, Marco Sbrighi wrote:
I'm using Open MPI 1.2.2 over OFED 1.2 on an 256 nodes, dual Opteron,
dual core, Linux cluster. Of course, with Infiniband 4x interconnect.
Each cluster node is equipped with 4 (or more) ethernet interface,
namely 2 gigabit ones plus 2 IPoIB. The two gig are named eth0,eth1,
while the two IPoIB are named ib0,ib1.
It happens that the eth0 is a management network, with poor
performances, and furthermore we wouldn't use the ib* to carry MPI's
traffic (neither OOB or TCP), so we would like the eth1 is used for
open
MPI OOB and TCP.
In order to drive the OOB over only eth1 I've tried various
combinations
of oob_tcp_[ex|in]clude MCA statements, starting from the obvious
oob_tcp_exclude = lo,eth0,ib0,ib1
then trying the othe obvious:
oob_tcp_include = eth1
This one statement (_include) should be sufficient.
Assumedly this(these) statement(s) are in a config file that is being
read by Open MPI, such as $HOME/.openmpi/mca-params.conf?
and both at the same time.
Next I've tried the following:
oob_tcp_exclude = eth0
but after the job starts, I still have a lot of tcp connections
established using eth0 or ib0 or ib1.
Furthermore It happens the following error:
[node191:03976] [0,1,14]-[0,1,12] mca_oob_tcp_peer_complete_connect:
connection failed: Connection timed out (110) - retrying
This is quite odd. :-(
I've found only a way in order to have tcp connections binded only to
the eth1 interface, using both the following MCA directives in the
command line:
mpirun .... --mca oob_tcp_include eth1 --mca oob_tcp_include
lo,eth0,ib0,ib1 .....
This sounds me as bug.
Yes, it does. Specifying the MCA same param twice on the command line
results in undefined behavior -- it will only take one of them, and I
assume it'll take the first (but I'd have to check the code to be sure).
Is there someone able to reproduce this behaviour?
If this is a bug, are there fixes?
I'm unfortunately unable to reproduce this behavior. I have a test
cluster with 2 IP interfaces: ib0, eth0. I have tried several
combinations of MCA params with 1.2.2:
--mca oob_tcp_include ib0
--mca oob_tcp_include ib0,bogus
--mca oob_tcp_include eth0
--mca oob_tcp_include eth0,bogus
--mca oob_tcp_exclude ib0
--mca oob_tcp_exclude ib0,bogus
--mca oob_tcp_exclude eth0
--mca oob_tcp_exclude eth0,bogus
All do as they are supposed to -- including or excluding ib0 or eth0.
I do note, however, that the handling of these parameters changed in
1.2.3 -- as well as their names. The names changed to
"oob_tcp_if_include" and "oob_tcp_if_exclude" to match other MCA
parameter name conventions from other components.
Could you try with 1.2.3 or 1.2.4 (1.2.4 is the most recent; 1.2.5 is
due out "soon" -- it *may* get out before the holiday break, but no
promises...)?
If you can't upgrade, let me know and I can provide a debugging patch
that will give us a little more insight into what is happening on your
machines. Thanks.
--
Jeff Squyres
Cisco Systems