On Dec 17, 2007, at 8:35 AM, Marco Sbrighi wrote:

I'm using Open MPI 1.2.2 over OFED 1.2 on an 256 nodes, dual Opteron,
dual core, Linux cluster. Of course, with Infiniband 4x interconnect.

Each cluster node is equipped with 4 (or more) ethernet interface,
namely 2 gigabit ones plus 2 IPoIB. The two gig are named  eth0,eth1,
while the two IPoIB are named ib0,ib1.

It happens that the eth0 is a management network, with poor
performances, and furthermore we wouldn't use the ib* to carry MPI's
traffic (neither OOB or TCP), so we would like the eth1 is used for open
MPI OOB and TCP.

In order to drive the OOB over only eth1 I've tried various combinations
of oob_tcp_[ex|in]clude MCA statements, starting from the obvious

oob_tcp_exclude = lo,eth0,ib0,ib1

then trying the othe obvious:

oob_tcp_include = eth1

This one statement (_include) should be sufficient.

Assumedly this(these) statement(s) are in a config file that is being read by Open MPI, such as $HOME/.openmpi/mca-params.conf?

and both at the same time.

Next I've tried the following:

oob_tcp_exclude = eth0

but after the job starts, I still have a lot of tcp connections
established using eth0 or ib0 or ib1.
Furthermore It happens the following error:

  [node191:03976] [0,1,14]-[0,1,12] mca_oob_tcp_peer_complete_connect:
connection failed: Connection timed out (110) - retrying

This is quite odd.  :-(

I've found only a way in order to have tcp connections binded only to
the eth1 interface, using both the following MCA directives in the
command line:

mpirun .... --mca oob_tcp_include eth1 --mca oob_tcp_include lo,eth0,ib0,ib1 .....

This sounds me as bug.

Yes, it does. Specifying the MCA same param twice on the command line results in undefined behavior -- it will only take one of them, and I assume it'll take the first (but I'd have to check the code to be sure).

Is there someone able to reproduce this behaviour?
If this is a bug, are there fixes?


I'm unfortunately unable to reproduce this behavior. I have a test cluster with 2 IP interfaces: ib0, eth0. I have tried several combinations of MCA params with 1.2.2:

   --mca oob_tcp_include ib0
   --mca oob_tcp_include ib0,bogus
   --mca oob_tcp_include eth0
   --mca oob_tcp_include eth0,bogus
   --mca oob_tcp_exclude ib0
   --mca oob_tcp_exclude ib0,bogus
   --mca oob_tcp_exclude eth0
   --mca oob_tcp_exclude eth0,bogus

All do as they are supposed to -- including or excluding ib0 or eth0.

I do note, however, that the handling of these parameters changed in 1.2.3 -- as well as their names. The names changed to "oob_tcp_if_include" and "oob_tcp_if_exclude" to match other MCA parameter name conventions from other components.

Could you try with 1.2.3 or 1.2.4 (1.2.4 is the most recent; 1.2.5 is due out "soon" -- it *may* get out before the holiday break, but no promises...)?

If you can't upgrade, let me know and I can provide a debugging patch that will give us a little more insight into what is happening on your machines. Thanks.

--
Jeff Squyres
Cisco Systems

Reply via email to