Hmmm…I’m beginning to grok the issue. It is a tad unusual for people to
assign different hostnames to their interfaces - I’ve seen it in the
Hadoop world, but not in HPC. Still, no law against it.

No, not so unusual.
I have clusters from respectable vendors that come with
/etc/hosts for name resolution of the various interfaces.
If I remember right, Rocks clusters also does that (or actually
allow the sys admin to setup additional networks and at that point
will append /etc/hosts with the additional names, or perhaps put those
names in DHCP).
I am not so familiar to xcat, but I think it has similar DHCP functionality, or maybe DNS on the head node.

Having said that, I don't think this is an obstacle to setting up the right "if_include/if_exlculde" choices (along with the btl, oob, etc),
for each particular cluster in the mca parameter configuration file.
That is what my parallel conversation with Reuti was about.

I believe the current approach w.r.t. interfaces:
"use everythint, let the sysadmin/user restrict as
(s)he sees fit" is both a wise and flexible way to do it.
Guessing the "right interface to use" sounds risky to me (wrong choices may happen), and a bit of a cast.

This will take a little thought to figure out a solution. One problem
that immediately occurs is if someone includes a hostfile that has lines
which refer to the same physical server, but using different interface
names. We’ll think those are completely distinct servers, and so the
process placement will be totally messed up.

Sure, and besides this, there will be machines with
inconsistent/wrong/conflicting name resolution schemes
that the current OMPI approach simply (and wisely) ignores.

We’ll also encounter issues with the daemon when it reports back, as the
hostname it gets will almost certainly differ from the hostname we were
expecting. Not as critical, but need to check to see where that will
impact the code base

I'm sure that will happen.
Torque uses hostname by default for several things, and it can be a configuration nightmare to workaround that when what hostname reports is not what you want.

IMHO, you may face a daunting guesswork task to get this right,
to pick the
interfaces that are best for a particular computer or cluster.
It is so much easier to let the sysadmin/user, who presumably knows his/her machine, to write an MCA parameter config file,
as it is now in OMPI.

We can look at the hostfile changes at that time - no real objection to
them, but would need to figure out how to pass that info to the
appropriate subsystems. I assume you want this to apply to both the oob
and tcp/btl?

Obviously, this won’t make it for 1.8 as it is going to be fairly
intrusive, but we can probably do something for 1.9

The status quo is good.
Long life to the OMPI status quo.
(You don't know how reluctant I am to support the status quo, any status quo. :) ) My vote (... well, I don't have voting rights on that, but I'll vote anyway ...) is to keeep the current approach. It is wise and flexible, and easy to adjust and configure to specific machines with their own oddities, via MCA parameters, as I tried to explain in previous postings.

My two cents,
Gus Correa

Another thing you can do is (a) ensure you built with
—enable-debug, and then (b) run it with -mca oob_base_verbose 100
 (without the tcp_if_include option) so we can watch the
connection handshake and see what it is doing. The —hetero-nodes
will have not affect here and can be ignored.

Done. It really tries to connect to the outside interface of the
headnode. But being there a firewall or not: the nodes have no clue
how to reach - they have no gateway to this network at all.

I have to revert this. They think that there is a gateway although
it isn't. When I remove the entry by hand for the gateway in the
routing table it starts up instantly too.

While I can do this on my own cluster I still have the 30 seconds
delay on a cluster where I'm not root, while this can be because of
the firewall there. The gateway on this cluster is indeed going to
the outside world.

Personally I find this behavior a little bit too aggressive to use
all interfaces. If you don't check this carefully beforehand and
start a long running application one might even not notice the delay
during the startup.

Agreed - do you have any suggestions on how we should choose the
order in which to try them? I haven’t been able to come up with
anything yet. Jeff has some fancy algo in his usnic BTL that we are
going to discuss after SC that I’m hoping will help, but I’d be open
to doing something better in the interim for 1.8.4

The plain`mpiexec` should just use the specified interface it finds in
the hostfile. Being it hand crafted or prepared by any queuing system.

Option: could a single entry for a machine in the hostfile contain a
list of interfaces? I mean something like:

node01,node01-extra-eth1,node01-extra-eth2 slots=4


node01* slots=4

Means: use exactly these interfaces or even try to find all available
interfaces on/between the machines.

In case all interfaces have the same name, then it's up to the admin
to correct this.

-- Reuti

-- Reuti

It tries so independent from the internal or external name of the
headnode given in the machinefile - I hit ^C then. I attached the
output of Open MPI 1.8.1 for this setup too.

-- Reuti

