Hi folks,
I'm sorta stymied by the magic of effortless openmpi tight integration
with SGE and am wondering how best to proceed...
Here is my situation:
- Cluster has nodes named "node1 ... nodeN"
- Cluster also has IB NICs in each node
- Cluster hosts file declares the IB interfaces as "inode1 ... inodeN"
So my basic situation is that the hostname of the compute node is
different if I want to explicitly invoke the infiniband interface and
network. I need to use "inodeN" instead of "nodeN" for my MPI hosts.
In the bad old days of loose MPI integration I'd just intercept the
temporary hostfile generated by the pe_starter method and just run a
quick regex on it to change all mentions of "node" to "inode" and I'd be
done - the mpirun command would be force fed a machines file that
explicitly names the infiniband-associated hostnames.
However with the magic/automatic support that SGE has for OpenMPI there
is no written MPI hosts file that I can find ($TMPDIR/hosts does not
exist in the job context) -- the SGE scheduler just sends the selected
host set directly to the OpenMPI starter process and in my case it seems
clear that SGE is sending the "ethernet" hostnames instead of the IB
hostnames and thus my shiny IB fabric is being ignored in favor of
running MPI over the ethernet links.
So my basic question is "how to force tightly integrated openmpi to use
a (sligthly) different set of hostnames so that the IB fabric is
actually used ..."
Right now I'm thinking of mirroring part of the loose integration method
and writing a simple pe_starter method that will take $pe_hosts and
translate it into a hostfile that has the 'nodeN' to 'inodeN' regex
applied. Then I can modify my job scripts to force mpirun to accept a
machinesfile or hostfile argument.
Is there a better way ?
Also, is there a better way to "prove" what network/interface endpoints
openmpi is using? So far for debugging I've been using the following
options to sorta prove to myself that the non-IB network is being used:
$MPIRUN --display-devel-allocation --display-allocation --verbose
--show-progress
and by running that command through SGE and then outside of SGE with a
manual hostfile using the IB interface I see enough difference in output
to be convinced that SGE is routing jobs through the ethernet network.
Thoughts, clues and tips appreciated!
-Chris
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users