Hi folks,

I'm sorta stymied by the magic of effortless openmpi tight integration with SGE and am wondering how best to proceed...

Here is my situation:

- Cluster has nodes named "node1 ... nodeN"
- Cluster also has IB NICs in each node
- Cluster hosts file declares the IB interfaces as "inode1 ... inodeN"

So my basic situation is that the hostname of the compute node is different if I want to explicitly invoke the infiniband interface and network. I need to use "inodeN" instead of "nodeN" for my MPI hosts.

In the bad old days of loose MPI integration I'd just intercept the temporary hostfile generated by the pe_starter method and just run a quick regex on it to change all mentions of "node" to "inode" and I'd be done - the mpirun command would be force fed a machines file that explicitly names the infiniband-associated hostnames.

However with the magic/automatic support that SGE has for OpenMPI there is no written MPI hosts file that I can find ($TMPDIR/hosts does not exist in the job context) -- the SGE scheduler just sends the selected host set directly to the OpenMPI starter process and in my case it seems clear that SGE is sending the "ethernet" hostnames instead of the IB hostnames and thus my shiny IB fabric is being ignored in favor of running MPI over the ethernet links.


So my basic question is "how to force tightly integrated openmpi to use a (sligthly) different set of hostnames so that the IB fabric is actually used ..."

Right now I'm thinking of mirroring part of the loose integration method and writing a simple pe_starter method that will take $pe_hosts and translate it into a hostfile that has the 'nodeN' to 'inodeN' regex applied. Then I can modify my job scripts to force mpirun to accept a machinesfile or hostfile argument.

Is there a better way ?

Also, is there a better way to "prove" what network/interface endpoints openmpi is using? So far for debugging I've been using the following options to sorta prove to myself that the non-IB network is being used:

$MPIRUN --display-devel-allocation --display-allocation --verbose 
--show-progress

and by running that command through SGE and then outside of SGE with a manual hostfile using the IB interface I see enough difference in output to be convinced that SGE is routing jobs through the ethernet network.


Thoughts, clues and tips appreciated!

-Chris





_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to