Hello,
I am new to this list, where I hope to find a solution for a problem
that I have been having for quite a longtime.
I run various versions of openmpi (from 1.1.2 to 1.2.8) on a cluster
with Infiniband interconnects that I use and administer at the same
time. The openfabric stac is OFED-1.2.5, the compilers gcc 4.2 and
Intel. The queue manager is SGE 6.0u8.
The trouble is with an MPI code that runs fine with an openmpi 1.1.2
library compiled without infiniband support (I have tested the
scalability of the code up to 64 cores, the nodes are 4 or 8 cores, the
results are exactly what I expect), but if I try to use a version
compiled for infiniband, then only a subset of comunications (the ones
connecting cores in the same node) are enabled, and because of this the
program fails (gets stuck in a perennial waiting phase, in particular).
This happens with any combination of compilers/library releases (1.1.2,
1.2.7, 1.2.8) I have tried. On other codes, and in particular on
benchmarks downloaded from the net, openmpi over infiniband seems to
work (I compared the latency with the tcp btl, so I am pretty sure that
infiniband works). The two variables I kept fixed are SGE and the OFED
module stack. I would like not to touch them, if possible, because the
cluster seems to run fine for other purposes.
My question is: does anyone has a suggestion on what I could try next?
I'm pretty sure that to get an answer I need to provide more details,
which I am willing to do, but in more than two months of
testing/trying/hoping/praying I have accumulated so much material and
information that if I post everything in this e-mail I am likely to
confuse a potential helper, more than helping him to understand the problem.
Thank you in advance,
Biagio Lucini
--
=========================================================
Dr. Biagio Lucini
Department of Physics, Swansea University
Singleton Park, SA2 8PP Swansea (UK)
Tel. +44 (0)1792 602284
=========================================================