Pavel Shamis (Pasha) wrote:
Another thing to try is a change that we made late in the Open MPI
v1.2 series with regards to IB:
http://www.open-mpi.org/faq/?category=openfabrics#v1.2-use-early-completion
Thanks, this is something worth investigating. What would be the
exact syntax to use to turn off pml_ob1_use_early_completion?
Your problem definitely maybe related to the know issue with early
completions. The exact syntax is:|
--mca pml_ob1_use_early_completion 0|
Unfortunately this did not help: still the same problem. Here is the
script I run: last line for the tcp test, previous line for the openib
test.
------------------------------------------------------------------------------------------------------------------------------
#!/bin/bash
#$ -S /bin/bash
#Set out, error and job name
#$ -o run2.out
#$ -e run2.err
#$ -N su3_01Jan
#Number of nodes for mpi (18 in this case)
#$ -pe make 38
# The batchsystem should use the current directory as working directory.
#$ -cwd
export
LD_LIBRARY_PATH=/opt/numactl-0.6.4/:/opt/sge-6.0u8/lib/lx24-amd64:/opt/ompi128-intel/lib
echo LD_LIBRARY_PATH $LD_LIBRARY_PATH
ldd ./k-string
ulimit -l 8388608
ulimit -a
export PATH=$PATH:/opt/ompi128-intel/bin
which mpirun
#The actual mpirun command
#mpirun -np $NSLOTS -mca btl openib,sm,self --mca
pml_ob1_use_early_completion 0 ./k-string
mpirun -np $NSLOTS -mca btl tcp,sm,self ./k-string
-------------------------------------------------------------------------------------------------------------------------------------------
This also contains extra diagnostic for the path, library path, memory
locked etc. All seems ok, and as before the tcp run goes well, the
openib run has communication problem (it looks like no communication
channel can be open or recognised). I will try OMPI1.3 rc2 (as it has
been suggested), failing that I will try to isolate a test case, to see
if the problem can be reproduced on other systems. Meanwhile, I'm happy
to listen to any suggestion you might have.
Thanks,
Biagio