Pavel Shamis (Pasha) wrote:

Another thing to try is a change that we made late in the Open MPI v1.2 series with regards to IB:

http://www.open-mpi.org/faq/?category=openfabrics#v1.2-use-early-completion

Thanks, this is something worth investigating. What would be the exact syntax to use to turn off pml_ob1_use_early_completion?
Your problem definitely maybe related to the know issue with early completions. The exact syntax is:|
--mca pml_ob1_use_early_completion 0|

Unfortunately this did not help: still the same problem. Here is the script I run: last line for the tcp test, previous line for the openib test.
------------------------------------------------------------------------------------------------------------------------------
#!/bin/bash
#$ -S /bin/bash

#Set out, error and job name
#$ -o run2.out
#$ -e run2.err
#$ -N su3_01Jan

#Number of nodes for mpi (18 in this case)
#$ -pe make 38

# The batchsystem should use the current directory as working directory.
#$ -cwd


export LD_LIBRARY_PATH=/opt/numactl-0.6.4/:/opt/sge-6.0u8/lib/lx24-amd64:/opt/ompi128-intel/lib

echo LD_LIBRARY_PATH  $LD_LIBRARY_PATH
ldd ./k-string

ulimit -l 8388608
ulimit -a

export PATH=$PATH:/opt/ompi128-intel/bin
which mpirun

#The actual mpirun command
#mpirun -np $NSLOTS -mca btl openib,sm,self --mca pml_ob1_use_early_completion 0 ./k-string
mpirun -np $NSLOTS -mca btl tcp,sm,self ./k-string

-------------------------------------------------------------------------------------------------------------------------------------------

This also contains extra diagnostic for the path, library path, memory locked etc. All seems ok, and as before the tcp run goes well, the openib run has communication problem (it looks like no communication channel can be open or recognised). I will try OMPI1.3 rc2 (as it has been suggested), failing that I will try to isolate a test case, to see if the problem can be reproduced on other systems. Meanwhile, I'm happy to listen to any suggestion you might have.

Thanks,
Biagio

Reply via email to