I am using openmpi to run a job on 4 nodes, 2 processors per node. Seems like 5 out of the 8 processors executed the app successfully and 3 of them did not. Here is the error message I got. The last thing I did in the code is an MPI_Barrier call and it never returns (probably because 3 of the processes never gets executed properly?)
[0,1,7][btl_openib_component.c:1332:btl_openib_component_progress] from hplcnla160 to: hplcnla162 error polling HP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 6158264 opcode 0 and here is the script I used: #!/bin/bash -debug #PBS -N mytest #PBS -l nodes=4:ppn=2,walltime=00:05:00,tpn=2 #PBS -j oe NP=$(wc -l $PBS_NODEFILE | awk '{print $1}') /opt/openmpi-1.2.4/gnu/bin/mpirun -np $NP My_Executable Has anybody seen this kind of error before? Thanks. CJ