I am using openmpi to run a job on 4 nodes, 2 processors per node. Seems
like 5 out of the 8 processors executed the app successfully and 3 of them
did not. Here is the error message I got. The last thing I did in the code
is an MPI_Barrier  call and it never returns (probably because 3 of the
processes never gets executed properly?)

[0,1,7][btl_openib_component.c:1332:btl_openib_component_progress] from
hplcnla160 to: hplcnla162 error polling HP CQ with status LOCAL LENGTH
ERROR status number 1 for wr_id 6158264 opcode 0

and here is the script I used:

#!/bin/bash -debug
#PBS -N mytest
#PBS -l nodes=4:ppn=2,walltime=00:05:00,tpn=2
#PBS -j oe

NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
/opt/openmpi-1.2.4/gnu/bin/mpirun -np $NP My_Executable

Has anybody seen this kind of error before? Thanks.

CJ

Reply via email to