When I submit a simple job (described below) using PBS, I always get one
of the following two errors:
1) [adroit-28:03945] [0,0,1]-[0,0,0] mca_oob_tcp_peer_recv_blocking:
recv() failed with errno=104

2) [adroit-30:03770] [0,0,3]-[0,0,0] mca_oob_tcp_peer_complete_connect:
connection failed (errno=111) - retrying (pid=3770)

The program does a uname and prints out results to standard out. The
only MPI calls it makes are MPI_Init, MPI_Comm_size, MPI_Comm_rank, and
MPI_Finalize. I have tried it with both openmpi v 1.1.2 and 1.1.4, built
with Intel C compiler 9.1.045, and get the same results. But if I build
the same versions of openmpi using gcc, the test program always works
fine. The app itself is built with mpicc.

It runs successfully if run from the command line with "mpiexec -n X
<test-program-name>", where X is 1 to 8, but if I wrap it in the
following qsub command file:
---------------------------------------------------
#PBS -l pmem=512mb,nodes=1:ppn=1,walltime=0:10:00
#PBS -m abe
# #PBS -o /home0/dmcr/my_mpi/curt/uname_test.gcc.stdout
# #PBS -e /home0/dmcr/my_mpi/curt/uname_test.gcc.stderr

cd /home/dmcr/my_mpi/openmpi
echo "About to call mpiexec"
module list
mpiexec -n 1 uname_test.intel
echo "After call to mpiexec"
----------------------------------------------------

it fails on any number of processors from 1 to 8, and the application
segfaults.

The complete standard error of an 8-processsor job follows (note that
mpiexec ran on adroit-31, but usually there is no info about adroit-31
in standard error):
-------------------------
Currently Loaded Modulefiles:
  1) intel/9.1/32/C/9.1.045         4) intel/9.1/32/default
  2) intel/9.1/32/Fortran/9.1.040   5) openmpi/intel/1.1.2/32
  3) intel/9.1/32/Iidb/9.1.045
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x5
[0] func:/usr/local/openmpi/1.1.4/intel/i386/lib/libopal.so.0 [0xb72c5b]
*** End of error message ***
^@[adroit-29:03934] [0,0,2]-[0,0,0] mca_oob_tcp_peer_recv_blocking:
recv() failed with errno=104
[adroit-28:03945] [0,0,1]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv()
failed with errno=104
[adroit-30:03770] [0,0,3]-[0,0,0] mca_oob_tcp_peer_complete_connect:
connection failed (errno=111) - retrying (pid=3770)
--------------------------

The complete standard error of an 1-processsor job follows:
--------------------------
Currently Loaded Modulefiles:
  1) intel/9.1/32/C/9.1.045         4) intel/9.1/32/default
  2) intel/9.1/32/Fortran/9.1.040   5) openmpi/intel/1.1.2/32
  3) intel/9.1/32/Iidb/9.1.045
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2
[0] func:/usr/local/openmpi/1.1.2/intel/i386/lib/libopal.so.0 [0x27d847]
*** End of error message ***
^@[adroit-31:08840] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect:
connection failed (errno=111) - retrying (pid=8840)
---------------------------

Any thoughts as to why this might be failing?

Thanks,
       Dennis

Dennis McRitchie
Computational Science and Engineering Support (CSES)
Academic Services Department
Office of Information Technology
Princeton University

Reply via email to