Has anyone had a similar problem to this?  Note that each node has 16
slots, so 17 is utilizing interconnect.  A simple Open MPI hello world
works as expected:
$ mpirun --machinefile /etc/machines.list -np 17 ~/MPI/test
Hello World from Node 16
Hello World from Node 9
Hello World from Node 3
Hello World from Node 2
Hello World from Node 5
Hello World from Node 12
Hello World from Node 11
Hello World from Node 8
Hello World from Node 15
Hello World from Node 7
Hello World from Node 1
Hello World from Node 4
Hello World from Node 0
Hello World from Node 10
Hello World from Node 13
Hello World from Node 14
Hello World from Node 6


But with grid engine I get these errors:
$ qrsh -verbose -V -q all.q -pe ompi 17 mpirun -np 17 ~/MPI/test
Your job 23 ("mpirun") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 23 has been successfully scheduled.
Establishing builtin session to host node17 ...
node17:16.0.ErrPkt: Received packet for context 31 on context 16.
Receive Header Queue offset: 0x0. Exiting.


test:29432 terminated with signal 6 at PC=3a67e30265 SP=7fff250173a8.
Backtrace:
/lib64/libc.so.6(gsignal+0x35)[0x3a67e30265]
/lib64/libc.so.6(abort+0x110)[0x3a67e31d10]
/usr/lib64/libpsm_infinipath.so.1[0x2b15b8b35940]
/usr/lib64/libpsm_infinipath.so.1(psmi_handle_error+0x237)[0x2b15b8b35b87]
/usr/lib64/libpsm_infinipath.so.1[0x2b15b8b4ba3d]
/usr/lib64/libpsm_infinipath.so.1(ips_ptl_poll+0x9b)[0x2b15b8b49c5b]
/usr/lib64/libpsm_infinipath.so.1(psmi_poll_internal+0x50)[0x2b15b8b49b30]
/usr/lib64/libpsm_infinipath.so.1[0x2b15b8b2fc51]
/usr/lib64/libpsm_infinipath.so.1[0x2b15b8b30594]
/usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x32e)[0x2b15b8b34ece]
/usr/lib64/openmpi/mca_mtl_psm.so[0x2b15b890ec45]
/usr/lib64/openmpi/mca_pml_cm.so[0x2b15b80d96d4]
/usr/lib64/libmpi.so.0[0x3c2ca36179]
/usr/lib64/libmpi.so.0(MPI_Init+0xf0)[0x3c2ca531c0]
/data0/home/bug/MPI/test(main+0x1c)[0x400844]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3a67e1d994]
/data0/home/bug/MPI/test[0x400779]
--------------------------------------------------------------------------
mpirun has exited due to process rank 12 with PID 29432 on
node node17 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

Simple hostname commands work either way.  The combination of grid
engine and open mpi seem to be failing.  Any pointers are much appreciated.

Cheers,
-- 
Gavin W. Burris
Senior Systems Programmer
Information Security and Unix Systems
School of Arts and Sciences
University of Pennsylvania
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to