Am 10.03.2011 um 19:38 schrieb Gavin W. Burris:

> Has anyone had a similar problem to this?  Note that each node has 16
> slots, so 17 is utilizing interconnect.  A simple Open MPI hello world
> works as expected:
> $ mpirun --machinefile /etc/machines.list -np 17 ~/MPI/test

Yep, Open MPI needs to be compiled "--with-sge --with-openib=<dir>"  
http://icl.cs.utk.edu/open-mpi/faq/?category=building#build-p2p

Then a plain "mpirun ~/MPI/test" will route the job to the slots granted by SGE 
for the job automatically.

To be sure that IB is used you can disable the tcp interface: "mpirun --mca btl 
^tcp ~/MPI/test".

-- Reuti


> Hello World from Node 16
> Hello World from Node 9
> Hello World from Node 3
> Hello World from Node 2
> Hello World from Node 5
> Hello World from Node 12
> Hello World from Node 11
> Hello World from Node 8
> Hello World from Node 15
> Hello World from Node 7
> Hello World from Node 1
> Hello World from Node 4
> Hello World from Node 0
> Hello World from Node 10
> Hello World from Node 13
> Hello World from Node 14
> Hello World from Node 6
> 
> 
> But with grid engine I get these errors:
> $ qrsh -verbose -V -q all.q -pe ompi 17 mpirun -np 17 ~/MPI/test
> Your job 23 ("mpirun") has been submitted
> waiting for interactive job to be scheduled ...
> Your interactive job 23 has been successfully scheduled.
> Establishing builtin session to host node17 ...
> node17:16.0.ErrPkt: Received packet for context 31 on context 16.
> Receive Header Queue offset: 0x0. Exiting.
> 
> 
> test:29432 terminated with signal 6 at PC=3a67e30265 SP=7fff250173a8.
> Backtrace:
> /lib64/libc.so.6(gsignal+0x35)[0x3a67e30265]
> /lib64/libc.so.6(abort+0x110)[0x3a67e31d10]
> /usr/lib64/libpsm_infinipath.so.1[0x2b15b8b35940]
> /usr/lib64/libpsm_infinipath.so.1(psmi_handle_error+0x237)[0x2b15b8b35b87]
> /usr/lib64/libpsm_infinipath.so.1[0x2b15b8b4ba3d]
> /usr/lib64/libpsm_infinipath.so.1(ips_ptl_poll+0x9b)[0x2b15b8b49c5b]
> /usr/lib64/libpsm_infinipath.so.1(psmi_poll_internal+0x50)[0x2b15b8b49b30]
> /usr/lib64/libpsm_infinipath.so.1[0x2b15b8b2fc51]
> /usr/lib64/libpsm_infinipath.so.1[0x2b15b8b30594]
> /usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x32e)[0x2b15b8b34ece]
> /usr/lib64/openmpi/mca_mtl_psm.so[0x2b15b890ec45]
> /usr/lib64/openmpi/mca_pml_cm.so[0x2b15b80d96d4]
> /usr/lib64/libmpi.so.0[0x3c2ca36179]
> /usr/lib64/libmpi.so.0(MPI_Init+0xf0)[0x3c2ca531c0]
> /data0/home/bug/MPI/test(main+0x1c)[0x400844]
> /lib64/libc.so.6(__libc_start_main+0xf4)[0x3a67e1d994]
> /data0/home/bug/MPI/test[0x400779]
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 12 with PID 29432 on
> node node17 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> 
> Simple hostname commands work either way.  The combination of grid
> engine and open mpi seem to be failing.  Any pointers are much appreciated.
> 
> Cheers,
> -- 
> Gavin W. Burris
> Senior Systems Programmer
> Information Security and Unix Systems
> School of Arts and Sciences
> University of Pennsylvania
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to