Am 10.03.2011 um 22:43 schrieb Gavin W. Burris: > I already had build ompi 1.4.1 with those options. Originally, I > suspected two compute nodes that I had to kickstart / re-image because > of bad drives, thinking they were slightly different than all nodes, > maybe a library mismatch.
Ok. > I have now installed the latest Grid Engine and Open MPI 1.4.2. I was > still getting the same error, though. After returning to it a few hours > later, things are looking OK. Weird... You are just using a plain "mpirun ~/MPI/test"? Then we have to check the setting for the start of slave tasks where e.g. ROCKS fills in something stupid by default. Can you please post: $ qconf -sconf -- Reuti > Thanks again! > > On 03/10/2011 01:51 PM, Reuti wrote: >> Am 10.03.2011 um 19:38 schrieb Gavin W. Burris: >> >>> Has anyone had a similar problem to this? Note that each node has 16 >>> slots, so 17 is utilizing interconnect. A simple Open MPI hello world >>> works as expected: >>> $ mpirun --machinefile /etc/machines.list -np 17 ~/MPI/test >> >> Yep, Open MPI needs to be compiled "--with-sge --with-openib=<dir>" >> http://icl.cs.utk.edu/open-mpi/faq/?category=building#build-p2p >> >> Then a plain "mpirun ~/MPI/test" will route the job to the slots granted by >> SGE for the job automatically. >> >> To be sure that IB is used you can disable the tcp interface: "mpirun --mca >> btl ^tcp ~/MPI/test". >> >> -- Reuti >> >> >>> Hello World from Node 16 >>> Hello World from Node 9 >>> Hello World from Node 3 >>> Hello World from Node 2 >>> Hello World from Node 5 >>> Hello World from Node 12 >>> Hello World from Node 11 >>> Hello World from Node 8 >>> Hello World from Node 15 >>> Hello World from Node 7 >>> Hello World from Node 1 >>> Hello World from Node 4 >>> Hello World from Node 0 >>> Hello World from Node 10 >>> Hello World from Node 13 >>> Hello World from Node 14 >>> Hello World from Node 6 >>> >>> >>> But with grid engine I get these errors: >>> $ qrsh -verbose -V -q all.q -pe ompi 17 mpirun -np 17 ~/MPI/test >>> Your job 23 ("mpirun") has been submitted >>> waiting for interactive job to be scheduled ... >>> Your interactive job 23 has been successfully scheduled. >>> Establishing builtin session to host node17 ... >>> node17:16.0.ErrPkt: Received packet for context 31 on context 16. >>> Receive Header Queue offset: 0x0. Exiting. >>> >>> >>> test:29432 terminated with signal 6 at PC=3a67e30265 SP=7fff250173a8. >>> Backtrace: >>> /lib64/libc.so.6(gsignal+0x35)[0x3a67e30265] >>> /lib64/libc.so.6(abort+0x110)[0x3a67e31d10] >>> /usr/lib64/libpsm_infinipath.so.1[0x2b15b8b35940] >>> /usr/lib64/libpsm_infinipath.so.1(psmi_handle_error+0x237)[0x2b15b8b35b87] >>> /usr/lib64/libpsm_infinipath.so.1[0x2b15b8b4ba3d] >>> /usr/lib64/libpsm_infinipath.so.1(ips_ptl_poll+0x9b)[0x2b15b8b49c5b] >>> /usr/lib64/libpsm_infinipath.so.1(psmi_poll_internal+0x50)[0x2b15b8b49b30] >>> /usr/lib64/libpsm_infinipath.so.1[0x2b15b8b2fc51] >>> /usr/lib64/libpsm_infinipath.so.1[0x2b15b8b30594] >>> /usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x32e)[0x2b15b8b34ece] >>> /usr/lib64/openmpi/mca_mtl_psm.so[0x2b15b890ec45] >>> /usr/lib64/openmpi/mca_pml_cm.so[0x2b15b80d96d4] >>> /usr/lib64/libmpi.so.0[0x3c2ca36179] >>> /usr/lib64/libmpi.so.0(MPI_Init+0xf0)[0x3c2ca531c0] >>> /data0/home/bug/MPI/test(main+0x1c)[0x400844] >>> /lib64/libc.so.6(__libc_start_main+0xf4)[0x3a67e1d994] >>> /data0/home/bug/MPI/test[0x400779] >>> -------------------------------------------------------------------------- >>> mpirun has exited due to process rank 12 with PID 29432 on >>> node node17 exiting without calling "finalize". This may >>> have caused other processes in the application to be >>> terminated by signals sent by mpirun (as reported here). >>> -------------------------------------------------------------------------- >>> >>> Simple hostname commands work either way. The combination of grid >>> engine and open mpi seem to be failing. Any pointers are much appreciated. >>> >>> Cheers, >>> -- >>> Gavin W. Burris >>> Senior Systems Programmer >>> Information Security and Unix Systems >>> School of Arts and Sciences >>> University of Pennsylvania >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users >> >> > > -- > Gavin W. Burris > Senior Systems Programmer > Information Security and Unix Systems > School of Arts and Sciences > University of Pennsylvania _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
