Reuti,

I already had build ompi 1.4.1 with those options.  Originally, I
suspected two compute nodes that I had to kickstart / re-image because
of bad drives, thinking they were slightly different than all nodes,
maybe a library mismatch.

I have now installed the latest Grid Engine and Open MPI 1.4.2.  I was
still getting the same error, though.  After returning to it a few hours
later, things are looking OK.  Weird...

Thanks again!

On 03/10/2011 01:51 PM, Reuti wrote:
> Am 10.03.2011 um 19:38 schrieb Gavin W. Burris:
> 
>> Has anyone had a similar problem to this?  Note that each node has 16
>> slots, so 17 is utilizing interconnect.  A simple Open MPI hello world
>> works as expected:
>> $ mpirun --machinefile /etc/machines.list -np 17 ~/MPI/test
> 
> Yep, Open MPI needs to be compiled "--with-sge --with-openib=<dir>"  
> http://icl.cs.utk.edu/open-mpi/faq/?category=building#build-p2p
> 
> Then a plain "mpirun ~/MPI/test" will route the job to the slots granted by 
> SGE for the job automatically.
> 
> To be sure that IB is used you can disable the tcp interface: "mpirun --mca 
> btl ^tcp ~/MPI/test".
> 
> -- Reuti
> 
> 
>> Hello World from Node 16
>> Hello World from Node 9
>> Hello World from Node 3
>> Hello World from Node 2
>> Hello World from Node 5
>> Hello World from Node 12
>> Hello World from Node 11
>> Hello World from Node 8
>> Hello World from Node 15
>> Hello World from Node 7
>> Hello World from Node 1
>> Hello World from Node 4
>> Hello World from Node 0
>> Hello World from Node 10
>> Hello World from Node 13
>> Hello World from Node 14
>> Hello World from Node 6
>>
>>
>> But with grid engine I get these errors:
>> $ qrsh -verbose -V -q all.q -pe ompi 17 mpirun -np 17 ~/MPI/test
>> Your job 23 ("mpirun") has been submitted
>> waiting for interactive job to be scheduled ...
>> Your interactive job 23 has been successfully scheduled.
>> Establishing builtin session to host node17 ...
>> node17:16.0.ErrPkt: Received packet for context 31 on context 16.
>> Receive Header Queue offset: 0x0. Exiting.
>>
>>
>> test:29432 terminated with signal 6 at PC=3a67e30265 SP=7fff250173a8.
>> Backtrace:
>> /lib64/libc.so.6(gsignal+0x35)[0x3a67e30265]
>> /lib64/libc.so.6(abort+0x110)[0x3a67e31d10]
>> /usr/lib64/libpsm_infinipath.so.1[0x2b15b8b35940]
>> /usr/lib64/libpsm_infinipath.so.1(psmi_handle_error+0x237)[0x2b15b8b35b87]
>> /usr/lib64/libpsm_infinipath.so.1[0x2b15b8b4ba3d]
>> /usr/lib64/libpsm_infinipath.so.1(ips_ptl_poll+0x9b)[0x2b15b8b49c5b]
>> /usr/lib64/libpsm_infinipath.so.1(psmi_poll_internal+0x50)[0x2b15b8b49b30]
>> /usr/lib64/libpsm_infinipath.so.1[0x2b15b8b2fc51]
>> /usr/lib64/libpsm_infinipath.so.1[0x2b15b8b30594]
>> /usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x32e)[0x2b15b8b34ece]
>> /usr/lib64/openmpi/mca_mtl_psm.so[0x2b15b890ec45]
>> /usr/lib64/openmpi/mca_pml_cm.so[0x2b15b80d96d4]
>> /usr/lib64/libmpi.so.0[0x3c2ca36179]
>> /usr/lib64/libmpi.so.0(MPI_Init+0xf0)[0x3c2ca531c0]
>> /data0/home/bug/MPI/test(main+0x1c)[0x400844]
>> /lib64/libc.so.6(__libc_start_main+0xf4)[0x3a67e1d994]
>> /data0/home/bug/MPI/test[0x400779]
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 12 with PID 29432 on
>> node node17 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> --------------------------------------------------------------------------
>>
>> Simple hostname commands work either way.  The combination of grid
>> engine and open mpi seem to be failing.  Any pointers are much appreciated.
>>
>> Cheers,
>> -- 
>> Gavin W. Burris
>> Senior Systems Programmer
>> Information Security and Unix Systems
>> School of Arts and Sciences
>> University of Pennsylvania
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
> 
> 

-- 
Gavin W. Burris
Senior Systems Programmer
Information Security and Unix Systems
School of Arts and Sciences
University of Pennsylvania
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to