Re: [gridengine users] [Rocks-Discuss] Problem with mpi program and sge

Reuti Tue, 11 Dec 2012 12:55:43 -0800

Am 10.12.2012 um 19:37 schrieb Luca Clementi:

> On Mon, Dec 10, 2012 at 9:27 AM, Forster, Robert
> <robert.fors...@agr.gc.ca> wrote:
>> Hello all:
>> 
>> I'm running a small Rocks cluster (Rocks 5.4, 7 nodes, 56 cores).  I
>> need to run many iterations of a program that takes 13 hrs to finish on
>> 53 cores. I can successfully run the program via the command line,


Do you provide a -machinefile in this case?


>> however when I tried an sge script it failed.  I then tested mpi-ring_c
>> and hello_c, and they also both failed. I really need to queue this
>> program up so I'm not just running once per day.
>> 
>> When submitted with qsub -pe mpi 56 mpi-ring.qsub

What are the settings of the PEs "mpi" here and "orted" below?

Which startup method you set in SGE's configuration: rsh_command/rsh_daemon?


>> mpi-ring.qsub
>> 
>> #!/bin/bash
>> #
>> #$ -cwd
>> #$ -j y
>> #$ -S /bin/bash
>> #
>> 
>> 
>> /share/apps/mpi/gcc460/openmpi-1.4.3/bin/mpirun /share/apps/test/ring_c
>> 
>> 
>> [mono-addon] -bash-3.2$ cat mpi-ring.qsub.o869
>> error: executing task of job 869 failed: execution daemon on host
>> "compute-0-0.local" didn't accept task
>> error: executing task of job 869 failed: execution daemon on host
>> "compute-0-10.local" didn't accept task
>> ...

Such a message is also output if the addressed node is not in the list of 
granted nodes, or too many invocation of `qrsh -inherit ...`are done to a node 
under tight integration using more slots than granted.

Did you define any default hostlist on Open MPI?

-- Reuti


> It seems that you have a problem with SGE installation on two nodes.
> First you should try to fix this. Go on the node and check if there's
> any problem with them (disks? Restart sge with /etc/init.d/sgeexecXXXX
> restart).
> 
> And then you can try to make a sge script with this extra line:
> #$ -l hostname=compute-0-0.local
> 
> which simply execute a hostname on the node to debug.
> 
>> 
>> When submitted with qsub -pe orte 56 mpi-ring.qsub
>> 
>> [compute-0-2:13313] *** Process received signal ***
>> [compute-0-2:13313] Signal: Segmentation fault (11)
>> [compute-0-2:13313] Signal code: Address not mapped (1)
>> [compute-0-2:13313] Failing at address: 0x206
>> [compute-0-2:13313] [ 0] /lib64/libpthread.so.0 [0x3a0c40eb10]
>> [compute-0-2:13313] [ 1]
>> /share/apps/mpi/gcc460/openmpi-1.4.3/lib/openmpi/mca_btl_sm.so
>> [0x2ac3f3ba6188]
>> [compute-0-2:13313] [ 2]
>> /share/apps/mpi/gcc460/openmpi-1.4.3/lib/openmpi/mca_bml_r2.so
>> [0x2ac3f2f467f2]
>> [compute-0-2:13313] [ 3]
>> /share/apps/mpi/gcc460/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so
>> [0x2ac3f2b302ee]
>> [compute-0-2:13313] [ 4]
>> /share/apps/mpi/gcc460/openmpi-1.4.3/lib/libmpi.so.0 [0x2ac3f019b6e9]
>> [compute-0-2:13313] [ 5]
>> /share/apps/mpi/gcc460/openmpi-1.4.3/lib/libmpi.so.0(MPI_Init+0x16b)
>> [0x2ac3f01ba38b]
>> [compute-0-2:13313] [ 6] /share/apps/test/ring_c(main+0x29) [0x4009dd]
>> [compute-0-2:13313] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4)
>> [0x3a0bc1d994]
>> [compute-0-2:13313] [ 8] /share/apps/test/ring_c [0x4008f9]
>> [compute-0-2:13313] *** End of error message ***
>> ------------------------------------------------------------------------
>> --
>> mpirun noticed that process rank 14 with PID 13313 on node
>> compute-0-2.local exited on signal 11 (Segmentation fault).
>> 
>> When I change the allocation rule to $pe_slots and only run 8 processes,
>> it works. However this doesn't help.
>> 
>> Since this will be the major work for this computer over the next month
>> or two, I'm thinking of starting over and installing Rocks6.1,
>> especially if infiniband is built in. Unless there is a simple fix. Is
>> there something I need to do to set up sge to run across multiple nodes?
> 
> Sge is working out of the box on a rocks cluster.
> Upgrading is always a good idea.
> 
> Luca
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] [Rocks-Discuss] Problem with mpi program and sge

Reply via email to