On Mon, Dec 10, 2012 at 9:27 AM, Forster, Robert
<[email protected]> wrote:
> Hello all:
>
> I'm running a small Rocks cluster (Rocks 5.4, 7 nodes, 56 cores).  I
> need to run many iterations of a program that takes 13 hrs to finish on
> 53 cores. I can successfully run the program via the command line,
> however when I tried an sge script it failed.  I then tested mpi-ring_c
> and hello_c, and they also both failed. I really need to queue this
> program up so I'm not just running once per day.
>
> When submitted with qsub -pe mpi 56 mpi-ring.qsub
> mpi-ring.qsub
>
> #!/bin/bash
> #
> #$ -cwd
> #$ -j y
> #$ -S /bin/bash
> #
>
>
> /share/apps/mpi/gcc460/openmpi-1.4.3/bin/mpirun /share/apps/test/ring_c
>
>
> [mono-addon] -bash-3.2$ cat mpi-ring.qsub.o869
> error: executing task of job 869 failed: execution daemon on host
> "compute-0-0.local" didn't accept task
> error: executing task of job 869 failed: execution daemon on host
> "compute-0-10.local" didn't accept task
> ...

It seems that you have a problem with SGE installation on two nodes.
First you should try to fix this. Go on the node and check if there's
any problem with them (disks? Restart sge with /etc/init.d/sgeexecXXXX
restart).

And then you can try to make a sge script with this extra line:
#$ -l hostname=compute-0-0.local

which simply execute a hostname on the node to debug.

>
> When submitted with qsub -pe orte 56 mpi-ring.qsub
>
> [compute-0-2:13313] *** Process received signal ***
> [compute-0-2:13313] Signal: Segmentation fault (11)
> [compute-0-2:13313] Signal code: Address not mapped (1)
> [compute-0-2:13313] Failing at address: 0x206
> [compute-0-2:13313] [ 0] /lib64/libpthread.so.0 [0x3a0c40eb10]
> [compute-0-2:13313] [ 1]
> /share/apps/mpi/gcc460/openmpi-1.4.3/lib/openmpi/mca_btl_sm.so
> [0x2ac3f3ba6188]
> [compute-0-2:13313] [ 2]
> /share/apps/mpi/gcc460/openmpi-1.4.3/lib/openmpi/mca_bml_r2.so
> [0x2ac3f2f467f2]
> [compute-0-2:13313] [ 3]
> /share/apps/mpi/gcc460/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so
> [0x2ac3f2b302ee]
> [compute-0-2:13313] [ 4]
> /share/apps/mpi/gcc460/openmpi-1.4.3/lib/libmpi.so.0 [0x2ac3f019b6e9]
> [compute-0-2:13313] [ 5]
> /share/apps/mpi/gcc460/openmpi-1.4.3/lib/libmpi.so.0(MPI_Init+0x16b)
> [0x2ac3f01ba38b]
> [compute-0-2:13313] [ 6] /share/apps/test/ring_c(main+0x29) [0x4009dd]
> [compute-0-2:13313] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x3a0bc1d994]
> [compute-0-2:13313] [ 8] /share/apps/test/ring_c [0x4008f9]
> [compute-0-2:13313] *** End of error message ***
> ------------------------------------------------------------------------
> --
> mpirun noticed that process rank 14 with PID 13313 on node
> compute-0-2.local exited on signal 11 (Segmentation fault).
>
> When I change the allocation rule to $pe_slots and only run 8 processes,
> it works. However this doesn't help.
>
> Since this will be the major work for this computer over the next month
> or two, I'm thinking of starting over and installing Rocks6.1,
> especially if infiniband is built in. Unless there is a simple fix. Is
> there something I need to do to set up sge to run across multiple nodes?

Sge is working out of the box on a rocks cluster.
Upgrading is always a good idea.

Luca
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to