Am 10.12.2012 um 19:37 schrieb Luca Clementi: > On Mon, Dec 10, 2012 at 9:27 AM, Forster, Robert > <robert.fors...@agr.gc.ca> wrote: >> Hello all: >> >> I'm running a small Rocks cluster (Rocks 5.4, 7 nodes, 56 cores). I >> need to run many iterations of a program that takes 13 hrs to finish on >> 53 cores. I can successfully run the program via the command line,
Do you provide a -machinefile in this case? >> however when I tried an sge script it failed. I then tested mpi-ring_c >> and hello_c, and they also both failed. I really need to queue this >> program up so I'm not just running once per day. >> >> When submitted with qsub -pe mpi 56 mpi-ring.qsub What are the settings of the PEs "mpi" here and "orted" below? Which startup method you set in SGE's configuration: rsh_command/rsh_daemon? >> mpi-ring.qsub >> >> #!/bin/bash >> # >> #$ -cwd >> #$ -j y >> #$ -S /bin/bash >> # >> >> >> /share/apps/mpi/gcc460/openmpi-1.4.3/bin/mpirun /share/apps/test/ring_c >> >> >> [mono-addon] -bash-3.2$ cat mpi-ring.qsub.o869 >> error: executing task of job 869 failed: execution daemon on host >> "compute-0-0.local" didn't accept task >> error: executing task of job 869 failed: execution daemon on host >> "compute-0-10.local" didn't accept task >> ... Such a message is also output if the addressed node is not in the list of granted nodes, or too many invocation of `qrsh -inherit ...`are done to a node under tight integration using more slots than granted. Did you define any default hostlist on Open MPI? -- Reuti > It seems that you have a problem with SGE installation on two nodes. > First you should try to fix this. Go on the node and check if there's > any problem with them (disks? Restart sge with /etc/init.d/sgeexecXXXX > restart). > > And then you can try to make a sge script with this extra line: > #$ -l hostname=compute-0-0.local > > which simply execute a hostname on the node to debug. > >> >> When submitted with qsub -pe orte 56 mpi-ring.qsub >> >> [compute-0-2:13313] *** Process received signal *** >> [compute-0-2:13313] Signal: Segmentation fault (11) >> [compute-0-2:13313] Signal code: Address not mapped (1) >> [compute-0-2:13313] Failing at address: 0x206 >> [compute-0-2:13313] [ 0] /lib64/libpthread.so.0 [0x3a0c40eb10] >> [compute-0-2:13313] [ 1] >> /share/apps/mpi/gcc460/openmpi-1.4.3/lib/openmpi/mca_btl_sm.so >> [0x2ac3f3ba6188] >> [compute-0-2:13313] [ 2] >> /share/apps/mpi/gcc460/openmpi-1.4.3/lib/openmpi/mca_bml_r2.so >> [0x2ac3f2f467f2] >> [compute-0-2:13313] [ 3] >> /share/apps/mpi/gcc460/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so >> [0x2ac3f2b302ee] >> [compute-0-2:13313] [ 4] >> /share/apps/mpi/gcc460/openmpi-1.4.3/lib/libmpi.so.0 [0x2ac3f019b6e9] >> [compute-0-2:13313] [ 5] >> /share/apps/mpi/gcc460/openmpi-1.4.3/lib/libmpi.so.0(MPI_Init+0x16b) >> [0x2ac3f01ba38b] >> [compute-0-2:13313] [ 6] /share/apps/test/ring_c(main+0x29) [0x4009dd] >> [compute-0-2:13313] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) >> [0x3a0bc1d994] >> [compute-0-2:13313] [ 8] /share/apps/test/ring_c [0x4008f9] >> [compute-0-2:13313] *** End of error message *** >> ------------------------------------------------------------------------ >> -- >> mpirun noticed that process rank 14 with PID 13313 on node >> compute-0-2.local exited on signal 11 (Segmentation fault). >> >> When I change the allocation rule to $pe_slots and only run 8 processes, >> it works. However this doesn't help. >> >> Since this will be the major work for this computer over the next month >> or two, I'm thinking of starting over and installing Rocks6.1, >> especially if infiniband is built in. Unless there is a simple fix. Is >> there something I need to do to set up sge to run across multiple nodes? > > Sge is working out of the box on a rocks cluster. > Upgrading is always a good idea. > > Luca > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users