Aha, it's a parallel job and the slaves are not stopped. This is normal as the signal isn't broadcasted to all slaves (there was a patch available by Rayson to enable it) for parallel jobs.
Or you use a feature in Open MPI itself: http://www.open-mpi.org/faq/?category=running#suspend-resume which looks easier. BTW: any reason to use `ssh` as job starter and not 'builtin'? -- Reuti Am 11.10.2012 um 12:50 schrieb Xavier: >> According to S the job is suspended. Does `qstat -f` show state C for the >> queue (calendar suspended)? > yes compute-0-2 is in C state and the 3 others in aC state. > > ps -e f on compute-0-2 give: > > 12395 ? Sl 25:50 /opt/gridengine/bin/lx26-amd64/sge_execd > 6299 ? S 0:00 \_ sge_shepherd-28865 -bg > 6301 ? TNs 0:00 | \_ bash > /opt/gridengine/default/spool/compute-0-2/job_scripts/28865 > 6302 ? TN 0:00 | \_ /bin/csh ./run_roms.csh > 25564 ? TN 0:03 | \_ mpirun -np 32 ./roms roms.in > 25568 ? TN 0:00 | \_ > /opt/gridengine/bin/lx26-amd64/qrsh -inherit -nostdin -V compute-0-3.loca > 25572 ? TN 0:00 | | \_ /usr/bin/ssh -n -p 38836 > compute-0-3.local exec '/opt/gridengine/util > 25569 ? TN 0:00 | \_ > /opt/gridengine/bin/lx26-amd64/qrsh -inherit -nostdin -V compute-0-1.loca > 25571 ? TN 0:00 | | \_ /usr/bin/ssh -n -p 44411 > compute-0-1.local exec '/opt/gridengine/util > 25570 ? TN 0:00 | \_ > /opt/gridengine/bin/lx26-amd64/qrsh -inherit -nostdin -V compute-0-14.loc > 25573 ? TN 0:00 | | \_ /usr/bin/ssh -n -p 52152 > compute-0-14.local exec '/opt/gridengine/uti > 25574 ? TN 253:19 | \_ ./roms roms.in > 25575 ? TN 253:28 | \_ ./roms roms.in > 25576 ? TN 253:26 | \_ ./roms roms.in > 25577 ? TN 253:30 | \_ ./roms roms.in > 25578 ? TN 253:20 | \_ ./roms roms.in > 25579 ? TN 253:27 | \_ ./roms roms.in > 25580 ? TN 253:25 | \_ ./roms roms.in > 25581 ? TN 253:20 | \_ ./roms roms.in > 4666 ? S 0:00 \_ sge_shepherd-28899 -bg > 4667 ? Ss 0:00 \_ sshd: forecast [priv] > 4672 ? S 0:00 \_ sshd: forecast@notty > 4673 ? Ss 0:00 \_ > /opt/gridengine/utilbin/lx26-amd64/qrsh_starter /opt/gridengine/default/spool > 4764 ? S 0:00 \_ orted -mca ess env -mca > orte_ess_jobid 858914816 -mca orte_ess_vpid 3 -mc > 4765 ? R 188:03 \_ ./roms roms_forecast.in > 4766 ? R 190:14 \_ ./roms roms_forecast.in > 4767 ? R 189:45 \_ ./roms roms_forecast.in > 4768 ? R 190:30 \_ ./roms roms_forecast.in > 4769 ? R 190:48 \_ ./roms roms_forecast.in > 4770 ? R 190:39 \_ ./roms roms_forecast.in > 4771 ? R 190:56 \_ ./roms roms_forecast.in > 4772 ? R 189:54 \_ ./roms roms_forecast.in > > and top: > > 4765 forecast 25 0 250m 80m 5008 R 100.1 0.5 190:54.74 roms > 4766 forecast 25 0 250m 80m 5020 R 100.1 0.5 193:05.62 roms > 4769 forecast 25 0 250m 80m 5232 R 100.1 0.5 193:39.34 roms > 4771 forecast 25 0 250m 69m 5012 R 100.1 0.4 193:47.49 roms > 4767 forecast 25 0 250m 80m 5236 R 99.8 0.5 192:36.86 roms > 4768 forecast 25 0 250m 80m 5240 R 99.8 0.5 193:21.09 roms > 4770 forecast 25 0 250m 80m 5240 R 99.8 0.5 193:30.51 roms > 4772 forecast 25 0 250m 69m 5028 R 99.8 0.4 192:46.14 roms > > while in compute-0-1 &co : > > ps - e f give: > > 11973 ? Sl 7:13 /opt/gridengine/bin/lx26-amd64/sge_execd > 25352 ? S 0:00 \_ sge_shepherd-28865 -bg > 25353 ? SNs 0:00 | \_ sshd: xavier [priv] > 25358 ? SN 0:00 | \_ sshd: xavier@notty > 25359 ? SNs 0:00 | \_ > /opt/gridengine/utilbin/lx26-amd64/qrsh_starter /opt/gridengine/default/spool > 25450 ? SN 0:00 | \_ orted -mca ess env -mca > orte_ess_jobid 1701052416 -mca orte_ess_vpid 2 -m > 25451 ? RN 421:43 | \_ ./roms roms.in > 25452 ? RN 421:32 | \_ ./roms roms.in > 25453 ? RN 422:02 | \_ ./roms roms.in > 25454 ? RN 421:53 | \_ ./roms roms.in > 25455 ? RN 422:05 | \_ ./roms roms.in > 25456 ? RN 421:55 | \_ ./roms roms.in > 25457 ? RN 421:48 | \_ ./roms roms.in > 25458 ? RN 422:01 | \_ ./roms roms.in > 4544 ? S 0:00 \_ sge_shepherd-28899 -bg > 4545 ? Ss 0:00 \_ sshd: forecast [priv] > 4550 ? S 0:00 \_ sshd: forecast@notty > 4551 ? Ss 0:00 \_ > /opt/gridengine/utilbin/lx26-amd64/qrsh_starter /opt/gridengine/default/spool > 4642 ? S 0:00 \_ orted -mca ess env -mca > orte_ess_jobid 858914816 -mca orte_ess_vpid 2 -mc > 4643 ? S 187:37 \_ ./roms roms_forecast.in > 4644 ? S 189:38 \_ ./roms roms_forecast.in > 4645 ? S 190:40 \_ ./roms roms_forecast.in > 4646 ? R 188:53 \_ ./roms roms_forecast.in > 4647 ? R 161:50 \_ ./roms roms_forecast.in > 4648 ? S 189:12 \_ ./roms roms_forecast.in > 4649 ? R 158:36 \_ ./roms roms_forecast.in > 4650 ? R 184:55 \_ ./roms roms_forecast.in > > and top: > > 4643 forecast 25 0 250m 80m 5028 R 100.6 0.5 188:46.90 roms > 4644 forecast 25 0 250m 80m 5028 R 100.6 0.5 190:50.58 roms > 4646 forecast 25 0 250m 80m 5240 R 100.6 0.5 190:05.82 roms > 4645 forecast 25 0 250m 80m 5228 R 98.6 0.5 191:52.46 roms > 4647 forecast 25 0 250m 80m 5248 R 98.6 0.5 162:55.69 roms > 4649 forecast 25 0 250m 80m 5020 R 98.6 0.5 159:38.28 roms > 4648 forecast 25 0 250m 80m 5236 R 82.8 0.5 190:23.34 roms > 4650 forecast 25 0 250m 80m 5016 R 82.8 0.5 186:03.83 roms > 25451 xavier 39 19 219m 52m 4896 R 5.9 0.3 421:48.02 roms > 25453 xavier 39 19 219m 53m 5052 R 5.9 0.3 422:07.73 roms > 25456 xavier 39 19 219m 53m 5060 R 5.9 0.3 422:00.43 roms > 25454 xavier 39 19 219m 52m 4924 R 3.9 0.3 421:58.77 roms > 25455 xavier 39 19 219m 52m 4948 R 3.9 0.3 422:10.83 roms > 25457 xavier 39 19 219m 53m 5056 R 3.9 0.3 421:53.80 roms > 25458 xavier 39 19 219m 52m 4840 R 3.9 0.3 422:07.16 roms > 25452 xavier 39 19 219m 53m 5060 R 2.0 0.3 421:37.02 roms > >> >> Did you check with: >> >> $ ps -e f >> >> on the n ode that all processes are kids of the sge_shepherd? They should >> have gotten state "T" then in `ps`. >> >> -- Reuti >> >> >>> but again while compute-0-2 is having a load of 8 (8cpus/nodes) compute-0-1 >>> and others are overloading at 16... >>> >>> using SGE 6.2u4 on a ROCKS 5.3 cluster >>> >>> On 11/10/2012 11:09, Reuti wrote: >>>> Am 11.10.2012 um 11:56 schrieb Xavier: >>>> >>>>> Hi all, >>>>> >>>>> I have created a calendar queue only available during the day (6am to >>>>> 1am) keeping nodes free for the night jobs trough an other queue. >>>>> This queue is composed of 4 nodes (32cpus). All jobs used the 32cpus >>>> Good - but what calendar definition did you create in detail? >>>> >>>> -- Reuti >>>> >>>> >>>>> what i don't get is that one of the nodes AND ONLY ONE drop its load 0 at >>>>> 1am. this node is the one where the job attributed, i.e. >>>>> >>>>> from qstat >>>>> JOB1 xavier r 10/03/2012 09:44:18 [email protected] >>>>> 32 >>>>> >>>>> while other 3 nodes keep their load and therefore overload at night. >>>>> >>>>> example of last day load >>>>> for compute-0-2 >>>>> http://nautilus.ciimar.up.pt/ganglia/graph.php?g=load_report&z=large&c=nautilus&h=compute-0-2.local&m=load_one&r=day&s=descending&hc=4&mc=2&st=1349865411 >>>>> and for compute-0-1 >>>>> http://nautilus.ciimar.up.pt/ganglia/graph.php?g=load_report&z=large&c=nautilus&h=compute-0-1.local&m=load_one&r=day&s=descending&hc=4&mc=2&st=1349865467 >>>>> >>>>> Why does all nodes not behave the same ? >>>>> >>>>> Xavier >>>>> >>>>> -- >>>>> Universidade da Madeira >>>>> CCM - Centro de Ciencias Matematicas >>>>> Campus Universitario da Penteada >>>>> 9000-390 Funchal, Madeira Island >>>>> Portugal >>>>> >>>>> (+351) 291 705 186 >>>>> http://wakes.uma.pt >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> [email protected] >>>>> https://gridengine.org/mailman/listinfo/users >>> -- >>> Universidade da Madeira >>> CCM - Centro de Ciencias Matematicas >>> Campus Universitario da Penteada >>> 9000-390 Funchal, Madeira Island >>> Portugal >>> >>> (+351) 291 705 186 >>> http://wakes.uma.pt >>> > > -- > Universidade da Madeira > CCM - Centro de Ciencias Matematicas > Campus Universitario da Penteada > 9000-390 Funchal, Madeira Island > Portugal > > (+351) 291 705 186 > http://wakes.uma.pt > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
