Working!!!
following http://www.open-mpi.org/faq/?category=running#suspend-resume

I add -mca -mca orte_forward_job_control in the mpirun cmd

mpirun -mca orte_forward_job_control 1 -np 32 ./$CODFILE ${MODEL}.in > ${MODEL}_${TIME}.out

and changed the suspend method as explained :

sheel$ qconf -sq short.q
qname                 all.q
[...snip...]
starter_method        NONE
suspend_method        SIGTSTP
resume_method         NON



and now  all process get suspended on all nodes.
Thanks a lot for your help!

Regards

Xavier

On 11/10/2012 12:45, Reuti wrote:
Aha, it's a parallel job and the slaves are not stopped. This is normal as the 
signal isn't broadcasted to all slaves (there was a patch available by Rayson 
to enable it) for parallel jobs.

Or you use a feature in Open MPI itself: 
http://www.open-mpi.org/faq/?category=running#suspend-resume which looks easier.

BTW: any reason to use `ssh` as job starter and not 'builtin'?

-- Reuti


Am 11.10.2012 um 12:50 schrieb Xavier:

According to S the job is suspended. Does `qstat -f` show state C for the queue 
(calendar suspended)?
yes compute-0-2 is in C state and the 3 others in aC state.

ps -e f on compute-0-2 give:

12395 ?        Sl    25:50 /opt/gridengine/bin/lx26-amd64/sge_execd
6299 ?        S      0:00  \_ sge_shepherd-28865 -bg
6301 ?        TNs    0:00  |   \_ bash 
/opt/gridengine/default/spool/compute-0-2/job_scripts/28865
6302 ?        TN     0:00  |       \_ /bin/csh ./run_roms.csh
25564 ?        TN     0:03  |           \_ mpirun -np 32 ./roms roms.in
25568 ?        TN     0:00  |               \_ 
/opt/gridengine/bin/lx26-amd64/qrsh -inherit -nostdin -V compute-0-3.loca
25572 ?        TN     0:00  |               |   \_ /usr/bin/ssh -n -p 38836 
compute-0-3.local exec '/opt/gridengine/util
25569 ?        TN     0:00  |               \_ 
/opt/gridengine/bin/lx26-amd64/qrsh -inherit -nostdin -V compute-0-1.loca
25571 ?        TN     0:00  |               |   \_ /usr/bin/ssh -n -p 44411 
compute-0-1.local exec '/opt/gridengine/util
25570 ?        TN     0:00  |               \_ 
/opt/gridengine/bin/lx26-amd64/qrsh -inherit -nostdin -V compute-0-14.loc
25573 ?        TN     0:00  |               |   \_ /usr/bin/ssh -n -p 52152 
compute-0-14.local exec '/opt/gridengine/uti
25574 ?        TN   253:19  |               \_ ./roms roms.in
25575 ?        TN   253:28  |               \_ ./roms roms.in
25576 ?        TN   253:26  |               \_ ./roms roms.in
25577 ?        TN   253:30  |               \_ ./roms roms.in
25578 ?        TN   253:20  |               \_ ./roms roms.in
25579 ?        TN   253:27  |               \_ ./roms roms.in
25580 ?        TN   253:25  |               \_ ./roms roms.in
25581 ?        TN   253:20  |               \_ ./roms roms.in
4666 ?        S      0:00  \_ sge_shepherd-28899 -bg
4667 ?        Ss     0:00      \_ sshd: forecast [priv]
4672 ?        S      0:00          \_ sshd: forecast@notty
4673 ?        Ss     0:00              \_ 
/opt/gridengine/utilbin/lx26-amd64/qrsh_starter /opt/gridengine/default/spool
4764 ?        S      0:00                  \_ orted -mca ess env -mca 
orte_ess_jobid 858914816 -mca orte_ess_vpid 3 -mc
4765 ?        R    188:03                      \_ ./roms roms_forecast.in
4766 ?        R    190:14                      \_ ./roms roms_forecast.in
4767 ?        R    189:45                      \_ ./roms roms_forecast.in
4768 ?        R    190:30                      \_ ./roms roms_forecast.in
4769 ?        R    190:48                      \_ ./roms roms_forecast.in
4770 ?        R    190:39                      \_ ./roms roms_forecast.in
4771 ?        R    190:56                      \_ ./roms roms_forecast.in
4772 ?        R    189:54                      \_ ./roms roms_forecast.in

and top:

4765 forecast  25   0  250m  80m 5008 R 100.1  0.5 190:54.74 roms
4766 forecast  25   0  250m  80m 5020 R 100.1  0.5 193:05.62 roms
4769 forecast  25   0  250m  80m 5232 R 100.1  0.5 193:39.34 roms
4771 forecast  25   0  250m  69m 5012 R 100.1  0.4 193:47.49 roms
4767 forecast  25   0  250m  80m 5236 R 99.8  0.5 192:36.86 roms
4768 forecast  25   0  250m  80m 5240 R 99.8  0.5 193:21.09 roms
4770 forecast  25   0  250m  80m 5240 R 99.8  0.5 193:30.51 roms
4772 forecast  25   0  250m  69m 5028 R 99.8  0.4 192:46.14 roms

while in compute-0-1 &co :

ps - e f give:

11973 ?        Sl     7:13 /opt/gridengine/bin/lx26-amd64/sge_execd
25352 ?        S      0:00  \_ sge_shepherd-28865 -bg
25353 ?        SNs    0:00  |   \_ sshd: xavier [priv]
25358 ?        SN     0:00  |       \_ sshd: xavier@notty
25359 ?        SNs    0:00  |           \_ 
/opt/gridengine/utilbin/lx26-amd64/qrsh_starter /opt/gridengine/default/spool
25450 ?        SN     0:00  |               \_ orted -mca ess env -mca 
orte_ess_jobid 1701052416 -mca orte_ess_vpid 2 -m
25451 ?        RN   421:43  |                   \_ ./roms roms.in
25452 ?        RN   421:32  |                   \_ ./roms roms.in
25453 ?        RN   422:02  |                   \_ ./roms roms.in
25454 ?        RN   421:53  |                   \_ ./roms roms.in
25455 ?        RN   422:05  |                   \_ ./roms roms.in
25456 ?        RN   421:55  |                   \_ ./roms roms.in
25457 ?        RN   421:48  |                   \_ ./roms roms.in
25458 ?        RN   422:01  |                   \_ ./roms roms.in
4544 ?        S      0:00  \_ sge_shepherd-28899 -bg
4545 ?        Ss     0:00      \_ sshd: forecast [priv]
4550 ?        S      0:00          \_ sshd: forecast@notty
4551 ?        Ss     0:00              \_ 
/opt/gridengine/utilbin/lx26-amd64/qrsh_starter /opt/gridengine/default/spool
4642 ?        S      0:00                  \_ orted -mca ess env -mca 
orte_ess_jobid 858914816 -mca orte_ess_vpid 2 -mc
4643 ?        S    187:37                      \_ ./roms roms_forecast.in
4644 ?        S    189:38                      \_ ./roms roms_forecast.in
4645 ?        S    190:40                      \_ ./roms roms_forecast.in
4646 ?        R    188:53                      \_ ./roms roms_forecast.in
4647 ?        R    161:50                      \_ ./roms roms_forecast.in
4648 ?        S    189:12                      \_ ./roms roms_forecast.in
4649 ?        R    158:36                      \_ ./roms roms_forecast.in
4650 ?        R    184:55                      \_ ./roms roms_forecast.in

and top:

4643 forecast  25   0  250m  80m 5028 R 100.6  0.5 188:46.90 roms
4644 forecast  25   0  250m  80m 5028 R 100.6  0.5 190:50.58 roms
4646 forecast  25   0  250m  80m 5240 R 100.6  0.5 190:05.82 roms
4645 forecast  25   0  250m  80m 5228 R 98.6  0.5 191:52.46 roms
4647 forecast  25   0  250m  80m 5248 R 98.6  0.5 162:55.69 roms
4649 forecast  25   0  250m  80m 5020 R 98.6  0.5 159:38.28 roms
4648 forecast  25   0  250m  80m 5236 R 82.8  0.5 190:23.34 roms
4650 forecast  25   0  250m  80m 5016 R 82.8  0.5 186:03.83 roms
25451 xavier    39  19  219m  52m 4896 R  5.9  0.3 421:48.02 roms
25453 xavier    39  19  219m  53m 5052 R  5.9  0.3 422:07.73 roms
25456 xavier    39  19  219m  53m 5060 R  5.9  0.3 422:00.43 roms
25454 xavier    39  19  219m  52m 4924 R  3.9  0.3 421:58.77 roms
25455 xavier    39  19  219m  52m 4948 R  3.9  0.3 422:10.83 roms
25457 xavier    39  19  219m  53m 5056 R  3.9  0.3 421:53.80 roms
25458 xavier    39  19  219m  52m 4840 R  3.9  0.3 422:07.16 roms
25452 xavier    39  19  219m  53m 5060 R  2.0  0.3 421:37.02 roms

Did you check with:

$ ps -e f

on the n ode that all processes are kids of the sge_shepherd? They should have gotten 
state "T" then in `ps`.

-- Reuti


but again while compute-0-2 is having a load of 8 (8cpus/nodes) compute-0-1 and 
others are overloading at 16...

using SGE 6.2u4 on a ROCKS 5.3 cluster

On 11/10/2012 11:09, Reuti wrote:
Am 11.10.2012 um 11:56 schrieb Xavier:

Hi all,

I have created a calendar queue only available during the day (6am to 1am) 
keeping nodes free for the night jobs trough an other queue.
This queue is composed of 4 nodes (32cpus). All jobs used the 32cpus
Good - but what calendar definition did you create in detail?

-- Reuti


what i don't get is that one of the nodes AND ONLY ONE drop its load 0 at 1am. 
this node is the one where the job attributed, i.e.

from qstat
JOB1 xavier       r     10/03/2012 09:44:18 [email protected]     32

while other 3 nodes  keep their load and therefore overload at night.

example of last day load
for compute-0-2
http://nautilus.ciimar.up.pt/ganglia/graph.php?g=load_report&z=large&c=nautilus&h=compute-0-2.local&m=load_one&r=day&s=descending&hc=4&mc=2&st=1349865411
and for compute-0-1
http://nautilus.ciimar.up.pt/ganglia/graph.php?g=load_report&z=large&c=nautilus&h=compute-0-1.local&m=load_one&r=day&s=descending&hc=4&mc=2&st=1349865467

Why does all nodes not behave the same ?

Xavier

--
Universidade da Madeira
CCM - Centro de Ciencias Matematicas
Campus Universitario da Penteada
9000-390 Funchal, Madeira Island
Portugal

(+351) 291 705 186
http://wakes.uma.pt

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
--
Universidade da Madeira
CCM - Centro de Ciencias Matematicas
Campus Universitario da Penteada
9000-390 Funchal, Madeira Island
Portugal

(+351) 291 705 186
http://wakes.uma.pt

--
Universidade da Madeira
CCM - Centro de Ciencias Matematicas
Campus Universitario da Penteada
9000-390 Funchal, Madeira Island
Portugal

(+351) 291 705 186
http://wakes.uma.pt


--
Universidade da Madeira
CCM - Centro de Ciencias Matematicas
Campus Universitario da Penteada
9000-390 Funchal, Madeira Island
Portugal

(+351) 291 705 186
http://wakes.uma.pt

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to