Aha, it's a parallel job and the slaves are not stopped. This is normal as the 
signal isn't broadcasted to all slaves (there was a patch available by Rayson 
to enable it) for parallel jobs.

Or you use a feature in Open MPI itself: 
http://www.open-mpi.org/faq/?category=running#suspend-resume which looks easier.

BTW: any reason to use `ssh` as job starter and not 'builtin'?

-- Reuti


Am 11.10.2012 um 12:50 schrieb Xavier:

>> According to S the job is suspended. Does `qstat -f` show state C for the 
>> queue (calendar suspended)?
> yes compute-0-2 is in C state and the 3 others in aC state.
> 
> ps -e f on compute-0-2 give:
> 
> 12395 ?        Sl    25:50 /opt/gridengine/bin/lx26-amd64/sge_execd
> 6299 ?        S      0:00  \_ sge_shepherd-28865 -bg
> 6301 ?        TNs    0:00  |   \_ bash 
> /opt/gridengine/default/spool/compute-0-2/job_scripts/28865
> 6302 ?        TN     0:00  |       \_ /bin/csh ./run_roms.csh
> 25564 ?        TN     0:03  |           \_ mpirun -np 32 ./roms roms.in
> 25568 ?        TN     0:00  |               \_ 
> /opt/gridengine/bin/lx26-amd64/qrsh -inherit -nostdin -V compute-0-3.loca
> 25572 ?        TN     0:00  |               |   \_ /usr/bin/ssh -n -p 38836 
> compute-0-3.local exec '/opt/gridengine/util
> 25569 ?        TN     0:00  |               \_ 
> /opt/gridengine/bin/lx26-amd64/qrsh -inherit -nostdin -V compute-0-1.loca
> 25571 ?        TN     0:00  |               |   \_ /usr/bin/ssh -n -p 44411 
> compute-0-1.local exec '/opt/gridengine/util
> 25570 ?        TN     0:00  |               \_ 
> /opt/gridengine/bin/lx26-amd64/qrsh -inherit -nostdin -V compute-0-14.loc
> 25573 ?        TN     0:00  |               |   \_ /usr/bin/ssh -n -p 52152 
> compute-0-14.local exec '/opt/gridengine/uti
> 25574 ?        TN   253:19  |               \_ ./roms roms.in
> 25575 ?        TN   253:28  |               \_ ./roms roms.in
> 25576 ?        TN   253:26  |               \_ ./roms roms.in
> 25577 ?        TN   253:30  |               \_ ./roms roms.in
> 25578 ?        TN   253:20  |               \_ ./roms roms.in
> 25579 ?        TN   253:27  |               \_ ./roms roms.in
> 25580 ?        TN   253:25  |               \_ ./roms roms.in
> 25581 ?        TN   253:20  |               \_ ./roms roms.in
> 4666 ?        S      0:00  \_ sge_shepherd-28899 -bg
> 4667 ?        Ss     0:00      \_ sshd: forecast [priv]
> 4672 ?        S      0:00          \_ sshd: forecast@notty
> 4673 ?        Ss     0:00              \_ 
> /opt/gridengine/utilbin/lx26-amd64/qrsh_starter /opt/gridengine/default/spool
> 4764 ?        S      0:00                  \_ orted -mca ess env -mca 
> orte_ess_jobid 858914816 -mca orte_ess_vpid 3 -mc
> 4765 ?        R    188:03                      \_ ./roms roms_forecast.in
> 4766 ?        R    190:14                      \_ ./roms roms_forecast.in
> 4767 ?        R    189:45                      \_ ./roms roms_forecast.in
> 4768 ?        R    190:30                      \_ ./roms roms_forecast.in
> 4769 ?        R    190:48                      \_ ./roms roms_forecast.in
> 4770 ?        R    190:39                      \_ ./roms roms_forecast.in
> 4771 ?        R    190:56                      \_ ./roms roms_forecast.in
> 4772 ?        R    189:54                      \_ ./roms roms_forecast.in
> 
> and top:
> 
> 4765 forecast  25   0  250m  80m 5008 R 100.1  0.5 190:54.74 roms
> 4766 forecast  25   0  250m  80m 5020 R 100.1  0.5 193:05.62 roms
> 4769 forecast  25   0  250m  80m 5232 R 100.1  0.5 193:39.34 roms
> 4771 forecast  25   0  250m  69m 5012 R 100.1  0.4 193:47.49 roms
> 4767 forecast  25   0  250m  80m 5236 R 99.8  0.5 192:36.86 roms
> 4768 forecast  25   0  250m  80m 5240 R 99.8  0.5 193:21.09 roms
> 4770 forecast  25   0  250m  80m 5240 R 99.8  0.5 193:30.51 roms
> 4772 forecast  25   0  250m  69m 5028 R 99.8  0.4 192:46.14 roms
> 
> while in compute-0-1 &co :
> 
> ps - e f give:
> 
> 11973 ?        Sl     7:13 /opt/gridengine/bin/lx26-amd64/sge_execd
> 25352 ?        S      0:00  \_ sge_shepherd-28865 -bg
> 25353 ?        SNs    0:00  |   \_ sshd: xavier [priv]
> 25358 ?        SN     0:00  |       \_ sshd: xavier@notty
> 25359 ?        SNs    0:00  |           \_ 
> /opt/gridengine/utilbin/lx26-amd64/qrsh_starter /opt/gridengine/default/spool
> 25450 ?        SN     0:00  |               \_ orted -mca ess env -mca 
> orte_ess_jobid 1701052416 -mca orte_ess_vpid 2 -m
> 25451 ?        RN   421:43  |                   \_ ./roms roms.in
> 25452 ?        RN   421:32  |                   \_ ./roms roms.in
> 25453 ?        RN   422:02  |                   \_ ./roms roms.in
> 25454 ?        RN   421:53  |                   \_ ./roms roms.in
> 25455 ?        RN   422:05  |                   \_ ./roms roms.in
> 25456 ?        RN   421:55  |                   \_ ./roms roms.in
> 25457 ?        RN   421:48  |                   \_ ./roms roms.in
> 25458 ?        RN   422:01  |                   \_ ./roms roms.in
> 4544 ?        S      0:00  \_ sge_shepherd-28899 -bg
> 4545 ?        Ss     0:00      \_ sshd: forecast [priv]
> 4550 ?        S      0:00          \_ sshd: forecast@notty
> 4551 ?        Ss     0:00              \_ 
> /opt/gridengine/utilbin/lx26-amd64/qrsh_starter /opt/gridengine/default/spool
> 4642 ?        S      0:00                  \_ orted -mca ess env -mca 
> orte_ess_jobid 858914816 -mca orte_ess_vpid 2 -mc
> 4643 ?        S    187:37                      \_ ./roms roms_forecast.in
> 4644 ?        S    189:38                      \_ ./roms roms_forecast.in
> 4645 ?        S    190:40                      \_ ./roms roms_forecast.in
> 4646 ?        R    188:53                      \_ ./roms roms_forecast.in
> 4647 ?        R    161:50                      \_ ./roms roms_forecast.in
> 4648 ?        S    189:12                      \_ ./roms roms_forecast.in
> 4649 ?        R    158:36                      \_ ./roms roms_forecast.in
> 4650 ?        R    184:55                      \_ ./roms roms_forecast.in
> 
> and top:
> 
> 4643 forecast  25   0  250m  80m 5028 R 100.6  0.5 188:46.90 roms
> 4644 forecast  25   0  250m  80m 5028 R 100.6  0.5 190:50.58 roms
> 4646 forecast  25   0  250m  80m 5240 R 100.6  0.5 190:05.82 roms
> 4645 forecast  25   0  250m  80m 5228 R 98.6  0.5 191:52.46 roms
> 4647 forecast  25   0  250m  80m 5248 R 98.6  0.5 162:55.69 roms
> 4649 forecast  25   0  250m  80m 5020 R 98.6  0.5 159:38.28 roms
> 4648 forecast  25   0  250m  80m 5236 R 82.8  0.5 190:23.34 roms
> 4650 forecast  25   0  250m  80m 5016 R 82.8  0.5 186:03.83 roms
> 25451 xavier    39  19  219m  52m 4896 R  5.9  0.3 421:48.02 roms
> 25453 xavier    39  19  219m  53m 5052 R  5.9  0.3 422:07.73 roms
> 25456 xavier    39  19  219m  53m 5060 R  5.9  0.3 422:00.43 roms
> 25454 xavier    39  19  219m  52m 4924 R  3.9  0.3 421:58.77 roms
> 25455 xavier    39  19  219m  52m 4948 R  3.9  0.3 422:10.83 roms
> 25457 xavier    39  19  219m  53m 5056 R  3.9  0.3 421:53.80 roms
> 25458 xavier    39  19  219m  52m 4840 R  3.9  0.3 422:07.16 roms
> 25452 xavier    39  19  219m  53m 5060 R  2.0  0.3 421:37.02 roms
> 
>> 
>> Did you check with:
>> 
>> $ ps -e f
>> 
>> on the n ode that all processes are kids of the sge_shepherd? They should 
>> have gotten state "T" then in `ps`.
>> 
>> -- Reuti
>> 
>> 
>>> but again while compute-0-2 is having a load of 8 (8cpus/nodes) compute-0-1 
>>> and others are overloading at 16...
>>> 
>>> using SGE 6.2u4 on a ROCKS 5.3 cluster
>>> 
>>> On 11/10/2012 11:09, Reuti wrote:
>>>> Am 11.10.2012 um 11:56 schrieb Xavier:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I have created a calendar queue only available during the day (6am to 
>>>>> 1am) keeping nodes free for the night jobs trough an other queue.
>>>>> This queue is composed of 4 nodes (32cpus). All jobs used the 32cpus
>>>> Good - but what calendar definition did you create in detail?
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> what i don't get is that one of the nodes AND ONLY ONE drop its load 0 at 
>>>>> 1am. this node is the one where the job attributed, i.e.
>>>>> 
>>>>> from qstat
>>>>> JOB1 xavier       r     10/03/2012 09:44:18 [email protected]     
>>>>> 32
>>>>> 
>>>>> while other 3 nodes  keep their load and therefore overload at night.
>>>>> 
>>>>> example of last day load
>>>>> for compute-0-2
>>>>> http://nautilus.ciimar.up.pt/ganglia/graph.php?g=load_report&z=large&c=nautilus&h=compute-0-2.local&m=load_one&r=day&s=descending&hc=4&mc=2&st=1349865411
>>>>> and for compute-0-1
>>>>> http://nautilus.ciimar.up.pt/ganglia/graph.php?g=load_report&z=large&c=nautilus&h=compute-0-1.local&m=load_one&r=day&s=descending&hc=4&mc=2&st=1349865467
>>>>> 
>>>>> Why does all nodes not behave the same ?
>>>>> 
>>>>> Xavier
>>>>> 
>>>>> -- 
>>>>> Universidade da Madeira
>>>>> CCM - Centro de Ciencias Matematicas
>>>>> Campus Universitario da Penteada
>>>>> 9000-390 Funchal, Madeira Island
>>>>> Portugal
>>>>> 
>>>>> (+351) 291 705 186
>>>>> http://wakes.uma.pt
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> https://gridengine.org/mailman/listinfo/users
>>> -- 
>>> Universidade da Madeira
>>> CCM - Centro de Ciencias Matematicas
>>> Campus Universitario da Penteada
>>> 9000-390 Funchal, Madeira Island
>>> Portugal
>>> 
>>> (+351) 291 705 186
>>> http://wakes.uma.pt
>>> 
> 
> -- 
> Universidade da Madeira
> CCM - Centro de Ciencias Matematicas
> Campus Universitario da Penteada
> 9000-390 Funchal, Madeira Island
> Portugal
> 
> (+351) 291 705 186
> http://wakes.uma.pt
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to