Hi,

Am 14.11.2011 um 10:43 schrieb Bartosz Dobrzelecki:

> I have a cluster with 168 slots (14 nodes). I use a submission script that 
> does nothing
> 
> #!/bin/bash --login
> #$ -S /bin/bash
> #$ -j y
> #$ -cwd
> #$ -pe make 16-
> #$ -q all.q
> #$ -N test
> 
> Make PE definition:
> 
> pe_name            make
> slots              999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    NONE
> stop_proc_args     NONE
> allocation_rule    $round_robin
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary FALSE
> 
> This job, although it does nothing, blocks the queue for more than a minute

this is the normal behavior. There were some improvements to shorten  the time, 
but a delay of some time is still there. It's a safety measure in case it needs 
to delete the $TMPDIR on all tightly integrated nodes (although in your case 
there aren't any) and kill all processes on all involved nodes by the 
additional attached group id.

It depends whether this is much time compared to your jobs. In our case jobs 
run for days or even weeks, so we are not concerned about a loss of one minute.

-- Reuti



> plbadob@k:~/test> qsub ./test.sge ; date ; qstat
> Your job 127 ("test") has been submitted
> Mon Nov 14 10:33:44 CET 2011
> job-ID  prior   name       user         state submit/start at     queue       
>                    slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
>     127 0.00000 test       plbadob      qw    11/14/2011 10:33:44             
>                       16
> plbadob@k:~/test> date ; qstat
> Mon Nov 14 10:33:46 CET 2011
> job-ID  prior   name       user         state submit/start at     queue       
>                    slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
>     127 0.55500 test       plbadob      r     11/14/2011 10:33:44 
> [email protected]            168
> plbadob@k:~/test> date ; qstat
> Mon Nov 14 10:35:10 CET 2011
> job-ID  prior   name       user         state submit/start at     queue       
>                    slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
>     127 0.55500 test       plbadob      r     11/14/2011 10:33:44 
> [email protected]            168
> plbadob@k:~/test> date ; qstat
> Mon Nov 14 10:35:18 CET 2011
> 
> The qacct reports that the job finished when it started - as it should be
> 
> plbadob@k:~/test> qacct -j 127
> ==============================================================
> qname        all.q
> qsub_time    Mon Nov 14 10:33:44 2011
> start_time   Mon Nov 14 10:33:44 2011
> end_time     Mon Nov 14 10:33:44 2011
> 
> When I switch control_slaves to FALSE everything works as expected - the job 
> is removed from the queue immediately.
> 
> Andy ideas what could be wrong? How to fix this behavior, so that the cluster 
> is not being blocked doing nothing?
> 
> Cheers,
> Bartek
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to