Hi, I was surprised to see it. Our old cluster uses GE 6.1 and the behavior is different - tightly integrated jobs get removed from the queue nearly immediately.
In our setup we have a lot of short jobs and over a minute is just wasteful. Cheers, Bartek On 14 November 2011 11:58, Reuti <[email protected]> wrote: > Hi, > > Am 14.11.2011 um 10:43 schrieb Bartosz Dobrzelecki: > > > I have a cluster with 168 slots (14 nodes). I use a submission script > that does nothing > > > > #!/bin/bash --login > > #$ -S /bin/bash > > #$ -j y > > #$ -cwd > > #$ -pe make 16- > > #$ -q all.q > > #$ -N test > > > > Make PE definition: > > > > pe_name make > > slots 999 > > user_lists NONE > > xuser_lists NONE > > start_proc_args NONE > > stop_proc_args NONE > > allocation_rule $round_robin > > control_slaves TRUE > > job_is_first_task FALSE > > urgency_slots min > > accounting_summary FALSE > > > > This job, although it does nothing, blocks the queue for more than a > minute > > this is the normal behavior. There were some improvements to shorten the > time, but a delay of some time is still there. It's a safety measure in > case it needs to delete the $TMPDIR on all tightly integrated nodes > (although in your case there aren't any) and kill all processes on all > involved nodes by the additional attached group id. > > It depends whether this is much time compared to your jobs. In our case > jobs run for days or even weeks, so we are not concerned about a loss of > one minute. > > -- Reuti > > > > > plbadob@k:~/test> qsub ./test.sge ; date ; qstat > > Your job 127 ("test") has been submitted > > Mon Nov 14 10:33:44 CET 2011 > > job-ID prior name user state submit/start at queue > slots ja-task-ID > > > ----------------------------------------------------------------------------------------------------------------- > > 127 0.00000 test plbadob qw 11/14/2011 10:33:44 > 16 > > plbadob@k:~/test> date ; qstat > > Mon Nov 14 10:33:46 CET 2011 > > job-ID prior name user state submit/start at queue > slots ja-task-ID > > > ----------------------------------------------------------------------------------------------------------------- > > 127 0.55500 test plbadob r 11/14/2011 10:33:44 > [email protected] 168 > > plbadob@k:~/test> date ; qstat > > Mon Nov 14 10:35:10 CET 2011 > > job-ID prior name user state submit/start at queue > slots ja-task-ID > > > ----------------------------------------------------------------------------------------------------------------- > > 127 0.55500 test plbadob r 11/14/2011 10:33:44 > [email protected] 168 > > plbadob@k:~/test> date ; qstat > > Mon Nov 14 10:35:18 CET 2011 > > > > The qacct reports that the job finished when it started - as it should be > > > > plbadob@k:~/test> qacct -j 127 > > ============================================================== > > qname all.q > > qsub_time Mon Nov 14 10:33:44 2011 > > start_time Mon Nov 14 10:33:44 2011 > > end_time Mon Nov 14 10:33:44 2011 > > > > When I switch control_slaves to FALSE everything works as expected - the > job is removed from the queue immediately. > > > > Andy ideas what could be wrong? How to fix this behavior, so that the > cluster is not being blocked doing nothing? > > > > Cheers, > > Bartek > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
