Hi,
Am 14.11.2011 um 10:43 schrieb Bartosz Dobrzelecki:
> I have a cluster with 168 slots (14 nodes). I use a submission script that
> does nothing
>
> #!/bin/bash --login
> #$ -S /bin/bash
> #$ -j y
> #$ -cwd
> #$ -pe make 16-
> #$ -q all.q
> #$ -N test
>
> Make PE definition:
>
> pe_name make
> slots 999
> user_lists NONE
> xuser_lists NONE
> start_proc_args NONE
> stop_proc_args NONE
> allocation_rule $round_robin
> control_slaves TRUE
> job_is_first_task FALSE
> urgency_slots min
> accounting_summary FALSE
>
> This job, although it does nothing, blocks the queue for more than a minute
this is the normal behavior. There were some improvements to shorten the time,
but a delay of some time is still there. It's a safety measure in case it needs
to delete the $TMPDIR on all tightly integrated nodes (although in your case
there aren't any) and kill all processes on all involved nodes by the
additional attached group id.
It depends whether this is much time compared to your jobs. In our case jobs
run for days or even weeks, so we are not concerned about a loss of one minute.
-- Reuti
> plbadob@k:~/test> qsub ./test.sge ; date ; qstat
> Your job 127 ("test") has been submitted
> Mon Nov 14 10:33:44 CET 2011
> job-ID prior name user state submit/start at queue
> slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
> 127 0.00000 test plbadob qw 11/14/2011 10:33:44
> 16
> plbadob@k:~/test> date ; qstat
> Mon Nov 14 10:33:46 CET 2011
> job-ID prior name user state submit/start at queue
> slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
> 127 0.55500 test plbadob r 11/14/2011 10:33:44
> [email protected] 168
> plbadob@k:~/test> date ; qstat
> Mon Nov 14 10:35:10 CET 2011
> job-ID prior name user state submit/start at queue
> slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
> 127 0.55500 test plbadob r 11/14/2011 10:33:44
> [email protected] 168
> plbadob@k:~/test> date ; qstat
> Mon Nov 14 10:35:18 CET 2011
>
> The qacct reports that the job finished when it started - as it should be
>
> plbadob@k:~/test> qacct -j 127
> ==============================================================
> qname all.q
> qsub_time Mon Nov 14 10:33:44 2011
> start_time Mon Nov 14 10:33:44 2011
> end_time Mon Nov 14 10:33:44 2011
>
> When I switch control_slaves to FALSE everything works as expected - the job
> is removed from the queue immediately.
>
> Andy ideas what could be wrong? How to fix this behavior, so that the cluster
> is not being blocked doing nothing?
>
> Cheers,
> Bartek
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users