Hi,

I was surprised to see it. Our old cluster uses GE 6.1 and the behavior is
different - tightly integrated jobs get removed from the queue nearly
immediately.

In our setup we have a lot of short jobs and over a minute is just wasteful.

Cheers,
Bartek

On 14 November 2011 11:58, Reuti <[email protected]> wrote:

> Hi,
>
> Am 14.11.2011 um 10:43 schrieb Bartosz Dobrzelecki:
>
> > I have a cluster with 168 slots (14 nodes). I use a submission script
> that does nothing
> >
> > #!/bin/bash --login
> > #$ -S /bin/bash
> > #$ -j y
> > #$ -cwd
> > #$ -pe make 16-
> > #$ -q all.q
> > #$ -N test
> >
> > Make PE definition:
> >
> > pe_name            make
> > slots              999
> > user_lists         NONE
> > xuser_lists        NONE
> > start_proc_args    NONE
> > stop_proc_args     NONE
> > allocation_rule    $round_robin
> > control_slaves     TRUE
> > job_is_first_task  FALSE
> > urgency_slots      min
> > accounting_summary FALSE
> >
> > This job, although it does nothing, blocks the queue for more than a
> minute
>
> this is the normal behavior. There were some improvements to shorten  the
> time, but a delay of some time is still there. It's a safety measure in
> case it needs to delete the $TMPDIR on all tightly integrated nodes
> (although in your case there aren't any) and kill all processes on all
> involved nodes by the additional attached group id.
>
> It depends whether this is much time compared to your jobs. In our case
> jobs run for days or even weeks, so we are not concerned about a loss of
> one minute.
>
> -- Reuti
>
>
>
> > plbadob@k:~/test> qsub ./test.sge ; date ; qstat
> > Your job 127 ("test") has been submitted
> > Mon Nov 14 10:33:44 CET 2011
> > job-ID  prior   name       user         state submit/start at     queue
>                          slots ja-task-ID
> >
> -----------------------------------------------------------------------------------------------------------------
> >     127 0.00000 test       plbadob      qw    11/14/2011 10:33:44
>                             16
> > plbadob@k:~/test> date ; qstat
> > Mon Nov 14 10:33:46 CET 2011
> > job-ID  prior   name       user         state submit/start at     queue
>                          slots ja-task-ID
> >
> -----------------------------------------------------------------------------------------------------------------
> >     127 0.55500 test       plbadob      r     11/14/2011 10:33:44
> [email protected]            168
> > plbadob@k:~/test> date ; qstat
> > Mon Nov 14 10:35:10 CET 2011
> > job-ID  prior   name       user         state submit/start at     queue
>                          slots ja-task-ID
> >
> -----------------------------------------------------------------------------------------------------------------
> >     127 0.55500 test       plbadob      r     11/14/2011 10:33:44
> [email protected]            168
> > plbadob@k:~/test> date ; qstat
> > Mon Nov 14 10:35:18 CET 2011
> >
> > The qacct reports that the job finished when it started - as it should be
> >
> > plbadob@k:~/test> qacct -j 127
> > ==============================================================
> > qname        all.q
> > qsub_time    Mon Nov 14 10:33:44 2011
> > start_time   Mon Nov 14 10:33:44 2011
> > end_time     Mon Nov 14 10:33:44 2011
> >
> > When I switch control_slaves to FALSE everything works as expected - the
> job is removed from the queue immediately.
> >
> > Andy ideas what could be wrong? How to fix this behavior, so that the
> cluster is not being blocked doing nothing?
> >
> > Cheers,
> > Bartek
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to