Re: [gridengine users] Reserve cluster for large job

Reuti Wed, 15 Jan 2014 02:29:00 -0800

Hi,

Am 15.01.2014 um 11:16 schrieb Joe Borġ:


> Using h_rt kills the job after the allotted time.

Yes.


>  Can't this be disabled?

There is no feature in SGE to extend the granted runtime of a job (I heard such 
a thing is available in Torque).


>  We only want to use it as a rough guide.

If you want to do it only once in a time for a particular job:

In this case you can just kill (or softstop) the `sgeexecd` on the node. You 
will lose control of the jobs on the node and the node (from SGE's view - 
`qhost` shows "-" for the node's load). So you have to check from time to time 
whether the job in question finished already, and then restart the `sgeexecd`. 
Also no new jobs will be scheduled to the node.

Only at point of restarting the `sgeexecd` it will discover that the job 
finished (and send an email if applicable). Other (still) running jobs will 
gain supervision of their runtime again.

-- Reuti


> Thanks
> 
> 
> 
> Regards,
> Joseph David Borġ 
> josephb.org
> 
> 
> On 13 January 2014 17:43, Reuti <[email protected]> wrote:
> Am 13.01.2014 um 18:33 schrieb Joe Borġ:
> 
> > Thanks.  Can you please tell me what I'm doing wrong?
> >
> > qsub -q test.q -R y -l h_rt=60 -pe test.pe 1 small.bash
> > qsub -q test.q -R y -l h_rt=120 -pe test.pe 2 big.bash
> > qsub -q test.q -R y -l h_rt=60 -pe test.pe 1 small.bash
> > qsub -q test.q -R y -l h_rt=60 -pe test.pe 1 small.bash
> 
> Only the parallel job needs "-R y".
> 
> 
> >
> > job-ID  prior   name       user         state submit/start at     queue     
> >                      slots ja-task-ID
> > -----------------------------------------------------------------------------------------------------------------
> >  156757 0.50000 small.bash joe.borg     qw    01/13/2014 16:45:18           
> >                          1
> >  156761 0.50000 big.bash   joe.borg     qw    01/13/2014 16:55:31           
> >                          2
> >  156762 0.50000 small.bash joe.borg     qw    01/13/2014 16:55:33           
> >                          1
> >  156763 0.50000 small.bash joe.borg     qw    01/13/2014 16:55:34           
> >                          1
> >
> > ...But when I release...
> 
> max_reservation is set?
> 
> But the reservation feature must also be seen in a running cluster. If all 
> four jobs are on hold and released at once, I wouldn't be surprised if it's 
> not strictly FIFO.
> 
> 
> > job-ID  prior   name       user         state submit/start at     queue     
> >                      slots ja-task-ID
> > -----------------------------------------------------------------------------------------------------------------
> >  156757 0.50000 small.bash joe.borg     r     01/13/2014 16:56:06 
> > test.q@test                  1
> >  156762 0.50000 small.bash joe.borg     r     01/13/2014 16:56:06 
> > test.q@test                  1
> >  156761 0.50000 big.bash   joe.borg     qw    01/13/2014 16:55:31           
> >                         2
> >  156763 0.50000 small.bash joe.borg     qw    01/13/2014 16:55:34           
> >                        1
> 
> As job 156762 has the same runtime as 156757, backfilling will occur to use 
> the otherwise idling core. Whether job 156762 is started or not, the parallel 
> one 156761 will start at the same time. Only 156763 shouldn't start.
> 
> -- Reuti
> 
> 
> >
> >
> > Thanks
> >
> >
> >
> > Regards,
> > Joseph David Borġ
> > josephb.org
> >
> >
> > On 13 January 2014 17:26, Reuti <[email protected]> wrote:
> > Am 13.01.2014 um 17:24 schrieb Joe Borġ:
> >
> > > Hi Reuti,
> > >
> > > I am using a PE, so that's fine.
> > >
> > > I've not set either of the other 3.  Will the job be killed if 
> > > default_duration is exceeded?
> >
> > No. It can be set to any value you like (like a few weeks), but it 
> > shouldn't be set to "INFINITY" as SGE judges infinity being smaller than 
> > infinity and so backfilling will always occur.
> >
> > -- Reuti
> >
> >
> > > Thanks
> > >
> > >
> > >
> > > Regards,
> > > Joseph David Borġ
> > > josephb.org
> > >
> > >
> > > On 13 January 2014 16:16, Reuti <[email protected]> wrote:
> > > Hi,
> > >
> > > Am 13.01.2014 um 16:58 schrieb Joe Borġ:
> > >
> > > > I'm trying to set up an SGE queue and am having a problem getting the 
> > > > jobs to start in the right order.  Here is my example - test.q with 2 
> > > > possible slots and the following jobs queued:
> > > >
> > > > job-ID  prior   name       user         state submit/start at     queue 
> > > >                          slots ja-task-ID
> > > > -----------------------------------------------------------------------------------------------------------------
> > > >  1           0.50000 small.bash joe.borg     qw    01/13/2014 15:43:16  
> > > >                                   1
> > > >  2           0.50000 big.bash   joe.borg     qw    01/13/2014 15:43:24  
> > > >                                   2
> > > >  3           0.50000 small.bash joe.borg     qw    01/13/2014 15:43:27  
> > > >                                   1
> > > >  4           0.50000 small.bash joe.borg     qw    01/13/2014 15:43:28  
> > > >                                   1
> > > >
> > > > I want the jobs to run in that order, but (obviously), when I enable 
> > > > the queue, the small jobs fill the available slots and the big job has 
> > > > to wait for them to complete.  I'd like it setup so that only job 1 
> > > > runs; finishes, then 2 (with both slots), then the final 2 jobs, 3 & 4, 
> > > > together.
> > > >
> > > > I've looked at -R y on submission, but doesn't seem to work.
> > >
> > > For the reservation to work (and it's only necessary to request it for 
> > > the parallel job) it's necessary to have suitable "h_rt" requests for all 
> > > jobs.
> > >
> > > - Do you request any "h_rt" for all jobs?
> > > - Do you have a "default_duration" set to a proper value in the schedule 
> > > configuration otherwise?
> > > - Is "max_reservation" set to a value like 16?
> > >
> > > -- Reuti
> > >
> > >
> > > > Regards,
> > > > Joseph David Borġ
> > > > josephb.org
> > > > _______________________________________________
> > > > users mailing list
> > > > [email protected]
> > > > https://gridengine.org/mailman/listinfo/users
> > >
> > >
> >
> >
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Reserve cluster for large job

Reply via email to