Re: [gridengine users] Job Reservations/Job starvation

Reuti Thu, 26 Apr 2012 14:25:51 -0700

Am 26.04.2012 um 18:20 schrieb Stuart Barkley:

> Can someone give a quick overview of how job reservations are supposed
> to work?
> 
> I have a large cluster where one user has ~1500 jobs executing which
> take several days each to run.  Every day several of the jobs finish
> and then new ones from the array job start.  These jobs have a minimal
> memory footprint (h_vmem=2G), so the new jobs fit exactly in the old
> footprint.
> 
> I also have another user array job, but these have h_vmem=20G.  These
> jobs have been starved for several days and none have started since
> 20G never becomes free on a single host.


Yes, some real-time features in SGE would be nice: assuming all jobs run up to 
their specified h_rt, when will job X start? The problem is: someone can submit 
a new job at a later point in time and due to his higher priority the forecast 
isn't valid any longer.


> h_vmem is consumable and generally works well preventing memory over
> allocation.
> 
> In this case the blocking job turnover is pretty slow and I suspect
> fragmentation between nodes means that a single node is unlikely to
> actually become empty within any short time period.
> 
> I have manually adjusted the job priority so the starved job is at the
> top of the waiting list (qalter -p 500).

It's not easy to investigate the overall priority as it depends on different 
weights. You can check "qstat -pri -ext -urg".

You can also set "params monitor=true" in SGE's scheduler configuration.

In the "schedule" ($SGE_ROOT/default/common) file you see some reservations?

172034:1:RESERVING:1336276605:31536060:G:global:jobs:1.000000


> I've manually set the qalter "-R y" option, so the job should be
> considered for a reservation.
> 
> I have "max_reservation 8" is sched_conf so believe that an internal
> jobs reservation should be done.
> 
> Previous experimentation with job reservations on smaller turn around
> jobs appeared to have an effect, at some point new it appeared that
> grid engine was clearing off nodes for larger jobs, but I didn't find
> any way to actually confirm that is what was happening.
> 
> Is there any way to tell that grid engine has even noticed this and
> created a reservation?  Is there a way to see what future resources
> have been "reserved"?
> 
> Will the reservation adjust itself over time?  The running jobs all
> have a huge h_rt, but the actual run time varies a lot.

This will prevent suitable backfilling then.

Do you have jobs without h_rt request? The default_duration is for some 
configurations INFINITY, and SGE judges INFINITY being smaller than INFINITY 
and backfilling occurs all the time this way as the jobs fit in.

-- Reuti


>  It would be
> good if grid engine would reallocate reservations as jobs end and
> better possibilities emerge.
> 
> Current manual workaround:
> 
> For this specific case I have created a temporary rqs to limit this
> specific user to only 1000 slots, but I need to be sure I reset this
> once the current blocked jobs get started.
> 
> (still using sge6.2u5, CentOS 5)
> 
> Thanks,
> Stuart Barkley
> -- 
> I've never been lost; I was once bewildered for three days, but never lost!
>                                        --  Daniel Boone
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Job Reservations/Job starvation

Reply via email to