Am 26.04.2012 um 18:20 schrieb Stuart Barkley: > Can someone give a quick overview of how job reservations are supposed > to work? > > I have a large cluster where one user has ~1500 jobs executing which > take several days each to run. Every day several of the jobs finish > and then new ones from the array job start. These jobs have a minimal > memory footprint (h_vmem=2G), so the new jobs fit exactly in the old > footprint. > > I also have another user array job, but these have h_vmem=20G. These > jobs have been starved for several days and none have started since > 20G never becomes free on a single host.
Yes, some real-time features in SGE would be nice: assuming all jobs run up to their specified h_rt, when will job X start? The problem is: someone can submit a new job at a later point in time and due to his higher priority the forecast isn't valid any longer. > h_vmem is consumable and generally works well preventing memory over > allocation. > > In this case the blocking job turnover is pretty slow and I suspect > fragmentation between nodes means that a single node is unlikely to > actually become empty within any short time period. > > I have manually adjusted the job priority so the starved job is at the > top of the waiting list (qalter -p 500). It's not easy to investigate the overall priority as it depends on different weights. You can check "qstat -pri -ext -urg". You can also set "params monitor=true" in SGE's scheduler configuration. In the "schedule" ($SGE_ROOT/default/common) file you see some reservations? 172034:1:RESERVING:1336276605:31536060:G:global:jobs:1.000000 > I've manually set the qalter "-R y" option, so the job should be > considered for a reservation. > > I have "max_reservation 8" is sched_conf so believe that an internal > jobs reservation should be done. > > Previous experimentation with job reservations on smaller turn around > jobs appeared to have an effect, at some point new it appeared that > grid engine was clearing off nodes for larger jobs, but I didn't find > any way to actually confirm that is what was happening. > > Is there any way to tell that grid engine has even noticed this and > created a reservation? Is there a way to see what future resources > have been "reserved"? > > Will the reservation adjust itself over time? The running jobs all > have a huge h_rt, but the actual run time varies a lot. This will prevent suitable backfilling then. Do you have jobs without h_rt request? The default_duration is for some configurations INFINITY, and SGE judges INFINITY being smaller than INFINITY and backfilling occurs all the time this way as the jobs fit in. -- Reuti > It would be > good if grid engine would reallocate reservations as jobs end and > better possibilities emerge. > > Current manual workaround: > > For this specific case I have created a temporary rqs to limit this > specific user to only 1000 slots, but I need to be sure I reset this > once the current blocked jobs get started. > > (still using sge6.2u5, CentOS 5) > > Thanks, > Stuart Barkley > -- > I've never been lost; I was once bewildered for three days, but never lost! > -- Daniel Boone > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
