Can someone give a quick overview of how job reservations are supposed
to work?
I have a large cluster where one user has ~1500 jobs executing which
take several days each to run. Every day several of the jobs finish
and then new ones from the array job start. These jobs have a minimal
memory footprint (h_vmem=2G), so the new jobs fit exactly in the old
footprint.
I also have another user array job, but these have h_vmem=20G. These
jobs have been starved for several days and none have started since
20G never becomes free on a single host.
h_vmem is consumable and generally works well preventing memory over
allocation.
In this case the blocking job turnover is pretty slow and I suspect
fragmentation between nodes means that a single node is unlikely to
actually become empty within any short time period.
I have manually adjusted the job priority so the starved job is at the
top of the waiting list (qalter -p 500).
I've manually set the qalter "-R y" option, so the job should be
considered for a reservation.
I have "max_reservation 8" is sched_conf so believe that an internal
jobs reservation should be done.
Previous experimentation with job reservations on smaller turn around
jobs appeared to have an effect, at some point new it appeared that
grid engine was clearing off nodes for larger jobs, but I didn't find
any way to actually confirm that is what was happening.
Is there any way to tell that grid engine has even noticed this and
created a reservation? Is there a way to see what future resources
have been "reserved"?
Will the reservation adjust itself over time? The running jobs all
have a huge h_rt, but the actual run time varies a lot. It would be
good if grid engine would reallocate reservations as jobs end and
better possibilities emerge.
Current manual workaround:
For this specific case I have created a temporary rqs to limit this
specific user to only 1000 slots, but I need to be sure I reset this
once the current blocked jobs get started.
(still using sge6.2u5, CentOS 5)
Thanks,
Stuart Barkley
--
I've never been lost; I was once bewildered for three days, but never lost!
-- Daniel Boone
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users