We have just started using QoS here and I was curious about a few
features which would make our lives easier.
1. Spillover/overflow: Essentially if you use up one QoS you would spill
over into your next lower priority QoS. For instance if you used up
your groups QoS but still had jobs and there were idle cycles your jobs
that were pending for your high priority QoS would go to the low
priority normal QoS.
2. Gres: Adding number of GPU's or other Gres quantities to the QoS that
can be used.
3. Requeue/No Requeue: There are some partitions we want to allow QoS to
requeue, others we don't. For instance we have a general queue which we
don't want requeue on, but we also have a backfill queue that we do
permit it on. If the QoS could kill the backfill jobs first to find
space, and just wait on the general queue that would be great. We
haven't experimented with QoS Requeue but we may in the future so this
is just looking forward.
We were also wondering if jobs asking for Gres could get higher priority
on those nodes, such that they can grab the GPU's and leave the CPU's
for everyone else. After all the Gres resources are usually scarcer
than the CPU resouces and we would hate for a Gres resource to idle just
because all the CPU jobs took up the slots.
-Paul Edmon-
- [slurm-dev] QoS Feature Requests Paul Edmon
-