[slurm-dev] Re: QoS Feature Requests

2014-05-21 Thread jette


Quoting Paul Edmon ped...@cfa.harvard.edu:

We have just started using QoS here and I was curious about a few  
features which would make our lives easier.


1. Spillover/overflow: Essentially if you use up one QoS you would  
spill over into your next lower priority QoS.  For instance if you  
used up your groups QoS but still had jobs and there were idle  
cycles your jobs that were pending for your high priority QoS would  
go to the low priority normal QoS.


There isn't a great way to do this today. Each job is associated with  
a single QOS.


One possibility would be to submit one job to each QOS and then  
whichever job started first would kill the others. A job submit plugin  
could probably handle the multiple submissions (e.g. if the --qos  
option has multiple comma-separated names, then submit one job for  
each QOS). Offhand I'm not sure what would be a good way to identify  
and purge the extra jobs. Some variation of the --depend=singleton  
logic would probably do the trick.



2. Gres: Adding number of GPU's or other Gres quantities to the QoS  
that can be used.


This has been discussed, but not implemented yet.


3. Requeue/No Requeue: There are some partitions we want to allow  
QoS to requeue, others we don't.  For instance we have a general  
queue which we don't want requeue on, but we also have a backfill  
queue that we do permit it on.  If the QoS could kill the backfill  
jobs first to find space, and just wait on the general queue that  
would be great.  We haven't experimented with QoS Requeue but we may  
in the future so this is just looking forward.


You can configure differtent preemption mechanisms and preempt by  
either QoS or partition. Take a look at:

http://slurm.schedmd.com/preempt.html
For example, you might enable QoS high to requeue jobs in QoS low,  
but wait for jobs in QoS medium.

There is no mechanism to configure QoS high to preempt jobs by partition.

We were also wondering if jobs asking for Gres could get higher  
priority on those nodes, such that they can grab the GPU's and leave  
the CPU's for everyone else.  After all the Gres resources are  
usually scarcer than the CPU resouces and we would hate for a Gres  
resource to idle just because all the CPU jobs took up the slots.


-Paul Edmon-


This has also been discussed, but not implemented yet. One option  
might be to use a job_submit plugin to adjust a job's nice option  
based upon GRES.
There is a partition parameter MaxCPUsPerNode that might be useful to  
limit the number of CPUs that are consumed on each node by each  
partition. You would probably require a separate partition/queue for  
GPU jobs for that to work well, so that would probably not work for you.


Let me know if you need help pursuing these options.

Moe Jette


[slurm-dev] Re: QoS Feature Requests

2014-05-21 Thread Paul Edmon


Thanks.  For the info.  The spillover stuff would be handy, but I can 
definitely see the difficulties with the coding of it.  Though a similar 
mechanism exists for the partitions where you can list multiple 
partitions and it will execute on the one that will go first.  Could 
this be imported into QoS?


-Paul Edmon-

On 5/21/2014 6:52 PM, je...@schedmd.com wrote:


Quoting Paul Edmon ped...@cfa.harvard.edu:

We have just started using QoS here and I was curious about a few 
features which would make our lives easier.


1. Spillover/overflow: Essentially if you use up one QoS you would 
spill over into your next lower priority QoS.  For instance if you 
used up your groups QoS but still had jobs and there were idle cycles 
your jobs that were pending for your high priority QoS would go to 
the low priority normal QoS.


There isn't a great way to do this today. Each job is associated with 
a single QOS.


One possibility would be to submit one job to each QOS and then 
whichever job started first would kill the others. A job submit plugin 
could probably handle the multiple submissions (e.g. if the --qos 
option has multiple comma-separated names, then submit one job for 
each QOS). Offhand I'm not sure what would be a good way to identify 
and purge the extra jobs. Some variation of the --depend=singleton 
logic would probably do the trick.



2. Gres: Adding number of GPU's or other Gres quantities to the QoS 
that can be used.


This has been discussed, but not implemented yet.


3. Requeue/No Requeue: There are some partitions we want to allow QoS 
to requeue, others we don't. For instance we have a general queue 
which we don't want requeue on, but we also have a backfill queue 
that we do permit it on. If the QoS could kill the backfill jobs 
first to find space, and just wait on the general queue that would be 
great.  We haven't experimented with QoS Requeue but we may in the 
future so this is just looking forward.


You can configure differtent preemption mechanisms and preempt by 
either QoS or partition. Take a look at:

http://slurm.schedmd.com/preempt.html
For example, you might enable QoS high to requeue jobs in QoS low, 
but wait for jobs in QoS medium.
There is no mechanism to configure QoS high to preempt jobs by 
partition.


We were also wondering if jobs asking for Gres could get higher 
priority on those nodes, such that they can grab the GPU's and leave 
the CPU's for everyone else.  After all the Gres resources are 
usually scarcer than the CPU resouces and we would hate for a Gres 
resource to idle just because all the CPU jobs took up the slots.


-Paul Edmon-


This has also been discussed, but not implemented yet. One option 
might be to use a job_submit plugin to adjust a job's nice option 
based upon GRES.
There is a partition parameter MaxCPUsPerNode that might be useful to 
limit the number of CPUs that are consumed on each node by each 
partition. You would probably require a separate partition/queue for 
GPU jobs for that to work well, so that would probably not work for you.


Let me know if you need help pursuing these options.

Moe Jette