[slurm-dev] Re: preemption based on partition priority

Marcin Sliwowski Fri, 27 Sep 2013 13:34:00 -0700

Hello All,

I just realized that my original message to the mailing list was alittle hard to read so I would like to ask my question again in aclearer fashion.

Pre-emption is working but it appears that too many jobs in the lowpriority "runatrisk" partition are being killed for a single highpriority job in the "ncfs" partition to run.

It almost appears as if the new high priority job is impatient, meaningthat first a single low priority job is going into "completing" state asa result of being pre-empted. The completion of this one single lowpriority job leaves enough resources for the high prio job to run. Butwhile the first low prio job is completing, pre-emption decides totarget yet another low priority job, which isn't necessary. Below is avery clear example of this illustrated with squeue.


I am running slurm 2.6.0
PreemptType=preempt/partition_prio

and the following partitions are defined:

PartitionName=DEFAULT Shared=NO State=UP Default=NO Priority=5000MaxNodes=32 MaxTime=5760

PartitionName=ncfs Nodes=compute-[0-3]-[0-31] PreemptMode=REQUEUEPriority=10000 AllowGroups=ncfs,itgroup

PartitionName=batch Nodes=compute-[0-3]-[0-31] Default=YESPreemptMode=REQUEUE

PartitionName=runatrisk Nodes=compute-[0-3]-[0-31] MaxTime=1440PreemptMode=CANCEL Priority=1000




Here is the actual example that I am referring to, illustrated with squeue:

The job I submit to the runatrisk low priority partition is a 32 node 8tasks per node MPI job that sleeps for an hour.

The job I submit to the ncfs high priority partition is a 32 node 8tasks per node MPI job that just prints “Hello” and finishes immediately.

I submit 4 jobs to runatrisk to fill up all 16 cores on each of the 32nodes in compute-0-[0-31] and compute-1-[0-31], the only places that a32 node MPI job can run. To clarify, compute-0-[0-31] andcompute-1-[0-31] are separate infiniband fabrics, so a 32 node MPI jobsis the largest we can do.

Below you can see the high priority job #20817 is pending, and one ofthe low priority jobs #20813 is being pre-empted and going intocompleting state.



[marcin@ht0 mpi-test]$ squeue

JOBID PARTITION NAME USER ST TIME NODESNODELIST(REASON)

20817 ncfs ncfs_hel marcin PD 0:00 32(Resources)

20813 runatrisk runatris marcin CG 2:06 32compute-0-[0-31]

20816 runatrisk runatris marcin R 0:28 32compute-1-[0-31]

20815 runatrisk runatris marcin R 0:29 32compute-1-[0-31]

20814 runatrisk runatris marcin R 0:30 32compute-0-[0-31]

Now it decides to pre-empt another low priority job #20815. The two lowpriority jobs each consume the same amount of resources. Killing oneshould be enough for the single high priority job to run. Also noticethat the two low priority jobs it is killing are from two completelyseparate sets of nodes, one from compute-1* and the other fromcompute-0*, so killing both doesn't actually free up contiguous amountof resources, an MPI job can't span those two sets of nodes.

JOBID PARTITION NAME USER ST TIME NODESNODELIST(REASON)

20817 ncfs ncfs_hel marcin PD 0:00 32(Resources)

20815 runatrisk runatris marcin CG 1:09 32compute-1-[0-31]

20813 runatrisk runatris marcin CG 2:06 5compute-0-[0-1,9,17,25]

20816 runatrisk runatris marcin R 1:10 32compute-1-[0-31]

20814 runatrisk runatris marcin R 1:12 32compute-0-[0-31]

Below you can see that actually killing just one of the low priorityjobs #20813 would have been enough, because the high priority job is nowrunning within the compute-0* set of nodes. So it did not have to killjob #20815.


[marcin@ht0 mpi-test]$ squeue

JOBID PARTITION NAME USER ST TIME NODESNODELIST(REASON)

20817 ncfs ncfs_hel marcin R 0:01 32compute-0-[0-31]

20815 runatrisk runatris marcin CG 1:09 5compute-1-[0-1,9,17,25]

20816 runatrisk runatris marcin R 1:33 32compute-1-[0-31]

20814 runatrisk runatris marcin R 1:35 32compute-0-[0-31]

And a final example of just two low priority jobs remaining afterpre-emption did its thing.


[marcin@ht0 mpi-test]$ squeue

JOBID PARTITION NAME USER ST TIME NODESNODELIST(REASON)

20816 runatrisk runatris marcin R 3:29 32compute-1-[0-31]

20814 runatrisk runatris marcin R 3:31 32compute-0-[0-31]

I am trying to prevent this from happening, I don't want too many jobsbeing killed if it is unnecessary. Is this a bug or a misconfigurationon my part?



Thanks,
Marcin



Marcin Sliwowski | SysAdmin@RENCI | (919) 445-0479

On 09/25/2013 04:04 PM, Marcin Sliwowski wrote:

I am running slurm 2.6.0 with PreemptType=preempt/partition_prio andthe following partition setup
PartitionName=DEFAULT Shared=NO State=UP Default=NO Priority=5000MaxNodes=32 MaxTime=5760
PartitionName=ncfs Nodes=compute-[0-3]-[0-31] PreemptMode=REQUEUEPriority=10000 AllowGroups=ncfs,itgroup
PartitionName=batch Nodes=compute-[0-3]-[0-31] Default=YESPreemptMode=REQUEUE
PartitionName=runatrisk Nodes=compute-[0-3]-[0-31] MaxTime=1440PreemptMode=CANCEL Priority=1000
Preemption is functioning but something is not quite right. I have anactual example below that illustrates the odd behavior and wouldappreciate any help from the community. If you need additionalinformation regarding my config please let me know.

The job I submit to the runatrisk partition is a 32 node 8 tasks pernode MPI job that sleeps for an hour.
The job I submit to the ncfs partition is a 32 node 8 tasks per nodeMPI job that just prints “Hello”.

The ncfs partition has a higher priority than runatrisk which triggersthe preemption.

I submit 4 jobs to runatrisk to fill up the nodes on compute-0-[0-31]and compute-1-[0-31], the only place that a 32 node job can run.
[marcin@ht0 mpi-test]$ squeue
JOBID PARTITION NAME USER ST TIME NODESNODELIST(REASON)
20817 ncfs ncfs_hel marcin PD 0:00 32(Resources)
20813 runatrisk runatris marcin CG 2:06 32compute-0-[0-31]
20816 runatrisk runatris marcin R 0:28 32compute-1-[0-31]
20815 runatrisk runatris marcin R 0:29 32compute-1-[0-31]
20814 runatrisk runatris marcin R 0:30 32compute-0-[0-31]

>> below is the strange behavior, slurm decided to kill 2 of therunatrisk jobs and one is on compute-0* and the other on compute-1*,in the end the ncfs job ends up running on compute-0*, it shouldn’thave killed that runatrisk job on compute-1*. Does anyone know whythis happens?

JOBID PARTITION     NAME     USER  ST TIME  NODES NODELIST(REASON)
20817 ncfs ncfs_hel marcin PD 0:00 32(Resources)
20815 runatrisk runatris marcin CG 1:09 32compute-1-[0-31]
20813 runatrisk runatris marcin CG 2:06 5compute-0-[0-1,9,17,25]
20816 runatrisk runatris marcin R 1:10 32compute-1-[0-31]
20814 runatrisk runatris marcin R 1:12 32compute-0-[0-31]
[marcin@ht0 mpi-test]$ squeue
JOBID PARTITION NAME USER ST TIME NODESNODELIST(REASON)
20817 ncfs ncfs_hel marcin R 0:01 32compute-0-[0-31]
20815 runatrisk runatris marcin CG 1:09 5compute-1-[0-1,9,17,25]
20816 runatrisk runatris marcin R 1:33 32compute-1-[0-31]
20814 runatrisk runatris marcin R 1:35 32compute-0-[0-31]
[marcin@ht0 mpi-test]$ squeue
JOBID PARTITION NAME USER ST TIME NODESNODELIST(REASON)
20816 runatrisk runatris marcin R 3:29 32compute-1-[0-31]
20814 runatrisk runatris marcin R 3:31 32compute-0-[0-31]
Thanks,

Marcin

[slurm-dev] Re: preemption based on partition priority

Reply via email to