[slurm-dev] Re: preemption based on partition priority

Marcin Sliwowski Tue, 08 Oct 2013 10:17:09 -0700

Dear Moe Jette,

I would really appreciate you looking into this and confirming ordenying the behaviour that I am consistently seeing based on your ownexperiments, I think I may be running across a bug but no one hasresponded or commented to my previous posts.

I believe that this strange pre-emption behaviour may have something todo with our TopologyPlugin=topology/tree plugin.


The contents of topology.conf are

SwitchName=s0 Nodes=compute-0-[0-31]
SwitchName=s1 Nodes=compute-1-[0-31]
SwitchName=s2 Nodes=compute-2-[0-31]
SwitchName=s3 Nodes=compute-3-[0-31]

We have 4 dell blade centers that make up this cluster, but ourinfiniband fabric does not span all 4 of them, instead each blade centercomposed of 32 nodes has its own fabric.

If I down all the nodes except for compute-0-[0-31] which leaves me withjust one fully functional blade center. Then I fill those nodes up withlow priority jobs by submitting them to the runatrisk low prioritypartition. Then I submit a single high priority job to the ncfs highpriority partition the correct number of low priority jobs arepre-empted to make room for the single high priority job.

However, in the past with two fully functional blade centers the singlehigh priority job would kill twice as many jobs as are necessary to freeup enough resources. It would kill the exact right amount of jobs onblade center 0 represented by SwitchName=s0 Nodes=compute-0-[0-31] andit would go and kill the right number of jobs again on blade center 1represented by the following in the topology.conf SwitchName=s1Nodes=compute-1-[0-31]

It is almost as if pre-emption and the topology plugin are notunderstanding each other, they are certainly talking to each other,since jobs are being pre-empted across topological divisions, eventhough it is not necessary.


Thank You,
Marcin


Marcin Sliwowski | SysAdmin@RENCI | (919) 445-0479

On 10/02/2013 09:50 AM, Marcin Sliwowski wrote:

I can reproduce this issue every single time, here is the excerpt fromthe slurmctld.log that illustrates the two jobs 20952 and 20950 fromthe lower priority runatrisk partition being killed due to pre-emptionby job 20954
snippet from squeue showing the two jobs being brought down and thehigh priority job awaiting resources:
JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
20954      ncfs ncfs_hel   marcin  PD       0:00     32 (Resources)
20952 runatrisk runatris marcin CG 1:29 5compute-1-[0-1,9,17,25]20950 runatrisk runatris marcin CG 0:50 5compute-0-[0-1,9,17,25]
20953 runatrisk runatris   marcin   R       1:27     32 compute-1-[0-31]
20951 runatrisk runatris   marcin   R       1:44     32 compute-0-[0-31]


excerpt from slurmctld.log :
[2013-10-02T09:36:58.392] _slurm_rpc_submit_batch_job JobId=20954usec=2720
[2013-10-02T09:36:58.394] preempted job 20950 has been killed
[2013-10-02T09:36:58.397] Signal 9 of StepId=20950.0 by UID=1175:Job/step already completing or completed
[2013-10-02T09:36:59.422] completing job 20950
[2013-10-02T09:36:59.422] _slurm_rpc_complete_batch_scriptJobId=20950: Job/step already completing or compl
eted
[2013-10-02T09:37:55.405] Resending TERMINATE_JOB request JobId=20950Nodelist=compute-0-[0-1,9,17,25]
[2013-10-02T09:37:55.407] preempted job 20952 has been killed
[2013-10-02T09:37:55.412] Signal 9 of StepId=20952.0 by UID=1175:Job/step already completing or completed
[2013-10-02T09:37:56.561] completing job 20952
[2013-10-02T09:37:56.561] _slurm_rpc_complete_batch_scriptJobId=20952: Job/step already completing or completed[2013-10-02T09:38:05.061] sched: _slurm_rpc_step_completeStepId=20950.0 usec=123
[2013-10-02T09:38:05.075] Job 20950 completion process took 67 seconds
[2013-10-02T09:38:05.076] sched: Allocate JobId=20954NodeList=compute-0-[0-31] #CPUs=256[2013-10-02T09:38:05.458] sched: _slurm_rpc_job_step_create:StepId=20954.0 compute-0-[0-31] usec=446
Anyone have an explanation for this behaviour?

Thanks

Marcin Sliwowski | SysAdmin@RENCI | (919) 445-0479
On 09/27/2013 04:34 PM, Marcin Sliwowski wrote:
Hello All,
I just realized that my original message to the mailing list was alittle hard to read so I would like to ask my question again in aclearer fashion.
Pre-emption is working but it appears that too many jobs in the lowpriority "runatrisk" partition are being killed for a single highpriority job in the "ncfs" partition to run.
It almost appears as if the new high priority job is impatient,meaning that first a single low priority job is going into"completing" state as a result of being pre-empted. The completion ofthis one single low priority job leaves enough resources for the highprio job to run. But while the first low prio job is completing,pre-emption decides to target yet another low priority job, whichisn't necessary. Below is a very clear example of this illustratedwith squeue.
I am running slurm 2.6.0
PreemptType=preempt/partition_prio

and the following partitions are defined:
PartitionName=DEFAULT Shared=NO State=UP Default=NO Priority=5000MaxNodes=32 MaxTime=5760
PartitionName=ncfs Nodes=compute-[0-3]-[0-31] PreemptMode=REQUEUEPriority=10000 AllowGroups=ncfs,itgroup
PartitionName=batch Nodes=compute-[0-3]-[0-31] Default=YESPreemptMode=REQUEUE
PartitionName=runatrisk Nodes=compute-[0-3]-[0-31] MaxTime=1440PreemptMode=CANCEL Priority=1000
Here is the actual example that I am referring to, illustrated withsqueue:
The job I submit to the runatrisk low priority partition is a 32 node8 tasks per node MPI job that sleeps for an hour.
The job I submit to the ncfs high priority partition is a 32 node 8tasks per node MPI job that just prints “Hello” and finishes immediately.
I submit 4 jobs to runatrisk to fill up all 16 cores on each of the32 nodes in compute-0-[0-31] and compute-1-[0-31], the only placesthat a 32 node MPI job can run. To clarify, compute-0-[0-31] andcompute-1-[0-31] are separate infiniband fabrics, so a 32 node MPIjobs is the largest we can do.
Below you can see the high priority job #20817 is pending, and one ofthe low priority jobs #20813 is being pre-empted and going intocompleting state.
[marcin@ht0 mpi-test]$ squeue
JOBID PARTITION NAME USER ST TIME NODESNODELIST(REASON)
20817 ncfs ncfs_hel marcin PD 0:00 32(Resources)
20813 runatrisk runatris marcin CG 2:06 32compute-0-[0-31]
20816 runatrisk runatris marcin R 0:28 32compute-1-[0-31]
20815 runatrisk runatris marcin R 0:29 32compute-1-[0-31]
20814 runatrisk runatris marcin R 0:30 32compute-0-[0-31]
Now it decides to pre-empt another low priority job #20815. The twolow priority jobs each consume the same amount of resources. Killingone should be enough for the single high priority job to run. Alsonotice that the two low priority jobs it is killing are from twocompletely separate sets of nodes, one from compute-1* and the otherfrom compute-0*, so killing both doesn't actually free up contiguousamount of resources, an MPI job can't span those two sets of nodes.
JOBID PARTITION NAME USER ST TIME NODESNODELIST(REASON)
20817 ncfs ncfs_hel marcin PD 0:00 32(Resources)
20815 runatrisk runatris marcin CG 1:09 32compute-1-[0-31]
20813 runatrisk runatris marcin CG 2:06 5compute-0-[0-1,9,17,25]
20816 runatrisk runatris marcin R 1:10 32compute-1-[0-31]
20814 runatrisk runatris marcin R 1:12 32compute-0-[0-31]
Below you can see that actually killing just one of the low priorityjobs #20813 would have been enough, because the high priority job isnow running within the compute-0* set of nodes. So it did not have tokill job #20815.
[marcin@ht0 mpi-test]$ squeue
JOBID PARTITION NAME USER ST TIME NODESNODELIST(REASON)
20817 ncfs ncfs_hel marcin R 0:01 32compute-0-[0-31]
20815 runatrisk runatris marcin CG 1:09 5compute-1-[0-1,9,17,25]
20816 runatrisk runatris marcin R 1:33 32compute-1-[0-31]
20814 runatrisk runatris marcin R 1:35 32compute-0-[0-31]
And a final example of just two low priority jobs remaining afterpre-emption did its thing.
[marcin@ht0 mpi-test]$ squeue
JOBID PARTITION NAME USER ST TIME NODESNODELIST(REASON)
20816 runatrisk runatris marcin R 3:29 32compute-1-[0-31]
20814 runatrisk runatris marcin R 3:31 32compute-0-[0-31]
I am trying to prevent this from happening, I don't want too manyjobs being killed if it is unnecessary. Is this a bug or amisconfiguration on my part?
Thanks,
Marcin



Marcin Sliwowski | SysAdmin@RENCI | (919) 445-0479
On 09/25/2013 04:04 PM, Marcin Sliwowski wrote:
I am running slurm 2.6.0 with PreemptType=preempt/partition_prio andthe following partition setup
PartitionName=DEFAULT Shared=NO State=UP Default=NO Priority=5000MaxNodes=32 MaxTime=5760
PartitionName=ncfs Nodes=compute-[0-3]-[0-31] PreemptMode=REQUEUEPriority=10000 AllowGroups=ncfs,itgroup
PartitionName=batch Nodes=compute-[0-3]-[0-31] Default=YESPreemptMode=REQUEUE
PartitionName=runatrisk Nodes=compute-[0-3]-[0-31] MaxTime=1440PreemptMode=CANCEL Priority=1000
Preemption is functioning but something is not quite right. I havean actual example below that illustrates the odd behavior and wouldappreciate any help from the community. If you need additionalinformation regarding my config please let me know.
The job I submit to the runatrisk partition is a 32 node 8 tasks pernode MPI job that sleeps for an hour.
The job I submit to the ncfs partition is a 32 node 8 tasks pernode MPI job that just prints “Hello”.
The ncfs partition has a higher priority than runatrisk whichtriggers the preemption.
I submit 4 jobs to runatrisk to fill up the nodes oncompute-0-[0-31] and compute-1-[0-31], the only place that a 32 nodejob can run.
[marcin@ht0 mpi-test]$ squeue
JOBID PARTITION NAME USER ST TIME NODESNODELIST(REASON)
20817 ncfs ncfs_hel marcin PD 0:00 32(Resources)
20813 runatrisk runatris marcin CG 2:06 32compute-0-[0-31]
20816 runatrisk runatris marcin R 0:28 32compute-1-[0-31]
20815 runatrisk runatris marcin R 0:29 32compute-1-[0-31]
20814 runatrisk runatris marcin R 0:30 32compute-0-[0-31]
>> below is the strange behavior, slurm decided to kill 2 of therunatrisk jobs and one is on compute-0* and the other on compute-1*,in the end the ncfs job ends up running on compute-0*, it shouldn’thave killed that runatrisk job on compute-1*. Does anyone know whythis happens?
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
20817 ncfs ncfs_hel marcin PD 0:00 32(Resources)
20815 runatrisk runatris marcin CG 1:09 32compute-1-[0-31]
20813 runatrisk runatris marcin CG 2:06 5compute-0-[0-1,9,17,25]
20816 runatrisk runatris marcin R 1:10 32compute-1-[0-31]
20814 runatrisk runatris marcin R 1:12 32compute-0-[0-31]
[marcin@ht0 mpi-test]$ squeue
JOBID PARTITION NAME USER ST TIME NODESNODELIST(REASON)
20817 ncfs ncfs_hel marcin R 0:01 32compute-0-[0-31]
20815 runatrisk runatris marcin CG 1:09 5compute-1-[0-1,9,17,25]
20816 runatrisk runatris marcin R 1:33 32compute-1-[0-31]
20814 runatrisk runatris marcin R 1:35 32compute-0-[0-31]
[marcin@ht0 mpi-test]$ squeue
JOBID PARTITION NAME USER ST TIME NODESNODELIST(REASON)
20816 runatrisk runatris marcin R 3:29 32compute-1-[0-31]
20814 runatrisk runatris marcin R 3:31 32compute-0-[0-31]
Thanks,

Marcin

[slurm-dev] Re: preemption based on partition priority

Reply via email to