I can reproduce this issue every single time, here is the excerpt from
the slurmctld.log that illustrates the two jobs 20952 and 20950 from the
lower priority runatrisk partition being killed due to pre-emption by
job 20954
snippet from squeue showing the two jobs being brought down and the high
priority job awaiting resources:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
20954 ncfs ncfs_hel marcin PD 0:00 32 (Resources)
20952 runatrisk runatris marcin CG 1:29 5
compute-1-[0-1,9,17,25]
20950 runatrisk runatris marcin CG 0:50 5
compute-0-[0-1,9,17,25]
20953 runatrisk runatris marcin R 1:27 32 compute-1-[0-31]
20951 runatrisk runatris marcin R 1:44 32 compute-0-[0-31]
excerpt from slurmctld.log :
[2013-10-02T09:36:58.392] _slurm_rpc_submit_batch_job JobId=20954 usec=2720
[2013-10-02T09:36:58.394] preempted job 20950 has been killed
[2013-10-02T09:36:58.397] Signal 9 of StepId=20950.0 by UID=1175:
Job/step already completing or completed
[2013-10-02T09:36:59.422] completing job 20950
[2013-10-02T09:36:59.422] _slurm_rpc_complete_batch_script JobId=20950:
Job/step already completing or compl
eted
[2013-10-02T09:37:55.405] Resending TERMINATE_JOB request JobId=20950
Nodelist=compute-0-[0-1,9,17,25]
[2013-10-02T09:37:55.407] preempted job 20952 has been killed
[2013-10-02T09:37:55.412] Signal 9 of StepId=20952.0 by UID=1175:
Job/step already completing or completed
[2013-10-02T09:37:56.561] completing job 20952
[2013-10-02T09:37:56.561] _slurm_rpc_complete_batch_script JobId=20952:
Job/step already completing or completed
[2013-10-02T09:38:05.061] sched: _slurm_rpc_step_complete StepId=20950.0
usec=123
[2013-10-02T09:38:05.075] Job 20950 completion process took 67 seconds
[2013-10-02T09:38:05.076] sched: Allocate JobId=20954
NodeList=compute-0-[0-31] #CPUs=256
[2013-10-02T09:38:05.458] sched: _slurm_rpc_job_step_create:
StepId=20954.0 compute-0-[0-31] usec=446
Anyone have an explanation for this behaviour?
Thanks
Marcin Sliwowski | SysAdmin@RENCI | (919) 445-0479
On 09/27/2013 04:34 PM, Marcin Sliwowski wrote:
Hello All,
I just realized that my original message to the mailing list was a
little hard to read so I would like to ask my question again in a
clearer fashion.
Pre-emption is working but it appears that too many jobs in the low
priority "runatrisk" partition are being killed for a single high
priority job in the "ncfs" partition to run.
It almost appears as if the new high priority job is impatient,
meaning that first a single low priority job is going into
"completing" state as a result of being pre-empted. The completion of
this one single low priority job leaves enough resources for the high
prio job to run. But while the first low prio job is completing,
pre-emption decides to target yet another low priority job, which
isn't necessary. Below is a very clear example of this illustrated
with squeue.
I am running slurm 2.6.0
PreemptType=preempt/partition_prio
and the following partitions are defined:
PartitionName=DEFAULT Shared=NO State=UP Default=NO Priority=5000
MaxNodes=32 MaxTime=5760
PartitionName=ncfs Nodes=compute-[0-3]-[0-31] PreemptMode=REQUEUE
Priority=10000 AllowGroups=ncfs,itgroup
PartitionName=batch Nodes=compute-[0-3]-[0-31] Default=YES
PreemptMode=REQUEUE
PartitionName=runatrisk Nodes=compute-[0-3]-[0-31] MaxTime=1440
PreemptMode=CANCEL Priority=1000
Here is the actual example that I am referring to, illustrated with
squeue:
The job I submit to the runatrisk low priority partition is a 32 node
8 tasks per node MPI job that sleeps for an hour.
The job I submit to the ncfs high priority partition is a 32 node 8
tasks per node MPI job that just prints “Hello” and finishes immediately.
I submit 4 jobs to runatrisk to fill up all 16 cores on each of the 32
nodes in compute-0-[0-31] and compute-1-[0-31], the only places that a
32 node MPI job can run. To clarify, compute-0-[0-31] and
compute-1-[0-31] are separate infiniband fabrics, so a 32 node MPI
jobs is the largest we can do.
Below you can see the high priority job #20817 is pending, and one of
the low priority jobs #20813 is being pre-empted and going into
completing state.
[marcin@ht0 mpi-test]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
20817 ncfs ncfs_hel marcin PD 0:00 32
(Resources)
20813 runatrisk runatris marcin CG 2:06 32
compute-0-[0-31]
20816 runatrisk runatris marcin R 0:28 32
compute-1-[0-31]
20815 runatrisk runatris marcin R 0:29 32
compute-1-[0-31]
20814 runatrisk runatris marcin R 0:30 32
compute-0-[0-31]
Now it decides to pre-empt another low priority job #20815. The two
low priority jobs each consume the same amount of resources. Killing
one should be enough for the single high priority job to run. Also
notice that the two low priority jobs it is killing are from two
completely separate sets of nodes, one from compute-1* and the other
from compute-0*, so killing both doesn't actually free up contiguous
amount of resources, an MPI job can't span those two sets of nodes.
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
20817 ncfs ncfs_hel marcin PD 0:00 32
(Resources)
20815 runatrisk runatris marcin CG 1:09 32
compute-1-[0-31]
20813 runatrisk runatris marcin CG 2:06 5
compute-0-[0-1,9,17,25]
20816 runatrisk runatris marcin R 1:10 32
compute-1-[0-31]
20814 runatrisk runatris marcin R 1:12 32
compute-0-[0-31]
Below you can see that actually killing just one of the low priority
jobs #20813 would have been enough, because the high priority job is
now running within the compute-0* set of nodes. So it did not have to
kill job #20815.
[marcin@ht0 mpi-test]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
20817 ncfs ncfs_hel marcin R 0:01 32
compute-0-[0-31]
20815 runatrisk runatris marcin CG 1:09 5
compute-1-[0-1,9,17,25]
20816 runatrisk runatris marcin R 1:33 32
compute-1-[0-31]
20814 runatrisk runatris marcin R 1:35 32
compute-0-[0-31]
And a final example of just two low priority jobs remaining after
pre-emption did its thing.
[marcin@ht0 mpi-test]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
20816 runatrisk runatris marcin R 3:29 32
compute-1-[0-31]
20814 runatrisk runatris marcin R 3:31 32
compute-0-[0-31]
I am trying to prevent this from happening, I don't want too many jobs
being killed if it is unnecessary. Is this a bug or a misconfiguration
on my part?
Thanks,
Marcin
Marcin Sliwowski | SysAdmin@RENCI | (919) 445-0479
On 09/25/2013 04:04 PM, Marcin Sliwowski wrote:
I am running slurm 2.6.0 with PreemptType=preempt/partition_prio and
the following partition setup
PartitionName=DEFAULT Shared=NO State=UP Default=NO Priority=5000
MaxNodes=32 MaxTime=5760
PartitionName=ncfs Nodes=compute-[0-3]-[0-31] PreemptMode=REQUEUE
Priority=10000 AllowGroups=ncfs,itgroup
PartitionName=batch Nodes=compute-[0-3]-[0-31] Default=YES
PreemptMode=REQUEUE
PartitionName=runatrisk Nodes=compute-[0-3]-[0-31] MaxTime=1440
PreemptMode=CANCEL Priority=1000
Preemption is functioning but something is not quite right. I have an
actual example below that illustrates the odd behavior and would
appreciate any help from the community. If you need additional
information regarding my config please let me know.
The job I submit to the runatrisk partition is a 32 node 8 tasks per
node MPI job that sleeps for an hour.
The job I submit to the ncfs partition is a 32 node 8 tasks per node
MPI job that just prints “Hello”.
The ncfs partition has a higher priority than runatrisk which
triggers the preemption.
I submit 4 jobs to runatrisk to fill up the nodes on compute-0-[0-31]
and compute-1-[0-31], the only place that a 32 node job can run.
[marcin@ht0 mpi-test]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
20817 ncfs ncfs_hel marcin PD 0:00 32
(Resources)
20813 runatrisk runatris marcin CG 2:06 32
compute-0-[0-31]
20816 runatrisk runatris marcin R 0:28 32
compute-1-[0-31]
20815 runatrisk runatris marcin R 0:29 32
compute-1-[0-31]
20814 runatrisk runatris marcin R 0:30 32
compute-0-[0-31]
>> below is the strange behavior, slurm decided to kill 2 of the
runatrisk jobs and one is on compute-0* and the other on compute-1*,
in the end the ncfs job ends up running on compute-0*, it shouldn’t
have killed that runatrisk job on compute-1*. Does anyone know why
this happens?
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
20817 ncfs ncfs_hel marcin PD 0:00 32
(Resources)
20815 runatrisk runatris marcin CG 1:09 32
compute-1-[0-31]
20813 runatrisk runatris marcin CG 2:06 5
compute-0-[0-1,9,17,25]
20816 runatrisk runatris marcin R 1:10 32
compute-1-[0-31]
20814 runatrisk runatris marcin R 1:12 32
compute-0-[0-31]
[marcin@ht0 mpi-test]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
20817 ncfs ncfs_hel marcin R 0:01 32
compute-0-[0-31]
20815 runatrisk runatris marcin CG 1:09 5
compute-1-[0-1,9,17,25]
20816 runatrisk runatris marcin R 1:33 32
compute-1-[0-31]
20814 runatrisk runatris marcin R 1:35 32
compute-0-[0-31]
[marcin@ht0 mpi-test]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
20816 runatrisk runatris marcin R 3:29 32
compute-1-[0-31]
20814 runatrisk runatris marcin R 3:31 32
compute-0-[0-31]
Thanks,
Marcin