We are runnning slurm 2.4.3 on Rocks based on CentOS 6.2. We have configured scheduling such that jobs in the "lowpri" QoS will be preempted by jobs in other QoSes if they need the resources. (See extract from slurm.conf below.)
As I understand it, a job should not be preempted by another job if there are resources available for the new job to start (without preempting the lowpri job). We are experiencing that lowpri jobs get rescheduled far much frequently than we would expect, especially at times when there are many (several thousand) free Cpus on the cluster. See an example below. I have two theories about what might cause this: 1) We use TopologyPlugin=topology/tree. Perhaps the scheduler prefers to preempt jobs than place a job on multiple switches? 2) We have given the nodes in racks different weight, in order to try and "pack" jobs into the smallest number of nodes (to keep whole nodes free for jobs that need whole nodes). Perhaps the scheduler will preempt a job rather than select nodes with higher weight for the new jobs? Is one of these correct? The (hopefully) relevant part of the slurm.conf: SchedulerType=sched/backfill SchedulerParameters=bf_max_job_user=10 SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory PreemptMode=requeue PreemptType=preempt/qos AccountingStorageEnforce=limits,qos TopologyPlugin=topology/tree Nodename=DEFAULT Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62976 Gres=localtmp:100 State=unknown PartitionName=DEFAULT State=up Shared=NO Nodename=c1-[1-36] NodeHostname=compute-1-[1-36] Weight=4173 Feature=rack1,intel,ib Nodename=c2-[1-36] NodeHostname=compute-2-[1-36] Weight=4173 Feature=rack2,intel,ib Nodename=c3-[1-36] NodeHostname=compute-3-[1-36] Weight=4174 Feature=rack3,intel,ib Nodename=c4-[1-36] NodeHostname=compute-4-[1-36] Weight=4174 Feature=rack4,intel,ib Nodename=c5-[1-36] NodeHostname=compute-5-[1-36] Weight=4175 Feature=rack5,intel,ib Nodename=c6-[1-36] NodeHostname=compute-6-[1-36] Weight=4175 Feature=rack6,intel,ib [etc.] Example: the lowpri job 100965: # sacct --duplicates -X -o start,state,end,exitcode,alloccpus,nodelist -j 100965 Start State End ExitCode AllocCPUS NodeList ------------------- ---------- ------------------- -------- ---------- --------------- 2012-10-10T12:22:55 CANCELLED 2012-10-10T12:56:38 0:0 8 c1-1 2012-10-10T12:56:50 CANCELLED 2012-10-10T13:27:19 0:0 8 c6-32 2012-10-10T13:27:31 CANCELLED 2012-10-10T13:42:33 0:0 8 c1-3 2012-10-10T13:42:46 CANCELLED 2012-10-10T13:46:50 0:0 8 c1-3 2012-10-10T13:47:03 CANCELLED 2012-10-10T15:43:33 0:0 8 c7-6 2012-10-10T15:43:48 CANCELLED 2012-10-10T17:30:14 0:0 8 c7-23 2012-10-10T17:30:28 CANCELLED 2012-10-10T18:52:40 0:0 8 c10-21 2012-10-10T18:52:56 CANCELLED 2012-10-10T21:03:21 0:0 8 c11-23 2012-10-10T21:32:59 CANCELLED 2012-10-10T21:52:08 0:0 8 c11-16 2012-10-10T21:52:56 NODE_FAIL 2012-10-11T00:02:43 0:0 8 c11-36 2012-10-11T00:05:30 CANCELLED 2012-10-11T03:25:02 0:0 8 c10-25 2012-10-11T03:25:32 CANCELLED 2012-10-11T06:08:02 0:0 8 c4-16 2012-10-11T06:08:55 CANCELLED 2012-10-11T08:21:17 0:0 8 c6-32 2012-10-11T08:21:32 CANCELLED 2012-10-11T08:25:31 0:0 8 c1-5 2012-10-11T08:25:47 CANCELLED 2012-10-11T08:30:24 0:0 8 c4-13 2012-10-11T08:30:44 CANCELLED 2012-10-11T08:32:50 0:0 8 c5-36 2012-10-11T08:33:02 CANCELLED 2012-10-11T09:35:25 0:0 8 c8-32 2012-10-11T09:36:02 CANCELLED 2012-10-11T10:31:11 0:0 8 c8-9 2012-10-11T10:31:30 CANCELLED 2012-10-11T10:32:35 0:0 8 c8-7 2012-10-11T10:33:36 CANCELLED 2012-10-11T10:37:18 0:0 8 c10-18 2012-10-11T10:37:43 FAILED 2012-10-11T14:59:10 1:0 8 c11-13 (There is a node-failure in there, and the job failed when it finally got to run long enough.) Apart from a short period around 21:00 the 10., less than 7,000 of the ~ 10,000 cores were used. -- Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo