I believe that theory #2 is correct. Slurm will attempt to use lower priority nodes first before higher priority nodes, even if this requires preemption. I could understand the desire to change this logic and avoid preemption unless necessary. The relevant code is in src/slurmctld/node_scheduler.c, function _pick_best_nodes(). We probably want to pass the preemptee_candidates to select_g_job_test() only after all nodes have been included (the last time through the main loop, around line 984.
Moe Quoting Bjørn-Helge Mevik <[email protected]>: > > We are runnning slurm 2.4.3 on Rocks based on CentOS 6.2. > > We have configured scheduling such that jobs in the "lowpri" QoS will be > preempted by jobs in other QoSes if they need the resources. (See > extract from slurm.conf below.) > > As I understand it, a job should not be preempted by another job if > there are resources available for the new job to start (without > preempting the lowpri job). > > We are experiencing that lowpri jobs get rescheduled far much frequently > than we would expect, especially at times when there are many (several > thousand) free Cpus on the cluster. See an example below. > > I have two theories about what might cause this: > > 1) We use TopologyPlugin=topology/tree. Perhaps the scheduler prefers to > preempt jobs than place a job on multiple switches? > > 2) We have given the nodes in racks different weight, in order to try > and "pack" jobs into the smallest number of nodes (to keep whole > nodes free for jobs that need whole nodes). Perhaps the scheduler > will preempt a job rather than select nodes with higher weight for > the new jobs? > > Is one of these correct? > > > > The (hopefully) relevant part of the slurm.conf: > > SchedulerType=sched/backfill > SchedulerParameters=bf_max_job_user=10 > SelectType=select/cons_res > SelectTypeParameters=CR_CPU_Memory > PreemptMode=requeue > PreemptType=preempt/qos > AccountingStorageEnforce=limits,qos > TopologyPlugin=topology/tree > Nodename=DEFAULT Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 > RealMemory=62976 Gres=localtmp:100 State=unknown > PartitionName=DEFAULT State=up Shared=NO > Nodename=c1-[1-36] NodeHostname=compute-1-[1-36] Weight=4173 > Feature=rack1,intel,ib > Nodename=c2-[1-36] NodeHostname=compute-2-[1-36] Weight=4173 > Feature=rack2,intel,ib > Nodename=c3-[1-36] NodeHostname=compute-3-[1-36] Weight=4174 > Feature=rack3,intel,ib > Nodename=c4-[1-36] NodeHostname=compute-4-[1-36] Weight=4174 > Feature=rack4,intel,ib > Nodename=c5-[1-36] NodeHostname=compute-5-[1-36] Weight=4175 > Feature=rack5,intel,ib > Nodename=c6-[1-36] NodeHostname=compute-6-[1-36] Weight=4175 > Feature=rack6,intel,ib > [etc.] > > > Example: the lowpri job 100965: > > # sacct --duplicates -X -o > start,state,end,exitcode,alloccpus,nodelist -j 100965 > Start State End ExitCode > AllocCPUS NodeList > ------------------- ---------- ------------------- -------- > ---------- --------------- > 2012-10-10T12:22:55 CANCELLED 2012-10-10T12:56:38 0:0 > 8 c1-1 > 2012-10-10T12:56:50 CANCELLED 2012-10-10T13:27:19 0:0 > 8 c6-32 > 2012-10-10T13:27:31 CANCELLED 2012-10-10T13:42:33 0:0 > 8 c1-3 > 2012-10-10T13:42:46 CANCELLED 2012-10-10T13:46:50 0:0 > 8 c1-3 > 2012-10-10T13:47:03 CANCELLED 2012-10-10T15:43:33 0:0 > 8 c7-6 > 2012-10-10T15:43:48 CANCELLED 2012-10-10T17:30:14 0:0 > 8 c7-23 > 2012-10-10T17:30:28 CANCELLED 2012-10-10T18:52:40 0:0 > 8 c10-21 > 2012-10-10T18:52:56 CANCELLED 2012-10-10T21:03:21 0:0 > 8 c11-23 > 2012-10-10T21:32:59 CANCELLED 2012-10-10T21:52:08 0:0 > 8 c11-16 > 2012-10-10T21:52:56 NODE_FAIL 2012-10-11T00:02:43 0:0 > 8 c11-36 > 2012-10-11T00:05:30 CANCELLED 2012-10-11T03:25:02 0:0 > 8 c10-25 > 2012-10-11T03:25:32 CANCELLED 2012-10-11T06:08:02 0:0 > 8 c4-16 > 2012-10-11T06:08:55 CANCELLED 2012-10-11T08:21:17 0:0 > 8 c6-32 > 2012-10-11T08:21:32 CANCELLED 2012-10-11T08:25:31 0:0 > 8 c1-5 > 2012-10-11T08:25:47 CANCELLED 2012-10-11T08:30:24 0:0 > 8 c4-13 > 2012-10-11T08:30:44 CANCELLED 2012-10-11T08:32:50 0:0 > 8 c5-36 > 2012-10-11T08:33:02 CANCELLED 2012-10-11T09:35:25 0:0 > 8 c8-32 > 2012-10-11T09:36:02 CANCELLED 2012-10-11T10:31:11 0:0 > 8 c8-9 > 2012-10-11T10:31:30 CANCELLED 2012-10-11T10:32:35 0:0 > 8 c8-7 > 2012-10-11T10:33:36 CANCELLED 2012-10-11T10:37:18 0:0 > 8 c10-18 > 2012-10-11T10:37:43 FAILED 2012-10-11T14:59:10 1:0 > 8 c11-13 > > (There is a node-failure in there, and the job failed when it finally > got to run long enough.) Apart from a short period around 21:00 the > 10., less > than 7,000 of the ~ 10,000 cores were used. > > > -- > Bjørn-Helge Mevik, dr. scient, > Research Computing Services, University of Oslo
