I believe that theory #2 is correct. Slurm will attempt to use lower  
priority nodes first before higher priority nodes, even if this  
requires preemption. I could understand the desire to change this  
logic and avoid preemption unless necessary. The relevant code is in  
src/slurmctld/node_scheduler.c, function _pick_best_nodes(). We  
probably want to pass the preemptee_candidates to select_g_job_test()  
only after all nodes have been included (the last time through the  
main loop, around line 984.

Moe

Quoting Bjørn-Helge Mevik <[email protected]>:

>
> We are runnning slurm 2.4.3 on Rocks based on CentOS 6.2.
>
> We have configured scheduling such that jobs in the "lowpri" QoS will be
> preempted by jobs in other QoSes if they need the resources.  (See
> extract from slurm.conf below.)
>
> As I understand it, a job should not be preempted by another job if
> there are resources available for the new job to start (without
> preempting the lowpri job).
>
> We are experiencing that lowpri jobs get rescheduled far much frequently
> than we would expect, especially at times when there are many (several
> thousand) free Cpus on the cluster.  See an example below.
>
> I have two theories about what might cause this:
>
> 1) We use TopologyPlugin=topology/tree.  Perhaps the scheduler prefers to
>    preempt jobs than place a job on multiple switches?
>
> 2) We have given the nodes in racks different weight, in order to try
>    and "pack" jobs into the smallest number of nodes (to keep whole
>    nodes free for jobs that need whole nodes).  Perhaps the scheduler
>    will preempt a job rather than select nodes with higher weight for
>    the new jobs?
>
> Is one of these correct?
>
>
>
> The (hopefully) relevant part of the slurm.conf:
>
> SchedulerType=sched/backfill
> SchedulerParameters=bf_max_job_user=10
> SelectType=select/cons_res
> SelectTypeParameters=CR_CPU_Memory
> PreemptMode=requeue
> PreemptType=preempt/qos
> AccountingStorageEnforce=limits,qos
> TopologyPlugin=topology/tree
> Nodename=DEFAULT Sockets=2 CoresPerSocket=8 ThreadsPerCore=1  
> RealMemory=62976 Gres=localtmp:100 State=unknown
> PartitionName=DEFAULT State=up Shared=NO
> Nodename=c1-[1-36] NodeHostname=compute-1-[1-36] Weight=4173  
> Feature=rack1,intel,ib
> Nodename=c2-[1-36] NodeHostname=compute-2-[1-36] Weight=4173  
> Feature=rack2,intel,ib
> Nodename=c3-[1-36] NodeHostname=compute-3-[1-36] Weight=4174  
> Feature=rack3,intel,ib
> Nodename=c4-[1-36] NodeHostname=compute-4-[1-36] Weight=4174  
> Feature=rack4,intel,ib
> Nodename=c5-[1-36] NodeHostname=compute-5-[1-36] Weight=4175  
> Feature=rack5,intel,ib
> Nodename=c6-[1-36] NodeHostname=compute-6-[1-36] Weight=4175  
> Feature=rack6,intel,ib
> [etc.]
>
>
> Example: the lowpri job 100965:
>
> # sacct --duplicates -X -o  
> start,state,end,exitcode,alloccpus,nodelist -j 100965
>               Start      State                 End ExitCode   
> AllocCPUS        NodeList
> ------------------- ---------- ------------------- --------  
> ---------- ---------------
> 2012-10-10T12:22:55  CANCELLED 2012-10-10T12:56:38      0:0           
> 8            c1-1
> 2012-10-10T12:56:50  CANCELLED 2012-10-10T13:27:19      0:0           
> 8           c6-32
> 2012-10-10T13:27:31  CANCELLED 2012-10-10T13:42:33      0:0           
> 8            c1-3
> 2012-10-10T13:42:46  CANCELLED 2012-10-10T13:46:50      0:0           
> 8            c1-3
> 2012-10-10T13:47:03  CANCELLED 2012-10-10T15:43:33      0:0           
> 8            c7-6
> 2012-10-10T15:43:48  CANCELLED 2012-10-10T17:30:14      0:0           
> 8           c7-23
> 2012-10-10T17:30:28  CANCELLED 2012-10-10T18:52:40      0:0           
> 8          c10-21
> 2012-10-10T18:52:56  CANCELLED 2012-10-10T21:03:21      0:0           
> 8          c11-23
> 2012-10-10T21:32:59  CANCELLED 2012-10-10T21:52:08      0:0           
> 8          c11-16
> 2012-10-10T21:52:56  NODE_FAIL 2012-10-11T00:02:43      0:0           
> 8          c11-36
> 2012-10-11T00:05:30  CANCELLED 2012-10-11T03:25:02      0:0           
> 8          c10-25
> 2012-10-11T03:25:32  CANCELLED 2012-10-11T06:08:02      0:0           
> 8           c4-16
> 2012-10-11T06:08:55  CANCELLED 2012-10-11T08:21:17      0:0           
> 8           c6-32
> 2012-10-11T08:21:32  CANCELLED 2012-10-11T08:25:31      0:0           
> 8            c1-5
> 2012-10-11T08:25:47  CANCELLED 2012-10-11T08:30:24      0:0           
> 8           c4-13
> 2012-10-11T08:30:44  CANCELLED 2012-10-11T08:32:50      0:0           
> 8           c5-36
> 2012-10-11T08:33:02  CANCELLED 2012-10-11T09:35:25      0:0           
> 8           c8-32
> 2012-10-11T09:36:02  CANCELLED 2012-10-11T10:31:11      0:0           
> 8            c8-9
> 2012-10-11T10:31:30  CANCELLED 2012-10-11T10:32:35      0:0           
> 8            c8-7
> 2012-10-11T10:33:36  CANCELLED 2012-10-11T10:37:18      0:0           
> 8          c10-18
> 2012-10-11T10:37:43     FAILED 2012-10-11T14:59:10      1:0           
> 8          c11-13
>
> (There is a node-failure in there, and the job failed when it finally
> got to run long enough.)  Apart from a short period around 21:00 the  
> 10., less
> than 7,000 of the ~ 10,000 cores were used.
>
>
> --
> Bjørn-Helge Mevik, dr. scient,
> Research Computing Services, University of Oslo

Reply via email to