We are runnning slurm 2.4.3 on Rocks based on CentOS 6.2.

We have configured scheduling such that jobs in the "lowpri" QoS will be
preempted by jobs in other QoSes if they need the resources.  (See
extract from slurm.conf below.)

As I understand it, a job should not be preempted by another job if
there are resources available for the new job to start (without
preempting the lowpri job).

We are experiencing that lowpri jobs get rescheduled far much frequently
than we would expect, especially at times when there are many (several
thousand) free Cpus on the cluster.  See an example below.

I have two theories about what might cause this:

1) We use TopologyPlugin=topology/tree.  Perhaps the scheduler prefers to
   preempt jobs than place a job on multiple switches?

2) We have given the nodes in racks different weight, in order to try
   and "pack" jobs into the smallest number of nodes (to keep whole
   nodes free for jobs that need whole nodes).  Perhaps the scheduler
   will preempt a job rather than select nodes with higher weight for
   the new jobs?

Is one of these correct?



The (hopefully) relevant part of the slurm.conf:

SchedulerType=sched/backfill
SchedulerParameters=bf_max_job_user=10
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
PreemptMode=requeue
PreemptType=preempt/qos
AccountingStorageEnforce=limits,qos
TopologyPlugin=topology/tree
Nodename=DEFAULT Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62976 
Gres=localtmp:100 State=unknown
PartitionName=DEFAULT State=up Shared=NO
Nodename=c1-[1-36] NodeHostname=compute-1-[1-36] Weight=4173 
Feature=rack1,intel,ib
Nodename=c2-[1-36] NodeHostname=compute-2-[1-36] Weight=4173 
Feature=rack2,intel,ib
Nodename=c3-[1-36] NodeHostname=compute-3-[1-36] Weight=4174 
Feature=rack3,intel,ib
Nodename=c4-[1-36] NodeHostname=compute-4-[1-36] Weight=4174 
Feature=rack4,intel,ib
Nodename=c5-[1-36] NodeHostname=compute-5-[1-36] Weight=4175 
Feature=rack5,intel,ib
Nodename=c6-[1-36] NodeHostname=compute-6-[1-36] Weight=4175 
Feature=rack6,intel,ib
[etc.]


Example: the lowpri job 100965:

# sacct --duplicates -X -o start,state,end,exitcode,alloccpus,nodelist -j 100965
              Start      State                 End ExitCode  AllocCPUS        
NodeList 
------------------- ---------- ------------------- -------- ---------- 
--------------- 
2012-10-10T12:22:55  CANCELLED 2012-10-10T12:56:38      0:0          8          
  c1-1 
2012-10-10T12:56:50  CANCELLED 2012-10-10T13:27:19      0:0          8          
 c6-32 
2012-10-10T13:27:31  CANCELLED 2012-10-10T13:42:33      0:0          8          
  c1-3 
2012-10-10T13:42:46  CANCELLED 2012-10-10T13:46:50      0:0          8          
  c1-3 
2012-10-10T13:47:03  CANCELLED 2012-10-10T15:43:33      0:0          8          
  c7-6 
2012-10-10T15:43:48  CANCELLED 2012-10-10T17:30:14      0:0          8          
 c7-23 
2012-10-10T17:30:28  CANCELLED 2012-10-10T18:52:40      0:0          8          
c10-21 
2012-10-10T18:52:56  CANCELLED 2012-10-10T21:03:21      0:0          8          
c11-23 
2012-10-10T21:32:59  CANCELLED 2012-10-10T21:52:08      0:0          8          
c11-16 
2012-10-10T21:52:56  NODE_FAIL 2012-10-11T00:02:43      0:0          8          
c11-36 
2012-10-11T00:05:30  CANCELLED 2012-10-11T03:25:02      0:0          8          
c10-25 
2012-10-11T03:25:32  CANCELLED 2012-10-11T06:08:02      0:0          8          
 c4-16 
2012-10-11T06:08:55  CANCELLED 2012-10-11T08:21:17      0:0          8          
 c6-32 
2012-10-11T08:21:32  CANCELLED 2012-10-11T08:25:31      0:0          8          
  c1-5 
2012-10-11T08:25:47  CANCELLED 2012-10-11T08:30:24      0:0          8          
 c4-13 
2012-10-11T08:30:44  CANCELLED 2012-10-11T08:32:50      0:0          8          
 c5-36 
2012-10-11T08:33:02  CANCELLED 2012-10-11T09:35:25      0:0          8          
 c8-32 
2012-10-11T09:36:02  CANCELLED 2012-10-11T10:31:11      0:0          8          
  c8-9 
2012-10-11T10:31:30  CANCELLED 2012-10-11T10:32:35      0:0          8          
  c8-7 
2012-10-11T10:33:36  CANCELLED 2012-10-11T10:37:18      0:0          8          
c10-18 
2012-10-11T10:37:43     FAILED 2012-10-11T14:59:10      1:0          8          
c11-13 

(There is a node-failure in there, and the job failed when it finally
got to run long enough.)  Apart from a short period around 21:00 the 10., less
than 7,000 of the ~ 10,000 cores were used.


-- 
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

Reply via email to