[slurm-dev] Re: SchedulerParameters and topology problem

Paddy Doyle Fri, 20 May 2016 01:34:01 -0700

Doh! Sorry, please ignore: we have a reservation in place for a user starting
today, and so obviously those nodes are left idle as backfill can't start the
longer jobs.


Paddy

On Fri, May 20, 2016 at 09:12:29AM +0100, Paddy Doyle wrote:

> Forgot to mention, it's slurm version slurm-15.08.7
> 
> On Fri, May 20, 2016 at 09:03:06AM +0100, Paddy Doyle wrote:
> 
> > Hi all,
> > 
> > We're seeing a really strange scheduling issue on one of our clusters, 
> > whereby
> > jobs are not being scheduled, even though there are many idle nodes.
> > 
> > In fact there were 19 idle nodes, with the first priority job only needing 
> > 1;
> > the next few needed 6 nodes etc.
> > 
> > Turning up Debug logging showed lots of "best_fit topology failure: no 
> > switch
> > currently has sufficient resource to satisfy the request" messages in the 
> > logs,
> > but even so the 'max_switch_wait' (which we haven't set and so should 
> > default to
> > 300 seconds) doesn't seem to be honoured.
> > 
> > This morning there are 25 idle nodes, with the top priority job needing 6.
> > 
> > I'll copy in the slurm.conf, topology.conf, and some relevant logs and queue
> > snapshots.
> > 
> > Any help would be appreciated.
> > 
> > Thanks,
> > Paddy
> > 
> > 
> > 
> > 
> > 
> > #############################################################
> > # the cluster and queue state yesterday evening:
> > #############################################################
> > 
> > root@kelvin01:/etc/slurm # sinfo
> > PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> > compute      up 3-00:00:00      1 drain* kelvin-n027
> > compute      up 3-00:00:00     76  alloc 
> > kelvin-n[001-026,028-045,049-054,072-094,096-098]
> > compute      up 3-00:00:00     19   idle kelvin-n[046-048,055-067,069-071]
> > debug*       up      30:00      2   idle kelvin-n[099-100]
> > root@kelvin01:/etc/slurm # squeue --start | head
> >              JOBID PARTITION     NAME     USER ST          START_TIME  
> > NODES SCHEDNODES           NODELIST(REASON)
> >              86801   compute debo_ben  aaaaaaa PD 2016-05-20T12:20:35      
> > 1 kelvin-n039          (Priority)
> >              86677   compute GdDC_25_   bbbbbb PD 2016-05-20T21:24:29      
> > 6 (null)               (Resources)
> >              86678   compute GdDC_25_   bbbbbb PD 2016-05-20T21:24:29      
> > 6 (null)               (Priority)
> >              86679   compute        B  ccccccc PD 2016-05-20T21:24:29      
> > 8 (null)               (Priority)
> >              86680   compute       BA  ccccccc PD 2016-05-20T21:24:29      
> > 8 (null)               (Priority)
> >              86682   compute GdDC_5_9   bbbbbb PD 2016-05-20T21:24:29      
> > 6 (null)               (Priority)
> >              86683   compute GdDC_5_9   bbbbbb PD 2016-05-20T21:24:29      
> > 6 (null)               (Priority)
> >              86684   compute GdDC_5_9   bbbbbb PD 2016-05-20T21:24:29      
> > 6 (null)               (Priority)
> >              86685   compute GdDC_10_   bbbbbb PD 2016-05-20T21:24:29      
> > 6 (null)               (Priority)
> > root@kelvin01:/etc/slurm # squeue -tr
> >              JOBID PARTITION     NAME     USER ST       TIME  NODES 
> > NODELIST(REASON)
> >              86699   compute antiferr  aaaaaaa  R 2-05:41:33      2 
> > kelvin-n[019,021]
> >              86700   compute   ferro3  aaaaaaa  R 2-05:40:03      2 
> > kelvin-n[030-031]
> >              86745   compute lco_int_  ddddddd  R   19:41:24      6 
> > kelvin-n[040,075,084,096-098]
> >              86681   compute lco_int_  ddddddd  R    9:33:37      6 
> > kelvin-n[013,017-018,022-024]
> >              86729   compute lco_int_  ddddddd  R    9:21:36      6 
> > kelvin-n[049-054]
> >              86765   compute debo_ace  aaaaaaa  R    9:04:06      1 
> > kelvin-n072
> >              86766   compute   acetic  aaaaaaa  R    6:40:31      1 
> > kelvin-n038
> >              86793   compute lco_int_  ddddddd  R    6:37:31      6 
> > kelvin-n[039,041-045]
> >              86662   compute        B  eeeeeee  R    4:37:31     16 
> > kelvin-n[032-035,077,079,085-094]
> >              86810   compute    C_opt  fffffff  R    3:47:40      6 
> > kelvin-n[007-012]
> >              86808   compute lco_int_  ddddddd  R    3:01:55      6 
> > kelvin-n[016,020,036-037,082-083]
> >              86674   compute GdDC_20_   bbbbbb  R    2:52:57      6 
> > kelvin-n[073-074,076,078,080-081]
> >              86675   compute GdDC_20_   bbbbbb  R    2:45:57      6 
> > kelvin-n[001-006]
> >              86676   compute GdDC_25_   bbbbbb  R    1:09:55      6 
> > kelvin-n[014-015,025-026,028-029]
> > 
> > 
> > #############################################################
> > # snips from logs at that time:
> > #############################################################
> > 
> > [2016-05-19T18:58:24.658] backfill: beginning
> > [2016-05-19T18:58:24.658] debug:  backfill: 45 jobs to backfill
> > [2016-05-19T18:58:24.658] backfill test for JobID=86677 Prio=12823347 
> > Partition=compute
> > [2016-05-19T18:58:24.658] debug:  job 86677: best_fit topology failure: no 
> > switch currently has sufficient resource to satisfy the request
> > [2016-05-19T18:58:24.658] debug:  job 86677: best_fit topology failure: no 
> > switch currently has sufficient resource to satisfy the request
> > [2016-05-19T18:58:24.658] debug:  job 86677: best_fit topology failure: no 
> > switch currently has sufficient resource to satisfy the request
> > [2016-05-19T18:58:24.658] Job 86677 to start at 2016-05-20T21:24:29, end at 
> > 2016-05-22T21:24:00 on kelvin-n[013,017-019,021-022]
> > [2016-05-19T18:58:24.658] backfill test for JobID=86678 Prio=12823322 
> > Partition=compute
> > [2016-05-19T18:58:24.658] debug:  job 86678: best_fit topology failure: no 
> > switch currently has sufficient resource to satisfy the request
> > [2016-05-19T18:58:24.658] debug:  job 86678: best_fit topology failure: no 
> > switch currently has sufficient resource to satisfy the request
> > [2016-05-19T18:58:24.658] debug:  job 86678: best_fit topology failure: no 
> > switch currently has sufficient resource to satisfy the request
> > [2016-05-19T18:58:24.658] Job 86678 to start at 2016-05-20T21:24:29, end at 
> > 2016-05-22T21:24:00 on kelvin-n[013,017-019,021-022]
> > [2016-05-19T18:58:24.658] backfill test for JobID=86679 Prio=12662420 
> > Partition=compute
> > [2016-05-19T18:58:24.659] debug:  job 86679: best_fit topology failure: no 
> > switch currently has sufficient resource to satisfy the request
> > [2016-05-19T18:58:24.659] debug:  job 86679: best_fit topology failure: no 
> > switch currently has sufficient resource to satisfy the request
> > [2016-05-19T18:58:24.659] debug:  job 86679: best_fit topology failure: no 
> > switch currently has sufficient resource to satisfy the request
> > [2016-05-19T18:58:24.659] Job 86679 to start at 2016-05-20T21:24:29, end at 
> > 2016-05-23T21:24:00 on kelvin-n[013,017-019,021-024]
> > [2016-05-19T18:58:24.659] backfill test for JobID=86680 Prio=12662172 
> > Partition=compute
> > [2016-05-19T18:58:24.659] debug:  job 86680: best_fit topology failure: no 
> > switch currently has sufficient resource to satisfy the request
> > [2016-05-19T18:58:24.659] debug:  job 86680: best_fit topology failure: no 
> > switch currently has sufficient resource to satisfy the request
> > [2016-05-19T18:58:24.659] debug:  job 86680: best_fit topology failure: no 
> > switch currently has sufficient resource to satisfy the request
> > [2016-05-19T18:58:24.659] Job 86680 to start at 2016-05-20T21:24:29, end at 
> > 2016-05-23T21:24:00 on kelvin-n[013,017-019,021-024]
> > [2016-05-19T18:58:24.659] backfill test for JobID=86731 Prio=12564061 
> > Partition=compute
> > [2016-05-19T18:58:24.659] Job 86731 to start at 2016-05-21T16:05:09, end at 
> > 2016-05-24T16:05:00 on 
> > kelvin-n[013,017-019,021-024,030-031,073-074,076,078,080-081]
> > [2016-05-19T18:58:24.659] backfill test for JobID=86682 Prio=12286095 
> > Partition=compute
> > [2016-05-19T18:58:24.659] debug:  job 86682: best_fit topology failure: no 
> > switch currently has sufficient resource to satisfy the request
> > [2016-05-19T18:58:24.659] debug:  job 86682: best_fit topology failure: no 
> > switch currently has sufficient resource to satisfy the request
> > [2016-05-19T18:58:24.659] debug:  job 86682: best_fit topology failure: no 
> > switch currently has sufficient resource to satisfy the request
> > [2016-05-19T18:58:24.659] Job 86682 to start at 2016-05-20T21:24:29, end at 
> > 2016-05-22T21:24:00 on kelvin-n[013,017-019,021-022]
> > etc
> > 
> > 
> > 
> > #############################################################
> > # it's even worse this morning: 25 idle nodes!
> > #############################################################
> > 
> > root@kelvin01:/etc/slurm # sinfo
> > PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> > compute      up 3-00:00:00      1 drain* kelvin-n027
> > compute      up 3-00:00:00     70  alloc 
> > kelvin-n[001-026,028-038,040,049-054,072-094,096-098]
> > compute      up 3-00:00:00     25   idle 
> > kelvin-n[039,041-048,055-067,069-071]
> > debug*       up      30:00      2   idle kelvin-n[099-100]
> > root@kelvin01:/etc/slurm # squeue --start | head
> >              JOBID PARTITION     NAME     USER ST          START_TIME  
> > NODES SCHEDNODES           NODELIST(REASON)
> >              86678   compute GdDC_25_   lucida PD 2016-05-20T21:24:00      
> > 6 kelvin-n[039,041-045 (Priority)
> >              86679   compute        B  shuklag PD 2016-05-20T21:24:00      
> > 8 kelvin-n[055-062]    (Priority)
> >              86680   compute       BA  shuklag PD 2016-05-20T21:24:00      
> > 8 kelvin-n[063-067,069 (Priority)
> >              86682   compute GdDC_5_9   lucida PD 2016-05-20T21:24:00      
> > 6 kelvin-n[049-054]    (Priority)
> >              86683   compute GdDC_5_9   lucida PD 2016-05-20T21:24:29      
> > 6 kelvin-n[023,030-031 (Priority)
> >              86801   compute debo_ben  tandons PD 2016-05-20T21:24:29      
> > 1 kelvin-n024          (Priority)
> >              86818   compute    C_opt  watsong PD 2016-05-20T21:24:29      
> > 6 kelvin-n[013,017-019 (Resources)
> >              86684   compute GdDC_5_9   lucida PD 2016-05-21T16:05:09      
> > 6 (null)               (Priority)
> >              86685   compute GdDC_10_   lucida PD 2016-05-21T16:05:09      
> > 6 (null)               (Priority)
> > root@kelvin01:/etc/slurm # squeue -tr -l
> > Fri May 20 08:59:35 2016
> >              JOBID PARTITION     NAME     USER    STATE       TIME 
> > TIME_LIMI  NODES NODELIST(REASON)
> >              86699   compute antiferr  tandons  RUNNING 2-19:43:02 
> > 3-00:00:00      2 kelvin-n[019,021]
> >              86700   compute   ferro3  tandons  RUNNING 2-19:41:32 
> > 3-00:00:00      2 kelvin-n[030-031]
> >              86745   compute lco_int_  gavinai  RUNNING 1-09:42:53 
> > 3-00:00:00      6 kelvin-n[040,075,084,096-098]
> >              86681   compute lco_int_  gavinai  RUNNING   23:35:06 
> > 1-12:00:00      6 kelvin-n[013,017-018,022-024]
> >              86729   compute lco_int_  gavinai  RUNNING   23:23:05 
> > 1-00:00:00      6 kelvin-n[049-054]
> >              86765   compute debo_ace  tandons  RUNNING   23:05:35 
> > 3-00:00:00      1 kelvin-n072
> >              86766   compute   acetic  tandons  RUNNING   20:42:00 
> > 3-00:00:00      1 kelvin-n038
> >              86662   compute        B  montese  RUNNING   18:39:00 
> > 3-00:00:00     16 kelvin-n[032-035,077,079,085-094]
> >              86808   compute lco_int_  gavinai  RUNNING   17:03:24 
> > 3-00:00:00      6 kelvin-n[016,020,036-037,082-083]
> >              86674   compute GdDC_20_   lucida  RUNNING   16:54:26 
> > 2-00:00:00      6 kelvin-n[073-074,076,078,080-081]
> >              86675   compute GdDC_20_   lucida  RUNNING   16:47:26 
> > 2-00:00:00      6 kelvin-n[001-006]
> >              86676   compute GdDC_25_   lucida  RUNNING   15:11:24 
> > 2-00:00:00      6 kelvin-n[014-015,025-026,028-029]
> >              86677   compute GdDC_25_   lucida  RUNNING   13:16:34 
> > 2-00:00:00      6 kelvin-n[007-012]
> > 
> > 
> > 
> > 
> > -- 
> > Paddy Doyle
> > Trinity Centre for High Performance Computing,
> > Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> > Phone: +353-1-896-3725
> > http://www.tchpc.tcd.ie/
> 
> > #
> > # Example slurm.conf file. Please run configurator.html
> > # (in doc/html) to build a configuration file customized
> > # for your environment.
> > #
> > #
> > # slurm.conf file generated by configurator.html.
> > #
> > # See the slurm.conf man page for more information.
> > #
> > ClusterName=kelvin
> > ControlMachine=kelvin01
> > ControlAddr=192.168.19.254
> > BackupController=kelvin-n001
> > BackupAddr=192.168.16.1
> > #
> > SlurmUser=root
> > #SlurmdUser=root
> > SlurmctldPort=6817
> > SlurmdPort=6818
> > AuthType=auth/munge
> > EnforcePartLimits=YES
> > JobRequeue=1
> > #JobCredentialPrivateKey=
> > #JobCredentialPublicCertificate=
> > StateSaveLocation=/var/slurm_state/kelvin
> > #SlurmdSpoolDir=/tmp/slurmd
> > SwitchType=switch/none
> > MpiDefault=none
> > SlurmctldPidFile=/var/run/slurmctld.pid
> > SlurmdPidFile=/var/run/slurmd.pid
> > #ProctrackType=proctrack/pgid
> > ProctrackType=proctrack/cgroup
> > #PluginDir=
> > CacheGroups=0
> > #FirstJobId=
> > ReturnToService=1
> > #MaxJobCount=
> > #PlugStackConfig=
> > #PropagatePrioProcess=
> > #PropagateResourceLimits=
> > #PropagateResourceLimitsExcept=
> > #PropagateResourceLimits=NONE
> > PropagateResourceLimitsExcept=CPU,RSS,DATA,AS
> > Prolog=/etc/slurm/prolog
> > PrologFlags=Alloc
> > Epilog=/etc/slurm/slurm.epilog.clean
> > EpilogSlurmctld=/etc/slurm/slurm.epilogslurmctld
> > #SrunProlog=
> > #SrunEpilog=
> > #TaskProlog=
> > #TaskEpilog=
> > #TaskPlugin=
> > TaskPlugin=task/cgroup
> > #TrackWCKey=no
> > #TreeWidth=50
> > #TmpFs=
> > #UsePAM=
> > #
> > # TIMERS
> > SlurmctldTimeout=300
> > SlurmdTimeout=300
> > HealthCheckInterval=3600
> > HealthCheckProgram=/etc/slurm/slurm.healthcheck
> > InactiveLimit=0
> > MinJobAge=300
> > KillWait=30
> > Waittime=0
> > RebootProgram=/sbin/reboot
> > #
> > # SCHEDULING
> > SchedulerType=sched/backfill
> > #SchedulerType=sched/wiki
> > SchedulerPort=7321
> > SelectType=select/cons_res
> > SelectTypeParameters=CR_Core_Memory
> > #SchedulerAuth=
> > #SchedulerPort=
> > #SchedulerRootFilter=
> > FastSchedule=0
> > #PriorityType=priority/multifactor
> > #PriorityDecayHalfLife=14-0
> > #PriorityUsageResetPeriod=14-0
> > #PriorityWeightFairshare=100000
> > #PriorityWeightAge=1000
> > #PriorityWeightPartition=10000
> > #PriorityWeightJobSize=1000
> > #PriorityMaxAge=1-0
> > #
> > # LOGGING
> > SlurmctldDebug=3
> > SlurmctldLogFile=/var/log/slurm.log
> > SlurmdDebug=3
> > SlurmdLogFile=/var/log/slurm.log
> > JobCompType=jobcomp/none
> > #JobCompLoc=
> > #
> > # ACCOUNTING
> > #JobAcctGatherType=jobacct_gather/linux
> > #JobAcctGatherFrequency=30
> > #
> > # LOGGING AND ACCOUNTING
> > #AccountingStorageEnforce=0
> > #AccountingStorageEnforce=limits
> > AccountingStorageEnforce=safe       # don't start a job unless there's 
> > enough balance
> > AccountingStorageHost=service01
> > #AccountingStorageLoc=
> > #AccountingStoragePass=
> > #AccountingStoragePort=
> > AccountingStorageType=accounting_storage/slurmdbd
> > #AccountingStorageUser=
> > #JobCompHost=
> > #JobCompLoc=
> > #JobCompPass=
> > #JobCompPort=
> > #JobCompUser=
> > JobAcctGatherFrequency=30
> > JobAcctGatherType=jobacct_gather/cgroup
> > #AccountingStorageType=accounting_storage/slurmdbd
> > #AccountingStorageHost=
> > #AccountingStorageLoc=
> > #AccountingStoragePass=
> > #AccountingStorageUser=
> > 
> > 
> > 
> > # Activate the Multi-factor Job Priority Plugin with decay
> > PriorityType=priority/multifactor
> > 
> > # apply decay of 2 weeks
> > #PriorityDecayHalfLife=14-0
> > # for slurm-bank 1.2
> > #PriorityDecayHalfLife=0
> > # for slurm-bank 1.3
> > PriorityDecayHalfLife=14-0
> > 
> > 
> > # reset usage after 28 days
> > #PriorityUsageResetPeriod=MONTHLY
> > PriorityUsageResetPeriod=NONE
> > 
> > # The larger the job, the greater its job size priority.
> > #PriorityFavorSmall=YES
> > 
> > # The job's age factor reaches 1.0 after waiting in the
> > # queue for 2 weeks.
> > PriorityMaxAge=14-0
> > 
> > # re-calc priority
> > PriorityCalcPeriod=00:01:00
> > 
> > # This next group determines the weighting of each of the
> > # components of the Multi-factor Job Priority Plugin.
> > # The default value for each of the following is 1.
> > PriorityWeightAge=10000000
> > PriorityWeightFairshare=10000000
> > PriorityWeightJobSize=10000000
> > PriorityWeightPartition=10000000
> > PriorityWeightQOS=0 # don't use the qos factor
> > 
> > 
> > # describe the node's memory (only one of the two following options is 
> > allowed)
> > #DefMemPerCPU=1900
> > DefMemPerNode=23000
> > 
> > MaxMemPerNode=24000
> > 
> > # turn on the topology/tree plugin
> > TopologyPlugin=topology/tree
> > 
> > # COMPUTE NODES
> > #NodeName=DEFAULT State=UNKNOWN Feature=debug Sockets=2 CoresPerSocket=6 
> > ThreadsPerCore=1
> > NodeName=kelvin-n[001-067,069-094,096-100] RealMemory=24020 Sockets=2 
> > CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN
> > #NodeName=kelvin-n[001-067,069,070,072-094,096-100] RealMemory=24020 
> > Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN
> > #NodeName=kelvin-n[001-070,072-100] RealMemory=24020 Sockets=2 
> > CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN
> > #NodeName=kelvin-n099 RealMemory=19980 Sockets=2 CoresPerSocket=6 
> > ThreadsPerCore=1 State=UNKNOWN
> > 
> > 
> > #PartitionName=compute   Nodes=kelvin-n[001-096] Default=NO 
> > MaxTime=72:00:00 State=UP 
> > #PartitionName=debug     Nodes=kelvin-n[097-100] Default=YES 
> > MaxTime=3:00:00 State=UP
> > #PartitionName=compute   Nodes=kelvin-n[001-067,069,070,072-094,096-098] 
> > Default=NO DefaultTime=01:00:00 MaxTime=72:00:00 State=UP Shared=Exclusive
> > PartitionName=compute   Nodes=kelvin-n[001-067,069-094,096-098] Default=NO 
> > DefaultTime=01:00:00 MaxTime=72:00:00 State=UP Shared=Exclusive
> > PartitionName=debug     Nodes=kelvin-n[099-100] Default=YES 
> > DefaultTime=00:30:00 MaxTime=00:30:00 State=UP Shared=Exclusive
> > 
> 
> > 
> > # Rack C-04[42]
> > SwitchName=kelvinibsw03 Nodes=kelvin-n[025-048]
> > 
> > # Rack C-02[42]
> > SwitchName=kelvinibsw04 Nodes=kelvin-n[073-094,096-100]
> > 
> > # Rack C-02[17]
> > SwitchName=kelvinibsw05 Nodes=kelvin-n[049-067,069-072]
> > 
> > # Rack C-04[17]
> > SwitchName=kelvinibsw06 Nodes=kelvin-n[001-024]
> > 
> > # Rack C-03[6] (top-level switch)
> > SwitchName=kelvinibsw01 
> > Switches=kelvinibsw03,kelvinibsw04,kelvinibsw05,kelvinibsw06
> > # (and kelvin01,io03,io04,io06)
> > 
> > # Rack C-03[7] (top-level switch)
> > SwitchName=kelvinibsw02 
> > Switches=kelvinibsw03,kelvinibsw04,kelvinibsw05,kelvinibsw06
> > # (and io01,io02,io05)
> > 
> 
> 
> -- 
> Paddy Doyle
> Trinity Centre for High Performance Computing,
> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> Phone: +353-1-896-3725
> http://www.tchpc.tcd.ie/

-- 
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/

[slurm-dev] Re: SchedulerParameters and topology problem

Reply via email to