Doh! Sorry, please ignore: we have a reservation in place for a user starting today, and so obviously those nodes are left idle as backfill can't start the longer jobs.
Paddy On Fri, May 20, 2016 at 09:12:29AM +0100, Paddy Doyle wrote: > Forgot to mention, it's slurm version slurm-15.08.7 > > On Fri, May 20, 2016 at 09:03:06AM +0100, Paddy Doyle wrote: > > > Hi all, > > > > We're seeing a really strange scheduling issue on one of our clusters, > > whereby > > jobs are not being scheduled, even though there are many idle nodes. > > > > In fact there were 19 idle nodes, with the first priority job only needing > > 1; > > the next few needed 6 nodes etc. > > > > Turning up Debug logging showed lots of "best_fit topology failure: no > > switch > > currently has sufficient resource to satisfy the request" messages in the > > logs, > > but even so the 'max_switch_wait' (which we haven't set and so should > > default to > > 300 seconds) doesn't seem to be honoured. > > > > This morning there are 25 idle nodes, with the top priority job needing 6. > > > > I'll copy in the slurm.conf, topology.conf, and some relevant logs and queue > > snapshots. > > > > Any help would be appreciated. > > > > Thanks, > > Paddy > > > > > > > > > > > > ############################################################# > > # the cluster and queue state yesterday evening: > > ############################################################# > > > > root@kelvin01:/etc/slurm # sinfo > > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > > compute up 3-00:00:00 1 drain* kelvin-n027 > > compute up 3-00:00:00 76 alloc > > kelvin-n[001-026,028-045,049-054,072-094,096-098] > > compute up 3-00:00:00 19 idle kelvin-n[046-048,055-067,069-071] > > debug* up 30:00 2 idle kelvin-n[099-100] > > root@kelvin01:/etc/slurm # squeue --start | head > > JOBID PARTITION NAME USER ST START_TIME > > NODES SCHEDNODES NODELIST(REASON) > > 86801 compute debo_ben aaaaaaa PD 2016-05-20T12:20:35 > > 1 kelvin-n039 (Priority) > > 86677 compute GdDC_25_ bbbbbb PD 2016-05-20T21:24:29 > > 6 (null) (Resources) > > 86678 compute GdDC_25_ bbbbbb PD 2016-05-20T21:24:29 > > 6 (null) (Priority) > > 86679 compute B ccccccc PD 2016-05-20T21:24:29 > > 8 (null) (Priority) > > 86680 compute BA ccccccc PD 2016-05-20T21:24:29 > > 8 (null) (Priority) > > 86682 compute GdDC_5_9 bbbbbb PD 2016-05-20T21:24:29 > > 6 (null) (Priority) > > 86683 compute GdDC_5_9 bbbbbb PD 2016-05-20T21:24:29 > > 6 (null) (Priority) > > 86684 compute GdDC_5_9 bbbbbb PD 2016-05-20T21:24:29 > > 6 (null) (Priority) > > 86685 compute GdDC_10_ bbbbbb PD 2016-05-20T21:24:29 > > 6 (null) (Priority) > > root@kelvin01:/etc/slurm # squeue -tr > > JOBID PARTITION NAME USER ST TIME NODES > > NODELIST(REASON) > > 86699 compute antiferr aaaaaaa R 2-05:41:33 2 > > kelvin-n[019,021] > > 86700 compute ferro3 aaaaaaa R 2-05:40:03 2 > > kelvin-n[030-031] > > 86745 compute lco_int_ ddddddd R 19:41:24 6 > > kelvin-n[040,075,084,096-098] > > 86681 compute lco_int_ ddddddd R 9:33:37 6 > > kelvin-n[013,017-018,022-024] > > 86729 compute lco_int_ ddddddd R 9:21:36 6 > > kelvin-n[049-054] > > 86765 compute debo_ace aaaaaaa R 9:04:06 1 > > kelvin-n072 > > 86766 compute acetic aaaaaaa R 6:40:31 1 > > kelvin-n038 > > 86793 compute lco_int_ ddddddd R 6:37:31 6 > > kelvin-n[039,041-045] > > 86662 compute B eeeeeee R 4:37:31 16 > > kelvin-n[032-035,077,079,085-094] > > 86810 compute C_opt fffffff R 3:47:40 6 > > kelvin-n[007-012] > > 86808 compute lco_int_ ddddddd R 3:01:55 6 > > kelvin-n[016,020,036-037,082-083] > > 86674 compute GdDC_20_ bbbbbb R 2:52:57 6 > > kelvin-n[073-074,076,078,080-081] > > 86675 compute GdDC_20_ bbbbbb R 2:45:57 6 > > kelvin-n[001-006] > > 86676 compute GdDC_25_ bbbbbb R 1:09:55 6 > > kelvin-n[014-015,025-026,028-029] > > > > > > ############################################################# > > # snips from logs at that time: > > ############################################################# > > > > [2016-05-19T18:58:24.658] backfill: beginning > > [2016-05-19T18:58:24.658] debug: backfill: 45 jobs to backfill > > [2016-05-19T18:58:24.658] backfill test for JobID=86677 Prio=12823347 > > Partition=compute > > [2016-05-19T18:58:24.658] debug: job 86677: best_fit topology failure: no > > switch currently has sufficient resource to satisfy the request > > [2016-05-19T18:58:24.658] debug: job 86677: best_fit topology failure: no > > switch currently has sufficient resource to satisfy the request > > [2016-05-19T18:58:24.658] debug: job 86677: best_fit topology failure: no > > switch currently has sufficient resource to satisfy the request > > [2016-05-19T18:58:24.658] Job 86677 to start at 2016-05-20T21:24:29, end at > > 2016-05-22T21:24:00 on kelvin-n[013,017-019,021-022] > > [2016-05-19T18:58:24.658] backfill test for JobID=86678 Prio=12823322 > > Partition=compute > > [2016-05-19T18:58:24.658] debug: job 86678: best_fit topology failure: no > > switch currently has sufficient resource to satisfy the request > > [2016-05-19T18:58:24.658] debug: job 86678: best_fit topology failure: no > > switch currently has sufficient resource to satisfy the request > > [2016-05-19T18:58:24.658] debug: job 86678: best_fit topology failure: no > > switch currently has sufficient resource to satisfy the request > > [2016-05-19T18:58:24.658] Job 86678 to start at 2016-05-20T21:24:29, end at > > 2016-05-22T21:24:00 on kelvin-n[013,017-019,021-022] > > [2016-05-19T18:58:24.658] backfill test for JobID=86679 Prio=12662420 > > Partition=compute > > [2016-05-19T18:58:24.659] debug: job 86679: best_fit topology failure: no > > switch currently has sufficient resource to satisfy the request > > [2016-05-19T18:58:24.659] debug: job 86679: best_fit topology failure: no > > switch currently has sufficient resource to satisfy the request > > [2016-05-19T18:58:24.659] debug: job 86679: best_fit topology failure: no > > switch currently has sufficient resource to satisfy the request > > [2016-05-19T18:58:24.659] Job 86679 to start at 2016-05-20T21:24:29, end at > > 2016-05-23T21:24:00 on kelvin-n[013,017-019,021-024] > > [2016-05-19T18:58:24.659] backfill test for JobID=86680 Prio=12662172 > > Partition=compute > > [2016-05-19T18:58:24.659] debug: job 86680: best_fit topology failure: no > > switch currently has sufficient resource to satisfy the request > > [2016-05-19T18:58:24.659] debug: job 86680: best_fit topology failure: no > > switch currently has sufficient resource to satisfy the request > > [2016-05-19T18:58:24.659] debug: job 86680: best_fit topology failure: no > > switch currently has sufficient resource to satisfy the request > > [2016-05-19T18:58:24.659] Job 86680 to start at 2016-05-20T21:24:29, end at > > 2016-05-23T21:24:00 on kelvin-n[013,017-019,021-024] > > [2016-05-19T18:58:24.659] backfill test for JobID=86731 Prio=12564061 > > Partition=compute > > [2016-05-19T18:58:24.659] Job 86731 to start at 2016-05-21T16:05:09, end at > > 2016-05-24T16:05:00 on > > kelvin-n[013,017-019,021-024,030-031,073-074,076,078,080-081] > > [2016-05-19T18:58:24.659] backfill test for JobID=86682 Prio=12286095 > > Partition=compute > > [2016-05-19T18:58:24.659] debug: job 86682: best_fit topology failure: no > > switch currently has sufficient resource to satisfy the request > > [2016-05-19T18:58:24.659] debug: job 86682: best_fit topology failure: no > > switch currently has sufficient resource to satisfy the request > > [2016-05-19T18:58:24.659] debug: job 86682: best_fit topology failure: no > > switch currently has sufficient resource to satisfy the request > > [2016-05-19T18:58:24.659] Job 86682 to start at 2016-05-20T21:24:29, end at > > 2016-05-22T21:24:00 on kelvin-n[013,017-019,021-022] > > etc > > > > > > > > ############################################################# > > # it's even worse this morning: 25 idle nodes! > > ############################################################# > > > > root@kelvin01:/etc/slurm # sinfo > > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > > compute up 3-00:00:00 1 drain* kelvin-n027 > > compute up 3-00:00:00 70 alloc > > kelvin-n[001-026,028-038,040,049-054,072-094,096-098] > > compute up 3-00:00:00 25 idle > > kelvin-n[039,041-048,055-067,069-071] > > debug* up 30:00 2 idle kelvin-n[099-100] > > root@kelvin01:/etc/slurm # squeue --start | head > > JOBID PARTITION NAME USER ST START_TIME > > NODES SCHEDNODES NODELIST(REASON) > > 86678 compute GdDC_25_ lucida PD 2016-05-20T21:24:00 > > 6 kelvin-n[039,041-045 (Priority) > > 86679 compute B shuklag PD 2016-05-20T21:24:00 > > 8 kelvin-n[055-062] (Priority) > > 86680 compute BA shuklag PD 2016-05-20T21:24:00 > > 8 kelvin-n[063-067,069 (Priority) > > 86682 compute GdDC_5_9 lucida PD 2016-05-20T21:24:00 > > 6 kelvin-n[049-054] (Priority) > > 86683 compute GdDC_5_9 lucida PD 2016-05-20T21:24:29 > > 6 kelvin-n[023,030-031 (Priority) > > 86801 compute debo_ben tandons PD 2016-05-20T21:24:29 > > 1 kelvin-n024 (Priority) > > 86818 compute C_opt watsong PD 2016-05-20T21:24:29 > > 6 kelvin-n[013,017-019 (Resources) > > 86684 compute GdDC_5_9 lucida PD 2016-05-21T16:05:09 > > 6 (null) (Priority) > > 86685 compute GdDC_10_ lucida PD 2016-05-21T16:05:09 > > 6 (null) (Priority) > > root@kelvin01:/etc/slurm # squeue -tr -l > > Fri May 20 08:59:35 2016 > > JOBID PARTITION NAME USER STATE TIME > > TIME_LIMI NODES NODELIST(REASON) > > 86699 compute antiferr tandons RUNNING 2-19:43:02 > > 3-00:00:00 2 kelvin-n[019,021] > > 86700 compute ferro3 tandons RUNNING 2-19:41:32 > > 3-00:00:00 2 kelvin-n[030-031] > > 86745 compute lco_int_ gavinai RUNNING 1-09:42:53 > > 3-00:00:00 6 kelvin-n[040,075,084,096-098] > > 86681 compute lco_int_ gavinai RUNNING 23:35:06 > > 1-12:00:00 6 kelvin-n[013,017-018,022-024] > > 86729 compute lco_int_ gavinai RUNNING 23:23:05 > > 1-00:00:00 6 kelvin-n[049-054] > > 86765 compute debo_ace tandons RUNNING 23:05:35 > > 3-00:00:00 1 kelvin-n072 > > 86766 compute acetic tandons RUNNING 20:42:00 > > 3-00:00:00 1 kelvin-n038 > > 86662 compute B montese RUNNING 18:39:00 > > 3-00:00:00 16 kelvin-n[032-035,077,079,085-094] > > 86808 compute lco_int_ gavinai RUNNING 17:03:24 > > 3-00:00:00 6 kelvin-n[016,020,036-037,082-083] > > 86674 compute GdDC_20_ lucida RUNNING 16:54:26 > > 2-00:00:00 6 kelvin-n[073-074,076,078,080-081] > > 86675 compute GdDC_20_ lucida RUNNING 16:47:26 > > 2-00:00:00 6 kelvin-n[001-006] > > 86676 compute GdDC_25_ lucida RUNNING 15:11:24 > > 2-00:00:00 6 kelvin-n[014-015,025-026,028-029] > > 86677 compute GdDC_25_ lucida RUNNING 13:16:34 > > 2-00:00:00 6 kelvin-n[007-012] > > > > > > > > > > -- > > Paddy Doyle > > Trinity Centre for High Performance Computing, > > Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. > > Phone: +353-1-896-3725 > > http://www.tchpc.tcd.ie/ > > > # > > # Example slurm.conf file. Please run configurator.html > > # (in doc/html) to build a configuration file customized > > # for your environment. > > # > > # > > # slurm.conf file generated by configurator.html. > > # > > # See the slurm.conf man page for more information. > > # > > ClusterName=kelvin > > ControlMachine=kelvin01 > > ControlAddr=192.168.19.254 > > BackupController=kelvin-n001 > > BackupAddr=192.168.16.1 > > # > > SlurmUser=root > > #SlurmdUser=root > > SlurmctldPort=6817 > > SlurmdPort=6818 > > AuthType=auth/munge > > EnforcePartLimits=YES > > JobRequeue=1 > > #JobCredentialPrivateKey= > > #JobCredentialPublicCertificate= > > StateSaveLocation=/var/slurm_state/kelvin > > #SlurmdSpoolDir=/tmp/slurmd > > SwitchType=switch/none > > MpiDefault=none > > SlurmctldPidFile=/var/run/slurmctld.pid > > SlurmdPidFile=/var/run/slurmd.pid > > #ProctrackType=proctrack/pgid > > ProctrackType=proctrack/cgroup > > #PluginDir= > > CacheGroups=0 > > #FirstJobId= > > ReturnToService=1 > > #MaxJobCount= > > #PlugStackConfig= > > #PropagatePrioProcess= > > #PropagateResourceLimits= > > #PropagateResourceLimitsExcept= > > #PropagateResourceLimits=NONE > > PropagateResourceLimitsExcept=CPU,RSS,DATA,AS > > Prolog=/etc/slurm/prolog > > PrologFlags=Alloc > > Epilog=/etc/slurm/slurm.epilog.clean > > EpilogSlurmctld=/etc/slurm/slurm.epilogslurmctld > > #SrunProlog= > > #SrunEpilog= > > #TaskProlog= > > #TaskEpilog= > > #TaskPlugin= > > TaskPlugin=task/cgroup > > #TrackWCKey=no > > #TreeWidth=50 > > #TmpFs= > > #UsePAM= > > # > > # TIMERS > > SlurmctldTimeout=300 > > SlurmdTimeout=300 > > HealthCheckInterval=3600 > > HealthCheckProgram=/etc/slurm/slurm.healthcheck > > InactiveLimit=0 > > MinJobAge=300 > > KillWait=30 > > Waittime=0 > > RebootProgram=/sbin/reboot > > # > > # SCHEDULING > > SchedulerType=sched/backfill > > #SchedulerType=sched/wiki > > SchedulerPort=7321 > > SelectType=select/cons_res > > SelectTypeParameters=CR_Core_Memory > > #SchedulerAuth= > > #SchedulerPort= > > #SchedulerRootFilter= > > FastSchedule=0 > > #PriorityType=priority/multifactor > > #PriorityDecayHalfLife=14-0 > > #PriorityUsageResetPeriod=14-0 > > #PriorityWeightFairshare=100000 > > #PriorityWeightAge=1000 > > #PriorityWeightPartition=10000 > > #PriorityWeightJobSize=1000 > > #PriorityMaxAge=1-0 > > # > > # LOGGING > > SlurmctldDebug=3 > > SlurmctldLogFile=/var/log/slurm.log > > SlurmdDebug=3 > > SlurmdLogFile=/var/log/slurm.log > > JobCompType=jobcomp/none > > #JobCompLoc= > > # > > # ACCOUNTING > > #JobAcctGatherType=jobacct_gather/linux > > #JobAcctGatherFrequency=30 > > # > > # LOGGING AND ACCOUNTING > > #AccountingStorageEnforce=0 > > #AccountingStorageEnforce=limits > > AccountingStorageEnforce=safe # don't start a job unless there's > > enough balance > > AccountingStorageHost=service01 > > #AccountingStorageLoc= > > #AccountingStoragePass= > > #AccountingStoragePort= > > AccountingStorageType=accounting_storage/slurmdbd > > #AccountingStorageUser= > > #JobCompHost= > > #JobCompLoc= > > #JobCompPass= > > #JobCompPort= > > #JobCompUser= > > JobAcctGatherFrequency=30 > > JobAcctGatherType=jobacct_gather/cgroup > > #AccountingStorageType=accounting_storage/slurmdbd > > #AccountingStorageHost= > > #AccountingStorageLoc= > > #AccountingStoragePass= > > #AccountingStorageUser= > > > > > > > > # Activate the Multi-factor Job Priority Plugin with decay > > PriorityType=priority/multifactor > > > > # apply decay of 2 weeks > > #PriorityDecayHalfLife=14-0 > > # for slurm-bank 1.2 > > #PriorityDecayHalfLife=0 > > # for slurm-bank 1.3 > > PriorityDecayHalfLife=14-0 > > > > > > # reset usage after 28 days > > #PriorityUsageResetPeriod=MONTHLY > > PriorityUsageResetPeriod=NONE > > > > # The larger the job, the greater its job size priority. > > #PriorityFavorSmall=YES > > > > # The job's age factor reaches 1.0 after waiting in the > > # queue for 2 weeks. > > PriorityMaxAge=14-0 > > > > # re-calc priority > > PriorityCalcPeriod=00:01:00 > > > > # This next group determines the weighting of each of the > > # components of the Multi-factor Job Priority Plugin. > > # The default value for each of the following is 1. > > PriorityWeightAge=10000000 > > PriorityWeightFairshare=10000000 > > PriorityWeightJobSize=10000000 > > PriorityWeightPartition=10000000 > > PriorityWeightQOS=0 # don't use the qos factor > > > > > > # describe the node's memory (only one of the two following options is > > allowed) > > #DefMemPerCPU=1900 > > DefMemPerNode=23000 > > > > MaxMemPerNode=24000 > > > > # turn on the topology/tree plugin > > TopologyPlugin=topology/tree > > > > # COMPUTE NODES > > #NodeName=DEFAULT State=UNKNOWN Feature=debug Sockets=2 CoresPerSocket=6 > > ThreadsPerCore=1 > > NodeName=kelvin-n[001-067,069-094,096-100] RealMemory=24020 Sockets=2 > > CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN > > #NodeName=kelvin-n[001-067,069,070,072-094,096-100] RealMemory=24020 > > Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN > > #NodeName=kelvin-n[001-070,072-100] RealMemory=24020 Sockets=2 > > CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN > > #NodeName=kelvin-n099 RealMemory=19980 Sockets=2 CoresPerSocket=6 > > ThreadsPerCore=1 State=UNKNOWN > > > > > > #PartitionName=compute Nodes=kelvin-n[001-096] Default=NO > > MaxTime=72:00:00 State=UP > > #PartitionName=debug Nodes=kelvin-n[097-100] Default=YES > > MaxTime=3:00:00 State=UP > > #PartitionName=compute Nodes=kelvin-n[001-067,069,070,072-094,096-098] > > Default=NO DefaultTime=01:00:00 MaxTime=72:00:00 State=UP Shared=Exclusive > > PartitionName=compute Nodes=kelvin-n[001-067,069-094,096-098] Default=NO > > DefaultTime=01:00:00 MaxTime=72:00:00 State=UP Shared=Exclusive > > PartitionName=debug Nodes=kelvin-n[099-100] Default=YES > > DefaultTime=00:30:00 MaxTime=00:30:00 State=UP Shared=Exclusive > > > > > > > # Rack C-04[42] > > SwitchName=kelvinibsw03 Nodes=kelvin-n[025-048] > > > > # Rack C-02[42] > > SwitchName=kelvinibsw04 Nodes=kelvin-n[073-094,096-100] > > > > # Rack C-02[17] > > SwitchName=kelvinibsw05 Nodes=kelvin-n[049-067,069-072] > > > > # Rack C-04[17] > > SwitchName=kelvinibsw06 Nodes=kelvin-n[001-024] > > > > # Rack C-03[6] (top-level switch) > > SwitchName=kelvinibsw01 > > Switches=kelvinibsw03,kelvinibsw04,kelvinibsw05,kelvinibsw06 > > # (and kelvin01,io03,io04,io06) > > > > # Rack C-03[7] (top-level switch) > > SwitchName=kelvinibsw02 > > Switches=kelvinibsw03,kelvinibsw04,kelvinibsw05,kelvinibsw06 > > # (and io01,io02,io05) > > > > > -- > Paddy Doyle > Trinity Centre for High Performance Computing, > Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. > Phone: +353-1-896-3725 > http://www.tchpc.tcd.ie/ -- Paddy Doyle Trinity Centre for High Performance Computing, Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. Phone: +353-1-896-3725 http://www.tchpc.tcd.ie/
