My tests of this show try_sched() completing in in a few milliseconds and I don't see how the existence of a constraint would measurably impact performance. About the only thing that could take much time is considering job preemption.
What version of SLURM are you using? What is your configuration? Do you have job preemption configured and if so, how? How many active and queued jobs are there? You can respond off of the mailing list if desired. ________________________________________ From: [email protected] [[email protected]] On Behalf Of Bjørn-Helge Mevik [[email protected]] Sent: Tuesday, February 15, 2011 7:15 AM To: [email protected] Subject: [slurm-dev] Slow backfill testing of some jobs. We have discovered that some jobs take very long time to try and backfill. More precisely, each call to _try_sched can take 4-5 seconds. While investigating this to try and find out why, we discovered that there appear to be a difference between jobs specifying --constraint=something and jobs specifying --constraint=something*1. The last one can be over twice as fast. An example: # scontrol show job 5757574 JobId=5757574 Name=codwsol UserId=ash022(51801) GroupId=users(100) Priority=10277 Account=nn4684k QOS=notur JobState=PENDING Reason=Resources Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=-13--23:-37:-58 TimeLimit=10-00:00:00 TimeMin=N/A SubmitTime=2011-02-15T10:40:57 EligibleTime=2011-02-15T10:40:57 StartTime=2011-03-01T14:58:48 EndTime=Unknown SuspendTime=None SecsPreSuspend=0 Partition=hugemem AllocNode:Sid=login-0-0:11737 ReqNodeList=compute-0-0 ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryCPU=120000M MinTmpDiskNode=1000G Features=hugemem Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/projects/codgenome/solexa/asm/wgs61codsol.slurm WorkDir=/projects/codgenome/solexa/asm It asks for one specific node: compute-0-0, and the feature hugemem. It takes 5 seconds in _try_sched(). From the slurmctld.log: [2011-02-15T15:20:17] backfill test for job 5757574 [2011-02-15T15:20:17] debug: backfill: entering _try_sched for job 5757574. [2011-02-15T15:20:22] debug: backfill: finished _try_sched for job 5757574. [2011-02-15T15:21:22] backfill test for job 5757574 [2011-02-15T15:21:22] debug: backfill: entering _try_sched for job 5757574. [2011-02-15T15:21:27] debug: backfill: finished _try_sched for job 5757574. (We've added a debug output right before and after the call to _try_sched().) Then we changed the feature request to hugmem*1: # scontrol update jobid=5757574 features='hugemem*1' # scontrol show job 5757574 JobId=5757574 Name=codwsol UserId=ash022(51801) GroupId=users(100) Priority=10277 Account=nn4684k QOS=notur JobState=PENDING Reason=Resources Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=-13--23:-37:-19 TimeLimit=10-00:00:00 TimeMin=N/A SubmitTime=2011-02-15T10:40:57 EligibleTime=2011-02-15T10:40:57 StartTime=2011-03-01T14:58:48 EndTime=Unknown SuspendTime=None SecsPreSuspend=0 Partition=hugemem AllocNode:Sid=login-0-0:11737 ReqNodeList=compute-0-0 ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryCPU=120000M MinTmpDiskNode=1000G Features=hugemem*1 Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/projects/codgenome/solexa/asm/wgs61codsol.slurm WorkDir=/projects/codgenome/solexa/asm It now takes 2 seconds in _try_sched(): [2011-02-15T15:21:57] backfill test for job 5757574 [2011-02-15T15:21:57] debug: backfill: entering _try_sched for job 5757574. [2011-02-15T15:21:59] debug: backfill: finished _try_sched for job 5757574. [2011-02-15T15:22:33] backfill test for job 5757574 [2011-02-15T15:22:33] debug: backfill: entering _try_sched for job 5757574. [2011-02-15T15:22:35] debug: backfill: finished _try_sched for job 5757574. Even then, though, 2 seconds seems to be quite long for finding out when/if a job can run when it asks for one specific node. -- Regards, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo
