I have a job: scontrol show jobid 49 JobId=49 Name=20141001_17h23_C4_ArXe13nm_dt002.0000as_nbions00309_nbions00252_SyesIyesEyesTnoRyes_potSymmetric_b01.20_I01.5000e+14_WL13.7nm_FW10.0fs_Lrelaxed_A0.0_ALmd_oclAmd_deepthought
UserId=zhart(1001) GroupId=zhart(1001) Priority=4294901712 Account=(null) QOS=(null) JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2014-10-01T17:23:40 EligibleTime=2014-10-01T17:23:40 StartTime=2015-09-30T08:10:54 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=mem AllocNode:Sid=deepthought:658 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=16 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/raid/zhart/code/md/output/20141001_17h23_C4_ArXe13nm_dt002.0000as_nbions00309_nbions00252_SyesIyesEyesTnoRyes_potSymmetric_b01.20_I01.5000e+14_WL13.7nm_FW10.0fs_Lrelaxed_A0.0_ALmd_oclAmd_deepthought/slurm_20141001_17h23.sh WorkDir=/raid/zhart/code/md/output/20141001_17h23_C4_ArXe13nm_dt002.0000as_nbions00309_nbions00252_SyesIyesEyesTnoRyes_potSymmetric_b01.20_I01.5000e+14_WL13.7nm_FW10.0fs_Lrelaxed_A0.0_ALmd_oclAmd_deepthought says it does not have the resources needed to execute. However, I have four nodes with a configuration of NodeName=node8 Arch=x86_64 CoresPerSocket=16 CPUAlloc=16 CPUErr=0 CPUTot=32 CPULoad=15.98 Features=(null) Gres=gpu:4 NodeAddr=node8 NodeHostName=node8 OS=Linux RealMemory=257951 AllocMem=0 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2014-09-29T08:54:37 SlurmdStartTime=2014-09-29T08:55:08 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s which says only 16/32 cpus are in use, so job 49 should have resources to run. As you can see the node is in a mixed state since it is already running a gres=4 cpu=16 run. my slurm.conf is ControlMachine=deepthought #ControlAddr= #BackupController= #BackupAddr= # AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #PrologSlurmctld= #FirstJobId=1 #MaxJobId=999999 GresTypes=gpu #GroupUpdateForce=0 #GroupUpdateTime=600 #JobCheckpointDir=/var/slurm/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #Licenses=foo*4,bar #MailProg=/bin/mail #MaxJobCount=5000 #MaxStepCount=40000 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/pgid #Prolog= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= ReturnToService=2 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/tmp/slurm/slurmd SlurmUser=slurm #SrunEpilog= #SrunProlog= StateSaveLocation=/tmp/slurm SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFs=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 FastSchedule=1 #MaxMemPerCPU=0 #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/cons_res #SelectTypeParameters= # # # JOB PRIORITY #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/none #AccountingStorageUser= AccountingStoreJobComment=YES ClusterName=deepthought #DebugFlags= #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= JobCompType=jobcomp/none #JobCompUser= JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm/slurmd.log #SlurmSchedLogFile= #SlurmSchedLogLevel= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=deepthought RealMemory=16048 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:2 NodeName=node2 RealMemory=15500 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:3 NodeName=node3 RealMemory=15500 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:3 NodeName=node4 RealMemory=15500 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:2 NodeName=node5 RealMemory=15500 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:2 NodeName=node6 RealMemory=128640 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:4 NodeName=node7 RealMemory=128927 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:4 NodeName=node8 RealMemory=257951 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:4 NodeName=node9 RealMemory=64415 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:4 PartitionName=cpu Nodes=node[2-5],deepthought Default=YES MaxTime=INFINITE State=UP PartitionName=gpu Nodes=node[2-5],deepthought MaxTime=INFINITE State=UP PartitionName=gpx Nodes=node[6-9] MaxTime=INFINITE State=UP PartitionName=mem Nodes=node[6-9] MaxTime=INFINITE State=UP Any ideas why it won't use the remainder of the node(s)? Thanks in advance! -- Eddie
