Hi everyone. This is the second time I noticed this problem. If one user submits several jobs requesting a specific node which is currently not available due to resources, then all following jobs will stay in pending state with Reason=Priority, despite other nodes sitting idle.
Example: JOBID PARTI NAME USER ST TIME PRIOR NODELIST COMMENT 70148 batch 1G_6_test dtaliun PD 0:00 364 (null) 70128 batch 1G_8 dtaliun PD 0:00 365 (null) 70127 batch 1G_6 dtaliun PD 0:00 365 (null) 70126 batch 1G_5 dtaliun PD 0:00 365 (null) 70125 batch 1G_4 dtaliun PD 0:00 365 (null) 70124 batch 1G_3 dtaliun PD 0:00 365 (null) 70123 batch 1G_2 dtaliun PD 0:00 365 (null) 70122 batch 1G_1 dtaliun PD 0:00 365 (null) 70096 batch bayesian_ci_G30haplo dtaliun PD 0:00 386 (null) 70095 batch bayesian_ci_G30haplo dtaliun PD 0:00 386 (null) 69643 batch zapata_ci_G5haplo dtaliun R 9-23:24:40 333 calc06 (null) Job 69643 is running on calc06. Jobs 70095-70128 have ReqNodeList=calc06 and are all in state PD, Reason=Resources (correct). Job 70148 though could start on any other node, but it doesn't: JobId=70148 Name=1G_6_test UserId=dtaliun(1026) GroupId=dtaliun(1026) Priority=364 Account=stats QOS=normal JobState=PENDING Reason=Priority Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:00 TimeLimit=14-00:00:00 TimeMin=N/A SubmitTime=2013-02-15T09:19:20 EligibleTime=2013-02-15T09:19:20 StartTime=2013-02-17T14:31:51 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=batch AllocNode:Sid=calc05:14709 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryCPU=5000M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/test If I raise the priority of job 70148 manually then I can make the job start, but the logic looks broken. With priority/multifactor at play, what happens is that a single user can block the whole cluster by just scheluding some jobs which are waiting on any resource. I'm running with the builtin scheduler with priority/multifactor: SchedulerType = sched/builtin SelectType = select/cons_res PriorityType = priority/multifactor under SLURM 2.4.4. I've been looking at the changelog, but it doesn't look like anything changed for the builtin scheduler in later versions. Can anybody confirm the problem and/or knows if it has been fixed recently? Thanks.
