Hi everyone, Slurm newbie here with newbie problems.
I just migrated our small cluster (~300 cores) from Torque/Maui to Slurm and it is already working much better for us. Well, for me at least. The problem I am having is that I can run any job successfully as SlurmUser, but jobs submitted by other users fail and get requeued. Here is the excerpt from scontrol show job: JobId=50 JobName=TestJob UserId=kyle(1008) GroupId=student(1003) Priority=0 Nice=0 Account=(null) QOS=(null) JobState=PENDING Reason=launch_failed_requeued_held Dependency=(null) Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2015-12-13T21:36:20 EligibleTime=2015-12-13T21:38:21 StartTime=Unknown EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=big AllocNode:Sid=clusty:13290 ReqNodeList=node1 ExcNodeList=(null) NodeList=(null) BatchHost=node1 NumNodes=1 NumCPUs=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=48,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/users/kyle/bpl116/testmpi.slurm WorkDir=/home/users/kyle/bpl116 StdErr=/home/users/kyle/bpl116/slurm-50.out StdIn=/dev/null StdOut=/home/users/kyle/bpl116/slurm-50.out Power= SICP=0 The exact same job works like a charm if I run it as SlurmUser. I tried googling for launch_failed_requeued_held but there is really not much I could go by. Any help would be appreciated. I am running slurm 15.08.4 under ubuntu 14.04. Thanks, Andrej