Hi everyone,

Slurm newbie here with newbie problems.

I just migrated our small cluster (~300 cores) from Torque/Maui to
Slurm and it is already working much better for us. Well, for me at
least. The problem I am having is that I can run any job successfully
as SlurmUser, but jobs submitted by other users fail and get requeued.
Here is the excerpt from scontrol show job:

JobId=50 JobName=TestJob
   UserId=kyle(1008) GroupId=student(1003)
   Priority=0 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=launch_failed_requeued_held Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2015-12-13T21:36:20 EligibleTime=2015-12-13T21:38:21
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=big AllocNode:Sid=clusty:13290
   ReqNodeList=node1 ExcNodeList=(null)
   NodeList=(null)
   BatchHost=node1
   NumNodes=1 NumCPUs=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=48,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/users/kyle/bpl116/testmpi.slurm
   WorkDir=/home/users/kyle/bpl116
   StdErr=/home/users/kyle/bpl116/slurm-50.out
   StdIn=/dev/null
   StdOut=/home/users/kyle/bpl116/slurm-50.out
   Power= SICP=0

The exact same job works like a charm if I run it as SlurmUser. I tried
googling for launch_failed_requeued_held but there is really not much I
could go by. Any help would be appreciated. I am running slurm 15.08.4
under ubuntu 14.04.

Thanks,
Andrej

Reply via email to