[slurm-dev] Jobs allocated but don't run

James Andrew Venning Tue, 30 Aug 2016 19:56:58 -0700

Hi all again,
Initial problems are fixed, but there are more - thanks in advance for
reading this!


Ubuntu 14.04 with Slurm 2.6.5-1

Master and controller seem to be talking to each other:
scontrol ping
Slurmctld(primary/backup) at slurm-master/(NULL) are UP/DOWN
exact same message from each slave

ON one of the slaves, node0, I start the daemon with
sudo slurmd -Dvvv

which returns
slurmd: topology NONE plugin loaded
slurmd: CPU frequency setting not configured for this node
slurmd: task NONE plugin loaded
slurmd: auth plugin for Munge (http://code.google.com/p/munge/) loaded
slurmd: debug:  spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
slurmd: Munge cryptographic signature plugin loaded
slurmd: Warning: Core limit is only 0 KB
slurmd: slurmd version 2.6.5 started
slurmd: Job accounting gather NOT_INVOKED plugin loaded
slurmd: switch NONE plugin loaded
slurmd: slurmd started on Wed, 31 Aug 2016 02:48:03 +0000
slurmd: CPUs=16 Boards=1 Sockets=16 Cores=1 Threads=1 Memory=64431
TmpDisk=10042 Uptime=77963
slurmd: AcctGatherEnergy NONE plugin loaded
slurmd: AcctGatherProfile NONE plugin loaded
slurmd: AcctGatherInfiniband NONE plugin loaded
slurmd: AcctGatherFilesystem NONE plugin loaded
slurmd: debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf)

Which seems fine?

sinfo on any machine returns
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
standard*    up   infinite      1  down* node2
standard*    up   infinite      2   idle node[0-1]

node2 isn't connected yet, so I'm not concerned about that.

When I run a job, srun -N1 /bin/hostname

The master debug says:
slurmctld: debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from
uid=1000
slurmctld: debug:  job_submit_plugin_submit: usec=1
slurmctld: debug2: found 3 usable nodes from config containing node[0-2]
slurmctld: debug2: sched: JobId=7 allocated resources: NodeList=node0
slurmctld: sched: _slurm_rpc_allocate_resources JobId=7 NodeList=node0
usec=3578
slurmctld: debug2: _slurm_rpc_job_ready(7)=3 usec=14
slurmctld: debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from uid=1000
slurmctld: debug:  Configuration for job 7 complete
slurmctld: debug:  laying out the 1 tasks on 1 hosts node0 dist 1
slurmctld: sched: _slurm_rpc_job_step_create: StepId=7.0 node0 usec=6717

And the node's says:
slurmd: debug2: got this type of message 1008
slurmd: debug2: got this type of message 6001
slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug:  task_slurmd_launch_request: 7.0 0
slurmd: launch task 7.0 request from 1000.1000@144.6.230.71 (port 12449)
slurmd: debug:  Checking credential with 256 bytes of sig data
slurmd: debug:  Calling /usr/sbin/slurmstepd spank prolog
spank-prolog: Reading slurm.conf file: /etc/slurm-llnl/slurm.conf
spank-prolog: Running spank/prolog for jobid [7] uid [1000]
spank-prolog: spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
slurmd: debug:  task_slurmd_reserve_resources: 7 0

The sinfo changes to allocated:
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
standard*    up   infinite      1  down* node2
standard*    up   infinite      1  alloc node0
standard*    up   infinite      1   idle node1

And scontrol show job 7 returns:
JobId=7 Name=hostname
   UserId=ubuntu(1000) GroupId=ubuntu(1000)
   Priority=4294901754 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
   RunTime=00:01:29 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2016-08-31T02:51:06 EligibleTime=2016-08-31T02:51:06
   StartTime=2016-08-31T02:51:06 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=standard AllocNode:Sid=slurm-master:3410
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node0
   BatchHost=node0
   NumNodes=1 NumCPUs=16 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/hostname
   WorkDir=/home/ubuntu

But the job doesn't run... Any help is appreciated greatly.

James Venning

[slurm-dev] Jobs allocated but don't run

Reply via email to