On 31/03/15 07:31, Jeff Layton wrote:
Good afternoon!
Hiya Jeff,
[...]
But it doesn't seem to run. Here is the output of sinfo
and squeue:
[...]
Actually it does appear to get started (at least), but..
[ec2-user@ip-10-0-1-72 ec2-user]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
2 debug slurmtes ec2-user CG 0:00 1
ip-10-0-2-101
...the CG state you see there is the completing state, i.e. the state
when a job is finishing up.
The system logs on the master node (contoller node) don't show too
much:
Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]:
_slurm_rpc_submit_batch_job JobId=2 usec=239
Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2
NodeList=ip-10-0-2-101 #CPUs=1
OK, node allocated.
Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
Job finishes.
Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue
JobID=2 State=0x8000 NodeCnt=1 per user/system request
Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
State=0x8000 NodeCnt=1 done
Not sure of the implication of that "requeue" there, unless it's the
transition to the CG state?
Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes
ip-10-0-2-[101-102] not responding
Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes
ip-10-0-2-102
not responding, setting DOWN
Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes
ip-10-0-2-101
not responding, setting DOW
Now the nodes stop responding (not before).
From these logs, it looks like the compute nodes are not
responding to the control node (master node).
Not sure how to debug this - any tips?
I would suggest looking at the slurmd logs on the compute nodes to see
if they report any problems, and check to see what state the processes
are in - especially if they're stuck in a 'D' state waiting on some form
of device I/O.
I know some people have reported strange interactions between Slurm
being on an NFSv4 mount (NFSv3 is fine).
Good luck!
Chris