Do you have all the ports open between all the compute nodes as well? Since slurm builds a tree to communicate all the nodes need to talk to every other node on those ports and do so with out a huge amount of latency. You might want to try to up your timeouts.

-Paul Edmon-

On 03/31/2015 10:28 AM, Jeff Layton wrote:

Chris and David,

Thanks for the help! I'm still trying to find out why the
compute nodes are down or not responding. Any tips
on where to start?

How about open ports? Right now I have 6817 and
6818 open as per my slurm.conf. I also have 22 and 80
open as well as 111, 2049, and 32806. I'm using NFSv4
but don't know if that is causing the problem or not
(I REALLY want to stick to NFSv4).

Thanks!

Jeff

On 31/03/15 07:31, Jeff Layton wrote:

Good afternoon!
Hiya Jeff,

[...]
But it doesn't seem to run. Here is the output of sinfo
and squeue:
[...]

Actually it does appear to get started (at least), but..

[ec2-user@ip-10-0-1-72 ec2-user]$ squeue
              JOBID PARTITION     NAME     USER ST TIME  NODES
NODELIST(REASON)
2 debug slurmtes ec2-user CG 0:00 1 ip-10-0-2-101
...the CG state you see there is the completing state, i.e. the state
when a job is finishing up.

The system logs on the master node (contoller node) don't show too much:

Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]:
_slurm_rpc_submit_batch_job JobId=2 usec=239
Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2
NodeList=ip-10-0-2-101 #CPUs=1
OK, node allocated.

Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
Job finishes.

Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue
JobID=2 State=0x8000 NodeCnt=1 per user/system request
Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
State=0x8000 NodeCnt=1 done
Not sure of the implication of that "requeue" there, unless it's the
transition to the CG state?

Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes
ip-10-0-2-[101-102] not responding
Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102
not responding, setting DOWN
Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101
not responding, setting DOW
Now the nodes stop responding (not before).

 From these logs, it looks like the compute nodes are not
responding to the control node (master node).

Not sure how to debug this - any tips?
I would suggest looking at the slurmd logs on the compute nodes to see
if they report any problems, and check to see what state the processes
are in - especially if they're stuck in a 'D' state waiting on some form
of device I/O.

I know some people have reported strange interactions between Slurm
being on an NFSv4 mount (NFSv3 is fine).

Good luck!
Chris

Reply via email to