The problem mentioned with NFSv4 is keeping your SLURM installation on NFS, eg. exported to but not physically residing on your nodes.
-- ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |---------------------*O*--------------------- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | [email protected] - 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark `' ________________________________________ From: Jeff Layton [[email protected]] Sent: Tuesday, March 31, 2015 10:28 AM To: slurm-dev Subject: [slurm-dev] Re: Problems running job Chris and David, Thanks for the help! I'm still trying to find out why the compute nodes are down or not responding. Any tips on where to start? How about open ports? Right now I have 6817 and 6818 open as per my slurm.conf. I also have 22 and 80 open as well as 111, 2049, and 32806. I'm using NFSv4 but don't know if that is causing the problem or not (I REALLY want to stick to NFSv4). Thanks! Jeff > On 31/03/15 07:31, Jeff Layton wrote: > >> Good afternoon! > Hiya Jeff, > > [...] >> But it doesn't seem to run. Here is the output of sinfo >> and squeue: > [...] > > Actually it does appear to get started (at least), but.. > >> [ec2-user@ip-10-0-1-72 ec2-user]$ squeue >> JOBID PARTITION NAME USER ST TIME NODES >> NODELIST(REASON) >> 2 debug slurmtes ec2-user CG 0:00 1 ip-10-0-2-101 > ...the CG state you see there is the completing state, i.e. the state > when a job is finishing up. > >> The system logs on the master node (contoller node) don't show too much: >> >> Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]: >> _slurm_rpc_submit_batch_job JobId=2 usec=239 >> Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2 >> NodeList=ip-10-0-2-101 #CPUs=1 > OK, node allocated. > >> Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 >> State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 > Job finishes. > >> Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue >> JobID=2 State=0x8000 NodeCnt=1 per user/system request >> Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 >> State=0x8000 NodeCnt=1 done > Not sure of the implication of that "requeue" there, unless it's the > transition to the CG state? > >> Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes >> ip-10-0-2-[101-102] not responding >> Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102 >> not responding, setting DOWN >> Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101 >> not responding, setting DOW > Now the nodes stop responding (not before). > >> From these logs, it looks like the compute nodes are not >> responding to the control node (master node). >> >> Not sure how to debug this - any tips? > I would suggest looking at the slurmd logs on the compute nodes to see > if they report any problems, and check to see what state the processes > are in - especially if they're stuck in a 'D' state waiting on some form > of device I/O. > > I know some people have reported strange interactions between Slurm > being on an NFSv4 mount (NFSv3 is fine). > > Good luck! > Chris
