Put the slurmd and slurmctld in debug mode and retry the submission. Then provide the logs.
Le 31/03/2015 16:28, Jeff Layton a écrit : > > Chris and David, > > Thanks for the help! I'm still trying to find out why the > compute nodes are down or not responding. Any tips > on where to start? > > How about open ports? Right now I have 6817 and > 6818 open as per my slurm.conf. I also have 22 and 80 > open as well as 111, 2049, and 32806. I'm using NFSv4 > but don't know if that is causing the problem or not > (I REALLY want to stick to NFSv4). > > Thanks! > > Jeff > >> On 31/03/15 07:31, Jeff Layton wrote: >> >>> Good afternoon! >> Hiya Jeff, >> >> [...] >>> But it doesn't seem to run. Here is the output of sinfo >>> and squeue: >> [...] >> >> Actually it does appear to get started (at least), but.. >> >>> [ec2-user@ip-10-0-1-72 ec2-user]$ squeue >>> JOBID PARTITION NAME USER ST TIME NODES >>> NODELIST(REASON) >>> 2 debug slurmtes ec2-user CG 0:00 1 >>> ip-10-0-2-101 >> ...the CG state you see there is the completing state, i.e. the state >> when a job is finishing up. >> >>> The system logs on the master node (contoller node) don't show too >>> much: >>> >>> Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]: >>> _slurm_rpc_submit_batch_job JobId=2 usec=239 >>> Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2 >>> NodeList=ip-10-0-2-101 #CPUs=1 >> OK, node allocated. >> >>> Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 >>> State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 >> Job finishes. >> >>> Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue >>> JobID=2 State=0x8000 NodeCnt=1 per user/system request >>> Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 >>> State=0x8000 NodeCnt=1 done >> Not sure of the implication of that "requeue" there, unless it's the >> transition to the CG state? >> >>> Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes >>> ip-10-0-2-[101-102] not responding >>> Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes >>> ip-10-0-2-102 >>> not responding, setting DOWN >>> Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes >>> ip-10-0-2-101 >>> not responding, setting DOW >> Now the nodes stop responding (not before). >> >>> From these logs, it looks like the compute nodes are not >>> responding to the control node (master node). >>> >>> Not sure how to debug this - any tips? >> I would suggest looking at the slurmd logs on the compute nodes to see >> if they report any problems, and check to see what state the processes >> are in - especially if they're stuck in a 'D' state waiting on some form >> of device I/O. >> >> I know some people have reported strange interactions between Slurm >> being on an NFSv4 mount (NFSv3 is fine). >> >> Good luck! >> Chris -- --- Mehdi Denou International HPC support +336 45 57 66 56
