Yes! There are problems if the clean-up scripts for cgroups reside on NFSv4. Nodes will lock-up when they try to remove a job's cgroup.
Am 31.03.2015 um 17:06 schrieb Jeff Layton: > > That's what I've done. Everything is in NFSv4 except for a > few bits: > > /etc/slurm.conf > /etc/init.d/slurm > /var/log/slurm > /var/run/slurm > /var/spool/slurm > > These bits are local to the node. > > Will slurm have trouble in this case? > > Thanks! > > Jeff > >> The problem mentioned with NFSv4 is keeping your SLURM installation on NFS, >> eg. exported to but not physically residing on your >> nodes. >> >> -- >> ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* >> || \\UTGERS |---------------------*O*--------------------- >> ||_// Biomedical | Ryan Novosielski - Senior Technologist >> || \\ and Health | [email protected] - 973/972.0922 (2x0922) >> || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark >> `' >> ________________________________________ >> From: Jeff Layton [[email protected]] >> Sent: Tuesday, March 31, 2015 10:28 AM >> To: slurm-dev >> Subject: [slurm-dev] Re: Problems running job >> >> Chris and David, >> >> Thanks for the help! I'm still trying to find out why the >> compute nodes are down or not responding. Any tips >> on where to start? >> >> How about open ports? Right now I have 6817 and >> 6818 open as per my slurm.conf. I also have 22 and 80 >> open as well as 111, 2049, and 32806. I'm using NFSv4 >> but don't know if that is causing the problem or not >> (I REALLY want to stick to NFSv4). >> >> Thanks! >> >> Jeff >> >>> On 31/03/15 07:31, Jeff Layton wrote: >>> >>>> Good afternoon! >>> Hiya Jeff, >>> >>> [...] >>>> But it doesn't seem to run. Here is the output of sinfo >>>> and squeue: >>> [...] >>> >>> Actually it does appear to get started (at least), but.. >>> >>>> [ec2-user@ip-10-0-1-72 ec2-user]$ squeue >>>> JOBID PARTITION NAME USER ST TIME NODES >>>> NODELIST(REASON) >>>> 2 debug slurmtes ec2-user CG 0:00 1 >>>> ip-10-0-2-101 >>> ...the CG state you see there is the completing state, i.e. the state >>> when a job is finishing up. >>> >>>> The system logs on the master node (contoller node) don't show too much: >>>> >>>> Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]: >>>> _slurm_rpc_submit_batch_job JobId=2 usec=239 >>>> Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2 >>>> NodeList=ip-10-0-2-101 #CPUs=1 >>> OK, node allocated. >>> >>>> Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 >>>> State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 >>> Job finishes. >>> >>>> Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue >>>> JobID=2 State=0x8000 NodeCnt=1 per user/system request >>>> Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 >>>> State=0x8000 NodeCnt=1 done >>> Not sure of the implication of that "requeue" there, unless it's the >>> transition to the CG state? >>> >>>> Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes >>>> ip-10-0-2-[101-102] not responding >>>> Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102 >>>> not responding, setting DOWN >>>> Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101 >>>> not responding, setting DOW >>> Now the nodes stop responding (not before). >>> >>>> From these logs, it looks like the compute nodes are not >>>> responding to the control node (master node). >>>> >>>> Not sure how to debug this - any tips? >>> I would suggest looking at the slurmd logs on the compute nodes to see >>> if they report any problems, and check to see what state the processes >>> are in - especially if they're stuck in a 'D' state waiting on some form >>> of device I/O. >>> >>> I know some people have reported strange interactions between Slurm >>> being on an NFSv4 mount (NFSv3 is fine). >>> >>> Good luck! >>> Chris
