[slurm-dev] A way of abuse the priority option in Slurm?
Hi! I just discovered a possible way for a user to abuse the priority in Slurm. This is the scenario: 1. A user has not run any jobs in a long time and has therefore has a high fairshare priority. Lets say: 1. 2. The user submits 1000 jobs into the queue that is far above his fairshare target. 3. The user changes the priority of his job (It's ok for a user to lower the priority of jobs as long as the user is the owner) to lets say: (still a high priority. +-1 is in practice nothing). (scontrol update jobid=1 priority= 4. The users jobs starts and the fairshare priority lowers. But here is the big _BUT_ the jobs with changed priority does not seams to change leaving the users job with maximum priority until all of the jobs are completed. Have I missed something in this scenario? If this is true what do we do about it? Should users be able to change the priority at all? The user can use the 'nice' option to alter the priority of a job within a small limit that does not alter the priority as defined above. Please let me be wrong :-) /Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] error code in slurm_checkpoint_complete
Hi all, just a quick question regarding Slurm API. In the following call, (checkpoint.c) /* * slurm_checkpoint_complete - note the completion of a job step's checkpoint operation. * IN job_id - job on which to perform operation * IN step_id - job step on which to perform operation * IN begin_time - time at which checkpoint began * IN error_code - error code, highest value for all complete calls is preserved * IN error_msg - error message, preserved for highest error_code * RET 0 or a slurm error code */ extern int slurm_checkpoint_complete (uint32_t job_id, uint32_t step_id, time_t begin_time, uint32_t error_code, char *error_msg); what is error_code employed for? man says Error code for checkpoint operation. Only the highest value is preserved. but given that it is an input parameter I don't really see the sense, moreover given that another error code is returned by the function. Can anyone enlighten me? Thanks for your attention. Best regards, Manuel -- Dr. Manuel Rodríguez-Pascual skype: manuel.rodriguez.pascual phone: (+34) 913466173 // (+34) 679925108 CIEMAT-Moncloa Edificio 22, desp. 1.25 Avenida Complutense, 40 28040- MADRID SPAIN
[slurm-dev] Re: Problems running job
Chris and David, Thanks for the help! I'm still trying to find out why the compute nodes are down or not responding. Any tips on where to start? How about open ports? Right now I have 6817 and 6818 open as per my slurm.conf. I also have 22 and 80 open as well as 111, 2049, and 32806. I'm using NFSv4 but don't know if that is causing the problem or not (I REALLY want to stick to NFSv4). Thanks! Jeff On 31/03/15 07:31, Jeff Layton wrote: Good afternoon! Hiya Jeff, [...] But it doesn't seem to run. Here is the output of sinfo and squeue: [...] Actually it does appear to get started (at least), but.. [ec2-user@ip-10-0-1-72 ec2-user]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 debug slurmtes ec2-user CG 0:00 1 ip-10-0-2-101 ...the CG state you see there is the completing state, i.e. the state when a job is finishing up. The system logs on the master node (contoller node) don't show too much: Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]: _slurm_rpc_submit_batch_job JobId=2 usec=239 Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2 NodeList=ip-10-0-2-101 #CPUs=1 OK, node allocated. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 Job finishes. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue JobID=2 State=0x8000 NodeCnt=1 per user/system request Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x8000 NodeCnt=1 done Not sure of the implication of that requeue there, unless it's the transition to the CG state? Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-[101-102] not responding Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102 not responding, setting DOWN Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101 not responding, setting DOW Now the nodes stop responding (not before). From these logs, it looks like the compute nodes are not responding to the control node (master node). Not sure how to debug this - any tips? I would suggest looking at the slurmd logs on the compute nodes to see if they report any problems, and check to see what state the processes are in - especially if they're stuck in a 'D' state waiting on some form of device I/O. I know some people have reported strange interactions between Slurm being on an NFSv4 mount (NFSv3 is fine). Good luck! Chris
[slurm-dev] Re: Problems running job
Actually I don't have all the ports open :( I can do that though (I thought that might be a problem). Thanks! Jeff Do you have all the ports open between all the compute nodes as well? Since slurm builds a tree to communicate all the nodes need to talk to every other node on those ports and do so with out a huge amount of latency. You might want to try to up your timeouts. -Paul Edmon- On 03/31/2015 10:28 AM, Jeff Layton wrote: Chris and David, Thanks for the help! I'm still trying to find out why the compute nodes are down or not responding. Any tips on where to start? How about open ports? Right now I have 6817 and 6818 open as per my slurm.conf. I also have 22 and 80 open as well as 111, 2049, and 32806. I'm using NFSv4 but don't know if that is causing the problem or not (I REALLY want to stick to NFSv4). Thanks! Jeff On 31/03/15 07:31, Jeff Layton wrote: Good afternoon! Hiya Jeff, [...] But it doesn't seem to run. Here is the output of sinfo and squeue: [...] Actually it does appear to get started (at least), but.. [ec2-user@ip-10-0-1-72 ec2-user]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 debug slurmtes ec2-user CG 0:00 1 ip-10-0-2-101 ...the CG state you see there is the completing state, i.e. the state when a job is finishing up. The system logs on the master node (contoller node) don't show too much: Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]: _slurm_rpc_submit_batch_job JobId=2 usec=239 Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2 NodeList=ip-10-0-2-101 #CPUs=1 OK, node allocated. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 Job finishes. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue JobID=2 State=0x8000 NodeCnt=1 per user/system request Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x8000 NodeCnt=1 done Not sure of the implication of that requeue there, unless it's the transition to the CG state? Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-[101-102] not responding Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102 not responding, setting DOWN Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101 not responding, setting DOW Now the nodes stop responding (not before). From these logs, it looks like the compute nodes are not responding to the control node (master node). Not sure how to debug this - any tips? I would suggest looking at the slurmd logs on the compute nodes to see if they report any problems, and check to see what state the processes are in - especially if they're stuck in a 'D' state waiting on some form of device I/O. I know some people have reported strange interactions between Slurm being on an NFSv4 mount (NFSv3 is fine). Good luck! Chris
[slurm-dev] Re: Problems running job
Do you have all the ports open between all the compute nodes as well? Since slurm builds a tree to communicate all the nodes need to talk to every other node on those ports and do so with out a huge amount of latency. You might want to try to up your timeouts. -Paul Edmon- On 03/31/2015 10:28 AM, Jeff Layton wrote: Chris and David, Thanks for the help! I'm still trying to find out why the compute nodes are down or not responding. Any tips on where to start? How about open ports? Right now I have 6817 and 6818 open as per my slurm.conf. I also have 22 and 80 open as well as 111, 2049, and 32806. I'm using NFSv4 but don't know if that is causing the problem or not (I REALLY want to stick to NFSv4). Thanks! Jeff On 31/03/15 07:31, Jeff Layton wrote: Good afternoon! Hiya Jeff, [...] But it doesn't seem to run. Here is the output of sinfo and squeue: [...] Actually it does appear to get started (at least), but.. [ec2-user@ip-10-0-1-72 ec2-user]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 debug slurmtes ec2-user CG 0:00 1 ip-10-0-2-101 ...the CG state you see there is the completing state, i.e. the state when a job is finishing up. The system logs on the master node (contoller node) don't show too much: Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]: _slurm_rpc_submit_batch_job JobId=2 usec=239 Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2 NodeList=ip-10-0-2-101 #CPUs=1 OK, node allocated. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 Job finishes. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue JobID=2 State=0x8000 NodeCnt=1 per user/system request Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x8000 NodeCnt=1 done Not sure of the implication of that requeue there, unless it's the transition to the CG state? Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-[101-102] not responding Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102 not responding, setting DOWN Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101 not responding, setting DOW Now the nodes stop responding (not before). From these logs, it looks like the compute nodes are not responding to the control node (master node). Not sure how to debug this - any tips? I would suggest looking at the slurmd logs on the compute nodes to see if they report any problems, and check to see what state the processes are in - especially if they're stuck in a 'D' state waiting on some form of device I/O. I know some people have reported strange interactions between Slurm being on an NFSv4 mount (NFSv3 is fine). Good luck! Chris
[slurm-dev] Re: Problems running job
That's what I've done. Everything is in NFSv4 except for a few bits: /etc/slurm.conf /etc/init.d/slurm /var/log/slurm /var/run/slurm /var/spool/slurm These bits are local to the node. Will slurm have trouble in this case? Thanks! Jeff The problem mentioned with NFSv4 is keeping your SLURM installation on NFS, eg. exported to but not physically residing on your nodes. -- *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |-*O*- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf Res Comp - MSB C630, Newark `' From: Jeff Layton [layto...@att.net] Sent: Tuesday, March 31, 2015 10:28 AM To: slurm-dev Subject: [slurm-dev] Re: Problems running job Chris and David, Thanks for the help! I'm still trying to find out why the compute nodes are down or not responding. Any tips on where to start? How about open ports? Right now I have 6817 and 6818 open as per my slurm.conf. I also have 22 and 80 open as well as 111, 2049, and 32806. I'm using NFSv4 but don't know if that is causing the problem or not (I REALLY want to stick to NFSv4). Thanks! Jeff On 31/03/15 07:31, Jeff Layton wrote: Good afternoon! Hiya Jeff, [...] But it doesn't seem to run. Here is the output of sinfo and squeue: [...] Actually it does appear to get started (at least), but.. [ec2-user@ip-10-0-1-72 ec2-user]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 debug slurmtes ec2-user CG 0:00 1 ip-10-0-2-101 ...the CG state you see there is the completing state, i.e. the state when a job is finishing up. The system logs on the master node (contoller node) don't show too much: Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]: _slurm_rpc_submit_batch_job JobId=2 usec=239 Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2 NodeList=ip-10-0-2-101 #CPUs=1 OK, node allocated. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 Job finishes. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue JobID=2 State=0x8000 NodeCnt=1 per user/system request Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x8000 NodeCnt=1 done Not sure of the implication of that requeue there, unless it's the transition to the CG state? Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-[101-102] not responding Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102 not responding, setting DOWN Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101 not responding, setting DOW Now the nodes stop responding (not before). From these logs, it looks like the compute nodes are not responding to the control node (master node). Not sure how to debug this - any tips? I would suggest looking at the slurmd logs on the compute nodes to see if they report any problems, and check to see what state the processes are in - especially if they're stuck in a 'D' state waiting on some form of device I/O. I know some people have reported strange interactions between Slurm being on an NFSv4 mount (NFSv3 is fine). Good luck! Chris
[slurm-dev] Re: Problems running job
Yes! There are problems if the clean-up scripts for cgroups reside on NFSv4. Nodes will lock-up when they try to remove a job's cgroup. Am 31.03.2015 um 17:06 schrieb Jeff Layton: That's what I've done. Everything is in NFSv4 except for a few bits: /etc/slurm.conf /etc/init.d/slurm /var/log/slurm /var/run/slurm /var/spool/slurm These bits are local to the node. Will slurm have trouble in this case? Thanks! Jeff The problem mentioned with NFSv4 is keeping your SLURM installation on NFS, eg. exported to but not physically residing on your nodes. -- *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |-*O*- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf Res Comp - MSB C630, Newark `' From: Jeff Layton [layto...@att.net] Sent: Tuesday, March 31, 2015 10:28 AM To: slurm-dev Subject: [slurm-dev] Re: Problems running job Chris and David, Thanks for the help! I'm still trying to find out why the compute nodes are down or not responding. Any tips on where to start? How about open ports? Right now I have 6817 and 6818 open as per my slurm.conf. I also have 22 and 80 open as well as 111, 2049, and 32806. I'm using NFSv4 but don't know if that is causing the problem or not (I REALLY want to stick to NFSv4). Thanks! Jeff On 31/03/15 07:31, Jeff Layton wrote: Good afternoon! Hiya Jeff, [...] But it doesn't seem to run. Here is the output of sinfo and squeue: [...] Actually it does appear to get started (at least), but.. [ec2-user@ip-10-0-1-72 ec2-user]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 debug slurmtes ec2-user CG 0:00 1 ip-10-0-2-101 ...the CG state you see there is the completing state, i.e. the state when a job is finishing up. The system logs on the master node (contoller node) don't show too much: Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]: _slurm_rpc_submit_batch_job JobId=2 usec=239 Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2 NodeList=ip-10-0-2-101 #CPUs=1 OK, node allocated. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 Job finishes. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue JobID=2 State=0x8000 NodeCnt=1 per user/system request Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x8000 NodeCnt=1 done Not sure of the implication of that requeue there, unless it's the transition to the CG state? Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-[101-102] not responding Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102 not responding, setting DOWN Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101 not responding, setting DOW Now the nodes stop responding (not before). From these logs, it looks like the compute nodes are not responding to the control node (master node). Not sure how to debug this - any tips? I would suggest looking at the slurmd logs on the compute nodes to see if they report any problems, and check to see what state the processes are in - especially if they're stuck in a 'D' state waiting on some form of device I/O. I know some people have reported strange interactions between Slurm being on an NFSv4 mount (NFSv3 is fine). Good luck! Chris
[slurm-dev] A way of abuse the priority option in Slurm?
Hi Magnus, Unfortunately you found a bug. Here is a patch that will prevent users from making persistent job priority changes. We should probably return an error for this condition, but I would like to defer that change to the next major release, v15.08. https://github.com/SchedMD/slurm/commit/4454316ef527b8700743d94c958811a39609e7d5.patch -- Morris Moe Jette CTO, SchedMD LLC Commercial Slurm Development and Support
[slurm-dev] Re: Error connecting slurm stream socket at IP:6817: Connection refused
I thanks for reply. But i think is not a network problem, because i start this only on head controller. Can see my config? I don’t have installed iptables or anything that looks. Config: ControlMachine=JGSLURMHC ControlAddr=172.16.40.42 #BackupController= #BackupAddr= AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none CryptoType=crypto/munge MpiDefault=none ProctrackType=proctrack/pgid ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/tmp/slurmd SlurmUser=root StateSaveLocation=/tmp SwitchType=switch/none TaskPlugin=task/none InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=1 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/linear AccountingStorageType=accounting_storage/none AccountingStoreJobComment=YES ClusterName=cluster JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmdLogFile=/var/log/slurmd SlurmctldLogFile=/var/log/slurm SlurmdDebug=3 # COMPUTE NODES NodeName=JGSLURMHC CPUs=1 State=UNKNOWN #PartitionName=debug Nodes=JGNODE[1-1] Default=YES MaxTime=INFINITE State=UP Bests, JG
[slurm-dev] Re: Problems running job
Put the slurmd and slurmctld in debug mode and retry the submission. Then provide the logs. Le 31/03/2015 16:28, Jeff Layton a écrit : Chris and David, Thanks for the help! I'm still trying to find out why the compute nodes are down or not responding. Any tips on where to start? How about open ports? Right now I have 6817 and 6818 open as per my slurm.conf. I also have 22 and 80 open as well as 111, 2049, and 32806. I'm using NFSv4 but don't know if that is causing the problem or not (I REALLY want to stick to NFSv4). Thanks! Jeff On 31/03/15 07:31, Jeff Layton wrote: Good afternoon! Hiya Jeff, [...] But it doesn't seem to run. Here is the output of sinfo and squeue: [...] Actually it does appear to get started (at least), but.. [ec2-user@ip-10-0-1-72 ec2-user]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 debug slurmtes ec2-user CG 0:00 1 ip-10-0-2-101 ...the CG state you see there is the completing state, i.e. the state when a job is finishing up. The system logs on the master node (contoller node) don't show too much: Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]: _slurm_rpc_submit_batch_job JobId=2 usec=239 Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2 NodeList=ip-10-0-2-101 #CPUs=1 OK, node allocated. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 Job finishes. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue JobID=2 State=0x8000 NodeCnt=1 per user/system request Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x8000 NodeCnt=1 done Not sure of the implication of that requeue there, unless it's the transition to the CG state? Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-[101-102] not responding Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102 not responding, setting DOWN Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101 not responding, setting DOW Now the nodes stop responding (not before). From these logs, it looks like the compute nodes are not responding to the control node (master node). Not sure how to debug this - any tips? I would suggest looking at the slurmd logs on the compute nodes to see if they report any problems, and check to see what state the processes are in - especially if they're stuck in a 'D' state waiting on some form of device I/O. I know some people have reported strange interactions between Slurm being on an NFSv4 mount (NFSv3 is fine). Good luck! Chris -- --- Mehdi Denou International HPC support +336 45 57 66 56
[slurm-dev] Re: Problems running job
The problem mentioned with NFSv4 is keeping your SLURM installation on NFS, eg. exported to but not physically residing on your nodes. -- *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |-*O*- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf Res Comp - MSB C630, Newark `' From: Jeff Layton [layto...@att.net] Sent: Tuesday, March 31, 2015 10:28 AM To: slurm-dev Subject: [slurm-dev] Re: Problems running job Chris and David, Thanks for the help! I'm still trying to find out why the compute nodes are down or not responding. Any tips on where to start? How about open ports? Right now I have 6817 and 6818 open as per my slurm.conf. I also have 22 and 80 open as well as 111, 2049, and 32806. I'm using NFSv4 but don't know if that is causing the problem or not (I REALLY want to stick to NFSv4). Thanks! Jeff On 31/03/15 07:31, Jeff Layton wrote: Good afternoon! Hiya Jeff, [...] But it doesn't seem to run. Here is the output of sinfo and squeue: [...] Actually it does appear to get started (at least), but.. [ec2-user@ip-10-0-1-72 ec2-user]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 debug slurmtes ec2-user CG 0:00 1 ip-10-0-2-101 ...the CG state you see there is the completing state, i.e. the state when a job is finishing up. The system logs on the master node (contoller node) don't show too much: Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]: _slurm_rpc_submit_batch_job JobId=2 usec=239 Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2 NodeList=ip-10-0-2-101 #CPUs=1 OK, node allocated. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 Job finishes. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue JobID=2 State=0x8000 NodeCnt=1 per user/system request Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x8000 NodeCnt=1 done Not sure of the implication of that requeue there, unless it's the transition to the CG state? Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-[101-102] not responding Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102 not responding, setting DOWN Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101 not responding, setting DOW Now the nodes stop responding (not before). From these logs, it looks like the compute nodes are not responding to the control node (master node). Not sure how to debug this - any tips? I would suggest looking at the slurmd logs on the compute nodes to see if they report any problems, and check to see what state the processes are in - especially if they're stuck in a 'D' state waiting on some form of device I/O. I know some people have reported strange interactions between Slurm being on an NFSv4 mount (NFSv3 is fine). Good luck! Chris