Re: [slurm-users] unable to start slurmd process.
Navin, thanks for the update, and congrats on finding the problem! Andy From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of navin srivastava Sent: Saturday, June 13, 2020 1:21 AM To: Slurm User Community List Subject: Re: [slurm-users] unable to start slurmd process. Hi Team, After my Analysis i found that the user used the qdel command which is a plugin with slurm and the job is not killed properly and it makes the slurmstepd process in a kind of hung state. so when i was trying to start the slurmd the process was not getting started.after killing those processes. slurmd started without any issues. Regards Navin. On Thu, Jun 11, 2020 at 9:23 PM Riebs, Andy mailto:andy.ri...@hpe.com>> wrote: Short of getting on the system and kicking the tires myself, I’m fresh out of ideas. Does “sinfo -R” offer any hints? From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>] On Behalf Of navin srivastava Sent: Thursday, June 11, 2020 11:31 AM To: Slurm User Community List mailto:slurm-users@lists.schedmd.com>> Subject: Re: [slurm-users] unable to start slurmd process. i am able to get the output scontrol show node oled3 also the oled3 is pinging fine and scontrol ping output showing like Slurmctld(primary/backup) at deda1x1466/(NULL) are UP/DOWN so all looks ok to me. REgards Navin. On Thu, Jun 11, 2020 at 8:38 PM Riebs, Andy mailto:andy.ri...@hpe.com>> wrote: So there seems to be a failure to communicate between slurmctld and the oled3 slurmd. From oled3, try “scontrol ping” to confirm that it can see the slurmctld daemon. From the head node, try “scontrol show node oled3”, and then ping the address that is shown for “NodeAddr=” From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>] On Behalf Of navin srivastava Sent: Thursday, June 11, 2020 10:40 AM To: Slurm User Community List mailto:slurm-users@lists.schedmd.com>> Subject: Re: [slurm-users] unable to start slurmd process. i collected the log from slurmctld and it says below [2020-06-10T20:10:38.501] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:14:38.901] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:18:38.255] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:22:38.624] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:26:38.902] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:30:38.230] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:34:38.594] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:38:38.986] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:42:38.402] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:46:38.764] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:50:38.094] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:26:38.839] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:30:38.225] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:34:38.582] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:38:38.914] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:42:38.292] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:46:38.542] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:50:38.869] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:54:38.227] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:58:38.628] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T06:54:39.012] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T06:58:39.411] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:02:39.106] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:06:39.495] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:10:39.814] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:14:39.188] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:14:49.204] agent/is_node_resp: node:oled3 RPC:REQUEST_TERMINATE_JOB : Communication connection failure [2020-06-11T07:14:50.210] error: Nodes oled3 not responding [2020-06-11T07:15:54.313] error: Nodes oled3 not responding [2020-06-11T07:17:34.407] error: Nodes oled3 not responding [2020-06-11T07:19:14.637] error: Nodes oled3 not responding [2020-06-11T07:19:54.313] update_node: node oled3 reason set to: reboot-required [2020-06-11T07:19:54.313] update_node: node oled3 state set to DRAINING* [2020-06-11T07:20:43.788] requeu
Re: [slurm-users] unable to start slurmd process.
Hi Team, After my Analysis i found that the user used the qdel command which is a plugin with slurm and the job is not killed properly and it makes the slurmstepd process in a kind of hung state. so when i was trying to start the slurmd the process was not getting started.after killing those processes. slurmd started without any issues. Regards Navin. On Thu, Jun 11, 2020 at 9:23 PM Riebs, Andy wrote: > Short of getting on the system and kicking the tires myself, I’m fresh out > of ideas. Does “sinfo -R” offer any hints? > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On > Behalf Of *navin srivastava > *Sent:* Thursday, June 11, 2020 11:31 AM > *To:* Slurm User Community List > *Subject:* Re: [slurm-users] unable to start slurmd process. > > > > i am able to get the output scontrol show node oled3 > > also the oled3 is pinging fine > > > > and scontrol ping output showing like > > > > Slurmctld(primary/backup) at deda1x1466/(NULL) are UP/DOWN > > > > so all looks ok to me. > > > > REgards > > Navin. > > > > > > > > On Thu, Jun 11, 2020 at 8:38 PM Riebs, Andy wrote: > > So there seems to be a failure to communicate between slurmctld and the > oled3 slurmd. > > > > From oled3, try “scontrol ping” to confirm that it can see the slurmctld > daemon. > > > > From the head node, try “scontrol show node oled3”, and then ping the > address that is shown for “NodeAddr=” > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On > Behalf Of *navin srivastava > *Sent:* Thursday, June 11, 2020 10:40 AM > *To:* Slurm User Community List > *Subject:* Re: [slurm-users] unable to start slurmd process. > > > > i collected the log from slurmctld and it says below > > > > [2020-06-10T20:10:38.501] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:14:38.901] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:18:38.255] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:22:38.624] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:26:38.902] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:30:38.230] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:34:38.594] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:38:38.986] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:42:38.402] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:46:38.764] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:50:38.094] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T21:26:38.839] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T21:30:38.225] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T21:34:38.582] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T21:38:38.914] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T21:42:38.292] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T21:46:38.542] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T21:50:38.869] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T21:54:38.227] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T21:58:38.628] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-11T06:54:39.012] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-11T06:58:39.411] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-11T07:02:39.106] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-11T07:06:39.495] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-11T07:10:39.814] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-11T07:14:39.188] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-11T07:14:49.204] agent/is_node_resp: node:oled3 > RPC:REQUEST_TERMINATE_JOB : Communication connection failure > [2020-06-11T07:14:50.210] error: Nodes oled3 not responding > [2020-06-11T07:15:54.313] error: Nodes oled3 not responding > [2020-06-11T07:17:34.407] error: Nodes oled3 not responding > [2020-06-11T07:19:14.637] error: Nodes oled3 not responding > [2020-06-11T07:19:54.313] update_node: node oled3 reason set to: > reboot-required > [2020-06-11T07:19:54.313] update_node: no
Re: [slurm-users] unable to start slurmd process.
Short of getting on the system and kicking the tires myself, I’m fresh out of ideas. Does “sinfo -R” offer any hints? From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of navin srivastava Sent: Thursday, June 11, 2020 11:31 AM To: Slurm User Community List Subject: Re: [slurm-users] unable to start slurmd process. i am able to get the output scontrol show node oled3 also the oled3 is pinging fine and scontrol ping output showing like Slurmctld(primary/backup) at deda1x1466/(NULL) are UP/DOWN so all looks ok to me. REgards Navin. On Thu, Jun 11, 2020 at 8:38 PM Riebs, Andy mailto:andy.ri...@hpe.com>> wrote: So there seems to be a failure to communicate between slurmctld and the oled3 slurmd. From oled3, try “scontrol ping” to confirm that it can see the slurmctld daemon. From the head node, try “scontrol show node oled3”, and then ping the address that is shown for “NodeAddr=” From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>] On Behalf Of navin srivastava Sent: Thursday, June 11, 2020 10:40 AM To: Slurm User Community List mailto:slurm-users@lists.schedmd.com>> Subject: Re: [slurm-users] unable to start slurmd process. i collected the log from slurmctld and it says below [2020-06-10T20:10:38.501] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:14:38.901] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:18:38.255] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:22:38.624] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:26:38.902] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:30:38.230] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:34:38.594] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:38:38.986] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:42:38.402] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:46:38.764] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:50:38.094] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:26:38.839] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:30:38.225] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:34:38.582] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:38:38.914] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:42:38.292] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:46:38.542] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:50:38.869] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:54:38.227] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:58:38.628] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T06:54:39.012] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T06:58:39.411] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:02:39.106] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:06:39.495] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:10:39.814] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:14:39.188] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:14:49.204] agent/is_node_resp: node:oled3 RPC:REQUEST_TERMINATE_JOB : Communication connection failure [2020-06-11T07:14:50.210] error: Nodes oled3 not responding [2020-06-11T07:15:54.313] error: Nodes oled3 not responding [2020-06-11T07:17:34.407] error: Nodes oled3 not responding [2020-06-11T07:19:14.637] error: Nodes oled3 not responding [2020-06-11T07:19:54.313] update_node: node oled3 reason set to: reboot-required [2020-06-11T07:19:54.313] update_node: node oled3 state set to DRAINING* [2020-06-11T07:20:43.788] requeue job 1316970 due to failure of node oled3 [2020-06-11T07:20:43.788] requeue job 1349322 due to failure of node oled3 [2020-06-11T07:20:43.789] error: Nodes oled3 not responding, setting DOWN sinfo says OLED* up infinite 1 drain* oled3 while checking the node i feel node is healthy. Regards Navin On Thu, Jun 11, 2020 at 7:21 PM Riebs, Andy mailto:andy.ri...@hpe.com>> wrote: Weird. “slurmd -Dvvv” ought to report a whole lot of data; I can’t guess how to interpret it not reporting anything but the “log file” and “munge” messages. When you have it running attached to your window, is there any chance that sinfo or scontrol suggest that the node is actually all right? Perhaps something in /etc/sysconfig/slurm or the like is messed up? If that’s not the case, I think my next step would be to follow up on so
Re: [slurm-users] unable to start slurmd process.
i am able to get the output scontrol show node oled3 also the oled3 is pinging fine and scontrol ping output showing like Slurmctld(primary/backup) at deda1x1466/(NULL) are UP/DOWN so all looks ok to me. REgards Navin. On Thu, Jun 11, 2020 at 8:38 PM Riebs, Andy wrote: > So there seems to be a failure to communicate between slurmctld and the > oled3 slurmd. > > > > From oled3, try “scontrol ping” to confirm that it can see the slurmctld > daemon. > > > > From the head node, try “scontrol show node oled3”, and then ping the > address that is shown for “NodeAddr=” > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On > Behalf Of *navin srivastava > *Sent:* Thursday, June 11, 2020 10:40 AM > *To:* Slurm User Community List > *Subject:* Re: [slurm-users] unable to start slurmd process. > > > > i collected the log from slurmctld and it says below > > > > [2020-06-10T20:10:38.501] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:14:38.901] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:18:38.255] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:22:38.624] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:26:38.902] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:30:38.230] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:34:38.594] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:38:38.986] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:42:38.402] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:46:38.764] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T20:50:38.094] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T21:26:38.839] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T21:30:38.225] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T21:34:38.582] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T21:38:38.914] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T21:42:38.292] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T21:46:38.542] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T21:50:38.869] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T21:54:38.227] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-10T21:58:38.628] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-11T06:54:39.012] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-11T06:58:39.411] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-11T07:02:39.106] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-11T07:06:39.495] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-11T07:10:39.814] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-11T07:14:39.188] Resending TERMINATE_JOB request JobId=1252284 > Nodelist=oled3 > [2020-06-11T07:14:49.204] agent/is_node_resp: node:oled3 > RPC:REQUEST_TERMINATE_JOB : Communication connection failure > [2020-06-11T07:14:50.210] error: Nodes oled3 not responding > [2020-06-11T07:15:54.313] error: Nodes oled3 not responding > [2020-06-11T07:17:34.407] error: Nodes oled3 not responding > [2020-06-11T07:19:14.637] error: Nodes oled3 not responding > [2020-06-11T07:19:54.313] update_node: node oled3 reason set to: > reboot-required > [2020-06-11T07:19:54.313] update_node: node oled3 state set to DRAINING* > [2020-06-11T07:20:43.788] requeue job 1316970 due to failure of node oled3 > [2020-06-11T07:20:43.788] requeue job 1349322 due to failure of node oled3 > [2020-06-11T07:20:43.789] error: Nodes oled3 not responding, setting DOWN > > > > sinfo says > > > > OLED* up infinite 1 drain* oled3 > > > > while checking the node i feel node is healthy. > > > > Regards > > Navin > > > > On Thu, Jun 11, 2020 at 7:21 PM Riebs, Andy wrote: > > Weird. “slurmd -Dvvv” ought to report a whole lot of data; I can’t guess > how to interpret it not reporting anything but the “log file” and “munge” > messages. When you have it running attached to your window, is there any > chance that sinfo or scontrol suggest that the node is actually all right? > Perhaps something in /etc/sysconfig/slurm or the like is messed up? > > > > If tha
Re: [slurm-users] unable to start slurmd process.
i collected the log from slurmctld and it says below [2020-06-10T20:10:38.501] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:14:38.901] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:18:38.255] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:22:38.624] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:26:38.902] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:30:38.230] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:34:38.594] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:38:38.986] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:42:38.402] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:46:38.764] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:50:38.094] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:26:38.839] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:30:38.225] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:34:38.582] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:38:38.914] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:42:38.292] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:46:38.542] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:50:38.869] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:54:38.227] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:58:38.628] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T06:54:39.012] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T06:58:39.411] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:02:39.106] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:06:39.495] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:10:39.814] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:14:39.188] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:14:49.204] agent/is_node_resp: node:oled3 RPC:REQUEST_TERMINATE_JOB : Communication connection failure [2020-06-11T07:14:50.210] error: Nodes oled3 not responding [2020-06-11T07:15:54.313] error: Nodes oled3 not responding [2020-06-11T07:17:34.407] error: Nodes oled3 not responding [2020-06-11T07:19:14.637] error: Nodes oled3 not responding [2020-06-11T07:19:54.313] update_node: node oled3 reason set to: reboot-required [2020-06-11T07:19:54.313] update_node: node oled3 state set to DRAINING* [2020-06-11T07:20:43.788] requeue job 1316970 due to failure of node oled3 [2020-06-11T07:20:43.788] requeue job 1349322 due to failure of node oled3 [2020-06-11T07:20:43.789] error: Nodes oled3 not responding, setting DOWN sinfo says OLED* up infinite 1 drain* oled3 while checking the node i feel node is healthy. Regards Navin On Thu, Jun 11, 2020 at 7:21 PM Riebs, Andy wrote: > Weird. “slurmd -Dvvv” ought to report a whole lot of data; I can’t guess > how to interpret it not reporting anything but the “log file” and “munge” > messages. When you have it running attached to your window, is there any > chance that sinfo or scontrol suggest that the node is actually all right? > Perhaps something in /etc/sysconfig/slurm or the like is messed up? > > > > If that’s not the case, I think my next step would be to follow up on > someone else’s suggestion, and scan the slurmctld.log file for the problem > node name. > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On > Behalf Of *navin srivastava > *Sent:* Thursday, June 11, 2020 9:26 AM > *To:* Slurm User Community List > *Subject:* Re: [slurm-users] unable to start slurmd process. > > > > Sorry Andy I missed to add. > > 1st i tried the slurmd -Dvvv and it is not written anything > > slurmd: debug: Log file re-opened > slurmd: debug: Munge authentication plugin loaded > > > > After that I waited for 10-20 minutes but no output and finally i pressed > Ctrl^c. > > > > My doubt is in slurm.conf file: > > > > ControlMachine=deda1x1466 > ControlAddr=192.168.150.253 > > > > The deda1x1466 is having a different interface with different IP which > compute node is unable to ping but IP is pingable. > > could be one of the reason? > > > > but other nodes having the same config and there i am able to start the > slurmd. so bit of confusion. > > > > Regards > > Navin. > > > > > > > > > > > > > > > > > > Regards >
Re: [slurm-users] unable to start slurmd process.
Weird. “slurmd -Dvvv” ought to report a whole lot of data; I can’t guess how to interpret it not reporting anything but the “log file” and “munge” messages. When you have it running attached to your window, is there any chance that sinfo or scontrol suggest that the node is actually all right? Perhaps something in /etc/sysconfig/slurm or the like is messed up? If that’s not the case, I think my next step would be to follow up on someone else’s suggestion, and scan the slurmctld.log file for the problem node name. From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of navin srivastava Sent: Thursday, June 11, 2020 9:26 AM To: Slurm User Community List Subject: Re: [slurm-users] unable to start slurmd process. Sorry Andy I missed to add. 1st i tried the slurmd -Dvvv and it is not written anything slurmd: debug: Log file re-opened slurmd: debug: Munge authentication plugin loaded After that I waited for 10-20 minutes but no output and finally i pressed Ctrl^c. My doubt is in slurm.conf file: ControlMachine=deda1x1466 ControlAddr=192.168.150.253 The deda1x1466 is having a different interface with different IP which compute node is unable to ping but IP is pingable. could be one of the reason? but other nodes having the same config and there i am able to start the slurmd. so bit of confusion. Regards Navin. Regards Navin. On Thu, Jun 11, 2020 at 6:44 PM Riebs, Andy mailto:andy.ri...@hpe.com>> wrote: If you omitted the “-D” that I suggested, then the daemon would have detached and logged nothing on the screen. In this case, you can still go to the slurmd log (use “scontrol show config | grep -I log” if you’re not sure where the logs are stored). From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>] On Behalf Of navin srivastava Sent: Thursday, June 11, 2020 9:01 AM To: Slurm User Community List mailto:slurm-users@lists.schedmd.com>> Subject: Re: [slurm-users] unable to start slurmd process. I tried by executing the debug mode but there also it is not writing anything. i waited for about 5-10 minutes deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v No output on terminal. The OS is SLES12-SP4 . All firewall services are disabled. The recent change is the local hostname earlier it was with local hostname node1,node2,etc but we have moved to dns based hostname which is deda NodeName=node[1-12] NodeHostname=deda1x[1450-1461] NodeAddr=node[1-12] Sockets=2 CoresPerSocket=10 State=UNKNOWN other than this it is fine but after that i have done several time slurmd process started on the node and it works fine but now i am seeing this issue today. Regards Navin. On Thu, Jun 11, 2020 at 6:06 PM Riebs, Andy mailto:andy.ri...@hpe.com>> wrote: Navin, As you can see, systemd provides very little service-specific information. For slurm, you really need to go to the slurm logs to find out what happened. Hint: A quick way to identify problems like this with slurmd and slurmctld is to run them with the “-Dvvv” option, causing them to log to your window, and usually causing the problem to become immediately obvious. For example, # /usr/local/slurm/sbin/slurmd -D Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail when you run it this way, it’s time to look elsewhere. Andy From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>] On Behalf Of navin srivastava Sent: Thursday, June 11, 2020 8:25 AM To: Slurm User Community List mailto:slurm-users@lists.schedmd.com>> Subject: [slurm-users] unable to start slurmd process. Hi Team, when i am trying to start the slurmd process i am getting the below error. 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node daemon... 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start operation timed out. Terminating. 2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node daemon. 2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit entered failed state. 2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed with result 'timeout'. 2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): session opened for user root by (uid=0) Slurm version is 17.11.8 The server and slurm is running from long time and we have not made any changes but today when i am starting it is giving this error message. Any idea what could be wrong here. Regards Navin.
Re: [slurm-users] unable to start slurmd process.
Is the time on that node too far out-of-sync w.r.t. the slurmctld server? > On Jun 11, 2020, at 09:01 , navin srivastava wrote: > > I tried by executing the debug mode but there also it is not writing anything. > > i waited for about 5-10 minutes > > deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v > > No output on terminal. > > The OS is SLES12-SP4 . All firewall services are disabled. > > The recent change is the local hostname earlier it was with local hostname > node1,node2,etc but we have moved to dns based hostname which is deda > > NodeName=node[1-12] NodeHostname=deda1x[1450-1461] NodeAddr=node[1-12] > Sockets=2 CoresPerSocket=10 State=UNKNOWN > other than this it is fine but after that i have done several time slurmd > process started on the node and it works fine but now i am seeing this issue > today. > > Regards > Navin. > > > > > > > > > > On Thu, Jun 11, 2020 at 6:06 PM Riebs, Andy wrote: > Navin, > > > > As you can see, systemd provides very little service-specific information. > For slurm, you really need to go to the slurm logs to find out what happened. > > > > Hint: A quick way to identify problems like this with slurmd and slurmctld is > to run them with the “-Dvvv” option, causing them to log to your window, and > usually causing the problem to become immediately obvious. > > > > For example, > > > > # /usr/local/slurm/sbin/slurmd -D > > > > Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail when > you run it this way, it’s time to look elsewhere. > > > > Andy > > > > From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of > navin srivastava > Sent: Thursday, June 11, 2020 8:25 AM > To: Slurm User Community List > Subject: [slurm-users] unable to start slurmd process. > > > > Hi Team, > > > > when i am trying to start the slurmd process i am getting the below error. > > > > 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node > daemon... > 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start > operation timed out. Terminating. > 2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node > daemon. > 2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit > entered failed state. > 2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed > with result 'timeout'. > 2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): > session opened for user root by (uid=0) > > > > Slurm version is 17.11.8 > > > > The server and slurm is running from long time and we have not made any > changes but today when i am starting it is giving this error message. > > Any idea what could be wrong here. > > > > Regards > > Navin. > > > > > > > > >
Re: [slurm-users] unable to start slurmd process.
If you omitted the “-D” that I suggested, then the daemon would have detached and logged nothing on the screen. In this case, you can still go to the slurmd log (use “scontrol show config | grep -I log” if you’re not sure where the logs are stored). From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of navin srivastava Sent: Thursday, June 11, 2020 9:01 AM To: Slurm User Community List Subject: Re: [slurm-users] unable to start slurmd process. I tried by executing the debug mode but there also it is not writing anything. i waited for about 5-10 minutes deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v No output on terminal. The OS is SLES12-SP4 . All firewall services are disabled. The recent change is the local hostname earlier it was with local hostname node1,node2,etc but we have moved to dns based hostname which is deda NodeName=node[1-12] NodeHostname=deda1x[1450-1461] NodeAddr=node[1-12] Sockets=2 CoresPerSocket=10 State=UNKNOWN other than this it is fine but after that i have done several time slurmd process started on the node and it works fine but now i am seeing this issue today. Regards Navin. On Thu, Jun 11, 2020 at 6:06 PM Riebs, Andy mailto:andy.ri...@hpe.com>> wrote: Navin, As you can see, systemd provides very little service-specific information. For slurm, you really need to go to the slurm logs to find out what happened. Hint: A quick way to identify problems like this with slurmd and slurmctld is to run them with the “-Dvvv” option, causing them to log to your window, and usually causing the problem to become immediately obvious. For example, # /usr/local/slurm/sbin/slurmd -D Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail when you run it this way, it’s time to look elsewhere. Andy From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>] On Behalf Of navin srivastava Sent: Thursday, June 11, 2020 8:25 AM To: Slurm User Community List mailto:slurm-users@lists.schedmd.com>> Subject: [slurm-users] unable to start slurmd process. Hi Team, when i am trying to start the slurmd process i am getting the below error. 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node daemon... 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start operation timed out. Terminating. 2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node daemon. 2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit entered failed state. 2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed with result 'timeout'. 2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): session opened for user root by (uid=0) Slurm version is 17.11.8 The server and slurm is running from long time and we have not made any changes but today when i am starting it is giving this error message. Any idea what could be wrong here. Regards Navin.
Re: [slurm-users] unable to start slurmd process.
I tried by executing the debug mode but there also it is not writing anything. i waited for about 5-10 minutes deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v No output on terminal. The OS is SLES12-SP4 . All firewall services are disabled. The recent change is the local hostname earlier it was with local hostname node1,node2,etc but we have moved to dns based hostname which is deda NodeName=node[1-12] NodeHostname=deda1x[1450-1461] NodeAddr=node[1-12] Sockets=2 CoresPerSocket=10 State=UNKNOWN other than this it is fine but after that i have done several time slurmd process started on the node and it works fine but now i am seeing this issue today. Regards Navin. On Thu, Jun 11, 2020 at 6:06 PM Riebs, Andy wrote: > Navin, > > > > As you can see, systemd provides very little service-specific information. > For slurm, you really need to go to the slurm logs to find out what > happened. > > > > Hint: A quick way to identify problems like this with slurmd and slurmctld > is to run them with the “-Dvvv” option, causing them to log to your window, > and usually causing the problem to become immediately obvious. > > > > For example, > > > > # /usr/local/slurm/sbin/slurmd -D > > > > Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail > when you run it this way, it’s time to look elsewhere. > > > > Andy > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On > Behalf Of *navin srivastava > *Sent:* Thursday, June 11, 2020 8:25 AM > *To:* Slurm User Community List > *Subject:* [slurm-users] unable to start slurmd process. > > > > Hi Team, > > > > when i am trying to start the slurmd process i am getting the below error. > > > > 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node > daemon... > 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start > operation timed out. Terminating. > 2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm > node daemon. > 2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit > entered failed state. > 2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed > with result 'timeout'. > 2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: > pam_unix(crond:session): session opened for user root by (uid=0) > > > > Slurm version is 17.11.8 > > > > The server and slurm is running from long time and we have not made any > changes but today when i am starting it is giving this error message. > > Any idea what could be wrong here. > > > > Regards > > Navin. > > > > > > > > >
Re: [slurm-users] unable to start slurmd process.
Hi: please mention the below output. cat /etc/redhat-release OR cat /etc/lsb_release Also, please let us know the detailed log reports that is probably available at /var/log/slurm/slurmctld.log status of: ps -ef | grep slurmctld Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System Technology Facility Academic Block 5 | Room 110 Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355 INDIA On 11/06/20 5:54 pm, navin srivastava wrote: Hi Team, when i am trying to start the slurmd process i am getting the below error. 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node daemon... 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start operation timed out. Terminating. 2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node daemon. 2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit entered failed state. 2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed with result 'timeout'. 2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): session opened for user root by (uid=0) Slurm version is 17.11.8 The server and slurm is running from long time and we have not made any changes but today when i am starting it is giving this error message. Any idea what could be wrong here. Regards Navin.
Re: [slurm-users] unable to start slurmd process.
Hi Navin, try running slurmd in the foregrund with increased verbosity: slurmd -D -v (add as many v as you deem necessary) Hopefully it'll tell you more about why it times out. Best, Marcus On 6/11/20 2:24 PM, navin srivastava wrote: > Hi Team, > > when i am trying to start the slurmd process i am getting the below error. > > 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node > daemon... > 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start > operation timed out. Terminating. > 2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm > node daemon. > 2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit > entered failed state. > 2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed > with result 'timeout'. > 2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): > session opened for user root by (uid=0) > > Slurm version is 17.11.8 > > The server and slurm is running from long time and we have not made any > changes but today when i am starting it is giving this error message. > Any idea what could be wrong here. > > Regards > Navin. > -- Marcus Vincent Boden, M.Sc. Arbeitsgruppe eScience Tel.: +49 (0)551 201-2191 E-Mail: mbo...@gwdg.de --- Gesellschaft fuer wissenschaftliche Datenverarbeitung mbH Goettingen (GWDG) Am Fassberg 11, 37077 Goettingen URL:http://www.gwdg.de E-Mail: g...@gwdg.de Tel.: +49 (0)551 201-1510 Fax:+49 (0)551 201-2150 Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger Sitz der Gesellschaft: Goettingen Registergericht: Goettingen Handelsregister-Nr. B 598 --- smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] unable to start slurmd process.
On 11-06-2020 14:24, navin srivastava wrote: Hi Team, when i am trying to start the slurmd process i am getting the below error. 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node daemon... 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start operation timed out. Terminating. 2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node daemon. 2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit entered failed state. 2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed with result 'timeout'. 2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): session opened for user root by (uid=0) Slurm version is 17.11.8 The server and slurm is running from long time and we have not made any changes but today when i am starting it is giving this error message. Any idea what could be wrong here. Which OS do you run this ancient Slurm version on? There could be many reasons why slurmd refuses to start, such as network, DNS, firewall, etc. You should check the log file in /var/log/slurm/ You could start the slurmd from the command line, adding one or more -v for verbose logging: $ slurmd -v -v /Ole
Re: [slurm-users] unable to start slurmd process.
Navin, As you can see, systemd provides very little service-specific information. For slurm, you really need to go to the slurm logs to find out what happened. Hint: A quick way to identify problems like this with slurmd and slurmctld is to run them with the “-Dvvv” option, causing them to log to your window, and usually causing the problem to become immediately obvious. For example, # /usr/local/slurm/sbin/slurmd -D Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail when you run it this way, it’s time to look elsewhere. Andy From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of navin srivastava Sent: Thursday, June 11, 2020 8:25 AM To: Slurm User Community List Subject: [slurm-users] unable to start slurmd process. Hi Team, when i am trying to start the slurmd process i am getting the below error. 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node daemon... 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start operation timed out. Terminating. 2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node daemon. 2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit entered failed state. 2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed with result 'timeout'. 2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): session opened for user root by (uid=0) Slurm version is 17.11.8 The server and slurm is running from long time and we have not made any changes but today when i am starting it is giving this error message. Any idea what could be wrong here. Regards Navin.