i collected the log from slurmctld and it says below [2020-06-10T20:10:38.501] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:14:38.901] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:18:38.255] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:22:38.624] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:26:38.902] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:30:38.230] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:34:38.594] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:38:38.986] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:42:38.402] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:46:38.764] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:50:38.094] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:26:38.839] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:30:38.225] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:34:38.582] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:38:38.914] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:42:38.292] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:46:38.542] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:50:38.869] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:54:38.227] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T21:58:38.628] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T06:54:39.012] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T06:58:39.411] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:02:39.106] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:06:39.495] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:10:39.814] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:14:39.188] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-11T07:14:49.204] agent/is_node_resp: node:oled3 RPC:REQUEST_TERMINATE_JOB : Communication connection failure [2020-06-11T07:14:50.210] error: Nodes oled3 not responding [2020-06-11T07:15:54.313] error: Nodes oled3 not responding [2020-06-11T07:17:34.407] error: Nodes oled3 not responding [2020-06-11T07:19:14.637] error: Nodes oled3 not responding [2020-06-11T07:19:54.313] update_node: node oled3 reason set to: reboot-required [2020-06-11T07:19:54.313] update_node: node oled3 state set to DRAINING* [2020-06-11T07:20:43.788] requeue job 1316970 due to failure of node oled3 [2020-06-11T07:20:43.788] requeue job 1349322 due to failure of node oled3 [2020-06-11T07:20:43.789] error: Nodes oled3 not responding, setting DOWN
sinfo says OLED* up infinite 1 drain* oled3 while checking the node i feel node is healthy. Regards Navin On Thu, Jun 11, 2020 at 7:21 PM Riebs, Andy <andy.ri...@hpe.com> wrote: > Weird. “slurmd -Dvvv” ought to report a whole lot of data; I can’t guess > how to interpret it not reporting anything but the “log file” and “munge” > messages. When you have it running attached to your window, is there any > chance that sinfo or scontrol suggest that the node is actually all right? > Perhaps something in /etc/sysconfig/slurm or the like is messed up? > > > > If that’s not the case, I think my next step would be to follow up on > someone else’s suggestion, and scan the slurmctld.log file for the problem > node name. > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On > Behalf Of *navin srivastava > *Sent:* Thursday, June 11, 2020 9:26 AM > *To:* Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject:* Re: [slurm-users] unable to start slurmd process. > > > > Sorry Andy I missed to add. > > 1st i tried the slurmd -Dvvv and it is not written anything > > slurmd: debug: Log file re-opened > slurmd: debug: Munge authentication plugin loaded > > > > After that I waited for 10-20 minutes but no output and finally i pressed > Ctrl^c. > > > > My doubt is in slurm.conf file: > > > > ControlMachine=deda1x1466 > ControlAddr=192.168.150.253 > > > > The deda1x1466 is having a different interface with different IP which > compute node is unable to ping but IP is pingable. > > could be one of the reason? > > > > but other nodes having the same config and there i am able to start the > slurmd. so bit of confusion. > > > > Regards > > Navin. > > > > > > > > > > > > > > > > > > Regards > > Navin. > > > > > > > > > > > > > > On Thu, Jun 11, 2020 at 6:44 PM Riebs, Andy <andy.ri...@hpe.com> wrote: > > If you omitted the “-D” that I suggested, then the daemon would have > detached and logged nothing on the screen. In this case, you can still go > to the slurmd log (use “scontrol show config | grep -I log” if you’re not > sure where the logs are stored). > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On > Behalf Of *navin srivastava > *Sent:* Thursday, June 11, 2020 9:01 AM > *To:* Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject:* Re: [slurm-users] unable to start slurmd process. > > > > I tried by executing the debug mode but there also it is not writing > anything. > > > > i waited for about 5-10 minutes > > > > deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v > > No output on terminal. > > > > The OS is SLES12-SP4 . All firewall services are disabled. > > > > The recent change is the local hostname earlier it was with local hostname > node1,node2,etc but we have moved to dns based hostname which is deda > > > > NodeName=node[1-12] NodeHostname=deda1x[1450-1461] NodeAddr=node[1-12] > Sockets=2 CoresPerSocket=10 State=UNKNOWN > > other than this it is fine but after that i have done several time slurmd > process started on the node and it works fine but now i am seeing this > issue today. > > > > Regards > > Navin. > > > > > > > > > > > > > > > > > > > > On Thu, Jun 11, 2020 at 6:06 PM Riebs, Andy <andy.ri...@hpe.com> wrote: > > Navin, > > > > As you can see, systemd provides very little service-specific information. > For slurm, you really need to go to the slurm logs to find out what > happened. > > > > Hint: A quick way to identify problems like this with slurmd and slurmctld > is to run them with the “-Dvvv” option, causing them to log to your window, > and usually causing the problem to become immediately obvious. > > > > For example, > > > > # /usr/local/slurm/sbin/slurmd -Dvvvv > > > > Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail > when you run it this way, it’s time to look elsewhere. > > > > Andy > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On > Behalf Of *navin srivastava > *Sent:* Thursday, June 11, 2020 8:25 AM > *To:* Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject:* [slurm-users] unable to start slurmd process. > > > > Hi Team, > > > > when i am trying to start the slurmd process i am getting the below error. > > > > 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node > daemon... > 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start > operation timed out. Terminating. > 2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm > node daemon. > 2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit > entered failed state. > 2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed > with result 'timeout'. > 2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: > pam_unix(crond:session): session opened for user root by (uid=0) > > > > Slurm version is 17.11.8 > > > > The server and slurm is running from long time and we have not made any > changes but today when i am starting it is giving this error message. > > Any idea what could be wrong here. > > > > Regards > > Navin. > > > > > > > > > >