Is the time on that node too far out-of-sync w.r.t. the slurmctld server?
> On Jun 11, 2020, at 09:01 , navin srivastava <navin.alt...@gmail.com> wrote: > > I tried by executing the debug mode but there also it is not writing anything. > > i waited for about 5-10 minutes > > deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v > > No output on terminal. > > The OS is SLES12-SP4 . All firewall services are disabled. > > The recent change is the local hostname earlier it was with local hostname > node1,node2,etc but we have moved to dns based hostname which is deda > > NodeName=node[1-12] NodeHostname=deda1x[1450-1461] NodeAddr=node[1-12] > Sockets=2 CoresPerSocket=10 State=UNKNOWN > other than this it is fine but after that i have done several time slurmd > process started on the node and it works fine but now i am seeing this issue > today. > > Regards > Navin. > > > > > > > > > > On Thu, Jun 11, 2020 at 6:06 PM Riebs, Andy <andy.ri...@hpe.com> wrote: > Navin, > > > > As you can see, systemd provides very little service-specific information. > For slurm, you really need to go to the slurm logs to find out what happened. > > > > Hint: A quick way to identify problems like this with slurmd and slurmctld is > to run them with the “-Dvvv” option, causing them to log to your window, and > usually causing the problem to become immediately obvious. > > > > For example, > > > > # /usr/local/slurm/sbin/slurmd -Dvvvv > > > > Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail when > you run it this way, it’s time to look elsewhere. > > > > Andy > > > > From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of > navin srivastava > Sent: Thursday, June 11, 2020 8:25 AM > To: Slurm User Community List <slurm-users@lists.schedmd.com> > Subject: [slurm-users] unable to start slurmd process. > > > > Hi Team, > > > > when i am trying to start the slurmd process i am getting the below error. > > > > 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node > daemon... > 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start > operation timed out. Terminating. > 2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node > daemon. > 2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit > entered failed state. > 2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed > with result 'timeout'. > 2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): > session opened for user root by (uid=0) > > > > Slurm version is 17.11.8 > > > > The server and slurm is running from long time and we have not made any > changes but today when i am starting it is giving this error message. > > Any idea what could be wrong here. > > > > Regards > > Navin. > > > > > > > > >