I am working on an OpenHPC/Warewulf cluster. When I start the slurmd service on the compute nodes the systemctl sits there for a long time, then reports:
Starting slurm (via systemctl): Job for slurm.service failed because a timeout was exceeded. See "systemctl status slurm.service" and "journalctl -xe" for details. [FAILED] However a slurmd process is started on the compute node. On the head node sinfo says the compute node is down and Not Responding I don;t expect my problem to be solved by the list, but would appreciate some diagnostics. debug level set to 3 in slurm.conf ________________________________ Scanned by MailMarshal - M86 Security's comprehensive email content security solution. ________________________________ Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Employees of XMA Ltd are expressly required not to make defamatory statements and not to infringe or authorise any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising. XMA Limited is registered in England and Wales (registered no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP