Abotu an hour ago I found that some of the nodes in my environment were
running an old outdated version of slurmd. By that I mean that I had tried
to stop all slurm daemons using the script
/etc/slurm/stop_all.sh. This had not killed the slurm daemoon on some nodes.

So I explicitly logged into these nodes and did a "kill -9 <pid>" which
killed the slurmd process.
I then deleted the file /var/run/slurmd.pid from each of these nodes.

I then tried to start up slurm using the /etc/slurm/start_all.sh script.
This time the slurm daemon wont start on the same nodes where I had
explicitly kill the slurm daemon using the kill command

I did a netstat on some of these nodes to make sure that the SlurmdPort was
not locked, and it was not.

Is there something I am missing. perhaps a lock file that I should have
deleted and did not. Any help is appreciated.

Reply via email to