Abotu an hour ago I found that some of the nodes in my environment were running an old outdated version of slurmd. By that I mean that I had tried to stop all slurm daemons using the script /etc/slurm/stop_all.sh. This had not killed the slurm daemoon on some nodes.
So I explicitly logged into these nodes and did a "kill -9 <pid>" which killed the slurmd process. I then deleted the file /var/run/slurmd.pid from each of these nodes. I then tried to start up slurm using the /etc/slurm/start_all.sh script. This time the slurm daemon wont start on the same nodes where I had explicitly kill the slurm daemon using the kill command I did a netstat on some of these nodes to make sure that the SlurmdPort was not locked, and it was not. Is there something I am missing. perhaps a lock file that I should have deleted and did not. Any help is appreciated.
