So this morning a node went down in the middle of a bunch of jobs running, the slurm controller tried to reconfigure, and this was the only error message we got in the log file:
slurmctld: error: Unable to resolve "thenodename": Unknown host slurmctld: fatal: slurm_set_addr failure on thenodename The controller won't even restart if I set the State=DOWN for the node in /etc/slurm.conf. I have to manually remove the node from configuration file in order for the controller to restart. This is a huge problem for us! We expected that nodes could go down graceful failover. Any idea what's going on? mike On Tue, Aug 2, 2011 at 4:53 PM, <[email protected]> wrote: > Your log should say what is happening. If not, try logging on as root and > starting the daemon by hand with lots of debugging (-v's): > "slurmctld -Dvvvvv" > > Quoting Mike Schachter <[email protected]>: > >> Hi there, >> >> If we have a node down, and then restart the slurm controller, >> for some reason slurm won't start up! Is there some way to >> ameliorate this issue? >> >> mike >> > > > >
