There are many clusters running slurm where nodes go down daily and you are the first person to report a problem. My best guess is that your slurm.conf file is bad. What does your node configuration line(s) look like in slurm.conf?

Quoting Mike Schachter <[email protected]>:

So this morning a node went down in the middle of a bunch of
jobs running, the slurm controller tried to reconfigure, and this
was the only error message we got in the log file:

slurmctld: error: Unable to resolve "thenodename": Unknown host
slurmctld: fatal: slurm_set_addr failure on thenodename

The controller won't even restart if I set the State=DOWN for the node
in /etc/slurm.conf. I have to manually remove the node from configuration
file in order for the controller to restart.

This is a huge problem for us! We expected that nodes could go
down graceful failover. Any idea what's going on?

  mike


On Tue, Aug 2, 2011 at 4:53 PM,  <[email protected]> wrote:
Your log should say what is happening. If not, try logging on as root and
starting the daemon by hand with lots of debugging (-v's):
"slurmctld -Dvvvvv"

Quoting Mike Schachter <[email protected]>:

Hi there,

If we have a node down, and then restart the slurm controller,
for some reason slurm won't start up! Is there some way to
ameliorate this issue?

 mike











Reply via email to