See the configuration parameter SlurmctldTimeout as described here:
http://slurm.schedmd.com/slurm.conf.html

Quoting Marc Vecsys <[email protected]>:

Hi
It takes 5mn for the backup controler to start after the master failed, is
there any setup to have a fast switching ?
Thanks
Marc


slurm.conf file

ControlMachine=frontal1
ControlAddr=10.229.190.20
BackupController=frontal2
BackupAddr=10.229.190.21



scontrol ping
Slurmctld(primary/backup) at frontal1/frontal2 are DOWN/UP
*****************************************
** RESTORE SLURMCTLD DAEMON TO SERVICE **
*****************************************



frontal2 (backup) log
[2014-03-05T08:55:29.874] debug3: pinging slurmctld at 10.229.190.20
[2014-03-05T08:55:29.875] debug2: _slurm_connect failed: Connection refused
[2014-03-05T08:55:29.875] debug2: Error connecting slurm stream socket at
10.229.190.20:6817: Connection refused
[2014-03-05T08:55:29.875] error: _ping_controller/slurm_send_node_msg
error: Connection refused

[2014-03-05T09:00:29.914] error: ControlMachine frontal1 not responding,
BackupController frontal2 taking over

[2014-03-05T09:00:29.914] Terminate signal (SIGINT or SIGTERM) received
[2014-03-05T09:00:29.914] debug:  sched: slurmctld terminating


Reply via email to