See the configuration parameter SlurmctldTimeout as described here: http://slurm.schedmd.com/slurm.conf.html
Quoting Marc Vecsys <[email protected]>:
Hi It takes 5mn for the backup controler to start after the master failed, is there any setup to have a fast switching ? Thanks Marc slurm.conf file ControlMachine=frontal1 ControlAddr=10.229.190.20 BackupController=frontal2 BackupAddr=10.229.190.21 scontrol ping Slurmctld(primary/backup) at frontal1/frontal2 are DOWN/UP ***************************************** ** RESTORE SLURMCTLD DAEMON TO SERVICE ** ***************************************** frontal2 (backup) log [2014-03-05T08:55:29.874] debug3: pinging slurmctld at 10.229.190.20 [2014-03-05T08:55:29.875] debug2: _slurm_connect failed: Connection refused [2014-03-05T08:55:29.875] debug2: Error connecting slurm stream socket at 10.229.190.20:6817: Connection refused [2014-03-05T08:55:29.875] error: _ping_controller/slurm_send_node_msg error: Connection refused [2014-03-05T09:00:29.914] error: ControlMachine frontal1 not responding, BackupController frontal2 taking over [2014-03-05T09:00:29.914] Terminate signal (SIGINT or SIGTERM) received [2014-03-05T09:00:29.914] debug: sched: slurmctld terminating
