Hello Slurm folks, I've used Slurm for several years, but prior to this, I've not used Slurm's BackupController redundancy/failover functionality. However, I tried it recently and it raised a couple of questions.
This on Slurm 2.2.1 under Debian 6.0 (Squeeze). 1. What happens / is supposed to happen if the job was submitted to Slurm via the primary controller, and the connections to the submitting salloc or srun get broken when the primary controller dies / becomes unavailable? In our case, we use either srun's background I/O redirection, or we wrap salloc in a process that daemonizes and writes stdout and stderr to files in the background. Should jobs continue to execute normally if there is no more sink for stdout/stderr, or if the part of the MPI session executing on the controller node becomes unavailable? 2. We tested the BackupController functionality using VMs for both primary and backup controllers. Tests were done by virtually unplugging the network connection on the primary. We noticed that while the BackupController assumed control, the primary seemed to mark compute nodes as down when it couldn't reach them. In the meantime, if any jobs had been submitted via the backup controller, when we restored the network connection to the primary, it saw the new jobs but also saw that they were assigned to nodes marked down, so it terminated the new jobs. Is this to be expected? In general, what should be expected of the primary/backup relationship? What is the backup's role supposed to be when the primary is down, and can jobs be freely submitted via the backup while the primary is down such that they continue if the primary comes back up while jobs are running? Thank you. V. Ram -- http://www.fastmail.fm - mmm... Fastmail...