Hello Slurm folks,

I've used Slurm for several years, but prior to this, I've not used
Slurm's BackupController redundancy/failover functionality.  However, I
tried it recently and it raised a couple of questions.

This on Slurm 2.2.1 under Debian 6.0 (Squeeze).

1. What happens / is supposed to happen if the job was submitted to
Slurm via the primary controller, and the connections to the submitting
salloc or srun get broken when the primary controller dies / becomes
unavailable?  In our case, we use either srun's background I/O
redirection, or we wrap salloc in a process that daemonizes and writes
stdout and stderr to files in the background.

Should jobs continue to execute normally if there is no more sink for
stdout/stderr, or if the part of the MPI session executing on the
controller node becomes unavailable?

2. We tested the BackupController functionality using VMs for both
primary and backup controllers.  Tests were done by virtually unplugging
the network connection on the primary.

We noticed that while the BackupController assumed control, the primary
seemed to mark compute nodes as down when it couldn't reach them.  In
the meantime, if any jobs had been submitted via the backup controller,
when we restored the network connection to the primary, it saw the new
jobs but also saw that they were assigned to nodes marked down, so it
terminated the new jobs.  Is this to be expected?

In general, what should be expected of the primary/backup relationship? 
What is the backup's role supposed to be when the primary is down, and
can jobs be freely submitted via the backup while the primary is down
such that they continue if the primary comes back up while jobs are
running?

Thank you.

V. Ram


-- 
http://www.fastmail.fm - mmm... Fastmail...

Reply via email to