Hi all! I would like to start a discussion with developers and especially with slurm users/admins about slurm high availability.
First, I would like to ask you to share with us your HA solutions for your clusters, and second, I would like to ask for your advises and suggestions about a specific setup and what would be the best HA approach. Let's say that we have two management nodes admin1 and admin2, and we have both local and shared filesystems available for these nodes. We can provide NFS exports, DRBD and whatever other services we need for HA. admin1 will be the primary controller and admin2 will be the backup controller. And we want to provide high availability for slurmctld, slurmdbd and mysqld daemons. Also the database files also need a HA approach, most probably in a shared filesystem. So, I would like to ask you which would be the best approach to provide HA for Slurm? We want a good solution so even the accounting will be also HA always. I am waiting for your interesting answers and thanks in advance!!! Best Regards, Chrysovalantis Paschoulas Juelich Supercomputing Centre Forschungszentrum Juelich
