There hasn't been as much effort to make slurmdbd as resilient as you are hinting at because there has been no need.

The database itself can be made resilient for keeping the data safe. Data that is unable to go in to the database is cached until it becomes available, even if that is to failover to the AccountingStorageBackupHost. So the only potential 'loss' is access to immediate data that may be in a cache until a slurmdbd server is accessible again.

You can have multiple slurmdbd servers running and point any system to whichever you like. In that respect, a simple way to do it would be to have round-robin DNS or a load balancer in front of the slurmdbd servers and let that be where clients access it.

Brian Andrus

On 2/15/2022 7:46 AM, Xand Meaden wrote:
Hello,

I'm wondering what others are doing to make their slurmdbd service resilient? We have the following setup right now:

- two VMs running slurmctld (and also slurmdbd)
- shared storage for StateSaveLocation using CephFS
- three-way mysql cluster using Percona XtraDB

However I can see no "Slurm native" way to make slurmdbd resilient - there is no option for a backup server in slurm.conf. I naively tried setting the AccountingStorageHost to "localhost" but this only worked on the primary control node.

Can we use something like Keepalived to present slurmdbd running on both control nodes via a floating IP, or will this cause complications with Slurm's use of it?

Thanks for any advice,
Xand



Reply via email to