[slurm-dev] Re: Slurm with High Availabilty/Automatic failover
Hello, Am 25.07.2017 um 16:19 schrieb J. Smith: > Does anyone has any suggestions in setting up high availability and > automatic failover between two servers that run a Controller daemon, > Database daemon and Mysql Database (i.e replication vs galera cluster)? > > Any input would be appreciated. we use ganeti instances for most services. In our case KVM (configurable on a per cluster basis) + DRBD (instance storage) On Debian they are rock solid. While HA is experimentally possible, the default is intentionally going without automatic fail-over: http://docs.ganeti.org/ganeti/2.15/html/design-linuxha.html#risks From my point of view a failing Slurm controller is such a rare event, that I prefer having a look first and only then be able to do a manually triggered fast fail-over. On the other hand the (unwritten) expected SLA for most services here is 90% per week & month, 95% year -- sure, relaxed; not knowing your needs, that might just be a HPC-kindergarden from your perspective. Regards, Benjamin -- FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html ☎ +49 3641 9 44323 smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] RE: Slurm with High Availabilty/Automatic failover
Hi, Thank you both for sharing and would definitely would like to hear more about it. Davide, what type of issue did you run into with parallel filesystem? You are using keepalived for both controller and database daemon? How about the Database, what type of setup is it? master/slave? On my side, slurm is installed on our GPFS parallel filesystem. Using two servers both running the controller and db daemons and Mariadb Database on each server. For the time being, we are just replicated the master db to a slave on the second server and want to change this configuration for a better failover and automated option. The failover works fine as master/slave for slurmctld but having issues to failover slurmdbd. On Tue, Jul 25, 2017 at 12:31 PM, Vanzo, Davide <davide.va...@vanderbilt.edu > wrote: > Gary, > > Would it be possible to get some additional details on your experience > with DRBD? > Thank you. > > > -- > *Davide Vanzo, PhD* > Application Developer > Adjunct Assistant Professor of Chemical and Biomolecular Engineering > Advanced Computing Center for Research and Education (ACCRE) > Vanderbilt University - Hill Center 201 > (615)-875-9137 <(615)%20875-9137> > www.accre.vanderbilt.edu > > > On 2017-07-25 11:13:30-05:00 Skouson, Gary B wrote: > > We use an NFS appliance for storing state files. The NFS has been VERY > stable. We tried the DRBD shared volume but found that our problems were > more likely to be something with the DRBD than with slurmctld. > > > > - > > Gary Skouson > > > > > > *From:* Vanzo, Davide [mailto:davide.va...@vanderbilt.edu] > *Sent:* Tuesday, July 25, 2017 8:39 AM > *To:* slurm-dev <slurm-dev@schedmd.com> > *Cc:* slurm-dev@schedmd.com > *Subject:* [slurm-dev] RE: Slurm with High Availabilty/Automatic failover > > > > We are currently experimenting with keepalived+DRBD to have an HA cluster > with two nodes where both the controller and the database are hosted on the > same node. The reason why we are pursuing this route is because we > experienced significant performance and stability issues of having the > state files on the cluster parallel filesystem. > > We are still in the early stages of testing but I will be happy to share > our experience if you are interested. > > -- > > *Davide Vanzo, PhD* > > Application Developer > > Adjunct Assistant Professor of Chemical and Biomolecular Engineering > > Advanced Computing Center for Research and Education (ACCRE) > > Vanderbilt University - Hill Center 201 > > (615)-875-9137 <(615)%20875-9137> > > www.accre.vanderbilt.edu > > > > On 2017-07-25 09:20:55-05:00 J. Smith wrote: > > Does anyone has any suggestions in setting up high availability and > automatic failover between two servers that run a Controller daemon, > Database daemon and Mysql Database (i.e replication vs galera cluster)? > > Any input would be appreciated. > > Thanks! > >
[slurm-dev] RE: Slurm with High Availabilty/Automatic failover
Gary, Would it be possible to get some additional details on your experience with DRBD? Thank you. -- Davide Vanzo, PhD Application Developer Adjunct Assistant Professor of Chemical and Biomolecular Engineering Advanced Computing Center for Research and Education (ACCRE) Vanderbilt University - Hill Center 201 (615)-875-9137 www.accre.vanderbilt.edu On 2017-07-25 11:13:30-05:00 Skouson, Gary B wrote: We use an NFS appliance for storing state files. The NFS has been VERY stable. We tried the DRBD shared volume but found that our problems were more likely to be something with the DRBD than with slurmctld. - Gary Skouson From: Vanzo, Davide [mailto:davide.va...@vanderbilt.edu] Sent: Tuesday, July 25, 2017 8:39 AM To: slurm-dev <slurm-dev@schedmd.com> Cc: slurm-dev@schedmd.com Subject: [slurm-dev] RE: Slurm with High Availabilty/Automatic failover We are currently experimenting with keepalived+DRBD to have an HA cluster with two nodes where both the controller and the database are hosted on the same node. The reason why we are pursuing this route is because we experienced significant performance and stability issues of having the state files on the cluster parallel filesystem. We are still in the early stages of testing but I will be happy to share our experience if you are interested. -- Davide Vanzo, PhD Application Developer Adjunct Assistant Professor of Chemical and Biomolecular Engineering Advanced Computing Center for Research and Education (ACCRE) Vanderbilt University - Hill Center 201 (615)-875-9137 www.accre.vanderbilt.edu<http://www.accre.vanderbilt.edu> On 2017-07-25 09:20:55-05:00 J. Smith wrote: Does anyone has any suggestions in setting up high availability and automatic failover between two servers that run a Controller daemon, Database daemon and Mysql Database (i.e replication vs galera cluster)? Any input would be appreciated. Thanks!
[slurm-dev] RE: Slurm with High Availabilty/Automatic failover
We use an NFS appliance for storing state files. The NFS has been VERY stable. We tried the DRBD shared volume but found that our problems were more likely to be something with the DRBD than with slurmctld. - Gary Skouson From: Vanzo, Davide [mailto:davide.va...@vanderbilt.edu] Sent: Tuesday, July 25, 2017 8:39 AM To: slurm-dev <slurm-dev@schedmd.com> Cc: slurm-dev@schedmd.com Subject: [slurm-dev] RE: Slurm with High Availabilty/Automatic failover We are currently experimenting with keepalived+DRBD to have an HA cluster with two nodes where both the controller and the database are hosted on the same node. The reason why we are pursuing this route is because we experienced significant performance and stability issues of having the state files on the cluster parallel filesystem. We are still in the early stages of testing but I will be happy to share our experience if you are interested. -- Davide Vanzo, PhD Application Developer Adjunct Assistant Professor of Chemical and Biomolecular Engineering Advanced Computing Center for Research and Education (ACCRE) Vanderbilt University - Hill Center 201 (615)-875-9137 www.accre.vanderbilt.edu<http://www.accre.vanderbilt.edu> On 2017-07-25 09:20:55-05:00 J. Smith wrote: Does anyone has any suggestions in setting up high availability and automatic failover between two servers that run a Controller daemon, Database daemon and Mysql Database (i.e replication vs galera cluster)? Any input would be appreciated. Thanks!
[slurm-dev] RE: Slurm with High Availabilty/Automatic failover
We are currently experimenting with keepalived+DRBD to have an HA cluster with two nodes where both the controller and the database are hosted on the same node. The reason why we are pursuing this route is because we experienced significant performance and stability issues of having the state files on the cluster parallel filesystem. We are still in the early stages of testing but I will be happy to share our experience if you are interested. -- Davide Vanzo, PhD Application Developer Adjunct Assistant Professor of Chemical and Biomolecular Engineering Advanced Computing Center for Research and Education (ACCRE) Vanderbilt University - Hill Center 201 (615)-875-9137 www.accre.vanderbilt.edu On 2017-07-25 09:20:55-05:00 J. Smith wrote: Does anyone has any suggestions in setting up high availability and automatic failover between two servers that run a Controller daemon, Database daemon and Mysql Database (i.e replication vs galera cluster)? Any input would be appreciated. Thanks!