[slurm-dev] Re: Slurm with High Availabilty/Automatic failover

2017-07-26 Thread Benjamin Redling
Hello,

Am 25.07.2017 um 16:19 schrieb J. Smith:
> Does anyone has any suggestions in setting up high availability and
> automatic failover between two servers that run a Controller daemon,
> Database daemon and Mysql Database (i.e replication vs galera cluster)?
> 
> Any input would be appreciated.

we use ganeti instances for most services. In our case KVM (configurable
on a per cluster basis) + DRBD (instance storage)
On Debian they are rock solid.
While HA is experimentally possible, the default is intentionally going
without automatic fail-over:
http://docs.ganeti.org/ganeti/2.15/html/design-linuxha.html#risks

From my point of view a failing Slurm controller is such a rare event,
that I prefer having a look first and only then be able to do a manually
triggered fast fail-over.
  On the other hand the (unwritten) expected SLA for most services here
is 90% per week & month, 95% year
-- sure, relaxed; not knowing your needs, that might just be a
HPC-kindergarden from your perspective.


Regards,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
☎ +49 3641 9 44323



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] RE: Slurm with High Availabilty/Automatic failover

2017-07-25 Thread J. Smith
Hi,

Thank you both for sharing and would definitely would like to hear more
about it.

Davide, what type of issue did you run into with parallel filesystem? You
are using keepalived for both controller and database daemon? How about the
Database, what type of setup is it? master/slave?

On my side, slurm is installed on our GPFS parallel filesystem. Using two
servers both running the controller and db daemons and Mariadb Database on
each server. For the time being, we are just replicated the master db to a
slave on the second server and want to change this configuration for a
better failover and automated option.  The failover works fine as
master/slave for slurmctld but having issues to failover slurmdbd.

On Tue, Jul 25, 2017 at 12:31 PM, Vanzo, Davide <davide.va...@vanderbilt.edu
> wrote:

> Gary,
>
> Would it be possible to get some additional details on your experience
> with DRBD?
> Thank you.
>
>
> --
> *Davide Vanzo, PhD*
> Application Developer
> Adjunct Assistant Professor of Chemical and Biomolecular Engineering
> Advanced Computing Center for Research and Education (ACCRE)
> Vanderbilt University - Hill Center 201
> (615)-875-9137 <(615)%20875-9137>
> www.accre.vanderbilt.edu
>
>
> On 2017-07-25 11:13:30-05:00 Skouson, Gary B wrote:
>
> We use an NFS appliance for storing state files.  The NFS has been VERY
> stable.  We tried the DRBD shared volume but found that our problems were
> more likely to be something with the DRBD than with slurmctld.
>
>
>
> -
>
> Gary Skouson
>
>
>
>
>
> *From:* Vanzo, Davide [mailto:davide.va...@vanderbilt.edu]
> *Sent:* Tuesday, July 25, 2017 8:39 AM
> *To:* slurm-dev <slurm-dev@schedmd.com>
> *Cc:* slurm-dev@schedmd.com
> *Subject:* [slurm-dev] RE: Slurm with High Availabilty/Automatic failover
>
>
>
> We are currently experimenting with keepalived+DRBD to have an HA cluster
> with two nodes where both the controller and the database are hosted on the
> same node. The reason why we are pursuing this route is because we
> experienced significant performance and stability issues of having the
> state files on the cluster parallel filesystem.
>
> We are still in the early stages of testing but I will be happy to share
> our experience if you are interested.
>
> --
>
> *Davide Vanzo, PhD*
>
> Application Developer
>
> Adjunct Assistant Professor of Chemical and Biomolecular Engineering
>
> Advanced Computing Center for Research and Education (ACCRE)
>
> Vanderbilt University - Hill Center 201
>
> (615)-875-9137 <(615)%20875-9137>
>
> www.accre.vanderbilt.edu
>
>
>
> On 2017-07-25 09:20:55-05:00 J. Smith wrote:
>
> Does anyone has any suggestions in setting up high availability and
> automatic failover between two servers that run a Controller daemon,
> Database daemon and Mysql Database (i.e replication vs galera cluster)?
>
> Any input would be appreciated.
>
> Thanks!
>
>


[slurm-dev] RE: Slurm with High Availabilty/Automatic failover

2017-07-25 Thread Vanzo, Davide
Gary,

Would it be possible to get some additional details on your experience with 
DRBD?
Thank you.


--
Davide Vanzo, PhD
Application Developer
Adjunct Assistant Professor of Chemical and Biomolecular Engineering
Advanced Computing Center for Research and Education (ACCRE)
Vanderbilt University - Hill Center 201
(615)-875-9137
www.accre.vanderbilt.edu


On 2017-07-25 11:13:30-05:00 Skouson, Gary B wrote:
We use an NFS appliance for storing state files.  The NFS has been VERY stable. 
 We tried the DRBD shared volume but found that our problems were more likely 
to be something with the DRBD than with slurmctld.

-
Gary Skouson


From: Vanzo, Davide [mailto:davide.va...@vanderbilt.edu]
Sent: Tuesday, July 25, 2017 8:39 AM
To: slurm-dev <slurm-dev@schedmd.com>
Cc: slurm-dev@schedmd.com
Subject: [slurm-dev] RE: Slurm with High Availabilty/Automatic failover

We are currently experimenting with keepalived+DRBD to have an HA cluster with 
two nodes where both the controller and the database are hosted on the same 
node. The reason why we are pursuing this route is because we experienced 
significant performance and stability issues of having the state files on the 
cluster parallel filesystem.
We are still in the early stages of testing but I will be happy to share our 
experience if you are interested.
--
Davide Vanzo, PhD
Application Developer
Adjunct Assistant Professor of Chemical and Biomolecular Engineering
Advanced Computing Center for Research and Education (ACCRE)
Vanderbilt University - Hill Center 201
(615)-875-9137
www.accre.vanderbilt.edu<http://www.accre.vanderbilt.edu>


On 2017-07-25 09:20:55-05:00 J. Smith wrote:
Does anyone has any suggestions in setting up high availability and automatic 
failover between two servers that run a Controller daemon, Database daemon and 
Mysql Database (i.e replication vs galera cluster)?
Any input would be appreciated.
Thanks!


[slurm-dev] RE: Slurm with High Availabilty/Automatic failover

2017-07-25 Thread Skouson, Gary B
We use an NFS appliance for storing state files.  The NFS has been VERY stable. 
 We tried the DRBD shared volume but found that our problems were more likely 
to be something with the DRBD than with slurmctld.

-
Gary Skouson


From: Vanzo, Davide [mailto:davide.va...@vanderbilt.edu]
Sent: Tuesday, July 25, 2017 8:39 AM
To: slurm-dev <slurm-dev@schedmd.com>
Cc: slurm-dev@schedmd.com
Subject: [slurm-dev] RE: Slurm with High Availabilty/Automatic failover

We are currently experimenting with keepalived+DRBD to have an HA cluster with 
two nodes where both the controller and the database are hosted on the same 
node. The reason why we are pursuing this route is because we experienced 
significant performance and stability issues of having the state files on the 
cluster parallel filesystem.
We are still in the early stages of testing but I will be happy to share our 
experience if you are interested.
--
Davide Vanzo, PhD
Application Developer
Adjunct Assistant Professor of Chemical and Biomolecular Engineering
Advanced Computing Center for Research and Education (ACCRE)
Vanderbilt University - Hill Center 201
(615)-875-9137
www.accre.vanderbilt.edu<http://www.accre.vanderbilt.edu>


On 2017-07-25 09:20:55-05:00 J. Smith wrote:
Does anyone has any suggestions in setting up high availability and automatic 
failover between two servers that run a Controller daemon, Database daemon and 
Mysql Database (i.e replication vs galera cluster)?
Any input would be appreciated.
Thanks!


[slurm-dev] RE: Slurm with High Availabilty/Automatic failover

2017-07-25 Thread Vanzo, Davide
We are currently experimenting with keepalived+DRBD to have an HA cluster with 
two nodes where both the controller and the database are hosted on the same 
node. The reason why we are pursuing this route is because we experienced 
significant performance and stability issues of having the state files on the 
cluster parallel filesystem.
We are still in the early stages of testing but I will be happy to share our 
experience if you are interested.
--
Davide Vanzo, PhD
Application Developer
Adjunct Assistant Professor of Chemical and Biomolecular Engineering
Advanced Computing Center for Research and Education (ACCRE)
Vanderbilt University - Hill Center 201
(615)-875-9137
www.accre.vanderbilt.edu


On 2017-07-25 09:20:55-05:00 J. Smith wrote:

Does anyone has any suggestions in setting up high availability and automatic 
failover between two servers that run a Controller daemon, Database daemon and 
Mysql Database (i.e replication vs galera cluster)?
Any input would be appreciated.
Thanks!