On Tue, 2019-02-19 at 20:06 +0000, Eric Robinson wrote: > > -----Original Message----- > > From: Users <users-boun...@clusterlabs.org> On Behalf Of Ken > > Gaillot > > Sent: Tuesday, February 19, 2019 10:31 AM > > To: Cluster Labs - All topics related to open-source clustering > > welcomed > > <users@clusterlabs.org> > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When > > Just One > > Fails? > > > > On Tue, 2019-02-19 at 17:40 +0000, Eric Robinson wrote: > > > > -----Original Message----- > > > > From: Users <users-boun...@clusterlabs.org> On Behalf Of Andrei > > > > Borzenkov > > > > Sent: Sunday, February 17, 2019 11:56 AM > > > > To: users@clusterlabs.org > > > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When > > > > Just > > > > One Fails? > > > > > > > > 17.02.2019 0:44, Eric Robinson пишет: > > > > > Thanks for the feedback, Andrei. > > > > > > > > > > I only want cluster failover to occur if the filesystem or > > > > > drbd > > > > > resources fail, > > > > > > > > or if the cluster messaging layer detects a complete node > > > > failure. > > > > Is there a > > > > way to tell PaceMaker not to trigger a cluster failover if any > > > > of > > > > the p_mysql resources fail? > > > > > > > > > > > > > Let's look at this differently. If all these applications > > > > depend on > > > > each other, you should not be able to stop individual resource > > > > in > > > > the first place - you need to group them or define dependency > > > > so > > > > that stopping any resource would stop everything. > > > > > > > > If these applications are independent, they should not share > > > > resources. > > > > Each MySQL application should have own IP and own FS and own > > > > block > > > > device for this FS so that they can be moved between cluster > > > > nodes > > > > independently. > > > > > > > > Anything else will lead to troubles as you already observed. > > > > > > FYI, the MySQL services do not depend on each other. All of them > > > depend on the floating IP, which depends on the filesystem, which > > > depends on DRBD, but they do not depend on each other. Ideally, > > > the > > > failure of p_mysql_002 should not cause failure of other mysql > > > resources, but now I understand why it happened. Pacemaker wanted > > > to > > > start it on the other node, so it needed to move the floating IP, > > > filesystem, and DRBD primary, which had the cascade effect of > > > stopping > > > the other MySQL resources. > > > > > > I think I also understand why the p_vip_clust01 resource blocked. > > > > > > FWIW, we've been using Linux HA since 2006, originally Heartbeat, > > > but > > > then Corosync+Pacemaker. The past 12 years have been relatively > > > problem free. This symptom is new for us, only within the past > > > year. > > > Our cluster nodes have many separate instances of MySQL running, > > > so it > > > is not practical to have that many filesystems, IPs, etc. We are > > > content with the way things are, except for this new troubling > > > behavior. > > > > > > If I understand the thread correctly, op-fail=stop will not work > > > because the cluster will still try to stop the resources that are > > > implied dependencies. > > > > > > Bottom line is, how do we configure the cluster in such a way > > > that > > > there are no cascading circumstances when a MySQL resource fails? > > > Basically, if a MySQL resource fails, it fails. We'll deal with > > > that > > > on an ad-hoc basis. I don't want the whole cluster to barf. What > > > about > > > op-fail=ignore? Earlier, you suggested symmetrical=false might > > > also do > > > the trick, but you said it comes with its own can or worms. > > > What are the downsides with op-fail=ignore or asymmetrical=false? > > > > > > --Eric > > > > Even adding on-fail=ignore to the recurring monitors may not do > > what you > > want, because I suspect that even an ignored failure will make the > > node less > > preferable for all the other resources. But it's worth testing. > > > > Otherwise, your best option is to remove all the recurring monitors > > from the > > mysql resources, and rely on external monitoring (e.g. nagios, > > icinga, monit, > > ...) to detect problems. > > This is probably a dumb question, but can we remove just the monitor > operation but leave the resource configured in the cluster? If a node > fails over, we do want the resources to start automatically on the > new primary node.
Yes, operations can be added/removed without affecting the configuration of the resource itself. -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org