Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

Eric Robinson Fri, 26 Feb 2021 10:59:00 -0800

> -----Original Message-----
> From: Users <[email protected]> On Behalf Of Andrei
> Borzenkov
> Sent: Friday, February 26, 2021 11:27 AM
> To: [email protected]
> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went
> Down Anyway?
>
> 26.02.2021 19:19, Eric Robinson пишет:
> > At 5:16 am Pacific time Monday, one of our cluster nodes failed and its
> mysql services went down. The cluster did not automatically recover.
> >
> > We're trying to figure out:
> >
> >
> >   1.  Why did it fail?
>
> Pacemaker only registered loss of connection between two nodes. You need
> to investigate why it happened.
>
> >   2.  Why did it not automatically recover?
> >
> > The cluster did not recover until we manually executed...
> >
>
> *Cluster* never failed in the first place. Specific resource may. Do not
> confuse things more than is necessary.
>
> > # pcs resource cleanup p_mysql_622
> >
>
> Because this resource failed to stop and this is fatal.
>
> > Feb 22 05:16:30 [91682] 001db01a    pengine:   notice: LogAction:        * 
> > Stop
> p_mysql_622      (                 001db01a )   due to no quorum
>
> Remaining node lost quorum and decided to stop resources
>


I consider this a cluster failure, exacerbated by a resource failure. We can 
investigate why resource p_mysql_622 failed to stop, but it seems the 
underlying problem is the loss of quorum. That should not have happened with 
the qdevice in the mix, should it?

I'm confused about what is supposed to happen here. If the root cause is that 
node 001db01a briefly lost all communication with the network (just guessing), 
then it should have taken no action, including STONITH, since there would be no 
quorum. (There is no physical STONITH device anyway, as both nodes are in 
Azure.) Meanwhile, node 001db01b would still have had quorum (itself plus the 
qdevice), and should have assumed ownership of the resources and started them, 
or no?

> > Feb 22 05:16:30 [91683] 001db01a       crmd:   notice: te_rsc_command:
> Initiating stop operation p_mysql_622_stop_0 locally on 001db01a | action 76
> ...
> > Feb 22 05:16:30 [91680] 001db01a       lrmd:     info: log_execute:     
> > executing
> - rsc:p_mysql_622 action:stop call_id:308
> ...
> > Feb 22 05:16:45 [91680] 001db01a       lrmd:  warning:
> child_timeout_callback:  p_mysql_622_stop_0 process (PID 19225) timed out
> > Feb 22 05:16:45 [91680] 001db01a       lrmd:  warning: operation_finished:
> p_mysql_622_stop_0:19225 - timed out after 15000ms
> > Feb 22 05:16:45 [91680] 001db01a       lrmd:     info: log_finished:    
> > finished -
> rsc:p_mysql_622 action:stop call_id:308 pid:19225 exit-code:1 exec-
> time:15002ms queue-time:0ms
> > Feb 22 05:16:45 [91683] 001db01a       crmd:    error: process_lrm_event:
> Result of stop operation for p_mysql_622 on 001db01a: Timed Out | call=308
> key=p_mysql_622_stop_0 timeout=15000ms
> ...
> > Feb 22 05:16:38 [112948] 001db01b    pengine:     info: LogActions:     
> > Leave
> p_mysql_622     (Started unmanaged)
>
> At this point pacemaker stops managing this resource because its status is
> unknown. Normal reaction to stop failure is to fence node and fail resource
> over, but apparently you also do not ave working stonith.
>
> Loss of quorum may be related to network issue so that nodes both lost
> connection to each other and to quorum device.
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

Reply via email to