26.02.2021 20:23, Eric Robinson пишет: >> -----Original Message----- >> From: Digimer <[email protected]> >> Sent: Friday, February 26, 2021 10:35 AM >> To: Cluster Labs - All topics related to open-source clustering welcomed >> <[email protected]>; Eric Robinson <[email protected]> >> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went >> Down Anyway? >> >> On 2021-02-26 11:19 a.m., Eric Robinson wrote: >>> At 5:16 am Pacific time Monday, one of our cluster nodes failed and >>> its mysql services went down. The cluster did not automatically recover. >>> >>> We're trying to figure out: >>> >>> 1. Why did it fail? >>> 2. Why did it not automatically recover? >>> >>> The cluster did not recover until we manually executed... >>> >>> # pcs resource cleanup p_mysql_622 >>> >>> OS: CentOS Linux release 7.5.1804 (Core) >>> >>> Cluster version: >>> >>> corosync.x86_64 2.4.5-4.el7 @base >>> corosync-qdevice.x86_64 2.4.5-4.el7 @base >>> pacemaker.x86_64 1.1.21-4.el7 @base >>> >>> Two nodes: 001db01a, 001db01b >>> >>> The following log snippet is from node 001db01a: >>> >>> [root@001db01a cluster]# grep "Feb 22 05:1[67]" corosync.log-20210223 >> >> <snip> >> >>> Feb 22 05:16:30 [91682] 001db01a pengine: warning: cluster_status: >> Fencing and resource management disabled due to lack of quorum >> >> Seems like there was no quorum from this node's perspective, so it won't do >> anything. What does the other node's logs say? >> > > The logs from the other node are at the bottom of the original email. > >> What is the cluster configuration? Do you have stonith (fencing) configured? > > 2-node with a separate qdevice. No fencing. >
I wonder what is expected behavior in this case; pacemaker documentation is rather silent. It explains what happens on nodes out of quorum, but it is unclear whether (and when) quorate nodes will takeover resources from nodes out of quorum. In this case 001db01b does not seem to do anything at all for 15 seconds (while 001db01a begins stopping resources) until 001db01a reappears Feb 22 05:15:56 [112947] 001db01b attrd: info: pcmk_cpg_membership: Group attrd event 15: 001db01a (node 1 pid 91681) left via cluster exit ... Feb 22 05:15:56 [112943] 001db01b pacemakerd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=21): Try again (6) ... Feb 22 05:16:11 [112947] 001db01b attrd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=424856): Try again (6) Feb 22 05:16:11 [112945] 001db01b stonith-ng: info: pcmk_cpg_membership: Group stonith-ng event 16: node 1 pid 91679 joined via cluster join BTW time on nodes seem to be 30 seconds off. >> Quorum is a useful tool when things are working properly, but it doesn't help >> when things enter an undefined / unexpected state. >> When that happens, stonith saves you. So said another way, you must have >> stonith for a stable cluster, quorum is optional. >> > > In this case, if fencing was enabled, which node would have fenced the other? > Would they have gotten into a STONITH war? > looks like 001db01b retained quorum, so it would have fenced 001db01a. > More importantly, why did the failure of resource p_mysql_622 keep the whole > cluster from recovering? The resources on 001db01b continued to be up as far as I can tell. So "the whole cluster" is exaggeration. The 001db01a tried to stop resources due to quorum loss: Feb 22 05:16:30 [91682] 001db01a pengine: notice: LogAction: * Stop p_fs_clust01 ( 001db01a ) due to no quorum * Stop p_mysql_001 ( 001db01a ) due to no quorum Feb 22 05:16:30 [91682] 001db01a pengine: notice: LogAction: * Stop p_mysql_000 ( 001db01a ) due to no quorum Feb 22 05:16:30 [91682] 001db01a pengine: notice: LogAction: * Stop p_mysql_002 ( 001db01a ) due to no quorum Feb 22 05:16:30 [91682] 001db01a pengine: notice: LogAction: * Stop p_mysql_003 ( 001db01a ) due to no quorum Feb 22 05:16:30 [91682] 001db01a pengine: notice: LogAction: * Stop p_mysql_004 ( 001db01a ) due to no quorum Feb 22 05:16:30 [91682] 001db01a pengine: notice: LogAction: * Stop p_mysql_005 ( 001db01a ) due to no quorum Feb 22 05:16:30 [91682] 001db01a pengine: notice: LogAction: * Stop p_mysql_622 ( 001db01a ) due to no quorum > As soon as I did 'pcs resource cleanup p_mysql_622' all the other resources > recovered, but none of them are dependent on that resource. > Logs do not contain entries for this, but my guess is that p_mysql_622 depends on p_fs_clust01 and failure to stop p_mysql_622 resulted in blocking further actions for p_fs_clust01, so resources remained stopped Feb 22 05:16:38 [112948] 001db01b pengine: notice: LogAction: * Stop p_fs_clust01 ( 001db01a ) blocked Feb 22 05:16:38 [112948] 001db01b pengine: notice: LogAction: * Start p_mysql_001 ( 001db01b ) due to colocation with p_fs_clust01 (blocked) Feb 22 05:16:38 [112948] 001db01b pengine: notice: LogAction: * Start p_mysql_000 ( 001db01b ) due to colocation with p_fs_clust01 (blocked) Feb 22 05:16:38 [112948] 001db01b pengine: notice: LogAction: * Start p_mysql_002 ( 001db01b ) due to colocation with p_fs_clust01 (blocked) Feb 22 05:16:38 [112948] 001db01b pengine: notice: LogAction: * Start p_mysql_003 ( 001db01b ) due to colocation with p_fs_clust01 (blocked) Feb 22 05:16:38 [112948] 001db01b pengine: notice: LogAction: * Start p_mysql_004 ( 001db01b ) due to colocation with p_fs_clust01 (blocked) Feb 22 05:16:38 [112948] 001db01b pengine: notice: LogAction: * Start p_mysql_005 ( 001db01b ) due to colocation with p_fs_clust01 (blocked) _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
