26.02.2021 19:19, Eric Robinson пишет: > At 5:16 am Pacific time Monday, one of our cluster nodes failed and its mysql > services went down. The cluster did not automatically recover. > > We're trying to figure out: > > > 1. Why did it fail?
Pacemaker only registered loss of connection between two nodes. You need to investigate why it happened. > 2. Why did it not automatically recover? > > The cluster did not recover until we manually executed... > *Cluster* never failed in the first place. Specific resource may. Do not confuse things more than is necessary. > # pcs resource cleanup p_mysql_622 > Because this resource failed to stop and this is fatal. > Feb 22 05:16:30 [91682] 001db01a pengine: notice: LogAction: * > Stop p_mysql_622 ( 001db01a ) due to no quorum Remaining node lost quorum and decided to stop resources > Feb 22 05:16:30 [91683] 001db01a crmd: notice: te_rsc_command: > Initiating stop operation p_mysql_622_stop_0 locally on 001db01a | action 76 ... > Feb 22 05:16:30 [91680] 001db01a lrmd: info: log_execute: > executing - rsc:p_mysql_622 action:stop call_id:308 ... > Feb 22 05:16:45 [91680] 001db01a lrmd: warning: > child_timeout_callback: p_mysql_622_stop_0 process (PID 19225) timed out > Feb 22 05:16:45 [91680] 001db01a lrmd: warning: operation_finished: > p_mysql_622_stop_0:19225 - timed out after 15000ms > Feb 22 05:16:45 [91680] 001db01a lrmd: info: log_finished: > finished - rsc:p_mysql_622 action:stop call_id:308 pid:19225 exit-code:1 > exec-time:15002ms queue-time:0ms > Feb 22 05:16:45 [91683] 001db01a crmd: error: process_lrm_event: > Result of stop operation for p_mysql_622 on 001db01a: Timed Out | call=308 > key=p_mysql_622_stop_0 timeout=15000ms ... > Feb 22 05:16:38 [112948] 001db01b pengine: info: LogActions: Leave > p_mysql_622 (Started unmanaged) At this point pacemaker stops managing this resource because its status is unknown. Normal reaction to stop failure is to fence node and fail resource over, but apparently you also do not ave working stonith. Loss of quorum may be related to network issue so that nodes both lost connection to each other and to quorum device. _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
