Re: [ClusterLabs] Spurious node loss in corosync cluster
Hi Ken - Thanks for you response. We do have seen messages in other cases like corosync [MAIN ] Corosync main process was not scheduled for 17314.4746 ms (threshold is 8000. ms). Consider token timeout increase. corosync [TOTEM ] A processor failed, forming new configuration. Is this the indication of a failure due to CPU load issues and will this get resolved if I upgrade to Corosync 2.x series ? In any case, for the current scenario, we did not see any scheduling related messages. Thanks for your help. Prasad On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot wrote: > On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote: > > Hi: > > > > One of these days, I saw a spurious node loss on my 3-node corosync > > cluster with following logged in the corosync.log of one of the > > nodes. > > > > Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: > > Transitional membership event on ring 32: memb=2, new=0, lost=1 > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: > > vm02d780875f 67114156 > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: > > vmfa2757171f 151000236 > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: lost: > > vm728316982d 201331884 > > Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: Stable > > membership event on ring 32: memb=2, new=0, lost=0 > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: > > vm02d780875f 67114156 > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: > > vmfa2757171f 151000236 > > Aug 18 12:40:25 corosync [pcmk ] info: ais_mark_unseen_peer_dead: > > Node vm728316982d was not seen in the previous transition > > Aug 18 12:40:25 corosync [pcmk ] info: update_member: Node > > 201331884/vm728316982d is now: lost > > Aug 18 12:40:25 corosync [pcmk ] info: send_member_notification: > > Sending membership update 32 to 3 children > > Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left the > > membership and a new membership was formed. > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: info: > > plugin_handle_membership: Membership 32: quorum retained > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice: > > crm_update_peer_state_iter: plugin_handle_membership: Node > > vm728316982d[201331884] - state is now lost (was member) > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: > > plugin_handle_membership: Membership 32: quorum retained > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: notice: > > crm_update_peer_state_iter: plugin_handle_membership: Node > > vm728316982d[201331884] - state is now lost (was member) > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: > > peer_update_callback: vm728316982d is now lost (was member) > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: warning: > > match_down_event: No match for shutdown action on vm728316982d > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: notice: > > peer_update_callback: Stonith/shutdown of vm728316982d not matched > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: > > crm_update_peer_join: peer_update_callback: Node > > vm728316982d[201331884] - join-6 phase 4 -> 0 > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: > > abort_transition_graph: Transition aborted: Node failure > > (source=peer_update_callback:240, 1) > > Aug 18 12:40:25 [4543] vmfa2757171fcib: info: > > plugin_handle_membership: Membership 32: quorum retained > > Aug 18 12:40:25 [4543] vmfa2757171fcib: notice: > > crm_update_peer_state_iter: plugin_handle_membership: Node > > vm728316982d[201331884] - state is now lost (was member) > > Aug 18 12:40:25 [4543] vmfa2757171fcib: notice: > > crm_reap_dead_member: Removing vm728316982d/201331884 from the > > membership list > > Aug 18 12:40:25 [4543] vmfa2757171fcib: notice: > > reap_crm_member: Purged 1 peers with id=201331884 and/or > > uname=vm728316982d from the membership cache > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice: > > crm_reap_dead_member: Removing vm728316982d/201331884 from the > > membership list > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice: > > reap_crm_member: Purged 1 peers with id=201331884 and/or > > uname=vm728316982d from the membership cache > > > > However, within seconds, the node was able to join back. > > > > Aug 18 12:40:34 corosync [pcmk ] notice: pcmk_peer_update: Stable > > membership event on ring 36: memb=3, new=1, lost=0 > > Aug 18 12:40:34 corosync [pcmk ] info: update_member: Node > > 201331884/vm728316982d is now: member > > Aug 18 12:40:34 corosync [pcmk ] info: pcmk_peer_update: NEW: > > vm728316982d 201331884 > > > > > > But this was enough time for the cluster to get into split brain kind > > of situation with a resource on the node vm728316982d being stopped > > because of this node loss detection. > > > > Could anyone help whether this could happen due to any transient >
Re: [ClusterLabs] Spurious node loss in corosync cluster
On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote: > Hi: > > One of these days, I saw a spurious node loss on my 3-node corosync > cluster with following logged in the corosync.log of one of the > nodes. > > Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 32: memb=2, new=0, lost=1 > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: > vm02d780875f 67114156 > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: > vmfa2757171f 151000236 > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: lost: > vm728316982d 201331884 > Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: Stable > membership event on ring 32: memb=2, new=0, lost=0 > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: > vm02d780875f 67114156 > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: > vmfa2757171f 151000236 > Aug 18 12:40:25 corosync [pcmk ] info: ais_mark_unseen_peer_dead: > Node vm728316982d was not seen in the previous transition > Aug 18 12:40:25 corosync [pcmk ] info: update_member: Node > 201331884/vm728316982d is now: lost > Aug 18 12:40:25 corosync [pcmk ] info: send_member_notification: > Sending membership update 32 to 3 children > Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left the > membership and a new membership was formed. > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: info: > plugin_handle_membership: Membership 32: quorum retained > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice: > crm_update_peer_state_iter: plugin_handle_membership: Node > vm728316982d[201331884] - state is now lost (was member) > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: > plugin_handle_membership: Membership 32: quorum retained > Aug 18 12:40:25 [4548] vmfa2757171f crmd: notice: > crm_update_peer_state_iter: plugin_handle_membership: Node > vm728316982d[201331884] - state is now lost (was member) > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: > peer_update_callback: vm728316982d is now lost (was member) > Aug 18 12:40:25 [4548] vmfa2757171f crmd: warning: > match_down_event: No match for shutdown action on vm728316982d > Aug 18 12:40:25 [4548] vmfa2757171f crmd: notice: > peer_update_callback: Stonith/shutdown of vm728316982d not matched > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: > crm_update_peer_join: peer_update_callback: Node > vm728316982d[201331884] - join-6 phase 4 -> 0 > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: > abort_transition_graph: Transition aborted: Node failure > (source=peer_update_callback:240, 1) > Aug 18 12:40:25 [4543] vmfa2757171f cib: info: > plugin_handle_membership: Membership 32: quorum retained > Aug 18 12:40:25 [4543] vmfa2757171f cib: notice: > crm_update_peer_state_iter: plugin_handle_membership: Node > vm728316982d[201331884] - state is now lost (was member) > Aug 18 12:40:25 [4543] vmfa2757171f cib: notice: > crm_reap_dead_member: Removing vm728316982d/201331884 from the > membership list > Aug 18 12:40:25 [4543] vmfa2757171f cib: notice: > reap_crm_member: Purged 1 peers with id=201331884 and/or > uname=vm728316982d from the membership cache > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice: > crm_reap_dead_member: Removing vm728316982d/201331884 from the > membership list > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice: > reap_crm_member: Purged 1 peers with id=201331884 and/or > uname=vm728316982d from the membership cache > > However, within seconds, the node was able to join back. > > Aug 18 12:40:34 corosync [pcmk ] notice: pcmk_peer_update: Stable > membership event on ring 36: memb=3, new=1, lost=0 > Aug 18 12:40:34 corosync [pcmk ] info: update_member: Node > 201331884/vm728316982d is now: member > Aug 18 12:40:34 corosync [pcmk ] info: pcmk_peer_update: NEW: > vm728316982d 201331884 > > > But this was enough time for the cluster to get into split brain kind > of situation with a resource on the node vm728316982d being stopped > because of this node loss detection. > > Could anyone help whether this could happen due to any transient > network distortion or so ? > Are there any configuration settings that can be applied in > corosync.conf so that cluster is more resilient to such temporary > distortions. Your corosync sensitivity of 10-second token timeout and 10 retransimissions is already very lengthy -- likely the node was already unresponsive for more than 10 seconds before the first message above, so it was more than 18 seconds before it rejoined. It's rarely a good idea to change token_retransmits_before_loss_const; changing token is generally enough to deal with transient network unreliability. However 18 seconds is a really long time to raise the token to, and it's uncertain from the information here whether the root cause was networking or something on