subject:"Re\: \[ClusterLabs\] Spurious node loss in corosync cluster"

Re: [ClusterLabs] Spurious node loss in corosync cluster

2018-08-21 Thread Prasad Nagaraj

Hi Ken - Thanks for you response.

We do have seen messages in other cases like
corosync [MAIN  ] Corosync main process was not scheduled for 17314.4746 ms
(threshold is 8000. ms). Consider token timeout increase.
corosync [TOTEM ] A processor failed, forming new configuration.

Is this the indication of a failure due to CPU load issues and will this
get resolved if I upgrade to Corosync 2.x series ?

In any case, for the current scenario, we did not see any scheduling
related messages.

Thanks for your help.
Prasad

On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot  wrote:

> On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:
> > Hi:
> >
> > One of these days, I saw a spurious node loss on my 3-node corosync
> > cluster with following logged in the corosync.log of one of the
> > nodes.
> >
> > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
> > Transitional membership event on ring 32: memb=2, new=0, lost=1
> > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > vm02d780875f 67114156
> > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > vmfa2757171f 151000236
> > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
> > vm728316982d 201331884
> > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> > membership event on ring 32: memb=2, new=0, lost=0
> > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > vm02d780875f 67114156
> > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > vmfa2757171f 151000236
> > Aug 18 12:40:25 corosync [pcmk  ] info: ais_mark_unseen_peer_dead:
> > Node vm728316982d was not seen in the previous transition
> > Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node
> > 201331884/vm728316982d is now: lost
> > Aug 18 12:40:25 corosync [pcmk  ] info: send_member_notification:
> > Sending membership update 32 to 3 children
> > Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left the
> > membership and a new membership was formed.
> > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: info:
> > plugin_handle_membership: Membership 32: quorum retained
> > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
> > crm_update_peer_state_iter:   plugin_handle_membership: Node
> > vm728316982d[201331884] - state is now lost (was member)
> > Aug 18 12:40:25 [4548] vmfa2757171f   crmd: info:
> > plugin_handle_membership: Membership 32: quorum retained
> > Aug 18 12:40:25 [4548] vmfa2757171f   crmd:   notice:
> > crm_update_peer_state_iter:   plugin_handle_membership: Node
> > vm728316982d[201331884] - state is now lost (was member)
> > Aug 18 12:40:25 [4548] vmfa2757171f   crmd: info:
> > peer_update_callback: vm728316982d is now lost (was member)
> > Aug 18 12:40:25 [4548] vmfa2757171f   crmd:  warning:
> > match_down_event: No match for shutdown action on vm728316982d
> > Aug 18 12:40:25 [4548] vmfa2757171f   crmd:   notice:
> > peer_update_callback: Stonith/shutdown of vm728316982d not matched
> > Aug 18 12:40:25 [4548] vmfa2757171f   crmd: info:
> > crm_update_peer_join: peer_update_callback: Node
> > vm728316982d[201331884] - join-6 phase 4 -> 0
> > Aug 18 12:40:25 [4548] vmfa2757171f   crmd: info:
> > abort_transition_graph:   Transition aborted: Node failure
> > (source=peer_update_callback:240, 1)
> > Aug 18 12:40:25 [4543] vmfa2757171fcib: info:
> > plugin_handle_membership: Membership 32: quorum retained
> > Aug 18 12:40:25 [4543] vmfa2757171fcib:   notice:
> > crm_update_peer_state_iter:   plugin_handle_membership: Node
> > vm728316982d[201331884] - state is now lost (was member)
> > Aug 18 12:40:25 [4543] vmfa2757171fcib:   notice:
> > crm_reap_dead_member: Removing vm728316982d/201331884 from the
> > membership list
> > Aug 18 12:40:25 [4543] vmfa2757171fcib:   notice:
> > reap_crm_member:  Purged 1 peers with id=201331884 and/or
> > uname=vm728316982d from the membership cache
> > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
> > crm_reap_dead_member: Removing vm728316982d/201331884 from the
> > membership list
> > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
> > reap_crm_member:  Purged 1 peers with id=201331884 and/or
> > uname=vm728316982d from the membership cache
> >
> > However, within seconds, the node was able to join back.
> >
> > Aug 18 12:40:34 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> > membership event on ring 36: memb=3, new=1, lost=0
> > Aug 18 12:40:34 corosync [pcmk  ] info: update_member: Node
> > 201331884/vm728316982d is now: member
> > Aug 18 12:40:34 corosync [pcmk  ] info: pcmk_peer_update: NEW:
> > vm728316982d 201331884
> >
> >
> > But this was enough time for the cluster to get into split brain kind
> > of situation with  a resource on the node vm728316982d being stopped
> > because of this node loss detection.
> >
> > Could anyone help whether this could happen due to any transient
>

Re: [ClusterLabs] Spurious node loss in corosync cluster

2018-08-20 Thread Ken Gaillot

On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:
> Hi:
> 
> One of these days, I saw a spurious node loss on my 3-node corosync
> cluster with following logged in the corosync.log of one of the
> nodes.
> 
> Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
> Transitional membership event on ring 32: memb=2, new=0, lost=1
> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> vm02d780875f 67114156
> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> vmfa2757171f 151000236
> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
> vm728316982d 201331884
> Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> membership event on ring 32: memb=2, new=0, lost=0
> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> vm02d780875f 67114156
> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> vmfa2757171f 151000236
> Aug 18 12:40:25 corosync [pcmk  ] info: ais_mark_unseen_peer_dead:
> Node vm728316982d was not seen in the previous transition
> Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node
> 201331884/vm728316982d is now: lost
> Aug 18 12:40:25 corosync [pcmk  ] info: send_member_notification:
> Sending membership update 32 to 3 children
> Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:     info:
> plugin_handle_membership:     Membership 32: quorum retained
> Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
> crm_update_peer_state_iter:   plugin_handle_membership: Node
> vm728316982d[201331884] - state is now lost (was member)
> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
> plugin_handle_membership:     Membership 32: quorum retained
> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:   notice:
> crm_update_peer_state_iter:   plugin_handle_membership: Node
> vm728316982d[201331884] - state is now lost (was member)
> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
> peer_update_callback: vm728316982d is now lost (was member)
> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:  warning:
> match_down_event:     No match for shutdown action on vm728316982d
> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:   notice:
> peer_update_callback: Stonith/shutdown of vm728316982d not matched
> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
> crm_update_peer_join: peer_update_callback: Node
> vm728316982d[201331884] - join-6 phase 4 -> 0
> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
> abort_transition_graph:       Transition aborted: Node failure
> (source=peer_update_callback:240, 1)
> Aug 18 12:40:25 [4543] vmfa2757171f        cib:     info:
> plugin_handle_membership:     Membership 32: quorum retained
> Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
> crm_update_peer_state_iter:   plugin_handle_membership: Node
> vm728316982d[201331884] - state is now lost (was member)
> Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
> crm_reap_dead_member: Removing vm728316982d/201331884 from the
> membership list
> Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
> reap_crm_member:      Purged 1 peers with id=201331884 and/or
> uname=vm728316982d from the membership cache
> Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
> crm_reap_dead_member: Removing vm728316982d/201331884 from the
> membership list
> Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
> reap_crm_member:      Purged 1 peers with id=201331884 and/or
> uname=vm728316982d from the membership cache
> 
> However, within seconds, the node was able to join back.
> 
> Aug 18 12:40:34 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> membership event on ring 36: memb=3, new=1, lost=0
> Aug 18 12:40:34 corosync [pcmk  ] info: update_member: Node
> 201331884/vm728316982d is now: member
> Aug 18 12:40:34 corosync [pcmk  ] info: pcmk_peer_update: NEW: 
> vm728316982d 201331884
> 
> 
> But this was enough time for the cluster to get into split brain kind
> of situation with  a resource on the node vm728316982d being stopped
> because of this node loss detection.
> 
> Could anyone help whether this could happen due to any transient
> network distortion or so ?
> Are there any configuration settings that can be applied in
> corosync.conf so that cluster is more resilient to such temporary
> distortions.

Your corosync sensitivity of 10-second token timeout and 10
retransimissions is already very lengthy -- likely the node was already
unresponsive for more than 10 seconds before the first message above,
so it was more than 18 seconds before it rejoined.

It's rarely a good idea to change token_retransmits_before_loss_const;
changing token is generally enough to deal with transient network
unreliability. However 18 seconds is a really long time to raise the
token to, and it's uncertain from the information here whether the root
cause was networking or something on

Re: [ClusterLabs] Spurious node loss in corosync cluster

Re: [ClusterLabs] Spurious node loss in corosync cluster

2 matches

Site Navigation

Mail list logo

Footer information