Re: [ClusterLabs] Spurious node loss in corosync cluster

Prasad Nagaraj Tue, 21 Aug 2018 02:43:34 -0700

Hi Ken - Thanks for you response.

We do have seen messages in other cases like
corosync [MAIN  ] Corosync main process was not scheduled for 17314.4746 ms
(threshold is 8000.0000 ms). Consider token timeout increase.
corosync [TOTEM ] A processor failed, forming new configuration.


Is this the indication of a failure due to CPU load issues and will this
get resolved if I upgrade to Corosync 2.x series ?

In any case, for the current scenario, we did not see any scheduling
related messages.

Thanks for your help.
Prasad

On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot <[email protected]> wrote:

> On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:
> > Hi:
> >
> > One of these days, I saw a spurious node loss on my 3-node corosync
> > cluster with following logged in the corosync.log of one of the
> > nodes.
> >
> > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
> > Transitional membership event on ring 32: memb=2, new=0, lost=1
> > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > vm02d780875f 67114156
> > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > vmfa2757171f 151000236
> > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
> > vm728316982d 201331884
> > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> > membership event on ring 32: memb=2, new=0, lost=0
> > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > vm02d780875f 67114156
> > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > vmfa2757171f 151000236
> > Aug 18 12:40:25 corosync [pcmk  ] info: ais_mark_unseen_peer_dead:
> > Node vm728316982d was not seen in the previous transition
> > Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node
> > 201331884/vm728316982d is now: lost
> > Aug 18 12:40:25 corosync [pcmk  ] info: send_member_notification:
> > Sending membership update 32 to 3 children
> > Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left the
> > membership and a new membership was formed.
> > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:     info:
> > plugin_handle_membership:     Membership 32: quorum retained
> > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
> > crm_update_peer_state_iter:   plugin_handle_membership: Node
> > vm728316982d[201331884] - state is now lost (was member)
> > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
> > plugin_handle_membership:     Membership 32: quorum retained
> > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:   notice:
> > crm_update_peer_state_iter:   plugin_handle_membership: Node
> > vm728316982d[201331884] - state is now lost (was member)
> > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
> > peer_update_callback: vm728316982d is now lost (was member)
> > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:  warning:
> > match_down_event:     No match for shutdown action on vm728316982d
> > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:   notice:
> > peer_update_callback: Stonith/shutdown of vm728316982d not matched
> > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
> > crm_update_peer_join: peer_update_callback: Node
> > vm728316982d[201331884] - join-6 phase 4 -> 0
> > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
> > abort_transition_graph:       Transition aborted: Node failure
> > (source=peer_update_callback:240, 1)
> > Aug 18 12:40:25 [4543] vmfa2757171f        cib:     info:
> > plugin_handle_membership:     Membership 32: quorum retained
> > Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
> > crm_update_peer_state_iter:   plugin_handle_membership: Node
> > vm728316982d[201331884] - state is now lost (was member)
> > Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
> > crm_reap_dead_member: Removing vm728316982d/201331884 from the
> > membership list
> > Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
> > reap_crm_member:      Purged 1 peers with id=201331884 and/or
> > uname=vm728316982d from the membership cache
> > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
> > crm_reap_dead_member: Removing vm728316982d/201331884 from the
> > membership list
> > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
> > reap_crm_member:      Purged 1 peers with id=201331884 and/or
> > uname=vm728316982d from the membership cache
> >
> > However, within seconds, the node was able to join back.
> >
> > Aug 18 12:40:34 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> > membership event on ring 36: memb=3, new=1, lost=0
> > Aug 18 12:40:34 corosync [pcmk  ] info: update_member: Node
> > 201331884/vm728316982d is now: member
> > Aug 18 12:40:34 corosync [pcmk  ] info: pcmk_peer_update: NEW:
> > vm728316982d 201331884
> >
> >
> > But this was enough time for the cluster to get into split brain kind
> > of situation with  a resource on the node vm728316982d being stopped
> > because of this node loss detection.
> >
> > Could anyone help whether this could happen due to any transient
> > network distortion or so ?
> > Are there any configuration settings that can be applied in
> > corosync.conf so that cluster is more resilient to such temporary
> > distortions.
>
> Your corosync sensitivity of 10-second token timeout and 10
> retransimissions is already very lengthy -- likely the node was already
> unresponsive for more than 10 seconds before the first message above,
> so it was more than 18 seconds before it rejoined.
>
> It's rarely a good idea to change token_retransmits_before_loss_const;
> changing token is generally enough to deal with transient network
> unreliability. However 18 seconds is a really long time to raise the
> token to, and it's uncertain from the information here whether the root
> cause was networking or something on the host.
>
> I notice your configuration is corosync 1 with the pacemaker plugin;
> that is a long-deprecated setup, and corosync 3 is about to come out,
> so you may want to consider upgrading to at least corosync 2 and a
> reasonably recent pacemaker. That would give you some reliability
> improvements, including real-time priority scheduling of corosync,
> which could have been the issue here if CPU load rather than networking
> was the root cause.
>
> >
> > Currently my corosync.conf looks like this :
> >
> > compatibility: whitetank
> > totem {
> >     version: 2
> >     secauth: on
> >     threads: 0
> >     interface {
> >     member {
> >             memberaddr: 172.20.0.4
> >         }
> > member {
> >             memberaddr: 172.20.0.9
> >         }
> > member {
> >             memberaddr: 172.20.0.12
> >         }
> >
> >     bindnetaddr: 172.20.0.12
> >
> >     ringnumber: 0
> >     mcastport: 5405
> >     ttl: 1
> >     }
> >     transport: udpu
> >     token: 10000
> >     token_retransmits_before_loss_const: 10
> > }
> >
> > logging {
> >     fileline: off
> >     to_stderr: yes
> >     to_logfile: yes
> >     to_syslog: no
> >     logfile: /var/log/cluster/corosync.log
> >     timestamp: on
> >     logger_subsys {
> >     subsys: AMF
> >     debug: off
> >     }
> > }
> > service {
> >     name: pacemaker
> >     ver: 1
> > }
> > amf {
> >     mode: disabled
> > }
> >
> > Thanks in advance for the help.
> > Prasad
> >
> > _______________________________________________
> > Users mailing list: [email protected]
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> > pdf
> > Bugs: http://bugs.clusterlabs.org
> --
> Ken Gaillot <[email protected]>
> _______________________________________________
> Users mailing list: [email protected]
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Users mailing list: [email protected]
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Spurious node loss in corosync cluster

Reply via email to