Re: [ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

2018-08-23 Thread Jan Friesse

Prasad,



Hi - My systems are single core cpu VMs running on azure platform. I am


Ok, now it make sense. I don't think you get too much guarantees in the 
cloud environment so quite a large scheduling pause simply can happen. 
Also single core CPU is kind of "unsupported" today.



running MySQL on the nodes that do generate high io load. And my bad , I
meant to say 'High CPU load detected' logged by crmd and not corosync.
Corosync logs messages like 'Corosync main process was not scheduled
for.' kind of messages which inturn makes pacemaker monitor action to
fail sometimes. Is increasing token timeout a solution for this or any
other ways ?


Yes. Actually increasing token timeout to quite a big values seems to be 
 only "solution". Please keep in mind that it's not for free - when 
real problem happens corosync does not detect it before token timeout 
expires, so with large token timeout detection last long.


Regards,
  Honza



Thanks for the help
Prasaf

On Wed, 22 Aug 2018, 11:55 am Jan Friesse,  wrote:


Prasad,


Thanks Ken and Ulrich. There is definitely high IO on the system with
sometimes IOWAIT s of upto 90%
I have come across some previous posts that IOWAIT is also considered as
CPU load by Corosync. Is this true ? Does having high IO may lead

corosync

complain as in " Corosync main process was not scheduled for..." or "Hi
CPU load detected.." ?


Yes it can.

Corosync never logs "Hi CPU load detected...".



I will surely monitor the system more.


Is that system VM or physical machine? Because " Corosync main process
was not scheduled for..." is usually happening on VMs where hosts are
highly overloaded.

Honza



Thanks for your help.
Prasad



On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot 

wrote:



On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote:

Prasad Nagaraj  schrieb am
21.08.2018 um 11:42 in


Nachricht
:

Hi Ken - Thanks for you response.

We do have seen messages in other cases like
corosync [MAIN  ] Corosync main process was not scheduled for
17314.4746 ms
(threshold is 8000. ms). Consider token timeout increase.
corosync [TOTEM ] A processor failed, forming new configuration.

Is this the indication of a failure due to CPU load issues and will
this
get resolved if I upgrade to Corosync 2.x series ?


Yes, most definitely this is a CPU issue. It means corosync isn't
getting enough CPU cycles to handle the cluster token before the
timeout is reached.

Upgrading may indeed help, as recent versions ensure that corosync runs
with real-time priority in the kernel, and thus are more likely to get
CPU time when something of lower priority is consuming all the CPU.

But of course, there is some underlying problem that should be
identified and addressed. Figure out what's maxing out the CPU or I/O.
Ulrich's monitoring suggestion is a good start.


Hi!

I'd strongly recommend starting monitoring on your nodes, at least
until you know what's going on. The good old UNIX sa (sysstat
package) could be a starting point. I'd monitor CPU idle
specifically. Then go for 100% device utilization, then look for
network bottlenecks...

A new corosync release cannot fix those, most likely.

Regards,
Ulrich



In any case, for the current scenario, we did not see any
scheduling
related messages.

Thanks for your help.
Prasad

On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot 
wrote:


On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:

Hi:

One of these days, I saw a spurious node loss on my 3-node
corosync
cluster with following logged in the corosync.log of one of the
nodes.

Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
Transitional membership event on ring 32: memb=2, new=0, lost=1
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
vm02d780875f 67114156
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
vmfa2757171f 151000236
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
vm728316982d 201331884
Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
Stable
membership event on ring 32: memb=2, new=0, lost=0
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vm02d780875f 67114156
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vmfa2757171f 151000236
Aug 18 12:40:25 corosync [pcmk  ] info:
ais_mark_unseen_peer_dead:
Node vm728316982d was not seen in the previous transition
Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node
201331884/vm728316982d is now: lost
Aug 18 12:40:25 corosync [pcmk  ] info:
send_member_notification:
Sending membership update 32 to 3 children
Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left
the
membership and a new membership was formed.
Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: info:
plugin_handle_membership: Membership 32: quorum retained
Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
crm_update_peer_state_iter:   plugin_handle_membership: Node
vm728316982d[201331884] - state is now lost (was member)
Aug 18 12:40:25 [4548] 

Re: [ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

2018-08-22 Thread Prasad Nagaraj
Hi - My systems are single core cpu VMs running on azure platform. I am
running MySQL on the nodes that do generate high io load. And my bad , I
meant to say 'High CPU load detected' logged by crmd and not corosync.
Corosync logs messages like 'Corosync main process was not scheduled
for.' kind of messages which inturn makes pacemaker monitor action to
fail sometimes. Is increasing token timeout a solution for this or any
other ways ?

Thanks for the help
Prasaf

On Wed, 22 Aug 2018, 11:55 am Jan Friesse,  wrote:

> Prasad,
>
> > Thanks Ken and Ulrich. There is definitely high IO on the system with
> > sometimes IOWAIT s of upto 90%
> > I have come across some previous posts that IOWAIT is also considered as
> > CPU load by Corosync. Is this true ? Does having high IO may lead
> corosync
> > complain as in " Corosync main process was not scheduled for..." or "Hi
> > CPU load detected.." ?
>
> Yes it can.
>
> Corosync never logs "Hi CPU load detected...".
>
> >
> > I will surely monitor the system more.
>
> Is that system VM or physical machine? Because " Corosync main process
> was not scheduled for..." is usually happening on VMs where hosts are
> highly overloaded.
>
> Honza
>
> >
> > Thanks for your help.
> > Prasad
> >
> >
> >
> > On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot 
> wrote:
> >
> >> On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote:
> >> Prasad Nagaraj  schrieb am
> >> 21.08.2018 um 11:42 in
> >>>
> >>> Nachricht
> >>> :
>  Hi Ken - Thanks for you response.
> 
>  We do have seen messages in other cases like
>  corosync [MAIN  ] Corosync main process was not scheduled for
>  17314.4746 ms
>  (threshold is 8000. ms). Consider token timeout increase.
>  corosync [TOTEM ] A processor failed, forming new configuration.
> 
>  Is this the indication of a failure due to CPU load issues and will
>  this
>  get resolved if I upgrade to Corosync 2.x series ?
> >>
> >> Yes, most definitely this is a CPU issue. It means corosync isn't
> >> getting enough CPU cycles to handle the cluster token before the
> >> timeout is reached.
> >>
> >> Upgrading may indeed help, as recent versions ensure that corosync runs
> >> with real-time priority in the kernel, and thus are more likely to get
> >> CPU time when something of lower priority is consuming all the CPU.
> >>
> >> But of course, there is some underlying problem that should be
> >> identified and addressed. Figure out what's maxing out the CPU or I/O.
> >> Ulrich's monitoring suggestion is a good start.
> >>
> >>> Hi!
> >>>
> >>> I'd strongly recommend starting monitoring on your nodes, at least
> >>> until you know what's going on. The good old UNIX sa (sysstat
> >>> package) could be a starting point. I'd monitor CPU idle
> >>> specifically. Then go for 100% device utilization, then look for
> >>> network bottlenecks...
> >>>
> >>> A new corosync release cannot fix those, most likely.
> >>>
> >>> Regards,
> >>> Ulrich
> >>>
> 
>  In any case, for the current scenario, we did not see any
>  scheduling
>  related messages.
> 
>  Thanks for your help.
>  Prasad
> 
>  On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot 
>  wrote:
> 
> > On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:
> >> Hi:
> >>
> >> One of these days, I saw a spurious node loss on my 3-node
> >> corosync
> >> cluster with following logged in the corosync.log of one of the
> >> nodes.
> >>
> >> Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
> >> Transitional membership event on ring 32: memb=2, new=0, lost=1
> >> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> >> vm02d780875f 67114156
> >> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> >> vmfa2757171f 151000236
> >> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
> >> vm728316982d 201331884
> >> Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
> >> Stable
> >> membership event on ring 32: memb=2, new=0, lost=0
> >> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> >> vm02d780875f 67114156
> >> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> >> vmfa2757171f 151000236
> >> Aug 18 12:40:25 corosync [pcmk  ] info:
> >> ais_mark_unseen_peer_dead:
> >> Node vm728316982d was not seen in the previous transition
> >> Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node
> >> 201331884/vm728316982d is now: lost
> >> Aug 18 12:40:25 corosync [pcmk  ] info:
> >> send_member_notification:
> >> Sending membership update 32 to 3 children
> >> Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left
> >> the
> >> membership and a new membership was formed.
> >> Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: info:
> >> plugin_handle_membership: Membership 32: quorum retained
> 

Re: [ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

2018-08-22 Thread Ferenc Wágner
Jan Friesse  writes:

> Is that system VM or physical machine? Because " Corosync main process
> was not scheduled for..." is usually happening on VMs where hosts are
> highly overloaded.

Or when physical hosts use BMC watchdogs.  But Prasad didn't encounter
such logs in the setup at hand, as far as I understand.
-- 
Regards,
Feri
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

2018-08-22 Thread Jan Friesse

Prasad,


Thanks Ken and Ulrich. There is definitely high IO on the system with
sometimes IOWAIT s of upto 90%
I have come across some previous posts that IOWAIT is also considered as
CPU load by Corosync. Is this true ? Does having high IO may lead corosync
complain as in " Corosync main process was not scheduled for..." or "Hi > CPU load 
detected.." ?


Yes it can.

Corosync never logs "Hi CPU load detected...".



I will surely monitor the system more.


Is that system VM or physical machine? Because " Corosync main process 
was not scheduled for..." is usually happening on VMs where hosts are 
highly overloaded.


Honza



Thanks for your help.
Prasad



On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot  wrote:


On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote:

Prasad Nagaraj  schrieb am
21.08.2018 um 11:42 in


Nachricht
:

Hi Ken - Thanks for you response.

We do have seen messages in other cases like
corosync [MAIN  ] Corosync main process was not scheduled for
17314.4746 ms
(threshold is 8000. ms). Consider token timeout increase.
corosync [TOTEM ] A processor failed, forming new configuration.

Is this the indication of a failure due to CPU load issues and will
this
get resolved if I upgrade to Corosync 2.x series ?


Yes, most definitely this is a CPU issue. It means corosync isn't
getting enough CPU cycles to handle the cluster token before the
timeout is reached.

Upgrading may indeed help, as recent versions ensure that corosync runs
with real-time priority in the kernel, and thus are more likely to get
CPU time when something of lower priority is consuming all the CPU.

But of course, there is some underlying problem that should be
identified and addressed. Figure out what's maxing out the CPU or I/O.
Ulrich's monitoring suggestion is a good start.


Hi!

I'd strongly recommend starting monitoring on your nodes, at least
until you know what's going on. The good old UNIX sa (sysstat
package) could be a starting point. I'd monitor CPU idle
specifically. Then go for 100% device utilization, then look for
network bottlenecks...

A new corosync release cannot fix those, most likely.

Regards,
Ulrich



In any case, for the current scenario, we did not see any
scheduling
related messages.

Thanks for your help.
Prasad

On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot 
wrote:


On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:

Hi:

One of these days, I saw a spurious node loss on my 3-node
corosync
cluster with following logged in the corosync.log of one of the
nodes.

Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
Transitional membership event on ring 32: memb=2, new=0, lost=1
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
vm02d780875f 67114156
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
vmfa2757171f 151000236
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
vm728316982d 201331884
Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
Stable
membership event on ring 32: memb=2, new=0, lost=0
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vm02d780875f 67114156
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vmfa2757171f 151000236
Aug 18 12:40:25 corosync [pcmk  ] info:
ais_mark_unseen_peer_dead:
Node vm728316982d was not seen in the previous transition
Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node
201331884/vm728316982d is now: lost
Aug 18 12:40:25 corosync [pcmk  ] info:
send_member_notification:
Sending membership update 32 to 3 children
Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left
the
membership and a new membership was formed.
Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: info:
plugin_handle_membership: Membership 32: quorum retained
Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
crm_update_peer_state_iter:   plugin_handle_membership: Node
vm728316982d[201331884] - state is now lost (was member)
Aug 18 12:40:25 [4548] vmfa2757171f   crmd: info:
plugin_handle_membership: Membership 32: quorum retained
Aug 18 12:40:25 [4548] vmfa2757171f   crmd:   notice:
crm_update_peer_state_iter:   plugin_handle_membership: Node
vm728316982d[201331884] - state is now lost (was member)
Aug 18 12:40:25 [4548] vmfa2757171f   crmd: info:
peer_update_callback: vm728316982d is now lost (was member)
Aug 18 12:40:25 [4548] vmfa2757171f   crmd:  warning:
match_down_event: No match for shutdown action on
vm728316982d
Aug 18 12:40:25 [4548] vmfa2757171f   crmd:   notice:
peer_update_callback: Stonith/shutdown of vm728316982d not
matched
Aug 18 12:40:25 [4548] vmfa2757171f   crmd: info:
crm_update_peer_join: peer_update_callback: Node
vm728316982d[201331884] - join-6 phase 4 -> 0
Aug 18 12:40:25 [4548] vmfa2757171f   crmd: info:
abort_transition_graph:   Transition aborted: Node failure
(source=peer_update_callback:240, 1)
Aug 18 12:40:25 [4543] vmfa2757171fcib: info:

Re: [ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

2018-08-21 Thread Prasad Nagaraj
Thanks Ken and Ulrich. There is definitely high IO on the system with
sometimes IOWAIT s of upto 90%
I have come across some previous posts that IOWAIT is also considered as
CPU load by Corosync. Is this true ? Does having high IO may lead corosync
complain as in " Corosync main process was not scheduled for..." or "High
CPU load detected.." ?

I will surely monitor the system more.

Thanks for your help.
Prasad



On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot  wrote:

> On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote:
> > > > > Prasad Nagaraj  schrieb am
> > > > > 21.08.2018 um 11:42 in
> >
> > Nachricht
> > :
> > > Hi Ken - Thanks for you response.
> > >
> > > We do have seen messages in other cases like
> > > corosync [MAIN  ] Corosync main process was not scheduled for
> > > 17314.4746 ms
> > > (threshold is 8000. ms). Consider token timeout increase.
> > > corosync [TOTEM ] A processor failed, forming new configuration.
> > >
> > > Is this the indication of a failure due to CPU load issues and will
> > > this
> > > get resolved if I upgrade to Corosync 2.x series ?
>
> Yes, most definitely this is a CPU issue. It means corosync isn't
> getting enough CPU cycles to handle the cluster token before the
> timeout is reached.
>
> Upgrading may indeed help, as recent versions ensure that corosync runs
> with real-time priority in the kernel, and thus are more likely to get
> CPU time when something of lower priority is consuming all the CPU.
>
> But of course, there is some underlying problem that should be
> identified and addressed. Figure out what's maxing out the CPU or I/O.
> Ulrich's monitoring suggestion is a good start.
>
> > Hi!
> >
> > I'd strongly recommend starting monitoring on your nodes, at least
> > until you know what's going on. The good old UNIX sa (sysstat
> > package) could be a starting point. I'd monitor CPU idle
> > specifically. Then go for 100% device utilization, then look for
> > network bottlenecks...
> >
> > A new corosync release cannot fix those, most likely.
> >
> > Regards,
> > Ulrich
> >
> > >
> > > In any case, for the current scenario, we did not see any
> > > scheduling
> > > related messages.
> > >
> > > Thanks for your help.
> > > Prasad
> > >
> > > On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot 
> > > wrote:
> > >
> > > > On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:
> > > > > Hi:
> > > > >
> > > > > One of these days, I saw a spurious node loss on my 3-node
> > > > > corosync
> > > > > cluster with following logged in the corosync.log of one of the
> > > > > nodes.
> > > > >
> > > > > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
> > > > > Transitional membership event on ring 32: memb=2, new=0, lost=1
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > > > > vm02d780875f 67114156
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > > > > vmfa2757171f 151000236
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
> > > > > vm728316982d 201331884
> > > > > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
> > > > > Stable
> > > > > membership event on ring 32: memb=2, new=0, lost=0
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > > > > vm02d780875f 67114156
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > > > > vmfa2757171f 151000236
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info:
> > > > > ais_mark_unseen_peer_dead:
> > > > > Node vm728316982d was not seen in the previous transition
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node
> > > > > 201331884/vm728316982d is now: lost
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info:
> > > > > send_member_notification:
> > > > > Sending membership update 32 to 3 children
> > > > > Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left
> > > > > the
> > > > > membership and a new membership was formed.
> > > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: info:
> > > > > plugin_handle_membership: Membership 32: quorum retained
> > > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
> > > > > crm_update_peer_state_iter:   plugin_handle_membership: Node
> > > > > vm728316982d[201331884] - state is now lost (was member)
> > > > > Aug 18 12:40:25 [4548] vmfa2757171f   crmd: info:
> > > > > plugin_handle_membership: Membership 32: quorum retained
> > > > > Aug 18 12:40:25 [4548] vmfa2757171f   crmd:   notice:
> > > > > crm_update_peer_state_iter:   plugin_handle_membership: Node
> > > > > vm728316982d[201331884] - state is now lost (was member)
> > > > > Aug 18 12:40:25 [4548] vmfa2757171f   crmd: info:
> > > > > peer_update_callback: vm728316982d is now lost (was member)
> > > > > Aug 18 12:40:25 [4548] vmfa2757171f   crmd:  warning:
> > > > > match_down_event: No match for shutdown action on
> > > > > vm728316982d
> > > > > Aug 18 12:40:25 [4548] vmfa2757171f   crmd:   

Re: [ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

2018-08-21 Thread Ken Gaillot
On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote:
> > > > Prasad Nagaraj  schrieb am
> > > > 21.08.2018 um 11:42 in
> 
> Nachricht
> :
> > Hi Ken - Thanks for you response.
> > 
> > We do have seen messages in other cases like
> > corosync [MAIN  ] Corosync main process was not scheduled for
> > 17314.4746 ms
> > (threshold is 8000. ms). Consider token timeout increase.
> > corosync [TOTEM ] A processor failed, forming new configuration.
> > 
> > Is this the indication of a failure due to CPU load issues and will
> > this
> > get resolved if I upgrade to Corosync 2.x series ?

Yes, most definitely this is a CPU issue. It means corosync isn't
getting enough CPU cycles to handle the cluster token before the
timeout is reached.

Upgrading may indeed help, as recent versions ensure that corosync runs
with real-time priority in the kernel, and thus are more likely to get
CPU time when something of lower priority is consuming all the CPU.

But of course, there is some underlying problem that should be
identified and addressed. Figure out what's maxing out the CPU or I/O.
Ulrich's monitoring suggestion is a good start.

> Hi!
> 
> I'd strongly recommend starting monitoring on your nodes, at least
> until you know what's going on. The good old UNIX sa (sysstat
> package) could be a starting point. I'd monitor CPU idle
> specifically. Then go for 100% device utilization, then look for
> network bottlenecks...
> 
> A new corosync release cannot fix those, most likely.
> 
> Regards,
> Ulrich
> 
> > 
> > In any case, for the current scenario, we did not see any
> > scheduling
> > related messages.
> > 
> > Thanks for your help.
> > Prasad
> > 
> > On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot 
> > wrote:
> > 
> > > On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:
> > > > Hi:
> > > > 
> > > > One of these days, I saw a spurious node loss on my 3-node
> > > > corosync
> > > > cluster with following logged in the corosync.log of one of the
> > > > nodes.
> > > > 
> > > > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
> > > > Transitional membership event on ring 32: memb=2, new=0, lost=1
> > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > > > vm02d780875f 67114156
> > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > > > vmfa2757171f 151000236
> > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
> > > > vm728316982d 201331884
> > > > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
> > > > Stable
> > > > membership event on ring 32: memb=2, new=0, lost=0
> > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > > > vm02d780875f 67114156
> > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > > > vmfa2757171f 151000236
> > > > Aug 18 12:40:25 corosync [pcmk  ] info:
> > > > ais_mark_unseen_peer_dead:
> > > > Node vm728316982d was not seen in the previous transition
> > > > Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node
> > > > 201331884/vm728316982d is now: lost
> > > > Aug 18 12:40:25 corosync [pcmk  ] info:
> > > > send_member_notification:
> > > > Sending membership update 32 to 3 children
> > > > Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left
> > > > the
> > > > membership and a new membership was formed.
> > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: info:
> > > > plugin_handle_membership: Membership 32: quorum retained
> > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
> > > > crm_update_peer_state_iter:   plugin_handle_membership: Node
> > > > vm728316982d[201331884] - state is now lost (was member)
> > > > Aug 18 12:40:25 [4548] vmfa2757171f   crmd: info:
> > > > plugin_handle_membership: Membership 32: quorum retained
> > > > Aug 18 12:40:25 [4548] vmfa2757171f   crmd:   notice:
> > > > crm_update_peer_state_iter:   plugin_handle_membership: Node
> > > > vm728316982d[201331884] - state is now lost (was member)
> > > > Aug 18 12:40:25 [4548] vmfa2757171f   crmd: info:
> > > > peer_update_callback: vm728316982d is now lost (was member)
> > > > Aug 18 12:40:25 [4548] vmfa2757171f   crmd:  warning:
> > > > match_down_event: No match for shutdown action on
> > > > vm728316982d
> > > > Aug 18 12:40:25 [4548] vmfa2757171f   crmd:   notice:
> > > > peer_update_callback: Stonith/shutdown of vm728316982d not
> > > > matched
> > > > Aug 18 12:40:25 [4548] vmfa2757171f   crmd: info:
> > > > crm_update_peer_join: peer_update_callback: Node
> > > > vm728316982d[201331884] - join-6 phase 4 -> 0
> > > > Aug 18 12:40:25 [4548] vmfa2757171f   crmd: info:
> > > > abort_transition_graph:   Transition aborted: Node failure
> > > > (source=peer_update_callback:240, 1)
> > > > Aug 18 12:40:25 [4543] vmfa2757171fcib: info:
> > > > plugin_handle_membership: Membership 32: quorum retained
> > > > Aug 18 12:40:25 [4543] vmfa2757171fcib:   notice:
> > > >