Re: [ClusterLabs] Strange lost quorum with qdevice

Andrei Borzenkov Fri, 09 Aug 2019 00:03:14 -0700

On Fri, Aug 9, 2019 at 9:25 AM Jan Friesse <jfrie...@redhat.com> wrote:
>
> Олег Самойлов napsal(a):
> > Hello all.
> >
> > I have a test bed with several virtual machines to test pacemaker. I 
> > simulate random failure on one of the node. The cluster will be on several 
> > data centres, so there is not stonith device, instead I use qnetd on the 
> > third data centre and watchdog (softdog). And sometimes (not always) on 
> > failure on one node, the second node also watchdoged due to lost of the 
> > quorum. I increased quorum timeouts 2 times:
> >
> > for qnetd: COROSYNC_QNETD_OPTIONS="-S dpd_interval=20000 -d"
>
> Please do not set dpd_interval that high. dpd_interval on qnetd side is
> not about how often is the ping is sent. Could you please retry your
> test with dpd_interval=1000? I'm pretty sure it will work then.
>


Well, I observed inexplicable resets of a node when partner was
rebooted in twp node cluster with qdevice. No timers were changed. I
have not investigated it in more details because first I could not
find out was conditions triggered it and later I decided to use SBD
with shared block device instead which seems to mostly work.

According to manual page default for dpd_interval is 10 seconds
(10000ms). So should not defaults be changed then?

> Honza
>
>
> > for pacemaker: pcs quorum device add sync_timeout=60000 timeout=20000 model 
> > net host=‘witness' algorithm=ffsplit
> >
> > Also I set -I 60 option (net timeout) to sbd.
> >
> > But effect is still exists:
> >
> > Logs, after one of the nodes «tuchanka1a» was powered off in 17:13:53.
> >
> >  From server 'witness' with qnetd:
> >
> > Aug  8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug   Client 
> > ::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2) sent membership 
> > node list.
> > Aug  8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug     msg seq 
> > num = 7
> > Aug  8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug     ring id = 
> > (2.4c)
> > Aug  8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug     
> > heuristics = Undefined
> > Aug  8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug     node list:
> > Aug  8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug       node_id 
> > = 2, data_center_id = 0, node_state = not set
> > Aug  8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug   ffsplit: 
> > Membership for cluster krogan1 is not yet stable
> > Aug  8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug   Algorithm 
> > result vote is Wait for reply
> > Aug  8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug   Client 
> > ::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2) sent quorum node 
> > list.
> > Aug  8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug     msg seq 
> > num = 8
> > Aug  8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug     quorate = > > 0
> > Aug  8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug     node list:
> > Aug  8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug       node_id 
> > = 1, data_center_id = 0, node_state = dead
> > Aug  8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug       node_id 
> > = 2, data_center_id = 0, node_state = member
> > Aug  8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug   Algorithm 
> > result vote is No change
> > Aug  8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 warning Client 
> > ::ffff:192.168.89.11:47456 doesn't sent any message during 40000ms. 
> > Disconnecting
> > Aug  8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug   Client 
> > ::ffff:192.168.89.11:47456 (init_received 1, cluster krogan1, node_id 1) 
> > disconnect
> > Aug  8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug   ffsplit: 
> > Membership for cluster krogan1 is now stable
> > Aug  8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug   ffsplit: 
> > Quorate partition selected
> > Aug  8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug     node list:
> > Aug  8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug       node_id 
> > = 2, data_center_id = 0, node_state = not set
> > Aug  8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug   ffsplit: No 
> > client gets NACK
> > Aug  8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug   Sending 
> > vote info to client ::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2)
> > Aug  8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug     msg seq 
> > num = 2
> > Aug  8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug     vote = ACK
> > Aug  8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug   Client 
> > ::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2) replied back to 
> > vote info message
> > Aug  8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug     msg seq 
> > num = 2
> > Aug  8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug   ffsplit: 
> > All ACK votes sent for cluster krogan1
> > Aug  8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 warning Client 
> > ::ffff:192.168.89.12:39144 doesn't sent any message during 40000ms. 
> > Disconnecting
> > Aug  8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug   Client 
> > ::ffff:192.168.89.12:39144 (init_received 1, cluster krogan1, node_id 2) 
> > disconnect
> > Aug  8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug   ffsplit: 
> > Membership for cluster krogan1 is now stable
> > Aug  8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug   ffsplit: No 
> > quorate partition was selected
> > Aug  8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug   ffsplit: No 
> > client gets NACK
> > Aug  8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug   ffsplit: No 
> > client gets ACK
> >
> >  From other node ‘tuchanka1b’. I deleted all rows from sbd "info: 
> > notify_parent: Notifying parent: healthy" and "info: notify_parent: Not 
> > notifying parent: state transient (2)".
> >
> > Aug  8 17:13:54 tuchanka1b corosync[1185]: [TOTEM ] A processor failed, 
> > forming new configuration.
> > Aug  8 17:13:55 tuchanka1b corosync[1185]: [TOTEM ] A new membership 
> > (192.168.89.12:76) was formed. Members left: 1
> > Aug  8 17:13:55 tuchanka1b corosync[1185]: [TOTEM ] Failed to receive the 
> > leave message. failed: 1
> > Aug  8 17:13:55 tuchanka1b corosync[1185]: [VOTEQ ] waiting for quorum 
> > device Qdevice poll (but maximum for 60000 ms)
> > Aug  8 17:13:55 tuchanka1b stonith-ng[1210]:  notice: Node tuchanka1a state 
> > is now lost
> > Aug  8 17:13:55 tuchanka1b stonith-ng[1210]:  notice: Purged 1 peer with 
> > id=1 and/or uname=tuchanka1a from the membership cache
> > Aug  8 17:13:55 tuchanka1b crmd[1214]:  notice: Our peer on the DC 
> > (tuchanka1a) is dead
> > Aug  8 17:13:55 tuchanka1b crmd[1214]:  notice: State transition S_NOT_DC 
> > -> S_ELECTION
> > Aug  8 17:13:55 tuchanka1b attrd[1212]:  notice: Node tuchanka1a state is 
> > now lost
> > Aug  8 17:13:55 tuchanka1b attrd[1212]:  notice: Removing all tuchanka1a 
> > attributes for peer loss
> > Aug  8 17:13:55 tuchanka1b attrd[1212]:  notice: Lost attribute writer 
> > tuchanka1a
> > Aug  8 17:13:55 tuchanka1b attrd[1212]:  notice: Purged 1 peer with id=1 
> > and/or uname=tuchanka1a from the membership cache
> > Aug  8 17:13:55 tuchanka1b cib[1209]:  notice: Node tuchanka1a state is now 
> > lost
> > Aug  8 17:13:55 tuchanka1b cib[1209]:  notice: Purged 1 peer with id=1 
> > and/or uname=tuchanka1a from the membership cacheAug  8 17:14:25 tuchanka1b 
> > crmd[1214]:  notice: Deletion of 
> > "//node_state[@uname='tuchanka1a']/transient_attributes": Timer expired 
> > (rc=-62)
> > Aug  8 17:14:55 tuchanka1b corosync[1185]: [VOTEQ ] lost contact with 
> > quorum device Qdevice
> > Aug  8 17:14:55 tuchanka1b corosync[1185]: [VOTEQ ] Waiting for all cluster 
> > members. Current votes: 1 expected_votes: 3
> > Aug  8 17:14:55 tuchanka1b corosync[1185]: [QUORUM] This node is within the 
> > non-primary component and will NOT provide any services.
> > Aug  8 17:14:55 tuchanka1b corosync[1185]: [QUORUM] Members[1]: 2
> > Aug  8 17:14:55 tuchanka1b corosync[1185]: [MAIN  ] Completed service 
> > synchronization, ready to provide service.
> > Aug  8 17:14:55 tuchanka1b pacemakerd[1208]: warning: Quorum lost
> > Aug  8 17:14:55 tuchanka1b pacemakerd[1208]:  notice: Node tuchanka1a state 
> > is now lost
> > Aug  8 17:14:55 tuchanka1b crmd[1214]: warning: Quorum lost
> > Aug  8 17:14:55 tuchanka1b crmd[1214]:  notice: Node tuchanka1a state is 
> > now lost
> > Aug  8 17:14:55 tuchanka1b sbd[1182]:   cluster:     info: notify_parent: 
> > Notifying parent: healthy
> > Aug  8 17:14:55 tuchanka1b crmd[1214]:  notice: State transition S_ELECTION 
> > -> S_INTEGRATION
> > Aug  8 17:14:55 tuchanka1b crmd[1214]: warning: Input I_ELECTION_DC 
> > received in state S_INTEGRATION from do_election_check
> > Aug  8 17:14:56 tuchanka1b pengine[1213]:  notice: Watchdog will be used 
> > via SBD if fencing is required
> > Aug  8 17:14:56 tuchanka1b pengine[1213]: warning: Fencing and resource 
> > management disabled due to lack of quorum
> > Aug  8 17:14:56 tuchanka1b pengine[1213]: warning: Cluster node tuchanka1a 
> > is unclean: peer is no longer part of the cluster
> > Aug  8 17:14:56 tuchanka1b pengine[1213]: warning: Node tuchanka1a is 
> > unclean
> > Aug  8 17:14:56 tuchanka1b pengine[1213]: warning: Action 
> > krogan1DB:1_demote_0 on tuchanka1a is unrunnable (offline)
> > Aug  8 17:14:56 tuchanka1b pengine[1213]: warning: Action 
> > krogan1DB:1_stop_0 on tuchanka1a is unrunnable (offline)
> > Aug  8 17:14:56 tuchanka1b pengine[1213]: warning: Action 
> > krogan1DB:1_demote_0 on tuchanka1a is unrunnable (offline)
> > Aug  8 17:14:56 tuchanka1b pengine[1213]: warning: Action 
> > krogan1DB:1_stop_0 on tuchanka1a is unrunnable (offline)
> > Aug  8 17:14:56 tuchanka1b pengine[1213]: warning: Action 
> > krogan1DB:1_demote_0 on tuchanka1a is unrunnable (offline)
> > Aug  8 17:14:56 tuchanka1b pengine[1213]: warning: Action 
> > krogan1DB:1_stop_0 on tuchanka1a is unrunnable (offline)
> > Aug  8 17:14:56 tuchanka1b pengine[1213]: warning: Action 
> > krogan1DB:1_demote_0 on tuchanka1a is unrunnable (offline)
> > Aug  8 17:14:56 tuchanka1b pengine[1213]: warning: Action 
> > krogan1DB:1_stop_0 on tuchanka1a is unrunnable (offline)
> > Aug  8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1IP_stop_0 
> > on tuchanka1a is unrunnable (offline)
> > Aug  8 17:14:56 tuchanka1b pengine[1213]: warning: Node tuchanka1a is 
> > unclean!
> > Aug  8 17:14:56 tuchanka1b pengine[1213]:  notice: Cannot fence unclean 
> > nodes until quorum is attained (or no-quorum-policy is set to ignore)
> > Aug  8 17:14:56 tuchanka1b pengine[1213]:  notice:  * Stop       
> > krogan1DB:0     ( Slave tuchanka1b )   due to no quorum
> > Aug  8 17:14:56 tuchanka1b pengine[1213]:  notice:  * Stop       
> > krogan1DB:1     ( Master tuchanka1a )   due to node availability (blocked)
> > Aug  8 17:14:56 tuchanka1b pengine[1213]:  notice:  * Stop       krogan1IP  
> >      (        tuchanka1a )   due to no quorum (blocked)
> > Aug  8 17:14:56 tuchanka1b pengine[1213]:  notice:  * Stop       krogan1sIP 
> >      (        tuchanka1b )   due to no quorum
> > Aug  8 17:14:56 tuchanka1b pengine[1213]: warning: Calculated transition 0 
> > (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-4.bz2
> > Aug  8 17:14:56 tuchanka1b crmd[1214]:  notice: Initiating cancel operation 
> > krogan1DB_monitor_15000 locally on tuchanka1b
> > Aug  8 17:14:56 tuchanka1b crmd[1214]:  notice: Initiating cancel operation 
> > krogan1DB_monitor_17000 locally on tuchanka1b
> > Aug  8 17:14:56 tuchanka1b crmd[1214]:  notice: Initiating notify operation 
> > krogan1DB_pre_notify_stop_0 locally on tuchanka1b
> > Aug  8 17:14:56 tuchanka1b stonith-ng[1210]:  notice: Watchdog will be used 
> > via SBD if fencing is required
> > Aug  8 17:14:56 tuchanka1b crmd[1214]:  notice: Initiating stop operation 
> > krogan1sIP_stop_0 locally on tuchanka1b
> > Aug  8 17:14:56 tuchanka1b stonith-ng[1210]:  notice: Watchdog will be used 
> > via SBD if fencing is required
> > Aug  8 17:14:56 tuchanka1b stonith-ng[1210]:  notice: Watchdog will be used 
> > via SBD if fencing is required
> > Aug  8 17:14:56 tuchanka1b IPaddr2(krogan1sIP)[5245]: INFO: IP status = ok, 
> > IP_CIP=
> > Aug  8 17:14:56 tuchanka1b crmd[1214]:  notice: Result of stop operation 
> > for krogan1sIP on tuchanka1b: 0 (ok)
> > Aug  8 17:14:56 tuchanka1b stonith-ng[1210]:  notice: Watchdog will be used 
> > via SBD if fencing is required
> > Aug  8 17:14:56 tuchanka1b crmd[1214]:  notice: Result of notify operation 
> > for krogan1DB on tuchanka1b: 0 (ok)
> > Aug  8 17:14:56 tuchanka1b crmd[1214]:  notice: Initiating stop operation 
> > krogan1DB_stop_0 locally on tuchanka1b
> > Aug  8 17:14:56 tuchanka1b stonith-ng[1210]:  notice: Watchdog will be used 
> > via SBD if fencing is required
> > Aug  8 17:14:56 tuchanka1b sbd[1181]:      pcmk:  warning: cluster_status: 
> > Fencing and resource management disabled due to lack of quorum
> > Aug  8 17:14:56 tuchanka1b sbd[1181]:      pcmk:  warning: pe_fence_node: 
> > Cluster node tuchanka1a is unclean: peer is no longer part of the cluster
> > Aug  8 17:14:56 tuchanka1b sbd[1181]:      pcmk:  warning: 
> > determine_online_status: Node tuchanka1a is unclean
> > Aug  8 17:14:56 tuchanka1b sbd[1181]:      pcmk:     info: 
> > set_servant_health: Quorum lost: Stop ALL resources
> > Aug  8 17:14:56 tuchanka1b sbd[1181]:      pcmk:     info: notify_parent: 
> > Not notifying parent: state transient (2)
> > Aug  8 17:14:57 tuchanka1b pgsqlms(krogan1DB)[5312]: INFO: Instance 
> > "krogan1DB" stopped
> > Aug  8 17:14:57 tuchanka1b crmd[1214]:  notice: Result of stop operation 
> > for krogan1DB on tuchanka1b: 0 (ok)
> > Aug  8 17:14:57 tuchanka1b stonith-ng[1210]:  notice: Watchdog will be used 
> > via SBD if fencing is required
> > Aug  8 17:14:57 tuchanka1b crmd[1214]:  notice: Transition 0 (Complete=11, 
> > Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> > Source=/var/lib/pacemaker/pengine/pe-warn-4.bz2): Complete
> > Aug  8 17:14:57 tuchanka1b crmd[1214]:  notice: State transition 
> > S_TRANSITION_ENGINE -> S_IDLEAug  8 17:14:57 tuchanka1b sbd[1181]:      
> > pcmk:     info: notify_parent: Not notifying parent: state transient (2)
> > Aug  8 17:14:58 tuchanka1b ntpd[557]: Deleting interface #8 eth0, 
> > 192.168.89.104#123, interface stats: received=0, sent=0, dropped=0, 
> > active_time=260 secs
> > from now and later, repeatable Aug  8 17:14:58 tuchanka1b sbd[1181]:      
> > pcmk:     info: notify_parent: Not notifying parent: state transient (2)
> > from now and later, repeatable Aug  8 17:14:59 tuchanka1b sbd[1181]:      
> > pcmk:  warning: cluster_status: Fencing and resource management disabled 
> > due to lack of quorum
> > from now and later, repeatable Aug  8 17:14:59 tuchanka1b sbd[1181]:      
> > pcmk:  warning: pe_fence_node: Cluster node tuchanka1a is unclean: peer is 
> > no longer part of the cluster
> > from now and later, repeatable Aug  8 17:14:59 tuchanka1b sbd[1181]:      
> > pcmk:  warning: determine_online_status: Node tuchanka1a is unclean
> > Aug  8 17:15:00 tuchanka1b corosync[1185]: [VOTEQ ] Waiting for all cluster 
> > members. Current votes: 2 expected_votes: 3
> > Aug  8 17:15:56 tuchanka1b sbd[1180]: warning: inquisitor_child: Servant 
> > pcmk is outdated (age: 61)
> > Aug  8 17:15:59 tuchanka1b sbd[1180]: warning: inquisitor_child: Latency: 
> > No liveness for 4 s exceeds threshold of 3 s (healthy servants: 0)
> > Aug  8 17:15:59 tuchanka1b sbd[1180]: warning: inquisitor_child: Latency: 
> > No liveness for 4 s exceeds threshold of 3 s (healthy servants: 0)
> > Aug  8 17:16:00 tuchanka1b sbd[1180]: warning: inquisitor_child: Latency: 
> > No liveness for 5 s exceeds threshold of 3 s (healthy servants: 0)
> > Rebooted
> >
> > What is strange:
> >
> > 1. The node expect quorum in 60s, but it got quorum vote from qnetd on 5s 
> > later. So in 17:14:55 the node lost quorum.
> > 2. In 17:15:00 it got vote and but not quorum. Is it due to "wait for all" 
> > option?
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Strange lost quorum with qdevice

Reply via email to