On Fri, Aug 9, 2019 at 9:25 AM Jan Friesse <jfrie...@redhat.com> wrote: > > Олег Самойлов napsal(a): > > Hello all. > > > > I have a test bed with several virtual machines to test pacemaker. I > > simulate random failure on one of the node. The cluster will be on several > > data centres, so there is not stonith device, instead I use qnetd on the > > third data centre and watchdog (softdog). And sometimes (not always) on > > failure on one node, the second node also watchdoged due to lost of the > > quorum. I increased quorum timeouts 2 times: > > > > for qnetd: COROSYNC_QNETD_OPTIONS="-S dpd_interval=20000 -d" > > Please do not set dpd_interval that high. dpd_interval on qnetd side is > not about how often is the ping is sent. Could you please retry your > test with dpd_interval=1000? I'm pretty sure it will work then. >
Well, I observed inexplicable resets of a node when partner was rebooted in twp node cluster with qdevice. No timers were changed. I have not investigated it in more details because first I could not find out was conditions triggered it and later I decided to use SBD with shared block device instead which seems to mostly work. According to manual page default for dpd_interval is 10 seconds (10000ms). So should not defaults be changed then? > Honza > > > > for pacemaker: pcs quorum device add sync_timeout=60000 timeout=20000 model > > net host=‘witness' algorithm=ffsplit > > > > Also I set -I 60 option (net timeout) to sbd. > > > > But effect is still exists: > > > > Logs, after one of the nodes «tuchanka1a» was powered off in 17:13:53. > > > > From server 'witness' with qnetd: > > > > Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug Client > > ::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2) sent membership > > node list. > > Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug msg seq > > num = 7 > > Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug ring id = > > (2.4c) > > Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug > > heuristics = Undefined > > Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug node list: > > Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug node_id > > = 2, data_center_id = 0, node_state = not set > > Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug ffsplit: > > Membership for cluster krogan1 is not yet stable > > Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug Algorithm > > result vote is Wait for reply > > Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug Client > > ::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2) sent quorum node > > list. > > Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug msg seq > > num = 8 > > Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug quorate = > > 0 > > Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug node list: > > Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug node_id > > = 1, data_center_id = 0, node_state = dead > > Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug node_id > > = 2, data_center_id = 0, node_state = member > > Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug Algorithm > > result vote is No change > > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 warning Client > > ::ffff:192.168.89.11:47456 doesn't sent any message during 40000ms. > > Disconnecting > > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug Client > > ::ffff:192.168.89.11:47456 (init_received 1, cluster krogan1, node_id 1) > > disconnect > > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug ffsplit: > > Membership for cluster krogan1 is now stable > > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug ffsplit: > > Quorate partition selected > > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug node list: > > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug node_id > > = 2, data_center_id = 0, node_state = not set > > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug ffsplit: No > > client gets NACK > > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug Sending > > vote info to client ::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2) > > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug msg seq > > num = 2 > > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug vote = ACK > > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug Client > > ::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2) replied back to > > vote info message > > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug msg seq > > num = 2 > > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug ffsplit: > > All ACK votes sent for cluster krogan1 > > Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 warning Client > > ::ffff:192.168.89.12:39144 doesn't sent any message during 40000ms. > > Disconnecting > > Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug Client > > ::ffff:192.168.89.12:39144 (init_received 1, cluster krogan1, node_id 2) > > disconnect > > Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug ffsplit: > > Membership for cluster krogan1 is now stable > > Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug ffsplit: No > > quorate partition was selected > > Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug ffsplit: No > > client gets NACK > > Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug ffsplit: No > > client gets ACK > > > > From other node ‘tuchanka1b’. I deleted all rows from sbd "info: > > notify_parent: Notifying parent: healthy" and "info: notify_parent: Not > > notifying parent: state transient (2)". > > > > Aug 8 17:13:54 tuchanka1b corosync[1185]: [TOTEM ] A processor failed, > > forming new configuration. > > Aug 8 17:13:55 tuchanka1b corosync[1185]: [TOTEM ] A new membership > > (192.168.89.12:76) was formed. Members left: 1 > > Aug 8 17:13:55 tuchanka1b corosync[1185]: [TOTEM ] Failed to receive the > > leave message. failed: 1 > > Aug 8 17:13:55 tuchanka1b corosync[1185]: [VOTEQ ] waiting for quorum > > device Qdevice poll (but maximum for 60000 ms) > > Aug 8 17:13:55 tuchanka1b stonith-ng[1210]: notice: Node tuchanka1a state > > is now lost > > Aug 8 17:13:55 tuchanka1b stonith-ng[1210]: notice: Purged 1 peer with > > id=1 and/or uname=tuchanka1a from the membership cache > > Aug 8 17:13:55 tuchanka1b crmd[1214]: notice: Our peer on the DC > > (tuchanka1a) is dead > > Aug 8 17:13:55 tuchanka1b crmd[1214]: notice: State transition S_NOT_DC > > -> S_ELECTION > > Aug 8 17:13:55 tuchanka1b attrd[1212]: notice: Node tuchanka1a state is > > now lost > > Aug 8 17:13:55 tuchanka1b attrd[1212]: notice: Removing all tuchanka1a > > attributes for peer loss > > Aug 8 17:13:55 tuchanka1b attrd[1212]: notice: Lost attribute writer > > tuchanka1a > > Aug 8 17:13:55 tuchanka1b attrd[1212]: notice: Purged 1 peer with id=1 > > and/or uname=tuchanka1a from the membership cache > > Aug 8 17:13:55 tuchanka1b cib[1209]: notice: Node tuchanka1a state is now > > lost > > Aug 8 17:13:55 tuchanka1b cib[1209]: notice: Purged 1 peer with id=1 > > and/or uname=tuchanka1a from the membership cacheAug 8 17:14:25 tuchanka1b > > crmd[1214]: notice: Deletion of > > "//node_state[@uname='tuchanka1a']/transient_attributes": Timer expired > > (rc=-62) > > Aug 8 17:14:55 tuchanka1b corosync[1185]: [VOTEQ ] lost contact with > > quorum device Qdevice > > Aug 8 17:14:55 tuchanka1b corosync[1185]: [VOTEQ ] Waiting for all cluster > > members. Current votes: 1 expected_votes: 3 > > Aug 8 17:14:55 tuchanka1b corosync[1185]: [QUORUM] This node is within the > > non-primary component and will NOT provide any services. > > Aug 8 17:14:55 tuchanka1b corosync[1185]: [QUORUM] Members[1]: 2 > > Aug 8 17:14:55 tuchanka1b corosync[1185]: [MAIN ] Completed service > > synchronization, ready to provide service. > > Aug 8 17:14:55 tuchanka1b pacemakerd[1208]: warning: Quorum lost > > Aug 8 17:14:55 tuchanka1b pacemakerd[1208]: notice: Node tuchanka1a state > > is now lost > > Aug 8 17:14:55 tuchanka1b crmd[1214]: warning: Quorum lost > > Aug 8 17:14:55 tuchanka1b crmd[1214]: notice: Node tuchanka1a state is > > now lost > > Aug 8 17:14:55 tuchanka1b sbd[1182]: cluster: info: notify_parent: > > Notifying parent: healthy > > Aug 8 17:14:55 tuchanka1b crmd[1214]: notice: State transition S_ELECTION > > -> S_INTEGRATION > > Aug 8 17:14:55 tuchanka1b crmd[1214]: warning: Input I_ELECTION_DC > > received in state S_INTEGRATION from do_election_check > > Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: Watchdog will be used > > via SBD if fencing is required > > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Fencing and resource > > management disabled due to lack of quorum > > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Cluster node tuchanka1a > > is unclean: peer is no longer part of the cluster > > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Node tuchanka1a is > > unclean > > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action > > krogan1DB:1_demote_0 on tuchanka1a is unrunnable (offline) > > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action > > krogan1DB:1_stop_0 on tuchanka1a is unrunnable (offline) > > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action > > krogan1DB:1_demote_0 on tuchanka1a is unrunnable (offline) > > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action > > krogan1DB:1_stop_0 on tuchanka1a is unrunnable (offline) > > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action > > krogan1DB:1_demote_0 on tuchanka1a is unrunnable (offline) > > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action > > krogan1DB:1_stop_0 on tuchanka1a is unrunnable (offline) > > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action > > krogan1DB:1_demote_0 on tuchanka1a is unrunnable (offline) > > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action > > krogan1DB:1_stop_0 on tuchanka1a is unrunnable (offline) > > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1IP_stop_0 > > on tuchanka1a is unrunnable (offline) > > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Node tuchanka1a is > > unclean! > > Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: Cannot fence unclean > > nodes until quorum is attained (or no-quorum-policy is set to ignore) > > Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: * Stop > > krogan1DB:0 ( Slave tuchanka1b ) due to no quorum > > Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: * Stop > > krogan1DB:1 ( Master tuchanka1a ) due to node availability (blocked) > > Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: * Stop krogan1IP > > ( tuchanka1a ) due to no quorum (blocked) > > Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: * Stop krogan1sIP > > ( tuchanka1b ) due to no quorum > > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Calculated transition 0 > > (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-4.bz2 > > Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating cancel operation > > krogan1DB_monitor_15000 locally on tuchanka1b > > Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating cancel operation > > krogan1DB_monitor_17000 locally on tuchanka1b > > Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating notify operation > > krogan1DB_pre_notify_stop_0 locally on tuchanka1b > > Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used > > via SBD if fencing is required > > Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating stop operation > > krogan1sIP_stop_0 locally on tuchanka1b > > Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used > > via SBD if fencing is required > > Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used > > via SBD if fencing is required > > Aug 8 17:14:56 tuchanka1b IPaddr2(krogan1sIP)[5245]: INFO: IP status = ok, > > IP_CIP= > > Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Result of stop operation > > for krogan1sIP on tuchanka1b: 0 (ok) > > Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used > > via SBD if fencing is required > > Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Result of notify operation > > for krogan1DB on tuchanka1b: 0 (ok) > > Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating stop operation > > krogan1DB_stop_0 locally on tuchanka1b > > Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used > > via SBD if fencing is required > > Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: warning: cluster_status: > > Fencing and resource management disabled due to lack of quorum > > Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: warning: pe_fence_node: > > Cluster node tuchanka1a is unclean: peer is no longer part of the cluster > > Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: warning: > > determine_online_status: Node tuchanka1a is unclean > > Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: info: > > set_servant_health: Quorum lost: Stop ALL resources > > Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: info: notify_parent: > > Not notifying parent: state transient (2) > > Aug 8 17:14:57 tuchanka1b pgsqlms(krogan1DB)[5312]: INFO: Instance > > "krogan1DB" stopped > > Aug 8 17:14:57 tuchanka1b crmd[1214]: notice: Result of stop operation > > for krogan1DB on tuchanka1b: 0 (ok) > > Aug 8 17:14:57 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used > > via SBD if fencing is required > > Aug 8 17:14:57 tuchanka1b crmd[1214]: notice: Transition 0 (Complete=11, > > Pending=0, Fired=0, Skipped=0, Incomplete=0, > > Source=/var/lib/pacemaker/pengine/pe-warn-4.bz2): Complete > > Aug 8 17:14:57 tuchanka1b crmd[1214]: notice: State transition > > S_TRANSITION_ENGINE -> S_IDLEAug 8 17:14:57 tuchanka1b sbd[1181]: > > pcmk: info: notify_parent: Not notifying parent: state transient (2) > > Aug 8 17:14:58 tuchanka1b ntpd[557]: Deleting interface #8 eth0, > > 192.168.89.104#123, interface stats: received=0, sent=0, dropped=0, > > active_time=260 secs > > from now and later, repeatable Aug 8 17:14:58 tuchanka1b sbd[1181]: > > pcmk: info: notify_parent: Not notifying parent: state transient (2) > > from now and later, repeatable Aug 8 17:14:59 tuchanka1b sbd[1181]: > > pcmk: warning: cluster_status: Fencing and resource management disabled > > due to lack of quorum > > from now and later, repeatable Aug 8 17:14:59 tuchanka1b sbd[1181]: > > pcmk: warning: pe_fence_node: Cluster node tuchanka1a is unclean: peer is > > no longer part of the cluster > > from now and later, repeatable Aug 8 17:14:59 tuchanka1b sbd[1181]: > > pcmk: warning: determine_online_status: Node tuchanka1a is unclean > > Aug 8 17:15:00 tuchanka1b corosync[1185]: [VOTEQ ] Waiting for all cluster > > members. Current votes: 2 expected_votes: 3 > > Aug 8 17:15:56 tuchanka1b sbd[1180]: warning: inquisitor_child: Servant > > pcmk is outdated (age: 61) > > Aug 8 17:15:59 tuchanka1b sbd[1180]: warning: inquisitor_child: Latency: > > No liveness for 4 s exceeds threshold of 3 s (healthy servants: 0) > > Aug 8 17:15:59 tuchanka1b sbd[1180]: warning: inquisitor_child: Latency: > > No liveness for 4 s exceeds threshold of 3 s (healthy servants: 0) > > Aug 8 17:16:00 tuchanka1b sbd[1180]: warning: inquisitor_child: Latency: > > No liveness for 5 s exceeds threshold of 3 s (healthy servants: 0) > > Rebooted > > > > What is strange: > > > > 1. The node expect quorum in 60s, but it got quorum vote from qnetd on 5s > > later. So in 17:14:55 the node lost quorum. > > 2. In 17:15:00 it got vote and but not quorum. Is it due to "wait for all" > > option? > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/