06.04.2020 17:05, Sherrard Burton пишет: > ...or at least that's that i think is happening :-) > > two-node cluster, plus quorum-only node. testing the behavior when > active node is gracefully rebooted. all seems well initially. resources > are migrated, come up and function as expected. > > but, when the rebooted node starts to come back up, the other node seems > to lose quorum temporarily, even though it still has communication with > the quorum node. this causes the resources to stop until quorum is > reestablished. > > summary: > active node: xen-nfs01 192.168.250.50 > standby node: xen-nfs02 192.168.250.51 > quorum node: xen-quorum 192.168.250.52 > > issue reboot on xen-nfs01 > xen-nfs02 becomes active node > > xen-nfs01 starts to come back online > xen-nfs02 detects loss of quorum and stops resources > xen-nfs01 finishes booting > quorum is reestablished > > > instead of overinundating you with all of the debugging output from > corosync, pacemaker and corosync-qnetd on all nodes, i'll start with the > basics, and provide whatever else is needed on request. >
Well, to sensibly interpret logs IP address of each and corosync configuration are needed at the very least. > TIA > > > from the node that was not rebooted: > Apr 5 23:10:15 xen-nfs02 corosync[19099]: [KNET ] udp: Received ICMP > error from 192.168.250.51: No route to host > Apr 5 23:10:15 xen-nfs02 corosync[19099]: [KNET ] udp: Received ICMP > error from 192.168.250.51: No route to host > Apr 5 23:10:16 xen-nfs02 corosync[19099]: [KNET ] udp: Received ICMP > error from 192.168.250.50: Connection refused > Apr 5 23:10:16 xen-nfs02 corosync[19099]: [KNET ] udp: Received ICMP > error from 192.168.250.50: Connection refused > Apr 5 23:10:16 xen-nfs02 corosync[19099]: [KNET ] rx: host: 1 link: > 0 received pong: 1 > Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Received vote info > Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: seq = 6 > Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: vote = NACK > Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: ring id = (2.814) > Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Algorithm result vote > is NACK > Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Cast vote timer > remains scheduled every 500ms voting NACK. > Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] flags: > quorate: Yes Leaving: No WFA Status: No First: No Qdevice: Yes > QdeviceAlive: Yes QdeviceCastVote: No QdeviceMasterWins: No > Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] got nodeinfo > message from cluster node 2 > Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] nodeinfo > message[2]: votes: 1, expected: 3 flags: 49 > Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] flags: > quorate: Yes Leaving: No WFA Status: No First: No Qdevice: Yes > QdeviceAlive: Yes QdeviceCastVote: No QdeviceMasterWins: No > Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] > total_votes=2, expected_votes=3 > Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] node 1 > state=2, votes=1, expected=3 > Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] node 2 > state=1, votes=1, expected=3 > Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] quorum lost, > blocking activity qdevice decided to not cast vote to nfs02 node. > Apr 05 23:10:17 [19099] xen-nfs02 corosync notice [QUORUM] This node is > within the non-primary component and will NOT provide any services. > Apr 05 23:10:17 [19099] xen-nfs02 corosync notice [QUORUM] Members[1]: 2 > Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [QUORUM] sending > quorum notification to (nil), length = 52 > Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] Sending > quorum callback, quorate = 0 > ... > Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Votequorum quorum > notify callback: > Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Quorate = 0 > Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Node list (size = 3): > Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 0 nodeid = 1, > state = 2 > Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 1 nodeid = 2, > state = 1 > Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 2 nodeid = 0, > state = 0 > Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Algorithm decided to > send list and result vote is No change > Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Sending quorum node > list seq = 13, quorate = 0 > Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Node list: > Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 0 node_id = 1, > data_center_id = 0, node_state = dead > Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 1 node_id = 2, > data_center_id = 0, node_state = member > > > > from the quorum node: > Apr 05 23:10:17 debug New client connected > Apr 05 23:10:17 debug cluster name = xen-nfs01_xen-nfs02 > Apr 05 23:10:17 debug tls started = 1 > Apr 05 23:10:17 debug tls peer certificate verified = 1 > Apr 05 23:10:17 debug node_id = 1 > Apr 05 23:10:17 debug pointer = 0x55b37c2d74f0 > Apr 05 23:10:17 debug addr_str = ::ffff:192.168.250.50:54462 > Apr 05 23:10:17 debug ring id = (1.814) > Apr 05 23:10:17 debug cluster dump: > Apr 05 23:10:17 debug client = ::ffff:192.168.250.51:54876, > node_id = 2 > Apr 05 23:10:17 debug client = ::ffff:192.168.250.50:54462, > node_id = 1 > Apr 05 23:10:17 debug Client ::ffff:192.168.250.50:54462 (cluster > xen-nfs01_xen-nfs02, node_id 1) sent initial node list. > Apr 05 23:10:17 debug msg seq num = 4 > Apr 05 23:10:17 debug node list: > Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state > = not set > Apr 05 23:10:17 debug node_id = 2, data_center_id = 0, node_state > = not set > Apr 05 23:10:17 debug Algorithm result vote is Ask later > Apr 05 23:10:17 debug Client ::ffff:192.168.250.50:54462 (cluster > xen-nfs01_xen-nfs02, node_id 1) sent membership node list. > Apr 05 23:10:17 debug msg seq num = 5 > Apr 05 23:10:17 debug ring id = (1.814) > Apr 05 23:10:17 debug heuristics = Undefined > Apr 05 23:10:17 debug node list: > Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state > = not set > Apr 05 23:10:17 debug ffsplit: Membership for cluster > xen-nfs01_xen-nfs02 is now stable > Apr 05 23:10:17 debug ffsplit: Quorate partition selected > Apr 05 23:10:17 debug node list: > Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state > = not set > Apr 05 23:10:17 debug Sending vote info to client > ::ffff:192.168.250.51:54876 (cluster xen-nfs01_xen-nfs02, node_id 2) > Apr 05 23:10:17 debug msg seq num = 6 > Apr 05 23:10:17 debug vote = NACK > Apr 05 23:10:17 debug Algorithm result vote is No change > Apr 05 23:10:17 debug Client ::ffff:192.168.250.50:54462 (cluster > xen-nfs01_xen-nfs02, node_id 1) sent quorum node list. > Apr 05 23:10:17 debug msg seq num = 6 > Apr 05 23:10:17 debug quorate = 0 > Apr 05 23:10:17 debug node list: > Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state > = member Oops. How comes that node that was rebooted formed cluster all by itself, without seeing the second node? Do you have two_nodes and/or wait_for_all configured? > Apr 05 23:10:17 debug Algorithm result vote is No change > Apr 05 23:10:17 debug Client ::ffff:192.168.250.51:54876 (cluster > xen-nfs01_xen-nfs02, node_id 2) replied back to vote info message > Apr 05 23:10:17 debug msg seq num = 6 > Apr 05 23:10:17 debug ffsplit: All NACK votes sent for cluster > xen-nfs01_xen-nfs02 > Apr 05 23:10:17 debug Sending vote info to client > ::ffff:192.168.250.50:54462 (cluster xen-nfs01_xen-nfs02, node_id 1) > Apr 05 23:10:17 debug msg seq num = 1 > Apr 05 23:10:17 debug vote = ACK > Apr 05 23:10:17 debug Client ::ffff:192.168.250.50:54462 (cluster > xen-nfs01_xen-nfs02, node_id 1) replied back to vote info message > Apr 05 23:10:17 debug msg seq num = 1 > Apr 05 23:10:17 debug ffsplit: All ACK votes sent for cluster > xen-nfs01_xen-nfs02 > Apr 05 23:10:17 debug Client ::ffff:192.168.250.51:54876 (cluster > xen-nfs01_xen-nfs02, node_id 2) sent quorum node list. > Apr 05 23:10:17 debug msg seq num = 13 > Apr 05 23:10:17 debug quorate = 0 > Apr 05 23:10:17 debug node list: > Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state > = dead > Apr 05 23:10:17 debug node_id = 2, data_center_id = 0, node_state > = member > Apr 05 23:10:17 debug Algorithm result vote is No change > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/