On 01.03.2021 12:26, Jan Friesse wrote: >> > > Thanks for digging into logs. I believe Eric is hitting > https://github.com/corosync/corosync-qdevice/issues/10 (already fixed, > but may take some time to get into distributions) - it also contains > workaround. >
I tested corosync-qnetd at df3c672 which should include these fixes. It changed behavior, still I cannot explain it. Again, ha1+ha2+qnetd, ha2 is current DC, I disconnect ha1 (block everything with ha1 source MAC), stonith disabled. corosync and corosync-qdevice on nodes are still 2.4.5 if it matters. ha2: ar 01 13:23:27 ha2 corosync[1576]: [TOTEM ] A processor failed, forming new configuration. Mar 01 13:23:28 ha2 corosync[1576]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms) Mar 01 13:23:28 ha2 corosync[1576]: [TOTEM ] A new membership (192.168.1.2:3632) was formed. Members left: 1 Mar 01 13:23:28 ha2 corosync[1576]: [TOTEM ] Failed to receive the leave message. failed: 1 Mar 01 13:23:28 ha2 corosync[1576]: [CPG ] downlist left_list: 1 received Mar 01 13:23:28 ha2 pacemaker-based[2032]: notice: Node ha1 state is now lost Mar 01 13:23:28 ha2 pacemaker-based[2032]: notice: Purged 1 peer with id=1 and/or uname=ha1 from the membership cache Mar 01 13:23:28 ha2 pacemaker-attrd[2035]: notice: Lost attribute writer ha1 Mar 01 13:23:28 ha2 pacemaker-attrd[2035]: notice: Node ha1 state is now lost Mar 01 13:23:28 ha2 pacemaker-attrd[2035]: notice: Removing all ha1 attributes for peer loss Mar 01 13:23:28 ha2 pacemaker-attrd[2035]: notice: Purged 1 peer with id=1 and/or uname=ha1 from the membership cache Mar 01 13:23:28 ha2 pacemaker-fenced[2033]: notice: Node ha1 state is now lost Mar 01 13:23:28 ha2 pacemaker-fenced[2033]: notice: Purged 1 peer with id=1 and/or uname=ha1 from the membership cache Mar 01 13:23:28 ha2 pacemaker-controld[2037]: warning: Stonith/shutdown of node ha1 was not expected Mar 01 13:23:28 ha2 pacemaker-controld[2037]: notice: State transition S_IDLE -> S_POLICY_ENGINE Mar 01 13:23:33 ha2 pacemaker-controld[2037]: notice: High CPU load detected: 1.200000 Mar 01 13:23:35 ha2 corosync[1576]: [QUORUM] Members[1]: 2 Mar 01 13:23:35 ha2 corosync[1576]: [MAIN ] Completed service synchronization, ready to provide service. Mar 01 13:23:35 ha2 pacemaker-attrd[2035]: notice: Recorded local node as attribute writer (was unset) Mar 01 13:23:35 ha2 pacemaker-controld[2037]: notice: Node ha1 state is now lost Mar 01 13:23:35 ha2 pacemaker-controld[2037]: warning: Stonith/shutdown of node ha1 was not expected Mar 01 13:23:36 ha2 pacemaker-schedulerd[2036]: notice: * Promote p_drbd0:0 ( Slave -> Master ha2 ) Mar 01 13:23:36 ha2 pacemaker-schedulerd[2036]: notice: * Start p_fs_clust01 ( ha2 ) Mar 01 13:23:36 ha2 pacemaker-schedulerd[2036]: notice: * Start p_mysql_001 ( ha2 ) So it is pretty fast to react (8 seconds) ha1: Mar 01 13:23:27 ha1 corosync[1552]: [TOTEM ] A processor failed, forming new configuration. Mar 01 13:23:30 ha1 corosync[1552]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms) Mar 01 13:23:30 ha1 corosync[1552]: [TOTEM ] A new membership (192.168.1.1:3640) was formed. Members left: 2 Mar 01 13:23:30 ha1 corosync[1552]: [TOTEM ] Failed to receive the leave message. failed: 2 Mar 01 13:23:30 ha1 corosync[1552]: [CPG ] downlist left_list: 1 received Mar 01 13:23:30 ha1 pacemaker-attrd[1738]: notice: Node ha2 state is now lost Mar 01 13:23:30 ha1 pacemaker-attrd[1738]: notice: Removing all ha2 attributes for peer loss Mar 01 13:23:30 ha1 pacemaker-attrd[1738]: notice: Purged 1 peer with id=2 and/or uname=ha2 from the membership cache Mar 01 13:23:30 ha1 pacemaker-based[1735]: notice: Node ha2 state is now lost Mar 01 13:23:30 ha1 pacemaker-based[1735]: notice: Purged 1 peer with id=2 and/or uname=ha2 from the membership cache Mar 01 13:23:30 ha1 pacemaker-controld[1740]: notice: Our peer on the DC (ha2) is dead Mar 01 13:23:30 ha1 pacemaker-controld[1740]: notice: State transition S_NOT_DC -> S_ELECTION Mar 01 13:23:30 ha1 pacemaker-fenced[1736]: notice: Node ha2 state is now lost Mar 01 13:23:30 ha1 pacemaker-fenced[1736]: notice: Purged 1 peer with id=2 and/or uname=ha2 from the membership cache Mar 01 13:23:32 ha1 corosync[1552]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms) Mar 01 13:23:32 ha1 corosync[1552]: [TOTEM ] A new membership (192.168.1.1:3644) was formed. Members Mar 01 13:23:32 ha1 corosync[1552]: [CPG ] downlist left_list: 0 received Mar 01 13:23:33 ha1 corosync[1552]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms) Mar 01 13:23:33 ha1 corosync[1552]: [TOTEM ] A new membership (192.168.1.1:3648) was formed. Members Mar 01 13:23:33 ha1 corosync[1552]: [CPG ] downlist left_list: 0 received Mar 01 13:23:35 ha1 corosync[1552]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms) ... Mar 01 13:24:05 ha1 corosync-qdevice[1563]: Can't connect to qnetd host. (-5986): Network address not available (in use?) Mar 01 13:24:05 ha1 corosync-qdevice[1563]: Mar 01 13:24:05 error Can't connect to qnetd host. (-5986): Network address not available (in use?) Mar 01 13:24:05 ha1 corosync[1552]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms) Mar 01 13:24:05 ha1 corosync[1552]: [TOTEM ] A new membership (192.168.1.1:3736) was formed. Members Mar 01 13:24:05 ha1 corosync[1552]: [CPG ] downlist left_list: 0 received Mar 01 13:24:05 ha1 corosync[1552]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Mar 01 13:24:05 ha1 corosync[1552]: [QUORUM] Members[1]: 1 Mar 01 13:24:05 ha1 corosync[1552]: [MAIN ] Completed service synchronization, ready to provide service. Mar 01 13:24:05 ha1 pacemaker-controld[1740]: warning: Quorum lost Mar 01 13:24:05 ha1 pacemaker-controld[1740]: notice: Node ha2 state is now lost Mar 01 13:24:05 ha1 pacemaker-controld[1740]: notice: State transition S_ELECTION -> S_INTEGRATION Mar 01 13:24:05 ha1 pacemaker-controld[1740]: notice: Updating quorum status to false (call=56) Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]: warning: Blind faith: not fencing unseen nodes Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]: warning: Fencing and resource management disabled due to lack of quorum Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]: notice: * Stop p_drbd0:0 ( Master ha1 ) due to no quorum Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]: notice: * Stop p_drbd1:0 ( Slave ha1 ) due to no quorum Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]: notice: * Stop p_fs_clust01 ( ha1 ) due to no quorum Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]: notice: * Start p_fs_clust02 ( ha1 ) due to no quorum (blocked) Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]: notice: * Stop p_mysql_001 ( ha1 ) due to no quorum Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]: notice: * Start p_mysql_006 ( ha1 ) due to no quorum (blocked) Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]: notice: * Start p_mysql_666 ( ha1 ) due to no quorum (blocked) Mar 01 13:24:05 ha1 pacemaker-controld[1740]: notice: Processing graph 0 (ref=pe_calc-dc-16145 So it took it almost 40 seconds to make decision. Somehow it is exactly the opposite of what I observed before - disconnected node was fast and connected node was slow. While I can understand why behavior changed for connected node, I still do not understand why disconnected node now needs so much time. And what is worse, this is not reliable - next time I test both nodes react almost immediately (just 3 seconds to reach decision it is out of quorum for disconnected node). That is the most irritating, as one expects consistent behavior here. That is something inside corosync/corosync-qdevice. At least it seems to improve situation with qnetd response timing to survived nodes. _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
