Hi All, In clusters that do not use STONITH, actions to erase attributes to each other occurred. The problem occurs when the load on the CPU goes up and the token of corosync does not stabilize.
I confirmed that the problem will occur with a simple configuration. Step1) Configure the cluster. ---- [root@rh74-01 ~]# crm_mon -1 -Af (snip) Online: [ rh74-01 rh74-02 ] Active resources: Clone Set: clnPing [prmPing] Started: [ rh74-01 rh74-02 ] Node Attributes: * Node rh74-01: + default_ping_set : 100 * Node rh74-02: + default_ping_set : 100 Migration Summary: * Node rh74-01: * Node rh74-02: ---- Step2) Node 2 puts a heavy load on the CPU, making token unstable. ---- [root@rh74-02 ~]# stress -c 2 --timeout 2s ---- Step3) Each node deletes the attribute of the other node, but when the cluster recovers in the middle, the transient_attributes of all nodes are deleted. ---- ha-log.extract (snip) Dec 18 14:05:47 rh74-01 cib[21140]: info: Completed cib_delete operation for section //node_state[@uname='rh74-01']/transient_attributes: OK (rc=0, origin=rh74-02/crmd/16, version=0.5.35) (snip) Dec 18 14:05:49 rh74-01 pengine[21144]: notice: On loss of CCM Quorum: Ignore Dec 18 14:05:49 rh74-01 pengine[21144]: info: Node rh74-01 is online Dec 18 14:05:49 rh74-01 pengine[21144]: info: Node rh74-02 is online Dec 18 14:05:49 rh74-01 pengine[21144]: info: Node 1 is already processed Dec 18 14:05:49 rh74-01 pengine[21144]: info: Node 2 is already processed Dec 18 14:05:49 rh74-01 pengine[21144]: info: Node 1 is already processed Dec 18 14:05:49 rh74-01 pengine[21144]: info: Node 2 is already processed Dec 18 14:05:49 rh74-01 pengine[21144]: info: Clone Set: clnPing [prmPing] Dec 18 14:05:49 rh74-01 pengine[21144]: info: Started: [ rh74-01 rh74-02 ] Dec 18 14:05:49 rh74-01 pengine[21144]: info: Leave prmPing:0#011(Started rh74-01) Dec 18 14:05:49 rh74-01 pengine[21144]: info: Leave prmPing:1#011(Started rh74-02) Dec 18 14:05:49 rh74-01 pengine[21144]: notice: Calculated transition 8, saving inputs in /var/lib/pacemaker/pengine/pe-input-1.bz2 Dec 18 14:05:49 rh74-01 cib[21387]: warning: Could not verify cluster configuration file /var/lib/pacemaker/cib/cib.xml: No such file or directory (2) Dec 18 14:05:49 rh74-01 crmd[21145]: info: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE Dec 18 14:05:49 rh74-01 crmd[21145]: info: Processing graph 8 (ref=pe_calc-dc-1545109549-63) derived from /var/lib/pacemaker/pengine/pe-input-1.bz2 Dec 18 14:05:49 rh74-01 crmd[21145]: notice: Transition 8 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1.bz2): Complete Dec 18 14:05:49 rh74-01 crmd[21145]: info: Input I_TE_SUCCESS received in state S_TRANSITION_ENGINE from notify_crmd Dec 18 14:05:49 rh74-01 crmd[21145]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE ---- pe-input-1 (snip) <status> <node_state id="1" uname="rh74-01" in_ccm="true" crmd="online" crm-debug-origin="do_state_transition" join="member" expected="member"> <lrm id="1"> <lrm_resources> <lrm_resource id="prmPing" type="ping" class="ocf" provider="pacemaker"> <lrm_rsc_op id="prmPing_last_0" operation_key="prmPing_start_0" operation="start" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.14" transition-key="3:1:0:2677a223-7aef-4a3a-abcf-a4fe262dcb85" transition-magic="0:0;3:1:0:2677a223-7aef-4a3a-abcf-a4fe262dcb85" exit-reason="" on_node="rh74-01" call-id="7" rc-code="0" op-status="0" interval="0" last-run="1545109490" last-rc-change="1545109490" exec-time="1076" queue-time="0" op-digest="4866a0300f41b301bf6e8d35bad4029a"/> <lrm_rsc_op id="prmPing_monitor_10000" operation_key="prmPing_monitor_10000" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.14" transition-key="4:1:0:2677a223-7aef-4a3a-abcf-a4fe262dcb85" transition-magic="0:0;4:1:0:2677a223-7aef-4a3a-abcf-a4fe262dcb85" exit-reason="" on_node="rh74-01" call-id="8" rc-code="0" op-status="0" interval="10000" last-rc-change="1545109491" exec-time="1076" queue-time="1" op-digest="1c819bc4abd52897263e40446447d063"/> </lrm_resource> </lrm_resources> </lrm> </node_state> <node_state id="2" uname="rh74-02" crmd="online" crm-debug-origin="do_state_transition" in_ccm="true" join="member" expected="member"> <lrm id="2"> <lrm_resources> <lrm_resource id="prmPing" type="ping" class="ocf" provider="pacemaker"> <lrm_rsc_op id="prmPing_last_0" operation_key="prmPing_start_0" operation="start" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.14" transition-key="6:4:0:2677a223-7aef-4a3a-abcf-a4fe262dcb85" transition-magic="0:0;6:4:0:2677a223-7aef-4a3a-abcf-a4fe262dcb85" exit-reason="" on_node="rh74-02" call-id="7" rc-code="0" op-status="0" interval="0" last-run="1545109506" last-rc-change="1545109506" exec-time="2075" queue-time="1" op-digest="4866a0300f41b301bf6e8d35bad4029a"/> <lrm_rsc_op id="prmPing_monitor_10000" operation_key="prmPing_monitor_10000" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.14" transition-key="7:4:0:2677a223-7aef-4a3a-abcf-a4fe262dcb85" transition-magic="0:0;7:4:0:2677a223-7aef-4a3a-abcf-a4fe262dcb85" exit-reason="" on_node="rh74-02" call-id="8" rc-code="0" op-status="0" interval="10000" last-rc-change="1545109508" exec-time="1090" queue-time="0" op-digest="1c819bc4abd52897263e40446447d063"/> </lrm_resource> </lrm_resources> </lrm> </node_state> </status> (snip) ---- With this simple configuration, no problem occurs, but in the case of a resource whose attribute is actually set as a constraint, at the time of pe-input-1, the resource is stopped. In order to avoid the problem, it is necessary to examine the process of deleting the attribute of its own node from the correspondent node. I confirmed this problem with Pacemaker 1.1.19. The same problem has also been reported by users using Pacemaker 1.1.17. * Attach the crm_report file. * The attached log is acquired after applying high load. * https://bugs.clusterlabs.org/show_bug.cgi?id=5375 Best Regards, Hideo Yamauchi. _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org