[ClusterLabs] [Problem] If the token becomes unstable, the transient_attributes of all nodes disappear.

renayama19661014 Tue, 18 Dec 2018 13:19:41 -0800

Hi All,

In clusters that do not use STONITH, actions to erase attributes to each other 
occurred.
The problem occurs when the load on the CPU goes up and the token of corosync 
does not stabilize.


I confirmed that the problem will occur with a simple configuration.

Step1) Configure the cluster.
----
[root@rh74-01 ~]# crm_mon -1 -Af                                                
                                                                           
(snip)
Online: [ rh74-01 rh74-02 ]


Active resources:
 Clone Set: clnPing [prmPing]

     Started: [ rh74-01 rh74-02 ]
Node Attributes:

* Node rh74-01:
    + default_ping_set                  : 100       
* Node rh74-02:
    + default_ping_set                  : 100       
Migration Summary:

* Node rh74-01:
* Node rh74-02:
----

Step2) Node 2 puts a heavy load on the CPU, making token unstable.
----
[root@rh74-02 ~]# stress -c 2 --timeout 2s
----

Step3) Each node deletes the attribute of the other node, but when the cluster 
recovers in the middle, the transient_attributes of all nodes are deleted.


---- ha-log.extract
(snip)
Dec 18 14:05:47 rh74-01 cib[21140]:    info: Completed cib_delete operation for 
section //node_state[@uname='rh74-01']/transient_attributes: OK (rc=0, 
origin=rh74-02/crmd/16, version=0.5.35)
(snip)
Dec 18 14:05:49 rh74-01 pengine[21144]:  notice: On loss of CCM Quorum: Ignore
Dec 18 14:05:49 rh74-01 pengine[21144]:    info: Node rh74-01 is online
Dec 18 14:05:49 rh74-01 pengine[21144]:    info: Node rh74-02 is online
Dec 18 14:05:49 rh74-01 pengine[21144]:    info: Node 1 is already processed
Dec 18 14:05:49 rh74-01 pengine[21144]:    info: Node 2 is already processed
Dec 18 14:05:49 rh74-01 pengine[21144]:    info: Node 1 is already processed
Dec 18 14:05:49 rh74-01 pengine[21144]:    info: Node 2 is already processed
Dec 18 14:05:49 rh74-01 pengine[21144]:    info:  Clone Set: clnPing [prmPing]
Dec 18 14:05:49 rh74-01 pengine[21144]:    info:      Started: [ rh74-01 
rh74-02 ]
Dec 18 14:05:49 rh74-01 pengine[21144]:    info: Leave   prmPing:0#011(Started 
rh74-01)
Dec 18 14:05:49 rh74-01 pengine[21144]:    info: Leave   prmPing:1#011(Started 
rh74-02)
Dec 18 14:05:49 rh74-01 pengine[21144]:  notice: Calculated transition 8, 
saving inputs in /var/lib/pacemaker/pengine/pe-input-1.bz2
Dec 18 14:05:49 rh74-01 cib[21387]: warning: Could not verify cluster 
configuration file /var/lib/pacemaker/cib/cib.xml: No such file or directory (2)
Dec 18 14:05:49 rh74-01 crmd[21145]:    info: State transition S_POLICY_ENGINE 
-> S_TRANSITION_ENGINE
Dec 18 14:05:49 rh74-01 crmd[21145]:    info: Processing graph 8 
(ref=pe_calc-dc-1545109549-63) derived from 
/var/lib/pacemaker/pengine/pe-input-1.bz2
Dec 18 14:05:49 rh74-01 crmd[21145]:  notice: Transition 8 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-1.bz2): Complete
Dec 18 14:05:49 rh74-01 crmd[21145]:    info: Input I_TE_SUCCESS received in 
state S_TRANSITION_ENGINE from notify_crmd
Dec 18 14:05:49 rh74-01 crmd[21145]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE

---- pe-input-1
(snip)
  <status>
    <node_state id="1" uname="rh74-01" in_ccm="true" crmd="online" 
crm-debug-origin="do_state_transition" join="member" expected="member">
      <lrm id="1">
        <lrm_resources>
          <lrm_resource id="prmPing" type="ping" class="ocf" 
provider="pacemaker">
            <lrm_rsc_op id="prmPing_last_0" operation_key="prmPing_start_0" 
operation="start" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.14" 
transition-key="3:1:0:2677a223-7aef-4a3a-abcf-a4fe262dcb85" 
transition-magic="0:0;3:1:0:2677a223-7aef-4a3a-abcf-a4fe262dcb85" 
exit-reason="" on_node="rh74-01" call-id="7" rc-code="0" op-status="0" 
interval="0" last-run="1545109490" last-rc-change="1545109490" exec-time="1076" 
queue-time="0" op-digest="4866a0300f41b301bf6e8d35bad4029a"/>
            <lrm_rsc_op id="prmPing_monitor_10000" 
operation_key="prmPing_monitor_10000" operation="monitor" 
crm-debug-origin="build_active_RAs" crm_feature_set="3.0.14" 
transition-key="4:1:0:2677a223-7aef-4a3a-abcf-a4fe262dcb85" 
transition-magic="0:0;4:1:0:2677a223-7aef-4a3a-abcf-a4fe262dcb85" 
exit-reason="" on_node="rh74-01" call-id="8" rc-code="0" op-status="0" 
interval="10000" last-rc-change="1545109491" exec-time="1076" queue-time="1" 
op-digest="1c819bc4abd52897263e40446447d063"/>
          </lrm_resource>
        </lrm_resources>
      </lrm>
    </node_state>
    <node_state id="2" uname="rh74-02" crmd="online" 
crm-debug-origin="do_state_transition" in_ccm="true" join="member" 
expected="member">
      <lrm id="2">
        <lrm_resources>
          <lrm_resource id="prmPing" type="ping" class="ocf" 
provider="pacemaker">
            <lrm_rsc_op id="prmPing_last_0" operation_key="prmPing_start_0" 
operation="start" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.14" 
transition-key="6:4:0:2677a223-7aef-4a3a-abcf-a4fe262dcb85" 
transition-magic="0:0;6:4:0:2677a223-7aef-4a3a-abcf-a4fe262dcb85" 
exit-reason="" on_node="rh74-02" call-id="7" rc-code="0" op-status="0" 
interval="0" last-run="1545109506" last-rc-change="1545109506" exec-time="2075" 
queue-time="1" op-digest="4866a0300f41b301bf6e8d35bad4029a"/>
            <lrm_rsc_op id="prmPing_monitor_10000" 
operation_key="prmPing_monitor_10000" operation="monitor" 
crm-debug-origin="build_active_RAs" crm_feature_set="3.0.14" 
transition-key="7:4:0:2677a223-7aef-4a3a-abcf-a4fe262dcb85" 
transition-magic="0:0;7:4:0:2677a223-7aef-4a3a-abcf-a4fe262dcb85" 
exit-reason="" on_node="rh74-02" call-id="8" rc-code="0" op-status="0" 
interval="10000" last-rc-change="1545109508" exec-time="1090" queue-time="0" 
op-digest="1c819bc4abd52897263e40446447d063"/>
          </lrm_resource>
        </lrm_resources>
      </lrm>
    </node_state>
  </status>
(snip)
----

With this simple configuration, no problem occurs, but in the case of a 
resource whose attribute is actually set as a constraint, at the time of 
pe-input-1, the resource is stopped.

In order to avoid the problem, it is necessary to examine the process of 
deleting the attribute of its own node from the correspondent node.

I confirmed this problem with Pacemaker 1.1.19.
The same problem has also been reported by users using Pacemaker 1.1.17.

 * Attach the crm_report file.
 * The attached log is acquired after applying high load.
 * https://bugs.clusterlabs.org/show_bug.cgi?id=5375

Best Regards,
Hideo Yamauchi.
_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] [Problem] If the token becomes unstable, the transient_attributes of all nodes disappear.

Reply via email to