[ClusterLabs] pacemaker reports monitor timeout while CPU is high

范国腾 Wed, 10 Jan 2018 01:46:52 -0800

Hello,

This issue only appears when we run performance test and the CPU is high. The 
cluster and log is as below. The Pacemaker will restart the Slave Side pgsql-ha 
resource about every two minutes.


Take the following scenario for example:（when the pgsqlms RA is called, we 
print the log “execute the command start (command)”. When the command is 
returned, we print the log “execute the command stop (Command) (result)”）

1.     We could see that pacemaker call “pgsqlms monitor” about every 15 
seconds. And it return $OCF_SUCCESS

2.     In calls monitor command again at 13:56:16, and then it reports timeout 
error error 13:56:18. It is only 2 seconds but it reports “timeout=10000ms”

3.     In other logs, sometimes after 15 minutes, there is no “execute the 
command start monitor” printed and it reports timeout error directly.

Could you please tell how to debug or resolve such issue?

The log:

Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command start 
monitor
Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start
Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0
Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command stop 
monitor 0
Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command start 
monitor
Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start
Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0
Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command stop 
monitor 0
Jan 10 13:56:02 sds2 crmd[26096]:  notice: High CPU load detected: 426.779999
Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command start 
monitor
Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process (PID 
5606) timed out
Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 - timed 
out after 10000ms
Jan 10 13:56:18 sds2 crmd[26096]:   error: Result of monitor operation for 
pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000 timeout=10000ms
Jan 10 13:56:18 sds2 crmd[26096]:  notice: db2-pgsqld_monitor_16000:102 [ 
/tmp:5432 - accepting connections\n ]
Jan 10 13:56:18 sds2 crmd[26096]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph
Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op monitor for 
pgsqld:0 on db2: unknown error (1)
Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op start for 
pgsqld:1 on db1: unknown error (1)
Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 
after 1000000 failures (max=1000000)
Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 
after 1000000 failures (max=1000000)
Jan 10 13:56:19 sds2 pengine[26095]:  notice: Recover pgsqld:0#011(Slave db2)
Jan 10 13:56:19 sds2 pengine[26095]:  notice: Calculated transition 37, saving 
inputs in /var/lib/pacemaker/pengine/pe-input-1251.bz2


The Cluster Configuration:
2 nodes and 13 resources configured

Online: [ db1 db2 ]

Full list of resources:

Clone Set: dlm-clone [dlm]
     Started: [ db1 db2 ]
Clone Set: clvmd-clone [clvmd]
     Started: [ db1 db2 ]
ipmi_node1     (stonith:fence_ipmilan):        Started db2
ipmi_node2     (stonith:fence_ipmilan):        Started db1
Clone Set: clusterfs-clone [clusterfs]
     Started: [ db1 db2 ]
Master/Slave Set: pgsql-ha [pgsqld]>

      Masters: [ db1 ]

Slaves: [ db2 ]
Resource Group: mastergroup
     db1-vip    (ocf::heartbeat:IPaddr2):       Started
     rep-vip    (ocf::heartbeat:IPaddr2):       Started
Resource Group: slavegroup
     db2-vip    (ocf::heartbeat:IPaddr2):       Started


pcs resource show pgsql-ha
Master: pgsql-ha
  Meta Attrs: interleave=true notify=true
  Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data
   Operations: start interval=0s timeout=160s (pgsqld-start-interval-0s)
               stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
               promote interval=0s timeout=130s (pgsqld-promote-interval-0s)
               demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
               monitor interval=15s role=Master timeout=10s 
(pgsqld-monitor-interval-15s)
               monitor interval=16s role=Slave timeout=10s 
(pgsqld-monitor-interval-16s)
               notify interval=0s timeout=60s (pgsqld-notify-interval-0s)

_______________________________________________
Users mailing list: [email protected]
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] pacemaker reports monitor timeout while CPU is high

Reply via email to