Hello,
This issue only appears when we run performance test and the CPU is high. The
cluster and log is as below. The Pacemaker will restart the Slave Side pgsql-ha
resource about every two minutes.
Take the following scenario for example:(when the pgsqlms RA is called, we
print the log “execute the command start (command)”. When the command is
returned, we print the log “execute the command stop (Command) (result)”)
1. We could see that pacemaker call “pgsqlms monitor” about every 15
seconds. And it return $OCF_SUCCESS
2. In calls monitor command again at 13:56:16, and then it reports timeout
error error 13:56:18. It is only 2 seconds but it reports “timeout=10000ms”
3. In other logs, sometimes after 15 minutes, there is no “execute the
command start monitor” printed and it reports timeout error directly.
Could you please tell how to debug or resolve such issue?
The log:
Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command start
monitor
Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start
Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0
Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command stop
monitor 0
Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command start
monitor
Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start
Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0
Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command stop
monitor 0
Jan 10 13:56:02 sds2 crmd[26096]: notice: High CPU load detected: 426.779999
Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command start
monitor
Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process (PID
5606) timed out
Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 - timed
out after 10000ms
Jan 10 13:56:18 sds2 crmd[26096]: error: Result of monitor operation for
pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000 timeout=10000ms
Jan 10 13:56:18 sds2 crmd[26096]: notice: db2-pgsqld_monitor_16000:102 [
/tmp:5432 - accepting connections\n ]
Jan 10 13:56:18 sds2 crmd[26096]: notice: State transition S_IDLE ->
S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph
Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op monitor for
pgsqld:0 on db2: unknown error (1)
Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op start for
pgsqld:1 on db1: unknown error (1)
Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1
after 1000000 failures (max=1000000)
Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1
after 1000000 failures (max=1000000)
Jan 10 13:56:19 sds2 pengine[26095]: notice: Recover pgsqld:0#011(Slave db2)
Jan 10 13:56:19 sds2 pengine[26095]: notice: Calculated transition 37, saving
inputs in /var/lib/pacemaker/pengine/pe-input-1251.bz2
The Cluster Configuration:
2 nodes and 13 resources configured
Online: [ db1 db2 ]
Full list of resources:
Clone Set: dlm-clone [dlm]
Started: [ db1 db2 ]
Clone Set: clvmd-clone [clvmd]
Started: [ db1 db2 ]
ipmi_node1 (stonith:fence_ipmilan): Started db2
ipmi_node2 (stonith:fence_ipmilan): Started db1
Clone Set: clusterfs-clone [clusterfs]
Started: [ db1 db2 ]
Master/Slave Set: pgsql-ha [pgsqld]>
Masters: [ db1 ]
Slaves: [ db2 ]
Resource Group: mastergroup
db1-vip (ocf::heartbeat:IPaddr2): Started
rep-vip (ocf::heartbeat:IPaddr2): Started
Resource Group: slavegroup
db2-vip (ocf::heartbeat:IPaddr2): Started
pcs resource show pgsql-ha
Master: pgsql-ha
Meta Attrs: interleave=true notify=true
Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
Attributes: bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data
Operations: start interval=0s timeout=160s (pgsqld-start-interval-0s)
stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
promote interval=0s timeout=130s (pgsqld-promote-interval-0s)
demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
monitor interval=15s role=Master timeout=10s
(pgsqld-monitor-interval-15s)
monitor interval=16s role=Slave timeout=10s
(pgsqld-monitor-interval-16s)
notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
_______________________________________________
Users mailing list: [email protected]
http://lists.clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org