Hi! A few years ago I was playing with cgroups, getting quite interesting (useful) results, but applying the cgroups to existing and newly started processes was quite hard to integrate into the OS, so I did not proceed on that way. I think cgroups is even more powerful today, but I haven't followed the ease of using it in systems based on systemd (which uses cgroups heavily AFAIK).
In short: You may be unable to control the client processes, but you could control the server processes the clients start. Regards, Ulrich >>> ??? <[email protected]> schrieb am 11.01.2018 um 05:01 in Nachricht <[email protected]>: > Ulrich, > > Thank you very much for the help. When we do the performance test, our > application(pgsql-ha) will start more than 500 process to process the client > request. Is it possible to make this issue? > > Is it any workaround or method to make pacemaker not restart the resource in > such situation? Now the system could not work if the client sends high call > load but we could not control the client's behavior. > > Thanks > > > -----邮件原件----- > 发件人: Ulrich Windl [mailto:[email protected]] > 发送时间: 2018年1月10日 18:20 > 收件人: [email protected] > 主题: [ClusterLabs] Antw: pacemaker reports monitor timeout while CPU is high > > Hi! > > I only can talk for myself: In former times with HP-UX, we had severe > performance problems when the load was in the range of 8 to 14 (I/O waits not > included, average for all logical CPUs), while in Linux we are getting > problems with a load above 40 (or so) (I/O included, sum of all logical CPUs > (which are 24)). Also I/O waits cause cluster timeouts before CPU load > actually matters (for us). > So with a load above 400 (not knowing your number of CPUs) it should not be > that unusual. What is the number of threads in your system at that time? > It might be worth the efforts binding the cluster processes to specific CPUs > and keep other tasks away from those, but I don't have experience with that. > I guess the "High CPU load detected" message triggers some internal suspend > in the cluster engine (assuming the cluster engine caused the high load). Of > course for "external " load the measure won't help... > > Regards, > Ulrich > > >>>> ??? <[email protected]> schrieb am 10.01.2018 um 10:40 in >>>> Nachricht > <[email protected]>: >> Hello, >> >> This issue only appears when we run performance test and the CPU is high. >> The cluster and log is as below. The Pacemaker will restart the Slave >> Side pgsql-ha resource about every two minutes. >> >> Take the following scenario for example:(when the pgsqlms RA is >> called, we print the log “execute the command start (command)”. When >> the command is > >> returned, we print the log “execute the command stop (Command) > (result)”) >> >> 1. We could see that pacemaker call “pgsqlms monitor” about every 15 > >> seconds. And it return $OCF_SUCCESS >> >> 2. In calls monitor command again at 13:56:16, and then it reports >> timeout error error 13:56:18. It is only 2 seconds but it reports >> “timeout=10000ms” >> >> 3. In other logs, sometimes after 15 minutes, there is no “execute the > >> command start monitor” printed and it reports timeout error directly. >> >> Could you please tell how to debug or resolve such issue? >> >> The log: >> >> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command >> start > >> monitor >> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start >> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0 >> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command >> stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: >> execute the command start > >> monitor >> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start >> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0 >> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command >> stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]: notice: High CPU >> load detected: >> 426.779999 >> Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command >> start > >> monitor >> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 >> process (PID > >> 5606) timed out >> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 - >> timed > >> out after 10000ms >> Jan 10 13:56:18 sds2 crmd[26096]: error: Result of monitor operation for >> pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000 > timeout=10000ms >> Jan 10 13:56:18 sds2 crmd[26096]: notice: >> db2-pgsqld_monitor_16000:102 [ >> /tmp:5432 - accepting connections\n ] >> Jan 10 13:56:18 sds2 crmd[26096]: notice: State transition S_IDLE -> >> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL >> origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]: >> warning: Processing failed op monitor for pgsqld:0 on db2: unknown >> error (1) Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing >> failed op start for > >> pgsqld:1 on db1: unknown error (1) >> Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away >> from db1 > >> after 1000000 failures (max=1000000) >> Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away >> from db1 > >> after 1000000 failures (max=1000000) >> Jan 10 13:56:19 sds2 pengine[26095]: notice: Recover >> pgsqld:0#011(Slave >> db2) >> Jan 10 13:56:19 sds2 pengine[26095]: notice: Calculated transition >> 37, saving inputs in /var/lib/pacemaker/pengine/pe-input-1251.bz2 >> >> >> The Cluster Configuration: >> 2 nodes and 13 resources configured >> >> Online: [ db1 db2 ] >> >> Full list of resources: >> >> Clone Set: dlm-clone [dlm] >> Started: [ db1 db2 ] >> Clone Set: clvmd-clone [clvmd] >> Started: [ db1 db2 ] >> ipmi_node1 (stonith:fence_ipmilan): Started db2 >> ipmi_node2 (stonith:fence_ipmilan): Started db1 >> Clone Set: clusterfs-clone [clusterfs] >> Started: [ db1 db2 ] >> Master/Slave Set: pgsql-ha [pgsqld]> >> >> Masters: [ db1 ] >> >> Slaves: [ db2 ] >> Resource Group: mastergroup >> db1-vip (ocf::heartbeat:IPaddr2): Started >> rep-vip (ocf::heartbeat:IPaddr2): Started >> Resource Group: slavegroup >> db2-vip (ocf::heartbeat:IPaddr2): Started >> >> >> pcs resource show pgsql-ha >> Master: pgsql-ha >> Meta Attrs: interleave=true notify=true >> Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms) >> Attributes: bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data >> Operations: start interval=0s timeout=160s (pgsqld-start-interval-0s) >> stop interval=0s timeout=60s (pgsqld-stop-interval-0s) >> promote interval=0s timeout=130s > (pgsqld-promote-interval-0s) >> demote interval=0s timeout=120s (pgsqld-demote-interval-0s) >> monitor interval=15s role=Master timeout=10s >> (pgsqld-monitor-interval-15s) >> monitor interval=16s role=Slave timeout=10s >> (pgsqld-monitor-interval-16s) >> notify interval=0s timeout=60s >> (pgsqld-notify-interval-0s) > > _______________________________________________ > Users mailing list: [email protected] > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ > Users mailing list: [email protected] > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: [email protected] http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
