On Mon, 2017-09-11 at 16:02 +0530, Anu Pillai wrote: > Hi, > > We are using 3 node cluster (2 active and 1 standby). > When failover happens, CPU utilization going high in newly active > node as well as other active node. It is remaining in high CPU state > for nearly 20 seconds. > > We have 122 resource attributes under the resource(res1) which is > failing over. Failover triggered at 14:49:05 > > Cluster Information: > Pacemaker 1.1.14 > Corosync Cluster Engine, version '2.3.5' > pcs version 0.9.150 > dc-version: 1.1.14-5a6cdd1 > no-quorum-policy: ignore > notification-agent: /etc/sysconfig/notify.sh > notification-recipient: /var/log/notify.log > placement-strategy: balanced > startup-fencing: true > stonith-enabled: false > > Our device is having 8 cores. Pacemaker and related application > running on Core 6 > > top command output: > CPU0: 4.4% usr 17.3% sys 0.0% nic 75.7% idle 0.0% io 1.9% irq > 0.4% sirq > CPU1: 9.5% usr 2.5% sys 0.0% nic 88.0% idle 0.0% io 0.0% irq > 0.0% sirq > CPU2: 1.4% usr 1.4% sys 0.0% nic 96.5% idle 0.0% io 0.4% irq > 0.0% sirq > CPU3: 3.4% usr 0.4% sys 0.0% nic 95.5% idle 0.0% io 0.4% irq > 0.0% sirq > CPU4: 7.9% usr 2.4% sys 0.0% nic 88.5% idle 0.0% io 0.9% irq > 0.0% sirq > CPU5: 0.5% usr 0.5% sys 0.0% nic 98.5% idle 0.0% io 0.5% irq > 0.0% sirq > CPU6: 60.3% usr 38.6% sys 0.0% nic 0.0% idle 0.0% io 0.4% irq > 0.4% sirq > CPU7: 2.9% usr 10.3% sys 0.0% nic 83.6% idle 0.0% io 2.9% irq > 0.0% sirq > Load average: 3.47 1.82 1.63 7/314 11444 > > PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND > 4921 4839 hacluste R < 78492 2.8 6 2.0 > /usr/libexec/pacemaker/cib > 11240 11239 root RW< 0 0.0 6 1.9 [python] > 4925 4839 hacluste R < 52804 1.9 6 1.1 > /usr/libexec/pacemaker/pengine > 4637 1 root R < 97620 3.5 6 0.4 corosync -p -f > 4926 4839 hacluste S < 131m 4.8 6 0.3 > /usr/libexec/pacemaker/crmd > 4839 1 root S < 33448 1.2 6 0.1 pacemakerd > > > > I am attaching the log for your reference. > > > > Regards, > Aswathi
Is there a reason all the cluster services are pegged to one core? Pacemaker can take advantage of multiple cores both by spreading out the daemons and by running multiple resource actions at once. I see you're using the original "notifications" implementation. This has been superseded by "alerts" in Pacemaker 1.1.15 and later. I recommend upgrading if you can, which will also get you bugfixes in pacemaker and corosync that could help. In any case, your notify script /etc/sysconfig/notify.sh is generating errors. If you don't really need the notify logging, I'd disable that and see if that helps. It looks to me that, after failover, the resource agent is setting a lot of node attributes and possibly its own resource attributes. Each of those changes requires the cluster to recalculate resource placement, and that's probably where most of the CPU usage is coming from. (BTW, setting node attributes is fine, but a resource agent generally shouldn't change its own configuration.) You should be able to reduce the CPU usage by setting "dampening" on the node attributes. This will make the cluster wait a bit of time before writing node attribute changes to the CIB, so the recalculation doesn't have to occur immediately after each change. See the "--delay" option to attrd_updater (which can be used when creating the attribute initially). -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org