Re: [ClusterLabs] Corosync: 100% cpu (corosync 2.3.5, libqb 0.17.1, pacemaker 1.1.13)

2015-08-06 Thread Pallai Roland
Thanks, resolved.

I ran into the following libqb issue:
 https://github.com/ClusterLabs/libqb/issues/139
 https://github.com/ClusterLabs/libqb/pull/141

Applying 7f56f58 on libqb 0.17.1 fixed my problem.

https://github.com/davidvossel/libqb/commit/7f56f583d891859c94b24db0ec38a301c3f3466a.patch


2015-08-06 1:57 GMT+02:00 Pallai Roland pall...@magex.hu:

 hi,

 I've built a recent cluster stack from sources on Debian Jessie and I
 can't get rid of cpu spikes. Corosync blocks the entire system for
 seconds on every simple transition, even itself:

  drbdtest1 corosync[4734]:   [MAIN  ] Corosync main process was not
 scheduled for 2590.4512 ms (threshold is 2400. ms). Consider token
 timeout increase.

 and even drbd:
  drbdtest1 kernel: drbd p1: PingAck did not arrive in time.

 My previous build (corosync 1.4.6, libqb 0.17.0, pacemaker 1.1.12) works
 fine on this nodes with the same corosync/pacemaker setup.

 What should I try? It's a test environment, the issue is 100% reproducible
 in seconds. Network traffic is minimal all the time and there is no I/O
 load.


 *Pacemaker config:*

 node 167969573: drbdtest1
 node 167969574: drbdtest2
 primitive drbd_p1 ocf:linbit:drbd \
 params drbd_resource=p1 \
 op monitor interval=30
 primitive drbd_p2 ocf:linbit:drbd \
 params drbd_resource=p2 \
 op monitor interval=30
 primitive dummy_test ocf:pacemaker:Dummy \
 meta allow-migrate=true \
 params state=/var/run/activenode
 primitive fence_libvirt stonith:external/libvirt \
 params hostlist=drbdtest1,drbdtest2
 hypervisor_uri=qemu+ssh://libvirt-fencing@mgx4/system \
 op monitor interval=30
 primitive fs_boot Filesystem \
 params device=/dev/null directory=/boot fstype=* \
 meta is-managed=false \
 op monitor interval=20 timeout=40 on-fail=block OCF_CHECK_LEVEL=20
 primitive fs_f1 Filesystem \
 params device=/dev/drbd/by-res/p1 directory=/mnt/p1
 fstype=ext4 options=commit=60,barrier=0,data=writeback \
 op monitor interval=20 timeout=40 \
 op start timeout=300 interval=0 \
 op stop timeout=180 interval=0
 primitive ip_10.3.3.138 IPaddr2 \
 params ip=10.3.3.138 cidr_netmask=32 \
 op monitor interval=10s timeout=20s
 primitive sysinfo ocf:pacemaker:SysInfo \
 op start timeout=20s interval=0 \
 op stop timeout=20s interval=0 \
 op monitor interval=60s
 group dummy-group dummy_test
 ms ms_drbd_p1 drbd_p1 \
 meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
 notify=true
 ms ms_drbd_p2 drbd_p2 \
 meta master-max=2 master-node-max=1 clone-max=2 notify=true
 clone fencing_by_libvirt fence_libvirt \
 meta globally-unique=false
 clone fs_boot_clone fs_boot
 clone sysinfos sysinfo \
 meta globally-unique=false
 location fs1_on_high_load fs_f1 \
 rule -inf: cpu_load gte 4
 colocation dummy_coloc inf: dummy-group ms_drbd_p2:Master
 colocation f1a-coloc inf: fs_f1 ms_drbd_p1:Master
 colocation f1b-coloc inf: fs_f1 fs_boot_clone:Started
 order dummy_order inf: ms_drbd_p2:promote dummy-group:start
 order orderA inf: ms_drbd_p1:promote fs_f1:start
 property cib-bootstrap-options: \
 dc-version=1.1.13-6052cd1 \
 cluster-infrastructure=corosync \
 expected-quorum-votes=2 \
 no-quorum-policy=ignore \
 symmetric-cluster=true \
 placement-strategy=default \
 last-lrm-refresh=1438735742 \
 have-watchdog=false
 property cib-bootstrap-options-stonith: \
 stonith-enabled=true \
 stonith-action=reboot
 rsc_defaults rsc-options: \
 resource-stickiness=100


 *corosync.conf:*

 totem {
 version: 2
 token: 3000
 token_retransmits_before_loss_const: 10
 clear_node_high_bit: yes
 crypto_cipher: none
 crypto_hash: none
 interface {
 ringnumber: 0
 bindnetaddr: 10.3.3.37
 mcastaddr: 225.0.0.37
 mcastport: 5403
 ttl: 1
 }
 }

 logging {
 fileline: off
 to_stderr: no
 to_logfile: yes
 logfile: /var/log/corosync/corosync.log
 to_syslog: yes
 syslog_facility: daemon
 debug: off
 timestamp: on
 logger_subsys {
 subsys: QUORUM
 debug: off
 }
 }

 quorum {
 provider: corosync_votequorum
 expected_votes: 2
 }


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Corosync: 100% cpu (corosync 2.3.5, libqb 0.17.1, pacemaker 1.1.13)

2015-08-05 Thread Pallai Roland
hi,

I've built a recent cluster stack from sources on Debian Jessie and I can't
get rid of cpu spikes. Corosync blocks the entire system for seconds on
every simple transition, even itself:

 drbdtest1 corosync[4734]:   [MAIN  ] Corosync main process was not
scheduled for 2590.4512 ms (threshold is 2400. ms). Consider token
timeout increase.

and even drbd:
 drbdtest1 kernel: drbd p1: PingAck did not arrive in time.

My previous build (corosync 1.4.6, libqb 0.17.0, pacemaker 1.1.12) works
fine on this nodes with the same corosync/pacemaker setup.

What should I try? It's a test environment, the issue is 100% reproducible
in seconds. Network traffic is minimal all the time and there is no I/O
load.


*Pacemaker config:*

node 167969573: drbdtest1
node 167969574: drbdtest2
primitive drbd_p1 ocf:linbit:drbd \
params drbd_resource=p1 \
op monitor interval=30
primitive drbd_p2 ocf:linbit:drbd \
params drbd_resource=p2 \
op monitor interval=30
primitive dummy_test ocf:pacemaker:Dummy \
meta allow-migrate=true \
params state=/var/run/activenode
primitive fence_libvirt stonith:external/libvirt \
params hostlist=drbdtest1,drbdtest2
hypervisor_uri=qemu+ssh://libvirt-fencing@mgx4/system \
op monitor interval=30
primitive fs_boot Filesystem \
params device=/dev/null directory=/boot fstype=* \
meta is-managed=false \
op monitor interval=20 timeout=40 on-fail=block OCF_CHECK_LEVEL=20
primitive fs_f1 Filesystem \
params device=/dev/drbd/by-res/p1 directory=/mnt/p1 fstype=ext4
options=commit=60,barrier=0,data=writeback \
op monitor interval=20 timeout=40 \
op start timeout=300 interval=0 \
op stop timeout=180 interval=0
primitive ip_10.3.3.138 IPaddr2 \
params ip=10.3.3.138 cidr_netmask=32 \
op monitor interval=10s timeout=20s
primitive sysinfo ocf:pacemaker:SysInfo \
op start timeout=20s interval=0 \
op stop timeout=20s interval=0 \
op monitor interval=60s
group dummy-group dummy_test
ms ms_drbd_p1 drbd_p1 \
meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true
ms ms_drbd_p2 drbd_p2 \
meta master-max=2 master-node-max=1 clone-max=2 notify=true
clone fencing_by_libvirt fence_libvirt \
meta globally-unique=false
clone fs_boot_clone fs_boot
clone sysinfos sysinfo \
meta globally-unique=false
location fs1_on_high_load fs_f1 \
rule -inf: cpu_load gte 4
colocation dummy_coloc inf: dummy-group ms_drbd_p2:Master
colocation f1a-coloc inf: fs_f1 ms_drbd_p1:Master
colocation f1b-coloc inf: fs_f1 fs_boot_clone:Started
order dummy_order inf: ms_drbd_p2:promote dummy-group:start
order orderA inf: ms_drbd_p1:promote fs_f1:start
property cib-bootstrap-options: \
dc-version=1.1.13-6052cd1 \
cluster-infrastructure=corosync \
expected-quorum-votes=2 \
no-quorum-policy=ignore \
symmetric-cluster=true \
placement-strategy=default \
last-lrm-refresh=1438735742 \
have-watchdog=false
property cib-bootstrap-options-stonith: \
stonith-enabled=true \
stonith-action=reboot
rsc_defaults rsc-options: \
resource-stickiness=100


*corosync.conf:*

totem {
version: 2
token: 3000
token_retransmits_before_loss_const: 10
clear_node_high_bit: yes
crypto_cipher: none
crypto_hash: none
interface {
ringnumber: 0
bindnetaddr: 10.3.3.37
mcastaddr: 225.0.0.37
mcastport: 5403
ttl: 1
}
}

logging {
fileline: off
to_stderr: no
to_logfile: yes
logfile: /var/log/corosync/corosync.log
to_syslog: yes
syslog_facility: daemon
debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}

quorum {
provider: corosync_votequorum
expected_votes: 2
}
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org