Hello,
I ran the verification for the Trusty version.
root@juju-niedbalski-sec-machine-15:/home/ubuntu# dpkg -l|grep corosync
ii corosync 2.3.3-1ubuntu3
amd64 Standards-based cluster framework (daemon and modules)
ii libcorosync-common4 2.3.3-1ubuntu3
amd64 Standards-based cluster framework, common library
I configured a 3 nodes nova-cloud-controller environment related with
hacluster.
ubuntu@niedbalski-sec-bastion:~/openstack-charm-testing/bundles/dev$ juju run
--service nova-cloud-controller "sudo corosync-quorumtool -s|grep votes"
- MachineId: "15"
Stdout: |
Expected votes: 3
Total votes: 3
UnitId: nova-cloud-controller/0
- MachineId: "28"
Stdout: |
Expected votes: 3
Total votes: 3
UnitId: nova-cloud-controller/1
- MachineId: "29"
Stdout: |
Expected votes: 3
Total votes: 3
UnitId: nova-cloud-controller/2
I changed the transport mode to UDP by setting:
$ juju set hacluster-ncc corosync_transport=udpu
After this, I moved to the primary node (the one that holds the virtual ip
address) and I applied the TC
rules, while monitoring the memory usage of the corosync process (multiple
times)
root@juju-niedbalski-sec-machine-15:/home/ubuntu# tc qdisc add dev eth0 root
netem delay 550ms
root@juju-niedbalski-sec-machine-15:/home/ubuntu# tc qdisc del dev eth0 root
netem
Apr 6 17:57:37 juju-niedbalski-sec-machine-15 cib[14387]: warning:
cib_process_request: Completed cib_apply_diff operation for section 'all':
Application of an update diff failed (rc=-206, origin=local/cibadmin/2,
version=0.27.1)
Apr 6 18:04:12 juju-niedbalski-sec-machine-15 corosync[14376]: [MAIN ]
Completed service synchronization, ready to provide service.
Apr 6 18:04:13 juju-niedbalski-sec-machine-15 corosync[18645]: [MAIN ]
Completed service synchronization, ready to provide service.
Apr 6 18:06:27 juju-niedbalski-sec-machine-15 corosync[18645]: [MAIN ]
Completed service synchronization, ready to provide service.
Apr 6 18:06:28 juju-niedbalski-sec-machine-15 corosync[19528]: [MAIN ]
Completed service synchronization, ready to provide service.
Apr 6 18:07:48 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ]
Completed service synchronization, ready to provide service.
Apr 6 18:07:49 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ]
Completed service synchronization, ready to provide service.
Apr 6 18:08:16 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ]
Completed service synchronization, ready to provide service.
Apr 6 18:08:59 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ]
Completed service synchronization, ready to provide service.
Apr 6 18:09:38 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ]
Completed service synchronization, ready to provide
service.
After 5 minutes of observation on the corosync process by using:
$ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep -E
'.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done
I don't see any substantial memory usage increase.
root@juju-niedbalski-sec-machine-15:/home/ubuntu# more memory-usage.log
135584 3928
135584 3928
135584 3928
135584 3928
135584 3928
135584 3928
135584 3928
135584 3928
135584 3928
135584 3928
135584 3928
135584 3928
135584 3928
135584 3928
135584 3928
135584 3928
135584 3928
135584 3928
135584 3928
135584 3928
--
You received this bug notification because you are a member of Ubuntu
High Availability Team, which is subscribed to corosync in Ubuntu.
https://bugs.launchpad.net/bugs/1563089
Title:
Memory Leak when new cluster configuration is formed.
Status in corosync package in Ubuntu:
Fix Released
Status in corosync source package in Trusty:
Fix Committed
Status in corosync source package in Wily:
Fix Committed
Bug description:
[Environment]
Trusty 14.04.3
Packages:
ii corosync 2.3.3-1ubuntu1
amd64 Standards-based cluster framework (daemon and modules)
ii libcorosync-common4 2.3.3-1ubuntu1
amd64 Standards-based cluster framework, common library
[Reproducer]
1) I deployed an HA environment using this bundle
(http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/dev/next-ha.yaml)
with a 3 nodes installation of cinder related to an HACluster subordinate
unit.
$ juju-deployer -c next-ha.yaml -w 600 trusty-kilo
2) I changed the default corosync transport mode to unicast.
$ juju set cinder-hacluster corosync_transport=udpu
3) I assured that the 3 units were quorated
cinder/0# corosync-quorumtool
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
1002 1 10.5.1.57 (local)
1001 1 10.5.1.58
1000 1 10.5.1.59
The primary unit was holding the VIP resource 10.5.105.1/16
root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc netem state UP
group default qlen 1000
link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff
inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0
valid_lft forever preferred_lft forever
4) I manually added a TC queue for the eth0 interface on the node
holding the VIP resource, introducing a 350 ms delay.
$ sudo tc qdisc add dev eth0 root netem delay 350ms
5) Right after adding the 350ms on the cinder/0 unit, the corosync process
informs that one of the processors failed, and is forming a new
cluster configuration.
Mar 28 21:57:41 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A
processor failed, forming new configuration.
Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A new
membership (10.5.1.57:11628) was formed. Members
Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [QUORUM]
Members[3]: 1002 1001 1000
Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [MAIN ]
Completed service synchronization, ready to provide service.
This happens on all of the units.
6) After receiving this message, I remove the queue from eth0:
$ sudo tc qdisk del dev eth0 root netem
Then, the following statement is written in the master node:
Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [TOTEM ] A new
membership (10.5.1.57:11628) was formed. Members
Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [QUORUM]
Members[3]: 1002 1001 1000
Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [MAIN ]
Completed service synchronization, ready to provide service.
7) While executing 5 and 6 repeatedly, I ran the following command to track
the VSZ and RSS memory usage of the
corosync process:
root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc add dev eth0 root
netem delay 350ms
root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc del dev eth0 root
netem
$ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep
-E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done
The results shows that both vsz and rss are increased over time at a
high ratio.
25476 4036
... (after 5 minutes).
135644 10352
[Fix]
So preliminary based on this reproducer, I think that this commit
(https://github.com/corosync/corosync/commit/600fb4084adcbfe7678b44a83fa8f3d3550f48b9)
is a good candidate to be backported in Ubuntu Trusty.
[Test Case]
* See reproducer
[Backport Impact]
* Not identified
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1563089/+subscriptions
_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to : [email protected]
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help : https://help.launchpad.net/ListHelp