[Ubuntu-ha] [Bug 1563089] Re: Memory Leak when new cluster configuration is formed.

Jorge Niedbalski Wed, 06 Apr 2016 11:31:38 -0700

Hello,

I ran the verification for the Trusty version.


root@juju-niedbalski-sec-machine-15:/home/ubuntu# dpkg -l|grep corosync
ii  corosync                         2.3.3-1ubuntu3                        
amd64        Standards-based cluster framework (daemon and modules)
ii  libcorosync-common4              2.3.3-1ubuntu3                        
amd64        Standards-based cluster framework, common library

I configured a 3 nodes nova-cloud-controller environment related with
hacluster.

ubuntu@niedbalski-sec-bastion:~/openstack-charm-testing/bundles/dev$ juju run 
--service nova-cloud-controller "sudo corosync-quorumtool -s|grep votes"
- MachineId: "15"
  Stdout: |
    Expected votes:   3
    Total votes:      3
  UnitId: nova-cloud-controller/0
- MachineId: "28"
  Stdout: |
    Expected votes:   3
    Total votes:      3
  UnitId: nova-cloud-controller/1
- MachineId: "29"
  Stdout: |
    Expected votes:   3
    Total votes:      3
  UnitId: nova-cloud-controller/2

I changed the transport mode to UDP by setting:

$ juju set hacluster-ncc corosync_transport=udpu

After this, I moved to the primary node (the one that holds the virtual ip 
address) and I applied the TC
rules, while monitoring the memory usage of the corosync process (multiple 
times)

root@juju-niedbalski-sec-machine-15:/home/ubuntu# tc qdisc add dev eth0 root 
netem delay 550ms
root@juju-niedbalski-sec-machine-15:/home/ubuntu# tc qdisc del dev eth0 root 
netem

Apr  6 17:57:37 juju-niedbalski-sec-machine-15 cib[14387]:  warning: 
cib_process_request: Completed cib_apply_diff operation for section 'all': 
Application of an update diff failed (rc=-206, origin=local/cibadmin/2, 
version=0.27.1)
Apr  6 18:04:12 juju-niedbalski-sec-machine-15 corosync[14376]:  [MAIN  ] 
Completed service synchronization, ready to provide service.
Apr  6 18:04:13 juju-niedbalski-sec-machine-15 corosync[18645]:  [MAIN  ] 
Completed service synchronization, ready to provide service.
Apr  6 18:06:27 juju-niedbalski-sec-machine-15 corosync[18645]:  [MAIN  ] 
Completed service synchronization, ready to provide service.
Apr  6 18:06:28 juju-niedbalski-sec-machine-15 corosync[19528]:  [MAIN  ] 
Completed service synchronization, ready to provide service.
Apr  6 18:07:48 juju-niedbalski-sec-machine-15 corosync[19985]:  [MAIN  ] 
Completed service synchronization, ready to provide service.
Apr  6 18:07:49 juju-niedbalski-sec-machine-15 corosync[19985]:  [MAIN  ] 
Completed service synchronization, ready to provide service.
Apr  6 18:08:16 juju-niedbalski-sec-machine-15 corosync[19985]:  [MAIN  ] 
Completed service synchronization, ready to provide service.
Apr  6 18:08:59 juju-niedbalski-sec-machine-15 corosync[19985]:  [MAIN  ] 
Completed service synchronization, ready to provide service.
Apr  6 18:09:38 juju-niedbalski-sec-machine-15 corosync[19985]:  [MAIN  ] 
Completed service synchronization, ready to provide 
service.

After 5 minutes of observation on the corosync process by using:

 $ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep -E
'.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done

I don't see any substantial memory usage increase.

root@juju-niedbalski-sec-machine-15:/home/ubuntu# more memory-usage.log 
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928

-- 
You received this bug notification because you are a member of Ubuntu
High Availability Team, which is subscribed to corosync in Ubuntu.
https://bugs.launchpad.net/bugs/1563089

Title:
  Memory Leak when new cluster configuration is formed.

Status in corosync package in Ubuntu:
  Fix Released
Status in corosync source package in Trusty:
  Fix Committed
Status in corosync source package in Wily:
  Fix Committed

Bug description:
  [Environment]

  Trusty 14.04.3

  Packages:

  ii  corosync                         2.3.3-1ubuntu1                        
amd64        Standards-based cluster framework (daemon and modules)
  ii  libcorosync-common4              2.3.3-1ubuntu1                        
amd64        Standards-based cluster framework, common library

  [Reproducer]

  1) I deployed an HA environment using this bundle 
(http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/dev/next-ha.yaml)
  with a 3 nodes installation of cinder related to an HACluster subordinate 
unit.

  $ juju-deployer -c next-ha.yaml -w 600 trusty-kilo

  2) I changed the default corosync transport mode to unicast.

  $ juju set cinder-hacluster corosync_transport=udpu

  3) I assured that the 3 units were quorated

  cinder/0# corosync-quorumtool
  Votequorum information
  ----------------------
  Expected votes:   3
  Highest expected: 3
  Total votes:      3
  Quorum:           2
  Flags:            Quorate

  Membership information
  ----------------------
      Nodeid      Votes Name
        1002          1 10.5.1.57 (local)
        1001          1 10.5.1.58
        1000          1 10.5.1.59

  The primary unit was holding the VIP resource 10.5.105.1/16

  root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr
  2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc netem state UP 
group default qlen 1000
      link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff
      inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0
         valid_lft forever preferred_lft forever
      inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0
         valid_lft forever preferred_lft forever

  4) I manually added a TC queue for the eth0 interface on the node
  holding the VIP resource, introducing a 350 ms delay.

  $ sudo tc qdisc add dev eth0 root netem delay 350ms

  5) Right after adding the 350ms on the cinder/0 unit, the corosync process 
informs that one of the processors failed, and is forming a new
  cluster configuration.

  Mar 28 21:57:41 juju-niedbalski-sec-machine-5 corosync[4584]:  [TOTEM ] A 
processor failed, forming new configuration.
  Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]:  [TOTEM ] A new 
membership (10.5.1.57:11628) was formed. Members
  Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]:  [QUORUM] 
Members[3]: 1002 1001 1000
  Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]:  [MAIN  ] 
Completed service synchronization, ready to provide service.

  This happens on all of the units.

  6) After receiving this message, I remove the queue from eth0:

  $ sudo tc qdisk del dev eth0 root netem

  Then, the following statement is written in the master node:

  Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]:  [TOTEM ] A new 
membership (10.5.1.57:11628) was formed. Members
  Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]:  [QUORUM] 
Members[3]: 1002 1001 1000
  Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]:  [MAIN  ] 
Completed service synchronization, ready to provide service.

  7) While executing 5 and 6 repeatedly, I ran the following command to track 
the VSZ and RSS memory usage of the
  corosync process:

  root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc add dev eth0 root 
netem delay 350ms
  root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc del dev eth0 root 
netem

  $ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep
  -E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done

  The results shows that both vsz and rss are increased over time at a
  high ratio.

  25476 4036

  ... (after 5 minutes).

  135644 10352

  [Fix]

  So preliminary based on this reproducer, I think that this commit 
(https://github.com/corosync/corosync/commit/600fb4084adcbfe7678b44a83fa8f3d3550f48b9)
  is a good candidate to be backported in Ubuntu Trusty.

  [Test Case]

  * See reproducer

  [Backport Impact]

  * Not identified

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1563089/+subscriptions

_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help   : https://help.launchpad.net/ListHelp

[Ubuntu-ha] [Bug 1563089] Re: Memory Leak when new cluster configuration is formed.

Reply via email to