[Ubuntu-ha] [Bug 1890491] Re: A pacemaker node fails monitor (probe) and stop /start operations on a resource because it returns "rc=189
** Changed in: pacemaker (Ubuntu Bionic) Status: New => In Progress ** Changed in: pacemaker (Ubuntu Bionic) Assignee: (unassigned) => Jorge Niedbalski (niedbalski) -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to pacemaker in Ubuntu. https://bugs.launchpad.net/bugs/1890491 Title: A pacemaker node fails monitor (probe) and stop /start operations on a resource because it returns "rc=189 Status in pacemaker package in Ubuntu: Fix Released Status in pacemaker source package in Bionic: In Progress Status in pacemaker source package in Focal: Fix Released Status in pacemaker source package in Groovy: Fix Released Bug description: Cause: Pacemaker implicitly ordered all stops needed on a Pacemaker Remote node before the stop of the node's Pacemaker Remote connection, including stops that were implied by fencing of the node. Also, Pacemaker scheduled actions on Pacemaker Remote nodes with a failed connection so that the actions could be done once the connection is recovered, even if the connection wasn't being recovered (for example, if the node was shutting down when the failure occurred). Consequence: If a Pacemaker Remote node needed to be fenced while it was in the process of shutting down, once the fencing completed pacemaker scheduled probes on the node. The probes fail because the connection is not actually active. Due to the failed probe, a stop is scheduled which also fails, leading to fencing of the node again, and the situation repeats itself indefinitely. Fix: Pacemaker Remote connection stops are no longer ordered after implied stops, and actions are not scheduled on Pacemaker Remote nodes when the connection is failed and not being started again. Result: A Pacemaker Remote node that needs to be fenced while it is in the process of shutting down is fenced once, without repeating indefinitely. The fix seems to be fixed in pacemaker-1.1.21-1.el7 Related to https://bugzilla.redhat.com/show_bug.cgi?id=1704870 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1890491/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1890491] Re: A pacemaker node fails monitor (probe) and stop /start operations on a resource because it returns "rc=189
Hello, I am testing a couple of patches (both imported from master), through this PPA: https://launchpad.net/~niedbalski/+archive/ubuntu/fix-1890491 c20f8920 - don't order implied stops relative to a remote connection 938e99f2 - remote state is failed if node is shutting down with connection failure I'll report back here if these patches fixes the behavior described in my previous comment. -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to pacemaker in Ubuntu. https://bugs.launchpad.net/bugs/1890491 Title: A pacemaker node fails monitor (probe) and stop /start operations on a resource because it returns "rc=189 Status in pacemaker package in Ubuntu: Fix Released Status in pacemaker source package in Bionic: New Status in pacemaker source package in Focal: Fix Released Status in pacemaker source package in Groovy: Fix Released Bug description: Cause: Pacemaker implicitly ordered all stops needed on a Pacemaker Remote node before the stop of the node's Pacemaker Remote connection, including stops that were implied by fencing of the node. Also, Pacemaker scheduled actions on Pacemaker Remote nodes with a failed connection so that the actions could be done once the connection is recovered, even if the connection wasn't being recovered (for example, if the node was shutting down when the failure occurred). Consequence: If a Pacemaker Remote node needed to be fenced while it was in the process of shutting down, once the fencing completed pacemaker scheduled probes on the node. The probes fail because the connection is not actually active. Due to the failed probe, a stop is scheduled which also fails, leading to fencing of the node again, and the situation repeats itself indefinitely. Fix: Pacemaker Remote connection stops are no longer ordered after implied stops, and actions are not scheduled on Pacemaker Remote nodes when the connection is failed and not being started again. Result: A Pacemaker Remote node that needs to be fenced while it is in the process of shutting down is fenced once, without repeating indefinitely. The fix seems to be fixed in pacemaker-1.1.21-1.el7 Related to https://bugzilla.redhat.com/show_bug.cgi?id=1704870 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1890491/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1890491] Re: A pacemaker node fails monitor (probe) and stop /start operations on a resource because it returns "rc=189
I am able to reproduce a similar issue with the following bundle: https://paste.ubuntu.com/p/VJ3m7nMN79/ Resource created with sudo pcs resource create test2 ocf:pacemaker:Dummy op_sleep=10 op monitor interval=30s timeout=30s op start timeout=30s op stop timeout=30s juju ssh nova-cloud-controller/2 "sudo pcs constraint location test2 prefers juju-acda3d-pacemaker-remote-10.cloud.sts" juju ssh nova-cloud-controller/2 "sudo pcs constraint location test2 prefers juju-acda3d-pacemaker-remote-11.cloud.sts" juju ssh nova-cloud-controller/2 "sudo pcs constraint location test2 prefers juju-acda3d-pacemaker-remote-12.cloud.sts" Online: [ juju-acda3d-pacemaker-remote-7 juju-acda3d-pacemaker-remote-8 juju-acda3d-pacemaker-remote-9 ] RemoteOnline: [ juju-acda3d-pacemaker-remote-10.cloud.sts juju-acda3d-pacemaker-remote-11.cloud.sts juju-acda3d-pacemaker-remote-12.cloud.sts ] Full list of resources: Resource Group: grp_nova_vips res_nova_bf9661e_vip (ocf::heartbeat:IPaddr2): Started juju-acda3d-pacemaker-remote-7 Clone Set: cl_nova_haproxy [res_nova_haproxy] Started: [ juju-acda3d-pacemaker-remote-7 juju-acda3d-pacemaker-remote-8 juju-acda3d-pacemaker-remote-9 ] juju-acda3d-pacemaker-remote-10.cloud.sts (ocf::pacemaker:remote): Started juju-acda3d-pacemaker-remote-8 juju-acda3d-pacemaker-remote-12.cloud.sts (ocf::pacemaker:remote): Started juju-acda3d-pacemaker-remote-8 juju-acda3d-pacemaker-remote-11.cloud.sts (ocf::pacemaker:remote): Started juju-acda3d-pacemaker-remote-7 test2 (ocf::pacemaker:Dummy): Started juju-acda3d-pacemaker- remote-10.cloud.sts ## After running the following commands on juju-acda3d-pacemaker- remote-10.cloud.sts 1) sudo systemctl stop pacemaker_remote 2) forcedfully shutdown (openstack server stop ) in less than 10 seconds after the pacemaker_remote gets executed. Remote is shutdown RemoteOFFLINE: [ juju-acda3d-pacemaker-remote-10.cloud.sts ] The resource status remains as stopped across the 3 machines, and doesn't recovers. $ juju run --application nova-cloud-controller "sudo pcs resource show | grep -i test2" - Stdout: " test2\t(ocf::pacemaker:Dummy):\tStopped\n" UnitId: nova-cloud-controller/0 - Stdout: " test2\t(ocf::pacemaker:Dummy):\tStopped\n" UnitId: nova-cloud-controller/1 - Stdout: " test2\t(ocf::pacemaker:Dummy):\tStopped\n" UnitId: nova-cloud-controller/2 However, If I do a clean shutdown (without interrupting the pacemaker_remote fence), that ends up with the resource migrated correctly to another node. 6 nodes configured 9 resources configured Online: [ juju-acda3d-pacemaker-remote-7 juju-acda3d-pacemaker-remote-8 juju-acda3d-pacemaker-remote-9 ] RemoteOnline: [ juju-acda3d-pacemaker-remote-11.cloud.sts juju-acda3d-pacemaker-remote-12.cloud.sts ] RemoteOFFLINE: [ juju-acda3d-pacemaker-remote-10.cloud.sts ] Full list of resources: [...] test2 (ocf::pacemaker:Dummy): Started juju-acda3d-pacemaker-remote-12.cloud.sts I will keep investigating this behavior and determine is this is linked to the bug reported. -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to pacemaker in Ubuntu. https://bugs.launchpad.net/bugs/1890491 Title: A pacemaker node fails monitor (probe) and stop /start operations on a resource because it returns "rc=189 Status in pacemaker package in Ubuntu: Fix Released Status in pacemaker source package in Bionic: New Status in pacemaker source package in Focal: Fix Released Status in pacemaker source package in Groovy: Fix Released Bug description: Cause: Pacemaker implicitly ordered all stops needed on a Pacemaker Remote node before the stop of the node's Pacemaker Remote connection, including stops that were implied by fencing of the node. Also, Pacemaker scheduled actions on Pacemaker Remote nodes with a failed connection so that the actions could be done once the connection is recovered, even if the connection wasn't being recovered (for example, if the node was shutting down when the failure occurred). Consequence: If a Pacemaker Remote node needed to be fenced while it was in the process of shutting down, once the fencing completed pacemaker scheduled probes on the node. The probes fail because the connection is not actually active. Due to the failed probe, a stop is scheduled which also fails, leading to fencing of the node again, and the situation repeats itself indefinitely. Fix: Pacemaker Remote connection stops are no longer ordered after implied stops, and actions are not scheduled on Pacemaker Remote nodes when the connection is failed and not being started again. Result: A Pacemaker Remote node that needs to be fenced while it is in the process of shutting down is fenced once, without repeating indefinitely. The fix seems to be fixed in pacemaker-1.1.21-1.el7 Related to https://bugzilla.redhat.com/show_bug.cgi?id=1704870 To manage notifications about this bug go to:
[Ubuntu-ha] [Bug 1890491] Re: A pacemaker node fails monitor (probe) and stop /start operations on a resource because it returns "rc=189
** Also affects: pacemaker (Ubuntu Groovy) Importance: Undecided Status: New ** Also affects: pacemaker (Ubuntu Bionic) Importance: Undecided Status: New ** Also affects: pacemaker (Ubuntu Focal) Importance: Undecided Status: New ** Changed in: pacemaker (Ubuntu Groovy) Status: New => Fix Released ** Changed in: pacemaker (Ubuntu Focal) Status: New => Fix Released -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to pacemaker in Ubuntu. https://bugs.launchpad.net/bugs/1890491 Title: A pacemaker node fails monitor (probe) and stop /start operations on a resource because it returns "rc=189 Status in pacemaker package in Ubuntu: Fix Released Status in pacemaker source package in Bionic: New Status in pacemaker source package in Focal: Fix Released Status in pacemaker source package in Groovy: Fix Released Bug description: Cause: Pacemaker implicitly ordered all stops needed on a Pacemaker Remote node before the stop of the node's Pacemaker Remote connection, including stops that were implied by fencing of the node. Also, Pacemaker scheduled actions on Pacemaker Remote nodes with a failed connection so that the actions could be done once the connection is recovered, even if the connection wasn't being recovered (for example, if the node was shutting down when the failure occurred). Consequence: If a Pacemaker Remote node needed to be fenced while it was in the process of shutting down, once the fencing completed pacemaker scheduled probes on the node. The probes fail because the connection is not actually active. Due to the failed probe, a stop is scheduled which also fails, leading to fencing of the node again, and the situation repeats itself indefinitely. Fix: Pacemaker Remote connection stops are no longer ordered after implied stops, and actions are not scheduled on Pacemaker Remote nodes when the connection is failed and not being started again. Result: A Pacemaker Remote node that needs to be fenced while it is in the process of shutting down is fenced once, without repeating indefinitely. The fix seems to be fixed in pacemaker-1.1.21-1.el7 Related to https://bugzilla.redhat.com/show_bug.cgi?id=1704870 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1890491/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1644152] Re: Pacemaker hang during upgrade to 9.2
** Also affects: pacemaker (Ubuntu) Importance: Undecided Status: New ** No longer affects: pacemaker (Ubuntu) -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to pacemaker in Ubuntu. https://bugs.launchpad.net/bugs/1644152 Title: Pacemaker hang during upgrade to 9.2 Status in Fuel for OpenStack: Fix Released Bug description: During upgrade from pacemaker version 1.1.14-2~u14.04+mos1 to version 1.1.14-2~u14.04+mos2 lrmd process hang and does not allow pacemaker to recover from corosync outage. Long way to reproduce: ~ 1. Install 9.1 with one controller node in HA mode. 2. Try to upgrade to 9.2 -- Expected result: Upgrade finished without problems. -- Result: ~~ upgrade fails on some random component outage. There are errors in pacemaker log: error: mainloop_add_ipc_server: Could not start pengine IPC server: Address already in use (-98) error: main:Failed to create IPC server: shutting down and inhibiting respawn Pacemaker process restart every 2-3 minutes. For example view https://bugs.launchpad.net/fuel/+bug/1641947 == Fast way to reproduce: ~ 1. Install 9.0 or 9.1 with one controller node in HA mode. 2. Login to controller over ssh 3. service corosync stop 4. Update packages pacemaker-cli-utils, pacemaker-common, pacemaker-resource-agents, pacemaker to 1.1.14-2~u14.04+mos2 5. service corosync start 6. wait 60 second for pacemaker to respawn 7. service pacemaker restart -- Expected result: Pacemaker recovers from corosync outage. -- Result: ~~~ Pacemaker fail to communicate with zombi lrmd and constantly restart. To manage notifications about this bug go to: https://bugs.launchpad.net/fuel/+bug/1644152/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1677684] Re: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found
** Patch added: "lp1677684-trusty.debdiff" https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+attachment/4867853/+files/lp1677684-trusty.debdiff -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1677684 Title: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb- blackbox: not found Status in corosync package in Ubuntu: In Progress Status in corosync source package in Trusty: In Progress Status in corosync source package in Xenial: In Progress Status in corosync source package in Yakkety: In Progress Bug description: [Environment] Ubuntu Xenial 16.04 Amd64 [Test Case] 1) sudo apt-get install corosync 2) sudo corosync-blackbox. root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L corosync |grep black /usr/bin/corosync-blackbox Expected results: corosync-blackbox runs OK. Current results: $ sudo corosync-blackbox /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found [Impact] * Cannot run corosync-blackbox [Regression Potential] * None identified. [Fix] Make the package dependant of libqb-dev root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L libqb-dev | grep qb-bl /usr/sbin/qb-blackbox To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1677684] Re: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found
** Patch removed: "lp1677684-zesty.debdiff" https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+attachment/4850712/+files/lp1677684-zesty.debdiff ** Patch removed: "lp1677684-xenial.debdiff" https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+attachment/4850713/+files/lp1677684-xenial.debdiff ** Patch removed: "lp1677684-yakkety.debdiff" https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+attachment/4851293/+files/lp1677684-yakkety.debdiff ** Patch removed: "lp1677684-trusty.debdiff" https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+attachment/4851294/+files/lp1677684-trusty.debdiff ** Patch added: "lp1677684-zesty.debdiff" https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+attachment/4867850/+files/lp1677684-zesty.debdiff -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1677684 Title: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb- blackbox: not found Status in corosync package in Ubuntu: In Progress Status in corosync source package in Trusty: In Progress Status in corosync source package in Xenial: In Progress Status in corosync source package in Yakkety: In Progress Bug description: [Environment] Ubuntu Xenial 16.04 Amd64 [Test Case] 1) sudo apt-get install corosync 2) sudo corosync-blackbox. root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L corosync |grep black /usr/bin/corosync-blackbox Expected results: corosync-blackbox runs OK. Current results: $ sudo corosync-blackbox /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found [Impact] * Cannot run corosync-blackbox [Regression Potential] * None identified. [Fix] Make the package dependant of libqb-dev root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L libqb-dev | grep qb-bl /usr/sbin/qb-blackbox To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1677684] Re: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found
Hello Christian, Thanks for looking into this, I just followed what the build dependency suggested (>= 0.12) there is no strict dependency on it. Do you want me to just leave it as libqb-dev or this is something you can fix when merging? Let me know how to proceed. Thanks. -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1677684 Title: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb- blackbox: not found Status in corosync package in Ubuntu: In Progress Status in corosync source package in Trusty: In Progress Status in corosync source package in Xenial: In Progress Status in corosync source package in Yakkety: In Progress Bug description: [Environment] Ubuntu Xenial 16.04 Amd64 [Test Case] 1) sudo apt-get install corosync 2) sudo corosync-blackbox. root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L corosync |grep black /usr/bin/corosync-blackbox Expected results: corosync-blackbox runs OK. Current results: $ sudo corosync-blackbox /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found [Impact] * Cannot run corosync-blackbox [Regression Potential] * None identified. [Fix] Make the package dependant of libqb-dev root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L libqb-dev | grep qb-bl /usr/sbin/qb-blackbox To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1677684] Re: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found
Hello Christian, I've attached the updated debdiff for trusty (checking for libqb-dev >= 0.12) as well as the requested Yakkety debdiff. Thanks for looking into this. -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1677684 Title: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb- blackbox: not found Status in corosync package in Ubuntu: In Progress Status in corosync source package in Trusty: In Progress Status in corosync source package in Xenial: In Progress Status in corosync source package in Yakkety: In Progress Bug description: [Environment] Ubuntu Xenial 16.04 Amd64 [Test Case] 1) sudo apt-get install corosync 2) sudo corosync-blackbox. root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L corosync |grep black /usr/bin/corosync-blackbox Expected results: corosync-blackbox runs OK. Current results: $ sudo corosync-blackbox /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found [Impact] * Cannot run corosync-blackbox [Regression Potential] * None identified. [Fix] Make the package dependant of libqb-dev root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L libqb-dev | grep qb-bl /usr/sbin/qb-blackbox To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1677684] Re: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found
** Patch removed: "lp1677684-trusty.debdiff" https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+attachment/4850762/+files/lp1677684-trusty.debdiff ** Patch added: "lp1677684-yakkety.debdiff" https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+attachment/4851293/+files/lp1677684-yakkety.debdiff -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1677684 Title: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb- blackbox: not found Status in corosync package in Ubuntu: In Progress Status in corosync source package in Trusty: In Progress Status in corosync source package in Xenial: In Progress Status in corosync source package in Yakkety: In Progress Bug description: [Environment] Ubuntu Xenial 16.04 Amd64 [Test Case] 1) sudo apt-get install corosync 2) sudo corosync-blackbox. root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L corosync |grep black /usr/bin/corosync-blackbox Expected results: corosync-blackbox runs OK. Current results: $ sudo corosync-blackbox /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found [Impact] * Cannot run corosync-blackbox [Regression Potential] * None identified. [Fix] Make the package dependant of libqb-dev root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L libqb-dev | grep qb-bl /usr/sbin/qb-blackbox To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1677684] Re: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found
** Changed in: corosync (Ubuntu Yakkety) Status: Confirmed => In Progress ** Changed in: corosync (Ubuntu Yakkety) Importance: Undecided => Medium -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1677684 Title: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb- blackbox: not found Status in corosync package in Ubuntu: In Progress Status in corosync source package in Trusty: In Progress Status in corosync source package in Xenial: In Progress Status in corosync source package in Yakkety: In Progress Bug description: [Environment] Ubuntu Xenial 16.04 Amd64 [Test Case] 1) sudo apt-get install corosync 2) sudo corosync-blackbox. root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L corosync |grep black /usr/bin/corosync-blackbox Expected results: corosync-blackbox runs OK. Current results: $ sudo corosync-blackbox /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found [Impact] * Cannot run corosync-blackbox [Regression Potential] * None identified. [Fix] Make the package dependant of libqb-dev root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L libqb-dev | grep qb-bl /usr/sbin/qb-blackbox To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1677684] Re: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found
** Patch added: "lp1677684-trusty.debdiff" https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+attachment/4850762/+files/lp1677684-trusty.debdiff ** Description changed: [Environment] Ubuntu Xenial 16.04 Amd64 - [Reproduction] + [Test Case] - - Install corosync - - Run the corosync-blackbox executable. + 1) sudo apt-get install corosync + 2) sudo corosync-blackbox. root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L corosync |grep black /usr/bin/corosync-blackbox Expected results: corosync-blackbox runs OK. + Current results: $ sudo corosync-blackbox /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found - Fix: + [Impact] + * Cannot run corosync-blackbox + + [Regression Potential] + + * None identified. + + [Fix] Make the package dependant of libqb-dev root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L libqb-dev | grep qb-bl /usr/sbin/qb-blackbox -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1677684 Title: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb- blackbox: not found Status in corosync package in Ubuntu: In Progress Status in corosync source package in Trusty: In Progress Status in corosync source package in Xenial: In Progress Bug description: [Environment] Ubuntu Xenial 16.04 Amd64 [Test Case] 1) sudo apt-get install corosync 2) sudo corosync-blackbox. root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L corosync |grep black /usr/bin/corosync-blackbox Expected results: corosync-blackbox runs OK. Current results: $ sudo corosync-blackbox /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found [Impact] * Cannot run corosync-blackbox [Regression Potential] * None identified. [Fix] Make the package dependant of libqb-dev root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L libqb-dev | grep qb-bl /usr/sbin/qb-blackbox To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1677684] Re: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found
** Patch added: "lp1677684-xenial.debdiff" https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+attachment/4850713/+files/lp1677684-xenial.debdiff -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1677684 Title: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb- blackbox: not found Status in corosync package in Ubuntu: In Progress Status in corosync source package in Trusty: In Progress Status in corosync source package in Xenial: In Progress Bug description: [Environment] Ubuntu Xenial 16.04 Amd64 [Reproduction] - Install corosync - Run the corosync-blackbox executable. root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L corosync |grep black /usr/bin/corosync-blackbox Expected results: corosync-blackbox runs OK. Current results: $ sudo corosync-blackbox /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found Fix: Make the package dependant of libqb-dev root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L libqb-dev | grep qb-bl /usr/sbin/qb-blackbox To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1677684] Re: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found
** Tags removed: sts ** Tags added: sts-sponsor ** Patch added: "lp1677684-zesty.debdiff" https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+attachment/4850712/+files/lp1677684-zesty.debdiff ** Changed in: corosync (Ubuntu) Status: New => In Progress ** Changed in: corosync (Ubuntu) Importance: Undecided => Medium ** Changed in: corosync (Ubuntu) Assignee: (unassigned) => Jorge Niedbalski (niedbalski) ** Changed in: corosync (Ubuntu Trusty) Status: New => In Progress ** Changed in: corosync (Ubuntu Trusty) Importance: Undecided => Medium ** Changed in: corosync (Ubuntu Trusty) Assignee: (unassigned) => Jorge Niedbalski (niedbalski) ** Changed in: corosync (Ubuntu Xenial) Status: New => In Progress ** Changed in: corosync (Ubuntu Xenial) Importance: Undecided => Medium ** Changed in: corosync (Ubuntu Xenial) Assignee: (unassigned) => Jorge Niedbalski (niedbalski) -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1677684 Title: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb- blackbox: not found Status in corosync package in Ubuntu: In Progress Status in corosync source package in Trusty: In Progress Status in corosync source package in Xenial: In Progress Bug description: [Environment] Ubuntu Xenial 16.04 Amd64 [Reproduction] - Install corosync - Run the corosync-blackbox executable. root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L corosync |grep black /usr/bin/corosync-blackbox Expected results: corosync-blackbox runs OK. Current results: $ sudo corosync-blackbox /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found Fix: Make the package dependant of libqb-dev root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L libqb-dev | grep qb-bl /usr/sbin/qb-blackbox To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1677684] [NEW] /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found
Public bug reported: [Environment] Ubuntu Xenial 16.04 Amd64 [Reproduction] - Install corosync - Run the corosync-blackbox executable. root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L corosync |grep black /usr/bin/corosync-blackbox Expected results: corosync-blackbox runs OK. Current results: $ sudo corosync-blackbox /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found Fix: Make the package dependant of libqb-dev root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L libqb-dev | grep qb-bl /usr/sbin/qb-blackbox ** Affects: corosync (Ubuntu) Importance: Undecided Status: New ** Tags: sts ** Tags added: sts ** Description changed: [Environment] Ubuntu Xenial 16.04 Amd64 [Reproduction] - Install corosync - Run the corosync-blackbox executable. root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L corosync |grep black /usr/bin/corosync-blackbox Expected results: corosync-blackbox runs OK. Current results: $ sudo corosync-blackbox /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found + + Fix: + + Make the package dependant of libqb-dev + + root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L libqb-dev | grep qb-bl + /usr/sbin/qb-blackbox -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1677684 Title: /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb- blackbox: not found Status in corosync package in Ubuntu: New Bug description: [Environment] Ubuntu Xenial 16.04 Amd64 [Reproduction] - Install corosync - Run the corosync-blackbox executable. root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L corosync |grep black /usr/bin/corosync-blackbox Expected results: corosync-blackbox runs OK. Current results: $ sudo corosync-blackbox /usr/bin/corosync-blackbox: 34: /usr/bin/corosync-blackbox: qb-blackbox: not found Fix: Make the package dependant of libqb-dev root@juju-niedbalski-xenial-machine-5:/home/ubuntu# dpkg -L libqb-dev | grep qb-bl /usr/sbin/qb-blackbox To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1677684/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1563089] Re: Memory Leak when new cluster configuration is formed.
Hello, I ran the verification for the Trusty version. root@juju-niedbalski-sec-machine-15:/home/ubuntu# dpkg -l|grep corosync ii corosync 2.3.3-1ubuntu3 amd64Standards-based cluster framework (daemon and modules) ii libcorosync-common4 2.3.3-1ubuntu3 amd64Standards-based cluster framework, common library I configured a 3 nodes nova-cloud-controller environment related with hacluster. ubuntu@niedbalski-sec-bastion:~/openstack-charm-testing/bundles/dev$ juju run --service nova-cloud-controller "sudo corosync-quorumtool -s|grep votes" - MachineId: "15" Stdout: | Expected votes: 3 Total votes: 3 UnitId: nova-cloud-controller/0 - MachineId: "28" Stdout: | Expected votes: 3 Total votes: 3 UnitId: nova-cloud-controller/1 - MachineId: "29" Stdout: | Expected votes: 3 Total votes: 3 UnitId: nova-cloud-controller/2 I changed the transport mode to UDP by setting: $ juju set hacluster-ncc corosync_transport=udpu After this, I moved to the primary node (the one that holds the virtual ip address) and I applied the TC rules, while monitoring the memory usage of the corosync process (multiple times) root@juju-niedbalski-sec-machine-15:/home/ubuntu# tc qdisc add dev eth0 root netem delay 550ms root@juju-niedbalski-sec-machine-15:/home/ubuntu# tc qdisc del dev eth0 root netem Apr 6 17:57:37 juju-niedbalski-sec-machine-15 cib[14387]: warning: cib_process_request: Completed cib_apply_diff operation for section 'all': Application of an update diff failed (rc=-206, origin=local/cibadmin/2, version=0.27.1) Apr 6 18:04:12 juju-niedbalski-sec-machine-15 corosync[14376]: [MAIN ] Completed service synchronization, ready to provide service. Apr 6 18:04:13 juju-niedbalski-sec-machine-15 corosync[18645]: [MAIN ] Completed service synchronization, ready to provide service. Apr 6 18:06:27 juju-niedbalski-sec-machine-15 corosync[18645]: [MAIN ] Completed service synchronization, ready to provide service. Apr 6 18:06:28 juju-niedbalski-sec-machine-15 corosync[19528]: [MAIN ] Completed service synchronization, ready to provide service. Apr 6 18:07:48 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ] Completed service synchronization, ready to provide service. Apr 6 18:07:49 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ] Completed service synchronization, ready to provide service. Apr 6 18:08:16 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ] Completed service synchronization, ready to provide service. Apr 6 18:08:59 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ] Completed service synchronization, ready to provide service. Apr 6 18:09:38 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ] Completed service synchronization, ready to provide service. After 5 minutes of observation on the corosync process by using: $ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep -E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done I don't see any substantial memory usage increase. root@juju-niedbalski-sec-machine-15:/home/ubuntu# more memory-usage.log 135584 3928 135584 3928 135584 3928 135584 3928 135584 3928 135584 3928 135584 3928 135584 3928 135584 3928 135584 3928 135584 3928 135584 3928 135584 3928 135584 3928 135584 3928 135584 3928 135584 3928 135584 3928 135584 3928 135584 3928 -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1563089 Title: Memory Leak when new cluster configuration is formed. Status in corosync package in Ubuntu: Fix Released Status in corosync source package in Trusty: Fix Committed Status in corosync source package in Wily: Fix Committed Bug description: [Environment] Trusty 14.04.3 Packages: ii corosync 2.3.3-1ubuntu1 amd64Standards-based cluster framework (daemon and modules) ii libcorosync-common4 2.3.3-1ubuntu1 amd64Standards-based cluster framework, common library [Reproducer] 1) I deployed an HA environment using this bundle (http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/dev/next-ha.yaml) with a 3 nodes installation of cinder related to an HACluster subordinate unit. $ juju-deployer -c next-ha.yaml -w 600 trusty-kilo 2) I changed the default corosync transport mode to unicast. $ juju set cinder-hacluster corosync_transport=udpu 3) I assured that the 3 units were quorated cinder/0# corosync-quorumtool Votequorum information -- Expected votes: 3 Highest expected: 3 Total votes: 3 Quorum: 2 Flags:Quorate Membership information
[Ubuntu-ha] [Bug 1563089] Re: Memory Leak when new cluster configuration is formed.
Based on my latest comment, I am marking the Trusty version as verification-done-trusty ** Tags removed: verification-needed ** Tags added: verification-done-trusty verification-needed-wily -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1563089 Title: Memory Leak when new cluster configuration is formed. Status in corosync package in Ubuntu: Fix Released Status in corosync source package in Trusty: Fix Committed Status in corosync source package in Wily: Fix Committed Bug description: [Environment] Trusty 14.04.3 Packages: ii corosync 2.3.3-1ubuntu1 amd64Standards-based cluster framework (daemon and modules) ii libcorosync-common4 2.3.3-1ubuntu1 amd64Standards-based cluster framework, common library [Reproducer] 1) I deployed an HA environment using this bundle (http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/dev/next-ha.yaml) with a 3 nodes installation of cinder related to an HACluster subordinate unit. $ juju-deployer -c next-ha.yaml -w 600 trusty-kilo 2) I changed the default corosync transport mode to unicast. $ juju set cinder-hacluster corosync_transport=udpu 3) I assured that the 3 units were quorated cinder/0# corosync-quorumtool Votequorum information -- Expected votes: 3 Highest expected: 3 Total votes: 3 Quorum: 2 Flags:Quorate Membership information -- Nodeid Votes Name 1002 1 10.5.1.57 (local) 1001 1 10.5.1.58 1000 1 10.5.1.59 The primary unit was holding the VIP resource 10.5.105.1/16 root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr 2: eth0:mtu 1500 qdisc netem state UP group default qlen 1000 link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0 valid_lft forever preferred_lft forever inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0 valid_lft forever preferred_lft forever 4) I manually added a TC queue for the eth0 interface on the node holding the VIP resource, introducing a 350 ms delay. $ sudo tc qdisc add dev eth0 root netem delay 350ms 5) Right after adding the 350ms on the cinder/0 unit, the corosync process informs that one of the processors failed, and is forming a new cluster configuration. Mar 28 21:57:41 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A processor failed, forming new configuration. Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [MAIN ] Completed service synchronization, ready to provide service. This happens on all of the units. 6) After receiving this message, I remove the queue from eth0: $ sudo tc qdisk del dev eth0 root netem Then, the following statement is written in the master node: Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [MAIN ] Completed service synchronization, ready to provide service. 7) While executing 5 and 6 repeatedly, I ran the following command to track the VSZ and RSS memory usage of the corosync process: root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc add dev eth0 root netem delay 350ms root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc del dev eth0 root netem $ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep -E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done The results shows that both vsz and rss are increased over time at a high ratio. 25476 4036 ... (after 5 minutes). 135644 10352 [Fix] So preliminary based on this reproducer, I think that this commit (https://github.com/corosync/corosync/commit/600fb4084adcbfe7678b44a83fa8f3d3550f48b9) is a good candidate to be backported in Ubuntu Trusty. [Test Case] * See reproducer [Backport Impact] * Not identified To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1563089/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help :
[Ubuntu-ha] [Bug 1563089] Re: Memory Leak when new cluster configuration is formed.
** Patch added: "Wily Pathc" https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1563089/+attachment/4619421/+files/fix-lp-1563089-wily.debdiff -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1563089 Title: Memory Leak when new cluster configuration is formed. Status in corosync package in Ubuntu: In Progress Status in corosync source package in Trusty: In Progress Status in corosync source package in Wily: In Progress Bug description: [Environment] Trusty 14.04.3 Packages: ii corosync 2.3.3-1ubuntu1 amd64Standards-based cluster framework (daemon and modules) ii libcorosync-common4 2.3.3-1ubuntu1 amd64Standards-based cluster framework, common library [Reproducer] 1) I deployed an HA environment using this bundle (http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/dev/next-ha.yaml) with a 3 nodes installation of cinder related to an HACluster subordinate unit. $ juju-deployer -c next-ha.yaml -w 600 trusty-kilo 2) I changed the default corosync transport mode to unicast. $ juju set cinder-hacluster corosync_transport=udpu 3) I assured that the 3 units were quorated cinder/0# corosync-quorumtool Votequorum information -- Expected votes: 3 Highest expected: 3 Total votes: 3 Quorum: 2 Flags:Quorate Membership information -- Nodeid Votes Name 1002 1 10.5.1.57 (local) 1001 1 10.5.1.58 1000 1 10.5.1.59 The primary unit was holding the VIP resource 10.5.105.1/16 root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr 2: eth0:mtu 1500 qdisc netem state UP group default qlen 1000 link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0 valid_lft forever preferred_lft forever inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0 valid_lft forever preferred_lft forever 4) I manually added a TC queue for the eth0 interface on the node holding the VIP resource, introducing a 350 ms delay. $ sudo tc qdisc add dev eth0 root netem delay 350ms 5) Right after adding the 350ms on the cinder/0 unit, the corosync process informs that one of the processors failed, and is forming a new cluster configuration. Mar 28 21:57:41 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A processor failed, forming new configuration. Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [MAIN ] Completed service synchronization, ready to provide service. This happens on all of the units. 6) After receiving this message, I remove the queue from eth0: $ sudo tc qdisk del dev eth0 root netem Then, the following statement is written in the master node: Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [MAIN ] Completed service synchronization, ready to provide service. 7) While executing 5 and 6 repeatedly, I ran the following command to track the VSZ and RSS memory usage of the corosync process: root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc add dev eth0 root netem delay 350ms root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc del dev eth0 root netem $ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep -E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done The results shows that both vsz and rss are increased over time at a high ratio. 25476 4036 ... (after 5 minutes). 135644 10352 [Fix] So preliminary based on this reproducer, I think that this commit (https://github.com/corosync/corosync/commit/600fb4084adcbfe7678b44a83fa8f3d3550f48b9) is a good candidate to be backported in Ubuntu Trusty. [Test Case] * See reproducer [Backport Impact] * Not identified To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1563089/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1564250] Re: Corosync upgrade to 2.3.3-1ubuntu2 leaves pacemaker in a stopped state
** Changed in: pacemaker (Ubuntu) Importance: Undecided => High -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to pacemaker in Ubuntu. https://bugs.launchpad.net/bugs/1564250 Title: Corosync upgrade to 2.3.3-1ubuntu2 leaves pacemaker in a stopped state Status in pacemaker package in Ubuntu: Confirmed Bug description: Using pacemaker version: 1.1.10+git20130802-1ubuntu2.3 On ubuntu: Ubuntu 14.04.4 LTS Corosync upgrade to 2.3.3-1ubuntu2 leaves pacemaker in a stopped state. Specifically from version 2.3.3-1ubuntu1 I have attached logs from such upgrade. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1564250/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1563089] Re: Memory Leak when new cluster configuration is formed.
** Patch removed: "Xenial Patch" https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1563089/+attachment/4617458/+files/fix-lp-1563089-xenial.debdiff ** Patch added: "Xenial Patch" https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1563089/+attachment/4618461/+files/fix-lp-1563089-xenial.debdiff -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1563089 Title: Memory Leak when new cluster configuration is formed. Status in corosync package in Ubuntu: In Progress Status in corosync source package in Trusty: In Progress Status in corosync source package in Wily: In Progress Bug description: [Environment] Trusty 14.04.3 Packages: ii corosync 2.3.3-1ubuntu1 amd64Standards-based cluster framework (daemon and modules) ii libcorosync-common4 2.3.3-1ubuntu1 amd64Standards-based cluster framework, common library [Reproducer] 1) I deployed an HA environment using this bundle (http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/dev/next-ha.yaml) with a 3 nodes installation of cinder related to an HACluster subordinate unit. $ juju-deployer -c next-ha.yaml -w 600 trusty-kilo 2) I changed the default corosync transport mode to unicast. $ juju set cinder-hacluster corosync_transport=udpu 3) I assured that the 3 units were quorated cinder/0# corosync-quorumtool Votequorum information -- Expected votes: 3 Highest expected: 3 Total votes: 3 Quorum: 2 Flags:Quorate Membership information -- Nodeid Votes Name 1002 1 10.5.1.57 (local) 1001 1 10.5.1.58 1000 1 10.5.1.59 The primary unit was holding the VIP resource 10.5.105.1/16 root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr 2: eth0:mtu 1500 qdisc netem state UP group default qlen 1000 link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0 valid_lft forever preferred_lft forever inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0 valid_lft forever preferred_lft forever 4) I manually added a TC queue for the eth0 interface on the node holding the VIP resource, introducing a 350 ms delay. $ sudo tc qdisc add dev eth0 root netem delay 350ms 5) Right after adding the 350ms on the cinder/0 unit, the corosync process informs that one of the processors failed, and is forming a new cluster configuration. Mar 28 21:57:41 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A processor failed, forming new configuration. Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [MAIN ] Completed service synchronization, ready to provide service. This happens on all of the units. 6) After receiving this message, I remove the queue from eth0: $ sudo tc qdisk del dev eth0 root netem Then, the following statement is written in the master node: Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [MAIN ] Completed service synchronization, ready to provide service. 7) While executing 5 and 6 repeatedly, I ran the following command to track the VSZ and RSS memory usage of the corosync process: root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc add dev eth0 root netem delay 350ms root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc del dev eth0 root netem $ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep -E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done The results shows that both vsz and rss are increased over time at a high ratio. 25476 4036 ... (after 5 minutes). 135644 10352 [Fix] So preliminary based on this reproducer, I think that this commit (https://github.com/corosync/corosync/commit/600fb4084adcbfe7678b44a83fa8f3d3550f48b9) is a good candidate to be backported in Ubuntu Trusty. [Test Case] * See reproducer [Backport Impact] * Not identified To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1563089/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha
[Ubuntu-ha] [Bug 1563089] Re: Memory Leak when new cluster configuration is formed.
** Patch added: "Xenial Patch" https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1563089/+attachment/4617458/+files/fix-lp-1563089-xenial.debdiff ** Changed in: corosync (Ubuntu Wily) Status: New => In Progress ** Changed in: corosync (Ubuntu Wily) Importance: Undecided => High ** Changed in: corosync (Ubuntu Wily) Assignee: (unassigned) => Jorge Niedbalski (niedbalski) -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1563089 Title: Memory Leak when new cluster configuration is formed. Status in corosync package in Ubuntu: In Progress Status in corosync source package in Trusty: In Progress Status in corosync source package in Wily: In Progress Bug description: [Environment] Trusty 14.04.3 Packages: ii corosync 2.3.3-1ubuntu1 amd64Standards-based cluster framework (daemon and modules) ii libcorosync-common4 2.3.3-1ubuntu1 amd64Standards-based cluster framework, common library [Reproducer] 1) I deployed an HA environment using this bundle (http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/dev/next-ha.yaml) with a 3 nodes installation of cinder related to an HACluster subordinate unit. $ juju-deployer -c next-ha.yaml -w 600 trusty-kilo 2) I changed the default corosync transport mode to unicast. $ juju set cinder-hacluster corosync_transport=udpu 3) I assured that the 3 units were quorated cinder/0# corosync-quorumtool Votequorum information -- Expected votes: 3 Highest expected: 3 Total votes: 3 Quorum: 2 Flags:Quorate Membership information -- Nodeid Votes Name 1002 1 10.5.1.57 (local) 1001 1 10.5.1.58 1000 1 10.5.1.59 The primary unit was holding the VIP resource 10.5.105.1/16 root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc netem state UP group default qlen 1000 link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0 valid_lft forever preferred_lft forever inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0 valid_lft forever preferred_lft forever 4) I manually added a TC queue for the eth0 interface on the node holding the VIP resource, introducing a 350 ms delay. $ sudo tc qdisc add dev eth0 root netem delay 350ms 5) Right after adding the 350ms on the cinder/0 unit, the corosync process informs that one of the processors failed, and is forming a new cluster configuration. Mar 28 21:57:41 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A processor failed, forming new configuration. Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [MAIN ] Completed service synchronization, ready to provide service. This happens on all of the units. 6) After receiving this message, I remove the queue from eth0: $ sudo tc qdisk del dev eth0 root netem Then, the following statement is written in the master node: Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [MAIN ] Completed service synchronization, ready to provide service. 7) While executing 5 and 6 repeatedly, I ran the following command to track the VSZ and RSS memory usage of the corosync process: root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc add dev eth0 root netem delay 350ms root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc del dev eth0 root netem $ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep -E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done The results shows that both vsz and rss are increased over time at a high ratio. 25476 4036 ... (after 5 minutes). 135644 10352 [Fix] So preliminary based on this reproducer, I think that this commit (https://github.com/corosync/corosync/commit/600fb4084adcbfe7678b44a83fa8f3d3550f48b9) is a good candidate to be backported in Ubuntu Trusty. [Test Case] * See reproducer [Backport Impact] * Not identified To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+sou
[Ubuntu-ha] [Bug 1563089] Re: Memory Leak when new cluster configuration is formed.
** Changed in: corosync (Ubuntu) Status: New => In Progress ** Changed in: corosync (Ubuntu Trusty) Status: New => In Progress ** Changed in: corosync (Ubuntu) Importance: Undecided => High ** Changed in: corosync (Ubuntu Trusty) Importance: Undecided => High ** Changed in: corosync (Ubuntu) Assignee: (unassigned) => Jorge Niedbalski (niedbalski) ** Changed in: corosync (Ubuntu Trusty) Assignee: (unassigned) => Jorge Niedbalski (niedbalski) -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1563089 Title: Memory Leak when new cluster configuration is formed. Status in corosync package in Ubuntu: In Progress Status in corosync source package in Trusty: In Progress Status in corosync source package in Wily: New Bug description: [Environment] Trusty 14.04.3 Packages: ii corosync 2.3.3-1ubuntu1 amd64Standards-based cluster framework (daemon and modules) ii libcorosync-common4 2.3.3-1ubuntu1 amd64Standards-based cluster framework, common library [Reproducer] 1) I deployed an HA environment using this bundle (http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/dev/next-ha.yaml) with a 3 nodes installation of cinder related to an HACluster subordinate unit. $ juju-deployer -c next-ha.yaml -w 600 trusty-kilo 2) I changed the default corosync transport mode to unicast. $ juju set cinder-hacluster corosync_transport=udpu 3) I assured that the 3 units were quorated cinder/0# corosync-quorumtool Votequorum information -- Expected votes: 3 Highest expected: 3 Total votes: 3 Quorum: 2 Flags:Quorate Membership information -- Nodeid Votes Name 1002 1 10.5.1.57 (local) 1001 1 10.5.1.58 1000 1 10.5.1.59 The primary unit was holding the VIP resource 10.5.105.1/16 root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc netem state UP group default qlen 1000 link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0 valid_lft forever preferred_lft forever inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0 valid_lft forever preferred_lft forever 4) I manually added a TC queue for the eth0 interface on the node holding the VIP resource, introducing a 350 ms delay. $ sudo tc qdisc add dev eth0 root netem delay 350ms 5) Right after adding the 350ms on the cinder/0 unit, the corosync process informs that one of the processors failed, and is forming a new cluster configuration. Mar 28 21:57:41 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A processor failed, forming new configuration. Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [MAIN ] Completed service synchronization, ready to provide service. This happens on all of the units. 6) After receiving this message, I remove the queue from eth0: $ sudo tc qdisk del dev eth0 root netem Then, the following statement is written in the master node: Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [MAIN ] Completed service synchronization, ready to provide service. 7) While executing 5 and 6 repeatedly, I ran the following command to track the VSZ and RSS memory usage of the corosync process: root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc add dev eth0 root netem delay 350ms root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc del dev eth0 root netem $ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep -E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done The results shows that both vsz and rss are increased over time at a high ratio. 25476 4036 ... (after 5 minutes). 135644 10352 [Fix] So preliminary based on this reproducer, I think that this commit (https://github.com/corosync/corosync/commit/600fb4084adcbfe7678b44a83fa8f3d3550f48b9) is a good candidate to be backported in Ubuntu Trusty. [Test Case] * See reproducer [Backport Impact] * Not identified To manage no
[Ubuntu-ha] [Bug 1563089] Re: Memory Leak when new cluster configuration is formed.
** Description changed: [Environment] Trusty 14.04.3 Packages: ii corosync 2.3.3-1ubuntu1 amd64Standards-based cluster framework (daemon and modules) ii libcorosync-common4 2.3.3-1ubuntu1 amd64Standards-based cluster framework, common library [Reproducer] - 1) I deployed an HA environment using this bundle (http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/dev/next-ha.yaml) with a 3 nodes installation of cinder related to an HACluster subordinate unit. $ juju-deployer -c next-ha.yaml -w 600 trusty-kilo 2) I changed the default corosync transport mode to unicast. $ juju set cinder-hacluster corosync_transport=udpu 3) I assured that the 3 units were quorated - cinder/0# corosync-quorumtool + cinder/0# corosync-quorumtool Votequorum information -- Expected votes: 3 Highest expected: 3 Total votes: 3 - Quorum: 2 - Flags:Quorate + Quorum: 2 + Flags:Quorate Membership information -- - Nodeid Votes Name - 1002 1 10.5.1.57 (local) - 1001 1 10.5.1.58 - 1000 1 10.5.1.59 + Nodeid Votes Name + 1002 1 10.5.1.57 (local) + 1001 1 10.5.1.58 + 1000 1 10.5.1.59 The primary unit was holding the VIP resource 10.5.105.1/16 - root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr + root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr 2: eth0:mtu 1500 qdisc netem state UP group default qlen 1000 - link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff - inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0 -valid_lft forever preferred_lft forever - inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0 -valid_lft forever preferred_lft forever + link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff + inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0 + valid_lft forever preferred_lft forever + inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0 + valid_lft forever preferred_lft forever 4) I manually added a TC queue for the eth0 interface on the node holding the VIP resource, introducing a 350 ms delay. $ sudo tc qdisc add dev eth0 root netem delay 350ms - 5) Right after adding the 350ms on the cinder/0 unit, the corosync process informs that one of the processors failed, and is forming a new + 5) Right after adding the 350ms on the cinder/0 unit, the corosync process informs that one of the processors failed, and is forming a new cluster configuration. - + Mar 28 21:57:41 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A processor failed, forming new configuration. Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [MAIN ] Completed service synchronization, ready to provide service. This happens on all of the units. 6) After receiving this message, I remove the queue from eth0: $ sudo tc qdisk del dev eth0 root netem Then, the following statement is written in the master node: Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [MAIN ] Completed service synchronization, ready to provide service. - - 7) While executing 5 and 6 repeatedly, I ran the following command to track the SZ and RSS memory usage of the + 7) While executing 5 and 6 repeatedly, I ran the following command to track the VSZ and RSS memory usage of the corosync process: root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc add dev eth0 root netem delay 350ms root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc del dev eth0 root netem - $ sudo while true; do ps -o sz,rss -p $(pgrep corosync) 2>&1 | grep -E + $ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep -E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done - The results shows that both sz and rss are increased over time at a high - ratio. + The results shows that both vsz and rss are increased over time at a + high ratio. 25476 4036 ... (after 5 minutes). 135644 10352 [Fix] - So preliminary based on this reproducer, I think that this commit (https://github.com/corosync/corosync/commit/600fb4084adcbfe7678b44a83fa8f3d3550f48b9) + So preliminary
[Ubuntu-ha] [Bug 1530837] Re: Logsys file leaks in /dev/shm after sigabrt, sigsegv and when running corosync -v
I just ran a verification on this package. root@juju-niedbalski-sec-machine-27:/home/ubuntu# for file in $(strace -e open -i corosync -v 2>&1 | grep -E '.*shm.*' |grep -Po '".*?"'| sed -e s/\"//g); do du -sh $file; done 12K /dev/shm/qb-corosync-blackbox-header 8.1M/dev/shm/qb-corosync-blackbox-data After enabling proposed root@juju-niedbalski-sec-machine-27:/home/ubuntu# dpkg -l | grep corosync ii corosync 2.3.3-1ubuntu2 amd64 Standards-based cluster framework (daemon and modules) ii libcorosync-common4 2.3.3-1ubuntu2 amd64 Standards-based cluster framework, common library root@juju-niedbalski-sec-machine-27:/home/ubuntu# for file in $(strace -e open -i corosync -v 2>&1 | grep -E '.*shm.*' |grep -Po '".*?"'| sed -e s/\"//g); do du -sh $file; done du: cannot access ‘/dev/shm/qb-corosync-blackbox-header’: No such file or directory du: cannot access ‘/dev/shm/qb-corosync-blackbox-data’: No such file or directory So, seems to be fixed. ** Tags removed: verification-needed ** Tags added: verification-done -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1530837 Title: Logsys file leaks in /dev/shm after sigabrt, sigsegv and when running corosync -v Status in corosync package in Ubuntu: Fix Released Status in corosync source package in Trusty: Fix Committed Bug description: [Impact] * corosync has a memory leak problem with multiple calls to corosync -v * corosync has a memory leak problem by not properly handling signals [Test Case] * run "corosync -v" multiple times * some cloud tools do that [Regression Potential] * minor code changes on not-core code * based on upstream changes * based on a redhat fix [Other Info] # Original BUG Description It was brought to my attention that Ubuntu also suffers from: https://bugzilla.redhat.com/show_bug.cgi?id=1117911 And corosync should include the following fixes: commit dfaca4b10a005681230a81e229384b6cd239b4f6 Author: Jan FriesseDate: Wed Jul 9 15:52:14 2014 +0200 Fix compiler warning introduced by previous patch QB loop signal handler prototype differs from signal(2) prototype. Solution is to create wrapper functions. Signed-off-by: Jan Friesse commit 384760cb670836dc37e243f594612c6e68f44351 Author: zouyu Date: Thu Jul 3 10:56:02 2014 +0800 Handle SIGSEGV and SIGABRT signals SIGSEGV and SIGABRT signals are now correctly handled (blackbox is dumped and logsys is finalized). Signed-off-by: zouyu Reviewed-by: Jan Friesse commit cc80c8567d6eec1d136f9e85d2f8dfb957337eef Author: zouyu Date: Wed Jul 2 10:00:53 2014 +0800 fix memory leak produced by 'corosync -v' Signed-off-by: zouyu Reviewed-by: Jan Friesse Description from Red Hat bug: """ Description of problem: When corosync receives sigabrt or sigsegv it doesn't delete libqb blackbox file (/dev/shm one). Same happens when corosync is executed with -v parameter (this shows only version, so it shouldn't cause leak in /dev/shm). Version-Release number of selected component (if applicable): 7.0 How reproducible: 100% Steps to Reproduce 1: 1. Start corosync 2. Send sigabrt to corosync Steps to Reproduce 1: 1. Execute corosync -v Actual results: File like qb-corosync-*-blackbox-data|header exists results in leak of /dev/shm space. Expected results: No leak Additional info: """ To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1530837/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1563089] Re: Memory Leak when new cluster configuration is formed.
** Summary changed: - Memory Leak when new configuration is formed. + Memory Leak when new cluster configuration is formed. ** Tags added: sts-needs-review -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1563089 Title: Memory Leak when new cluster configuration is formed. Status in corosync package in Ubuntu: New Status in corosync source package in Trusty: New Bug description: [Environment] Trusty 14.04.3 Packages: ii corosync 2.3.3-1ubuntu1 amd64Standards-based cluster framework (daemon and modules) ii libcorosync-common4 2.3.3-1ubuntu1 amd64Standards-based cluster framework, common library [Reproducer] 1) I deployed an HA environment using this bundle (http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/dev/next-ha.yaml) with a 3 nodes installation of cinder related to an HACluster subordinate unit. $ juju-deployer -c next-ha.yaml -w 600 trusty-kilo 2) I changed the default corosync transport mode to unicast. $ juju set cinder-hacluster corosync_transport=udpu 3) I assured that the 3 units were quorated cinder/0# corosync-quorumtool Votequorum information -- Expected votes: 3 Highest expected: 3 Total votes: 3 Quorum: 2 Flags:Quorate Membership information -- Nodeid Votes Name 1002 1 10.5.1.57 (local) 1001 1 10.5.1.58 1000 1 10.5.1.59 The primary unit was holding the VIP resource 10.5.105.1/16 root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr 2: eth0:mtu 1500 qdisc netem state UP group default qlen 1000 link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0 valid_lft forever preferred_lft forever inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0 valid_lft forever preferred_lft forever 4) I manually added a TC queue for the eth0 interface on the node holding the VIP resource, introducing a 350 ms delay. $ sudo tc qdisc add dev eth0 root netem delay 350ms 5) Right after adding the 350ms on the cinder/0 unit, the corosync process informs that one of the processors failed, and is forming a new cluster configuration. Mar 28 21:57:41 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A processor failed, forming new configuration. Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [MAIN ] Completed service synchronization, ready to provide service. This happens on all of the units. 6) After receiving this message, I remove the queue from eth0: $ sudo tc qdisk del dev eth0 root netem Then, the following statement is written in the master node: Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [MAIN ] Completed service synchronization, ready to provide service. 7) While executing 5 and 6 repeatedly, I ran the following command to track the SZ and RSS memory usage of the corosync process: root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc add dev eth0 root netem delay 350ms root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc del dev eth0 root netem $ sudo while true; do ps -o sz,rss -p $(pgrep corosync) 2>&1 | grep -E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done The results shows that both sz and rss are increased over time at a high ratio. 25476 4036 ... (after 5 minutes). 135644 10352 [Fix] So preliminary based on this reproducer, I think that this commit (https://github.com/corosync/corosync/commit/600fb4084adcbfe7678b44a83fa8f3d3550f48b9) is a good candidate to be backported in Ubuntu Trusty. [Test Case] * See reproducer [Backport Impact] * Not identified To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1563089/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1563089] [NEW] Memory Leak when new configuration is formed.
Public bug reported: [Environment] Trusty 14.04.3 Packages: ii corosync 2.3.3-1ubuntu1 amd64Standards-based cluster framework (daemon and modules) ii libcorosync-common4 2.3.3-1ubuntu1 amd64Standards-based cluster framework, common library [Reproducer] 1) I deployed an HA environment using this bundle (http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/dev/next-ha.yaml) with a 3 nodes installation of cinder related to an HACluster subordinate unit. $ juju-deployer -c next-ha.yaml -w 600 trusty-kilo 2) I changed the default corosync transport mode to unicast. $ juju set cinder-hacluster corosync_transport=udpu 3) I assured that the 3 units were quorated cinder/0# corosync-quorumtool Votequorum information -- Expected votes: 3 Highest expected: 3 Total votes: 3 Quorum: 2 Flags:Quorate Membership information -- Nodeid Votes Name 1002 1 10.5.1.57 (local) 1001 1 10.5.1.58 1000 1 10.5.1.59 The primary unit was holding the VIP resource 10.5.105.1/16 root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr 2: eth0:mtu 1500 qdisc netem state UP group default qlen 1000 link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0 valid_lft forever preferred_lft forever inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0 valid_lft forever preferred_lft forever 4) I manually added a TC queue for the eth0 interface on the node holding the VIP resource, introducing a 350 ms delay. $ sudo tc qdisc add dev eth0 root netem delay 350ms 5) Right after adding the 350ms on the cinder/0 unit, the corosync process informs that one of the processors failed, and is forming a new cluster configuration. Mar 28 21:57:41 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A processor failed, forming new configuration. Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [MAIN ] Completed service synchronization, ready to provide service. This happens on all of the units. 6) After receiving this message, I remove the queue from eth0: $ sudo tc qdisk del dev eth0 root netem Then, the following statement is written in the master node: Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [MAIN ] Completed service synchronization, ready to provide service. 7) While executing 5 and 6 repeatedly, I ran the following command to track the SZ and RSS memory usage of the corosync process: root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc add dev eth0 root netem delay 350ms root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc del dev eth0 root netem $ sudo while true; do ps -o sz,rss -p $(pgrep corosync) 2>&1 | grep -E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done The results shows that both sz and rss are increased over time at a high ratio. 25476 4036 ... (after 5 minutes). 135644 10352 [Fix] So preliminary based on this reproducer, I think that this commit (https://github.com/corosync/corosync/commit/600fb4084adcbfe7678b44a83fa8f3d3550f48b9) is a good candidate to be backported in Ubuntu Trusty. [Test Case] * See reproducer [Backport Impact] * Not identified ** Affects: corosync (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1563089 Title: Memory Leak when new configuration is formed. Status in corosync package in Ubuntu: New Bug description: [Environment] Trusty 14.04.3 Packages: ii corosync 2.3.3-1ubuntu1 amd64Standards-based cluster framework (daemon and modules) ii libcorosync-common4 2.3.3-1ubuntu1 amd64Standards-based cluster framework, common library [Reproducer] 1) I deployed an HA environment using this bundle (http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/dev/next-ha.yaml) with a 3 nodes installation of cinder related to an HACluster subordinate unit. $ juju-deployer -c next-ha.yaml -w 600 trusty-kilo 2) I changed the default corosync
[Ubuntu-ha] [Bug 1477198] Re: Stop doesn't works on Trusty
I have verified that the -proposed package fixes the issue. Thanks. root@juju-testing-machine-18:/home/ubuntu# service haproxy restart * Restarting haproxy haproxy [ OK ] root@juju-testing-machine-18:/home/ubuntu# ps aux|grep haproxy haproxy 8530 0.0 0.0 20300 636 ?Ss 19:47 0:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid root@juju-testing-machine-18:/home/ubuntu# service haproxy stop * Stopping haproxy haproxy [ OK ] root@juju-testing-machine-18:/home/ubuntu# ps aux|grep haproxy root@juju-testing-machine-18:/home/ubuntu# service haproxy start * Starting haproxy haproxy [ OK ] root@juju-testing-machine-18:/home/ubuntu# ps aux|grep haproxy haproxy 8567 0.0 0.0 20300 632 ?Ss 19:47 0:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid root@juju-testing-machine-18:/home/ubuntu# service haproxy restart * Restarting haproxy haproxy [ OK ] root@juju-testing-machine-18:/home/ubuntu# service haproxy restart * Restarting haproxy haproxy [ OK ] root@juju-testing-machine-18:/home/ubuntu# ps aux|grep haproxy haproxy 8607 0.0 0.0 20300 636 ?Ss 19:47 0:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid root 8611 0.0 0.0 10432 624 pts/0S+ 19:47 0:00 grep --color=auto haproxy root@juju-testing-machine-18:/home/ubuntu# ** Tags removed: verification-needed ** Tags added: verification-done -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to haproxy in Ubuntu. https://bugs.launchpad.net/bugs/1477198 Title: Stop doesn't works on Trusty Status in haproxy package in Ubuntu: Fix Released Status in haproxy source package in Trusty: Fix Committed Bug description: [Description] The stop method is not working properly. I removed the --oknodo --quiet and is returning (No /usr/sbin/haproxy found running; none killed) I think this is a regression caused by the incorporation of this lines on the stop method: + for pid in $(cat $PIDFILE); do + start-stop-daemon --quiet --oknodo --stop \ + --retry 5 --pid $pid --exec $HAPROXY || ret=$? root@juju-machine-1-lxc-0:~# service haproxy status haproxy is running. root@juju-machine-1-lxc-0:~# ps -ef| grep haproxy haproxy 1269 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid root1513 906 0 14:33 pts/600:00:00 grep --color=auto haproxy root@juju-machine-1-lxc-0:~# service haproxy restart * Restarting haproxy haproxy ...done. root@juju-machine-1-lxc-0:~# ps -ef| grep haproxy haproxy 1269 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2169 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid root2277 906 0 14:33 pts/600:00:00 grep --color=auto haproxy root@juju-machine-1-lxc-0:~# service haproxy restart * Restarting haproxy haproxy ...done. root@juju-machine-1-lxc-0:~# ps -ef| grep haproxy haproxy 1269 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2169 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2505 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid root2523 906 0 14:33 pts/600:00:00 grep --color=auto haproxy root@juju-machine-1-lxc-0:~# service haproxy stop * Stopping haproxy haproxy ...done. root@juju-machine-1-lxc-0:~# ps -ef| grep haproxy haproxy 1269 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2169 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2505 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid root2584 906 0 14:34 pts/600:00:00 grep
[Ubuntu-ha] [Bug 1477198] Re: Stop doesn't works on Trusty
** Patch added: Trusty Patch https://bugs.launchpad.net/ubuntu/+source/haproxy/+bug/1477198/+attachment/4432608/+files/fix-lp-1477198-trusty.patch ** Tags added: sts -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to haproxy in Ubuntu. https://bugs.launchpad.net/bugs/1477198 Title: Stop doesn't works on Trusty Status in haproxy package in Ubuntu: Fix Released Status in haproxy source package in Trusty: In Progress Bug description: [Description] The stop method is not working properly. I removed the --oknodo --quiet and is returning (No /usr/sbin/haproxy found running; none killed) I think this is a regression caused by the incorporation of this lines on the stop method: + for pid in $(cat $PIDFILE); do + start-stop-daemon --quiet --oknodo --stop \ + --retry 5 --pid $pid --exec $HAPROXY || ret=$? root@juju-machine-1-lxc-0:~# service haproxy status haproxy is running. root@juju-machine-1-lxc-0:~# ps -ef| grep haproxy haproxy 1269 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid root1513 906 0 14:33 pts/600:00:00 grep --color=auto haproxy root@juju-machine-1-lxc-0:~# service haproxy restart * Restarting haproxy haproxy ...done. root@juju-machine-1-lxc-0:~# ps -ef| grep haproxy haproxy 1269 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2169 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid root2277 906 0 14:33 pts/600:00:00 grep --color=auto haproxy root@juju-machine-1-lxc-0:~# service haproxy restart * Restarting haproxy haproxy ...done. root@juju-machine-1-lxc-0:~# ps -ef| grep haproxy haproxy 1269 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2169 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2505 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid root2523 906 0 14:33 pts/600:00:00 grep --color=auto haproxy root@juju-machine-1-lxc-0:~# service haproxy stop * Stopping haproxy haproxy ...done. root@juju-machine-1-lxc-0:~# ps -ef| grep haproxy haproxy 1269 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2169 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2505 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid root2584 906 0 14:34 pts/600:00:00 grep --color=auto haproxy root@juju-machine-1-lxc-0:~# service haproxy start * Starting haproxy haproxy ...done. root@juju-machine-1-lxc-0:~# ps -ef| grep haproxy haproxy 1269 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2169 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2505 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2591 1 0 14:34 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid root2610 906 0 14:34 pts/600:00:00 grep --color=auto haproxy To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/haproxy/+bug/1477198/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1477198] Re: Stop doesn't works on Trusty
** Changed in: haproxy (Ubuntu Trusty) Status: New = In Progress ** Changed in: haproxy (Ubuntu Trusty) Assignee: (unassigned) = Jorge Niedbalski (niedbalski) ** Changed in: haproxy (Ubuntu Trusty) Importance: Undecided = Critical -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to haproxy in Ubuntu. https://bugs.launchpad.net/bugs/1477198 Title: Stop doesn't works on Trusty Status in haproxy package in Ubuntu: Fix Released Status in haproxy source package in Trusty: In Progress Bug description: [Description] The stop method is not working properly. I removed the --oknodo --quiet and is returning (No /usr/sbin/haproxy found running; none killed) I think this is a regression caused by the incorporation of this lines on the stop method: + for pid in $(cat $PIDFILE); do + start-stop-daemon --quiet --oknodo --stop \ + --retry 5 --pid $pid --exec $HAPROXY || ret=$? root@juju-machine-1-lxc-0:~# service haproxy status haproxy is running. root@juju-machine-1-lxc-0:~# ps -ef| grep haproxy haproxy 1269 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid root1513 906 0 14:33 pts/600:00:00 grep --color=auto haproxy root@juju-machine-1-lxc-0:~# service haproxy restart * Restarting haproxy haproxy ...done. root@juju-machine-1-lxc-0:~# ps -ef| grep haproxy haproxy 1269 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2169 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid root2277 906 0 14:33 pts/600:00:00 grep --color=auto haproxy root@juju-machine-1-lxc-0:~# service haproxy restart * Restarting haproxy haproxy ...done. root@juju-machine-1-lxc-0:~# ps -ef| grep haproxy haproxy 1269 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2169 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2505 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid root2523 906 0 14:33 pts/600:00:00 grep --color=auto haproxy root@juju-machine-1-lxc-0:~# service haproxy stop * Stopping haproxy haproxy ...done. root@juju-machine-1-lxc-0:~# ps -ef| grep haproxy haproxy 1269 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2169 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2505 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid root2584 906 0 14:34 pts/600:00:00 grep --color=auto haproxy root@juju-machine-1-lxc-0:~# service haproxy start * Starting haproxy haproxy ...done. root@juju-machine-1-lxc-0:~# ps -ef| grep haproxy haproxy 1269 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2169 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2505 1 0 14:33 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid haproxy 2591 1 0 14:34 ?00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid root2610 906 0 14:34 pts/600:00:00 grep --color=auto haproxy To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/haproxy/+bug/1477198/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1468879] Re: Haproxy doesn't checks for configuration on start/reload
After enabling -proposed and install haproxy. Edit /etc/default/haproxy, set ENABLED=1 I modified an entry in the configuration /etc/haproxy/haproxy.cfg global log /dev/loglocal0 log /dev/loglocal1 notice chroot /var/lib/haproxy user haproxy group haproxy daemon defaults log globalss service haproxy restart * Restarting haproxy haproxy [ALERT] 195/191904 (5640) : parsing [/etc/haproxy/haproxy.cfg:10] : 'log' expects either address[:port] and facility or 'global' as arguments. [ALERT] 195/191904 (5640) : Error(s) found in configuration file : /etc/haproxy/haproxy.cfg [WARNING] 195/191904 (5640) : config : log format ignored for proxy 'http-in' since it has no log address. [ALERT] 195/191904 (5640) : Fatal errors found in configuration. [fail] ** Tags removed: verification-needed ** Tags added: verification-done -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to haproxy in Ubuntu. https://bugs.launchpad.net/bugs/1468879 Title: Haproxy doesn't checks for configuration on start/reload Status in haproxy package in Ubuntu: Fix Released Status in haproxy source package in Trusty: Fix Released Bug description: [Environment] Trusty 14.04.2 [Description] Configuration is not tested before service start or reload [Suggested Fix] Backport current check_haproxy_config function from utopic+. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/haproxy/+bug/1468879/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1468879] Re: Haproxy doesn't checks for configuration on start/reload
** Changed in: haproxy (Ubuntu Trusty) Status: New = In Progress ** Changed in: haproxy (Ubuntu Trusty) Assignee: (unassigned) = Jorge Niedbalski (niedbalski) -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to haproxy in Ubuntu. https://bugs.launchpad.net/bugs/1468879 Title: Haproxy doesn't checks for configuration on start/reload Status in haproxy package in Ubuntu: Fix Released Status in haproxy source package in Trusty: In Progress Bug description: [Environment] Trusty 14.04.2 [Description] Configuration is not tested before service start or reload [Suggested Fix] Backport current check_haproxy_config function from utopic+. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/haproxy/+bug/1468879/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1468879] [NEW] Haproxy doesn't checks for configuration on start/reload
Public bug reported: [Environment] Trusty 14.04.2 [Description] Configuration is not tested before service start or reload [Suggested Fix] Backport current check_haproxy_config function from utopic+. ** Affects: haproxy (Ubuntu) Importance: Undecided Status: Fix Released ** Changed in: haproxy (Ubuntu) Status: New = Fix Released -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to haproxy in Ubuntu. https://bugs.launchpad.net/bugs/1468879 Title: Haproxy doesn't checks for configuration on start/reload Status in haproxy package in Ubuntu: Fix Released Bug description: [Environment] Trusty 14.04.2 [Description] Configuration is not tested before service start or reload [Suggested Fix] Backport current check_haproxy_config function from utopic+. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/haproxy/+bug/1468879/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1462495] Re: Init file does not respect the LSB spec.
** Description changed: [Environment] - Trusty 14.02 + Trusty 14.04.2 [Description] Looking in the /etc/init.d/haproxy script, particularly the stop method, is returning 4 in case of the pidfile doesn't exists. - /bin/kill $pid || return 4. + /bin/kill $pid || return 4. According to the spec that means 'insufficient privileges' which is not correct. This is causing pacemaker and other system monitoring tools to fail because it doesn't complains with LSB. An example: Jun 2 12:52:13 glance02 crmd[2518]: notice: process_lrm_event: glance02-res_glance_haproxy_monitor_5000:22 [ haproxy dead, but /var/run/haproxy.pid exists.\n ] Jun 2 12:52:13 glance02 crmd[2518]: notice: process_lrm_event: LRM operation res_glance_haproxy_stop_0 (call=33, rc=4, cib-update=19, confirmed=true) insufficient privileges Reference: haproxy_stop() { - if [ ! -f $PIDFILE ] ; then - # This is a success according to LSB - return 0 - fi - for pid in $(cat $PIDFILE) ; do - /bin/kill $pid || return 4 - done - rm -f $PIDFILE - return 0 + if [ ! -f $PIDFILE ] ; then + # This is a success according to LSB + return 0 + fi + for pid in $(cat $PIDFILE) ; do + /bin/kill $pid || return 4 + done + rm -f $PIDFILE + return 0 } [Proposed solution] Backport the current devel (wily) init. -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to haproxy in Ubuntu. https://bugs.launchpad.net/bugs/1462495 Title: Init file does not respect the LSB spec. Status in haproxy package in Ubuntu: In Progress Status in haproxy source package in Trusty: New Status in haproxy source package in Utopic: New Bug description: [Environment] Trusty 14.04.2 [Description] Looking in the /etc/init.d/haproxy script, particularly the stop method, is returning 4 in case of the pidfile doesn't exists. /bin/kill $pid || return 4. According to the spec that means 'insufficient privileges' which is not correct. This is causing pacemaker and other system monitoring tools to fail because it doesn't complains with LSB. An example: Jun 2 12:52:13 glance02 crmd[2518]: notice: process_lrm_event: glance02-res_glance_haproxy_monitor_5000:22 [ haproxy dead, but /var/run/haproxy.pid exists.\n ] Jun 2 12:52:13 glance02 crmd[2518]: notice: process_lrm_event: LRM operation res_glance_haproxy_stop_0 (call=33, rc=4, cib-update=19, confirmed=true) insufficient privileges Reference: haproxy_stop() { if [ ! -f $PIDFILE ] ; then # This is a success according to LSB return 0 fi for pid in $(cat $PIDFILE) ; do /bin/kill $pid || return 4 done rm -f $PIDFILE return 0 } [Proposed solution] Backport the current devel (wily) init. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/haproxy/+bug/1462495/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1462495] [NEW] Init file does not respect the LSB spec.
Public bug reported: [Environment] Trusty 14.02 [Description] Looking in the /etc/init.d/haproxy script, particularly the stop method, is returning 4 in case of the pidfile doesn't exists. /bin/kill $pid || return 4. According to the spec that means 'insufficient privileges' which is not correct. This is causing pacemaker and other system monitoring tools to fail because it doesn't complains with LSB. An example: Jun 2 12:52:13 glance02 crmd[2518]: notice: process_lrm_event: glance02-res_glance_haproxy_monitor_5000:22 [ haproxy dead, but /var/run/haproxy.pid exists.\n ] Jun 2 12:52:13 glance02 crmd[2518]: notice: process_lrm_event: LRM operation res_glance_haproxy_stop_0 (call=33, rc=4, cib-update=19, confirmed=true) insufficient privileges Reference: haproxy_stop() { if [ ! -f $PIDFILE ] ; then # This is a success according to LSB return 0 fi for pid in $(cat $PIDFILE) ; do /bin/kill $pid || return 4 done rm -f $PIDFILE return 0 } [Proposed solution] Backport the current devel (wily) init. ** Affects: haproxy (Ubuntu) Importance: High Assignee: Jorge Niedbalski (niedbalski) Status: In Progress ** Changed in: haproxy (Ubuntu) Status: New = In Progress ** Changed in: haproxy (Ubuntu) Importance: Undecided = High ** Changed in: haproxy (Ubuntu) Assignee: (unassigned) = Jorge Niedbalski (niedbalski) -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to haproxy in Ubuntu. https://bugs.launchpad.net/bugs/1462495 Title: Init file does not respect the LSB spec. Status in haproxy package in Ubuntu: In Progress Bug description: [Environment] Trusty 14.02 [Description] Looking in the /etc/init.d/haproxy script, particularly the stop method, is returning 4 in case of the pidfile doesn't exists. /bin/kill $pid || return 4. According to the spec that means 'insufficient privileges' which is not correct. This is causing pacemaker and other system monitoring tools to fail because it doesn't complains with LSB. An example: Jun 2 12:52:13 glance02 crmd[2518]: notice: process_lrm_event: glance02-res_glance_haproxy_monitor_5000:22 [ haproxy dead, but /var/run/haproxy.pid exists.\n ] Jun 2 12:52:13 glance02 crmd[2518]: notice: process_lrm_event: LRM operation res_glance_haproxy_stop_0 (call=33, rc=4, cib-update=19, confirmed=true) insufficient privileges Reference: haproxy_stop() { if [ ! -f $PIDFILE ] ; then # This is a success according to LSB return 0 fi for pid in $(cat $PIDFILE) ; do /bin/kill $pid || return 4 done rm -f $PIDFILE return 0 } [Proposed solution] Backport the current devel (wily) init. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/haproxy/+bug/1462495/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1353473] Re: Pacemaker crm node standby stops resource successfully, but lrmd still monitors it and causes Failed actions
** Tags added: cts -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to pacemaker in Ubuntu. https://bugs.launchpad.net/bugs/1353473 Title: Pacemaker crm node standby stops resource successfully, but lrmd still monitors it and causes Failed actions Status in “pacemaker” package in Ubuntu: Fix Released Status in “pacemaker” source package in Trusty: Fix Released Status in “pacemaker” package in Debian: New Bug description: [Impact] * Whenever a user uses crm node standby the code can make lrmd still try to monitor resource put into stand-by and cause error messages. [Test Case] * To use crm node standby and check lrmd does not stop monitoring not set to stand-by. [Regression Potential] * users already tested and are using in production. * based on upstream fixes for lrmd monitoring. * potential race conditions (based on upstream history). [Other Info] * Original bug description: It was brought to me (~inaddy) the following situation: * Environment Ubuntu 14.04 LTS Pacemaker 1.1.10+git20130802-1ubuntu2 * Priority High * Issue I used crm node standby and the resource(haproxy) was stopped successfully. But lrmd still monitors it and causes Failed actions. --- Node A1LB101 (167969461): standby Online: [ A1LB102 ] Resource Group: grpHaproxy vip-internal (ocf::heartbeat:IPaddr2): Started A1LB102 vip-external (ocf::heartbeat:IPaddr2): Started A1LB102 vip-nfs (ocf::heartbeat:IPaddr2): Started A1LB102 vip-iscsi (ocf::heartbeat:IPaddr2): Started A1LB102 Resource Group: grpStonith1 prmStonith1-1 (stonith:external/stonith-helper): Started A1LB102 Clone Set: clnHaproxy [haproxy] Started: [ A1LB102 ] Stopped: [ A1LB101 ] Clone Set: clnPing [ping] Started: [ A1LB102 ] Stopped: [ A1LB101 ] Node Attributes: * Node A1LB101: * Node A1LB102: + default_ping_set : 400 Migration summary: * Node A1LB101: haproxy: migration-threshold=1 fail-count=18 last-failure='Mon Jul 7 20:28:58 2014' * Node A1LB102: Failed actions: haproxy_monitor_1 (node=A1LB101, call=2332, rc=7, status=complete, last-rc-change=Mon Jul 7 20:28:58 2014 , queued=0ms, exec=0ms ): not running --- Abstract from log (ha-log.node1) Jul 7 20:28:50 A1LB101 crmd[6364]: notice: te_rsc_command: Initiating action 42: stop haproxy_stop_0 on A1LB101 (local) Jul 7 20:28:50 A1LB101 crmd[6364]: info: match_graph_event: Action haproxy_stop_0 (42) confirmed on A1LB101 (rc=0) Jul 7 20:28:58 A1LB101 crmd[6364]: notice: process_lrm_event: A1LB101-haproxy_monitor_1:1372 [ haproxy not running.\n ] I wasn't able to reproduce this error so far but the fix seems a straightforward cherry-picking from upstream patch set fix: 48f90f6 Fix: services: Do not allow duplicate recurring op entries c29ab27 High: lrmd: Merge duplicate recurring monitor operations 348bb51 Fix: lrmd: Cancel recurring operations before stop action is executed So I'm assuming (and testing right now) this will fix the issue... Opening the public bug for the fix I'll provide after tests, and to ask others to test the fix also. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1353473/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp
[Ubuntu-ha] [Bug 1368737] Re: Pacemaker can seg fault on crm node online/standy
** Tags added: cts -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to pacemaker in Ubuntu. https://bugs.launchpad.net/bugs/1368737 Title: Pacemaker can seg fault on crm node online/standy Status in “pacemaker” package in Ubuntu: Confirmed Bug description: It was brought to my attention the following situation: [Issue] lrmd process crashed when repeating crm node standby and crm node online # grep pacemakerd ha-log.k1pm101 | grep core Aug 27 17:47:06 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 49275 (lrmd) dumped core Aug 27 17:47:06 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=49275, core=1) Aug 27 18:27:14 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 1471 (lrmd) dumped core Aug 27 18:27:14 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=1471, core=1) Aug 27 18:56:41 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 35771 (lrmd) dumped core Aug 27 18:56:41 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=35771, core=1) Aug 27 19:44:09 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 60709 (lrmd) dumped core Aug 27 19:44:09 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=60709, core=1) Aug 27 20:00:53 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 35838 (lrmd) dumped core Aug 27 20:00:53 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=35838, core=1) Aug 27 21:33:52 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 49249 (lrmd) dumped core Aug 27 21:33:52 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=49249, core=1) Aug 27 22:01:16 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 65358 (lrmd) dumped core Aug 27 22:01:16 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=65358, core=1) Aug 27 22:28:02 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 22693 (lrmd) dumped core Aug 27 22:28:02 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=22693, core=1) # grep pacemakerd ha-log.k1pm102 | grep core Aug 27 15:32:48 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed process 5812 (lrmd) dumped core Aug 27 15:32:48 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=5812, core=1) Aug 27 15:52:52 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed process 35781 (lrmd) dumped core Aug 27 15:52:52 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=35781, core=1) Aug 27 16:02:54 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed process 51984 (lrmd) dumped core Aug 27 16:02:54 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=51984, core=1) Analyzing core file with dbgsyms I could see that: #0 0x7f7184a45983 in services_action_sync (op=0x7f7185b605d0) at services.c:434 434 crm_trace( stdout: %s, op-stdout_data); Is responsible for the core. I've checked upstream code and there might be 2 important commits that could be cherry-picked to fix this behavior: commit f2a637cc553cb7aec59bdcf05c5e1d077173419f Author: Andrew Beekhof and...@beekhof.net Date: Fri Sep 20 12:20:36 2013 +1000 Fix: services: Prevent use-of-NULL when executing service actions commit 11473a5a8c88eb17d5e8d6cd1d99dc497e817aac Author: Gao,Yan y...@suse.com Date: Sun Sep 29 12:40:18 2013 +0800 Fix: services: Fix the executing of synchronous actions The core can be caused by things such as this missing code: if (op == NULL) { crm_trace(No operation to execute); return FALSE; on the beginning of services_action_sync(svc_action_t * op) function. And improved by commit #11473a5. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/+subscriptions ___ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp