Ok I have now completed testing the bionic-proposed keepalived package with Openstack Queens and am happy that it resolves the problem of ensuring that keepalived will teardown routes, vips, evips etc when it comes back up and transitions from master to backup. My test comprised of deploying Queens with 3 gateways, creating 100 users/projects each with 1 router, creating some instances with floating ips then forcibly killing both the keepalived and neutron-keepalived-state-change processes associated with a particular router for which i have an instance with a fip. I then observed that the qrouter ns interfaces for that router were definitely unconfigured and the vrrp transition happened as expected. This is in contrast to e.g. keepalived 1:1.2.19-1ubuntu0.2 available with all Xenial releases of Openstack for which I consistently see the qrouter interfaces remain configured on > 1 gateway.
For completeness (although not having any bearing on the keepalived fix) I also still see the other issue remain for bionic whereby in neutron the router is listed as being active on > 1 host e.g. (truncating so that it will display properly) +-//---------------------------+---------+----------------+-------+----------+ | // id | host | admin_state_up | alive | ha_state | +-//---------------------------+---------+----------------+-------+----------+ | //901-4edd-86fb-8dbfe7373255 | crustle | True | :-) | active | | //961-4318-9743-775ebc9b0067 | chespin | True | :-) | active | | //628-4c2e-8e91-c309e4477c75 | orgen | True | :-) | standby | +-//---------------------------+---------+----------------+-------+----------+ The reason for this is simple and the good news is that with the fixed keepalived it is also benign. Neutron detects state changes by running ip monitor on the qrouter interfaces and since my test involved killing both neutron-keepalived-state-change (that runs ip monitor) and keepalived, the vrrp transition appears to have happened before neutron had ip monitor running again. Looking at the l3-agent logs is see: 2018-07-25 10:19:33.636 14018 WARNING neutron.agent.linux.external_process [-] Respawning keepalived for uuid 75d24bfb-9807-4216-af4a-3aac37cf2417 2018-07-25 10:19:33.638 14018 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-75d24bfb-9807-4216-af4a-3aac37cf2417', 'keepalived', '-P', '-f', '/var/lib/neutron/ha_confs/ 2018-07-25 10:19:33.886 14018 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-75d24bfb-9807-4216-af4a-3aac37cf2417', 'neutron-keepalived-state-change', '--router_id=75d24 i.e. neutron starts keepalived BEFORE keepalived-state-change so if the transition and teardown happens prior to the latter coming up and launching ip monitor it never sees the changes and has nothing to report to neutron. ** Tags removed: verification-needed-bionic ** Tags added: verification-done-bionic -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to keepalived in Ubuntu. https://bugs.launchpad.net/bugs/1744062 Title: [SRU] L3 HA: multiple agents are active at the same time Status in Ubuntu Cloud Archive: Triaged Status in Ubuntu Cloud Archive mitaka series: Triaged Status in Ubuntu Cloud Archive ocata series: Triaged Status in Ubuntu Cloud Archive pike series: Triaged Status in Ubuntu Cloud Archive queens series: Fix Committed Status in neutron: New Status in keepalived package in Ubuntu: Fix Released Status in neutron package in Ubuntu: New Status in keepalived source package in Xenial: Triaged Status in neutron source package in Xenial: New Status in keepalived source package in Bionic: Fix Committed Status in neutron source package in Bionic: New Bug description: [Impact] This is the same issue reported in https://bugs.launchpad.net/neutron/+bug/1731595, however that is marked as 'Fix Released' and the issue is still occurring and I can't change back to 'New' so it seems best to just open a new bug. It seems as if this bug surfaces due to load issues. While the fix provided by Venkata in https://bugs.launchpad.net/neutron/+bug/1731595 (https://review.openstack.org/#/c/522641/) should help clean things up at the time of l3 agent restart, issues seem to come back later down the line in some circumstances. xavpaice mentioned he saw multiple routers active at the same time when they had 464 routers configured on 3 neutron gateway hosts using L3HA, and each router was scheduled to all 3 hosts. However, jhebden mentions that things seem stable at the 400 L3HA router mark, and it's worth noting this is the same deployment that xavpaice was referring to. keepalived has a patch upstream in 1.4.0 that provides a fix for removing left-over addresses if keepalived aborts. That patch will be cherry-picked to Ubuntu keepalived packages. [Test Case] The following SRU process will be followed: https://wiki.ubuntu.com/OpenStackUpdates In order to avoid regression of existing consumers, the OpenStack team will run their continuous integration test against the packages that are in -proposed. A successful run of all available tests will be required before the proposed packages can be let into -updates. The OpenStack team will be in charge of attaching the output summary of the executed tests. The OpenStack team members will not mark ‘verification-done’ until this has happened. [Regression Potential] The regression potential is lowered as the fix is cherry-picked without change from upstream. In order to mitigate the regression potential, the results of the aforementioned tests are attached to this bug. [Discussion] To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1744062/+subscriptions _______________________________________________ Mailing list: https://launchpad.net/~ubuntu-ha Post to : [email protected] Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp

