Reviewed: https://review.openstack.org/265685 Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=472d84d25cee0694500e583845718a4f377cc75c Submitter: Jenkins Branch: master
commit 472d84d25cee0694500e583845718a4f377cc75c Author: LIU Yulong <[email protected]> Date: Mon Jan 11 12:02:55 2016 +0800 Catch PortNotFound after HA router race condition When neutron server deleted all the resources of a HA router, L3 agent can not aware that, so race happened in some procedure like this: 1. Neutron server delete all resources of a HA router. 2. RPC fanout to L3 agent 1 in which the HA router was master state. 3. In l3 agent 2 'backup' router set itself to masert and notify neutron server a HA router state change notify. 4. PorNotFound rasied in updating router HA port status. How the step 2 and 3 happens? Consider that l3 agent 2 has much more HA routers than l3 agent 1, or any reason that causes l3 agent 2 gets/processes the deleting RPC later than l3 agent 1. Then l3 agent 1 remove HA router's keepalived process will soonly be detected by backup router in l3 agent 2 via VRRP protocol. Now the router deleting RPC is in the queue of RouterUpdate or any step of a HA router deleting procedure, and the router_info will still have 'the' router info. So l3 agent 2 will do the state change procedure, AKA notify the neutron server to update router state. This patch is mainly to deal with the race by catching the PorNotFound exception in neutron-server side. Change-Id: I34d7347595bfceb8a70685672a6287e1a44ede6b Closes-Bug: #1533454 Related-Bug: #1523780 ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1533454 Title: L3 agent unable to update HA router state after race between HA router creating and deleting Status in neutron: Fix Released Bug description: The router L3 HA binding process does not take into account the fact that the port it is binding to the agent can be concurrently deleted. Details: When neutron server deleted all the resources of a HA router, L3 agent can not aware that, so race happened in some procedure like this: 1. Neutron server delete all resources of a HA router 2. RPC fanout to L3 agent 1 in which the HA router was master state 3. In l3 agent 2 'backup' router set itself to masert and notify neutron server a HA router state change notify. 4. PortNotFound rasied in updating HA router states function (Seems the DB error was no longer existed.) How the step 2 and 3 happens? Consider that l3 agent 2 has much more HA routers than l3 agent 1, or any reason that causes l3 agent 2 gets/processes the deleting RPC later than l3 agent 1. Then l3 agent 1 remove HA router's keepalived process will soonly be detected by backup router in l3 agent 2 via VRRP protocol. Now the router deleting RPC is in the queue of RouterUpdate or any step of a HA router deleting procedure, and the router_info will still have 'the' router info. So l3 agent 2 will do the state change procedure, AKA notify the neutron server to update router state. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1533454/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

