Reviewed: https://review.openstack.org/534456 Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fea188acd173fe09e2e6f98534c4b9cb1523ebc6 Submitter: Zuul Branch: master
commit fea188acd173fe09e2e6f98534c4b9cb1523ebc6 Author: Ihar Hrachyshka <[email protected]> Date: Tue Jan 16 13:59:39 2018 -0800 l3_ha: only pass host into update_port when updating router port bindings There is a race condition in update_routers_states that may result in some fixed ips incorrectly deallocated from router ports. This may happen if update_routers_states fetches ports' state before another thread updates the list; then update_routers_states passes port payloads with old fixed ips into update_port, which results in ip address deallocation. Among other things, l3 agent will detect the change and remove the affected subnet prefix from radvd configuration file, since it doesn't configure extra_subnets for RA. There is no need to pass full port payload into update_port just to set host. This patch replaces the payload with a dict of one key - host. This allows core plugin to handle just this host field change, leaving existing allocations (and other port attributes) intact. Change-Id: Ib2c661d6e2cb8e34676fd83e19b6cf65c232545d Closes-Bug: #1743658 ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1743658 Title: SLAAC address incorrectly deallocated from HA router port due to race condition Status in neutron: Fix Released Bug description: This was originally reported in Red Hat Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1486324 The issue is triggered when executing tempest.scenario.test_network_v6.TestGettingAddress tests in a loop with L3 HA enabled. The failure looks as follows: Captured traceback: ~~~~~~~~~~~~~~~~~~~ Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper return f(*func_args, **func_kwargs) File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 258, in test_dualnet_multi_prefix_slaac dualnet=True) File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 196, in _prepare_and_test (ip, srv['id'], ssh.exec_command("ip address"))) File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 666, in fail raise self.failureException(msg) AssertionError: Address 2003::1:f816:3eff:fee0:fbf0 not configured for instance d89f1b14-20ef-47f7-80a4-9d3173446dbc, ip address output is 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast qlen 1000 link/ether fa:16:3e:12:f5:ea brd ff:ff:ff:ff:ff:ff inet 10.100.0.8/28 brd 10.100.0.15 scope global eth0 inet6 fe80::f816:3eff:fe12:f5ea/64 scope link valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000 link/ether fa:16:3e:e0:fb:f0 brd ff:ff:ff:ff:ff:ff inet6 2003::f816:3eff:fee0:fbf0/64 scope global dynamic valid_lft 86335sec preferred_lft 14335sec inet6 fe80::f816:3eff:fee0:fbf0/64 scope link valid_lft forever preferred_lft forever The test case creates a network with two ipv6 slaac subnets, then starts an instance on the network and checks that the instance OS configured addresses from both prefixes. It fails because an address from the second prefix is not configured. When we check l3 agent log, we see that radvd first is correctly configured for both prefixes, but then something happens that reconfigures radvd again, now without one of prefixes. If we trace the event that triggered the second radvd reconfiguration back to server, we see that the router update happened as a result of update_routers_states execution (which is itself remotely triggered by l3 agent). We see that the update_routers_states call started before the second subnet was added to the router in question. At the same time, we see that the call is complete AFTER the subnet is added. In server log, we see both allocation and deallocation events for router gateway address: 2018-01-16 19:50:31.459 886987 DEBUG neutron.db.db_base_plugin_common [req-13cbe23b-ae7f-472e-9049-601e75e04b6a 269a2421f89742b09cfd722dc28aca5c 97e8b9e11107489aad70e7e6d172ddce - default default] Allocated IP 2003:0:0:1::1 (d7bd8a16-6bf6-4a40-b508-c27d963b3912/fc6a62f5-d1a1-4272-aa67-a70ee8b1ee01/6f4b11b3-4a31-4264 -8c3b-439c8bd903bd) _store_ip_allocation /usr/lib/python2.7/site- packages/neutron/db/db_base_plugin_common.py:122 2018-01-16 19:50:35.144 883810 DEBUG neutron.db.db_base_plugin_common [req-445e4d27-16e4-463f-bfce-b7603a739b01 - - - - -] Delete allocated IP 2003:0:0:1::1 (d7bd8a16-6bf6-4a40-b508-c27d963b3912/fc6a62f5-d1a1-4272-aa67-a70ee8b1ee01) _delete_ip_allocation /usr/lib/python2.7/site-packages/neutron/db/db_base_plugin_common.py:108 The allocation event belongs to add_router_interface, while deletion is from update_routers_states. Code inspection suggests that deallocation happens because update_routers_states does the following: 1. fetch all router ports; 2. then for each port payload, set host, and pass the payload into update_port. If add_router_interface happened in between those two steps, then we risk calling update_port with a port payload that DOESN'T contain a fixed_ip that was added during add_router_interface call. I think we should avoid passing the whole port payload into update_port, instead just pass a dict with a single key of host. This is both semantically correct, fixes the race condition, and in theory may be slightly quicker since the core plugin won't need to process fields that were not changed. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1743658/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

