This bug was fixed in the package neutron - 2:12.1.1-0ubuntu4 --------------- neutron (2:12.1.1-0ubuntu4) bionic; urgency=medium
* Fix interrupt of VLAN traffic on reboot of neutron-ovs-agent: - d/p/0001-ovs-agent-signal-to-plugin-if-tunnel-refresh-needed.patch (LP: #1853613) - d/p/0002-Do-not-block-connection-between-br-int-and-br-phys-o.patch (LP: #1869808) - d/p/0003-Ensure-that-stale-flows-are-cleaned-from-phys_bridge.patch (LP: #1864822) - d/p/0004-DVR-Reconfigure-re-created-physical-bridges-for-dvr-.patch (LP: #1864822) - d/p/0005-Ensure-drop-flows-on-br-int-at-agent-startup-for-DVR.patch (LP: #1887148) - d/p/0006-Don-t-check-if-any-bridges-were-recrected-when-OVS-w.patch (LP: #1864822) - d/p/0007-Not-remove-the-running-router-when-MQ-is-unreachable.patch (LP: #1871850) -- Edward Hope-Morley <[email protected]> Mon, 22 Feb 2021 16:55:40 +0000 ** Changed in: neutron (Ubuntu Bionic) Status: Fix Committed => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1869808 Title: reboot neutron-ovs-agent introduces a short interrupt of vlan traffic Status in Ubuntu Cloud Archive: Fix Released Status in Ubuntu Cloud Archive queens series: Fix Committed Status in Ubuntu Cloud Archive rocky series: Fix Committed Status in Ubuntu Cloud Archive stein series: Fix Released Status in Ubuntu Cloud Archive train series: Fix Released Status in Ubuntu Cloud Archive ussuri series: Fix Released Status in Ubuntu Cloud Archive victoria series: Fix Released Status in neutron: Fix Released Status in neutron package in Ubuntu: Fix Released Status in neutron source package in Bionic: Fix Released Status in neutron source package in Focal: Fix Released Status in neutron source package in Groovy: Fix Released Status in neutron source package in Hirsute: Fix Released Bug description: (SRU template copied from comment 42) [Impact] - When there is a RabbitMQ or neutron-api outage, the neutron- openvswitch-agent undergoes a "resync" process and temporarily blocks all VM traffic. This always happens for a short time period (maybe <1 second) but in some high scale environments this lasts for minutes. If RabbitMQ is down again during the re-sync, traffic will also be blocked until it can connect which may be for a long period. This also affects situations where neutron-openvswitch-agent is intentionally restarted while RabbitMQ is down. Bug #1869808 addresses this issue and Bug #1887148 is a fix for that fix to prevent network loops during DVR startup. - In the same situation, the neutron-l3-agent can delete the L3 router (Bug #1871850), or may need to refresh the tunnel (Bug #1853613), or may need to update flows or reconfigure bridges (Bug #1864822) [Test Plan] (1) Deploy Openstack Bionic-Queens with DVR and a *VLAN* tenant network (VXLAN or FLAT will not reproduce the issue). With a standard deployment, simply enabling DHCP on the ext_net subnet will allow VMs to be booted directly on the ext_net provider network. "openstack subnet set --dhcp ext_net and then deploy the VM directly to ext_net" (2) Deploy a VM to the VLAN network (3) Start pinging the VM from an external network (4) Stop all RabbitMQ servers (5) Restart neutron-openvswitch-agent (6) Ping traffic should NOT see interruption (7) Start all RabbitMQ servers (8) Ping traffic should still be fine [Where problems could occur] These patches are all cherry-picked from the upstream stable branches, and have existed upstream including the stable/queens branch for many months and in Ubuntu all supported subsequent releases (Stein onwards) have also had these patches for many months with the exception of Queens. There is a chance that not installing these drop flows during startup could have traffic go somewhere that's not expected when the network is in a partially setup case, this was the case for DVR and in setups where more than 1 DVR external network port existed a network loop was possibly temporarily created. This was already addressed with the included patch for Bug #1869808. Checked and could not locate any other merged changes to this drop_port logic that also need to be backported. [Other Info] [original description] We are using Openstack Neutron 13.0.6 and it is deployed using OpenStack-helm. I test ping servers in the same vlan while rebooting neutron-ovs- agent. The result shows root@mgt01:~# openstack server list +--------------------------------------+-----------------+--------+------------------------------------------+------------------------------+-----------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-----------------+--------+------------------------------------------+------------------------------+-----------+ | 22d55077-b1b5-452e-8eba-cbcd2d1514a8 | test-1-1 | ACTIVE | vlan105=172.31.10.4 | Cirros 0.4.0 64-bit | m1.tiny | | 726bc888-7767-44bc-b68a-7a1f3a6babf1 | test-1-2 | ACTIVE | vlan105=172.31.10.18 | Cirros 0.4.0 64-bit | m1.tiny | $ ping 172.31.10.4 PING 172.31.10.4 (172.31.10.4): 56 data bytes ...... 64 bytes from 172.31.10.4: seq=59 ttl=64 time=0.465 ms 64 bytes from 172.31.10.4: seq=60 ttl=64 time=0.510 ms <-------- 64 bytes from 172.31.10.4: seq=61 ttl=64 time=0.446 ms 64 bytes from 172.31.10.4: seq=63 ttl=64 time=0.744 ms 64 bytes from 172.31.10.4: seq=64 ttl=64 time=0.477 ms 64 bytes from 172.31.10.4: seq=65 ttl=64 time=0.441 ms 64 bytes from 172.31.10.4: seq=66 ttl=64 time=0.376 ms 64 bytes from 172.31.10.4: seq=67 ttl=64 time=0.481 ms As one can see, packet seq 62 is lost, I believe, during rebooting ovs agent. Right now, I am suspecting https://github.com/openstack/neutron/blob/6d619ea7c13e89ec575295f04c63ae316759c50a/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py#L229 this code is refreshing flow table rules even though it is not necessary. Because when I dump flows on phys bridge, I can see duration is rewinding to 0 which suggests flow has been deleted and created again """ duration=secs The time, in seconds, that the entry has been in the table. secs includes as much precision as the switch provides, possibly to nanosecond resolution. """ root@compute01:~# ovs-ofctl dump-flows br-floating ... cookie=0x673522f560f5ca4f, duration=323.852s, table=2, n_packets=1100, n_bytes=103409, ^------ this value resets priority=4,in_port="phy-br-floating",dl_vlan=2 actions=mod_vlan_vid:105,NORMAL ... IMO, rebooting ovs-agent should not affecting data plane. To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1869808/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

