Marking this 'invalid' since, as you suggest, Neutron 9.4.1 (Newton)
reached end of life 10/25/2017, and is no longer supported upstream. If
you believe this to still be an issue in master then please recomment
and I will change status appropriately.
** Changed in: neutron
Status: New => Invalid
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1794569
Title:
DVR with static routes may cause routed traffic to be dropped
Status in neutron:
Invalid
Bug description:
Neutron version: 9.4.1 (EOL, but bug may still be present)
Network scenario: Openvswitch with DVR
Openvswitch version: 2.6.1
OpenStack installation version: Newton
Operating system: Ubuntu 16.04.5 LTS
Kernel: 4.4.0-135 x86_64
Symptoms:
Instances whose default gateway is a DVR interface (10.10.255.1 in our case)
occassionaly lose connectivity to non-local networks. Meaning, any packet that
had to pass through the local virtual router is dropped. Sometimes this
behavior lasts for a few milliseconds, sometimes tens of seconds. Since
floating-ip traffic is a subset of those cases, north-south connectivity breaks
too.
Steps to reproduce:
- Use DVR routing mode
- Configure at least one static route in the virtual router, whose next hop
is NOT an address managed by Neutron (e.g. a physical interface on a VPN
gateway; in our case 10.2.0.0/24 with next-hop 10.10.0.254)
- Have an instance plugged into a Flat or VLAN network, use the virtual
router as the default gateway
- Try to reach a host inside the statically-routed network from within the
instance
Possible explanation:
Distributed routers get their ARP caches populated by neutron-l3-agent at its
startup. The agent takes all the ports in a given subnet and fills in their
IP-to-MAC mappings inside the qrouter- namespace, as permanent entries (meaning
they won't expire from the cache). However, if Neutron doesn't manage an IP (as
is the case with our static route's next-hop 10.10.0.254), a permanent record
isn't created, naturally.
So when we try to reach a host in the statically-routed network (e.g.
10.2.0.10) from inside the instance, the packet goes to default
gateway (10.10.255.1). After it arrives to the qrouter- namespace,
there is a static route for this host pointing to 10.10.0.254 as next-
hop. However qrouter- doesn't have its MAC address, so what it does is
it sends out an ARP request with source MAC of the distributed
router's qr- interface.
And that's the problem. Since ARP requests are usually broadcasts,
they land on pretty much every hypervisor in the network within the
same VLAN. Combined with the fact that qr- interfaces in a given
qrouter- namespace have the same MAC address on every host, this leads
to a disaster: every integration bridge will recieve that ARP request
on the port that connects it to the Flat/VLAN network and learns that
the qr- interface's MAC address is actually there - not on the qr-
port also attached to br-int. From this moment on, packets from
instances that need to pass via qrouter- are forwarded to the
Flat/VLAN network interface, circumventing the qrouter- namespace.
This is especially problematic with traffic that needs to be SNAT-ed
on its way out.
Workarounds:
- The workaround that we used is creating stub Neutron ports for next-hop
addresses, with correct MACs. After restarting neutron-l3-agents, they got
populated into the qrouter- ARP cache as permanent entries.
- Next option is setting the static route into the instances' routing tables
instead of the virtual router. This way it's the instance that makes ARP
discovery and not the qrouter- namespace.
- Another workaround might consist of using ebtables/arptables on hypervisors
to block incoming ARP requests from qrouters.
Possible long-term solution:
Maybe it would help if ancillary bridges (those connecting Flat/VLAN network
interfaces to br-int) contained an OVS flow that drops ARP requests with source
MAC addresses of qr- interfaces originating from the physical interface. Since
their IPs and MACs are well defined (their device_owner is
"network:router_interface_distributed"), it shouldn't be a problem setting
these flows up. However I'm not sure of the shortcomings of this approach.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1794569/+subscriptions
--
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help : https://help.launchpad.net/ListHelp