[openstack-dev] [neutron] VXLAN with single-NIC compute nodes: Avoiding the MTU pitfalls
Hello world I recently created a VXLAN test setup with single-NIC compute nodes (using OpenStack Juno on Fedora 20), conciously ignoring the OpenStack advice of using nodes with at least 2 NICs ;-) . The fact that both native and encapsulated traffic needs to pass through the same NIC does create some interesting challenges, but finally I got it working cleanly, staying clear of MTU pitfalls ... I documented my findings here: [1] http://blog.systemathic.ch/2015/03/06/openstack-vxlan-with-single-nic-compute-nodes/ [2] http://blog.systemathic.ch/2015/03/05/openstack-mtu-pitfalls-with-tunnels/ For those interested in single-NIC setups, I'm curious what you think about [1] (a small patch is needed to add "VLAN awareness" to the qg-XXX Neutron gateway ports). While catching up with Neutron changes for OpenStack Kilo, I came across the in-progress work on "MTU selection and advertisement": [3] Spec: https://github.com/openstack/neutron-specs/blob/master/specs/kilo/mtu-selection-and-advertisement.rst [4] Patch review: https://review.openstack.org/#/c/153733/ [5] Spec update: https://review.openstack.org/#/c/159146/ Seems like [1] eliminates some additional MTU pitfalls that are not addressed by [3-5]. But I think it would be nice if we could achieve [1] while coordinating with the "MTU selection and advertisement" work [3-5]. Thoughts? Cheers, - Fredy Fredy ("Freddie") Neeser http://blog.systeMathic.ch __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron] VXLAN with single-NIC compute nodes: Avoiding the MTU pitfalls
OK, I looked at the devstack patch [6] "Configure mtu for ovs with the common protocols" but no -- it doesn't do the job for the VLAN-based separation of native and encapsulated traffic, which I'm using in [1] for a clean (correct MTUs ...) VXLAN setup with single-NIC compute nodes. As shown in Figure 2 of [1], I'm using VLANs 1 and 12 for native and encapsulated traffic, respectively. I needed to manually create br-ex ports br-ex.1 (VLAN 1) and br-ex.12 (VLAN 12) and configure their MTUs. Moreover, I needed a small "VLAN awareness" patch to ensure that the Neutron router gateway port qg-XXX uses VLAN 1. Consider the example below: # ip a ... 2: eth0: mtu 1554 qdisc pfifo_fast master ovs-system state UP group default qlen 1000 link/ether e0:3f:49:b4:7c:a7 brd ff:ff:ff:ff:ff:ff ... 6: br-ex: mtu 1554 qdisc noqueue state UNKNOWN group default link/ether e0:3f:49:b4:7c:a7 brd ff:ff:ff:ff:ff:ff 7: br-ex.1: mtu 1500 qdisc noqueue state UNKNOWN group default link/ether e0:3f:49:b4:7c:a7 brd ff:ff:ff:ff:ff:ff inet 192.168.1.14/24 brd 192.168.1.255 scope global br-ex.1 valid_lft forever preferred_lft forever 8: br-ex.12: mtu 1554 qdisc noqueue state UNKNOWN group default link/ether e0:3f:49:b4:7c:a7 brd ff:ff:ff:ff:ff:ff inet 192.168.1.14/24 brd 192.168.1.255 scope global br-ex.12 valid_lft forever preferred_lft forever # ovs-vsctl show c0618b20-1eeb-486c-88bd-fb96988dbf96 Bridge br-tun Port patch-int Interface patch-int type: patch options: {peer=patch-tun} Port br-tun Interface br-tun type: internal Port "vxlan-c0a80115" Interface "vxlan-c0a80115" type: vxlan options: {df_default="true", in_key=flow, local_ip="192.168.1.14", out_key=flow, remote_ip="192.168.1.21"} Bridge br-ex Port phy-br-ex Interface phy-br-ex type: patch options: {peer=int-br-ex} Port "br-ex.12" tag: 12 Interface "br-ex.12" type: internal Port "br-ex.1" tag: 1 Interface "br-ex.1" type: internal Port "eth0" tag: 1 trunks: [1, 12] Interface "eth0" Port "qg-e046ec4e-e3" tag: 1 Interface "qg-e046ec4e-e3" type: internal Port br-ex Interface br-ex type: internal My home LAN ("external network") is enabled for Jumbo frames, as can be seen from eth0's MTU of 1554, so my path MTU for VXLAN is 1554 bytes. VLAN 12 supports encapsulated traffic with an MTU of 1554 bytes. This allows my VMs to use the standard MTU of 1500 regardless of whether they are on different compute nodes (so they communicate via VXLAN) or on the same compute node, i.e., the effective L2 segment MTU is 1500 bytes. Because this is the default, I don't need to change guest MTUs at all. For bridge br-ex, I configured two internal ports br-ex.{1,12} as shown in the table below: br-ex VLAN MTURemarks Port -- br-ex.1 11500 br-ex.12121554 br-ex Unused qg-e046ec4e-e3 1"VLAN awareness" patch (cf. [1]) All native traffic (including routed traffic to/from a Neutron router and traffic generated by the Network Node itself) uses VLAN 1 on my LAN, with an MTU of 1500 bytes. For my small VXLAN test setup, I didn't need to assign different IPs to br-ex.1 and br-ex.12, both are assigned 192.168.1.14/24. So why doesn't [6] do "the right thing"? -- Well, obviously [6] does not add the "VLAN awareness" that I need for the Neutron qg-XXX gateway ports. Moreover, [6] tries to auto-configure the L2 segment MTU based on determining the MTU of an interface associated with $TUNNEL_ENDPOINT_IP, which is 192.168.1.14 in my case. It does this essentially by querying # ip -o address | awk "/192.168.1.14/ {print \$2}" However, in my case, this would return *two* interfaces: br-ex.1 br-ex.12 so the patch [6] wouldn't know which interface's MTU it should take. I'm currently checking if the "MTU selection and advertisement" patches [3-5] are compatible with the VLAN-based traffic separation [1]. Regards Fredy Neeser http://blog.systeMathic.ch On 06.03.2015 18:37, Attila Fazekas wrote: Can you check is this patch does the right thing [6]: [6] https://review.openstack.org/#/c/112523/6 - Original Message - From: "Fredy Neeser" To: openstack-dev@lists.openstack.org Sent: Frida
Re: [openstack-dev] [neutron] VXLAN with single-NIC compute nodes: Avoiding the MTU pitfalls
[resent with a clarification of what [6] is doing towards EoM] OK, I looked at the devstack patch [6] "Configure mtu for ovs with the common protocols" but no -- it doesn't do the job for the VLAN-based separation of native and encapsulated traffic, which I'm using in [1] for a clean (correct MTUs ...) VXLAN setup with single-NIC compute nodes. As shown in Figure 2 of [1], I'm using VLANs 1 and 12 for native and encapsulated traffic, respectively. I needed to manually create br-ex ports br-ex.1 (VLAN 1) and br-ex.12 (VLAN 12) and configure their MTUs. Moreover, I needed a small "VLAN awareness" patch to ensure that the Neutron router gateway port qg-XXX uses VLAN 1. Consider the example below: # ip a ... 2: eth0: mtu 1554 qdisc pfifo_fast master ovs-system state UP group default qlen 1000 link/ether e0:3f:49:b4:7c:a7 brd ff:ff:ff:ff:ff:ff ... 6: br-ex: mtu 1554 qdisc noqueue state UNKNOWN group default link/ether e0:3f:49:b4:7c:a7 brd ff:ff:ff:ff:ff:ff 7: br-ex.1: mtu 1500 qdisc noqueue state UNKNOWN group default link/ether e0:3f:49:b4:7c:a7 brd ff:ff:ff:ff:ff:ff inet 192.168.1.14/24 brd 192.168.1.255 scope global br-ex.1 valid_lft forever preferred_lft forever 8: br-ex.12: mtu 1554 qdisc noqueue state UNKNOWN group default link/ether e0:3f:49:b4:7c:a7 brd ff:ff:ff:ff:ff:ff inet 192.168.1.14/24 brd 192.168.1.255 scope global br-ex.12 valid_lft forever preferred_lft forever # ovs-vsctl show c0618b20-1eeb-486c-88bd-fb96988dbf96 Bridge br-tun Port patch-int Interface patch-int type: patch options: {peer=patch-tun} Port br-tun Interface br-tun type: internal Port "vxlan-c0a80115" Interface "vxlan-c0a80115" type: vxlan options: {df_default="true", in_key=flow, local_ip="192.168.1.14", out_key=flow, remote_ip="192.168.1.21"} Bridge br-ex Port phy-br-ex Interface phy-br-ex type: patch options: {peer=int-br-ex} Port "br-ex.12" tag: 12 Interface "br-ex.12" type: internal Port "br-ex.1" tag: 1 Interface "br-ex.1" type: internal Port "eth0" tag: 1 trunks: [1, 12] Interface "eth0" Port "qg-e046ec4e-e3" tag: 1 Interface "qg-e046ec4e-e3" type: internal Port br-ex Interface br-ex type: internal My home LAN ("external network") is enabled for Jumbo frames, as can be seen from eth0's MTU of 1554, so my path MTU for VXLAN is 1554 bytes. VLAN 12 supports encapsulated traffic with an MTU of 1554 bytes. This allows my VMs to use the standard MTU of 1500 regardless of whether they are on different compute nodes (so they communicate via VXLAN) or on the same compute node, i.e., the effective L2 segment MTU is 1500 bytes. Because this is the default, I don't need to change guest MTUs at all. For bridge br-ex, I configured two internal ports br-ex.{1,12} as shown in the table below: br-ex VLAN MTURemarks Port -- br-ex.1 11500 br-ex.12121554 br-ex Unused qg-e046ec4e-e3 1"VLAN awareness" patch (cf. [1]) All native traffic (including routed traffic to/from a Neutron router and traffic generated by the Network Node itself) uses VLAN 1 on my LAN, with an MTU of 1500 bytes. For my small VXLAN test setup, I didn't need to assign different IPs to br-ex.1 and br-ex.12, both are assigned 192.168.1.14/24. So why doesn't [6] do "the right thing"? -- Well, obviously [6] does not add the "VLAN awareness" that I need for the Neutron qg-XXX gateway ports. Moreover, [6] tries to auto-configure the L2 segment MTU based on guessing the path MTU by determining the MTU of an interface associated with $TUNNEL_ENDPOINT_IP, which is 192.168.1.14 in my case. It does this essentially by querying # ip -o address | awk "/192.168.1.14/ {print \$2}" getting the MTU of that interface and then subtracting out the overhead for VXLAN encapsulation. However, in my case, the above lookup would return *two* interfaces: br-ex.1 br-ex.12 so the patch [6] wouldn't know which interface's MTU it should take. Also, when I'm doing "VLAN-based traffic separation" for an overlay setup using single-NIC nodes, then I already know both the "L3 path MTU" and the desired "L2 segment MTU". I'm currently checking if the "MTU s
Re: [openstack-dev] [neutron] VXLAN with single-NIC compute nodes: Avoiding the MTU pitfalls
On 11.03.2015 19:31, Ian Wells wrote: On 11 March 2015 at 04:27, Fredy Neeser <mailto:fredy.nee...@solnet.ch>> wrote: 7: br-ex.1: mtu 1500 qdisc noqueue state UNKNOWN group default link/ether e0:3f:49:b4:7c:a7 brd ff:ff:ff:ff:ff:ff inet 192.168.1.14/24 <http://192.168.1.14/24> brd 192.168.1.255 scope global br-ex.1 valid_lft forever preferred_lft forever 8: br-ex.12: mtu 1554 qdisc noqueue state UNKNOWN group default link/ether e0:3f:49:b4:7c:a7 brd ff:ff:ff:ff:ff:ff inet 192.168.1.14/24 <http://192.168.1.14/24> brd 192.168.1.255 scope global br-ex.12 valid_lft forever preferred_lft forever I find it hard to believe that you want the same address configured on *both* of these interfaces - which one do you think will be sending packets? Ian, thanks for your feedback! I did choose the same address for the two interfaces, for three reasons: 1. Within my home single-LAN (underlay) environment, traffic is switched, and VXLAN traffic is confined to VLAN 12, so there is never a conflict between IP 192.168.1.14 on VLAN 1 and the same IP on VLAN 12. OTOH, for a more scalable VXLAN setup (with multiple underlays and L3 routing in between), I would like to use different IPs for br-ex.1 and br-ex.12 -- for example by using separate subnets 192.168.1.0/26 for VLAN 1 192.168.12.0/26 for VLAN 12 However, I'm not quite there yet (see 3.). 2. I'm using policy routing on my hosts to steer VXLAN traffic (UDP dest. port 4789) to interface br-ex.12 -- all other traffic from 192.168.1.14 is source routed from br-ex.1, presumably because br-ex.1 is a lower-numbered interface than br-ex.12 (?) -- interesting question whether I'm relying here on the order in which I created these two interfaces. [root@langrain ~]# ip a ... 7: br-ex.1: mtu 1500 qdisc noqueue state UNKNOWN group default link/ether e0:3f:49:b4:7c:a7 brd ff:ff:ff:ff:ff:ff inet 192.168.1.14/24 brd 192.168.1.255 scope global br-ex.1 valid_lft forever preferred_lft forever 8: br-ex.12: mtu 1554 qdisc noqueue state UNKNOWN group default link/ether e0:3f:49:b4:7c:a7 brd ff:ff:ff:ff:ff:ff inet 192.168.1.14/24 brd 192.168.1.255 scope global br-ex.12 valid_lft forever preferred_lft forever 3. It's not clear to me how to setup multiple nodes with packstack if a node's tunnel IP does not equal its admin IP (or the OpenStack API IP in case of a controller node). With packstack, I can only specify the compute node IPs through CONFIG_COMPUTE_HOSTS. Presumably, these IPs are used for both packstack deployment (admin IP) and for configuring the VXLAN tunnel IPs (local_ip and remote_ip parameters). How would I specify different IPs for these purposes? (Recall that my hosts have a single NIC). In any case, native traffic on bridge br-ex is sent via br-ex.1 (VLAN 1), which is also the reason the Neutron gateway port qg-XXX needs to be an access port for VLAN 1 (tag: 1). VXLAN traffic is sent from br-ex.12 on all compute nodes. See the 2 cases below: Case 1. Max-size ping from compute node 'langrain' (192.168.1.14) to another host on same LAN => Native traffic sent from br-ex.1; no traffic sent from br-ex.12 [fn@langrain ~]$ ping -M do -s 1472 -c 1 192.168.1.54 PING 192.168.1.54 (192.168.1.54) 1472(1500) bytes of data. 1480 bytes from 192.168.1.54: icmp_seq=1 ttl=64 time=0.766 ms [root@langrain ~]# tcpdump -n -i br-ex.1 dst 192.168.1.54 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on br-ex.1, link-type EN10MB (Ethernet), capture size 65535 bytes 10:32:37.666572 IP 192.168.1.14 > 192.168.1.54: ICMP echo request, id 10432, seq 1, length 1480 10:32:42.673665 ARP, Request who-has 192.168.1.54 tell 192.168.1.14, length 28 Case 2: Max-size ping from a guest1 (10.0.0.1) on compute node 'langrain' (192.168.1.14) to a guest2 (10.0.0.3) on another compute node (192.168.1.21) via VXLAN tunnel. Guests are on the same virtual network 10.0.0.0/24 => Encapsulated traffic sent from br-ex.12; no traffic sent from br-ex.1 $ ping -M do -s 1472 -c 1 10.0.0.3 PING 10.0.0.3 (10.0.0.3) 1472(1500) bytes of data. 1480 bytes from 10.0.0.3: icmp_seq=1 ttl=64 time=2.22 ms [root@langrain ~]# tcpdump -n -i br-ex.12 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on br-ex.12, link-type EN10MB (Ethernet), capture size 65535 bytes 11:02:56.916265 IP 192.168.1.14.47872 > 192.168.1.21.4789: VXLAN, flags [I] (0x08), vni 10 ARP, Request who-has 10.0.0.3 tell 10.0.0.1, length 28 11:02:56.916991 IP 192.168.1.21.51408 > 192.168.1.14.4789: VXLAN, flags [I] (0x08), vni 10 ARP, Reply 10.0.0.3 is-at fa:16:3e:e6:e1:c8, length 28 11:02:56.917282 IP 192.168.1.14.57836 > 192.168.1.21.4789: VXLAN, flags [I] (0x