Re: [openstack-dev] [Neutron] MTU configuration pain
On Mon, Jan 25, 2016 at 08:16:03PM EST, Fox, Kevin M wrote: > Another place to look... > I've had to use network_device_mtu=9000 in nova's config as well to get mtu's > working smoothly. > I'll have to read the code on the Nova side and familiarize myself, but this sounds like a case of DRY that needs to be done. We should just set it once *somewhere* and then communicate it to related OpenStack components. -- Sean M. Collins __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] MTU configuration pain
big +1 from me. :) Kevin From: Sean M. Collins [s...@coreitpro.com] Sent: Tuesday, January 26, 2016 9:59 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [Neutron] MTU configuration pain On Mon, Jan 25, 2016 at 08:16:03PM EST, Fox, Kevin M wrote: > Another place to look... > I've had to use network_device_mtu=9000 in nova's config as well to get mtu's > working smoothly. > I'll have to read the code on the Nova side and familiarize myself, but this sounds like a case of DRY that needs to be done. We should just set it once *somewhere* and then communicate it to related OpenStack components. -- Sean M. Collins __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] MTU configuration pain
As I recall, network_device_mtu sets up the MTU on a bunch of structures independently of whatever the correct value is. It was a bit of a workaround back in the day and is still a bit of a workaround now. I'd sooner we actually fix up the new mechanism (which is kind of hard to do when the closest I have to information is 'it probably doesn't work'). On 26 January 2016 at 09:59, Sean M. Collinswrote: > On Mon, Jan 25, 2016 at 08:16:03PM EST, Fox, Kevin M wrote: > > Another place to look... > > I've had to use network_device_mtu=9000 in nova's config as well to get > mtu's working smoothly. > > > > I'll have to read the code on the Nova side and familiarize myself, but > this sounds like a case of DRY that needs to be done. We should just set > it once *somewhere* and then communicate it to related OpenStack > components. > -- > Sean M. Collins > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] MTU configuration pain
On Mon, Jan 25, 2016 at 01:37:55AM EST, Kevin Benton wrote: > At a minimum I think we should pick a default in devstack and dump a > warning in neutron if operators don't specify it. Here's the DevStack change that implements this. https://review.openstack.org/#/c/267604/ Again this just fixes it for DevStack. Deployers still need to set the MTUs by hand in their deployment tool of choice. I would hope that we can still move forward with some sort of automatic discovery - and also figure out a way to take it from 3 different config knobs down to like one master knob, for the sake of sanity. -- Sean M. Collins __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] MTU configuration pain
BTW, regarding devstack: See https://bugs.launchpad.net/devstack/+bug/1532924. I have been trying to get the current code to work, following the ideas in https://specs.openstack.org/openstack/fuel-specs/specs/7.0/jumbo-frames-between-instances.html#proposed-change . It fails only at the last step: the MTU on the network interface inside the VM is still 1500. Regards, Mike __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] MTU configuration pain
Ian, Overthinking and corner cases led to the existing implementation which doesn't solve the MTU problem and arguably makes the situation worse because options in the configuration files give operators the impression they can control it. For example, the segment_mtu does nothing in the in-tree drivers, the network_device_mtu option only impacts parts of some in-tree drivers, and path_mtu only provides a way to change the MTU for VMs for all in-tree drivers. I ran my experiments without any of these options to provide a clean slate for empirically analyzing the problem and finding a solution for the majority of operators. Matt On Mon, Jan 25, 2016 at 6:31 AM, Sean M. Collinswrote: > On Mon, Jan 25, 2016 at 01:37:55AM EST, Kevin Benton wrote: > > At a minimum I think we should pick a default in devstack and dump a > > warning in neutron if operators don't specify it. > > Here's the DevStack change that implements this. > > https://review.openstack.org/#/c/267604/ > > Again this just fixes it for DevStack. Deployers still need to set the > MTUs by hand in their deployment tool of choice. I would hope that we > can still move forward with some sort of automatic discovery - and also > figure out a way to take it from 3 different config knobs down to like > one master knob, for the sake of sanity. > > -- > Sean M. Collins > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] MTU configuration pain
You need to set path_mtu. https://review.openstack.org/#/c/267604/ sets it now and defaults to 1500 - then Neutron calculates the overhead for your tunnel protocol down to the appropriate value -- Sean M. Collins __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] MTU configuration pain
On 01/24/2016 07:43 PM, Ian Wells wrote: Also, I say 9000, but why is 9000 even the right number? While that may have been a rhetorical question... Because that is the value Alteon picked in the late 1990s when they created the de facto standard for "Jumbo Frames" by including it in their Gigabit Ethernet kit as a way to enable the systems of the day to have a hope of getting link-rate :) Perhaps they picked 9000 because it was twice the 4500 of FDDI, which itself was selected to allow space for 4096 bytes of data and then a good bit of headers. rick jones __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] MTU configuration pain
On 25 January 2016 at 07:06, Matt Kassawarawrote: > Overthinking and corner cases led to the existing implementation which doesn't solve the MTU problem and arguably makes the situation worse because options in the configuration files give operators the impression they can control it. We are giving the impression we solved the problem because we tried to comprehensively solve the problem (documentation aside, apparently). It's complex when you want to do complex things, but the right answer for basic end users is adding these two lines to neutron.conf, which I don't think is asking too much: path_mtu = 1500 # for VXLAN and GRE; MTU is 1450 on ports on VXLAN networks segment_mtu = 1500 # for VLAN; MTU is 1500 on ports on VLAN networks (while leaving the floor open for the other 1% of cases, where the options cover pretty much everything you'd want to do). So. I don't know what path_mtu and segment_mtu settings you used that disappointed you; could you recap? Can you tell me whether the two options above help? > For example, the segment_mtu does nothing in the in-tree drivers, the network_device_mtu option only impacts parts of some in-tree drivers, and path_mtu only provides a way to change the MTU for VMs for all in-tree drivers. I was reading what documentation I could find (I may have written the spec, but I didn't write the code, so I have to check the docs like everyone else) and it says it should work - so anything else is a bug, which we should go out and fix. What test cases did you try? network_device_mtu is an old hack, this much I know, and path_mtu and segment_mtu are intended to be the correct modern way of doing things. path_mtu should not apply to all in tree drivers, specifically it should only apply to L3 overlays (as segment_mtu should only apply to VLANs) (and by the wording of your statement I have to ask - are you seeing VM MTU = path MTU, because you shouldn't be). I see there are plausible looking unit tests for segment_mtu, so if it's not working then in what specific configuration is it not working? > > I ran my experiments without any of these options to provide a clean slate for empirically analyzing the problem and finding a solution for the majority of operators. I'm afraid you've not been clear about what setups you've tested where path_mtu and segment_mtu *are* set - you dismissed them so I presume you tried. When you say they don't do what you want, what do they do wrong? > > > Matt > > On Mon, Jan 25, 2016 at 6:31 AM, Sean M. Collins wrote: >> >> On Mon, Jan 25, 2016 at 01:37:55AM EST, Kevin Benton wrote: >> > At a minimum I think we should pick a default in devstack and dump a >> > warning in neutron if operators don't specify it. >> >> Here's the DevStack change that implements this. >> >> https://review.openstack.org/#/c/267604/ >> >> Again this just fixes it for DevStack. Deployers still need to set the >> MTUs by hand in their deployment tool of choice. I would hope that we >> can still move forward with some sort of automatic discovery - and also >> figure out a way to take it from 3 different config knobs down to like >> one master knob, for the sake of sanity. >> >> -- >> Sean M. Collins >> >> __ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] MTU configuration pain
Another place to look... I've had to use network_device_mtu=9000 in nova's config as well to get mtu's working smoothly. Thanks, Kevin From: Matt Kassawara [mkassaw...@gmail.com] Sent: Monday, January 25, 2016 5:00 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [Neutron] MTU configuration pain Results from the Open vSwitch agent... I highly recommend reading further, but here's the TL;DR: Using physical network interfaces with MTUs larger than 1500 reveals problems in several places, but only involving Linux components rather than Open vSwitch components (such as br-int) on both the controller and compute nodes. Most of the problems involve MTU disparities in security group bridge components on the compute node. First, review the OpenStack bits and resulting network components in the environment [1] and see that a typical 'ping' works using IPv4 and IPv6 [2]. [1] https://gist.github.com/ionosphere80/23655bedd24730d22c89 [2] https://gist.github.com/ionosphere80/5f309e7021a830246b66 Note: The tcpdump output in each case references up to seven points: neutron router gateway on the public network (qg), namespace end of the neutron router interface on the private network (qr), controller node end of the VXLAN network (underlying interface), compute node end of the VXLAN network (underlying interface), Open vSwitch end of the veth pair for the security group bridge (qvo), Linux bridge end of the veth pair for the security group bridge (qvb), and the bridge end of the tap for the VM (tap). I can use SSH to access the VM because every component between my host and the VM supports at least a 1500 MTU. So, let's configure the VM network interface to use the proper MTU of 9000 minus the VXLAN protocol overhead of 50 bytes... 8950... and try SSH again. 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc pfifo_fast qlen 1000 link/ether fa:16:3e:ea:22:3a brd ff:ff:ff:ff:ff:ff inet 172.16.1.3/24<http://172.16.1.3/24> brd 172.16.1.255 scope global eth0 inet6 fd00:100:52:1:f816:3eff:feea:223a/64 scope global dynamic valid_lft 86396sec preferred_lft 14396sec inet6 fe80::f816:3eff:feea:223a/64 scope link valid_lft forever preferred_lft forever Contrary to the Linux bridge experiment, I can still use SSH to access the VM. Why? Let's ping with a payload size of 8922 for IPv4 and 8902 for IPv6, the maximum for a VXLAN segment with 8950 MTU. # ping -c 1 -s 8922 -M do 10.100.52.102 PING 10.100.52.102 (10.100.52.102) 8922(8950) bytes of data. >From 10.100.52.102 icmp_seq=1 Frag needed and DF set (mtu = 1500) --- 10.100.52.102 ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms # ping6 -c 1 -s 8902 -M do fd00:100:52:1:f816:3eff:feea:223a PING fd00:100:52:1:f816:3eff:feea:223a(fd00:100:52:1:f816:3eff:feea:223a) 8902 data bytes >From fd00:100:52::101 icmp_seq=1 Packet too big: mtu=1500 --- fd00:100:52:1:f816:3eff:feea:223a ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms Look at the tcpdump output [3]. The router namespace, operating at layer-3, sees the MTU discrepancy between inbound packet and the neutron router gateway on the public network and returns an ICMP "fragmentation needed" or "packet too big" message to the sender. The sender uses the MTU value in the ICMP packet to recalculate the length of the first packet and caches it for future packets. [3] https://gist.github.com/ionosphere80/4e1389a34fd3a628b294 Although PTMUD enables communication between my host and the VM, it limits MTU to 1500 regardless of the MTU between the router namespace and VM and therefore could impact performance on 10 Gbps or faster networks. Also, it does not address the MTU disparity between a VM and network components on the compute node. If a VM uses a 1500 or smaller MTU, it cannot send packets that exceed the MTU of the tap interface, veth pairs, and bridge on the compute node. In this situation which seems fairly typical for operators trying to work around MTU problems, communication between a host (outside of OpenStack) and a VM always works. However, what if a VM uses a MTU larger than 1500 and attempts to send a large packet? The bridge or veth pairs would drop it because of the MTU disparity. Using observations from the Linux bridge experiment, let's configure the MTU of the interfaces in the router namespace to match the interfaces outside of the namespace. The public network (gateway) interface MTU becomes 9000 and the private network router interfaces (IPv4 and IPv6) become 8950. 31: qr-d744191c-9d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc noqueue state UNKNOWN mode DEFAULT group default link/ether fa:16:3e:34:67:40 brd ff:ff:ff:ff:ff:ff 32: qr-ae54b450-b4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc noqueue state
Re: [openstack-dev] [Neutron] MTU configuration pain
Results from the Open vSwitch agent... I highly recommend reading further, but here's the TL;DR: Using physical network interfaces with MTUs larger than 1500 reveals problems in several places, but only involving Linux components rather than Open vSwitch components (such as br-int) on both the controller and compute nodes. Most of the problems involve MTU disparities in security group bridge components on the compute node. First, review the OpenStack bits and resulting network components in the environment [1] and see that a typical 'ping' works using IPv4 and IPv6 [2]. [1] https://gist.github.com/ionosphere80/23655bedd24730d22c89 [2] https://gist.github.com/ionosphere80/5f309e7021a830246b66 Note: The tcpdump output in each case references up to seven points: neutron router gateway on the public network (qg), namespace end of the neutron router interface on the private network (qr), controller node end of the VXLAN network (underlying interface), compute node end of the VXLAN network (underlying interface), Open vSwitch end of the veth pair for the security group bridge (qvo), Linux bridge end of the veth pair for the security group bridge (qvb), and the bridge end of the tap for the VM (tap). I can use SSH to access the VM because every component between my host and the VM supports at least a 1500 MTU. So, let's configure the VM network interface to use the proper MTU of 9000 minus the VXLAN protocol overhead of 50 bytes... 8950... and try SSH again. 2: eth0:mtu 8950 qdisc pfifo_fast qlen 1000 link/ether fa:16:3e:ea:22:3a brd ff:ff:ff:ff:ff:ff inet 172.16.1.3/24 brd 172.16.1.255 scope global eth0 inet6 fd00:100:52:1:f816:3eff:feea:223a/64 scope global dynamic valid_lft 86396sec preferred_lft 14396sec inet6 fe80::f816:3eff:feea:223a/64 scope link valid_lft forever preferred_lft forever Contrary to the Linux bridge experiment, I can still use SSH to access the VM. Why? Let's ping with a payload size of 8922 for IPv4 and 8902 for IPv6, the maximum for a VXLAN segment with 8950 MTU. # ping -c 1 -s 8922 -M do 10.100.52.102 PING 10.100.52.102 (10.100.52.102) 8922(8950) bytes of data. >From 10.100.52.102 icmp_seq=1 Frag needed and DF set (mtu = 1500) --- 10.100.52.102 ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms # ping6 -c 1 -s 8902 -M do fd00:100:52:1:f816:3eff:feea:223a PING fd00:100:52:1:f816:3eff:feea:223a(fd00:100:52:1:f816:3eff:feea:223a) 8902 data bytes >From fd00:100:52::101 icmp_seq=1 Packet too big: mtu=1500 --- fd00:100:52:1:f816:3eff:feea:223a ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms Look at the tcpdump output [3]. The router namespace, operating at layer-3, sees the MTU discrepancy between inbound packet and the neutron router gateway on the public network and returns an ICMP "fragmentation needed" or "packet too big" message to the sender. The sender uses the MTU value in the ICMP packet to recalculate the length of the first packet and caches it for future packets. [3] https://gist.github.com/ionosphere80/4e1389a34fd3a628b294 Although PTMUD enables communication between my host and the VM, it limits MTU to 1500 regardless of the MTU between the router namespace and VM and therefore could impact performance on 10 Gbps or faster networks. Also, it does not address the MTU disparity between a VM and network components on the compute node. If a VM uses a 1500 or smaller MTU, it cannot send packets that exceed the MTU of the tap interface, veth pairs, and bridge on the compute node. In this situation which seems fairly typical for operators trying to work around MTU problems, communication between a host (outside of OpenStack) and a VM always works. However, what if a VM uses a MTU larger than 1500 and attempts to send a large packet? The bridge or veth pairs would drop it because of the MTU disparity. Using observations from the Linux bridge experiment, let's configure the MTU of the interfaces in the router namespace to match the interfaces outside of the namespace. The public network (gateway) interface MTU becomes 9000 and the private network router interfaces (IPv4 and IPv6) become 8950. 31: qr-d744191c-9d: mtu 8950 qdisc noqueue state UNKNOWN mode DEFAULT group default link/ether fa:16:3e:34:67:40 brd ff:ff:ff:ff:ff:ff 32: qr-ae54b450-b4: mtu 8950 qdisc noqueue state UNKNOWN mode DEFAULT group default link/ether fa:16:3e:d4:f1:63 brd ff:ff:ff:ff:ff:ff 33: qg-e3303f07-e7: mtu 9000 qdisc noqueue state UNKNOWN mode DEFAULT group default link/ether fa:16:3e:70:09:54 brd ff:ff:ff:ff:ff:ff Let's ping again with a payload size of 8922 for IPv4, the maximum for a VXLAN segment with 8950 MTU, and look at the tcpdump output [4]. For brevity, I'm only showing IPv4 because IPv6 provides similar results. # ping -c 1
Re: [openstack-dev] [Neutron] MTU configuration pain
On 23 January 2016 at 11:27, Adam Lawsonwrote: > For the sake of over-simplification, is there ever a reason to NOT enable > jumbo frames in a cloud/SDN context where most of the traffic is between > virtual elements that all support it? I understand that some switches do > not support it and traffic form the web doesn't support it either but > besides that, seems like a default "jumboframes = 1" concept would work > just fine to me. > Offhand: 1. you don't want the latency increase that comes with 9000 byte packets, even if it's tiny (bearing in mind that in a link shared between tenants it affects everyone when one packet holds the line for 6 times longer) 2. not every switch in the world is going to (a) be configurable or (b) pass 9000 byte packets 3. not every VM has a configurable MTU that you can set on boot, or supports jumbo frames, and someone somewhere will try and run one of those VMs 4. when you're using provider networks, not every device attached to the cloud has a 9000 MTU (and this one's interesting, in fact, because it points to the other element the MTU spec was addressing, that *not all networks, even in Neutron, will have the same MTU*). 5. similarly, if you have an external network in Openstack, and you're using VXLAN, the MTU of the external network is almost certainly 50 bytes bigger than that of the inside of the VXLAN overlays, so no one number can ever be right for every network in Neutron. Also, I say 9000, but why is 9000 even the right number? We need a number... and 'jumbo' is not a number. I know devices that will let you transmit 9200 byte packets. Conversely, if the native L2 is 9000 bytes, then the MTU in a Neutron virtual network is less than 9000 - so what MTU do you want to offer your applications? If your apps don't care, why not tell them what MTU they're getting (e.g. 1450) and be done with it? (Memory says that the old problem with that was that github had problems with PMTUD in that circumstance, but I don't know if that's still true, and even if it is it's not technically our problem.) Per the spec, I would like to see us do the remaining fixes to make that work as intended - largely 'tell the VMs what they're getting' - and then, as others have said, lay out simple options for deployments, be they jumbo frame or otherwise. If you're seeing MTU related problems at this point, can you file bugs on them and/or report back the bugs here, so that we can see what we're actually facing? -- Ian. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] MTU configuration pain
I believe the issue is that the default is unspecified, which leads to nothing being advertised to VMs via dhcp/ra. So VMs end up using 1500, which leads to a catastrophe when running on an overlay on a 1500 underlay. On Jan 24, 2016 20:48, "Ian Wells"wrote: > On 23 January 2016 at 11:27, Adam Lawson wrote: > >> For the sake of over-simplification, is there ever a reason to NOT enable >> jumbo frames in a cloud/SDN context where most of the traffic is between >> virtual elements that all support it? I understand that some switches do >> not support it and traffic form the web doesn't support it either but >> besides that, seems like a default "jumboframes = 1" concept would work >> just fine to me. >> > > Offhand: > > 1. you don't want the latency increase that comes with 9000 byte packets, > even if it's tiny (bearing in mind that in a link shared between tenants it > affects everyone when one packet holds the line for 6 times longer) > 2. not every switch in the world is going to (a) be configurable or (b) > pass 9000 byte packets > 3. not every VM has a configurable MTU that you can set on boot, or > supports jumbo frames, and someone somewhere will try and run one of those > VMs > 4. when you're using provider networks, not every device attached to the > cloud has a 9000 MTU (and this one's interesting, in fact, because it > points to the other element the MTU spec was addressing, that *not all > networks, even in Neutron, will have the same MTU*). > 5. similarly, if you have an external network in Openstack, and you're > using VXLAN, the MTU of the external network is almost certainly 50 bytes > bigger than that of the inside of the VXLAN overlays, so no one number can > ever be right for every network in Neutron. > > Also, I say 9000, but why is 9000 even the right number? We need a > number... and 'jumbo' is not a number. I know devices that will let you > transmit 9200 byte packets. Conversely, if the native L2 is 9000 bytes, > then the MTU in a Neutron virtual network is less than 9000 - so what MTU > do you want to offer your applications? If your apps don't care, why not > tell them what MTU they're getting (e.g. 1450) and be done with it? > (Memory says that the old problem with that was that github had problems > with PMTUD in that circumstance, but I don't know if that's still true, and > even if it is it's not technically our problem.) > > Per the spec, I would like to see us do the remaining fixes to make that > work as intended - largely 'tell the VMs what they're getting' - and then, > as others have said, lay out simple options for deployments, be they jumbo > frame or otherwise. > > If you're seeing MTU related problems at this point, can you file bugs on > them and/or report back the bugs here, so that we can see what we're > actually facing? > -- > Ian. > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] MTU configuration pain
I wrote the spec for the MTU work that's in the Neutron API today. It haunts my nightmares. I learned so many nasty corner cases for MTU, and you're treading that same dark path. I'd first like to point out a few things that change the implications of what you're reporting in strange ways. [1] points out even more strange ways, but these are the notable ones from what I've been reading here... RFC7348: "VTEPs MUST NOT fragment VXLAN packets. ... The destination VTEP MAY silently discard such VXLAN fragments." The VXLAN VTEP implementations we use today may fragment, but it's not according to the RFC, and I wouldn't rely that every implementation you come across knows to do it. So, the largest L2 packet you can send over VXLAN is a function of path MTU. Even if VXLAN is fragmenting, you actually actively want to avoid it fragmenting, because - in the typical case of bulk TCP transfers using max-MTU packets - you're *invisibly* fragmenting the packets into two and adding about 80 bytes of overhead in the process and then reassembling them at the far end. You've just expicitly guaranteed that, just as you send the most data, your connection will slow down. And the MTU problem will be undetectable to the VMs (which can't find out that a VXLAN encapped packet has been fragmented; the packet *they* sent didn't fragment, but the one it's carried in did, not to mention the fragmentation didn't even happen at an L3 node in the virtual network so DF and therefore PMTUD wouldn't work). Path MTU is not fixed, because your path can vary according to network weather (failures, congestion, whatever). It's an oddity, and perhaps a rarity, but you can get many weirdnesses: you fail over from one link to a link with a smaller MTU and the path MTU shrinks; some switches are jumbo frame and some aren't, so the path MTU might vary from host to host; and so on. Granted, these are weird cases, but the point here is that Openstack cannot *discover* this number. An installer might attempt something, knowing how to read switch config; or it might attempt to validate a number it's been given, as best it can; but even then it's best effort, it's not a guarantee. For all these reasons, the only way to really get the minimum path MTU is from the operator themselves, which is why this is a configuration parameter to Neutron (path_mtu). The aim of the changes in the spec [1] were threefold: 1. To ensure that an app that absolutely required a certain minimum MTU to operate could guarantee it would receive it 2. To allow the network to say what the MTU was, so that the VM could be programmed accordingly 3. To ensure that the MTU for the network would - by default - settle on the optimal value, per all the stuff above. So what could we do in this environment to improve matters? 1. We should advertise MTU in the RA and DHCP messages that Openstack sends. I thought we'd already done this work, but this thread suggests not. [Note, though, that you can't reliably set an MTU higher than 1500 on IPv6 using an RA, thanks to RFC4861 referencing RFC2464 which goes with the standard, but not the practice, that the biggest ethernet packet is 1500 bytes. You've been violating the standard all these years, you bad people. Unfortunately, Linux enforces this RA rule, albeit slightly strangely.] 2. We should also put the MTU in any config-drive settings for VMs that don't respect such things in DHCP and RAs, or don't do DHCP. This is Nova-side, reacting to the MTU property of the network. 3. Installers should determine the appropriate MTU settings on interfaces and ensure they're set. Openstack can't do this in some cases (VXLAN - no interfaces) - and probably shouldn't in others (VLAN - the interface MTU is input to the MTU selection algorithm above, and the installer should set the interface MTU to match what the operator says the fabric MTU is). 4. We need to check the Neutron network drivers to see which ones are accepting, but not properly respecting, the MTU setting on the network. I suspect we're short of testing to make sure that veths, bridges, switches and so on are all correctly configured. -- Ian. [1] https://review.openstack.org/#/c/105989/ and https://github.com/openstack/neutron-specs/blob/master/specs/kilo/mtu-selection-and-advertisement.rst On 22 January 2016 at 19:13, Matt Kassawarawrote: > The fun continues, now using an OpenStack deployment on physical hardware > that supports jumbo frames with 9000 MTU and IPv4/IPv6. This experiment > still uses Linux bridge for consistency. I'm planning to run similar > experiments with Open vSwitch and Open Virtual Network (OVN) in the next > week. > > I highly recommend reading further, but here's the TL;DR: Using physical > network interfaces with MTUs larger than 1500 reveals an additional problem > with veth pair for the neutron router interface on the public network. > Additionally, IP protocol version does not impact MTU calculation for >
Re: [openstack-dev] [Neutron] MTU configuration pain
At a minimum I think we should pick a default in devstack and dump a warning in neutron if operators don't specify it. I would still be preferable to changing the default even though it's a behavior change considering the current behavior is annoying. :) On Jan 24, 2016 23:31, "Ian Wells"wrote: > On 24 January 2016 at 22:12, Kevin Benton wrote: > >> >The reason for that was in the other half of the thread - it's not >> possible to magically discover these things from within Openstack's own >> code because the relevant settings span more than just one server >> >> IMO it's better to have a default of 1500 rather than let VMs >> automatically default to 1500 because at least we will deduct the encap >> header length when necessary in the dhcp/ra advertised value so overlays >> work on standard 1500 MTU networks. >> >> In other words, our current empty default is realistically a terrible >> default of 1500 that doesn't account for network segmentation overhead. >> > It's pretty clear that, while the current setup is precisely the old > behaviour (backward compatibility, y'know?), it's not very useful. Problem > is, anyone using the 1550+hacks and other methods of today will find their > system changes behaviour if we started setting that specific default. > > Regardless, we need to take that documentation and update it. It was a > nasty hack back in the day and not remotely a good idea now. > > > >> On Jan 24, 2016 23:00, "Ian Wells" wrote: >> >>> On 24 January 2016 at 20:18, Kevin Benton wrote: >>> I believe the issue is that the default is unspecified, which leads to nothing being advertised to VMs via dhcp/ra. So VMs end up using 1500, which leads to a catastrophe when running on an overlay on a 1500 underlay. >>> That's not quite the point I was making here, but to answer that: looks >>> to me like (for the LB or OVS drivers to appropriately set the network MTU >>> for the virtual network, at which point it will be advertised because >>> advertise_mtu defaults to True in the code) you *must* set one or more of >>> path_mtu (for L3 overlays), segment_mtu (for L2 overlays) or physnet_mtu >>> (for L2 overlays with differing MTUs on different physical networks). >>> That's a statement of faith - I suspect if we try it we'll find a few >>> niggling problems - but I can find the code, at least. >>> >>> The reason for that was in the other half of the thread - it's not >>> possible to magically discover these things from within Openstack's own >>> code because the relevant settings span more than just one server. They >>> have to line up with both your MTU settings for the interfaces in use, and >>> the MTU settings for the other equipment within and neighbouring the cloud >>> - switches, routers, nexthops. So they have to be provided by the operator >>> - then everything you want should kick in. >>> >>> If all of that is true, it really is just a documentation problem - we >>> have the idea in place, we're just not telling people how to make use of >>> it. We can also include a checklist or a check script with that >>> documentation - you might not be able to deduce the MTU values, but you can >>> certainly run some checks to see if the values you have been given are >>> obviously wrong. >>> >>> In the meantime, Matt K, you said you hadn't set path_mtu in your tests, >>> but [1] says you have to ([1] is far from end-user consumable >>> documentation, which again illustrates our problem). >>> >>> Can you set both path_mtu and segment_mtu to whatever value your switch >>> MTU is (1500 or 9000), confirm your outbound interface MTU is the same >>> (1500 or 9000), and see if that changes things? At this point, you should >>> find that your networks get appropriate 1500/9000 MTUs on VLAN based >>> networks and 1450/8950 MTUs on VXLAN networks, that they're advertised to >>> your VMs via DHCP and RA, and that your routers even know that different >>> interfaces have different MTUs in a mixed environment, at least if >>> everything is working as intended. >>> -- >>> Ian. >>> >>> [1] >>> https://github.com/openstack/neutron/blob/544ff57bcac00720f54a75eb34916218cb248213/releasenotes/notes/advertise_mtu_by_default-d8b0b056a74517b8.yaml#L5 >>> >>> On Jan 24, 2016 20:48, "Ian Wells" wrote: > On 23 January 2016 at 11:27, Adam Lawson wrote: > >> For the sake of over-simplification, is there ever a reason to NOT >> enable jumbo frames in a cloud/SDN context where most of the traffic is >> between virtual elements that all support it? I understand that some >> switches do not support it and traffic form the web doesn't support it >> either but besides that, seems like a default "jumboframes = 1" concept >> would work just fine to me. >> > > Offhand: > > 1. you don't want the latency increase that comes
Re: [openstack-dev] [Neutron] MTU configuration pain
On 24 January 2016 at 20:18, Kevin Bentonwrote: > I believe the issue is that the default is unspecified, which leads to > nothing being advertised to VMs via dhcp/ra. So VMs end up using 1500, > which leads to a catastrophe when running on an overlay on a 1500 underlay. > That's not quite the point I was making here, but to answer that: looks to me like (for the LB or OVS drivers to appropriately set the network MTU for the virtual network, at which point it will be advertised because advertise_mtu defaults to True in the code) you *must* set one or more of path_mtu (for L3 overlays), segment_mtu (for L2 overlays) or physnet_mtu (for L2 overlays with differing MTUs on different physical networks). That's a statement of faith - I suspect if we try it we'll find a few niggling problems - but I can find the code, at least. The reason for that was in the other half of the thread - it's not possible to magically discover these things from within Openstack's own code because the relevant settings span more than just one server. They have to line up with both your MTU settings for the interfaces in use, and the MTU settings for the other equipment within and neighbouring the cloud - switches, routers, nexthops. So they have to be provided by the operator - then everything you want should kick in. If all of that is true, it really is just a documentation problem - we have the idea in place, we're just not telling people how to make use of it. We can also include a checklist or a check script with that documentation - you might not be able to deduce the MTU values, but you can certainly run some checks to see if the values you have been given are obviously wrong. In the meantime, Matt K, you said you hadn't set path_mtu in your tests, but [1] says you have to ([1] is far from end-user consumable documentation, which again illustrates our problem). Can you set both path_mtu and segment_mtu to whatever value your switch MTU is (1500 or 9000), confirm your outbound interface MTU is the same (1500 or 9000), and see if that changes things? At this point, you should find that your networks get appropriate 1500/9000 MTUs on VLAN based networks and 1450/8950 MTUs on VXLAN networks, that they're advertised to your VMs via DHCP and RA, and that your routers even know that different interfaces have different MTUs in a mixed environment, at least if everything is working as intended. -- Ian. [1] https://github.com/openstack/neutron/blob/544ff57bcac00720f54a75eb34916218cb248213/releasenotes/notes/advertise_mtu_by_default-d8b0b056a74517b8.yaml#L5 > On Jan 24, 2016 20:48, "Ian Wells" wrote: > >> On 23 January 2016 at 11:27, Adam Lawson wrote: >> >>> For the sake of over-simplification, is there ever a reason to NOT >>> enable jumbo frames in a cloud/SDN context where most of the traffic is >>> between virtual elements that all support it? I understand that some >>> switches do not support it and traffic form the web doesn't support it >>> either but besides that, seems like a default "jumboframes = 1" concept >>> would work just fine to me. >>> >> >> Offhand: >> >> 1. you don't want the latency increase that comes with 9000 byte packets, >> even if it's tiny (bearing in mind that in a link shared between tenants it >> affects everyone when one packet holds the line for 6 times longer) >> 2. not every switch in the world is going to (a) be configurable or (b) >> pass 9000 byte packets >> 3. not every VM has a configurable MTU that you can set on boot, or >> supports jumbo frames, and someone somewhere will try and run one of those >> VMs >> 4. when you're using provider networks, not every device attached to the >> cloud has a 9000 MTU (and this one's interesting, in fact, because it >> points to the other element the MTU spec was addressing, that *not all >> networks, even in Neutron, will have the same MTU*). >> 5. similarly, if you have an external network in Openstack, and you're >> using VXLAN, the MTU of the external network is almost certainly 50 bytes >> bigger than that of the inside of the VXLAN overlays, so no one number can >> ever be right for every network in Neutron. >> >> Also, I say 9000, but why is 9000 even the right number? We need a >> number... and 'jumbo' is not a number. I know devices that will let you >> transmit 9200 byte packets. Conversely, if the native L2 is 9000 bytes, >> then the MTU in a Neutron virtual network is less than 9000 - so what MTU >> do you want to offer your applications? If your apps don't care, why not >> tell them what MTU they're getting (e.g. 1450) and be done with it? >> (Memory says that the old problem with that was that github had problems >> with PMTUD in that circumstance, but I don't know if that's still true, and >> even if it is it's not technically our problem.) >> >> Per the spec, I would like to see us do the remaining fixes to make that >> work as intended - largely 'tell the VMs
Re: [openstack-dev] [Neutron] MTU configuration pain
>The reason for that was in the other half of the thread - it's not possible to magically discover these things from within Openstack's own code because the relevant settings span more than just one server IMO it's better to have a default of 1500 rather than let VMs automatically default to 1500 because at least we will deduct the encap header length when necessary in the dhcp/ra advertised value so overlays work on standard 1500 MTU networks. In other words, our current empty default is realistically a terrible default of 1500 that doesn't account for network segmentation overhead. On Jan 24, 2016 23:00, "Ian Wells"wrote: > On 24 January 2016 at 20:18, Kevin Benton wrote: > >> I believe the issue is that the default is unspecified, which leads to >> nothing being advertised to VMs via dhcp/ra. So VMs end up using 1500, >> which leads to a catastrophe when running on an overlay on a 1500 underlay. >> > That's not quite the point I was making here, but to answer that: looks to > me like (for the LB or OVS drivers to appropriately set the network MTU for > the virtual network, at which point it will be advertised because > advertise_mtu defaults to True in the code) you *must* set one or more of > path_mtu (for L3 overlays), segment_mtu (for L2 overlays) or physnet_mtu > (for L2 overlays with differing MTUs on different physical networks). > That's a statement of faith - I suspect if we try it we'll find a few > niggling problems - but I can find the code, at least. > > The reason for that was in the other half of the thread - it's not > possible to magically discover these things from within Openstack's own > code because the relevant settings span more than just one server. They > have to line up with both your MTU settings for the interfaces in use, and > the MTU settings for the other equipment within and neighbouring the cloud > - switches, routers, nexthops. So they have to be provided by the operator > - then everything you want should kick in. > > If all of that is true, it really is just a documentation problem - we > have the idea in place, we're just not telling people how to make use of > it. We can also include a checklist or a check script with that > documentation - you might not be able to deduce the MTU values, but you can > certainly run some checks to see if the values you have been given are > obviously wrong. > > In the meantime, Matt K, you said you hadn't set path_mtu in your tests, > but [1] says you have to ([1] is far from end-user consumable > documentation, which again illustrates our problem). > > Can you set both path_mtu and segment_mtu to whatever value your switch > MTU is (1500 or 9000), confirm your outbound interface MTU is the same > (1500 or 9000), and see if that changes things? At this point, you should > find that your networks get appropriate 1500/9000 MTUs on VLAN based > networks and 1450/8950 MTUs on VXLAN networks, that they're advertised to > your VMs via DHCP and RA, and that your routers even know that different > interfaces have different MTUs in a mixed environment, at least if > everything is working as intended. > -- > Ian. > > [1] > https://github.com/openstack/neutron/blob/544ff57bcac00720f54a75eb34916218cb248213/releasenotes/notes/advertise_mtu_by_default-d8b0b056a74517b8.yaml#L5 > > >> On Jan 24, 2016 20:48, "Ian Wells" wrote: >> >>> On 23 January 2016 at 11:27, Adam Lawson wrote: >>> For the sake of over-simplification, is there ever a reason to NOT enable jumbo frames in a cloud/SDN context where most of the traffic is between virtual elements that all support it? I understand that some switches do not support it and traffic form the web doesn't support it either but besides that, seems like a default "jumboframes = 1" concept would work just fine to me. >>> >>> Offhand: >>> >>> 1. you don't want the latency increase that comes with 9000 byte >>> packets, even if it's tiny (bearing in mind that in a link shared between >>> tenants it affects everyone when one packet holds the line for 6 times >>> longer) >>> 2. not every switch in the world is going to (a) be configurable or (b) >>> pass 9000 byte packets >>> 3. not every VM has a configurable MTU that you can set on boot, or >>> supports jumbo frames, and someone somewhere will try and run one of those >>> VMs >>> 4. when you're using provider networks, not every device attached to the >>> cloud has a 9000 MTU (and this one's interesting, in fact, because it >>> points to the other element the MTU spec was addressing, that *not all >>> networks, even in Neutron, will have the same MTU*). >>> 5. similarly, if you have an external network in Openstack, and you're >>> using VXLAN, the MTU of the external network is almost certainly 50 bytes >>> bigger than that of the inside of the VXLAN overlays, so no one number can >>> ever be right for every network in Neutron. >>>
Re: [openstack-dev] [Neutron] MTU configuration pain
I like both of those ideas. On 24 January 2016 at 22:37, Kevin Bentonwrote: > At a minimum I think we should pick a default in devstack and dump a > warning in neutron if operators don't specify it. > > I would still be preferable to changing the default even though it's a > behavior change considering the current behavior is annoying. :) > On Jan 24, 2016 23:31, "Ian Wells" wrote: > >> On 24 January 2016 at 22:12, Kevin Benton wrote: >> >>> >The reason for that was in the other half of the thread - it's not >>> possible to magically discover these things from within Openstack's own >>> code because the relevant settings span more than just one server >>> >>> IMO it's better to have a default of 1500 rather than let VMs >>> automatically default to 1500 because at least we will deduct the encap >>> header length when necessary in the dhcp/ra advertised value so overlays >>> work on standard 1500 MTU networks. >>> >>> In other words, our current empty default is realistically a terrible >>> default of 1500 that doesn't account for network segmentation overhead. >>> >> It's pretty clear that, while the current setup is precisely the old >> behaviour (backward compatibility, y'know?), it's not very useful. Problem >> is, anyone using the 1550+hacks and other methods of today will find their >> system changes behaviour if we started setting that specific default. >> >> Regardless, we need to take that documentation and update it. It was a >> nasty hack back in the day and not remotely a good idea now. >> >> >> >>> On Jan 24, 2016 23:00, "Ian Wells" wrote: >>> On 24 January 2016 at 20:18, Kevin Benton wrote: > I believe the issue is that the default is unspecified, which leads to > nothing being advertised to VMs via dhcp/ra. So VMs end up using 1500, > which leads to a catastrophe when running on an overlay on a 1500 > underlay. > That's not quite the point I was making here, but to answer that: looks to me like (for the LB or OVS drivers to appropriately set the network MTU for the virtual network, at which point it will be advertised because advertise_mtu defaults to True in the code) you *must* set one or more of path_mtu (for L3 overlays), segment_mtu (for L2 overlays) or physnet_mtu (for L2 overlays with differing MTUs on different physical networks). That's a statement of faith - I suspect if we try it we'll find a few niggling problems - but I can find the code, at least. The reason for that was in the other half of the thread - it's not possible to magically discover these things from within Openstack's own code because the relevant settings span more than just one server. They have to line up with both your MTU settings for the interfaces in use, and the MTU settings for the other equipment within and neighbouring the cloud - switches, routers, nexthops. So they have to be provided by the operator - then everything you want should kick in. If all of that is true, it really is just a documentation problem - we have the idea in place, we're just not telling people how to make use of it. We can also include a checklist or a check script with that documentation - you might not be able to deduce the MTU values, but you can certainly run some checks to see if the values you have been given are obviously wrong. In the meantime, Matt K, you said you hadn't set path_mtu in your tests, but [1] says you have to ([1] is far from end-user consumable documentation, which again illustrates our problem). Can you set both path_mtu and segment_mtu to whatever value your switch MTU is (1500 or 9000), confirm your outbound interface MTU is the same (1500 or 9000), and see if that changes things? At this point, you should find that your networks get appropriate 1500/9000 MTUs on VLAN based networks and 1450/8950 MTUs on VXLAN networks, that they're advertised to your VMs via DHCP and RA, and that your routers even know that different interfaces have different MTUs in a mixed environment, at least if everything is working as intended. -- Ian. [1] https://github.com/openstack/neutron/blob/544ff57bcac00720f54a75eb34916218cb248213/releasenotes/notes/advertise_mtu_by_default-d8b0b056a74517b8.yaml#L5 > On Jan 24, 2016 20:48, "Ian Wells" wrote: > >> On 23 January 2016 at 11:27, Adam Lawson wrote: >> >>> For the sake of over-simplification, is there ever a reason to NOT >>> enable jumbo frames in a cloud/SDN context where most of the traffic is >>> between virtual elements that all support it? I understand that some >>> switches do not support it and traffic form the web doesn't support it
Re: [openstack-dev] [Neutron] MTU configuration pain
On 24 January 2016 at 22:12, Kevin Bentonwrote: > >The reason for that was in the other half of the thread - it's not > possible to magically discover these things from within Openstack's own > code because the relevant settings span more than just one server > > IMO it's better to have a default of 1500 rather than let VMs > automatically default to 1500 because at least we will deduct the encap > header length when necessary in the dhcp/ra advertised value so overlays > work on standard 1500 MTU networks. > > In other words, our current empty default is realistically a terrible > default of 1500 that doesn't account for network segmentation overhead. > It's pretty clear that, while the current setup is precisely the old behaviour (backward compatibility, y'know?), it's not very useful. Problem is, anyone using the 1550+hacks and other methods of today will find their system changes behaviour if we started setting that specific default. Regardless, we need to take that documentation and update it. It was a nasty hack back in the day and not remotely a good idea now. > On Jan 24, 2016 23:00, "Ian Wells" wrote: > >> On 24 January 2016 at 20:18, Kevin Benton wrote: >> >>> I believe the issue is that the default is unspecified, which leads to >>> nothing being advertised to VMs via dhcp/ra. So VMs end up using 1500, >>> which leads to a catastrophe when running on an overlay on a 1500 underlay. >>> >> That's not quite the point I was making here, but to answer that: looks >> to me like (for the LB or OVS drivers to appropriately set the network MTU >> for the virtual network, at which point it will be advertised because >> advertise_mtu defaults to True in the code) you *must* set one or more of >> path_mtu (for L3 overlays), segment_mtu (for L2 overlays) or physnet_mtu >> (for L2 overlays with differing MTUs on different physical networks). >> That's a statement of faith - I suspect if we try it we'll find a few >> niggling problems - but I can find the code, at least. >> >> The reason for that was in the other half of the thread - it's not >> possible to magically discover these things from within Openstack's own >> code because the relevant settings span more than just one server. They >> have to line up with both your MTU settings for the interfaces in use, and >> the MTU settings for the other equipment within and neighbouring the cloud >> - switches, routers, nexthops. So they have to be provided by the operator >> - then everything you want should kick in. >> >> If all of that is true, it really is just a documentation problem - we >> have the idea in place, we're just not telling people how to make use of >> it. We can also include a checklist or a check script with that >> documentation - you might not be able to deduce the MTU values, but you can >> certainly run some checks to see if the values you have been given are >> obviously wrong. >> >> In the meantime, Matt K, you said you hadn't set path_mtu in your tests, >> but [1] says you have to ([1] is far from end-user consumable >> documentation, which again illustrates our problem). >> >> Can you set both path_mtu and segment_mtu to whatever value your switch >> MTU is (1500 or 9000), confirm your outbound interface MTU is the same >> (1500 or 9000), and see if that changes things? At this point, you should >> find that your networks get appropriate 1500/9000 MTUs on VLAN based >> networks and 1450/8950 MTUs on VXLAN networks, that they're advertised to >> your VMs via DHCP and RA, and that your routers even know that different >> interfaces have different MTUs in a mixed environment, at least if >> everything is working as intended. >> -- >> Ian. >> >> [1] >> https://github.com/openstack/neutron/blob/544ff57bcac00720f54a75eb34916218cb248213/releasenotes/notes/advertise_mtu_by_default-d8b0b056a74517b8.yaml#L5 >> >> >>> On Jan 24, 2016 20:48, "Ian Wells" wrote: >>> On 23 January 2016 at 11:27, Adam Lawson wrote: > For the sake of over-simplification, is there ever a reason to NOT > enable jumbo frames in a cloud/SDN context where most of the traffic is > between virtual elements that all support it? I understand that some > switches do not support it and traffic form the web doesn't support it > either but besides that, seems like a default "jumboframes = 1" concept > would work just fine to me. > Offhand: 1. you don't want the latency increase that comes with 9000 byte packets, even if it's tiny (bearing in mind that in a link shared between tenants it affects everyone when one packet holds the line for 6 times longer) 2. not every switch in the world is going to (a) be configurable or (b) pass 9000 byte packets 3. not every VM has a configurable MTU that you can set on boot, or supports jumbo frames, and someone somewhere will try and
Re: [openstack-dev] [Neutron] MTU configuration pain
Actually, I note that that document is Juno and there doesn't seem to be anything at all in the Liberty guide now, so the answer is probably to add settings for path_mtu and segment_mtu in the recommended Neutron configuration. On 24 January 2016 at 22:26, Ian Wellswrote: > On 24 January 2016 at 22:12, Kevin Benton wrote: > >> >The reason for that was in the other half of the thread - it's not >> possible to magically discover these things from within Openstack's own >> code because the relevant settings span more than just one server >> >> IMO it's better to have a default of 1500 rather than let VMs >> automatically default to 1500 because at least we will deduct the encap >> header length when necessary in the dhcp/ra advertised value so overlays >> work on standard 1500 MTU networks. >> >> In other words, our current empty default is realistically a terrible >> default of 1500 that doesn't account for network segmentation overhead. >> > It's pretty clear that, while the current setup is precisely the old > behaviour (backward compatibility, y'know?), it's not very useful. Problem > is, anyone using the 1550+hacks and other methods of today will find their > system changes behaviour if we started setting that specific default. > > Regardless, we need to take that documentation and update it. It was a > nasty hack back in the day and not remotely a good idea now. > > > >> On Jan 24, 2016 23:00, "Ian Wells" wrote: >> >>> On 24 January 2016 at 20:18, Kevin Benton wrote: >>> I believe the issue is that the default is unspecified, which leads to nothing being advertised to VMs via dhcp/ra. So VMs end up using 1500, which leads to a catastrophe when running on an overlay on a 1500 underlay. >>> That's not quite the point I was making here, but to answer that: looks >>> to me like (for the LB or OVS drivers to appropriately set the network MTU >>> for the virtual network, at which point it will be advertised because >>> advertise_mtu defaults to True in the code) you *must* set one or more of >>> path_mtu (for L3 overlays), segment_mtu (for L2 overlays) or physnet_mtu >>> (for L2 overlays with differing MTUs on different physical networks). >>> That's a statement of faith - I suspect if we try it we'll find a few >>> niggling problems - but I can find the code, at least. >>> >>> The reason for that was in the other half of the thread - it's not >>> possible to magically discover these things from within Openstack's own >>> code because the relevant settings span more than just one server. They >>> have to line up with both your MTU settings for the interfaces in use, and >>> the MTU settings for the other equipment within and neighbouring the cloud >>> - switches, routers, nexthops. So they have to be provided by the operator >>> - then everything you want should kick in. >>> >>> If all of that is true, it really is just a documentation problem - we >>> have the idea in place, we're just not telling people how to make use of >>> it. We can also include a checklist or a check script with that >>> documentation - you might not be able to deduce the MTU values, but you can >>> certainly run some checks to see if the values you have been given are >>> obviously wrong. >>> >>> In the meantime, Matt K, you said you hadn't set path_mtu in your tests, >>> but [1] says you have to ([1] is far from end-user consumable >>> documentation, which again illustrates our problem). >>> >>> Can you set both path_mtu and segment_mtu to whatever value your switch >>> MTU is (1500 or 9000), confirm your outbound interface MTU is the same >>> (1500 or 9000), and see if that changes things? At this point, you should >>> find that your networks get appropriate 1500/9000 MTUs on VLAN based >>> networks and 1450/8950 MTUs on VXLAN networks, that they're advertised to >>> your VMs via DHCP and RA, and that your routers even know that different >>> interfaces have different MTUs in a mixed environment, at least if >>> everything is working as intended. >>> -- >>> Ian. >>> >>> [1] >>> https://github.com/openstack/neutron/blob/544ff57bcac00720f54a75eb34916218cb248213/releasenotes/notes/advertise_mtu_by_default-d8b0b056a74517b8.yaml#L5 >>> >>> On Jan 24, 2016 20:48, "Ian Wells" wrote: > On 23 January 2016 at 11:27, Adam Lawson wrote: > >> For the sake of over-simplification, is there ever a reason to NOT >> enable jumbo frames in a cloud/SDN context where most of the traffic is >> between virtual elements that all support it? I understand that some >> switches do not support it and traffic form the web doesn't support it >> either but besides that, seems like a default "jumboframes = 1" concept >> would work just fine to me. >> > > Offhand: > > 1. you don't want the latency increase that comes with 9000 byte >
Re: [openstack-dev] [Neutron] MTU configuration pain
For the sake of over-simplification, is there ever a reason to NOT enable jumbo frames in a cloud/SDN context where most of the traffic is between virtual elements that all support it? I understand that some switches do not support it and traffic form the web doesn't support it either but besides that, seems like a default "jumboframes = 1" concept would work just fine to me. Then again I'm all about making OpenStack easier to consume so my ideas tend to gloss over special use cases with special requirements. *Adam Lawson* AQORN, Inc. 427 North Tatnall Street Ste. 58461 Wilmington, Delaware 19801-2230 Toll-free: (844) 4-AQORN-NOW ext. 101 International: +1 302-387-4660 Direct: +1 916-246-2072 On Fri, Jan 22, 2016 at 7:13 PM, Matt Kassawarawrote: > The fun continues, now using an OpenStack deployment on physical hardware > that supports jumbo frames with 9000 MTU and IPv4/IPv6. This experiment > still uses Linux bridge for consistency. I'm planning to run similar > experiments with Open vSwitch and Open Virtual Network (OVN) in the next > week. > > I highly recommend reading further, but here's the TL;DR: Using physical > network interfaces with MTUs larger than 1500 reveals an additional problem > with veth pair for the neutron router interface on the public network. > Additionally, IP protocol version does not impact MTU calculation for > Linux bridge. > > First, review the OpenStack bits and resulting network components in the > environment [1]. In the first experiment, public cloud network limitations > prevented truly seeing how Linux bridge (actually the kernel) handles > physical network interfaces with MTUs larger than 1500. In this experiment, > we see that it automatically calculates the proper MTU for bridges and > VXLAN interfaces using the MTU of parent devices. Also, see that a regular > 'ping' works between the host outside of the deployment and the VM [2]. > > [1] https://gist.github.com/ionosphere80/a3725066386d8ca4c6d7 > [2] https://gist.github.com/ionosphere80/a8d601a356ac6c6274cb > > Note: The tcpdump output in each case references up to six points: neutron > router gateway on the public network (qg), namespace end of the veth pair > for the neutron router interface on the private network (qr), bridge end of > the veth pair for router interface on the private network (tap), controller > node end of the VXLAN network (underlying interface), compute node end of > the VXLAN network (underlying interface), and the bridge end of the tap for > the VM (tap). > > In the first experiment, SSH "stuck" because of a MTU mismatch on the veth > pair between the router namespace and private network bridge. In this > experiment, SSH works because the VM network interface uses a 1500 MTU and > all devices along the path between the host and VM use a 1500 or larger > MTU. So, let's configure the VM network interface to use the proper MTU of > 9000 minus the VXLAN protocol overhead of 50 bytes... 8950... and try SSH > again. > > 2: eth0: mtu 8950 qdisc pfifo_fast qlen > 1000 > link/ether fa:16:3e:46:ac:d3 brd ff:ff:ff:ff:ff:ff > inet 172.16.1.3/24 brd 172.16.1.255 scope global eth0 > inet6 fd00:100:52:1:f816:3eff:fe46:acd3/64 scope global dynamic >valid_lft 86395sec preferred_lft 14395sec > inet6 fe80::f816:3eff:fe46:acd3/64 scope link >valid_lft forever preferred_lft forever > > SSH doesn't work with IPv4 or IPv6. Adding a slight twist to the first > experiment, I don't even see the large packet traversing the neutron > router gateway on the public network. So, I began a tcpdump closer to the > source on the bridge end of the veth pair for the neutron router > interface on the public network. > > Looking at [3], the veth pair between the router namespace and private > network bridge drops the packet. The MTU changes over a layer-2 connection > without a router, similar to connecting two switches with different MTUs. > Even if it could participate in PMTUD, the veth pair lacks an IP address > and therefore cannot originate ICMP messages. > > [3] https://gist.github.com/ionosphere80/ec83d0955c79b05ea381 > > Using observations from the first experiment, let's configure the MTU of > the interfaces in the qrouter namespace to match the other end of their > respective veth pairs. The public network (gateway) interface MTU becomes > 9000 and the private network router interfaces (IPv4 and IPv6) become 8950. > > 2: qr-49b27408-04: mtu 8950 qdisc > pfifo_fast state UP mode DEFAULT group default qlen 1000 > link/ether fa:16:3e:e5:43:1c brd ff:ff:ff:ff:ff:ff > 3: qr-b7e0ef22-32: mtu 8950 qdisc > pfifo_fast state UP mode DEFAULT group default qlen 1000 > link/ether fa:16:3e:16:01:92 brd ff:ff:ff:ff:ff:ff > 4: qg-7bbe8e38-cc: mtu 9000 qdisc > pfifo_fast state UP mode DEFAULT group default qlen 1000 > link/ether
Re: [openstack-dev] [Neutron] MTU configuration pain
Adam Lawsonwrote on 01/23/2016 02:27:46 PM: > For the sake of over-simplification, is there ever a reason to NOT > enable jumbo frames in a cloud/SDN context where most of the traffic > is between virtual elements that all support it? I understand that > some switches do not support it and traffic form the web doesn't > support it either but besides that, seems like a default > "jumboframes = 1" concept would work just fine to me. > > Then again I'm all about making OpenStack easier to consume so my > ideas tend to gloss over special use cases with special requirements. Regardless of the default, there needs to be clear documentation on what to do for those of us who can not use jumbo frames, and it needs to work. That goes for production deployers and also for developers using DevStack. Thanks, Mike __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] MTU configuration pain
Adam, Any modern datacenter network, especially those with 10 Gbps or faster connectivity, should support jumbo frames for performance reasons. However, depending on the network infrastructure, jumbo frames does not always mean a 9000 MTU, so neutron should support a configurable value rather than a boolean. I envision one configuration option containing the physical network MTU that neutron uses to calculate the MTU of all virtual network components. Mike... this mechanism should work for any physical network MTU, large or small. Matt On Sat, Jan 23, 2016 at 3:28 PM, Mike Spreitzerwrote: > Adam Lawson wrote on 01/23/2016 02:27:46 PM: > > > For the sake of over-simplification, is there ever a reason to NOT > > enable jumbo frames in a cloud/SDN context where most of the traffic > > is between virtual elements that all support it? I understand that > > some switches do not support it and traffic form the web doesn't > > support it either but besides that, seems like a default > > "jumboframes = 1" concept would work just fine to me. > > > > Then again I'm all about making OpenStack easier to consume so my > > ideas tend to gloss over special use cases with special requirements. > > Regardless of the default, there needs to be clear documentation on what > to do for those of us who can not use jumbo frames, and it needs to work. > That goes for production deployers and also for developers using DevStack. > > Thanks, > Mike > > > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] MTU configuration pain
The fun continues, now using an OpenStack deployment on physical hardware that supports jumbo frames with 9000 MTU and IPv4/IPv6. This experiment still uses Linux bridge for consistency. I'm planning to run similar experiments with Open vSwitch and Open Virtual Network (OVN) in the next week. I highly recommend reading further, but here's the TL;DR: Using physical network interfaces with MTUs larger than 1500 reveals an additional problem with veth pair for the neutron router interface on the public network. Additionally, IP protocol version does not impact MTU calculation for Linux bridge. First, review the OpenStack bits and resulting network components in the environment [1]. In the first experiment, public cloud network limitations prevented truly seeing how Linux bridge (actually the kernel) handles physical network interfaces with MTUs larger than 1500. In this experiment, we see that it automatically calculates the proper MTU for bridges and VXLAN interfaces using the MTU of parent devices. Also, see that a regular 'ping' works between the host outside of the deployment and the VM [2]. [1] https://gist.github.com/ionosphere80/a3725066386d8ca4c6d7 [2] https://gist.github.com/ionosphere80/a8d601a356ac6c6274cb Note: The tcpdump output in each case references up to six points: neutron router gateway on the public network (qg), namespace end of the veth pair for the neutron router interface on the private network (qr), bridge end of the veth pair for router interface on the private network (tap), controller node end of the VXLAN network (underlying interface), compute node end of the VXLAN network (underlying interface), and the bridge end of the tap for the VM (tap). In the first experiment, SSH "stuck" because of a MTU mismatch on the veth pair between the router namespace and private network bridge. In this experiment, SSH works because the VM network interface uses a 1500 MTU and all devices along the path between the host and VM use a 1500 or larger MTU. So, let's configure the VM network interface to use the proper MTU of 9000 minus the VXLAN protocol overhead of 50 bytes... 8950... and try SSH again. 2: eth0:mtu 8950 qdisc pfifo_fast qlen 1000 link/ether fa:16:3e:46:ac:d3 brd ff:ff:ff:ff:ff:ff inet 172.16.1.3/24 brd 172.16.1.255 scope global eth0 inet6 fd00:100:52:1:f816:3eff:fe46:acd3/64 scope global dynamic valid_lft 86395sec preferred_lft 14395sec inet6 fe80::f816:3eff:fe46:acd3/64 scope link valid_lft forever preferred_lft forever SSH doesn't work with IPv4 or IPv6. Adding a slight twist to the first experiment, I don't even see the large packet traversing the neutron router gateway on the public network. So, I began a tcpdump closer to the source on the bridge end of the veth pair for the neutron router interface on the public network. Looking at [3], the veth pair between the router namespace and private network bridge drops the packet. The MTU changes over a layer-2 connection without a router, similar to connecting two switches with different MTUs. Even if it could participate in PMTUD, the veth pair lacks an IP address and therefore cannot originate ICMP messages. [3] https://gist.github.com/ionosphere80/ec83d0955c79b05ea381 Using observations from the first experiment, let's configure the MTU of the interfaces in the qrouter namespace to match the other end of their respective veth pairs. The public network (gateway) interface MTU becomes 9000 and the private network router interfaces (IPv4 and IPv6) become 8950. 2: qr-49b27408-04: mtu 8950 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 link/ether fa:16:3e:e5:43:1c brd ff:ff:ff:ff:ff:ff 3: qr-b7e0ef22-32: mtu 8950 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 link/ether fa:16:3e:16:01:92 brd ff:ff:ff:ff:ff:ff 4: qg-7bbe8e38-cc: mtu 9000 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 link/ether fa:16:3e:2b:c1:fd brd ff:ff:ff:ff:ff:ff Let's ping with a payload size of 8922 for IPv4 and 8902 for IPv6, the maximum for a VXLAN segment with 8950 MTU, and look at the tcpdump output [4]. For brevity, I'm only showing tcpdump output from the VM tap interface. Ping operates normally. # ping -c 1 -s 8922 -M do 10.100.52.104 # ping -c 1 -s 8902 -M do fd00:100:52:1:f816:3eff:fe46:acd3 [4] https://gist.github.com/ionosphere80/85339b587bb9b2693b07 Let's ping with a payload size of 8923 for IPv4 and 8903 for IPv6, one byte larger than the maximum for a VXLAN segment with 8950 MTU. The router namespace, operating at layer-3, sees the MTU discrepancy between the two interfaces in the namespace and returns an ICMP "fragmentation needed" or "packet too big" message to the sender. The sender uses the MTU value in the ICMP packet to recalculate the length of the first packet and caches it for future packets. # ping -c
Re: [openstack-dev] [Neutron] MTU configuration pain
On Tue, Jan 19, 2016 at 08:15:18AM EST, Matt Kassawara wrote: > No. However, we ought to determine what happens when both DHCP and RA > advertise it. We'd have to look at the RFCs for how hosts are supposed to behave since IPv6 has a minimum MTU of 1280 bytes while IPv4's minimum mtu is 576 (what is this, an MTU for ants?). -- Sean M. Collins __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] MTU configuration pain
On 01/20/2016 08:56 AM, Sean M. Collins wrote: On Tue, Jan 19, 2016 at 08:15:18AM EST, Matt Kassawara wrote: No. However, we ought to determine what happens when both DHCP and RA advertise it. We'd have to look at the RFCs for how hosts are supposed to behave since IPv6 has a minimum MTU of 1280 bytes while IPv4's minimum mtu is 576 (what is this, an MTU for ants?). Quibble - 576 is the IPv4 minimum, maximum MTU. That is to say a compliant IPv4 implementation must be able to reassemble datagrams of at least 576 bytes. If memory serves, the actual minimum MTU for IPv4 is 68 bytes. rick jones __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] MTU configuration pain
Maybe it will do the obvious thing and add them together. ;) On Jan 20, 2016 12:03, "Sean M. Collins"wrote: > On Tue, Jan 19, 2016 at 08:15:18AM EST, Matt Kassawara wrote: > > No. However, we ought to determine what happens when both DHCP and RA > > advertise it. > > We'd have to look at the RFCs for how hosts are supposed to behave since > IPv6 has a minimum MTU of 1280 bytes while IPv4's minimum mtu is 576 > (what is this, an MTU for ants?). > > -- > Sean M. Collins > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] MTU configuration pain
No. However, we ought to determine what happens when both DHCP and RA advertise it. On Tue, Jan 19, 2016 at 12:36 AM, Kevin Bentonwrote: > >Yup. We mostly attempt to do that now. > > Right, but not by default. Can you think of a scenario where advertising > it would be harmful? > On Jan 18, 2016 23:57, "Matt Kassawara" wrote: > >> >> >> On Mon, Jan 18, 2016 at 4:14 PM, Kevin Benton wrote: >> >>> Thanks for the awesome writeup. >>> >>> >5) A bridge or veth pair with an IP address can participate in path >>> MTU discovery (PMTUD). However, these devices do not appear to understand >>> namespaces and originate the ICMP message from the host instead of a >>> namespace. Therefore, the message never reaches the destination... >>> typically a host outside of the deployment. >>> >>> I suspect this is because we don't put the bridges into namespaces. Even >>> if we did do this, we would need to allocate IP addresses for every compute >>> node to use to chat on the network... >>> >> >> Yup. Moving the MTU disparity to the first layer-3 device a packet >> traverses inbound to a VM saves us from burning IPs too. >> >> >>> >>> >>> >>> >At least for the Linux bridge agent, I think we can address ingress >>> MTU disparity (to the VM) by moving it to the first device in the chain >>> capable of layer-3 operations, particularly the neutron router namespace. >>> We can address the egress MTU disparity (from the VM) by advertising the >>> MTU of the overlay network to the VM via DHCP/RA or using manual interface >>> configuration. >>> >>> So when setting up DHCP for the subnet, would telling the DHCP agent to >>> use an MTU we calculate based on (global MTU value - network encap >>> overhead) achieve what you are suggesting here? >>> >> >> Yup. We mostly attempt to do that now. >> >> On Fri, Jan 15, 2016 at 10:41 AM, Sean M. Collins wrote: > MTU has been an ongoing issue in Neutron for _years_. > > It's such a hassle, that most people just throw up their hands and set > their physical infrastructure to jumbo frames. We even document it. > > > http://docs.openstack.org/juno/install-guide/install/apt-debian/content/neutron-network-node.html > > > Ideally, you can prevent these problems by enabling jumbo frames on > > the physical network that contains your tenant virtual networks. > Jumbo > > frames support MTUs up to approximately 9000 bytes which negates the > > impact of GRE overhead on virtual networks. > > We've pushed this onto operators and deployers. There's a lot of > code in provisioning projects to handle MTUs. > > http://codesearch.openstack.org/?q=MTU=nope== > > We have mentions to it in our architecture design guide > > > http://git.openstack.org/cgit/openstack/openstack-manuals/tree/doc/arch-design/source/network-focus-architecture.rst#n150 > > I want to get Neutron to the point where it starts discovering this > information and automatically configuring, in the optimistic cases. I > understand that it can be complex and have corner cases, but the issue > we have today is that it is broken in some multinode jobs, even Neutron > developers are configuring it correctly. > > I also had this discussion on the DevStack side in > https://review.openstack.org/#/c/112523/ > where basically, sure we can fix it in DevStack and at the gate, but it > doesn't fix the problem for anyone who isn't using DevStack to deploy > their cloud. > > Today we have a ton of MTU configuration options sprinkled throghout > the > L3 agent, dhcp agent, l2 agents, and at least one API extension to the > REST API for handling MTUs. > > So yeah, a lot of knobs and not a lot of documentation on how to make > this thing work correctly. I'd like to try and simplify. > > > Further reading: > > > http://techbackground.blogspot.co.uk/2013/06/path-mtu-discovery-and-gre.html > > http://lists.openstack.org/pipermail/openstack/2013-October/001778.html > > > https://ask.openstack.org/en/question/6140/quantum-neutron-gre-slow-performance/ > > > https://ask.openstack.org/en/question/12499/forcing-mtu-to-1400-via-etcneutrondnsmasq-neutronconf-per-daniels/ > > > http://blog.systemathic.ch/2015/03/05/openstack-mtu-pitfalls-with-tunnels/ > > https://twitter.com/search?q=openstack%20neutron%20MTU > > -- > Sean M. Collins > > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: > openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
Re: [openstack-dev] [Neutron] MTU configuration pain
On Mon, Jan 18, 2016 at 4:14 PM, Kevin Bentonwrote: > Thanks for the awesome writeup. > > >5) A bridge or veth pair with an IP address can participate in path MTU > discovery (PMTUD). However, these devices do not appear to understand > namespaces and originate the ICMP message from the host instead of a > namespace. Therefore, the message never reaches the destination... > typically a host outside of the deployment. > > I suspect this is because we don't put the bridges into namespaces. Even > if we did do this, we would need to allocate IP addresses for every compute > node to use to chat on the network... > Yup. Moving the MTU disparity to the first layer-3 device a packet traverses inbound to a VM saves us from burning IPs too. > > > > >At least for the Linux bridge agent, I think we can address ingress MTU > disparity (to the VM) by moving it to the first device in the chain capable > of layer-3 operations, particularly the neutron router namespace. We can > address the egress MTU disparity (from the VM) by advertising the MTU of > the overlay network to the VM via DHCP/RA or using manual interface > configuration. > > So when setting up DHCP for the subnet, would telling the DHCP agent to > use an MTU we calculate based on (global MTU value - network encap > overhead) achieve what you are suggesting here? > Yup. We mostly attempt to do that now. On Fri, Jan 15, 2016 at 10:41 AM, Sean M. Collins >> wrote: >> >>> MTU has been an ongoing issue in Neutron for _years_. >>> >>> It's such a hassle, that most people just throw up their hands and set >>> their physical infrastructure to jumbo frames. We even document it. >>> >>> >>> http://docs.openstack.org/juno/install-guide/install/apt-debian/content/neutron-network-node.html >>> >>> > Ideally, you can prevent these problems by enabling jumbo frames on >>> > the physical network that contains your tenant virtual networks. Jumbo >>> > frames support MTUs up to approximately 9000 bytes which negates the >>> > impact of GRE overhead on virtual networks. >>> >>> We've pushed this onto operators and deployers. There's a lot of >>> code in provisioning projects to handle MTUs. >>> >>> http://codesearch.openstack.org/?q=MTU=nope== >>> >>> We have mentions to it in our architecture design guide >>> >>> >>> http://git.openstack.org/cgit/openstack/openstack-manuals/tree/doc/arch-design/source/network-focus-architecture.rst#n150 >>> >>> I want to get Neutron to the point where it starts discovering this >>> information and automatically configuring, in the optimistic cases. I >>> understand that it can be complex and have corner cases, but the issue >>> we have today is that it is broken in some multinode jobs, even Neutron >>> developers are configuring it correctly. >>> >>> I also had this discussion on the DevStack side in >>> https://review.openstack.org/#/c/112523/ >>> where basically, sure we can fix it in DevStack and at the gate, but it >>> doesn't fix the problem for anyone who isn't using DevStack to deploy >>> their cloud. >>> >>> Today we have a ton of MTU configuration options sprinkled throghout the >>> L3 agent, dhcp agent, l2 agents, and at least one API extension to the >>> REST API for handling MTUs. >>> >>> So yeah, a lot of knobs and not a lot of documentation on how to make >>> this thing work correctly. I'd like to try and simplify. >>> >>> >>> Further reading: >>> >>> >>> http://techbackground.blogspot.co.uk/2013/06/path-mtu-discovery-and-gre.html >>> >>> http://lists.openstack.org/pipermail/openstack/2013-October/001778.html >>> >>> >>> https://ask.openstack.org/en/question/6140/quantum-neutron-gre-slow-performance/ >>> >>> >>> https://ask.openstack.org/en/question/12499/forcing-mtu-to-1400-via-etcneutrondnsmasq-neutronconf-per-daniels/ >>> >>> >>> http://blog.systemathic.ch/2015/03/05/openstack-mtu-pitfalls-with-tunnels/ >>> >>> https://twitter.com/search?q=openstack%20neutron%20MTU >>> >>> -- >>> Sean M. Collins >>> >>> >>> __ >>> OpenStack Development Mailing List (not for usage questions) >>> Unsubscribe: >>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >> >> >> __ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: >> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> > > > -- > Kevin Benton > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >
Re: [openstack-dev] [Neutron] MTU configuration pain
>Yup. We mostly attempt to do that now. Right, but not by default. Can you think of a scenario where advertising it would be harmful? On Jan 18, 2016 23:57, "Matt Kassawara"wrote: > > > On Mon, Jan 18, 2016 at 4:14 PM, Kevin Benton wrote: > >> Thanks for the awesome writeup. >> >> >5) A bridge or veth pair with an IP address can participate in path MTU >> discovery (PMTUD). However, these devices do not appear to understand >> namespaces and originate the ICMP message from the host instead of a >> namespace. Therefore, the message never reaches the destination... >> typically a host outside of the deployment. >> >> I suspect this is because we don't put the bridges into namespaces. Even >> if we did do this, we would need to allocate IP addresses for every compute >> node to use to chat on the network... >> > > Yup. Moving the MTU disparity to the first layer-3 device a packet > traverses inbound to a VM saves us from burning IPs too. > > >> >> >> >> >At least for the Linux bridge agent, I think we can address ingress MTU >> disparity (to the VM) by moving it to the first device in the chain capable >> of layer-3 operations, particularly the neutron router namespace. We can >> address the egress MTU disparity (from the VM) by advertising the MTU of >> the overlay network to the VM via DHCP/RA or using manual interface >> configuration. >> >> So when setting up DHCP for the subnet, would telling the DHCP agent to >> use an MTU we calculate based on (global MTU value - network encap >> overhead) achieve what you are suggesting here? >> > > Yup. We mostly attempt to do that now. > > On Fri, Jan 15, 2016 at 10:41 AM, Sean M. Collins >>> wrote: >>> MTU has been an ongoing issue in Neutron for _years_. It's such a hassle, that most people just throw up their hands and set their physical infrastructure to jumbo frames. We even document it. http://docs.openstack.org/juno/install-guide/install/apt-debian/content/neutron-network-node.html > Ideally, you can prevent these problems by enabling jumbo frames on > the physical network that contains your tenant virtual networks. Jumbo > frames support MTUs up to approximately 9000 bytes which negates the > impact of GRE overhead on virtual networks. We've pushed this onto operators and deployers. There's a lot of code in provisioning projects to handle MTUs. http://codesearch.openstack.org/?q=MTU=nope== We have mentions to it in our architecture design guide http://git.openstack.org/cgit/openstack/openstack-manuals/tree/doc/arch-design/source/network-focus-architecture.rst#n150 I want to get Neutron to the point where it starts discovering this information and automatically configuring, in the optimistic cases. I understand that it can be complex and have corner cases, but the issue we have today is that it is broken in some multinode jobs, even Neutron developers are configuring it correctly. I also had this discussion on the DevStack side in https://review.openstack.org/#/c/112523/ where basically, sure we can fix it in DevStack and at the gate, but it doesn't fix the problem for anyone who isn't using DevStack to deploy their cloud. Today we have a ton of MTU configuration options sprinkled throghout the L3 agent, dhcp agent, l2 agents, and at least one API extension to the REST API for handling MTUs. So yeah, a lot of knobs and not a lot of documentation on how to make this thing work correctly. I'd like to try and simplify. Further reading: http://techbackground.blogspot.co.uk/2013/06/path-mtu-discovery-and-gre.html http://lists.openstack.org/pipermail/openstack/2013-October/001778.html https://ask.openstack.org/en/question/6140/quantum-neutron-gre-slow-performance/ https://ask.openstack.org/en/question/12499/forcing-mtu-to-1400-via-etcneutrondnsmasq-neutronconf-per-daniels/ http://blog.systemathic.ch/2015/03/05/openstack-mtu-pitfalls-with-tunnels/ https://twitter.com/search?q=openstack%20neutron%20MTU -- Sean M. Collins __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >>> >>> >>> __ >>> OpenStack Development Mailing List (not for usage questions) >>> Unsubscribe: >>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >>> >> >> >> -- >> Kevin Benton >> >>
Re: [openstack-dev] [Neutron] MTU configuration pain
On Sun, Jan 17, 2016 at 8:30 PM, Matt Kassawarawrote: > Prior attempts to solve the MTU problem in neutron simply band-aid it or > become too complex from feature creep or edge cases that mask the primary > goal of a simple implementation that works for most deployments. So, I ran > some experiments to empirically determine the root cause of MTU problems in > common neutron deployments using the Linux bridge agent. I plan to perform > these experiments again using the Open vSwitch agent... after sufficient > mental recovery. > > I highly recommend reading further, but here's the TL;DR: > > Observations... > > 1) During creation of a VXLAN interface, Linux automatically subtracts the > VXLAN protocol overhead from the MTU of the parent interface. > 2) A veth pair or tap with a different MTU on each end drops packets > larger than the smaller MTU. > 3) Linux automatically adjusts the MTU of a bridge to the lowest MTU of > all the ports. Therefore, Linux reduces the typical bridge MTU from 1500 to > 1450 when neutron adds a VXLAN interface to it. > 4) A bridge with different MTUs on each port drops packets larger than the > MTU of the bridge. > 5) A bridge or veth pair with an IP address can participate in path MTU > discovery (PMTUD). However, these devices do not appear to understand > namespaces and originate the ICMP message from the host instead of a > namespace. Therefore, the message never reaches the destination... > typically a host outside of the deployment. > > Conclusion... > > The MTU disparity between native and overlay networks must reside in a > device capable of layer-3 operations that can participate in PMTUD, such as > the neutron router between a private/project overlay network and a > public/external native network. > > Some background... > > In a typical datacenter network, MTU must remain consistent within a > layer-2 network because fragmentation and the mechanism indicating the need > for it occurs at layer-3. In other words, all host interfaces and switch > ports on the same layer-2 network must use the same MTU. If the layer-2 > network connects to a router, the router port must also use the same MTU. A > router can contain ports on multiple layer-2 networks with different MTUs > because it operates on those networks at layer-3. If the MTU changes > between ports on a router and devices on those layer-2 networks attempt to > communicate at layer-3, the router can perform a couple of actions. For > IPv4, the router can fragment the packet. However, if the packet contains > the "don't fragment" (DF) flag, the router can either silently drop the > packet or return an ICMP "fragmentation needed" message to the sender. This > ICMP message contains the MTU of the next layer-2 network in the route > between the sender and receiver. Each router in the path can return these > ICMP messages to the sender until it learns the maximum MTU for the entire > path, also known as path MTU discovery (PMTUD). IPv6 does not support > fragmentation. > > The cloud provides a virtual extension of a physical network. In the > simplest sense, patch cables become veth pairs, switches become bridges, > and routers become namespaces. Therefore, MTU implementation for virtual > networks should mimic physical networks where MTU changes must occur within > a router at layer-3. > > For these experiments, my deployment contains one controller and one > compute node. Neutron uses the ML2 plug-in and Linux bridge agent. The > configuration does not contain any MTU options (e.g, path_mtu). One VM with > a floating IP address attaches to a VXLAN private network that routes to a > flat public network. The DHCP agent does not advertise MTU to the VM. My > lab resides on public cloud infrastructure with networks that filter > unknown MAC addresses such as those that neutron generates for virtual > network components. Let's talk about the implications and workarounds. > > The VXLAN protocol contains 50 bytes of overhead. Linux automatically > calculates the MTU of VXLAN devices by subtracting 50 bytes from the parent > device, in this case a standard Ethernet interface with a 1500 MTU. > However, due the limitations of public cloud networks, I must create a > VXLAN tunnel between the controller node and a host outside of the > deployment to simulate traffic from a datacenter network. This tunnel > effectively reduces the "native" MTU from 1500 to 1450. Therefore, I need > to subtract an additional 50 bytes from neutron VXLAN network components, > essentially emulating the 50-byte difference between conventional neutron > VXLAN networks and native networks. The host outside of the deployment > assumes it can send packets using a 1450 MTU. The VM also assumes it can > send packets using a 1450 MTU because the DHCP agent does not advertise a > 1400 MTU to it. > > Let's get to it! > > Note: The commands in these experiments often generate lengthy output, so > please refer to the gists when necessary. >
Re: [openstack-dev] [Neutron] MTU configuration pain
Thanks for the awesome writeup. >5) A bridge or veth pair with an IP address can participate in path MTU discovery (PMTUD). However, these devices do not appear to understand namespaces and originate the ICMP message from the host instead of a namespace. Therefore, the message never reaches the destination... typically a host outside of the deployment. I suspect this is because we don't put the bridges into namespaces. Even if we did do this, we would need to allocate IP addresses for every compute node to use to chat on the network... >At least for the Linux bridge agent, I think we can address ingress MTU disparity (to the VM) by moving it to the first device in the chain capable of layer-3 operations, particularly the neutron router namespace. We can address the egress MTU disparity (from the VM) by advertising the MTU of the overlay network to the VM via DHCP/RA or using manual interface configuration. So when setting up DHCP for the subnet, would telling the DHCP agent to use an MTU we calculate based on (global MTU value - network encap overhead) achieve what you are suggesting here? On Sun, Jan 17, 2016 at 10:30 PM, Matt Kassawarawrote: > Prior attempts to solve the MTU problem in neutron simply band-aid it or > become too complex from feature creep or edge cases that mask the primary > goal of a simple implementation that works for most deployments. So, I ran > some experiments to empirically determine the root cause of MTU problems in > common neutron deployments using the Linux bridge agent. I plan to perform > these experiments again using the Open vSwitch agent... after sufficient > mental recovery. > > I highly recommend reading further, but here's the TL;DR: > > Observations... > > 1) During creation of a VXLAN interface, Linux automatically subtracts the > VXLAN protocol overhead from the MTU of the parent interface. > 2) A veth pair or tap with a different MTU on each end drops packets > larger than the smaller MTU. > 3) Linux automatically adjusts the MTU of a bridge to the lowest MTU of > all the ports. Therefore, Linux reduces the typical bridge MTU from 1500 to > 1450 when neutron adds a VXLAN interface to it. > 4) A bridge with different MTUs on each port drops packets larger than the > MTU of the bridge. > 5) A bridge or veth pair with an IP address can participate in path MTU > discovery (PMTUD). However, these devices do not appear to understand > namespaces and originate the ICMP message from the host instead of a > namespace. Therefore, the message never reaches the destination... > typically a host outside of the deployment. > > Conclusion... > > The MTU disparity between native and overlay networks must reside in a > device capable of layer-3 operations that can participate in PMTUD, such as > the neutron router between a private/project overlay network and a > public/external native network. > > Some background... > > In a typical datacenter network, MTU must remain consistent within a > layer-2 network because fragmentation and the mechanism indicating the need > for it occurs at layer-3. In other words, all host interfaces and switch > ports on the same layer-2 network must use the same MTU. If the layer-2 > network connects to a router, the router port must also use the same MTU. A > router can contain ports on multiple layer-2 networks with different MTUs > because it operates on those networks at layer-3. If the MTU changes > between ports on a router and devices on those layer-2 networks attempt to > communicate at layer-3, the router can perform a couple of actions. For > IPv4, the router can fragment the packet. However, if the packet contains > the "don't fragment" (DF) flag, the router can either silently drop the > packet or return an ICMP "fragmentation needed" message to the sender. This > ICMP message contains the MTU of the next layer-2 network in the route > between the sender and receiver. Each router in the path can return these > ICMP messages to the sender until it learns the maximum MTU for the entire > path, also known as path MTU discovery (PMTUD). IPv6 does not support > fragmentation. > > The cloud provides a virtual extension of a physical network. In the > simplest sense, patch cables become veth pairs, switches become bridges, > and routers become namespaces. Therefore, MTU implementation for virtual > networks should mimic physical networks where MTU changes must occur within > a router at layer-3. > > For these experiments, my deployment contains one controller and one > compute node. Neutron uses the ML2 plug-in and Linux bridge agent. The > configuration does not contain any MTU options (e.g, path_mtu). One VM with > a floating IP address attaches to a VXLAN private network that routes to a > flat public network. The DHCP agent does not advertise MTU to the VM. My > lab resides on public cloud infrastructure with networks that filter > unknown MAC addresses such as those that neutron generates for virtual > network
Re: [openstack-dev] [Neutron] MTU configuration pain
The MTU setting is an issue because it involves knowledge of the network outside of openstack. That's why it was just a config value that was expected to be set by an operator. This thread is working to see if we can figure that out, or maybe at least come up with a different sub-optimal default. For the floating IP thing, do you need floating IPs? If not, using the 'provider networking' workflow is much simpler if you don't want tenant virtual routers and whatnot: http://docs.openstack.org/liberty/networking-guide/scenario_provider_lb.html On Mon, Jan 18, 2016 at 4:06 PM, John Griffithwrote: > > > On Sun, Jan 17, 2016 at 8:30 PM, Matt Kassawara > wrote: > >> Prior attempts to solve the MTU problem in neutron simply band-aid it or >> become too complex from feature creep or edge cases that mask the primary >> goal of a simple implementation that works for most deployments. So, I ran >> some experiments to empirically determine the root cause of MTU problems in >> common neutron deployments using the Linux bridge agent. I plan to perform >> these experiments again using the Open vSwitch agent... after sufficient >> mental recovery. >> >> I highly recommend reading further, but here's the TL;DR: >> >> Observations... >> >> 1) During creation of a VXLAN interface, Linux automatically subtracts >> the VXLAN protocol overhead from the MTU of the parent interface. >> 2) A veth pair or tap with a different MTU on each end drops packets >> larger than the smaller MTU. >> 3) Linux automatically adjusts the MTU of a bridge to the lowest MTU of >> all the ports. Therefore, Linux reduces the typical bridge MTU from 1500 to >> 1450 when neutron adds a VXLAN interface to it. >> 4) A bridge with different MTUs on each port drops packets larger than >> the MTU of the bridge. >> 5) A bridge or veth pair with an IP address can participate in path MTU >> discovery (PMTUD). However, these devices do not appear to understand >> namespaces and originate the ICMP message from the host instead of a >> namespace. Therefore, the message never reaches the destination... >> typically a host outside of the deployment. >> >> Conclusion... >> >> The MTU disparity between native and overlay networks must reside in a >> device capable of layer-3 operations that can participate in PMTUD, such as >> the neutron router between a private/project overlay network and a >> public/external native network. >> >> Some background... >> >> In a typical datacenter network, MTU must remain consistent within a >> layer-2 network because fragmentation and the mechanism indicating the need >> for it occurs at layer-3. In other words, all host interfaces and switch >> ports on the same layer-2 network must use the same MTU. If the layer-2 >> network connects to a router, the router port must also use the same MTU. A >> router can contain ports on multiple layer-2 networks with different MTUs >> because it operates on those networks at layer-3. If the MTU changes >> between ports on a router and devices on those layer-2 networks attempt to >> communicate at layer-3, the router can perform a couple of actions. For >> IPv4, the router can fragment the packet. However, if the packet contains >> the "don't fragment" (DF) flag, the router can either silently drop the >> packet or return an ICMP "fragmentation needed" message to the sender. This >> ICMP message contains the MTU of the next layer-2 network in the route >> between the sender and receiver. Each router in the path can return these >> ICMP messages to the sender until it learns the maximum MTU for the entire >> path, also known as path MTU discovery (PMTUD). IPv6 does not support >> fragmentation. >> >> The cloud provides a virtual extension of a physical network. In the >> simplest sense, patch cables become veth pairs, switches become bridges, >> and routers become namespaces. Therefore, MTU implementation for virtual >> networks should mimic physical networks where MTU changes must occur within >> a router at layer-3. >> >> For these experiments, my deployment contains one controller and one >> compute node. Neutron uses the ML2 plug-in and Linux bridge agent. The >> configuration does not contain any MTU options (e.g, path_mtu). One VM with >> a floating IP address attaches to a VXLAN private network that routes to a >> flat public network. The DHCP agent does not advertise MTU to the VM. My >> lab resides on public cloud infrastructure with networks that filter >> unknown MAC addresses such as those that neutron generates for virtual >> network components. Let's talk about the implications and workarounds. >> >> The VXLAN protocol contains 50 bytes of overhead. Linux automatically >> calculates the MTU of VXLAN devices by subtracting 50 bytes from the parent >> device, in this case a standard Ethernet interface with a 1500 MTU. >> However, due the limitations of public cloud networks, I must create a >> VXLAN tunnel between the controller