Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-26 Thread Sean M. Collins
On Mon, Jan 25, 2016 at 08:16:03PM EST, Fox, Kevin M wrote:
> Another place to look...
> I've had to use network_device_mtu=9000 in nova's config as well to get mtu's 
> working smoothly.
> 

I'll have to read the code on the Nova side and familiarize myself, but
this sounds like a case of DRY that needs to be done. We should just set
it once *somewhere* and then communicate it to related OpenStack
components.
-- 
Sean M. Collins

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-26 Thread Fox, Kevin M
big +1 from me. :)

Kevin

From: Sean M. Collins [s...@coreitpro.com]
Sent: Tuesday, January 26, 2016 9:59 AM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [Neutron] MTU configuration pain

On Mon, Jan 25, 2016 at 08:16:03PM EST, Fox, Kevin M wrote:
> Another place to look...
> I've had to use network_device_mtu=9000 in nova's config as well to get mtu's 
> working smoothly.
>

I'll have to read the code on the Nova side and familiarize myself, but
this sounds like a case of DRY that needs to be done. We should just set
it once *somewhere* and then communicate it to related OpenStack
components.
--
Sean M. Collins

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-26 Thread Ian Wells
As I recall, network_device_mtu sets up the MTU on a bunch of structures
independently of whatever the correct value is.  It was a bit of a
workaround back in the day and is still a bit of a workaround now.  I'd
sooner we actually fix up the new mechanism (which is kind of hard to do
when the closest I have to information is 'it probably doesn't work').

On 26 January 2016 at 09:59, Sean M. Collins  wrote:

> On Mon, Jan 25, 2016 at 08:16:03PM EST, Fox, Kevin M wrote:
> > Another place to look...
> > I've had to use network_device_mtu=9000 in nova's config as well to get
> mtu's working smoothly.
> >
>
> I'll have to read the code on the Nova side and familiarize myself, but
> this sounds like a case of DRY that needs to be done. We should just set
> it once *somewhere* and then communicate it to related OpenStack
> components.
> --
> Sean M. Collins
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-25 Thread Sean M. Collins
On Mon, Jan 25, 2016 at 01:37:55AM EST, Kevin Benton wrote:
> At a minimum I think we should pick a default in devstack and dump a
> warning in neutron if operators don't specify it.

Here's the DevStack change that implements this.

https://review.openstack.org/#/c/267604/

Again this just fixes it for DevStack. Deployers still need to set the
MTUs by hand in their deployment tool of choice. I would hope that we
can still move forward with some sort of automatic discovery - and also
figure out a way to take it from 3 different config knobs down to like
one master knob, for the sake of sanity.

-- 
Sean M. Collins

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-25 Thread Mike Spreitzer
BTW, regarding devstack: See 
https://bugs.launchpad.net/devstack/+bug/1532924.  I have been trying to 
get the current code to work, following the ideas in 
https://specs.openstack.org/openstack/fuel-specs/specs/7.0/jumbo-frames-between-instances.html#proposed-change
.  It fails only at the last step: the MTU on the network interface inside 
the VM is still 1500.

Regards,
Mike

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-25 Thread Matt Kassawara
Ian,

Overthinking and corner cases led to the existing implementation which
doesn't solve the MTU problem and arguably makes the situation worse
because options in the configuration files give operators the impression
they can control it. For example, the segment_mtu does nothing in the
in-tree drivers, the network_device_mtu option only impacts parts of some
in-tree drivers, and path_mtu only provides a way to change the MTU for VMs
for all in-tree drivers. I ran my experiments without any of these options
to provide a clean slate for empirically analyzing the problem and finding
a solution for the majority of operators.

Matt

On Mon, Jan 25, 2016 at 6:31 AM, Sean M. Collins  wrote:

> On Mon, Jan 25, 2016 at 01:37:55AM EST, Kevin Benton wrote:
> > At a minimum I think we should pick a default in devstack and dump a
> > warning in neutron if operators don't specify it.
>
> Here's the DevStack change that implements this.
>
> https://review.openstack.org/#/c/267604/
>
> Again this just fixes it for DevStack. Deployers still need to set the
> MTUs by hand in their deployment tool of choice. I would hope that we
> can still move forward with some sort of automatic discovery - and also
> figure out a way to take it from 3 different config knobs down to like
> one master knob, for the sake of sanity.
>
> --
> Sean M. Collins
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-25 Thread Sean M. Collins
You need to set path_mtu. https://review.openstack.org/#/c/267604/ sets
it now and defaults to 1500 - then Neutron calculates the overhead for
your tunnel protocol down to the appropriate value

-- 
Sean M. Collins

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-25 Thread Rick Jones

On 01/24/2016 07:43 PM, Ian Wells wrote:

Also, I say 9000, but why is 9000 even the right number?


While that may have been a rhetorical question...

Because that is the value Alteon picked in the late 1990s when they 
created the de facto standard for "Jumbo Frames" by including it in 
their Gigabit Ethernet kit as a way to enable the systems of the day to 
have a hope of getting link-rate :)


Perhaps they picked 9000 because it was twice the 4500 of FDDI, which 
itself was selected to allow space for 4096 bytes of data and then a 
good bit of headers.



rick jones

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-25 Thread Ian Wells
On 25 January 2016 at 07:06, Matt Kassawara  wrote:
> Overthinking and corner cases led to the existing implementation which
doesn't solve the MTU problem and arguably makes the situation worse
because options in the configuration files give operators the impression
they can control it.

We are giving the impression we solved the problem because we tried to
comprehensively solve the problem (documentation aside, apparently).  It's
complex when you want to do complex things, but the right answer for basic
end users is adding these two lines to neutron.conf, which I don't think is
asking too much:

path_mtu = 1500 # for VXLAN and GRE; MTU is 1450 on ports on VXLAN networks
segment_mtu = 1500 # for VLAN; MTU is 1500 on ports on VLAN networks

(while leaving the floor open for the other 1% of cases, where the options
cover pretty much everything you'd want to do).

So.  I don't know what path_mtu and segment_mtu settings you used that
disappointed you; could you recap?  Can you tell me whether the two options
above help?

> For example, the segment_mtu does nothing in the in-tree drivers, the
network_device_mtu option only impacts parts of some in-tree drivers, and
path_mtu only provides a way to change the MTU for VMs for all in-tree
drivers.

I was reading what documentation I could find (I may have written the spec,
but I didn't write the code, so I have to check the docs like everyone
else) and it says it should work - so anything else is a bug, which we
should go out and fix.  What test cases did you try?

network_device_mtu is an old hack, this much I know, and path_mtu and
segment_mtu are intended to be the correct modern way of doing things.

path_mtu should not apply to all in tree drivers, specifically it should
only apply to L3 overlays (as segment_mtu should only apply to VLANs) (and
by the wording of your statement I have to ask - are you seeing VM MTU =
path MTU, because you shouldn't be).

I see there are plausible looking unit tests for segment_mtu, so if it's
not working then in what specific configuration is it not working?

>
> I ran my experiments without any of these options to provide a clean
slate for empirically analyzing the problem and finding a solution for the
majority of operators.

I'm afraid you've not been clear about what setups you've tested where
path_mtu and segment_mtu *are* set - you dismissed them so I presume you
tried.  When you say they don't do what you want, what do they do wrong?

>
>
> Matt
>
> On Mon, Jan 25, 2016 at 6:31 AM, Sean M. Collins 
wrote:
>>
>> On Mon, Jan 25, 2016 at 01:37:55AM EST, Kevin Benton wrote:
>> > At a minimum I think we should pick a default in devstack and dump a
>> > warning in neutron if operators don't specify it.
>>
>> Here's the DevStack change that implements this.
>>
>> https://review.openstack.org/#/c/267604/
>>
>> Again this just fixes it for DevStack. Deployers still need to set the
>> MTUs by hand in their deployment tool of choice. I would hope that we
>> can still move forward with some sort of automatic discovery - and also
>> figure out a way to take it from 3 different config knobs down to like
>> one master knob, for the sake of sanity.
>>
>> --
>> Sean M. Collins
>>
>>
__
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-25 Thread Fox, Kevin M
Another place to look...
I've had to use network_device_mtu=9000 in nova's config as well to get mtu's 
working smoothly.

Thanks,
Kevin

From: Matt Kassawara [mkassaw...@gmail.com]
Sent: Monday, January 25, 2016 5:00 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [Neutron] MTU configuration pain

Results from the Open vSwitch agent...

I highly recommend reading further, but here's the TL;DR: Using physical 
network interfaces with MTUs larger than 1500 reveals problems in several 
places, but only involving Linux components rather than Open vSwitch components 
(such as br-int) on both the controller and compute nodes. Most of the problems 
involve MTU disparities in security group bridge components on the compute node.

First, review the OpenStack bits and resulting network components in the 
environment [1] and see that a typical 'ping' works using IPv4 and IPv6 [2].

[1] https://gist.github.com/ionosphere80/23655bedd24730d22c89
[2] https://gist.github.com/ionosphere80/5f309e7021a830246b66

Note: The tcpdump output in each case references up to seven points: neutron 
router gateway on the public network (qg), namespace end of the neutron router 
interface on the private network (qr), controller node end of the VXLAN network 
(underlying interface), compute node end of the VXLAN network (underlying 
interface), Open vSwitch end of the veth pair for the security group bridge 
(qvo), Linux bridge end of the veth pair for the security group bridge (qvb), 
and the bridge end of the tap for the VM (tap).

I can use SSH to access the VM because every component between my host and the 
VM supports at least a 1500 MTU. So, let's configure the VM network interface 
to use the proper MTU of 9000 minus the VXLAN protocol overhead of 50 bytes... 
8950... and try SSH again.

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc pfifo_fast qlen 1000
link/ether fa:16:3e:ea:22:3a brd ff:ff:ff:ff:ff:ff
inet 172.16.1.3/24<http://172.16.1.3/24> brd 172.16.1.255 scope global eth0
inet6 fd00:100:52:1:f816:3eff:feea:223a/64 scope global dynamic
   valid_lft 86396sec preferred_lft 14396sec
inet6 fe80::f816:3eff:feea:223a/64 scope link
   valid_lft forever preferred_lft forever

Contrary to the Linux bridge experiment, I can still use SSH to access the VM. 
Why?

Let's ping with a payload size of 8922 for IPv4 and 8902 for IPv6, the maximum 
for a VXLAN segment with 8950 MTU.

# ping -c 1 -s 8922 -M do 10.100.52.102
PING 10.100.52.102 (10.100.52.102) 8922(8950) bytes of data.
>From 10.100.52.102 icmp_seq=1 Frag needed and DF set (mtu = 1500)

--- 10.100.52.102 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

# ping6 -c 1 -s 8902 -M do fd00:100:52:1:f816:3eff:feea:223a
PING fd00:100:52:1:f816:3eff:feea:223a(fd00:100:52:1:f816:3eff:feea:223a) 8902 
data bytes
>From fd00:100:52::101 icmp_seq=1 Packet too big: mtu=1500

--- fd00:100:52:1:f816:3eff:feea:223a ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

Look at the tcpdump output [3]. The router namespace, operating at layer-3, 
sees the MTU discrepancy between inbound packet and the neutron router gateway 
on the public network and returns an ICMP "fragmentation needed" or "packet too 
big" message to the sender. The sender uses the MTU value in the ICMP packet to 
recalculate the length of the first packet and caches it for future packets.

[3] https://gist.github.com/ionosphere80/4e1389a34fd3a628b294

Although PTMUD enables communication between my host and the VM, it limits MTU 
to 1500 regardless of the MTU between the router namespace and VM and therefore 
could impact performance on 10 Gbps or faster networks. Also, it does not 
address the MTU disparity between a VM and network components on the compute 
node. If a VM uses a 1500 or smaller MTU, it cannot send packets that exceed 
the MTU of the tap interface, veth pairs, and bridge on the compute node. In 
this situation which seems fairly typical for operators trying to work around 
MTU problems, communication between a host (outside of OpenStack) and a VM 
always works. However, what if a VM uses a MTU larger than 1500 and attempts to 
send a large packet? The bridge or veth pairs would drop it because of the MTU 
disparity.

Using observations from the Linux bridge experiment, let's configure the MTU of 
the interfaces in the router namespace to match the interfaces outside of the 
namespace. The public network (gateway) interface MTU becomes 9000 and the 
private network router interfaces (IPv4 and IPv6) become 8950.

31: qr-d744191c-9d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc noqueue 
state UNKNOWN mode DEFAULT group default
link/ether fa:16:3e:34:67:40 brd ff:ff:ff:ff:ff:ff
32: qr-ae54b450-b4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc noqueue 
state

Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-25 Thread Matt Kassawara
Results from the Open vSwitch agent...

I highly recommend reading further, but here's the TL;DR: Using physical
network interfaces with MTUs larger than 1500 reveals problems in several
places, but only involving Linux components rather than Open vSwitch
components (such as br-int) on both the controller and compute nodes. Most
of the problems involve MTU disparities in security group bridge components
on the compute node.

First, review the OpenStack bits and resulting network components in the
environment [1] and see that a typical 'ping' works using IPv4 and IPv6 [2].

[1] https://gist.github.com/ionosphere80/23655bedd24730d22c89
[2] https://gist.github.com/ionosphere80/5f309e7021a830246b66

Note: The tcpdump output in each case references up to seven points:
neutron router gateway on the public network (qg), namespace end of the
neutron router interface on the private network (qr), controller node end
of the VXLAN network (underlying interface), compute node end of the VXLAN
network (underlying interface), Open vSwitch end of the veth pair for the
security group bridge (qvo), Linux bridge end of the veth pair for the
security group bridge (qvb), and the bridge end of the tap for the VM (tap).

I can use SSH to access the VM because every component between my host and
the VM supports at least a 1500 MTU. So, let's configure the VM network
interface to use the proper MTU of 9000 minus the VXLAN protocol overhead
of 50 bytes... 8950... and try SSH again.

2: eth0:  mtu 8950 qdisc pfifo_fast qlen
1000
link/ether fa:16:3e:ea:22:3a brd ff:ff:ff:ff:ff:ff
inet 172.16.1.3/24 brd 172.16.1.255 scope global eth0
inet6 fd00:100:52:1:f816:3eff:feea:223a/64 scope global dynamic
   valid_lft 86396sec preferred_lft 14396sec
inet6 fe80::f816:3eff:feea:223a/64 scope link
   valid_lft forever preferred_lft forever

Contrary to the Linux bridge experiment, I can still use SSH to access the
VM. Why?

Let's ping with a payload size of 8922 for IPv4 and 8902 for IPv6, the
maximum for a VXLAN segment with 8950 MTU.

# ping -c 1 -s 8922 -M do 10.100.52.102
PING 10.100.52.102 (10.100.52.102) 8922(8950) bytes of data.
>From 10.100.52.102 icmp_seq=1 Frag needed and DF set (mtu = 1500)

--- 10.100.52.102 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

# ping6 -c 1 -s 8902 -M do fd00:100:52:1:f816:3eff:feea:223a
PING fd00:100:52:1:f816:3eff:feea:223a(fd00:100:52:1:f816:3eff:feea:223a)
8902 data bytes
>From fd00:100:52::101 icmp_seq=1 Packet too big: mtu=1500

--- fd00:100:52:1:f816:3eff:feea:223a ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

Look at the tcpdump output [3]. The router namespace, operating at layer-3,
sees the MTU discrepancy between inbound packet and the neutron router
gateway on the public network and returns an ICMP "fragmentation needed" or
"packet too big" message to the sender. The sender uses the MTU value in
the ICMP packet to recalculate the length of the first packet and caches it
for future packets.

[3] https://gist.github.com/ionosphere80/4e1389a34fd3a628b294

Although PTMUD enables communication between my host and the VM, it limits
MTU to 1500 regardless of the MTU between the router namespace and VM and
therefore could impact performance on 10 Gbps or faster networks. Also, it
does not address the MTU disparity between a VM and network components on
the compute node. If a VM uses a 1500 or smaller MTU, it cannot send
packets that exceed the MTU of the tap interface, veth pairs, and bridge on
the compute node. In this situation which seems fairly typical for
operators trying to work around MTU problems, communication between a host
(outside of OpenStack) and a VM always works. However, what if a VM uses a
MTU larger than 1500 and attempts to send a large packet? The bridge or
veth pairs would drop it because of the MTU disparity.

Using observations from the Linux bridge experiment, let's configure the
MTU of the interfaces in the router namespace to match the interfaces
outside of the namespace. The public network (gateway) interface MTU
becomes 9000 and the private network router interfaces (IPv4 and IPv6)
become 8950.

31: qr-d744191c-9d:  mtu 8950 qdisc
noqueue state UNKNOWN mode DEFAULT group default
link/ether fa:16:3e:34:67:40 brd ff:ff:ff:ff:ff:ff
32: qr-ae54b450-b4:  mtu 8950 qdisc
noqueue state UNKNOWN mode DEFAULT group default
link/ether fa:16:3e:d4:f1:63 brd ff:ff:ff:ff:ff:ff
33: qg-e3303f07-e7:  mtu 9000 qdisc
noqueue state UNKNOWN mode DEFAULT group default
link/ether fa:16:3e:70:09:54 brd ff:ff:ff:ff:ff:ff

Let's ping again with a payload size of 8922 for IPv4, the maximum for a
VXLAN segment with 8950 MTU, and look at the tcpdump output [4]. For
brevity, I'm only showing IPv4 because IPv6 provides similar results.

# ping -c 1 

Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-24 Thread Ian Wells
On 23 January 2016 at 11:27, Adam Lawson  wrote:

> For the sake of over-simplification, is there ever a reason to NOT enable
> jumbo frames in a cloud/SDN context where most of the traffic is between
> virtual elements that all support it? I understand that some switches do
> not support it and traffic form the web doesn't support it either but
> besides that, seems like a default "jumboframes = 1" concept would work
> just fine to me.
>

Offhand:

1. you don't want the latency increase that comes with 9000 byte packets,
even if it's tiny (bearing in mind that in a link shared between tenants it
affects everyone when one packet holds the line for 6 times longer)
2. not every switch in the world is going to (a) be configurable or (b)
pass 9000 byte packets
3. not every VM has a configurable MTU that you can set on boot, or
supports jumbo frames, and someone somewhere will try and run one of those
VMs
4. when you're using provider networks, not every device attached to the
cloud has a 9000 MTU (and this one's interesting, in fact, because it
points to the other element the MTU spec was addressing, that *not all
networks, even in Neutron, will have the same MTU*).
5. similarly, if you have an external network in Openstack, and you're
using VXLAN, the MTU of the external network is almost certainly 50 bytes
bigger than that of the inside of the VXLAN overlays, so no one number can
ever be right for every network in Neutron.

Also, I say 9000, but why is 9000 even the right number?  We need a
number... and 'jumbo' is not a number.  I know devices that will let you
transmit 9200 byte packets.  Conversely, if the native L2 is 9000 bytes,
then the MTU in a Neutron virtual network is less than 9000 - so what MTU
do you want to offer your applications?  If your apps don't care, why not
tell them what MTU they're getting (e.g. 1450) and be done with it?
(Memory says that the old problem with that was that github had problems
with PMTUD in that circumstance, but I don't know if that's still true, and
even if it is it's not technically our problem.)

Per the spec, I would like to see us do the remaining fixes to make that
work as intended - largely 'tell the VMs what they're getting' - and then,
as others have said, lay out simple options for deployments, be they jumbo
frame or otherwise.

If you're seeing MTU related problems at this point, can you file bugs on
them and/or report back the bugs here, so that we can see what we're
actually facing?
-- 
Ian.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-24 Thread Kevin Benton
I believe the issue is that the default is unspecified, which leads to
nothing being advertised to VMs via dhcp/ra. So VMs end up using 1500,
which leads to a catastrophe when running on an overlay on a 1500 underlay.
On Jan 24, 2016 20:48, "Ian Wells"  wrote:

> On 23 January 2016 at 11:27, Adam Lawson  wrote:
>
>> For the sake of over-simplification, is there ever a reason to NOT enable
>> jumbo frames in a cloud/SDN context where most of the traffic is between
>> virtual elements that all support it? I understand that some switches do
>> not support it and traffic form the web doesn't support it either but
>> besides that, seems like a default "jumboframes = 1" concept would work
>> just fine to me.
>>
>
> Offhand:
>
> 1. you don't want the latency increase that comes with 9000 byte packets,
> even if it's tiny (bearing in mind that in a link shared between tenants it
> affects everyone when one packet holds the line for 6 times longer)
> 2. not every switch in the world is going to (a) be configurable or (b)
> pass 9000 byte packets
> 3. not every VM has a configurable MTU that you can set on boot, or
> supports jumbo frames, and someone somewhere will try and run one of those
> VMs
> 4. when you're using provider networks, not every device attached to the
> cloud has a 9000 MTU (and this one's interesting, in fact, because it
> points to the other element the MTU spec was addressing, that *not all
> networks, even in Neutron, will have the same MTU*).
> 5. similarly, if you have an external network in Openstack, and you're
> using VXLAN, the MTU of the external network is almost certainly 50 bytes
> bigger than that of the inside of the VXLAN overlays, so no one number can
> ever be right for every network in Neutron.
>
> Also, I say 9000, but why is 9000 even the right number?  We need a
> number... and 'jumbo' is not a number.  I know devices that will let you
> transmit 9200 byte packets.  Conversely, if the native L2 is 9000 bytes,
> then the MTU in a Neutron virtual network is less than 9000 - so what MTU
> do you want to offer your applications?  If your apps don't care, why not
> tell them what MTU they're getting (e.g. 1450) and be done with it?
> (Memory says that the old problem with that was that github had problems
> with PMTUD in that circumstance, but I don't know if that's still true, and
> even if it is it's not technically our problem.)
>
> Per the spec, I would like to see us do the remaining fixes to make that
> work as intended - largely 'tell the VMs what they're getting' - and then,
> as others have said, lay out simple options for deployments, be they jumbo
> frame or otherwise.
>
> If you're seeing MTU related problems at this point, can you file bugs on
> them and/or report back the bugs here, so that we can see what we're
> actually facing?
> --
> Ian.
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-24 Thread Ian Wells
I wrote the spec for the MTU work that's in the Neutron API today.  It
haunts my nightmares.  I learned so many nasty corner cases for MTU, and
you're treading that same dark path.

I'd first like to point out a few things that change the implications of
what you're reporting in strange ways. [1] points out even more strange
ways, but these are the notable ones from what I've been reading here...

RFC7348: "VTEPs MUST NOT fragment VXLAN packets. ... The destination VTEP
MAY silently discard such VXLAN fragments."  The VXLAN VTEP implementations
we use today may fragment, but it's not according to the RFC, and I
wouldn't rely that every implementation you come across knows to do it.
So, the largest L2 packet you can send over VXLAN is a function of path MTU.

Even if VXLAN is fragmenting, you actually actively want to avoid it
fragmenting, because - in the typical case of bulk TCP transfers using
max-MTU packets - you're *invisibly* fragmenting the packets into two and
adding about 80 bytes of overhead in the process and then reassembling them
at the far end.  You've just expicitly guaranteed that, just as you send
the most data, your connection will slow down. And the MTU problem will be
undetectable to the VMs (which can't find out that a VXLAN encapped packet
has been fragmented; the packet *they* sent didn't fragment, but the one
it's carried in did, not to mention the fragmentation didn't even happen at
an L3 node in the virtual network so DF and therefore PMTUD wouldn't work).

Path MTU is not fixed, because your path can vary according to network
weather (failures, congestion, whatever).  It's an oddity, and perhaps a
rarity, but you can get many weirdnesses: you fail over from one link to a
link with a smaller MTU and the path MTU shrinks; some switches are jumbo
frame and some aren't, so the path MTU might vary from host to host; and so
on.  Granted, these are weird cases, but the point here is that Openstack
cannot *discover* this number.  An installer might attempt something,
knowing how to read switch config; or it might attempt to validate a number
it's been given, as best it can; but even then it's best effort, it's not a
guarantee.  For all these reasons, the only way to really get the minimum
path MTU is from the operator themselves, which is why this is a
configuration parameter to Neutron (path_mtu).

The aim of the changes in the spec [1] were threefold:

1. To ensure that an app that absolutely required a certain minimum MTU to
operate could guarantee it would receive it
2. To allow the network to say what the MTU was, so that the VM could be
programmed accordingly
3. To ensure that the MTU for the network would - by default - settle on
the optimal value, per all the stuff above.

So what could we do in this environment to improve matters?

1. We should advertise MTU in the RA and DHCP messages that Openstack
sends.  I thought we'd already done this work, but this thread suggests not.

[Note, though, that you can't reliably set an MTU higher than 1500 on IPv6
using an RA, thanks to RFC4861 referencing RFC2464 which goes with the
standard, but not the practice, that the biggest ethernet packet is 1500
bytes.  You've been violating the standard all these years, you bad
people.  Unfortunately, Linux enforces this RA rule, albeit slightly
strangely.]

2. We should also put the MTU in any config-drive settings for VMs that
don't respect such things in DHCP and RAs, or don't do DHCP.  This is
Nova-side, reacting to the MTU property of the network.

3. Installers should determine the appropriate MTU settings on interfaces
and ensure they're set.  Openstack can't do this in some cases (VXLAN - no
interfaces) - and probably shouldn't in others (VLAN - the interface MTU is
input to the MTU selection algorithm above, and the installer should set
the interface MTU to match what the operator says the fabric MTU is).

4. We need to check the Neutron network drivers to see which ones are
accepting, but not properly respecting, the MTU setting on the network.  I
suspect we're short of testing to make sure that veths, bridges, switches
and so on are all correctly configured.

-- 
Ian.

[1] https://review.openstack.org/#/c/105989/ and
https://github.com/openstack/neutron-specs/blob/master/specs/kilo/mtu-selection-and-advertisement.rst


On 22 January 2016 at 19:13, Matt Kassawara  wrote:

> The fun continues, now using an OpenStack deployment on physical hardware
> that supports jumbo frames with 9000 MTU and IPv4/IPv6. This experiment
> still uses Linux bridge for consistency. I'm planning to run similar
> experiments with Open vSwitch and Open Virtual Network (OVN) in the next
> week.
>
> I highly recommend reading further, but here's the TL;DR: Using physical
> network interfaces with MTUs larger than 1500 reveals an additional problem
> with veth pair for the neutron router interface on the public network.
> Additionally, IP protocol version does not impact MTU calculation for
> 

Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-24 Thread Kevin Benton
At a minimum I think we should pick a default in devstack and dump a
warning in neutron if operators don't specify it.

I would still be preferable to changing the default even though it's a
behavior change considering the current behavior is annoying. :)
On Jan 24, 2016 23:31, "Ian Wells"  wrote:

> On 24 January 2016 at 22:12, Kevin Benton  wrote:
>
>> >The reason for that was in the other half of the thread - it's not
>> possible to magically discover these things from within Openstack's own
>> code because the relevant settings span more than just one server
>>
>> IMO it's better to have a default of 1500 rather than let VMs
>> automatically default to 1500 because at least we will deduct the encap
>> header length when necessary in the dhcp/ra advertised value so overlays
>> work on standard 1500 MTU networks.
>>
>> In other words, our current empty default is realistically a terrible
>> default of 1500 that doesn't account for network segmentation overhead.
>>
> It's pretty clear that, while the current setup is precisely the old
> behaviour (backward compatibility, y'know?), it's not very useful.  Problem
> is, anyone using the 1550+hacks and other methods of today will find their
> system changes behaviour if we started setting that specific default.
>
> Regardless, we need to take that documentation and update it.  It was a
> nasty hack back in the day and not remotely a good idea now.
>
>
>
>> On Jan 24, 2016 23:00, "Ian Wells"  wrote:
>>
>>> On 24 January 2016 at 20:18, Kevin Benton  wrote:
>>>
 I believe the issue is that the default is unspecified, which leads to
 nothing being advertised to VMs via dhcp/ra. So VMs end up using 1500,
 which leads to a catastrophe when running on an overlay on a 1500 underlay.

>>> That's not quite the point I was making here, but to answer that: looks
>>> to me like (for the LB or OVS drivers to appropriately set the network MTU
>>> for the virtual network, at which point it will be advertised because
>>> advertise_mtu defaults to True in the code) you *must* set one or more of
>>> path_mtu (for L3 overlays), segment_mtu (for L2 overlays) or physnet_mtu
>>> (for L2 overlays with differing MTUs on different physical networks).
>>> That's a statement of faith - I suspect if we try it we'll find a few
>>> niggling problems - but I can find the code, at least.
>>>
>>> The reason for that was in the other half of the thread - it's not
>>> possible to magically discover these things from within Openstack's own
>>> code because the relevant settings span more than just one server.  They
>>> have to line up with both your MTU settings for the interfaces in use, and
>>> the MTU settings for the other equipment within and neighbouring the cloud
>>> - switches, routers, nexthops.  So they have to be provided by the operator
>>> - then everything you want should kick in.
>>>
>>> If all of that is true, it really is just a documentation problem - we
>>> have the idea in place, we're just not telling people how to make use of
>>> it.  We can also include a checklist or a check script with that
>>> documentation - you might not be able to deduce the MTU values, but you can
>>> certainly run some checks to see if the values you have been given are
>>> obviously wrong.
>>>
>>> In the meantime, Matt K, you said you hadn't set path_mtu in your tests,
>>> but [1] says you have to ([1] is far from end-user consumable
>>> documentation, which again illustrates our problem).
>>>
>>> Can you set both path_mtu and segment_mtu to whatever value your switch
>>> MTU is (1500 or 9000), confirm your outbound interface MTU is the same
>>> (1500 or 9000), and see if that changes things?  At this point, you should
>>> find that your networks get appropriate 1500/9000 MTUs on VLAN based
>>> networks and 1450/8950 MTUs on VXLAN networks, that they're advertised to
>>> your VMs via DHCP and RA, and that your routers even know that different
>>> interfaces have different MTUs in a mixed environment, at least if
>>> everything is working as intended.
>>> --
>>> Ian.
>>>
>>> [1]
>>> https://github.com/openstack/neutron/blob/544ff57bcac00720f54a75eb34916218cb248213/releasenotes/notes/advertise_mtu_by_default-d8b0b056a74517b8.yaml#L5
>>>
>>>
 On Jan 24, 2016 20:48, "Ian Wells"  wrote:

> On 23 January 2016 at 11:27, Adam Lawson  wrote:
>
>> For the sake of over-simplification, is there ever a reason to NOT
>> enable jumbo frames in a cloud/SDN context where most of the traffic is
>> between virtual elements that all support it? I understand that some
>> switches do not support it and traffic form the web doesn't support it
>> either but besides that, seems like a default "jumboframes = 1" concept
>> would work just fine to me.
>>
>
> Offhand:
>
> 1. you don't want the latency increase that comes 

Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-24 Thread Ian Wells
On 24 January 2016 at 20:18, Kevin Benton  wrote:

> I believe the issue is that the default is unspecified, which leads to
> nothing being advertised to VMs via dhcp/ra. So VMs end up using 1500,
> which leads to a catastrophe when running on an overlay on a 1500 underlay.
>
That's not quite the point I was making here, but to answer that: looks to
me like (for the LB or OVS drivers to appropriately set the network MTU for
the virtual network, at which point it will be advertised because
advertise_mtu defaults to True in the code) you *must* set one or more of
path_mtu (for L3 overlays), segment_mtu (for L2 overlays) or physnet_mtu
(for L2 overlays with differing MTUs on different physical networks).
That's a statement of faith - I suspect if we try it we'll find a few
niggling problems - but I can find the code, at least.

The reason for that was in the other half of the thread - it's not possible
to magically discover these things from within Openstack's own code because
the relevant settings span more than just one server.  They have to line up
with both your MTU settings for the interfaces in use, and the MTU settings
for the other equipment within and neighbouring the cloud - switches,
routers, nexthops.  So they have to be provided by the operator - then
everything you want should kick in.

If all of that is true, it really is just a documentation problem - we have
the idea in place, we're just not telling people how to make use of it.  We
can also include a checklist or a check script with that documentation -
you might not be able to deduce the MTU values, but you can certainly run
some checks to see if the values you have been given are obviously wrong.

In the meantime, Matt K, you said you hadn't set path_mtu in your tests,
but [1] says you have to ([1] is far from end-user consumable
documentation, which again illustrates our problem).

Can you set both path_mtu and segment_mtu to whatever value your switch MTU
is (1500 or 9000), confirm your outbound interface MTU is the same (1500 or
9000), and see if that changes things?  At this point, you should find that
your networks get appropriate 1500/9000 MTUs on VLAN based networks and
1450/8950 MTUs on VXLAN networks, that they're advertised to your VMs via
DHCP and RA, and that your routers even know that different interfaces have
different MTUs in a mixed environment, at least if everything is working as
intended.
-- 
Ian.

[1]
https://github.com/openstack/neutron/blob/544ff57bcac00720f54a75eb34916218cb248213/releasenotes/notes/advertise_mtu_by_default-d8b0b056a74517b8.yaml#L5


> On Jan 24, 2016 20:48, "Ian Wells"  wrote:
>
>> On 23 January 2016 at 11:27, Adam Lawson  wrote:
>>
>>> For the sake of over-simplification, is there ever a reason to NOT
>>> enable jumbo frames in a cloud/SDN context where most of the traffic is
>>> between virtual elements that all support it? I understand that some
>>> switches do not support it and traffic form the web doesn't support it
>>> either but besides that, seems like a default "jumboframes = 1" concept
>>> would work just fine to me.
>>>
>>
>> Offhand:
>>
>> 1. you don't want the latency increase that comes with 9000 byte packets,
>> even if it's tiny (bearing in mind that in a link shared between tenants it
>> affects everyone when one packet holds the line for 6 times longer)
>> 2. not every switch in the world is going to (a) be configurable or (b)
>> pass 9000 byte packets
>> 3. not every VM has a configurable MTU that you can set on boot, or
>> supports jumbo frames, and someone somewhere will try and run one of those
>> VMs
>> 4. when you're using provider networks, not every device attached to the
>> cloud has a 9000 MTU (and this one's interesting, in fact, because it
>> points to the other element the MTU spec was addressing, that *not all
>> networks, even in Neutron, will have the same MTU*).
>> 5. similarly, if you have an external network in Openstack, and you're
>> using VXLAN, the MTU of the external network is almost certainly 50 bytes
>> bigger than that of the inside of the VXLAN overlays, so no one number can
>> ever be right for every network in Neutron.
>>
>> Also, I say 9000, but why is 9000 even the right number?  We need a
>> number... and 'jumbo' is not a number.  I know devices that will let you
>> transmit 9200 byte packets.  Conversely, if the native L2 is 9000 bytes,
>> then the MTU in a Neutron virtual network is less than 9000 - so what MTU
>> do you want to offer your applications?  If your apps don't care, why not
>> tell them what MTU they're getting (e.g. 1450) and be done with it?
>> (Memory says that the old problem with that was that github had problems
>> with PMTUD in that circumstance, but I don't know if that's still true, and
>> even if it is it's not technically our problem.)
>>
>> Per the spec, I would like to see us do the remaining fixes to make that
>> work as intended - largely 'tell the VMs 

Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-24 Thread Kevin Benton
>The reason for that was in the other half of the thread - it's not
possible to magically discover these things from within Openstack's own
code because the relevant settings span more than just one server

IMO it's better to have a default of 1500 rather than let VMs automatically
default to 1500 because at least we will deduct the encap header length
when necessary in the dhcp/ra advertised value so overlays work on standard
1500 MTU networks.

In other words, our current empty default is realistically a terrible
default of 1500 that doesn't account for network segmentation overhead.
On Jan 24, 2016 23:00, "Ian Wells"  wrote:

> On 24 January 2016 at 20:18, Kevin Benton  wrote:
>
>> I believe the issue is that the default is unspecified, which leads to
>> nothing being advertised to VMs via dhcp/ra. So VMs end up using 1500,
>> which leads to a catastrophe when running on an overlay on a 1500 underlay.
>>
> That's not quite the point I was making here, but to answer that: looks to
> me like (for the LB or OVS drivers to appropriately set the network MTU for
> the virtual network, at which point it will be advertised because
> advertise_mtu defaults to True in the code) you *must* set one or more of
> path_mtu (for L3 overlays), segment_mtu (for L2 overlays) or physnet_mtu
> (for L2 overlays with differing MTUs on different physical networks).
> That's a statement of faith - I suspect if we try it we'll find a few
> niggling problems - but I can find the code, at least.
>
> The reason for that was in the other half of the thread - it's not
> possible to magically discover these things from within Openstack's own
> code because the relevant settings span more than just one server.  They
> have to line up with both your MTU settings for the interfaces in use, and
> the MTU settings for the other equipment within and neighbouring the cloud
> - switches, routers, nexthops.  So they have to be provided by the operator
> - then everything you want should kick in.
>
> If all of that is true, it really is just a documentation problem - we
> have the idea in place, we're just not telling people how to make use of
> it.  We can also include a checklist or a check script with that
> documentation - you might not be able to deduce the MTU values, but you can
> certainly run some checks to see if the values you have been given are
> obviously wrong.
>
> In the meantime, Matt K, you said you hadn't set path_mtu in your tests,
> but [1] says you have to ([1] is far from end-user consumable
> documentation, which again illustrates our problem).
>
> Can you set both path_mtu and segment_mtu to whatever value your switch
> MTU is (1500 or 9000), confirm your outbound interface MTU is the same
> (1500 or 9000), and see if that changes things?  At this point, you should
> find that your networks get appropriate 1500/9000 MTUs on VLAN based
> networks and 1450/8950 MTUs on VXLAN networks, that they're advertised to
> your VMs via DHCP and RA, and that your routers even know that different
> interfaces have different MTUs in a mixed environment, at least if
> everything is working as intended.
> --
> Ian.
>
> [1]
> https://github.com/openstack/neutron/blob/544ff57bcac00720f54a75eb34916218cb248213/releasenotes/notes/advertise_mtu_by_default-d8b0b056a74517b8.yaml#L5
>
>
>> On Jan 24, 2016 20:48, "Ian Wells"  wrote:
>>
>>> On 23 January 2016 at 11:27, Adam Lawson  wrote:
>>>
 For the sake of over-simplification, is there ever a reason to NOT
 enable jumbo frames in a cloud/SDN context where most of the traffic is
 between virtual elements that all support it? I understand that some
 switches do not support it and traffic form the web doesn't support it
 either but besides that, seems like a default "jumboframes = 1" concept
 would work just fine to me.

>>>
>>> Offhand:
>>>
>>> 1. you don't want the latency increase that comes with 9000 byte
>>> packets, even if it's tiny (bearing in mind that in a link shared between
>>> tenants it affects everyone when one packet holds the line for 6 times
>>> longer)
>>> 2. not every switch in the world is going to (a) be configurable or (b)
>>> pass 9000 byte packets
>>> 3. not every VM has a configurable MTU that you can set on boot, or
>>> supports jumbo frames, and someone somewhere will try and run one of those
>>> VMs
>>> 4. when you're using provider networks, not every device attached to the
>>> cloud has a 9000 MTU (and this one's interesting, in fact, because it
>>> points to the other element the MTU spec was addressing, that *not all
>>> networks, even in Neutron, will have the same MTU*).
>>> 5. similarly, if you have an external network in Openstack, and you're
>>> using VXLAN, the MTU of the external network is almost certainly 50 bytes
>>> bigger than that of the inside of the VXLAN overlays, so no one number can
>>> ever be right for every network in Neutron.
>>>

Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-24 Thread Ian Wells
I like both of those ideas.

On 24 January 2016 at 22:37, Kevin Benton  wrote:

> At a minimum I think we should pick a default in devstack and dump a
> warning in neutron if operators don't specify it.
>
> I would still be preferable to changing the default even though it's a
> behavior change considering the current behavior is annoying. :)
> On Jan 24, 2016 23:31, "Ian Wells"  wrote:
>
>> On 24 January 2016 at 22:12, Kevin Benton  wrote:
>>
>>> >The reason for that was in the other half of the thread - it's not
>>> possible to magically discover these things from within Openstack's own
>>> code because the relevant settings span more than just one server
>>>
>>> IMO it's better to have a default of 1500 rather than let VMs
>>> automatically default to 1500 because at least we will deduct the encap
>>> header length when necessary in the dhcp/ra advertised value so overlays
>>> work on standard 1500 MTU networks.
>>>
>>> In other words, our current empty default is realistically a terrible
>>> default of 1500 that doesn't account for network segmentation overhead.
>>>
>> It's pretty clear that, while the current setup is precisely the old
>> behaviour (backward compatibility, y'know?), it's not very useful.  Problem
>> is, anyone using the 1550+hacks and other methods of today will find their
>> system changes behaviour if we started setting that specific default.
>>
>> Regardless, we need to take that documentation and update it.  It was a
>> nasty hack back in the day and not remotely a good idea now.
>>
>>
>>
>>> On Jan 24, 2016 23:00, "Ian Wells"  wrote:
>>>
 On 24 January 2016 at 20:18, Kevin Benton  wrote:

> I believe the issue is that the default is unspecified, which leads to
> nothing being advertised to VMs via dhcp/ra. So VMs end up using 1500,
> which leads to a catastrophe when running on an overlay on a 1500 
> underlay.
>
 That's not quite the point I was making here, but to answer that: looks
 to me like (for the LB or OVS drivers to appropriately set the network MTU
 for the virtual network, at which point it will be advertised because
 advertise_mtu defaults to True in the code) you *must* set one or more of
 path_mtu (for L3 overlays), segment_mtu (for L2 overlays) or physnet_mtu
 (for L2 overlays with differing MTUs on different physical networks).
 That's a statement of faith - I suspect if we try it we'll find a few
 niggling problems - but I can find the code, at least.

 The reason for that was in the other half of the thread - it's not
 possible to magically discover these things from within Openstack's own
 code because the relevant settings span more than just one server.  They
 have to line up with both your MTU settings for the interfaces in use, and
 the MTU settings for the other equipment within and neighbouring the cloud
 - switches, routers, nexthops.  So they have to be provided by the operator
 - then everything you want should kick in.

 If all of that is true, it really is just a documentation problem - we
 have the idea in place, we're just not telling people how to make use of
 it.  We can also include a checklist or a check script with that
 documentation - you might not be able to deduce the MTU values, but you can
 certainly run some checks to see if the values you have been given are
 obviously wrong.

 In the meantime, Matt K, you said you hadn't set path_mtu in your
 tests, but [1] says you have to ([1] is far from end-user consumable
 documentation, which again illustrates our problem).

 Can you set both path_mtu and segment_mtu to whatever value your switch
 MTU is (1500 or 9000), confirm your outbound interface MTU is the same
 (1500 or 9000), and see if that changes things?  At this point, you should
 find that your networks get appropriate 1500/9000 MTUs on VLAN based
 networks and 1450/8950 MTUs on VXLAN networks, that they're advertised to
 your VMs via DHCP and RA, and that your routers even know that different
 interfaces have different MTUs in a mixed environment, at least if
 everything is working as intended.
 --
 Ian.

 [1]
 https://github.com/openstack/neutron/blob/544ff57bcac00720f54a75eb34916218cb248213/releasenotes/notes/advertise_mtu_by_default-d8b0b056a74517b8.yaml#L5


> On Jan 24, 2016 20:48, "Ian Wells"  wrote:
>
>> On 23 January 2016 at 11:27, Adam Lawson  wrote:
>>
>>> For the sake of over-simplification, is there ever a reason to NOT
>>> enable jumbo frames in a cloud/SDN context where most of the traffic is
>>> between virtual elements that all support it? I understand that some
>>> switches do not support it and traffic form the web doesn't support it

Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-24 Thread Ian Wells
On 24 January 2016 at 22:12, Kevin Benton  wrote:

> >The reason for that was in the other half of the thread - it's not
> possible to magically discover these things from within Openstack's own
> code because the relevant settings span more than just one server
>
> IMO it's better to have a default of 1500 rather than let VMs
> automatically default to 1500 because at least we will deduct the encap
> header length when necessary in the dhcp/ra advertised value so overlays
> work on standard 1500 MTU networks.
>
> In other words, our current empty default is realistically a terrible
> default of 1500 that doesn't account for network segmentation overhead.
>
It's pretty clear that, while the current setup is precisely the old
behaviour (backward compatibility, y'know?), it's not very useful.  Problem
is, anyone using the 1550+hacks and other methods of today will find their
system changes behaviour if we started setting that specific default.

Regardless, we need to take that documentation and update it.  It was a
nasty hack back in the day and not remotely a good idea now.



> On Jan 24, 2016 23:00, "Ian Wells"  wrote:
>
>> On 24 January 2016 at 20:18, Kevin Benton  wrote:
>>
>>> I believe the issue is that the default is unspecified, which leads to
>>> nothing being advertised to VMs via dhcp/ra. So VMs end up using 1500,
>>> which leads to a catastrophe when running on an overlay on a 1500 underlay.
>>>
>> That's not quite the point I was making here, but to answer that: looks
>> to me like (for the LB or OVS drivers to appropriately set the network MTU
>> for the virtual network, at which point it will be advertised because
>> advertise_mtu defaults to True in the code) you *must* set one or more of
>> path_mtu (for L3 overlays), segment_mtu (for L2 overlays) or physnet_mtu
>> (for L2 overlays with differing MTUs on different physical networks).
>> That's a statement of faith - I suspect if we try it we'll find a few
>> niggling problems - but I can find the code, at least.
>>
>> The reason for that was in the other half of the thread - it's not
>> possible to magically discover these things from within Openstack's own
>> code because the relevant settings span more than just one server.  They
>> have to line up with both your MTU settings for the interfaces in use, and
>> the MTU settings for the other equipment within and neighbouring the cloud
>> - switches, routers, nexthops.  So they have to be provided by the operator
>> - then everything you want should kick in.
>>
>> If all of that is true, it really is just a documentation problem - we
>> have the idea in place, we're just not telling people how to make use of
>> it.  We can also include a checklist or a check script with that
>> documentation - you might not be able to deduce the MTU values, but you can
>> certainly run some checks to see if the values you have been given are
>> obviously wrong.
>>
>> In the meantime, Matt K, you said you hadn't set path_mtu in your tests,
>> but [1] says you have to ([1] is far from end-user consumable
>> documentation, which again illustrates our problem).
>>
>> Can you set both path_mtu and segment_mtu to whatever value your switch
>> MTU is (1500 or 9000), confirm your outbound interface MTU is the same
>> (1500 or 9000), and see if that changes things?  At this point, you should
>> find that your networks get appropriate 1500/9000 MTUs on VLAN based
>> networks and 1450/8950 MTUs on VXLAN networks, that they're advertised to
>> your VMs via DHCP and RA, and that your routers even know that different
>> interfaces have different MTUs in a mixed environment, at least if
>> everything is working as intended.
>> --
>> Ian.
>>
>> [1]
>> https://github.com/openstack/neutron/blob/544ff57bcac00720f54a75eb34916218cb248213/releasenotes/notes/advertise_mtu_by_default-d8b0b056a74517b8.yaml#L5
>>
>>
>>> On Jan 24, 2016 20:48, "Ian Wells"  wrote:
>>>
 On 23 January 2016 at 11:27, Adam Lawson  wrote:

> For the sake of over-simplification, is there ever a reason to NOT
> enable jumbo frames in a cloud/SDN context where most of the traffic is
> between virtual elements that all support it? I understand that some
> switches do not support it and traffic form the web doesn't support it
> either but besides that, seems like a default "jumboframes = 1" concept
> would work just fine to me.
>

 Offhand:

 1. you don't want the latency increase that comes with 9000 byte
 packets, even if it's tiny (bearing in mind that in a link shared between
 tenants it affects everyone when one packet holds the line for 6 times
 longer)
 2. not every switch in the world is going to (a) be configurable or (b)
 pass 9000 byte packets
 3. not every VM has a configurable MTU that you can set on boot, or
 supports jumbo frames, and someone somewhere will try and 

Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-24 Thread Ian Wells
Actually, I note that that document is Juno and there doesn't seem to be
anything at all in the Liberty guide now, so the answer is probably to add
settings for path_mtu and segment_mtu in the recommended Neutron
configuration.

On 24 January 2016 at 22:26, Ian Wells  wrote:

> On 24 January 2016 at 22:12, Kevin Benton  wrote:
>
>> >The reason for that was in the other half of the thread - it's not
>> possible to magically discover these things from within Openstack's own
>> code because the relevant settings span more than just one server
>>
>> IMO it's better to have a default of 1500 rather than let VMs
>> automatically default to 1500 because at least we will deduct the encap
>> header length when necessary in the dhcp/ra advertised value so overlays
>> work on standard 1500 MTU networks.
>>
>> In other words, our current empty default is realistically a terrible
>> default of 1500 that doesn't account for network segmentation overhead.
>>
> It's pretty clear that, while the current setup is precisely the old
> behaviour (backward compatibility, y'know?), it's not very useful.  Problem
> is, anyone using the 1550+hacks and other methods of today will find their
> system changes behaviour if we started setting that specific default.
>
> Regardless, we need to take that documentation and update it.  It was a
> nasty hack back in the day and not remotely a good idea now.
>
>
>
>> On Jan 24, 2016 23:00, "Ian Wells"  wrote:
>>
>>> On 24 January 2016 at 20:18, Kevin Benton  wrote:
>>>
 I believe the issue is that the default is unspecified, which leads to
 nothing being advertised to VMs via dhcp/ra. So VMs end up using 1500,
 which leads to a catastrophe when running on an overlay on a 1500 underlay.

>>> That's not quite the point I was making here, but to answer that: looks
>>> to me like (for the LB or OVS drivers to appropriately set the network MTU
>>> for the virtual network, at which point it will be advertised because
>>> advertise_mtu defaults to True in the code) you *must* set one or more of
>>> path_mtu (for L3 overlays), segment_mtu (for L2 overlays) or physnet_mtu
>>> (for L2 overlays with differing MTUs on different physical networks).
>>> That's a statement of faith - I suspect if we try it we'll find a few
>>> niggling problems - but I can find the code, at least.
>>>
>>> The reason for that was in the other half of the thread - it's not
>>> possible to magically discover these things from within Openstack's own
>>> code because the relevant settings span more than just one server.  They
>>> have to line up with both your MTU settings for the interfaces in use, and
>>> the MTU settings for the other equipment within and neighbouring the cloud
>>> - switches, routers, nexthops.  So they have to be provided by the operator
>>> - then everything you want should kick in.
>>>
>>> If all of that is true, it really is just a documentation problem - we
>>> have the idea in place, we're just not telling people how to make use of
>>> it.  We can also include a checklist or a check script with that
>>> documentation - you might not be able to deduce the MTU values, but you can
>>> certainly run some checks to see if the values you have been given are
>>> obviously wrong.
>>>
>>> In the meantime, Matt K, you said you hadn't set path_mtu in your tests,
>>> but [1] says you have to ([1] is far from end-user consumable
>>> documentation, which again illustrates our problem).
>>>
>>> Can you set both path_mtu and segment_mtu to whatever value your switch
>>> MTU is (1500 or 9000), confirm your outbound interface MTU is the same
>>> (1500 or 9000), and see if that changes things?  At this point, you should
>>> find that your networks get appropriate 1500/9000 MTUs on VLAN based
>>> networks and 1450/8950 MTUs on VXLAN networks, that they're advertised to
>>> your VMs via DHCP and RA, and that your routers even know that different
>>> interfaces have different MTUs in a mixed environment, at least if
>>> everything is working as intended.
>>> --
>>> Ian.
>>>
>>> [1]
>>> https://github.com/openstack/neutron/blob/544ff57bcac00720f54a75eb34916218cb248213/releasenotes/notes/advertise_mtu_by_default-d8b0b056a74517b8.yaml#L5
>>>
>>>
 On Jan 24, 2016 20:48, "Ian Wells"  wrote:

> On 23 January 2016 at 11:27, Adam Lawson  wrote:
>
>> For the sake of over-simplification, is there ever a reason to NOT
>> enable jumbo frames in a cloud/SDN context where most of the traffic is
>> between virtual elements that all support it? I understand that some
>> switches do not support it and traffic form the web doesn't support it
>> either but besides that, seems like a default "jumboframes = 1" concept
>> would work just fine to me.
>>
>
> Offhand:
>
> 1. you don't want the latency increase that comes with 9000 byte
> 

Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-23 Thread Adam Lawson
For the sake of over-simplification, is there ever a reason to NOT enable
jumbo frames in a cloud/SDN context where most of the traffic is between
virtual elements that all support it? I understand that some switches do
not support it and traffic form the web doesn't support it either but
besides that, seems like a default "jumboframes = 1" concept would work
just fine to me.

Then again I'm all about making OpenStack easier to consume so my ideas
tend to gloss over special use cases with special requirements.


*Adam Lawson*

AQORN, Inc.
427 North Tatnall Street
Ste. 58461
Wilmington, Delaware 19801-2230
Toll-free: (844) 4-AQORN-NOW ext. 101
International: +1 302-387-4660
Direct: +1 916-246-2072

On Fri, Jan 22, 2016 at 7:13 PM, Matt Kassawara 
wrote:

> The fun continues, now using an OpenStack deployment on physical hardware
> that supports jumbo frames with 9000 MTU and IPv4/IPv6. This experiment
> still uses Linux bridge for consistency. I'm planning to run similar
> experiments with Open vSwitch and Open Virtual Network (OVN) in the next
> week.
>
> I highly recommend reading further, but here's the TL;DR: Using physical
> network interfaces with MTUs larger than 1500 reveals an additional problem
> with veth pair for the neutron router interface on the public network.
> Additionally, IP protocol version does not impact MTU calculation for
> Linux bridge.
>
> First, review the OpenStack bits and resulting network components in the
> environment [1]. In the first experiment, public cloud network limitations
> prevented truly seeing how Linux bridge (actually the kernel) handles
> physical network interfaces with MTUs larger than 1500. In this experiment,
> we see that it automatically calculates the proper MTU for bridges and
> VXLAN interfaces using the MTU of parent devices. Also, see that a regular
> 'ping' works between the host outside of the deployment and the VM [2].
>
> [1] https://gist.github.com/ionosphere80/a3725066386d8ca4c6d7
> [2] https://gist.github.com/ionosphere80/a8d601a356ac6c6274cb
>
> Note: The tcpdump output in each case references up to six points: neutron
> router gateway on the public network (qg), namespace end of the veth pair
> for the neutron router interface on the private network (qr), bridge end of
> the veth pair for router interface on the private network (tap), controller
> node end of the VXLAN network (underlying interface), compute node end of
> the VXLAN network (underlying interface), and the bridge end of the tap for
> the VM (tap).
>
> In the first experiment, SSH "stuck" because of a MTU mismatch on the veth
> pair between the router namespace and private network bridge. In this
> experiment, SSH works because the VM network interface uses a 1500 MTU and
> all devices along the path between the host and VM use a 1500 or larger
> MTU. So, let's configure the VM network interface to use the proper MTU of
> 9000 minus the VXLAN protocol overhead of 50 bytes... 8950... and try SSH
> again.
>
> 2: eth0:  mtu 8950 qdisc pfifo_fast qlen
> 1000
> link/ether fa:16:3e:46:ac:d3 brd ff:ff:ff:ff:ff:ff
> inet 172.16.1.3/24 brd 172.16.1.255 scope global eth0
> inet6 fd00:100:52:1:f816:3eff:fe46:acd3/64 scope global dynamic
>valid_lft 86395sec preferred_lft 14395sec
> inet6 fe80::f816:3eff:fe46:acd3/64 scope link
>valid_lft forever preferred_lft forever
>
> SSH doesn't work with IPv4 or IPv6. Adding a slight twist to the first
> experiment, I don't even see the large packet traversing the neutron
> router gateway on the public network. So, I began a tcpdump closer to the
> source on the bridge end of the veth pair for the neutron router
> interface on the public network.
>
> Looking at [3], the veth pair between the router namespace and private
> network bridge drops the packet. The MTU changes over a layer-2 connection
> without a router, similar to connecting two switches with different MTUs.
> Even if it could participate in PMTUD, the veth pair lacks an IP address
> and therefore cannot originate ICMP messages.
>
> [3] https://gist.github.com/ionosphere80/ec83d0955c79b05ea381
>
> Using observations from the first experiment, let's configure the MTU of
> the interfaces in the qrouter namespace to match the other end of their
> respective veth pairs. The public network (gateway) interface MTU becomes
> 9000 and the private network router interfaces (IPv4 and IPv6) become 8950.
>
> 2: qr-49b27408-04:  mtu 8950 qdisc
> pfifo_fast state UP mode DEFAULT group default qlen 1000
> link/ether fa:16:3e:e5:43:1c brd ff:ff:ff:ff:ff:ff
> 3: qr-b7e0ef22-32:  mtu 8950 qdisc
> pfifo_fast state UP mode DEFAULT group default qlen 1000
> link/ether fa:16:3e:16:01:92 brd ff:ff:ff:ff:ff:ff
> 4: qg-7bbe8e38-cc:  mtu 9000 qdisc
> pfifo_fast state UP mode DEFAULT group default qlen 1000
> link/ether 

Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-23 Thread Mike Spreitzer
Adam Lawson  wrote on 01/23/2016 02:27:46 PM:

> For the sake of over-simplification, is there ever a reason to NOT 
> enable jumbo frames in a cloud/SDN context where most of the traffic
> is between virtual elements that all support it? I understand that 
> some switches do not support it and traffic form the web doesn't 
> support it either but besides that, seems like a default 
> "jumboframes = 1" concept would work just fine to me.
> 
> Then again I'm all about making OpenStack easier to consume so my 
> ideas tend to gloss over special use cases with special requirements.

Regardless of the default, there needs to be clear documentation on what 
to do for those of us who can not use jumbo frames, and it needs to work. 
That goes for production deployers and also for developers using DevStack.

Thanks,
Mike



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-23 Thread Matt Kassawara
Adam,

Any modern datacenter network, especially those with 10 Gbps or faster
connectivity, should support jumbo frames for performance reasons. However,
depending on the network infrastructure, jumbo frames does not always mean
a 9000 MTU, so neutron should support a configurable value rather than a
boolean. I envision one configuration option containing the physical
network MTU that neutron uses to calculate the MTU of all virtual network
components. Mike... this mechanism should work for any physical network
MTU, large or small.

Matt

On Sat, Jan 23, 2016 at 3:28 PM, Mike Spreitzer  wrote:

> Adam Lawson  wrote on 01/23/2016 02:27:46 PM:
>
> > For the sake of over-simplification, is there ever a reason to NOT
> > enable jumbo frames in a cloud/SDN context where most of the traffic
> > is between virtual elements that all support it? I understand that
> > some switches do not support it and traffic form the web doesn't
> > support it either but besides that, seems like a default
> > "jumboframes = 1" concept would work just fine to me.
> >
> > Then again I'm all about making OpenStack easier to consume so my
> > ideas tend to gloss over special use cases with special requirements.
>
> Regardless of the default, there needs to be clear documentation on what
> to do for those of us who can not use jumbo frames, and it needs to work.
> That goes for production deployers and also for developers using DevStack.
>
> Thanks,
> Mike
>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-22 Thread Matt Kassawara
The fun continues, now using an OpenStack deployment on physical hardware
that supports jumbo frames with 9000 MTU and IPv4/IPv6. This experiment
still uses Linux bridge for consistency. I'm planning to run similar
experiments with Open vSwitch and Open Virtual Network (OVN) in the next
week.

I highly recommend reading further, but here's the TL;DR: Using physical
network interfaces with MTUs larger than 1500 reveals an additional problem
with veth pair for the neutron router interface on the public network.
Additionally, IP protocol version does not impact MTU calculation for Linux
bridge.

First, review the OpenStack bits and resulting network components in the
environment [1]. In the first experiment, public cloud network limitations
prevented truly seeing how Linux bridge (actually the kernel) handles
physical network interfaces with MTUs larger than 1500. In this experiment,
we see that it automatically calculates the proper MTU for bridges and
VXLAN interfaces using the MTU of parent devices. Also, see that a regular
'ping' works between the host outside of the deployment and the VM [2].

[1] https://gist.github.com/ionosphere80/a3725066386d8ca4c6d7
[2] https://gist.github.com/ionosphere80/a8d601a356ac6c6274cb

Note: The tcpdump output in each case references up to six points: neutron
router gateway on the public network (qg), namespace end of the veth pair
for the neutron router interface on the private network (qr), bridge end of
the veth pair for router interface on the private network (tap), controller
node end of the VXLAN network (underlying interface), compute node end of
the VXLAN network (underlying interface), and the bridge end of the tap for
the VM (tap).

In the first experiment, SSH "stuck" because of a MTU mismatch on the veth
pair between the router namespace and private network bridge. In this
experiment, SSH works because the VM network interface uses a 1500 MTU and
all devices along the path between the host and VM use a 1500 or larger
MTU. So, let's configure the VM network interface to use the proper MTU of
9000 minus the VXLAN protocol overhead of 50 bytes... 8950... and try SSH
again.

2: eth0:  mtu 8950 qdisc pfifo_fast qlen
1000
link/ether fa:16:3e:46:ac:d3 brd ff:ff:ff:ff:ff:ff
inet 172.16.1.3/24 brd 172.16.1.255 scope global eth0
inet6 fd00:100:52:1:f816:3eff:fe46:acd3/64 scope global dynamic
   valid_lft 86395sec preferred_lft 14395sec
inet6 fe80::f816:3eff:fe46:acd3/64 scope link
   valid_lft forever preferred_lft forever

SSH doesn't work with IPv4 or IPv6. Adding a slight twist to the first
experiment, I don't even see the large packet traversing the neutron router
gateway on the public network. So, I began a tcpdump closer to the source
on the bridge end of the veth pair for the neutron router interface on the
public network.

Looking at [3], the veth pair between the router namespace and private
network bridge drops the packet. The MTU changes over a layer-2 connection
without a router, similar to connecting two switches with different MTUs.
Even if it could participate in PMTUD, the veth pair lacks an IP address
and therefore cannot originate ICMP messages.

[3] https://gist.github.com/ionosphere80/ec83d0955c79b05ea381

Using observations from the first experiment, let's configure the MTU of
the interfaces in the qrouter namespace to match the other end of their
respective veth pairs. The public network (gateway) interface MTU becomes
9000 and the private network router interfaces (IPv4 and IPv6) become 8950.

2: qr-49b27408-04:  mtu 8950 qdisc
pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether fa:16:3e:e5:43:1c brd ff:ff:ff:ff:ff:ff
3: qr-b7e0ef22-32:  mtu 8950 qdisc
pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether fa:16:3e:16:01:92 brd ff:ff:ff:ff:ff:ff
4: qg-7bbe8e38-cc:  mtu 9000 qdisc
pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether fa:16:3e:2b:c1:fd brd ff:ff:ff:ff:ff:ff

Let's ping with a payload size of 8922 for IPv4 and 8902 for IPv6, the
maximum for a VXLAN segment with 8950 MTU, and look at the tcpdump output
[4]. For brevity, I'm only showing tcpdump output from the VM tap
interface. Ping operates normally.

# ping -c 1 -s 8922 -M do 10.100.52.104

# ping -c 1 -s 8902 -M do fd00:100:52:1:f816:3eff:fe46:acd3

[4] https://gist.github.com/ionosphere80/85339b587bb9b2693b07

Let's ping with a payload size of 8923 for IPv4 and 8903 for IPv6, one byte
larger than the maximum for a VXLAN segment with 8950 MTU. The router
namespace, operating at layer-3, sees the MTU discrepancy between the two
interfaces in the namespace and returns an ICMP "fragmentation needed" or
"packet too big" message to the sender. The sender uses the MTU value in
the ICMP packet to recalculate the length of the first packet and caches it
for future packets.

# ping -c 

Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-20 Thread Sean M. Collins
On Tue, Jan 19, 2016 at 08:15:18AM EST, Matt Kassawara wrote:
> No. However, we ought to determine what happens when both DHCP and RA
> advertise it.

We'd have to look at the RFCs for how hosts are supposed to behave since
IPv6 has a minimum MTU of 1280 bytes while IPv4's minimum mtu is 576
(what is this, an MTU for ants?).

-- 
Sean M. Collins

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-20 Thread Rick Jones

On 01/20/2016 08:56 AM, Sean M. Collins wrote:

On Tue, Jan 19, 2016 at 08:15:18AM EST, Matt Kassawara wrote:

No. However, we ought to determine what happens when both DHCP and RA
advertise it.


We'd have to look at the RFCs for how hosts are supposed to behave since
IPv6 has a minimum MTU of 1280 bytes while IPv4's minimum mtu is 576
(what is this, an MTU for ants?).


Quibble - 576 is the IPv4 minimum, maximum MTU.  That is to say a 
compliant IPv4 implementation must be able to reassemble datagrams of at 
least 576 bytes.


If memory serves, the actual minimum MTU for IPv4 is 68 bytes.

rick jones

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-20 Thread Kevin Benton
Maybe it will do the obvious thing and add them together. ;)
On Jan 20, 2016 12:03, "Sean M. Collins"  wrote:

> On Tue, Jan 19, 2016 at 08:15:18AM EST, Matt Kassawara wrote:
> > No. However, we ought to determine what happens when both DHCP and RA
> > advertise it.
>
> We'd have to look at the RFCs for how hosts are supposed to behave since
> IPv6 has a minimum MTU of 1280 bytes while IPv4's minimum mtu is 576
> (what is this, an MTU for ants?).
>
> --
> Sean M. Collins
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-19 Thread Matt Kassawara
No. However, we ought to determine what happens when both DHCP and RA
advertise it.

On Tue, Jan 19, 2016 at 12:36 AM, Kevin Benton  wrote:

> >Yup. We mostly attempt to do that now.
>
> Right, but not by default. Can you think of a scenario where advertising
> it would be harmful?
> On Jan 18, 2016 23:57, "Matt Kassawara"  wrote:
>
>>
>>
>> On Mon, Jan 18, 2016 at 4:14 PM, Kevin Benton  wrote:
>>
>>> Thanks for the awesome writeup.
>>>
>>> >5) A bridge or veth pair with an IP address can participate in path
>>> MTU discovery (PMTUD). However, these devices do not appear to understand
>>> namespaces and originate the ICMP message from the host instead of a
>>> namespace. Therefore, the message never reaches the destination...
>>> typically a host outside of the deployment.
>>>
>>> I suspect this is because we don't put the bridges into namespaces. Even
>>> if we did do this, we would need to allocate IP addresses for every compute
>>> node to use to chat on the network...
>>>
>>
>> Yup. Moving the MTU disparity to the first layer-3 device a packet
>> traverses inbound to a VM saves us from burning IPs too.
>>
>>
>>>
>>>
>>>
>>> >At least for the Linux bridge agent, I think we can address ingress
>>> MTU disparity (to the VM) by moving it to the first device in the chain
>>> capable of layer-3 operations, particularly the neutron router namespace.
>>> We can address the egress MTU disparity (from the VM) by advertising the
>>> MTU of the overlay network to the VM via DHCP/RA or using manual interface
>>> configuration.
>>>
>>> So when setting up DHCP for the subnet, would telling the DHCP agent to
>>> use an MTU we calculate based on (global MTU value - network encap
>>> overhead) achieve what you are suggesting here?
>>>
>>
>> Yup. We mostly attempt to do that now.
>>
>> On Fri, Jan 15, 2016 at 10:41 AM, Sean M. Collins 
 wrote:

> MTU has been an ongoing issue in Neutron for _years_.
>
> It's such a hassle, that most people just throw up their hands and set
> their physical infrastructure to jumbo frames. We even document it.
>
>
> http://docs.openstack.org/juno/install-guide/install/apt-debian/content/neutron-network-node.html
>
> > Ideally, you can prevent these problems by enabling jumbo frames on
> > the physical network that contains your tenant virtual networks.
> Jumbo
> > frames support MTUs up to approximately 9000 bytes which negates the
> > impact of GRE overhead on virtual networks.
>
> We've pushed this onto operators and deployers. There's a lot of
> code in provisioning projects to handle MTUs.
>
> http://codesearch.openstack.org/?q=MTU=nope==
>
> We have mentions to it in our architecture design guide
>
>
> http://git.openstack.org/cgit/openstack/openstack-manuals/tree/doc/arch-design/source/network-focus-architecture.rst#n150
>
> I want to get Neutron to the point where it starts discovering this
> information and automatically configuring, in the optimistic cases. I
> understand that it can be complex and have corner cases, but the issue
> we have today is that it is broken in some multinode jobs, even Neutron
> developers are configuring it correctly.
>
> I also had this discussion on the DevStack side in
> https://review.openstack.org/#/c/112523/
> where basically, sure we can fix it in DevStack and at the gate, but it
> doesn't fix the problem for anyone who isn't using DevStack to deploy
> their cloud.
>
> Today we have a ton of MTU configuration options sprinkled throghout
> the
> L3 agent, dhcp agent, l2 agents, and at least one API extension to the
> REST API for handling MTUs.
>
> So yeah, a lot of knobs and not a lot of documentation on how to make
> this thing work correctly. I'd like to try and simplify.
>
>
> Further reading:
>
>
> http://techbackground.blogspot.co.uk/2013/06/path-mtu-discovery-and-gre.html
>
> http://lists.openstack.org/pipermail/openstack/2013-October/001778.html
>
>
> https://ask.openstack.org/en/question/6140/quantum-neutron-gre-slow-performance/
>
>
> https://ask.openstack.org/en/question/12499/forcing-mtu-to-1400-via-etcneutrondnsmasq-neutronconf-per-daniels/
>
>
> http://blog.systemathic.ch/2015/03/05/openstack-mtu-pitfalls-with-tunnels/
>
> https://twitter.com/search?q=openstack%20neutron%20MTU
>
> --
> Sean M. Collins
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe:
> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



 

Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-18 Thread Matt Kassawara
On Mon, Jan 18, 2016 at 4:14 PM, Kevin Benton  wrote:

> Thanks for the awesome writeup.
>
> >5) A bridge or veth pair with an IP address can participate in path MTU
> discovery (PMTUD). However, these devices do not appear to understand
> namespaces and originate the ICMP message from the host instead of a
> namespace. Therefore, the message never reaches the destination...
> typically a host outside of the deployment.
>
> I suspect this is because we don't put the bridges into namespaces. Even
> if we did do this, we would need to allocate IP addresses for every compute
> node to use to chat on the network...
>

Yup. Moving the MTU disparity to the first layer-3 device a packet
traverses inbound to a VM saves us from burning IPs too.


>
>
>
> >At least for the Linux bridge agent, I think we can address ingress MTU
> disparity (to the VM) by moving it to the first device in the chain capable
> of layer-3 operations, particularly the neutron router namespace. We can
> address the egress MTU disparity (from the VM) by advertising the MTU of
> the overlay network to the VM via DHCP/RA or using manual interface
> configuration.
>
> So when setting up DHCP for the subnet, would telling the DHCP agent to
> use an MTU we calculate based on (global MTU value - network encap
> overhead) achieve what you are suggesting here?
>

Yup. We mostly attempt to do that now.

On Fri, Jan 15, 2016 at 10:41 AM, Sean M. Collins 
>> wrote:
>>
>>> MTU has been an ongoing issue in Neutron for _years_.
>>>
>>> It's such a hassle, that most people just throw up their hands and set
>>> their physical infrastructure to jumbo frames. We even document it.
>>>
>>>
>>> http://docs.openstack.org/juno/install-guide/install/apt-debian/content/neutron-network-node.html
>>>
>>> > Ideally, you can prevent these problems by enabling jumbo frames on
>>> > the physical network that contains your tenant virtual networks. Jumbo
>>> > frames support MTUs up to approximately 9000 bytes which negates the
>>> > impact of GRE overhead on virtual networks.
>>>
>>> We've pushed this onto operators and deployers. There's a lot of
>>> code in provisioning projects to handle MTUs.
>>>
>>> http://codesearch.openstack.org/?q=MTU=nope==
>>>
>>> We have mentions to it in our architecture design guide
>>>
>>>
>>> http://git.openstack.org/cgit/openstack/openstack-manuals/tree/doc/arch-design/source/network-focus-architecture.rst#n150
>>>
>>> I want to get Neutron to the point where it starts discovering this
>>> information and automatically configuring, in the optimistic cases. I
>>> understand that it can be complex and have corner cases, but the issue
>>> we have today is that it is broken in some multinode jobs, even Neutron
>>> developers are configuring it correctly.
>>>
>>> I also had this discussion on the DevStack side in
>>> https://review.openstack.org/#/c/112523/
>>> where basically, sure we can fix it in DevStack and at the gate, but it
>>> doesn't fix the problem for anyone who isn't using DevStack to deploy
>>> their cloud.
>>>
>>> Today we have a ton of MTU configuration options sprinkled throghout the
>>> L3 agent, dhcp agent, l2 agents, and at least one API extension to the
>>> REST API for handling MTUs.
>>>
>>> So yeah, a lot of knobs and not a lot of documentation on how to make
>>> this thing work correctly. I'd like to try and simplify.
>>>
>>>
>>> Further reading:
>>>
>>>
>>> http://techbackground.blogspot.co.uk/2013/06/path-mtu-discovery-and-gre.html
>>>
>>> http://lists.openstack.org/pipermail/openstack/2013-October/001778.html
>>>
>>>
>>> https://ask.openstack.org/en/question/6140/quantum-neutron-gre-slow-performance/
>>>
>>>
>>> https://ask.openstack.org/en/question/12499/forcing-mtu-to-1400-via-etcneutrondnsmasq-neutronconf-per-daniels/
>>>
>>>
>>> http://blog.systemathic.ch/2015/03/05/openstack-mtu-pitfalls-with-tunnels/
>>>
>>> https://twitter.com/search?q=openstack%20neutron%20MTU
>>>
>>> --
>>> Sean M. Collins
>>>
>>>
>>> __
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>
>
> --
> Kevin Benton
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>

Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-18 Thread Kevin Benton
>Yup. We mostly attempt to do that now.

Right, but not by default. Can you think of a scenario where advertising it
would be harmful?
On Jan 18, 2016 23:57, "Matt Kassawara"  wrote:

>
>
> On Mon, Jan 18, 2016 at 4:14 PM, Kevin Benton  wrote:
>
>> Thanks for the awesome writeup.
>>
>> >5) A bridge or veth pair with an IP address can participate in path MTU
>> discovery (PMTUD). However, these devices do not appear to understand
>> namespaces and originate the ICMP message from the host instead of a
>> namespace. Therefore, the message never reaches the destination...
>> typically a host outside of the deployment.
>>
>> I suspect this is because we don't put the bridges into namespaces. Even
>> if we did do this, we would need to allocate IP addresses for every compute
>> node to use to chat on the network...
>>
>
> Yup. Moving the MTU disparity to the first layer-3 device a packet
> traverses inbound to a VM saves us from burning IPs too.
>
>
>>
>>
>>
>> >At least for the Linux bridge agent, I think we can address ingress MTU
>> disparity (to the VM) by moving it to the first device in the chain capable
>> of layer-3 operations, particularly the neutron router namespace. We can
>> address the egress MTU disparity (from the VM) by advertising the MTU of
>> the overlay network to the VM via DHCP/RA or using manual interface
>> configuration.
>>
>> So when setting up DHCP for the subnet, would telling the DHCP agent to
>> use an MTU we calculate based on (global MTU value - network encap
>> overhead) achieve what you are suggesting here?
>>
>
> Yup. We mostly attempt to do that now.
>
> On Fri, Jan 15, 2016 at 10:41 AM, Sean M. Collins 
>>> wrote:
>>>
 MTU has been an ongoing issue in Neutron for _years_.

 It's such a hassle, that most people just throw up their hands and set
 their physical infrastructure to jumbo frames. We even document it.


 http://docs.openstack.org/juno/install-guide/install/apt-debian/content/neutron-network-node.html

 > Ideally, you can prevent these problems by enabling jumbo frames on
 > the physical network that contains your tenant virtual networks. Jumbo
 > frames support MTUs up to approximately 9000 bytes which negates the
 > impact of GRE overhead on virtual networks.

 We've pushed this onto operators and deployers. There's a lot of
 code in provisioning projects to handle MTUs.

 http://codesearch.openstack.org/?q=MTU=nope==

 We have mentions to it in our architecture design guide


 http://git.openstack.org/cgit/openstack/openstack-manuals/tree/doc/arch-design/source/network-focus-architecture.rst#n150

 I want to get Neutron to the point where it starts discovering this
 information and automatically configuring, in the optimistic cases. I
 understand that it can be complex and have corner cases, but the issue
 we have today is that it is broken in some multinode jobs, even Neutron
 developers are configuring it correctly.

 I also had this discussion on the DevStack side in
 https://review.openstack.org/#/c/112523/
 where basically, sure we can fix it in DevStack and at the gate, but it
 doesn't fix the problem for anyone who isn't using DevStack to deploy
 their cloud.

 Today we have a ton of MTU configuration options sprinkled throghout the
 L3 agent, dhcp agent, l2 agents, and at least one API extension to the
 REST API for handling MTUs.

 So yeah, a lot of knobs and not a lot of documentation on how to make
 this thing work correctly. I'd like to try and simplify.


 Further reading:


 http://techbackground.blogspot.co.uk/2013/06/path-mtu-discovery-and-gre.html

 http://lists.openstack.org/pipermail/openstack/2013-October/001778.html


 https://ask.openstack.org/en/question/6140/quantum-neutron-gre-slow-performance/


 https://ask.openstack.org/en/question/12499/forcing-mtu-to-1400-via-etcneutrondnsmasq-neutronconf-per-daniels/


 http://blog.systemathic.ch/2015/03/05/openstack-mtu-pitfalls-with-tunnels/

 https://twitter.com/search?q=openstack%20neutron%20MTU

 --
 Sean M. Collins


 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
 openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

>>>
>>>
>>>
>>> __
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>>
>>
>>
>> --
>> Kevin Benton
>>
>> 

Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-18 Thread John Griffith
On Sun, Jan 17, 2016 at 8:30 PM, Matt Kassawara 
wrote:

> Prior attempts to solve the MTU problem in neutron simply band-aid it or
> become too complex from feature creep or edge cases that mask the primary
> goal of a simple implementation that works for most deployments. So, I ran
> some experiments to empirically determine the root cause of MTU problems in
> common neutron deployments using the Linux bridge agent. I plan to perform
> these experiments again using the Open vSwitch agent... after sufficient
> mental recovery.
>
> I highly recommend reading further, but here's the TL;DR:
>
> Observations...
>
> 1) During creation of a VXLAN interface, Linux automatically subtracts the
> VXLAN protocol overhead from the MTU of the parent interface.
> 2) A veth pair or tap with a different MTU on each end drops packets
> larger than the smaller MTU.
> 3) Linux automatically adjusts the MTU of a bridge to the lowest MTU of
> all the ports. Therefore, Linux reduces the typical bridge MTU from 1500 to
> 1450 when neutron adds a VXLAN interface to it.
> 4) A bridge with different MTUs on each port drops packets larger than the
> MTU of the bridge.
> 5) A bridge or veth pair with an IP address can participate in path MTU
> discovery (PMTUD). However, these devices do not appear to understand
> namespaces and originate the ICMP message from the host instead of a
> namespace. Therefore, the message never reaches the destination...
> typically a host outside of the deployment.
>
> Conclusion...
>
> The MTU disparity between native and overlay networks must reside in a
> device capable of layer-3 operations that can participate in PMTUD, such as
> the neutron router between a private/project overlay network and a
> public/external native network.
>
> Some background...
>
> In a typical datacenter network, MTU must remain consistent within a
> layer-2 network because fragmentation and the mechanism indicating the need
> for it occurs at layer-3. In other words, all host interfaces and switch
> ports on the same layer-2 network must use the same MTU. If the layer-2
> network connects to a router, the router port must also use the same MTU. A
> router can contain ports on multiple layer-2 networks with different MTUs
> because it operates on those networks at layer-3. If the MTU changes
> between ports on a router and devices on those layer-2 networks attempt to
> communicate at layer-3, the router can perform a couple of actions. For
> IPv4, the router can fragment the packet. However, if the packet contains
> the "don't fragment" (DF) flag, the router can either silently drop the
> packet or return an ICMP "fragmentation needed" message to the sender. This
> ICMP message contains the MTU of the next layer-2 network in the route
> between the sender and receiver. Each router in the path can return these
> ICMP messages to the sender until it learns the maximum MTU for the entire
> path, also known as path MTU discovery (PMTUD). IPv6 does not support
> fragmentation.
>
> The cloud provides a virtual extension of a physical network. In the
> simplest sense, patch cables become veth pairs, switches become bridges,
> and routers become namespaces. Therefore, MTU implementation for virtual
> networks should mimic physical networks where MTU changes must occur within
> a router at layer-3.
>
> For these experiments, my deployment contains one controller and one
> compute node. Neutron uses the ML2 plug-in and Linux bridge agent. The
> configuration does not contain any MTU options (e.g, path_mtu). One VM with
> a floating IP address attaches to a VXLAN private network that routes to a
> flat public network. The DHCP agent does not advertise MTU to the VM. My
> lab resides on public cloud infrastructure with networks that filter
> unknown MAC addresses such as those that neutron generates for virtual
> network components. Let's talk about the implications and workarounds.
>
> The VXLAN protocol contains 50 bytes of overhead. Linux automatically
> calculates the MTU of VXLAN devices by subtracting 50 bytes from the parent
> device, in this case a standard Ethernet interface with a 1500 MTU.
> However, due the limitations of public cloud networks, I must create a
> VXLAN tunnel between the controller node and a host outside of the
> deployment to simulate traffic from a datacenter network. This tunnel
> effectively reduces the "native" MTU from 1500 to 1450. Therefore, I need
> to subtract an additional 50 bytes from neutron VXLAN network components,
> essentially emulating the 50-byte difference between conventional neutron
> VXLAN networks and native networks. The host outside of the deployment
> assumes it can send packets using a 1450 MTU. The VM also assumes it can
> send packets using a 1450 MTU because the DHCP agent does not advertise a
> 1400 MTU to it.
>
> Let's get to it!
>
> Note: The commands in these experiments often generate lengthy output, so
> please refer to the gists when necessary.
>

Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-18 Thread Kevin Benton
Thanks for the awesome writeup.

>5) A bridge or veth pair with an IP address can participate in path MTU
discovery (PMTUD). However, these devices do not appear to understand
namespaces and originate the ICMP message from the host instead of a
namespace. Therefore, the message never reaches the destination...
typically a host outside of the deployment.

I suspect this is because we don't put the bridges into namespaces. Even if
we did do this, we would need to allocate IP addresses for every compute
node to use to chat on the network...


>At least for the Linux bridge agent, I think we can address ingress MTU
disparity (to the VM) by moving it to the first device in the chain capable
of layer-3 operations, particularly the neutron router namespace. We can
address the egress MTU disparity (from the VM) by advertising the MTU of
the overlay network to the VM via DHCP/RA or using manual interface
configuration.

So when setting up DHCP for the subnet, would telling the DHCP agent to use
an MTU we calculate based on (global MTU value - network encap overhead)
achieve what you are suggesting here?

On Sun, Jan 17, 2016 at 10:30 PM, Matt Kassawara 
wrote:

> Prior attempts to solve the MTU problem in neutron simply band-aid it or
> become too complex from feature creep or edge cases that mask the primary
> goal of a simple implementation that works for most deployments. So, I ran
> some experiments to empirically determine the root cause of MTU problems in
> common neutron deployments using the Linux bridge agent. I plan to perform
> these experiments again using the Open vSwitch agent... after sufficient
> mental recovery.
>
> I highly recommend reading further, but here's the TL;DR:
>
> Observations...
>
> 1) During creation of a VXLAN interface, Linux automatically subtracts the
> VXLAN protocol overhead from the MTU of the parent interface.
> 2) A veth pair or tap with a different MTU on each end drops packets
> larger than the smaller MTU.
> 3) Linux automatically adjusts the MTU of a bridge to the lowest MTU of
> all the ports. Therefore, Linux reduces the typical bridge MTU from 1500 to
> 1450 when neutron adds a VXLAN interface to it.
> 4) A bridge with different MTUs on each port drops packets larger than the
> MTU of the bridge.
> 5) A bridge or veth pair with an IP address can participate in path MTU
> discovery (PMTUD). However, these devices do not appear to understand
> namespaces and originate the ICMP message from the host instead of a
> namespace. Therefore, the message never reaches the destination...
> typically a host outside of the deployment.
>
> Conclusion...
>
> The MTU disparity between native and overlay networks must reside in a
> device capable of layer-3 operations that can participate in PMTUD, such as
> the neutron router between a private/project overlay network and a
> public/external native network.
>
> Some background...
>
> In a typical datacenter network, MTU must remain consistent within a
> layer-2 network because fragmentation and the mechanism indicating the need
> for it occurs at layer-3. In other words, all host interfaces and switch
> ports on the same layer-2 network must use the same MTU. If the layer-2
> network connects to a router, the router port must also use the same MTU. A
> router can contain ports on multiple layer-2 networks with different MTUs
> because it operates on those networks at layer-3. If the MTU changes
> between ports on a router and devices on those layer-2 networks attempt to
> communicate at layer-3, the router can perform a couple of actions. For
> IPv4, the router can fragment the packet. However, if the packet contains
> the "don't fragment" (DF) flag, the router can either silently drop the
> packet or return an ICMP "fragmentation needed" message to the sender. This
> ICMP message contains the MTU of the next layer-2 network in the route
> between the sender and receiver. Each router in the path can return these
> ICMP messages to the sender until it learns the maximum MTU for the entire
> path, also known as path MTU discovery (PMTUD). IPv6 does not support
> fragmentation.
>
> The cloud provides a virtual extension of a physical network. In the
> simplest sense, patch cables become veth pairs, switches become bridges,
> and routers become namespaces. Therefore, MTU implementation for virtual
> networks should mimic physical networks where MTU changes must occur within
> a router at layer-3.
>
> For these experiments, my deployment contains one controller and one
> compute node. Neutron uses the ML2 plug-in and Linux bridge agent. The
> configuration does not contain any MTU options (e.g, path_mtu). One VM with
> a floating IP address attaches to a VXLAN private network that routes to a
> flat public network. The DHCP agent does not advertise MTU to the VM. My
> lab resides on public cloud infrastructure with networks that filter
> unknown MAC addresses such as those that neutron generates for virtual
> network 

Re: [openstack-dev] [Neutron] MTU configuration pain

2016-01-18 Thread Kevin Benton
The MTU setting is an issue because it involves knowledge of the network
outside of openstack. That's why it was just a config value that was
expected to be set by an operator. This thread is working to see if we can
figure that out, or maybe at least come up with a different sub-optimal
default.

For the floating IP thing, do you need floating IPs? If not, using the
'provider networking' workflow is much simpler if you don't want tenant
virtual routers and whatnot:
http://docs.openstack.org/liberty/networking-guide/scenario_provider_lb.html

On Mon, Jan 18, 2016 at 4:06 PM, John Griffith 
wrote:

>
>
> On Sun, Jan 17, 2016 at 8:30 PM, Matt Kassawara 
> wrote:
>
>> Prior attempts to solve the MTU problem in neutron simply band-aid it or
>> become too complex from feature creep or edge cases that mask the primary
>> goal of a simple implementation that works for most deployments. So, I ran
>> some experiments to empirically determine the root cause of MTU problems in
>> common neutron deployments using the Linux bridge agent. I plan to perform
>> these experiments again using the Open vSwitch agent... after sufficient
>> mental recovery.
>>
>> I highly recommend reading further, but here's the TL;DR:
>>
>> Observations...
>>
>> 1) During creation of a VXLAN interface, Linux automatically subtracts
>> the VXLAN protocol overhead from the MTU of the parent interface.
>> 2) A veth pair or tap with a different MTU on each end drops packets
>> larger than the smaller MTU.
>> 3) Linux automatically adjusts the MTU of a bridge to the lowest MTU of
>> all the ports. Therefore, Linux reduces the typical bridge MTU from 1500 to
>> 1450 when neutron adds a VXLAN interface to it.
>> 4) A bridge with different MTUs on each port drops packets larger than
>> the MTU of the bridge.
>> 5) A bridge or veth pair with an IP address can participate in path MTU
>> discovery (PMTUD). However, these devices do not appear to understand
>> namespaces and originate the ICMP message from the host instead of a
>> namespace. Therefore, the message never reaches the destination...
>> typically a host outside of the deployment.
>>
>> Conclusion...
>>
>> The MTU disparity between native and overlay networks must reside in a
>> device capable of layer-3 operations that can participate in PMTUD, such as
>> the neutron router between a private/project overlay network and a
>> public/external native network.
>>
>> Some background...
>>
>> In a typical datacenter network, MTU must remain consistent within a
>> layer-2 network because fragmentation and the mechanism indicating the need
>> for it occurs at layer-3. In other words, all host interfaces and switch
>> ports on the same layer-2 network must use the same MTU. If the layer-2
>> network connects to a router, the router port must also use the same MTU. A
>> router can contain ports on multiple layer-2 networks with different MTUs
>> because it operates on those networks at layer-3. If the MTU changes
>> between ports on a router and devices on those layer-2 networks attempt to
>> communicate at layer-3, the router can perform a couple of actions. For
>> IPv4, the router can fragment the packet. However, if the packet contains
>> the "don't fragment" (DF) flag, the router can either silently drop the
>> packet or return an ICMP "fragmentation needed" message to the sender. This
>> ICMP message contains the MTU of the next layer-2 network in the route
>> between the sender and receiver. Each router in the path can return these
>> ICMP messages to the sender until it learns the maximum MTU for the entire
>> path, also known as path MTU discovery (PMTUD). IPv6 does not support
>> fragmentation.
>>
>> The cloud provides a virtual extension of a physical network. In the
>> simplest sense, patch cables become veth pairs, switches become bridges,
>> and routers become namespaces. Therefore, MTU implementation for virtual
>> networks should mimic physical networks where MTU changes must occur within
>> a router at layer-3.
>>
>> For these experiments, my deployment contains one controller and one
>> compute node. Neutron uses the ML2 plug-in and Linux bridge agent. The
>> configuration does not contain any MTU options (e.g, path_mtu). One VM with
>> a floating IP address attaches to a VXLAN private network that routes to a
>> flat public network. The DHCP agent does not advertise MTU to the VM. My
>> lab resides on public cloud infrastructure with networks that filter
>> unknown MAC addresses such as those that neutron generates for virtual
>> network components. Let's talk about the implications and workarounds.
>>
>> The VXLAN protocol contains 50 bytes of overhead. Linux automatically
>> calculates the MTU of VXLAN devices by subtracting 50 bytes from the parent
>> device, in this case a standard Ethernet interface with a 1500 MTU.
>> However, due the limitations of public cloud networks, I must create a
>> VXLAN tunnel between the controller