Re: [ovs-dev] [RFC ovn] Multi-chassis port + MTU behavior

2022-10-17 Thread Ihar Hrachyshka
Initial (incomplete) approach to the idea can be seen here:

https://patchwork.ozlabs.org/project/ovn/patch/20221017210546.120517-1-ihrac...@redhat.com/

There are lots of missing parts still (support for one of two
directions only; there's some issue with local port size check to
investigate; geneve info leaking into embedded ICMP error packet;
missing IPv6 support to name a few). Posted to gather thoughts on
local delivery issue / controller action definition.

Please take a look if you have spare cycles.

Thanks,
Ihar

On Thu, Jul 21, 2022 at 11:22 AM Ihar Hrachyshka  wrote:
>
> On Wed, Jul 20, 2022 at 3:45 PM Numan Siddique  wrote:
> >
> > On Mon, Jul 18, 2022 at 7:44 PM Ihar Hrachyshka  wrote:
> > >
> > > Hi folks,
> > >
> > > looking for some advices on MTU / IP behavior for multi-chassis ports.
> > >
> > > 22.06 got the new multichassis port concept introduced where the same
> > > port may be present on multiple chassis, which can be used as a
> > > performance optimization for VM live migration, among other things.
> > > Migration with sub-0.1ms network downtime is achieved by cloning all
> > > traffic directed towards a multichassis port to all its locations,
> > > making all chassis hosting it receive relevant traffic at any given
> > > moment until migration is complete.
> > >
> > > For logical switches with a localnet port where traffic usually goes
> > > through the localnet port and not tunnels, it means enforcement of
> > > tunneling of egress and ingress traffic for a multichassis port. (The
> > > rest of the traffic between other, non-multichassis, ports keeps going
> > > through localnet port.) Tunneling enforcement is done because traffic
> > > sent through localnet port won't generally be cloned to all port binding
> > > locations by the upstream switch.
> > >
> > > A problem comes down when ingress or egress traffic of a multichassis
> > > port, being redirected through a tunnel, gets lost because of geneve
> > > header overhead or because the interface used for tunneling has a
> > > different MTU from the physical bridge backing up the localnet port.
> > >
> > > This is happening when:
> > >   - tunnel_iface_mtu < localnet_mtu + geneve_overhead
> > >
> > > This makes for an unfortunate situation where, for a multichassis port,
> > > SOME traffic (e.g. regular ICMP requests) pass through without any
> > > problems, while OTHER traffic (e.g. produced with 'ping -s) doesn't.
> > > (Test scenario demonstrating it is included below.)
> > >
> > > Here are ideas I came up with on how this could be resolved or at least
> > > mitigated:
> > >
> > > a) pass "oversized" traffic through localnet, and the rest through
> > > tunnels. Apart from confusing pattern on the wire where packets that
> > > belong to the same TCP session may go through two paths (arguably not a
> > > big problem and should be expected by other network hardware), this gets
> > > us back to the problem of upstream switch not delivering packets to all
> > > binding chassis.
> > >
> > > b) fragmentation needed. We could send ICMP Fragmentation Needed ICMP
> > > errors on attempts to send oversized packets to or from a multichassis
> > > port. Then TCP sessions could adjust their properties to reflect the new
> > > recommended MTU. We already have a similar mechanism for router ports.
> > > There are several caveats and limitations with fragmentation needed (b):
> > >
> > > - AFAIU this works for some protocols but not others. It also depends on
> > >   the client network stack to reflect the change in network path. TCP
> > >   should probably work.
> > >
> > > - It may be confusing for users that some L2 paths between ports of the
> > >   same switch have reduced MTU while others have the regular, depending
> > >   on the type of a port (single- or multi-chassis).
> > >
> > > c) implement fragmentation inside L2 domain. I actually don't know if
> > > that's even compliant with RFCs. Usually packet fragmentation is
> > > implemented on L2 domain boundary, by a router. In this scenario, a peer
> > > port on the same switch would receive fragments for a packet that was
> > > sent as a single piece.
> > >
> > > I currently lean towards (b) though it's not a universal fix since it
> > > requires collaboration of the underlying network stack. But (a) leaves
> > > cloning broken, and (c) is even more invasive, taking action on port's
> > > packets (fragmenting them) without the port's owner knowledge.
> > >
> > > Perhaps this is all unnecessary and there's a way to make OVN
> > > transparently split and reassemble packets as needed, though I doubt it
> > > since it doesn't encapsulate tunneled traffic into another application
> > > layer. But if there's a way to achieve transparent re-assemble, or there
> > > are other alternatives beyond (a)-(c), let me know. Please let me know
> > > what you think of (a)-(c) regardless.
> > >
> > > Thanks,
> > > Ihar
> >
> > I lean towards (b) too.   I suppose we only need this special handling
> > 

Re: [ovs-dev] [RFC ovn] Multi-chassis port + MTU behavior

2022-07-21 Thread Ihar Hrachyshka
On Wed, Jul 20, 2022 at 3:45 PM Numan Siddique  wrote:
>
> On Mon, Jul 18, 2022 at 7:44 PM Ihar Hrachyshka  wrote:
> >
> > Hi folks,
> >
> > looking for some advices on MTU / IP behavior for multi-chassis ports.
> >
> > 22.06 got the new multichassis port concept introduced where the same
> > port may be present on multiple chassis, which can be used as a
> > performance optimization for VM live migration, among other things.
> > Migration with sub-0.1ms network downtime is achieved by cloning all
> > traffic directed towards a multichassis port to all its locations,
> > making all chassis hosting it receive relevant traffic at any given
> > moment until migration is complete.
> >
> > For logical switches with a localnet port where traffic usually goes
> > through the localnet port and not tunnels, it means enforcement of
> > tunneling of egress and ingress traffic for a multichassis port. (The
> > rest of the traffic between other, non-multichassis, ports keeps going
> > through localnet port.) Tunneling enforcement is done because traffic
> > sent through localnet port won't generally be cloned to all port binding
> > locations by the upstream switch.
> >
> > A problem comes down when ingress or egress traffic of a multichassis
> > port, being redirected through a tunnel, gets lost because of geneve
> > header overhead or because the interface used for tunneling has a
> > different MTU from the physical bridge backing up the localnet port.
> >
> > This is happening when:
> >   - tunnel_iface_mtu < localnet_mtu + geneve_overhead
> >
> > This makes for an unfortunate situation where, for a multichassis port,
> > SOME traffic (e.g. regular ICMP requests) pass through without any
> > problems, while OTHER traffic (e.g. produced with 'ping -s) doesn't.
> > (Test scenario demonstrating it is included below.)
> >
> > Here are ideas I came up with on how this could be resolved or at least
> > mitigated:
> >
> > a) pass "oversized" traffic through localnet, and the rest through
> > tunnels. Apart from confusing pattern on the wire where packets that
> > belong to the same TCP session may go through two paths (arguably not a
> > big problem and should be expected by other network hardware), this gets
> > us back to the problem of upstream switch not delivering packets to all
> > binding chassis.
> >
> > b) fragmentation needed. We could send ICMP Fragmentation Needed ICMP
> > errors on attempts to send oversized packets to or from a multichassis
> > port. Then TCP sessions could adjust their properties to reflect the new
> > recommended MTU. We already have a similar mechanism for router ports.
> > There are several caveats and limitations with fragmentation needed (b):
> >
> > - AFAIU this works for some protocols but not others. It also depends on
> >   the client network stack to reflect the change in network path. TCP
> >   should probably work.
> >
> > - It may be confusing for users that some L2 paths between ports of the
> >   same switch have reduced MTU while others have the regular, depending
> >   on the type of a port (single- or multi-chassis).
> >
> > c) implement fragmentation inside L2 domain. I actually don't know if
> > that's even compliant with RFCs. Usually packet fragmentation is
> > implemented on L2 domain boundary, by a router. In this scenario, a peer
> > port on the same switch would receive fragments for a packet that was
> > sent as a single piece.
> >
> > I currently lean towards (b) though it's not a universal fix since it
> > requires collaboration of the underlying network stack. But (a) leaves
> > cloning broken, and (c) is even more invasive, taking action on port's
> > packets (fragmenting them) without the port's owner knowledge.
> >
> > Perhaps this is all unnecessary and there's a way to make OVN
> > transparently split and reassemble packets as needed, though I doubt it
> > since it doesn't encapsulate tunneled traffic into another application
> > layer. But if there's a way to achieve transparent re-assemble, or there
> > are other alternatives beyond (a)-(c), let me know. Please let me know
> > what you think of (a)-(c) regardless.
> >
> > Thanks,
> > Ihar
>
> I lean towards (b) too.   I suppose we only need this special handling
> until the migration is complete and the multi chassis option is
> cleared by CMS.

Yes in case of short-term use of the feature (e.g. for live
migration). If we are talking about persistent multi-chassis ports
(e.g. to clone traffic), this may be more of a problem. I'd expect
that in this case, MTUs will be carefully aligned between tunneling
interfaces and physical networks so that any packet directed towards a
physical network would fit through the tunneling interface. (This
implication is already documented in ovn-nb.xml, so what is discussed
here is more of an optimization for environments that don't abide to
expected MTU configuration.)

>
> For (a) how would OVN decide if the packet is oversized ?  Using the
> check_pkt_larger OVS action 

Re: [ovs-dev] [RFC ovn] Multi-chassis port + MTU behavior

2022-07-20 Thread Numan Siddique
On Mon, Jul 18, 2022 at 7:44 PM Ihar Hrachyshka  wrote:
>
> Hi folks,
>
> looking for some advices on MTU / IP behavior for multi-chassis ports.
>
> 22.06 got the new multichassis port concept introduced where the same
> port may be present on multiple chassis, which can be used as a
> performance optimization for VM live migration, among other things.
> Migration with sub-0.1ms network downtime is achieved by cloning all
> traffic directed towards a multichassis port to all its locations,
> making all chassis hosting it receive relevant traffic at any given
> moment until migration is complete.
>
> For logical switches with a localnet port where traffic usually goes
> through the localnet port and not tunnels, it means enforcement of
> tunneling of egress and ingress traffic for a multichassis port. (The
> rest of the traffic between other, non-multichassis, ports keeps going
> through localnet port.) Tunneling enforcement is done because traffic
> sent through localnet port won't generally be cloned to all port binding
> locations by the upstream switch.
>
> A problem comes down when ingress or egress traffic of a multichassis
> port, being redirected through a tunnel, gets lost because of geneve
> header overhead or because the interface used for tunneling has a
> different MTU from the physical bridge backing up the localnet port.
>
> This is happening when:
>   - tunnel_iface_mtu < localnet_mtu + geneve_overhead
>
> This makes for an unfortunate situation where, for a multichassis port,
> SOME traffic (e.g. regular ICMP requests) pass through without any
> problems, while OTHER traffic (e.g. produced with 'ping -s) doesn't.
> (Test scenario demonstrating it is included below.)
>
> Here are ideas I came up with on how this could be resolved or at least
> mitigated:
>
> a) pass "oversized" traffic through localnet, and the rest through
> tunnels. Apart from confusing pattern on the wire where packets that
> belong to the same TCP session may go through two paths (arguably not a
> big problem and should be expected by other network hardware), this gets
> us back to the problem of upstream switch not delivering packets to all
> binding chassis.
>
> b) fragmentation needed. We could send ICMP Fragmentation Needed ICMP
> errors on attempts to send oversized packets to or from a multichassis
> port. Then TCP sessions could adjust their properties to reflect the new
> recommended MTU. We already have a similar mechanism for router ports.
> There are several caveats and limitations with fragmentation needed (b):
>
> - AFAIU this works for some protocols but not others. It also depends on
>   the client network stack to reflect the change in network path. TCP
>   should probably work.
>
> - It may be confusing for users that some L2 paths between ports of the
>   same switch have reduced MTU while others have the regular, depending
>   on the type of a port (single- or multi-chassis).
>
> c) implement fragmentation inside L2 domain. I actually don't know if
> that's even compliant with RFCs. Usually packet fragmentation is
> implemented on L2 domain boundary, by a router. In this scenario, a peer
> port on the same switch would receive fragments for a packet that was
> sent as a single piece.
>
> I currently lean towards (b) though it's not a universal fix since it
> requires collaboration of the underlying network stack. But (a) leaves
> cloning broken, and (c) is even more invasive, taking action on port's
> packets (fragmenting them) without the port's owner knowledge.
>
> Perhaps this is all unnecessary and there's a way to make OVN
> transparently split and reassemble packets as needed, though I doubt it
> since it doesn't encapsulate tunneled traffic into another application
> layer. But if there's a way to achieve transparent re-assemble, or there
> are other alternatives beyond (a)-(c), let me know. Please let me know
> what you think of (a)-(c) regardless.
>
> Thanks,
> Ihar

I lean towards (b) too.   I suppose we only need this special handling
until the migration is complete and the multi chassis option is
cleared by CMS.

For (a) how would OVN decide if the packet is oversized ?  Using the
check_pkt_larger OVS action ?

I've no idea how (c) can be done.

Thanks
Numan

> ---
>  tests/ovn.at | 92 
>  1 file changed, 92 insertions(+)
>
> diff --git a/tests/ovn.at b/tests/ovn.at
> index c346975e6..4ec1d37c3 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -14966,6 +14966,98 @@ OVN_CLEANUP([hv1],[hv2],[hv3])
>  AT_CLEANUP
>  ])
>
> +OVN_FOR_EACH_NORTHD([
> +AT_SETUP([localnet connectivity with multiple requested-chassis, max mtu])
> +AT_KEYWORDS([multi-chassis])
> +ovn_start
> +
> +net_add n1
> +for i in 1 2; do
> +sim_add hv$i
> +as hv$i
> +check ovs-vsctl add-br br-phys
> +ovn_attach n1 br-phys 192.168.0.$i
> +check ovs-vsctl set open . external-ids:ovn-bridge-mappings=phys:br-phys
> +done
> +
> +check ovn-nbctl ls-add ls0
> 

[ovs-dev] [RFC ovn] Multi-chassis port + MTU behavior

2022-07-18 Thread Ihar Hrachyshka
Hi folks,

looking for some advices on MTU / IP behavior for multi-chassis ports.

22.06 got the new multichassis port concept introduced where the same
port may be present on multiple chassis, which can be used as a
performance optimization for VM live migration, among other things.
Migration with sub-0.1ms network downtime is achieved by cloning all
traffic directed towards a multichassis port to all its locations,
making all chassis hosting it receive relevant traffic at any given
moment until migration is complete.

For logical switches with a localnet port where traffic usually goes
through the localnet port and not tunnels, it means enforcement of
tunneling of egress and ingress traffic for a multichassis port. (The
rest of the traffic between other, non-multichassis, ports keeps going
through localnet port.) Tunneling enforcement is done because traffic
sent through localnet port won't generally be cloned to all port binding
locations by the upstream switch.

A problem comes down when ingress or egress traffic of a multichassis
port, being redirected through a tunnel, gets lost because of geneve
header overhead or because the interface used for tunneling has a
different MTU from the physical bridge backing up the localnet port.

This is happening when:
  - tunnel_iface_mtu < localnet_mtuĀ + geneve_overhead

This makes for an unfortunate situation where, for a multichassis port,
SOME traffic (e.g. regular ICMP requests) pass through without any
problems, while OTHER traffic (e.g. produced with 'ping -s) doesn't.
(Test scenario demonstrating it is included below.)

Here are ideas I came up with on how this could be resolved or at least
mitigated:

a) pass "oversized" traffic through localnet, and the rest through
tunnels. Apart from confusing pattern on the wire where packets that
belong to the same TCP session may go through two paths (arguably not a
big problem and should be expected by other network hardware), this gets
us back to the problem of upstream switch not delivering packets to all
binding chassis.

b) fragmentation needed. We could send ICMP Fragmentation Needed ICMP
errors on attempts to send oversized packets to or from a multichassis
port. Then TCP sessions could adjust their properties to reflect the new
recommended MTU. We already have a similar mechanism for router ports.
There are several caveats and limitations with fragmentation needed (b):

- AFAIU this works for some protocols but not others. It also depends on
  the client network stack to reflect the change in network path. TCP
  should probably work.

- It may be confusing for users that some L2 paths between ports of the
  same switch have reduced MTU while others have the regular, depending
  on the type of a port (single- or multi-chassis).

c) implement fragmentation inside L2 domain. I actually don't know if
that's even compliant with RFCs. Usually packet fragmentation is
implemented on L2 domain boundary, by a router. In this scenario, a peer
port on the same switch would receive fragments for a packet that was
sent as a single piece.

I currently lean towards (b) though it's not a universal fix since it
requires collaboration of the underlying network stack. But (a) leaves
cloning broken, and (c) is even more invasive, taking action on port's
packets (fragmenting them) without the port's owner knowledge.

Perhaps this is all unnecessary and there's a way to make OVN
transparently split and reassemble packets as needed, though I doubt it
since it doesn't encapsulate tunneled traffic into another application
layer. But if there's a way to achieve transparent re-assemble, or there
are other alternatives beyond (a)-(c), let me know. Please let me know
what you think of (a)-(c) regardless.

Thanks,
Ihar
---
 tests/ovn.at | 92 
 1 file changed, 92 insertions(+)

diff --git a/tests/ovn.at b/tests/ovn.at
index c346975e6..4ec1d37c3 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -14966,6 +14966,98 @@ OVN_CLEANUP([hv1],[hv2],[hv3])
 AT_CLEANUP
 ])
 
+OVN_FOR_EACH_NORTHD([
+AT_SETUP([localnet connectivity with multiple requested-chassis, max mtu])
+AT_KEYWORDS([multi-chassis])
+ovn_start
+
+net_add n1
+for i in 1 2; do
+sim_add hv$i
+as hv$i
+check ovs-vsctl add-br br-phys
+ovn_attach n1 br-phys 192.168.0.$i
+check ovs-vsctl set open . external-ids:ovn-bridge-mappings=phys:br-phys
+done
+
+check ovn-nbctl ls-add ls0
+check ovn-nbctl lsp-add ls0 first
+check ovn-nbctl lsp-add ls0 second
+check ovn-nbctl lsp-add ls0 migrator
+check ovn-nbctl lsp-set-addresses first "00:00:00:00:00:01 10.0.0.1"
+check ovn-nbctl lsp-set-addresses second "00:00:00:00:00:02 10.0.0.2"
+check ovn-nbctl lsp-set-addresses migrator "00:00:00:00:00:ff 10.0.0.100"
+
+check ovn-nbctl lsp-add ls0 public
+check ovn-nbctl lsp-set-type public localnet
+check ovn-nbctl lsp-set-addresses public unknown
+check ovn-nbctl lsp-set-options public network_name=phys
+
+check ovn-nbctl lsp-set-options