Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Vladislav Odintsov via discuss
regards,Vladislav OdintsovOn 30 Sep 2023, at 23:24, Han Zhou  wrote:On Sat, Sep 30, 2023 at 9:56 AM Vladislav Odintsov  wrote: regards,> Vladislav Odintsov>> > On 30 Sep 2023, at 16:50, Robin Jarry  wrote:> >> > Hi Vladislav, Frode,> >> > Thanks for your replies.> >> > Frode Nordahl, Sep 30, 2023 at 10:55:> >> On Sat, Sep 30, 2023 at 9:43 AM Vladislav Odintsov via discuss> >>  wrote:>  On 29 Sep 2023, at 18:14, Robin Jarry via discuss  wrote:> >  Felix Huettner, Sep 29, 2023 at 15:23:> >> Distributed mac learning> >> >  [snip]> >>> >> Cons:> >>> >> - How to manage seamless upgrades?> >> - Requires ovn-controller to move/plug ports in the correct bridge.> >> - Multiple openflow connections (one per managed bridge).> >> - Requires ovn-trace to be reimplemented differently (maybe other tools> >> as well).> >> > - No central information anymore on mac bindings. All nodes need to> > update their data individually> > - Each bridge generates also a linux network interface. I do not know if> > there is some kind of limit to the linux interfaces or the ovs bridges> > somewhere.> >  That's a good point. However, only the bridges related to one>  implemented logical network would need to be created on a single>  chassis. Even with the largest OVN deployments, I doubt this would be>  a limitation.> > > Would you still preprovision static mac addresses on the bridge for all> > port_bindings we know the mac address from, or would you rather leave> > that up for learning as well?> >  I would leave everything dynamic.> > > I do not know if there is some kind of performance/optimization penality> > for moving packets between different bridges.> >  As far as I know, once the openflow pipeline has been resolved into>  a datapath flow, there is no penalty.> > > You can also not only use the logical switch that have a local port> > bound. Assume the following topology:> > +---+ +---+ +---+ +---+ +---+ +---+ +---+> > |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|> > +---+ +---+ +---+ +---+ +---+ +---+ +---+> > vm1 and vm2 are both running on the same hypervisor. Creating only local> > logical switches would mean only ls1 and ls3 are available on that> > hypervisor. This would break the connection between the two vms which> > would in the current implementation just traverse the two logical> > routers.> > I guess we would need to create bridges for each locally reachable> > logical switch. I am concerned about the potentially significant> > increase in bridges and openflow connections this brings.> >  That is one of the concerns I raised in the last point. In my opinion>  this is a trade off. You remove centralization and require more local>  processing. But overall, the processing cost should remain equivalent.>  >>> Just want to clarify.>  >>> For topology described by Felix above, you propose to create 2 OVS> >>> bridges, right? How will the packet traverse from vm1 to vm2?> >> > In this particular case, there would be 3 OVS bridges, one for each> > logical switch.>> Yeah, agree, this is typo. Below I named three bridges :).>> >> >>> Currently when the packet enters OVS all the logical switching and> >>> routing openflow calculation is done with no packet re-entering OVS,> >>> and this results in one DP flow match to deliver this packet from> >>> vm1 to vm2 (if no conntrack used, which could introduce> >>> recirculations).>  >>> Do I understand correctly, that in this proposal OVS needs to> >>> receive packet from “ls1” bridge, next run through lrouter “lr1”> >>> OpenFlow pipelines, then output packet to “ls2” OVS bridge for mac> >>> learning between logical routers (should we have here OF flow with> >>> learn action?), then send packet again to OVS, calculate “lr2”> >>> OpenFlow pipeline and finally reach destination OVS bridge “ls3” to> >>> send packet to a vm2?> >> > What I am proposing is to implement the northbound L2 network intent> > with actual OVS bridges and builtin OVS mac learning. The L3/L4 network> > constructs and ACLs would require patch ports and specific OF pipelines.> >> > We could even think of adding more advanced L3 capabilities (RIB) into> > OVS to simplify the OF pipelines.>> But this will make OVS<->kernel interaction more complex. Even if we forget about dpdk environments…>> >> >>> Also, will such behavior be compatible with HW-offload-capable to> >>> smartnics/DPUs?> >>> >> I am also a bit concerned about this, what would be the typical number> >> of bridges supported by hardware?> >> > As far as I understand, only the datapath flows are offloaded to> > hardware. The OF pipeline is only parsed when there is an upcall for the> > first packet. Once 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Han Zhou via discuss
On Sat, Sep 30, 2023 at 9:56 AM Vladislav Odintsov 
wrote:
>
>
>
> regards,
> Vladislav Odintsov
>
> > On 30 Sep 2023, at 16:50, Robin Jarry  wrote:
> >
> > Hi Vladislav, Frode,
> >
> > Thanks for your replies.
> >
> > Frode Nordahl, Sep 30, 2023 at 10:55:
> >> On Sat, Sep 30, 2023 at 9:43 AM Vladislav Odintsov via discuss
> >>  wrote:
>  On 29 Sep 2023, at 18:14, Robin Jarry via discuss <
ovs-discuss@openvswitch.org> wrote:
> 
>  Felix Huettner, Sep 29, 2023 at 15:23:
> >> Distributed mac learning
> >> 
>  [snip]
> >>
> >> Cons:
> >>
> >> - How to manage seamless upgrades?
> >> - Requires ovn-controller to move/plug ports in the correct bridge.
> >> - Multiple openflow connections (one per managed bridge).
> >> - Requires ovn-trace to be reimplemented differently (maybe other
tools
> >> as well).
> >
> > - No central information anymore on mac bindings. All nodes need to
> > update their data individually
> > - Each bridge generates also a linux network interface. I do not
know if
> > there is some kind of limit to the linux interfaces or the ovs
bridges
> > somewhere.
> 
>  That's a good point. However, only the bridges related to one
>  implemented logical network would need to be created on a single
>  chassis. Even with the largest OVN deployments, I doubt this would be
>  a limitation.
> 
> > Would you still preprovision static mac addresses on the bridge for
all
> > port_bindings we know the mac address from, or would you rather
leave
> > that up for learning as well?
> 
>  I would leave everything dynamic.
> 
> > I do not know if there is some kind of performance/optimization
penality
> > for moving packets between different bridges.
> 
>  As far as I know, once the openflow pipeline has been resolved into
>  a datapath flow, there is no penalty.
> 
> > You can also not only use the logical switch that have a local port
> > bound. Assume the following topology:
> > +---+ +---+ +---+ +---+ +---+ +---+ +---+
> > |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
> > +---+ +---+ +---+ +---+ +---+ +---+ +---+
> > vm1 and vm2 are both running on the same hypervisor. Creating only
local
> > logical switches would mean only ls1 and ls3 are available on that
> > hypervisor. This would break the connection between the two vms
which
> > would in the current implementation just traverse the two logical
> > routers.
> > I guess we would need to create bridges for each locally reachable
> > logical switch. I am concerned about the potentially significant
> > increase in bridges and openflow connections this brings.
> 
>  That is one of the concerns I raised in the last point. In my opinion
>  this is a trade off. You remove centralization and require more local
>  processing. But overall, the processing cost should remain
equivalent.
> >>>
> >>> Just want to clarify.
> >>>
> >>> For topology described by Felix above, you propose to create 2 OVS
> >>> bridges, right? How will the packet traverse from vm1 to vm2?
> >
> > In this particular case, there would be 3 OVS bridges, one for each
> > logical switch.
>
> Yeah, agree, this is typo. Below I named three bridges :).
>
> >
> >>> Currently when the packet enters OVS all the logical switching and
> >>> routing openflow calculation is done with no packet re-entering OVS,
> >>> and this results in one DP flow match to deliver this packet from
> >>> vm1 to vm2 (if no conntrack used, which could introduce
> >>> recirculations).
> >>>
> >>> Do I understand correctly, that in this proposal OVS needs to
> >>> receive packet from “ls1” bridge, next run through lrouter “lr1”
> >>> OpenFlow pipelines, then output packet to “ls2” OVS bridge for mac
> >>> learning between logical routers (should we have here OF flow with
> >>> learn action?), then send packet again to OVS, calculate “lr2”
> >>> OpenFlow pipeline and finally reach destination OVS bridge “ls3” to
> >>> send packet to a vm2?
> >
> > What I am proposing is to implement the northbound L2 network intent
> > with actual OVS bridges and builtin OVS mac learning. The L3/L4 network
> > constructs and ACLs would require patch ports and specific OF pipelines.
> >
> > We could even think of adding more advanced L3 capabilities (RIB) into
> > OVS to simplify the OF pipelines.
>
> But this will make OVS<->kernel interaction more complex. Even if we
forget about dpdk environments…
>
> >
> >>> Also, will such behavior be compatible with HW-offload-capable to
> >>> smartnics/DPUs?
> >>
> >> I am also a bit concerned about this, what would be the typical number
> >> of bridges supported by hardware?
> >
> > As far as I understand, only the datapath flows are offloaded to
> > hardware. The OF pipeline is only parsed when there is an upcall for the
> > first packet. Once resolved, the datapath 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Han Zhou via discuss
On Thu, Sep 28, 2023 at 9:28 AM Robin Jarry  wrote:
>
> Hello OVN community,
>
> This is a follow up on the message I have sent today [1]. That second
> part focuses on some ideas I have to remove the limitations that were
> mentioned in the previous email.
>
> [1]
https://mail.openvswitch.org/pipermail/ovs-discuss/2023-September/052695.html
>
> If you didn't read it, my goal is to start a discussion about how we
> could improve OVN on the following topics:
>
> - Reduce the memory and CPU footprint of ovn-controller, ovn-northd.
> - Support scaling of L2 connectivity across larger clusters.
> - Simplify CMS interoperability.
> - Allow support for alternative datapath implementations.
>
> Disclaimer:
>
> This message does not mention anything about L3/L4 features of OVN.
> I didn't have time to work on these, yet. I hope we can discuss how
> these fit with my ideas.
>

Hi Robin and folks, thanks for the great discussions!
I read the replies of two other threads of this email, but I am replying
directly here to comment on some of the original statements in this email.
I will reply to the other threads for some specific points.

> Distributed mac learning
> 
>
> Use one OVS bridge per logical switch with mac learning enabled. Only
> create the bridge if the logical switch has a port bound to the local
> chassis.
>
> Pros:
>
> - Minimal openflow rules required in each bridge (ACLs and NAT mostly).
> - No central mac binding table required.

Firstly to clarify the terminology of "mac binding" to avoid confusion, the
mac_binding table currently in SB DB has nothing to do with L2 MAC
learning. It is actually the ARP/Neighbor table of distributed logical
routers. We should probably call it IP_MAC_binding table, or just Neighbor
table.
Here what you mean is actually L2 MAC learning, which today is implemented
by the FDB table in SB DB, and it is only for uncommon use cases when the
NB doesn't have the knowledge of a MAC address of a VIF.

> - Mac table aging comes for free.
> - Zero access to southbound DB for learned addresses nor for aging.
>
> Cons:
>
> - How to manage seamless upgrades?
> - Requires ovn-controller to move/plug ports in the correct bridge.
> - Multiple openflow connections (one per managed bridge).
> - Requires ovn-trace to be reimplemented differently (maybe other tools
>   as well).
>

The purpose of this proposal is clear - to avoid using a central table in
DB for L2 information but instead using L2 MAC learning to populate such
information on chassis, which is a reasonable alternative with pros and
cons.
However, I don't think it is necessary to use separate OVS bridges for this
purpose. L2 MAC learning can be easily implemented in the br-int bridge
with OVS flows, which is much simpler than managing dynamic number of OVS
bridges just for the purpose of using the builtin OVS mac-learning.

Now back to the distributed MAC learning idea itself. Essentially for two
VMs/pods to communicate on L2, say, VM1@Chassis1 needs to send a packet to
VM2@chassis2, assuming VM1 already has VM2's MAC address (we will discuss
this later), Chassis1 needs to know that VM2's MAC is located on Chassis2.
In OVN today this information is conveyed through:
- MAC and LSP mapping (NB -> northd -> SB -> Chassis)
- LSP and Chassis mapping (Chassis -> SB -> Chassis)

In your proposal:
- MAC and Chassis mapping (can be learned through initial L2
broadcast/flood)

This indeed would avoid the control plane cost through the centralized
components (for this L2 binding part). Given that today's SB OVSDB is a
bottleneck, this idea may sound attractive. But please also take into
consideration the below improvement that could mitigate the OVN central
scale issue:
- For MAC and LSP mapping, northd is now capable of incrementally
processing VIF related L2/L3 changes, so the cost of NB -> northd -> SB is
very small. For SB -> Chassis, a more scalable DB deployment, such as the
OVSDB relays, may largely help.
- For LSP and Chassis mapping, the round trip through a central DB
obviously costs higher than a direct L2 broadcast (the targets are the
same). But this can be optimized if the MAC and Chassis is known by the CMS
system (which is true for most openstack/k8s env I believe). Instead of
updating the binding from each Chassis, CMS can tell this information
through the same NB -> northd -> SB -> Chassis path, and the Chassis can
just read the SB without updating it.

On the other hand, the dynamic MAC learning approach has its own drawbacks.
- It is simple to consider L2 only, but if considering more SDB features, a
central DB is more flexible to extend and implement new features than a
network protocol based approach.
- It is more predictable and easier to debug with pre-populated information
through CMS than states learned dynamically in data-plane.
- With the DB approach we can suppress most of L2 broadcast/flood, while
with the distributed MAC learning broadcast/flood can't be avoided.
Although it may 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Vladislav Odintsov via discuss


regards,
Vladislav Odintsov

> On 30 Sep 2023, at 16:50, Robin Jarry  wrote:
> 
> Hi Vladislav, Frode,
> 
> Thanks for your replies.
> 
> Frode Nordahl, Sep 30, 2023 at 10:55:
>> On Sat, Sep 30, 2023 at 9:43 AM Vladislav Odintsov via discuss
>>  wrote:
 On 29 Sep 2023, at 18:14, Robin Jarry via discuss 
  wrote:
 
 Felix Huettner, Sep 29, 2023 at 15:23:
>> Distributed mac learning
>> 
 [snip]
>> 
>> Cons:
>> 
>> - How to manage seamless upgrades?
>> - Requires ovn-controller to move/plug ports in the correct bridge.
>> - Multiple openflow connections (one per managed bridge).
>> - Requires ovn-trace to be reimplemented differently (maybe other tools
>> as well).
> 
> - No central information anymore on mac bindings. All nodes need to
> update their data individually
> - Each bridge generates also a linux network interface. I do not know if
> there is some kind of limit to the linux interfaces or the ovs bridges
> somewhere.
 
 That's a good point. However, only the bridges related to one
 implemented logical network would need to be created on a single
 chassis. Even with the largest OVN deployments, I doubt this would be
 a limitation.
 
> Would you still preprovision static mac addresses on the bridge for all
> port_bindings we know the mac address from, or would you rather leave
> that up for learning as well?
 
 I would leave everything dynamic.
 
> I do not know if there is some kind of performance/optimization penality
> for moving packets between different bridges.
 
 As far as I know, once the openflow pipeline has been resolved into
 a datapath flow, there is no penalty.
 
> You can also not only use the logical switch that have a local port
> bound. Assume the following topology:
> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> vm1 and vm2 are both running on the same hypervisor. Creating only local
> logical switches would mean only ls1 and ls3 are available on that
> hypervisor. This would break the connection between the two vms which
> would in the current implementation just traverse the two logical
> routers.
> I guess we would need to create bridges for each locally reachable
> logical switch. I am concerned about the potentially significant
> increase in bridges and openflow connections this brings.
 
 That is one of the concerns I raised in the last point. In my opinion
 this is a trade off. You remove centralization and require more local
 processing. But overall, the processing cost should remain equivalent.
>>> 
>>> Just want to clarify.
>>> 
>>> For topology described by Felix above, you propose to create 2 OVS
>>> bridges, right? How will the packet traverse from vm1 to vm2?
> 
> In this particular case, there would be 3 OVS bridges, one for each
> logical switch.

Yeah, agree, this is typo. Below I named three bridges :).

> 
>>> Currently when the packet enters OVS all the logical switching and
>>> routing openflow calculation is done with no packet re-entering OVS,
>>> and this results in one DP flow match to deliver this packet from
>>> vm1 to vm2 (if no conntrack used, which could introduce
>>> recirculations).
>>> 
>>> Do I understand correctly, that in this proposal OVS needs to
>>> receive packet from “ls1” bridge, next run through lrouter “lr1”
>>> OpenFlow pipelines, then output packet to “ls2” OVS bridge for mac
>>> learning between logical routers (should we have here OF flow with
>>> learn action?), then send packet again to OVS, calculate “lr2”
>>> OpenFlow pipeline and finally reach destination OVS bridge “ls3” to
>>> send packet to a vm2?
> 
> What I am proposing is to implement the northbound L2 network intent
> with actual OVS bridges and builtin OVS mac learning. The L3/L4 network
> constructs and ACLs would require patch ports and specific OF pipelines.
> 
> We could even think of adding more advanced L3 capabilities (RIB) into
> OVS to simplify the OF pipelines.

But this will make OVS<->kernel interaction more complex. Even if we forget 
about dpdk environments…

> 
>>> Also, will such behavior be compatible with HW-offload-capable to
>>> smartnics/DPUs?
>> 
>> I am also a bit concerned about this, what would be the typical number
>> of bridges supported by hardware?
> 
> As far as I understand, only the datapath flows are offloaded to
> hardware. The OF pipeline is only parsed when there is an upcall for the
> first packet. Once resolved, the datapath flow is reused. OVS bridges
> are only logical constructs, they are neither reflected in the datapath
> nor in hardware.

As far as I remember from my tests against ConnectX-5/6 SmartNICs in ASAP^2 
mode, HW-offload is not capable with offloading OVS patch ports. At least it 
was so 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Daniel Alvarez via discuss
Hi all,


> On 30 Sep 2023, at 15:51, Robin Jarry via discuss 
>  wrote:
> 
> Hi Vladislav, Frode,
> 
> Thanks for your replies.
> 
> Frode Nordahl, Sep 30, 2023 at 10:55:
>> On Sat, Sep 30, 2023 at 9:43 AM Vladislav Odintsov via discuss
>>  wrote:
 On 29 Sep 2023, at 18:14, Robin Jarry via discuss 
  wrote:
 
 Felix Huettner, Sep 29, 2023 at 15:23:
>> Distributed mac learning
>> 
 [snip]
>> 
>> Cons:
>> 
>> - How to manage seamless upgrades?
>> - Requires ovn-controller to move/plug ports in the correct bridge.
>> - Multiple openflow connections (one per managed bridge).
>> - Requires ovn-trace to be reimplemented differently (maybe other tools
>> as well).
> 
> - No central information anymore on mac bindings. All nodes need to
> update their data individually
> - Each bridge generates also a linux network interface. I do not know if
> there is some kind of limit to the linux interfaces or the ovs bridges
> somewhere.
 
 That's a good point. However, only the bridges related to one
 implemented logical network would need to be created on a single
 chassis. Even with the largest OVN deployments, I doubt this would be
 a limitation.
 
> Would you still preprovision static mac addresses on the bridge for all
> port_bindings we know the mac address from, or would you rather leave
> that up for learning as well?
 
 I would leave everything dynamic.
 
> I do not know if there is some kind of performance/optimization penality
> for moving packets between different bridges.
 
 As far as I know, once the openflow pipeline has been resolved into
 a datapath flow, there is no penalty.
 
> You can also not only use the logical switch that have a local port
> bound. Assume the following topology:
> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> vm1 and vm2 are both running on the same hypervisor. Creating only local
> logical switches would mean only ls1 and ls3 are available on that
> hypervisor. This would break the connection between the two vms which
> would in the current implementation just traverse the two logical
> routers.
> I guess we would need to create bridges for each locally reachable
> logical switch. I am concerned about the potentially significant
> increase in bridges and openflow connections this brings.
 
 That is one of the concerns I raised in the last point. In my opinion
 this is a trade off. You remove centralization and require more local
 processing. But overall, the processing cost should remain equivalent.
>>> 
>>> Just want to clarify.
>>> 
>>> For topology described by Felix above, you propose to create 2 OVS
>>> bridges, right? How will the packet traverse from vm1 to vm2?
> 
> In this particular case, there would be 3 OVS bridges, one for each
> logical switch.
> 
>>> Currently when the packet enters OVS all the logical switching and
>>> routing openflow calculation is done with no packet re-entering OVS,
>>> and this results in one DP flow match to deliver this packet from
>>> vm1 to vm2 (if no conntrack used, which could introduce
>>> recirculations).
>>> 
>>> Do I understand correctly, that in this proposal OVS needs to
>>> receive packet from “ls1” bridge, next run through lrouter “lr1”
>>> OpenFlow pipelines, then output packet to “ls2” OVS bridge for mac
>>> learning between logical routers (should we have here OF flow with
>>> learn action?), then send packet again to OVS, calculate “lr2”
>>> OpenFlow pipeline and finally reach destination OVS bridge “ls3” to
>>> send packet to a vm2?
> 
> What I am proposing is to implement the northbound L2 network intent
> with actual OVS bridges and builtin OVS mac learning. The L3/L4 network
> constructs and ACLs would require patch ports and specific OF pipelines.
> 
> We could even think of adding more advanced L3 capabilities (RIB) into
> OVS to simplify the OF pipelines.
> 
>>> Also, will such behavior be compatible with HW-offload-capable to
>>> smartnics/DPUs?
>> 
>> I am also a bit concerned about this, what would be the typical number
>> of bridges supported by hardware?
> 
> As far as I understand, only the datapath flows are offloaded to
> hardware. The OF pipeline is only parsed when there is an upcall for the
> first packet. Once resolved, the datapath flow is reused. OVS bridges
> are only logical constructs, they are neither reflected in the datapath
> nor in hardware.
> 
>> Use multicast for overlay networks
>> ==
 [snip]
>> - 24bit VNI allows for more than 16 million logical switches. No need
>> for extended GENEVE tunnel options.
> Note that using vxlan at the moment significantly reduces the ovn
> featureset. This is because the 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Robin Jarry via discuss
Hi Vladislav, Frode,

Thanks for your replies.

Frode Nordahl, Sep 30, 2023 at 10:55:
> On Sat, Sep 30, 2023 at 9:43 AM Vladislav Odintsov via discuss
>  wrote:
> > > On 29 Sep 2023, at 18:14, Robin Jarry via discuss 
> > >  wrote:
> > >
> > > Felix Huettner, Sep 29, 2023 at 15:23:
> > >>> Distributed mac learning
> > >>> 
> > > [snip]
> > >>>
> > >>> Cons:
> > >>>
> > >>> - How to manage seamless upgrades?
> > >>> - Requires ovn-controller to move/plug ports in the correct bridge.
> > >>> - Multiple openflow connections (one per managed bridge).
> > >>> - Requires ovn-trace to be reimplemented differently (maybe other tools
> > >>>  as well).
> > >>
> > >> - No central information anymore on mac bindings. All nodes need to
> > >>  update their data individually
> > >> - Each bridge generates also a linux network interface. I do not know if
> > >>  there is some kind of limit to the linux interfaces or the ovs bridges
> > >>  somewhere.
> > >
> > > That's a good point. However, only the bridges related to one
> > > implemented logical network would need to be created on a single
> > > chassis. Even with the largest OVN deployments, I doubt this would be
> > > a limitation.
> > >
> > >> Would you still preprovision static mac addresses on the bridge for all
> > >> port_bindings we know the mac address from, or would you rather leave
> > >> that up for learning as well?
> > >
> > > I would leave everything dynamic.
> > >
> > >> I do not know if there is some kind of performance/optimization penality
> > >> for moving packets between different bridges.
> > >
> > > As far as I know, once the openflow pipeline has been resolved into
> > > a datapath flow, there is no penalty.
> > >
> > >> You can also not only use the logical switch that have a local port
> > >> bound. Assume the following topology:
> > >> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> > >> |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
> > >> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> > >> vm1 and vm2 are both running on the same hypervisor. Creating only local
> > >> logical switches would mean only ls1 and ls3 are available on that
> > >> hypervisor. This would break the connection between the two vms which
> > >> would in the current implementation just traverse the two logical
> > >> routers.
> > >> I guess we would need to create bridges for each locally reachable
> > >> logical switch. I am concerned about the potentially significant
> > >> increase in bridges and openflow connections this brings.
> > >
> > > That is one of the concerns I raised in the last point. In my opinion
> > > this is a trade off. You remove centralization and require more local
> > > processing. But overall, the processing cost should remain equivalent.
> >
> > Just want to clarify.
> >
> > For topology described by Felix above, you propose to create 2 OVS
> > bridges, right? How will the packet traverse from vm1 to vm2?

In this particular case, there would be 3 OVS bridges, one for each
logical switch.

> > Currently when the packet enters OVS all the logical switching and
> > routing openflow calculation is done with no packet re-entering OVS,
> > and this results in one DP flow match to deliver this packet from
> > vm1 to vm2 (if no conntrack used, which could introduce
> > recirculations).
> >
> > Do I understand correctly, that in this proposal OVS needs to
> > receive packet from “ls1” bridge, next run through lrouter “lr1”
> > OpenFlow pipelines, then output packet to “ls2” OVS bridge for mac
> > learning between logical routers (should we have here OF flow with
> > learn action?), then send packet again to OVS, calculate “lr2”
> > OpenFlow pipeline and finally reach destination OVS bridge “ls3” to
> > send packet to a vm2?

What I am proposing is to implement the northbound L2 network intent
with actual OVS bridges and builtin OVS mac learning. The L3/L4 network
constructs and ACLs would require patch ports and specific OF pipelines.

We could even think of adding more advanced L3 capabilities (RIB) into
OVS to simplify the OF pipelines.

> > Also, will such behavior be compatible with HW-offload-capable to
> > smartnics/DPUs?
>
> I am also a bit concerned about this, what would be the typical number
> of bridges supported by hardware?

As far as I understand, only the datapath flows are offloaded to
hardware. The OF pipeline is only parsed when there is an upcall for the
first packet. Once resolved, the datapath flow is reused. OVS bridges
are only logical constructs, they are neither reflected in the datapath
nor in hardware.

> > >>> Use multicast for overlay networks
> > >>> ==
> > > [snip]
> > >>> - 24bit VNI allows for more than 16 million logical switches. No need
> > >>>  for extended GENEVE tunnel options.
> > >> Note that using vxlan at the moment significantly reduces the ovn
> > >> featureset. This is because the geneve header options are currently used
> > >> for data that would not fit into 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Frode Nordahl via discuss
Thanks alot for starting this discussion.

On Sat, Sep 30, 2023 at 9:43 AM Vladislav Odintsov via discuss
 wrote:
>
> Hi Robin,
>
> Please, see inline.
>
> regards,
> Vladislav Odintsov
>
> > On 29 Sep 2023, at 18:14, Robin Jarry via discuss 
> >  wrote:
> >
> > Felix Huettner, Sep 29, 2023 at 15:23:
> >>> Distributed mac learning
> >>> 
> > [snip]
> >>>
> >>> Cons:
> >>>
> >>> - How to manage seamless upgrades?
> >>> - Requires ovn-controller to move/plug ports in the correct bridge.
> >>> - Multiple openflow connections (one per managed bridge).
> >>> - Requires ovn-trace to be reimplemented differently (maybe other tools
> >>>  as well).
> >>
> >> - No central information anymore on mac bindings. All nodes need to
> >>  update their data individually
> >> - Each bridge generates also a linux network interface. I do not know if
> >>  there is some kind of limit to the linux interfaces or the ovs bridges
> >>  somewhere.
> >
> > That's a good point. However, only the bridges related to one
> > implemented logical network would need to be created on a single
> > chassis. Even with the largest OVN deployments, I doubt this would be
> > a limitation.
> >
> >> Would you still preprovision static mac addresses on the bridge for all
> >> port_bindings we know the mac address from, or would you rather leave
> >> that up for learning as well?
> >
> > I would leave everything dynamic.
> >
> >> I do not know if there is some kind of performance/optimization penality
> >> for moving packets between different bridges.
> >
> > As far as I know, once the openflow pipeline has been resolved into
> > a datapath flow, there is no penalty.
> >
> >> You can also not only use the logical switch that have a local port
> >> bound. Assume the following topology:
> >> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> >> |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
> >> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> >> vm1 and vm2 are both running on the same hypervisor. Creating only local
> >> logical switches would mean only ls1 and ls3 are available on that
> >> hypervisor. This would break the connection between the two vms which
> >> would in the current implementation just traverse the two logical
> >> routers.
> >> I guess we would need to create bridges for each locally reachable
> >> logical switch. I am concerned about the potentially significant
> >> increase in bridges and openflow connections this brings.
> >
> > That is one of the concerns I raised in the last point. In my opinion
> > this is a trade off. You remove centralization and require more local
> > processing. But overall, the processing cost should remain equivalent.
>
> Just want to clarify.
> For topology described by Felix above, you propose to create 2 OVS bridges, 
> right? How will the packet traverse from vm1 to vm2?
>
> Currently when the packet enters OVS all the logical switching and routing 
> openflow calculation is done with no packet re-entering OVS, and this results 
> in one DP flow match to deliver this packet from vm1 to vm2 (if no conntrack 
> used, which could introduce recirculations).
> Do I understand correctly, that in this proposal OVS needs to receive packet 
> from “ls1” bridge, next run through lrouter “lr1” OpenFlow pipelines, then 
> output packet to “ls2” OVS bridge for mac learning between logical routers 
> (should we have here OF flow with learn action?), then send packet again to 
> OVS, calculate “lr2” OpenFlow pipeline and finally reach destination OVS 
> bridge “ls3” to send packet to a vm2?
>
> Also, will such behavior be compatible with HW-offload-capable to 
> smartnics/DPUs?

I am also a bit concerned about this, what would be the typical number
of bridges supported by hardware?

> >
> >>> Use multicast for overlay networks
> >>> ==
> > [snip]
> >>> - 24bit VNI allows for more than 16 million logical switches. No need
> >>>  for extended GENEVE tunnel options.
> >> Note that using vxlan at the moment significantly reduces the ovn
> >> featureset. This is because the geneve header options are currently used
> >> for data that would not fit into the vxlan vni.
> >>
> >> From ovn-architecture.7.xml:
> >> ```
> >> The maximum number of networks is reduced to 4096.
> >> The maximum number of ports per network is reduced to 2048.
> >> ACLs matching against logical ingress port identifiers are not supported.
> >> OVN interconnection feature is not supported.
> >> ```
> >
> > In my understanding, the main reason why GENEVE replaced VXLAN is
> > because Openstack uses full mesh point to point tunnels and that the
> > sender needs to know behind which chassis any mac address is to send it
> > into the correct tunnel. GENEVE allowed to reduce the lookup time both
> > on the sender and receiver thanks to ingress/egress port metadata.
> >
> > https://blog.russellbryant.net/2017/05/30/ovn-geneve-vs-vxlan-does-it-matter/
> > 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Vladislav Odintsov via discuss
Hi Robin,

Please, see inline.

regards,
Vladislav Odintsov

> On 29 Sep 2023, at 18:14, Robin Jarry via discuss 
>  wrote:
> 
> Felix Huettner, Sep 29, 2023 at 15:23:
>>> Distributed mac learning
>>> 
> [snip]
>>> 
>>> Cons:
>>> 
>>> - How to manage seamless upgrades?
>>> - Requires ovn-controller to move/plug ports in the correct bridge.
>>> - Multiple openflow connections (one per managed bridge).
>>> - Requires ovn-trace to be reimplemented differently (maybe other tools
>>>  as well).
>> 
>> - No central information anymore on mac bindings. All nodes need to
>>  update their data individually
>> - Each bridge generates also a linux network interface. I do not know if
>>  there is some kind of limit to the linux interfaces or the ovs bridges
>>  somewhere.
> 
> That's a good point. However, only the bridges related to one
> implemented logical network would need to be created on a single
> chassis. Even with the largest OVN deployments, I doubt this would be
> a limitation.
> 
>> Would you still preprovision static mac addresses on the bridge for all
>> port_bindings we know the mac address from, or would you rather leave
>> that up for learning as well?
> 
> I would leave everything dynamic.
> 
>> I do not know if there is some kind of performance/optimization penality
>> for moving packets between different bridges.
> 
> As far as I know, once the openflow pipeline has been resolved into
> a datapath flow, there is no penalty.
> 
>> You can also not only use the logical switch that have a local port
>> bound. Assume the following topology:
>> +---+ +---+ +---+ +---+ +---+ +---+ +---+
>> |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
>> +---+ +---+ +---+ +---+ +---+ +---+ +---+
>> vm1 and vm2 are both running on the same hypervisor. Creating only local
>> logical switches would mean only ls1 and ls3 are available on that
>> hypervisor. This would break the connection between the two vms which
>> would in the current implementation just traverse the two logical
>> routers.
>> I guess we would need to create bridges for each locally reachable
>> logical switch. I am concerned about the potentially significant
>> increase in bridges and openflow connections this brings.
> 
> That is one of the concerns I raised in the last point. In my opinion
> this is a trade off. You remove centralization and require more local
> processing. But overall, the processing cost should remain equivalent.

Just want to clarify.
For topology described by Felix above, you propose to create 2 OVS bridges, 
right? How will the packet traverse from vm1 to vm2? 

Currently when the packet enters OVS all the logical switching and routing 
openflow calculation is done with no packet re-entering OVS, and this results 
in one DP flow match to deliver this packet from vm1 to vm2 (if no conntrack 
used, which could introduce recirculations).
Do I understand correctly, that in this proposal OVS needs to receive packet 
from “ls1” bridge, next run through lrouter “lr1” OpenFlow pipelines, then 
output packet to “ls2” OVS bridge for mac learning between logical routers 
(should we have here OF flow with learn action?), then send packet again to 
OVS, calculate “lr2” OpenFlow pipeline and finally reach destination OVS bridge 
“ls3” to send packet to a vm2? 

Also, will such behavior be compatible with HW-offload-capable to 
smartnics/DPUs?

> 
>>> Use multicast for overlay networks
>>> ==
> [snip]
>>> - 24bit VNI allows for more than 16 million logical switches. No need
>>>  for extended GENEVE tunnel options.
>> Note that using vxlan at the moment significantly reduces the ovn
>> featureset. This is because the geneve header options are currently used
>> for data that would not fit into the vxlan vni.
>> 
>> From ovn-architecture.7.xml:
>> ```
>> The maximum number of networks is reduced to 4096.
>> The maximum number of ports per network is reduced to 2048.
>> ACLs matching against logical ingress port identifiers are not supported.
>> OVN interconnection feature is not supported.
>> ```
> 
> In my understanding, the main reason why GENEVE replaced VXLAN is
> because Openstack uses full mesh point to point tunnels and that the
> sender needs to know behind which chassis any mac address is to send it
> into the correct tunnel. GENEVE allowed to reduce the lookup time both
> on the sender and receiver thanks to ingress/egress port metadata.
> 
> https://blog.russellbryant.net/2017/05/30/ovn-geneve-vs-vxlan-does-it-matter/
> https://dani.foroselectronica.es/ovn-geneve-encapsulation-541/
> 
> If VXLAN + multicast and address learning was used, the "correct" tunnel
> would be established ad-hoc and both sender and receiver lookups would
> only be a simple mac forwarding with learning. The ingress pipeline
> would probably cost a little more.
> 
> Maybe multicast + address learning could be implemented for GENEVE as
> well. But it would not be interoperable with other VTEPs.
> 
>>> - Limited