Re: [ovs-discuss] NorthD inc-engine Handlers; OVN 24.03

2024-05-15 Thread Han Zhou via discuss
On Tue, May 14, 2024 at 12:31 AM Dumitru Ceara  wrote:
>
> On 5/8/24 18:01, Numan Siddique wrote:
> > On Wed, May 8, 2024 at 8:42 AM Шагов Георгий via discuss <
> > ovs-discuss@openvswitch.org> wrote:
> >
> >> Hello everyone
> >>
> >>
> >>
> >> In some aspect it might be considered as a continuation of this thread:
> >> (link1), yet it is different
> >>
> >> After we have upgrade from OVN 22.03 to OVN 24.03, we have indeed found
> >> increase in performance in 3-4 times
> >>
> >> And yet still we do observe high CPU load for NorthD process; taking
> >> deeper into the logs we have found:
> >>
> >>
> >>
> >
> > Thanks for reporting this issue.
> >
> >
> > 2024-05-07T08:36:46.505Z|18503|poll_loop|INFO|wakeup due to [POLLIN] on
fd
> >> 15 (10.34.22.66:60716<->10.34.22.66:6642) at lib/stream-fd.c:157 (94%
CPU
> >> usage)
> >>
> >> *2024-05-07T08:37:38.857Z|18504|inc_proc_eng|INFO|node: northd,
recompute
> >> (missing handler for input SB_datapath_binding) took 52313ms*
> >>
> >> *2024-05-07T08:37:48.335Z|18505|inc_proc_eng|INFO|node: lflow,
recompute
> >> (failed handler for input northd) took 7759ms*
> >>
> >> *2024-05-07T08:37:48.718Z|18506|timeval|WARN|Unreasonably long 62213ms
> >> poll interval (56201ms user, 2900ms system)*
> >>
> >>
> >>
> >> As you can see there is a significant delay in 52 secs
> >>
>
> This is huge indeed!
>
> >> Correct me please, if I am in the wrong, but IMU: ‘*missing handler
for*’
> >> – practically means absence of the inc-engine handler from some node
(in
> >> this sample: *SB_datapath_binding*)
> >>
> >
> > That's correct.
> >
> > Before plunging into Development it would be great to clarify/adjust
with
> >> Community’s position
> >>
> >>- Why there is not handler for this node?
> >>
> >>
> > Our approach has been to add a handler  for any input change only if it
is
> > frequent or if it can be easily handled.
> > We also have skipped adding handlers if it increases the code
complexity.
> > Having said that I think we are open
> > to adding more handlers if it makes sense or if it results in scale
> > improvements.
> >
> > Right now we fall back to a full recompute of northd engine for any
changes
> > to a logical switch or logical router.
> > Does your deployment create/delete logical switches/routers frequently ?
> > Is it possible to enable ovn debug logs
> > and share them ?  I'm curious to know what are the changes to SB
datapath
> > binding.
> >
> > Feel free to share your OVN NB and SB DBs if you're ok with it.  I can
> > deploy those DBs and see why recompute is so expensive.
> >
> >
> >
> >>- Any particular reason for this or just the peculiarity of our
> >>installation highlighted this issue?
> >>
> >>
> > My guess is that your installation is frequently creating , deleting or
> > modifying logical switches or routers.
> >
> >
> >>-
> >>- Do you think there is a reason in implementing that handler? (
> >>*SB_datapath_binding*)
> >>
> >>
> > I'm fine adding a handler if it helps in the scale.   In our use cases,
we
> > don't frequently create/delete the logical switches and routers
> > and hence it is ok to fall back to full recomputes for such changes.
> >
> >
> >>-
> >>
> >>
> >>
> >> Any ideas are highly appreciated.
> >>
> >
> > You're welcome to work on it and submit patches to add a handler for
> > SB_datapath_binding.
> >
> > @Dumitru Ceara  @Han Zhou  if you've
any
> > reservations on adding more handlers please do comment here.
> >
>
> In general, especially if it fixes a scalability issue like this one,
> it's probably fine.  In practice it depends a bit on how much complexity
> this would add to the code.
>
I agree with the general statement.

> But the best way to tell is to have a way to reproduce this, e.g., NB/SB
> databases and the NB/SB jsonrpc update that caused the recompute.
>

Yes, it is better to understand why in this deployment the recompute took
so long (52s). Is it simply too large scale, or is it because of some
uncommon configuration that we don't handle efficiently and can be
optimized to improve recompute performance.

Otherwise, even if we can implement datapath I-P, there can be just another
input change that triggers recompute and causes the same latency. It is
just not sustainable to maintain more and more I-P in northd.

> Regards,
> Dumitru
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [ovs-dev] [ANN] Primary OVS branch renamed as main development branch as main.

2024-04-10 Thread Han Zhou via discuss
On Wed, Apr 10, 2024 at 6:52 AM Simon Horman  wrote:
>
> Hi,
>
> I would like to announce that the primary development branch for OvS
> has been renamed main.
>
> The rename occurred a little earlier today.
>
> OVS is currently hosted on GitHub. We can expect the following behaviour
> after the rename:
>
> * GitHub pull requests against master should have been automatically
>   re-homed on main.
> * GitHub Issues should not to be affected - the test issue I
>   created had no association with a branch
> * URLs accessed via the GitHub web UI are automatically renamed
> * Clones may also rename their primary branch - you may
>   get a notification about this in the Web UI
>
> As a result of this change it may be necessary to update your local git
> configuration for checked out branches.
>
> For example:
> # Fetch origin: new remote main branch; remote master branch is deleted
> git fetch -tp origin
> # Rename local branch
> git branch -m master main
> # Update local main branch to use remote main branch as it's upstream
> git branch --set-upstream-to=origin/main main
>
> If you have an automation that fetches the master branch then please
> update the automation to fetch main. If your automation is fetching
> main and falling back to master, then it should now be safe to
> remove the fallback.
>
> This change is in keeping with OVS's recently OVS adopted a policy of
using
> the inclusive naming word list v1 [1, 2].
>
> [1] df5e5cf4318a ("Documentation: Add section on inclusive language.")
> [2] https://inclusivenaming.org/word-lists/
>
> Kind regards,
> Simon
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Thanks Simon. Shall this be announced to ovs-announce as well?

Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN SB DB from RAFT cluster to Relay DB

2024-03-19 Thread Han Zhou via discuss
On Thu, Mar 7, 2024 at 12:29 PM Sri kor via discuss <
ovs-discuss@openvswitch.org> wrote:
>
> Is there a way to configure, ovn-controller subscribing to only specific
SB DB updates?
> I

Hi Sri,

As mentioned by Felix, ovn-controller by default subscribes to only SB DB
updates that it considers relevant to the hypervisor.
What's your settings of: external_ids:ovn-monitor-all? (ovs-vsctl
--if-exists get open . external_ids:ovn-monitor-all)
Is there anything you want to tune beyond this?

Thanks,
Han

>
> On Wed, Mar 6, 2024 at 1:45 AM Felix Huettner 
wrote:
>>
>> On Wed, Mar 06, 2024 at 10:29:29AM +0300, Vladislav Odintsov wrote:
>> > Hi Felix,
>> >
>> > > On 6 Mar 2024, at 10:16, Felix Huettner via discuss <
ovs-discuss@openvswitch.org> wrote:
>> > >
>> > > Hi Srini,
>> > >
>> > > i can share what works for us for ~1k hypervisors:
>> > >
>> > > On Tue, Mar 05, 2024 at 09:51:43PM -0800, Sri kor via discuss wrote:
>> > >> Hi Team,
>> > >>
>> > >>
>> > >> Currently , we are using OVN in RAFT cluster mode. We have 3 NB and
SB
>> > >> ovsdb-servers operating in RAFT cluster mode. Currently we have 500
>> > >> hypervisors connected to this RAFT cluster.
>> > >>
>> > >> For our next deployment, our scale would increase to 3000
hypervisors. To
>> > >> accommodate this scaled hypervisors, we are migrating to DB relay
with
>> > >> multigroup deployment model. This increase helps with OVN SB DB read
>> > >> transactions. But for write transactions, only the leader in the
RAFT
>> > >> cluster can update the DB. This creates a load on the leader of
RAFT. Is
>> > >> there a way to address the load on the RAFT cluster leader?
>> > >
>> > > We do the following:
>> > > * If you need TLS on the ovsdb path, separate it out to some
>> > >  reverseproxy that can do just L4 TLS Termination (e.g. traefik, or
so)
>> >
>> > Do I understand correctly that with such TLS "offload" you can’t use
RBAC for hypervisors?
>> >
>>
>> yes, that is the unfortunate side effect
>>
>> > > * Have nobody besides northd connect to the SB DB directly, everyone
>> > >  else needs to use a relay
>> > > * Do not run backups on the cluster leader, but on one of the current
>> > >  followers
>> > > * Increase the raft election timeout significantly (we have 120s in
>> > >  there). However there is a patch afaik in 3.3 that makes that better
>> > > * If you create metrics or so from database content generate these on
>> > >  the relays instead of the raft cluster
>> > >
>> > > Overall when our southbound db had issues most of the time it was
some
>> > > client constantly reconnecting to it and thereby pulling always a
full
>> > > DB dump.
>> > >
>> > >>
>> > >>
>> > >> As the scale increases, number updates coming to the ovn-controller
from
>> > >> OVN SB increases. that creates pressure on ovn-controller. Is there
a way
>> > >> to minimize the load on ovn-controller?
>> > >
>> > > Did not see any kind of issue there yet.
>> > > However if you are using some python tooling outside of OVN (e.g.
>> > > Openstack) ensure that you have JSON parsing using a C library
avaialble
>> > > in the ovs lib. This brings significant performance benefts if you
have
>> > > a lot of updates.
>> > > You can check with `python3 -c "import ovs.json;
print(ovs.json.PARSER)"`
>> > > which should return "C".
>> > >
>> > >>
>> > >> I wish there is a way for ovn-controller to subscribe to updates
specific
>> > >> to this hypervisor. Are there any known ovn-contrller subscription
methods
>> > >> available and being used OVS community?
>> > >
>> > > Yes, they do that per default. However for us we saw that this
creates
>> > > increased load on the relays due to the needed additional filtering
and
>> > > json serializing per target node. So we turned it of and thereby
trade
>> > > less ovsdb load for more network bandwidth.
>> > > Relevant setting is `external_ids:ovn-monitor-all`.
>> > >
>> > > Thanks
>> > > Felix
>> > >
>> > >>
>> > >>
>> > >> How can I optimize the load on the leader node in an OVN RAFT
cluster to
>> > >> handle increased write transactions?
>> > >>
>> > >>
>> > >>
>> > >> Thanks,
>> > >>
>> > >> Srini
>> > >
>> > >> ___
>> > >> discuss mailing list
>> > >> disc...@openvswitch.org
>> > >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>> > >
>> > > ___
>> > > discuss mailing list
>> > > disc...@openvswitch.org 
>> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>> >
>> >
>> > Regards,
>> > Vladislav Odintsov
>> >
>
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-01 Thread Han Zhou via discuss
On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry  wrote:
>
> Hi Han,
>
> Please see my comments/questions inline.
>
> Han Zhou, Sep 30, 2023 at 21:59:
> > > Distributed mac learning
> > > 
> > >
> > > Use one OVS bridge per logical switch with mac learning enabled. Only
> > > create the bridge if the logical switch has a port bound to the local
> > > chassis.
> > >
> > > Pros:
> > >
> > > - Minimal openflow rules required in each bridge (ACLs and NAT
mostly).
> > > - No central mac binding table required.
> >
> > Firstly to clarify the terminology of "mac binding" to avoid confusion,
the
> > mac_binding table currently in SB DB has nothing to do with L2 MAC
> > learning. It is actually the ARP/Neighbor table of distributed logical
> > routers. We should probably call it IP_MAC_binding table, or just
Neighbor
> > table.
>
> Yes sorry about the confusion. I actually meant the FDB table.
>
> > Here what you mean is actually L2 MAC learning, which today is
implemented
> > by the FDB table in SB DB, and it is only for uncommon use cases when
the
> > NB doesn't have the knowledge of a MAC address of a VIF.
>
> This is not that uncommon in telco use cases where VNFs can send packets
> from mac addresses unknown to OVN.
>
Understand, but VNFs contributes a very small portion of the workloads,
right? Maybe I should rephrase that: it is uncommon to have "unknown"
addresses for the majority of ports in a large scale cloud. Is this
understanding correct?

> > The purpose of this proposal is clear - to avoid using a central table
in
> > DB for L2 information but instead using L2 MAC learning to populate such
> > information on chassis, which is a reasonable alternative with pros and
> > cons.
> > However, I don't think it is necessary to use separate OVS bridges for
this
> > purpose. L2 MAC learning can be easily implemented in the br-int bridge
> > with OVS flows, which is much simpler than managing dynamic number of
OVS
> > bridges just for the purpose of using the builtin OVS mac-learning.
>
> I agree that this could also be implemented with VLAN tags on the
> appropriate ports. But since OVS does not support trunk ports, it may
> require complicated OF pipelines. My intent with this idea was two fold:
>
> 1) Avoid a central point of failure for mac learning/aging.
> 2) Simplify the OF pipeline by making all FDB operations dynamic.

IMHO, the L2 pipeline is not really complex. It is probably the simplest
part (compared with other features for L3, NAT, ACL, LB, etc.).
Adding dynamic learning to this part probably makes it *a little* more
complex, but should still be straightforward. We don't need any VLAN tag
because the incoming packet has geneve VNI in the metadata. We just need a
flow that resubmits to lookup a MAC-tunnelSrc mapping table, and inject a
new flow (with related tunnel endpont information) if the src MAC is not
found, with the help of the "learn" action. The entries are
per-logical_switch (VNI). This would serve your purpose of avoiding a
central DB for L2. At least this looks much simpler to me than managing
dynamic number of OVS bridges and the patch pairs between them.

>
> > Now back to the distributed MAC learning idea itself. Essentially for
two
> > VMs/pods to communicate on L2, say, VM1@Chassis1 needs to send a packet
to
> > VM2@chassis2, assuming VM1 already has VM2's MAC address (we will
discuss
> > this later), Chassis1 needs to know that VM2's MAC is located on
Chassis2.
> >
> > In OVN today this information is conveyed through:
> >
> > - MAC and LSP mapping (NB -> northd -> SB -> Chassis)
> > - LSP and Chassis mapping (Chassis -> SB -> Chassis)
> >
> > In your proposal:
> >
> > - MAC and Chassis mapping (can be learned through initial L2
> >   broadcast/flood)
> >
> > This indeed would avoid the control plane cost through the centralized
> > components (for this L2 binding part). Given that today's SB OVSDB is a
> > bottleneck, this idea may sound attractive. But please also take into
> > consideration the below improvement that could mitigate the OVN central
> > scale issue:
> >
> > - For MAC and LSP mapping, northd is now capable of incrementally
> >   processing VIF related L2/L3 changes, so the cost of NB -> northd ->
> >   SB is very small. For SB -> Chassis, a more scalable DB deployment,
> >   such as the OVSDB relays, may largely help.
>
> But using relays will only help with read-only operations (SB ->
> chassis). Write operations (from dynamically learned mac addresses) will
> be equivalent.
>
OVSDB relay supports write operations, too. It scales better because each
ovsdb-server process handles smaller number of clients/connections. It may
still perform worse when there are too many write operations from many
clients, but I think it should scale better than without relay. This is
only based on my knowledge of the ovsdb-server relay, but I haven't tested
it at scale, yet. People who actually deployed it may comment more.

> > - For LSP and Chassis mapping, the 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-10-01 Thread Han Zhou via discuss
On Sun, Oct 1, 2023 at 9:06 AM Robin Jarry  wrote:
>
> Hi Han,
>
> thanks a lot for your detailed answer.
>
> Han Zhou, Sep 30, 2023 at 01:03:
> > > I think ovn-controller only consumes the logical flows. The chassis
and
> > > port bindings tables are used by northd to updated these logical
flows.
> >
> > Felix was right. For example, port-binding is firstly a configuration
from
> > north-bound, but the states such as its physical location (the chassis
> > column) are populated by ovn-controller of the owning chassis and
consumed
> > by other ovn-controllers that are interested in that port-binding.
>
> I was not aware of this. Thanks.
>
> > > Exactly, but was the signaling between the nodes ever an issue?
> >
> > I am not an expert of BGP, but at least for what I am aware of, there
are
> > scaling issues in things like BGP full mesh signaling, and there are
> > solutions such as route reflector (which is again centralized) to solve
> > such issues.
>
> I am not familiar with BGP full mesh signaling. But from what can tell,
> it looks like the same concept than the full mesh GENEVE tunnels. Except
> that the tunnels are only used when the same logical switch is
> implemented between two nodes.
>
Please note that tunnels are needed not only between nodes related to same
logical switches, but also when they are related to different logical
switches connected by logical routers (even multiple LR+LS hops away).

> > > So you have enabled monitor_all=true as well? Or did you test at scale
> > > with monitor_all=false.
> > >
> > We do use monitor_all=false, primarily to reduce memory footprint (and
also
> > CPU cost of IDL processing) on each chassis. There are trade-offs to
the SB
> > DB server performance:
> >
> > - On one hand it increases the cost of conditional monitoring, which
> >   is expensive for sure
> > - On the other hand, it reduces the total amount of data for the
> >   server to propagate to clients
> >
> > It really depends on your topology for making the choice. If most of the
> > nodes would anyway monitor most of the DB data (something similar to a
> > full-mesh), it is more reasonable to use monitor_all=true. Otherwise, in
> > topology like ovn-kubernetes where each node has its dedicated part of
the
> > data, or in topologies where you have lots of small "island" such as a
> > cloud with many small tenants that never talks to each other, using
> > monitor_all=false could make sense (but still need to be carefully
> > evaluated and tested for your own use cases).
>
> I didn't see recent scale testing for openstack, but in past testing we
> had to set monitor_all=true because the CPU usage of the SB ovsdb was
> a bottleneck.
>
To clarify a little more, openstack deployment can have different logical
topologies. So to evaluate the impact of monitor_all settings there should
be different test cases to capture different types of deployment, e.g.
full-mesh topology (monitor_all=true is better) v.s. "small islands"
toplogy (monitor_all=false is reasonable).

> > > The memory usage would be reduced but I don't know to which point. One
> > > of the main consumers is the logical flows table which is required
> > > everywhere. Unless there is a way to only sync a portion of this table
> > > depending on the chassis, disabling monitor_all would save syncing the
> > > unneeded tables for ovn-controller: chassis, port bindings, etc.
> >
> > Probably it wasn't what you meant, but I'd like to clarify that it is
not
> > about unneeded tables, but unneeded rows in those tables (mainly
> > logical_flow and port_binding).
> > It indeed syncs only a portion of the tables. It is not depending
directly
> > on chassis, but depending on what port-bindings are on the chassis and
what
> > logical connectivity those port-bindings have. So, again, the choice
really
> > depends on your use cases.
>
> What about the FDB (mac-port) and MAC binding (ip-mac) tables? I thought
> ovn-controller does not need them. If that is the case, I thought that
> by default, the whole tables (not only some of their rows) were excluded
> from the synchronized data.
>
FDB and MAC_binding tables are used by ovn-controllers. They are
essentially the central storage for MAC tables of the distributed logical
switches (FDB) and ARP/Neighbour tables for distributed logical routers
(MAC_binding). A record can be populate by one chassis and consumed by many
other chassis.

monitor_all should work the same way for these tables: if monitor_all =
false, only rows related to "local datapaths" should be downloaded to the
chassis. However, for FDB table, the condition is not set for now (which
may have been a miss in the initial implementation). Perhaps this is not
noticed because MAC learning is not a very widely used feature and no scale
impact noticed, but I just proposed a patch to enable the conditional
monitoring:
https://patchwork.ozlabs.org/project/ovn/patch/20231001192658.1012806-1-hz...@ovn.org/

Thanks,
Han

> Thanks!
>

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Han Zhou via discuss
On Sat, Sep 30, 2023 at 9:56 AM Vladislav Odintsov 
wrote:
>
>
>
> regards,
> Vladislav Odintsov
>
> > On 30 Sep 2023, at 16:50, Robin Jarry  wrote:
> >
> > Hi Vladislav, Frode,
> >
> > Thanks for your replies.
> >
> > Frode Nordahl, Sep 30, 2023 at 10:55:
> >> On Sat, Sep 30, 2023 at 9:43 AM Vladislav Odintsov via discuss
> >>  wrote:
>  On 29 Sep 2023, at 18:14, Robin Jarry via discuss <
ovs-discuss@openvswitch.org> wrote:
> 
>  Felix Huettner, Sep 29, 2023 at 15:23:
> >> Distributed mac learning
> >> 
>  [snip]
> >>
> >> Cons:
> >>
> >> - How to manage seamless upgrades?
> >> - Requires ovn-controller to move/plug ports in the correct bridge.
> >> - Multiple openflow connections (one per managed bridge).
> >> - Requires ovn-trace to be reimplemented differently (maybe other
tools
> >> as well).
> >
> > - No central information anymore on mac bindings. All nodes need to
> > update their data individually
> > - Each bridge generates also a linux network interface. I do not
know if
> > there is some kind of limit to the linux interfaces or the ovs
bridges
> > somewhere.
> 
>  That's a good point. However, only the bridges related to one
>  implemented logical network would need to be created on a single
>  chassis. Even with the largest OVN deployments, I doubt this would be
>  a limitation.
> 
> > Would you still preprovision static mac addresses on the bridge for
all
> > port_bindings we know the mac address from, or would you rather
leave
> > that up for learning as well?
> 
>  I would leave everything dynamic.
> 
> > I do not know if there is some kind of performance/optimization
penality
> > for moving packets between different bridges.
> 
>  As far as I know, once the openflow pipeline has been resolved into
>  a datapath flow, there is no penalty.
> 
> > You can also not only use the logical switch that have a local port
> > bound. Assume the following topology:
> > +---+ +---+ +---+ +---+ +---+ +---+ +---+
> > |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
> > +---+ +---+ +---+ +---+ +---+ +---+ +---+
> > vm1 and vm2 are both running on the same hypervisor. Creating only
local
> > logical switches would mean only ls1 and ls3 are available on that
> > hypervisor. This would break the connection between the two vms
which
> > would in the current implementation just traverse the two logical
> > routers.
> > I guess we would need to create bridges for each locally reachable
> > logical switch. I am concerned about the potentially significant
> > increase in bridges and openflow connections this brings.
> 
>  That is one of the concerns I raised in the last point. In my opinion
>  this is a trade off. You remove centralization and require more local
>  processing. But overall, the processing cost should remain
equivalent.
> >>>
> >>> Just want to clarify.
> >>>
> >>> For topology described by Felix above, you propose to create 2 OVS
> >>> bridges, right? How will the packet traverse from vm1 to vm2?
> >
> > In this particular case, there would be 3 OVS bridges, one for each
> > logical switch.
>
> Yeah, agree, this is typo. Below I named three bridges :).
>
> >
> >>> Currently when the packet enters OVS all the logical switching and
> >>> routing openflow calculation is done with no packet re-entering OVS,
> >>> and this results in one DP flow match to deliver this packet from
> >>> vm1 to vm2 (if no conntrack used, which could introduce
> >>> recirculations).
> >>>
> >>> Do I understand correctly, that in this proposal OVS needs to
> >>> receive packet from “ls1” bridge, next run through lrouter “lr1”
> >>> OpenFlow pipelines, then output packet to “ls2” OVS bridge for mac
> >>> learning between logical routers (should we have here OF flow with
> >>> learn action?), then send packet again to OVS, calculate “lr2”
> >>> OpenFlow pipeline and finally reach destination OVS bridge “ls3” to
> >>> send packet to a vm2?
> >
> > What I am proposing is to implement the northbound L2 network intent
> > with actual OVS bridges and builtin OVS mac learning. The L3/L4 network
> > constructs and ACLs would require patch ports and specific OF pipelines.
> >
> > We could even think of adding more advanced L3 capabilities (RIB) into
> > OVS to simplify the OF pipelines.
>
> But this will make OVS<->kernel interaction more complex. Even if we
forget about dpdk environments…
>
> >
> >>> Also, will such behavior be compatible with HW-offload-capable to
> >>> smartnics/DPUs?
> >>
> >> I am also a bit concerned about this, what would be the typical number
> >> of bridges supported by hardware?
> >
> > As far as I understand, only the datapath flows are offloaded to
> > hardware. The OF pipeline is only parsed when there is an upcall for the
> > first packet. Once resolved, the datapath 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Han Zhou via discuss
On Thu, Sep 28, 2023 at 9:28 AM Robin Jarry  wrote:
>
> Hello OVN community,
>
> This is a follow up on the message I have sent today [1]. That second
> part focuses on some ideas I have to remove the limitations that were
> mentioned in the previous email.
>
> [1]
https://mail.openvswitch.org/pipermail/ovs-discuss/2023-September/052695.html
>
> If you didn't read it, my goal is to start a discussion about how we
> could improve OVN on the following topics:
>
> - Reduce the memory and CPU footprint of ovn-controller, ovn-northd.
> - Support scaling of L2 connectivity across larger clusters.
> - Simplify CMS interoperability.
> - Allow support for alternative datapath implementations.
>
> Disclaimer:
>
> This message does not mention anything about L3/L4 features of OVN.
> I didn't have time to work on these, yet. I hope we can discuss how
> these fit with my ideas.
>

Hi Robin and folks, thanks for the great discussions!
I read the replies of two other threads of this email, but I am replying
directly here to comment on some of the original statements in this email.
I will reply to the other threads for some specific points.

> Distributed mac learning
> 
>
> Use one OVS bridge per logical switch with mac learning enabled. Only
> create the bridge if the logical switch has a port bound to the local
> chassis.
>
> Pros:
>
> - Minimal openflow rules required in each bridge (ACLs and NAT mostly).
> - No central mac binding table required.

Firstly to clarify the terminology of "mac binding" to avoid confusion, the
mac_binding table currently in SB DB has nothing to do with L2 MAC
learning. It is actually the ARP/Neighbor table of distributed logical
routers. We should probably call it IP_MAC_binding table, or just Neighbor
table.
Here what you mean is actually L2 MAC learning, which today is implemented
by the FDB table in SB DB, and it is only for uncommon use cases when the
NB doesn't have the knowledge of a MAC address of a VIF.

> - Mac table aging comes for free.
> - Zero access to southbound DB for learned addresses nor for aging.
>
> Cons:
>
> - How to manage seamless upgrades?
> - Requires ovn-controller to move/plug ports in the correct bridge.
> - Multiple openflow connections (one per managed bridge).
> - Requires ovn-trace to be reimplemented differently (maybe other tools
>   as well).
>

The purpose of this proposal is clear - to avoid using a central table in
DB for L2 information but instead using L2 MAC learning to populate such
information on chassis, which is a reasonable alternative with pros and
cons.
However, I don't think it is necessary to use separate OVS bridges for this
purpose. L2 MAC learning can be easily implemented in the br-int bridge
with OVS flows, which is much simpler than managing dynamic number of OVS
bridges just for the purpose of using the builtin OVS mac-learning.

Now back to the distributed MAC learning idea itself. Essentially for two
VMs/pods to communicate on L2, say, VM1@Chassis1 needs to send a packet to
VM2@chassis2, assuming VM1 already has VM2's MAC address (we will discuss
this later), Chassis1 needs to know that VM2's MAC is located on Chassis2.
In OVN today this information is conveyed through:
- MAC and LSP mapping (NB -> northd -> SB -> Chassis)
- LSP and Chassis mapping (Chassis -> SB -> Chassis)

In your proposal:
- MAC and Chassis mapping (can be learned through initial L2
broadcast/flood)

This indeed would avoid the control plane cost through the centralized
components (for this L2 binding part). Given that today's SB OVSDB is a
bottleneck, this idea may sound attractive. But please also take into
consideration the below improvement that could mitigate the OVN central
scale issue:
- For MAC and LSP mapping, northd is now capable of incrementally
processing VIF related L2/L3 changes, so the cost of NB -> northd -> SB is
very small. For SB -> Chassis, a more scalable DB deployment, such as the
OVSDB relays, may largely help.
- For LSP and Chassis mapping, the round trip through a central DB
obviously costs higher than a direct L2 broadcast (the targets are the
same). But this can be optimized if the MAC and Chassis is known by the CMS
system (which is true for most openstack/k8s env I believe). Instead of
updating the binding from each Chassis, CMS can tell this information
through the same NB -> northd -> SB -> Chassis path, and the Chassis can
just read the SB without updating it.

On the other hand, the dynamic MAC learning approach has its own drawbacks.
- It is simple to consider L2 only, but if considering more SDB features, a
central DB is more flexible to extend and implement new features than a
network protocol based approach.
- It is more predictable and easier to debug with pre-populated information
through CMS than states learned dynamically in data-plane.
- With the DB approach we can suppress most of L2 broadcast/flood, while
with the distributed MAC learning broadcast/flood can't be avoided.
Although it may 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-09-29 Thread Han Zhou via discuss
On Fri, Sep 29, 2023 at 7:26 AM Robin Jarry  wrote:
>
> Hi Felix,
>
> Thanks a lot for your message.
>
> Felix Huettner, Sep 29, 2023 at 14:35:
> > I can get that when running 10k ovn-controllers the benefits of
> > optimizing cpu and memory load are quite significant. However i am
> > unsure about reducing the footprint of ovn-northd.
> > When running so many nodes i would have assumed that having an
> > additional (or maybe two) dedicated machines for ovn-northd would
> > be completely acceptable, as long as it can still actually do what
> > it should in a reasonable timeframe.
> > Would the goal for ovn-northd be more like "Reduce the full/incremental
> > recompute time" then?

+1

>
> The main goal of this thread is to get a consensus on the actual issues
> that prevent scaling at the moment. We can discuss solutions in the
> other thread.
>

Thanks for the good discussions!

> > > * Allow support for alternative datapath implementations.
> >
> > Does this mean ovs datapths (e.g. dpdk) or something different?
>
> See the other thread.
>
> > > Southbound Design
> > > =
> ...
> > Note that also ovn-controller consumes the "state" of other chassis to
> > e.g build the tunnels to other chassis. To visualize my understanding
> >
> > ++---++
> > || configuration |   state|
> > ++---++
> > |   ovn-northd   |  write-only   | read-only  |
> > ++---++
> > | ovn-controller |   read-only   | read-write |
> > ++---++
> > |some cms|  no access?   | read-only  |
> > ++---++
>
> I think ovn-controller only consumes the logical flows. The chassis and
> port bindings tables are used by northd to updated these logical flows.
>

Felix was right. For example, port-binding is firstly a configuration from
north-bound, but the states such as its physical location (the chassis
column) are populated by ovn-controller of the owning chassis and consumed
by other ovn-controllers that are interested in that port-binding.

> > > Centralized decisions
> > > =
> > >
> > > Every chassis needs to be "aware" of all other chassis in the cluster.
> >
> > I think we need to accept this as fundamental truth. Indepentent if you
> > look at centralized designs like ovn or the neutron-l2 implementation
> > or if you look at decentralized designs like bgp or spanning tree. In
> > all cases if we need some kind of organized communication we need to
> > know all relevant peers.
> > Designs might diverge if you need to be "aware" of all peers or just
> > some of them, but that is just a tradeoff between data size and options
> > you have to forward data.
> >
> > > This requirement mainly comes from overlay networks that are
implemented
> > > over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some
> > > limitations). It is not a scaling issue by itself, but it implies
> > > a centralized decision which in turn puts pressure on the central node
> > > at scale.
> >
> > +1. On the other hand it removes signaling needs between the nodes (like
> > you would have with bgp).
>
> Exactly, but was the signaling between the nodes ever an issue?

I am not an expert of BGP, but at least for what I am aware of, there are
scaling issues in things like BGP full mesh signaling, and there are
solutions such as route reflector (which is again centralized) to solve
such issues.

>
> > > Due to ovsdb monitoring and caching, any change in the southbound DB
> > > (either by northd or by any of the chassis controllers) is replicated
on
> > > every chassis. The monitor_all option is often enabled on large
clusters
> > > to avoid the conditional monitoring CPU cost on the central node.
> >
> > This is, i guess, something that should be possible to fix. We have also
> > enabled this setting as it gave us stability improvements and we do not
> > yet see performance issues with it
>
> So you have enabled monitor_all=true as well? Or did you test at scale
> with monitor_all=false.
>
We do use monitor_all=false, primarily to reduce memory footprint (and also
CPU cost of IDL processing) on each chassis. There are trade-offs to the SB
DB server performance:
- On one hand it increases the cost of conditional monitoring, which is
expensive for sure
- On the other hand, it reduces the total amount of data for the server to
propagate to clients

It really depends on your topology for making the choice. If most of the
nodes would anyway monitor most of the DB data (something similar to a
full-mesh), it is more reasonable to use monitor_all=true. Otherwise, in
topology like ovn-kubernetes where each node has its dedicated part of the
data, or in topologies where you have lots of small "island" such as a
cloud with many small tenants that never talks to each other, using
monitor_all=false could make sense (but 

Re: [ovs-discuss] Scaling OVN/Southbound

2023-07-07 Thread Han Zhou via discuss
On Fri, Jul 7, 2023 at 1:21 PM Han Zhou  wrote:
>
>
>
> On Thu, Jul 6, 2023 at 1:28 AM Terry Wilson  wrote:
> >
> > On Wed, Jul 5, 2023 at 9:59 AM Terry Wilson  wrote:
> > >
> > > On Fri, Jun 30, 2023 at 7:09 PM Han Zhou via discuss
> > >  wrote:
> > > >
> > > >
> > > >
> > > > On Wed, May 24, 2023 at 12:26 AM Felix Huettner via discuss <
ovs-discuss@openvswitch.org> wrote:
> > > > >
> > > > > Hi Ilya,
> > > > >
> > > > > thank you for the detailed reply
> > > > >
> > > > > On Tue, May 23, 2023 at 05:25:49PM +0200, Ilya Maximets wrote:
> > > > > > On 5/23/23 15:59, Felix Hüttner via discuss wrote:
> > > > > > > Hi everyone,
> > > > > >
> > > > > > Hi, Felix.
> > > > > >
> > > > > > >
> > > > > > > we are currently running an OVN Deployment with 450 Nodes. We
run a 3 node cluster for the northbound database and a 3 nodes cluster for
the southbound database.
> > > > > > > Between the southbound cluster and the ovn-controllers we
have a layer of 24 ovsdb relays.
> > > > > > > The setup is using TLS for all connections, however the TLS
Server is handled by a traefik reverseproxy to offload this from the ovsdb
> > > > > >
> > > > > > The very important part of the system description is what
versions
> > > > > > of OVS and OVN are you using in this setup?  If it's not latest
> > > > > > 3.1 and 23.03, then it's hard to talk about what/if performance
> > > > > > improvements are actually needed.
> > > > > >
> > > > >
> > > > > We are currently running ovs 3.1 and ovn 22.12 (in the process of
> > > > > upgrading to 23.03). `monitor-all` is currently disabled, but we
want to
> > > > > try that as well.
> > > > >
> > > > Hi Felix, did you try upgrading and enabling "monitor-all"? How
does it look now?
> > > >
> > > > > > > Northd and Neutron is connecting directly to north- and
southbound databases without the relays.
> > > > > >
> > > > > > One of the big things that is annoying is that Neutron connects
to
> > > > > > Southbound database at all.  There are some reasons to do that,
> > > > > > but ideally that should be avoided.  I know that in the past
limiting
> > > > > > the number of metadata agents was one of the mitigation
strategies
> > > > > > for scaling issues.  Also, why can't it connect to relays?
There
> > > > > > shouldn't be too many transactions flowing towards Southbound DB
> > > > > > from the Neutron.
> > > > > >
> > > > >
> > > > > Thanks for that suggestion, that definately makes sense.
> > > > >
> > > > Does this make a big difference? How many Neutron - SB connections
are there?
> > > > What rings a bell is that Neutron is using the python OVSDB library
which hasn't implemented the fast-resync feature (if I remember correctly).
> > >
> > > python-ovs has supported monitor_cond_since since v2.17.0 (though
> > > there may have been a bug that was fixed in 2.17.1). If fast resync
> > > isn't happening, then it should be considered a bug. With that said, I
> > > remember when I looked it a year or two ago, ovsdb-server didn't
> > > really use fast resync/monitor_cond_since unless it was running in
> > > raft cluster mode (it would reply, but with the last-txn-id as 0
> > > IIRC?). Does the ovsdb-relay code actually return the last-txn-id? I
> > > can set up an environment and run some tests, but maybe someone else
> > > already knows.
> >
> > Looks like ovsdb-relay does support last-txn-id now:
> >
https://github.com/openvswitch/ovs/commit/a3e97b1af1bdcaa802c6caa9e73087df7077d2b1
,
> > but only in v3.0+.
> >
>
> Hi Terry, thanks for correcting me, and sorry for my bad memory! And you
are right that fast resync is supported only in cluster mode.
>
> Han
>
> > > > At the same time, there is the feature
leader-transfer-for-snapshot, which automatically transfer leader whenever
a snapshot is to be written, which would happen frequently if your
environment is very active.
> > >
> > > I believe snapshot should only be happening "no less frequently than
> > > 24 hours, with snapshots if there are more 

Re: [ovs-discuss] Scaling OVN/Southbound

2023-07-07 Thread Han Zhou via discuss
On Thu, Jul 6, 2023 at 12:00 AM Felix Huettner 
wrote:
>
> Hi Han,
>
> On Fri, Jun 30, 2023 at 05:08:36PM -0700, Han Zhou wrote:
> > On Wed, May 24, 2023 at 12:26 AM Felix Huettner via discuss <
> > ovs-discuss@openvswitch.org> wrote:
> > >
> > > Hi Ilya,
> > >
> > > thank you for the detailed reply
> > >
> > > On Tue, May 23, 2023 at 05:25:49PM +0200, Ilya Maximets wrote:
> > > > On 5/23/23 15:59, Felix Hüttner via discuss wrote:
> > > > > Hi everyone,
> > > >
> > > > Hi, Felix.
> > > >
> > > > >
> > > > > we are currently running an OVN Deployment with 450 Nodes. We run
a 3
> > node cluster for the northbound database and a 3 nodes cluster for the
> > southbound database.
> > > > > Between the southbound cluster and the ovn-controllers we have a
> > layer of 24 ovsdb relays.
> > > > > The setup is using TLS for all connections, however the TLS
Server is
> > handled by a traefik reverseproxy to offload this from the ovsdb
> > > >
> > > > The very important part of the system description is what versions
> > > > of OVS and OVN are you using in this setup?  If it's not latest
> > > > 3.1 and 23.03, then it's hard to talk about what/if performance
> > > > improvements are actually needed.
> > > >
> > >
> > > We are currently running ovs 3.1 and ovn 22.12 (in the process of
> > > upgrading to 23.03). `monitor-all` is currently disabled, but we want
to
> > > try that as well.
> > >
> > Hi Felix, did you try upgrading and enabling "monitor-all"? How does it
> > look now?
>
> we did not yet upgrade, but we tried monitor-all and that provided a big
> benefit in terms of stability.
>
It is great to know that monitor-all helped for your use case.

> >
> > > > > Northd and Neutron is connecting directly to north- and southbound
> > databases without the relays.
> > > >
> > > > One of the big things that is annoying is that Neutron connects to
> > > > Southbound database at all.  There are some reasons to do that,
> > > > but ideally that should be avoided.  I know that in the past
limiting
> > > > the number of metadata agents was one of the mitigation strategies
> > > > for scaling issues.  Also, why can't it connect to relays?  There
> > > > shouldn't be too many transactions flowing towards Southbound DB
> > > > from the Neutron.
> > > >
> > >
> > > Thanks for that suggestion, that definately makes sense.
> > >
> > Does this make a big difference? How many Neutron - SB connections are
> > there?
> > What rings a bell is that Neutron is using the python OVSDB library
which
> > hasn't implemented the fast-resync feature (if I remember correctly).
> > At the same time, there is the feature leader-transfer-for-snapshot,
which
> > automatically transfer leader whenever a snapshot is to be written,
which
> > would happen frequently if your environment is very active.
> > When a leader transfer happens, if Neutron set the option "leader-only"
> > (only connects to leader) to SB DB (could someone confirm?), then when
the
> > leader transfer happens, all Neutron workers would reconnect to the new
> > leader. With fast-resync, like what's implemented in C IDL and Go, the
> > client that has cached the data would only request the delta when
> > reconnecting. But since the python lib doesn't have this, the Neutron
> > server would re-download full data when reconnecting ...
> > This is a speculation based on the information I have, and the
assumptions
> > need to be confirmed.
>
> We are currently working with upstream neutron to get the leader-only
> flag removed wherever we can. I guess in total the amount of connections
> depends on the process count which would be ~150 connections in total in
> our case.
>
As Terry pointed out that the python OVSDB lib does support fast-resync,
then this shouldn't be the problem and I think it is better to keep the
leader-only flag for neutron because it is more efficient to update
directly through the leader especially when the client writes heavily.
Without leader-only, the updates from different followers will have to
anyway go through the leader but parallel updates will result in sequence
conflict and will have to retry, which creates more waste and load to the
servers. But of course, it does harm to try. I didn't think that you have
so many (~150) connections just from Neutron (I thought it might be 10+),
which seems big enough to create significant load to the server, especially
when many of them restarts at the same time such as during an upgrade.

> >
> > > > >
> > > > > We needed to increase various timeouts on the ovsdb-server and
client
> > side to get this to a mostly stable state:
> > > > > * inactivity probes of 60 seconds (for all connections between
> > ovsdb-server, relay and clients)
> > > > > * cluster election time of 50 seconds
> > > > >
> > > > > As long as none of the relays restarts the environment is quite
> > stable.
> > > > > However we see quite regularly the "Unreasonably long xxx ms poll
> > interval" messages ranging from 1000ms up to 4ms.
> > > >
> > 

Re: [ovs-discuss] Scaling OVN/Southbound

2023-07-06 Thread Han Zhou via discuss
On Thu, Jul 6, 2023 at 1:28 AM Terry Wilson  wrote:
>
> On Wed, Jul 5, 2023 at 9:59 AM Terry Wilson  wrote:
> >
> > On Fri, Jun 30, 2023 at 7:09 PM Han Zhou via discuss
> >  wrote:
> > >
> > >
> > >
> > > On Wed, May 24, 2023 at 12:26 AM Felix Huettner via discuss <
ovs-discuss@openvswitch.org> wrote:
> > > >
> > > > Hi Ilya,
> > > >
> > > > thank you for the detailed reply
> > > >
> > > > On Tue, May 23, 2023 at 05:25:49PM +0200, Ilya Maximets wrote:
> > > > > On 5/23/23 15:59, Felix Hüttner via discuss wrote:
> > > > > > Hi everyone,
> > > > >
> > > > > Hi, Felix.
> > > > >
> > > > > >
> > > > > > we are currently running an OVN Deployment with 450 Nodes. We
run a 3 node cluster for the northbound database and a 3 nodes cluster for
the southbound database.
> > > > > > Between the southbound cluster and the ovn-controllers we have
a layer of 24 ovsdb relays.
> > > > > > The setup is using TLS for all connections, however the TLS
Server is handled by a traefik reverseproxy to offload this from the ovsdb
> > > > >
> > > > > The very important part of the system description is what versions
> > > > > of OVS and OVN are you using in this setup?  If it's not latest
> > > > > 3.1 and 23.03, then it's hard to talk about what/if performance
> > > > > improvements are actually needed.
> > > > >
> > > >
> > > > We are currently running ovs 3.1 and ovn 22.12 (in the process of
> > > > upgrading to 23.03). `monitor-all` is currently disabled, but we
want to
> > > > try that as well.
> > > >
> > > Hi Felix, did you try upgrading and enabling "monitor-all"? How does
it look now?
> > >
> > > > > > Northd and Neutron is connecting directly to north- and
southbound databases without the relays.
> > > > >
> > > > > One of the big things that is annoying is that Neutron connects to
> > > > > Southbound database at all.  There are some reasons to do that,
> > > > > but ideally that should be avoided.  I know that in the past
limiting
> > > > > the number of metadata agents was one of the mitigation strategies
> > > > > for scaling issues.  Also, why can't it connect to relays?  There
> > > > > shouldn't be too many transactions flowing towards Southbound DB
> > > > > from the Neutron.
> > > > >
> > > >
> > > > Thanks for that suggestion, that definately makes sense.
> > > >
> > > Does this make a big difference? How many Neutron - SB connections
are there?
> > > What rings a bell is that Neutron is using the python OVSDB library
which hasn't implemented the fast-resync feature (if I remember correctly).
> >
> > python-ovs has supported monitor_cond_since since v2.17.0 (though
> > there may have been a bug that was fixed in 2.17.1). If fast resync
> > isn't happening, then it should be considered a bug. With that said, I
> > remember when I looked it a year or two ago, ovsdb-server didn't
> > really use fast resync/monitor_cond_since unless it was running in
> > raft cluster mode (it would reply, but with the last-txn-id as 0
> > IIRC?). Does the ovsdb-relay code actually return the last-txn-id? I
> > can set up an environment and run some tests, but maybe someone else
> > already knows.
>
> Looks like ovsdb-relay does support last-txn-id now:
>
https://github.com/openvswitch/ovs/commit/a3e97b1af1bdcaa802c6caa9e73087df7077d2b1
,
> but only in v3.0+.
>

Hi Terry, thanks for correcting me, and sorry for my bad memory! And you
are right that fast resync is supported only in cluster mode.

Han

> > > At the same time, there is the feature leader-transfer-for-snapshot,
which automatically transfer leader whenever a snapshot is to be written,
which would happen frequently if your environment is very active.
> >
> > I believe snapshot should only be happening "no less frequently than
> > 24 hours, with snapshots if there are more than 100 log entries and
> > the log size has doubled, but no more frequently than every 10 mins"
> > or something pretty close to that. So it seems like once the system
> > got up to its expected size, you would just see updates every 24 hours
> > since you obviously can't double in size forever. But it's possible
> > I'm reading that wrong.
> >
> > > When a leader transfer happens, if Ne

Re: [ovs-discuss] Scaling OVN/Southbound

2023-06-30 Thread Han Zhou via discuss
On Wed, May 24, 2023 at 12:26 AM Felix Huettner via discuss <
ovs-discuss@openvswitch.org> wrote:
>
> Hi Ilya,
>
> thank you for the detailed reply
>
> On Tue, May 23, 2023 at 05:25:49PM +0200, Ilya Maximets wrote:
> > On 5/23/23 15:59, Felix Hüttner via discuss wrote:
> > > Hi everyone,
> >
> > Hi, Felix.
> >
> > >
> > > we are currently running an OVN Deployment with 450 Nodes. We run a 3
node cluster for the northbound database and a 3 nodes cluster for the
southbound database.
> > > Between the southbound cluster and the ovn-controllers we have a
layer of 24 ovsdb relays.
> > > The setup is using TLS for all connections, however the TLS Server is
handled by a traefik reverseproxy to offload this from the ovsdb
> >
> > The very important part of the system description is what versions
> > of OVS and OVN are you using in this setup?  If it's not latest
> > 3.1 and 23.03, then it's hard to talk about what/if performance
> > improvements are actually needed.
> >
>
> We are currently running ovs 3.1 and ovn 22.12 (in the process of
> upgrading to 23.03). `monitor-all` is currently disabled, but we want to
> try that as well.
>
Hi Felix, did you try upgrading and enabling "monitor-all"? How does it
look now?

> > > Northd and Neutron is connecting directly to north- and southbound
databases without the relays.
> >
> > One of the big things that is annoying is that Neutron connects to
> > Southbound database at all.  There are some reasons to do that,
> > but ideally that should be avoided.  I know that in the past limiting
> > the number of metadata agents was one of the mitigation strategies
> > for scaling issues.  Also, why can't it connect to relays?  There
> > shouldn't be too many transactions flowing towards Southbound DB
> > from the Neutron.
> >
>
> Thanks for that suggestion, that definately makes sense.
>
Does this make a big difference? How many Neutron - SB connections are
there?
What rings a bell is that Neutron is using the python OVSDB library which
hasn't implemented the fast-resync feature (if I remember correctly).
At the same time, there is the feature leader-transfer-for-snapshot, which
automatically transfer leader whenever a snapshot is to be written, which
would happen frequently if your environment is very active.
When a leader transfer happens, if Neutron set the option "leader-only"
(only connects to leader) to SB DB (could someone confirm?), then when the
leader transfer happens, all Neutron workers would reconnect to the new
leader. With fast-resync, like what's implemented in C IDL and Go, the
client that has cached the data would only request the delta when
reconnecting. But since the python lib doesn't have this, the Neutron
server would re-download full data when reconnecting ...
This is a speculation based on the information I have, and the assumptions
need to be confirmed.

> > >
> > > We needed to increase various timeouts on the ovsdb-server and client
side to get this to a mostly stable state:
> > > * inactivity probes of 60 seconds (for all connections between
ovsdb-server, relay and clients)
> > > * cluster election time of 50 seconds
> > >
> > > As long as none of the relays restarts the environment is quite
stable.
> > > However we see quite regularly the "Unreasonably long xxx ms poll
interval" messages ranging from 1000ms up to 4ms.
> >
> > With latest versions of OVS/OVN the CPU usage on Southbound DB
> > servers without relays in our weekly 500-node ovn-heater runs
> > stays below 10% during the test phase.  No large poll intervals
> > are getting registered.
> >
> > Do you have more details on under which circumstances these
> > large poll intervals occur?
> >
>
> It seems to mostly happen on the initial connection of some client to
> the ovsdb. From the few times we ran perf there it looks like the time
> is spend in creating a monitor and during that sending out the updates
> to the client side.
>
It is one of the worst case scenario for OVSDB when many clients initialize
connections to it at the same time, when the size of data downloaded by
each client is big.
OVSDB relay, for what I understand, should greatly help on this. You have
24 relay nodes, which are supposed to share the burden. Are the SB DB and
the relay instances running with sufficient CPU resources?
Is it clear that initial connections from which clients (ovn-controller or
Neutron) are causing this? If it is Neutron, the above speculation about
the lack of fast-resync from Neutron workers may be worth checking.

> If it is of interest i can try and get a perf report once this occurs
> again.
>
> > >
> > > If a large amount of relays restart simultaneously they can also
bring the ovsdb cluster to fail as the poll interval exceeds the cluster
election time.
> > > This happens with the relays already syncing the data from all 3
ovsdb servers.
> >
> > There was a performance issue with upgrades and simultaneous
> > reconnections, but it should be mostly fixed on the current master
> > branch, 

Re: [ovs-discuss] MAC binding aging refresh mechanism

2023-05-25 Thread Han Zhou via discuss
On Thu, May 25, 2023 at 9:19 AM Ilya Maximets  wrote:
>
> On 5/25/23 14:08, Ales Musil via discuss wrote:
> > Hi,
> >
> > to improve the MAC binding aging mechanism we need a way to ensure that
rows which are still in use are preserved. This doesn't happen with current
implementation.
> >
> > I propose the following solution which should solve the issue, any
questions or comments are welcome. If there isn't anything major that would
block this approach I would start to implement it so it can be available on
23.09.
> >
> > For the approach itself:
> >
> > Add "mac_cache_use" action into "lr_in_learn_neighbor" table (only the
flow that continues on known MAC binding):
> > match=(REGBIT_LOOKUP_NEIGHBOR_RESULT == 1 ||
REGBIT_LOOKUP_NEIGHBOR_IP_RESULT == 0), action=(next;)  ->
match=(REGBIT_LOOKUP_NEIGHBOR_RESULT == 1 ||
REGBIT_LOOKUP_NEIGHBOR_IP_RESULT == 0), action=(mac_cache_use; next;)
> >
> > The "mac_cache_use" would translate to resubmit into separate table
with flows per MAC binding as follows:
> > match=(ip.src=, eth.src=, datapath=),
action=(drop;)

It is possible that some workload has heavy traffic for ingress direction
only, such as some UDP streams, but not sending anything out for a long
interval. So I am not sure if using "src" only would be sufficient.

>
> One concern here would be that it will likely cause a packet clone
> in the datapath just to immediately drop it.  So, might have a
> noticeable performance impact.
>
+1. We need to be careful to avoid any dataplane performance impact, which
doesn't sound justified for the value.

> >
> > This should bump the statistics every time for the correct MAC binding.
In ovn-controller we could periodically dump the flows from this table. the
period would be set to MIN(mac_binding_age_threshold/2) from all local
datapaths. The dump would happen from a different thread with its own rconn
to prevent backlogging issues. The thread would receive mapped data from
I-P node that would keep track of mapping datapath -> cookies -> mac
bindings. This allows us to avoid constant lookups, but at the cost of
keeping track of all local MAC bindings. To save some computation time this
I-P could be relevant only for datapaths that actually have the threshold
set.
> >
> > If the "idle_age" of the particular flow is smaller than the datapath
"mac_binding_age_threshold" it means that it is still in use. To prevent a
lot of updates, if the traffic is still relevant on multiple controllers,
we would check if the timestamp is older than the "dump period"; if not we
don't have to update it, because someone else did.

Thanks for trying to reduce the number of updates to SB DB, but I still
have some concerns for this. In theory, to prevent the records being
deleted while it is still used, at least one timestamp update is required
per threshold for each record. Even if we bundle the updates from each
node, assume that the workloads that own the IP/MAC of the mac_binding
records are distributed across 1000 nodes, and the aging threshold is 30s,
there will still be ~30 updates/s (if we can evenly distribute the updates
from different nodes). That's still a lot, which may keep the SB server and
all the ovn-controller busy just for these messages. If the aging threshold
is set to 300s (5min), it may look better: ~3updates/s, but this still
could contribute to the major part of the SB <-> ovn-controller messages,
e.g. in ovn-k8s deployment the cluster LR is distributed on all nodes so
all nodes would need to monitor all mac-binding timestamp updates related
to the cluster LR, which means all mac-binding updates from all nodes. In
reality the amount of messages may be doubled if we use the proposed
dump-and-check interval mac_binding_age_threshold/2.

So, I'd evaluate both the dataplane and control plane cost before going for
formal implementation.

Thanks,
Han

>
> >
> > Also to "desync" the controllers there would be a random delay added to
the "dump period".
> >
> > All of this would be applicable to FDB aging as well.
> >
> > Does that sound reasonable?
> > Please let me know if you have any comments/suggestions.
> >
> > Thanks,
> > Ales
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN interconnection and NAT

2023-03-15 Thread Han Zhou via discuss
On Wed, Mar 15, 2023 at 1:00 PM Tiago Pires 
wrote:
>
> Hi Vladislav,
>
> It seems the gateway_port option was added on 22.09 according with this
commit:
https://github.com/ovn-org/ovn/commit/4f93381d7d38aa21f56fb3ff4ec00490fca12614
.
> It is what I need in order to make my use case to work, let me try it.

Thanks for reporting this issue. It would be good to try the gateway_port
option, but it also seems to be a bug somewhere if it behaves as what you
described in the example, because:

"When  a  logical  router  has multiple distributed gateway ports and this
column is not set for a NAT
  rule, then the rule will be applied at the distributed
gateway port which is in the same  network  as
  the  external_ip  of  the NAT rule "

We need to check more on this.

Regards,
Han

>
> Thank you
>
> Tiago Pires
>
>
>
> On Wed, Mar 15, 2023 at 2:10 PM Vladislav Odintsov 
wrote:
>>
>> I’m sorry, of course I meant gateway_port instead of logical_port:
>>
>>gateway_port: optional weak reference to Logical_Router_Port
>>   A distributed gateway port in the Logical_Router_Port
table where the NAT rule needs to be applied.
>>
>>   When multiple distributed gateway ports are configured on
a Logical_Router, applying a  NAT  rule  at
>>   each  of the distributed gateway ports might not be
desired. Consider the case where a logical router
>>   has 2 distributed  gateway  port,  one  with  networks
50.0.0.10/24  and  the  other  with  networks
>>   60.0.0.10/24.  If  the  logical router has a NAT rule of
type snat, logical_ip 10.1.1.0/24 and exter‐
>>   nal_ip 50.1.1.20/24, the rule needs to be selectively
applied on  matching  packets  entering/leaving
>>   through the distributed gateway port with networks
50.0.0.10/24.
>>
>>   When  a  logical  router  has multiple distributed gateway
ports and this column is not set for a NAT
>>   rule, then the rule will be applied at the distributed
gateway port which is in the same  network  as
>>   the  external_ip  of  the NAT rule, if such a router port
exists. If logical router has a single dis‐
>>   tributed gateway port and this column is not set for a NAT
rule, the rule will be applied at the dis‐
>>   tributed  gateway  port  even if the router port is not in
the same network as the external_ip of the
>>   NAT rule.
>>
>> On 15 Mar 2023, at 20:05, Vladislav Odintsov via discuss <
ovs-discuss@openvswitch.org> wrote:
>>
>> Hi,
>>
>> since you’ve configured multiple LRPs with GW chassis, you must supply
logical_port for NAT rule. Did you configure it?
>> You should see appropriate message in ovn-northd logfile.
>>
>>logical_port: optional string
>>   The name of the logical port where the logical_ip resides.
>>
>>   This is only used on distributed routers. This must be
specified in order for the NAT rule to be pro‐
>>   cessed  in a distributed manner on all chassis. If this is
not specified for a NAT rule on a distrib‐
>>   uted router, then this NAT rule will be processed  in  a
 centralized  manner  on  the  gateway  port
>>   instance on the gateway chassis.
>>
>> On 15 Mar 2023, at 19:22, Tiago Pires via discuss <
ovs-discuss@openvswitch.org> wrote:
>>
>> Hi,
>>
>> In an OVN Interconnection environment (OVN 22.03) with a few AZs, I
noticed that when the OVN router has a SNAT enabled or DNAT_AND_SNAT,
>> the traffic between the AZs is nated.
>> When checking the OVN router's logical flows, it is possible to see the
LSP that is connected into the transit switch with NAT enabled:
>>
>> Scenario:
>>
>> OVN Global database:
>> # ovn-ic-sbctl show
>> availability-zone az1
>> gateway ovn-central-1
>> hostname: ovn-central-1
>> type: geneve
>> ip: 192.168.40.50
>> port ts1-r1-az1
>> transit switch: ts1
>> address: ["aa:aa:aa:aa:aa:10 169.254.100.10/24"]
>> availability-zone az2
>> gateway ovn-central-2
>> hostname: ovn-central-2
>> type: geneve
>> ip: 192.168.40.221
>> port ts1-r1-az2
>> transit switch: ts1
>> address: ["aa:aa:aa:aa:aa:20 169.254.100.20/24"]
>> availability-zone az3
>> gateway ovn-central-3
>> hostname: ovn-central-3
>> type: geneve
>> ip: 192.168.40.247
>> port ts1-r1-az3
>> transit switch: ts1
>> address: ["aa:aa:aa:aa:aa:30 169.254.100.30/24"]
>>
>> OVN Central (az1)
>>
>> # ovn-nbctl show r1
>> router 3e80e81a-58b5-41b1-9600-5bfc917c4ace (r1)
>> port r1-ts1-az1
>> mac: "aa:aa:aa:aa:aa:10"
>> networks: ["169.254.100.10/24"]
>> gateway chassis: [ovn-central-1]
>> port r1_s1
>> mac: "00:de:ad:fe:0:1"
>> networks: ["10.0.1.1/24"]
>> port r1_public
>> mac: 

Re: [ovs-discuss] ovsdb: schema conversion for clustered db blocks preventing processing of raft election and inactivity probes

2023-01-10 Thread Han Zhou via discuss
On Mon, Jan 9, 2023 at 3:34 AM Ilya Maximets  wrote:
>
> On 1/8/23 04:51, Han Zhou wrote:
> >
> >
> > On Tue, Jan 3, 2023 at 6:07 AM Ilya Maximets via discuss <
ovs-discuss@openvswitch.org > wrote:
> >>
> >> On 12/14/22 08:28, Frode Nordahl via discuss wrote:
> >> > Hello,
> >> >
> >> > When performing an online schema conversion for a clustered DB the
> >> > `ovsdb-client` connects to the current leader of the cluster and
> >> > requests it to convert the DB to a new schema.
> >> >
> >> > The main thread of the leader ovsdb-server will then parse the new
> >> > schema and copy the entire database into a new in-memory copy using
> >> > the new schema. For a moderately sized database, let's say 650MB
> >> > on-disk, this process can take north of 24 seconds on a modern
> >> > adequately performant system.
> >> >
> >> > While this is happening the ovsdb-server process will not process any
> >> > raft election events or inactivity probes, so by the time the
> >> > conversion is done and the now past leader wants to write the
> >> > converted database to the cluster, its connection to the cluster is
> >> > dead.
> >> >
> >> > The past leader will keep repeating this process indefinitely, until
> >> > the client requesting the conversion disconnects. No message is
passed
> >> > to the client.
> >> >
> >> > Meanwhile the other nodes in the cluster have moved on with a new
leader.
> >> >
> >> > A workaround for this scenario would be to increase the election
timer
> >> > to a value great enough so that the conversion can succeed within an
> >> > election window.
> >> >
> >> > I don't view this as a permanent solution though, as it would be
> >> > unfair to leave the end user with guessing the correct election timer
> >> > in order for their upgrades to succeed.
> >> >
> >> > Maybe we need to hand off conversion to a thread and make the main
> >> > loop only process raft requests until it is done, similar to the
> >> > recent addition of preparing snapshot JSON in a separate thread [0].
> >> >
> >> > Any other thoughts or ideas?
> >> >
> >> > 0:
https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
<
https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
>
> >> >
> >>
> >> Hi, Frode.  Thanks for starting this conversation.
> >>
> >> First of all I'd still respectfully disagree that 650 MB is a
> >> moderately sized database. :)  ovsdb-server on its own doesn't limit
> >> users on how much data they can put in, but that doesn't mean there
> >> is no limit at which it will be difficult for it or even impossible
> >> to handle the database.  From my experience 650 MB is far beyond the
> >> threshold for a smooth work.
> >>
> >> Allowing database to grow to such size might be considered a user
> >> error, or a CMS error.  In any case, setups should be tested at the
> >> desired [simulated at least] scale including upgrades before
> >> deploying in production environment to not run into such issues
> >> unexpectedly.
> >>
> >> Another way out from the situation, beside bumping the election
> >> timer, might be to pin ovn-controllers, destroy the database (maybe
> >> keep port bindings, etc.) and let northd to re-create it after
> >> conversion.  Not sure if that will actually work though, as I
> >> didn't try.
> >>
> >>
> >> For the threads, I'll re-iterate my thought that throwing more
> >> cores on the problem is absolutely last thing we should do.  Only
> >> if there is no other choice.  Simply because many parts of
> >> ovsdb-server was never optimized for performance and there are
> >> likely many things we can do to improve without blindly using more
> >> resources and increasing the code complexity by adding threads.
> >>
> >>
> >> Speaking of not optimal code, the conversion process seems very
> >> inefficient.  Let's deconstruct it.  (I'll skip the standalone
> >> case, focusing on the clustered mode.)
> >>
> >> There are few main steps:
> >>
> >> 1. ovsdb_convert() - Creates a copy of a database converting
> >>each column along the way and checks the constraints.
> >>
> >> 2. ovsdb_to_txn_json() - Converts the new database into a
> >>transaction JSON object.
> >>
> >> 3. ovsdb_txn_propose_schema_change() - Writes the new schema
> >>and the transaction JSON to the storage (RAFT).
> >>
> >> 4. ovsdb_destroy() - Copy of a database is destroyed.
> >>
> >>-
> >>
> >> 5. read_db()/parse_txn() - Reads the new schema and the
> >>transaction JSON from the storage, replaces the current
> >>database with an empty one and replays the transaction
> >>that creates a new converted database.
> >>
> >> There is always a storage run between steps 4 and 5, so we generally
> >> care only that steps 1-4 and the step 5 are below the election timer
> >> threshold separately.
> >>
> >>
> >> Now looking closer to the step 1, which is the most time consuming
> >> step.  It has two stages - data conversion and 

Re: [ovs-discuss] ovsdb: schema conversion for clustered db blocks preventing processing of raft election and inactivity probes

2023-01-07 Thread Han Zhou via discuss
On Tue, Jan 3, 2023 at 6:07 AM Ilya Maximets via discuss <
ovs-discuss@openvswitch.org> wrote:
>
> On 12/14/22 08:28, Frode Nordahl via discuss wrote:
> > Hello,
> >
> > When performing an online schema conversion for a clustered DB the
> > `ovsdb-client` connects to the current leader of the cluster and
> > requests it to convert the DB to a new schema.
> >
> > The main thread of the leader ovsdb-server will then parse the new
> > schema and copy the entire database into a new in-memory copy using
> > the new schema. For a moderately sized database, let's say 650MB
> > on-disk, this process can take north of 24 seconds on a modern
> > adequately performant system.
> >
> > While this is happening the ovsdb-server process will not process any
> > raft election events or inactivity probes, so by the time the
> > conversion is done and the now past leader wants to write the
> > converted database to the cluster, its connection to the cluster is
> > dead.
> >
> > The past leader will keep repeating this process indefinitely, until
> > the client requesting the conversion disconnects. No message is passed
> > to the client.
> >
> > Meanwhile the other nodes in the cluster have moved on with a new
leader.
> >
> > A workaround for this scenario would be to increase the election timer
> > to a value great enough so that the conversion can succeed within an
> > election window.
> >
> > I don't view this as a permanent solution though, as it would be
> > unfair to leave the end user with guessing the correct election timer
> > in order for their upgrades to succeed.
> >
> > Maybe we need to hand off conversion to a thread and make the main
> > loop only process raft requests until it is done, similar to the
> > recent addition of preparing snapshot JSON in a separate thread [0].
> >
> > Any other thoughts or ideas?
> >
> > 0:
https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
> >
>
> Hi, Frode.  Thanks for starting this conversation.
>
> First of all I'd still respectfully disagree that 650 MB is a
> moderately sized database. :)  ovsdb-server on its own doesn't limit
> users on how much data they can put in, but that doesn't mean there
> is no limit at which it will be difficult for it or even impossible
> to handle the database.  From my experience 650 MB is far beyond the
> threshold for a smooth work.
>
> Allowing database to grow to such size might be considered a user
> error, or a CMS error.  In any case, setups should be tested at the
> desired [simulated at least] scale including upgrades before
> deploying in production environment to not run into such issues
> unexpectedly.
>
> Another way out from the situation, beside bumping the election
> timer, might be to pin ovn-controllers, destroy the database (maybe
> keep port bindings, etc.) and let northd to re-create it after
> conversion.  Not sure if that will actually work though, as I
> didn't try.
>
>
> For the threads, I'll re-iterate my thought that throwing more
> cores on the problem is absolutely last thing we should do.  Only
> if there is no other choice.  Simply because many parts of
> ovsdb-server was never optimized for performance and there are
> likely many things we can do to improve without blindly using more
> resources and increasing the code complexity by adding threads.
>
>
> Speaking of not optimal code, the conversion process seems very
> inefficient.  Let's deconstruct it.  (I'll skip the standalone
> case, focusing on the clustered mode.)
>
> There are few main steps:
>
> 1. ovsdb_convert() - Creates a copy of a database converting
>each column along the way and checks the constraints.
>
> 2. ovsdb_to_txn_json() - Converts the new database into a
>transaction JSON object.
>
> 3. ovsdb_txn_propose_schema_change() - Writes the new schema
>and the transaction JSON to the storage (RAFT).
>
> 4. ovsdb_destroy() - Copy of a database is destroyed.
>
>-
>
> 5. read_db()/parse_txn() - Reads the new schema and the
>transaction JSON from the storage, replaces the current
>database with an empty one and replays the transaction
>that creates a new converted database.
>
> There is always a storage run between steps 4 and 5, so we generally
> care only that steps 1-4 and the step 5 are below the election timer
> threshold separately.
>
>
> Now looking closer to the step 1, which is the most time consuming
> step.  It has two stages - data conversion and the transaction
> check.  Data conversion part makes sure that we're creating all
> the rows in a new database with all the new columns and without
> removed columns.  It also makes sure that all the datum objects
> are converted from the old column type to the new column type by
> calling ovsdb_datum_convert() for every one of them.
>
> Datum conversion is a very heavy operation, because it involves
> converting it to JSON and back.  However, in vast majority of cases
> column types do not change at all, and even if they do, it only