Re: [ovs-discuss] NorthD inc-engine Handlers; OVN 24.03

2024-05-15 Thread Han Zhou via discuss
On Tue, May 14, 2024 at 12:31 AM Dumitru Ceara  wrote:
>
> On 5/8/24 18:01, Numan Siddique wrote:
> > On Wed, May 8, 2024 at 8:42 AM Шагов Георгий via discuss <
> > ovs-discuss@openvswitch.org> wrote:
> >
> >> Hello everyone
> >>
> >>
> >>
> >> In some aspect it might be considered as a continuation of this thread:
> >> (link1), yet it is different
> >>
> >> After we have upgrade from OVN 22.03 to OVN 24.03, we have indeed found
> >> increase in performance in 3-4 times
> >>
> >> And yet still we do observe high CPU load for NorthD process; taking
> >> deeper into the logs we have found:
> >>
> >>
> >>
> >
> > Thanks for reporting this issue.
> >
> >
> > 2024-05-07T08:36:46.505Z|18503|poll_loop|INFO|wakeup due to [POLLIN] on
fd
> >> 15 (10.34.22.66:60716<->10.34.22.66:6642) at lib/stream-fd.c:157 (94%
CPU
> >> usage)
> >>
> >> *2024-05-07T08:37:38.857Z|18504|inc_proc_eng|INFO|node: northd,
recompute
> >> (missing handler for input SB_datapath_binding) took 52313ms*
> >>
> >> *2024-05-07T08:37:48.335Z|18505|inc_proc_eng|INFO|node: lflow,
recompute
> >> (failed handler for input northd) took 7759ms*
> >>
> >> *2024-05-07T08:37:48.718Z|18506|timeval|WARN|Unreasonably long 62213ms
> >> poll interval (56201ms user, 2900ms system)*
> >>
> >>
> >>
> >> As you can see there is a significant delay in 52 secs
> >>
>
> This is huge indeed!
>
> >> Correct me please, if I am in the wrong, but IMU: ‘*missing handler
for*’
> >> – practically means absence of the inc-engine handler from some node
(in
> >> this sample: *SB_datapath_binding*)
> >>
> >
> > That's correct.
> >
> > Before plunging into Development it would be great to clarify/adjust
with
> >> Community’s position
> >>
> >>- Why there is not handler for this node?
> >>
> >>
> > Our approach has been to add a handler  for any input change only if it
is
> > frequent or if it can be easily handled.
> > We also have skipped adding handlers if it increases the code
complexity.
> > Having said that I think we are open
> > to adding more handlers if it makes sense or if it results in scale
> > improvements.
> >
> > Right now we fall back to a full recompute of northd engine for any
changes
> > to a logical switch or logical router.
> > Does your deployment create/delete logical switches/routers frequently ?
> > Is it possible to enable ovn debug logs
> > and share them ?  I'm curious to know what are the changes to SB
datapath
> > binding.
> >
> > Feel free to share your OVN NB and SB DBs if you're ok with it.  I can
> > deploy those DBs and see why recompute is so expensive.
> >
> >
> >
> >>- Any particular reason for this or just the peculiarity of our
> >>installation highlighted this issue?
> >>
> >>
> > My guess is that your installation is frequently creating , deleting or
> > modifying logical switches or routers.
> >
> >
> >>-
> >>- Do you think there is a reason in implementing that handler? (
> >>*SB_datapath_binding*)
> >>
> >>
> > I'm fine adding a handler if it helps in the scale.   In our use cases,
we
> > don't frequently create/delete the logical switches and routers
> > and hence it is ok to fall back to full recomputes for such changes.
> >
> >
> >>-
> >>
> >>
> >>
> >> Any ideas are highly appreciated.
> >>
> >
> > You're welcome to work on it and submit patches to add a handler for
> > SB_datapath_binding.
> >
> > @Dumitru Ceara  @Han Zhou  if you've
any
> > reservations on adding more handlers please do comment here.
> >
>
> In general, especially if it fixes a scalability issue like this one,
> it's probably fine.  In practice it depends a bit on how much complexity
> this would add to the code.
>
I agree with the general statement.

> But the best way to tell is to have a way to reproduce this, e.g., NB/SB
> databases and the NB/SB jsonrpc update that caused the recompute.
>

Yes, it is better to understand why in this deployment the recompute took
so long (52s). Is it simply too large scale, or is it because of some
uncommon configuration that we don't handle efficiently and can be
optimized to improve recompute performance.

Otherwise, even if we can implement datapath I-P, there can be just another
input change that triggers recompute and causes the same latency. It is
just not sustainable to maintain more and more I-P in northd.

> Regards,
> Dumitru
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [ovs-dev] [ANN] Primary OVS branch renamed as main development branch as main.

2024-04-10 Thread Han Zhou via discuss
On Wed, Apr 10, 2024 at 6:52 AM Simon Horman  wrote:
>
> Hi,
>
> I would like to announce that the primary development branch for OvS
> has been renamed main.
>
> The rename occurred a little earlier today.
>
> OVS is currently hosted on GitHub. We can expect the following behaviour
> after the rename:
>
> * GitHub pull requests against master should have been automatically
>   re-homed on main.
> * GitHub Issues should not to be affected - the test issue I
>   created had no association with a branch
> * URLs accessed via the GitHub web UI are automatically renamed
> * Clones may also rename their primary branch - you may
>   get a notification about this in the Web UI
>
> As a result of this change it may be necessary to update your local git
> configuration for checked out branches.
>
> For example:
> # Fetch origin: new remote main branch; remote master branch is deleted
> git fetch -tp origin
> # Rename local branch
> git branch -m master main
> # Update local main branch to use remote main branch as it's upstream
> git branch --set-upstream-to=origin/main main
>
> If you have an automation that fetches the master branch then please
> update the automation to fetch main. If your automation is fetching
> main and falling back to master, then it should now be safe to
> remove the fallback.
>
> This change is in keeping with OVS's recently OVS adopted a policy of
using
> the inclusive naming word list v1 [1, 2].
>
> [1] df5e5cf4318a ("Documentation: Add section on inclusive language.")
> [2] https://inclusivenaming.org/word-lists/
>
> Kind regards,
> Simon
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Thanks Simon. Shall this be announced to ovs-announce as well?

Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN SB DB from RAFT cluster to Relay DB

2024-03-19 Thread Han Zhou via discuss
On Thu, Mar 7, 2024 at 12:29 PM Sri kor via discuss <
ovs-discuss@openvswitch.org> wrote:
>
> Is there a way to configure, ovn-controller subscribing to only specific
SB DB updates?
> I

Hi Sri,

As mentioned by Felix, ovn-controller by default subscribes to only SB DB
updates that it considers relevant to the hypervisor.
What's your settings of: external_ids:ovn-monitor-all? (ovs-vsctl
--if-exists get open . external_ids:ovn-monitor-all)
Is there anything you want to tune beyond this?

Thanks,
Han

>
> On Wed, Mar 6, 2024 at 1:45 AM Felix Huettner 
wrote:
>>
>> On Wed, Mar 06, 2024 at 10:29:29AM +0300, Vladislav Odintsov wrote:
>> > Hi Felix,
>> >
>> > > On 6 Mar 2024, at 10:16, Felix Huettner via discuss <
ovs-discuss@openvswitch.org> wrote:
>> > >
>> > > Hi Srini,
>> > >
>> > > i can share what works for us for ~1k hypervisors:
>> > >
>> > > On Tue, Mar 05, 2024 at 09:51:43PM -0800, Sri kor via discuss wrote:
>> > >> Hi Team,
>> > >>
>> > >>
>> > >> Currently , we are using OVN in RAFT cluster mode. We have 3 NB and
SB
>> > >> ovsdb-servers operating in RAFT cluster mode. Currently we have 500
>> > >> hypervisors connected to this RAFT cluster.
>> > >>
>> > >> For our next deployment, our scale would increase to 3000
hypervisors. To
>> > >> accommodate this scaled hypervisors, we are migrating to DB relay
with
>> > >> multigroup deployment model. This increase helps with OVN SB DB read
>> > >> transactions. But for write transactions, only the leader in the
RAFT
>> > >> cluster can update the DB. This creates a load on the leader of
RAFT. Is
>> > >> there a way to address the load on the RAFT cluster leader?
>> > >
>> > > We do the following:
>> > > * If you need TLS on the ovsdb path, separate it out to some
>> > >  reverseproxy that can do just L4 TLS Termination (e.g. traefik, or
so)
>> >
>> > Do I understand correctly that with such TLS "offload" you can’t use
RBAC for hypervisors?
>> >
>>
>> yes, that is the unfortunate side effect
>>
>> > > * Have nobody besides northd connect to the SB DB directly, everyone
>> > >  else needs to use a relay
>> > > * Do not run backups on the cluster leader, but on one of the current
>> > >  followers
>> > > * Increase the raft election timeout significantly (we have 120s in
>> > >  there). However there is a patch afaik in 3.3 that makes that better
>> > > * If you create metrics or so from database content generate these on
>> > >  the relays instead of the raft cluster
>> > >
>> > > Overall when our southbound db had issues most of the time it was
some
>> > > client constantly reconnecting to it and thereby pulling always a
full
>> > > DB dump.
>> > >
>> > >>
>> > >>
>> > >> As the scale increases, number updates coming to the ovn-controller
from
>> > >> OVN SB increases. that creates pressure on ovn-controller. Is there
a way
>> > >> to minimize the load on ovn-controller?
>> > >
>> > > Did not see any kind of issue there yet.
>> > > However if you are using some python tooling outside of OVN (e.g.
>> > > Openstack) ensure that you have JSON parsing using a C library
avaialble
>> > > in the ovs lib. This brings significant performance benefts if you
have
>> > > a lot of updates.
>> > > You can check with `python3 -c "import ovs.json;
print(ovs.json.PARSER)"`
>> > > which should return "C".
>> > >
>> > >>
>> > >> I wish there is a way for ovn-controller to subscribe to updates
specific
>> > >> to this hypervisor. Are there any known ovn-contrller subscription
methods
>> > >> available and being used OVS community?
>> > >
>> > > Yes, they do that per default. However for us we saw that this
creates
>> > > increased load on the relays due to the needed additional filtering
and
>> > > json serializing per target node. So we turned it of and thereby
trade
>> > > less ovsdb load for more network bandwidth.
>> > > Relevant setting is `external_ids:ovn-monitor-all`.
>> > >
>> > > Thanks
>> > > Felix
>> > >
>> > >>
>> > >>
>> > >> How can I optimize the load on the leader node in an OVN RAFT
cluster to
>> > >> handle increased write transactions?
>> > >>
>> > >>
>> > >>
>> > >> Thanks,
>> > >>
>> > >> Srini
>> > >
>> > >> ___
>> > >> discuss mailing list
>> > >> disc...@openvswitch.org
>> > >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>> > >
>> > > ___
>> > > discuss mailing list
>> > > disc...@openvswitch.org 
>> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>> >
>> >
>> > Regards,
>> > Vladislav Odintsov
>> >
>
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-01 Thread Han Zhou via discuss
On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry  wrote:
>
> Hi Han,
>
> Please see my comments/questions inline.
>
> Han Zhou, Sep 30, 2023 at 21:59:
> > > Distributed mac learning
> > > 
> > >
> > > Use one OVS bridge per logical switch with mac learning enabled. Only
> > > create the bridge if the logical switch has a port bound to the local
> > > chassis.
> > >
> > > Pros:
> > >
> > > - Minimal openflow rules required in each bridge (ACLs and NAT
mostly).
> > > - No central mac binding table required.
> >
> > Firstly to clarify the terminology of "mac binding" to avoid confusion,
the
> > mac_binding table currently in SB DB has nothing to do with L2 MAC
> > learning. It is actually the ARP/Neighbor table of distributed logical
> > routers. We should probably call it IP_MAC_binding table, or just
Neighbor
> > table.
>
> Yes sorry about the confusion. I actually meant the FDB table.
>
> > Here what you mean is actually L2 MAC learning, which today is
implemented
> > by the FDB table in SB DB, and it is only for uncommon use cases when
the
> > NB doesn't have the knowledge of a MAC address of a VIF.
>
> This is not that uncommon in telco use cases where VNFs can send packets
> from mac addresses unknown to OVN.
>
Understand, but VNFs contributes a very small portion of the workloads,
right? Maybe I should rephrase that: it is uncommon to have "unknown"
addresses for the majority of ports in a large scale cloud. Is this
understanding correct?

> > The purpose of this proposal is clear - to avoid using a central table
in
> > DB for L2 information but instead using L2 MAC learning to populate such
> > information on chassis, which is a reasonable alternative with pros and
> > cons.
> > However, I don't think it is necessary to use separate OVS bridges for
this
> > purpose. L2 MAC learning can be easily implemented in the br-int bridge
> > with OVS flows, which is much simpler than managing dynamic number of
OVS
> > bridges just for the purpose of using the builtin OVS mac-learning.
>
> I agree that this could also be implemented with VLAN tags on the
> appropriate ports. But since OVS does not support trunk ports, it may
> require complicated OF pipelines. My intent with this idea was two fold:
>
> 1) Avoid a central point of failure for mac learning/aging.
> 2) Simplify the OF pipeline by making all FDB operations dynamic.

IMHO, the L2 pipeline is not really complex. It is probably the simplest
part (compared with other features for L3, NAT, ACL, LB, etc.).
Adding dynamic learning to this part probably makes it *a little* more
complex, but should still be straightforward. We don't need any VLAN tag
because the incoming packet has geneve VNI in the metadata. We just need a
flow that resubmits to lookup a MAC-tunnelSrc mapping table, and inject a
new flow (with related tunnel endpont information) if the src MAC is not
found, with the help of the "learn" action. The entries are
per-logical_switch (VNI). This would serve your purpose of avoiding a
central DB for L2. At least this looks much simpler to me than managing
dynamic number of OVS bridges and the patch pairs between them.

>
> > Now back to the distributed MAC learning idea itself. Essentially for
two
> > VMs/pods to communicate on L2, say, VM1@Chassis1 needs to send a packet
to
> > VM2@chassis2, assuming VM1 already has VM2's MAC address (we will
discuss
> > this later), Chassis1 needs to know that VM2's MAC is located on
Chassis2.
> >
> > In OVN today this information is conveyed through:
> >
> > - MAC and LSP mapping (NB -> northd -> SB -> Chassis)
> > - LSP and Chassis mapping (Chassis -> SB -> Chassis)
> >
> > In your proposal:
> >
> > - MAC and Chassis mapping (can be learned through initial L2
> >   broadcast/flood)
> >
> > This indeed would avoid the control plane cost through the centralized
> > components (for this L2 binding part). Given that today's SB OVSDB is a
> > bottleneck, this idea may sound attractive. But please also take into
> > consideration the below improvement that could mitigate the OVN central
> > scale issue:
> >
> > - For MAC and LSP mapping, northd is now capable of incrementally
> >   processing VIF related L2/L3 changes, so the cost of NB -> northd ->
> >   SB is very small. For SB -> Chassis, a more scalable DB deployment,
> >   such as the OVSDB relays, may largely help.
>
> But using relays will only help with read-only operations (SB ->
> chassis). Write operations (from dynamically learned mac addresses) will
> be 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-10-01 Thread Han Zhou via discuss
On Sun, Oct 1, 2023 at 9:06 AM Robin Jarry  wrote:
>
> Hi Han,
>
> thanks a lot for your detailed answer.
>
> Han Zhou, Sep 30, 2023 at 01:03:
> > > I think ovn-controller only consumes the logical flows. The chassis
and
> > > port bindings tables are used by northd to updated these logical
flows.
> >
> > Felix was right. For example, port-binding is firstly a configuration
from
> > north-bound, but the states such as its physical location (the chassis
> > column) are populated by ovn-controller of the owning chassis and
consumed
> > by other ovn-controllers that are interested in that port-binding.
>
> I was not aware of this. Thanks.
>
> > > Exactly, but was the signaling between the nodes ever an issue?
> >
> > I am not an expert of BGP, but at least for what I am aware of, there
are
> > scaling issues in things like BGP full mesh signaling, and there are
> > solutions such as route reflector (which is again centralized) to solve
> > such issues.
>
> I am not familiar with BGP full mesh signaling. But from what can tell,
> it looks like the same concept than the full mesh GENEVE tunnels. Except
> that the tunnels are only used when the same logical switch is
> implemented between two nodes.
>
Please note that tunnels are needed not only between nodes related to same
logical switches, but also when they are related to different logical
switches connected by logical routers (even multiple LR+LS hops away).

> > > So you have enabled monitor_all=true as well? Or did you test at scale
> > > with monitor_all=false.
> > >
> > We do use monitor_all=false, primarily to reduce memory footprint (and
also
> > CPU cost of IDL processing) on each chassis. There are trade-offs to
the SB
> > DB server performance:
> >
> > - On one hand it increases the cost of conditional monitoring, which
> >   is expensive for sure
> > - On the other hand, it reduces the total amount of data for the
> >   server to propagate to clients
> >
> > It really depends on your topology for making the choice. If most of the
> > nodes would anyway monitor most of the DB data (something similar to a
> > full-mesh), it is more reasonable to use monitor_all=true. Otherwise, in
> > topology like ovn-kubernetes where each node has its dedicated part of
the
> > data, or in topologies where you have lots of small "island" such as a
> > cloud with many small tenants that never talks to each other, using
> > monitor_all=false could make sense (but still need to be carefully
> > evaluated and tested for your own use cases).
>
> I didn't see recent scale testing for openstack, but in past testing we
> had to set monitor_all=true because the CPU usage of the SB ovsdb was
> a bottleneck.
>
To clarify a little more, openstack deployment can have different logical
topologies. So to evaluate the impact of monitor_all settings there should
be different test cases to capture different types of deployment, e.g.
full-mesh topology (monitor_all=true is better) v.s. "small islands"
toplogy (monitor_all=false is reasonable).

> > > The memory usage would be reduced but I don't know to which point. One
> > > of the main consumers is the logical flows table which is required
> > > everywhere. Unless there is a way to only sync a portion of this table
> > > depending on the chassis, disabling monitor_all would save syncing the
> > > unneeded tables for ovn-controller: chassis, port bindings, etc.
> >
> > Probably it wasn't what you meant, but I'd like to clarify that it is
not
> > about unneeded tables, but unneeded rows in those tables (mainly
> > logical_flow and port_binding).
> > It indeed syncs only a portion of the tables. It is not depending
directly
> > on chassis, but depending on what port-bindings are on the chassis and
what
> > logical connectivity those port-bindings have. So, again, the choice
really
> > depends on your use cases.
>
> What about the FDB (mac-port) and MAC binding (ip-mac) tables? I thought
> ovn-controller does not need them. If that is the case, I thought that
> by default, the whole tables (not only some of their rows) were excluded
> from the synchronized data.
>
FDB and MAC_binding tables are used by ovn-controllers. They are
essentially the central storage for MAC tables of the distributed logical
switches (FDB) and ARP/Neighbour tables for distributed logical routers
(MAC_binding). A record can be populate by one chassis and consumed by many
other chassis.

monitor_all should work the same way for these tables: if monitor_all =
false, only rows related to "local datapaths" should be downloaded to 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Han Zhou via discuss
 “ls1” bridge, next run through lrouter “lr1”
> >>> OpenFlow pipelines, then output packet to “ls2” OVS bridge for mac
> >>> learning between logical routers (should we have here OF flow with
> >>> learn action?), then send packet again to OVS, calculate “lr2”
> >>> OpenFlow pipeline and finally reach destination OVS bridge “ls3” to
> >>> send packet to a vm2?
> >
> > What I am proposing is to implement the northbound L2 network intent
> > with actual OVS bridges and builtin OVS mac learning. The L3/L4 network
> > constructs and ACLs would require patch ports and specific OF pipelines.
> >
> > We could even think of adding more advanced L3 capabilities (RIB) into
> > OVS to simplify the OF pipelines.
>
> But this will make OVS<->kernel interaction more complex. Even if we
forget about dpdk environments…
>
> >
> >>> Also, will such behavior be compatible with HW-offload-capable to
> >>> smartnics/DPUs?
> >>
> >> I am also a bit concerned about this, what would be the typical number
> >> of bridges supported by hardware?
> >
> > As far as I understand, only the datapath flows are offloaded to
> > hardware. The OF pipeline is only parsed when there is an upcall for the
> > first packet. Once resolved, the datapath flow is reused. OVS bridges
> > are only logical constructs, they are neither reflected in the datapath
> > nor in hardware.
>
> As far as I remember from my tests against ConnectX-5/6 SmartNICs in
ASAP^2 mode, HW-offload is not capable with offloading OVS patch ports. At
least it was so 2 years ago.
> If that is still true, this will degrade the ability to offload datapath.
> Maybe @Han Zhou can comment this in more detail.
>
I am not aware of any problem of HW offloading for patch ports in general,
and I am backing what Robin said as the OVS HW offloading mechanism. And In
the current smartNIC used in our environment there is no offload problem
with OVS patch ports. Probably there were problems that I wasn't aware of
in other versions of HW or with particular settings, which may need to be
checked with the dedicated teams.

However, IMHO, as mentioned in my other response, separating OVS bridges
doesn't sound necessary for the idea of distributed MAC learning.

Thanks,
Han

> >
> >>>>>> Use multicast for overlay networks
> >>>>>> ==
> >>>> [snip]
> >>>>>> - 24bit VNI allows for more than 16 million logical switches. No
need
> >>>>>> for extended GENEVE tunnel options.
> >>>>> Note that using vxlan at the moment significantly reduces the ovn
> >>>>> featureset. This is because the geneve header options are currently
used
> >>>>> for data that would not fit into the vxlan vni.
> >>>>>
> >>>>> From ovn-architecture.7.xml:
> >>>>> ```
> >>>>> The maximum number of networks is reduced to 4096.
> >>>>> The maximum number of ports per network is reduced to 2048.
> >>>>> ACLs matching against logical ingress port identifiers are not
supported.
> >>>>> OVN interconnection feature is not supported.
> >>>>> ```
> >>>>
> >>>> In my understanding, the main reason why GENEVE replaced VXLAN is
> >>>> because Openstack uses full mesh point to point tunnels and that the
> >>>> sender needs to know behind which chassis any mac address is to send
it
> >>>> into the correct tunnel. GENEVE allowed to reduce the lookup time
both
> >>>> on the sender and receiver thanks to ingress/egress port metadata.
> >>>>
> >>>>
https://blog.russellbryant.net/2017/05/30/ovn-geneve-vs-vxlan-does-it-matter/
> >>>> https://dani.foroselectronica.es/ovn-geneve-encapsulation-541/
> >>>>
> >>>> If VXLAN + multicast and address learning was used, the "correct"
tunnel
> >>>> would be established ad-hoc and both sender and receiver lookups
would
> >>>> only be a simple mac forwarding with learning. The ingress pipeline
> >>>> would probably cost a little more.
> >>>>
> >>>> Maybe multicast + address learning could be implemented for GENEVE as
> >>>> well. But it would not be interoperable with other VTEPs.
> >>
> >> While it is true that it takes time before switch hardware picks up
> >> support for emerging protocols, I do not think it is a valid argument
> >> for limiting the development o

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Han Zhou via discuss
On Thu, Sep 28, 2023 at 9:28 AM Robin Jarry  wrote:
>
> Hello OVN community,
>
> This is a follow up on the message I have sent today [1]. That second
> part focuses on some ideas I have to remove the limitations that were
> mentioned in the previous email.
>
> [1]
https://mail.openvswitch.org/pipermail/ovs-discuss/2023-September/052695.html
>
> If you didn't read it, my goal is to start a discussion about how we
> could improve OVN on the following topics:
>
> - Reduce the memory and CPU footprint of ovn-controller, ovn-northd.
> - Support scaling of L2 connectivity across larger clusters.
> - Simplify CMS interoperability.
> - Allow support for alternative datapath implementations.
>
> Disclaimer:
>
> This message does not mention anything about L3/L4 features of OVN.
> I didn't have time to work on these, yet. I hope we can discuss how
> these fit with my ideas.
>

Hi Robin and folks, thanks for the great discussions!
I read the replies of two other threads of this email, but I am replying
directly here to comment on some of the original statements in this email.
I will reply to the other threads for some specific points.

> Distributed mac learning
> 
>
> Use one OVS bridge per logical switch with mac learning enabled. Only
> create the bridge if the logical switch has a port bound to the local
> chassis.
>
> Pros:
>
> - Minimal openflow rules required in each bridge (ACLs and NAT mostly).
> - No central mac binding table required.

Firstly to clarify the terminology of "mac binding" to avoid confusion, the
mac_binding table currently in SB DB has nothing to do with L2 MAC
learning. It is actually the ARP/Neighbor table of distributed logical
routers. We should probably call it IP_MAC_binding table, or just Neighbor
table.
Here what you mean is actually L2 MAC learning, which today is implemented
by the FDB table in SB DB, and it is only for uncommon use cases when the
NB doesn't have the knowledge of a MAC address of a VIF.

> - Mac table aging comes for free.
> - Zero access to southbound DB for learned addresses nor for aging.
>
> Cons:
>
> - How to manage seamless upgrades?
> - Requires ovn-controller to move/plug ports in the correct bridge.
> - Multiple openflow connections (one per managed bridge).
> - Requires ovn-trace to be reimplemented differently (maybe other tools
>   as well).
>

The purpose of this proposal is clear - to avoid using a central table in
DB for L2 information but instead using L2 MAC learning to populate such
information on chassis, which is a reasonable alternative with pros and
cons.
However, I don't think it is necessary to use separate OVS bridges for this
purpose. L2 MAC learning can be easily implemented in the br-int bridge
with OVS flows, which is much simpler than managing dynamic number of OVS
bridges just for the purpose of using the builtin OVS mac-learning.

Now back to the distributed MAC learning idea itself. Essentially for two
VMs/pods to communicate on L2, say, VM1@Chassis1 needs to send a packet to
VM2@chassis2, assuming VM1 already has VM2's MAC address (we will discuss
this later), Chassis1 needs to know that VM2's MAC is located on Chassis2.
In OVN today this information is conveyed through:
- MAC and LSP mapping (NB -> northd -> SB -> Chassis)
- LSP and Chassis mapping (Chassis -> SB -> Chassis)

In your proposal:
- MAC and Chassis mapping (can be learned through initial L2
broadcast/flood)

This indeed would avoid the control plane cost through the centralized
components (for this L2 binding part). Given that today's SB OVSDB is a
bottleneck, this idea may sound attractive. But please also take into
consideration the below improvement that could mitigate the OVN central
scale issue:
- For MAC and LSP mapping, northd is now capable of incrementally
processing VIF related L2/L3 changes, so the cost of NB -> northd -> SB is
very small. For SB -> Chassis, a more scalable DB deployment, such as the
OVSDB relays, may largely help.
- For LSP and Chassis mapping, the round trip through a central DB
obviously costs higher than a direct L2 broadcast (the targets are the
same). But this can be optimized if the MAC and Chassis is known by the CMS
system (which is true for most openstack/k8s env I believe). Instead of
updating the binding from each Chassis, CMS can tell this information
through the same NB -> northd -> SB -> Chassis path, and the Chassis can
just read the SB without updating it.

On the other hand, the dynamic MAC learning approach has its own drawbacks.
- It is simple to consider L2 only, but if considering more SDB features, a
central DB is more flexible to extend and implement new features than a
network protocol based approach.
- It is more predictable and easier to debug with pre-populated information
through CMS than states learned dynamically in data-plane.
- With the DB approach we can suppress most of L2 broadcast/flood, while
with the distributed MAC learning broadcast/flood can't be avoided.
Although it may 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-09-29 Thread Han Zhou via discuss
On Fri, Sep 29, 2023 at 7:26 AM Robin Jarry  wrote:
>
> Hi Felix,
>
> Thanks a lot for your message.
>
> Felix Huettner, Sep 29, 2023 at 14:35:
> > I can get that when running 10k ovn-controllers the benefits of
> > optimizing cpu and memory load are quite significant. However i am
> > unsure about reducing the footprint of ovn-northd.
> > When running so many nodes i would have assumed that having an
> > additional (or maybe two) dedicated machines for ovn-northd would
> > be completely acceptable, as long as it can still actually do what
> > it should in a reasonable timeframe.
> > Would the goal for ovn-northd be more like "Reduce the full/incremental
> > recompute time" then?

+1

>
> The main goal of this thread is to get a consensus on the actual issues
> that prevent scaling at the moment. We can discuss solutions in the
> other thread.
>

Thanks for the good discussions!

> > > * Allow support for alternative datapath implementations.
> >
> > Does this mean ovs datapths (e.g. dpdk) or something different?
>
> See the other thread.
>
> > > Southbound Design
> > > =
> ...
> > Note that also ovn-controller consumes the "state" of other chassis to
> > e.g build the tunnels to other chassis. To visualize my understanding
> >
> > ++---++
> > || configuration |   state|
> > ++---++
> > |   ovn-northd   |  write-only   | read-only  |
> > ++---++
> > | ovn-controller |   read-only   | read-write |
> > ++---++
> > |some cms|  no access?   | read-only  |
> > ++---++
>
> I think ovn-controller only consumes the logical flows. The chassis and
> port bindings tables are used by northd to updated these logical flows.
>

Felix was right. For example, port-binding is firstly a configuration from
north-bound, but the states such as its physical location (the chassis
column) are populated by ovn-controller of the owning chassis and consumed
by other ovn-controllers that are interested in that port-binding.

> > > Centralized decisions
> > > =
> > >
> > > Every chassis needs to be "aware" of all other chassis in the cluster.
> >
> > I think we need to accept this as fundamental truth. Indepentent if you
> > look at centralized designs like ovn or the neutron-l2 implementation
> > or if you look at decentralized designs like bgp or spanning tree. In
> > all cases if we need some kind of organized communication we need to
> > know all relevant peers.
> > Designs might diverge if you need to be "aware" of all peers or just
> > some of them, but that is just a tradeoff between data size and options
> > you have to forward data.
> >
> > > This requirement mainly comes from overlay networks that are
implemented
> > > over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some
> > > limitations). It is not a scaling issue by itself, but it implies
> > > a centralized decision which in turn puts pressure on the central node
> > > at scale.
> >
> > +1. On the other hand it removes signaling needs between the nodes (like
> > you would have with bgp).
>
> Exactly, but was the signaling between the nodes ever an issue?

I am not an expert of BGP, but at least for what I am aware of, there are
scaling issues in things like BGP full mesh signaling, and there are
solutions such as route reflector (which is again centralized) to solve
such issues.

>
> > > Due to ovsdb monitoring and caching, any change in the southbound DB
> > > (either by northd or by any of the chassis controllers) is replicated
on
> > > every chassis. The monitor_all option is often enabled on large
clusters
> > > to avoid the conditional monitoring CPU cost on the central node.
> >
> > This is, i guess, something that should be possible to fix. We have also
> > enabled this setting as it gave us stability improvements and we do not
> > yet see performance issues with it
>
> So you have enabled monitor_all=true as well? Or did you test at scale
> with monitor_all=false.
>
We do use monitor_all=false, primarily to reduce memory footprint (and also
CPU cost of IDL processing) on each chassis. There are trade-offs to the SB
DB server performance:
- On one hand it increases the cost of conditional monitoring, which is
expensive for sure
- On the other hand, it reduces the total amount of data for the server to
propagate to clients

It really depends on your topology for making the choice. If most of the
nodes would anyway monitor most of the DB data (something similar to a
full-mesh), it is more reasonable to use monitor_all=true. Otherwise, in
topology like ovn-kubernetes where each node has its dedicated part of the
data, or in topologies where you have lots of small "island" such as a
cloud with many small tenants that never talks to each other, using
monitor_all=false could make sense (but 

Re: [ovs-discuss] Scaling OVN/Southbound

2023-07-07 Thread Han Zhou via discuss
On Fri, Jul 7, 2023 at 1:21 PM Han Zhou  wrote:
>
>
>
> On Thu, Jul 6, 2023 at 1:28 AM Terry Wilson  wrote:
> >
> > On Wed, Jul 5, 2023 at 9:59 AM Terry Wilson  wrote:
> > >
> > > On Fri, Jun 30, 2023 at 7:09 PM Han Zhou via discuss
> > >  wrote:
> > > >
> > > >
> > > >
> > > > On Wed, May 24, 2023 at 12:26 AM Felix Huettner via discuss <
ovs-discuss@openvswitch.org> wrote:
> > > > >
> > > > > Hi Ilya,
> > > > >
> > > > > thank you for the detailed reply
> > > > >
> > > > > On Tue, May 23, 2023 at 05:25:49PM +0200, Ilya Maximets wrote:
> > > > > > On 5/23/23 15:59, Felix Hüttner via discuss wrote:
> > > > > > > Hi everyone,
> > > > > >
> > > > > > Hi, Felix.
> > > > > >
> > > > > > >
> > > > > > > we are currently running an OVN Deployment with 450 Nodes. We
run a 3 node cluster for the northbound database and a 3 nodes cluster for
the southbound database.
> > > > > > > Between the southbound cluster and the ovn-controllers we
have a layer of 24 ovsdb relays.
> > > > > > > The setup is using TLS for all connections, however the TLS
Server is handled by a traefik reverseproxy to offload this from the ovsdb
> > > > > >
> > > > > > The very important part of the system description is what
versions
> > > > > > of OVS and OVN are you using in this setup?  If it's not latest
> > > > > > 3.1 and 23.03, then it's hard to talk about what/if performance
> > > > > > improvements are actually needed.
> > > > > >
> > > > >
> > > > > We are currently running ovs 3.1 and ovn 22.12 (in the process of
> > > > > upgrading to 23.03). `monitor-all` is currently disabled, but we
want to
> > > > > try that as well.
> > > > >
> > > > Hi Felix, did you try upgrading and enabling "monitor-all"? How
does it look now?
> > > >
> > > > > > > Northd and Neutron is connecting directly to north- and
southbound databases without the relays.
> > > > > >
> > > > > > One of the big things that is annoying is that Neutron connects
to
> > > > > > Southbound database at all.  There are some reasons to do that,
> > > > > > but ideally that should be avoided.  I know that in the past
limiting
> > > > > > the number of metadata agents was one of the mitigation
strategies
> > > > > > for scaling issues.  Also, why can't it connect to relays?
There
> > > > > > shouldn't be too many transactions flowing towards Southbound DB
> > > > > > from the Neutron.
> > > > > >
> > > > >
> > > > > Thanks for that suggestion, that definately makes sense.
> > > > >
> > > > Does this make a big difference? How many Neutron - SB connections
are there?
> > > > What rings a bell is that Neutron is using the python OVSDB library
which hasn't implemented the fast-resync feature (if I remember correctly).
> > >
> > > python-ovs has supported monitor_cond_since since v2.17.0 (though
> > > there may have been a bug that was fixed in 2.17.1). If fast resync
> > > isn't happening, then it should be considered a bug. With that said, I
> > > remember when I looked it a year or two ago, ovsdb-server didn't
> > > really use fast resync/monitor_cond_since unless it was running in
> > > raft cluster mode (it would reply, but with the last-txn-id as 0
> > > IIRC?). Does the ovsdb-relay code actually return the last-txn-id? I
> > > can set up an environment and run some tests, but maybe someone else
> > > already knows.
> >
> > Looks like ovsdb-relay does support last-txn-id now:
> >
https://github.com/openvswitch/ovs/commit/a3e97b1af1bdcaa802c6caa9e73087df7077d2b1
,
> > but only in v3.0+.
> >
>
> Hi Terry, thanks for correcting me, and sorry for my bad memory! And you
are right that fast resync is supported only in cluster mode.
>
> Han
>
> > > > At the same time, there is the feature
leader-transfer-for-snapshot, which automatically transfer leader whenever
a snapshot is to be written, which would happen frequently if your
environment is very active.
> > >
> > > I believe snapshot should only be happening "no less frequently than
> > > 24 hours, with snapshots if there are more 

Re: [ovs-discuss] Scaling OVN/Southbound

2023-07-07 Thread Han Zhou via discuss
On Thu, Jul 6, 2023 at 12:00 AM Felix Huettner 
wrote:
>
> Hi Han,
>
> On Fri, Jun 30, 2023 at 05:08:36PM -0700, Han Zhou wrote:
> > On Wed, May 24, 2023 at 12:26 AM Felix Huettner via discuss <
> > ovs-discuss@openvswitch.org> wrote:
> > >
> > > Hi Ilya,
> > >
> > > thank you for the detailed reply
> > >
> > > On Tue, May 23, 2023 at 05:25:49PM +0200, Ilya Maximets wrote:
> > > > On 5/23/23 15:59, Felix Hüttner via discuss wrote:
> > > > > Hi everyone,
> > > >
> > > > Hi, Felix.
> > > >
> > > > >
> > > > > we are currently running an OVN Deployment with 450 Nodes. We run
a 3
> > node cluster for the northbound database and a 3 nodes cluster for the
> > southbound database.
> > > > > Between the southbound cluster and the ovn-controllers we have a
> > layer of 24 ovsdb relays.
> > > > > The setup is using TLS for all connections, however the TLS
Server is
> > handled by a traefik reverseproxy to offload this from the ovsdb
> > > >
> > > > The very important part of the system description is what versions
> > > > of OVS and OVN are you using in this setup?  If it's not latest
> > > > 3.1 and 23.03, then it's hard to talk about what/if performance
> > > > improvements are actually needed.
> > > >
> > >
> > > We are currently running ovs 3.1 and ovn 22.12 (in the process of
> > > upgrading to 23.03). `monitor-all` is currently disabled, but we want
to
> > > try that as well.
> > >
> > Hi Felix, did you try upgrading and enabling "monitor-all"? How does it
> > look now?
>
> we did not yet upgrade, but we tried monitor-all and that provided a big
> benefit in terms of stability.
>
It is great to know that monitor-all helped for your use case.

> >
> > > > > Northd and Neutron is connecting directly to north- and southbound
> > databases without the relays.
> > > >
> > > > One of the big things that is annoying is that Neutron connects to
> > > > Southbound database at all.  There are some reasons to do that,
> > > > but ideally that should be avoided.  I know that in the past
limiting
> > > > the number of metadata agents was one of the mitigation strategies
> > > > for scaling issues.  Also, why can't it connect to relays?  There
> > > > shouldn't be too many transactions flowing towards Southbound DB
> > > > from the Neutron.
> > > >
> > >
> > > Thanks for that suggestion, that definately makes sense.
> > >
> > Does this make a big difference? How many Neutron - SB connections are
> > there?
> > What rings a bell is that Neutron is using the python OVSDB library
which
> > hasn't implemented the fast-resync feature (if I remember correctly).
> > At the same time, there is the feature leader-transfer-for-snapshot,
which
> > automatically transfer leader whenever a snapshot is to be written,
which
> > would happen frequently if your environment is very active.
> > When a leader transfer happens, if Neutron set the option "leader-only"
> > (only connects to leader) to SB DB (could someone confirm?), then when
the
> > leader transfer happens, all Neutron workers would reconnect to the new
> > leader. With fast-resync, like what's implemented in C IDL and Go, the
> > client that has cached the data would only request the delta when
> > reconnecting. But since the python lib doesn't have this, the Neutron
> > server would re-download full data when reconnecting ...
> > This is a speculation based on the information I have, and the
assumptions
> > need to be confirmed.
>
> We are currently working with upstream neutron to get the leader-only
> flag removed wherever we can. I guess in total the amount of connections
> depends on the process count which would be ~150 connections in total in
> our case.
>
As Terry pointed out that the python OVSDB lib does support fast-resync,
then this shouldn't be the problem and I think it is better to keep the
leader-only flag for neutron because it is more efficient to update
directly through the leader especially when the client writes heavily.
Without leader-only, the updates from different followers will have to
anyway go through the leader but parallel updates will result in sequence
conflict and will have to retry, which creates more waste and load to the
servers. But of course, it does harm to try. I didn't think that you have
so many (~150) connections just from Neutron (I thought it might be 10+),
which s

Re: [ovs-discuss] Scaling OVN/Southbound

2023-07-06 Thread Han Zhou via discuss
On Thu, Jul 6, 2023 at 1:28 AM Terry Wilson  wrote:
>
> On Wed, Jul 5, 2023 at 9:59 AM Terry Wilson  wrote:
> >
> > On Fri, Jun 30, 2023 at 7:09 PM Han Zhou via discuss
> >  wrote:
> > >
> > >
> > >
> > > On Wed, May 24, 2023 at 12:26 AM Felix Huettner via discuss <
ovs-discuss@openvswitch.org> wrote:
> > > >
> > > > Hi Ilya,
> > > >
> > > > thank you for the detailed reply
> > > >
> > > > On Tue, May 23, 2023 at 05:25:49PM +0200, Ilya Maximets wrote:
> > > > > On 5/23/23 15:59, Felix Hüttner via discuss wrote:
> > > > > > Hi everyone,
> > > > >
> > > > > Hi, Felix.
> > > > >
> > > > > >
> > > > > > we are currently running an OVN Deployment with 450 Nodes. We
run a 3 node cluster for the northbound database and a 3 nodes cluster for
the southbound database.
> > > > > > Between the southbound cluster and the ovn-controllers we have
a layer of 24 ovsdb relays.
> > > > > > The setup is using TLS for all connections, however the TLS
Server is handled by a traefik reverseproxy to offload this from the ovsdb
> > > > >
> > > > > The very important part of the system description is what versions
> > > > > of OVS and OVN are you using in this setup?  If it's not latest
> > > > > 3.1 and 23.03, then it's hard to talk about what/if performance
> > > > > improvements are actually needed.
> > > > >
> > > >
> > > > We are currently running ovs 3.1 and ovn 22.12 (in the process of
> > > > upgrading to 23.03). `monitor-all` is currently disabled, but we
want to
> > > > try that as well.
> > > >
> > > Hi Felix, did you try upgrading and enabling "monitor-all"? How does
it look now?
> > >
> > > > > > Northd and Neutron is connecting directly to north- and
southbound databases without the relays.
> > > > >
> > > > > One of the big things that is annoying is that Neutron connects to
> > > > > Southbound database at all.  There are some reasons to do that,
> > > > > but ideally that should be avoided.  I know that in the past
limiting
> > > > > the number of metadata agents was one of the mitigation strategies
> > > > > for scaling issues.  Also, why can't it connect to relays?  There
> > > > > shouldn't be too many transactions flowing towards Southbound DB
> > > > > from the Neutron.
> > > > >
> > > >
> > > > Thanks for that suggestion, that definately makes sense.
> > > >
> > > Does this make a big difference? How many Neutron - SB connections
are there?
> > > What rings a bell is that Neutron is using the python OVSDB library
which hasn't implemented the fast-resync feature (if I remember correctly).
> >
> > python-ovs has supported monitor_cond_since since v2.17.0 (though
> > there may have been a bug that was fixed in 2.17.1). If fast resync
> > isn't happening, then it should be considered a bug. With that said, I
> > remember when I looked it a year or two ago, ovsdb-server didn't
> > really use fast resync/monitor_cond_since unless it was running in
> > raft cluster mode (it would reply, but with the last-txn-id as 0
> > IIRC?). Does the ovsdb-relay code actually return the last-txn-id? I
> > can set up an environment and run some tests, but maybe someone else
> > already knows.
>
> Looks like ovsdb-relay does support last-txn-id now:
>
https://github.com/openvswitch/ovs/commit/a3e97b1af1bdcaa802c6caa9e73087df7077d2b1
,
> but only in v3.0+.
>

Hi Terry, thanks for correcting me, and sorry for my bad memory! And you
are right that fast resync is supported only in cluster mode.

Han

> > > At the same time, there is the feature leader-transfer-for-snapshot,
which automatically transfer leader whenever a snapshot is to be written,
which would happen frequently if your environment is very active.
> >
> > I believe snapshot should only be happening "no less frequently than
> > 24 hours, with snapshots if there are more than 100 log entries and
> > the log size has doubled, but no more frequently than every 10 mins"
> > or something pretty close to that. So it seems like once the system
> > got up to its expected size, you would just see updates every 24 hours
> > since you obviously can't double in size forever. But it's possible
> > I'm reading that wrong.
> >
> > > When a leader transfer happens, if Ne

Re: [ovs-discuss] Scaling OVN/Southbound

2023-06-30 Thread Han Zhou via discuss
On Wed, May 24, 2023 at 12:26 AM Felix Huettner via discuss <
ovs-discuss@openvswitch.org> wrote:
>
> Hi Ilya,
>
> thank you for the detailed reply
>
> On Tue, May 23, 2023 at 05:25:49PM +0200, Ilya Maximets wrote:
> > On 5/23/23 15:59, Felix Hüttner via discuss wrote:
> > > Hi everyone,
> >
> > Hi, Felix.
> >
> > >
> > > we are currently running an OVN Deployment with 450 Nodes. We run a 3
node cluster for the northbound database and a 3 nodes cluster for the
southbound database.
> > > Between the southbound cluster and the ovn-controllers we have a
layer of 24 ovsdb relays.
> > > The setup is using TLS for all connections, however the TLS Server is
handled by a traefik reverseproxy to offload this from the ovsdb
> >
> > The very important part of the system description is what versions
> > of OVS and OVN are you using in this setup?  If it's not latest
> > 3.1 and 23.03, then it's hard to talk about what/if performance
> > improvements are actually needed.
> >
>
> We are currently running ovs 3.1 and ovn 22.12 (in the process of
> upgrading to 23.03). `monitor-all` is currently disabled, but we want to
> try that as well.
>
Hi Felix, did you try upgrading and enabling "monitor-all"? How does it
look now?

> > > Northd and Neutron is connecting directly to north- and southbound
databases without the relays.
> >
> > One of the big things that is annoying is that Neutron connects to
> > Southbound database at all.  There are some reasons to do that,
> > but ideally that should be avoided.  I know that in the past limiting
> > the number of metadata agents was one of the mitigation strategies
> > for scaling issues.  Also, why can't it connect to relays?  There
> > shouldn't be too many transactions flowing towards Southbound DB
> > from the Neutron.
> >
>
> Thanks for that suggestion, that definately makes sense.
>
Does this make a big difference? How many Neutron - SB connections are
there?
What rings a bell is that Neutron is using the python OVSDB library which
hasn't implemented the fast-resync feature (if I remember correctly).
At the same time, there is the feature leader-transfer-for-snapshot, which
automatically transfer leader whenever a snapshot is to be written, which
would happen frequently if your environment is very active.
When a leader transfer happens, if Neutron set the option "leader-only"
(only connects to leader) to SB DB (could someone confirm?), then when the
leader transfer happens, all Neutron workers would reconnect to the new
leader. With fast-resync, like what's implemented in C IDL and Go, the
client that has cached the data would only request the delta when
reconnecting. But since the python lib doesn't have this, the Neutron
server would re-download full data when reconnecting ...
This is a speculation based on the information I have, and the assumptions
need to be confirmed.

> > >
> > > We needed to increase various timeouts on the ovsdb-server and client
side to get this to a mostly stable state:
> > > * inactivity probes of 60 seconds (for all connections between
ovsdb-server, relay and clients)
> > > * cluster election time of 50 seconds
> > >
> > > As long as none of the relays restarts the environment is quite
stable.
> > > However we see quite regularly the "Unreasonably long xxx ms poll
interval" messages ranging from 1000ms up to 4ms.
> >
> > With latest versions of OVS/OVN the CPU usage on Southbound DB
> > servers without relays in our weekly 500-node ovn-heater runs
> > stays below 10% during the test phase.  No large poll intervals
> > are getting registered.
> >
> > Do you have more details on under which circumstances these
> > large poll intervals occur?
> >
>
> It seems to mostly happen on the initial connection of some client to
> the ovsdb. From the few times we ran perf there it looks like the time
> is spend in creating a monitor and during that sending out the updates
> to the client side.
>
It is one of the worst case scenario for OVSDB when many clients initialize
connections to it at the same time, when the size of data downloaded by
each client is big.
OVSDB relay, for what I understand, should greatly help on this. You have
24 relay nodes, which are supposed to share the burden. Are the SB DB and
the relay instances running with sufficient CPU resources?
Is it clear that initial connections from which clients (ovn-controller or
Neutron) are causing this? If it is Neutron, the above speculation about
the lack of fast-resync from Neutron workers may be worth checking.

> If it is of interest i can try and get a perf report once this occurs
> again.
>
> > >
> > > If a large amount of relays restart simultaneously they can also
bring the ovsdb cluster to fail as the poll interval exceeds the cluster
election time.
> > > This happens with the relays already syncing the data from all 3
ovsdb servers.
> >
> > There was a performance issue with upgrades and simultaneous
> > reconnections, but it should be mostly fixed on the current master
> > branch, 

Re: [ovs-discuss] MAC binding aging refresh mechanism

2023-05-25 Thread Han Zhou via discuss
On Thu, May 25, 2023 at 9:19 AM Ilya Maximets  wrote:
>
> On 5/25/23 14:08, Ales Musil via discuss wrote:
> > Hi,
> >
> > to improve the MAC binding aging mechanism we need a way to ensure that
rows which are still in use are preserved. This doesn't happen with current
implementation.
> >
> > I propose the following solution which should solve the issue, any
questions or comments are welcome. If there isn't anything major that would
block this approach I would start to implement it so it can be available on
23.09.
> >
> > For the approach itself:
> >
> > Add "mac_cache_use" action into "lr_in_learn_neighbor" table (only the
flow that continues on known MAC binding):
> > match=(REGBIT_LOOKUP_NEIGHBOR_RESULT == 1 ||
REGBIT_LOOKUP_NEIGHBOR_IP_RESULT == 0), action=(next;)  ->
match=(REGBIT_LOOKUP_NEIGHBOR_RESULT == 1 ||
REGBIT_LOOKUP_NEIGHBOR_IP_RESULT == 0), action=(mac_cache_use; next;)
> >
> > The "mac_cache_use" would translate to resubmit into separate table
with flows per MAC binding as follows:
> > match=(ip.src=, eth.src=, datapath=),
action=(drop;)

It is possible that some workload has heavy traffic for ingress direction
only, such as some UDP streams, but not sending anything out for a long
interval. So I am not sure if using "src" only would be sufficient.

>
> One concern here would be that it will likely cause a packet clone
> in the datapath just to immediately drop it.  So, might have a
> noticeable performance impact.
>
+1. We need to be careful to avoid any dataplane performance impact, which
doesn't sound justified for the value.

> >
> > This should bump the statistics every time for the correct MAC binding.
In ovn-controller we could periodically dump the flows from this table. the
period would be set to MIN(mac_binding_age_threshold/2) from all local
datapaths. The dump would happen from a different thread with its own rconn
to prevent backlogging issues. The thread would receive mapped data from
I-P node that would keep track of mapping datapath -> cookies -> mac
bindings. This allows us to avoid constant lookups, but at the cost of
keeping track of all local MAC bindings. To save some computation time this
I-P could be relevant only for datapaths that actually have the threshold
set.
> >
> > If the "idle_age" of the particular flow is smaller than the datapath
"mac_binding_age_threshold" it means that it is still in use. To prevent a
lot of updates, if the traffic is still relevant on multiple controllers,
we would check if the timestamp is older than the "dump period"; if not we
don't have to update it, because someone else did.

Thanks for trying to reduce the number of updates to SB DB, but I still
have some concerns for this. In theory, to prevent the records being
deleted while it is still used, at least one timestamp update is required
per threshold for each record. Even if we bundle the updates from each
node, assume that the workloads that own the IP/MAC of the mac_binding
records are distributed across 1000 nodes, and the aging threshold is 30s,
there will still be ~30 updates/s (if we can evenly distribute the updates
from different nodes). That's still a lot, which may keep the SB server and
all the ovn-controller busy just for these messages. If the aging threshold
is set to 300s (5min), it may look better: ~3updates/s, but this still
could contribute to the major part of the SB <-> ovn-controller messages,
e.g. in ovn-k8s deployment the cluster LR is distributed on all nodes so
all nodes would need to monitor all mac-binding timestamp updates related
to the cluster LR, which means all mac-binding updates from all nodes. In
reality the amount of messages may be doubled if we use the proposed
dump-and-check interval mac_binding_age_threshold/2.

So, I'd evaluate both the dataplane and control plane cost before going for
formal implementation.

Thanks,
Han

>
> >
> > Also to "desync" the controllers there would be a random delay added to
the "dump period".
> >
> > All of this would be applicable to FDB aging as well.
> >
> > Does that sound reasonable?
> > Please let me know if you have any comments/suggestions.
> >
> > Thanks,
> > Ales
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN interconnection and NAT

2023-03-15 Thread Han Zhou via discuss
On Wed, Mar 15, 2023 at 1:00 PM Tiago Pires 
wrote:
>
> Hi Vladislav,
>
> It seems the gateway_port option was added on 22.09 according with this
commit:
https://github.com/ovn-org/ovn/commit/4f93381d7d38aa21f56fb3ff4ec00490fca12614
.
> It is what I need in order to make my use case to work, let me try it.

Thanks for reporting this issue. It would be good to try the gateway_port
option, but it also seems to be a bug somewhere if it behaves as what you
described in the example, because:

"When  a  logical  router  has multiple distributed gateway ports and this
column is not set for a NAT
  rule, then the rule will be applied at the distributed
gateway port which is in the same  network  as
  the  external_ip  of  the NAT rule "

We need to check more on this.

Regards,
Han

>
> Thank you
>
> Tiago Pires
>
>
>
> On Wed, Mar 15, 2023 at 2:10 PM Vladislav Odintsov 
wrote:
>>
>> I’m sorry, of course I meant gateway_port instead of logical_port:
>>
>>gateway_port: optional weak reference to Logical_Router_Port
>>   A distributed gateway port in the Logical_Router_Port
table where the NAT rule needs to be applied.
>>
>>   When multiple distributed gateway ports are configured on
a Logical_Router, applying a  NAT  rule  at
>>   each  of the distributed gateway ports might not be
desired. Consider the case where a logical router
>>   has 2 distributed  gateway  port,  one  with  networks
50.0.0.10/24  and  the  other  with  networks
>>   60.0.0.10/24.  If  the  logical router has a NAT rule of
type snat, logical_ip 10.1.1.0/24 and exter‐
>>   nal_ip 50.1.1.20/24, the rule needs to be selectively
applied on  matching  packets  entering/leaving
>>   through the distributed gateway port with networks
50.0.0.10/24.
>>
>>   When  a  logical  router  has multiple distributed gateway
ports and this column is not set for a NAT
>>   rule, then the rule will be applied at the distributed
gateway port which is in the same  network  as
>>   the  external_ip  of  the NAT rule, if such a router port
exists. If logical router has a single dis‐
>>   tributed gateway port and this column is not set for a NAT
rule, the rule will be applied at the dis‐
>>   tributed  gateway  port  even if the router port is not in
the same network as the external_ip of the
>>   NAT rule.
>>
>> On 15 Mar 2023, at 20:05, Vladislav Odintsov via discuss <
ovs-discuss@openvswitch.org> wrote:
>>
>> Hi,
>>
>> since you’ve configured multiple LRPs with GW chassis, you must supply
logical_port for NAT rule. Did you configure it?
>> You should see appropriate message in ovn-northd logfile.
>>
>>logical_port: optional string
>>   The name of the logical port where the logical_ip resides.
>>
>>   This is only used on distributed routers. This must be
specified in order for the NAT rule to be pro‐
>>   cessed  in a distributed manner on all chassis. If this is
not specified for a NAT rule on a distrib‐
>>   uted router, then this NAT rule will be processed  in  a
 centralized  manner  on  the  gateway  port
>>   instance on the gateway chassis.
>>
>> On 15 Mar 2023, at 19:22, Tiago Pires via discuss <
ovs-discuss@openvswitch.org> wrote:
>>
>> Hi,
>>
>> In an OVN Interconnection environment (OVN 22.03) with a few AZs, I
noticed that when the OVN router has a SNAT enabled or DNAT_AND_SNAT,
>> the traffic between the AZs is nated.
>> When checking the OVN router's logical flows, it is possible to see the
LSP that is connected into the transit switch with NAT enabled:
>>
>> Scenario:
>>
>> OVN Global database:
>> # ovn-ic-sbctl show
>> availability-zone az1
>> gateway ovn-central-1
>> hostname: ovn-central-1
>> type: geneve
>> ip: 192.168.40.50
>> port ts1-r1-az1
>> transit switch: ts1
>> address: ["aa:aa:aa:aa:aa:10 169.254.100.10/24"]
>> availability-zone az2
>> gateway ovn-central-2
>> hostname: ovn-central-2
>> type: geneve
>> ip: 192.168.40.221
>> port ts1-r1-az2
>> transit switch: ts1
>> address: ["aa:aa:aa:aa:aa:20 169.254.100.20/24"]
>> availability-zone az3
>> gateway ovn-central-3
>> hostname: ovn-central-3
>> type: geneve
>> ip: 192.168.40.247
>> port ts1-r1-az3
>> transit switch: ts1
>> address: ["aa:aa:aa:aa:aa:30 169.254.100.30/24"]
>>
>> OVN Central (az1)
>>
>> # ovn-nbctl show r1
>> router 3e80e81a-58b5-41b1-9600-5bfc917c4ace (r1)
>> port r1-ts1-az1
>> mac: "aa:aa:aa:aa:aa:10"
>> networks: ["169.254.100.10/24"]
>> gateway chassis: [ovn-central-1]
>> port r1_s1
>> mac: "00:de:ad:fe:0:1"
>> networks: ["10.0.1.1/24"]
>> port r1_public
>> mac: 

Re: [ovs-discuss] ovsdb: schema conversion for clustered db blocks preventing processing of raft election and inactivity probes

2023-01-10 Thread Han Zhou via discuss
On Mon, Jan 9, 2023 at 3:34 AM Ilya Maximets  wrote:
>
> On 1/8/23 04:51, Han Zhou wrote:
> >
> >
> > On Tue, Jan 3, 2023 at 6:07 AM Ilya Maximets via discuss <
ovs-discuss@openvswitch.org <mailto:ovs-discuss@openvswitch.org>> wrote:
> >>
> >> On 12/14/22 08:28, Frode Nordahl via discuss wrote:
> >> > Hello,
> >> >
> >> > When performing an online schema conversion for a clustered DB the
> >> > `ovsdb-client` connects to the current leader of the cluster and
> >> > requests it to convert the DB to a new schema.
> >> >
> >> > The main thread of the leader ovsdb-server will then parse the new
> >> > schema and copy the entire database into a new in-memory copy using
> >> > the new schema. For a moderately sized database, let's say 650MB
> >> > on-disk, this process can take north of 24 seconds on a modern
> >> > adequately performant system.
> >> >
> >> > While this is happening the ovsdb-server process will not process any
> >> > raft election events or inactivity probes, so by the time the
> >> > conversion is done and the now past leader wants to write the
> >> > converted database to the cluster, its connection to the cluster is
> >> > dead.
> >> >
> >> > The past leader will keep repeating this process indefinitely, until
> >> > the client requesting the conversion disconnects. No message is
passed
> >> > to the client.
> >> >
> >> > Meanwhile the other nodes in the cluster have moved on with a new
leader.
> >> >
> >> > A workaround for this scenario would be to increase the election
timer
> >> > to a value great enough so that the conversion can succeed within an
> >> > election window.
> >> >
> >> > I don't view this as a permanent solution though, as it would be
> >> > unfair to leave the end user with guessing the correct election timer
> >> > in order for their upgrades to succeed.
> >> >
> >> > Maybe we need to hand off conversion to a thread and make the main
> >> > loop only process raft requests until it is done, similar to the
> >> > recent addition of preparing snapshot JSON in a separate thread [0].
> >> >
> >> > Any other thoughts or ideas?
> >> >
> >> > 0:
https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
<
https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
>
> >> >
> >>
> >> Hi, Frode.  Thanks for starting this conversation.
> >>
> >> First of all I'd still respectfully disagree that 650 MB is a
> >> moderately sized database. :)  ovsdb-server on its own doesn't limit
> >> users on how much data they can put in, but that doesn't mean there
> >> is no limit at which it will be difficult for it or even impossible
> >> to handle the database.  From my experience 650 MB is far beyond the
> >> threshold for a smooth work.
> >>
> >> Allowing database to grow to such size might be considered a user
> >> error, or a CMS error.  In any case, setups should be tested at the
> >> desired [simulated at least] scale including upgrades before
> >> deploying in production environment to not run into such issues
> >> unexpectedly.
> >>
> >> Another way out from the situation, beside bumping the election
> >> timer, might be to pin ovn-controllers, destroy the database (maybe
> >> keep port bindings, etc.) and let northd to re-create it after
> >> conversion.  Not sure if that will actually work though, as I
> >> didn't try.
> >>
> >>
> >> For the threads, I'll re-iterate my thought that throwing more
> >> cores on the problem is absolutely last thing we should do.  Only
> >> if there is no other choice.  Simply because many parts of
> >> ovsdb-server was never optimized for performance and there are
> >> likely many things we can do to improve without blindly using more
> >> resources and increasing the code complexity by adding threads.
> >>
> >>
> >> Speaking of not optimal code, the conversion process seems very
> >> inefficient.  Let's deconstruct it.  (I'll skip the standalone
> >> case, focusing on the clustered mode.)
> >>
> >> There are few main steps:
> >>
> >> 1. ovsdb_convert() - Creates a copy of a database converting
> >>each column along the w

Re: [ovs-discuss] ovsdb: schema conversion for clustered db blocks preventing processing of raft election and inactivity probes

2023-01-07 Thread Han Zhou via discuss
On Tue, Jan 3, 2023 at 6:07 AM Ilya Maximets via discuss <
ovs-discuss@openvswitch.org> wrote:
>
> On 12/14/22 08:28, Frode Nordahl via discuss wrote:
> > Hello,
> >
> > When performing an online schema conversion for a clustered DB the
> > `ovsdb-client` connects to the current leader of the cluster and
> > requests it to convert the DB to a new schema.
> >
> > The main thread of the leader ovsdb-server will then parse the new
> > schema and copy the entire database into a new in-memory copy using
> > the new schema. For a moderately sized database, let's say 650MB
> > on-disk, this process can take north of 24 seconds on a modern
> > adequately performant system.
> >
> > While this is happening the ovsdb-server process will not process any
> > raft election events or inactivity probes, so by the time the
> > conversion is done and the now past leader wants to write the
> > converted database to the cluster, its connection to the cluster is
> > dead.
> >
> > The past leader will keep repeating this process indefinitely, until
> > the client requesting the conversion disconnects. No message is passed
> > to the client.
> >
> > Meanwhile the other nodes in the cluster have moved on with a new
leader.
> >
> > A workaround for this scenario would be to increase the election timer
> > to a value great enough so that the conversion can succeed within an
> > election window.
> >
> > I don't view this as a permanent solution though, as it would be
> > unfair to leave the end user with guessing the correct election timer
> > in order for their upgrades to succeed.
> >
> > Maybe we need to hand off conversion to a thread and make the main
> > loop only process raft requests until it is done, similar to the
> > recent addition of preparing snapshot JSON in a separate thread [0].
> >
> > Any other thoughts or ideas?
> >
> > 0:
https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
> >
>
> Hi, Frode.  Thanks for starting this conversation.
>
> First of all I'd still respectfully disagree that 650 MB is a
> moderately sized database. :)  ovsdb-server on its own doesn't limit
> users on how much data they can put in, but that doesn't mean there
> is no limit at which it will be difficult for it or even impossible
> to handle the database.  From my experience 650 MB is far beyond the
> threshold for a smooth work.
>
> Allowing database to grow to such size might be considered a user
> error, or a CMS error.  In any case, setups should be tested at the
> desired [simulated at least] scale including upgrades before
> deploying in production environment to not run into such issues
> unexpectedly.
>
> Another way out from the situation, beside bumping the election
> timer, might be to pin ovn-controllers, destroy the database (maybe
> keep port bindings, etc.) and let northd to re-create it after
> conversion.  Not sure if that will actually work though, as I
> didn't try.
>
>
> For the threads, I'll re-iterate my thought that throwing more
> cores on the problem is absolutely last thing we should do.  Only
> if there is no other choice.  Simply because many parts of
> ovsdb-server was never optimized for performance and there are
> likely many things we can do to improve without blindly using more
> resources and increasing the code complexity by adding threads.
>
>
> Speaking of not optimal code, the conversion process seems very
> inefficient.  Let's deconstruct it.  (I'll skip the standalone
> case, focusing on the clustered mode.)
>
> There are few main steps:
>
> 1. ovsdb_convert() - Creates a copy of a database converting
>each column along the way and checks the constraints.
>
> 2. ovsdb_to_txn_json() - Converts the new database into a
>transaction JSON object.
>
> 3. ovsdb_txn_propose_schema_change() - Writes the new schema
>and the transaction JSON to the storage (RAFT).
>
> 4. ovsdb_destroy() - Copy of a database is destroyed.
>
>-
>
> 5. read_db()/parse_txn() - Reads the new schema and the
>transaction JSON from the storage, replaces the current
>database with an empty one and replays the transaction
>that creates a new converted database.
>
> There is always a storage run between steps 4 and 5, so we generally
> care only that steps 1-4 and the step 5 are below the election timer
> threshold separately.
>
>
> Now looking closer to the step 1, which is the most time consuming
> step.  It has two stages - data conversion and the transaction
> check.  Data conversion part makes sure that we're creating all
> the rows in a new database with all the new columns and without
> removed columns.  It also makes sure that all the datum objects
> are converted from the old column type to the new column type by
> calling ovsdb_datum_convert() for every one of them.
>
> Datum conversion is a very heavy operation, because it involves
> converting it to JSON and back.  However, in vast majority of cases
> column types do not change at all, and even if they do, it only

Re: [ovs-discuss] [OVN] ovn-interconnect multiple routers in same AZ and transit switch

2022-09-23 Thread Han Zhou
On Fri, Sep 23, 2022 at 11:52 AM Baranin, Alexander 
wrote:
>
> > The ovn-ic route learning is for remote AZs only. It is not supposed to
learn routes from its own AZ. Did you try adding routes by yourself, and
see if it works?
>
>
> Thank you! Manually-added static routes were added by us, as well as a
missing MAC_BINDING:
>
> for some reason only one of the two router ports, plugged into the
transit switch, had a binding in SB.
>
> After the above actions, ovn-trace was printing a healthy ping packet
traversal from Port1 to Port2 and back.
>
> Actual ping, however, does not go through.
>
> Any tips for debugging the ping an issue?
>
> And is it an IC-controller that is responsible for MAC_BINDINGs of
transit switch's port ips inside a router datapath?
>
It shouldn't require MAC_Binding to be involved in your scenario, unless
you set options:dynamic_neigh_routers to true for the LRs. OVN-IC
controller doesn't populate MAC_Binding either. The entry would populate
only when there are ARP packets received.

AFAIK, this scenario is not officially tested, since transit-switch was
supposed to connect to remote LRs. Although your topology looks valid, it
is not the typical one we tested, so there could be something missing in
the implementation to make it work. Before debugging further, did you
consider other options of the topology?
Option 1. Since both LRs connect to the same TS and no isolation needed
between them, should they just be merged to the same LR?
Option 2. If for any reason the LRs need to be separate, is it better to
create an "edge" LR that connects both through an internal LS and then
connects the remote AZs through TS?
Option 3. Both LRs connects to TS directly, but use an different internal
LS to communicate within the AZ.

For the actual problem you are facing, I may need to reproduce it locally
and debug.
BTW, what's your OVN version? Did you try branch-22.09? I remembered some
minor fixes for admission control flows related to GW chassis.

Thanks,
Han

>
> 
> От: Han Zhou 
> Отправлено: 23 сентября 2022 г. 21:22
> Кому: Baranin, Alexander
> Копия: ovs-discuss@openvswitch.org; Bravo, Oleg
> Тема: Re: [ovs-discuss] [OVN] ovn-interconnect multiple routers in same
AZ and transit switch
>
>
>
> On Fri, Sep 23, 2022 at 8:10 AM Baranin, Alexander via discuss <
ovs-discuss@openvswitch.org> wrote:
> >
> > Hello!
> >
> > Is the following configuration supported?
> >
> >
> >
> > AZ1:
> >
> >
> > Port1 - LSwitch1 - LRouter1 - TransitSwitch1 - LRouter2 - LSwitch2 -
Port2
> >
> >
> > Gateway ports of LRouter1 and LRouter2 are assigned to the same chassis.
> >
> >
> > We are currently having troubles configuring interconnect.
> >
> > - router port bindings are correctly created in IC-SB-DB.
> >
> > - no routes are inter-learned between the routers.
>
> The ovn-ic route learning is for remote AZs only. It is not supposed to
learn routes from its own AZ. Did you try adding routes by yourself, and
see if it works?
>
> Han
>
> >
> > - we can't ping Port2 from Port1 and vice-versa. We are using "veth in
network namespace" method. Ping between two ports in the same LSwitch works.
> >
> > - ovn-trace for a ICMP echo packet looks correct and puts the packet to
expected destination port.
> >
> >
> >
> > Is this suspposed to work, or the routers must be in separate AZs?
> >
> > ___
> > discuss mailing list
> > disc...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] ovn-interconnect multiple routers in same AZ and transit switch

2022-09-23 Thread Han Zhou
On Fri, Sep 23, 2022 at 8:10 AM Baranin, Alexander via discuss <
ovs-discuss@openvswitch.org> wrote:
>
> Hello!
>
> Is the following configuration supported?
>
>
>
> AZ1:
>
>
> Port1 - LSwitch1 - LRouter1 - TransitSwitch1 - LRouter2 - LSwitch2 - Port2
>
>
> Gateway ports of LRouter1 and LRouter2 are assigned to the same chassis.
>
>
> We are currently having troubles configuring interconnect.
>
> - router port bindings are correctly created in IC-SB-DB.
>
> - no routes are inter-learned between the routers.

The ovn-ic route learning is for remote AZs only. It is not supposed to
learn routes from its own AZ. Did you try adding routes by yourself, and
see if it works?

Han

>
> - we can't ping Port2 from Port1 and vice-versa. We are using "veth in
network namespace" method. Ping between two ports in the same LSwitch works.
>
> - ovn-trace for a ICMP echo packet looks correct and puts the packet to
expected destination port.
>
>
>
> Is this suspposed to work, or the routers must be in separate AZs?
>
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] How to count the upcall times on every datapath interface ?

2022-08-24 Thread Han Zhou
On Sun, Aug 21, 2022 at 6:50 PM wangchuanlei 
wrote:
>
> Hi,
>   In datapath, we can see the all count of missed packets via cmd
> "ovs-dpctl show", but we can not see the count of missed on every
> interface. On my enviroment, i always encounter too many upcall
> packets, which  leads very high cpu usage of vswitchd. In that
> circumstance, i do not know which kind of packet did this. Is there
> a plan to develop a command to count this?
>
> Thanks!
I am not aware of any such plans but I think you can enable debug logs to
see some of the packets (before the debug log rate limit is hit).
If you are using OVN and have LB VIPs defined, there is a chance that you
hit a performance bug that would cause most short-lived flow packets going
to slow-path. This is purely speculation without any details or evidences,
but if you like to test, here is a fix (under review) for that OVN problem:
https://patchwork.ozlabs.org/project/ovn/patch/20220824061730.2523979-1-hz...@ovn.org/

Thanks,
Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] UDP stateful ACL not working when pkt passing through two switches ?

2022-07-29 Thread Han Zhou
On Thu, Jul 28, 2022 at 11:53 AM Brendan Doyle 
wrote:
>
> UDP stateful ACL not working? The logical representation of My network
 is shown bellow
> ('ovn-nbctl show' shown towards the end). I have a Port Group
(pg_vcn3_net1_sl3) that has
> two ports in it, the VM port on switch(ls_vcn3_net1) and
lsb_vcn4_stgw-lr_vcn3_stgw switch
> (ls_vcn3_backbone) asshown below ((o)).
>
> I do a 'showmount -e 192.16.1.106' in the VM, I see the pkt go out from
the VM  get to the NFS
> server on the underlay, see the reply on the underlay and then I see my
PG ACL drop the pkt.
>
> The ACLs are:
>
> Egress From VM - Ingress to switch
> ---
> from-lport 32767 (inport == @pg_vcn3_net1_sl3 && (arp || udp.dst == 67 ||
udp.dst == 68)) allow-related
> from-lport 27000 (inport == @pg_vcn3_net1_sl3 && ip4.dst == 192.16.1.0/24
&& udp.dst == 111) allow-related
> from-lport 0 (inport == @pg_vcn3_net1_sl3) drop
log(name=fss-8,severity=debug) <--- Drops
the return pkt

According to your description, the ACL here not only applies to the VM port
but also the router port (lsb_vcn4_stgw-lr_vcn3_stgw) on the
ls_vcn3_backbone switch. So the return packet is in fact dropped at the
backbone switch, which is expected because we don't support conntrack for
router ports, so the "to-lport" ACL below wouldn't create the conntrack
entry. OVN ACL is primarily to apply rules for VIFs (VMs/containers).

I remember @Numan Siddique  worked on some patches related
to ACL on router port recently, so maybe he could provide more details or
correct me if I am wrong.

Thanks,
Han

>
> Ingress TO VM - Egress from switch
> 
>   to-lport 32767 (outport == @pg_vcn3_net1_sl3 && (arp || udp.dst == 67
|| udp.dst == 68)) allow-related
>   to-lport 27000 (outport == @pg_vcn3_net1_sl3 && ip4.src == 192.16.1.0/24
&& tcp.dst == 111) allow-related
>   to-lport 27000 (outport == @pg_vcn3_net1_sl3 && ip4.src == 192.16.1.0/24
&& tcp.dst == 20048) allow-related
>   to-lport 27000 (outport == @pg_vcn3_net1_sl3 && ip4.src == 192.16.1.0/24
&& udp.dst == 111) allow-related  <--- But this should
>   to-lport 0 (outport == @pg_vcn3_net1_sl3) drop
log(name=fss-17,severity=debug)
  have allowed the
>

  return pkt
>
>
> ++
> |   VM   |
> | 192.16.1.6 |
> +-((O))--+
> | 284195d2-9280-4334-900e-571ecd00327a in PG
pg_vcn3_net1_sl3
>   +-+
>   |ls_vcn3_net1 |
>   +-+
> | ls_vcn3_net1-lr_vcn3_net1 (proxy ARP for 192.16.1.106)
>   |
> |
> | lr_vcn3_net1-ls_vcn3_net1 (192.16.1.1/24)
>   /\
>  ( lr_vcn3_net1 )
>   \/
> | lr_vcn3_net1-lsb_vcn3_net1 (253.255.25.1/25)
> |
> |
> | lsb_vcn3_net1-lr_vcn3_net1
>  ++
>  |   ls_vcn3_backbone |
>  +((O))---+
> | lsb_vcn4_stgw-lr_vcn3_stgw in PG pg_vcn3_net1_sl3
> |
> |
> | lr_vcn3_stgw-lsb_vcn3_stgw (253.255.25.10/25)
>  /\
> ( lr_vcn3_stgw ) SNAT 192.16.1.6 to 253.255.80.8
>  \/
> | lr_vcn3_stgw-ls_vcn3_external_stgw (253.255.80.20/16)
> |
> |
> | ls_vcn3_external_stgw-lr_vcn3_stgw
>   +---+
>   | ls_vcn3_external_stgw |
>   +---+
> | ln-ls_vcn3_external_stgw
> |   (localnet)
> |
>+-+
>| br-ext  | Physical OVS on chassis
>+-+
> |  Egress : Change dst 192.16.1.106 to dst 253.255.0.2
> |  Ingress: Change src 253.255.0.2 to 192.16.1.106
> 253.255.0.0/16  |
> |
>  +---+
>  |  NFS server   |
>  | 253.255.0.2   |
>  +---+
>
> When I do a trace of the out going pkt, it looks like to me that there is
no conntrack
> established in the ls_vcn3_backbone so it does not recognize the return
pkt as a return
> but the 'allow-related' should have established that. See Below
>
>
> ovn-trace --detailed ls_vcn3_net1 'inport ==
"284195d2-9280-4334-900e-571ecd00327a" && eth.dst == 40:44:00:00:00:90 &&
eth.src == 52:54:00:02:55:96 && ip4.src == 192.16.1.6 && ip4.dst ==
192.16.1.106 && ip.ttl == 64 && udp.dst == 111'
> #
udp,reg14=0x1,vlan_tci=0x,dl_src=52:54:00:02:55:96,dl_dst=40:44:00:00:00:90,nw_src=192.16.1.6,nw_dst=192.16.1.106,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=0,tp_dst=111
>
> ingress(dp="ls_vcn3_net1", inport="284195")
> ---
>  0. 

Re: [ovs-discuss] ovn-controller stranger behaviour

2022-06-24 Thread Han Zhou
Hi Tiago,

Thanks for reporting the problem. It seems you can easily reproduce the
problem, right? If so, could you enable debug log for ovn-controller before
triggering the recompute, and then we can see what flows are added during
recompute from the logs of the ofctrl module?

Thanks,
Han

On Thu, Jun 23, 2022 at 1:24 PM Tiago Pires  wrote:

> Hi all,
>
> I did some troubleshooting and I'm seeing this error (ovs-vswitchd) always
> when a VM is created in a Chassi:
> 2022-06-23T11:47:08.385Z|07907|bridge|WARN|could not open network device
> tap8a43df0c-fd (No such device)
> 2022-06-23T11:47:09.282Z|07908|bridge|INFO|bridge br-int: added interface
> tap8a43df0c-fd on port 51
> 2022-06-23T11:47:09.645Z|07909|bridge|INFO|bridge br-int: added interface
> tap3200bf1c-20 on port 52
> 2022-06-23T11:47:19.329Z|07911|connmgr|INFO|br-int<->unix#1468: 430
> flow_mods in the 7 s starting 10 s ago (410 adds, 20 deletes)
>
> On this commit
> http://patchwork.ozlabs.org/project/ovn/patch/1608197000-637-1-git-send-email-dce...@redhat.com/
> it solved something similar to my issue. It seems the ovs-vswitchd is
> missing some flows and when I run the recompute it fixes it.
> So, to avoid this issue I'm testing at this moment to run the recompute
> through libvirt hook when a VM gets "started" status.
>
> Regards,
>
> Tiago Pires
>
>
> Em qua., 22 de jun. de 2022 às 19:43, Tiago Pires 
> escreveu:
>
>> Hi all,
>>
>> I'm trying to understand a stranger's behaviour regarding to
>> ovn-controller.
>> In my setup I have OVN 21.09/ OVS 2.16 and Xena and sometimes when a new
>> VM is created, this VM can reach other VMs in east-west traffic (even in
>> differents Chassis) but it can't reach an external network (e.g. Internet)
>> through Chassi Gateway.
>> I ran the following trace:
>> # ovs-appctl ofproto/trace br-int
>> in_port="93",icmp,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=192.168.40.140,nw_dst=8.8.8.8,nw_ttl=64
>>
>> And I got this output:
>> Final flow:
>> recirc_id=0xc157b1,eth,icmp,reg0=0x300,reg11=0xd,reg12=0x10,reg13=0xf,reg14=0x3,reg15=0x2,metadata=0x29,in_port=93,vlan_tci=0x,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=192.168.40.140,nw_dst=8.8.8.8,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0
>> Megaflow:
>> recirc_id=0xc157b1,ct_state=+new-est-rel-rpl-inv+trk,ct_label=0/0x1,eth,icmp,in_port=93,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=
>> 192.168.40.128/26,nw_dst=8.0.0.0/7,nw_ttl=64,nw_frag=no
>> Datapath actions:
>> ct(commit,zone=15,label=0/0x1,nat(src)),set(eth(src=fa:16:3e:ec:7f:dd,dst=00:00:00:00:00:00)),set(ipv4(ttl=63)),userspace(pid=3451843211,controller(reason=1,dont_send=1,continuation=0,recirc_id=12670898,rule_cookie=0x3e26215e,controller_id=0,max_len=65535))
>> It seems the Datapath is querying the controller and I did not understand
>> the reason.
>>
>> So, I did an ovn-controller recompute (ovn-appctl -t ovn-controller
>> recompute) on the Chassi where the VM is placed to check if it could change
>> the behaviour and I could trace the packet with success and the VM started
>> to communicate with the Internet normally:
>> Final flow:
>> recirc_id=0x2,eth,icmp,reg0=0x300,reg11=0xd,reg12=0x10,reg13=0xf,reg14=0x3,reg15=0x2,metadata=0x29,in_port=93,vlan_tci=0x,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=192.168.40.140,nw_dst=8.8.8.8,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0
>> Megaflow:
>> recirc_id=0x2,ct_state=+new-est-rel-rpl-inv+trk,ct_label=0/0x1,eth,icmp,tun_id=0/0xff,tun_metadata0=NP,in_port=93,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=
>> 192.168.40.128/26,nw_dst=8.0.0.0/7,nw_ecn=0,nw_ttl=64,nw_frag=no
>> Datapath actions:
>> ct(commit,zone=15,label=0/0x1,nat(src)),set(tunnel(tun_id=0x2a,dst=10.X6.X3.133,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x30002}),flags(df|csum|key))),set(eth(src=fa:16:3e:ec:7f:dd,dst=00:00:5e:00:04:00)),set(ipv4(ttl=63)),2
>> The Datapath action is using the tunnel with the Chassi Gateway.
>>
>> It happens always with new VMs but sometimes. After running the recompute
>> on the Chassi, I created additional VMs and this issue did not happen.
>>
>> In my Chassi I have enable these parameters also:
>> ovn-monitor-all="true"
>> ovn-openflow-probe-interval="0"
>> ovn-remote-probe-interval="18"
>>
>> Do you know this behaviour could be bug related?
>>
>> Tiago Pires
>>
>>
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] reside-on-redirect-chassis and redirect-type

2022-02-20 Thread Han Zhou
On Thu, Feb 17, 2022 at 3:23 AM Brendan Doyle 
wrote:
>
> Hi,
>
> So I have a Distributed Gateway Port (DGP) on a Gateway through which
> VMs in the overlay can access
> underlay networks. If the VM is not on the chassis where the DGP is
> scheduled then the traffic takes
> the extra tunneled hop to the chassis where the DGP is and is then sent
> to the underlay via the localnet
> switch there.  It would be great if I could avoid that extra hop, whilst
> still having the Gateway do NAT
> and routing.  So I'm wondering if 'reside-on-redirect-chassis'  or '
> redirect-type' can be used to
> do this? and if so which one? Also will normal traffic between to VMs in
> the overlay on different chassis
> still be tunneled through the underlay?
>
> I'll do some experimentation later, but a nod in the right direction
> would be appreciated.
>
Hi Brendan,

Not sure if I understood your question well. If you want to avoid the
gateway hop, you can just configure NAT rules with "dnat_and_snat" to do
distributed NAT.
'reside-on-redirect-chassis' and 'redirect-type' don't seem to help for
your use case because you mentioned your VMs are in overlay, while those
options are for VLAN based logical networks.

Thanks,
Han

> Thanks
>
>
> Brendan
>
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] Remodel OVN Logical_Switch_Port addresses

2022-02-16 Thread Han Zhou
On Wed, Feb 16, 2022 at 3:27 AM Frode Nordahl 
wrote:
>
> Hello all,
>
> Having just spent several days chasing an unpredictable flow output
> issue caused by duplicate IP addresses on LSPs in the same LS (see [0]
> for details), I wanted to gauge the interest for remodeling this part
> of the OVN NB DB data structure.
>
> As mentioned in [0] we currently do have a duplicate check in the
> ovn-nbctl tool, but since OVN has a database driven CMS facing API,
> this does not help when the CMS is the source of the duplicate IPs due
> to losing track of LSP creation and deletion.
>
> There is no doubt that the problem owner here is the CMS, but I can't
> help thinking that this problem would have had less severe
> consequences and would be easier to detect if we could prevent the
> duplicates from ever entering the database.
>
> I'm thinking along the lines of creating a
> Logical_Switch_Port_Addresses table with columns Logical_Switch_Port
> (reference), mac_address, ip_address and a table uniqueness constraint
> across that set of columns.

Hi Frode,

Thanks for bringing this up. I agree that it would be better if we can
prevent wrong data from entering into the DB, but it is not always easy.
We will have to consider that it is valid to have the same IPs in different
logical switches, so a new table with the columns you mentioned is not
sufficient. We will need LS in that table, too, but it adds redundant
information in the DB, and I'm not sure if it is worth the benefit.

There are other situations that CMS can put wrong data to NB DB, which
would end up with conflict desired flows in ovn-controller, e.g.
overlapping/duplicated "match" in ACLs. I'd still leave the responsibility
to CMS.

Thanks,
Han

>
> I know it is difficult to change behavior like this so far into the
> lifetime of the feature, but I'm sure we'll be able to provide a
> migration path if there is appetite for it.
>
> Any thoughts?
>
> 0: https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1961046
>
> --
> Frode Nordahl
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [External] : allow-related and allow-stateless not working when used together - should have the union of the rules.

2022-01-25 Thread Han Zhou
On Mon, Jan 24, 2022 at 4:31 AM Brendan Doyle 
wrote:

>
>
> On 24/01/2022 12:05, Odintsov Vladislav wrote:
>
> Hi,
>
> allow-stateless rules are fully implemented with ovs, while allow-related
> involve conntrack. Conntrack
> adds some additional logic, which most likely lead to drops.
>
> I couldn’t understand the reason for such flows (pg1 - sender - has
> allow-stateless rule for dst tcp/22 for egress
> and pg2 - receiver has allow-related rule for dst tcp/22;
> in reverse direction pg1 has allow-related rule for src tcp/22 and pg2 has
> allow-stateless for src tcp/22).
>
> I know it is an odd combination of stateful and stateless, it is  a test
> case. We are trying to implement
> a feature that does allow a mix of stateless and staeful, and does work
> the the combination of
> stateful/stateless defined here.
>
> IIUC when tcp session is established from VM1 in pg1 to VM2 in pg2, tcp
> syn packet doesn’t get to VM1 allocated
> by ovn-controller conntrack zone and NEW connection is NOT commited there
> (as it is allow-stateless behaviour).
>
> Then, tcp syn reaches VM2 ingress pipeline, it goes to conntrack and a NEW
> connection is committed to VM2’s network interface
> conntrack zone (allow-related).
> The replying tcp syn+ack is sent from VM2 in an allow-stateless manner,
> packet is processed by OVS without conntrack
> (note, that at this point the connection in VM2’s conntrack zone is left
> `unreplied` and will remain there until the
> timeout expires), packet is sent out.
> When tcp syn+ack with src 22 reaches VM1 ingress pipeline, it is sent to
> conntrack (VM1’s network interface conntrack zone),
> which first time sees packet from this connection and moreover, it’s in an
> expected state (it’s a reply). Conntrack sets +inv
> on this packet and then it got dropped.
>
>
> OK thanks this makes sense. So then stateless and stateful can co-exist in
> an ACL, but their rules
> have to be "symmetrical" i.e If an Egress rule  uses stateless, then the
> ingress should be stateless.
>
>
Yes.

In addition, we do support mixing stateful and stateless ACLs, but there
are restrictions as described in
https://github.com/ovn-org/ovn/blob/main/ovn-nb.xml

allow-stateless flows always take precedence before
stateful ACLs, regardless of their priority. (Both
allow and allow-related ACLs can be
stateful.)

allow-stateless: Always forward the packet in stateless
manner, omitting connection tracking mechanism, regardless of other
rules defined for the switch. May require defining additional rules
for inbound replies. For example, if you define a rule to allow
outgoing TCP traffic directed to an IP address, then you probably
also want to define another rule to allow incoming TCP traffic coming
from this same IP address.


In practice, it is not a good idea to have *overlapping* ACLs between
stateless and stateful.

Thanks,
Han


> Brendan
>
>
> You should rethink your ACLs to either be stateless-only or cerrectly go
> to conntrack and commit new and replying connections
> to work properly.
> Also, take into account that allow-stateless rules always have a higher
> that allow-related priority. Even if you pass otherwise [0].
>
> 0: https://github.com/ovn-org/ovn/blob/v21.12.0/northd/northd.c#L5724
> 
>
> Regards,
> Vladislav Odintsov
>
> On 24 Jan 2022, at 14:18, Brendan Doyle  wrote:
>
> On 21/01/2022 23:29, Numan Siddique wrote:
>
> On Fri, Jan 21, 2022 at 1:06 PM Brendan Doyle 
> wrote:
>
> Just wondering if there was any update on this?
>
> I've not tested this scenario.  And I'm not too well versed with the
> allow-stateless.
> My guess here is that since you're mixing both stateful and stateless,
> the packet is getting
> dropped due to high priority flows which ovn-northd generates to drop
> packets with ct.inv.
>
> Can you verify if that is the case ?
>
> When there are stateful ACLs,  you'd see logical flows like
>
> -
>   table=9 (ls_in_acl  ), priority=65532, match=(ct.inv ||
> (ct.est && ct.rpl && ct_label.blocked == 1)), action=(drop;)
>  ...
> ...
>   table=4 (ls_out_acl ), priority=65532, match=(ct.inv ||
> (ct.est && ct.rpl && ct_label.blocked == 1)), action=(drop;)
>
> ---
>
>
> Yes we do see these SB flow entries see attached.
>
>
> I'd suggest checking if these flows are hit in openflow table 17 or 44
> on the compute node where the traffic is dropped.
>
> So As indicated in the original email with tcpdump we can see
> the ssh request from the sender being received and replied to by the
> receiver but the response is not seen by the sender.
>
> This seems to tally with what is seen in the OVS flows. In the receiver
> OVS flows
> (attached) we see no hits in table 17/44 that drop packets.
>
> But in the sender OVS flows (attached) we do see
>
>  cookie=0x74b1af0, duration=30.294s, 

Re: [ovs-discuss] OVN at scale in production

2021-10-14 Thread Han Zhou
On Thu, Oct 14, 2021 at 7:25 AM Seena Fallah  wrote:

> It's mostly on nb.
>
I am surprised since we usually don't see any scale problem for the NB DB
servers, because usually SB data size is much bigger and also number of
clients are much bigger than NB DB. So if there are scale problems it would
always happen on SB already before NB hits any limit.
You would see NB scale problem but not on SB probably because ovn-northd
couldn't even translate the NB data to SB yet because of the NB problem you
hit. I'd suggest to start with smaller scale, and make sure it works end to
end, and then enlarge it gradually, then you would see the real limit.
Somehow 100k ACLs sound scary to me. Usually the number of ACLs is not so
big but each ACL could reference big address-sets and port-groups. You
could probably give more details about your topology and what your typical
ACLs look like.


> Yes, I set that value before to 6 but it didn't help!
>
> On Sun, Oct 10, 2021 at 10:34 PM Han Zhou  wrote:
>
>>
>>
>> On Sat, Oct 9, 2021 at 12:02 PM Seena Fallah 
>> wrote:
>> >
>> > Also I get many logs like this in ovn:
>> >
>> > 2021-10-09T18:54:45.263Z|01151|jsonrpc|WARN|Dropped 6 log messages in
>> last 8 seconds (most recently, 3 seconds ago) due to excessive rate
>> > 2021-10-09T18:54:45.263Z|01152|jsonrpc|WARN|tcp:10.0.0.1:44454:
>> receive error: Connection reset by peer
>> > 2021-10-09T18:54:45.263Z|01153|reconnect|WARN|tcp:10.0.01:44454:
>> connection dropped (Connection reset by peer)
>> > 2021-10-09T18:54:46.798Z|01154|reconnect|WARN|tcp:10.0.0.2:50224:
>> connection dropped (Connection reset by peer)
>> > 2021-10-09T18:54:49.127Z|01155|reconnect|WARN|tcp:10.0.0.3:48514:
>> connection dropped (Connection reset by peer)
>> > 2021-10-09T18:54:51.241Z|01156|reconnect|WARN|tcp:10.0.0.3:48544:
>> connection dropped (Connection reset by peer)
>> > 2021-10-09T18:54:53.005Z|01157|reconnect|WARN|tcp:10.0.0.3:48846:
>> connection dropped (Connection reset by peer)
>> > 2021-10-09T18:54:53.246Z|01158|reconnect|WARN|tcp:10.0.0.3:48796:
>> connection dropped (Connection reset by peer)
>> >
>> > What does it mean about excessive rate? How many req/s is going to be
>> an excessive rate?
>>
>> Don't worry about "excessive rate", which is talking about the log rate
>> limit itself.
>> The "connection reset by peer" indicates client side inactivity probe is
>> enabled and it disconnects when the server hasn't responded for a while.
>> What server is this? NB or SB? Usually SB DB would have this problem if
>> there are lots of nodes and if the inactivity probe is not adjusted on the
>> nodes (ovn-controllers). Try: ovs-vsctl set open .
>> external_ids:ovn-remote-probe-interval=10 on each node.
>>
>> >
>> > On Thu, Oct 7, 2021 at 12:46 AM Seena Fallah 
>> wrote:
>> >>
>> >> Seems the most leader failure is for NB and the command you said is
>> for SB.
>> >>
>> >> Do you have any benchmarks of how many ACLs can OVN perform normally?
>> >> I see many failures after 100k ACLs.
>> >>
>> >> On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique  wrote:
>> >>>
>> >>> On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah 
>> wrote:
>> >>> >
>> >>> > I'm using these versions on a centos container:
>> >>> > ovsdb-server (Open vSwitch) 2.15.2
>> >>> > ovn-nbctl 21.06.0
>> >>> > Open vSwitch Library 2.15.90
>> >>> > DB Schema 5.32.0
>> >>> >
>> >>> > Today I see the election timed out too and I should increase ovsdb
>> election timeout too. I saw the commits but I didn't find any related
>> change to my problem.
>> >>> > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to
>> increase election timeout and disable the inactivity probe?
>> >>>
>> >>> Not sure on that.  It's worth a try if you have a test environment.
>> >>>
>> >>> > Also is there any limitation on the number of ACLs that can OVN
>> handle?
>> >>>
>> >>> I don't think there is any limitation on the number of ACLs.  In
>> >>> general as the size of the SB DB increases, we have seen issues.
>> >>>
>> >>> Can you run the below command on each of your nodes where
>> >>> ovn-controller runs and see if that helps ?
>> >>>
>> >>> ---
>> >>> ovs-vsctl set open . external_ids:ovn-moni

Re: [ovs-discuss] OVN at scale in production

2021-10-10 Thread Han Zhou
On Sat, Oct 9, 2021 at 12:02 PM Seena Fallah  wrote:
>
> Also I get many logs like this in ovn:
>
> 2021-10-09T18:54:45.263Z|01151|jsonrpc|WARN|Dropped 6 log messages in
last 8 seconds (most recently, 3 seconds ago) due to excessive rate
> 2021-10-09T18:54:45.263Z|01152|jsonrpc|WARN|tcp:10.0.0.1:44454: receive
error: Connection reset by peer
> 2021-10-09T18:54:45.263Z|01153|reconnect|WARN|tcp:10.0.01:44454:
connection dropped (Connection reset by peer)
> 2021-10-09T18:54:46.798Z|01154|reconnect|WARN|tcp:10.0.0.2:50224:
connection dropped (Connection reset by peer)
> 2021-10-09T18:54:49.127Z|01155|reconnect|WARN|tcp:10.0.0.3:48514:
connection dropped (Connection reset by peer)
> 2021-10-09T18:54:51.241Z|01156|reconnect|WARN|tcp:10.0.0.3:48544:
connection dropped (Connection reset by peer)
> 2021-10-09T18:54:53.005Z|01157|reconnect|WARN|tcp:10.0.0.3:48846:
connection dropped (Connection reset by peer)
> 2021-10-09T18:54:53.246Z|01158|reconnect|WARN|tcp:10.0.0.3:48796:
connection dropped (Connection reset by peer)
>
> What does it mean about excessive rate? How many req/s is going to be an
excessive rate?

Don't worry about "excessive rate", which is talking about the log rate
limit itself.
The "connection reset by peer" indicates client side inactivity probe is
enabled and it disconnects when the server hasn't responded for a while.
What server is this? NB or SB? Usually SB DB would have this problem if
there are lots of nodes and if the inactivity probe is not adjusted on the
nodes (ovn-controllers). Try: ovs-vsctl set open .
external_ids:ovn-remote-probe-interval=10 on each node.

>
> On Thu, Oct 7, 2021 at 12:46 AM Seena Fallah 
wrote:
>>
>> Seems the most leader failure is for NB and the command you said is for
SB.
>>
>> Do you have any benchmarks of how many ACLs can OVN perform normally?
>> I see many failures after 100k ACLs.
>>
>> On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique  wrote:
>>>
>>> On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah 
wrote:
>>> >
>>> > I'm using these versions on a centos container:
>>> > ovsdb-server (Open vSwitch) 2.15.2
>>> > ovn-nbctl 21.06.0
>>> > Open vSwitch Library 2.15.90
>>> > DB Schema 5.32.0
>>> >
>>> > Today I see the election timed out too and I should increase ovsdb
election timeout too. I saw the commits but I didn't find any related
change to my problem.
>>> > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to
increase election timeout and disable the inactivity probe?
>>>
>>> Not sure on that.  It's worth a try if you have a test environment.
>>>
>>> > Also is there any limitation on the number of ACLs that can OVN
handle?
>>>
>>> I don't think there is any limitation on the number of ACLs.  In
>>> general as the size of the SB DB increases, we have seen issues.
>>>
>>> Can you run the below command on each of your nodes where
>>> ovn-controller runs and see if that helps ?
>>>
>>> ---
>>> ovs-vsctl set open . external_ids:ovn-monitor-all=true
>>> ---
>>>
>>> Thanks
>>> Numan
>>>
>>>
>>> >
>>> > Thanks.
>>> >
>>> > On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique  wrote:
>>> >>
>>> >> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah 
wrote:
>>> >> >
>>> >> > Hi,
>>> >> >
>>> >> > I use ovn for OpenStack neutron plugin for my production. After
days I see issues about losing a leader in ovsdb. It seems it was because
of the failing inactivity probe and because I had 17k acls. After I disable
the inactivity probe it works fine but when I did a scale test on it (about
40k ACLS) again it fails the leader.
>>> >> > I saw many docs about ovn at scale issues that were raised by both
RedHat and eBay and seems the solution is to rewrite ovn with ddlog. I
checked it with northd-ddlog but nothing changes.
>>> >> >
>>> >> > My question is should I wait more for ovn to be stable for high
scale or is there any tuning I miss in my deployment?
>>> >> > Also, will the ovn-nb/sb rewrite with ddlog and can help the
issues at a high scale? if yes is there any due time?
>>> >>
>>> >> What is the ovsdb-server version you're using ?  There are many
>>> >> improvements in the ovsdb-server in 2.16.
>>> >> Maybe that would help in your deployment.  And also there were many
>>> >> improvements which went into OVN 21.09
>>> >> if you want to test it out.
>>> >>
>>> >> Thanks
>>> >> Numan
>>> >>
>>> >> >
>>> >> > Thanks.
>>> >> > ___
>>> >> > discuss mailing list
>>> >> > disc...@openvswitch.org
>>> >> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>> >
>>> > ___
>>> > discuss mailing list
>>> > disc...@openvswitch.org
>>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org

Re: [ovs-discuss] [ovs-dev] [OVN] branch name renamed from 'master' to 'main'

2021-10-07 Thread Han Zhou
On Tue, Oct 5, 2021 at 10:40 AM Numan Siddique  wrote:
>
> Hello everyone,
>
> The default branch of OVN has been renamed from 'master' to 'main'.  I
> had brought this up
>  for discussion in our weekly upstream OVN meeting a couple of weeks
> ago and the attendees were supportive of it.
>
> I followed the instructions from here -
> https://github.com/github/renaming  for the renaming.
>
> Request all the developers and users of OVN code base to update the
> default branch locally.
>
> Please let me know if this caused any issues so that this can be resolved
soon.
>
> Thanks
> Numan
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Thanks Numan. It seems the committer's git hook script needs a change, too.
I noticed that in OVN's document the hook is not mentioned, so I copied it
from OVS document and renamed the branch name.
PTAL:
https://patchwork.ozlabs.org/project/ovn/patch/20211007165546.1495488-1-hz...@ovn.org/

Thanks,
Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [ovn] OVN asymmetric routing with conntrack

2021-08-24 Thread Han Zhou
On Tue, Aug 24, 2021 at 3:02 PM Vladislav Odintsov 
wrote:

>
>
> Regards,
> Vladislav Odintsov
>
> On 25 Aug 2021, at 00:47, Numan Siddique  wrote:
>
> On Mon, Aug 23, 2021 at 11:22 AM Vladislav Odintsov 
> wrote:
>
>
> Hi,
>
> we’ve faced an issue where asymmetric-routed traffic is used. Please help
> understand what options do we have to allow such traffic.
>
> Topology is next:
>
> client lsp (10.0.0.1/24)
> |
>ls-external
>/ \
> lsp router vm1 eth0: 10.0.0.2/24 lsp router vm2 eth0: 10.0.0.3/24
> lsp router vm1 eth1: 192.168.0.1/24  lsp router vm2 eth1: 192.168.0.2/24
>\ /
>ls-internal
> |
>server lsp (192.168.0.10/24)
>
>
> All LSPs have port_security configured with " 0.0.0.0/0 ::/0" and
> belong to port group pg1.
>
> There are two ACLs within this PG:
> from-lport 0.0.0.0/0 allow-related
> to-lport 0.0.0.0/0 allow-related
>
> The problem is when traffic from client to server goes through router vm1
> and returns through router vm2, there is no connectivity. I see reply
> traffic on the server interface, which is going to router vm2 mac address,
> but I don't see it on the router vm2 interface.
> I guess the reason for this is that conntrack first time sees packet for
> the connection and ACK+SYN flags are set and treats this packet as invalid,
> right?
>
>
> I think so.
>
>
> If yes, is there any option how to use asymmetric-routed topologies inside
> OVN with stateful ACLs?
> I found there is an ability to replace ct.inv field check:
> https://github.com/ovn-org/ovn/commit/3bb91366a6b0d60df5ce8f9c7f6427f7d37dfdd4
> Is it good idea to use this option to solve the issue or this is intended
> specifically to use with smart NICs without invalid state support and can
> be removed in future?
>
>
I am afraid disabling ct.inv will not help here. In your use case the
connections won't become established in conntrack, which means stateful ACL
wouldn't work. This will be the case even with physical stateful FWs. I'd
suggest either disable ACLs or use stateless ACLs (i.e.
allow/allow-stateless instead of allow-related).

>
> I do not understand your use case completely.  I'm not quite clear
> from the diagram which all resources are external
> and which all are part of OVN.  Have you tried using the ECMP routes
> feature ?
>
> Logical Routers are not used in this topology. Only two logical switches -
> ls-external and ls-internal.
>
> In OVN we have:
> 1. LS "ls-external" with 3 LSPs: "lsp-router-vm1-eth0" (10.0.0.2/24),
> "lsp-router-vm2-eth0" (10.0.0.3/24) and "lsp-client" (10.0.0.1/24)
> 2. LS "ls-internal" with 2 LSPs: "lsp-router-vm1-eth1" (192.168.0.1/24),
> "lsp-router-vm2-eth1" (192.168.0.2/24) and "lsp-server" (192.168.0.10/24)
> 3. Port group pg1 with mentioned above LSPs
> 4. Ingress and egress ACLs 0.0.0.0/0 with allow-related action.
>
> External:
> Linux client VM (10.0.0.1), which has static ECMP route:
> 192.168.0.0/24 via nexthop 10.0.0.2 via nexthop 10.0.0.3
>
> sends tcp syn from ip 10.0.0.1 to 192.168.0.10 (server).
> Traffic to server can go either through router VM1 (10.0.0.2) or through
> router VM2 (10.0.0.3).
>
> Server also has static ECMP route to reverse direction:
> 10.0.0.0/24 via nexthop 192.168.0.1 via nexthop 192.168.0.2
>
> So, when traffic in both directions goes through same router VM, it passes
> normally.
> When traffic goes from client to server through one router VM and returns
> back through another - SYN-ACK is blocked somewhere in OVS/conntrack. On
> router VM’s interface we don’t see SYN+ACK.
>
> So, OVN-based ECMP feature is not relevant for this case since it doesn’t
> invoke logical routers.
>
> Regarding the ct.inv flag, does it work when you disable the usage of
> ct.inv ?
>
> Not yet. First wanted to understand options.
>
> Thanks
> Numan
>
>
> If these routes are configured in the logical router, then
>
> Thanks.
>
> Regards,
> Vladislav Odintsov
>
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
>
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [question] need help to understand raft leadership transfer reason

2021-08-16 Thread Han Zhou
On Sun, Aug 15, 2021 at 5:37 AM Vladislav Odintsov 
wrote:
>
> Han, thanks for the answer.
> My comments inline.
>
> Regards,
> Vladislav Odintsov
>
> On 12 Aug 2021, at 19:15, Han Zhou  wrote:
>
>
>
> On Thu, Aug 12, 2021 at 4:54 AM Vladislav Odintsov 
wrote:
> >
> > Hi,
> >
> > I’ve got a 3-node RAFT cluster, serving ovn northbound DB and trying to
understand what triggered ovsdb-server leadership change.
> > Can somebody help explain that?
> >
> > This cluster runs ovs 2.13.4 and has ~25 active clients: 3 x
ovn-northd, 3 x ovn-ic, 11 python-ovsdbapp clients (CMS).
> >
> > Server logs (per-node) listed above.
> >
> > Node 1:
> >
> > 2021-08-12T07:45:55.189Z|03131|raft|INFO|server be61 is leader for term
357
> > 2021-08-12T07:46:04.187Z|03132|raft|INFO|term 358: 1454 ms timeout
expired, starting election
> > 2021-08-12T07:46:04.191Z|03133|raft|INFO|term 358: elected leader by 2+
of 3 servers
> > 2021-08-12T07:46:05.558Z|03134|raft|INFO|rejecting term 357 < current
term 358 received in vote_reply message from server be61
> > 2021-08-12T07:46:29.181Z|03135|timeval|WARN|Unreasonably long 2122ms
poll interval (2120ms user, 0ms system)
> > 2021-08-12T07:46:29.181Z|03136|timeval|WARN|context switches: 0
voluntary, 7 involuntary
> > 2021-08-12T07:46:29.181Z|03137|coverage|INFO|Event coverage, avg rate
over last: 5 seconds, last minute, last hour,  hash=3b99bba2:
> > 2021-08-12T07:46:29.181Z|03138|coverage|INFO|hmap_pathological
94.0/sec 9.917/sec0.1819/sec   total: 85714
> > 2021-08-12T07:46:29.181Z|03139|coverage|INFO|hmap_expand
 39281.4/sec  4318.283/sec  112.9786/sec   total: 193444648
> > 2021-08-12T07:46:29.181Z|03140|coverage|INFO|lockfile_lock
 0.0/sec 0.000/sec0./sec   total: 1
> > 2021-08-12T07:46:29.181Z|03141|coverage|INFO|poll_create_node
201.6/sec   109.617/sec   49.4042/sec   total: 425254534
> > 2021-08-12T07:46:29.181Z|03142|coverage|INFO|poll_zero_timeout
 2.0/sec 1.567/sec0.9350/sec   total: 2938208
> > 2021-08-12T07:46:29.181Z|03143|coverage|INFO|seq_change
 13.0/sec10.433/sec5.5928/sec   total: 22320268
> > 2021-08-12T07:46:29.181Z|03144|coverage|INFO|pstream_open
0.0/sec 0.000/sec0./sec   total: 4
> > 2021-08-12T07:46:29.181Z|03145|coverage|INFO|stream_open
 0.0/sec 0.000/sec0./sec   total: 20
> > 2021-08-12T07:46:29.181Z|03146|coverage|INFO|unixctl_received
0.0/sec 0.017/sec0.0164/sec   total: 54512
> > 2021-08-12T07:46:29.181Z|03147|coverage|INFO|unixctl_replied
 0.0/sec 0.017/sec0.0164/sec   total: 54512
> > 2021-08-12T07:46:29.181Z|03148|coverage|INFO|util_xalloc
 1693863.2/sec 181037.517/sec 3963.9467/sec   total: 4830747405
> > 2021-08-12T07:46:29.181Z|03149|coverage|INFO|97 events never hit
> > 2021-08-12T07:46:29.186Z|03150|raft|WARN|ignoring vote request received
as leader
> > 2021-08-12T07:46:29.186Z|03151|raft|INFO|server be61 is leader for term
359
> > 2021-08-12T07:46:29.187Z|03152|raft|INFO|1076 truncating 1 entries from
end of log
> > 2021-08-12T07:46:29.187Z|03153|raft|INFO|rejected append_reply (not
leader)
> > 2021-08-12T07:46:29.187Z|03154|raft|INFO|rejected append_reply (not
leader)
> > 2021-08-12T07:46:29.187Z|03155|raft|INFO|rejected append_reply (not
leader)
> > 2021-08-12T07:46:29.191Z|03156|raft|INFO|rejected append_reply (not
leader)
> > 2021-08-12T07:46:29.221Z|03157|jsonrpc|WARN|Dropped 4 log messages in
last 14866 seconds (most recently, 14865 seconds ago) due to excessive rate
> > 2021-08-12T07:46:29.221Z|03158|jsonrpc|WARN|tcp:client-1:42402: receive
error: Connection reset by peer
> > 2021-08-12T07:46:29.222Z|03159|reconnect|WARN|tcp:client-1:42402:
connection dropped (Connection reset by peer)
> > 2021-08-12T07:46:29.222Z|03160|jsonrpc|WARN|tcp:client-2:45746: receive
error: Connection reset by peer
> > 2021-08-12T07:46:29.222Z|03161|reconnect|WARN|tcp:client-2:45746:
connection dropped (Connection reset by peer)
> > 2021-08-12T07:46:29.225Z|03162|jsonrpc|WARN|tcp:client-3:54218: receive
error: Connection reset by peer
> > 2021-08-12T07:46:29.225Z|03163|reconnect|WARN|tcp:client-3:54218:
connection dropped (Connection reset by peer)
> > 2021-08-12T07:46:29.232Z|03164|jsonrpc|WARN|tcp:client-4:48064: receive
error: Connection reset by peer
> > 2021-08-12T07:46:29.232Z|03165|reconnect|WARN|tcp:client-4:48064:
connection dropped (Connection reset by peer)
> > 2021-08-12T07:46:29.232Z|03166|jsonrpc|WARN|tcp:client-5:49022: receive
error: Connection reset by peer
> > 2021-08-12T07:46:29.232Z|03167|reconnect|WARN|tcp:client-5:49022:
connection dropped (Connection reset by peer)
> > 2021-08-12T07:46:29.566Z|03168|

Re: [ovs-discuss] OVN: Any objections for making logical router and logical switches tables indexed by name?

2021-08-12 Thread Han Zhou
On Tue, Aug 10, 2021 at 9:32 AM Flavio Fernandes  wrote:
>
> [cc: Ben + Dmitry]
>
> Hi folks,
>
> I'm looking at some conversion code in ovn-org/ovn-kubernetes where we
replace the ovn-nbctl wrapper with the libovsdb library ( ovn-org/libovsdb
). Since we are mostly doing this to make it faster (besides reducing the
memory footprint), using the "Where" syntax [1] will greatly benefit
operations on logical-router [2] and logical-switch [3] tables if they were
indexed by name. Similar to what we already have for the logical-router
port and logical-switch port tables.
>
> After listening to episode 72 of OVS Orbit [4], I would like to ask: does
anyone have objections to adding "indexes": [["name"]],  to the
logical-router [2] and logical-switch [3] tables? I understand Ben's point
on making the implementation of the locally-cached tables have these types
of optimizations, but at the same time, I see these 2 tables as low-hanging
fruits when scaling deployments with lots of lr's and ls's. Unless there is
an implementation that use nameless rows for these tables, I cannot think
of a usage case where duplicate names are useful. Do you?
>
> Depending on your answers, I can propose a tweak to the schema to have
these changes... or not. ;)
>
> Thanks,
>
> -- flaviof
>
> [1]:  https://github.com/ovn-org/libovsdb/pull/209
> [2]:
https://github.com/ovn-org/ovn/blob/d08f89e219e1fa45583757bd2804783cf0630179/ovn-nb.ovsschema#L306
> [3]:
https://github.com/ovn-org/ovn/blob/d08f89e219e1fa45583757bd2804783cf0630179/ovn-nb.ovsschema#L41
> [4]: https://ovsorbit.org/ ==> Episode 72: The OVSDB Query Optimizer and
Key-Value Interface, with Dmitry Yusupov from NVIDIA (Feb 27, 2021)
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

I have no objection to the schema change, as long as it is declared as
incompatible change in the NEWS and add some guide in the upgrading topics
in the documentation regarding how to solve the problem when upgrading.

However I don't see a link between this and [4]. The talk was about server
side indexes which optimizes “select“ operations while what libovsdb does
is on the local cache. The pr [1] seems could utilize the index in the
schema but is not related to the server side indexes.

In addition I would propose that libovsdb add support for adding customized
indexes with some new APIs (of course not tied to any schema), not only on
primary keys but other columns as needed by clients, and then schema
wouldn’t matter that much for the cache indexes.

Thanks,
Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [question] need help to understand raft leadership transfer reason

2021-08-12 Thread Han Zhou
On Thu, Aug 12, 2021 at 4:54 AM Vladislav Odintsov 
wrote:
>
> Hi,
>
> I’ve got a 3-node RAFT cluster, serving ovn northbound DB and trying to
understand what triggered ovsdb-server leadership change.
> Can somebody help explain that?
>
> This cluster runs ovs 2.13.4 and has ~25 active clients: 3 x ovn-northd,
3 x ovn-ic, 11 python-ovsdbapp clients (CMS).
>
> Server logs (per-node) listed above.
>
> Node 1:
>
> 2021-08-12T07:45:55.189Z|03131|raft|INFO|server be61 is leader for term
357
> 2021-08-12T07:46:04.187Z|03132|raft|INFO|term 358: 1454 ms timeout
expired, starting election
> 2021-08-12T07:46:04.191Z|03133|raft|INFO|term 358: elected leader by 2+
of 3 servers
> 2021-08-12T07:46:05.558Z|03134|raft|INFO|rejecting term 357 < current
term 358 received in vote_reply message from server be61
> 2021-08-12T07:46:29.181Z|03135|timeval|WARN|Unreasonably long 2122ms poll
interval (2120ms user, 0ms system)
> 2021-08-12T07:46:29.181Z|03136|timeval|WARN|context switches: 0
voluntary, 7 involuntary
> 2021-08-12T07:46:29.181Z|03137|coverage|INFO|Event coverage, avg rate
over last: 5 seconds, last minute, last hour,  hash=3b99bba2:
> 2021-08-12T07:46:29.181Z|03138|coverage|INFO|hmap_pathological
94.0/sec 9.917/sec0.1819/sec   total: 85714
> 2021-08-12T07:46:29.181Z|03139|coverage|INFO|hmap_expand
 39281.4/sec  4318.283/sec  112.9786/sec   total: 193444648
> 2021-08-12T07:46:29.181Z|03140|coverage|INFO|lockfile_lock
 0.0/sec 0.000/sec0./sec   total: 1
> 2021-08-12T07:46:29.181Z|03141|coverage|INFO|poll_create_node
201.6/sec   109.617/sec   49.4042/sec   total: 425254534
> 2021-08-12T07:46:29.181Z|03142|coverage|INFO|poll_zero_timeout
 2.0/sec 1.567/sec0.9350/sec   total: 2938208
> 2021-08-12T07:46:29.181Z|03143|coverage|INFO|seq_change
 13.0/sec10.433/sec5.5928/sec   total: 22320268
> 2021-08-12T07:46:29.181Z|03144|coverage|INFO|pstream_open
0.0/sec 0.000/sec0./sec   total: 4
> 2021-08-12T07:46:29.181Z|03145|coverage|INFO|stream_open
 0.0/sec 0.000/sec0./sec   total: 20
> 2021-08-12T07:46:29.181Z|03146|coverage|INFO|unixctl_received
0.0/sec 0.017/sec0.0164/sec   total: 54512
> 2021-08-12T07:46:29.181Z|03147|coverage|INFO|unixctl_replied
 0.0/sec 0.017/sec0.0164/sec   total: 54512
> 2021-08-12T07:46:29.181Z|03148|coverage|INFO|util_xalloc
 1693863.2/sec 181037.517/sec 3963.9467/sec   total: 4830747405
> 2021-08-12T07:46:29.181Z|03149|coverage|INFO|97 events never hit
> 2021-08-12T07:46:29.186Z|03150|raft|WARN|ignoring vote request received
as leader
> 2021-08-12T07:46:29.186Z|03151|raft|INFO|server be61 is leader for term
359
> 2021-08-12T07:46:29.187Z|03152|raft|INFO|1076 truncating 1 entries from
end of log
> 2021-08-12T07:46:29.187Z|03153|raft|INFO|rejected append_reply (not
leader)
> 2021-08-12T07:46:29.187Z|03154|raft|INFO|rejected append_reply (not
leader)
> 2021-08-12T07:46:29.187Z|03155|raft|INFO|rejected append_reply (not
leader)
> 2021-08-12T07:46:29.191Z|03156|raft|INFO|rejected append_reply (not
leader)
> 2021-08-12T07:46:29.221Z|03157|jsonrpc|WARN|Dropped 4 log messages in
last 14866 seconds (most recently, 14865 seconds ago) due to excessive rate
> 2021-08-12T07:46:29.221Z|03158|jsonrpc|WARN|tcp:client-1:42402: receive
error: Connection reset by peer
> 2021-08-12T07:46:29.222Z|03159|reconnect|WARN|tcp:client-1:42402:
connection dropped (Connection reset by peer)
> 2021-08-12T07:46:29.222Z|03160|jsonrpc|WARN|tcp:client-2:45746: receive
error: Connection reset by peer
> 2021-08-12T07:46:29.222Z|03161|reconnect|WARN|tcp:client-2:45746:
connection dropped (Connection reset by peer)
> 2021-08-12T07:46:29.225Z|03162|jsonrpc|WARN|tcp:client-3:54218: receive
error: Connection reset by peer
> 2021-08-12T07:46:29.225Z|03163|reconnect|WARN|tcp:client-3:54218:
connection dropped (Connection reset by peer)
> 2021-08-12T07:46:29.232Z|03164|jsonrpc|WARN|tcp:client-4:48064: receive
error: Connection reset by peer
> 2021-08-12T07:46:29.232Z|03165|reconnect|WARN|tcp:client-4:48064:
connection dropped (Connection reset by peer)
> 2021-08-12T07:46:29.232Z|03166|jsonrpc|WARN|tcp:client-5:49022: receive
error: Connection reset by peer
> 2021-08-12T07:46:29.232Z|03167|reconnect|WARN|tcp:client-5:49022:
connection dropped (Connection reset by peer)
> 2021-08-12T07:46:29.566Z|03168|reconnect|WARN|tcp:node-3:40902:
connection dropped (Connection reset by peer)
> 2021-08-12T07:46:30.047Z|03169|poll_loop|INFO|Dropped 64 log messages in
last 14879 seconds (most recently, 14876 seconds ago) due to excessive rate
> 2021-08-12T07:46:30.047Z|03170|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 49 (node-1:6641<->ip7:48658) at lib/stream-fd.c:153 (72% CPU usage)
> 2021-08-12T07:46:30.195Z|03171|poll_loop|INFO|wakeup due to [POLLIN] on
fd 20 (node-1:6643<->node-3:47488) at lib/stream-fd.c:157 (72% CPU usage)
> 2021-08-12T07:46:30.351Z|03172|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 43 (node-1:6641<->ip8:48656) at 

Re: [ovs-discuss] OVN /OVS openvswitch: ovs-system: deferred action limit reached, drop recirc action

2021-08-04 Thread Han Zhou
On Wed, Aug 4, 2021 at 6:41 AM Numan Siddique  wrote:
>
> On Wed, Aug 4, 2021 at 4:17 AM Krzysztof Klimonda
>  wrote:
> >
> > Hi Ammad,
> >
> > (Re-adding ovs-discuss@openvswitch.org to CC to keep track of the
discussion)
> >
> > Thanks for testing it with SNAT enabled/disabled and verifying that it
seems to be related.
> >
> > As for the impact of this bug I have to say I'm unsure. I have
theorized that this could the cause for (or at least connected to) BFD
sessions being dropped between gateway chassises, but I couldn't really
validate it.
> >
> > My linked patch is pretty old and no longer applies cleanly on master,
but I'd be interested in getting some feedback from developers on whether
I'm even fixing the right thing.
>
> Hi Krzysztof,
>
> Your patch is in the "change requested" stage.  I see from the comment
> that the ddlog part of the code is missing.
>
> Seems like a valid case to me.  The issue is seen when the packet is
> destined to the router port IP right ?
>
> In the case of ovn-kubernetes, the router port IP is also used as a
> load balancer backend IP.
>
> Will your patch have any impact if the logical router has this load
> balancer configured ? (for the system test case you've added )
>
> ovn-nbctl lb-add lb1 172.16.1.254:90 192.168.1.100:90
> ovn-nbctl lr-lb-add R1 lb1
>
> Can you please repost the patch for further review.  It would be great
> if you can add ddlog code.  Or you can repost the patch
> and the ddlog part can be added if the reviewers are fine with the patch.
>
> Thanks
> Numan
>

Thanks Krzysztof, this is interesting. Could you share more on the root
cause since you debugged it - how did the loop happen? When a packet
destined to the SNAT IP hits the router ingress pipeline, what's the next
hop? How the L2 dst is populated for the dst IP and how is the packet
forwarded back to the router pipeline? How /32 IP (instead of a subnet) on
the SNAT config made a difference?

> >
> > Regards,
> > Krzysztof
> >
> > On Wed, Aug 4, 2021, at 09:02, Ammad Syed wrote:
> > > I am able to reproduce this issue with snat enabled network and
> > > accessing the snat IP from external network can reproduce this issue .
> > > If I keep snat disable, then I didn't see these logs in syslog.
> > >
> > > Ammad
> > >
> > > On Tue, Aug 3, 2021 at 6:39 PM Ammad Syed 
wrote:
> > > > Thanks. Let me try to reproduce it with this way.
> > > >
> > > > Can you please advise if this will cause any trouble if we have
this bug in production? Any workaround to avoid this issue?
> > > >
> > > > Ammad
> > > >
> > > > On Tue, Aug 3, 2021 at 5:56 PM Krzysztof Klimonda <
kklimo...@syntaxhighlighted.com> wrote:
> > > >> Hi,
> > > >>
> > > >> To reproduce it (on openstack. although the issue does not seem to
be openstack-specific) I've created a network with SNAT enabled (which is
default) and set its external gateway to my external network. Next, I've
tried establishing TCP session from the outside to IP address assigned to
the router and checked dmesg on the chassis that the port is assigned to
for "ovs-system: deferred action limit reached, drop recirc action"
messages.
> > > >>
> > > >> Best Regards,
> > > >> Krzysztof
> > > >>
> > > >> On Tue, Aug 3, 2021, at 09:05, Ammad Syed wrote:
> > > >> > Hi Krzysztof,
> > > >> >
> > > >> > Yes I might be stuck in this issue. How can I check if there is
any
> > > >> > loop in lflow-list ?
> > > >> >
> > > >> > Ammad
> > > >> >
> > > >> > On Tue, Aug 3, 2021 at 2:14 AM Krzysztof Klimonda
> > > >> >  wrote:
> > > >> > > Hi,
> > > >> > >
> > > >> > > Not sure if it's related, but I've seen this bug in ovn 20.12
release, where routing loop was related to flows created to handle SNAT,
I've sent an RFC patch few months back but didn't really have time to
follow up on it since then to get some feedback:
https://www.mail-archive.com/ovs-dev@openvswitch.org/msg53195.html
> > > >> > > I was planning on re-testing it with 21.06 release and follow
up on the patch.
> > > >> > >
> > > >> > > On Mon, Aug 2, 2021, at 21:31, Han Zhou wrote:
> > > >> > > >
> > > >> > > >
> > > >> > > > On Mon, Aug 2, 2021 at 5:07 AM Ammad Syed <
syedamma...@gmail.com> wrot

Re: [ovs-discuss] OVN /OVS openvswitch: ovs-system: deferred action limit reached, drop recirc action

2021-08-02 Thread Han Zhou
On Mon, Aug 2, 2021 at 5:07 AM Ammad Syed  wrote:
>
> Hello,
>
> I am using openstack with OVN 20.12 and OVS 2.15.0 on ubuntu 20.04. I am
using geneve tenant network and vlan provider network.
>
> I am continuously getting below messages in my dmesg logs continuously on
compute node 1 only the other two compute nodes have no such messages.
>
> [275612.826698] openvswitch: ovs-system: deferred action limit reached,
drop recirc action
> [275683.750343] openvswitch: ovs-system: deferred action limit reached,
drop recirc action
> [276102.200772] openvswitch: ovs-system: deferred action limit reached,
drop recirc action
> [276161.575494] openvswitch: ovs-system: deferred action limit reached,
drop recirc action
> [276210.262524] openvswitch: ovs-system: deferred action limit reached,
drop recirc action
>
> I have tried by reinstalling (OS everything) compute node 1 but still
having same errors.
>
> Need your advise.
>
> --
> Regards,
>
>
> Syed Ammad Ali
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Hi Syed,

Could you check if you have routing loops (i.e. a packet being routed back
and forth between logical routers infinitely) in your logical topology?

Thanks,
Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [External] : Re: Almost half OVN unit tests are skipped (ovn.at:xxx) - OVN_FOR_EACH_NORTHD

2021-05-26 Thread Han Zhou
On Wed, May 26, 2021 at 2:31 AM Brendan Doyle 
wrote:
>
>
>
> On 25/05/2021 18:36, Han Zhou wrote:
>
>
>
> >
> > ## - ##
> > ## Test results. ##
> > ## - ##
> >
> > 2 tests were successful.
> > 2 tests were skipped.
> > make[2]: Leaving directory `/root/ovn'
> > make[1]: Leaving directory `/root/ovn'
> >
> > I'm not sure how I'm supposed to interpret that - am I missing
something?
> > Where are the two tests coming from?
>
> Hi Brendan,
>
> The skipped tests are for ovn-northd-ddlog, a different implementation of
ovn-northd from the default C implementation. To enable them, you need to
install ddlog related tools and also configure it "--with-ddlog". For more
details, please search for keyword "ddlog" (case insensitive) in
https://github.com/ovn-org/ovn/blob/master/Documentation/intro/install/general.rst.
If you are not using ddlog, skipping those tests is ok.
>
>
> OK thanks, still curious as to why this reports two tests and not one is
run:
>
> OVN_FOR_EACH_NORTHD([
> AT_SETUP([ovn -- my-test: 1 HVs, 1 LSs, 1 lport/LS])
> AT_KEYWORDS([my-test])
> AT_CLEANUP
> ])
>

There are 4 tests in total: for each of ovn-northd and ovn-northd-ddlog,
enable v.s. disable datapath-group are tested. So in your case the 2 tested
are: enable datapath-group for northd, disable datapath-group for northd.

>
> Thanks,
> Han
>
>
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] Almost half OVN unit tests are skipped (ovn.at:xxx) - OVN_FOR_EACH_NORTHD

2021-05-25 Thread Han Zhou
On Tue, May 25, 2021 at 10:25 AM Brendan Doyle 
wrote:
>
> Folks,
>
> Perhaps I'm missing something, but in a recent pull of the OVN src,
> having boot
> strapped and built the code and executed unit tests as per instructions:
> Documentation/topics/testing.rst
>
> I see that almost half of the unit tests are skipped, seems to be any
> that begin
> with:
> OVN_FOR_EACH_NORTHD([
>
> Not only that If I create an empty test:
> OVN_FOR_EACH_NORTHD([
> AT_SETUP([ovn -- my-test: 1 HVs, 1 LSs, 1 lport/LS, 1 LR])
> AT_KEYWORDS([my-test])
> AT_CLEANUP
> ])
>
> And run " make check TESTSUITEFLAGS='-k my-test'" I'll get a report of:
> OVN end-to-end tests
>
> 623: ovn -- my-test: 1 HVs, 1 LSs, 1 lport/LS, 1 LR -- ovn-northd --
> dp-groups=yes ok
> 624: ovn -- my-test: 1 HVs, 1 LSs, 1 lport/LS, 1 LR -- ovn-northd ok
> 625: ovn -- my-test: 1 HVs, 1 LSs, 1 lport/LS, 1 LR -- ovn-northd-ddlog
> -- dp-groups=yes skipped (ovn.at:26531)
> 626: ovn -- my-test: 1 HVs, 1 LSs, 1 lport/LS, 1 LR -- ovn-northd-ddlog
> skipped (ovn.at:26531)
>
> ## - ##
> ## Test results. ##
> ## - ##
>
> 2 tests were successful.
> 2 tests were skipped.
> make[2]: Leaving directory `/root/ovn'
> make[1]: Leaving directory `/root/ovn'
>
> I'm not sure how I'm supposed to interpret that - am I missing something?
> Where are the two tests coming from?

Hi Brendan,

The skipped tests are for ovn-northd-ddlog, a different implementation of
ovn-northd from the default C implementation. To enable them, you need to
install ddlog related tools and also configure it "--with-ddlog". For more
details, please search for keyword "ddlog" (case insensitive) in
https://github.com/ovn-org/ovn/blob/master/Documentation/intro/install/general.rst.
If you are not using ddlog, skipping those tests is ok.

Thanks,
Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] Question on distributing snat traffic with OVN

2021-05-04 Thread Han Zhou
On Tue, May 4, 2021 at 11:40 AM Francois  wrote:
>
> On Tue, 4 May 2021 at 19:24, Han Zhou  wrote:
> >
> >
> > This is interesting. One question here. Not sure if I understand the
proposal correctly, but if you want to use the chassis's IP for snat, then
how would the return traffic hit the br-int? The return traffic (to the VM,
with destination IP being the chassis IP and destination mac address being
the chassis MAC) would go directly to the host interface (e.g. br0) without
going to the virtual network pipelines, right?
>
> I saw some user documentation.. where the ovn-encap-ip was handled by
> a br-tun bridge, and was hoping there would be hope...
>
> >
> > The original problem was to avoid using floating IPs per VM, but if it
is ok to have an extra IP per chassis, then I think current OVN
implementation already supports it by creating a per chassis Gateway Router
and configuring SNAT using the Gateway Router's uplink IP. Each chassis is
now both a HV and a gateway, and you will need one extra IP per chassis as
the Gateway Router IP. I think this is similar to the ovn-kubernetes
topology. Of course there may be other drawbacks such as you may need a
logical switch per chassis so that you can route the traffic to the
chassis's own Gateway Router. Not sure if it is something that could help
in your use cases.
> >
>
> What I wish to have is  VMs without floating IP, from different
> tenants, having their traffic transparently (as in, no fIP or any
> particular thing to add from a user perspective),  sent out from the
> chassis directly.
>
> Just to rephrase the suggestion, I can create a logical switch per
> chassis, then give to all the VMs on the chassis a new interface that
> connects them to the second network, and the routing rules on the VM
> must be configured so that the traffic to external is sent through
> that interface.
>
> If I understand correctly, I think it describes well my usecase (SNAT
> is well distributed), but you need some work from the tenant point of
> view ("if you want efficient snat you need to route traffic through
> that second interface") even if we patch OpenStack to magically add
> new ports to VMs.
>
> I am going to check ovn-kubernetes doc then.

In ovn-kubernetes it is done through a single interface per workload. Each
workload is connected to the chassis level LS, which has a default route to
a join router, but the join router can make decision based on src-route
policies to redirect external traffic back to the chassis level gateway
router. Physically this routing decision is made locally by OVS on the
chassis. Of course the logical topology and routing is a little complex,
while with dual interfaces per VM you could make it more flexible (but may
introduce some other operational challenges).

>
> >
> > There is a plan to remove the C version once the DDlog is stable
enough. The timeline is not clear (at least to me).
> > There is also a plan to rewrite ovn-controller in ddlog, but it is more
complex than northd and there are different options moving forward, and the
timeline is even less clear.
> >
> > Thanks,
> > Han
>
> Thanks!
> Francois
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] Question on distributing snat traffic with OVN

2021-05-04 Thread Han Zhou
On Tue, May 4, 2021 at 9:25 AM Francois  wrote:
>
> On Tue, 4 May 2021 at 17:03, Numan Siddique  wrote:
> >
> > On Sat, May 1, 2021 at 6:32 AM Francois 
wrote:
> > >
> > > Hi Open vSwitch
> > > I am running an OVN stack with a dozen chassis, all of them able to
> > > act as gateways.
> > > I have many VMs without floating IPs on the same logical switch, doing
> > > a lot of external traffic. Today, this traffic has to go through the
> > > tunnel towards the unique chassis claiming the gateway to perform the
> > > snat natting and send the traffic outside the stack.
> > >
> > > With this current design, I see a lot of BFD traffic, and a clear
> > > bottleneck and spof with that single chassis doing the snat. A
> > > workaround is to add floating IPs on each VM, but this means the end
> > > user has to put the floating IP themself, it also means if a single
> > > chassis runs 10 VMs, we need one floating IP per VM just for the snat,
> > > while we could instead use a single IP per chassis for that.
> > >
> > > I was thinking of adding a "br-snat" bridge on each ovs, adding to it
> > > one interface with a fixed IP, and (with some minimal development in
> > > ovn northd) have the snat traffic of all its ports going out of that
> > > interface instead of going through the tunnel towards the gateway.
> > > Ideally the IP used today for the tunnel could be used too for the
> > > snat traffic, but this seems less trivial to achieve.
> > >
> > > Before looking at the details of ddlog and the syntax of flows, I
> > > would love to get some feedback on the idea, maybe there is something
> > > fundamentally broken with my design, or maybe there is a smarter way
> > > to achieve this?
> >
> > This is an interesting idea.  In order to do snat, OVN should know what
IP
> > to use.  But this IP should belong to the provider network subnet pool
right ?
> >
> > If you think this can be done, you can probably attempt a quick PoC
with just
> > changes to the C version and post the patches as RFC. The ddlog part
> > can be done later if the approach seems to be fine for the reviewers.
>
>
> Ok! I have no idea if this can be done, but I will attempt something
> nevertheless. You need one IP per chassis so if it is set on a different
> interface, and statically (similar to the external-ids:ovn-encap-ip)
> it should be fine too.
>

This is interesting. One question here. Not sure if I understand the
proposal correctly, but if you want to use the chassis's IP for snat, then
how would the return traffic hit the br-int? The return traffic (to the VM,
with destination IP being the chassis IP and destination mac address being
the chassis MAC) would go directly to the host interface (e.g. br0) without
going to the virtual network pipelines, right?

The original problem was to avoid using floating IPs per VM, but if it is
ok to have an extra IP per chassis, then I think current OVN implementation
already supports it by creating a per chassis Gateway Router and
configuring SNAT using the Gateway Router's uplink IP. Each chassis is now
both a HV and a gateway, and you will need one extra IP per chassis as the
Gateway Router IP. I think this is similar to the ovn-kubernetes topology.
Of course there may be other drawbacks such as you may need a logical
switch per chassis so that you can route the traffic to the chassis's own
Gateway Router. Not sure if it is something that could help in your use
cases.

> What will happen is that traffic going out will be seen from the outside,
> as an IP of the chassis (and the compute running a VM is chosen at random
> usually).  If there are firewall rules to open (on a firewall seating on
> the external network), they will need to be opened for all hypervisors,
for
> all VMs (so firewall rules become less relevant in a sense).  It basically
> "works" from a security standpoint, when all VMs belong to the same
tenant.
> Still since we are going to open whole ranges in the firewall, the IPs
> should be limited to the IPs used for the SNAT, and not include any fIP,
so it
> should probably be a different subnet.
>
> I think (I am still reading the doc!) a somewhat similar work was done
when
> addressing the MTU issue with the redirect-type=bridged, where packets
> are sent through a different port, using statically set mac-mappings.
>
> ddlog looked funnier! Do you have a plan for the removal of the C
> version? Also is there still a plan to have ovn-controller rewritten in
> ddlog?

There is a plan to remove the C version once the DDlog is stable enough.
The timeline is not clear (at least to me).
There is also a plan to rewrite ovn-controller in ddlog, but it is more
complex than northd and there are different options moving forward, and the
timeline is even less clear.

Thanks,
Han

>
> Thanks a lot!
> Francois
>
> >
> > Thanks
> > Numan
> >
> > >
> > > Thanks
> > > Francois
> > > ___
> > > discuss mailing list
> > > disc...@openvswitch.org
> > > 

Re: [ovs-discuss] ovn cpu high load problem

2020-12-04 Thread Han Zhou
On Thu, Dec 3, 2020 at 11:47 AM jre l  wrote:
>
> Dear everyone:
>
> Please help me。
>
> My ovn network running into trouble,this is a 250 client openstack+ovn
network,every time I make any change on network ,such as creating a
instance,the ovs-dbserver cpu went to 100%,and traffic grows to 1Gb/s, did
I meet a BUG?
>
>
>
> Here is client INFO:
>
> [root@client deploy]# rpm -qa|grep ovn
>
> openstack-neutron-ovn-metadata-agent-17.0.0-1.el8.noarch
>
> ovn2.13-20.09.0-2.el8.x86_64
>
> ovn2.13-host-20.09.0-2.el8.x86_64
>
>
>
> on client and server side CONFIG:
>
> ovs-vsctl set open . external-ids:ovn-encap-ip=10.0.1.10
>
> ovs-vsctl set open . external-ids:ovn-remote=tcp:10.0.1.103:6642,tcp:
10.0.1.104:6642,tcp:10.0.1.105:6642
>
> ovs-vsctl set open . external-ids:ovn-encap-type=geneve
>
> ovs-vsctl set open . external-ids:ovn-remote-probe-interval="6"
>
> ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw
>
>
>
> nb show error:
>
> [root@ controller tools]# netstat -pln|grep 6641
>
> tcp0  0 172.16.16.103:6641  0.0.0.0:*
LISTEN  1112883/ovsdb-serve
>
>
>
> [root@controller tools]# ovn-nbctl --db=tcp:172.16.16.103:6641 show
>
> ovn-nbctl: tcp:172.16.16.103:6641: database connection failed ()
>
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

It shouldn't trigger that much traffic for just one VM creation. Are you
sure the 1G traffic is triggered during just one VM creation?
Usually if you are bootstrapping the whole environment, i.e. when all HVs
starts to connect to SB at the same time, you would see SB overloaded.
After the system calms down, it shouldn't generate that much load.

How many logical routers and logical switches do you have? If there is a
big number of LR/LS and they all connected together (e.g. through an
external provider logical switch), it is possible to trigger a big payload
for monitor condition change when a VM is created, but it won't be 1G.
However, in such scenario it is possible to create unnecessary load to SB
DB, and setting the external-ids:ovn-monitor-all=true in all the HVs may
help.

Thanks,
Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [ovn] does not match prerequisite

2020-10-20 Thread Han Zhou
On Tue, Oct 20, 2020 at 11:55 AM Tony Liu  wrote:
>
> Hi,
>
> From ovnsb log, I see many of the following messages.
> What does it mean? Is that a concern?
>
> 2020-10-20T18:52:50.483Z|00093|raft|INFO|current entry eid
2ab3eff8-87e1-4e19-9a1f-d359ad56a9ad does not match prerequisite
c6ffd854-6f6e-4533-a6d8-b297acb542e0 in execute_command_request
>
>
> Thanks!
> Tony

This is expected in cluster mode when there are concurrent transactions
initiated from different cluster nodes. It is recommended to connect to the
leader for write-heavy clients. Occasional conflict is ok and the
transaction will be resubmitted.

Thanks,
Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] ACL tcp reject action problem when stateful ACL exists

2020-09-27 Thread Han Zhou
In test case acl-reject, there are no stateful ACLs and the test case works
well. However, adding a stateful ACL even with a low priority (which
shouldn't change the expected behavior of the test case) resulted in the
test case failing. Below is the change for the test case.

- 8><  ><8 -
diff --git a/tests/ovn.at b/tests/ovn.at
index b6c8622ba..85601c0f5 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -12885,6 +12885,7 @@ done
 ovn-nbctl --log acl-add sw0 to-lport 1000 "outport == \"sw0-p12\"" reject
 ovn-nbctl --log acl-add sw0 from-lport 1000 "inport == \"sw0-p11\"" reject
 ovn-nbctl --log acl-add sw0 from-lport 1000 "inport == \"sw0-p21\"" reject
+ovn-nbctl --log acl-add sw0 from-lport 100 "inport == \"sw0-p21\""
allow-related

 # Allow some time for ovn-northd and ovn-controller to catch up.
 ovn-nbctl --timeout=3 --wait=hv sync
- 8><  ><8 -

I haven't checked the root cause yet, but it seems to be a bug that has
exsited for a long time - it fails even on branch 20.03. I haven't tried
older branches yet.

Thanks,
Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN]: IP address representing external destinations

2020-09-23 Thread Han Zhou
On Fri, Sep 18, 2020 at 12:58 AM Alexander Constantinescu <
acons...@redhat.com> wrote:

> Hi Han
>
> Sorry for the late reply.
>
> Is this the current situation?
>>
>
> Yes, it is.
>
> When you say there are too many default routes, what do you mean in the
>> above example? How would the SOUTH_TO_NORTH_IP solve the problem?
>>
>
> Each  corresponds to a node in our cluster, like this:
>
>   ip4.src ==  && ip4.dst ==
> , allow
>   ip6.src ==  && ip6.dst ==
> , allow
>   ip4.src ==  && ip4.dst ==
> , allow
> ...
>   ip6.src ==  && ip6.dst ==
> , allow
>
> so on large clusters (say 1000 nodes) with IPv6 and IPv4 enabled we can
> reach ~2000 logical router policies. By having the SOUTH_TO_NORTH_IP we can
> completely remove all of them and have the "default route" logical router
> policy specify:
>
> default route (lowest priority): ip4.src == 
> ip4.dst == SOUTH_TO_NORTH_IP, nexthop = 
>
> In addition, if SOUTH_TO_NORTH_IP is a user defined IP,
>>
>
> I didn't think it should be user defined, more so "system defined", like
> 0.0.0.0/0
>
>  I am not sure how would it work,  because ip4.dst is the dst IP from
>> packet header
>>
>
> I didn't intend for such an IP to be used solely as a destination IP, but
> as source too, if the user requires it.
>
> Comparing it with SOUTH_TO_NORTH_IP would just result in mismatch, unless
>> all south-to-north traffic really has this IP as destination (I guess
>> that's not the case).
>>
>
> Sure, I just wanted to assess the feasibility of such an IP from OVN's
> point of view. Obviously the real destination IP would be different, but I
> (without knowing the underlying works of OVN) thought there might be a
> programmable way of saying: "this IP is unknown to my network topology, so
> I could use identifier/alias grouping all such IPs under an umbrella
> identifier such as X.X.X.X/X"
>
>
I see. You don't really want the packet's dst IP to be replaced, so what
you need is some value in metadata and then you could define a policy to
match that metadata and specify the nexthop. This is supported by the
policy route table.

However, what really matters here is how to define "unknown IPs". We can't
say an IP is unknown just because there is no port with the IP assigned.
There can still be east-west traffic for such IP, e.g. for nested workload
in a VM, etc. From OVN's point of view this is completely user defined, and
the best way for the user to tell OVN this information is through the route
policies (which means you don't really need that extra metadata to achieve
this). I think it is the way how k8s-ovn is using OVN that causes this many
routes required. If the IP allocation can be better managed so that a big
range of IP is for east-west and otherwise north-south, then you probably
end up with a much smaller number of policies.


>
> On Wed, Sep 16, 2020 at 11:09 PM Han Zhou  wrote:
>
>>
>>
>> On Wed, Sep 16, 2020 at 10:07 AM Alexander Constantinescu <
>> acons...@redhat.com> wrote:
>>
>>> In this example it is equivalent to just "ip4.src == 10.244.2.5/32"'.
>>>>
>>>
>>> Yes, I was just using it as an example (though, granted, noop example)
>>>
>>> Some background to help steer the discussion:
>>>
>>> Essentially the functionality here is to have south -> north traffic
>>> from certain logical switch ports exit the cluster through a dedicated node
>>> (an egress node if you will). To do this we currently have a set of default
>>> logical router policies, intended to leave east <-> west traffic untouched,
>>> and then logical router policies with a lower priority, which specify
>>> reroute actions for this functionality to happen. However, on large
>>> clusters, there's this concern that the default logical router policies
>>> will become too many. Hence why the idea here would be to drop them
>>> completely and have this "special IP" that we can use to filter on the
>>> destination, south -> north, traffic .
>>>
>>> If you have a default route, anything "unknown" would just hit the
>>>> default route, right? Why would you need another IP for this purpose?
>>>>
>>>
>>> As to remove the default logical router policies, which can become a
>>> lot, on big clusters - as described above. With only reroute policies of
>>> type: "ip4.src == 10.244.2.5/32 && ip4.dst == SOUTH_TO_NORTH_IP" things
>>> would become lighter.
>>>
>>
>> Thanks for the 

Re: [ovs-discuss] [OVN]: IP address representing external destinations

2020-09-16 Thread Han Zhou
On Wed, Sep 16, 2020 at 10:07 AM Alexander Constantinescu <
acons...@redhat.com> wrote:

> In this example it is equivalent to just "ip4.src == 10.244.2.5/32"'.
>>
>
> Yes, I was just using it as an example (though, granted, noop example)
>
> Some background to help steer the discussion:
>
> Essentially the functionality here is to have south -> north traffic from
> certain logical switch ports exit the cluster through a dedicated node (an
> egress node if you will). To do this we currently have a set of default
> logical router policies, intended to leave east <-> west traffic untouched,
> and then logical router policies with a lower priority, which specify
> reroute actions for this functionality to happen. However, on large
> clusters, there's this concern that the default logical router policies
> will become too many. Hence why the idea here would be to drop them
> completely and have this "special IP" that we can use to filter on the
> destination, south -> north, traffic .
>
> If you have a default route, anything "unknown" would just hit the default
>> route, right? Why would you need another IP for this purpose?
>>
>
> As to remove the default logical router policies, which can become a
> lot, on big clusters - as described above. With only reroute policies of
> type: "ip4.src == 10.244.2.5/32 && ip4.dst == SOUTH_TO_NORTH_IP" things
> would become lighter.
>

Thanks for the background. So you have:


...
default route (lowest priority): ip4.src == ,
nexthop = 
default route (lowest priority): ip4.src == ,
nexthop = 

Is this the current situation?
When you say there are too many default routes, what do you mean in the
above example? How would the SOUTH_TO_NORTH_IP solve the problem?

In addition, if SOUTH_TO_NORTH_IP is a user defined IP, I am not sure how
would it work, because ip4.dst is the dst IP from packet header. Comparing
it with SOUTH_TO_NORTH_IP would just result in mismatch, unless all
south-to-north traffic really has this IP as destination (I guess that's
not the case).


>  In policies/ACL you will need to make sure the priorities are set
>> properly to achieve the default-route behavior.
>>
>
> Yes, so this is currently done, as described above.
>
> On Wed, Sep 16, 2020 at 6:35 PM Han Zhou  wrote:
>
>>
>>
>> On Wed, Sep 16, 2020 at 5:42 AM Alexander Constantinescu <
>> acons...@redhat.com> wrote:
>> >
>> > Hi
>> >
>> > I was wondering if anybody is aware of an IP address signifying
>> "external IP destinations"?
>> >
>> > Currently in OVN we can use the IP address 0.0.0.0/0 for match
>> expressions in logical routing policies / ACLs when we want to specify a
>> source or destination IP equating to the pseudo term: "all IP
>> addresses",ex: 'match="ip4.src == 10.244.2.5/32 && ip4.dst ==0.0.0.0/0"'
>> >
>> In this example it is equivalent to just "ip4.src == 10.244.2.5/32"'.
>>
>> > Essentially what I would need to do for an OVN-Kubernetes feature is
>> specify such a match condition for south -> north traffic, i.e when the
>> destination IP address is external to the cluster, and most likely
>> "unknown" to OVN. Thus, when OVN does not know how to route it within the
>> OVN network topology and has no choice except sending it out the default
>> route.
>> >
>> > Do we have such an IP address in OVN/OVS? Would it be feasible to
>> introduce, in case there is none?
>> >
>> We don't have such a special IP except 0.0.0.0/0. If you have a default
>> route, anything "unknown" would just hit the default route, right? Why
>> would you need another IP for this purpose? In logical_router_static_route
>> the priority is based on prefix length. In policies/ACL you will need to
>> make sure the priorities are set properly to achieve the default-route
>> behavior.
>>
>> Thanks,
>> Han
>>
>> > Thanks in advance!
>> >
>> > --
>> >
>> > Best regards,
>> >
>> >
>> > Alexander Constantinescu
>> >
>> > Software Engineer, Openshift SDN
>> >
>> > Red Hat
>> >
>> > acons...@redhat.com
>> >
>> > ___
>> > discuss mailing list
>> > disc...@openvswitch.org
>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>
>
>
> --
>
> Best regards,
>
>
> Alexander Constantinescu
>
> Software Engineer, Openshift SDN
>
> Red Hat <https://www.redhat.com/>
>
> acons...@redhat.com
> <https://www.redhat.com/>
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN]: IP address representing external destinations

2020-09-16 Thread Han Zhou
On Wed, Sep 16, 2020 at 5:42 AM Alexander Constantinescu <
acons...@redhat.com> wrote:
>
> Hi
>
> I was wondering if anybody is aware of an IP address signifying "external
IP destinations"?
>
> Currently in OVN we can use the IP address 0.0.0.0/0 for match
expressions in logical routing policies / ACLs when we want to specify a
source or destination IP equating to the pseudo term: "all IP
addresses",ex: 'match="ip4.src == 10.244.2.5/32 && ip4.dst ==0.0.0.0/0"'
>
In this example it is equivalent to just "ip4.src == 10.244.2.5/32"'.

> Essentially what I would need to do for an OVN-Kubernetes feature is
specify such a match condition for south -> north traffic, i.e when the
destination IP address is external to the cluster, and most likely
"unknown" to OVN. Thus, when OVN does not know how to route it within the
OVN network topology and has no choice except sending it out the default
route.
>
> Do we have such an IP address in OVN/OVS? Would it be feasible to
introduce, in case there is none?
>
We don't have such a special IP except 0.0.0.0/0. If you have a default
route, anything "unknown" would just hit the default route, right? Why
would you need another IP for this purpose? In logical_router_static_route
the priority is based on prefix length. In policies/ACL you will need to
make sure the priorities are set properly to achieve the default-route
behavior.

Thanks,
Han

> Thanks in advance!
>
> --
>
> Best regards,
>
>
> Alexander Constantinescu
>
> Software Engineer, Openshift SDN
>
> Red Hat
>
> acons...@redhat.com
>
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] How to restart raft cluster after a complete shutdown?

2020-08-25 Thread Han Zhou
On Tue, Aug 25, 2020 at 7:08 AM Matthew Booth  wrote:
>
> I'm deploying ovsdb-server (and only ovsdb-server) in K8S as a
StatefulSet:
>
>
https://github.com/openstack-k8s-operators/dev-tools/blob/master/ansible/files/ocp/ovn/ovsdb.yaml
>
> I'm going to replace this with an operator in due course, which may
> make the following simpler. I'm not necessarily constrained to only
> things which are easy to do in a StatefulSet.
>
> I've noticed an issue when I kill all 3 pods simultaneously: it is no
> longer possible to start the cluster. The issue is presumably one of
> quorum: when a node comes up it can't contact any other node to make
> quorum, and therefore can't come up. All nodes are similarly affected,
> so the cluster stays down. Ignoring kubernetes, how is this situation
> intended to be handled? Do I have to it to a single-node deployment,
> convert that to a new cluster and re-bootstrap it? This wouldn't be
> ideal. Is there any way, for example, I can bring up the first node
> while asserting to that node that the other 2 are definitely down?
>
In general you should be able to restart the whole cluster without
re-bootstraping it. The cluster should get back to work as long as 2 of the
3 nodes are back online.
In your case, I am not sure if you are using k8s pods' IPs as server
addresses. If so, probably the k8s pods' IP changed after you restart,
which causes the servers stored in the raft log can never be connected
again? Is that the problem?
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] Inquiry for DDlog status for ovn-northd

2020-08-24 Thread Han Zhou
Hi Ben and Leonid,

As I remember you were working on the new ovn-northd that utilizes DDlog
for incremental processing. Could you share the current status?

Now that some more improvements have been made in ovn-controller and OVSDB,
the ovn-northd becomes the more obvious bottleneck for OVN use in large
scale environments. Since you were not in the OVN meetings for the last
couple of weeks, could you share here the status and plan moving forward?

Thanks,
Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] the raft_is_connected state of a raft server stays as false and cannot recover

2020-08-16 Thread Han Zhou
On Thu, Aug 13, 2020 at 5:26 PM Yun Zhou  wrote:

> Hi,
>
> Need expert's view to address a problem we are seeing now and then:  A
> ovsdb-server node in a 3-nodes raft cluster keeps printing out the
> "raft_is_connected: false" message, and its "connected" state in its
> _Server DB stays as false.
>
> According to the ovsdb-server(5) manpage, it means this server is not
> contacting with a majority of its cluster.
>
> Except its "connected" state, from what we can see, this server is in the
> follower state and works fine, and connection between it and the other two
> servers appear healthy as well.
>
> Below is its raft structure snapshot at the time of the problem. Note that
> its candidate_retrying field stays as true.
>
> Hopefully the provide information can help to figure out what goes wrong
> here. Unfortunately we don't have a solid case to reproduce it:
>

Thanks for reporting the issue. This looks really strange. In the below
state, leader_sid is non-zero, but candidate_retrying is true.
According to the latest code, whenever leader_sid is set to non-zero (in
raft_set_leader()), candidate_retrying will be set to false; whenever
candidate_retrying is set to true (in raft_start_election()), leader_sid
will be set to UUID_ZERO. And the data struct is initialized with xzalloc,
making sure candidate_retrying is false in the beginning. So, sorry that I
can't explain how it ends up with this conflict situation. It would be
helpful if there is a way to reproduce. How often does it happen?

Thanks,
Han


> (gdb) print *(struct raft *)0xa872c0
> $19 = {
>   hmap_node = {
> hash = 2911123117,
> next = 0x0
>   },
>   log = 0xa83690,
>   cid = {
> parts = {2699238234, 2258650653, 3035282424, 813064186}
>   },
>   sid = {
> parts = {1071328836, 400573240, 2626104521, 1746414343}
>   },
>   local_address = 0xa874e0 "tcp:10.8.51.55:6643",
>   local_nickname = 0xa876d0 "3fdb",
>   name = 0xa876b0 "OVN_Northbound",
>   servers = {
> buckets = 0xad4bc0,
> one = 0x0,
> mask = 3,
> n = 3
>   },
>   election_timer = 1000,
>   election_timer_new = 0,
>   term = 3,
>   vote = {
> parts = {1071328836, 400573240, 2626104521, 1746414343}
>   },
>   synced_term = 3,
>   synced_vote = {
> parts = {1071328836, 400573240, 2626104521, 1746414343}
>   },
>   entries = 0xbf0fe0,
>   log_start = 2,
>   log_end = 312,
>   log_synced = 311,
>   allocated_log = 512,
>   snap = {
> term = 1,
> data = 0xaafb10,
> eid = {
>   parts = {1838862864, 1569866528, 2969429118, 3021055395}
> },
> servers = 0xaafa70,
> election_timer = 1000
>   },
>   role = RAFT_FOLLOWER,
>   commit_index = 311,
>   last_applied = 311,
>   leader_sid = {
> parts = {642765114, 43797788, 2533161504, 3088745929}
>   },
>   election_base = 6043283367,
>   election_timeout = 6043284593,
>   joining = false,
>   remote_addresses = {
> map = {
>   buckets = 0xa87410,
>   one = 0xa879c0,
>   mask = 0,
>   n = 1
> }
>   },
>   join_timeout = 6037634820,
>   leaving = false,
>   left = false,
>   leave_timeout = 0,
>   failed = false,
>   waiters = {
> prev = 0xa87448,
> next = 0xa87448
>   },
>   listener = 0xaafad0,
>   listen_backoff = -9223372036854775808,
>   conns = {
> prev = 0xbcd660,
> next = 0xaafc20
>   },
>   add_servers = {
> buckets = 0xa87480,
> one = 0x0,
> mask = 0,
> n = 0
>   },
>   remove_server = 0x0,
>   commands = {
> buckets = 0xa874a8,
> one = 0x0,
> mask = 0,
> n = 0
>   },
>   ping_timeout = 6043283700,
>   n_votes = 1,
>   candidate_retrying = true,
>   had_leader = false,
>   ever_had_leader = true
> }
>
> Thanks
> - Yun
>
> --
> You received this message because you are subscribed to the Google Groups
> "ovn-kubernetes" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ovn-kubernetes+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/ovn-kubernetes/BY5PR12MB4132F190E4BFE9F381BC5A82B0400%40BY5PR12MB4132.namprd12.prod.outlook.com
> .
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] ovn-k8s scale: how to make new ovn-controller process keep the previous Open Flow in br-int

2020-08-07 Thread Han Zhou
On Fri, Aug 7, 2020 at 1:56 PM Venugopal Iyer  wrote:

> Hi, Han:
>
>
>
> An additional comment;
>
>
>
> *From:* ovn-kuberne...@googlegroups.com  *On
> Behalf Of *Venugopal Iyer
> *Sent:* Friday, August 7, 2020 1:51 PM
> *To:* Han Zhou ; Numan Siddique 
> *Cc:* Winson Wang ; ovs-discuss@openvswitch.org;
> ovn-kuberne...@googlegroups.com; Dumitru Ceara ; Han
> Zhou 
> *Subject:* RE: ovn-k8s scale: how to make new ovn-controller process keep
> the previous Open Flow in br-int
>
>
>
> *External email: Use caution opening links or attachments*
>
>
>
> Hi, Han:
>
>
>
> *From:* ovn-kuberne...@googlegroups.com  *On
> Behalf Of *Han Zhou
> *Sent:* Friday, August 7, 2020 1:04 PM
> *To:* Numan Siddique 
> *Cc:* Venugopal Iyer ; Winson Wang <
> windson.w...@gmail.com>; ovs-discuss@openvswitch.org;
> ovn-kuberne...@googlegroups.com; Dumitru Ceara ; Han
> Zhou 
> *Subject:* Re: ovn-k8s scale: how to make new ovn-controller process keep
> the previous Open Flow in br-int
>
>
>
> *External email: Use caution opening links or attachments*
>
>
>
>
>
>
>
> On Fri, Aug 7, 2020 at 12:35 PM Numan Siddique  wrote:
>
>
>
>
>
> On Sat, Aug 8, 2020 at 12:16 AM Han Zhou  wrote:
>
>
>
>
>
> On Thu, Aug 6, 2020 at 10:22 AM Han Zhou  wrote:
>
>
>
>
>
> On Thu, Aug 6, 2020 at 9:15 AM Numan Siddique  wrote:
>
>
>
>
>
> On Thu, Aug 6, 2020 at 9:25 PM Venugopal Iyer 
> wrote:
>
> Hi, Han:
>
>
>
> A comment inline:
>
>
>
> *From:* ovn-kuberne...@googlegroups.com  *On
> Behalf Of *Han Zhou
> *Sent:* Wednesday, August 5, 2020 3:36 PM
> *To:* Winson Wang 
> *Cc:* ovs-discuss@openvswitch.org; ovn-kuberne...@googlegroups.com;
> Dumitru Ceara ; Han Zhou 
> *Subject:* Re: ovn-k8s scale: how to make new ovn-controller process keep
> the previous Open Flow in br-int
>
>
>
> *External email: Use caution opening links or attachments*
>
>
>
>
>
>
>
> On Wed, Aug 5, 2020 at 12:58 PM Winson Wang 
> wrote:
>
> Hello OVN Experts,
>
>
> With ovn-k8s,  we need to keep the flows always on br-int which needed by
> running pods on the k8s node.
>
> Is there an ongoing project to address this problem?
>
> If not,  I have one proposal not sure if it is doable.
>
> Please share your thoughts.
> The issue:
>
> In large scale ovn-k8s cluster there are 200K+ Open Flows on br-int on
> every K8s node.  When we restart ovn-controller for upgrade using
> `ovs-appctl -t ovn-controller exit --restart`,  the remaining traffic still
> works fine since br-int with flows still be Installed.
>
>
>
> However, when a new ovn-controller starts it will connect OVS IDL and do
> an engine init run,  clearing all OpenFlow flows and install flows based on
> SB DB.
>
> With open flows count above 200K+,  it took more than 15 seconds to get
> all the flows installed br-int bridge again.
>
>
> Proposal solution for the issue:
>
> When the ovn-controller gets “exit --start”,  it will write a
> “ovs-cond-seqno” to OVS IDL and store the value to Open vSwitch table in
> external-ids column. When new ovn-controller starts, it will check if the
> “ovs-cond-seqno” exists in the Open_vSwitch table,  and get the seqno from
> OVS IDL to decide if it will force a recomputing process?
>
>
>
>
>
> Hi Winson,
>
>
>
> Thanks for the proposal. Yes, the connection break during upgrading is a
> real issue in a large scale environment. However, the proposal doesn't
> work. The "ovs-cond-seqno" is for the OVSDB IDL for the local conf DB,
> which is a completely different connection from the ovs-vswitchd open-flow
> connection.
>
> To avoid clearing the open-flow table during ovn-controller startup, we
> can find a way to postpone clearing the OVS flows after the recomputing in
> ovn-controller is completed, right before ovn-controller replacing with the
> new flows.
>
> *[vi> ] *
>
> *[vi> ] Seems like we force recompute today if the OVS IDL is reconnected.
> Would it be possible to defer *
>
> *decision to  recompute the flows based on  the  SB’s nb_cfg we have
>  sync’d with? i.e.  If  our nb_cfg is *
>
> *in sync with the SB’s global nb_cfg, we can skip the recompute?  At least
> if nothing has changed since*
>
> *the restart, we won’t need to do anything.. We could stash nb_cfg in OVS
> (once ovn-controller receives*
>
> *conformation from OVS that the physical flows for an nb_cfg update are in
> place), which should be cleared if *
>
> *OVS itself is restarted.. (I mean currently, nb_cfg is used to check if
> NB, SB and Chassis are in sync, w

Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no changes in sb-db

2020-08-07 Thread Han Zhou
On Fri, Aug 7, 2020 at 12:57 PM Tony Liu  wrote:

> Enabled debug logging, there are tons of messages.
> Note there are 4353 datapath bindings and 13078 port bindings in SB.
> 4097 LS, 8470 LSP, 256 LR and 4352 LRP in NB. Every 16 LS connect to
> a router. All routers connect to the external network.
>
> ovn-controller on compute node is good. The ovn-controller on gateway
> node is taking 100% cpu. It's probably related to the ports on the
> external network? Any specific messages I need to check?
>
> Any hint to look into it is appreciated!
>
>
If it happens only on gateway, it may be related to ARP handling. Could you
share the output of ovn-appctl -t ovn-controller coverage/show, with 2 - 3
runs in 5s interval?
For the debug log, I'd first check if there is any OVSDB notification from
SB DB, and if yes, what are the changes.

>
> Thanks!
>
> Tony
> > -Original Message-
> > From: Han Zhou 
> > Sent: Friday, August 7, 2020 12:39 PM
> > To: Tony Liu 
> > Cc: ovs-discuss ; ovs-dev  > d...@openvswitch.org>
> > Subject: Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no
> > changes in sb-db
> >
> >
> >
> > On Fri, Aug 7, 2020 at 12:35 PM Tony Liu  > <mailto:tonyliu0...@hotmail.com> > wrote:
> >
> >
> >   Inline...
> >
> >   Thanks!
> >
> >   Tony
> >   > -Original Message-
> >   > From: Han Zhou mailto:zhou...@gmail.com> >
> >   > Sent: Friday, August 7, 2020 12:29 PM
> >   > To: Tony Liu  > <mailto:tonyliu0...@hotmail.com> >
> >   > Cc: ovs-discuss mailto:ovs-
> > disc...@openvswitch.org> >; ovs-dev  >   > d...@openvswitch.org <mailto:d...@openvswitch.org> >
> >   > Subject: Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu
> > while no
> >   > changes in sb-db
> >   >
> >   >
> >   >
> >   > On Fri, Aug 7, 2020 at 12:19 PM Tony Liu <
> tonyliu0...@hotmail.com
> > <mailto:tonyliu0...@hotmail.com>
> >   > <mailto:tonyliu0...@hotmail.com
> > <mailto:tonyliu0...@hotmail.com> > > wrote:
> >   >
> >   >
> >   >   ovn-controller is using UNIX socket connecting to local
> > ovsdb-
> >   > server.
> >   >
> >   > From the log you were showing, you were using tcp:127.0.0.1:6640
> > <http://127.0.0.1:6640>
> >
> >   Sorry, what I meant was, given your advice, I just made the change
> > for
> >   ovn-controller to use UNIX socket.
> >
> >
> >
> > Oh, I see, no worries.
> >
> >
> >   > <http://127.0.0.1:6640>  to connect the local ovsdb.
> >   > >   2020-08-
> > 07T16:38:04.022Z|29253|reconnect|WARN|tcp:127.0.0.1:6640
> > <http://127.0.0.1:6640>
> >   > > <http://127.0.0.1:6640> <http://127.0.0.1:6640> : connection
> > dropped
> >   > > (Broken pipe)
> >   >
> >   >
> >   >   Inactivity probe doesn't seem to be the cause of high cpu
> > usage.
> >   >
> >   >   The wakeup on connection to sb-db is always followed by a
> >   > "unreasonably
> >   >   long" warning. I guess the pollin event loop is stuck for
> > too long,
> >   > like
> >   >   10s as below.
> >   >   
> >   >   2020-08-07T18:46:49.301Z|00296|poll_loop|INFO|wakeup due to
> > [POLLIN]
> >   > on fd 19 (10.6.20.91:60712 <http://10.6.20.91:60712>
> > <http://10.6.20.91:60712> <->10.6.20.86:6642 <http://10.6.20.86:6642>
> >   > <http://10.6.20.86:6642> ) at lib/stream-fd.c:157 (99% CPU
> usage)
> >   >   2020-08-07T18:46:59.460Z|00297|timeval|WARN|Unreasonably
> > long
> >   > 10153ms poll interval (10075ms user, 1ms system)
> >   >   
> >   >
> >   >   Could that stuck loop be the cause of high cpu usage?
> >   >   What is it polling in?
> >   >   Why is it stuck, waiting for message from sb-db?
> >   >   Isn't it supposed to release the cpu while waiting?
> >   >
> >   >
> >   >
> >   > This log means there are messages received from 10.6.20.86:6642
> > <http://10.6.20.86:6642>
> >   > <http://10.6.20.86:6642>  (the SB DB). Is there SB

Re: [ovs-discuss] ovn-k8s scale: how to make new ovn-controller process keep the previous Open Flow in br-int

2020-08-07 Thread Han Zhou
On Fri, Aug 7, 2020 at 12:35 PM Numan Siddique  wrote:

>
>
> On Sat, Aug 8, 2020 at 12:16 AM Han Zhou  wrote:
>
>>
>>
>> On Thu, Aug 6, 2020 at 10:22 AM Han Zhou  wrote:
>>
>>>
>>>
>>> On Thu, Aug 6, 2020 at 9:15 AM Numan Siddique  wrote:
>>>
>>>>
>>>>
>>>> On Thu, Aug 6, 2020 at 9:25 PM Venugopal Iyer 
>>>> wrote:
>>>>
>>>>> Hi, Han:
>>>>>
>>>>>
>>>>>
>>>>> A comment inline:
>>>>>
>>>>>
>>>>>
>>>>> *From:* ovn-kuberne...@googlegroups.com <
>>>>> ovn-kuberne...@googlegroups.com> *On Behalf Of *Han Zhou
>>>>> *Sent:* Wednesday, August 5, 2020 3:36 PM
>>>>> *To:* Winson Wang 
>>>>> *Cc:* ovs-discuss@openvswitch.org; ovn-kuberne...@googlegroups.com;
>>>>> Dumitru Ceara ; Han Zhou 
>>>>> *Subject:* Re: ovn-k8s scale: how to make new ovn-controller process
>>>>> keep the previous Open Flow in br-int
>>>>>
>>>>>
>>>>>
>>>>> *External email: Use caution opening links or attachments*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 5, 2020 at 12:58 PM Winson Wang 
>>>>> wrote:
>>>>>
>>>>> Hello OVN Experts,
>>>>>
>>>>>
>>>>> With ovn-k8s,  we need to keep the flows always on br-int which needed
>>>>> by running pods on the k8s node.
>>>>>
>>>>> Is there an ongoing project to address this problem?
>>>>>
>>>>> If not,  I have one proposal not sure if it is doable.
>>>>>
>>>>> Please share your thoughts.
>>>>> The issue:
>>>>>
>>>>> In large scale ovn-k8s cluster there are 200K+ Open Flows on br-int on
>>>>> every K8s node.  When we restart ovn-controller for upgrade using
>>>>> `ovs-appctl -t ovn-controller exit --restart`,  the remaining traffic 
>>>>> still
>>>>> works fine since br-int with flows still be Installed.
>>>>>
>>>>>
>>>>>
>>>>> However, when a new ovn-controller starts it will connect OVS IDL and
>>>>> do an engine init run,  clearing all OpenFlow flows and install flows 
>>>>> based
>>>>> on SB DB.
>>>>>
>>>>> With open flows count above 200K+,  it took more than 15 seconds to
>>>>> get all the flows installed br-int bridge again.
>>>>>
>>>>>
>>>>> Proposal solution for the issue:
>>>>>
>>>>> When the ovn-controller gets “exit --start”,  it will write a
>>>>> “ovs-cond-seqno” to OVS IDL and store the value to Open vSwitch table in
>>>>> external-ids column. When new ovn-controller starts, it will check if the
>>>>> “ovs-cond-seqno” exists in the Open_vSwitch table,  and get the seqno from
>>>>> OVS IDL to decide if it will force a recomputing process?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Hi Winson,
>>>>>
>>>>>
>>>>>
>>>>> Thanks for the proposal. Yes, the connection break during upgrading is
>>>>> a real issue in a large scale environment. However, the proposal doesn't
>>>>> work. The "ovs-cond-seqno" is for the OVSDB IDL for the local conf DB,
>>>>> which is a completely different connection from the ovs-vswitchd open-flow
>>>>> connection.
>>>>>
>>>>> To avoid clearing the open-flow table during ovn-controller startup,
>>>>> we can find a way to postpone clearing the OVS flows after the recomputing
>>>>> in ovn-controller is completed, right before ovn-controller replacing with
>>>>> the new flows.
>>>>>
>>>>> *[vi> ] *
>>>>>
>>>>> *[vi> ] Seems like we force recompute today if the OVS IDL is
>>>>> reconnected. Would it be possible to defer *
>>>>>
>>>>> *decision to  recompute the flows based on  the  SB’s nb_cfg we have
>>>>>  sync’d with? i.e.  If  our nb_cfg is *
>>>>>
>>>>> *in sync with the SB’s global nb_cfg, we can

Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no changes in sb-db

2020-08-07 Thread Han Zhou
On Fri, Aug 7, 2020 at 12:35 PM Tony Liu  wrote:

> Inline...
>
> Thanks!
>
> Tony
> > -Original Message-----
> > From: Han Zhou 
> > Sent: Friday, August 7, 2020 12:29 PM
> > To: Tony Liu 
> > Cc: ovs-discuss ; ovs-dev  > d...@openvswitch.org>
> > Subject: Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no
> > changes in sb-db
> >
> >
> >
> > On Fri, Aug 7, 2020 at 12:19 PM Tony Liu  > <mailto:tonyliu0...@hotmail.com> > wrote:
> >
> >
> >   ovn-controller is using UNIX socket connecting to local ovsdb-
> > server.
> >
> > From the log you were showing, you were using tcp:127.0.0.1:6640
>
> Sorry, what I meant was, given your advice, I just made the change for
> ovn-controller to use UNIX socket.
>
> Oh, I see, no worries.

> <http://127.0.0.1:6640>  to connect the local ovsdb.
> > >   2020-08-07T16:38:04.022Z|29253|reconnect|WARN|tcp:127.0.0.1:6640
> > > <http://127.0.0.1:6640> <http://127.0.0.1:6640> : connection dropped
> > > (Broken pipe)
> >
> >
> >   Inactivity probe doesn't seem to be the cause of high cpu usage.
> >
> >   The wakeup on connection to sb-db is always followed by a
> > "unreasonably
> >   long" warning. I guess the pollin event loop is stuck for too long,
> > like
> >   10s as below.
> >   
> >   2020-08-07T18:46:49.301Z|00296|poll_loop|INFO|wakeup due to
> [POLLIN]
> > on fd 19 (10.6.20.91:60712 <http://10.6.20.91:60712> <->10.6.20.86:6642
> > <http://10.6.20.86:6642> ) at lib/stream-fd.c:157 (99% CPU usage)
> >   2020-08-07T18:46:59.460Z|00297|timeval|WARN|Unreasonably long
> > 10153ms poll interval (10075ms user, 1ms system)
> >   
> >
> >   Could that stuck loop be the cause of high cpu usage?
> >   What is it polling in?
> >   Why is it stuck, waiting for message from sb-db?
> >   Isn't it supposed to release the cpu while waiting?
> >
> >
> >
> > This log means there are messages received from 10.6.20.86:6642
> > <http://10.6.20.86:6642>  (the SB DB). Is there SB change? The CPU is
> > spent on handling the SB change. Some type of SB changes are not handled
> > incrementally.
>
> SB update is driven by ovn-northd in case anything changed in NB,
> and ovn-controller in case anything changed on chassis. No, there
> is nothing changed in NB, neither chassis.
>
> Should I bump logging level up to dbg? Is that going to show me
> what messages ovn-controller is handling?
>
> Yes, debug log should show the details.


> >
> >   Thanks!
> >
> >   Tony
> >
> >   > -Original Message-
> >   > From: Han Zhou mailto:zhou...@gmail.com> >
> >   > Sent: Friday, August 7, 2020 10:32 AM
> >   > To: Tony Liu  > <mailto:tonyliu0...@hotmail.com> >
> >   > Cc: ovs-discuss mailto:ovs-
> > disc...@openvswitch.org> >; ovs-dev  >   > d...@openvswitch.org <mailto:d...@openvswitch.org> >
> >   > Subject: Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu
> > while no
> >   > changes in sb-db
> >   >
> >   >
> >   >
> >   > On Fri, Aug 7, 2020 at 10:05 AM Tony Liu <
> tonyliu0...@hotmail.com
> > <mailto:tonyliu0...@hotmail.com>
> >   > <mailto:tonyliu0...@hotmail.com
> > <mailto:tonyliu0...@hotmail.com> > > wrote:
> >   >
> >   >
> >   >   Hi,
> >   >
> >   >   Here are some logging snippets from ovn-controller.
> >   >   
> >   >   2020-08-07T16:38:04.020Z|29250|timeval|WARN|Unreasonably
> > long
> >   > 8954ms poll interval (8895ms user, 0ms system)
> >   >   
> >   >   What's that mean? Is it harmless?
> >   >
> >   >   
> >   >   2020-08-07T16:38:04.021Z|29251|timeval|WARN|context
> > switches: 0
> >   > voluntary, 6 involuntary
> >   >   2020-08-07T16:38:04.022Z|29252|poll_loop|INFO|wakeup due to
> > [POLLIN]
> >   > on fd 19 (10.6.20.91:60398 <http://10.6.20.91:60398>
> > <http://10.6.20.91:60398> <->10.6.20.86:6642 <http://10.6.20.86:6642>
> >   > <http://10.6.20.86:6642> ) at lib/stream-fd.c:157 (99% CPU
> usage)
> >   >   
> > 

Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no changes in sb-db

2020-08-07 Thread Han Zhou
On Fri, Aug 7, 2020 at 12:19 PM Tony Liu  wrote:

> ovn-controller is using UNIX socket connecting to local ovsdb-server.
>

>From the log you were showing, you were using tcp:127.0.0.1:6640 to connect
the local ovsdb.
>   2020-08-07T16:38:04.022Z|29253|reconnect|WARN|tcp:127.0.0.1:6640
> <http://127.0.0.1:6640> : connection dropped (Broken pipe)


> Inactivity probe doesn't seem to be the cause of high cpu usage.
>
> The wakeup on connection to sb-db is always followed by a "unreasonably
> long" warning. I guess the pollin event loop is stuck for too long, like
> 10s as below.
> 
> 2020-08-07T18:46:49.301Z|00296|poll_loop|INFO|wakeup due to [POLLIN] on fd
> 19 (10.6.20.91:60712<->10.6.20.86:6642) at lib/stream-fd.c:157 (99% CPU
> usage)
> 2020-08-07T18:46:59.460Z|00297|timeval|WARN|Unreasonably long 10153ms poll
> interval (10075ms user, 1ms system)
> 
>
> Could that stuck loop be the cause of high cpu usage?
> What is it polling in?
> Why is it stuck, waiting for message from sb-db?
> Isn't it supposed to release the cpu while waiting?
>
> This log means there are messages received from 10.6.20.86:6642 (the SB
DB). Is there SB change? The CPU is spent on handling the SB change. Some
type of SB changes are not handled incrementally.


> Thanks!
>
> Tony
>
> > -Original Message-
> > From: Han Zhou 
> > Sent: Friday, August 7, 2020 10:32 AM
> > To: Tony Liu 
> > Cc: ovs-discuss ; ovs-dev  > d...@openvswitch.org>
> > Subject: Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no
> > changes in sb-db
> >
> >
> >
> > On Fri, Aug 7, 2020 at 10:05 AM Tony Liu  > <mailto:tonyliu0...@hotmail.com> > wrote:
> >
> >
> >   Hi,
> >
> >   Here are some logging snippets from ovn-controller.
> >   
> >   2020-08-07T16:38:04.020Z|29250|timeval|WARN|Unreasonably long
> > 8954ms poll interval (8895ms user, 0ms system)
> >   
> >   What's that mean? Is it harmless?
> >
> >   
> >   2020-08-07T16:38:04.021Z|29251|timeval|WARN|context switches: 0
> > voluntary, 6 involuntary
> >   2020-08-07T16:38:04.022Z|29252|poll_loop|INFO|wakeup due to
> [POLLIN]
> > on fd 19 (10.6.20.91:60398 <http://10.6.20.91:60398> <->10.6.20.86:6642
> > <http://10.6.20.86:6642> ) at lib/stream-fd.c:157 (99% CPU usage)
> >   
> >   Is this wakeup caused by changes in sb-db?
> >   Why is ovn-controller so busy?
> >
> >   
> >   2020-08-07T16:38:04.022Z|29253|reconnect|WARN|tcp:127.0.0.1:6640
> > <http://127.0.0.1:6640> : connection dropped (Broken pipe)
> >   
> >   Connection to local ovsdb-server is dropped.
> >   Is this caused by the timeout of inactivity probe?
> >
> >   
> >   2020-08-07T16:38:04.035Z|29254|poll_loop|INFO|wakeup due to
> [POLLIN]
> > on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157
> > (99% CPU usage)
> >   
> >   What causes this wakeup?
> >
> >   
> >   2020-08-07T16:38:04.048Z|29255|poll_loop|INFO|wakeup due to 0-ms
> > timeout at lib/ovsdb-idl.c:5391 (99% CPU usage)
> >   
> >   What's this 0-ms wakeup mean?
> >
> >   
> >   2020-08-07T16:38:05.022Z|29256|poll_loop|INFO|wakeup due to 962-ms
> > timeout at lib/reconnect.c:643 (99% CPU usage)
> >   2020-08-07T16:38:05.023Z|29257|reconnect|INFO|tcp:127.0.0.1:6640
> > <http://127.0.0.1:6640> : connecting...
> >   2020-08-07T16:38:05.041Z|29258|poll_loop|INFO|wakeup due to
> > [POLLOUT] on fd 14 (127.0.0.1:51478 <http://127.0.0.1:51478> <-
> > >127.0.0.1:6640 <http://127.0.0.1:6640> ) at lib/stream-fd.c:153 (99%
> > CPU usage)
> >   2020-08-07T16:38:05.041Z|29259|reconnect|INFO|tcp:127.0.0.1:6640
> > <http://127.0.0.1:6640> : connected
> >   
> >   Retry to connect to local ovsdb-server. A pollout event is
> > triggered
> >   right after connection is established. What's poolout?
> >
> >   ovn-controller is taking 100% CPU now, and there is no changes in
> >   sb-db (not busy). It seems that it's busy with local ovsdb-server
> >   or vswitchd. I'd like to understand why ovn-controller is so busy?
> >   All inactivity probe intervals are set to 30s.
> >
> >
> >
> >
> > Is there change from the local ovsdb? You can enable dbg log to see what
> > is happening.
> > For the local ovsdb probe, I have mentioned in the other thread that
> > UNIX socket is recommended (instead of tcp 127.0.0.1). Using UNIX socket
> > disables probe by default.
> >
> > Thanks,
> > Han
>
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] ovn-k8s scale: how to make new ovn-controller process keep the previous Open Flow in br-int

2020-08-07 Thread Han Zhou
On Thu, Aug 6, 2020 at 10:22 AM Han Zhou  wrote:

>
>
> On Thu, Aug 6, 2020 at 9:15 AM Numan Siddique  wrote:
>
>>
>>
>> On Thu, Aug 6, 2020 at 9:25 PM Venugopal Iyer 
>> wrote:
>>
>>> Hi, Han:
>>>
>>>
>>>
>>> A comment inline:
>>>
>>>
>>>
>>> *From:* ovn-kuberne...@googlegroups.com 
>>> *On Behalf Of *Han Zhou
>>> *Sent:* Wednesday, August 5, 2020 3:36 PM
>>> *To:* Winson Wang 
>>> *Cc:* ovs-discuss@openvswitch.org; ovn-kuberne...@googlegroups.com;
>>> Dumitru Ceara ; Han Zhou 
>>> *Subject:* Re: ovn-k8s scale: how to make new ovn-controller process
>>> keep the previous Open Flow in br-int
>>>
>>>
>>>
>>> *External email: Use caution opening links or attachments*
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Aug 5, 2020 at 12:58 PM Winson Wang 
>>> wrote:
>>>
>>> Hello OVN Experts,
>>>
>>>
>>> With ovn-k8s,  we need to keep the flows always on br-int which needed
>>> by running pods on the k8s node.
>>>
>>> Is there an ongoing project to address this problem?
>>>
>>> If not,  I have one proposal not sure if it is doable.
>>>
>>> Please share your thoughts.
>>> The issue:
>>>
>>> In large scale ovn-k8s cluster there are 200K+ Open Flows on br-int on
>>> every K8s node.  When we restart ovn-controller for upgrade using
>>> `ovs-appctl -t ovn-controller exit --restart`,  the remaining traffic still
>>> works fine since br-int with flows still be Installed.
>>>
>>>
>>>
>>> However, when a new ovn-controller starts it will connect OVS IDL and do
>>> an engine init run,  clearing all OpenFlow flows and install flows based on
>>> SB DB.
>>>
>>> With open flows count above 200K+,  it took more than 15 seconds to get
>>> all the flows installed br-int bridge again.
>>>
>>>
>>> Proposal solution for the issue:
>>>
>>> When the ovn-controller gets “exit --start”,  it will write a
>>> “ovs-cond-seqno” to OVS IDL and store the value to Open vSwitch table in
>>> external-ids column. When new ovn-controller starts, it will check if the
>>> “ovs-cond-seqno” exists in the Open_vSwitch table,  and get the seqno from
>>> OVS IDL to decide if it will force a recomputing process?
>>>
>>>
>>>
>>>
>>>
>>> Hi Winson,
>>>
>>>
>>>
>>> Thanks for the proposal. Yes, the connection break during upgrading is a
>>> real issue in a large scale environment. However, the proposal doesn't
>>> work. The "ovs-cond-seqno" is for the OVSDB IDL for the local conf DB,
>>> which is a completely different connection from the ovs-vswitchd open-flow
>>> connection.
>>>
>>> To avoid clearing the open-flow table during ovn-controller startup, we
>>> can find a way to postpone clearing the OVS flows after the recomputing in
>>> ovn-controller is completed, right before ovn-controller replacing with the
>>> new flows.
>>>
>>> *[vi> ] *
>>>
>>> *[vi> ] Seems like we force recompute today if the OVS IDL is
>>> reconnected. Would it be possible to defer *
>>>
>>> *decision to  recompute the flows based on  the  SB’s nb_cfg we have
>>>  sync’d with? i.e.  If  our nb_cfg is *
>>>
>>> *in sync with the SB’s global nb_cfg, we can skip the recompute?  At
>>> least if nothing has changed since*
>>>
>>> *the restart, we won’t need to do anything.. We could stash nb_cfg in
>>> OVS (once ovn-controller receives*
>>>
>>> *conformation from OVS that the physical flows for an nb_cfg update are
>>> in place), which should be cleared if *
>>>
>>> *OVS itself is restarted.. (I mean currently, nb_cfg is used to check if
>>> NB, SB and Chassis are in sync, we *
>>>
>>> *could extend this to OVS/physical flows?)*
>>>
>>>
>>>
>>> *Have not thought through this though .. so maybe I am missing
>>> something…*
>>>
>>>
>>>
>>> *Thanks,*
>>>
>>>
>>>
>>> *-venu*
>>>
>>> This should largely reduce the time of connection broken during
>>> upgrading. Some changes in the ofctrl module's state machine are required,
>>> but I am not 100% sure if

Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no changes in sb-db

2020-08-07 Thread Han Zhou
On Fri, Aug 7, 2020 at 10:05 AM Tony Liu  wrote:

> Hi,
>
> Here are some logging snippets from ovn-controller.
> 
> 2020-08-07T16:38:04.020Z|29250|timeval|WARN|Unreasonably long 8954ms poll
> interval (8895ms user, 0ms system)
> 
> What's that mean? Is it harmless?
>
> 
> 2020-08-07T16:38:04.021Z|29251|timeval|WARN|context switches: 0 voluntary,
> 6 involuntary
> 2020-08-07T16:38:04.022Z|29252|poll_loop|INFO|wakeup due to [POLLIN] on fd
> 19 (10.6.20.91:60398<->10.6.20.86:6642) at lib/stream-fd.c:157 (99% CPU
> usage)
> 
> Is this wakeup caused by changes in sb-db?
> Why is ovn-controller so busy?
>
> 
> 2020-08-07T16:38:04.022Z|29253|reconnect|WARN|tcp:127.0.0.1:6640:
> connection dropped (Broken pipe)
> 
> Connection to local ovsdb-server is dropped.
> Is this caused by the timeout of inactivity probe?
>
> 
> 2020-08-07T16:38:04.035Z|29254|poll_loop|INFO|wakeup due to [POLLIN] on fd
> 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (99% CPU
> usage)
> 
> What causes this wakeup?
>
> 
> 2020-08-07T16:38:04.048Z|29255|poll_loop|INFO|wakeup due to 0-ms timeout
> at lib/ovsdb-idl.c:5391 (99% CPU usage)
> 
> What's this 0-ms wakeup mean?
>
> 
> 2020-08-07T16:38:05.022Z|29256|poll_loop|INFO|wakeup due to 962-ms timeout
> at lib/reconnect.c:643 (99% CPU usage)
> 2020-08-07T16:38:05.023Z|29257|reconnect|INFO|tcp:127.0.0.1:6640:
> connecting...
> 2020-08-07T16:38:05.041Z|29258|poll_loop|INFO|wakeup due to [POLLOUT] on
> fd 14 (127.0.0.1:51478<->127.0.0.1:6640) at lib/stream-fd.c:153 (99% CPU
> usage)
> 2020-08-07T16:38:05.041Z|29259|reconnect|INFO|tcp:127.0.0.1:6640:
> connected
> 
> Retry to connect to local ovsdb-server. A pollout event is triggered
> right after connection is established. What's poolout?
>
> ovn-controller is taking 100% CPU now, and there is no changes in
> sb-db (not busy). It seems that it's busy with local ovsdb-server
> or vswitchd. I'd like to understand why ovn-controller is so busy?
> All inactivity probe intervals are set to 30s.
>
>
Is there change from the local ovsdb? You can enable dbg log to see what is
happening.
For the local ovsdb probe, I have mentioned in the other thread that UNIX
socket is recommended (instead of tcp 127.0.0.1). Using UNIX socket
disables probe by default.

Thanks,
Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] ovn-k8s scale: how to make new ovn-controller process keep the previous Open Flow in br-int

2020-08-06 Thread Han Zhou
On Thu, Aug 6, 2020 at 10:13 AM Han Zhou  wrote:

>
>
> On Thu, Aug 6, 2020 at 8:54 AM Venugopal Iyer 
> wrote:
>
>> Hi, Han:
>>
>>
>>
>> A comment inline:
>>
>>
>>
>> *From:* ovn-kuberne...@googlegroups.com 
>> *On Behalf Of *Han Zhou
>> *Sent:* Wednesday, August 5, 2020 3:36 PM
>> *To:* Winson Wang 
>> *Cc:* ovs-discuss@openvswitch.org; ovn-kuberne...@googlegroups.com;
>> Dumitru Ceara ; Han Zhou 
>> *Subject:* Re: ovn-k8s scale: how to make new ovn-controller process
>> keep the previous Open Flow in br-int
>>
>>
>>
>> *External email: Use caution opening links or attachments*
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Aug 5, 2020 at 12:58 PM Winson Wang 
>> wrote:
>>
>> Hello OVN Experts,
>>
>>
>> With ovn-k8s,  we need to keep the flows always on br-int which needed by
>> running pods on the k8s node.
>>
>> Is there an ongoing project to address this problem?
>>
>> If not,  I have one proposal not sure if it is doable.
>>
>> Please share your thoughts.
>> The issue:
>>
>> In large scale ovn-k8s cluster there are 200K+ Open Flows on br-int on
>> every K8s node.  When we restart ovn-controller for upgrade using
>> `ovs-appctl -t ovn-controller exit --restart`,  the remaining traffic still
>> works fine since br-int with flows still be Installed.
>>
>>
>>
>> However, when a new ovn-controller starts it will connect OVS IDL and do
>> an engine init run,  clearing all OpenFlow flows and install flows based on
>> SB DB.
>>
>> With open flows count above 200K+,  it took more than 15 seconds to get
>> all the flows installed br-int bridge again.
>>
>>
>> Proposal solution for the issue:
>>
>> When the ovn-controller gets “exit --start”,  it will write a
>> “ovs-cond-seqno” to OVS IDL and store the value to Open vSwitch table in
>> external-ids column. When new ovn-controller starts, it will check if the
>> “ovs-cond-seqno” exists in the Open_vSwitch table,  and get the seqno from
>> OVS IDL to decide if it will force a recomputing process?
>>
>>
>>
>>
>>
>> Hi Winson,
>>
>>
>>
>> Thanks for the proposal. Yes, the connection break during upgrading is a
>> real issue in a large scale environment. However, the proposal doesn't
>> work. The "ovs-cond-seqno" is for the OVSDB IDL for the local conf DB,
>> which is a completely different connection from the ovs-vswitchd open-flow
>> connection.
>>
>> To avoid clearing the open-flow table during ovn-controller startup, we
>> can find a way to postpone clearing the OVS flows after the recomputing in
>> ovn-controller is completed, right before ovn-controller replacing with the
>> new flows.
>>
>> *[vi> ] *
>>
>> *[vi> ] Seems like we force recompute today if the OVS IDL is
>> reconnected. Would it be possible to defer *
>>
>> *decision to  recompute the flows based on  the  SB’s nb_cfg we have
>>  sync’d with? i.e.  If  our nb_cfg is *
>>
>> *in sync with the SB’s global nb_cfg, we can skip the recompute?  At
>> least if nothing has changed since*
>>
>> *the restart, we won’t need to do anything.. We could stash nb_cfg in OVS
>> (once ovn-controller receives*
>>
>> *conformation from OVS that the physical flows for an nb_cfg update are
>> in place), which should be cleared if *
>>
>> *OVS itself is restarted.. (I mean currently, nb_cfg is used to check if
>> NB, SB and Chassis are in sync, we *
>>
>> *could extend this to OVS/physical flows?)*
>>
>
> nb_cfg is already used by ovn-controller to do that, with the help of
> "barrier" of OpenFlow, but I am not sure if it 100% working as expected.
>
> This basic idea should work, but in practice we need to take care of
> generating the "installed" flow table and "desired" flow table in
> ovn-controller.
> I'd start with "postpone clearing OVS flows" which seems a lower hanging
> fruit, and then see if any further improvement is needed.
>
>
(resend using my gmail so that it can reach the ovn-kubernetes group.)

I thought about it again and it seems the idea of remembering nb_cfg
doesn't work for the upgrading scenario. Even if nb_cfg is the same and we
are sure about the flow that's installed in OVS reflects the certain nb_cfg
version, we cannot say the OVS flows doesn't need any change, because the
new version of ovn-controller implementation may translate same SB data
into different

Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-06 Thread Han Zhou
On Thu, Aug 6, 2020 at 12:07 PM Tony Liu  wrote:
>
> Inline...
>
> Thanks!
>
> Tony
> > -Original Message-
> > From: Han Zhou 
> > Sent: Thursday, August 6, 2020 11:37 AM
> > To: Tony Liu 
> > Cc: Han Zhou ; Numan Siddique ; ovs-dev
> > ; ovs-discuss 
> > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> >
> >
> >
> > On Thu, Aug 6, 2020 at 11:11 AM Tony Liu  > <mailto:tonyliu0...@hotmail.com> > wrote:
> > >
> > > Inline... (please read with monospaced font:))
> > >
> > > Thanks!
> > >
> > > Tony
> > > > -Original Message-
> > > > From: Han Zhou mailto:hz...@ovn.org> >
> > > > Sent: Wednesday, August 5, 2020 11:48 PM
> > > > To: Tony Liu  > > > <mailto:tonyliu0...@hotmail.com> >
> > > > Cc: Han Zhou mailto:hz...@ovn.org> >; Numan Siddique
> > > > mailto:num...@ovn.org> >; ovs-dev
> > > > mailto:ovs-...@openvswitch.org> >;
> > > > ovs-discuss  > > > <mailto:ovs-discuss@openvswitch.org> >
> > > > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> > > >
> > > >
> > > >
> > > > On Wed, Aug 5, 2020 at 9:14 PM Tony Liu  > > > <mailto:tonyliu0...@hotmail.com> <mailto:tonyliu0...@hotmail.com
> > <mailto:tonyliu0...@hotmail.com> > > wrote:
> > > >
> > > >
> > > >   I set the connection target="ptcp:6641:10.6.20.84" for ovn-nb-
> > db
> > > >   and "ptcp:6642:10.6.20.84" for ovn-sb-db. .84 is the first
> > node
> > > >   of cluster. Also ovn-openflow-probe-interval=30 on compute
> > node.
> > > >   It seems helping. Not that many connect/drop/reconnect in
> > logging.
> > > >   That "commit failure" is also gone.
> > > >   The issue I reported in another thread "packet drop" seems
> > gone.
> > > >   And launching VM starts working.
> > > >
> > > >   How should I set connection table for all ovn-nb-db and ovn-
> > sb-db
> > > >   nodes in the cluster to set inactivity_probe?
> > > >   One row with address 0.0.0.0 seems not working.
> > > >
> > > > You can simply use 0.0.0.0 in the connection table, but don't
> > > > specify the same connection method on the command line when starting
> > > > ovsdb- server for NB/SB DB. Otherwise, these are conflicting and
> > > > that's why you saw "Address already in use" error.
> > >
> > > Could you share a bit details how it works?
> > > I thought the row in connection table only tells nbdb and sbdb the
> > > probe interval. Isn't that right? Does nbdb and sbdb also create
> > > socket based on target column?
> >
> > >
> >
> > In --remote option of ovsdb-server, you can specify either a connection
> > method directly, or specify the db,table,column which contains the
> > connection information.
> > Please see manpage ovsdb-server(1).
>
> Here is how one of those 3 nbdb nodes invoked.
> 
> ovsdb-server -vconsole:off -vfile:info
--log-file=/var/log/kolla/openvswitch/ovn-sb-db.log
--remote=punix:/var/run/ovn/ovnsb_db.sock --pidfile=/run/ovn/ovnsb_db.pid
--unixctl=/var/run/ovn/ovnsb_db.ctl
--remote=db:OVN_Southbound,SB_Global,connections
--private-key=db:OVN_Southbound,SSL,private_key
--certificate=db:OVN_Southbound,SSL,certificate
--ca-cert=db:OVN_Southbound,SSL,ca_cert
--ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols
--ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers
--remote=ptcp:6642:10.6.20.84 /var/lib/openvswitch/ovn-sb/ov sb.db
> 
> It creates UNIX and TCP sockets, and takes configuration from DB.
> Does that look ok?
> Given that, what the target column should be for all nodes of the cluster?
> And whatever target is set, ovsdb-server will create socket, right?
> Oh... Should I do "--remote=ptcp:6642:0.0.0.0"? Then I can set the same
> in connection table, and it won't cause conflict?
> If --remote and connection target are the same, whoever comes in later
> will be ignored, right?
> In coding, does ovsdb-server create a connection object for each of
> --remote and connection target, or it's one single connection object
> for both of them because method:port:address is the same? I'd expect
> the single object.
>

--remote=ptcp:6642:10.6.20.84 should be removed from the command.
You already specifies --remote=db:OVN_Southbound,SB_

Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-06 Thread Han Zhou
On Thu, Aug 6, 2020 at 11:11 AM Tony Liu  wrote:
>
> Inline... (please read with monospaced font:))
>
> Thanks!
>
> Tony
> > -Original Message-
> > From: Han Zhou 
> > Sent: Wednesday, August 5, 2020 11:48 PM
> > To: Tony Liu 
> > Cc: Han Zhou ; Numan Siddique ; ovs-dev
> > ; ovs-discuss 
> > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> >
> >
> >
> > On Wed, Aug 5, 2020 at 9:14 PM Tony Liu  > <mailto:tonyliu0...@hotmail.com> > wrote:
> >
> >
> >   I set the connection target="ptcp:6641:10.6.20.84" for ovn-nb-db
> >   and "ptcp:6642:10.6.20.84" for ovn-sb-db. .84 is the first node
> >   of cluster. Also ovn-openflow-probe-interval=30 on compute node.
> >   It seems helping. Not that many connect/drop/reconnect in logging.
> >   That "commit failure" is also gone.
> >   The issue I reported in another thread "packet drop" seems gone.
> >   And launching VM starts working.
> >
> >   How should I set connection table for all ovn-nb-db and ovn-sb-db
> >   nodes in the cluster to set inactivity_probe?
> >   One row with address 0.0.0.0 seems not working.
> >
> > You can simply use 0.0.0.0 in the connection table, but don't specify
> > the same connection method on the command line when starting ovsdb-
> > server for NB/SB DB. Otherwise, these are conflicting and that's why you
> > saw "Address already in use" error.
>
> Could you share a bit details how it works?
> I thought the row in connection table only tells nbdb and sbdb the
> probe interval. Isn't that right? Does nbdb and sbdb also create
> socket based on target column?
>

In --remote option of ovsdb-server, you can specify either a connection
method directly, or specify the db,table,column which contains the
connection information.
Please see manpage ovsdb-server(1).

> >
> >   Is "external_ids:ovn-remote-probe-interval" in ovsdb-server on
> >   compute node for ovn-controller to probe ovn-sb-db?
> >
> > OVSDB probe is bidirectional, so you need to set this value, too, if you
> > don't want too many probes handled by the SB server. (setting the
> > connection table for SB only changes the server side).
>
> In that case, how do I set probe interval for ovn-controller?
> My understanding is that, ovn-controller reads configuration from
> ovsdb-server on the local compute node. Isn't that right?
>

The configuration you mentioned "external_ids:ovn-remote-probe-interval" is
exactly the way to set the ovn-controller -> SB probe interval.
(SB -> ovn-controller probe is set in the connection table of SB)

You are right that ovn-controller reads configuration from the local
ovsdb-server. This setting is in local ovsdb-server.

> >   Is "external_ids:ovn-openflow-probe-interval" in ovsdb-server on
> >   compute node for ovn-controller to probe ovsdb-server?
> >
> > It is for the OpenFlow connection between ovn-controller and ovs-
> > vswitchd, which is part of the OpenFlow protocol.
> >
> >   What's probe interval for ovsdb-server to probe ovn-controller?
> >
> > The local ovsdb connection uses unix socket, which doesn't send probe by
> > default (if I remember correctly).
>
> Here is how ovsdb-server and ovn-controller is invoked on compute node.
> 
> root 41129  0.0  0.0 157556 20532 ?SJul30   1:51
/usr/sbin/ovsdb-server /var/lib/openvswitch/conf.db -vconsole:emer
-vsyslog:err -vfile:info --remote=punix:/run/openvswitch/db.sock
--remote=ptcp:6640:127.0.0.1
--remote=db:Open_vSwitch,Open_vSwitch,manager_options
--log-file=/var/log/kolla/openvswitch/ovsdb-server.log --pidfile
>
> root 63775 55.9  0.4 1477796 1224324 ? Sl   Aug04 1360:55
/usr/bin/ovn-controller --pidfile=/run/ovn/ovn-controller.pid
--log-file=/var/log/kolla/openvswitch/ovn-controller.log tcp:127.0.0.1:6640
> 
> Is that OK? Or UNIX socket method is recommended for ovn-controller
> to connect to ovsdb-server?

If using TCP, by default it is 5s probe interval. I think it is better to
use unix socket. (but maybe it doesn't matter that much)

>
> Here is the configuration in open_vswitch table in ovsdb-server.
> 
> external_ids: {ovn-encap-ip="10.6.30.22", ovn-encap-type=geneve,
ovn-openflow-probe-interval="30", ovn-remote="tcp:10.6.20.84:6642,tcp:
10.6.20.85:6642,tcp:10.6.20.86:6642", ovn-remote-probe-interval="6",
system-id="compute-3"}
> 
> ovn-controller connects to ovsdb-server and reads this configuration,
> so it knows how to

Re: [ovs-discuss] ovn-k8s scale: how to make new ovn-controller process keep the previous Open Flow in br-int

2020-08-06 Thread Han Zhou
On Thu, Aug 6, 2020 at 9:15 AM Numan Siddique  wrote:

>
>
> On Thu, Aug 6, 2020 at 9:25 PM Venugopal Iyer 
> wrote:
>
>> Hi, Han:
>>
>>
>>
>> A comment inline:
>>
>>
>>
>> *From:* ovn-kuberne...@googlegroups.com 
>> *On Behalf Of *Han Zhou
>> *Sent:* Wednesday, August 5, 2020 3:36 PM
>> *To:* Winson Wang 
>> *Cc:* ovs-discuss@openvswitch.org; ovn-kuberne...@googlegroups.com;
>> Dumitru Ceara ; Han Zhou 
>> *Subject:* Re: ovn-k8s scale: how to make new ovn-controller process
>> keep the previous Open Flow in br-int
>>
>>
>>
>> *External email: Use caution opening links or attachments*
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Aug 5, 2020 at 12:58 PM Winson Wang 
>> wrote:
>>
>> Hello OVN Experts,
>>
>>
>> With ovn-k8s,  we need to keep the flows always on br-int which needed by
>> running pods on the k8s node.
>>
>> Is there an ongoing project to address this problem?
>>
>> If not,  I have one proposal not sure if it is doable.
>>
>> Please share your thoughts.
>> The issue:
>>
>> In large scale ovn-k8s cluster there are 200K+ Open Flows on br-int on
>> every K8s node.  When we restart ovn-controller for upgrade using
>> `ovs-appctl -t ovn-controller exit --restart`,  the remaining traffic still
>> works fine since br-int with flows still be Installed.
>>
>>
>>
>> However, when a new ovn-controller starts it will connect OVS IDL and do
>> an engine init run,  clearing all OpenFlow flows and install flows based on
>> SB DB.
>>
>> With open flows count above 200K+,  it took more than 15 seconds to get
>> all the flows installed br-int bridge again.
>>
>>
>> Proposal solution for the issue:
>>
>> When the ovn-controller gets “exit --start”,  it will write a
>> “ovs-cond-seqno” to OVS IDL and store the value to Open vSwitch table in
>> external-ids column. When new ovn-controller starts, it will check if the
>> “ovs-cond-seqno” exists in the Open_vSwitch table,  and get the seqno from
>> OVS IDL to decide if it will force a recomputing process?
>>
>>
>>
>>
>>
>> Hi Winson,
>>
>>
>>
>> Thanks for the proposal. Yes, the connection break during upgrading is a
>> real issue in a large scale environment. However, the proposal doesn't
>> work. The "ovs-cond-seqno" is for the OVSDB IDL for the local conf DB,
>> which is a completely different connection from the ovs-vswitchd open-flow
>> connection.
>>
>> To avoid clearing the open-flow table during ovn-controller startup, we
>> can find a way to postpone clearing the OVS flows after the recomputing in
>> ovn-controller is completed, right before ovn-controller replacing with the
>> new flows.
>>
>> *[vi> ] *
>>
>> *[vi> ] Seems like we force recompute today if the OVS IDL is
>> reconnected. Would it be possible to defer *
>>
>> *decision to  recompute the flows based on  the  SB’s nb_cfg we have
>>  sync’d with? i.e.  If  our nb_cfg is *
>>
>> *in sync with the SB’s global nb_cfg, we can skip the recompute?  At
>> least if nothing has changed since*
>>
>> *the restart, we won’t need to do anything.. We could stash nb_cfg in OVS
>> (once ovn-controller receives*
>>
>> *conformation from OVS that the physical flows for an nb_cfg update are
>> in place), which should be cleared if *
>>
>> *OVS itself is restarted.. (I mean currently, nb_cfg is used to check if
>> NB, SB and Chassis are in sync, we *
>>
>> *could extend this to OVS/physical flows?)*
>>
>>
>>
>> *Have not thought through this though .. so maybe I am missing something…*
>>
>>
>>
>> *Thanks,*
>>
>>
>>
>> *-venu*
>>
>> This should largely reduce the time of connection broken during
>> upgrading. Some changes in the ofctrl module's state machine are required,
>> but I am not 100% sure if this approach is applicable. Need to check more
>> details.
>>
>
>
> We can also think if its possible to do the below way
>- When ovn-controller starts, it will not clear the flows, but instead
> will get the dump of flows  from the br-int and populate these flows in its
> installed flows
> - And then when it connects to the SB DB and computes the desired
> flows, it will anyway sync up with the installed flows with the desired
> flows
> - And if there is no difference between desired flows and installed
> flows, there will

Re: [ovs-discuss] ovn-k8s scale: how to make new ovn-controller process keep the previous Open Flow in br-int

2020-08-06 Thread Han Zhou
On Thu, Aug 6, 2020 at 8:54 AM Venugopal Iyer  wrote:

> Hi, Han:
>
>
>
> A comment inline:
>
>
>
> *From:* ovn-kuberne...@googlegroups.com  *On
> Behalf Of *Han Zhou
> *Sent:* Wednesday, August 5, 2020 3:36 PM
> *To:* Winson Wang 
> *Cc:* ovs-discuss@openvswitch.org; ovn-kuberne...@googlegroups.com;
> Dumitru Ceara ; Han Zhou 
> *Subject:* Re: ovn-k8s scale: how to make new ovn-controller process keep
> the previous Open Flow in br-int
>
>
>
> *External email: Use caution opening links or attachments*
>
>
>
>
>
>
>
> On Wed, Aug 5, 2020 at 12:58 PM Winson Wang 
> wrote:
>
> Hello OVN Experts,
>
>
> With ovn-k8s,  we need to keep the flows always on br-int which needed by
> running pods on the k8s node.
>
> Is there an ongoing project to address this problem?
>
> If not,  I have one proposal not sure if it is doable.
>
> Please share your thoughts.
> The issue:
>
> In large scale ovn-k8s cluster there are 200K+ Open Flows on br-int on
> every K8s node.  When we restart ovn-controller for upgrade using
> `ovs-appctl -t ovn-controller exit --restart`,  the remaining traffic still
> works fine since br-int with flows still be Installed.
>
>
>
> However, when a new ovn-controller starts it will connect OVS IDL and do
> an engine init run,  clearing all OpenFlow flows and install flows based on
> SB DB.
>
> With open flows count above 200K+,  it took more than 15 seconds to get
> all the flows installed br-int bridge again.
>
>
> Proposal solution for the issue:
>
> When the ovn-controller gets “exit --start”,  it will write a
> “ovs-cond-seqno” to OVS IDL and store the value to Open vSwitch table in
> external-ids column. When new ovn-controller starts, it will check if the
> “ovs-cond-seqno” exists in the Open_vSwitch table,  and get the seqno from
> OVS IDL to decide if it will force a recomputing process?
>
>
>
>
>
> Hi Winson,
>
>
>
> Thanks for the proposal. Yes, the connection break during upgrading is a
> real issue in a large scale environment. However, the proposal doesn't
> work. The "ovs-cond-seqno" is for the OVSDB IDL for the local conf DB,
> which is a completely different connection from the ovs-vswitchd open-flow
> connection.
>
> To avoid clearing the open-flow table during ovn-controller startup, we
> can find a way to postpone clearing the OVS flows after the recomputing in
> ovn-controller is completed, right before ovn-controller replacing with the
> new flows.
>
> *[vi> ] *
>
> *[vi> ] Seems like we force recompute today if the OVS IDL is reconnected.
> Would it be possible to defer *
>
> *decision to  recompute the flows based on  the  SB’s nb_cfg we have
>  sync’d with? i.e.  If  our nb_cfg is *
>
> *in sync with the SB’s global nb_cfg, we can skip the recompute?  At least
> if nothing has changed since*
>
> *the restart, we won’t need to do anything.. We could stash nb_cfg in OVS
> (once ovn-controller receives*
>
> *conformation from OVS that the physical flows for an nb_cfg update are in
> place), which should be cleared if *
>
> *OVS itself is restarted.. (I mean currently, nb_cfg is used to check if
> NB, SB and Chassis are in sync, we *
>
> *could extend this to OVS/physical flows?)*
>

nb_cfg is already used by ovn-controller to do that, with the help of
"barrier" of OpenFlow, but I am not sure if it 100% working as expected.

This basic idea should work, but in practice we need to take care of
generating the "installed" flow table and "desired" flow table in
ovn-controller.
I'd start with "postpone clearing OVS flows" which seems a lower hanging
fruit, and then see if any further improvement is needed.


>
> *Have not thought through this though .. so maybe I am missing something…*
>
>
>
> *Thanks,*
>
>
>
> *-venu*
>
> This should largely reduce the time of connection broken during upgrading.
> Some changes in the ofctrl module's state machine are required, but I am
> not 100% sure if this approach is applicable. Need to check more details.
>
>
>
> Thanks,
>
> Han
>
> Test log:
>
> Check flow cnt on br-int every second:
>
>
>
> packet_count=0 byte_count=0 flow_count=0
>
> packet_count=0 byte_count=0 flow_count=0
>
> packet_count=0 byte_count=0 flow_count=0
>
> packet_count=0 byte_count=0 flow_count=0
>
> packet_count=0 byte_count=0 flow_count=0
>
> packet_count=0 byte_count=0 flow_count=0
>
> packet_count=0 byte_count=0 flow_count=10322
>
> packet_count=0 byte_count=0 flow_count=34220
>
> packet_count=0 byte_count=0 flow_count=60425
>
> packet_count=0 byte_count=0 

Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-06 Thread Han Zhou
On Wed, Aug 5, 2020 at 9:14 PM Tony Liu  wrote:

> I set the connection target="ptcp:6641:10.6.20.84" for ovn-nb-db
> and "ptcp:6642:10.6.20.84" for ovn-sb-db. .84 is the first node
> of cluster. Also ovn-openflow-probe-interval=30 on compute node.
> It seems helping. Not that many connect/drop/reconnect in logging.
> That "commit failure" is also gone.
> The issue I reported in another thread "packet drop" seems gone.
> And launching VM starts working.
>
> How should I set connection table for all ovn-nb-db and ovn-sb-db
> nodes in the cluster to set inactivity_probe?
> One row with address 0.0.0.0 seems not working.
>

You can simply use 0.0.0.0 in the connection table, but don't specify the
same connection method on the command line when starting ovsdb-server for
NB/SB DB. Otherwise, these are conflicting and that's why you saw "Address
already in use" error.


> Is "external_ids:ovn-remote-probe-interval" in ovsdb-server on
> compute node for ovn-controller to probe ovn-sb-db?
>
> OVSDB probe is bidirectional, so you need to set this value, too, if you
don't want too many probes handled by the SB server. (setting the
connection table for SB only changes the server side).



> Is "external_ids:ovn-openflow-probe-interval" in ovsdb-server on
> compute node for ovn-controller to probe ovsdb-server?
>
> It is for the OpenFlow connection between ovn-controller and ovs-vswitchd,
which is part of the OpenFlow protocol.


> What's probe interval for ovsdb-server to probe ovn-controller?
>
> The local ovsdb connection uses unix socket, which doesn't send probe by
default (if I remember correctly).

For ovn-controller, since it is implemented with incremental-processing,
even if there are probes from openflow or local ovsdb, it doesn't matter.
If there is no configuration change, ovn-controller simply replies the
probe and there is no extra cost.


> Thanks!
>
> Tony
> > -Original Message-
> > From: discuss  On Behalf Of Tony
> > Liu
> > Sent: Wednesday, August 5, 2020 4:29 PM
> > To: Han Zhou 
> > Cc: ovs-dev ; ovs-discuss  > disc...@openvswitch.org>
> > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> >
> > Hi Han,
> >
> > After setting connection target="ptcp:6642:0.0.0.0" for ovn-sb-db, I see
> > this error.
> > 
> > 2020-08-
> > 05T23:01:26.819Z|06799|ovsdb_jsonrpc_server|ERR|ptcp:6642:0.0.0.0:
> > listen failed: Address already in use  Anything I am missing
> > here?
> >
> >
> > Thanks!
> >
> > Tony
> > > -Original Message-
> > > From: Han Zhou 
> > > Sent: Tuesday, August 4, 2020 4:44 PM
> > > To: Tony Liu 
> > > Cc: Numan Siddique ; Han Zhou ; ovs-
> > > discuss ; ovs-dev
> > > 
> > > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> > >
> > >
> > >
> > > On Tue, Aug 4, 2020 at 2:50 PM Tony Liu  > > <mailto:tonyliu0...@hotmail.com> > wrote:
> > >
> > >
> > > Hi,
> > >
> > > Since I have 3 OVN DB nodes, should I add 3 rows in connection
> > table
> > > for the inactivity_probe? Or put 3 addresses into one row?
> > >
> > > "set-connection" set one row only, and there is no
> "add-connection".
> > > How should I add 3 rows into the table connection?
> > >
> > >
> > >
> > >
> > > You only need to set one row. Try this command:
> > >
> > > ovn-nbctl -- --id=@conn_uuid create Connection
> > > target="ptcp\:6641\:0.0.0.0" inactivity_probe=0 -- set NB_Global .
> > > connections=@conn_uuid
> > >
> > >
> > >
> > > Thanks!
> > >
> > > Tony
> > >
> > > > -Original Message-
> > > > From: Numan Siddique mailto:num...@ovn.org> >
> > > > Sent: Tuesday, August 4, 2020 12:36 AM
> > > > To: Tony Liu  > > <mailto:tonyliu0...@hotmail.com> >
> > > > Cc: ovs-discuss mailto:ovs-
> > > disc...@openvswitch.org> >; ovs-dev  > > > d...@openvswitch.org <mailto:d...@openvswitch.org> >
> > > > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> > > >
> > > >
> > > >
> > > > On Tue, Aug 4, 2020 at 9:12 AM Tony Liu  > > <mailto:tonyliu0...@hotmail.com>
> > > > <mailto:tonyliu0...

Re: [ovs-discuss] ovn-k8s scale: how to make new ovn-controller process keep the previous Open Flow in br-int

2020-08-05 Thread Han Zhou
On Wed, Aug 5, 2020 at 4:21 PM Girish Moodalbail 
wrote:

>
>
> On Wed, Aug 5, 2020 at 3:35 PM Han Zhou  wrote:
>
>>
>>
>> On Wed, Aug 5, 2020 at 12:58 PM Winson Wang 
>> wrote:
>>
>>> Hello OVN Experts,
>>>
>>> With ovn-k8s,  we need to keep the flows always on br-int which needed
>>> by running pods on the k8s node.
>>> Is there an ongoing project to address this problem?
>>> If not,  I have one proposal not sure if it is doable.
>>> Please share your thoughts.
>>> The issue:
>>>
>>> In large scale ovn-k8s cluster there are 200K+ Open Flows on br-int on
>>> every K8s node.  When we restart ovn-controller for upgrade using
>>> `ovs-appctl -t ovn-controller exit --restart`,  the remaining traffic still
>>> works fine since br-int with flows still be Installed.
>>>
>>> However, when a new ovn-controller starts it will connect OVS IDL and do
>>> an engine init run,  clearing all OpenFlow flows and install flows based on
>>> SB DB.
>>>
>>> With open flows count above 200K+,  it took more than 15 seconds to get
>>> all the flows installed br-int bridge again.
>>>
>>> Proposal solution for the issue:
>>>
>>> When the ovn-controller gets “exit --start”,  it will write a
>>> “ovs-cond-seqno” to OVS IDL and store the value to Open vSwitch table in
>>> external-ids column. When new ovn-controller starts, it will check if the
>>> “ovs-cond-seqno” exists in the Open_vSwitch table,  and get the seqno from
>>> OVS IDL to decide if it will force a recomputing process?
>>>
>>>
>> Hi Winson,
>>
>> Thanks for the proposal. Yes, the connection break during upgrading is a
>> real issue in a large scale environment. However, the proposal doesn't
>> work. The "ovs-cond-seqno" is for the OVSDB IDL for the local conf DB,
>> which is a completely different connection from the ovs-vswitchd open-flow
>> connection.
>> To avoid clearing the open-flow table during ovn-controller startup, we
>> can find a way to postpone clearing the OVS flows after the recomputing in
>> ovn-controller is completed, right before ovn-controller replacing with the
>> new flows. This should largely reduce the time of connection broken during
>> upgrading. Some changes in the ofctrl module's state machine are required,
>> but I am not 100% sure if this approach is applicable. Need to check more
>> details.
>>
>>
> Thanks Han. Yes, postponing clearing of OpenFlow flows until all of the
> logical flows have been translated to OpenFlows will reduce the connection
> downtime. The question though is that can we use 'replace-flows' or
> 'mod-flows equivalent where-in the non-modified flows remain intact and all
> the sessions related to those flows will not face any downtime?
>
> I am not sure about the "replace-flows". However, I think these are
independent optimizations. I think postponing the clearing would solve the
major part of the problem. I believe currently > 90% of the time is spent
on waiting for computing to finish while the OVS flows are already cleared,
instead of on the one time flow installation. But yes, that could be a
further optimization.


> Regards,
> ~Girish
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN Scale with RAFT: how to make raft cluster clients to balanced state again

2020-08-05 Thread Han Zhou
On Wed, Aug 5, 2020 at 3:59 PM Tony Liu  wrote:

> Sorry for hijacking this thread, I'd like to get some clarifications.
>
> How is the initial balanced state established, say 100 ovn-controllers
> connecting to 3 ovn-sb-db?
>
> The ovn-controller by default randomly connects to any servers specified
in the connection method, e.g. tcp::6642,tcp:6642,tcp::6643.
(Please see ovsdb(7) for details on "Connection Method".)

So initially it is balanced.


> The ovn-controller doesn't have to connect to the leader of ovn-sb-db,
> does it? In case it connects to the follower, the write request still
> needs to be forwarded to the leader, right?
>
> These logs keep showing up.
> 
> 2020-08-05T22:48:33.141Z|103607|reconnect|INFO|tcp:10.6.20.84:6642:
> connecting...
> 2020-08-05T22:48:33.151Z|103608|reconnect|INFO|tcp:127.0.0.1:6640:
> connected
> 2020-08-05T22:48:33.151Z|103609|reconnect|INFO|tcp:10.6.20.84:6642:
> connected
> 2020-08-05T22:48:33.159Z|103610|main|INFO|OVNSB commit failed, force
> recompute next time.
> 2020-08-05T22:48:33.161Z|103611|ovsdb_idl|INFO|tcp:10.6.20.84:6642:
> clustered database server is disconnected from cluster; trying another
> server
> 2020-08-05T22:48:33.161Z|103612|reconnect|INFO|tcp:10.6.20.84:6642:
> connection attempt timed out
> 2020-08-05T22:48:33.161Z|103613|reconnect|INFO|tcp:10.6.20.84:6642:
> waiting 2 seconds before reconnect
> 
> What's that "clustered database server is disconnected from cluster" mean?
>
> It means the server is part of a cluster, but it is disconnected from the
cluster, e.g. due to network partitioning, or overloaded and lost
heartbeat, or the cluster lost quorum and there is no leader elected.
If you use a clustered DB, it's better to set the connect method to all
servers (or you can use a LB VIP that points to all servers), instead of
only specifying a single server, which doesn't provide desired HA.


>
> Thanks!
>
> Tony
>
>
> > -Original Message-
> > From: discuss  On Behalf Of Han
> > Zhou
> > Sent: Wednesday, August 5, 2020 3:05 PM
> > To: Winson Wang 
> > Cc: winson wang ; ovn-kuberne...@googlegroups.com;
> > ovs-discuss@openvswitch.org
> > Subject: Re: [ovs-discuss] OVN Scale with RAFT: how to make raft cluster
> > clients to balanced state again
> >
> >
> >
> > On Wed, Aug 5, 2020 at 12:51 PM Winson Wang  > <mailto:windson.w...@gmail.com> > wrote:
> >
> >
> >   Hello OVN Experts:
> >
> >   With large scale ovn-k8s cluster,  there are several conditions
> > that would make ovn-controller clients connect SB central from a
> > balanced state to an unbalanced state.
> >
> >   Is there an ongoing project to address this problem?
> >   If not,  I have one proposal not sure if it is doable.
> >   Please share your thoughts.
> >
> >   The issue:
> >
> >   OVN SB RAFT 3 node cluster,  at first all the ovn-controller
> > clients will connect all the 3 nodes in a balanced state.
> >
> >   The following conditions will make the connections become
> > unbalanced.
> >
> >   *   One RAFT node restart,  all the ovn-controller clients to
> > reconnect to the two remaining cluster nodes.
> >
> >   *   Ovn-k8s,  after SB raft pods rolling upgrade, the last raft
> > pod has no client connections.
> >
> >
> >   RAFT clients in an unbalanced state would trigger more stress to
> > the raft cluster,  which makes the raft unstable under stress compared
> > to a balanced state.
> >
> >
> >   The proposal solution:
> >
> >
> >
> >   Ovn-controller adds next unix commands “reconnect” with argument of
> > preferred SB node IP.
> >
> >   When unbalanced state happens,  the UNIX command can trigger ovn-
> > controller reconnect
> >
> >   To new SB raft node with fast sync which doesn’t trigger the whole
> > DB downloading process.
> >
> >
> >
> > Thanks Winson. The proposal sounds good to me. Will you implement it?
> >
> > Han
> >
> >
> >
> >
> >
> >   --
> >
> >   Winson
> >
> >
> >
> >   --
> >   You received this message because you are subscribed to the Google
> > Groups "ovn-kubernetes" group.
> >   To unsubscribe from this group and stop receiving emails from it,
> > send an email to ovn-kubernetes+unsubscr...@googlegroups.com
> > <mailto:ovn-kubernetes+unsubscr...@googlegroups.com> .
> >   To view this discussion on the web visit
> > https://groups.google.com/d/msgid/ovn-kubernetes/CAMu6iS--
> > iOW0LxxtkOhJpRT49E-9bJVy0iXraC1LMDUWeu6kLA%40mail.gmail.com
> > <https://groups.google.com/d/msgid/ovn-kubernetes/CAMu6iS--
> > iOW0LxxtkOhJpRT49E-
> > 9bJVy0iXraC1LMDUWeu6kLA%40mail.gmail.com?utm_medium=email_source=foo
> > ter> .
> >
>
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN Scale with RAFT: how to make raft cluster clients to balanced state again

2020-08-05 Thread Han Zhou
On Wed, Aug 5, 2020 at 4:35 PM Girish Moodalbail 
wrote:

>
>
> On Wed, Aug 5, 2020 at 3:05 PM Han Zhou  wrote:
>
>>
>>
>> On Wed, Aug 5, 2020 at 12:51 PM Winson Wang 
>> wrote:
>>
>>> Hello OVN Experts:
>>>
>>> With large scale ovn-k8s cluster,  there are several conditions that
>>> would make ovn-controller clients connect SB central from a balanced state
>>> to an unbalanced state.
>>> Is there an ongoing project to address this problem?
>>> If not,  I have one proposal not sure if it is doable.
>>> Please share your thoughts.
>>>
>>> The issue:
>>>
>>> OVN SB RAFT 3 node cluster,  at first all the ovn-controller clients
>>> will connect all the 3 nodes in a balanced state.
>>>
>>> The following conditions will make the connections become unbalanced.
>>>
>>>-
>>>
>>>One RAFT node restart,  all the ovn-controller clients to reconnect
>>>to the two remaining cluster nodes.
>>>
>>>
>>>-
>>>
>>>Ovn-k8s,  after SB raft pods rolling upgrade, the last raft pod has
>>>no client connections.
>>>
>>>
>>> RAFT clients in an unbalanced state would trigger more stress to the
>>> raft cluster,  which makes the raft unstable under stress compared to a
>>> balanced state.
>>> The proposal solution:
>>>
>>> Ovn-controller adds next unix commands “reconnect” with argument of
>>> preferred SB node IP.
>>>
>>> When unbalanced state happens,  the UNIX command can trigger
>>> ovn-controller reconnect
>>>
>>> To new SB raft node with fast sync which doesn’t trigger the whole DB
>>> downloading process.
>>>
>>>
>> Thanks Winson. The proposal sounds good to me. Will you implement it?
>>
>
> Han/Winson,
>
> The fast re-sync is for ovsdb-server restart and it will not apply for
> ovn-controller restart, right?
>
>
Right, but the proposal is to provide a command just to reconnect, without
restarting. In that case fast-resync should work.


> If the ovsdb-client (ovn-controller) restarts, then it would have lost all
> its state and when it starts again it will still need to download
> logical_flows, port_bindings , and other tables it cares about. So, fast
> re-sync may not apply to this case.
>
> Also, the ovn-controller should stash the IP address of the SB server to
> which it is connected to in Open_vSwitch table's external_id column. It
> updates this field whenever it re-connects to a different SB server
> (because that ovsdb-server instance failed or restarted). When
> ovn-controller itself restarts it could check for the value in this field
> and try to connect to it first and on failure fallback to connect to
> default connection approach.
>

The imbalance is usually caused by failover on server side. When one server
is down, all clients are expected to connect to the rest of the servers,
and when the server is back, there is no motivation for the clients to
reconnect again (unless you purposely restart the clients, which would
bring 1/3 of the restarted clients back to the old server). So I don't
understand how "stash the IP address" would work in this scenario.

The proposal above by Winson is to purposely trigger a reconnection towards
the desired server without restarting the clients, which I think solves
this problem directly.

Thanks,
Han


>
> Regards,
> ~Girish
>
>
>
>
>>
>> Han
>>
>>
>>
>>>
>>> --
>>> Winson
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "ovn-kubernetes" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to ovn-kubernetes+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/ovn-kubernetes/CAMu6iS--iOW0LxxtkOhJpRT49E-9bJVy0iXraC1LMDUWeu6kLA%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/ovn-kubernetes/CAMu6iS--iOW0LxxtkOhJpRT49E-9bJVy0iXraC1LMDUWeu6kLA%40mail.gmail.com?utm_medium=email_source=footer>
>>> .
>>>
>> ___
>> discuss mailing list
>> disc...@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>
> --
> You received this message because you are subscribed to the Google Groups
> "ovn-kubernetes" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ovn-kubernetes+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STTrZb%2BNo8%2B3%3DOJcMqd6T_1sS5bm-xnF6v_P4%2B2uqKtZAQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STTrZb%2BNo8%2B3%3DOJcMqd6T_1sS5bm-xnF6v_P4%2B2uqKtZAQ%40mail.gmail.com?utm_medium=email_source=footer>
> .
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] ovn-k8s scale: how to make new ovn-controller process keep the previous Open Flow in br-int

2020-08-05 Thread Han Zhou
On Wed, Aug 5, 2020 at 12:58 PM Winson Wang  wrote:

> Hello OVN Experts,
>
> With ovn-k8s,  we need to keep the flows always on br-int which needed by
> running pods on the k8s node.
> Is there an ongoing project to address this problem?
> If not,  I have one proposal not sure if it is doable.
> Please share your thoughts.
> The issue:
>
> In large scale ovn-k8s cluster there are 200K+ Open Flows on br-int on
> every K8s node.  When we restart ovn-controller for upgrade using
> `ovs-appctl -t ovn-controller exit --restart`,  the remaining traffic still
> works fine since br-int with flows still be Installed.
>
> However, when a new ovn-controller starts it will connect OVS IDL and do
> an engine init run,  clearing all OpenFlow flows and install flows based on
> SB DB.
>
> With open flows count above 200K+,  it took more than 15 seconds to get
> all the flows installed br-int bridge again.
>
> Proposal solution for the issue:
>
> When the ovn-controller gets “exit --start”,  it will write a
> “ovs-cond-seqno” to OVS IDL and store the value to Open vSwitch table in
> external-ids column. When new ovn-controller starts, it will check if the
> “ovs-cond-seqno” exists in the Open_vSwitch table,  and get the seqno from
> OVS IDL to decide if it will force a recomputing process?
>
>
Hi Winson,

Thanks for the proposal. Yes, the connection break during upgrading is a
real issue in a large scale environment. However, the proposal doesn't
work. The "ovs-cond-seqno" is for the OVSDB IDL for the local conf DB,
which is a completely different connection from the ovs-vswitchd open-flow
connection.
To avoid clearing the open-flow table during ovn-controller startup, we can
find a way to postpone clearing the OVS flows after the recomputing in
ovn-controller is completed, right before ovn-controller replacing with the
new flows. This should largely reduce the time of connection broken during
upgrading. Some changes in the ofctrl module's state machine are required,
but I am not 100% sure if this approach is applicable. Need to check more
details.

Thanks,
Han
Test log:

> Check flow cnt on br-int every second:
>
> packet_count=0 byte_count=0 flow_count=0
>
> packet_count=0 byte_count=0 flow_count=0
>
> packet_count=0 byte_count=0 flow_count=0
>
> packet_count=0 byte_count=0 flow_count=0
>
> packet_count=0 byte_count=0 flow_count=0
>
> packet_count=0 byte_count=0 flow_count=0
>
> packet_count=0 byte_count=0 flow_count=10322
>
> packet_count=0 byte_count=0 flow_count=34220
>
> packet_count=0 byte_count=0 flow_count=60425
>
> packet_count=0 byte_count=0 flow_count=82506
>
> packet_count=0 byte_count=0 flow_count=106771
>
> packet_count=0 byte_count=0 flow_count=131648
>
> packet_count=2 byte_count=120 flow_count=158303
>
> packet_count=29 byte_count=1693 flow_count=185999
>
> packet_count=188 byte_count=12455 flow_count=212764
>
>
>
> --
> Winson
>
> --
> You received this message because you are subscribed to the Google Groups
> "ovn-kubernetes" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ovn-kubernetes+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/ovn-kubernetes/CAMu6iS8eC2EtMJbqBccGD0hyvLFBkzkeJ9sXOsT_TVF3Ltm2hA%40mail.gmail.com
> 
> .
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN Scale with RAFT: how to make raft cluster clients to balanced state again

2020-08-05 Thread Han Zhou
On Wed, Aug 5, 2020 at 12:51 PM Winson Wang  wrote:

> Hello OVN Experts:
>
> With large scale ovn-k8s cluster,  there are several conditions that would
> make ovn-controller clients connect SB central from a balanced state to
> an unbalanced state.
> Is there an ongoing project to address this problem?
> If not,  I have one proposal not sure if it is doable.
> Please share your thoughts.
>
> The issue:
>
> OVN SB RAFT 3 node cluster,  at first all the ovn-controller clients will
> connect all the 3 nodes in a balanced state.
>
> The following conditions will make the connections become unbalanced.
>
>-
>
>One RAFT node restart,  all the ovn-controller clients to reconnect to
>the two remaining cluster nodes.
>
>
>-
>
>Ovn-k8s,  after SB raft pods rolling upgrade, the last raft pod has no
>client connections.
>
>
> RAFT clients in an unbalanced state would trigger more stress to the raft
> cluster,  which makes the raft unstable under stress compared to a balanced
> state.
> The proposal solution:
>
> Ovn-controller adds next unix commands “reconnect” with argument of
> preferred SB node IP.
>
> When unbalanced state happens,  the UNIX command can trigger
> ovn-controller reconnect
>
> To new SB raft node with fast sync which doesn’t trigger the whole DB
> downloading process.
>
>
Thanks Winson. The proposal sounds good to me. Will you implement it?

Han



>
> --
> Winson
>
> --
> You received this message because you are subscribed to the Google Groups
> "ovn-kubernetes" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ovn-kubernetes+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/ovn-kubernetes/CAMu6iS--iOW0LxxtkOhJpRT49E-9bJVy0iXraC1LMDUWeu6kLA%40mail.gmail.com
> 
> .
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-04 Thread Han Zhou
On Tue, Aug 4, 2020 at 2:50 PM Tony Liu  wrote:

> Hi,
>
> Since I have 3 OVN DB nodes, should I add 3 rows in connection table
> for the inactivity_probe? Or put 3 addresses into one row?
>
> "set-connection" set one row only, and there is no "add-connection".
> How should I add 3 rows into the table connection?
>
>
You only need to set one row. Try this command:

ovn-nbctl -- --id=@conn_uuid create Connection target="ptcp\:6641\:0.0.0.0"
inactivity_probe=0 -- set NB_Global . connections=@conn_uuid


> Thanks!
>
> Tony
>
> > -Original Message-
> > From: Numan Siddique 
> > Sent: Tuesday, August 4, 2020 12:36 AM
> > To: Tony Liu 
> > Cc: ovs-discuss ; ovs-dev  > d...@openvswitch.org>
> > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> >
> >
> >
> > On Tue, Aug 4, 2020 at 9:12 AM Tony Liu  >  > wrote:
> >
> >
> >   In my deployment, on each Neutron server, there are 13 Neutron
> > server processes.
> >   I see 12 of them (monitor, maintenance, RPC, API) connect to both
> > ovn-nb-db
> >   and ovn-sb-db. With 3 Neutron server nodes, that's 36 OVSDB
> clients.
> >   Is so many clients OK?
> >
> >   Any suggestions how to figure out which side doesn't respond the
> > probe,
> >   if it's bi-directional? I don't see any activities from logging,
> > other than
> >   connect/drop and reconnect...
> >
> >   BTW, please let me know if this is not the right place to discuss
> > Neutron OVN
> >   ML2 driver.
> >
> >
> >   Thanks!
> >
> >   Tony
> >
> >   > -Original Message-
> >   > From: dev mailto:ovs-dev-
> > boun...@openvswitch.org> > On Behalf Of Tony Liu
> >   > Sent: Monday, August 3, 2020 7:45 PM
> >   > To: ovs-discuss mailto:ovs-
> > disc...@openvswitch.org> >; ovs-dev  >   > d...@openvswitch.org  >
> >   > Subject: [ovs-dev] [OVN] no response to inactivity probe
> >   >
> >   > Hi,
> >   >
> >   > Neutron OVN ML2 driver was disconnected by ovn-nb-db. There are
> > many
> >   > error messages from ovn-nb-db leader.
> >   > 
> >   > 2020-08-04T02:31:39.751Z|03138|reconnect|ERR|tcp:
> 10.6.20.81:58620
> >  : no
> >   > response to inactivity probe after 5 seconds, disconnecting
> >   > 2020-08-04T02:31:42.484Z|03139|reconnect|ERR|tcp:
> 10.6.20.81:58300
> >  : no
> >   > response to inactivity probe after 5 seconds, disconnecting
> >   > 2020-08-04T02:31:49.858Z|03140|reconnect|ERR|tcp:
> 10.6.20.81:59582
> >  : no
> >   > response to inactivity probe after 5 seconds, disconnecting
> >   > 2020-08-04T02:31:53.057Z|03141|reconnect|ERR|tcp:
> 10.6.20.83:42626
> >  : no
> >   > response to inactivity probe after 5 seconds, disconnecting
> >   > 2020-08-04T02:31:53.058Z|03142|reconnect|ERR|tcp:
> 10.6.20.82:45412
> >  : no
> >   > response to inactivity probe after 5 seconds, disconnecting
> >   > 2020-08-04T02:31:54.067Z|03143|reconnect|ERR|tcp:
> 10.6.20.81:59416
> >  : no
> >   > response to inactivity probe after 5 seconds, disconnecting
> >   > 2020-08-04T02:31:54.809Z|03144|reconnect|ERR|tcp:
> 10.6.20.81:60004
> >  : no
> >   > response to inactivity probe after 5 seconds, disconnecting
> > 
> >   >
> >   > Could anyone share a bit details how this inactivity probe works?
> >
> >
> >
> > The inactivity probe is sent by both the server and clients
> > independently.
> > Meaning ovsdb-server will send an inactivity probe every 'x' configured
> > seconds to all its connected clients and if it doesn't get a reply from
> > the client within some time, it disconnects the connection.
> >
> > The inactivity probe from the server side can be configured. Run "ovn-
> > nbctl list connection"
> > and you will see inactivity_probe column. You can set this column to
> > desired value like - ovn-nbctl set connection . inactivity_probe=3
> > (for 30 seconds)
> >
> > The same thing for SB ovsdb-server.
> >
> > Similarly each client (ovn-northd, ovn-controller, neutron server) sends
> > inactivity probe every 'y' seconds and if the client doesn't get any
> > reply from ovsdb-server it will disconnect the connection and reconnect
> > again.
> >
> > For ovn-northd you can configured this as - ovn-nbctl set NB_Global .
> > options:northd_probe_interval=3
> >
> > For ovn-controllers - ovs-vsctl set open . external_ids:ovn-remote-
> > probe-interval=3
> >
> > There is also a probe interval for openflow connection from ovn-
> > controller to ovs-vswitchd which you can configure as ovs-vsctl set
> > open . external_ids:ovn-openflow-probe-interval=30 (this is in seconds)
> >
> >
> > Regarding the neutron server I think it is set to 60 seconds. Please see
> > this -
> > 

Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no configuration update

2020-08-04 Thread Han Zhou
On Tue, Aug 4, 2020 at 11:40 AM Tony Liu  wrote:

> Inline...
>
> Thanks!
>
> Tony
> > -Original Message-----
> > From: Han Zhou 
> > Sent: Tuesday, August 4, 2020 11:01 AM
> > To: Numan Siddique ; Ben Pfaff ; Leonid
> > Ryzhyk 
> > Cc: Tony Liu ; Han Zhou ; ovs-
> > dev ; ovs-discuss 
> > Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no
> > configuration update
> >
> >
> >
> > On Tue, Aug 4, 2020 at 12:38 AM Numan Siddique  > <mailto:num...@ovn.org> > wrote:
> >
> >
> >
> >
> >   On Tue, Aug 4, 2020 at 9:02 AM Tony Liu  > <mailto:tonyliu0...@hotmail.com> > wrote:
> >
> >
> >   The probe awakes recomputing?
> >   There is probe every 5 seconds. Without any connection
> > up/down or failover,
> >   ovn-northd will recompute everything every 5 seconds, no
> > matter what?
> >   Really?
> >
> >   Anyways, I will increase the probe interval for now, see if
> > that helps.
> >
> >
> >
> >   I think we should optimise this case. I am planning to look into
> > this.
> >
> >   Thanks
> >   Numan
> >
> >
> > Thanks Numan.
> > I'd like to discuss more on this before we move forward to change
> > anything.
> >
> > 1) Regarding the problem itself, the CPU cost triggered by OVSDB IDLE
> > probe when there is no configuration change to compute, I don't think it
> > matters that much in real production. It simply wastes CPU cycles when
> > there is nothing to do, so what harm would it do here? For ovn-northd,
> > since it is the centralized component, we would always ensure there is
> > enough CPU available for ovn-north when computing is needed, and this
> > reservation will be wasted anyway when there is no change to compute. So,
> > I'd avoid making any change specifically only to address this issue. I
> > could be wrong, though. I'd like to hear what would be the real concern
> > if this is not addressed.
>
> Is more vCPUs going to help here? Is ovn-northd multi-thread?
>
>
ovn-northd is single threaded. It can be changed to have a separate thread
for the probe handling, but I don't see any obvious benefit.


> I am probably still missing something here. The probe is there all times,
> every 5s.


The probe is sent only if there is no activity on the OVSDB connection
during the interval, that's why it is called "IDLE" probe. If there is
already interaction during the past interval, no probe will be sent.


> If ovn-northd is in the middle of a computing, is a probe going
> to make ovn-northd restart the computing?


No, it won't. Firstly, it is unlikely that a probe is received during
computing, unless the probe interval is set too short. Secondly, even when
it happens, the current computing will complete and all needed changes will
be enforced to SB DB regardless of the probe received during the computing.
The probe will be handled in the next round of the main loop, and it will
trigger another round of computing which is useless but not harmful either.
There is probably one case I can think of that causes a little latency -
when another NB DB change (say, change2) comes during the computing
triggered by the probe, then the handling for the change2 will be delayed a
little until the computing triggered by the probe completes. But the chance
is rather low, especially if the probe interval is enlarged, and in the
unlucky case, the impact is just a little delay in the change handling.


> Or the probe only triggers
> computing when ovn-northd is idle? Even with the latter case, what's the
> intention to trigger computing by probe?
>
>
It is not triggered intentionally for the probe. It is just because
ovn-northd doesn't distinguish if it is woken up by a probe only or if
there are any changes that need to be processed. Many events can wake up
ovn-northd, and once it is wake up it will compute everything. I agree it
can be optimized (we already optimized this for ovn-controller). I am just
wondering if it worth to be optimized specifically. Or we just get it for
free as a byproduct when implementing incremental-processing, which is
already in the road map.

Does this clarify a little?

Thanks,
Han


> >
> > 2) ovn-northd incremental processing would avoid this CPU problem
> > naturally. So let's discuss how to move forward for incremental
> > processing, which is much more important because it also solves the CPU
> > efficiency when handling the changes, and the IDLE probe problem is just
> > a byproduct. I believe the DDlog branch would have solved this problem.
> > However, it

Re: [ovs-discuss] [ovs-dev] [OVN] stale data complained by ovn-controller after db restore

2020-08-04 Thread Han Zhou
On Tue, Aug 4, 2020 at 11:31 AM Tony Liu  wrote:

> Is there any difference to restore DB on existing cluster vs. fresh
> cluster,
> in terms of performance?
>
> If I don't have to restore on fresh cluster, which is recommended?
>
> I would suggest to directly restore on top of existing cluster instead of
creating a fresh cluster.


> For now, since ovn-northd always recomputes the whole DB, I guess not much
> difference?
>
> With incremental-process, would restoring to a fresh cluster be better?
>
> No.


> Is it necessary to stop or restart ovn-northd during DB restore?
>
> No.


>
> Thanks!
>
> Tony
>
> > -Original Message-
> > From: Han Zhou 
> > Sent: Tuesday, August 4, 2020 11:13 AM
> > To: Tony Liu 
> > Cc: ovs-discuss ; ovs-dev  > d...@openvswitch.org>
> > Subject: Re: [ovs-dev] [OVN] stale data complained by ovn-controller
> > after db restore
> >
> >
> >
> > On Tue, Aug 4, 2020 at 10:30 AM Tony Liu  > <mailto:tonyliu0...@hotmail.com> > wrote:
> >
> >
> >   Hi,
> >
> >   Here is how I restore OVN DB.
> >   * Stop all ovn-nb-db, ovn-sb-db and ovn-northd services.
> >   * Clean up all DB files.
> >   * Start all DB services. Fresh ovn-nb-db and ovn-sb-db clusters are
> > up and
> > running.
> >   * Set DB election timer to 10s.
> >   * Restore DB to ovn-nb-db by ovsdb-client.
> >   * Start all ovn-northd services.
> >
> >   A few minutes after, ovn-sb-db is fully synced with ovn-nb-db.
> >
> >   Now, the client of ovn-sb-db, ovn-controller and nova-compute
> > complaint about
> >   "stale data". The chassis node is not getting updated.
> >   
> >   2020-08-04 09:07:45.892 26 INFO ovsdbapp.backend.ovs_idl.vlog [-]
> > tcp:10.6.20.84:6642 <http://10.6.20.84:6642> : connected
> >   2020-08-04 09:07:45.895 26 WARNING ovsdbapp.backend.ovs_idl.vlog
> [-]
> > tcp:10.6.20.84:6642 <http://10.6.20.84:6642> : clustered database server
> > has stale data; trying another server
> >   
> >
> >   Restarting ovn-controller and nova-compute resolve the issue.
> >
> >   Is this expected? As part of the DB restore process, should I
> > restart
> >   ovn-controller and nova-compute on all chassis node?
> >
> >
> >
> >
> > Yes, this is expected if you freshly start a new cluster. (It wouldn't
> > happen if you simply restore the old data on the existing cluster.
> > However, I understand that the scenario of restoring data on a freshly
> > created cluster is a valid use case).
> > For this case, you could either restart ovn-controller, or trigger a
> > client side raft index reset by:
> > ovn-appctl -t ovn-controller sb-cluster-state-reset
> >
> > Similarly for ovn-northd:
> > ovn-appctl -t ovn-northd nb-cluster-state-reset
> > ovn-appctl -t ovn-northd sb-cluster-state-reset
> >
> > To use this command, you will need at least 20.06 of OVN and OVS master.
> >
> >
> > Thanks,
> > Han
> >
> >
> >
> >
> >   Thanks!
> >
> >   Tony
> >
> >   ___
> >   dev mailing list
> >   d...@openvswitch.org <mailto:d...@openvswitch.org>
> >   https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
>
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [ovs-dev] [OVN] stale data complained by ovn-controller after db restore

2020-08-04 Thread Han Zhou
On Tue, Aug 4, 2020 at 10:30 AM Tony Liu  wrote:

> Hi,
>
> Here is how I restore OVN DB.
> * Stop all ovn-nb-db, ovn-sb-db and ovn-northd services.
> * Clean up all DB files.
> * Start all DB services. Fresh ovn-nb-db and ovn-sb-db clusters are up and
>   running.
> * Set DB election timer to 10s.
> * Restore DB to ovn-nb-db by ovsdb-client.
> * Start all ovn-northd services.
>
> A few minutes after, ovn-sb-db is fully synced with ovn-nb-db.
>
> Now, the client of ovn-sb-db, ovn-controller and nova-compute complaint
> about
> "stale data". The chassis node is not getting updated.
> 
> 2020-08-04 09:07:45.892 26 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:
> 10.6.20.84:6642: connected
> 2020-08-04 09:07:45.895 26 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:
> 10.6.20.84:6642: clustered database server has stale data; trying another
> server
> 
>
> Restarting ovn-controller and nova-compute resolve the issue.
>
> Is this expected? As part of the DB restore process, should I restart
> ovn-controller and nova-compute on all chassis node?
>
>
Yes, this is expected if you freshly start a new cluster. (It wouldn't
happen if you simply restore the old data on the existing cluster. However,
I understand that the scenario of restoring data on a freshly created
cluster is a valid use case).
For this case, you could either restart ovn-controller, or trigger a client
side raft index reset by:
ovn-appctl -t ovn-controller sb-cluster-state-reset

Similarly for ovn-northd:
ovn-appctl -t ovn-northd nb-cluster-state-reset
ovn-appctl -t ovn-northd sb-cluster-state-reset

To use this command, you will need at least 20.06 of OVN and OVS master.

Thanks,
Han



> Thanks!
>
> Tony
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no configuration update

2020-08-03 Thread Han Zhou
Sorry that I didn't make it clear enough. The OVSDB probe itself doesn't
take much CPU, but the probe awakes ovn-northd main loop, which recompute
everything, which is why you see CPU spike.
It will be solved by incremental-processing, when only delta is processed,
and in case of probe handling, there is no change in configuration, so the
delta is zero.
For now, please follow the steps to adjust probe interval, if the CPU of
ovn-northd (when there is no configuration change) is a concern for you.
But please remember that this has no impact to the real CPU usage for
handling configuration changes.

Thanks,
Han

On Mon, Aug 3, 2020 at 8:11 PM Tony Liu  wrote:

> Health check (5 sec internal) taking 30%-100% CPU is definitely not
> acceptable,
> if that's really the case. There must be some blocking (and not yielding
> CPU)
> in coding, which is not supposed to be there.
>
> Could you point me to the coding for such health check?
> Is it single thread? Does it use any event library?
>
>
> Thanks!
>
> Tony
>
> > -Original Message-
> > From: Han Zhou 
> > Sent: Saturday, August 1, 2020 9:11 PM
> > To: Tony Liu 
> > Cc: ovs-discuss ; ovs-dev  > d...@openvswitch.org>
> > Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no
> > configuration update
> >
> >
> >
> > On Fri, Jul 31, 2020 at 4:14 PM Tony Liu  > <mailto:tonyliu0...@hotmail.com> > wrote:
> >
> >
> >   Hi,
> >
> >   I see the active ovn-northd takes much CPU (30% - 100%) when there
> > is no
> >   configuration from OpenStack, nothing happening on all chassis
> > nodes either.
> >
> >   Is this expected? What is it busy with?
> >
> >
> >
> >
> > Yes, this is expected. It is due to the OVSDB probe between ovn-northd
> > and NB/SB OVSDB servers, which is used to detect the OVSDB connection
> > failure.
> > Usually this is not a concern (unlike the probe with a large number of
> > ovn-controller clients), because ovn-northd is a centralized component
> > and the CPU cost when there is no configuration change doesn't matter
> > that much. However, if it is a concern, the probe interval (default 5
> > sec) can be changed.
> > If you change, remember to change on both server side and client side.
> > For client side (ovn-northd), it is configured in the NB DB's NB_Global
> > table's options:northd_probe_interval. See man page of ovn-nb(5).
> > For server side (NB and SB), it is configured in the NB and SB DB's
> > Connection table's inactivity_probe column.
> >
> > Thanks,
> > Han
> >
> >
> >
> >   
> >   2020-07-31T23:08:09.511Z|04267|poll_loop|DBG|wakeup due to [POLLIN]
> > on fd 8 (10.6.20.84:44358 <http://10.6.20.84:44358> <->10.6.20.84:6641
> > <http://10.6.20.84:6641> ) at lib/stream-fd.c:157 (68% CPU usage)
> >   2020-07-31T23:08:09.512Z|04268|jsonrpc|DBG|tcp:10.6.20.84:6641
> > <http://10.6.20.84:6641> : received request, method="echo", params=[],
> > id="echo"
> >   2020-07-31T23:08:09.512Z|04269|jsonrpc|DBG|tcp:10.6.20.84:6641
> > <http://10.6.20.84:6641> : send reply, result=[], id="echo"
> >   2020-07-31T23:08:12.777Z|04270|poll_loop|DBG|wakeup due to [POLLIN]
> > on fd 9 (10.6.20.84:49158 <http://10.6.20.84:49158> <->10.6.20.85:6642
> > <http://10.6.20.85:6642> ) at lib/stream-fd.c:157 (34% CPU usage)
> >   2020-07-31T23:08:12.777Z|04271|reconnect|DBG|tcp:10.6.20.85:6642
> > <http://10.6.20.85:6642> : idle 5002 ms, sending inactivity probe
> >   2020-07-31T23:08:12.777Z|04272|reconnect|DBG|tcp:10.6.20.85:6642
> > <http://10.6.20.85:6642> : entering IDLE
> >   2020-07-31T23:08:12.777Z|04273|jsonrpc|DBG|tcp:10.6.20.85:6642
> > <http://10.6.20.85:6642> : send request, method="echo", params=[],
> > id="echo"
> >   2020-07-31T23:08:12.777Z|04274|jsonrpc|DBG|tcp:10.6.20.85:6642
> > <http://10.6.20.85:6642> : received request, method="echo", params=[],
> > id="echo"
> >   2020-07-31T23:08:12.777Z|04275|reconnect|DBG|tcp:10.6.20.85:6642
> > <http://10.6.20.85:6642> : entering ACTIVE
> >   2020-07-31T23:08:12.777Z|04276|jsonrpc|DBG|tcp:10.6.20.85:6642
> > <http://10.6.20.85:6642> : send reply, result=[], id="echo"
> >   2020-07-31T23:08:13.635Z|04277|poll_loop|DBG|wakeup due to [POLLIN]
> > on fd 9 (10.6.20.84:49158 <http://10.6.20.84:49158> <->10.6.20.85:6642
> > <http://10.6.20.85:6642> ) at

Re: [ovs-discuss] [ovs-dev] [OVN] ovn-northd HA

2020-08-01 Thread Han Zhou
Hi Tony,

Please find my answers inlined.

On Sat, Aug 1, 2020 at 5:55 PM Tony Liu  wrote:

> When I restore 4096 LS, 4354 LSP, 256 LR and 256 LRP, (I clean up
> all DBs before restore.) it takes a few seconds to restore the nb-db.
> But onv-northd takes forever to update sb-db.
>
> I changed sb-db election timer from 1s to 10s. Then it takes just a
> few minutes for sb-db to get fully synced.
>
> How does that sb-db leader switch affect such sync?
>
> Most likely it is because SB-DB was busy and resulted in time out for the
RAFT election, and kept doing leader election (it can be confirmed by
checking the "term" number), thus never got synced. When you change the
time to 10s, it could complete the work without leader flapping.


>
> Thanks!
>
> Tony
>
> > -Original Message-
> > From: dev  On Behalf Of Tony Liu
> > Sent: Saturday, August 1, 2020 5:26 PM
> > To: ovs-discuss ; ovs-dev  > d...@openvswitch.org>
> > Subject: [ovs-dev] [OVN] ovn-northd HA
> >
> > Hi,
> >
> > I have a few questions about ovn-northd HA.
> >
> > Does the lock for active ovn-northd have to be acquired from the leader
> > of sb-db?
>

Yes, because ovn-northd sets "leader_only" to true for the connection. I
remember it is also required that all lock participants must connect to the
leader for OVSDB lock to work properly.


> >
> > If ovn-northd didn't acquire the lock, it becomes standby. Does it keep
> > trying to acquire the lock, or wait for notification, or monitor the
> > active ovn-northd?
>

It is based on OVSDB notification.

>
> > If it keeps trying, what's the period?
> >
> > Say the active ovn-northd is down, the connection to sb-db is down, sb-
> > db releases the lock, so another ovn-northd can acquire it.
> > Is that correct?
> >
>
Yes

> When sb-db is busy, the connection from ovn-northd is dropped. Not sure
> > from which side it's dropped. And that triggers active ovn-northd switch.
> > Is that right?
> >
>
It is possible, but the same northd may get the lock again, if it is lucky.


> > In case that sb-db leader switchs, is that going to cause active ovn-
> > northd switch as well?
> >
>
It is possible, but the same northd may get the lock again, if it is lucky.


> > For whatever reason, in case active ovn-northd switches, is the new
> > active ovn-northd going to continue the work left by the previous leader,
> > or start all over again?
> >
>

Even for the same ovn-northd, it always recompute everything as a response
to any change. So during switch over, the new active ovn-northd doesn't
need to "continue" - it just recompute everything as usual.
When incremental processing is implemented, the new active ovn-northd may
need to do a recompute first and then handle the further changes
incrementally.
In either cases, there is no need to "continue the work left by the
previous leader".

Thanks,
Han

>
> > Thanks!
> >
> > Tony
> >
> > ___
> > dev mailing list
> > d...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no configuration update

2020-08-01 Thread Han Zhou
On Fri, Jul 31, 2020 at 4:14 PM Tony Liu  wrote:

> Hi,
>
> I see the active ovn-northd takes much CPU (30% - 100%) when there is no
> configuration from OpenStack, nothing happening on all chassis nodes
> either.
>
> Is this expected? What is it busy with?
>
>
Yes, this is expected. It is due to the OVSDB probe between ovn-northd and
NB/SB OVSDB servers, which is used to detect the OVSDB connection failure.
Usually this is not a concern (unlike the probe with a large number of
ovn-controller clients), because ovn-northd is a centralized component and
the CPU cost when there is no configuration change doesn't matter that
much. However, if it is a concern, the probe interval (default 5 sec) can
be changed.
If you change, remember to change on both server side and client side.
For client side (ovn-northd), it is configured in the NB DB's NB_Global
table's options:northd_probe_interval. See man page of ovn-nb(5).
For server side (NB and SB), it is configured in the NB and SB DB's
Connection table's inactivity_probe column.

Thanks,
Han


> 
> 2020-07-31T23:08:09.511Z|04267|poll_loop|DBG|wakeup due to [POLLIN] on fd
> 8 (10.6.20.84:44358<->10.6.20.84:6641) at lib/stream-fd.c:157 (68% CPU
> usage)
> 2020-07-31T23:08:09.512Z|04268|jsonrpc|DBG|tcp:10.6.20.84:6641: received
> request, method="echo", params=[], id="echo"
> 2020-07-31T23:08:09.512Z|04269|jsonrpc|DBG|tcp:10.6.20.84:6641: send
> reply, result=[], id="echo"
> 2020-07-31T23:08:12.777Z|04270|poll_loop|DBG|wakeup due to [POLLIN] on fd
> 9 (10.6.20.84:49158<->10.6.20.85:6642) at lib/stream-fd.c:157 (34% CPU
> usage)
> 2020-07-31T23:08:12.777Z|04271|reconnect|DBG|tcp:10.6.20.85:6642: idle
> 5002 ms, sending inactivity probe
> 2020-07-31T23:08:12.777Z|04272|reconnect|DBG|tcp:10.6.20.85:6642:
> entering IDLE
> 2020-07-31T23:08:12.777Z|04273|jsonrpc|DBG|tcp:10.6.20.85:6642: send
> request, method="echo", params=[], id="echo"
> 2020-07-31T23:08:12.777Z|04274|jsonrpc|DBG|tcp:10.6.20.85:6642: received
> request, method="echo", params=[], id="echo"
> 2020-07-31T23:08:12.777Z|04275|reconnect|DBG|tcp:10.6.20.85:6642:
> entering ACTIVE
> 2020-07-31T23:08:12.777Z|04276|jsonrpc|DBG|tcp:10.6.20.85:6642: send
> reply, result=[], id="echo"
> 2020-07-31T23:08:13.635Z|04277|poll_loop|DBG|wakeup due to [POLLIN] on fd
> 9 (10.6.20.84:49158<->10.6.20.85:6642) at lib/stream-fd.c:157 (34% CPU
> usage)
> 2020-07-31T23:08:13.635Z|04278|jsonrpc|DBG|tcp:10.6.20.85:6642: received
> reply, result=[], id="echo"
> 2020-07-31T23:08:14.480Z|04279|hmap|DBG|Dropped 129 log messages in last 5
> seconds (most recently, 0 seconds ago) due to excessive rate
> 2020-07-31T23:08:14.480Z|04280|hmap|DBG|lib/shash.c:112: 2 buckets with 6+
> nodes, including 2 buckets with 6 nodes (32 nodes total across 32 buckets)
> 2020-07-31T23:08:14.513Z|04281|poll_loop|DBG|wakeup due to 27-ms timeout
> at lib/reconnect.c:643 (34% CPU usage)
> 2020-07-31T23:08:14.513Z|04282|reconnect|DBG|tcp:10.6.20.84:6641: idle
> 5001 ms, sending inactivity probe
> 2020-07-31T23:08:14.513Z|04283|reconnect|DBG|tcp:10.6.20.84:6641:
> entering IDLE
> 2020-07-31T23:08:14.513Z|04284|jsonrpc|DBG|tcp:10.6.20.84:6641: send
> request, method="echo", params=[], id="echo"
> 2020-07-31T23:08:15.370Z|04285|poll_loop|DBG|wakeup due to [POLLIN] on fd
> 8 (10.6.20.84:44358<->10.6.20.84:6641) at lib/stream-fd.c:157 (34% CPU
> usage)
> 2020-07-31T23:08:15.370Z|04286|jsonrpc|DBG|tcp:10.6.20.84:6641: received
> request, method="echo", params=[], id="echo"
> 2020-07-31T23:08:15.370Z|04287|reconnect|DBG|tcp:10.6.20.84:6641:
> entering ACTIVE
> 2020-07-31T23:08:15.370Z|04288|jsonrpc|DBG|tcp:10.6.20.84:6641: send
> reply, result=[], id="echo"
> 2020-07-31T23:08:16.236Z|04289|poll_loop|DBG|wakeup due to 0-ms timeout at
> tcp:10.6.20.84:6641 (100% CPU usage)
> 2020-07-31T23:08:16.236Z|04290|jsonrpc|DBG|tcp:10.6.20.84:6641: received
> reply, result=[], id="echo"
> 2020-07-31T23:08:17.778Z|04291|poll_loop|DBG|wakeup due to [POLLIN] on fd
> 9 (10.6.20.84:49158<->10.6.20.85:6642) at lib/stream-fd.c:157 (100% CPU
> usage)
> 2020-07-31T23:08:17.778Z|04292|jsonrpc|DBG|tcp:10.6.20.85:6642: received
> request, method="echo", params=[], id="echo"
> 2020-07-31T23:08:17.778Z|04293|jsonrpc|DBG|tcp:10.6.20.85:6642: send
> reply, result=[], id="echo"
> 2020-07-31T23:08:20.372Z|04294|poll_loop|DBG|wakeup due to [POLLIN] on fd
> 8 (10.6.20.84:44358<->10.6.20.84:6641) at lib/stream-fd.c:157 (41% CPU
> usage)
> 2020-07-31T23:08:20.372Z|04295|reconnect|DBG|tcp:10.6.20.84:6641: idle
> 5002 ms, sending inactivity probe
> 2020-07-31T23:08:20.372Z|04296|reconnect|DBG|tcp:10.6.20.84:6641:
> entering IDLE
> 2020-07-31T23:08:20.372Z|04297|jsonrpc|DBG|tcp:10.6.20.84:6641: send
> request, method="echo", params=[], id="echo"
> 2020-07-31T23:08:20.372Z|04298|jsonrpc|DBG|tcp:10.6.20.84:6641: received
> request, method="echo", params=[], id="echo"
> 2020-07-31T23:08:20.372Z|04299|reconnect|DBG|tcp:10.6.20.84:6641:
> entering ACTIVE
> 

Re: [ovs-discuss] [OVN] DB backup and restore

2020-07-31 Thread Han Zhou
On Thu, Jul 30, 2020 at 11:07 PM Tony Liu  wrote:

> Hi Han,
>
> ovsdb-client backup and restore work as expected. Sorry for the false
> alarm.
> I messed up with the container. When restore the snapshot for nb-db, sb-db
> is
> updated accordingly by ovn-northd.
>
> Great!


> I think this man page should be updated saying RAFT cluster is also
> supported
> by backup and restore.
> http://www.openvswitch.org/support/dist-docs/ovsdb-client.1.txt
>
> I didn't see it saying RAFT cluster is not supported in the above
document. Probably you misunderstood this statement:

"Reads  snapshot,  which  must  be  a OVSDB standalone or
active-backup  database"

The backup file you generated from ovsdb-client backup command is in OVSDB
standalone format, which is mentioned in the "backup" document.

In addition, the document ovsdb(7) also made it clear that this is the
right way to backup/restore clustered DB.
Did this clarify?


> Thanks!
>
> Tony
> > -Original Message-
> > From: Han Zhou 
> > Sent: Thursday, July 30, 2020 7:19 PM
> > To: Tony Liu 
> > Cc: Numan Siddique ; Han Zhou ; ovs-
> > dev ; ovs-discuss 
> > Subject: Re: [ovs-discuss] [OVN] DB backup and restore
> >
> >
> >
> > On Thu, Jul 30, 2020 at 7:04 PM Tony Liu  > <mailto:tonyliu0...@hotmail.com> > wrote:
> >
> >
> >   Hi,
> >
> >
> >
> >   Just update, finally make this snapshot/rollback work for me.
> >
> >   The rollback is not live though. Here is what I did.
> >
> >
> >
> >   1. Make a snapshot by ovsdb-client. Assuming no ongoing
> >
> >  Transactions, and data is consistent on all nodes. The
> >
> >  Snapshot can be done on any node. It doesn't include any
> >
> >  cluster info. That's probably why the man page says this is
> >
> >  for standalone and A/B only. But that cluster info seems
> >
> >  not required to restore.
> >
> >
> >
> >   2. To rollback/restore, stop services on all nodes, starting
> >
> >  from followers to the leader.
> >
> >
> >
> >   3. Pick a node as the new leader, copy snapshot to be the DB
> >
> >  file. Then start the service. A cluster with new cluster ID
> >
> >  will be created. The node will be allocated a new server ID
> >
> >  as well.
> >
> >
> >
> >   4. On the rest two nodes, remove the DB file, restart service
> >
> >  with remote-address pointing to the leader.
> >
> >
> >
> >   Now, the new cluster starts working with the rollback data.
> >
> >
> > The steps you gave may work, but it is weird. It is better to just
> > follow the steps mentioned in this section:
> >
> > https://github.com/openvswitch/ovs/blob/master/Documentation/ref/ovsdb.7
> > .rst#backing-up-and-restoring-a-database
> >
> >
> >
> >
> >
> >
> >   "ovs-client restore" doesn't work for me, not sure why.
> >
> >   
> >
> >   ovsdb-client: ovsdb error: /dev/stdin: cannot identify file type
> >
> >   
> >
> >   I tried to restore the snapshot created by backup, also the
> >
> >   Directly copied DB file, neither of them works. Wondering anyone
> >
> >   experienced such issue?
> >
> >
> >
> > Maybe your command was wrong. Could you share your command line, and the
> > version used?
> >
> >
> >
> >
> >
> >   To Numan, it would great if you could share the details to use
> >
> >   Neutron-ovn-sync-util.
> >
> >
> >
> >
> >
> >   Thanks!
> >
> >
> >
> >   Tony
> >
> >
> >
> >   From: Tony Liu <mailto:tonyliu0...@hotmail.com>
> >   Sent: Thursday, July 30, 2020 4:51 PM
> >   To: Numan Siddique <mailto:nusid...@redhat.com> ; Han Zhou
> > <mailto:hz...@ovn.org>
> >   Cc: Han Zhou <mailto:hz...@ovn.org> ; ovs-dev <mailto:ovs-
> > d...@openvswitch.org> ; ovs-discuss <mailto:ovs-discuss@openvswitch.org>
> >   Subject: Re: [ovs-dev] [ovs-discuss] [OVN] DB backup and restore
> >
> >
> >
> >   Hi Numan,
> >
> >   I found this comment you made a few years back.
> >
> >   - At neutron-server startup, OVN ML2 driver syncs the neutron
> >   DB and OVN DB if 

Re: [ovs-discuss] [ovs-dev] OVN: Two datapath-bindings are created for the same logical-switch

2020-07-30 Thread Han Zhou
resend as plain text, since I got "The message's content type was not
explicitly allowed" reply from ovs-dev-owner.

On Thu, Jul 30, 2020 at 7:30 PM Han Zhou  wrote:
>
>
>
> On Thu, Jul 30, 2020 at 7:24 PM Tony Liu  wrote:
>>
>> Hi Han,
>>
>>
>>
>> Continue with this thread. Regarding to your comment in another thread.
>>
>> ===
>>
>> 2) OVSDB clients usually monitors and syncs all (interested) data from
server to local, so when they do declarative processing, they could correct
problems by themselves. In fact, ovn-northd does the check and deletes
duplicated datapaths. I did a simple test and it did cleanup by itself:
>>
>> 2020-07-30T18:55:53.057Z|6|ovn_northd|INFO|ovn-northd lock acquired.
This ovn-northd instance is now active.
>> 2020-07-30T19:02:10.465Z|7|ovn_northd|INFO|deleting Datapath_Binding
abef9503-445e-4a52-ae88-4c826cbad9d6 with duplicate
external-ids:logical-switch/router ee80c38b-2016-4cbc-9437-f73e3a59369e
>>
>>
>>
>> I am not sure why in your case north was stuck, but I agree there must
be something wrong. Please collect northd logs if you encounter this again
so we can dig further.
>>
>> ===
>>
>>
>>
>> You are right that ovn-northd will try to clean up the duplication, but,
>>
>> there are ports in port-binding referencing to this datapath-binding, so
>>
>> ovn-northd fails to delete the datapath-binding. I have to manually
delete
>>
>> those ports to be able to delete the datapath-binding. I believe it’s not
>>
>> supported for ovn-northd to delete a configuration that is being
>>
>> referenced. Is that right? If yes, should we fix it or it's the
intention?
>>
>>
>
>
> Yes, good point!
> It is definitely a bug and we should fix it. I think the best fix is to
change the schema and add "logical_datapath" as a index, but we'll need to
make it backward compatible to avoid upgrade issues.
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [ovs-dev] OVN: Two datapath-bindings are created for the same logical-switch

2020-07-30 Thread Han Zhou
On Thu, Jul 30, 2020 at 7:24 PM Tony Liu  wrote:

> Hi Han,
>
>
>
> Continue with this thread. Regarding to your comment in another thread.
>
> ===
>
> 2) OVSDB clients usually monitors and syncs all (interested) data from
> server to local, so when they do declarative processing, they could correct
> problems by themselves. In fact, ovn-northd does the check and deletes
> duplicated datapaths. I did a simple test and it did cleanup by itself:
>
> 2020-07-30T18:55:53.057Z|6|ovn_northd|INFO|ovn-northd lock acquired.
> This ovn-northd instance is now active.
> 2020-07-30T19:02:10.465Z|7|ovn_northd|INFO|deleting Datapath_Binding
> abef9503-445e-4a52-ae88-4c826cbad9d6 with duplicate
> external-ids:logical-switch/router ee80c38b-2016-4cbc-9437-f73e3a59369e
>
>
>
> I am not sure why in your case north was stuck, but I agree there must be
> something wrong. Please collect northd logs if you encounter this again so
> we can dig further.
>
> ===
>
>
>
> You are right that ovn-northd will try to clean up the duplication, but,
>
> there are ports in port-binding referencing to this datapath-binding, so
>
> ovn-northd fails to delete the datapath-binding. I have to manually delete
>
> those ports to be able to delete the datapath-binding. I believe it’s not
>
> supported for ovn-northd to delete a configuration that is being
>
> referenced. Is that right? If yes, should we fix it or it's the intention?
>
>
>

Yes, good point!
It is definitely a bug and we should fix it. I think the best fix is to
change the schema and add "logical_datapath" as a index, but we'll need to
make it backward compatible to avoid upgrade issues.


>
>
> Thanks!
>
>
>
> Tony
>
>
>
> *From: *Tony Liu 
> *Sent: *Thursday, July 23, 2020 7:51 PM
> *To: *Han Zhou ; Ben Pfaff 
> *Cc: *ovs-dev ; ovs-discuss@openvswitch.org
> *Subject: *Re: [ovs-discuss] [ovs-dev] OVN: Two datapath-bindings are
> created for the same logical-switch
>
>
>
> Hi Han,
>
>
>
> Thanks for taking the time to look into this. This problem is not
> consistently reproduced.
>
> Developers normally ignore it:) I think we collected enough context and we
> can let it go for now.
>
> I will rebuild setup, tune that RAFT heartbeat timer and rerun the test.
> Will keep you posted.
>
>
>
>
>
> Thanks again!
>
>
>
> Tony
>
>
>
> *From:* Han Zhou 
> *Sent:* July 23, 2020 06:53 PM
> *To:* Tony Liu ; Ben Pfaff 
> *Cc:* Numan Siddique ; ovs-dev ;
> ovs-discuss@openvswitch.org 
> *Subject:* Re: [ovs-dev] OVN: Two datapath-bindings are created for the
> same logical-switch
>
>
>
>
> On Thu, Jul 23, 2020 at 10:33 AM Tony Liu  wrote:
> >
> > Changed the title for this specific problem.
> > I looked into logs and have more findings.
> > The problem was happening when sb-db leader switched.
>
>
>
> Hi Tony,
>
>
>
> Thanks for this detailed information. Could you confirm which version of
> OVS is used (to understand OVSDB behavior).
>
>
>
> >
> > For ovsdb cluster, what may trigger the leader switch? Given the log,
> > 2020-07-21T01:08:38.119Z|00074|raft|INFO|term 2: 1135 ms timeout
> expired, starting election
> > The election is asked by a follower node. Is that because the connection
> from follower to leader timeout,
> > then follower assumes the leader is dead and starts an election?
>
>
>
> You are right, the RAFT heart beat would timeout when server is too busy
> and the election timer is too small (default 1s). For large scale test,
> please increase the election timer by:
>
> ovn-appctl -t  cluster/change-election-timer OVN_Southbound 
>
>
>
> I suggest to set  to be at least bigger than 1 or more in your
> case. (you need to increase the value gradually - 2000, 4000, 8000, 16000 -
> so it will take you 4 commands to reach this from the initial default value
> 1000, not very convenient, I know :)
>
>
>
>  here is the path to the socket ctl file of ovn-sb, usually under
> /var/run/ovn.
>
>
>
> >
>
> > For ovn-northd (3 instances), they all connect to the sb-db leader,
> whoever has the locker is the master.
> > When sb-db leader switches, all ovn-northd instances look for the new
> leader. In this case, there is no
> > guarantee that the old ovn-northd master remains the role, other
> ovn-northd instance may find the
> > leader and acquire the lock first. So, the sb-db leader switch may also
> cause ovn-northd master switch.
> > Such switch may happen in the middle of ovn-northd transaction, in that
> case, is there any guaran

Re: [ovs-discuss] [OVN] DB backup and restore

2020-07-30 Thread Han Zhou
On Thu, Jul 30, 2020 at 7:04 PM Tony Liu  wrote:

> Hi,
>
>
>
> Just update, finally make this snapshot/rollback work for me.
>
> The rollback is not live though. Here is what I did.
>
>
>
> 1. Make a snapshot by ovsdb-client. Assuming no ongoing
>
>Transactions, and data is consistent on all nodes. The
>
>Snapshot can be done on any node. It doesn't include any
>
>cluster info. That's probably why the man page says this is
>
>for standalone and A/B only. But that cluster info seems
>
>not required to restore.
>
>
>
> 2. To rollback/restore, stop services on all nodes, starting
>
>from followers to the leader.
>
>
>
> 3. Pick a node as the new leader, copy snapshot to be the DB
>
>file. Then start the service. A cluster with new cluster ID
>
>will be created. The node will be allocated a new server ID
>
>as well.
>
>
>
> 4. On the rest two nodes, remove the DB file, restart service
>
>with remote-address pointing to the leader.
>
>
>
> Now, the new cluster starts working with the rollback data.
>

The steps you gave may work, but it is weird. It is better to just follow
the steps mentioned in this section:

https://github.com/openvswitch/ovs/blob/master/Documentation/ref/ovsdb.7.rst#backing-up-and-restoring-a-database


>
> "ovs-client restore" doesn't work for me, not sure why.
>
> 
>
> ovsdb-client: ovsdb error: /dev/stdin: cannot identify file type
>
> 
>
> I tried to restore the snapshot created by backup, also the
>
> Directly copied DB file, neither of them works. Wondering anyone
>
> experienced such issue?
>
>
>
Maybe your command was wrong. Could you share your command line, and the
version used?


> To Numan, it would great if you could share the details to use
>
> Neutron-ovn-sync-util.
>
>
>
>
>
> Thanks!
>
>
>
> Tony
>
>
>
> *From: *Tony Liu 
> *Sent: *Thursday, July 30, 2020 4:51 PM
> *To: *Numan Siddique ; Han Zhou 
> *Cc: *Han Zhou ; ovs-dev ;
> ovs-discuss 
> *Subject: *Re: [ovs-dev] [ovs-discuss] [OVN] DB backup and restore
>
>
>
> Hi Numan,
>
> I found this comment you made a few years back.
>
> - At neutron-server startup, OVN ML2 driver syncs the neutron
> DB and OVN DB if sync mode is set to repair.
> - Admin can run the "neutron-ovn-db-sync-util" to sync the DBs.
>
> Could you share the details to try those two options?
>
>
> Thanks!
>
> Tony
>
> From: Tony Liu<mailto:tonyliu0...@hotmail.com >
> Sent: Thursday, July 30, 2020 4:38 PM
> To: Han Zhou<mailto:hz...@ovn.org >
> Cc: Han Zhou<mailto:hz...@ovn.org >; ovs-dev<
> mailto:ovs-...@openvswitch.org >; ovs-discuss<
> mailto:ovs-discuss@openvswitch.org >
> Subject: Re: [ovs-dev] [ovs-discuss] [OVN] DB backup and restore
>
> Hi,
>
> I have another thought after some diggings. Since I am with
> OpenStack, all networking configurations are from OpenStack.
> I could snapshot OpenStack MariaDB, restore and run
> neutron-ovn-db-sync to update OVN DB. Would that be a cleaner
> solution?
>
> BTW, I got this error when restore the OVN DB.
> ovsdb-client: ovsdb error: /dev/stdin: cannot identify file type
>
> The file was created by "backup" command.
>
>
> Thanks!
>
> Tony
>
> From: Tony Liu<mailto:tonyliu0...@hotmail.com >
> Sent: Thursday, July 30, 2020 3:41 PM
> To: Han Zhou<mailto:hz...@ovn.org >
> Cc: Han Zhou<mailto:hz...@ovn.org >; ovs-dev<
> mailto:ovs-...@openvswitch.org >; ovs-discuss<
> mailto:ovs-discuss@openvswitch.org >
> Subject: Re: [ovs-dev] [ovs-discuss] [OVN] DB backup and restore
>
> Hi,
>
> A quick question here. Given this man page.
> http://www.openvswitch.org/support/dist-docs/ovsdb-client.1.txt
>
> It says backup and restore commands are for OVSDB standalone and
>
> active-backup databases.
>
>
>
> Can they be used for RAFT cluster? If not, what would be the concern,
>
> like inconsistency?
>
>
>
> If I restore to a follower, is the request going to be forwarded to the
>
> leader to restore DB for the whole cluster? But I believe it's recommended
>
> to restore to the leader directly for performance sake.
>
>
>
> I am going to give it a try anyways, see how it works. Will make sure
>
> there is no configuration update from OpenStack side while running such
>
> snapshot and restore process.
>
>
>
>
>
> Thanks!
>
>
>
> Tony
>
> From: Han Zhou<mailto:hz...@ovn.org >
> Sent: Thursday, July 30, 2020 12:23 PM
> To: Tony Liu<mailto:tonyliu

Re: [ovs-discuss] [OVN] DB backup and restore

2020-07-30 Thread Han Zhou
On Thu, Jul 30, 2020 at 10:56 AM Tony Liu  wrote:

> Hi Han,
>
>
>
> That doc helps. I will run some tests and update here. The use case I want
>
> to cover is snapshot/rollback and backup/restore.
>
>
>
> 
>
> Actually, "at-least-once" consistency, because OVSDB does not have a
> session
>
> mechanism to drop duplicate transactions if a connection drops after the
> server
>
> commits it but before the client receives the result.
>
> 
>
> I saw duplicated datapath bindings for the same logical switch once, if you
>
> recall. This may explain that. The ovn-northd connection to sb-db is
> dropped
>
> before receiving the result. So ovn-northd initiates another transaction to
>
> create datapath binding for the same logical switch.
>
>
>
Yes, this is a possibility.
However, in reality, this is usually not a problem:

1) If DB schema has table keys properly defined, the redundant transaction
from clients would be rejected by DB server because of key constraint
check. In the datapath binding case, this doesn't work because of the poor
definition of the datapath_binding table. It should have had
"logical_switch_router" column defined and set as a key (in addition to the
"tunnel_key") instead of storing it in external_ids. The duplicated entries
would have been avoided. The other tables such as port_binding would never
have such problem.

2) OVSDB clients usually monitors and syncs all (interested) data from
server to local, so when they do declarative processing, they could correct
problems by themselves. In fact, ovn-northd does the check and deletes
duplicated datapaths. I did a simple test and it did cleanup by itself:
2020-07-30T18:55:53.057Z|6|ovn_northd|INFO|ovn-northd lock acquired.
This ovn-northd instance is now active.
2020-07-30T19:02:10.465Z|7|ovn_northd|INFO|deleting Datapath_Binding
abef9503-445e-4a52-ae88-4c826cbad9d6 with duplicate
external-ids:logical-switch/router ee80c38b-2016-4cbc-9437-f73e3a59369e

I am not sure why in your case north was stuck, but I agree there must be
something wrong. Please collect northd logs if you encounter this again so
we can dig further.

I see two ways to improve it.
>
> 1) On client side, if the connection is broken while waiting for the result
>
>of a transaction, the client checks the transaction state, committed or
> not,
>
>when it reconnects to the leader (maybe a different node).
>
>Do we have such check today?
>

Clients does check. In this case when transaction was actually successful
but appears to be failed from client point of view, the check doesn't help.


> 2) I see client connection is dropped by the leader when it's busy. I don't
>
>think this is a good way to control the traffic. The server can cache
> and
>
>hold the request when it's busy, or even push back. Dropping connection
>
>is not a good option. Any thoughts here?
>
>
>
The server doesn't make this kind of decisions. It could be simply
overloaded and disconnected from the cluster, or even worse, a node could
crash after commiting the transaction.

Thanks,
Han


>
> Thanks!
>
>
>
> Tony
>
>
>
> *From: *Han Zhou 
> *Sent: *Wednesday, July 29, 2020 11:38 PM
> *To: *Tony Liu 
> *Cc: *ovs-discuss ; ovs-dev
> 
> *Subject: *Re: [ovs-discuss] [OVN] DB backup and restore
>
>
>
>
>
> On Wed, Jul 29, 2020 at 10:58 PM Tony Liu  wrote:
> >
> > Hi,
> >
> >
> >
> > There is any guidance to backup and restore OVN nb-db and sb-db?
> >
> >
> >
> > Is /var/lib/openvswitch/ovn-[ns]b/ovn[ns]b.db the only database file?
> >
> >
> >
> > For 3-node DB cluster, is replication 3 (the data is replicated onto
> >
> > All 3 nodes)?
> >
> >
> >
> > Are DB files on 3 nodes identical?
> >
> >
> >
> > If I stop a DB follower and empty the DB file on the follower node,
> >
> > when I start it back, is the whole DB going to be replicated to it?
> >
> >
> >
> > To backup the DB, is it OK to copy the DB file from any node, assuming
> >
> > no transaction ongoing?
> >
> >
> >
> > Is the following going to work to restore the DB?
> >
> > * Stop all 3 DBs.
> >
> > * Copy backup DB file to one node, empty DB file on the rest two nodes.
> >
> > * Bootstrap the node with DB file.
> >
> > * Start the rest two nodes to join the cluster.
>
> >
>
>
>
> For ovsdb operations, please refer to "man 7 ovsdb", or here:
> https://github.com/openvswitch/ovs/blob/master/Documentation/ref/ovsdb.7.rst
>
>
>
> >
> >
> > Do 

Re: [ovs-discuss] [OVN] DB backup and restore

2020-07-30 Thread Han Zhou
On Wed, Jul 29, 2020 at 10:58 PM Tony Liu  wrote:
>
> Hi,
>
>
>
> There is any guidance to backup and restore OVN nb-db and sb-db?
>
>
>
> Is /var/lib/openvswitch/ovn-[ns]b/ovn[ns]b.db the only database file?
>
>
>
> For 3-node DB cluster, is replication 3 (the data is replicated onto
>
> All 3 nodes)?
>
>
>
> Are DB files on 3 nodes identical?
>
>
>
> If I stop a DB follower and empty the DB file on the follower node,
>
> when I start it back, is the whole DB going to be replicated to it?
>
>
>
> To backup the DB, is it OK to copy the DB file from any node, assuming
>
> no transaction ongoing?
>
>
>
> Is the following going to work to restore the DB?
>
> * Stop all 3 DBs.
>
> * Copy backup DB file to one node, empty DB file on the rest two nodes.
>
> * Bootstrap the node with DB file.
>
> * Start the rest two nodes to join the cluster.
>

For ovsdb operations, please refer to "man 7 ovsdb", or here:
https://github.com/openvswitch/ovs/blob/master/Documentation/ref/ovsdb.7.rst

>
>
> Do I need to restore sb-db as well? Or restore nb-db only and let
>
> ovn-northd to sync data from nb-db to sb-db. Chassis data should be
>
> updated by onv-controller?
>

You don't have to restore sb-db. ovn-northd and ovn-controllers will sync
the data in SB DB.
However, it may take quite some time to sync if the scale is large.
Also, remember that the mac_binding table in SB will not be restored by
ovn-controller because it is populated as a result of ARP packets handling
by ovn-controller. The entries will be generated again only if new ARP
packets are observed by ovn-controller.

>
>
> I am running scaling test. It takes quite a lot of time to build
>
> Configurations. Wondering if I can back and restore DB to rollback
>
> to some checkpoint to avoid restart all over.
>
>
>
>
>
> Thanks!
>
>
>
> Tony
>
>
>
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [ovs-dev] OVN: resync nb-db to sb-db

2020-07-28 Thread Han Zhou
On Tue, Jul 28, 2020 at 2:09 PM Tony Liu  wrote:
>
> Hi,
>
> When I run a script to create bunch of networks and routers from
OpenStack, for whatever reason,
> nb-db is fully updated, but sb-db is only partially updated. For example,
there are 500 logical routers
> in nb-db, but only 218 datapath bindings in sb-db. In this case, is there
any way to resync nb-db to
> sb-db? Any advices to avoid such failure?
>
>
> Thanks!
>
> Tony
>
This looks like a bug. Please share the version, and logs of ovn-northd.
You can also enable debug logs with: ovn-appctl -t ovn-northd vlog/set
file:dbg, if it can be reproduced.

Normally, SB and NB will be always in sync if ovn-northd is running.

Thanks,
Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [ovs-dev] OVN nb-db and sb-db election timer

2020-07-28 Thread Han Zhou
On Mon, Jul 27, 2020 at 1:40 PM Tony Liu  wrote:
>
> Hi,
>
> During scaling test, when sb-db is busy, followers believe the leader is
dead and started election
> request. Some inconsistency happens during such leader switch. Two
datapath bindings are created
> for the same logical switch. To avoid such case, I was recommended to
increase election timer x10.
> 4K networks are created successfully with that setting.
>
> Is it necessary to set big election timer for nb-db as well? The nb-db
doesn't seem very busy during
> the test, sb-db is always busy and taking 90+% CPU.
>
NB-DB is usually not busy, but you can still adjust the timer according to
your use case.

> With that big election timer, in case real problem happens, like the
leader node goes down, is it going
> to take a while for the new leader to be elected?
>

Yes, it will take longer to detect real failures when election timer is
bigger. During the gap, control plane operations may be impacted/delayed,
but dataplane is not impacted - what's running will continue working, but
changes during this time will not be reflected until the failover is done.

>
> Thanks!
>
> Tony
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN scale

2020-07-28 Thread Han Zhou
On Mon, Jul 27, 2020 at 10:16 AM Tony Liu  wrote:

> Hi Han,
>
> Just some updates here.
>
> I tried with 4K networks on single router. Configuration was done without
> any issues. I checked both
> nb-db and sb-db, they all look good. It's just that router configuration
> is huge (in Neutron DB, nb-db
> and flow table in sb-db), because it contains all 4K ports. Also, the
> pipeline of router datapath in sb-db
> is quite big.
>
> I see ovn-northd master and sb-db leader are busy, taking 90+% CPU. There
> are only 3 compute nodes
> and 2 gateway nodes. Does that monitor setting "ovn-monitor-all" matters
> in such case? Any idea what
> they are busy with, without any configuration updates from OpenStack? The
> nb-db is not busy though.
>

Did you create logical switch ports in your test? Did you do port-binding
on compute nodes? If yes, then "ovn-monitor-all" would matter, since all
networks are connected to the same router. With "ovn-monitor-all" = true,
it would avoid the huge monitor condition change messages.

Normally, if there is no NB-DB change, all components should be idle.


> Probably because nb-db is busy, ovn-controller can't connect to it
> consistently. It keeps being
> disconnected and reconnecting. Restarting ovn-controller seems help. I am
> able to launch a few VMs
> on different networks and they are connected via the router.
>
> If you are seeing ovn-controller disconnected to sb-db due to probe
timeout, you can disable/adjust the probe interval. See this slide:
https://www.slideshare.net/hanzhou1978/large-scale-overlay-networks-with-ovn-problems-and-solutions/16



> Now, I have problem on external access. The router is set as gateway to a
> provider/underlay network
> on an interface on the gateway node. The router is allocated an underlay
> address from that provider
> network. My understanding is that, the br-ex on gateway node holding the
> active router will broadcast
> ARP to announce that router underlay address in case of failover. Also, it
> will respond ARP request for
> that router underlay address. But when I run tcpdump on that underlay
> interface on gateway node,
> I see ARP request coming in, but no ARP response going out. I checked the
> flow table in sb-db, it seems
> ok. I also checked flow on br-ex by "ovs-ofctl dump-flows br-ex", I don't
> see anything about ARP there.
> How should I look into it?
>

"br-ex" is not managed by OVN, so you won't see any flows there. Did you
use OpenStack commands to setup the gateway? Did you see port-binding of
the gateway port in SB DB?


> Again, the case is to support 4K networks with external access (security
> group is disabled),
> 4K routers (one for each network), 50 routers (one for 80 networks), 1
> router (for all 4K networks)...
> All networks are isolated by ACL on the logical router. Which option
> should work better?
> Any comment is appreciated.
>
>
If the 4K networks don't need to communicate with each other, then what
would scale the best (in theory) is: 4K routers (one for each network) with
ovn-monitor-all=false. This way, each HV only need to process a small
proportion of the data. (the monitor condition change message should also
be small because each HV only monitor the networks that have related VMs on
the HV).

Thanks,
Han


> Thanks!
>
> Tony
>
>
> --
> *From:* discuss  on behalf of Tony
> Liu 
> *Sent:* July 21, 2020 09:09 PM
> *To:* Daniel Alvarez 
> *Cc:* ovs-discuss@openvswitch.org 
> *Subject:* Re: [ovs-discuss] OVN scale
>
> [root@ovn-db-2 ~]# ovn-nbctl list nb_global
> _uuid   : b7b3aa05-f7ed-4dbc-979f-10445ac325b8
> connections : []
> external_ids: {"neutron:liveness_check_at"="2020-07-22
> 04:03:17.726917+00:00"}
> hv_cfg  : 312
> ipsec   : false
> name: ""
> nb_cfg  : 2636
> options : {mac_prefix="ca:e8:07",
> svc_monitor_mac="4e:d0:3a:80:d4:b7"}
> sb_cfg  : 2005
> ssl : []
>
> [root@ovn-db-2 ~]# ovn-sbctl list sb_global
> _uuid   : 3720bc1d-b0da-47ce-85ca-96fa8d398489
> connections : []
> external_ids: {}
> ipsec   : false
> nb_cfg  : 312
> options : {mac_prefix="ca:e8:07",
> svc_monitor_mac="4e:d0:3a:80:d4:b7"}
> ssl : []
>
> The NBDB and SBDB is definitely out of sync. Is there any way to force
> ovn-northd sync them?
>
> Thanks!
>
> Tony
>
> --
> *From:* Tony Liu 
> *Sent:* July 21, 2020 08:39 PM
> *To:* Daniel Alvarez 
> *Cc:* Cory Hawkless ; ovs-discuss@openvswitch.org <
> ovs-discuss@openvswitch.org>; Dumitru Ceara 
> *Subject:* Re: [ovs-discuss] OVN scale
>
> When create a network (and subnet) on OpenStack, a GW port and service
> port (for DHCP and metadata)
> are also created. They are created in Neutron and onv-nb-db by ML2 driver.
> Then ovn-northd will translate
> such update from NBDB to SBDB. My question here is that, with 20.03, is

Re: [ovs-discuss] [ovs-dev] OVN: Two datapath-bindings are created for the same logical-switch

2020-07-23 Thread Han Zhou
On Thu, Jul 23, 2020 at 10:33 AM Tony Liu  wrote:
>
> Changed the title for this specific problem.
> I looked into logs and have more findings.
> The problem was happening when sb-db leader switched.

Hi Tony,

Thanks for this detailed information. Could you confirm which version of
OVS is used (to understand OVSDB behavior).

>
> For ovsdb cluster, what may trigger the leader switch? Given the log,
> 2020-07-21T01:08:38.119Z|00074|raft|INFO|term 2: 1135 ms timeout expired,
starting election
> The election is asked by a follower node. Is that because the connection
from follower to leader timeout,
> then follower assumes the leader is dead and starts an election?

You are right, the RAFT heart beat would timeout when server is too busy
and the election timer is too small (default 1s). For large scale test,
please increase the election timer by:
ovn-appctl -t  cluster/change-election-timer OVN_Southbound 

I suggest to set  to be at least bigger than 1 or more in your
case. (you need to increase the value gradually - 2000, 4000, 8000, 16000 -
so it will take you 4 commands to reach this from the initial default value
1000, not very convenient, I know :)

 here is the path to the socket ctl file of ovn-sb, usually under
/var/run/ovn.

>
> For ovn-northd (3 instances), they all connect to the sb-db leader,
whoever has the locker is the master.
> When sb-db leader switches, all ovn-northd instances look for the new
leader. In this case, there is no
> guarantee that the old ovn-northd master remains the role, other
ovn-northd instance may find the
> leader and acquire the lock first. So, the sb-db leader switch may also
cause ovn-northd master switch.
> Such switch may happen in the middle of ovn-northd transaction, in that
case, is there any guarantee to
> the transaction completeness? My guess is that, the older created a
datapath-binding for a logical-switch,
> switch happened when this transaction is not completed, then the new
master/leader created another
> data-path binding for the same logical-switch. Does it make any sense?

I agree with you it could be related to the failover and the lock behavior
during the failover. It could be a lock problem causing 2 northds became
active at the same time for a short moment. However, I still can't imagine
how the duplicated entries are created with different tunnel keys. If both
northd create the datapath binding for the same LS at the same time, they
should allocate the same tunnel key, and then one of them should fail
during the transaction commit because of index conflict in DB. But here
they have different keys so both were inserted in DB.

(OVSDB transaction is atomic even during failover and no client should see
partial data of a transaction.)

(cc Ben to comment more on the possibility of both clients acquiring the
lock during failover)

>
> From the log, when sb-db switched, ovn-northd master connected to the new
leader and lost the master,
> but immediately, it acquired the lock and become master again. Not sure
how this happened.

>From the ovn-northd logs, the ovn-northd on .86 firstly connected to SB DB
on .85, which suggests that it regarded .85 as the leader (otherwise it
would disconnect and retry another server), and then immediately after
connecting .85 and acquiring the lock, it disconnected because it somehow
noticed that .85 is not the leader, and then retried and connected to .86
(the new leader) and found out that the lock is already acquired by .85
northd, so it switched to standby. The .85 northd luckly connected to .86
in the first try so it was able to acquire the lock on the leader node
first. Maybe the key thing is to figure out why the .86 northd initially
connected to .85 DB which is not the leader and acquired lock.

Thanks,
Han

>
> Here are some loggings.
>  .84 sb-db leader =
> 2020-07-21T01:08:20.221Z|01408|raft|INFO|current entry eid
639238ba-bc00-4efe-bb66-6ac766bb5f4b does not match prerequisite
78e8e167-8b4c-4292-8e25-d9975631b010 in execute_command_request
>
> 2020-07-21T01:08:38.450Z|01409|timeval|WARN|Unreasonably long 1435ms poll
interval (1135ms user, 43ms system)
> 2020-07-21T01:08:38.451Z|01410|timeval|WARN|faults: 5942 minor, 0 major
> 2020-07-21T01:08:38.451Z|01411|timeval|WARN|disk: 0 reads, 50216 writes
> 2020-07-21T01:08:38.452Z|01412|timeval|WARN|context switches: 60
voluntary, 25 involuntary
> 2020-07-21T01:08:38.453Z|01413|coverage|INFO|Skipping details of
duplicate event coverage for hash=45329980
>
> 2020-07-21T01:08:38.455Z|01414|raft|WARN|ignoring vote request received
as leader
> 2020-07-21T01:08:38.456Z|01415|raft|INFO|server 1f9e is leader for term 2
> 2020-07-21T01:08:38.457Z|01416|raft|INFO|rejected append_reply (not
leader)
> 2020-07-21T01:08:38.471Z|01417|raft|INFO|rejected append_reply (not
leader)
>
> 2020-07-21T01:23:00.890Z|01418|timeval|WARN|Unreasonably long 1336ms poll
interval (1102ms user, 20ms system)
>
>  .85 sb-db ==
> 

Re: [ovs-discuss] [ovs-dev] OVN Controller Incremental Processing

2020-07-23 Thread Han Zhou
On Thu, Jul 23, 2020 at 5:40 PM Tony Liu  wrote:
>
> Hi Han,
>
> Now, I have 4000 records in logical-switch table in nb-db, only 1567
records in datapath-binding
> table in sb-db. The translation was broken by a duplication (2 datapath
bindings point to the same
> logical-switch). Not sure how that happened. Anyways, I manually removed
this duplication.
> How can I trigger ovn-northd to finish the translation for all the rest
logical switches?
>
One thing you need to be aware of with many networks is that you probably
want to enable the option "ovn-monitor-all=true" on each HV node, to avoid
overloading the SB DB because of the big size of conditional monitoring
condition updates.

If you trigger any change to NB, ovn-northd will recompute everything. If
it doesn't complete the processing, there must be something wrong, and
there should be error logs.

>
> Thanks!
>
> Tony
>
> 
> From: Han Zhou 
> Sent: July 23, 2020 04:19 PM
> To: Tony Liu 
> Cc: Han Zhou ; ovs-dev ;
ovs-discuss 
> Subject: Re: [ovs-dev] OVN Controller Incremental Processing
>
>
>
> On Thu, Jul 23, 2020 at 4:12 PM  wrote:
> >
> > Thanks Han for the quick confirmation!
> > That says, when changes was made into nb-db, ovn-northd doesn't
recompile the whole db, instead, it only updates the increment into sb-db.
I am currently running some scaling test and seeing 100% CPU usage, hence
asking.
> >
> Oh, no. The talk was about "OVN-controller", which is the component
running on hypervisors, to translate SB data into OVS flows, and this has
been implemented (although not all scenarios are incrementally processed).
For ovn-northd, it runs on central node to convert data from NB to SB DB,
and it is not incremental yet with incremental processing, so it is
expected to see 100% CPU. There is currently a work ongoing for ovn-northd
incremental processing, with DDlog, by Ben and Leonid.
>
> > Tony
> >
> > On Jul 23, 2020 4:02 PM, Han Zhou  wrote:
> >
> >
> >
> > On Thu, Jul 23, 2020 at 11:17 AM Tony Liu 
wrote:
> > >
> > > Hi,
> > >
> > > Is this implemented and released?
> > >
https://www.slideshare.net/hanzhou1978/ovn-controller-incremental-processing
> > > Could anyone share an update on this?
> > >
> > >
> > > Thanks!
> > >
> > > Tony
> > >
> >
> > Yes, it was released initially in OVS/OVN 2.12 (if I remember
correctly), and there have been more improvements added gradually since
then.
> > (The "future" part which talks about DDlog is not implemented yet.)
> >
> > Thanks,
> > Han
> >
> >
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN: ovn-sbctl backoff

2020-07-23 Thread Han Zhou
On Thu, Jul 23, 2020 at 4:34 PM  wrote:
>
> Good to know! I recall I read somewhere saying only the leader takes
write request. I will double check!
>
> Well, in that case, I have another question, why is such leader role
required? In a quorum based cluster, all nodes are equal. And why does
ovn-northd have to connect to the leader?
> Guess I will need to read more about RAFT:)
>
Quick answer: only leader node does the actual write, aso all writes are
redirected to leader, but it can be initiated from followers. ovn-northd
connects to leader for better performance because it heavily writes.
The manpage does provide more details, and yes RAFT paper has even more.

>
> Thanks!
>
> Tony
>
> On Jul 23, 2020 4:26 PM, Han Zhou  wrote:
>
>
>
> On Thu, Jul 23, 2020 at 4:07 PM  wrote:
> >
> > Thanks Han for the prompt responses!
> > That option is ok for reading. If I want to write, I have to connect to
the leader, right? Then my question remains, how does ovn-sbctl find out
how to connect to the leader?
> >
>
> RAFT doesn't require you to connect to leader for writing. You can
connect to any node and write.
> However, if for any reason you want to connect to the leader, you need to
specify the DB connection method as: ,,...,. For
example: tcp:10.0.0.2:6641,tcp:10.0.0.3:6641,tcp:10.0.0.4:6641.
> You can read more details about OVSDB clustering in manpage ovsdb(7).
>
> Thanks,
> Han
>
> > Thanks again!
> >
> > Tony
> >
> > On Jul 23, 2020 3:57 PM, Han Zhou  wrote:
> >
> >
> >
> > On Thu, Jul 23, 2020 at 3:43 PM Tony Liu 
wrote:
> > >
> > > Hi,
> > >
> > > In case of ovsdb cluster, when I run ovn-sbctl, it connects to the
unix socket of local sb-db.
> > > If local sb-db is not the leader, ovn-sbctl tries another server to
look for the leader.
> > > How does ovn-sbctl connect to another server? By which connection?
> > > How does ovn-sbctl know the connection?
> > > Or the local sb-db asks to be the leader to respond ovn-sbctl?
> > > I can't figure out how it works from verbose messages.
> > > Any help to clarify is appreciated!
> > >
> > >
> > > Thanks!
> > >
> > > Tony
> >
> > If you don't intentionally try to connect to the leader, you can use
ovn-sbctl --no-leader-only ... to avoid the retry.
> > If you want to avoid typing this option every time, you can export
OVN_SBCTL_OPTIONS"="--no-leader-only", and then just run ovn-sbctl ...
> > (does this answer your question?)
> >
> > Thanks,
> > Han
> >
> >
>
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN: ovn-sbctl backoff

2020-07-23 Thread Han Zhou
On Thu, Jul 23, 2020 at 4:07 PM  wrote:
>
> Thanks Han for the prompt responses!
> That option is ok for reading. If I want to write, I have to connect to
the leader, right? Then my question remains, how does ovn-sbctl find out
how to connect to the leader?
>

RAFT doesn't require you to connect to leader for writing. You can connect
to any node and write.
However, if for any reason you want to connect to the leader, you need to
specify the DB connection method as: ,,...,. For
example: tcp:10.0.0.2:6641,tcp:10.0.0.3:6641,tcp:10.0.0.4:6641.
You can read more details about OVSDB clustering in manpage ovsdb(7).

Thanks,
Han

> Thanks again!
>
> Tony
>
> On Jul 23, 2020 3:57 PM, Han Zhou  wrote:
>
>
>
> On Thu, Jul 23, 2020 at 3:43 PM Tony Liu  wrote:
> >
> > Hi,
> >
> > In case of ovsdb cluster, when I run ovn-sbctl, it connects to the unix
socket of local sb-db.
> > If local sb-db is not the leader, ovn-sbctl tries another server to
look for the leader.
> > How does ovn-sbctl connect to another server? By which connection?
> > How does ovn-sbctl know the connection?
> > Or the local sb-db asks to be the leader to respond ovn-sbctl?
> > I can't figure out how it works from verbose messages.
> > Any help to clarify is appreciated!
> >
> >
> > Thanks!
> >
> > Tony
>
> If you don't intentionally try to connect to the leader, you can use
ovn-sbctl --no-leader-only ... to avoid the retry.
> If you want to avoid typing this option every time, you can export
OVN_SBCTL_OPTIONS"="--no-leader-only", and then just run ovn-sbctl ...
> (does this answer your question?)
>
> Thanks,
> Han
>
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [ovs-dev] OVN Controller Incremental Processing

2020-07-23 Thread Han Zhou
On Thu, Jul 23, 2020 at 4:12 PM  wrote:
>
> Thanks Han for the quick confirmation!
> That says, when changes was made into nb-db, ovn-northd doesn't recompile
the whole db, instead, it only updates the increment into sb-db. I am
currently running some scaling test and seeing 100% CPU usage, hence asking.
>
Oh, no. The talk was about "OVN-controller", which is the component running
on hypervisors, to translate SB data into OVS flows, and this has been
implemented (although not all scenarios are incrementally processed). For
ovn-northd, it runs on central node to convert data from NB to SB DB, and
it is not incremental yet with incremental processing, so it is expected to
see 100% CPU. There is currently a work ongoing for ovn-northd incremental
processing, with DDlog, by Ben and Leonid.

> Tony
>
> On Jul 23, 2020 4:02 PM, Han Zhou  wrote:
>
>
>
> On Thu, Jul 23, 2020 at 11:17 AM Tony Liu  wrote:
> >
> > Hi,
> >
> > Is this implemented and released?
> >
https://www.slideshare.net/hanzhou1978/ovn-controller-incremental-processing
> > Could anyone share an update on this?
> >
> >
> > Thanks!
> >
> > Tony
> >
>
> Yes, it was released initially in OVS/OVN 2.12 (if I remember correctly),
and there have been more improvements added gradually since then.
> (The "future" part which talks about DDlog is not implemented yet.)
>
> Thanks,
> Han
>
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [ovs-dev] OVN Controller Incremental Processing

2020-07-23 Thread Han Zhou
On Thu, Jul 23, 2020 at 11:17 AM Tony Liu  wrote:
>
> Hi,
>
> Is this implemented and released?
>
https://www.slideshare.net/hanzhou1978/ovn-controller-incremental-processing
> Could anyone share an update on this?
>
>
> Thanks!
>
> Tony
>

Yes, it was released initially in OVS/OVN 2.12 (if I remember correctly),
and there have been more improvements added gradually since then.
(The "future" part which talks about DDlog is not implemented yet.)

Thanks,
Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN: ovn-sbctl backoff

2020-07-23 Thread Han Zhou
On Thu, Jul 23, 2020 at 3:43 PM Tony Liu  wrote:
>
> Hi,
>
> In case of ovsdb cluster, when I run ovn-sbctl, it connects to the unix
socket of local sb-db.
> If local sb-db is not the leader, ovn-sbctl tries another server to look
for the leader.
> How does ovn-sbctl connect to another server? By which connection?
> How does ovn-sbctl know the connection?
> Or the local sb-db asks to be the leader to respond ovn-sbctl?
> I can't figure out how it works from verbose messages.
> Any help to clarify is appreciated!
>
>
> Thanks!
>
> Tony

If you don't intentionally try to connect to the leader, you can use
ovn-sbctl --no-leader-only ... to avoid the retry.
If you want to avoid typing this option every time, you can export
OVN_SBCTL_OPTIONS"="--no-leader-only", and then just run ovn-sbctl ...
(does this answer your question?)

Thanks,
Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-07-14 Thread Han Zhou
Thanks Girish for the update! I will submit formal patches for ovn.

Regards,
Han

On Tue, Jul 14, 2020 at 8:32 AM Girish Moodalbail 
wrote:

> Yes, we are going to submit the patch to enable those options on L3
> Gateway Routers to ovn-k8s repo.  I am going to wait until these changes
> make it to OVN repo and then submit since I don't know if these options
> will be renamed and such.
>
> Regards,
> ~Girish
>
> On Tue, Jul 14, 2020 at 7:33 AM Tim Rozet  wrote:
>
>> Thanks for the update Girish. Are you planning on submitting an
>> ovn-k8s patch to enable these?
>>
>> Tim Rozet
>> Red Hat CTO Networking Team
>>
>>
>> On Mon, Jul 13, 2020 at 9:37 PM Girish Moodalbail 
>> wrote:
>>
>>> Hello Han,
>>>
>>> On the #openvswitch IRC channel I had provided an update on your patch
>>> working great on our test setup. That update was for the L3 Gateway Router
>>> option called* learn_from_arp_request="true|false".* With that option
>>> in place, the number of entries in the MAC binding table has significantly
>>> reduced.
>>>
>>> However, I had not provided an update on the single join switch tests.
>>> Sincere apologies for the delay. We just got that code to work last week,
>>> and we have an update. This is for the option called
>>> *dynamic_neigh_routers="true|false"* on the L3 Gateway Router. It works
>>> as expected.  With that option in place, for all of the L3 Gateway Routers
>>> I see just 3 entries as expected:
>>>
>>>   table=12(lr_in_arp_resolve  ), priority=500  , match=(ip4.mcast ||
>>> ip6.mcast), action=(next;)
>>>   table=12(lr_in_arp_resolve  ), priority=0, match=(ip4),
>>> action=(get_arp(outport, reg0); next;)
>>>   table=12(lr_in_arp_resolve  ), priority=0, match=(ip6),
>>> action=(get_nd(outport, xxreg0); next;)
>>>
>>> Before, on a 1000 node cluster with 1000 Gateway Routers we would see
>>> 1000 entries per Gateway Router and therefore a total of 1M entries in the
>>> cluster. Now, that is not the case.
>>>
>>> Thank you!
>>>
>>> Regards,
>>> ~Girish
>>>
>>>
>>> On Wed, Jun 10, 2020 at 12:04 PM Han Zhou  wrote:
>>>
>>>>
>>>>
>>>> On Wed, Jun 10, 2020 at 12:03 PM Han Zhou  wrote:
>>>>
>>>>> Hi Girish, Venu,
>>>>>
>>>>> I sent a RFC patch series for the solution discussed. Could you give
>>>>> it a try when you get the chance?
>>>>>
>>>>
>>>> Oops, I forgot the link:
>>>> https://patchwork.ozlabs.org/project/openvswitch/list/?series=182602
>>>>
>>>>>
>>>>> Thanks,
>>>>> Han
>>>>>
>>>>> On Tue, Jun 9, 2020 at 10:04 AM Han Zhou  wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 9, 2020 at 9:06 AM Venugopal Iyer 
>>>>>> wrote:
>>>>>>
>>>>>>> Sorry for the delay, Han, a quick question below:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* ovn-kuberne...@googlegroups.com <
>>>>>>> ovn-kuberne...@googlegroups.com> *On Behalf Of *Han Zhou
>>>>>>> *Sent:* Wednesday, June 3, 2020 4:27 PM
>>>>>>> *To:* Girish Moodalbail 
>>>>>>> *Cc:* Tim Rozet ; Dumitru Ceara <
>>>>>>> dce...@redhat.com>; Daniel Alvarez Sanchez ;
>>>>>>> Dan Winship ; ovn-kuberne...@googlegroups.com;
>>>>>>> ovs-discuss ; Michael Cambria <
>>>>>>> mcamb...@redhat.com>; Venugopal Iyer 
>>>>>>> *Subject:* Re: [ovs-discuss] [OVN] flow explosion in
>>>>>>> lr_in_arp_resolve table
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *External email: Use caution opening links or attachments*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Girish, yes, that's what we concluded in last OVN meeting, but
>>>>>>> sorry that I forgot to update here.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail <
>>>>>>> gmoodalb...@gmail.com> wrote:
>>>>>>> >
>>>>>>> &

Re: [ovs-discuss] [OVN] running bfd on ecmp routes?

2020-06-17 Thread Han Zhou
On Wed, Jun 17, 2020 at 3:43 AM Numan Siddique  wrote:

>
>
> On Wed, Jun 17, 2020 at 12:50 AM Han Zhou  wrote:
>
>>
>>
>> On Tue, Jun 16, 2020 at 11:32 AM Tim Rozet  wrote:
>>
>>> Thanks Han. See inline.
>>> Tim Rozet
>>> Red Hat CTO Networking Team
>>>
>>>
>>> On Tue, Jun 16, 2020 at 1:45 PM Han Zhou  wrote:
>>>
>>>>
>>>>
>>>> On Mon, Jun 15, 2020 at 7:22 AM Tim Rozet  wrote:
>>>>
>>>>> Hi All,
>>>>> While looking into using ecmp routes for an OVN router I noticed there
>>>>> is no support for BFD on these routes. Would it be possible to add this
>>>>> capability? I would like the next hop to be removed from the openflow 
>>>>> group
>>>>> if BFD detection for that next hop goes down. My routes in this case would
>>>>> be on a GR for N/S external next hop and not going across a tunnel as it
>>>>> egresses.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Tim Rozet
>>>>> Red Hat CTO Networking Team
>>>>>
>>>>> Hi Tim,
>>>>
>>>> Thanks for bringing this up. Yes, it is desirable to have BFD support
>>>> for OVN routers. Here are my thoughts.
>>>>
>>>> In general, OVN routers are distributed. It is not easy to tell which
>>>> node should be responsible for the BFD session, especially, to handle the
>>>> response packets. Even if we managed to implement this, the node detects
>>>> the failure needs to populate the information to central SB DB, so that the
>>>> information is distributed to all nodes, to make the distributed route
>>>> updated.
>>>>
>>>
>>> Right in a distributed case it would mean the BFD endpoint would be
>>> under the network managed by OVN, and therefore reside on the same node
>>> where the port for that endpoint resides. In the ovn-kubernetes context, it
>>> is a pod running on a node connected to the DR.
>>>
>>
>> Yes, this may be the typical case. However, there can be more scenarios,
>> since there is no limit for what the nexthop can be in OVN routes. It can
>> be an IP of a OVN port which is straightforward. It can also be an IP of a
>> nested workload which is under the OVN managed network but not directly
>> known by OVN (maybe learned through ARP). The nexthop can also be on
>> external networks reachable through distributed gateway ports (instead of
>> GR), in which case the routes are distributed and it requires resolving the
>> output port to figure out that the BFD session should be running through
>> the gateway node. But I agree that all these should be doable, although it
>> may introduce some complexity. In addition, for distributed routers, BFD is
>> not necessarily faster than an external monitoring mechanism, because the
>> updates to the route would anyway need to go through the central DB (so
>> that it can be enforced on all nodes in the distributed manner).
>>
>
> Maybe we can extend the current service monitor implementation to also
> include BFD ? And detect any failures.
>

Agree, I was thinking about something similar.

>
> Whether the nexthop is known to OVN or is outside of OVN subsystem,
> ovn-controller can inject the BFD packet to the router pipeline and this
> packet would be routed and get delivered
> to the endpoint handling the nexthop. If we take this approach, OVN
> doesn't need to know what is the OVS interface to use to enable BFD if the
> interface connected to the nexthop endpoint.
>

Sorry, this is unclear to me. If the nexthop is unknown to OVN, how do we
know which chassis should take care of the BFD handling that nexthop? For
"if the interface connected to the nexthop endpoint" - which interface do
you mean here?


>
> Right now, ovn-controller creates the OVS tunnel interfaces and it can
> easily configure BFD on these. But generally, OVS interfaces for VMs/PODs
> etc are created by external entities like - OpenStack Nova, ovn-kubernetes
> etc
> and probably its not a good idea for ovn-controller to enable the BFD by
> running equivalent of - "ovs-vsctl set interface 
> bfd:enable=true".
>
> I wonder how can we directly utilize the OVS BFD configuration on the OVS
interfaces for this purpose. For example, if the nexthop is a VM, how can
we use OVS BFD to peer a session within a VM?

There are so many scenarios and details to be figured out. I think it would
justify a design doc probably :)


> Any thoughts ?
>
> Thanks
> N

Re: [ovs-discuss] [OVN] running bfd on ecmp routes?

2020-06-16 Thread Han Zhou
On Tue, Jun 16, 2020 at 11:32 AM Tim Rozet  wrote:

> Thanks Han. See inline.
> Tim Rozet
> Red Hat CTO Networking Team
>
>
> On Tue, Jun 16, 2020 at 1:45 PM Han Zhou  wrote:
>
>>
>>
>> On Mon, Jun 15, 2020 at 7:22 AM Tim Rozet  wrote:
>>
>>> Hi All,
>>> While looking into using ecmp routes for an OVN router I noticed there
>>> is no support for BFD on these routes. Would it be possible to add this
>>> capability? I would like the next hop to be removed from the openflow group
>>> if BFD detection for that next hop goes down. My routes in this case would
>>> be on a GR for N/S external next hop and not going across a tunnel as it
>>> egresses.
>>>
>>> Thanks,
>>>
>>> Tim Rozet
>>> Red Hat CTO Networking Team
>>>
>>> Hi Tim,
>>
>> Thanks for bringing this up. Yes, it is desirable to have BFD support for
>> OVN routers. Here are my thoughts.
>>
>> In general, OVN routers are distributed. It is not easy to tell which
>> node should be responsible for the BFD session, especially, to handle the
>> response packets. Even if we managed to implement this, the node detects
>> the failure needs to populate the information to central SB DB, so that the
>> information is distributed to all nodes, to make the distributed route
>> updated.
>>
>
> Right in a distributed case it would mean the BFD endpoint would be under
> the network managed by OVN, and therefore reside on the same node where the
> port for that endpoint resides. In the ovn-kubernetes context, it is a pod
> running on a node connected to the DR.
>

Yes, this may be the typical case. However, there can be more scenarios,
since there is no limit for what the nexthop can be in OVN routes. It can
be an IP of a OVN port which is straightforward. It can also be an IP of a
nested workload which is under the OVN managed network but not directly
known by OVN (maybe learned through ARP). The nexthop can also be on
external networks reachable through distributed gateway ports (instead of
GR), in which case the routes are distributed and it requires resolving the
output port to figure out that the BFD session should be running through
the gateway node. But I agree that all these should be doable, although it
may introduce some complexity. In addition, for distributed routers, BFD is
not necessarily faster than an external monitoring mechanism, because the
updates to the route would anyway need to go through the central DB (so
that it can be enforced on all nodes in the distributed manner).


>> In your particular case, it may be easier, since the gateway router is
>> physically located on a single node. ovn-controller on the GR node can
>> maintain BFD session with the nexthops. If a session is down,
>> ovn-controller may take action locally to enforce the change locally.
>>
>
> Yeah for the external network case this makes sense. I went ahead and
> filed a BZ:
> https://bugzilla.redhat.com/show_bug.cgi?id=1847570
>
>>
>> For both cases, more details may need to be sorted out.
>>
>> Alternatively, it shouldn't be hard to have an external monitoring
>> service/agent that talks BFD with the nexthops, and react on the session
>> status changes by updating ECMP routes in OVN NB.
>>
> Yeah I have a workaround plan to do this for now, using a networking
> health check and signaling from K8S. The problem is this is much slower
> than using real BFD, but it is better than nothing.
>

Great. Is there any design doc or POC? (or if there is a plan to share when
ready). Thanks!

>
>
>>
>> Thanks,
>> Han
>>
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] running bfd on ecmp routes?

2020-06-16 Thread Han Zhou
On Mon, Jun 15, 2020 at 7:22 AM Tim Rozet  wrote:

> Hi All,
> While looking into using ecmp routes for an OVN router I noticed there is
> no support for BFD on these routes. Would it be possible to add this
> capability? I would like the next hop to be removed from the openflow group
> if BFD detection for that next hop goes down. My routes in this case would
> be on a GR for N/S external next hop and not going across a tunnel as it
> egresses.
>
> Thanks,
>
> Tim Rozet
> Red Hat CTO Networking Team
>
> Hi Tim,

Thanks for bringing this up. Yes, it is desirable to have BFD support for
OVN routers. Here are my thoughts.

In general, OVN routers are distributed. It is not easy to tell which node
should be responsible for the BFD session, especially, to handle the
response packets. Even if we managed to implement this, the node detects
the failure needs to populate the information to central SB DB, so that the
information is distributed to all nodes, to make the distributed route
updated.

In your particular case, it may be easier, since the gateway router is
physically located on a single node. ovn-controller on the GR node can
maintain BFD session with the nexthops. If a session is down,
ovn-controller may take action locally to enforce the change locally.

For both cases, more details may need to be sorted out.

Alternatively, it shouldn't be hard to have an external monitoring
service/agent that talks BFD with the nexthops, and react on the session
status changes by updating ECMP routes in OVN NB.

Thanks,
Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] logical flow explosion in lr_in_ip_input table for dnat_and_snat IPs

2020-06-15 Thread Han Zhou
Sorry Girish, I can't promise for now. I will see if I have time in the
next couple of weeks, but welcome anyone to volunteer on this if it is
urgent.

On Mon, Jun 15, 2020 at 10:56 AM Girish Moodalbail 
wrote:

> Hello Han,
>
> On Wed, Jun 3, 2020 at 9:39 PM Han Zhou  wrote:
>
>>
>>
>> On Wed, Jun 3, 2020 at 7:16 PM Girish Moodalbail 
>> wrote:
>>
>>> Hello all,
>>>
>>> While working on an extension, see the diagram below, to the existing
>>> OVN logical topology for the ovn-kubernetes project, I am seeing an
>>> explosion of the "Reply to ARP requests" logical flows in the
>>> `lr_in_ip_input` table for the distributed router (ovn_cluster_router)
>>> configured with gateway port (rtol-LS)
>>>
>>> internet
>>>-+-->
>>> |
>>> |
>>>   +--localnet-port-+
>>>   |LS  |
>>>   +-ltor-LS+
>>>|
>>>|
>>>  +-rtol-LS+
>>>  |   ovn_cluster_router   |
>>>  |  (Distributed Router)  |
>>>  +-rtos-ls0--rtos-ls1rtos-ls2-+
>>>   |  |  |
>>>   |  |  |
>>> +-+-+   ++--+ +-+-+
>>> |  LS0  |   |  LS1  | |  LS2  |
>>> +-+-+   +-+-+ +-+-+
>>>   |   | |
>>>   p0  p1p2
>>>  IA0 IA1   IA2
>>>  EA0 EA1   EA2
>>> (Node0)  (Node1)   (Node2)
>>>
>>> In the topology above, each of the three logical switch port has an
>>> internal address of IAx and an external address of EAx (dnat_and_snat IP).
>>> They are all bound to their respective nodes (Nodex). A packet from `p0`
>>> heading towards the internet will be SNAT'ed to EA0 on the local hypervisor
>>> and then sent out through the LS's localnet-port on that hypervisor.
>>> Basically, they are configured for distributed NATing.
>>>
>>> I am seeing interesting "Reply to ARP requests" flows for arp.tpa set to
>>> "EAX". Flows are like this:
>>>
>>> For EA0
>>> priority=90, match=(inport == "rtos-ls0" && arp.tpa == EA0 && arp.op ==
>>> 1), action=(/* ARP reply */)
>>> priority=90, match=(inport == "rtos-ls1" && arp.tpa == EA0 && arp.op ==
>>> 1), action=(/* ARP reply */)
>>> priority=90, match=(inport == "rtos-ls2" && arp.tpa == EA0 && arp.op ==
>>> 1), action=(/* ARP reply */)
>>>
>>> For EA1
>>> priority=90, match=(inport == "rtos-ls0" && arp.tpa == EA1 && arp.op ==
>>> 1), action=(/* ARP reply */)
>>> priority=90, match=(inport == "rtos-ls1" && arp.tpa == EA0 && arp.op ==
>>> 1), action=(/* ARP reply */)
>>> priority=90, match=(inport == "rtos-ls2" && arp.tpa == EA1 && arp.op ==
>>> 1), action=(/* ARP reply */)
>>>
>>> Similarly, for EA2.
>>>
>>> So, we have N * N "Reply to ARP requests" flows for N nodes each with 1
>>> dnat_and_snat ip.
>>> This is causing scale issues.
>>>
>>> If you look at the flows for `EA0`, i am confused as to why is it needed?
>>>
>>>1. When will one see an ARP request for the EA0 from any of the
>>>LS{0,1,2}'s logical switch port.
>>>2. If it is needed at all, can't we just remove the `inport` thing
>>>altogether since the flow is configured for every port of logical router
>>>port except for the distributed gateway port rtol-LS. For this port, we
>>>could add an higher priority rule with action set to `next`.
>>>3. Say, we don't need east-west NAT connectivity. Is there a way to
>>>make these ARPs be learnt dynamically, like we are doing for join and
>>>external logical switch (the other thread [1]).
>>>
>>> Regards,
>>> ~Girish
>>>
>>> [1]
>>> https://mail.openvswitch.org/pipermail/ovs-discuss/2020-May/049994.html
>>>
>>
>> In general, these flows should be per router instead of per router port,
>> since the nat addresses are

Re: [ovs-discuss] options for OVN mailing lists, with a survey

2020-06-10 Thread Han Zhou
On Wed, Jun 10, 2020 at 10:01 AM Ben Pfaff  wrote:
>
> Hi!
>
> TL;DR: please answer the simple question at the end of the email.
>
> Since Open vSwitch and OVN are now separate projects, it would be a good
> idea to separate their mailing lists as well.  This is easier said than
> done for at least two reasons:
>
> * Mailing lists are troublesome to manage these days.  There are
>   few service providers that are willing to do the job.  Even
>   Linux Foundation just subcontracts them.  The current OVS
>   lists are grandparented in but LF is not willing to add more
>   lists under the same terms.
>
> * OVN and Open vSwitch have given out email addresses to
>   individual contributors at ovn.org and Linux Foundation has
>   only given half-hearted assurances that they can do this.
>
> We are now considering our options.  One factor is the different ways
> that a mailing list can be run.  There are three major categories:
>
> * "Open": anyone may post to the list regardless of whether they
>   subscribe (although spam gets dropped to some degree of
>   accuracy) and it will be delivered immediately.
>
>   This is how ovs-dev is configured.  We have to configure it
>   that way because ovs-dev gets CCed from linux-kernel and
>   netdev on a regular basis and that community expects mailing
>   lists to be open and will yell at us if they are not.
>
> * "Moderated": anyone may post to the list, but if they are not
>   subscribed to it then it is diverted to a moderation queue.  A
>   moderator then glances at it to make sure it isn't spam, and
>   if it isn't, directs it to the list.  ALSO: the moderator
>   generally adds the sender email address to the list of allowed
>   sender addresses.  Thus, the FIRST email from a given sender
>   is delayed, but not subsequent ones.
>
>   This is how the rest of the OVS mailing lists are configured.
>   Justin and I are the current moderators.  I probably spend
>   only a minute or two a day on it, it's easy work.
>
>   I do not know of any commercial service providers that support
>   this configuration.
>
> - "Closed": only subscribers may post.  Others' emails are
>   rejected.  OVS doesn't configure lists this way.
>
>   Commercial service providers, including LF's, do support this
>   configuration.
>
> If it is acceptable for OVN mailing lists to be closed, then we could do
> something relatively easy:
>
> * Continue to manage ovn.org email addresses for individuals in
>   our current way, which works.
>
> * Create a new lists.ovn.org MX host and turn it over to LF's
>   service provider to implement closed OVN mailing lists.
>
> Question: would it be acceptable for OVN mailing lists to be closed, so
> that emails from non-subscribers are rejected?  (See the definition of
> "closed" above.)

I think it is acceptable, if anyone can subscribe.

>
> Thanks,
>
> Ben.
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-06-10 Thread Han Zhou
On Wed, Jun 10, 2020 at 12:03 PM Han Zhou  wrote:

> Hi Girish, Venu,
>
> I sent a RFC patch series for the solution discussed. Could you give it a
> try when you get the chance?
>

Oops, I forgot the link:
https://patchwork.ozlabs.org/project/openvswitch/list/?series=182602

>
> Thanks,
> Han
>
> On Tue, Jun 9, 2020 at 10:04 AM Han Zhou  wrote:
>
>>
>>
>> On Tue, Jun 9, 2020 at 9:06 AM Venugopal Iyer 
>> wrote:
>>
>>> Sorry for the delay, Han, a quick question below:
>>>
>>>
>>>
>>> *From:* ovn-kuberne...@googlegroups.com 
>>> *On Behalf Of *Han Zhou
>>> *Sent:* Wednesday, June 3, 2020 4:27 PM
>>> *To:* Girish Moodalbail 
>>> *Cc:* Tim Rozet ; Dumitru Ceara ;
>>> Daniel Alvarez Sanchez ; Dan Winship <
>>> danwins...@redhat.com>; ovn-kuberne...@googlegroups.com; ovs-discuss <
>>> ovs-discuss@openvswitch.org>; Michael Cambria ;
>>> Venugopal Iyer 
>>> *Subject:* Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve
>>> table
>>>
>>>
>>>
>>> *External email: Use caution opening links or attachments*
>>>
>>>
>>>
>>> Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry
>>> that I forgot to update here.
>>>
>>>
>>> On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail 
>>> wrote:
>>> >
>>> > Hello all,
>>> >
>>> > To kind of proceed with the proposed fixes, with minimal impact, is
>>> the following a reasonable approach?
>>> >
>>> > Add an option, namely dynamic_neigh_routes={true|false}, for a gateway
>>> router. With this option enabled, the nextHop IP's MAC will be learned
>>> through a ARP request on the physical network. The ARP request will be
>>> flooded on the L2 broadcast domain (for both join switch and external
>>> switch).
>>>
>>> >
>>>
>>>
>>>
>>> The RFC patch fulfils this purpose:
>>> https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/
>>>
>>> I am working on the formal patch.
>>>
>>>
>>>
>>> > Add an option, namely learn_from_arp_request={true|false}, for a
>>> gateway router. The option is interpreted as below:\
>>> > "true" - learn the MAC/IP binding and add a new MAC_Binding entry
>>> (default behavior)
>>> > "false" - if there is a MAC_binding for that IP and the MAC is
>>> different, then update that MAC/IP binding. The external entity might be
>>> trying to advertise the new MAC for that IP. (If we don't do this, then we
>>> will never learn External VIP to MAC changes)
>>> >
>>> > (Irrespective of, learn_from_arp_request is true or false, always do
>>> this -- if the TPA is on the router, add a new entry (it means the remote
>>> wants to communicate with this node, so it makes sense to learn the remote
>>> as well))
>>>
>>> >
>>>
>>>
>>>
>>> I am working on this as well, but delayed a little. I hope to have
>>> something this week.
>>>
>>> *[vi> ] Just wanted to check if this should be learn_From_unsolicit_arp
>>> (unsolicited ARP request or reply) instead of learn_from_arp_request? This
>>> is just to protect from potential rogue usage of  GARP reply flooding the
>>> MAC bindings.?*
>>>
>>>
>>>
>>
>> Hi Venu, as discussed earlier in this thread it is hard to check if it is
>> GARP in OVN from the router ingress pipeline. The proposal here cares about
>> ARP request only. It seems the best option so far.
>>
>>
>>> *Thanks,*
>>>
>>>
>>>
>>> *-venu*
>>>
>>>
>>>
>>> >
>>> > For now, I think it is fine for ARP packets to be broadcasted on the
>>> tunnel for the `join` switch case. If it becomes a problem, then we can
>>> start looking around changing the logical flows.
>>> >
>>> > Thanks everyone for the lively discussion.
>>> >
>>> > Regards,
>>> > ~Girish
>>> >
>>> > On Thu, May 28, 2020 at 7:33 AM Tim Rozet  wrote:
>>> >>
>>> >>
>>> >>
>>> >> On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara 
>>> wrote:
>>> >>>
>>> >>> On 5/28/20 12:48 PM, Daniel Alvarez Sa

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-06-10 Thread Han Zhou
Hi Girish, Venu,

I sent a RFC patch series for the solution discussed. Could you give it a
try when you get the chance?

Thanks,
Han

On Tue, Jun 9, 2020 at 10:04 AM Han Zhou  wrote:

>
>
> On Tue, Jun 9, 2020 at 9:06 AM Venugopal Iyer 
> wrote:
>
>> Sorry for the delay, Han, a quick question below:
>>
>>
>>
>> *From:* ovn-kuberne...@googlegroups.com 
>> *On Behalf Of *Han Zhou
>> *Sent:* Wednesday, June 3, 2020 4:27 PM
>> *To:* Girish Moodalbail 
>> *Cc:* Tim Rozet ; Dumitru Ceara ;
>> Daniel Alvarez Sanchez ; Dan Winship <
>> danwins...@redhat.com>; ovn-kuberne...@googlegroups.com; ovs-discuss <
>> ovs-discuss@openvswitch.org>; Michael Cambria ;
>> Venugopal Iyer 
>> *Subject:* Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve
>> table
>>
>>
>>
>> *External email: Use caution opening links or attachments*
>>
>>
>>
>> Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry
>> that I forgot to update here.
>>
>>
>> On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail 
>> wrote:
>> >
>> > Hello all,
>> >
>> > To kind of proceed with the proposed fixes, with minimal impact, is the
>> following a reasonable approach?
>> >
>> > Add an option, namely dynamic_neigh_routes={true|false}, for a gateway
>> router. With this option enabled, the nextHop IP's MAC will be learned
>> through a ARP request on the physical network. The ARP request will be
>> flooded on the L2 broadcast domain (for both join switch and external
>> switch).
>>
>> >
>>
>>
>>
>> The RFC patch fulfils this purpose:
>> https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/
>>
>> I am working on the formal patch.
>>
>>
>>
>> > Add an option, namely learn_from_arp_request={true|false}, for a
>> gateway router. The option is interpreted as below:\
>> > "true" - learn the MAC/IP binding and add a new MAC_Binding entry
>> (default behavior)
>> > "false" - if there is a MAC_binding for that IP and the MAC is
>> different, then update that MAC/IP binding. The external entity might be
>> trying to advertise the new MAC for that IP. (If we don't do this, then we
>> will never learn External VIP to MAC changes)
>> >
>> > (Irrespective of, learn_from_arp_request is true or false, always do
>> this -- if the TPA is on the router, add a new entry (it means the remote
>> wants to communicate with this node, so it makes sense to learn the remote
>> as well))
>>
>> >
>>
>>
>>
>> I am working on this as well, but delayed a little. I hope to have
>> something this week.
>>
>> *[vi> ] Just wanted to check if this should be learn_From_unsolicit_arp
>> (unsolicited ARP request or reply) instead of learn_from_arp_request? This
>> is just to protect from potential rogue usage of  GARP reply flooding the
>> MAC bindings.?*
>>
>>
>>
>
> Hi Venu, as discussed earlier in this thread it is hard to check if it is
> GARP in OVN from the router ingress pipeline. The proposal here cares about
> ARP request only. It seems the best option so far.
>
>
>> *Thanks,*
>>
>>
>>
>> *-venu*
>>
>>
>>
>> >
>> > For now, I think it is fine for ARP packets to be broadcasted on the
>> tunnel for the `join` switch case. If it becomes a problem, then we can
>> start looking around changing the logical flows.
>> >
>> > Thanks everyone for the lively discussion.
>> >
>> > Regards,
>> > ~Girish
>> >
>> > On Thu, May 28, 2020 at 7:33 AM Tim Rozet  wrote:
>> >>
>> >>
>> >>
>> >> On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara 
>> wrote:
>> >>>
>> >>> On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
>> >>> > Hi all
>> >>> >
>> >>> > Sorry for top posting. I want to thank you all for the discussion
>> and
>> >>> > give also some feedback from OpenStack perspective which is affected
>> >>> > by the problem described here.
>> >>> >
>> >>> > In OpenStack, it's kind of common to have a shared external network
>> >>> > (logical switch with a localnet port) across many tenants. Each
>> tenant
>> >>> > user may create their own router where their instances will be
>

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-06-09 Thread Han Zhou
On Tue, Jun 9, 2020 at 9:06 AM Venugopal Iyer  wrote:

> Sorry for the delay, Han, a quick question below:
>
>
>
> *From:* ovn-kuberne...@googlegroups.com  *On
> Behalf Of *Han Zhou
> *Sent:* Wednesday, June 3, 2020 4:27 PM
> *To:* Girish Moodalbail 
> *Cc:* Tim Rozet ; Dumitru Ceara ;
> Daniel Alvarez Sanchez ; Dan Winship <
> danwins...@redhat.com>; ovn-kuberne...@googlegroups.com; ovs-discuss <
> ovs-discuss@openvswitch.org>; Michael Cambria ;
> Venugopal Iyer 
> *Subject:* Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve
> table
>
>
>
> *External email: Use caution opening links or attachments*
>
>
>
> Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry
> that I forgot to update here.
>
>
> On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail 
> wrote:
> >
> > Hello all,
> >
> > To kind of proceed with the proposed fixes, with minimal impact, is the
> following a reasonable approach?
> >
> > Add an option, namely dynamic_neigh_routes={true|false}, for a gateway
> router. With this option enabled, the nextHop IP's MAC will be learned
> through a ARP request on the physical network. The ARP request will be
> flooded on the L2 broadcast domain (for both join switch and external
> switch).
>
> >
>
>
>
> The RFC patch fulfils this purpose:
> https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/
>
> I am working on the formal patch.
>
>
>
> > Add an option, namely learn_from_arp_request={true|false}, for a gateway
> router. The option is interpreted as below:\
> > "true" - learn the MAC/IP binding and add a new MAC_Binding entry
> (default behavior)
> > "false" - if there is a MAC_binding for that IP and the MAC is
> different, then update that MAC/IP binding. The external entity might be
> trying to advertise the new MAC for that IP. (If we don't do this, then we
> will never learn External VIP to MAC changes)
> >
> > (Irrespective of, learn_from_arp_request is true or false, always do
> this -- if the TPA is on the router, add a new entry (it means the remote
> wants to communicate with this node, so it makes sense to learn the remote
> as well))
>
> >
>
>
>
> I am working on this as well, but delayed a little. I hope to have
> something this week.
>
> *[vi> ] Just wanted to check if this should be learn_From_unsolicit_arp
> (unsolicited ARP request or reply) instead of learn_from_arp_request? This
> is just to protect from potential rogue usage of  GARP reply flooding the
> MAC bindings.?*
>
>
>

Hi Venu, as discussed earlier in this thread it is hard to check if it is
GARP in OVN from the router ingress pipeline. The proposal here cares about
ARP request only. It seems the best option so far.


> *Thanks,*
>
>
>
> *-venu*
>
>
>
> >
> > For now, I think it is fine for ARP packets to be broadcasted on the
> tunnel for the `join` switch case. If it becomes a problem, then we can
> start looking around changing the logical flows.
> >
> > Thanks everyone for the lively discussion.
> >
> > Regards,
> > ~Girish
> >
> > On Thu, May 28, 2020 at 7:33 AM Tim Rozet  wrote:
> >>
> >>
> >>
> >> On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara 
> wrote:
> >>>
> >>> On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
> >>> > Hi all
> >>> >
> >>> > Sorry for top posting. I want to thank you all for the discussion and
> >>> > give also some feedback from OpenStack perspective which is affected
> >>> > by the problem described here.
> >>> >
> >>> > In OpenStack, it's kind of common to have a shared external network
> >>> > (logical switch with a localnet port) across many tenants. Each
> tenant
> >>> > user may create their own router where their instances will be
> >>> > connected to access the external network.
> >>> >
> >>> > In such scenario, we are hitting the issue described here. In
> >>> > particular in our tests we exercise 3K VIFs (with 1 FIP) each
> spanning
> >>> > 300 LS; each LS connected to a LR (ie. 300 LRs) and that router
> >>> > connected to the public LS. This is creating a huge problem in terms
> >>> > of performance and tons of events due to the MAC_Binding entries
> >>> > generated as a consequence of the GARPs sent for the floating IPs.
> >>> >
> >>>
> >>> Just as an addition to thi

Re: [ovs-discuss] [OVN] logical flow explosion in lr_in_ip_input table for dnat_and_snat IPs

2020-06-03 Thread Han Zhou
On Wed, Jun 3, 2020 at 7:16 PM Girish Moodalbail 
wrote:

> Hello all,
>
> While working on an extension, see the diagram below, to the existing OVN
> logical topology for the ovn-kubernetes project, I am seeing an explosion
> of the "Reply to ARP requests" logical flows in the `lr_in_ip_input` table
> for the distributed router (ovn_cluster_router) configured with gateway
> port (rtol-LS)
>
> internet
>-+-->
> |
> |
>   +--localnet-port-+
>   |LS  |
>   +-ltor-LS+
>|
>|
>  +-rtol-LS+
>  |   ovn_cluster_router   |
>  |  (Distributed Router)  |
>  +-rtos-ls0--rtos-ls1rtos-ls2-+
>   |  |  |
>   |  |  |
> +-+-+   ++--+ +-+-+
> |  LS0  |   |  LS1  | |  LS2  |
> +-+-+   +-+-+ +-+-+
>   |   | |
>   p0  p1p2
>  IA0 IA1   IA2
>  EA0 EA1   EA2
> (Node0)  (Node1)   (Node2)
>
> In the topology above, each of the three logical switch port has an
> internal address of IAx and an external address of EAx (dnat_and_snat IP).
> They are all bound to their respective nodes (Nodex). A packet from `p0`
> heading towards the internet will be SNAT'ed to EA0 on the local hypervisor
> and then sent out through the LS's localnet-port on that hypervisor.
> Basically, they are configured for distributed NATing.
>
> I am seeing interesting "Reply to ARP requests" flows for arp.tpa set to
> "EAX". Flows are like this:
>
> For EA0
> priority=90, match=(inport == "rtos-ls0" && arp.tpa == EA0 && arp.op ==
> 1), action=(/* ARP reply */)
> priority=90, match=(inport == "rtos-ls1" && arp.tpa == EA0 && arp.op ==
> 1), action=(/* ARP reply */)
> priority=90, match=(inport == "rtos-ls2" && arp.tpa == EA0 && arp.op ==
> 1), action=(/* ARP reply */)
>
> For EA1
> priority=90, match=(inport == "rtos-ls0" && arp.tpa == EA1 && arp.op ==
> 1), action=(/* ARP reply */)
> priority=90, match=(inport == "rtos-ls1" && arp.tpa == EA0 && arp.op ==
> 1), action=(/* ARP reply */)
> priority=90, match=(inport == "rtos-ls2" && arp.tpa == EA1 && arp.op ==
> 1), action=(/* ARP reply */)
>
> Similarly, for EA2.
>
> So, we have N * N "Reply to ARP requests" flows for N nodes each with 1
> dnat_and_snat ip.
> This is causing scale issues.
>
> If you look at the flows for `EA0`, i am confused as to why is it needed?
>
>1. When will one see an ARP request for the EA0 from any of the
>LS{0,1,2}'s logical switch port.
>2. If it is needed at all, can't we just remove the `inport` thing
>altogether since the flow is configured for every port of logical router
>port except for the distributed gateway port rtol-LS. For this port, we
>could add an higher priority rule with action set to `next`.
>3. Say, we don't need east-west NAT connectivity. Is there a way to
>make these ARPs be learnt dynamically, like we are doing for join and
>external logical switch (the other thread [1]).
>
> Regards,
> ~Girish
>
> [1]
> https://mail.openvswitch.org/pipermail/ovs-discuss/2020-May/049994.html
>

In general, these flows should be per router instead of per router port,
since the nat addresses are not attached to any router port. For
distributed gateway ports, there will need per-port flows to match
is_chassis_resident(gateway-chassis). I think this can be handled by:
- priority X + 20 flows for each distributed gateway port with
is_chassis_resident(), reply ARP
- priority X + 10 flows for each distributed gateway port without
is_chassis_resident(), drop
- priority X flows for each router (no need to match inport), reply ARP

This way, there are N * (2D + 1) flows per router. N = number of NAT IPs, D
= number of distributed gateway ports. This would optimize the above
scenario where there is only 1 distributed gateway port but many regular
router ports. Thoughts?

Thanks,
Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-06-03 Thread Han Zhou
Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry
that I forgot to update here.

On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail 
wrote:
>
> Hello all,
>
> To kind of proceed with the proposed fixes, with minimal impact, is the
following a reasonable approach?
>
> Add an option, namely dynamic_neigh_routes={true|false}, for a gateway
router. With this option enabled, the nextHop IP's MAC will be learned
through a ARP request on the physical network. The ARP request will be
flooded on the L2 broadcast domain (for both join switch and external
switch).
>

The RFC patch fulfils this purpose:
https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/
I am working on the formal patch.

> Add an option, namely learn_from_arp_request={true|false}, for a gateway
router. The option is interpreted as below:\
> "true" - learn the MAC/IP binding and add a new MAC_Binding entry
(default behavior)
> "false" - if there is a MAC_binding for that IP and the MAC is different,
then update that MAC/IP binding. The external entity might be trying to
advertise the new MAC for that IP. (If we don't do this, then we will never
learn External VIP to MAC changes)
>
> (Irrespective of, learn_from_arp_request is true or false, always do this
-- if the TPA is on the router, add a new entry (it means the remote wants
to communicate with this node, so it makes sense to learn the remote as
well))
>

I am working on this as well, but delayed a little. I hope to have
something this week.

>
> For now, I think it is fine for ARP packets to be broadcasted on the
tunnel for the `join` switch case. If it becomes a problem, then we can
start looking around changing the logical flows.
>
> Thanks everyone for the lively discussion.
>
> Regards,
> ~Girish
>
> On Thu, May 28, 2020 at 7:33 AM Tim Rozet  wrote:
>>
>>
>>
>> On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara  wrote:
>>>
>>> On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
>>> > Hi all
>>> >
>>> > Sorry for top posting. I want to thank you all for the discussion and
>>> > give also some feedback from OpenStack perspective which is affected
>>> > by the problem described here.
>>> >
>>> > In OpenStack, it's kind of common to have a shared external network
>>> > (logical switch with a localnet port) across many tenants. Each tenant
>>> > user may create their own router where their instances will be
>>> > connected to access the external network.
>>> >
>>> > In such scenario, we are hitting the issue described here. In
>>> > particular in our tests we exercise 3K VIFs (with 1 FIP) each spanning
>>> > 300 LS; each LS connected to a LR (ie. 300 LRs) and that router
>>> > connected to the public LS. This is creating a huge problem in terms
>>> > of performance and tons of events due to the MAC_Binding entries
>>> > generated as a consequence of the GARPs sent for the floating IPs.
>>> >
>>>
>>> Just as an addition to this, GARPs wouldn't be the only reason why all
>>> routers would learn the MAC_Binding. Even if we wouldn't be sending
>>> GARPs for the FIPs, when a VM that's behind a FIP would send traffic to
>>> the outside, the router will generate an ARP request for the next hop
>>> using the FIP-IP and FIP-MAC. This will be broadcasted to all routers
>>> connected to the public LS and will trigger them to learn the
>>> FIP-IP:FIP-MAC binding.
>>
>>
>> Yeah we shouldn't be learning on regular ARP requests.
>>
>>>
>>>
>>> > Thanks,
>>> > Daniel
>>> >
>>> >
>>> > On Thu, May 28, 2020 at 10:51 AM Dumitru Ceara 
wrote:
>>> >>
>>> >> On 5/28/20 8:34 AM, Han Zhou wrote:
>>> >>>
>>> >>>
>>> >>> On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara >> >>> <mailto:dce...@redhat.com>> wrote:
>>> >>>>
>>> >>>> Hi Girish, Han,
>>> >>>>
>>> >>>> On 5/26/20 11:51 PM, Han Zhou wrote:
>>> >>>>>
>>> >>>>>
>>> >>>>> On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail
>>> >>> mailto:gmoodalb...@gmail.com>
>>> >>>>> <mailto:gmoodalb...@gmail.com <mailto:gmoodalb...@gmail.com>>>
wrote:
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-28 Thread Han Zhou
On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara  wrote:
>
> Hi Girish, Han,
>
> On 5/26/20 11:51 PM, Han Zhou wrote:
> >
> >
> > On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail  > <mailto:gmoodalb...@gmail.com>> wrote:
> >>
> >>
> >>
> >> On Tue, May 26, 2020 at 12:42 PM Han Zhou  > <mailto:zhou...@gmail.com>> wrote:
> >>>
> >>> Hi Girish,
> >>>
> >>> Thanks for the summary. I agree with you that GARP request v.s. reply
> > is irrelavent to the problem here.
>
> Well, actually I think GARP request vs reply is relevant (at least for
> case 1 below) because if OVN would be generating GARP replies we
> wouldn't need the priority 80 flow to determine if an ARP request packet
> is actually an OVN self originated GARP that needs to be flooded in the
> L2 broadcast domain.
>
> On the other hand, router3 would be learning mac_binding IP2,M2 from the
> GARP reply originated by router2 and vice versa so we'd have to restrict
> flooding of GARP replies to non-patch ports.
>

Hi Dumitru, the point was that, on the external LS, the GRs will have to
send ARP requests to resolve unknown IPs (at least for the external GW),
and it has to be broadcasted, which will cause all the GRs learn all MACs
of other GRs. This is regardless of the GARP behavior. You are right that
if we only consider the Join switch then the GARP request v.s. reply does
make a difference. However, GARP request/reply may be really needed only on
the external LS.

> >>> Please see my comment inline below.
> >>>
> >>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail
> > mailto:gmoodalb...@gmail.com>> wrote:
> >>> >
> >>> > Hello Dumitru,
> >>> >
> >>> > There are several things that are being discussed on this thread.
> > Let me see if I can tease them out for clarity.
> >>> >
> >>> > 1. All the router IPs are known to OVN (the join switch case)
> >>> > 2. Some IPs are known and some are not known (the external logical
> > switch that connects to physical network case).
> >>> >
> >>> > Let us look at each of the case above:
> >>> >
> >>> > 1. Join Switch Case
> >>> >
> >>> > ++++
> >>> > |   l3gateway||   l3gateway|
> >>> > |router2 ||router3 |
> >>> > +-+--++-+--+
> >>> > IP2,M2 IP3,M3
> >>> >   | |
> >>> >+--+-+---+
> >>> >|join switch |
> >>> >+-+--+
> >>> >  |
> >>> >   IP1,M1
> >>> >  +---++
> >>> >  |  distributed   |
> >>> >  | router |
> >>> >  ++
> >>> >
> >>> >
> >>> > Say, GR router2 wants to send the packet out to DR and that we
> > don't have static mappings of MAC to IP in lr_in_arp_resolve table on GR
> > router2 (with Han's patch of dynamic_neigh_routes=true for all the
> > Gateway Routers). With this in mind, when an ARP request is sent out by
> > router2's hypervisor the packet should be directly sent to the
> > distributed router alone. Your commit 32f5ebb0622 (ovn-northd: Limit
> > ARP/ND broadcast domain whenever possible) should have allowed only
> > unicast. However, in ls_in_l2_lkup table we have
> >>> >
> >>> >   table=19(ls_in_l2_lkup  ), priority=80   , match=(eth.src ==
> > { M2 } && (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood";
output;)
> >>> >   table=19(ls_in_l2_lkup  ), priority=75   , match=(flags[1] ==
> > 0 && arp.op == 1 && arp.tpa == { IP1}), action=(outport =
> > "jtor-router2"; output;)
> >>> >
> >>> > As you can see, `priority=80` rule will always be hit and sent out
> > to all the GRs. The `priority=75` rule is never hit. So, we will see ARP
> > packets on the GENEVE tunnel. So, we need to change `priority=80` to
> > match GARP request packets. That way, for the known OVN IPs case we
> > don't do broadcast.
> >>>
> >>> Since the solution to case 2) below (i.e.
> > learn_from_arp_request=false) solves the problem of case 1), too, I
&g

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-26 Thread Han Zhou
On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail 
wrote:
>
>
>
> On Tue, May 26, 2020 at 12:42 PM Han Zhou  wrote:
>>
>> Hi Girish,
>>
>> Thanks for the summary. I agree with you that GARP request v.s. reply is
irrelavent to the problem here.
>> Please see my comment inline below.
>>
>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail 
wrote:
>> >
>> > Hello Dumitru,
>> >
>> > There are several things that are being discussed on this thread. Let
me see if I can tease them out for clarity.
>> >
>> > 1. All the router IPs are known to OVN (the join switch case)
>> > 2. Some IPs are known and some are not known (the external logical
switch that connects to physical network case).
>> >
>> > Let us look at each of the case above:
>> >
>> > 1. Join Switch Case
>> >
>> > ++++
>> > |   l3gateway||   l3gateway|
>> > |router2 ||router3 |
>> > +-+--++-+--+
>> > IP2,M2 IP3,M3
>> >   | |
>> >+--+-+---+
>> >|join switch |
>> >+-+--+
>> >  |
>> >   IP1,M1
>> >  +---++
>> >  |  distributed   |
>> >  | router |
>> >  ++
>> >
>> >
>> > Say, GR router2 wants to send the packet out to DR and that we don't
have static mappings of MAC to IP in lr_in_arp_resolve table on GR router2
(with Han's patch of dynamic_neigh_routes=true for all the Gateway
Routers). With this in mind, when an ARP request is sent out by router2's
hypervisor the packet should be directly sent to the distributed router
alone. Your commit 32f5ebb0622 (ovn-northd: Limit ARP/ND broadcast domain
whenever possible) should have allowed only unicast. However, in
ls_in_l2_lkup table we have
>> >
>> >   table=19(ls_in_l2_lkup  ), priority=80   , match=(eth.src == {
M2 } && (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood"; output;)
>> >   table=19(ls_in_l2_lkup  ), priority=75   , match=(flags[1] == 0
&& arp.op == 1 && arp.tpa == { IP1}), action=(outport = "jtor-router2";
output;)
>> >
>> > As you can see, `priority=80` rule will always be hit and sent out to
all the GRs. The `priority=75` rule is never hit. So, we will see ARP
packets on the GENEVE tunnel. So, we need to change `priority=80` to match
GARP request packets. That way, for the known OVN IPs case we don't do
broadcast.
>>
>> Since the solution to case 2) below (i.e. learn_from_arp_request=false)
solves the problem of case 1), too, I think we don't need this change just
for case 1). As @Dumitru Ceara  mentioned, there is some cost because it
adds extra flows. It would be significant amount of flows if there are a
lot of snat_and_dnat IPs. What do you think?
>
>
> Han, yes it will work. However, my only concern is that we would send all
these ARP requests via tunnel to each of 1000 hypervisors and these
hypervisors will just drop them on the floor. when they see
learn_from_arp_request=false.

I think maybe it is not a problem since it happens only once on the Join
switch. Once the MAC is learned, it won't broadcast again. It may be more
of a problem on the external LS if periodical GARP is required there.
However, I'd suggest to have some test and see if it is really a problem,
before trying to solve it.

>
> Han, Dumitru,
>
> Why can't we swap the priorities of the above two flows so that the ARP
request for NexHop IP known to OVN will be always sent via `unicast`?

If swapped, even GARP won't get broadcasted. Maybe that's not the desired
behavior.

>
> Regards,
> ~Girish
>
>>
>> >
>> > 2. External Logical Switch Case
>> >
>> >10.10.10.0/24
>> >-+--
>> > |
>> >  localnet
>> >   +-+-+
>> >   | external  |
>> >  ++LS1+-+
>> >  |+-+-+ |
>> >  |  |   |
>> >  10.10.10.2 10.10.10.3  10.10.10.4
>> > SNAT   SNATSNAT
>> >+-+-+  +-+-+   +---+
>> >| l3gateway |  | l

  1   2   3   >