from:"Felix Huettner via dev"

Re: [ovs-dev] [PATCH v2 2/2] conntrack: Key connections by zone.

2024-05-21 Thread Felix Huettner via dev

Hi everyone, 

On Mon, May 13, 2024 at 03:05:47PM +0200, Paolo Valerio wrote:
> Hi Peng,
> 
> Peng He  writes:
> 
> > To seperate into N cmaps, why not use hash value divided by N?
> >
> 
> FWIW, I think it makes sense to discuss the potential benefits of other
> approaches as well.

i just settled for this approach since it seemed to be the easiest one.
And as the memory increase was quite small at 10MB i decided to not
investigate further into alternative approaches.

> They may even end up not being as performant as this one, but also some
> points to consider here are:
> 
> - the number of zones used in the common case (or even for the specific
>   use case as the expectation is that the fewer the zones involved, the
>   smaller the benefit)
> - given flush per zone is where most of the gain is, the flush per zone
>   for the use case
> 
> As a last remark, partitioning per zone also implies a substantial
> design change that may potentially result in contrast with other
> approaches targeting the overall performance (e.g., [0] is a quick
> example that comes to mind with good scalability improvements in cps,
> and probably, but this is just a guess, measurable improvements in the
> same ct execute test).
> 
> [0] 
> https://patchwork.ozlabs.org/project/openvswitch/patch/165668250987.1967719.7371616138630033269.stgit%40fed.void/

I would assume these approaches could even be combined.
Implementing bucketing on top of individual zones should give us the
same performance improvement as we now have when working with the whole
zone. In addition we would have performance benefits for accessing
individual connections within a zone which is something this patch did
not touch at all.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn v2] northd: Support routing over other address families.

2024-05-07 Thread Felix Huettner via dev

> 
> Let's say we add this static route -  "prefix = 172.168.0.0/24,
> nexthop = aef0::4"
> and lets assume OVN doesn't know the next hop "aef0::4" mac address.
> So with your patch, the original packet will be dropped because
> pinctrl.c is not handling it properly.
> 
> I see 2 approaches here :
> 1.  Fix it in ovn-northd/pinctrl so that for IPv6 next hop, neigh
> solicit packet is generated and for IPv4 next hop, ARP request packet
> is generated.
> 2.  CMS should add a static mac binding entry in the OVN Northbound db
> for the next hop for this scenario to work properly.
>  Eg.  ovn-nbctl static-mac-binding-add lr0-public aef0::5 
> 10:14:00:00:00:05
>  In this case, ovn-controller(s) will add an openflow  like below
> 
> ---
>  ovs-ofctl dump-flows br-int table=66
>  cookie=0xa4249f73, duration=470.024s, table=66, n_packets=0,
> n_bytes=0, 
> priority=50,reg0=0xaef0,reg1=0,reg2=0,reg3=0x5,reg15=0x3,metadata=0x3
> actions=mod_dl_dst:10:14:00:00:00:05,load:0x1->NXM_NX_REG10[6]
> 
> 
> ovn-northd adds below logical flows in the router ingress pipeline
> stage - "lr_in_arp_resolve".  These logical flows are hit if OVN can't
> resolve the next hop mac address
> 
>   table=20(lr_in_arp_resolve  ), priority=1, match=(ip4),
> action=(get_arp(outport, reg0); next;)
>   table=20(lr_in_arp_resolve  ), priority=1, match=(ip6),
> action=(get_nd(outport, xxreg0); next;)
> 
> The OVN actions get_arp and get_nd gets translated to OVS actions -
> resubmit(,66).
> The problem is that for IPv4 packets, reg0 is used and for IPv6
> packetsm xxreg0 is used.  But with your patch to support mixed static
> routes,  it will not work.
> 
> For both the approaches (whichever we chose to take),  we need to
> solve this problem.  Maybe we can use a register bit to indicate if
> nexhop is IPv4 addr or IPv6 addr
> in the "lr_in_ip_routing" stage and depending on that we can have
> below logical flows,
> 
>  table=20(lr_in_arp_resolve  ), priority=1, match=(reg10[1] == 0),
> action=(get_arp(outport, reg0); next;)
>   table=20(lr_in_arp_resolve  ), priority=1, match=(eg10[1] == 1),
> action=(get_nd(outport, xxreg0); next;)
> 
> I think your next version should address this, otherwise the feature
> will be incomplete.
> 
> If we go with approach (1),  we can indicate pinctrl to generate ARP
> or NS depending on a register bit.
> 
> Thanks
> Numan
> 

Sounds good, will do that in the coming weeks.

Thank you
Felix

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn v2] northd: Support routing over other address families.

2024-05-07 Thread Felix Huettner via dev

On Tue, May 07, 2024 at 10:06:07AM -0400, Numan Siddique wrote:
> On Tue, May 7, 2024 at 5:24 AM Felix Huettner via dev
>  wrote:
> >
> > On Mon, May 06, 2024 at 11:03:42PM -0400, Numan Siddique wrote:
> > > On Mon, Apr 22, 2024 at 9:02 AM Felix Huettner via dev
> > >  wrote:
> > > >
> > > > In most cases IPv4 packets are routed only over other IPv4 networks and
> > > > IPv6 packets are routed only over IPv6 networks. However there is no
> > > > inherent reason for this limitation. Routing IPv4 packets over IPv6
> > > > networks just requires the router to contain a route for an IPv4 network
> > > > with an IPv6 nexthop.
> > > >
> > > > This was previously prevented in OVN in ovn-nbctl and northd. By
> > > > removing these filters the forwarding will work if the mac addresses are
> > > > prepopulated.
> > > >
> > > > If the mac addresses are not prepopulated we will attempt to resolve 
> > > > them using
> > > > the original address family of the packet and not the address family of 
> > > > the
> > > > nexthop. This will fail and we will not forward the packet.
> > > >
> > >
> > > Thanks for adding this feature.
> > >
> > > If I understand correctly from the above commit message about
> > > prepopulation,  is it expected that
> > > CMS/admin needs to add a static mac binding entry in the OVN
> > > Northbound 'Static_MAC_Binding' table
> > > for this feature to work ?  Please correct me if I'm wrong.
> > > If this is the case, then it needs to be documented properly (perhaps
> > > in ovn-nb.xml)
> >
> > no the limitation is only that the CMS must not set
> > dynamic_neigh_routers to true on the Logical_Routers.
> >
> > If there are other reasons that cause OVN to send ARP request to lookup
> > the nexthop of a route then these will have that issue as well. I am
> > just no aware of any such reason.
> 
> Thanks.  I'm still not clear what happens if OVN doesn't know the
> nexthop mac address.
> What happens in that case ? Normally, OVN would generate ARP request
> for the next hop IP and
> learn the mac.  What happens in the mix route case ?
> 
> Numan

So lets assume you run an ipv4 prefix and route that over an ipv6 network.
If you then reach the Logical_Router you will reach the arp action in
S_ROUTER_IN_ARP_REQUEST (build_arp_request_flows_for_lrouter).

This will be handled by the node executing that action as a
packet-in to pinctrl_handle_arp which is here actually inappropriate.
We would need to reach pinctrl_handle_nd_na as we need to send a
neighbor discovery packet.

It could be changed in northd.c when using an additional register bit
to store the nexthop address family and then using that bit to figure
out what address family we need to lookup against, instead of relying on
the packet address family.

However if we end up in pinctrl_handle_nd_na we seem to reuse the
existing ipv4 packet and patch the Neighbor Discovery packet over it.
I thought this causes the lookup then to fail, as the address families
missmatch.
However i am no longer sure of that explanation after writing this :)

Thanks
Felix

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn v2] northd: Support routing over other address families.

2024-05-07 Thread Felix Huettner via dev

On Mon, May 06, 2024 at 11:03:42PM -0400, Numan Siddique wrote:
> On Mon, Apr 22, 2024 at 9:02 AM Felix Huettner via dev
>  wrote:
> >
> > In most cases IPv4 packets are routed only over other IPv4 networks and
> > IPv6 packets are routed only over IPv6 networks. However there is no
> > inherent reason for this limitation. Routing IPv4 packets over IPv6
> > networks just requires the router to contain a route for an IPv4 network
> > with an IPv6 nexthop.
> >
> > This was previously prevented in OVN in ovn-nbctl and northd. By
> > removing these filters the forwarding will work if the mac addresses are
> > prepopulated.
> >
> > If the mac addresses are not prepopulated we will attempt to resolve them 
> > using
> > the original address family of the packet and not the address family of the
> > nexthop. This will fail and we will not forward the packet.
> >
> 
> Thanks for adding this feature.
> 
> If I understand correctly from the above commit message about
> prepopulation,  is it expected that
> CMS/admin needs to add a static mac binding entry in the OVN
> Northbound 'Static_MAC_Binding' table
> for this feature to work ?  Please correct me if I'm wrong.
> If this is the case, then it needs to be documented properly (perhaps
> in ovn-nb.xml)

no the limitation is only that the CMS must not set
dynamic_neigh_routers to true on the Logical_Routers.

If there are other reasons that cause OVN to send ARP request to lookup
the nexthop of a route then these will have that issue as well. I am
just no aware of any such reason.

> 
> The patch overall looks good to me.  Can you please add a small test
> case in ovn-northd.at to check
> the expected logical flows in the "lr_in_ip_routing" stage when you
> have these mixed mode static routes ?
> You can also check for correctness of flows in other tables too which
> are generated for the static routes.

Yes i'll do that and send a new version.

Thanks
Felix

> 
> Thanks
> Numan
> 
> 
> 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH v2 2/2] conntrack: Key connections by zone.

2024-04-24 Thread Felix Huettner via dev

Currently conntrack uses a single large cmap for all connections stored.
This cmap contains all connections for all conntrack zones which are
completely separate from each other. By separating each zone to its own
cmap we can significantly optimize the performance when using multiple
zones.

The change fixes a similar issue as [1] where slow conntrack zone flush
operations significantly slow down OVN router failover. The difference is
just that this fix is used whith dpdk, while [1] was when using the ovs
kernel module.

As we now need to store more cmap's the memory usage of struct conntrack
increases by 524280 bytes. Additionally we need 65535 cmaps with 128
bytes each. This leads to a total memory increase of around 10MB.

Running "./ovstest test-conntrack benchmark 4 33554432 32 1" shows no
real difference in the multithreading behaviour against a single zone.

Running the new "./ovstest test-conntrack benchmark-zones" show
significant speedups as shown below. The values for "ct execute" are for
acting on the complete zone with all its entries in total (so in the
first case adding 10,000 new conntrack entries). All tests are run 1000
times.

When running with 1,000 zones with 10,000 entries each we see the
following results (all in microseconds):
"./ovstest test-conntrack benchmark-zones 1 1000 1000"

 +--++-+-+
 |  Min |   Max  |  95%ile |   Avg   |
++--++-+-+
| ct execute (commit)|  || | |
|with commit | 2266 |   3505 | 2707.06 | 2592.06 |
| without commit | 2411 |  12730 | 4432.50 | 2736.78 |
++--++-+-+
| ct execute (no commit) |  || | |
|with commit |  699 |   1238 |  886.15 |  722.67 |
| without commit |  700 |   3377 | 1934.42 |  803.53 |
++--++-+-+
| flush full zone|  || | |
|with commit |  619 |   1122 |  901.36 |  679.15 |
| without commit |  618 | 105078 |   64591 | 2886.46 |
++--++-+-+
| flush empty zone   |  || | |
|with commit |0 |  5 |1.00 |0.64 |
| without commit |   54 |  87469 |   64520 | 2172.25 |
++--++-+-+

When running with 10,000 zones with 1,000 entries each we see the
following results (all in microseconds):
"./ovstest test-conntrack benchmark-zones 1000 1 1000"

 +--++-+-+
 |  Min |   Max  |  95%ile |   Avg   |
++--++-+-+
| ct execute (commit)|  || | |
|with commit |  215 |287 |  231.88 |  222.30 |
| without commit |  214 |   1692 |  569.18 |  285.83 |
++--++-+-+
| ct execute (no commit) |  || | |
|with commit |   68 | 97 |   74.69 |   70.09 |
| without commit |   68 |300 |  158.40 |   82.06 |
++--++-+-+
| flush full zone|  || | |
|with commit |   47 |211 |   56.34 |   50.34 |
| without commit |   48 |  96330 |   63392 |   63923 |
++--++-+-+
| flush empty zone   |  || | |
|with commit |0 |  1 |1.00 |0.44 |
| without commit |3 | 109728 |   63923 | 3629.44 |
++--++-+-+

Comparing the averages we see:
* a moderate performance improvement for conntrack_execute with or
  without commiting of around 6% to 23%
* a significant performance improvement for flushing a full zone of
  around 75% to 99%
* an even more significant improvement for flushing empty zones since we
  no longer need to check any unrelated connections

[1] 9ec849e8aa869b646c372fac552ae2609a4b5f66

Signed-off-by: Felix Huettner 
---
v1->v2: fix formatting

 lib/conntrack-private.h |  2 +-
 lib/conntrack.c | 81 +++--
 lib/conntrack.h |  1 +
 3 files changed, 56 insertions(+), 28 deletions(-)

diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index 3fd5fccd3..71367f211 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -200,7 +200,7 @@ enum ct_ephemeral_range {
 
 struct conntrack {
 struct ovs_mutex ct_lock; /* Protects 2 following fields. */
-struct cmap conns OVS_GUARDED;
+struct cmap conns[UINT16_MAX + 1] OVS_GUARDED;
 struct rculist exp_lists[N_EXP_LISTS];
 struct cmap zone_limits

[ovs-dev] [PATCH v2 1/2] test-conntrack: Add per zone benchmark tool.

2024-04-24 Thread Felix Huettner via dev

The current test-conntrack benchmark command runs with multiple threads
against a single conntrack zone. We now add a new benchmark-zones
command that allows us to check the performance between multiple zones.

We in there test the following scenarios for one zone while other zones
also contain entries:
1. Flushing a single full zone
2. Flushing a single empty zone
3. Commiting new conntrack entries against a single zone
4. Running conntrack_execute without commit against the entries of a
   single zone

Signed-off-by: Felix Huettner 
---
v1->v2: fix formatting

 tests/test-conntrack.c | 181 +
 1 file changed, 166 insertions(+), 15 deletions(-)

diff --git a/tests/test-conntrack.c b/tests/test-conntrack.c
index 292b6c048..dc8d6cff9 100644
--- a/tests/test-conntrack.c
+++ b/tests/test-conntrack.c
@@ -25,36 +25,48 @@
 #include "ovstest.h"
 #include "pcap-file.h"
 #include "timeval.h"
+#include "stopwatch.h"
+
+#define STOPWATCH_CT_EXECUTE_COMMIT "ct-execute-commit"
+#define STOPWATCH_CT_EXECUTE_NO_COMMIT "ct-execute-no-commit"
+#define STOPWATCH_FLUSH_FULL_ZONE "full-zone"
+#define STOPWATCH_FLUSH_EMPTY_ZONE "empty-zone"
 
 static const char payload[] = "5054000a505400090800451c00"
   "11a4cd0a0101010a010102000100020008";
 
+static struct dp_packet *
+build_packet(uint16_t udp_src, uint16_t udp_dst, ovs_be16 *dl_type)
+{
+struct udp_header *udp;
+struct flow flow;
+struct dp_packet *pkt = dp_packet_new(sizeof payload / 2);
+
+dp_packet_put_hex(pkt, payload, NULL);
+flow_extract(pkt, );
+
+udp = dp_packet_l4(pkt);
+udp->udp_src = htons(udp_src);
+udp->udp_dst = htons(udp_dst);
+
+*dl_type = flow.dl_type;
+
+return pkt;
+}
+
 static struct dp_packet_batch *
 prepare_packets(size_t n, bool change, unsigned tid, ovs_be16 *dl_type)
 {
 struct dp_packet_batch *pkt_batch = xzalloc(sizeof *pkt_batch);
-struct flow flow;
 size_t i;
 
 ovs_assert(n <= ARRAY_SIZE(pkt_batch->packets));
 
 dp_packet_batch_init(pkt_batch);
 for (i = 0; i < n; i++) {
-struct udp_header *udp;
-struct dp_packet *pkt = dp_packet_new(sizeof payload/2);
-
-dp_packet_put_hex(pkt, payload, NULL);
-flow_extract(pkt, );
-
-udp = dp_packet_l4(pkt);
-udp->udp_src = htons(ntohs(udp->udp_src) + tid);
-
-if (change) {
-udp->udp_dst = htons(ntohs(udp->udp_dst) + i);
-}
-
+uint16_t udp_dst = change ? 2+1 : 2;
+struct dp_packet *pkt = build_packet(1 + tid, udp_dst, dl_type);
 dp_packet_batch_add(pkt_batch, pkt);
-*dl_type = flow.dl_type;
 }
 
 return pkt_batch;
@@ -153,6 +165,140 @@ test_benchmark(struct ovs_cmdl_context *ctx)
 free(threads);
 }
 
+static void
+test_benchmark_zones(struct ovs_cmdl_context *ctx)
+{
+unsigned long n_conns, n_zones, iterations;
+long long start;
+unsigned i, j;
+ovs_be16 dl_type;
+long long now = time_msec();
+
+fatal_signal_init();
+
+/* Parse arguments */
+n_conns = strtoul(ctx->argv[1], NULL, 0);
+if (n_conns == 0 || n_conns >= UINT32_MAX) {
+ovs_fatal(0, "n_conns must be between 1 and 2^32");
+}
+n_zones = strtoul(ctx->argv[2], NULL, 0);
+if (n_zones == 0 || n_zones >= UINT16_MAX) {
+ovs_fatal(0, "n_zones must be between 1 and 2^16");
+}
+iterations = strtoul(ctx->argv[3], NULL, 0);
+if (iterations == 0) {
+ovs_fatal(0, "iterations must be greater than 0");
+}
+
+ct = conntrack_init();
+
+/* Create initial connection entries */
+start = time_msec();
+struct dp_packet_batch **pkt_batch = xzalloc(n_conns * sizeof *pkt_batch);
+for (i = 0; i < n_conns; i++) {
+pkt_batch[i] = xzalloc(sizeof(struct dp_packet_batch));
+dp_packet_batch_init(pkt_batch[i]);
+uint16_t udp_src = (i & 0x) >> 16;
+if (udp_src == 0) {
+udp_src = UINT16_MAX;
+}
+uint16_t udp_dst = i & 0x;
+if (udp_dst == 0) {
+udp_dst = UINT16_MAX;
+}
+struct dp_packet *pkt = build_packet(udp_src, udp_dst, _type);
+dp_packet_batch_add(pkt_batch[i], pkt);
+}
+printf("initial packet generation time: %lld ms\n", time_msec() - start);
+
+/* Put initial entries to each zone */
+start = time_msec();
+for (i = 0; i < n_zones; i++) {
+for (j = 0; j < n_conns; j++) {
+conntrack_execute(ct, pkt_batch[j], dl_type, false, true, i,
+  NULL, NULL, NULL, NULL, now, 0);
+pkt_metadata_init_conn(_batch[j]->packets[0]->md);
+}
+}
+printf("initial insert time: %lld ms\n", time_msec() - start);
+
+/* Actually run the tests */
+stopwatch_create(STOPWATCH_CT_EXECUTE_COMMIT, SW_US);
+stopwatch_create(STOPWATCH_CT_EXECUTE_NO_COMMIT, SW_US);
+stopwatch_create(STOPWATCH_FLUSH_FULL_ZONE, SW_US);

[ovs-dev] [PATCH 2/2] conntrack: Key connections by zone.

2024-04-24 Thread Felix Huettner via dev

Currently conntrack uses a single large cmap for all connections stored.
This cmap contains all connections for all conntrack zones which are
completely separate from each other. By separating each zone to its own
cmap we can significantly optimize the performance when using multiple
zones.

The change fixes a similar issue as [1] where slow conntrack zone flush
operations significantly slow down OVN router failover. The difference is
just that this fix is used whith dpdk, while [1] was when using the ovs
kernel module.

As we now need to store more cmap's the memory usage of struct conntrack
increases by 524280 bytes. Additionally we need 65535 cmaps with 128
bytes each. This leads to a total memory increase of around 10MB.

Running "./ovstest test-conntrack benchmark 4 33554432 32 1" shows no
real difference in the multithreading behaviour against a single zone.

Running the new "./ovstest test-conntrack benchmark-zones" show
significant speedups as shown below. The values for "ct execute" are for
acting on the complete zone with all its entries in total (so in the
first case adding 10,000 new conntrack entries). All tests are run 1000
times.

When running with 1,000 zones with 10,000 entries each we see the
following results (all in microseconds):
"./ovstest test-conntrack benchmark-zones 1 1000 1000"

 +--++-+-+
 |  Min |   Max  |  95%ile |   Avg   |
++--++-+-+
| ct execute (commit)|  || | |
|with commit | 2266 |   3505 | 2707.06 | 2592.06 |
| without commit | 2411 |  12730 | 4432.50 | 2736.78 |
++--++-+-+
| ct execute (no commit) |  || | |
|with commit |  699 |   1238 |  886.15 |  722.67 |
| without commit |  700 |   3377 | 1934.42 |  803.53 |
++--++-+-+
| flush full zone|  || | |
|with commit |  619 |   1122 |  901.36 |  679.15 |
| without commit |  618 | 105078 |   64591 | 2886.46 |
++--++-+-+
| flush empty zone   |  || | |
|with commit |0 |  5 |1.00 |0.64 |
| without commit |   54 |  87469 |   64520 | 2172.25 |
++--++-+-+

When running with 10,000 zones with 1,000 entries each we see the
following results (all in microseconds):
"./ovstest test-conntrack benchmark-zones 1000 1 1000"

 +--++-+-+
 |  Min |   Max  |  95%ile |   Avg   |
++--++-+-+
| ct execute (commit)|  || | |
|with commit |  215 |287 |  231.88 |  222.30 |
| without commit |  214 |   1692 |  569.18 |  285.83 |
++--++-+-+
| ct execute (no commit) |  || | |
|with commit |   68 | 97 |   74.69 |   70.09 |
| without commit |   68 |300 |  158.40 |   82.06 |
++--++-+-+
| flush full zone|  || | |
|with commit |   47 |211 |   56.34 |   50.34 |
| without commit |   48 |  96330 |   63392 |   63923 |
++--++-+-+
| flush empty zone   |  || | |
|with commit |0 |  1 |1.00 |0.44 |
| without commit |3 | 109728 |   63923 | 3629.44 |
++--++-+-+

Comparing the averages we see:
* a moderate performance improvement for conntrack_execute with or
  without commiting of around 6% to 23%
* a significant performance improvement for flushing a full zone of
  around 75% to 99%
* an even more significant improvement for flushing empty zones since we
  no longer need to check any unrelated connections

[1] 9ec849e8aa869b646c372fac552ae2609a4b5f66

Signed-off-by: Felix Huettner 
---
 lib/conntrack-private.h |  2 +-
 lib/conntrack.c | 81 +++--
 lib/conntrack.h |  1 +
 3 files changed, 56 insertions(+), 28 deletions(-)

diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index 3fd5fccd3..19f1f42a0 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -200,7 +200,7 @@ enum ct_ephemeral_range {
 
 struct conntrack {
 struct ovs_mutex ct_lock; /* Protects 2 following fields. */
-struct cmap conns OVS_GUARDED;
+struct cmap conns[UINT16_MAX+1] OVS_GUARDED;
 struct rculist exp_lists[N_EXP_LISTS];
 struct cmap zone_limits OVS_GUARDED;
 struct cmap

[ovs-dev] [PATCH 1/2] test-conntrack: add per zone benchmark tool

2024-04-24 Thread Felix Huettner via dev

The current test-conntrack benchmark command runs with multiple threads
against a single conntrack zone. We now add a new benchmark-zones
command that allows us to check the performance between multiple zones.

We in there test the following scenarios for one zone while other zones
also contain entries:
1. Flushing a single full zone
2. Flushing a single empty zone
3. Commiting new conntrack entries against a single zone
4. Running conntrack_execute without commit against the entries of a
   single zone

Signed-off-by: Felix Huettner 
---
 tests/test-conntrack.c | 180 +
 1 file changed, 165 insertions(+), 15 deletions(-)

diff --git a/tests/test-conntrack.c b/tests/test-conntrack.c
index 292b6c048..9ec3d0a61 100644
--- a/tests/test-conntrack.c
+++ b/tests/test-conntrack.c
@@ -25,36 +25,48 @@
 #include "ovstest.h"
 #include "pcap-file.h"
 #include "timeval.h"
+#include "stopwatch.h"
+
+#define STOPWATCH_CT_EXECUTE_COMMIT "ct-execute-commit"
+#define STOPWATCH_CT_EXECUTE_NO_COMMIT "ct-execute-no-commit"
+#define STOPWATCH_FLUSH_FULL_ZONE "full-zone"
+#define STOPWATCH_FLUSH_EMPTY_ZONE "empty-zone"
 
 static const char payload[] = "5054000a505400090800451c00"
   "11a4cd0a0101010a010102000100020008";
 
+static struct dp_packet *
+build_packet(uint16_t udp_src, uint16_t udp_dst, ovs_be16 *dl_type)
+{
+struct udp_header *udp;
+struct flow flow;
+struct dp_packet *pkt = dp_packet_new(sizeof payload/2);
+
+dp_packet_put_hex(pkt, payload, NULL);
+flow_extract(pkt, );
+
+udp = dp_packet_l4(pkt);
+udp->udp_src = htons(udp_src);
+udp->udp_dst = htons(udp_dst);
+
+*dl_type = flow.dl_type;
+
+return pkt;
+}
+
 static struct dp_packet_batch *
 prepare_packets(size_t n, bool change, unsigned tid, ovs_be16 *dl_type)
 {
 struct dp_packet_batch *pkt_batch = xzalloc(sizeof *pkt_batch);
-struct flow flow;
 size_t i;
 
 ovs_assert(n <= ARRAY_SIZE(pkt_batch->packets));
 
 dp_packet_batch_init(pkt_batch);
 for (i = 0; i < n; i++) {
-struct udp_header *udp;
-struct dp_packet *pkt = dp_packet_new(sizeof payload/2);
-
-dp_packet_put_hex(pkt, payload, NULL);
-flow_extract(pkt, );
-
-udp = dp_packet_l4(pkt);
-udp->udp_src = htons(ntohs(udp->udp_src) + tid);
-
-if (change) {
-udp->udp_dst = htons(ntohs(udp->udp_dst) + i);
-}
-
+uint16_t udp_dst = change ? 2+1 : 2;
+struct dp_packet *pkt = build_packet(1 + tid, udp_dst, dl_type);
 dp_packet_batch_add(pkt_batch, pkt);
-*dl_type = flow.dl_type;
 }
 
 return pkt_batch;
@@ -153,6 +165,139 @@ test_benchmark(struct ovs_cmdl_context *ctx)
 free(threads);
 }
 
+static void
+test_benchmark_zones(struct ovs_cmdl_context *ctx)
+{
+unsigned long n_conns, n_zones, iterations;
+long long start;
+unsigned i, j;
+ovs_be16 dl_type;
+long long now = time_msec();
+
+fatal_signal_init();
+
+/* Parse arguments */
+n_conns = strtoul(ctx->argv[1], NULL, 0);
+if (n_conns == 0 || n_conns >= UINT32_MAX) {
+ovs_fatal(0, "n_conns must be between 1 and 2^32");
+}
+n_zones = strtoul(ctx->argv[2], NULL, 0);
+if (n_zones == 0 || n_zones >= UINT16_MAX) {
+ovs_fatal(0, "n_zones must be between 1 and 2^16");
+}
+iterations = strtoul(ctx->argv[3], NULL, 0);
+if (iterations == 0) {
+ovs_fatal(0, "iterations must be greater than 0");
+}
+
+ct = conntrack_init();
+
+/* Create initial connection entries */
+start = time_msec();
+struct dp_packet_batch **pkt_batch = xzalloc(n_conns * sizeof *pkt_batch);
+for (i = 0; i < n_conns; i++) {
+pkt_batch[i] = xzalloc(sizeof(struct dp_packet_batch));
+dp_packet_batch_init(pkt_batch[i]);
+uint16_t udp_src = (i & 0x) >> 16;
+if (udp_src == 0) {
+udp_src = UINT16_MAX;
+}
+uint16_t udp_dst = i & 0x;
+if (udp_dst == 0) {
+udp_dst = UINT16_MAX;
+}
+struct dp_packet *pkt = build_packet(udp_src, udp_dst, _type);
+dp_packet_batch_add(pkt_batch[i], pkt);
+}
+printf("initial packet generation time: %lld ms\n", time_msec() - start);
+
+/* Put initial entries to each zone */
+start = time_msec();
+for (i = 0; i < n_zones; i++) {
+for (j = 0; j < n_conns; j++) {
+conntrack_execute(ct, pkt_batch[j], dl_type, false, true, i,
+  NULL, NULL, NULL, NULL, now, 0);
+pkt_metadata_init_conn(_batch[j]->packets[0]->md);
+}
+}
+printf("initial insert time: %lld ms\n", time_msec() - start);
+
+/* Actually run the tests */
+stopwatch_create(STOPWATCH_CT_EXECUTE_COMMIT, SW_US);
+stopwatch_create(STOPWATCH_CT_EXECUTE_NO_COMMIT, SW_US);
+stopwatch_create(STOPWATCH_FLUSH_FULL_ZONE, SW_US);
+

[ovs-dev] [PATCH ovn v3] northd: Support routing over other address families.

2024-04-22 Thread Felix Huettner via dev

In most cases IPv4 packets are routed only over other IPv4 networks and
IPv6 packets are routed only over IPv6 networks. However there is no
inherent reason for this limitation. Routing IPv4 packets over IPv6
networks just requires the router to contain a route for an IPv4 network
with an IPv6 nexthop.

This was previously prevented in OVN in ovn-nbctl and northd. By
removing these filters the forwarding will work if the mac addresses are
prepopulated.

If the mac addresses are not prepopulated we will attempt to resolve them using
the original address family of the packet and not the address family of the
nexthop. This will fail and we will not forward the packet.

This feature can for example be used by service providers to
interconnect multiple IPv4 networks of a customer without needing to
negotiate free IPv4 addresses by just using any IPv6 address.

Signed-off-by: Felix Huettner 
---
v2->v3: fix uninitialized variable
v1->v2:
  - move ipv4 info to parsed_route
  - add tests for lr-route-add
  - switch tests to use fmt_pkt
  - some minor test cleanups
 NEWS  |   4 +
 northd/northd.c   |  57 ++---
 tests/ovn-nbctl.at|  26 ++-
 tests/ovn.at  | 511 ++
 utilities/ovn-nbctl.c |  12 +-
 5 files changed, 571 insertions(+), 39 deletions(-)

diff --git a/NEWS b/NEWS
index 141f1831c..14a935c86 100644
--- a/NEWS
+++ b/NEWS
@@ -13,6 +13,10 @@ Post v24.03.0
 "lflow-stage-to-oftable STAGE_NAME" that converts stage name into OpenFlow
 table id.
   - Rename the ovs-sandbox script to ovn-sandbox.
+  - Allow Static Routes where the address families of ip_prefix and nexthop
+diverge (e.g. IPv4 packets over IPv6 links). This is currently limited to
+nexthops that have their mac addresses prepopulated (so
+dynamic_neigh_routers must be false).
 
 OVN v24.03.0 - 01 Mar 2024
 --
diff --git a/northd/northd.c b/northd/northd.c
index 331d9c267..f2357af15 100644
--- a/northd/northd.c
+++ b/northd/northd.c
@@ -10194,6 +10194,8 @@ struct parsed_route {
 const struct nbrec_logical_router_static_route *route;
 bool ecmp_symmetric_reply;
 bool is_discard_route;
+bool is_ipv4_prefix;
+bool is_ipv4_nexthop;
 };
 
 static uint32_t
@@ -10219,6 +10221,8 @@ parsed_routes_add(struct ovn_datapath *od, const struct 
hmap *lr_ports,
 /* Verify that the next hop is an IP address with an all-ones mask. */
 struct in6_addr nexthop;
 unsigned int plen;
+bool is_ipv4_nexthop = true;
+bool is_ipv4_prefix;
 bool is_discard_route = !strcmp(route->nexthop, "discard");
 bool valid_nexthop = route->nexthop[0] && !is_discard_route;
 if (valid_nexthop) {
@@ -10237,6 +10241,7 @@ parsed_routes_add(struct ovn_datapath *od, const struct 
hmap *lr_ports,
  UUID_ARGS(>header_.uuid));
 return NULL;
 }
+is_ipv4_nexthop = IN6_IS_ADDR_V4MAPPED();
 }
 
 /* Parse ip_prefix */
@@ -10248,18 +10253,7 @@ parsed_routes_add(struct ovn_datapath *od, const 
struct hmap *lr_ports,
  UUID_ARGS(>header_.uuid));
 return NULL;
 }
-
-/* Verify that ip_prefix and nexthop have same address familiy. */
-if (valid_nexthop) {
-if (IN6_IS_ADDR_V4MAPPED() != IN6_IS_ADDR_V4MAPPED()) {
-static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
-VLOG_WARN_RL(, "Address family doesn't match between 
'ip_prefix'"
- " %s and 'nexthop' %s in static route "UUID_FMT,
- route->ip_prefix, route->nexthop,
- UUID_ARGS(>header_.uuid));
-return NULL;
-}
-}
+is_ipv4_prefix = IN6_IS_ADDR_V4MAPPED();
 
 /* Verify that ip_prefix and nexthop are on the same network. */
 if (!is_discard_route &&
@@ -10302,6 +10296,8 @@ parsed_routes_add(struct ovn_datapath *od, const struct 
hmap *lr_ports,
 pr->ecmp_symmetric_reply = smap_get_bool(>options,
  "ecmp_symmetric_reply", false);
 pr->is_discard_route = is_discard_route;
+pr->is_ipv4_prefix = is_ipv4_prefix;
+pr->is_ipv4_nexthop = is_ipv4_nexthop;
 ovs_list_insert(routes, >list_node);
 return pr;
 }
@@ -10677,7 +10673,7 @@ build_ecmp_route_flow(struct lflow_table *lflows, 
struct ovn_datapath *od,
   struct lflow_ref *lflow_ref)
 
 {
-bool is_ipv4 = IN6_IS_ADDR_V4MAPPED(>prefix);
+bool is_ipv4_prefix = IN6_IS_ADDR_V4MAPPED(>prefix);
 uint16_t priority;
 struct ecmp_route_list_node *er;
 struct ds route_match = DS_EMPTY_INITIALIZER;
@@ -10686,7 +10682,8 @@ build_ecmp_route_flow(struct lflow_table *lflows, 
struct ovn_datapath *od,
 int ofs = !strcmp(eg->origin, ROUTE_ORIGIN_CONNECTED) ?
 ROUTE_PRIO_OFFSET_CONNECTED: ROUTE_PRIO_OFFSET_STATIC;
 build_route_match(NULL, eg->route_table_id, prefix_s, eg->plen,
-

[ovs-dev] [PATCH ovn v2] northd: Support routing over other address families.

2024-04-22 Thread Felix Huettner via dev

In most cases IPv4 packets are routed only over other IPv4 networks and
IPv6 packets are routed only over IPv6 networks. However there is no
inherent reason for this limitation. Routing IPv4 packets over IPv6
networks just requires the router to contain a route for an IPv4 network
with an IPv6 nexthop.

This was previously prevented in OVN in ovn-nbctl and northd. By
removing these filters the forwarding will work if the mac addresses are
prepopulated.

If the mac addresses are not prepopulated we will attempt to resolve them using
the original address family of the packet and not the address family of the
nexthop. This will fail and we will not forward the packet.

This feature can for example be used by service providers to
interconnect multiple IPv4 networks of a customer without needing to
negotiate free IPv4 addresses by just using any IPv6 address.

Signed-off-by: Felix Huettner 
---
v1->v2:
  - move ipv4 info to parsed_route
  - add tests for lr-route-add
  - switch tests to use fmt_pkt
  - some minor test cleanups

 NEWS  |   4 +
 northd/northd.c   |  56 ++---
 tests/ovn-nbctl.at|  26 ++-
 tests/ovn.at  | 511 ++
 utilities/ovn-nbctl.c |  12 +-
 5 files changed, 570 insertions(+), 39 deletions(-)

diff --git a/NEWS b/NEWS
index 141f1831c..14a935c86 100644
--- a/NEWS
+++ b/NEWS
@@ -13,6 +13,10 @@ Post v24.03.0
 "lflow-stage-to-oftable STAGE_NAME" that converts stage name into OpenFlow
 table id.
   - Rename the ovs-sandbox script to ovn-sandbox.
+  - Allow Static Routes where the address families of ip_prefix and nexthop
+diverge (e.g. IPv4 packets over IPv6 links). This is currently limited to
+nexthops that have their mac addresses prepopulated (so
+dynamic_neigh_routers must be false).
 
 OVN v24.03.0 - 01 Mar 2024
 --
diff --git a/northd/northd.c b/northd/northd.c
index 331d9c267..e377d3e18 100644
--- a/northd/northd.c
+++ b/northd/northd.c
@@ -10194,6 +10194,8 @@ struct parsed_route {
 const struct nbrec_logical_router_static_route *route;
 bool ecmp_symmetric_reply;
 bool is_discard_route;
+bool is_ipv4_prefix;
+bool is_ipv4_nexthop;
 };
 
 static uint32_t
@@ -10219,6 +10221,7 @@ parsed_routes_add(struct ovn_datapath *od, const struct 
hmap *lr_ports,
 /* Verify that the next hop is an IP address with an all-ones mask. */
 struct in6_addr nexthop;
 unsigned int plen;
+bool is_ipv4_nexthop, is_ipv4_prefix;
 bool is_discard_route = !strcmp(route->nexthop, "discard");
 bool valid_nexthop = route->nexthop[0] && !is_discard_route;
 if (valid_nexthop) {
@@ -10237,6 +10240,7 @@ parsed_routes_add(struct ovn_datapath *od, const struct 
hmap *lr_ports,
  UUID_ARGS(>header_.uuid));
 return NULL;
 }
+is_ipv4_nexthop = IN6_IS_ADDR_V4MAPPED();
 }
 
 /* Parse ip_prefix */
@@ -10248,18 +10252,7 @@ parsed_routes_add(struct ovn_datapath *od, const 
struct hmap *lr_ports,
  UUID_ARGS(>header_.uuid));
 return NULL;
 }
-
-/* Verify that ip_prefix and nexthop have same address familiy. */
-if (valid_nexthop) {
-if (IN6_IS_ADDR_V4MAPPED() != IN6_IS_ADDR_V4MAPPED()) {
-static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
-VLOG_WARN_RL(, "Address family doesn't match between 
'ip_prefix'"
- " %s and 'nexthop' %s in static route "UUID_FMT,
- route->ip_prefix, route->nexthop,
- UUID_ARGS(>header_.uuid));
-return NULL;
-}
-}
+is_ipv4_prefix = IN6_IS_ADDR_V4MAPPED();
 
 /* Verify that ip_prefix and nexthop are on the same network. */
 if (!is_discard_route &&
@@ -10302,6 +10295,8 @@ parsed_routes_add(struct ovn_datapath *od, const struct 
hmap *lr_ports,
 pr->ecmp_symmetric_reply = smap_get_bool(>options,
  "ecmp_symmetric_reply", false);
 pr->is_discard_route = is_discard_route;
+pr->is_ipv4_prefix = is_ipv4_prefix;
+pr->is_ipv4_nexthop = is_ipv4_nexthop;
 ovs_list_insert(routes, >list_node);
 return pr;
 }
@@ -10677,7 +10672,7 @@ build_ecmp_route_flow(struct lflow_table *lflows, 
struct ovn_datapath *od,
   struct lflow_ref *lflow_ref)
 
 {
-bool is_ipv4 = IN6_IS_ADDR_V4MAPPED(>prefix);
+bool is_ipv4_prefix = IN6_IS_ADDR_V4MAPPED(>prefix);
 uint16_t priority;
 struct ecmp_route_list_node *er;
 struct ds route_match = DS_EMPTY_INITIALIZER;
@@ -10686,7 +10681,8 @@ build_ecmp_route_flow(struct lflow_table *lflows, 
struct ovn_datapath *od,
 int ofs = !strcmp(eg->origin, ROUTE_ORIGIN_CONNECTED) ?
 ROUTE_PRIO_OFFSET_CONNECTED: ROUTE_PRIO_OFFSET_STATIC;
 build_route_match(NULL, eg->route_table_id, prefix_s, eg->plen,
-  eg->is_src_route, is_ipv4, _match, , ofs);
+

Re: [ovs-dev] [PATCH ovn] northd: Support routing over other address families.

2024-04-22 Thread Felix Huettner via dev

On Fri, Apr 19, 2024 at 09:53:13AM +0200, Dumitru Ceara wrote:
> On 3/27/24 09:43, Felix Huettner via dev wrote:
> > In most cases IPv4 packets are routed only over other IPv4 networks and
> > IPv6 packets are routed only over IPv6 networks. However there is no
> > interent reason for this limitation. Routing IPv4 packets over IPv6
> > networks just requires the router to contain a route for an IPv4 network
> > with an IPv6 nexthop.
> > 
> > This was previously prevented in OVN in ovn-nbctl and northd. By
> > removing these filters the forwarding will work if the mac addresses are
> > prepopulated.
> > 
> > If the mac addresses are not prepopulated we will attempt to resolve them 
> > using
> > the original address family of the packet and not the address family of the
> > nexthop. This will fail and we will not forward the packet.
> > 
> > This feature can for example be used by service providers to
> > interconnect multiple IPv4 networks of a customer without needing to
> > negotiate free IPv4 addresses by just using any IPv6 address.
> > 
> > Signed-off-by: Felix Huettner 
> > ---
> 
> Hi Felix,
> 
> Thanks for the patch!  It's a very useful addition to the OVN feature
> set.  The code looks mostly OK to me, I only had some minor comments,
> please see below.
> 

Hi Dumitru,

thanks for the review.
They will be addressed in v2

Thanks

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH ovn] northd: Support routing over other address families.

2024-03-27 Thread Felix Huettner via dev

In most cases IPv4 packets are routed only over other IPv4 networks and
IPv6 packets are routed only over IPv6 networks. However there is no
interent reason for this limitation. Routing IPv4 packets over IPv6
networks just requires the router to contain a route for an IPv4 network
with an IPv6 nexthop.

This was previously prevented in OVN in ovn-nbctl and northd. By
removing these filters the forwarding will work if the mac addresses are
prepopulated.

If the mac addresses are not prepopulated we will attempt to resolve them using
the original address family of the packet and not the address family of the
nexthop. This will fail and we will not forward the packet.

This feature can for example be used by service providers to
interconnect multiple IPv4 networks of a customer without needing to
negotiate free IPv4 addresses by just using any IPv6 address.

Signed-off-by: Felix Huettner 
---
 NEWS  |   4 +
 northd/northd.c   |  45 ++--
 tests/ovn-nbctl.at|   8 +-
 tests/ovn.at  | 615 ++
 utilities/ovn-nbctl.c |  12 +-
 5 files changed, 650 insertions(+), 34 deletions(-)

diff --git a/NEWS b/NEWS
index 4d6ebea89..b419b2628 100644
--- a/NEWS
+++ b/NEWS
@@ -12,6 +12,10 @@ Post v24.03.0
 flow table id.
 "lflow-stage-to-oftable STAGE_NAME" that converts stage name into OpenFlow
 table id.
+  - Allow Static Routes where the address families of ip_prefix and nexthop
+diverge (e.g. IPv4 packets over IPv6 links). This is currently limited to
+nexthops that have their mac addresses prepopulated (so
+dynamic_neigh_routers must be false).
 
 OVN v24.03.0 - 01 Mar 2024
 --
diff --git a/northd/northd.c b/northd/northd.c
index 1839b7d8b..0359cde89 100644
--- a/northd/northd.c
+++ b/northd/northd.c
@@ -10238,18 +10238,6 @@ parsed_routes_add(struct ovn_datapath *od, const 
struct hmap *lr_ports,
 return NULL;
 }
 
-/* Verify that ip_prefix and nexthop have same address familiy. */
-if (valid_nexthop) {
-if (IN6_IS_ADDR_V4MAPPED() != IN6_IS_ADDR_V4MAPPED()) {
-static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
-VLOG_WARN_RL(, "Address family doesn't match between 
'ip_prefix'"
- " %s and 'nexthop' %s in static route "UUID_FMT,
- route->ip_prefix, route->nexthop,
- UUID_ARGS(>header_.uuid));
-return NULL;
-}
-}
-
 /* Verify that ip_prefix and nexthop are on the same network. */
 if (!is_discard_route &&
 !find_static_route_outport(od, lr_ports, route,
@@ -10666,7 +10654,7 @@ build_ecmp_route_flow(struct lflow_table *lflows, 
struct ovn_datapath *od,
   struct lflow_ref *lflow_ref)
 
 {
-bool is_ipv4 = IN6_IS_ADDR_V4MAPPED(>prefix);
+bool is_ipv4_network = IN6_IS_ADDR_V4MAPPED(>prefix);
 uint16_t priority;
 struct ecmp_route_list_node *er;
 struct ds route_match = DS_EMPTY_INITIALIZER;
@@ -10675,7 +10663,8 @@ build_ecmp_route_flow(struct lflow_table *lflows, 
struct ovn_datapath *od,
 int ofs = !strcmp(eg->origin, ROUTE_ORIGIN_CONNECTED) ?
 ROUTE_PRIO_OFFSET_CONNECTED: ROUTE_PRIO_OFFSET_STATIC;
 build_route_match(NULL, eg->route_table_id, prefix_s, eg->plen,
-  eg->is_src_route, is_ipv4, _match, , ofs);
+  eg->is_src_route, is_ipv4_network, _match,
+  , ofs);
 free(prefix_s);
 
 struct ds actions = DS_EMPTY_INITIALIZER;
@@ -10708,7 +10697,11 @@ build_ecmp_route_flow(struct lflow_table *lflows, 
struct ovn_datapath *od,
 /* Find the outgoing port. */
 const char *lrp_addr_s = NULL;
 struct ovn_port *out_port = NULL;
-if (!find_static_route_outport(od, lr_ports, route, is_ipv4,
+bool is_ipv4_gateway = is_ipv4_network;
+if (route->nexthop && route->nexthop[0]) {
+  is_ipv4_gateway = strchr(route->nexthop, '.') ? true : false;
+}
+if (!find_static_route_outport(od, lr_ports, route, is_ipv4_gateway,
_addr_s, _port)) {
 continue;
 }
@@ -10733,9 +10726,9 @@ build_ecmp_route_flow(struct lflow_table *lflows, 
struct ovn_datapath *od,
   "eth.src = %s; "
   "outport = %s; "
   "next;",
-  is_ipv4 ? REG_NEXT_HOP_IPV4 : REG_NEXT_HOP_IPV6,
+  is_ipv4_gateway ? REG_NEXT_HOP_IPV4 : REG_NEXT_HOP_IPV6,
   route->nexthop,
-  is_ipv4 ? REG_SRC_IPV4 : REG_SRC_IPV6,
+  is_ipv4_gateway ? REG_SRC_IPV4 : REG_SRC_IPV6,
   lrp_addr_s,
   out_port->lrp_networks.ea_s,
   out_port->json_key);
@@ -10757,13 +10750,18 @@ add_route(struct lflow_table *lflows, struct 
ovn_datapath *od,
   const

Re: [ovs-dev] [PATCH v2 1/5] ovsdb: raft: Avoid transferring leadership to unavailable servers.

2024-03-27 Thread Felix Huettner via dev

On Tue, Mar 26, 2024 at 06:27:09PM +0100, Ilya Maximets wrote:
> Current implementation of the leadership transfer just shoots the
> leadership in the general direction of the first stable server in the
> configuration.  It doesn't check if the server was active recently or
> even that the connection is established.  This may result in sending
> leadership to a disconnected or otherwise unavailable server.
> 
> Such behavior should not cause log truncation or any other correctness
> issues because the destination server would have all the append
> requests queued up or the connection will be dropped by the leader.
> In a worst case we will have a leader-less cluster until the next
> election timer fires up.  Other servers will notice the absence of the
> leader and will trigger a new leader election normally.
> However, the potential wait for the election timer is not good as
> real-world setups may have high values configured.
> 
> Fix that by trying to transfer to servers that we know have applied
> the most changes, i.e., have the highest 'match_index'.  Such servers
> replied to the most recent append requests, so they have highest
> chances to be healthy.  Choosing the random starting point in the
> list of such servers so we don't transfer to the same server every
> single time.  This slightly improves load distribution, but, most
> importantly, increases robustness of our test suite, making it cover
> more cases.  Also checking that the message was actually sent without
> immediate failure.
> 
> If we fail to transfer to any server with the highest index, try to
> just transfer to any other server that is not behind majority and then
> just any other server that is connected.  We did actually send them
> all the updates (if the connection is open), they just didn't reply
> yet for one reason or another.  It should be better than leaving the
> cluster without a leader.
> 
> Note that there is always a chance that transfer will fail, since
> we're not waiting for it to be acknowledged (and must not wait).
> In this case, normal election will be triggered after the election
> timer fires up.
> 
> Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered 
> databases.")
> Signed-off-by: Ilya Maximets 
> ---
> 
> CC: Felix Huettner 
> 
>  ovsdb/raft.c | 48 
>  1 file changed, 44 insertions(+), 4 deletions(-)
> 
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index f463afcb3..b171da345 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -1261,10 +1261,30 @@ raft_transfer_leadership(struct raft *raft, const 
> char *reason)
>  return;
>  }
>  
> -struct raft_server *s;
> +struct raft_server **servers, *s;
> +uint64_t threshold = 0;
> +size_t n = 0, start, i;
> +
> +servers = xmalloc(hmap_count(>servers) * sizeof *servers);
> +
>  HMAP_FOR_EACH (s, hmap_node, >servers) {
> -if (!uuid_equals(>sid, >sid)
> -&& s->phase == RAFT_PHASE_STABLE) {
> +if (uuid_equals(>sid, >sid)
> +|| s->phase != RAFT_PHASE_STABLE) {
> +continue;
> +}
> +if (s->match_index > threshold) {
> +threshold = s->match_index;
> +}
> +servers[n++] = s;
> +}
> +
> +start = n ? random_range(n) : 0;
> +
> +retry:
> +for (i = 0; i < n; i++) {
> +s = servers[(start + i) % n];
> +
> +if (s->match_index >= threshold) {
>  struct raft_conn *conn = raft_find_conn_by_sid(raft, >sid);
>  if (!conn) {
>  continue;
> @@ -1280,7 +1300,10 @@ raft_transfer_leadership(struct raft *raft, const char 
> *reason)
>  .term = raft->term,
>  }
>  };
> -raft_send_to_conn(raft, , conn);
> +
> +if (!raft_send_to_conn(raft, , conn)) {
> +continue;
> +}
>  
>  raft_record_note(raft, "transfer leadership",
>   "transferring leadership to %s because %s",
> @@ -1288,6 +1311,23 @@ raft_transfer_leadership(struct raft *raft, const char 
> *reason)
>  break;
>  }
>  }
> +
> +if (n && i == n && threshold) {
> +if (threshold > raft->commit_index) {
> +/* Failed to transfer to servers with the highest 'match_index'.
> + * Try other servers that are not behind the majority. */
> +threshold = raft->commit_index;
> +} else {
> +/* Try any other server.  It is safe, because they either have 
> all
> + * the append requests queued up for them before the leadership
> + * transfer message or their connection is broken and we will not
> + * transfer anyway. */
> +threshold = 0;
> +}
> +goto retry;
> +}
> +
> +free(servers);
>  }
>  
>  /* Send a RemoveServerRequest to the rest of the servers in the cluster.
> -- 
> 2.44.0
>

Re: [ovs-dev] [PATCH 1/5] ovsdb: raft: Randomize leadership transfer.

2024-03-19 Thread Felix Huettner via dev

On Mon, Mar 18, 2024 at 05:52:12PM +0100, Ilya Maximets wrote:
> On 3/18/24 17:15, Felix Huettner wrote:
> > On Fri, Mar 15, 2024 at 09:14:49PM +0100, Ilya Maximets wrote:
> >> Each cluster member typically always transfers leadership to the same
> >> other member, which is the first in their list of servers.  This may
> >> result in two servers in a 3-node cluster to transfer leadership to
> >> each other and never to the third one.
> >>
> >> Randomizing the selection to make the load more evenly distributed.
> > 
> > Hi Ilya,
> 
> Hi, Felix.  Thanks for the comments!
> 
> > 
> > just out of curiosity: since basically only one of the 3 members is
> > active at any point in time, is balancing the load even relevant. It
> > will always only be on one of the 3 members anyway.
> It is not very important, I agree.  What I observed in practice is
> that sometimes if, for example, compactions happen in approximately
> similar time, the server we transfer the leadership to may send it
> right back, while the first server is busy compacting.  This is
> less of a problem today as well, since we have parallel compaction,
> but it may still be annoying if that happens every time.
> 
> I'm mostly making this patch for the purpose of better testing below.
> 
> > 
> >>
> >> This also makes cluster failure tests cover more scenarios as servers
> >> will transfer leadership to servers they didn't before.  This is
> >> important especially for cluster joining tests.
> >>
> >> Ideally, we would transfer to a random server with a highest apply
> >> index, but not trying to implement this for now.
> >>
> >> Signed-off-by: Ilya Maximets 
> >> ---
> >>  ovsdb/raft.c | 6 +-
> >>  1 file changed, 5 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> >> index f463afcb3..25f462431 100644
> >> --- a/ovsdb/raft.c
> >> +++ b/ovsdb/raft.c
> >> @@ -1261,8 +1261,12 @@ raft_transfer_leadership(struct raft *raft, const 
> >> char *reason)
> >>  return;
> >>  }
> >>  
> >> +size_t n = hmap_count(>servers) * 3;
> >>  struct raft_server *s;
> >> -HMAP_FOR_EACH (s, hmap_node, >servers) {
> >> +
> >> +while (n--) {
> >> +s = CONTAINER_OF(hmap_random_node(>servers),
> >> + struct raft_server, hmap_node);
> >>  if (!uuid_equals(>sid, >sid)
> >>  && s->phase == RAFT_PHASE_STABLE) {
> >>  struct raft_conn *conn = raft_find_conn_by_sid(raft, >sid);
> > 
> > i think this has the risk of never selecting one server out of the list of
> > cluster members. Suppose you have a 3 node cluster where one of them
> > members is down. In this case there is a single member the leadership
> > can be transfered to.
> > This means for a single iteration of the while loop has a chance of 2/3
> > to select a member that can not be used. Over the 9 iterations this
> > would do this would give a chance of (2/3)^9 to always choose an
> > inappropriate member. This equals to a chance of 0.026% of never
> > selecting the single appropriate target member.
> > 
> > Could we instead rather start iterating the hmap at some random offset?
> > That should give a similar result while giving the guarantee that we
> > visit each member once.
> 
> I don't think it's very important, because we're transferring leadership
> without verifying if the other side is alive and we're not checking if we
> actually transferred it or not.  So, retries are basically just for the
> server not hitting itself or servers that didn't join yet.  We will transfer
> to the first other server that already joined regardless of the iteration
> method.
> 
> The way to mostly fix the issue with dead servers is, as I mentioned, to
> transfer only to the servers with the highest apply index, i.e. the servers
> that acknowledged the most amount of changes.  This will ensure that we
> at least heard something from the server in the recent past.  But that's
> a separate feature to implement.
> 
> Also, the leadership transfer is just an optimization to speed up elections,
> so it's not necessary for correctness for this operation to be successful.
> If we fail to transfer or transfer to the dead server, the rest of the
> cluster will notice the absence of the leader and initiate election by
> the timeout.
> 
> Best regards, Ilya Maximets.

Hi Ilya,

thanks for the clarifications.
Then i guess the 0.026% chance is not too relevant.

Thanks
Felix
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 1/5] ovsdb: raft: Randomize leadership transfer.

2024-03-18 Thread Felix Huettner via dev

On Fri, Mar 15, 2024 at 09:14:49PM +0100, Ilya Maximets wrote:
> Each cluster member typically always transfers leadership to the same
> other member, which is the first in their list of servers.  This may
> result in two servers in a 3-node cluster to transfer leadership to
> each other and never to the third one.
> 
> Randomizing the selection to make the load more evenly distributed.

Hi Ilya,

just out of curiosity: since basically only one of the 3 members is
active at any point in time, is balancing the load even relevant. It
will always only be on one of the 3 members anyway.

> 
> This also makes cluster failure tests cover more scenarios as servers
> will transfer leadership to servers they didn't before.  This is
> important especially for cluster joining tests.
> 
> Ideally, we would transfer to a random server with a highest apply
> index, but not trying to implement this for now.
> 
> Signed-off-by: Ilya Maximets 
> ---
>  ovsdb/raft.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index f463afcb3..25f462431 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -1261,8 +1261,12 @@ raft_transfer_leadership(struct raft *raft, const char 
> *reason)
>  return;
>  }
>  
> +size_t n = hmap_count(>servers) * 3;
>  struct raft_server *s;
> -HMAP_FOR_EACH (s, hmap_node, >servers) {
> +
> +while (n--) {
> +s = CONTAINER_OF(hmap_random_node(>servers),
> + struct raft_server, hmap_node);
>  if (!uuid_equals(>sid, >sid)
>  && s->phase == RAFT_PHASE_STABLE) {
>  struct raft_conn *conn = raft_find_conn_by_sid(raft, >sid);

i think this has the risk of never selecting one server out of the list of
cluster members. Suppose you have a 3 node cluster where one of them
members is down. In this case there is a single member the leadership
can be transfered to.
This means for a single iteration of the while loop has a chance of 2/3
to select a member that can not be used. Over the 9 iterations this
would do this would give a chance of (2/3)^9 to always choose an
inappropriate member. This equals to a chance of 0.026% of never
selecting the single appropriate target member.

Could we instead rather start iterating the hmap at some random offset?
That should give a similar result while giving the guarantee that we
visit each member once.

Thanks
Felix
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH v7 2/2] netlink-conntrack: Optimize flushing ct zone.

2024-03-11 Thread Felix Huettner via dev

Previously the kernel did not provide a netlink interface to flush/list
only conntrack entries matching a specific zone. With [1] and [2] it is now
possible to flush and list conntrack entries filtered by zone. Older
kernels not yet supporting this feature will ignore the filter.
For the list request that means just returning all entries (which we can
then filter in userspace as before).
For the flush request that means deleting all conntrack entries.

The implementation is now identical to the windows one, so we combine
them.

These significantly improves the performance of flushing conntrack zones
when the conntrack table is large. Since flushing a conntrack zone is
normally triggered via an openflow command it blocks the main ovs thread
and thereby also blocks new flows from being applied. Using this new
feature we can reduce the flushing time for zones by around 93%.

In combination with OVN the creation of a Logical_Router (which causes
the flushing of a ct zone) could block other operations, e.g. the
failover of Logical_Routers (as they cause new flows to be created).
This is visible from a user perspective as a ovn-controller that is idle
(as it waits for vswitchd) and vswitchd reporting:
"blocked 1000 ms waiting for main to quiesce" (potentially with ever
increasing times).

The following performance tests where run in a qemu vm with 500.000
conntrack entries distributed evenly over 500 ct zones using `ovstest
test-netlink-conntrack flush zone=`.

  |  flush zone with 1000 entries  |   flush zone with no entry |
  +-+--+-+--|
  |   with the patch| without  |   with the patch| without  |
  +--+--+--+--+--+--|
  | v6.8-rc4 |  v6.7.1  | v6.8-rc4 | v6.8-rc4 |  v6.7.1  | v6.8-rc4 |
+-+--+--+--+--+--+--|
| Min |  0.260   |  3.946   |  3.497   |  0.228   |  3.462   |  3.212   |
| Median  |  0.319   |  4.237   |  4.349   |  0.298   |  4.460   |  4.010   |
| 90%ile  |  0.335   |  4.367   |  4.522   |  0.325   |  4.662   |  4.572   |
| 99%ile  |  0.348   |  4.495   |  4.773   |  0.340   |  4.931   |  6.003   |
| Max |  0.362   |  4.543   |  5.054   |  0.348   |  5.390   |  6.396   |
| Mean|  0.320   |  4.236   |  4.331   |  0.296   |  4.430   |  4.071   |
| Total   |  80.02   |  1058|  1082|  73.93   |  1107|  1017|

[1]: 
https://github.com/torvalds/linux/commit/eff3c558bb7e61c41b53e4c8130e514a5a4df9ba
[2]: 
https://github.com/torvalds/linux/commit/fa173a1b4e3fd1ab5451cbc57de6fc624c824b0a

Acked-by: Mike Pattrick 
Co-Authored-By: Luca Czesla 
Signed-off-by: Luca Czesla 
Co-Authored-By: Max Lamprecht 
Signed-off-by: Max Lamprecht 
Signed-off-by: Felix Huettner 
---
v6->v7:
- fixed some nits
- move testcase to the "ct flush" test
v5->v6: none
v4->v5: none
v3->v4:
- combine the flush logic with windows implementation
v2->v3:
- update description to include upstream fix (Thanks to Ilya for finding
  that issue)
v1->v2:
- fixed wrong signed-off-by

 lib/netlink-conntrack.c | 52 -
 tests/system-traffic.at | 28 ++
 2 files changed, 74 insertions(+), 6 deletions(-)

diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c
index 492bfcffb..263496b17 100644
--- a/lib/netlink-conntrack.c
+++ b/lib/netlink-conntrack.c
@@ -141,6 +141,9 @@ nl_ct_dump_start(struct nl_ct_dump_state **statep, const 
uint16_t *zone,
 
 nl_msg_put_nfgenmsg(>buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
 IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
+if (zone) {
+nl_msg_put_be16(>buf, CTA_ZONE, htons(*zone));
+}
 nl_dump_start(>dump, NETLINK_NETFILTER, >buf);
 ofpbuf_clear(>buf);
 
@@ -263,11 +266,9 @@ out:
 return err;
 }
 
-#ifdef _WIN32
-int
-nl_ct_flush_zone(uint16_t flush_zone)
+static int
+nl_ct_flush_zone_with_cta_zone(uint16_t flush_zone)
 {
-/* Windows can flush a specific zone */
 struct ofpbuf buf;
 int err;
 
@@ -282,24 +283,63 @@ nl_ct_flush_zone(uint16_t flush_zone)
 
 return err;
 }
+
+#ifdef _WIN32
+int
+nl_ct_flush_zone(uint16_t flush_zone)
+{
+return nl_ct_flush_zone_with_cta_zone(flush_zone);
+}
 #else
+
+static bool
+netlink_flush_supports_zone(void)
+{
+static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+static bool supported = false;
+
+if (ovsthread_once_start()) {
+if (ovs_kernel_is_version_or_newer(6, 8)) {
+supported = true;
+} else {
+VLOG_INFO("disabling conntrack flush by zone. "
+  "Not supported in Linux kernel");
+}
+ovsthread_once_done();
+}
+return supported;
+}
+
 int
 nl_ct_flush_zone(uint16_t flush_zone)
 {
-/* Apparently, there's no netlink interface to flush a specific zone.
+/* In older kernels, there was no netlink interface to

[ovs-dev] [PATCH v7 1/2] util: Support checking for kernel versions.

2024-03-11 Thread Felix Huettner via dev

Extract checking for a given kernel version to a separate function.
It will be used also in the next patch.

Acked-by: Mike Pattrick 
Acked-by: Eelco Chaudron 
Signed-off-by: Felix Huettner 
---
v6->v7: none
v5->v6:
- fix ovs_kernel_is_version_or_newer returning false if major and minor
  are equal (thanks Mike)
v4->v5:
- fix wrong ifdef that broke on macos
- fix ovs_kernel_is_version_or_newer working in reverse than desired
- ovs_kernel_is_version_or_newer now always returns false if uname
  errors (Thanks Eelco)
v4:
- extract function to check kernel version

 lib/netdev-linux.c | 15 +++
 lib/util.c | 27 +++
 lib/util.h |  4 
 3 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 1e904d8e6..25349c605 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -40,7 +40,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -6428,18 +6427,10 @@ getqdisc_is_safe(void)
 static bool safe = false;
 
 if (ovsthread_once_start()) {
-struct utsname utsname;
-int major, minor;
-
-if (uname() == -1) {
-VLOG_WARN("uname failed (%s)", ovs_strerror(errno));
-} else if (!ovs_scan(utsname.release, "%d.%d", , )) {
-VLOG_WARN("uname reported bad OS release (%s)", utsname.release);
-} else if (major < 2 || (major == 2 && minor < 35)) {
-VLOG_INFO("disabling unsafe RTM_GETQDISC in Linux kernel %s",
-  utsname.release);
-} else {
+if (ovs_kernel_is_version_or_newer(2, 35)) {
 safe = true;
+} else {
+VLOG_INFO("disabling unsafe RTM_GETQDISC in Linux kernel");
 }
 ovsthread_once_done();
 }
diff --git a/lib/util.c b/lib/util.c
index 3fb3a4b40..5c31d983a 100644
--- a/lib/util.c
+++ b/lib/util.c
@@ -27,6 +27,7 @@
 #include 
 #ifdef __linux__
 #include 
+#include 
 #endif
 #include 
 #include 
@@ -2500,3 +2501,29 @@ OVS_CONSTRUCTOR(winsock_start) {
}
 }
 #endif
+
+#ifdef __linux__
+bool
+ovs_kernel_is_version_or_newer(int target_major, int target_minor)
+{
+static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+static int current_major, current_minor = -1;
+
+if (ovsthread_once_start()) {
+struct utsname utsname;
+
+if (uname() == -1) {
+VLOG_WARN("uname failed (%s)", ovs_strerror(errno));
+} else if (!ovs_scan(utsname.release, "%d.%d",
+_major, _minor)) {
+VLOG_WARN("uname reported bad OS release (%s)", utsname.release);
+}
+ovsthread_once_done();
+}
+if (current_major == -1 || current_minor == -1) {
+return false;
+}
+return current_major > target_major || (
+current_major == target_major && current_minor >= target_minor);
+}
+#endif
diff --git a/lib/util.h b/lib/util.h
index f2d45bcac..55718fd87 100644
--- a/lib/util.h
+++ b/lib/util.h
@@ -611,4 +611,8 @@ int ftruncate(int fd, off_t length);
 }
 #endif
 
+#ifdef __linux__
+bool ovs_kernel_is_version_or_newer(int target_major, int target_minor);
+#endif
+
 #endif /* util.h */

base-commit: 33f45ded67a2d524ccf54cf4bb79a38d8140f14b
-- 
2.44.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v6 2/2] netlink-conntrack: Optimize flushing ct zone.

2024-03-11 Thread Felix Huettner via dev

> > diff --git a/tests/system-traffic.at b/tests/system-traffic.at
> > index 07d09b912..d87d6914d 100644
> > --- a/tests/system-traffic.at
> > +++ b/tests/system-traffic.at
> > @@ -3376,6 +3376,14 @@ AT_CHECK([ovs-appctl dpctl/dump-conntrack | 
> > FORMAT_CT(10.1.1.4)], [0], [dnl
> >  
> > tcp,orig=(src=10.1.1.3,dst=10.1.1.4,sport=,dport=),reply=(src=10.1.1.4,dst=10.1.1.3,sport=,dport=),zone=2,protoinfo=(state=)
> >  ])
> >  
> > +dnl flushing one zone should leave the others intact
> 
> Nit: Capital letter and a period at the end.
> 
> > +AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=2])
> > +AT_CHECK([ovs-appctl dpctl/dump-conntrack zone=1 | FORMAT_CT(10.1.1.2)], 
> > [0], [dnl
> > +tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=,dport=),reply=(src=10.1.1.2,dst=10.1.1.1,sport=,dport=),zone=1,protoinfo=(state=)
> > +])
> > +AT_CHECK([ovs-appctl dpctl/dump-conntrack zone=2 | FORMAT_CT(10.1.1.4)], 
> > [0], [dnl
> > +])
> 
> Also this test seems out of place here.  I think, it should be part
> of the 'conntrack - ct flush' test and use FLUSH_CMD instead.  We
> should not mix the general zone testing with the flush testing.
> 
> We may add another port and an OpenFlow rule in order to have more
> zones in 'conntrack - ct flush' test if needed.
> 
> Best regards, Ilya Maximets.

Thanks i'll move it there and post a v7

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH v6 2/2] netlink-conntrack: Optimize flushing ct zone.

2024-03-04 Thread Felix Huettner via dev

Previously the kernel did not provide a netlink interface to flush/list
only conntrack entries matching a specific zone. With [1] and [2] it is now
possible to flush and list conntrack entries filtered by zone. Older
kernels not yet supporting this feature will ignore the filter.
For the list request that means just returning all entries (which we can
then filter in userspace as before).
For the flush request that means deleting all conntrack entries.

The implementation is now identical to the windows one, so we combine
them.

These significantly improves the performance of flushing conntrack zones
when the conntrack table is large. Since flushing a conntrack zone is
normally triggered via an openflow command it blocks the main ovs thread
and thereby also blocks new flows from being applied. Using this new
feature we can reduce the flushing time for zones by around 93%.

In combination with OVN the creation of a Logical_Router (which causes
the flushing of a ct zone) could block other operations, e.g. the
failover of Logical_Routers (as they cause new flows to be created).
This is visible from a user perspective as a ovn-controller that is idle
(as it waits for vswitchd) and vswitchd reporting:
"blocked 1000 ms waiting for main to quiesce" (potentially with ever
increasing times).

The following performance tests where run in a qemu vm with 500.000
conntrack entries distributed evenly over 500 ct zones using `ovstest
test-netlink-conntrack flush zone=`.

  |  flush zone with 1000 entries  |   flush zone with no entry |
  +-+--+-+--|
  |   with the patch| without  |   with the patch| without  |
  +--+--+--+--+--+--|
  | v6.8-rc4 |  v6.7.1  | v6.8-rc4 | v6.8-rc4 |  v6.7.1  | v6.8-rc4 |
+-+--+--+--+--+--+--|
| Min |  0.260   |  3.946   |  3.497   |  0.228   |  3.462   |  3.212   |
| Median  |  0.319   |  4.237   |  4.349   |  0.298   |  4.460   |  4.010   |
| 90%ile  |  0.335   |  4.367   |  4.522   |  0.325   |  4.662   |  4.572   |
| 99%ile  |  0.348   |  4.495   |  4.773   |  0.340   |  4.931   |  6.003   |
| Max |  0.362   |  4.543   |  5.054   |  0.348   |  5.390   |  6.396   |
| Mean|  0.320   |  4.236   |  4.331   |  0.296   |  4.430   |  4.071   |
| Total   |  80.02   |  1058|  1082|  73.93   |  1107|  1017|

[1]: 
https://github.com/torvalds/linux/commit/eff3c558bb7e61c41b53e4c8130e514a5a4df9ba
[2]: 
https://github.com/torvalds/linux/commit/fa173a1b4e3fd1ab5451cbc57de6fc624c824b0a

Co-Authored-By: Luca Czesla 
Signed-off-by: Luca Czesla 
Co-Authored-By: Max Lamprecht 
Signed-off-by: Max Lamprecht 
Signed-off-by: Felix Huettner 
---
v5->v6: none
v4->v5: none
v3->v4:
- combine the flush logic with windows implementation
v2->v3:
- update description to include upstream fix (Thanks to Ilya for finding
  that issue)
v1->v2:
- fixed wrong signed-off-by

 lib/netlink-conntrack.c | 52 -
 tests/system-traffic.at |  8 +++
 2 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c
index 492bfcffb..d9015d9f0 100644
--- a/lib/netlink-conntrack.c
+++ b/lib/netlink-conntrack.c
@@ -141,6 +141,9 @@ nl_ct_dump_start(struct nl_ct_dump_state **statep, const 
uint16_t *zone,
 
 nl_msg_put_nfgenmsg(>buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
 IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
+if (zone) {
+nl_msg_put_be16(>buf, CTA_ZONE, htons(*zone));
+}
 nl_dump_start(>dump, NETLINK_NETFILTER, >buf);
 ofpbuf_clear(>buf);
 
@@ -263,11 +266,9 @@ out:
 return err;
 }
 
-#ifdef _WIN32
-int
-nl_ct_flush_zone(uint16_t flush_zone)
+static int
+nl_ct_flush_zone_with_cta_zone(uint16_t flush_zone)
 {
-/* Windows can flush a specific zone */
 struct ofpbuf buf;
 int err;
 
@@ -282,24 +283,63 @@ nl_ct_flush_zone(uint16_t flush_zone)
 
 return err;
 }
+
+#ifdef _WIN32
+int
+nl_ct_flush_zone(uint16_t flush_zone)
+{
+return nl_ct_flush_zone_with_cta_zone(flush_zone);
+}
 #else
+
+static bool
+netlink_flush_supports_zone(void)
+{
+static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+static bool supported = false;
+
+if (ovsthread_once_start()) {
+if (ovs_kernel_is_version_or_newer(6, 8)) {
+supported = true;
+} else {
+VLOG_INFO("disabling conntrack flush by zone. "
+  "Not supported in Linux kernel");
+}
+ovsthread_once_done();
+}
+return supported;
+}
+
 int
 nl_ct_flush_zone(uint16_t flush_zone)
 {
-/* Apparently, there's no netlink interface to flush a specific zone.
+/* In older kernels, there was no netlink interface to flush a specific
+ * conntrack zone.
  * This code dumps every connection, checks the zone and

[ovs-dev] [PATCH v6 1/2] util: Support checking for kernel versions.

2024-03-04 Thread Felix Huettner via dev

Extract checking for a given kernel version to a separate function.
It will be used also in the next patch.

Signed-off-by: Felix Huettner 
---
v5->v6:
- fix ovs_kernel_is_version_or_newer returning false if major and minor
  are equal (thanks Mike)
v4->v5:
- fix wrong ifdef that broke on macos
- fix ovs_kernel_is_version_or_newer working in reverse than desired
- ovs_kernel_is_version_or_newer now always returns false if uname
  errors (Thanks Eelco)
v4:
- extract function to check kernel version

 lib/netdev-linux.c | 15 +++
 lib/util.c | 27 +++
 lib/util.h |  4 
 3 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index bf91ef462..a8424ed82 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -40,7 +40,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -6427,18 +6426,10 @@ getqdisc_is_safe(void)
 static bool safe = false;
 
 if (ovsthread_once_start()) {
-struct utsname utsname;
-int major, minor;
-
-if (uname() == -1) {
-VLOG_WARN("uname failed (%s)", ovs_strerror(errno));
-} else if (!ovs_scan(utsname.release, "%d.%d", , )) {
-VLOG_WARN("uname reported bad OS release (%s)", utsname.release);
-} else if (major < 2 || (major == 2 && minor < 35)) {
-VLOG_INFO("disabling unsafe RTM_GETQDISC in Linux kernel %s",
-  utsname.release);
-} else {
+if (ovs_kernel_is_version_or_newer(2, 35)) {
 safe = true;
+} else {
+VLOG_INFO("disabling unsafe RTM_GETQDISC in Linux kernel");
 }
 ovsthread_once_done();
 }
diff --git a/lib/util.c b/lib/util.c
index 3fb3a4b40..5c31d983a 100644
--- a/lib/util.c
+++ b/lib/util.c
@@ -27,6 +27,7 @@
 #include 
 #ifdef __linux__
 #include 
+#include 
 #endif
 #include 
 #include 
@@ -2500,3 +2501,29 @@ OVS_CONSTRUCTOR(winsock_start) {
}
 }
 #endif
+
+#ifdef __linux__
+bool
+ovs_kernel_is_version_or_newer(int target_major, int target_minor)
+{
+static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+static int current_major, current_minor = -1;
+
+if (ovsthread_once_start()) {
+struct utsname utsname;
+
+if (uname() == -1) {
+VLOG_WARN("uname failed (%s)", ovs_strerror(errno));
+} else if (!ovs_scan(utsname.release, "%d.%d",
+_major, _minor)) {
+VLOG_WARN("uname reported bad OS release (%s)", utsname.release);
+}
+ovsthread_once_done();
+}
+if (current_major == -1 || current_minor == -1) {
+return false;
+}
+return current_major > target_major || (
+current_major == target_major && current_minor >= target_minor);
+}
+#endif
diff --git a/lib/util.h b/lib/util.h
index f2d45bcac..55718fd87 100644
--- a/lib/util.h
+++ b/lib/util.h
@@ -611,4 +611,8 @@ int ftruncate(int fd, off_t length);
 }
 #endif
 
+#ifdef __linux__
+bool ovs_kernel_is_version_or_newer(int target_major, int target_minor);
+#endif
+
 #endif /* util.h */

base-commit: 436aba68d52891fb5775ec7651282ccf9d04176b
-- 
2.43.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH v5 1/2] util: Support checking for kernel versions.

2024-02-26 Thread Felix Huettner via dev

Extract checking for a given kernel version to a separate function.
It will be used also in the next patch.

Signed-off-by: Felix Huettner 
---
v4->v5:
- fix wrong ifdef that broke on macos
- fix ovs_kernel_is_version_or_newer working in reverse than desired
- ovs_kernel_is_version_or_newer now always returns false if uname
  errors (Thanks Eelco)
v4:
- extract function to check kernel version
 lib/netdev-linux.c | 14 +++---
 lib/util.c | 27 +++
 lib/util.h |  4 
 3 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index bf91ef462..51bd71ae3 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -6427,18 +6427,10 @@ getqdisc_is_safe(void)
 static bool safe = false;
 
 if (ovsthread_once_start()) {
-struct utsname utsname;
-int major, minor;
-
-if (uname() == -1) {
-VLOG_WARN("uname failed (%s)", ovs_strerror(errno));
-} else if (!ovs_scan(utsname.release, "%d.%d", , )) {
-VLOG_WARN("uname reported bad OS release (%s)", utsname.release);
-} else if (major < 2 || (major == 2 && minor < 35)) {
-VLOG_INFO("disabling unsafe RTM_GETQDISC in Linux kernel %s",
-  utsname.release);
-} else {
+if (ovs_kernel_is_version_or_newer(2, 35)) {
 safe = true;
+} else {
+VLOG_INFO("disabling unsafe RTM_GETQDISC in Linux kernel");
 }
 ovsthread_once_done();
 }
diff --git a/lib/util.c b/lib/util.c
index 3fb3a4b40..f5b2da095 100644
--- a/lib/util.c
+++ b/lib/util.c
@@ -27,6 +27,7 @@
 #include 
 #ifdef __linux__
 #include 
+#include 
 #endif
 #include 
 #include 
@@ -2500,3 +2501,29 @@ OVS_CONSTRUCTOR(winsock_start) {
}
 }
 #endif
+
+#ifdef __linux__
+bool
+ovs_kernel_is_version_or_newer(int target_major, int target_minor)
+{
+static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+static int current_major, current_minor = -1;
+
+if (ovsthread_once_start()) {
+struct utsname utsname;
+
+if (uname() == -1) {
+VLOG_WARN("uname failed (%s)", ovs_strerror(errno));
+} else if (!ovs_scan(utsname.release, "%d.%d",
+_major, _minor)) {
+VLOG_WARN("uname reported bad OS release (%s)", utsname.release);
+}
+ovsthread_once_done();
+}
+if (current_major == -1 || current_minor == -1) {
+return false;
+}
+return current_major > target_major || (
+current_major == target_major && current_minor > target_minor);
+}
+#endif
diff --git a/lib/util.h b/lib/util.h
index f2d45bcac..55718fd87 100644
--- a/lib/util.h
+++ b/lib/util.h
@@ -611,4 +611,8 @@ int ftruncate(int fd, off_t length);
 }
 #endif
 
+#ifdef __linux__
+bool ovs_kernel_is_version_or_newer(int target_major, int target_minor);
+#endif
+
 #endif /* util.h */

base-commit: 166ee41d282c506d100bc2185d60af277121b55b
-- 
2.43.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH v5 2/2] netlink-conntrack: Optimize flushing ct zone.

2024-02-26 Thread Felix Huettner via dev

Previously the kernel did not provide a netlink interface to flush/list
only conntrack entries matching a specific zone. With [1] and [2] it is now
possible to flush and list conntrack entries filtered by zone. Older
kernels not yet supporting this feature will ignore the filter.
For the list request that means just returning all entries (which we can
then filter in userspace as before).
For the flush request that means deleting all conntrack entries.

The implementation is now identical to the windows one, so we combine
them.

These significantly improves the performance of flushing conntrack zones
when the conntrack table is large. Since flushing a conntrack zone is
normally triggered via an openflow command it blocks the main ovs thread
and thereby also blocks new flows from being applied. Using this new
feature we can reduce the flushing time for zones by around 93%.

In combination with OVN the creation of a Logical_Router (which causes
the flushing of a ct zone) could block other operations, e.g. the
failover of Logical_Routers (as they cause new flows to be created).
This is visible from a user perspective as a ovn-controller that is idle
(as it waits for vswitchd) and vswitchd reporting:
"blocked 1000 ms waiting for main to quiesce" (potentially with ever
increasing times).

The following performance tests where run in a qemu vm with 500.000
conntrack entries distributed evenly over 500 ct zones using `ovstest
test-netlink-conntrack flush zone=`.

  |  flush zone with 1000 entries  |   flush zone with no entry |
  +-+--+-+--|
  |   with the patch| without  |   with the patch| without  |
  +--+--+--+--+--+--|
  | v6.8-rc4 |  v6.7.1  | v6.8-rc4 | v6.8-rc4 |  v6.7.1  | v6.8-rc4 |
+-+--+--+--+--+--+--|
| Min |  0.260   |  3.946   |  3.497   |  0.228   |  3.462   |  3.212   |
| Median  |  0.319   |  4.237   |  4.349   |  0.298   |  4.460   |  4.010   |
| 90%ile  |  0.335   |  4.367   |  4.522   |  0.325   |  4.662   |  4.572   |
| 99%ile  |  0.348   |  4.495   |  4.773   |  0.340   |  4.931   |  6.003   |
| Max |  0.362   |  4.543   |  5.054   |  0.348   |  5.390   |  6.396   |
| Mean|  0.320   |  4.236   |  4.331   |  0.296   |  4.430   |  4.071   |
| Total   |  80.02   |  1058|  1082|  73.93   |  1107|  1017|

[1]: 
https://github.com/torvalds/linux/commit/eff3c558bb7e61c41b53e4c8130e514a5a4df9ba
[2]: 
https://github.com/torvalds/linux/commit/fa173a1b4e3fd1ab5451cbc57de6fc624c824b0a

Co-Authored-By: Luca Czesla 
Signed-off-by: Luca Czesla 
Co-Authored-By: Max Lamprecht 
Signed-off-by: Max Lamprecht 
Signed-off-by: Felix Huettner 
---
v4->v5: none
v3->v4:
- combine the flush logic with windows implementation
v2->v3:
- update description to include upstream fix (Thanks to Ilya for finding
  that issue)
v1->v2:
- fixed wrong signed-off-by
 lib/netlink-conntrack.c | 52 -
 tests/system-traffic.at |  8 +++
 2 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c
index 492bfcffb..d9015d9f0 100644
--- a/lib/netlink-conntrack.c
+++ b/lib/netlink-conntrack.c
@@ -141,6 +141,9 @@ nl_ct_dump_start(struct nl_ct_dump_state **statep, const 
uint16_t *zone,
 
 nl_msg_put_nfgenmsg(>buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
 IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
+if (zone) {
+nl_msg_put_be16(>buf, CTA_ZONE, htons(*zone));
+}
 nl_dump_start(>dump, NETLINK_NETFILTER, >buf);
 ofpbuf_clear(>buf);
 
@@ -263,11 +266,9 @@ out:
 return err;
 }
 
-#ifdef _WIN32
-int
-nl_ct_flush_zone(uint16_t flush_zone)
+static int
+nl_ct_flush_zone_with_cta_zone(uint16_t flush_zone)
 {
-/* Windows can flush a specific zone */
 struct ofpbuf buf;
 int err;
 
@@ -282,24 +283,63 @@ nl_ct_flush_zone(uint16_t flush_zone)
 
 return err;
 }
+
+#ifdef _WIN32
+int
+nl_ct_flush_zone(uint16_t flush_zone)
+{
+return nl_ct_flush_zone_with_cta_zone(flush_zone);
+}
 #else
+
+static bool
+netlink_flush_supports_zone(void)
+{
+static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+static bool supported = false;
+
+if (ovsthread_once_start()) {
+if (ovs_kernel_is_version_or_newer(6, 8)) {
+supported = true;
+} else {
+VLOG_INFO("disabling conntrack flush by zone. "
+  "Not supported in Linux kernel");
+}
+ovsthread_once_done();
+}
+return supported;
+}
+
 int
 nl_ct_flush_zone(uint16_t flush_zone)
 {
-/* Apparently, there's no netlink interface to flush a specific zone.
+/* In older kernels, there was no netlink interface to flush a specific
+ * conntrack zone.
  * This code dumps every connection, checks the zone and eventually

[ovs-dev] [PATCH v4 2/2] netlink-conntrack: Optimize flushing ct zone.

2024-02-23 Thread Felix Huettner via dev

Previously the kernel did not provide a netlink interface to flush/list
only conntrack entries matching a specific zone. With [1] and [2] it is now
possible to flush and list conntrack entries filtered by zone. Older
kernels not yet supporting this feature will ignore the filter.
For the list request that means just returning all entries (which we can
then filter in userspace as before).
For the flush request that means deleting all conntrack entries.

The implementation is now identical to the windows one, so we combine
them.

These significantly improves the performance of flushing conntrack zones
when the conntrack table is large. Since flushing a conntrack zone is
normally triggered via an openflow command it blocks the main ovs thread
and thereby also blocks new flows from being applied. Using this new
feature we can reduce the flushing time for zones by around 93%.

In combination with OVN the creation of a Logical_Router (which causes
the flushing of a ct zone) could block other operations, e.g. the
failover of Logical_Routers (as they cause new flows to be created).
This is visible from a user perspective as a ovn-controller that is idle
(as it waits for vswitchd) and vswitchd reporting:
"blocked 1000 ms waiting for main to quiesce" (potentially with ever
increasing times).

The following performance tests where run in a qemu vm with 500.000
conntrack entries distributed evenly over 500 ct zones using `ovstest
test-netlink-conntrack flush zone=`.

  |  flush zone with 1000 entries  |   flush zone with no entry |
  +-+--+-+--|
  |   with the patch| without  |   with the patch| without  |
  +--+--+--+--+--+--|
  | v6.8-rc4 |  v6.7.1  | v6.8-rc4 | v6.8-rc4 |  v6.7.1  | v6.8-rc4 |
+-+--+--+--+--+--+--|
| Min |  0.260   |  3.946   |  3.497   |  0.228   |  3.462   |  3.212   |
| Median  |  0.319   |  4.237   |  4.349   |  0.298   |  4.460   |  4.010   |
| 90%ile  |  0.335   |  4.367   |  4.522   |  0.325   |  4.662   |  4.572   |
| 99%ile  |  0.348   |  4.495   |  4.773   |  0.340   |  4.931   |  6.003   |
| Max |  0.362   |  4.543   |  5.054   |  0.348   |  5.390   |  6.396   |
| Mean|  0.320   |  4.236   |  4.331   |  0.296   |  4.430   |  4.071   |
| Total   |  80.02   |  1058|  1082|  73.93   |  1107|  1017|

[1]: 
https://github.com/torvalds/linux/commit/eff3c558bb7e61c41b53e4c8130e514a5a4df9ba
[2]: 
https://github.com/torvalds/linux/commit/fa173a1b4e3fd1ab5451cbc57de6fc624c824b0a

Co-Authored-By: Luca Czesla 
Signed-off-by: Luca Czesla 
Co-Authored-By: Max Lamprecht 
Signed-off-by: Max Lamprecht 
Signed-off-by: Felix Huettner 
---
v3->v4:
- combine the flush logic with windows implementation
v2->v3:
- update description to include upstream fix (Thanks to Ilya for finding
  that issue)
v1->v2:
- fixed wrong signed-off-by

 lib/netlink-conntrack.c | 52 -
 tests/system-traffic.at |  8 +++
 2 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c
index 492bfcffb..d9015d9f0 100644
--- a/lib/netlink-conntrack.c
+++ b/lib/netlink-conntrack.c
@@ -141,6 +141,9 @@ nl_ct_dump_start(struct nl_ct_dump_state **statep, const 
uint16_t *zone,
 
 nl_msg_put_nfgenmsg(>buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
 IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
+if (zone) {
+nl_msg_put_be16(>buf, CTA_ZONE, htons(*zone));
+}
 nl_dump_start(>dump, NETLINK_NETFILTER, >buf);
 ofpbuf_clear(>buf);
 
@@ -263,11 +266,9 @@ out:
 return err;
 }
 
-#ifdef _WIN32
-int
-nl_ct_flush_zone(uint16_t flush_zone)
+static int
+nl_ct_flush_zone_with_cta_zone(uint16_t flush_zone)
 {
-/* Windows can flush a specific zone */
 struct ofpbuf buf;
 int err;
 
@@ -282,24 +283,63 @@ nl_ct_flush_zone(uint16_t flush_zone)
 
 return err;
 }
+
+#ifdef _WIN32
+int
+nl_ct_flush_zone(uint16_t flush_zone)
+{
+return nl_ct_flush_zone_with_cta_zone(flush_zone);
+}
 #else
+
+static bool
+netlink_flush_supports_zone(void)
+{
+static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+static bool supported = false;
+
+if (ovsthread_once_start()) {
+if (ovs_kernel_is_version_or_newer(6, 8)) {
+supported = true;
+} else {
+VLOG_INFO("disabling conntrack flush by zone. "
+  "Not supported in Linux kernel");
+}
+ovsthread_once_done();
+}
+return supported;
+}
+
 int
 nl_ct_flush_zone(uint16_t flush_zone)
 {
-/* Apparently, there's no netlink interface to flush a specific zone.
+/* In older kernels, there was no netlink interface to flush a specific
+ * conntrack zone.
  * This code dumps every connection, checks the zone and eventually
  * delete

[ovs-dev] [PATCH v4 1/2] util: Support checking for kernel versions.

2024-02-23 Thread Felix Huettner via dev

Extract checking for a given kernel version to a separate function.
It will be used also in the next patch.

Signed-off-by: Felix Huettner 
---
v4:
- extract function to check kernel version

 lib/netdev-linux.c | 14 +++---
 lib/util.c | 24 
 lib/util.h |  4 
 3 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index bf91ef462..51bd71ae3 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -6427,18 +6427,10 @@ getqdisc_is_safe(void)
 static bool safe = false;
 
 if (ovsthread_once_start()) {
-struct utsname utsname;
-int major, minor;
-
-if (uname() == -1) {
-VLOG_WARN("uname failed (%s)", ovs_strerror(errno));
-} else if (!ovs_scan(utsname.release, "%d.%d", , )) {
-VLOG_WARN("uname reported bad OS release (%s)", utsname.release);
-} else if (major < 2 || (major == 2 && minor < 35)) {
-VLOG_INFO("disabling unsafe RTM_GETQDISC in Linux kernel %s",
-  utsname.release);
-} else {
+if (ovs_kernel_is_version_or_newer(2, 35)) {
 safe = true;
+} else {
+VLOG_INFO("disabling unsafe RTM_GETQDISC in Linux kernel");
 }
 ovsthread_once_done();
 }
diff --git a/lib/util.c b/lib/util.c
index 3fb3a4b40..95c605af8 100644
--- a/lib/util.c
+++ b/lib/util.c
@@ -27,6 +27,7 @@
 #include 
 #ifdef __linux__
 #include 
+#include 
 #endif
 #include 
 #include 
@@ -2500,3 +2501,26 @@ OVS_CONSTRUCTOR(winsock_start) {
}
 }
 #endif
+
+#ifndef _WIN32
+bool
+ovs_kernel_is_version_or_newer(int target_major, int target_minor)
+{
+static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+static int current_major, current_minor = -1;
+
+if (ovsthread_once_start()) {
+struct utsname utsname;
+
+if (uname() == -1) {
+VLOG_WARN("uname failed (%s)", ovs_strerror(errno));
+} else if (!ovs_scan(utsname.release, "%d.%d",
+_major, _minor)) {
+VLOG_WARN("uname reported bad OS release (%s)", utsname.release);
+}
+ovsthread_once_done();
+}
+return current_major < target_major || (
+current_major == target_major && current_minor < target_minor);
+}
+#endif
diff --git a/lib/util.h b/lib/util.h
index f2d45bcac..57d8a2072 100644
--- a/lib/util.h
+++ b/lib/util.h
@@ -611,4 +611,8 @@ int ftruncate(int fd, off_t length);
 }
 #endif
 
+#ifndef _WIN32
+bool ovs_kernel_is_version_or_newer(int target_major, int target_minor);
+#endif
+
 #endif /* util.h */

base-commit: 5f2af0b7a30e7de84de97556223f892ef63ec14b
-- 
2.43.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3] netlink-conntrack: Optimize flushing ct zone.

2024-02-15 Thread Felix Huettner via dev

> > diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c
> > index 492bfcffb..1b050894d 100644
> > --- a/lib/netlink-conntrack.c
> > +++ b/lib/netlink-conntrack.c
> > @@ -25,6 +25,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include "byte-order.h"
> >  #include "compiler.h"
> > @@ -141,6 +142,9 @@ nl_ct_dump_start(struct nl_ct_dump_state **statep, 
> > const uint16_t *zone,
> >  
> >  nl_msg_put_nfgenmsg(>buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
> >  IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
> > +if (zone) {
> > +nl_msg_put_be16(>buf, CTA_ZONE, htons(*zone));
> > +}
> >  nl_dump_start(>dump, NETLINK_NETFILTER, >buf);
> >  ofpbuf_clear(>buf);
> >  
> > @@ -283,23 +287,72 @@ nl_ct_flush_zone(uint16_t flush_zone)
> >  return err;
> >  }
> >  #else
> > +
> > +static bool
> > +netlink_flush_supports_zone(void)
> > +{
> > +static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> > +static bool supported = false;
> > +
> > +if (ovsthread_once_start()) {
> > +struct utsname utsname;
> > +int major, minor;
> > +
> > +if (uname() == -1) {
> > +VLOG_WARN("uname failed (%s)", ovs_strerror(errno));
> > +} else if (!ovs_scan(utsname.release, "%d.%d", , )) {
> > +VLOG_WARN("uname reported bad OS release (%s)", 
> > utsname.release);
> 
> Do these WARN calls need to be severe like this?  There isn't much for
> the user to do about this, and it won't impact the functioning of the
> system in such a major way.  Maybe they can be INFO instead?  Otherwise,
> we may need to change most of the selftests to ignore the "uname failed"
> warnings.
> 
> More of a nit, because it shouldn't generally fail on any systems.

That was also my thought. I actually copied most of the logic from
`getqdisc_is_safe` in `netdev-linux.c`.

I am completely fine with changing it to something else if that makes
everyones lives easier, since hitting it is so unrealistic.

Let me know what you would prefer.

Thanks
Felix

> 
> > +} else if (major < 6 || (major == 6 && minor < 8)) {
> > +VLOG_INFO("disabling conntrack flush by zone in Linux kernel 
> > %s",
> > +  utsname.release);
> > +} else {
> > +supported = true;
> > +}
> > +ovsthread_once_done();
> > +}
> > +return supported;
> > +}
> > +
> >  int
> >  nl_ct_flush_zone(uint16_t flush_zone)
> >  {
> > -/* Apparently, there's no netlink interface to flush a specific zone.
> > +/* In older kernels, there was no netlink interface to flush a specific
> > + * conntrack zone.
> >   * This code dumps every connection, checks the zone and eventually
> >   * delete the entry.
> > + * In newer kernels there is the option to specifiy a zone for 
> > filtering
> > + * during dumps. Older kernels ignore this option. We set it here in 
> > the
> > + * hope we only get relevant entries back, but fall back to filtering 
> > here
> > + * to keep compatibility.
> >   *
> > - * This is race-prone, but it is better than using shell scripts. */
> > + * This is race-prone, but it is better than using shell scripts.
> > + *
> > + * Additionally newer kernels also support flushing a zone without 
> > listing
> > + * it first. */
> >  
> >  struct nl_dump dump;
> >  struct ofpbuf buf, reply, delete;
> > +int err;
> > +
> > +if (netlink_flush_supports_zone()) {
> > +ofpbuf_init(, NL_DUMP_BUFSIZE);
> > +
> > +nl_msg_put_nfgenmsg(, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
> > +IPCTNL_MSG_CT_DELETE, NLM_F_REQUEST);
> > +nl_msg_put_be16(, CTA_ZONE, htons(flush_zone));
> > +
> > +err = nl_transact(NETLINK_NETFILTER, , NULL);
> > +ofpbuf_uninit();
> > +
> > +return err;
> > +}
> >  
> >  ofpbuf_init(, NL_DUMP_BUFSIZE);
> >  ofpbuf_init(, NL_DUMP_BUFSIZE);
> >  
> >  nl_msg_put_nfgenmsg(, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
> >  IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
> > +nl_msg_put_be16(, CTA_ZONE, htons(flush_zone));
> >  nl_dump_start(, NETLINK_NETFILTER, );
> >  ofpbuf_clear();
> >  
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH v3] netlink-conntrack: Optimize flushing ct zone.

2024-02-12 Thread Felix Huettner via dev

Previously the kernel did not provide a netlink interface to flush/list
only conntrack entries matching a specific zone. With [1] and [2] it is now
possible to flush and list conntrack entries filtered by zone. Older
kernels not yet supporting this feature will ignore the filter.
For the list request that means just returning all entries (which we can
then filter in userspace as before).
For the flush request that means deleting all conntrack entries.

These significantly improves the performance of flushing conntrack zones
when the conntrack table is large. Since flushing a conntrack zone is
normally triggered via an openflow command it blocks the main ovs thread
and thereby also blocks new flows from being applied. Using this new
feature we can reduce the flushing time for zones by around 93%.

In combination with OVN the creation of a Logical_Router (which causes
the flushing of a ct zone) could block other operations, e.g. the
failover of Logical_Routers (as they cause new flows to be created).
This is visible from a user perspective as a ovn-controller that is idle
(as it waits for vswitchd) and vswitchd reporting:
"blocked 1000 ms waiting for main to quiesce" (potentially with ever
increasing times).

The following performance tests where run in a qemu vm with 500.000
conntrack entries distributed evenly over 500 ct zones using `ovstest
test-netlink-conntrack flush zone=`.

With this patch and kernel v6.8-rc4:

-
   Min (s) Median (s)  90%ile (s)  
99%ile (s)  Max (s) Mean (s)Total (s)   Count
-
flush zone with 1000 entries   0.260   0.319   0.335   
0.348   0.362   0.320   80.02   250
flush zone with no entry   0.228   0.298   0.325   
0.340   0.348   0.296   73.93   250
-

With this patch and kernel v6.7.1:

-
   Min (s) Median (s)  90%ile (s)  
99%ile (s)  Max (s) Mean (s)Total (s)   Count
-
flush zone with 1000 entries   3.946   4.237   4.367   
4.495   4.543   4.236   1058.992250
flush zone with no entry   3.462   4.460   4.662   
4.931   5.390   4.430   1107.479250
-

Without this patch and kernel v6.8-rc4:

-
   Min (s) Median (s)  90%ile (s)  
99%ile (s)  Max (s) Mean (s)Total (s)   Count
-
flush zone with 1000 entries   3.497   4.349   4.522   
4.773   5.054   4.331   1082.802250
flush zone with no entry   3.212   4.010   4.572   
6.003   6.396   4.071   1017.838250
-

[1]: 
https://github.com/torvalds/linux/commit/eff3c558bb7e61c41b53e4c8130e514a5a4df9ba
[2]: 
https://github.com/torvalds/linux/commit/fa173a1b4e3fd1ab5451cbc57de6fc624c824b0a

Co-Authored-By: Luca Czesla 
Signed-off-by: Luca Czesla 
Co-Authored-By: Max Lamprecht 
Signed-off-by: Max Lamprecht 
Signed-off-by: Felix Huettner 
---

v2->v3:
- update description to include upstream fix (Thanks to Ilya for finding
  that issue)
v1->v2:
- fixed wrong signed-off-by

 lib/netlink-conntrack.c | 57 +++--
 tests/system-traffic.at |  8 ++
 2 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c
index 492bfcffb..1b050894d 100644
--- a/lib/netlink-conntrack.c
+++ b/lib/netlink-conntrack.c
@@ -25,6

[ovs-dev] [PATCH v2] netlink-conntrack: Optimize flushing ct zone.

2024-02-02 Thread Felix Huettner via dev

Previously the kernel did not provide a netlink interface to flush/list
only conntrack entries matching a specific zone. With [1] it is now
possible to flush and list conntrack entries filtered by zone. Older
kernels not yet supporting this feature will ignore the filter.
For the list request that means just returning all entries (which we can
then filter in userspace as before).
For the flush request that means deleting all conntrack entries.

These significantly improves the performance of flushing conntrack zones
when the conntrack table is large. Since flushing a conntrack zone is
normally triggered via an openflow command it blocks the main ovs thread
and thereby also blocks new flows from being applied. Using this new
feature we can reduce the flushing time for zones by around 93%.

In combination with OVN the creation of a Logical_Router (which causes
the flushing of a ct zone) could block other operations, e.g. the
failover of Logical_Routers (as they cause new flows to be created).
This is visible from a user perspective as a ovn-controller that is idle
(as it waits for vswitchd) and vswitchd reporting:
"blocked 1000 ms waiting for main to quiesce" (potentially with ever
increasing times).

The following performance tests where run in a qemu vm with 500.000
conntrack entries distributed evenly over 500 ct zones using `ovstest
test-netlink-conntrack flush zone=`.

With this patch and kernel v6.8-rc2:

-
   Min (s) Median (s)  90%ile (s)  
99%ile (s)  Max (s) Mean (s)Total (s)   Count
-
flush zone with 1000 entries   0.260   0.319   0.335   
0.348   0.362   0.320   80.02   250
flush zone with no entry   0.228   0.298   0.325   
0.340   0.348   0.296   73.93   250
-

With this patch and kernel v6.7.1:

-
   Min (s) Median (s)  90%ile (s)  
99%ile (s)  Max (s) Mean (s)Total (s)   Count
-
flush zone with 1000 entries   3.946   4.237   4.367   
4.495   4.543   4.236   1058.992250
flush zone with no entry   3.462   4.460   4.662   
4.931   5.390   4.430   1107.479250
-

Without this patch and kernel v6.8-rc2:

-
   Min (s) Median (s)  90%ile (s)  
99%ile (s)  Max (s) Mean (s)Total (s)   Count
-
flush zone with 1000 entries   3.497   4.349   4.522   
4.773   5.054   4.331   1082.802250
flush zone with no entry   3.212   4.010   4.572   
6.003   6.396   4.071   1017.838250
-

[1]: 
https://github.com/torvalds/linux/commit/eff3c558bb7e61c41b53e4c8130e514a5a4df9ba

Co-Authored-By: Luca Czesla 
Signed-off-by: Luca Czesla 
Co-Authored-By: Max Lamprecht 
Signed-off-by: Max Lamprecht 
Signed-off-by: Felix Huettner 
---

v1->v2:
- fixed wrong signed-off-by

 lib/netlink-conntrack.c | 57 +++--
 tests/system-traffic.at |  8 ++
 2 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c
index 492bfcffb..1b050894d 100644
--- a/lib/netlink-conntrack.c
+++ b/lib/netlink-conntrack.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "byte-order.h"
 #include "compiler.h"
@@ -141,6 +142,9 @@ nl_ct_dump_start(struct nl_ct_dump_state **statep, const 
uint16_t

[ovs-dev] [PATCH] netlink-conntrack: Optimize flushing ct zone.

2024-02-01 Thread Felix Huettner via dev

Previously the kernel did not provide a netlink interface to flush/list
only conntrack entries matching a specific zone. With [1] it is now
possible to flush and list conntrack entries filtered by zone. Older
kernels not yet supporting this feature will ignore the filter.
For the list request that means just returning all entries (which we can
then filter in userspace as before).
For the flush request that means deleting all conntrack entries.

These significantly improves the performance of flushing conntrack zones
when the conntrack table is large. Since flushing a conntrack zone is
normally triggered via an openflow command it blocks the main ovs thread
and thereby also blocks new flows from being applied. Using this new
feature we can reduce the flushing time for zones by around 93%.

In combination with OVN the creation of a Logical_Router (which causes
the flushing of a ct zone) could block other operations, e.g. the
failover of Logical_Routers (as they cause new flows to be created).
This is visible from a user perspective as a ovn-controller that is idle
(as it waits for vswitchd) and vswitchd reporting:
"blocked 1000 ms waiting for main to quiesce" (potentially with ever
increasing times).

The following performance tests where run in a qemu vm with 500.000
conntrack entries distributed evenly over 500 ct zones using `ovstest
test-netlink-conntrack flush zone=`.

With this patch and kernel v6.8-rc2:

-
   Min (s) Median (s)  90%ile (s)  
99%ile (s)  Max (s) Mean (s)Total (s)   Count
-
flush zone with 1000 entries   0.260   0.319   0.335   
0.348   0.362   0.320   80.02   250
flush zone with no entry   0.228   0.298   0.325   
0.340   0.348   0.296   73.93   250
-

With this patch and kernel v6.7.1:

-
   Min (s) Median (s)  90%ile (s)  
99%ile (s)  Max (s) Mean (s)Total (s)   Count
-
flush zone with 1000 entries   3.946   4.237   4.367   
4.495   4.543   4.236   1058.992250
flush zone with no entry   3.462   4.460   4.662   
4.931   5.390   4.430   1107.479250
-

Without this patch and kernel v6.8-rc2:

-
   Min (s) Median (s)  90%ile (s)  
99%ile (s)  Max (s) Mean (s)Total (s)   Count
-
flush zone with 1000 entries   3.497   4.349   4.522   
4.773   5.054   4.331   1082.802250
flush zone with no entry   3.212   4.010   4.572   
6.003   6.396   4.071   1017.838250
-

[1]: 
https://github.com/torvalds/linux/commit/eff3c558bb7e61c41b53e4c8130e514a5a4df9ba

Co-Authored-By: Luca Czesla 
Signed-off-by: Luca Czesla 
Co-Authored-By: Max Lamprecht 
Signed-off-by: Max Lamprecht 
---
 lib/netlink-conntrack.c | 57 +++--
 tests/system-traffic.at |  8 ++
 2 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c
index 492bfcffb..1b050894d 100644
--- a/lib/netlink-conntrack.c
+++ b/lib/netlink-conntrack.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "byte-order.h"
 #include "compiler.h"
@@ -141,6 +142,9 @@ nl_ct_dump_start(struct nl_ct_dump_state **statep, const 
uint16_t *zone,
 
 nl_msg_put_nfgenmsg(>buf, 0, AF_UNSPEC,

Re: [ovs-dev] [PATCH ovn 1/4] rbac: MAC_Bindings can only be updated by the inserting chassis.

2024-01-22 Thread Felix Huettner via dev

On Fri, Jan 19, 2024 at 04:33:28PM -0500, Mark Michelson wrote:
> With this change, a chassis may only update MAC Binding records that it
> has created. We achieve this by adding a "chassis_name" column to the
> MAC_Binding table, and having the chassis insert its name into this
> column when creating a new MAC_Binding. The "chassis_name" is now part
> of the rbac_auth structure for the MAC_Binding table.

Hi Mark,

i am concerned that this will negatively impact MAC_Bindings for LRPs
with multiple gateway chassis. 

Suppose a MAC_Binding is first learned by an LRP currently residing on
chassis1. The LRP then failovers to chassis2 and chassis1 is potentially even
removed completely. In this case the ovn-controller on chassis2 would no
longer be allowed to update the timestamp column. This would break the
arp refresh mechanism.

In this case the MAC_Binding would need to expire first, causing northd
to removed it. Afterwards chassis2 would be allowed to insert a new
record with its own chassis name.

I honestly did not try out this case so i am not fully sure if this
issue realy exists or if i have a missunderstanding somewhere.

Thanks
Felix

> ---
>  controller/pinctrl.c | 51 
>  northd/ovn-northd.c  |  2 +-
>  ovn-sb.ovsschema |  7 +++---
>  ovn-sb.xml   |  3 +++
>  4 files changed, 45 insertions(+), 18 deletions(-)
> 
> diff --git a/controller/pinctrl.c b/controller/pinctrl.c
> index 4992eab08..a00cdceea 100644
> --- a/controller/pinctrl.c
> +++ b/controller/pinctrl.c
> @@ -180,6 +180,7 @@ struct pinctrl {
>  bool mac_binding_can_timestamp;
>  bool fdb_can_timestamp;
>  bool dns_supports_ovn_owned;
> +bool mac_binding_has_chassis_name;
>  };
>  
>  static struct pinctrl pinctrl;
> @@ -204,7 +205,8 @@ static void run_put_mac_bindings(
>  struct ovsdb_idl_txn *ovnsb_idl_txn,
>  struct ovsdb_idl_index *sbrec_datapath_binding_by_key,
>  struct ovsdb_idl_index *sbrec_port_binding_by_key,
> -struct ovsdb_idl_index *sbrec_mac_binding_by_lport_ip)
> +struct ovsdb_idl_index *sbrec_mac_binding_by_lport_ip,
> +const struct sbrec_chassis *chassis)
>  OVS_REQUIRES(pinctrl_mutex);
>  static void wait_put_mac_bindings(struct ovsdb_idl_txn *ovnsb_idl_txn);
>  static void send_mac_binding_buffered_pkts(struct rconn *swconn)
> @@ -3591,6 +3593,13 @@ pinctrl_update(const struct ovsdb_idl *idl, const char 
> *br_int_name)
>  notify_pinctrl_handler();
>  }
>  
> +bool mac_binding_has_chassis_name =
> +sbrec_server_has_mac_binding_table_col_chassis_name(idl);
> +if (mac_binding_has_chassis_name != 
> pinctrl.mac_binding_has_chassis_name) {
> +pinctrl.mac_binding_has_chassis_name = mac_binding_has_chassis_name;
> +notify_pinctrl_handler();
> +}
> +
>  ovs_mutex_unlock(_mutex);
>  }
>  
> @@ -3621,7 +3630,8 @@ pinctrl_run(struct ovsdb_idl_txn *ovnsb_idl_txn,
>  ovs_mutex_lock(_mutex);
>  run_put_mac_bindings(ovnsb_idl_txn, sbrec_datapath_binding_by_key,
>   sbrec_port_binding_by_key,
> - sbrec_mac_binding_by_lport_ip);
> + sbrec_mac_binding_by_lport_ip,
> + chassis);
>  run_put_vport_bindings(ovnsb_idl_txn, sbrec_datapath_binding_by_key,
> sbrec_port_binding_by_key, chassis);
>  send_garp_rarp_prepare(ovnsb_idl_txn, sbrec_port_binding_by_datapath,
> @@ -4285,7 +4295,8 @@ mac_binding_add_to_sb(struct ovsdb_idl_txn 
> *ovnsb_idl_txn,
>const char *logical_port,
>const struct sbrec_datapath_binding *dp,
>struct eth_addr ea, const char *ip,
> -  bool update_only)
> +  bool update_only,
> +  const struct sbrec_chassis *chassis)
>  {
>  /* Convert ethernet argument to string form for database. */
>  char mac_string[ETH_ADDR_STRLEN + 1];
> @@ -4302,6 +4313,9 @@ mac_binding_add_to_sb(struct ovsdb_idl_txn 
> *ovnsb_idl_txn,
>  sbrec_mac_binding_set_logical_port(b, logical_port);
>  sbrec_mac_binding_set_ip(b, ip);
>  sbrec_mac_binding_set_datapath(b, dp);
> +if (pinctrl.mac_binding_has_chassis_name) {
> +sbrec_mac_binding_set_chassis_name(b, chassis->name);
> +}
>  }
>  
>  if (strcmp(b->mac, mac_string)) {
> @@ -4323,7 +4337,8 @@ send_garp_locally(struct ovsdb_idl_txn *ovnsb_idl_txn,
>struct ovsdb_idl_index *sbrec_mac_binding_by_lport_ip,
>const struct hmap *local_datapaths,
>const struct sbrec_port_binding *in_pb,
> -  struct eth_addr ea, ovs_be32 ip)
> +  struct eth_addr ea, ovs_be32 ip,
> +  const struct sbrec_chassis *chassis)
>  {
>  if (!ovnsb_idl_txn) {
>  return;
> @@ -4351,7 +4366,7 @@ send_garp_locally(struct ovsdb_idl_txn

Re: [ovs-dev] [PATCH v4] reconnect: Set defaults from environment variables.

2023-12-17 Thread Felix Huettner via dev

On Sat, Dec 16, 2023 at 12:42:01AM +0100, Ilya Maximets wrote:
> On 12/4/23 12:23, Felix Huettner via dev wrote:
> > This exposes the old constants regarding min backoff, max backoff and
> > probe interval using environment variables. In case previously users
> > wanted to tune the probe interval for all connections this required
> > setting this setting in multiple locations. E.g. to configure the probe
> > interval from a relay to the ovsdb server you need to call an appctl
> > command after the relay has started up.
> > The new environment variables make it easy to set them for all new
> > connections. The existing configuration options for these settings stay
> > in place and take precedence over the environment variables.
> > In case the environment variables are not specified/invalid formatted we
> > default to the previous defaults.
> >
> > Signed-off-by: Felix Huettner 
> > ---
> > v3->v4: fix conflict in NEWS file
> > v2->v3: add minimal values and defaults as defines
> > v1->v2: fixed wrong function definitions
> >
> >  NEWS|  7 +
> >  lib/jsonrpc.c   |  4 +--
> >  lib/reconnect.c | 70 +
> >  lib/reconnect.h |  7 ++---
> >  ovsdb/jsonrpc-server.c  |  4 +--
> >  ovsdb/ovsdb-server.c|  2 +-
> >  ovsdb/relay.h   |  2 --
> >  python/ovs/reconnect.py | 45 --
> >  8 files changed, 121 insertions(+), 20 deletions(-)
> >
> > diff --git a/NEWS b/NEWS
> > index 490e275da..6b580edf3 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -19,6 +19,13 @@ Post-v3.2.0
> >   * Added support for Generic Segmentation Offloading for the cases 
> > where
> > TSO is enabled but not supported by an egress interface (except for
> > tunnel interfaces).
> > +   - The environment variable OVS_RECONNECT_MIN_BACKOFF,
> > + OVS_RECONNECT_MAX_BACKOFF and OVS_RECONNECT_PROBE_INTERVAL, if set, 
> > allow
> > + users to tune the default reconnect configuration. E.g. adapting
> > + OVS_RECONNECT_PROBE_INTERVAL modifies the default probe interval, for 
> > all
> > + servers and clients. These variables are valid for all ovs code as 
> > well as
> > + the python client. Values set for individual connections (e.g. using 
> > the
> > + `inactivity_probe` column in the `Manager` tables) will take 
> > precedence.
>
> Hi, Felix.  Thanks for the patch!  Though I'm not sure if introduction
> of yet another way of setting the same thing is a good approach for
> this problem.  If anything, we should probably try to reduce the number
> of configuration methods.  I tried to come up with a way to solve the
> configuration for ovsdb-server for a long time now, and the config file
> approach that I've been working for a past few months seems promising:
>   
> https://patchwork.ozlabs.org/project/openvswitch/patch/20231214010431.1664005-21-i.maximets%40ovn.org/
> At least it should reduce the configuration method variety while the
> config file is in use.  And it should solve the problem of configuring
> relays that you mentioned in the commit message.
>

Hi Ilya,

thanks for the feedback. I honestly like your approach significantly
more since it fixes a bunch of other issues as well.

> Having a configurable global default may also be confusing for the end
> users.  Primarily because it doesn't actually apply to everything.
> For example, active-backup replication has its own default probe interval
> that will not be affected, and RAFT is calculating probe interval based
> on election timer.  So, it may be not clear why the variable is set,
> nothing is explicitly configured by the user, but some connections are
> not using the value.
> Another potential source of confusion is that OpenFlow connections use
> the same terminology (probe_interval), but a different implementation
> that is not going to be affected by this change as well.
>

We noticed that as well. I guess the config file allows for more
expansions later on more easily (e.g. to specify a target RAFT election
timer and then let ovsdb change it in multiple steps as needed).

> Is there a scenario that we still need to cover that will not be covered
> by the configuration file?  If there are some, we could try to come
> up with solutions for these.  I know of a few, but these are just bugs
> that should be fixed and this patch will not help with them anyway.
>

The only downside i see there, is that it will not allow us to configure
the northd and ovn-controller settings as well (which the environment
variables would allow). Maybe for their f

[ovs-dev] [PATCH v4] reconnect: Set defaults from environment variables.

2023-12-04 Thread Felix Huettner via dev

This exposes the old constants regarding min backoff, max backoff and
probe interval using environment variables. In case previously users
wanted to tune the probe interval for all connections this required
setting this setting in multiple locations. E.g. to configure the probe
interval from a relay to the ovsdb server you need to call an appctl
command after the relay has started up.
The new environment variables make it easy to set them for all new
connections. The existing configuration options for these settings stay
in place and take precedence over the environment variables.
In case the environment variables are not specified/invalid formatted we
default to the previous defaults.

Signed-off-by: Felix Huettner 
---
v3->v4: fix conflict in NEWS file
v2->v3: add minimal values and defaults as defines
v1->v2: fixed wrong function definitions

 NEWS|  7 +
 lib/jsonrpc.c   |  4 +--
 lib/reconnect.c | 70 +
 lib/reconnect.h |  7 ++---
 ovsdb/jsonrpc-server.c  |  4 +--
 ovsdb/ovsdb-server.c|  2 +-
 ovsdb/relay.h   |  2 --
 python/ovs/reconnect.py | 45 --
 8 files changed, 121 insertions(+), 20 deletions(-)

diff --git a/NEWS b/NEWS
index 490e275da..6b580edf3 100644
--- a/NEWS
+++ b/NEWS
@@ -19,6 +19,13 @@ Post-v3.2.0
  * Added support for Generic Segmentation Offloading for the cases where
TSO is enabled but not supported by an egress interface (except for
tunnel interfaces).
+   - The environment variable OVS_RECONNECT_MIN_BACKOFF,
+ OVS_RECONNECT_MAX_BACKOFF and OVS_RECONNECT_PROBE_INTERVAL, if set, allow
+ users to tune the default reconnect configuration. E.g. adapting
+ OVS_RECONNECT_PROBE_INTERVAL modifies the default probe interval, for all
+ servers and clients. These variables are valid for all ovs code as well as
+ the python client. Values set for individual connections (e.g. using the
+ `inactivity_probe` column in the `Manager` tables) will take precedence.


 v3.2.0 - 17 Aug 2023
diff --git a/lib/jsonrpc.c b/lib/jsonrpc.c
index c8ce5362e..7bf49fc81 100644
--- a/lib/jsonrpc.c
+++ b/lib/jsonrpc.c
@@ -1269,8 +1269,8 @@ void
 jsonrpc_session_enable_reconnect(struct jsonrpc_session *s)
 {
 reconnect_set_max_tries(s->reconnect, UINT_MAX);
-reconnect_set_backoff(s->reconnect, RECONNECT_DEFAULT_MIN_BACKOFF,
-  RECONNECT_DEFAULT_MAX_BACKOFF);
+reconnect_set_backoff(s->reconnect, reconnect_default_min_backoff(),
+  reconnect_default_max_backoff());
 }

 /* Forces 's' to drop its connection (if any) and reconnect. */
diff --git a/lib/reconnect.c b/lib/reconnect.c
index 89a0bcaf9..8e18db1c6 100644
--- a/lib/reconnect.c
+++ b/lib/reconnect.c
@@ -25,6 +25,13 @@

 VLOG_DEFINE_THIS_MODULE(reconnect);

+#define RECONNECT_MIN_MIN_BACKOFF 1000
+#define RECONNECT_DEFAULT_MIN_BACKOFF 1000
+#define RECONNECT_MIN_MAX_BACKOFF 2000
+#define RECONNECT_DEFAULT_MAX_BACKOFF 8000
+#define RECONNECT_MIN_PROBE_INTERVAL 1000
+#define RECONNECT_DEFAULT_PROBE_INTERVAL 5000
+
 #define STATES  \
 STATE(VOID, 1 << 0) \
 STATE(BACKOFF, 1 << 1)  \
@@ -99,9 +106,9 @@ reconnect_create(long long int now)
 struct reconnect *fsm = xzalloc(sizeof *fsm);

 fsm->name = xstrdup("void");
-fsm->min_backoff = RECONNECT_DEFAULT_MIN_BACKOFF;
-fsm->max_backoff = RECONNECT_DEFAULT_MAX_BACKOFF;
-fsm->probe_interval = RECONNECT_DEFAULT_PROBE_INTERVAL;
+fsm->min_backoff = reconnect_default_min_backoff();
+fsm->max_backoff = reconnect_default_max_backoff();
+fsm->probe_interval = reconnect_default_probe_interval();
 fsm->passive = false;
 fsm->info = VLL_INFO;

@@ -163,7 +170,7 @@ reconnect_set_name(struct reconnect *fsm, const char *name)
 }

 /* Return the minimum number of milliseconds to back off between consecutive
- * connection attempts.  The default is RECONNECT_DEFAULT_MIN_BACKOFF. */
+ * connection attempts.  The default is reconnect_default_min_backoff(). */
 int
 reconnect_get_min_backoff(const struct reconnect *fsm)
 {
@@ -171,7 +178,7 @@ reconnect_get_min_backoff(const struct reconnect *fsm)
 }

 /* Return the maximum number of milliseconds to back off between consecutive
- * connection attempts.  The default is RECONNECT_DEFAULT_MAX_BACKOFF. */
+ * connection attempts.  The default is reconnect_default_max_backoff(). */
 int
 reconnect_get_max_backoff(const struct reconnect *fsm)
 {
@@ -190,6 +197,57 @@ reconnect_get_probe_interval(const struct reconnect *fsm)
 return fsm->probe_interval;
 }

+/* Returns the default min_backoff value for reconnect. It uses the environment
+ * variable OVS_RECONNECT_MIN_BACKOFF if set and valid or otherwise defaults
+ * to 1 second. The return value is in ms. */
+unsigned int reconnect_default_min_backoff(void) {
+static unsigned int

Re: [ovs-dev] [PATCH v2] reconnect: Set defaults from environment variables.

2023-12-04 Thread Felix Huettner via dev

Hi Eelco,

thanks for the feedback. I updated it and send a v3.

On Fri, Dec 01, 2023 at 10:38:16AM +0100, Eelco Chaudron wrote:
>
>
> On 8 Nov 2023, at 11:07, Felix Huettner via dev wrote:
>
> > this exposes the old constants regarding min backoff, max backoff and
>
> This
>

sorry, i missed that one, so its still wrong. But if there is a v4 it
will be included.

> >
> > +/* Returns the default min_backoff value for reconnect. It uses the 
> > environment
> > + * variable OVS_RECONNECT_MIN_BACKOFF if set and valid or otherwise 
> > defaults
> > + * to 1 second. The return value is in ms. */
> > +unsigned int reconnect_default_min_backoff(void) {
> > +static unsigned int default_min_backoff = 0;
> > +if (default_min_backoff == 0) {
> > +char *env = getenv("OVS_RECONNECT_MIN_BACKOFF");
> > +if (env && env[0]) {
> > +str_to_uint(env, 10, _min_backoff);
> > +}
> > +if (default_min_backoff == 0) {
>
> I think we should define some sane minimum value. Or are we ok with 1ms? Some 
> for all the others.

I now defined 1000ms as minimum for the min_backoff and probe_interval
and used 2000ms for max_backoff. Let me know if you find other values
more reasonable.

>
> > +default_min_backoff = 1000;
>
> I think we should keep the default definitions, and use them here rather than 
> the hard-coded numbers.
> So this will become:
>
>   default_min_backoff = RECONNECT_DEFAULT_MIN_BACKOFF
>

Yes i readded them as defines at the top of the file (so they migrate
from reconnect.h to reconnect.c)

Thanks
Felix
Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die 
Verwertung durch den vorgesehenen Empfänger bestimmt.
Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte 
unverzüglich in Kenntnis und löschen diese E Mail.

Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz>.

This e-mail may contain confidential content and is intended only for the 
specified recipient/s.
If you are not the intended recipient, please inform the sender immediately and 
delete this e-mail.

Information on data protection can be found 
here<https://www.datenschutz.schwarz>.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH v3] reconnect: Set defaults from environment variables.

2023-12-04 Thread Felix Huettner via dev

this exposes the old constants regarding min backoff, max backoff and
probe interval using environment variables. In case previously users
wanted to tune the probe interval for all connections this required
setting this setting in multiple locations. E.g. to configure the probe
interval from a relay to the ovsdb server you need to call an appctl
command after the relay has started up.
The new environment variables make it easy to set them for all new
connections. The existing configuration options for these settings stay
in place and take precedence over the environment variables.
In case the environment variables are not specified/invalid formatted we
default to the previous defaults.

Signed-off-by: Felix Huettner 
---
v2->v3: add minimal values and defaults as defines
v1->v2: fixed wrong function definitions

 NEWS|  7 +
 lib/jsonrpc.c   |  4 +--
 lib/reconnect.c | 70 +
 lib/reconnect.h |  7 ++---
 ovsdb/jsonrpc-server.c  |  4 +--
 ovsdb/ovsdb-server.c|  2 +-
 ovsdb/relay.h   |  2 --
 python/ovs/reconnect.py | 45 --
 8 files changed, 121 insertions(+), 20 deletions(-)

diff --git a/NEWS b/NEWS
index 6b45492f1..58266ae52 100644
--- a/NEWS
+++ b/NEWS
@@ -6,6 +6,13 @@ Post-v3.2.0
from older version is supported but it may trigger more leader elections
during the process, and error logs complaining unrecognized fields may
be observed on old nodes.
+   - The environment variable OVS_RECONNECT_MIN_BACKOFF,
+ OVS_RECONNECT_MAX_BACKOFF and OVS_RECONNECT_PROBE_INTERVAL, if set, allow
+ users to tune the default reconnect configuration. E.g. adapting
+ OVS_RECONNECT_PROBE_INTERVAL modifies the default probe interval, for all
+ servers and clients. These variables are valid for all ovs code as well as
+ the python client. Values set for individual connections (e.g. using the
+ `inactivity_probe` column in the `Manager` tables) will take precedence.


 v3.2.0 - 17 Aug 2023
diff --git a/lib/jsonrpc.c b/lib/jsonrpc.c
index c8ce5362e..7bf49fc81 100644
--- a/lib/jsonrpc.c
+++ b/lib/jsonrpc.c
@@ -1269,8 +1269,8 @@ void
 jsonrpc_session_enable_reconnect(struct jsonrpc_session *s)
 {
 reconnect_set_max_tries(s->reconnect, UINT_MAX);
-reconnect_set_backoff(s->reconnect, RECONNECT_DEFAULT_MIN_BACKOFF,
-  RECONNECT_DEFAULT_MAX_BACKOFF);
+reconnect_set_backoff(s->reconnect, reconnect_default_min_backoff(),
+  reconnect_default_max_backoff());
 }

 /* Forces 's' to drop its connection (if any) and reconnect. */
diff --git a/lib/reconnect.c b/lib/reconnect.c
index 89a0bcaf9..8e18db1c6 100644
--- a/lib/reconnect.c
+++ b/lib/reconnect.c
@@ -25,6 +25,13 @@

 VLOG_DEFINE_THIS_MODULE(reconnect);

+#define RECONNECT_MIN_MIN_BACKOFF 1000
+#define RECONNECT_DEFAULT_MIN_BACKOFF 1000
+#define RECONNECT_MIN_MAX_BACKOFF 2000
+#define RECONNECT_DEFAULT_MAX_BACKOFF 8000
+#define RECONNECT_MIN_PROBE_INTERVAL 1000
+#define RECONNECT_DEFAULT_PROBE_INTERVAL 5000
+
 #define STATES  \
 STATE(VOID, 1 << 0) \
 STATE(BACKOFF, 1 << 1)  \
@@ -99,9 +106,9 @@ reconnect_create(long long int now)
 struct reconnect *fsm = xzalloc(sizeof *fsm);

 fsm->name = xstrdup("void");
-fsm->min_backoff = RECONNECT_DEFAULT_MIN_BACKOFF;
-fsm->max_backoff = RECONNECT_DEFAULT_MAX_BACKOFF;
-fsm->probe_interval = RECONNECT_DEFAULT_PROBE_INTERVAL;
+fsm->min_backoff = reconnect_default_min_backoff();
+fsm->max_backoff = reconnect_default_max_backoff();
+fsm->probe_interval = reconnect_default_probe_interval();
 fsm->passive = false;
 fsm->info = VLL_INFO;

@@ -163,7 +170,7 @@ reconnect_set_name(struct reconnect *fsm, const char *name)
 }

 /* Return the minimum number of milliseconds to back off between consecutive
- * connection attempts.  The default is RECONNECT_DEFAULT_MIN_BACKOFF. */
+ * connection attempts.  The default is reconnect_default_min_backoff(). */
 int
 reconnect_get_min_backoff(const struct reconnect *fsm)
 {
@@ -171,7 +178,7 @@ reconnect_get_min_backoff(const struct reconnect *fsm)
 }

 /* Return the maximum number of milliseconds to back off between consecutive
- * connection attempts.  The default is RECONNECT_DEFAULT_MAX_BACKOFF. */
+ * connection attempts.  The default is reconnect_default_max_backoff(). */
 int
 reconnect_get_max_backoff(const struct reconnect *fsm)
 {
@@ -190,6 +197,57 @@ reconnect_get_probe_interval(const struct reconnect *fsm)
 return fsm->probe_interval;
 }

+/* Returns the default min_backoff value for reconnect. It uses the environment
+ * variable OVS_RECONNECT_MIN_BACKOFF if set and valid or otherwise defaults
+ * to 1 second. The return value is in ms. */
+unsigned int reconnect_default_min_backoff(void) {
+static unsigned int default_min_backoff = 0;
+if

[ovs-dev] [RFC] netlink-conntrack: optimize flushing ct zone

2023-11-17 Thread Felix Huettner via dev

NOTE: this change makes improvements depending on a change in the kernel
currently on the netdev mailing list (see [1]).

Previously the kernel did not provide a netlink interface to flush/list
only conntrack entries matching a specific zone. With [1] it is now
possible to flush and list conntrack entries filtered by zone. Older
kernels not yet supporting this feature will ignore the filter.
For the list request that means just returning all entries (which we can
then filter in userspace as before).
FOr the flush request that means deleting all conntrack entries.

These significantly improves the performance of flushing conntrack zones
when the conntrack table is large. Since flushing a conntrack zone is
normally triggered via an openflow command it blocks the main ovs thread
and thereby also blocks new flows from being applied. The main benefit
can already be acheived by using the existing logicl with the additional
filter based on the zone (90-95% speedup). Using the logical to flush
directly by zone brings an additional 10-15% on top of that (more
numbers below).

In combination with OVN the creation of a Logical_Router (which causes
the flushing of a ct zone) could block other operations, e.g. the
failover of Logical_Routers (as they cause new flows to be created).
This is visible from a user perspective as a ovn-controller that is idle
(as it waits for vswitchd) and vswitchd reporting:
"blocked 1000 ms waiting for main to quiesce" (potentially with ever
increasing times).

The following performance tests where run in a qemu vm with 500.000
conntrack entries distributed evenly over 500 ct zones using `ovstest
test-netlink-conntrack flush zone=`.

With this patch and the respective kernel patch applied, but
OVS_NETLINK_CONNTRAK_FLUSH_ZONE_SUPPORTED unset:

-
   Min (s) Median (s)  90%ile (s)  
99%ile (s)  Max (s) Mean (s)Total (s)   Count
-
flush zone with 1000 entries   0.309   0.372   0.393   
0.467   0.516   0.374   93.597  250
flush zone with no entry   0.265   0.305   0.333   
0.352   0.393   0.307   76.770  250
-

With this patch and the respective kernel patch applied, and
OVS_NETLINK_CONNTRAK_FLUSH_ZONE_SUPPORTED set:

-
   Min (s) Median (s)  90%ile (s)  
99%ile (s)  Max (s) Mean (s)Total (s)   Count
-
flush zone with 1000 entries   0.256   0.323   0.341   
0.367   0.389   0.322   80.729  250
flush zone with no entry   0.225   0.265   0.317   
0.336   0.351   0.274   68.659  250
-

Before this patch and/or without the respective kernel patch

-
   Min (s) Median (s)  90%ile (s)  
99%ile (s)  Max (s) Mean (s)Total (s)   Count
-
flush zone with 1000 entries   2.499   4.990   5.209   
6.435   7.150   5.008   1252.158250
flush zone with no entry   4.120   4.572   4.783   
5.156   5.364   4.559   1139.786250
-

[1]: 
https://lore.kernel.org/netdev/zvegfp2x-wx6d...@sit-sdelap4051.int.lidl.net/T/#u
---
 lib/netlink-conntrack.c | 49 +++--
 1 file changed, 47 insertions(+), 2 deletions(-)

diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c
index 492bfcffb..32be0d122 100644
--- a/lib/netlink-conntrack.c
+++

[ovs-dev] [PATCH v2] reconnect: Set defaults from environment variables.

2023-11-08 Thread Felix Huettner via dev

this exposes the old constants regarding min backoff, max backoff and
probe interval using environment variables. In case previously users
wanted to tune the probe interval for all connections this required
setting this setting in multiple locations. E.g. to configure the probe
interval from a relay to the ovsdb server you need to call an appctl
command after the relay has started up.
The new environment variables make it easy to set them for all new
connections. The existing configuration options for these settings stay
in place and take precedence over the environment variables.
In case the environment variables are not specified/invalid formated we
default to the previous defaults.

Signed-off-by: Felix Huettner 
---
v1->v2: fixed wrong function definitions
---
 NEWS|  7 +
 lib/jsonrpc.c   |  4 +--
 lib/reconnect.c | 63 +
 lib/reconnect.h |  7 ++---
 ovsdb/jsonrpc-server.c  |  4 +--
 ovsdb/ovsdb-server.c|  2 +-
 ovsdb/relay.h   |  2 --
 python/ovs/reconnect.py | 45 +++--
 8 files changed, 114 insertions(+), 20 deletions(-)

diff --git a/NEWS b/NEWS
index 6b45492f1..58266ae52 100644
--- a/NEWS
+++ b/NEWS
@@ -6,6 +6,13 @@ Post-v3.2.0
from older version is supported but it may trigger more leader elections
during the process, and error logs complaining unrecognized fields may
be observed on old nodes.
+   - The environment variable OVS_RECONNECT_MIN_BACKOFF,
+ OVS_RECONNECT_MAX_BACKOFF and OVS_RECONNECT_PROBE_INTERVAL, if set, allow
+ users to tune the default reconnect configuration. E.g. adapting
+ OVS_RECONNECT_PROBE_INTERVAL modifies the default probe interval, for all
+ servers and clients. These variables are valid for all ovs code as well as
+ the python client. Values set for individual connections (e.g. using the
+ `inactivity_probe` column in the `Manager` tables) will take precedence.


 v3.2.0 - 17 Aug 2023
diff --git a/lib/jsonrpc.c b/lib/jsonrpc.c
index c8ce5362e..7bf49fc81 100644
--- a/lib/jsonrpc.c
+++ b/lib/jsonrpc.c
@@ -1269,8 +1269,8 @@ void
 jsonrpc_session_enable_reconnect(struct jsonrpc_session *s)
 {
 reconnect_set_max_tries(s->reconnect, UINT_MAX);
-reconnect_set_backoff(s->reconnect, RECONNECT_DEFAULT_MIN_BACKOFF,
-  RECONNECT_DEFAULT_MAX_BACKOFF);
+reconnect_set_backoff(s->reconnect, reconnect_default_min_backoff(),
+  reconnect_default_max_backoff());
 }

 /* Forces 's' to drop its connection (if any) and reconnect. */
diff --git a/lib/reconnect.c b/lib/reconnect.c
index 89a0bcaf9..3e73d7cca 100644
--- a/lib/reconnect.c
+++ b/lib/reconnect.c
@@ -99,9 +99,9 @@ reconnect_create(long long int now)
 struct reconnect *fsm = xzalloc(sizeof *fsm);

 fsm->name = xstrdup("void");
-fsm->min_backoff = RECONNECT_DEFAULT_MIN_BACKOFF;
-fsm->max_backoff = RECONNECT_DEFAULT_MAX_BACKOFF;
-fsm->probe_interval = RECONNECT_DEFAULT_PROBE_INTERVAL;
+fsm->min_backoff = reconnect_default_min_backoff();
+fsm->max_backoff = reconnect_default_max_backoff();
+fsm->probe_interval = reconnect_default_probe_interval();
 fsm->passive = false;
 fsm->info = VLL_INFO;

@@ -163,7 +163,7 @@ reconnect_set_name(struct reconnect *fsm, const char *name)
 }

 /* Return the minimum number of milliseconds to back off between consecutive
- * connection attempts.  The default is RECONNECT_DEFAULT_MIN_BACKOFF. */
+ * connection attempts.  The default is reconnect_default_min_backoff(). */
 int
 reconnect_get_min_backoff(const struct reconnect *fsm)
 {
@@ -171,7 +171,7 @@ reconnect_get_min_backoff(const struct reconnect *fsm)
 }

 /* Return the maximum number of milliseconds to back off between consecutive
- * connection attempts.  The default is RECONNECT_DEFAULT_MAX_BACKOFF. */
+ * connection attempts.  The default is reconnect_default_max_backoff(). */
 int
 reconnect_get_max_backoff(const struct reconnect *fsm)
 {
@@ -190,6 +190,57 @@ reconnect_get_probe_interval(const struct reconnect *fsm)
 return fsm->probe_interval;
 }

+/* Returns the default min_backoff value for reconnect. It uses the environment
+ * variable OVS_RECONNECT_MIN_BACKOFF if set and valid or otherwise defaults
+ * to 1 second. The return value is in ms. */
+unsigned int reconnect_default_min_backoff(void) {
+static unsigned int default_min_backoff = 0;
+if (default_min_backoff == 0) {
+char *env = getenv("OVS_RECONNECT_MIN_BACKOFF");
+if (env && env[0]) {
+str_to_uint(env, 10, _min_backoff);
+}
+if (default_min_backoff == 0) {
+default_min_backoff = 1000;
+}
+}
+return default_min_backoff;
+}
+
+/* Returns the default max_backoff value for reconnect. It uses the environment
+ * variable OVS_RECONNECT_MAX_BACKOFF if set and valid or otherwise defaults
+ * to 8 second. The return value is in ms. */

[ovs-dev] [PATCH] reconnect: set defaults from environment variables

2023-11-07 Thread Felix Huettner via dev

this exposes the old constants regarding min backoff, max backoff and
probe interval using environment variables. In case previously users
wanted to tune the probe interval for all connections this required
setting this setting in multiple locations. E.g. to configure the probe
interval from a relay to the ovsdb server you need to call an appctl
command after the relay has started up.
The new environment variables make it easy to set them for all new
connections. The existing configuration options for these settings stay
in place and take precedence over the environment variables.
In case the environment variables are not specified/invalid formated we
default to the previous defaults.

Signed-off-by: Felix Huettner 
---
 NEWS|  7 +
 lib/jsonrpc.c   |  4 +--
 lib/reconnect.c | 63 +
 lib/reconnect.h |  7 ++---
 ovsdb/jsonrpc-server.c  |  4 +--
 ovsdb/ovsdb-server.c|  2 +-
 ovsdb/relay.h   |  2 --
 python/ovs/reconnect.py | 45 +++--
 8 files changed, 114 insertions(+), 20 deletions(-)

diff --git a/NEWS b/NEWS
index 6b45492f1..58266ae52 100644
--- a/NEWS
+++ b/NEWS
@@ -6,6 +6,13 @@ Post-v3.2.0
from older version is supported but it may trigger more leader elections
during the process, and error logs complaining unrecognized fields may
be observed on old nodes.
+   - The environment variable OVS_RECONNECT_MIN_BACKOFF,
+ OVS_RECONNECT_MAX_BACKOFF and OVS_RECONNECT_PROBE_INTERVAL, if set, allow
+ users to tune the default reconnect configuration. E.g. adapting
+ OVS_RECONNECT_PROBE_INTERVAL modifies the default probe interval, for all
+ servers and clients. These variables are valid for all ovs code as well as
+ the python client. Values set for individual connections (e.g. using the
+ `inactivity_probe` column in the `Manager` tables) will take precedence.


 v3.2.0 - 17 Aug 2023
diff --git a/lib/jsonrpc.c b/lib/jsonrpc.c
index c8ce5362e..7bf49fc81 100644
--- a/lib/jsonrpc.c
+++ b/lib/jsonrpc.c
@@ -1269,8 +1269,8 @@ void
 jsonrpc_session_enable_reconnect(struct jsonrpc_session *s)
 {
 reconnect_set_max_tries(s->reconnect, UINT_MAX);
-reconnect_set_backoff(s->reconnect, RECONNECT_DEFAULT_MIN_BACKOFF,
-  RECONNECT_DEFAULT_MAX_BACKOFF);
+reconnect_set_backoff(s->reconnect, reconnect_default_min_backoff(),
+  reconnect_default_max_backoff());
 }

 /* Forces 's' to drop its connection (if any) and reconnect. */
diff --git a/lib/reconnect.c b/lib/reconnect.c
index 89a0bcaf9..b45516589 100644
--- a/lib/reconnect.c
+++ b/lib/reconnect.c
@@ -99,9 +99,9 @@ reconnect_create(long long int now)
 struct reconnect *fsm = xzalloc(sizeof *fsm);

 fsm->name = xstrdup("void");
-fsm->min_backoff = RECONNECT_DEFAULT_MIN_BACKOFF;
-fsm->max_backoff = RECONNECT_DEFAULT_MAX_BACKOFF;
-fsm->probe_interval = RECONNECT_DEFAULT_PROBE_INTERVAL;
+fsm->min_backoff = reconnect_default_min_backoff();
+fsm->max_backoff = reconnect_default_max_backoff();
+fsm->probe_interval = reconnect_default_probe_interval();
 fsm->passive = false;
 fsm->info = VLL_INFO;

@@ -163,7 +163,7 @@ reconnect_set_name(struct reconnect *fsm, const char *name)
 }

 /* Return the minimum number of milliseconds to back off between consecutive
- * connection attempts.  The default is RECONNECT_DEFAULT_MIN_BACKOFF. */
+ * connection attempts.  The default is reconnect_default_min_backoff(). */
 int
 reconnect_get_min_backoff(const struct reconnect *fsm)
 {
@@ -171,7 +171,7 @@ reconnect_get_min_backoff(const struct reconnect *fsm)
 }

 /* Return the maximum number of milliseconds to back off between consecutive
- * connection attempts.  The default is RECONNECT_DEFAULT_MAX_BACKOFF. */
+ * connection attempts.  The default is reconnect_default_max_backoff(). */
 int
 reconnect_get_max_backoff(const struct reconnect *fsm)
 {
@@ -190,6 +190,57 @@ reconnect_get_probe_interval(const struct reconnect *fsm)
 return fsm->probe_interval;
 }

+/* Returns the default min_backoff value for reconnect. It uses the environment
+ * variable OVS_RECONNECT_MIN_BACKOFF if set and valid or otherwise defaults
+ * to 1 second. The return value is in ms. */
+int reconnect_default_min_backoff() {
+static int default_min_backoff = 0;
+if (default_min_backoff == 0) {
+char *env = getenv("OVS_RECONNECT_MIN_BACKOFF");
+if (env && env[0]) {
+str_to_uint(env, 10, _min_backoff);
+}
+if (default_min_backoff == 0) {
+default_min_backoff = 1000;
+}
+}
+return default_min_backoff;
+}
+
+/* Returns the default max_backoff value for reconnect. It uses the environment
+ * variable OVS_RECONNECT_MAX_BACKOFF if set and valid or otherwise defaults
+ * to 8 second. The return value is in ms. */
+int reconnect_default_max_backoff() {
+static int

[ovs-dev] [PATCH ovn v3 2/2] ovn-ic: support learning routes in same AZ

2023-09-13 Thread Felix Huettner via dev

when connecting multiple logical routers to a transit switch per az then
previously the routers in the same az would not learn each others
routes while the routers in the others az would learn all of them.

As this is confusing and would require each user to have additional
logic that configures static routing within each az.

Acked-by: Mark Michelson 
Acked-by: Ales Musil 
Co-Authored-By: Maxim Korezkij 
Signed-off-by: Maxim Korezkij 
Signed-off-by: Felix Huettner 
---
 ic/ovn-ic.c | 48 
 tests/ovn-ic.at |  2 ++
 2 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/ic/ovn-ic.c b/ic/ovn-ic.c
index 2df1235dc..e2023c2ba 100644
--- a/ic/ovn-ic.c
+++ b/ic/ovn-ic.c
@@ -871,6 +871,8 @@ struct ic_route_info {
 const char *origin;
 const char *route_table;

+const struct nbrec_logical_router *nb_lr;
+
 /* Either nb_route or nb_lrp is set and the other one must be NULL.
  * - For a route that is learned from IC-SB, or a static route that is
  *   generated from a route that is configured in NB, the "nb_route"
@@ -947,7 +949,8 @@ parse_route(const char *s_prefix, const char *s_nexthop,
 /* Return false if can't be added due to bad format. */
 static bool
 add_to_routes_learned(struct hmap *routes_learned,
-  const struct nbrec_logical_router_static_route *nb_route)
+  const struct nbrec_logical_router_static_route *nb_route,
+  const struct nbrec_logical_router *nb_lr)
 {
 struct in6_addr prefix, nexthop;
 unsigned int plen;
@@ -969,6 +972,7 @@ add_to_routes_learned(struct hmap *routes_learned,
 ic_route->nb_route = nb_route;
 ic_route->origin = origin;
 ic_route->route_table = nb_route->route_table;
+ic_route->nb_lr = nb_lr;
 hmap_insert(routes_learned, _route->node,
 ic_route_hash(, plen, , origin,
   nb_route->route_table));
@@ -1109,7 +1113,8 @@ add_to_routes_ad(struct hmap *routes_ad, const struct 
in6_addr prefix,
  unsigned int plen, const struct in6_addr nexthop,
  const char *origin, const char *route_table,
  const struct nbrec_logical_router_port *nb_lrp,
- const struct nbrec_logical_router_static_route *nb_route)
+ const struct nbrec_logical_router_static_route *nb_route,
+ const struct nbrec_logical_router *nb_lr)
 {
 if (route_table == NULL) {
 route_table = "";
@@ -1127,6 +1132,7 @@ add_to_routes_ad(struct hmap *routes_ad, const struct 
in6_addr prefix,
 ic_route->origin = origin;
 ic_route->route_table = route_table;
 ic_route->nb_lrp = nb_lrp;
+ic_route->nb_lr = nb_lr;
 hmap_insert(routes_ad, _route->node, hash);
 } else {
 static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
@@ -1140,6 +1146,7 @@ static void
 add_static_to_routes_ad(
 struct hmap *routes_ad,
 const struct nbrec_logical_router_static_route *nb_route,
+const struct nbrec_logical_router *nb_lr,
 const struct lport_addresses *nexthop_addresses,
 const struct smap *nb_options)
 {
@@ -1182,14 +1189,15 @@ add_static_to_routes_ad(
 }

 add_to_routes_ad(routes_ad, prefix, plen, nexthop, ROUTE_ORIGIN_STATIC,
- nb_route->route_table, NULL, nb_route);
+ nb_route->route_table, NULL, nb_route, nb_lr);
 }

 static void
 add_network_to_routes_ad(struct hmap *routes_ad, const char *network,
  const struct nbrec_logical_router_port *nb_lrp,
  const struct lport_addresses *nexthop_addresses,
- const struct smap *nb_options)
+ const struct smap *nb_options,
+ const struct nbrec_logical_router *nb_lr)
 {
 struct in6_addr prefix, nexthop;
 unsigned int plen;
@@ -1228,7 +1236,7 @@ add_network_to_routes_ad(struct hmap *routes_ad, const 
char *network,

 /* directly-connected routes go to  route table */
 add_to_routes_ad(routes_ad, prefix, plen, nexthop, ROUTE_ORIGIN_CONNECTED,
- NULL, nb_lrp, NULL);
+ NULL, nb_lrp, NULL, nb_lr);
 }

 static bool
@@ -1332,7 +1340,6 @@ lrp_is_ts_port(struct ic_context *ctx, struct 
ic_router_info *ic_lr,

 static void
 sync_learned_routes(struct ic_context *ctx,
-const struct icsbrec_availability_zone *az,
 struct ic_router_info *ic_lr)
 {
 ovs_assert(ctx->ovnnb_txn);
@@ -1355,7 +1362,15 @@ sync_learned_routes(struct ic_context *ctx,

 ICSBREC_ROUTE_FOR_EACH_EQUAL (isb_route, isb_route_key,
   ctx->icsbrec_route_by_ts) {
-if (isb_route->availability_zone == az) {
+const char *lr_id = smap_get(_route->external_ids, "lr-id");
+if (lr_id == NULL) {
+continue;
+

[ovs-dev] [PATCH ovn v3 1/2] ovn-ic fix multiple routers in an az

2023-09-13 Thread Felix Huettner via dev

previously if multiple routers in the same az are connected to the same
transit switch then ovn-ic would only propagate the routes of one of
these routers to the ic-sb.
This commit fixes this behaviour and allows multiple routers in a single
az to use route advertisements.

Co-authored-by: Maxim Korezkij 
Signed-off-by: Maxim Korezkij 
Signed-off-by: Felix Huettner 
---
 ic/ovn-ic.c | 27 +++-
 tests/ovn-ic.at | 86 +
 2 files changed, 105 insertions(+), 8 deletions(-)

diff --git a/ic/ovn-ic.c b/ic/ovn-ic.c
index 11b533981..2df1235dc 100644
--- a/ic/ovn-ic.c
+++ b/ic/ovn-ic.c
@@ -1592,9 +1592,9 @@ build_ts_routes_to_adv(struct ic_context *ctx,
 }

 static void
-advertise_lr_routes(struct ic_context *ctx,
-const struct icsbrec_availability_zone *az,
-struct ic_router_info *ic_lr)
+collect_lr_routes(struct ic_context *ctx,
+  struct ic_router_info *ic_lr,
+  struct shash *routes_ad_by_ts)
 {
 const struct nbrec_nb_global *nb_global =
 nbrec_nb_global_first(ctx->ovnnb_idl);
@@ -1605,7 +1605,7 @@ advertise_lr_routes(struct ic_context *ctx,
 struct lport_addresses ts_port_addrs;
 const struct icnbrec_transit_switch *key;

-struct hmap routes_ad = HMAP_INITIALIZER(_ad);
+struct hmap *routes_ad;
 for (int i = 0; i < ic_lr->n_isb_pbs; i++) {
 isb_pb = ic_lr->isb_pbs[i];
 key = icnbrec_transit_switch_index_init_row(
@@ -1614,6 +1614,12 @@ advertise_lr_routes(struct ic_context *ctx,
 ts_name = icnbrec_transit_switch_index_find(
 ctx->icnbrec_transit_switch_by_name, key)->name;
 icnbrec_transit_switch_index_destroy_row(key);
+routes_ad = shash_find_data(routes_ad_by_ts, ts_name);
+if (!routes_ad) {
+routes_ad = xzalloc(sizeof *routes_ad);
+hmap_init(routes_ad);
+shash_add(routes_ad_by_ts, ts_name, routes_ad);
+}

 if (!extract_lsp_addresses(isb_pb->address, _port_addrs)) {
 static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
@@ -1625,12 +1631,10 @@ advertise_lr_routes(struct ic_context *ctx,
 }
 lrp_name = get_lrp_name_by_ts_port_name(ctx, isb_pb->logical_port);
 route_table = get_route_table_by_lrp_name(ctx, lrp_name);
-build_ts_routes_to_adv(ctx, ic_lr, _ad, _port_addrs,
+build_ts_routes_to_adv(ctx, ic_lr, routes_ad, _port_addrs,
nb_global, route_table);
-advertise_routes(ctx, az, ts_name, _ad);
 destroy_lport_addresses(_port_addrs);
 }
-hmap_destroy(_ad);
 }

 static void
@@ -1731,14 +1735,21 @@ route_run(struct ic_context *ctx,
 icsbrec_port_binding_index_destroy_row(isb_pb_key);

 struct ic_router_info *ic_lr;
+struct shash routes_ad_by_ts = SHASH_INITIALIZER(_ad_by_ts);
 HMAP_FOR_EACH_SAFE (ic_lr, node, _lrs) {
-advertise_lr_routes(ctx, az, ic_lr);
+collect_lr_routes(ctx, ic_lr, _ad_by_ts);
 sync_learned_routes(ctx, az, ic_lr);
 free(ic_lr->isb_pbs);
 hmap_destroy(_lr->routes_learned);
 hmap_remove(_lrs, _lr->node);
 free(ic_lr);
 }
+struct shash_node *node;
+SHASH_FOR_EACH (node, _ad_by_ts) {
+advertise_routes(ctx, az, node->name, node->data);
+hmap_destroy(node->data);
+}
+shash_destroy_free_data(_ad_by_ts);
 hmap_destroy(_lrs);
 }

diff --git a/tests/ovn-ic.at b/tests/ovn-ic.at
index 9a5f3e312..4b1c33c99 100644
--- a/tests/ovn-ic.at
+++ b/tests/ovn-ic.at
@@ -1166,3 +1166,89 @@ AT_CHECK([ovn_as az2 ovn-nbctl lr-route-list lr12 | grep 
dst-ip | sort], [0], [d

 AT_CLEANUP
 ])
+
+OVN_FOR_EACH_NORTHD([
+AT_SETUP([ovn-ic -- route sync -- multiple logical routers])
+
+ovn_init_ic_db
+ovn-ic-nbctl ts-add ts1
+
+for i in 1 2; do
+ovn_start az$i
+ovn_as az$i
+
+# Enable route learning at AZ level
+ovn-nbctl set nb_global . options:ic-route-learn=true
+# Enable route advertising at AZ level
+ovn-nbctl set nb_global . options:ic-route-adv=true
+done
+
+# Create new transit switches and LRs. Test topology is next:
+#
+# logical router (lr11) - transit switch (ts1) - logical router (lr21)
+#  \- logical router (lr22)
+#
+# each LR has one connected subnet except TS port
+
+
+# create lr11, lr21, lr22, ts1 and connect them
+ovn-ic-nbctl ts-add ts1
+
+ovn_as az1
+
+lr=lr11
+ovn-nbctl lr-add $lr
+
+lrp=lrp-$lr-ts1
+lsp=lsp-ts1-$lr
+# Create LRP and connect to TS
+ovn-nbctl lrp-add $lr $lrp aa:aa:aa:aa:a1:01 169.254.10.11/24
+ovn-nbctl lsp-add ts1 $lsp \
+-- lsp-set-addresses $lsp router \
+-- lsp-set-type $lsp router \
+-- lsp-set-options $lsp router-port=$lrp
+
+ovn_as az2
+for i in 1 2; do
+lr=lr2$i
+ovn-nbctl lr-add $lr
+
+lrp=lrp-$lr-ts1
+lsp=lsp-ts1-$lr
+# Create LRP and connect to TS
+ovn-nbctl lrp-add

Re: [ovs-dev] [PATCH ovn v2 1/2] ovn-ic fix multiple routers in an az

2023-09-13 Thread Felix Huettner via dev

Hi Ales,

thanks for the feedback. That will all be addressed in v3.

Regards
Felix
On Tue, Sep 12, 2023 at 08:18:18AM +0200, Ales Musil wrote:
> On Wed, Aug 30, 2023 at 8:20 AM Felix Huettner via dev <
> ovs-dev@openvswitch.org> wrote:
>
> > previously if multiple routers in the same az are connected to the same
> > transit switch then ovn-ic would only propagate the routes of one of
> > these routers to the ic-sb.
> > This commit fixes this behaviour and allows multiple routers in a single
> > az to use route advertisements.
> >
> > Co-authored-by: Maxim Korezkij 
> > Signed-off-by: Maxim Korezkij 
> > Signed-off-by: Felix Huettner 
> >
>
> Hi Felix,
>
> I have a few minor comments, see below.
>
> ---
> >  ic/ovn-ic.c | 27 +++-
> >  tests/ovn-ic.at | 86 +
> >  2 files changed, 105 insertions(+), 8 deletions(-)
> >
> > diff --git a/ic/ovn-ic.c b/ic/ovn-ic.c
> > index db7e86bc1..ec749e25f 100644
> > --- a/ic/ovn-ic.c
> > +++ b/ic/ovn-ic.c
> > @@ -1587,9 +1587,9 @@ build_ts_routes_to_adv(struct ic_context *ctx,
> >  }
> >
> >  static void
> > -advertise_lr_routes(struct ic_context *ctx,
> > -const struct icsbrec_availability_zone *az,
> > -struct ic_router_info *ic_lr)
> > +collect_lr_routes(struct ic_context *ctx,
> > +struct ic_router_info *ic_lr,
> > +struct shash *routes_ad_by_ts)
> >
>
> The arguments are not properly aligned.
>
>
> >  {
> >  const struct nbrec_nb_global *nb_global =
> >  nbrec_nb_global_first(ctx->ovnnb_idl);
> > @@ -1600,7 +1600,7 @@ advertise_lr_routes(struct ic_context *ctx,
> >  struct lport_addresses ts_port_addrs;
> >  const struct icnbrec_transit_switch *key;
> >
> > -struct hmap routes_ad = HMAP_INITIALIZER(_ad);
> > +struct hmap *routes_ad;
> >  for (int i = 0; i < ic_lr->n_isb_pbs; i++) {
> >  isb_pb = ic_lr->isb_pbs[i];
> >  key = icnbrec_transit_switch_index_init_row(
> > @@ -1609,6 +1609,12 @@ advertise_lr_routes(struct ic_context *ctx,
> >  ts_name = icnbrec_transit_switch_index_find(
> >  ctx->icnbrec_transit_switch_by_name, key)->name;
> >  icnbrec_transit_switch_index_destroy_row(key);
> > +routes_ad = shash_find_data(routes_ad_by_ts, ts_name);
> > +if (!routes_ad) {
> > +routes_ad = xzalloc(sizeof *routes_ad);
> > +hmap_init(routes_ad);
> > +shash_add(routes_ad_by_ts, ts_name, routes_ad);
> > +}
> >
> >  if (!extract_lsp_addresses(isb_pb->address, _port_addrs)) {
> >  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
> > @@ -1620,12 +1626,10 @@ advertise_lr_routes(struct ic_context *ctx,
> >  }
> >  lrp_name = get_lrp_name_by_ts_port_name(ctx,
> > isb_pb->logical_port);
> >  route_table = get_route_table_by_lrp_name(ctx, lrp_name);
> > -build_ts_routes_to_adv(ctx, ic_lr, _ad, _port_addrs,
> > +build_ts_routes_to_adv(ctx, ic_lr, routes_ad, _port_addrs,
> > nb_global, route_table);
> > -advertise_routes(ctx, az, ts_name, _ad);
> >  destroy_lport_addresses(_port_addrs);
> >  }
> > -hmap_destroy(_ad);
> >  }
> >
> >  static void
> > @@ -1726,14 +1730,21 @@ route_run(struct ic_context *ctx,
> >  icsbrec_port_binding_index_destroy_row(isb_pb_key);
> >
> >  struct ic_router_info *ic_lr;
> > +struct shash routes_ad_by_ts = SHASH_INITIALIZER(_ad_by_ts);
> >  HMAP_FOR_EACH_SAFE (ic_lr, node, _lrs) {
> > -advertise_lr_routes(ctx, az, ic_lr);
> > +collect_lr_routes(ctx, ic_lr, _ad_by_ts);
> >  sync_learned_routes(ctx, az, ic_lr);
> >  free(ic_lr->isb_pbs);
> >  hmap_destroy(_lr->routes_learned);
> >  hmap_remove(_lrs, _lr->node);
> >  free(ic_lr);
> >  }
> > +struct shash_node *node;
> > +SHASH_FOR_EACH_SAFE (node, _ad_by_ts) {
> >
>
> The SHASH iteration doesn't have to be SAFE, we are not removing anything
> from it during the iteration.
>
>
> > +advertise_routes(ctx, az, node->name, node->data);
> > +hmap_destroy(node->data);
> > +}
> > +shash_destroy_free_data(_ad_by_ts);
> >  hmap_destroy(_lrs);
> >  }
> >
>

[ovs-dev] [PATCH ovn v2 2/2] ovn-ic: support learning routes in same AZ

2023-08-30 Thread Felix Huettner via dev

when connecting multiple logical routers to a transit switch per az then
previously the routers in the same az would not learn each others
routes while the routers in the others az would learn all of them.

As this is confusing and would require each user to have additional
logical that configures static routing within each az.

Acked-by: Mark Michelson 
Co-Authored-By: Maxim Korezkij 
Signed-off-by: Maxim Korezkij 
Signed-off-by: Felix Huettner 
---
 ic/ovn-ic.c | 48 
 tests/ovn-ic.at |  2 ++
 2 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/ic/ovn-ic.c b/ic/ovn-ic.c
index ec749e25f..0109afe40 100644
--- a/ic/ovn-ic.c
+++ b/ic/ovn-ic.c
@@ -866,6 +866,8 @@ struct ic_route_info {
 const char *origin;
 const char *route_table;

+const struct nbrec_logical_router *nb_lr;
+
 /* Either nb_route or nb_lrp is set and the other one must be NULL.
  * - For a route that is learned from IC-SB, or a static route that is
  *   generated from a route that is configured in NB, the "nb_route"
@@ -942,7 +944,8 @@ parse_route(const char *s_prefix, const char *s_nexthop,
 /* Return false if can't be added due to bad format. */
 static bool
 add_to_routes_learned(struct hmap *routes_learned,
-  const struct nbrec_logical_router_static_route *nb_route)
+  const struct nbrec_logical_router_static_route *nb_route,
+  const struct nbrec_logical_router *nb_lr)
 {
 struct in6_addr prefix, nexthop;
 unsigned int plen;
@@ -964,6 +967,7 @@ add_to_routes_learned(struct hmap *routes_learned,
 ic_route->nb_route = nb_route;
 ic_route->origin = origin;
 ic_route->route_table = nb_route->route_table;
+ic_route->nb_lr = nb_lr;
 hmap_insert(routes_learned, _route->node,
 ic_route_hash(, plen, , origin,
   nb_route->route_table));
@@ -1104,7 +1108,8 @@ add_to_routes_ad(struct hmap *routes_ad, const struct 
in6_addr prefix,
  unsigned int plen, const struct in6_addr nexthop,
  const char *origin, const char *route_table,
  const struct nbrec_logical_router_port *nb_lrp,
- const struct nbrec_logical_router_static_route *nb_route)
+ const struct nbrec_logical_router_static_route *nb_route,
+ const struct nbrec_logical_router *nb_lr)
 {
 if (route_table == NULL) {
 route_table = "";
@@ -1122,6 +1127,7 @@ add_to_routes_ad(struct hmap *routes_ad, const struct 
in6_addr prefix,
 ic_route->origin = origin;
 ic_route->route_table = route_table;
 ic_route->nb_lrp = nb_lrp;
+ic_route->nb_lr = nb_lr;
 hmap_insert(routes_ad, _route->node, hash);
 } else {
 static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
@@ -1135,6 +1141,7 @@ static void
 add_static_to_routes_ad(
 struct hmap *routes_ad,
 const struct nbrec_logical_router_static_route *nb_route,
+const struct nbrec_logical_router *nb_lr,
 const struct lport_addresses *nexthop_addresses,
 const struct smap *nb_options)
 {
@@ -1177,14 +1184,15 @@ add_static_to_routes_ad(
 }

 add_to_routes_ad(routes_ad, prefix, plen, nexthop, ROUTE_ORIGIN_STATIC,
- nb_route->route_table, NULL, nb_route);
+ nb_route->route_table, NULL, nb_route, nb_lr);
 }

 static void
 add_network_to_routes_ad(struct hmap *routes_ad, const char *network,
  const struct nbrec_logical_router_port *nb_lrp,
  const struct lport_addresses *nexthop_addresses,
- const struct smap *nb_options)
+ const struct smap *nb_options,
+ const struct nbrec_logical_router *nb_lr)
 {
 struct in6_addr prefix, nexthop;
 unsigned int plen;
@@ -1223,7 +1231,7 @@ add_network_to_routes_ad(struct hmap *routes_ad, const 
char *network,

 /* directly-connected routes go to  route table */
 add_to_routes_ad(routes_ad, prefix, plen, nexthop, ROUTE_ORIGIN_CONNECTED,
- NULL, nb_lrp, NULL);
+ NULL, nb_lrp, NULL, nb_lr);
 }

 static bool
@@ -1327,7 +1335,6 @@ lrp_is_ts_port(struct ic_context *ctx, struct 
ic_router_info *ic_lr,

 static void
 sync_learned_routes(struct ic_context *ctx,
-const struct icsbrec_availability_zone *az,
 struct ic_router_info *ic_lr)
 {
 ovs_assert(ctx->ovnnb_txn);
@@ -1350,7 +1357,15 @@ sync_learned_routes(struct ic_context *ctx,

 ICSBREC_ROUTE_FOR_EACH_EQUAL (isb_route, isb_route_key,
   ctx->icsbrec_route_by_ts) {
-if (isb_route->availability_zone == az) {
+const char *lr_id = smap_get(_route->external_ids, "lr-id");
+if (lr_id == NULL) {
+continue;
+}
+

[ovs-dev] [PATCH ovn v2 1/2] ovn-ic fix multiple routers in an az

2023-08-30 Thread Felix Huettner via dev

previously if multiple routers in the same az are connected to the same
transit switch then ovn-ic would only propagate the routes of one of
these routers to the ic-sb.
This commit fixes this behaviour and allows multiple routers in a single
az to use route advertisements.

Co-authored-by: Maxim Korezkij 
Signed-off-by: Maxim Korezkij 
Signed-off-by: Felix Huettner 
---
 ic/ovn-ic.c | 27 +++-
 tests/ovn-ic.at | 86 +
 2 files changed, 105 insertions(+), 8 deletions(-)

diff --git a/ic/ovn-ic.c b/ic/ovn-ic.c
index db7e86bc1..ec749e25f 100644
--- a/ic/ovn-ic.c
+++ b/ic/ovn-ic.c
@@ -1587,9 +1587,9 @@ build_ts_routes_to_adv(struct ic_context *ctx,
 }

 static void
-advertise_lr_routes(struct ic_context *ctx,
-const struct icsbrec_availability_zone *az,
-struct ic_router_info *ic_lr)
+collect_lr_routes(struct ic_context *ctx,
+struct ic_router_info *ic_lr,
+struct shash *routes_ad_by_ts)
 {
 const struct nbrec_nb_global *nb_global =
 nbrec_nb_global_first(ctx->ovnnb_idl);
@@ -1600,7 +1600,7 @@ advertise_lr_routes(struct ic_context *ctx,
 struct lport_addresses ts_port_addrs;
 const struct icnbrec_transit_switch *key;

-struct hmap routes_ad = HMAP_INITIALIZER(_ad);
+struct hmap *routes_ad;
 for (int i = 0; i < ic_lr->n_isb_pbs; i++) {
 isb_pb = ic_lr->isb_pbs[i];
 key = icnbrec_transit_switch_index_init_row(
@@ -1609,6 +1609,12 @@ advertise_lr_routes(struct ic_context *ctx,
 ts_name = icnbrec_transit_switch_index_find(
 ctx->icnbrec_transit_switch_by_name, key)->name;
 icnbrec_transit_switch_index_destroy_row(key);
+routes_ad = shash_find_data(routes_ad_by_ts, ts_name);
+if (!routes_ad) {
+routes_ad = xzalloc(sizeof *routes_ad);
+hmap_init(routes_ad);
+shash_add(routes_ad_by_ts, ts_name, routes_ad);
+}

 if (!extract_lsp_addresses(isb_pb->address, _port_addrs)) {
 static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
@@ -1620,12 +1626,10 @@ advertise_lr_routes(struct ic_context *ctx,
 }
 lrp_name = get_lrp_name_by_ts_port_name(ctx, isb_pb->logical_port);
 route_table = get_route_table_by_lrp_name(ctx, lrp_name);
-build_ts_routes_to_adv(ctx, ic_lr, _ad, _port_addrs,
+build_ts_routes_to_adv(ctx, ic_lr, routes_ad, _port_addrs,
nb_global, route_table);
-advertise_routes(ctx, az, ts_name, _ad);
 destroy_lport_addresses(_port_addrs);
 }
-hmap_destroy(_ad);
 }

 static void
@@ -1726,14 +1730,21 @@ route_run(struct ic_context *ctx,
 icsbrec_port_binding_index_destroy_row(isb_pb_key);

 struct ic_router_info *ic_lr;
+struct shash routes_ad_by_ts = SHASH_INITIALIZER(_ad_by_ts);
 HMAP_FOR_EACH_SAFE (ic_lr, node, _lrs) {
-advertise_lr_routes(ctx, az, ic_lr);
+collect_lr_routes(ctx, ic_lr, _ad_by_ts);
 sync_learned_routes(ctx, az, ic_lr);
 free(ic_lr->isb_pbs);
 hmap_destroy(_lr->routes_learned);
 hmap_remove(_lrs, _lr->node);
 free(ic_lr);
 }
+struct shash_node *node;
+SHASH_FOR_EACH_SAFE (node, _ad_by_ts) {
+advertise_routes(ctx, az, node->name, node->data);
+hmap_destroy(node->data);
+}
+shash_destroy_free_data(_ad_by_ts);
 hmap_destroy(_lrs);
 }

diff --git a/tests/ovn-ic.at b/tests/ovn-ic.at
index a654e59fe..8ef2362c4 100644
--- a/tests/ovn-ic.at
+++ b/tests/ovn-ic.at
@@ -1164,3 +1164,89 @@ AT_CHECK([ovn_as az2 ovn-nbctl lr-route-list lr12 | grep 
dst-ip | sort], [0], [d

 AT_CLEANUP
 ])
+
+OVN_FOR_EACH_NORTHD([
+AT_SETUP([ovn-ic -- route sync -- multiple logical routers])
+
+ovn_init_ic_db
+ovn-ic-nbctl ts-add ts1
+
+for i in 1 2; do
+ovn_start az$i
+ovn_as az$i
+
+# Enable route learning at AZ level
+ovn-nbctl set nb_global . options:ic-route-learn=true
+# Enable route advertising at AZ level
+ovn-nbctl set nb_global . options:ic-route-adv=true
+done
+
+# Create new transit switches and LRs. Test topology is next:
+#
+# logical router (lr11) - transit switch (ts1) - logical router (lr21)
+#  \- logical router (lr22)
+#
+# each LR has one connected subnet except TS port
+
+
+# create lr11, lr21, lr22, ts1 and connect them
+ovn-ic-nbctl ts-add ts1
+
+ovn_as az1
+
+lr=lr11
+ovn-nbctl lr-add $lr
+
+lrp=lrp-$lr-ts1
+lsp=lsp-ts1-$lr
+# Create LRP and connect to TS
+ovn-nbctl lrp-add $lr $lrp aa:aa:aa:aa:a1:01 169.254.10.11/24
+ovn-nbctl lsp-add ts1 $lsp \
+-- lsp-set-addresses $lsp router \
+-- lsp-set-type $lsp router \
+-- lsp-set-options $lsp router-port=$lrp
+
+ovn_as az2
+for i in 1 2; do
+lr=lr2$i
+ovn-nbctl lr-add $lr
+
+lrp=lrp-$lr-ts1
+lsp=lsp-ts1-$lr
+# Create LRP and connect to TS
+ovn-nbctl

Re: [ovs-dev] [PATCH ovn 2/2] ovn-ic: support learning routes in same AZ

2023-08-29 Thread Felix Huettner via dev

Thank you for the feedback,

the route learning feature needs to be explicitly enabled [1]. This is
currently a global setting, so if we want to limit this to individual
routers i would rather make this a general ovn-ic feature than to
include this here.

Thanks
Felix

[1] 
https://docs.ovn.org/en/latest/tutorials/ovn-interconnection.html#route-advertisement
On Wed, Aug 02, 2023 at 03:13:01PM -0400, Mark Michelson wrote:
> Looks good to me.
>
> Acked-by: Mark Michelson 
>
> Having said that, I'd like for some OVN devs that are more familiar with the
> ins and outs of ovn-ic to make the determination if this is behavior that we
> want or not. For non-IC OVN, routers learn routes to their neighbor routers
> automatically, unless the "dynamic_neigh_routers" option is enabled. For
> ovn-ic, do we have a way to ensure that routers do not learn neighbor routes
> if they do not want to?
>
> On 8/1/23 03:22, maximkorezkij via dev wrote:
> > when connecting multiple logical routers to a transit switch per az then
> > previously the routers in the same az would not learn each others
> > routes while the routers in the others az would learn all of them.
> >
> > As this is confusing and would require each user to have additional
> > logical that configures static routing within each az.
> >
> > Signed-off-by: Maxim Korezkij 
> > Signed-off-by: Felix Huettner 
> > ---
> >   ic/ovn-ic.c | 52 -
> >   tests/ovn-ic.at |  2 ++
> >   2 files changed, 40 insertions(+), 14 deletions(-)
> >
> > diff --git a/ic/ovn-ic.c b/ic/ovn-ic.c
> > index 49850954f..cea14d008 100644
> > --- a/ic/ovn-ic.c
> > +++ b/ic/ovn-ic.c
> > @@ -861,6 +861,8 @@ struct ic_route_info {
> >   const char *origin;
> >   const char *route_table;
> >
> > +const struct nbrec_logical_router *nb_lr;
> > +
> >   /* Either nb_route or nb_lrp is set and the other one must be NULL.
> >* - For a route that is learned from IC-SB, or a static route that is
> >*   generated from a route that is configured in NB, the "nb_route"
> > @@ -937,7 +939,8 @@ parse_route(const char *s_prefix, const char *s_nexthop,
> >   /* Return false if can't be added due to bad format. */
> >   static bool
> >   add_to_routes_learned(struct hmap *routes_learned,
> > -  const struct nbrec_logical_router_static_route 
> > *nb_route)
> > +  const struct nbrec_logical_router_static_route 
> > *nb_route,
> > +  const struct nbrec_logical_router *nb_lr)
> >   {
> >   struct in6_addr prefix, nexthop;
> >   unsigned int plen;
> > @@ -959,6 +962,7 @@ add_to_routes_learned(struct hmap *routes_learned,
> >   ic_route->nb_route = nb_route;
> >   ic_route->origin = origin;
> >   ic_route->route_table = nb_route->route_table;
> > +ic_route->nb_lr = nb_lr;
> >   hmap_insert(routes_learned, _route->node,
> >   ic_route_hash(, plen, , origin,
> > nb_route->route_table));
> > @@ -1099,7 +1103,8 @@ add_to_routes_ad(struct hmap *routes_ad, const struct 
> > in6_addr prefix,
> >unsigned int plen, const struct in6_addr nexthop,
> >const char *origin, const char *route_table,
> >const struct nbrec_logical_router_port *nb_lrp,
> > - const struct nbrec_logical_router_static_route *nb_route)
> > + const struct nbrec_logical_router_static_route *nb_route,
> > + const struct nbrec_logical_router *nb_lr)
> >   {
> >   if (route_table == NULL) {
> >   route_table = "";
> > @@ -1117,6 +1122,7 @@ add_to_routes_ad(struct hmap *routes_ad, const struct 
> > in6_addr prefix,
> >   ic_route->origin = origin;
> >   ic_route->route_table = route_table;
> >   ic_route->nb_lrp = nb_lrp;
> > +ic_route->nb_lr = nb_lr;
> >   hmap_insert(routes_ad, _route->node, hash);
> >   } else {
> >   static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
> > @@ -1130,6 +1136,7 @@ static void
> >   add_static_to_routes_ad(
> >   struct hmap *routes_ad,
> >   const struct nbrec_logical_router_static_route *nb_route,
> > +const struct nbrec_logical_router *nb_lr,
> >   const struct lport_addresses *nexthop_addresses,
> >   const struct smap *nb_options)
> >   {
> > @@ -1172,14 +1179,15 @@ add_static_to_routes_ad(
> >   }
> >
> >   add_to_routes_ad(routes_ad, prefix, plen, nexthop, 
> > ROUTE_ORIGIN_STATIC,
> > - nb_route->route_table, NULL, nb_route);
> > + nb_route->route_table, NULL, nb_route, nb_lr);
> >   }
> >
> >   static void
> >   add_network_to_routes_ad(struct hmap *routes_ad, const char *network,
> >const struct nbrec_logical_router_port *nb_lrp,
> >const struct lport_addresses *nexthop_addresses,
> > -

[ovs-dev] [PATCH ovn] fix missing documentation of ovn-ic arguments

2023-07-24 Thread Felix Huettner via dev

the arguments for the interconnect database where missing.

Signed-off-by: Felix Huettner 
---
 ic/ovn-ic.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/ic/ovn-ic.c b/ic/ovn-ic.c
index 6f31037ec..db7e86bc1 100644
--- a/ic/ovn-ic.c
+++ b/ic/ovn-ic.c
@@ -106,11 +106,16 @@ Options:\n\
 (default: %s)\n\
   --ovnsb-db=DATABASE   connect to ovn-sb database at DATABASE\n\
 (default: %s)\n\
+  --ic-nb-db=DATABASE   connect to ovn-ic-nb database at DATABASE\n\
+(default: %s)\n\
+  --ic-sb-db=DATABASE   connect to ovn-ic-sb database at DATABASE\n\
+(default: %s)\n\
   --unixctl=SOCKET  override default control socket name\n\
   -h, --helpdisplay this help message\n\
   -o, --options list available options\n\
   -V, --version display version information\n\
-", program_name, program_name, default_nb_db(), default_sb_db());
+", program_name, program_name, default_nb_db(), default_sb_db(),
+default_ic_nb_db(), default_ic_sb_db());
 daemon_usage();
 vlog_usage();
 stream_usage("database", true, true, false);
--
2.41.0

Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die 
Verwertung durch den vorgesehenen Empfänger bestimmt.
Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte 
unverzüglich in Kenntnis und löschen diese E Mail.

Hinweise zum Datenschutz finden Sie hier.


This e-mail may contain confidential content and is intended only for the 
specified recipient/s.
If you are not the intended recipient, please inform the sender immediately and 
delete this e-mail.

Information on data protection can be found 
here.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH v3] relay: allow setting probe interval

2023-07-17 Thread Felix Huettner via dev

previously it was not possible to set the probe interval for the
connection from a relay to the backing ovsdb-server. With this change it
is now possible using the
`ovsdb-server/set-relay-source-probe-interval` command.

Reviewed-by: Simon Horman 
Signed-off-by: Felix Huettner 
---
v3:
  - use a separate command
  - stay with the default of 5sec

 NEWS |  3 +++
 ovsdb/ovsdb-server.c | 30 +-
 ovsdb/relay.c| 15 ++-
 ovsdb/relay.h|  6 +-
 4 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/NEWS b/NEWS
index bda41ad4c..3e90c879b 100644
--- a/NEWS
+++ b/NEWS
@@ -10,6 +10,9 @@ Post-v3.1.0
conversion operation is present.  For the cluster service model follow
upgrade instructions in 'Upgrading from version 3.1 and earlier to 3.2
and later' section of ovsdb(7).
+ * When ovsdb-server is running in relay mode, the probe interval is
+   configurable via
+   `ovs-appctl ovsdb-server/set-relay-source-probe-interval`.
- IPFIX template and statistics intervals can now be configured through two
  new options in the IPFIX table: 'template_interval' and 'stats_interval'.
- Linux kernel datapath:
diff --git a/ovsdb/ovsdb-server.c b/ovsdb/ovsdb-server.c
index 9bad0c8dd..cef12ca38 100644
--- a/ovsdb/ovsdb-server.c
+++ b/ovsdb/ovsdb-server.c
@@ -94,6 +94,7 @@ static unixctl_cb_func ovsdb_server_get_active_ovsdb_server;
 static unixctl_cb_func ovsdb_server_connect_active_ovsdb_server;
 static unixctl_cb_func ovsdb_server_disconnect_active_ovsdb_server;
 static unixctl_cb_func ovsdb_server_set_active_ovsdb_server_probe_interval;
+static unixctl_cb_func ovsdb_server_set_relay_source_interval;
 static unixctl_cb_func ovsdb_server_set_sync_exclude_tables;
 static unixctl_cb_func ovsdb_server_get_sync_exclude_tables;
 static unixctl_cb_func ovsdb_server_get_sync_status;
@@ -107,6 +108,7 @@ struct server_config {
 char **sync_exclude;
 bool *is_backup;
 int *replication_probe_interval;
+int *relay_source_probe_interval;
 struct ovsdb_jsonrpc_server *jsonrpc;
 };
 static unixctl_cb_func ovsdb_server_add_remote;
@@ -328,6 +330,7 @@ main(int argc, char *argv[])
 struct shash all_dbs;
 struct shash_node *node;
 int replication_probe_interval = REPLICATION_DEFAULT_PROBE_INTERVAL;
+int relay_source_probe_interval = RELAY_SOURCE_DEFAULT_PROBE_INTERVAL;

 ovs_cmdl_proctitle_init(argc, argv);
 set_program_name(argv[0]);
@@ -377,6 +380,7 @@ main(int argc, char *argv[])
 server_config.sync_exclude = _exclude;
 server_config.is_backup = _backup;
 server_config.replication_probe_interval = _probe_interval;
+server_config.relay_source_probe_interval = _source_probe_interval;

 perf_counters_init();

@@ -472,6 +476,9 @@ main(int argc, char *argv[])
 unixctl_command_register(
 "ovsdb-server/set-active-ovsdb-server-probe-interval", "", 1, 1,
 ovsdb_server_set_active_ovsdb_server_probe_interval, _config);
+unixctl_command_register(
+"ovsdb-server/set-relay-source-probe-interval", "", 1, 1,
+ovsdb_server_set_relay_source_interval, _config);
 unixctl_command_register("ovsdb-server/set-sync-exclude-tables", "",
  0, 1, ovsdb_server_set_sync_exclude_tables,
  _config);
@@ -797,7 +804,8 @@ open_db(struct server_config *config, const char *filename)
 add_db(config, db);

 if (is_relay) {
-ovsdb_relay_add_db(db->db, relay_remotes, update_schema, config);
+ovsdb_relay_add_db(db->db, relay_remotes, update_schema, config,
+   *config->relay_source_probe_interval);
 }
 return NULL;
 }
@@ -1480,6 +1488,26 @@ 
ovsdb_server_set_active_ovsdb_server_probe_interval(struct unixctl_conn *conn,
 }
 }

+static void
+ovsdb_server_set_relay_source_interval(struct unixctl_conn *conn,
+   int argc OVS_UNUSED,
+   const char *argv[],
+   void *config_)
+{
+struct server_config *config = config_;
+
+int probe_interval;
+if (str_to_int(argv[1], 10, _interval)) {
+*config->relay_source_probe_interval = probe_interval;
+save_config(config);
+ovsdb_relay_set_probe_interval(probe_interval);
+unixctl_command_reply(conn, NULL);
+} else {
+unixctl_command_reply(
+conn, "Invalid probe interval, integer value expected");
+}
+}
+
 static void
 ovsdb_server_set_sync_exclude_tables(struct unixctl_conn *conn,
  int argc OVS_UNUSED,
diff --git a/ovsdb/relay.c b/ovsdb/relay.c
index 377f3285f..21720f62f 100644
--- a/ovsdb/relay.c
+++ b/ovsdb/relay.c
@@ -127,7 +127,7 @@ static struct ovsdb_cs_ops relay_cs_ops = {
 void
 ovsdb_relay_add_db(struct ovsdb *db, const char *remote,
schema_change_callback

Re: [ovs-dev] [PATCH v2] relay: allow setting probe interval

2023-07-17 Thread Felix Huettner via dev


On Wed, Jul 12, 2023 at 01:15:57PM +0200, Ilya Maximets wrote:
> On 7/10/23 08:19, Felix Huettner via dev wrote:
> > previously it was not possible to set the probe interval for the
> > connection from a relay to the backing ovsdb-server. With this change it
> > is now possible using the
> > `ovsdb-server/set-active-ovsdb-server-probe-interval` command.
> >
> > The command `ovsdb-server/set-active-ovsdb-server-probe-interval` is
> > already used to set the probe interval for active-backup replication.
> > However this is mutally exclusive with being a relay and the case for
> > using the command is the same. Therefor we decided to reuse it instead
> > of adding a new one.
>
> Hi, Felix.  Thanks for the patch!
>
> It is important to be able to set the probe interval for this direction,
> I agree.  And in the absence of the unified configuration method, the
> appctl seems to be as an acceptable solution.
>
> Though I don't think we should mix configuration for active-backup with
> relays.  The 'active-ovsdb-server' part in the name is a direct reference
> to an active-backup replication.  Terminology is a bit different:
>
>   AB service modelrelay service model
>   ---
>   backup  relay
>   active  relay source
>
> So, maybe something like 'set-relay-source-probe-interval'?
>
> And I'm not sure it's mutually exclusive to have active-backup and a
> relay in the same process.  What prevents it?
>

Hi Ilya,

thanks for the feedback. Then i'll include a new appctl command and stay
with the existing 5s default.

Thanks
Felix
>
> For the other part of the change, it's not mentioned in the commit message,
> but mentioned in the NEWS file that the default values is changed to 60s.
> I'm not sure that's a right thing to do.  In the active-backup model, the
> usual use-case is to back up a standalone database.  And it's important to
> keep the connection alive for long since there is no other source to
> replicate anyway.  On the other hand, relays are most likely used in pair
> with a clustered service model, and supposed to be faster in reaction to
> cluster changes.  There are likely other servers that we can re-connect to.
>
> I'd keep the default inactivity probe value for relays as is, as it might
> be important for smaller clusters to fail-over faster.  Users who frequently
> see disconnections due to inactivity probes on relay connection may evaluate
> reasons and adjust the value for themselves, if necessary.
>
> What do you think?
>
> Best regards, Ilya Maximets.
Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die 
Verwertung durch den vorgesehenen Empfänger bestimmt.
Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte 
unverzüglich in Kenntnis und löschen diese E Mail.

Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz>.


This e-mail may contain confidential content and is intended only for the 
specified recipient/s.
If you are not the intended recipient, please inform the sender immediately and 
delete this e-mail.

Information on data protection can be found 
here<https://www.datenschutz.schwarz>.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [RFC ovn-heater] OVN scale testing with OpenStack workloads - questionnaire

2023-07-14 Thread Felix Huettner via dev

On Wed, Jul 12, 2023 at 09:38:23PM +0200, Dumitru Ceara wrote:
> On 7/12/23 14:42, Frode Nordahl wrote:
> > On Wed, Jul 12, 2023 at 2:31 PM Felix Huettner
> >  wrote:
> >>
> >> On Wed, Jul 12, 2023 at 01:57:18PM +0200, Dumitru Ceara wrote:
> >>> On 7/12/23 13:00, Felix Huettner wrote:
>  Hi everyone,
> 
>  On Wed, Jul 12, 2023 at 12:29:58PM +0200, Dumitru Ceara wrote:
> > On 7/12/23 12:04, Frode Nordahl wrote:
> >> On Wed, Jul 12, 2023 at 10:51 AM Dumitru Ceara  
> >> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> During the ovn-heater community meeting organized by Frode yesterday
> >>> (thanks again for that!) we agreed to follow up on some points.  Two 
> >>> of
> >>> these are related to gathering more information that allow us to build
> >>> realistic test scenarios (1) and define targets for them (2).
> >>>
> >>> On that note we have agreed to prepare a _short_ questionnaire to be
> >>> shared with OpenStack + OVN users (companies and other operators).
> >>>
> >>> I started a draft document here:
> >>>
> >>> https://docs.google.com/document/d/151saO5a5PmCt7cIZQ7DkgvU755od76BlN2bKjXc1n08
> >>>
> >>> It's mostly based on the previous discussion and information we 
> >>> received
> >>> from Felix Hüttner [0].  I gave access to everyone with the link to
> >>> suggest changes/comment on the document.  Also, if people prefer a
> >>> different way of building this questionnaire (e.g., a PR on
> >>> github/ovn-org/ovn-heater) please let me know.
> >>>
> >>> An additional note: I don't think the goal is to get an 100%
> >>> comprehensive list of all possible workloads and scale targets.  AFAIU
> >>> the intention is to quickly (in a few weeks?) identify a "few" 
> >>> relevant
> >>> types of workloads and start with writing test scenarios for those.  
> >>> We
> >>> could then, incrementally, build on top of that.
> >>
> >> Thanks alot for putting this together, I think it is a great start.
> >>
> >> There is currently no mention of what type of topology is being used
> >> though, I do not want to complicate things, but I do think we need to
> >> at least gauge what is used behind the numbers being presented, as it
> >> would have consequences for how we lay out the tests, and consequently
> >> what we scale for.
> >>
> >
> > I was hoping for targets that make sense for all topologies.
> >
> >> I think these should cover the most normal cases:
> >>
> >> Gateway topologies:
> >> * Distributed gateways with distributed FIPs
> >> * Distributed gateways with centralized NAT
> >> * Centralized gateways
> >>
> >> IP topologies:
> >> * Project networks mainly use IPv4 RFC1918 and SNAT/DNAT
> >> * Project networks mainly use routed IPv4 (no NAT)
> >> * Project networks mainly use routed IPv6 (no NAT)
> >>
> >
> > Do we really need to differentiate between IP topologies?  I think the
> > only significant difference above is NAT vs noNAT.  From ovn-heater
> > perspective there should be no difference between private and routed
> > IPv4 or v6.
> >
> > If we test and optimize OVN for the worst case and always configure NAT
> > we should be fine, right?
> 
>  i'm not sure if that works. With centralized gateways it makes a big
>  difference if you have NAT or no NAT. The reason (iirc) is that without
>  NAT you are not actually using these gateways for most of the traffic in
>  your environment. But with NAT you need to use them.
> 
> >>>
> >>> Well, from a control plane perspective, I thought the "worst case" is
> >>> when we also configure NAT.
> >>
> >> thats probably right. Maybe we just stick to it then for now (until we
> >> see special scaling issues just for the non nat path)
> >
> > I brought up the NAT / no NAT cases because with the current way
> > OpenStack lays out logical routers and logical router ports the
> > traffic is in practice centralized when not using NAT. But as you
> > point out, this would be a data plane scaling issue and more a
> > question to feed into possible future topology changes for OpenStack
> > Neutron rather than a question informing the control plane scale
> > testing.
> >
> > So I agree, let's drop them.
> >
> >> Any suggestions for how to tie these into the questionnaire without
> >> causing a matrix explosion?
> >>
> >
> > Not really, but if we only have "Gateway topologies" as a variable then
> > we end up with a way smaller matrix. :)
> >
> 
>  i would go for a matrix with gateways (distributed or central) and nat
>  (yes or no).
> >
> > +1
> >
> >>> Should we also give up on the "cluster type" differentiation and just
> >>> focus on "large clusters"?  In the end that's what we want to be able to
> >>> run.
> >>
> >> I think that makes sense. If someone

Re: [ovs-dev] [RFC ovn-heater] OVN scale testing with OpenStack workloads - questionnaire

2023-07-12 Thread Felix Huettner via dev

On Wed, Jul 12, 2023 at 01:57:18PM +0200, Dumitru Ceara wrote:
> On 7/12/23 13:00, Felix Huettner wrote:
> > Hi everyone,
> >
> > On Wed, Jul 12, 2023 at 12:29:58PM +0200, Dumitru Ceara wrote:
> >> On 7/12/23 12:04, Frode Nordahl wrote:
> >>> On Wed, Jul 12, 2023 at 10:51 AM Dumitru Ceara  wrote:
> 
>  Hi all,
> 
>  During the ovn-heater community meeting organized by Frode yesterday
>  (thanks again for that!) we agreed to follow up on some points.  Two of
>  these are related to gathering more information that allow us to build
>  realistic test scenarios (1) and define targets for them (2).
> 
>  On that note we have agreed to prepare a _short_ questionnaire to be
>  shared with OpenStack + OVN users (companies and other operators).
> 
>  I started a draft document here:
> 
>  https://docs.google.com/document/d/151saO5a5PmCt7cIZQ7DkgvU755od76BlN2bKjXc1n08
> 
>  It's mostly based on the previous discussion and information we received
>  from Felix Hüttner [0].  I gave access to everyone with the link to
>  suggest changes/comment on the document.  Also, if people prefer a
>  different way of building this questionnaire (e.g., a PR on
>  github/ovn-org/ovn-heater) please let me know.
> 
>  An additional note: I don't think the goal is to get an 100%
>  comprehensive list of all possible workloads and scale targets.  AFAIU
>  the intention is to quickly (in a few weeks?) identify a "few" relevant
>  types of workloads and start with writing test scenarios for those.  We
>  could then, incrementally, build on top of that.
> >>>
> >>> Thanks alot for putting this together, I think it is a great start.
> >>>
> >>> There is currently no mention of what type of topology is being used
> >>> though, I do not want to complicate things, but I do think we need to
> >>> at least gauge what is used behind the numbers being presented, as it
> >>> would have consequences for how we lay out the tests, and consequently
> >>> what we scale for.
> >>>
> >>
> >> I was hoping for targets that make sense for all topologies.
> >>
> >>> I think these should cover the most normal cases:
> >>>
> >>> Gateway topologies:
> >>> * Distributed gateways with distributed FIPs
> >>> * Distributed gateways with centralized NAT
> >>> * Centralized gateways
> >>>
> >>> IP topologies:
> >>> * Project networks mainly use IPv4 RFC1918 and SNAT/DNAT
> >>> * Project networks mainly use routed IPv4 (no NAT)
> >>> * Project networks mainly use routed IPv6 (no NAT)
> >>>
> >>
> >> Do we really need to differentiate between IP topologies?  I think the
> >> only significant difference above is NAT vs noNAT.  From ovn-heater
> >> perspective there should be no difference between private and routed
> >> IPv4 or v6.
> >>
> >> If we test and optimize OVN for the worst case and always configure NAT
> >> we should be fine, right?
> >
> > i'm not sure if that works. With centralized gateways it makes a big
> > difference if you have NAT or no NAT. The reason (iirc) is that without
> > NAT you are not actually using these gateways for most of the traffic in
> > your environment. But with NAT you need to use them.
> >
>
> Well, from a control plane perspective, I thought the "worst case" is
> when we also configure NAT.

thats probably right. Maybe we just stick to it then for now (until we
see special scaling issues just for the non nat path)
>
> >>
> >>> Any suggestions for how to tie these into the questionnaire without
> >>> causing a matrix explosion?
> >>>
> >>
> >> Not really, but if we only have "Gateway topologies" as a variable then
> >> we end up with a way smaller matrix. :)
> >>
> >
> > i would go for a matrix with gateways (distributed or central) and nat
> > (yes or no).
> >
>
> Should we also give up on the "cluster type" differentiation and just
> focus on "large clusters"?  In the end that's what we want to be able to
> run.

I think that makes sense. If someone finds a small cluster that has
issues thats probably also interesting, but they will hopefully write
that in there as well.
>
> > Thanks
> > Felix
> > Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für 
> > die Verwertung durch den vorgesehenen Empfänger bestimmt.
> > Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender 
> > bitte unverzüglich in Kenntnis und löschen diese E Mail.
> >
> > Hinweise zum Datenschutz finden Sie hier.
> >
> >
> > This e-mail may contain confidential content and is intended only for the 
> > specified recipient/s.
> > If you are not the intended recipient, please inform the sender immediately 
> > and delete this e-mail.
> >
> > Information on data protection can be found 
> > here.
> >
>
Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die 
Verwertung durch den vorgesehenen Empfänger bestimmt.
Sollten Sie nicht der

Re: [ovs-dev] [RFC ovn-heater] OVN scale testing with OpenStack workloads - questionnaire

2023-07-12 Thread Felix Huettner via dev

Hi everyone,

On Wed, Jul 12, 2023 at 12:29:58PM +0200, Dumitru Ceara wrote:
> On 7/12/23 12:04, Frode Nordahl wrote:
> > On Wed, Jul 12, 2023 at 10:51 AM Dumitru Ceara  wrote:
> >>
> >> Hi all,
> >>
> >> During the ovn-heater community meeting organized by Frode yesterday
> >> (thanks again for that!) we agreed to follow up on some points.  Two of
> >> these are related to gathering more information that allow us to build
> >> realistic test scenarios (1) and define targets for them (2).
> >>
> >> On that note we have agreed to prepare a _short_ questionnaire to be
> >> shared with OpenStack + OVN users (companies and other operators).
> >>
> >> I started a draft document here:
> >>
> >> https://docs.google.com/document/d/151saO5a5PmCt7cIZQ7DkgvU755od76BlN2bKjXc1n08
> >>
> >> It's mostly based on the previous discussion and information we received
> >> from Felix Hüttner [0].  I gave access to everyone with the link to
> >> suggest changes/comment on the document.  Also, if people prefer a
> >> different way of building this questionnaire (e.g., a PR on
> >> github/ovn-org/ovn-heater) please let me know.
> >>
> >> An additional note: I don't think the goal is to get an 100%
> >> comprehensive list of all possible workloads and scale targets.  AFAIU
> >> the intention is to quickly (in a few weeks?) identify a "few" relevant
> >> types of workloads and start with writing test scenarios for those.  We
> >> could then, incrementally, build on top of that.
> >
> > Thanks alot for putting this together, I think it is a great start.
> >
> > There is currently no mention of what type of topology is being used
> > though, I do not want to complicate things, but I do think we need to
> > at least gauge what is used behind the numbers being presented, as it
> > would have consequences for how we lay out the tests, and consequently
> > what we scale for.
> >
>
> I was hoping for targets that make sense for all topologies.
>
> > I think these should cover the most normal cases:
> >
> > Gateway topologies:
> > * Distributed gateways with distributed FIPs
> > * Distributed gateways with centralized NAT
> > * Centralized gateways
> >
> > IP topologies:
> > * Project networks mainly use IPv4 RFC1918 and SNAT/DNAT
> > * Project networks mainly use routed IPv4 (no NAT)
> > * Project networks mainly use routed IPv6 (no NAT)
> >
>
> Do we really need to differentiate between IP topologies?  I think the
> only significant difference above is NAT vs noNAT.  From ovn-heater
> perspective there should be no difference between private and routed
> IPv4 or v6.
>
> If we test and optimize OVN for the worst case and always configure NAT
> we should be fine, right?

i'm not sure if that works. With centralized gateways it makes a big
difference if you have NAT or no NAT. The reason (iirc) is that without
NAT you are not actually using these gateways for most of the traffic in
your environment. But with NAT you need to use them.

>
> > Any suggestions for how to tie these into the questionnaire without
> > causing a matrix explosion?
> >
>
> Not really, but if we only have "Gateway topologies" as a variable then
> we end up with a way smaller matrix. :)
>

i would go for a matrix with gateways (distributed or central) and nat
(yes or no).

Thanks
Felix
Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die 
Verwertung durch den vorgesehenen Empfänger bestimmt.
Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte 
unverzüglich in Kenntnis und löschen diese E Mail.

Hinweise zum Datenschutz finden Sie hier.


This e-mail may contain confidential content and is intended only for the 
specified recipient/s.
If you are not the intended recipient, please inform the sender immediately and 
delete this e-mail.

Information on data protection can be found 
here.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH v2] relay: allow setting probe interval

2023-07-10 Thread Felix Huettner via dev

previously it was not possible to set the probe interval for the
connection from a relay to the backing ovsdb-server. With this change it
is now possible using the
`ovsdb-server/set-active-ovsdb-server-probe-interval` command.

The command `ovsdb-server/set-active-ovsdb-server-probe-interval` is
already used to set the probe interval for active-backup replication.
However this is mutally exclusive with being a relay and the case for
using the command is the same. Therefor we decided to reuse it instead
of adding a new one.

Reviewed-by: Simon Horman 
Signed-off-by: Felix Huettner 
---
 NEWS |  4 
 ovsdb/ovsdb-server.c |  4 +++-
 ovsdb/relay.c| 15 ++-
 ovsdb/relay.h|  4 +++-
 4 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/NEWS b/NEWS
index 6a990c921..1920830bd 100644
--- a/NEWS
+++ b/NEWS
@@ -10,6 +10,10 @@ Post-v3.1.0
conversion operation is present.  For the cluster service model follow
upgrade instructions in 'Upgrading from version 3.1 and earlier to 3.2
and later' section of ovsdb(7).
+ * When ovsdb-server is running in relay mode, the default value of probe
+   interval is increased to 60 seconds for the connection to the
+   backing server. This value is configurable with the unixctl
+   command - ovsdb-server/set-active-ovsdb-server-probe-interval.
- IPFIX template and statistics intervals can now be configured through two
  new options in the IPFIX table: 'template_interval' and 'stats_interval'.
- Linux kernel datapath:
diff --git a/ovsdb/ovsdb-server.c b/ovsdb/ovsdb-server.c
index 9bad0c8dd..d29dde417 100644
--- a/ovsdb/ovsdb-server.c
+++ b/ovsdb/ovsdb-server.c
@@ -797,7 +797,8 @@ open_db(struct server_config *config, const char *filename)
 add_db(config, db);

 if (is_relay) {
-ovsdb_relay_add_db(db->db, relay_remotes, update_schema, config);
+ovsdb_relay_add_db(db->db, relay_remotes, update_schema, config,
+   *config->replication_probe_interval);
 }
 return NULL;
 }
@@ -1473,6 +1474,7 @@ 
ovsdb_server_set_active_ovsdb_server_probe_interval(struct unixctl_conn *conn,
 if (*config->is_backup) {
 replication_set_probe_interval(probe_interval);
 }
+ovsdb_relay_set_probe_interval(probe_interval);
 unixctl_command_reply(conn, NULL);
 } else {
 unixctl_command_reply(
diff --git a/ovsdb/relay.c b/ovsdb/relay.c
index 377f3285f..21720f62f 100644
--- a/ovsdb/relay.c
+++ b/ovsdb/relay.c
@@ -127,7 +127,7 @@ static struct ovsdb_cs_ops relay_cs_ops = {
 void
 ovsdb_relay_add_db(struct ovsdb *db, const char *remote,
schema_change_callback schema_change_cb,
-   void *schema_change_aux)
+   void *schema_change_aux, int probe_interval)
 {
 struct relay_ctx *ctx;

@@ -152,10 +152,23 @@ ovsdb_relay_add_db(struct ovsdb *db, const char *remote,
 shash_add(_dbs, db->name, ctx);
 ovsdb_cs_set_leader_only(ctx->cs, false);
 ovsdb_cs_set_remote(ctx->cs, remote, true);
+ovsdb_cs_set_probe_interval(ctx->cs, probe_interval);

 VLOG_DBG("added database: %s, %s", db->name, remote);
 }

+/* Updates the probe interval for all relay connections to the specified
+ * value*/
+void
+ovsdb_relay_set_probe_interval(int probe_interval)
+{
+struct shash_node *node;
+SHASH_FOR_EACH (node, _dbs) {
+struct relay_ctx *ctx = node->data;
+ovsdb_cs_set_probe_interval(ctx->cs, probe_interval);
+}
+}
+
 void
 ovsdb_relay_del_db(struct ovsdb *db)
 {
diff --git a/ovsdb/relay.h b/ovsdb/relay.h
index f841554ca..ccfaf3c93 100644
--- a/ovsdb/relay.h
+++ b/ovsdb/relay.h
@@ -33,11 +33,13 @@ typedef struct ovsdb_error *(*schema_change_callback)(

 void ovsdb_relay_add_db(struct ovsdb *, const char *remote,
 schema_change_callback schema_change_cb,
-void *schema_change_aux);
+void *schema_change_aux, int probe_interval);
 void ovsdb_relay_del_db(struct ovsdb *);
 void ovsdb_relay_run(void);
 void ovsdb_relay_wait(void);

+void ovsdb_relay_set_probe_interval(int);
+
 bool ovsdb_relay_is_connected(struct ovsdb *);

 #endif /* OVSDB_RELAY_H */
--
2.40.0

Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die 
Verwertung durch den vorgesehenen Empfänger bestimmt.
Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte 
unverzüglich in Kenntnis und löschen diese E Mail.

Hinweise zum Datenschutz finden Sie hier.


This e-mail may contain confidential content and is intended only for the 
specified recipient/s.
If you are not the intended recipient, please inform the sender immediately and 
delete this e-mail.

Information on data protection can be found 
here.
___
dev mailing list
d...@openvswitch.org

Re: [ovs-dev] Scale testing OVN with ovn-heater for OpenStack use cases

2023-06-30 Thread Felix Huettner via dev

On Fri, Jun 30, 2023 at 09:25:47AM +0200, Frode Nordahl wrote:
> Hello all,
>
> On Tue, May 30, 2023 at 5:16 PM Felix Huettner
>  wrote:
> >
> > Hi Dumitru,
> >
> > On Fri, May 26, 2023 at 01:30:54PM +0200, Dumitru Ceara wrote:
> > > On 5/24/23 09:37, Felix Huettner wrote:
> > > > Hi everyone,
> > >
> > > Hi Felix,
> > >
> > > >
> > > > Ilya mentioned to me that you will want to bring openstack examples to
> > > > ovn-heater.
> > > >
> > >
> > > Yes, we're discussing that.
> > >
> > > > I wanted to ask how to best join this effort. It would be great for us
> > >
> > > Everyone is welcome to contribute! :)
> > >
> >
> > Great, we will try that out.
>
> Thank you for your interest in this effort, as promised I'm in the
> process of organizing a community meeting to get this going.
>
> Below are some proposals for dates and times, please respond with a
> prioritized list of what works best for you and we'll try to land on a
> single date/time for a first meeting:
> Wednesday July 5th 13:00 UTC
> Tuesday July 11th 13:30 UTC
> Wednesday July 12th 8:30 UTC

Hi Frode,

both the slot at 11th and 12th fit for us without any issues.
5th is also doable but we would prefer the other two.

Thanks
Felix


>
> > > > to have a tool to scale test ovn. Also we have an ovn deployment with
> > > > currently 450 compute nodes where we can try to extract the typically
> > > > used northbound entries from.
> > > >
> > >
> > > This is great!  I'm not an OpenStack user so I'm probably not using the
> > > right terminology and I might be missing things but I think we're
> > > interested in:
> > > - how many tenant networks do you have in the cluster
> > > - for each tenant network:
> > >   - how many VMs (on average but maybe also min/max/etc)
> > >   - how many floating IPs (dnat-and-snat)
> > >   - how many compute nodes run VMs from this network
> > > - for each compute:
> > >   - how many VMs run on it (on average but also min/max/etc)
> > >
> >
> > Ok, so i'll try to give an overview below.
> > If you need any more detail just ask.
> >
> > General amount of nodes:
> > * ~500 Hypervisors
> > * 9 Nodes that are just serve as gateway chassis (connected to 4
> >   different external networks)
> >
> > Amount of tenant Networks (== logical routers): 2300 (only 1400 are
> > actually hosting VMs).
> >
> > Amount of logical Routers: 2010 (2000 of these have a chassisredirect
> > port that is hosted on one of the 9 nodes.
> >
> > * Amount of VMs:
> >* min: 1
> >* max: 300
> >* avg: 4.5
> >* 630 networks only have a single VM, 230 have two VMs and 190 have 3
> >  VMs.
> > * Amount of Hypervisors per network:
> >* min: 1
> >* max: 90
> >* avg: 3.6
> >* 650 networks only strech on a single hypervisor, 250 on two
> >  hypervisors, 180 on three hypervisors
> >
> > Regarding floating ips we have currently 1530. I'll give you the amount
> > per Logical router:
> > * min: 1
> > * max: 230
> > * avg: 2.7
> > * 430 routers have a single floating ip, 140 routers have 2 floating
> >   ips.
> >
> > Regarding VMs per Hypervisor we have:
> > * min: 1
> > * max: 80
> > * avg: 15
> >
> >
> > Other potentially interesting values:
> >
> > * Amount of Load_Balancer: 0
> > * Amount of Logical_Switch_Port type=virtual: 800
> > * Amount of Port_Group: 6000
> > * Amount of ACL: 35500
> >* All of these referce portgroups in `match` using `@` or `$`
> >* Referencing other portgroups using `$`: 8300
>
> Thanks alot for providing this information, and gauging the size of
> your deployment, your inputs would be very valuable in this effort.
> One of the main goals of ovn-heater is the ability to simulate
> workloads, because as developers, we seldom have access to actual 500-
> or 1000- node clusters. Having a participant in the project with
> actual hardware installation of such a size would give us the
> opportunity to help confirm the accuracy of our simulations.
>
> --
> Frode Nordahl
>
> > Regards
> > Felix
> >
> > > Mayba Frode and Daniel can expand on this.
> > >
> > > Regards,
> > > Dumitru
> > >
> > > > Thanks
> > > > Felix
> > > >
> > > > On Tue, May 16, 2023 at 05:38:25PM +0200, Daniel Alvarez Sanchez wrote:
> > > >> Hey Frode, Dumitru, thanks a lot for initiating this
> > > >>
> > > >> On Thu, May 11, 2023 at 6:05 PM Frode Nordahl 
> > > >> 
> > > >> wrote:
> > > >>
> > > >>> Hello, Dumitru, Daniel and team,
> > > >>>
> > > >>> On Thu, May 11, 2023 at 5:23 PM Dumitru Ceara  
> > > >>> wrote:
> > > 
> > >  Hi Frode,
> > > 
> > >  During an internal discussion with RedHat OpenStack folks (Daniel in 
> > >  cc)
> > >  about scalability of OVN in OpenStack deployments I remembered that 
> > >  you
> > >  mentioned that you're thinking of enhancing ovn-heater in order to 
> > >  test
> > >  your types of deployments [0].  I assumed that those are also 
> > >  OpenStack
> > >  based deployments so I was wondering if there's an opportunity for
> > >

Re: [ovs-dev] Scale testing OVN with ovn-heater for OpenStack use cases

2023-05-30 Thread Felix Huettner via dev

Hi Dumitru,

On Fri, May 26, 2023 at 01:30:54PM +0200, Dumitru Ceara wrote:
> On 5/24/23 09:37, Felix Huettner wrote:
> > Hi everyone,
>
> Hi Felix,
>
> >
> > Ilya mentioned to me that you will want to bring openstack examples to
> > ovn-heater.
> >
>
> Yes, we're discussing that.
>
> > I wanted to ask how to best join this effort. It would be great for us
>
> Everyone is welcome to contribute! :)
>

Great, we will try that out.

> > to have a tool to scale test ovn. Also we have an ovn deployment with
> > currently 450 compute nodes where we can try to extract the typically
> > used northbound entries from.
> >
>
> This is great!  I'm not an OpenStack user so I'm probably not using the
> right terminology and I might be missing things but I think we're
> interested in:
> - how many tenant networks do you have in the cluster
> - for each tenant network:
>   - how many VMs (on average but maybe also min/max/etc)
>   - how many floating IPs (dnat-and-snat)
>   - how many compute nodes run VMs from this network
> - for each compute:
>   - how many VMs run on it (on average but also min/max/etc)
>

Ok, so i'll try to give an overview below.
If you need any more detail just ask.

General amount of nodes:
* ~500 Hypervisors
* 9 Nodes that are just serve as gateway chassis (connected to 4
  different external networks)

Amount of tenant Networks (== logical routers): 2300 (only 1400 are
actually hosting VMs).

Amount of logical Routers: 2010 (2000 of these have a chassisredirect
port that is hosted on one of the 9 nodes.

* Amount of VMs:
   * min: 1
   * max: 300
   * avg: 4.5
   * 630 networks only have a single VM, 230 have two VMs and 190 have 3
 VMs.
* Amount of Hypervisors per network:
   * min: 1
   * max: 90
   * avg: 3.6
   * 650 networks only strech on a single hypervisor, 250 on two
 hypervisors, 180 on three hypervisors

Regarding floating ips we have currently 1530. I'll give you the amount
per Logical router:
* min: 1
* max: 230
* avg: 2.7
* 430 routers have a single floating ip, 140 routers have 2 floating
  ips.

Regarding VMs per Hypervisor we have:
* min: 1
* max: 80
* avg: 15


Other potentially interesting values:

* Amount of Load_Balancer: 0
* Amount of Logical_Switch_Port type=virtual: 800
* Amount of Port_Group: 6000
* Amount of ACL: 35500
   * All of these referce portgroups in `match` using `@` or `$`
   * Referencing other portgroups using `$`: 8300

Regards
Felix

> Mayba Frode and Daniel can expand on this.
>
> Regards,
> Dumitru
>
> > Thanks
> > Felix
> >
> > On Tue, May 16, 2023 at 05:38:25PM +0200, Daniel Alvarez Sanchez wrote:
> >> Hey Frode, Dumitru, thanks a lot for initiating this
> >>
> >> On Thu, May 11, 2023 at 6:05 PM Frode Nordahl 
> >> wrote:
> >>
> >>> Hello, Dumitru, Daniel and team,
> >>>
> >>> On Thu, May 11, 2023 at 5:23 PM Dumitru Ceara  wrote:
> 
>  Hi Frode,
> 
>  During an internal discussion with RedHat OpenStack folks (Daniel in cc)
>  about scalability of OVN in OpenStack deployments I remembered that you
>  mentioned that you're thinking of enhancing ovn-heater in order to test
>  your types of deployments [0].  I assumed that those are also OpenStack
>  based deployments so I was wondering if there's an opportunity for
>  collaboration here.
> >>>
> >>> Thank you for reaching out to me on this topic. We do indeed have scale
> >>> testing in the context of OpenStack on our roadmap and welcome the
> >>> invitation to collaborate on this.
> >>>
> >>> As you know we did some preparation work together already as you refer
> >>> to in [0], and we thank you for your help in upstreaming ovn-heater.
> >>>
> >>> We have yet to schedule exactly when in the cycle to work on this, I
> >>> guess a next step could be to set up a meet to share some ideas and
> >>> see how it aligns?
> >>
> >>
> >>> If that resonates with you, I'd be happy to organize.
> >>>
> >>
> >> Sounds good, please do :)
> >>
> >>
> >>>
>  Having OpenStack-like scenarios in ovn-heater would really allow OVN
>  developers to profile/troubleshoot/optimize OVN for those use cases too.
> 
>  Well, in general, the above is true for any other CMS too.
> >>>
> >>> Yes, this is what we have been thinking as well. We are supporting other
> >>> products in their journey with OVN as well, so similar projects may come
> >>> from those in the future.
> >>>
> >>
> >> Yes, we can try to model OpenStack-like scenarios for ovn-heater where the
> >> resources are generally more distributed across the hypervisors than in k8s
> >> for example, perhaps larger L2 domain sizes, etcetera.
> >> Happy to discuss further!
> >>
> >> daniel
> >>
> >>
> >>> --
> >>> Frode Nordahl
> >>>
>  Thanks,
>  Dumitru
> 
>  [0]
> >>> https://mail.openvswitch.org/pipermail/ovs-dev/2023-March/402680.html
> 
> >>>
> >>>
> >> ___
> >> dev mailing list
> >> d...@openvswitch.org
> >>

Re: [ovs-dev] Scale testing OVN with ovn-heater for OpenStack use cases

2023-05-24 Thread Felix Huettner via dev

Hi everyone,

Ilya mentioned to me that you will want to bring openstack examples to
ovn-heater.

I wanted to ask how to best join this effort. It would be great for us
to have a tool to scale test ovn. Also we have an ovn deployment with
currently 450 compute nodes where we can try to extract the typically
used northbound entries from.

Thanks
Felix

On Tue, May 16, 2023 at 05:38:25PM +0200, Daniel Alvarez Sanchez wrote:
> Hey Frode, Dumitru, thanks a lot for initiating this
>
> On Thu, May 11, 2023 at 6:05 PM Frode Nordahl 
> wrote:
>
> > Hello, Dumitru, Daniel and team,
> >
> > On Thu, May 11, 2023 at 5:23 PM Dumitru Ceara  wrote:
> > >
> > > Hi Frode,
> > >
> > > During an internal discussion with RedHat OpenStack folks (Daniel in cc)
> > > about scalability of OVN in OpenStack deployments I remembered that you
> > > mentioned that you're thinking of enhancing ovn-heater in order to test
> > > your types of deployments [0].  I assumed that those are also OpenStack
> > > based deployments so I was wondering if there's an opportunity for
> > > collaboration here.
> >
> > Thank you for reaching out to me on this topic. We do indeed have scale
> > testing in the context of OpenStack on our roadmap and welcome the
> > invitation to collaborate on this.
> >
> > As you know we did some preparation work together already as you refer
> > to in [0], and we thank you for your help in upstreaming ovn-heater.
> >
> > We have yet to schedule exactly when in the cycle to work on this, I
> > guess a next step could be to set up a meet to share some ideas and
> > see how it aligns?
>
>
> > If that resonates with you, I'd be happy to organize.
> >
>
> Sounds good, please do :)
>
>
> >
> > > Having OpenStack-like scenarios in ovn-heater would really allow OVN
> > > developers to profile/troubleshoot/optimize OVN for those use cases too.
> > >
> > > Well, in general, the above is true for any other CMS too.
> >
> > Yes, this is what we have been thinking as well. We are supporting other
> > products in their journey with OVN as well, so similar projects may come
> > from those in the future.
> >
>
> Yes, we can try to model OpenStack-like scenarios for ovn-heater where the
> resources are generally more distributed across the hypervisors than in k8s
> for example, perhaps larger L2 domain sizes, etcetera.
> Happy to discuss further!
>
> daniel
>
>
> > --
> > Frode Nordahl
> >
> > > Thanks,
> > > Dumitru
> > >
> > > [0]
> > https://mail.openvswitch.org/pipermail/ovs-dev/2023-March/402680.html
> > >
> >
> >
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die 
Verwertung durch den vorgesehenen Empfänger bestimmt.
Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte 
unverzüglich in Kenntnis und löschen diese E Mail.

Hinweise zum Datenschutz finden Sie hier.


This e-mail may contain confidential content and is intended only for the 
specified recipient/s.
If you are not the intended recipient, please inform the sender immediately and 
delete this e-mail.

Information on data protection can be found 
here.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH] relay: allow setting probe interval

2023-05-19 Thread Felix Huettner via dev

previously it was not possible to set the probe interval for the
connection from a relay to the backing ovsdb-server. With this change it
is now possible using the
`ovsdb-server/set-active-ovsdb-server-probe-interval` command.

The command `ovsdb-server/set-active-ovsdb-server-probe-interval` is
already used to set the probe interval for active-backup replication.
However this is mutally exclusive with being a relay and the case for
using the command is the same. Therefor we decided to reuse it instead
of adding a new one.

Signed-off-by: Felix Huettner 
---
 NEWS |  4 
 ovsdb/ovsdb-server.c |  4 +++-
 ovsdb/relay.c| 15 ++-
 ovsdb/relay.h|  4 +++-
 4 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/NEWS b/NEWS
index cfd43..719517c67 100644
--- a/NEWS
+++ b/NEWS
@@ -10,6 +10,10 @@ Post-v3.1.0
conversion operation is present.  For the cluster service model follow
upgrade instructions in 'Upgrading from version 3.1 and earlier to 3.2
and later' section of ovsdb(7).
+ * When ovsdb-server is running in relay mode, the default value of probe
+   interval is increased to 60 seconds for the connection to the
+   backing server. This value is configurable with the unixctl
+   command - ovsdb-server/set-active-ovsdb-server-probe-interval.
- IPFIX template and statistics intervals can now be configured through two
  new options in the IPFIX table: 'template_interval' and 'stats_interval'.
- Linux kernel datapath:
diff --git a/ovsdb/ovsdb-server.c b/ovsdb/ovsdb-server.c
index 9bad0c8dd..d29dde417 100644
--- a/ovsdb/ovsdb-server.c
+++ b/ovsdb/ovsdb-server.c
@@ -797,7 +797,8 @@ open_db(struct server_config *config, const char *filename)
 add_db(config, db);

 if (is_relay) {
-ovsdb_relay_add_db(db->db, relay_remotes, update_schema, config);
+ovsdb_relay_add_db(db->db, relay_remotes, update_schema, config,
+   *config->replication_probe_interval);
 }
 return NULL;
 }
@@ -1473,6 +1474,7 @@ 
ovsdb_server_set_active_ovsdb_server_probe_interval(struct unixctl_conn *conn,
 if (*config->is_backup) {
 replication_set_probe_interval(probe_interval);
 }
+ovsdb_relay_set_probe_interval(probe_interval);
 unixctl_command_reply(conn, NULL);
 } else {
 unixctl_command_reply(
diff --git a/ovsdb/relay.c b/ovsdb/relay.c
index 377f3285f..21720f62f 100644
--- a/ovsdb/relay.c
+++ b/ovsdb/relay.c
@@ -127,7 +127,7 @@ static struct ovsdb_cs_ops relay_cs_ops = {
 void
 ovsdb_relay_add_db(struct ovsdb *db, const char *remote,
schema_change_callback schema_change_cb,
-   void *schema_change_aux)
+   void *schema_change_aux, int probe_interval)
 {
 struct relay_ctx *ctx;

@@ -152,10 +152,23 @@ ovsdb_relay_add_db(struct ovsdb *db, const char *remote,
 shash_add(_dbs, db->name, ctx);
 ovsdb_cs_set_leader_only(ctx->cs, false);
 ovsdb_cs_set_remote(ctx->cs, remote, true);
+ovsdb_cs_set_probe_interval(ctx->cs, probe_interval);

 VLOG_DBG("added database: %s, %s", db->name, remote);
 }

+/* Updates the probe interval for all relay connections to the specified
+ * value*/
+void
+ovsdb_relay_set_probe_interval(int probe_interval)
+{
+struct shash_node *node;
+SHASH_FOR_EACH (node, _dbs) {
+struct relay_ctx *ctx = node->data;
+ovsdb_cs_set_probe_interval(ctx->cs, probe_interval);
+}
+}
+
 void
 ovsdb_relay_del_db(struct ovsdb *db)
 {
diff --git a/ovsdb/relay.h b/ovsdb/relay.h
index f841554ca..ccfaf3c93 100644
--- a/ovsdb/relay.h
+++ b/ovsdb/relay.h
@@ -33,11 +33,13 @@ typedef struct ovsdb_error *(*schema_change_callback)(

 void ovsdb_relay_add_db(struct ovsdb *, const char *remote,
 schema_change_callback schema_change_cb,
-void *schema_change_aux);
+void *schema_change_aux, int probe_interval);
 void ovsdb_relay_del_db(struct ovsdb *);
 void ovsdb_relay_run(void);
 void ovsdb_relay_wait(void);

+void ovsdb_relay_set_probe_interval(int);
+
 bool ovsdb_relay_is_connected(struct ovsdb *);

 #endif /* OVSDB_RELAY_H */
--
2.40.0

Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die 
Verwertung durch den vorgesehenen Empfänger bestimmt.
Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte 
unverzüglich in Kenntnis und löschen diese E Mail.

Hinweise zum Datenschutz finden Sie hier.


This e-mail may contain confidential content and is intended only for the 
specified recipient/s.
If you are not the intended recipient, please inform the sender immediately and 
delete this e-mail.

Information on data protection can be found 
here.
___
dev mailing list
d...@openvswitch.org

[ovs-dev] [PATCH net v3] net: openvswitch: fix race on port output

2023-04-05 Thread Felix Huettner via dev

assume the following setup on a single machine:
1. An openvswitch instance with one bridge and default flows
2. two network namespaces "server" and "client"
3. two ovs interfaces "server" and "client" on the bridge
4. for each ovs interface a veth pair with a matching name and 32 rx and
   tx queues
5. move the ends of the veth pairs to the respective network namespaces
6. assign ip addresses to each of the veth ends in the namespaces (needs
   to be the same subnet)
7. start some http server on the server network namespace
8. test if a client in the client namespace can reach the http server

when following the actions below the host has a chance of getting a cpu
stuck in a infinite loop:
1. send a large amount of parallel requests to the http server (around
   3000 curls should work)
2. in parallel delete the network namespace (do not delete interfaces or
   stop the server, just kill the namespace)

there is a low chance that this will cause the below kernel cpu stuck
message. If this does not happen just retry.
Below there is also the output of bpftrace for the functions mentioned
in the output.

The series of events happening here is:
1. the network namespace is deleted calling
   `unregister_netdevice_many_notify` somewhere in the process
2. this sets first `NETREG_UNREGISTERING` on both ends of the veth and
   then runs `synchronize_net`
3. it then calls `call_netdevice_notifiers` with `NETDEV_UNREGISTER`
4. this is then handled by `dp_device_event` which calls
   `ovs_netdev_detach_dev` (if a vport is found, which is the case for
   the veth interface attached to ovs)
5. this removes the rx_handlers of the device but does not prevent
   packages to be sent to the device
6. `dp_device_event` then queues the vport deletion to work in
   background as a ovs_lock is needed that we do not hold in the
   unregistration path
7. `unregister_netdevice_many_notify` continues to call
   `netdev_unregister_kobject` which sets `real_num_tx_queues` to 0
8. port deletion continues (but details are not relevant for this issue)
9. at some future point the background task deletes the vport

If after 7. but before 9. a packet is send to the ovs vport (which is
not deleted at this point in time) which forwards it to the
`dev_queue_xmit` flow even though the device is unregistering.
In `skb_tx_hash` (which is called in the `dev_queue_xmit`) path there is
a while loop (if the packet has a rx_queue recorded) that is infinite if
`dev->real_num_tx_queues` is zero.

To prevent this from happening we update `do_output` to handle devices
without carrier the same as if the device is not found (which would
be the code path after 9. is done).

Additionally we now produce a warning in `skb_tx_hash` if we will hit
the inifinite loop.

bpftrace (first word is function name):

__dev_queue_xmit server: real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, 
skb_addr: 0x9edb6f207000, reg_state: 1
netdev_core_pick_tx server: addr: 0x9f0a46d4a000 real_num_tx_queues: 1, 
cpu: 2, pid: 28024, tid: 28024, skb_addr: 0x9edb6f207000, reg_state: 1
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, 
event 2, reg_state: 1
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, 
event 6, reg_state: 2
ovs_netdev_detach_dev server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 
21024, reg_state: 2
netdev_rx_handler_unregister server: real_num_tx_queues: 1, cpu: 9, pid: 21024, 
tid: 21024, reg_state: 2
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
netdev_rx_handler_unregister ret server: real_num_tx_queues: 1, cpu: 9, pid: 
21024, tid: 21024, reg_state: 2
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, 
event 27, reg_state: 2
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, 
event 22, reg_state: 2
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, 
event 18, reg_state: 2
netdev_unregister_kobject: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
ovs_vport_send server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, 
skb_addr: 0x9edb6f207000, reg_state: 2
__dev_queue_xmit server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, 
skb_addr: 0x9edb6f207000, reg_state: 2
netdev_core_pick_tx server: addr: 0x9f0a46d4a000 real_num_tx_queues: 0, 
cpu: 2, pid: 28024, tid: 28024, skb_addr: 0x9edb6f207000, reg_state: 2
broken device server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024
ovs_dp_detach_port server: real_num_tx_queues: 0 cpu 9, pid: 9124, tid: 9124, 
reg_state: 2
synchronize_rcu_expedited: cpu 9, pid: 33604, tid: 33604

stuck message:

watchdog: BUG: soft lockup - CPU#5 stuck

[ovs-dev] [PATCH net v2] net: openvswitch: fix race on port output

2023-04-04 Thread Felix Huettner via dev

assume the following setup on a single machine:
1. An openvswitch instance with one bridge and default flows
2. two network namespaces "server" and "client"
3. two ovs interfaces "server" and "client" on the bridge
4. for each ovs interface a veth pair with a matching name and 32 rx and
   tx queues
5. move the ends of the veth pairs to the respective network namespaces
6. assign ip addresses to each of the veth ends in the namespaces (needs
   to be the same subnet)
7. start some http server on the server network namespace
8. test if a client in the client namespace can reach the http server

when following the actions below the host has a chance of getting a cpu
stuck in a infinite loop:
1. send a large amount of parallel requests to the http server (around
   3000 curls should work)
2. in parallel delete the network namespace (do not delete interfaces or
   stop the server, just kill the namespace)

there is a low chance that this will cause the below kernel cpu stuck
message. If this does not happen just retry.
Below there is also the output of bpftrace for the functions mentioned
in the output.

The series of events happening here is:
1. the network namespace is deleted calling
   `unregister_netdevice_many_notify` somewhere in the process
2. this sets first `NETREG_UNREGISTERING` on both ends of the veth and
   then runs `synchronize_net`
3. it then calls `call_netdevice_notifiers` with `NETDEV_UNREGISTER`
4. this is then handled by `dp_device_event` which calls
   `ovs_netdev_detach_dev` (if a vport is found, which is the case for
   the veth interface attached to ovs)
5. this removes the rx_handlers of the device but does not prevent
   packages to be sent to the device
6. `dp_device_event` then queues the vport deletion to work in
   background as a ovs_lock is needed that we do not hold in the
   unregistration path
7. `unregister_netdevice_many_notify` continues to call
   `netdev_unregister_kobject` which sets `real_num_tx_queues` to 0
8. port deletion continues (but details are not relevant for this issue)
9. at some future point the background task deletes the vport

If after 7. but before 9. a packet is send to the ovs vport (which is
not deleted at this point in time) which forwards it to the
`dev_queue_xmit` flow even though the device is unregistering.
In `skb_tx_hash` (which is called in the `dev_queue_xmit`) path there is
a while loop (if the packet has a rx_queue recorded) that is infinite if
`dev->real_num_tx_queues` is zero.

To prevent this from happening we update `do_output` to handle devices
without carrier the same as if the device is not found (which would
be the code path after 9. is done).

Additionally we now produce a warning in `skb_tx_hash` if we will hit
the inifinite loop.

bpftrace (first word is function name):

__dev_queue_xmit server: real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, 
skb_addr: 0x9edb6f207000, reg_state: 1
netdev_core_pick_tx server: addr: 0x9f0a46d4a000 real_num_tx_queues: 1, 
cpu: 2, pid: 28024, tid: 28024, skb_addr: 0x9edb6f207000, reg_state: 1
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, 
event 2, reg_state: 1
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, 
event 6, reg_state: 2
ovs_netdev_detach_dev server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 
21024, reg_state: 2
netdev_rx_handler_unregister server: real_num_tx_queues: 1, cpu: 9, pid: 21024, 
tid: 21024, reg_state: 2
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
netdev_rx_handler_unregister ret server: real_num_tx_queues: 1, cpu: 9, pid: 
21024, tid: 21024, reg_state: 2
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, 
event 27, reg_state: 2
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, 
event 22, reg_state: 2
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, 
event 18, reg_state: 2
netdev_unregister_kobject: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
ovs_vport_send server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, 
skb_addr: 0x9edb6f207000, reg_state: 2
__dev_queue_xmit server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, 
skb_addr: 0x9edb6f207000, reg_state: 2
netdev_core_pick_tx server: addr: 0x9f0a46d4a000 real_num_tx_queues: 0, 
cpu: 2, pid: 28024, tid: 28024, skb_addr: 0x9edb6f207000, reg_state: 2
broken device server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024
ovs_dp_detach_port server: real_num_tx_queues: 0 cpu 9, pid: 9124, tid: 9124, 
reg_state: 2
synchronize_rcu_expedited: cpu 9, pid: 33604, tid: 33604

stuck message:

watchdog: BUG: soft lockup - CPU#5 stuck

[ovs-dev] [PATCH] net: openvswitch: fix race on port output

2023-03-31 Thread Felix Huettner via dev

assume the following setup on a single machine:
1. An openvswitch instance with one bridge and default flows
2. two network namespaces "server" and "client"
3. two ovs interfaces "server" and "client" on the bridge
4. for each ovs interface a veth pair with a matching name and 32 rx and
   tx queues
5. move the ends of the veth pairs to the respective network namespaces
6. assign ip addresses to each of the veth ends in the namespaces (needs
   to be the same subnet)
7. start some http server on the server network namespace
8. test if a client in the client namespace can reach the http server

when following the actions below the host has a chance of getting a cpu
stuck in a infinite loop:
1. send a large amount of parallel requests to the http server (around
   3000 curls should work)
2. in parallel delete the network namespace (do not delete interfaces or
   stop the server, just kill the namespace)

there is a low chance that this will cause the below kernel cpu stuck
message. If this does not happen just retry.
Below there is also the output of bpftrace for the functions mentioned
in the output.

The series of events happening here is:
1. the network namespace is deleted calling
   `unregister_netdevice_many_notify` somewhere in the process
2. this sets first `NETREG_UNREGISTERING` on both ends of the veth and
   then runs `synchronize_net`
3. it then calls `call_netdevice_notifiers` with `NETDEV_UNREGISTER`
4. this is then handled by `dp_device_event` which calls
   `ovs_netdev_detach_dev` (if a vport is found, which is the case for
   the veth interface attached to ovs)
5. this removes the rx_handlers of the device but does not prevent
   packages to be sent to the device
6. `dp_device_event` then queues the vport deletion to work in
   background as a ovs_lock is needed that we do not hold in the
   unregistration path
7. `unregister_netdevice_many_notify` continues to call
   `netdev_unregister_kobject` which sets `real_num_tx_queues` to 0
8. port deletion continues (but details are not relevant for this issue)
9. at some future point the background task deletes the vport

If after 7. but before 9. a packet is send to the ovs vport (which is
not deleted at this point in time) which forwards it to the
`dev_queue_xmit` flow even though the device is unregistering.
In `skb_tx_hash` (which is called in the `dev_queue_xmit`) path there is
a while loop (if the packet has a rx_queue recorded) that is infinite if
`dev->real_num_tx_queues` is zero.

To prevent this from happening we update `do_output` to handle not
registered (so e.g. unregistering) devices the same as if the device is
not found (which would be the code path after 9. is done).

Additionally we introduce a `BUG_ON` in `skb_tx_hash` to rather crash
then produce this infinite loop that can not be exited anyway.

bpftrace (first word is function name):

__dev_queue_xmit server: real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, 
skb_addr: 0x9edb6f207000, reg_state: 1
netdev_core_pick_tx server: addr: 0x9f0a46d4a000 real_num_tx_queues: 1, 
cpu: 2, pid: 28024, tid: 28024, skb_addr: 0x9edb6f207000, reg_state: 1
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, 
event 2, reg_state: 1
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, 
event 6, reg_state: 2
ovs_netdev_detach_dev server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 
21024, reg_state: 2
netdev_rx_handler_unregister server: real_num_tx_queues: 1, cpu: 9, pid: 21024, 
tid: 21024, reg_state: 2
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
netdev_rx_handler_unregister ret server: real_num_tx_queues: 1, cpu: 9, pid: 
21024, tid: 21024, reg_state: 2
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, 
event 27, reg_state: 2
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, 
event 22, reg_state: 2
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, 
event 18, reg_state: 2
netdev_unregister_kobject: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
ovs_vport_send server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, 
skb_addr: 0x9edb6f207000, reg_state: 2
__dev_queue_xmit server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, 
skb_addr: 0x9edb6f207000, reg_state: 2
netdev_core_pick_tx server: addr: 0x9f0a46d4a000 real_num_tx_queues: 0, 
cpu: 2, pid: 28024, tid: 28024, skb_addr: 0x9edb6f207000, reg_state: 2
broken device server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024
ovs_dp_detach_port server: real_num_tx_queues: 0 cpu 9, pid: 9124, tid: 9124, 
reg_state: 2
synchronize_rcu_expedited: cpu 9, pid: 33604, tid:

56 matches

Mail list logo