Re: [ovs-dev] [PATCH 5/5] ovsdb: raft: Fix inability to join after leadership change round trip.

2024-03-25 Thread Han Zhou
On Fri, Mar 15, 2024 at 1:15 PM Ilya Maximets  wrote:
>
> Consider the following sequence of events:
>
>  1. Cluster with 2 nodes - A and B.  A is a leader.
>  2. C connects to A and sends a join request.
>  3. A sends an append request to C.  C is in CATCHUP phase for A.
>  4. A looses leadership to B.  Sends join failure notification to C.
>  5. C sends append reply to A.
>  6. A discards append reply (not leader).
>  7. B looses leadership back to A.
>  8. C sends a new join request to A.
>  9. A replies with failure (already in progress).
>  10. GoTo step 8.
>
> At this point A is waiting for an append reply that it already
> discarded at step 6 and fails all the new attempts of C to join with
> 'already in progress' verdict.  C stays forever in a joining state
> and in a CATCHUP phase from A's perspective.
>
> This is a similar case to a sudden disconnect from a leader fixed in
> commit 999ba294fb4f ("ovsdb: raft: Fix inability to join the cluster
> after interrupted attempt."), but since we're not disconnecting, the
> servers are not getting destroyed.
>
> Fix that by destroying all the servers that are not yet part of the
> configuration after leadership is lost.  This way, server C will be
> able to simply re-start the joining process from scratch.
>
> New failure test command is added in order to simulate leadership
> change before we receive the append reply, so it gets discarded.
> New cluster test is added to exercise this scenario.
>
> Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered
databases.")
> Reported-at: https://github.com/ovn-org/ovn/issues/235
> Signed-off-by: Ilya Maximets 
> ---
>  ovsdb/raft.c   | 16 -
>  tests/ovsdb-cluster.at | 53 ++
>  2 files changed, 68 insertions(+), 1 deletion(-)
>
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index c41419052..f9e760a08 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -81,6 +81,7 @@ enum raft_failure_test {
>  FT_STOP_RAFT_RPC,
>  FT_TRANSFER_LEADERSHIP,
>  FT_TRANSFER_LEADERSHIP_AFTER_SEND_APPEND_REQ,
> +FT_TRANSFER_LEADERSHIP_AFTER_STARTING_TO_ADD,
>  };
>  static enum raft_failure_test failure_test;
>
> @@ -2702,15 +2703,22 @@ raft_become_follower(struct raft *raft)
>   * new configuration.  Our AppendEntries processing will properly
update
>   * the server configuration later, if necessary.
>   *
> + * However, since we're sending replies about a failure to add,
those new
> + * servers has to be cleaned up.  Otherwise, they will stuck in a
'CATCHUP'
> + * phase in case this server regains leadership before they join
through
> + * the current new leader.  They are not yet in 'raft->servers', so
not
> + * part of the shared configuration.
> + *
>   * Also we do not complete commands here, as they can still be
completed
>   * if their log entries have already been replicated to other
servers.
>   * If the entries were actually committed according to the new
leader, our
>   * AppendEntries processing will complete the corresponding commands.
>   */
>  struct raft_server *s;
> -HMAP_FOR_EACH (s, hmap_node, >add_servers) {
> +HMAP_FOR_EACH_POP (s, hmap_node, >add_servers) {
>  raft_send_add_server_reply__(raft, >sid, s->address, false,
>   RAFT_SERVER_LOST_LEADERSHIP);
> +raft_server_destroy(s);
>  }
>  if (raft->remove_server) {
>  raft_send_remove_server_reply__(raft, >remove_server->sid,
> @@ -3985,6 +3993,10 @@ raft_handle_add_server_request(struct raft *raft,
>   "to cluster "CID_FMT, s->nickname, SID_ARGS(>sid),
>   rq->address, CID_ARGS(>cid));
>  raft_send_append_request(raft, s, 0, "initialize new server");
> +
> +if (failure_test == FT_TRANSFER_LEADERSHIP_AFTER_STARTING_TO_ADD) {
> +failure_test = FT_TRANSFER_LEADERSHIP;
> +}
>  }
>
>  static void
> @@ -5110,6 +5122,8 @@ raft_unixctl_failure_test(struct unixctl_conn *conn
OVS_UNUSED,
>  } else if (!strcmp(test,
>
"transfer-leadership-after-sending-append-request")) {
>  failure_test = FT_TRANSFER_LEADERSHIP_AFTER_SEND_APPEND_REQ;
> +} else if (!strcmp(test,
"transfer-leadership-after-starting-to-add")) {
> +failure_test = FT_TRANSFER_LEADERSHIP_AFTER_STARTING_TO_ADD;
>  } else if (!strcmp(test, "transfer-leadership")) {
>  failure_test = FT_TRANSFER_LEADERSHIP;
>  } else if (!strcmp(test, "clear")) {
> diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
> index 482e4e02d..9d8b4d06a 100644
> --- a/tests/ovsdb-cluster.at
> +++ b/tests/ovsdb-cluster.at
> @@ -525,6 +525,59 @@ for i in $(seq $n); do
>  OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s$i], [s$i.pid])
>  done
>
> +AT_CLEANUP
> +
> +AT_SETUP([OVSDB cluster - leadership change before replication while
joining])
> +AT_KEYWORDS([ovsdb server negative unix cluster join])
> +
> +n=5
> 

Re: [ovs-dev] [PATCH 4/5] ovsdb: raft: Fix assertion when 1-node cluster looses leadership.

2024-03-25 Thread Han Zhou
On Fri, Mar 15, 2024 at 1:15 PM Ilya Maximets  wrote:
>
> Some of the failure tests can make a single-node cluster to
> loose leadership.  In this case the next raft_run() will
> trigger election with a pre-vore enabled.  This is causing

s/pre-vore/pre-vote

> an assertion when this server attempts to vote for itself.
>
> Fix that by not using pre-voting if the is only one server.

s/the/there

>
> A new failure test introduced in later commit triggers this
> assertion every time.
>
> Fixes: 85634fd58004 ("ovsdb: raft: Support pre-vote mechanism to deal
with disruptive server.")
> Signed-off-by: Ilya Maximets 

Thanks for the fix.
Acked-by: Han Zhou 

> ---
>  ovsdb/raft.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index 237d7ebf5..c41419052 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -2083,7 +2083,7 @@ raft_run(struct raft *raft)
>  raft_start_election(raft, true, false);
>  }
>  } else {
> -raft_start_election(raft, true, false);
> +raft_start_election(raft, hmap_count(>servers) > 1,
false);
>  }
>
>  }
> --
> 2.43.0
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 3/5] ovsdb: raft: Fix permanent joining state on a cluster member.

2024-03-25 Thread Han Zhou
On Fri, Mar 15, 2024 at 1:15 PM Ilya Maximets  wrote:
>
> Consider the following chain of events:
>
>  1. Have a cluster with 2 members - A and B.  A is a leader.
>  2. C connects to A, sends a request to join the cluster.
>  3. A catches up C, creates an update for the 'servers' list and sends
> it to B and C to apply.  This entry is not committed yet.
>  4. Before B or C can reply, A looses leadership for some reason.
>  5. A sends a joining failure message to C, C remains in joining state.
>  5. Both B and C have the new version of 'servers', so they recognize
> each other as valid cluster members.
>  6. B initiates a vote, C (or A) replies and B becomes a new leader.
>  7. B has a new list of servers.  B commits it.  C becomes a committed
> cluster member.
>  8. A and C receive heartbeats with a new commit index and C is now
> a committed cluster member for all A, B and C.
>
> However, at the end of this process, C is still in joining state
> as it never received a successful reply for a join request, and C is
> still in a COMMITTING phase for A.  So, C skips some parts of the RAFT
> life cycle and A will refuse to transfer leadership to C if something
> happens in the future.
>
> More interestingly, B can actually transfer leadership to C and vote
> for it.  A will vote for it just fine as well.  After that, C becomes
> a new cluster leader while still in joining state.  In this state C
> will not commit any changes.  So, we have seemingly stable cluster
> that doesn't commit any changes!  E.g.:
>
>   s3
>   Address: unix:s3.raft
>   Status: joining cluster
>   Remotes for joining: unix:s3.raft unix:s2.raft unix:s1.raft
>   Role: leader
>   Term: 4
>   Leader: self
>   Vote: self
>
>   Last Election started 30095 ms ago, reason: leadership_transfer
>   Last Election won: 30093 ms ago
>   Election timer: 1000
>   Log: [2, 7]
>   Entries not yet committed: 2
>   Entries not yet applied: 6
>   Connections: ->s1 ->s2 <-s1 <-s2
>   Disconnections: 0
>   Servers:
> s3 (60cf at unix:s3.raft) (self) next_index=7 match_index=6
> s2 (46aa at unix:s2.raft) next_index=7 match_index=6 last msg 58 ms
ago
> s1 (28f7 at unix:s1.raft) next_index=7 match_index=6 last msg 59 ms
ago
>
> Fix the first scenario by examining server changes in committed log
> entries.  This way server A can transition C to a STABLE phase and
> server C can find itself in the committed list of servers and move out
> from a joining state.  This is similar to completing commands without
> receiving an explicit reply or after the role change from leader to
> follower.
>
> The second scenario with a leader in a joining state can be fixed
> when the joining server becomes leader.  New leader's log is getting
> committed automatically and all servers transition into STABLE phase
> for it, but it should also move on from a joining state, since it
> leads the cluster now.
> It is also possible that B transfers leadership to C before the list
> of servers is marked as committed on other servers.  In this case
> C will commit it's own addition to the cluster configuration.
>
> The added test usually triggers both scenarios, but it will trigger at
> least one of them.
>
> Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered
databases.")
> Signed-off-by: Ilya Maximets 
> ---
>  ovsdb/raft.c   | 44 ++-
>  tests/ovsdb-cluster.at | 53 ++
>  2 files changed, 96 insertions(+), 1 deletion(-)
>
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index 57e27bf73..237d7ebf5 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -385,6 +385,7 @@ static void raft_get_servers_from_log(struct raft *,
enum vlog_level);
>  static void raft_get_election_timer_from_log(struct raft *);
>
>  static bool raft_handle_write_error(struct raft *, struct ovsdb_error *);
> +static bool raft_has_uncommitted_configuration(const struct raft *);
>
>  static void raft_run_reconfigure(struct raft *);
>
> @@ -2810,6 +2811,18 @@ raft_become_leader(struct raft *raft)
>  raft_reset_election_timer(raft);
>  raft_reset_ping_timer(raft);
>
> +if (raft->joining) {
> +/* It is possible that the server committing this one to the
list of
> + * servers lost leadership before the entry is committed but
after
> + * it was already replicated to majority of servers.  In this
case
> + * other servers will recognize this one as a valid cluster
member
> + * and may transfer leadership to it and vote for it.  This way
> + * we're becoming a cluster leader without receiving reply for a
> + * join request and will commit addition of this server
ourselves. */
> +VLOG_INFO_RL(, "elected as leader while joining");
> +raft->joining = false;
> +}
> +
>  struct raft_server *s;
>  HMAP_FOR_EACH (s, hmap_node, >servers) {
>  raft_server_init_leader(raft, s);
> @@ -2968,12 +2981,12 @@ 

Re: [ovs-dev] [PATCH 2/5] ovsdb: raft: Fix time intervals for multitasking while joining.

2024-03-25 Thread Han Zhou
On Fri, Mar 15, 2024 at 1:15 PM Ilya Maximets  wrote:
>
> While joining, ovsdb-server may not wake up for a duration of a join
> timer, which is 1 second and is by default 3x larger than a heartbeat
> timer.  This is causing unnecessary warnings from the cooperative
> multitasking module that thinks that we missed the heartbeat time by
> a lot.
>
> Use join timer (1000) instead while joining.
>
> Fixes: d4a15647b917 ("ovsdb: raft: Enable cooperative multitasking.")
> Signed-off-by: Ilya Maximets 
> ---
>
> CC: Frode Nordahl 
>
>  ovsdb/raft.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index 25f462431..57e27bf73 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -2126,10 +2126,11 @@ raft_run(struct raft *raft)
>  raft_reset_ping_timer(raft);
>  }
>
> +uint64_t interval = raft->joining
> +? 1000 :
RAFT_TIMER_THRESHOLD(raft->election_timer);

nit: the hardcoded joining timer value 1000 is used at least 3 places, so
probably better to define a macro for it.

Acked-by: Han Zhou 

>  cooperative_multitasking_set(
>  _run_cb, (void *) raft, time_msec(),
> -RAFT_TIMER_THRESHOLD(raft->election_timer)
> -+ RAFT_TIMER_THRESHOLD(raft->election_timer) / 10, "raft_run");
> +interval + interval / 10, "raft_run");
>
>  /* Do this only at the end; if we did it as soon as we set
raft->left or
>   * raft->failed in handling the RemoveServerReply, then it could
easily
> --
> 2.43.0
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v29 8/8] system-offloads-traffic.at: Add sFlow offload test cases.

2024-03-25 Thread 0-day Robot
Bleep bloop.  Greetings Chris Mi, I am a robot and I have tried out your patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
WARNING: Line is 111 characters long (recommended limit is 79)
#39 FILE: Documentation/howto/tc-offload.rst:101:

recirc_id(0),tunnel(tun_id=0x4,src=192.168.1.1,dst=192.168.1.2,flags(-df+csum+key)),in_port(vxlan_sys_4789)

WARNING: Line is 115 characters long (recommended limit is 79)
#43 FILE: Documentation/howto/tc-offload.rst:105:

recirc_id(0),tunnel(tun_id=0x4,src=192.168.1.1,dst=192.168.1.2,tp_dst=4789,flags(+key)),in_port(vxlan_sys_4789)

WARNING: Line is 99 characters long (recommended limit is 79)
#47 FILE: Documentation/howto/tc-offload.rst:109:

actions:set(tunnel(tun_id=0x4,dst=192.168.1.1,ttl=64,tp_dst=4789,flags(df|key))),vxlan_sys_4789

WARNING: Line is 97 characters long (recommended limit is 79)
#51 FILE: Documentation/howto/tc-offload.rst:113:

actions:set(tunnel(tun_id=0x4,dst=192.168.1.64,ttl=64,tp_dst=4789,flags(key))),vxlan_sys_4789

Lines checked: 268, Warnings: 4, Errors: 0


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v29 8/8] system-offloads-traffic.at: Add sFlow offload test cases.

2024-03-25 Thread Chris Mi via dev
Add three sFlow offload test cases:

  3: offloads - sflow with sampling=1 - offloads enabled ok
  4: offloads - sflow with sampling=2 - offloads enabled ok
  5: offloads - ping over vxlan tunnel with sflow - offloads enabled ok

Signed-off-by: Chris Mi 
Reviewed-by: Roi Dayan 
Acked-by: Eelco Chaudron 
---
 Documentation/howto/tc-offload.rst |  24 
 tests/system-common-macros.at  |  13 +++
 tests/system-offloads-traffic.at   | 177 +
 3 files changed, 214 insertions(+)

diff --git a/Documentation/howto/tc-offload.rst 
b/Documentation/howto/tc-offload.rst
index ee7f73f8a..ad85ffaa0 100644
--- a/Documentation/howto/tc-offload.rst
+++ b/Documentation/howto/tc-offload.rst
@@ -88,6 +88,30 @@ As you can see above the none-offload case reports 140 bytes 
more, which is 14
 bytes per packet. This represents the L2 header, in this case, 2 * *Ethernet
 address* + *Ethertype*.
 
+Tunnel offload
+++
+
+Current tunnel offload ignores DF and CSUM flags configuration requested by
+the user. TC for now has no way to pass these flags in a flower key and their
+masks are set by default. To make tunnel offload work, DF and CSUM flags
+are cleared. So please be aware of the following differences.
+
+Dumping vxlan decap match without offload, it shows::
+
+
recirc_id(0),tunnel(tun_id=0x4,src=192.168.1.1,dst=192.168.1.2,flags(-df+csum+key)),in_port(vxlan_sys_4789)
+
+Dumping vxlan decap match with offload, it shows::
+
+
recirc_id(0),tunnel(tun_id=0x4,src=192.168.1.1,dst=192.168.1.2,tp_dst=4789,flags(+key)),in_port(vxlan_sys_4789)
+
+Dumping vxlan encap action without offload, it shows::
+
+
actions:set(tunnel(tun_id=0x4,dst=192.168.1.1,ttl=64,tp_dst=4789,flags(df|key))),vxlan_sys_4789
+
+Dumping vxlan encap action with offload, it shows::
+
+
actions:set(tunnel(tun_id=0x4,dst=192.168.1.64,ttl=64,tp_dst=4789,flags(key))),vxlan_sys_4789
+
 TC Meter Offload
 
 
diff --git a/tests/system-common-macros.at b/tests/system-common-macros.at
index 01ebe364e..bb4bb09c6 100644
--- a/tests/system-common-macros.at
+++ b/tests/system-common-macros.at
@@ -374,3 +374,16 @@ m4_define([OVS_CHECK_CT_CLEAR],
 # OVS_CHECK_GITHUB_ACTION
 m4_define([OVS_CHECK_GITHUB_ACTION],
 [AT_SKIP_IF([test "$GITHUB_ACTIONS" = "true"])])
+
+# LOAD_MODULE([name])
+#
+# Tries to load specified kernel module and removes it after
+# if it wasn't loaded before this call.
+#
+m4_define([LOAD_MODULE],
+[if ! lsmod | grep -q $1; then
+ on_exit 'modprobe -q -r $1'
+ fi
+ AT_CHECK([modprobe $1])
+]
+)
diff --git a/tests/system-offloads-traffic.at b/tests/system-offloads-traffic.at
index 6bd49a3ee..334689bf9 100644
--- a/tests/system-offloads-traffic.at
+++ b/tests/system-offloads-traffic.at
@@ -93,6 +93,183 @@ AT_CHECK([ovs-appctl upcall/show | grep -E "offloaded flows 
: [[1-9]]"], [0], [i
 OVS_TRAFFIC_VSWITCHD_STOP
 AT_CLEANUP
 
+
+AT_SETUP([offloads - sflow with sampling=1 - offloads enabled])
+LOAD_MODULE([psample])
+OVS_TRAFFIC_VSWITCHD_START([], [], [-- set Open_vSwitch . 
other_config:hw-offload=true])
+
+on_exit 'kill `cat test-sflow.pid`'
+AT_CHECK([ovstest test-sflow --log-file --detach --no-chdir --pidfile 
0:127.0.0.1 > sflow.log],
+  [0], [], [ignore])
+AT_CAPTURE_FILE([sflow.log])
+PARSE_LISTENING_PORT([test-sflow.log], [SFLOW_PORT])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
+
+AT_CHECK([ovs-vsctl -- --id=@sflow create sflow agent=lo 
target=\"127.0.0.1:$SFLOW_PORT\" \
+  header=128 sampling=1 polling=100 -- set bridge br0 sflow=@sflow], [0], 
[ignore])
+NS_CHECK_EXEC([at_ns0], [ping -q -c 100 -i 0.1 -w 12 10.1.1.2 | FORMAT_PING], 
[0], [dnl
+100 packets transmitted, 100 received, 0% packet loss, time 0ms
+])
+
+p1_ifindex=$(ip link show ovs-p1 | head -1 | cut -d ':' -f 1)
+m4_define([DUMP_SFLOW], [sed -e "
+  s/used:[[0-9]].[[0-9]]*s/used:0.001s/;
+  s/eth(src=[[a-z0-9:]]*,dst=[[a-z0-9:]]*)/eth(macs)/;
+  s/pid=[[0-9]]*/pid=1/;
+  s/output=$1/output=1/"])
+AT_CHECK([ovs-appctl dpctl/dump-flows type=tc,offloaded | grep "in_port(2)" | \
+  grep "eth_type(0x0800)" | grep "actions:userspace" | grep "sFlow" | \
+  DUMP_SFLOW([$p1_ifindex])], [0], [dnl
+recirc_id(0),in_port(2),eth(macs),eth_type(0x0800),ipv4(frag=no), packets:99, 
bytes:8316, used:0.001s, 
actions:userspace(pid=1,sFlow(vid=0,pcp=0,output=1),actions),3
+])
+
+p0_ifindex=$(ip link show ovs-p0 | head -1 | cut -d ':' -f 1)
+AT_CHECK([ovs-appctl dpctl/dump-flows type=tc,offloaded | grep "in_port(3)" | \
+  grep "eth_type(0x0800)" | grep "actions:userspace" | grep "sFlow" | \
+  DUMP_SFLOW([$p0_ifindex])], [0], [dnl
+recirc_id(0),in_port(3),eth(macs),eth_type(0x0800),ipv4(frag=no), packets:99, 
bytes:8316, used:0.001s, 
actions:userspace(pid=1,sFlow(vid=0,pcp=0,output=1),actions),2
+])
+
+hdr="in_ifindex=$p0_ifindex in_format=0 out_ifindex=$p1_ifindex out_format=0 

[ovs-dev] [PATCH v29 7/8] netdev-offload-tc: Add offload support for sFlow.

2024-03-25 Thread Chris Mi via dev
Create a unique group ID to map the sFlow info when offloading sample
action to TC. When showing the offloaded datapath flows, translate the
group ID from TC sample action to sFlow info using the mapping.

Signed-off-by: Chris Mi 
Reviewed-by: Roi Dayan 
---
 NEWS|   2 +
 lib/netdev-offload-tc.c | 301 
 lib/tc.c|  63 -
 lib/tc.h|   6 +
 4 files changed, 346 insertions(+), 26 deletions(-)

diff --git a/NEWS b/NEWS
index c9e4064e6..d4bc8b1e1 100644
--- a/NEWS
+++ b/NEWS
@@ -4,6 +4,8 @@ Post-v3.3.0
  * Conntrack now supports 'random' flag for selecting ports in a range
while natting and 'persistent' flag for selection of the IP address
from a range.
+   - Linux TC offload:
+ * Add support for offloading sflow via tc sample.
 
 
 v3.3.0 - 16 Feb 2024
diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
index 12a0556b2..98a1e04ca 100644
--- a/lib/netdev-offload-tc.c
+++ b/lib/netdev-offload-tc.c
@@ -97,6 +97,7 @@ static int police_idx_lookup(uint32_t police_idx, uint32_t 
*meter_id);
 static int netdev_tc_parse_nl_actions(struct netdev *netdev,
   struct tc_flower *flower,
   struct offload_info *info,
+  const struct flow_tnl *tnl,
   const struct nlattr *actions,
   size_t actions_len,
   bool *recirc_act, bool more_actions,
@@ -138,6 +139,12 @@ static struct ovs_mutex sgid_lock = OVS_MUTEX_INITIALIZER;
 static struct cmap sgid_map = CMAP_INITIALIZER;
 static struct id_pool *sample_group_ids OVS_GUARDED_BY(sgid_lock);
 
+static bool
+psample_supported(void)
+{
+return psample_sock != NULL;
+}
+
 static void
 sgid_node_free(struct sgid_node *node)
 {
@@ -238,6 +245,30 @@ sgid_free(uint32_t id)
 }
 }
 
+static void free_all_flower_sgids(struct tc_flower *flower)
+{
+const struct tc_action *action = flower->actions;
+
+for (int i = 0; i < flower->action_count; i++, action++) {
+if (action->type == TC_ACT_SAMPLE) {
+sgid_free(action->sample.group_id);
+}
+}
+}
+
+static unsigned int get_flower_sgid_count(struct tc_flower *flower)
+{
+const struct tc_action *action = flower->actions;
+unsigned int count = 0;
+
+for (int i = 0; i < flower->action_count; i++, action++) {
+if (action->type == TC_ACT_SAMPLE && action->sample.group_id) {
+count++;
+}
+}
+return count;
+}
+
 static bool
 is_internal_port(const char *type)
 {
@@ -372,7 +403,12 @@ del_ufid_tc_mapping_unlocked(const ovs_u128 *ufid)
 hmap_remove(_to_tc, >ufid_to_tc_node);
 hmap_remove(_to_ufid, >tc_to_ufid_node);
 netdev_close(data->netdev);
-sgid_free(data->sample_group_id[0]);
+for (int i = 0; i < MAX_TC_SAMPLES_PER_FLOW; i++) {
+if (!data->sample_group_id[i]) {
+break;
+}
+sgid_free(data->sample_group_id[i]);
+}
 free(data);
 }
 
@@ -428,10 +464,12 @@ del_filter_and_ufid_mapping(struct tcf_id *id, const 
ovs_u128 *ufid,
 /* Add ufid entry to ufid_to_tc hashmap. */
 static void
 add_ufid_tc_mapping(struct netdev *netdev, const ovs_u128 *ufid,
-struct tcf_id *id, struct dpif_flow_stats *stats)
+struct tcf_id *id, struct dpif_flow_stats *stats,
+struct tc_flower *flower)
 {
 struct ufid_tc_data *new_data = xzalloc(sizeof *new_data);
 size_t ufid_hash = hash_bytes(ufid, sizeof *ufid, 0);
+const struct tc_action *action = flower->actions;
 size_t tc_hash;
 
 tc_hash = hash_int(hash_int(id->prio, id->handle), id->ifindex);
@@ -444,6 +482,15 @@ add_ufid_tc_mapping(struct netdev *netdev, const ovs_u128 
*ufid,
 new_data->adjust_stats = *stats;
 }
 
+for (int i = 0, si = 0; i < flower->action_count; i++, action++) {
+if (action->type == TC_ACT_SAMPLE && action->sample.group_id) {
+new_data->sample_group_id[si++] = action->sample.group_id;
+if (si >= MAX_TC_SAMPLES_PER_FLOW) {
+break;
+}
+}
+}
+
 ovs_mutex_lock(_lock);
 hmap_insert(_to_tc, _data->ufid_to_tc_node, ufid_hash);
 hmap_insert(_to_ufid, _data->tc_to_ufid_node, tc_hash);
@@ -931,6 +978,19 @@ parse_tc_flower_to_actions__(struct tc_flower *flower, 
struct ofpbuf *buf,
 action = >actions[i];
 
 switch (action->type) {
+case TC_ACT_SAMPLE: {
+const struct sgid_node *node;
+
+node = sgid_find(action->sample.group_id);
+if (!node) {
+VLOG_WARN("Can't find sample group ID data for ID: %u",
+  action->sample.group_id);
+return ENOENT;
+}
+nl_msg_put(buf, node->sample.action,
+  

[ovs-dev] [PATCH v29 6/8] dpif-netlink: Add netdev offload recv in normal recv upcalls.

2024-03-25 Thread Chris Mi via dev
In thread handler 0, add netdev offload recv in normal recv upcalls.
To avoid starvation, introduce a flag to alternate the order of
receiving normal upcalls and offload upcalls based on that flag.

Add similar change for recv_wait.

Signed-off-by: Chris Mi 
Reviewed-by: Roi Dayan 
---
 lib/dpif-netlink.c| 41 ++-
 ofproto/ofproto-dpif-upcall.c | 16 ++
 2 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index 84e2bd8ea..3e2455e56 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -201,6 +201,11 @@ struct dpif_handler {
 struct nl_sock *sock; /* Each handler thread holds one netlink
  socket. */
 
+/* The receive handler thread deals with both normal and offload receive
+ * upcalls. To avoid starvation, the below flag is used to alternate the
+ * processing order. */
+bool recv_offload_first;
+
 #ifdef _WIN32
 /* Pool of sockets. */
 struct dpif_windows_vport_sock *vport_sock_pool;
@@ -3010,7 +3015,6 @@ dpif_netlink_recv_windows(struct dpif_netlink *dpif, 
uint32_t handler_id,
 static int
 dpif_netlink_recv_cpu_dispatch(struct dpif_netlink *dpif, uint32_t handler_id,
struct dpif_upcall *upcall, struct ofpbuf *buf)
-OVS_REQ_RDLOCK(dpif->upcall_lock)
 {
 struct dpif_handler *handler;
 int read_tries = 0;
@@ -3061,7 +3065,6 @@ dpif_netlink_recv_vport_dispatch(struct dpif_netlink 
*dpif,
  uint32_t handler_id,
  struct dpif_upcall *upcall,
  struct ofpbuf *buf)
-OVS_REQ_RDLOCK(dpif->upcall_lock)
 {
 struct dpif_handler *handler;
 int read_tries = 0;
@@ -3135,13 +3138,12 @@ dpif_netlink_recv_vport_dispatch(struct dpif_netlink 
*dpif,
 #endif
 
 static int
-dpif_netlink_recv(struct dpif *dpif_, uint32_t handler_id,
-  struct dpif_upcall *upcall, struct ofpbuf *buf)
+dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
+struct dpif_upcall *upcall, struct ofpbuf *buf)
+OVS_REQ_RDLOCK(dpif->upcall_lock)
 {
-struct dpif_netlink *dpif = dpif_netlink_cast(dpif_);
 int error;
 
-fat_rwlock_rdlock(>upcall_lock);
 #ifdef _WIN32
 error = dpif_netlink_recv_windows(dpif, handler_id, upcall, buf);
 #else
@@ -3152,6 +3154,32 @@ dpif_netlink_recv(struct dpif *dpif_, uint32_t 
handler_id,
  handler_id, upcall, buf);
 }
 #endif
+
+return error;
+}
+
+static int
+dpif_netlink_recv(struct dpif *dpif_, uint32_t handler_id,
+  struct dpif_upcall *upcall, struct ofpbuf *buf)
+{
+struct dpif_netlink *dpif = dpif_netlink_cast(dpif_);
+struct dpif_handler *handler;
+int error;
+
+fat_rwlock_rdlock(>upcall_lock);
+handler = >handlers[handler_id];
+if (handler->recv_offload_first) {
+error = netdev_offload_recv(upcall, buf, handler_id);
+if (error == EAGAIN) {
+error = dpif_netlink_recv__(dpif, handler_id, upcall, buf);
+}
+} else {
+error = dpif_netlink_recv__(dpif, handler_id, upcall, buf);
+if (error == EAGAIN) {
+error = netdev_offload_recv(upcall, buf, handler_id);
+}
+}
+handler->recv_offload_first = !handler->recv_offload_first;
 fat_rwlock_unlock(>upcall_lock);
 
 return error;
@@ -3217,6 +3245,7 @@ dpif_netlink_recv_wait(struct dpif *dpif_, uint32_t 
handler_id)
 dpif_netlink_recv_wait_vport_dispatch(dpif, handler_id);
 }
 #endif
+netdev_offload_recv_wait(handler_id);
 fat_rwlock_unlock(>upcall_lock);
 }
 
diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
index d88195636..b60d15210 100644
--- a/ofproto/ofproto-dpif-upcall.c
+++ b/ofproto/ofproto-dpif-upcall.c
@@ -881,10 +881,18 @@ recv_upcalls(struct handler *handler)
 break;
 }
 
-upcall->fitness = odp_flow_key_to_flow(dupcall->key, dupcall->key_len,
-   flow, NULL);
-if (upcall->fitness == ODP_FIT_ERROR) {
-goto free_dupcall;
+/* If key and key_len are available, use them to construct flow.
+ * Otherwise, use upcall->flow. */
+if (dupcall->key && dupcall->key_len) {
+upcall->fitness = odp_flow_key_to_flow(dupcall->key,
+   dupcall->key_len,
+   flow, NULL);
+if (upcall->fitness == ODP_FIT_ERROR) {
+goto free_dupcall;
+}
+} else {
+upcall->fitness = ODP_FIT_PERFECT;
+flow = >flow;
 }
 
 if (dupcall->mru) {
-- 
2.37.1

___
dev mailing list
d...@openvswitch.org

[ovs-dev] [PATCH v29 4/8] netdev-offload-tc: Add sample offload API for TC.

2024-03-25 Thread Chris Mi via dev
Initialize psample socket. Add sample recv API to receive sampled
packets from psample socket. Add sample recv wait API to add psample
socket fd to poll list.

Signed-off-by: Chris Mi 
Reviewed-by: Roi Dayan 
---
 lib/dpif.h|   6 +-
 lib/flow.h|   2 +-
 lib/netdev-offload-provider.h |  30 ++
 lib/netdev-offload-tc.c   | 173 ++
 lib/netdev-offload.c  |   3 +-
 lib/packets.h |   2 +-
 6 files changed, 211 insertions(+), 5 deletions(-)

diff --git a/lib/dpif.h b/lib/dpif.h
index 0f2dc2ef3..7a0ea19e9 100644
--- a/lib/dpif.h
+++ b/lib/dpif.h
@@ -836,8 +836,10 @@ struct dpif_upcall {
 
 /* DPIF_UC_ACTION only. */
 struct nlattr *userdata;/* Argument to OVS_ACTION_ATTR_USERSPACE. */
-struct nlattr *out_tun_key;/* Output tunnel key. */
-struct nlattr *actions;/* Argument to OVS_ACTION_ATTR_USERSPACE. */
+struct nlattr *out_tun_key; /* Output tunnel key. */
+struct nlattr *actions; /* Argument to OVS_ACTION_ATTR_USERSPACE. */
+struct flow flow;   /* Caller provided 'flow' if the 'key' is not
+   available. */
 };
 
 /* A callback to notify higher layer of dpif about to be purged, so that
diff --git a/lib/flow.h b/lib/flow.h
index 75a9be3c1..626986182 100644
--- a/lib/flow.h
+++ b/lib/flow.h
@@ -970,7 +970,7 @@ pkt_metadata_from_flow(struct pkt_metadata *md, const 
struct flow *flow)
 
 md->recirc_id = flow->recirc_id;
 md->dp_hash = flow->dp_hash;
-flow_tnl_copy__(>tunnel, >tunnel);
+flow_tnl_copy(>tunnel, >tunnel);
 md->skb_priority = flow->skb_priority;
 md->pkt_mark = flow->pkt_mark;
 md->in_port = flow->in_port;
diff --git a/lib/netdev-offload-provider.h b/lib/netdev-offload-provider.h
index 9108856d1..53c272d07 100644
--- a/lib/netdev-offload-provider.h
+++ b/lib/netdev-offload-provider.h
@@ -28,6 +28,8 @@
 extern "C" {
 #endif
 
+struct dpif_upcall;
+
 struct netdev_flow_api {
 char *type;
 /* Flush all offloaded flows from a netdev.
@@ -121,6 +123,34 @@ struct netdev_flow_api {
 int (*meter_del)(ofproto_meter_id meter_id,
  struct ofputil_meter_stats *stats);
 
+/* Polls for upcall offload packets for an upcall handler. If successful,
+ * stores the upcall into '*upcall', using 'buf' for storage.
+ *
+ * The implementation should point '>flow' and 'upcall->userdata'
+ * (if any) into data in the caller-provided 'buf'.  The implementation may
+ * also use 'buf' for storing the data of 'upcall->packet'.  If necessary
+ * to make room, the implementation may reallocate the data in 'buf'.
+ *
+ * The caller owns the data of 'upcall->packet' and may modify it.  If
+ * packet's headroom is exhausted as it is manipulated, 'upcall->packet'
+ * will be reallocated.  This requires the data of 'upcall->packet' to be
+ * released with ofpbuf_uninit() before 'upcall' is destroyed.  However,
+ * when an error is returned, the 'upcall->packet' may be uninitialized
+ * and should not be released.
+ *
+ * This function must not block. If no upcall is pending when it is
+ * called, it should return EAGAIN without blocking.
+ *
+ * Return 0 if successful, otherwise returns a positive errno value.
+ */
+int (*recv)(struct dpif_upcall *upcall, struct ofpbuf *buf,
+uint32_t handler_id);
+
+/* Arranges for the poll loop for an upcall handler to wake up when
+ * offload provider has a message queued to be received with the recv
+ * member functions. */
+void (*recv_wait)(uint32_t handler_id);
+
 /* Initializies the netdev flow api.
  * Return 0 if successful, otherwise returns a positive errno value. */
 int (*init_flow_api)(struct netdev *);
diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
index bdf111942..12a0556b2 100644
--- a/lib/netdev-offload-tc.c
+++ b/lib/netdev-offload-tc.c
@@ -18,6 +18,8 @@
 
 #include 
 #include 
+#include 
+#include 
 
 #include "cmap.h"
 #include "dpif-provider.h"
@@ -35,6 +37,7 @@
 #include "openvswitch/hmap.h"
 #include "openvswitch/match.h"
 #include "openvswitch/ofpbuf.h"
+#include "openvswitch/poll-loop.h"
 #include "openvswitch/thread.h"
 #include "openvswitch/types.h"
 #include "openvswitch/util.h"
@@ -126,6 +129,9 @@ struct sgid_node {
 struct offload_sample sample;
 };
 
+static struct nl_sock *psample_sock;
+static int psample_family;
+
 /* The sgid_map mutex protects the sample_group_ids and the sgid_map for
  * cmap_insert(), cmap_remove(), or cmap_replace() operations. */
 static struct ovs_mutex sgid_lock = OVS_MUTEX_INITIALIZER;
@@ -157,6 +163,14 @@ sgid_find(uint32_t id)
 return node ? CONTAINER_OF(node, struct sgid_node, id_node) : NULL;
 }
 
+static struct offload_sample *
+sample_find(uint32_t id)
+{
+struct sgid_node *node = sgid_find(id);
+
+return node ? >sample: NULL;
+}
+
 static 

[ovs-dev] [PATCH v29 5/8] netdev-offload: Add netdev offload recv and recv_wait APIs.

2024-03-25 Thread Chris Mi via dev
Iterate each registered offload API. It's not a problem for today
since we only have one implementation.

Signed-off-by: Chris Mi 
Reviewed-by: Roi Dayan 
---
 lib/netdev-offload.c | 38 ++
 lib/netdev-offload.h |  5 +
 2 files changed, 43 insertions(+)

diff --git a/lib/netdev-offload.c b/lib/netdev-offload.c
index 5aad804c1..185142484 100644
--- a/lib/netdev-offload.c
+++ b/lib/netdev-offload.c
@@ -257,6 +257,44 @@ meter_offload_del(ofproto_meter_id meter_id, struct 
ofputil_meter_stats *stats)
 return 0;
 }
 
+int
+netdev_offload_recv(struct dpif_upcall *upcall, struct ofpbuf *buf,
+uint32_t handler_id)
+{
+struct netdev_registered_flow_api *rfa;
+int ret = EAGAIN;
+
+CMAP_FOR_EACH (rfa, cmap_node, _flow_apis) {
+if (rfa->flow_api->recv) {
+ret = rfa->flow_api->recv(upcall, buf, handler_id);
+if (!ret) {
+return 0;
+}
+
+if (ret != EAGAIN) {
+VLOG_DBG_RL(, "Failed to receive offload packet, %s, "
+"type: %s", ovs_strerror(ret),
+rfa->flow_api->type);
+}
+} else {
+ret = EAGAIN;
+   }
+}
+return ret;
+}
+
+void
+netdev_offload_recv_wait(uint32_t handler_id)
+{
+struct netdev_registered_flow_api *rfa;
+
+CMAP_FOR_EACH (rfa, cmap_node, _flow_apis) {
+if (rfa->flow_api->recv_wait) {
+rfa->flow_api->recv_wait(handler_id);
+}
+}
+}
+
 int
 netdev_flow_flush(struct netdev *netdev)
 {
diff --git a/lib/netdev-offload.h b/lib/netdev-offload.h
index 47f8e6f48..c20e2c26e 100644
--- a/lib/netdev-offload.h
+++ b/lib/netdev-offload.h
@@ -33,6 +33,7 @@ extern "C" {
 
 struct dp_packet_batch;
 struct dp_packet;
+struct dpif_upcall;
 struct netdev_class;
 struct netdev_rxq;
 struct netdev_saved_flags;
@@ -162,6 +163,10 @@ void meter_offload_set(ofproto_meter_id, struct 
ofputil_meter_config *);
 int meter_offload_get(ofproto_meter_id, struct ofputil_meter_stats *);
 int meter_offload_del(ofproto_meter_id, struct ofputil_meter_stats *);
 
+int netdev_offload_recv(struct dpif_upcall *, struct ofpbuf *,
+uint32_t handler_id);
+void netdev_offload_recv_wait(uint32_t handler_id);
+
 #ifdef  __cplusplus
 }
 #endif
-- 
2.37.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v29 3/8] netdev-offload-tc: Introduce group ID management API.

2024-03-25 Thread Chris Mi via dev
When offloading sample action to TC, userspace creates a unique ID
to map sample action and tunnel info and passes this ID to kernel
instead of the sample info. Kernel will send this ID and sampled
packet to userspace. Using the ID, userspace can recover the sample
info and send sampled packet to the right sample monitoring host.

Signed-off-by: Chris Mi 
Reviewed-by: Roi Dayan 
---
 lib/netdev-offload-tc.c | 156 +---
 lib/util.c  |   6 ++
 lib/util.h  |   1 +
 3 files changed, 154 insertions(+), 9 deletions(-)

diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
index 921d52317..bdf111942 100644
--- a/lib/netdev-offload-tc.c
+++ b/lib/netdev-offload-tc.c
@@ -19,28 +19,29 @@
 #include 
 #include 
 
+#include "cmap.h"
+#include "dpif-provider.h"
 #include "dpif.h"
 #include "hash.h"
 #include "id-pool.h"
-#include "openvswitch/hmap.h"
-#include "openvswitch/match.h"
-#include "openvswitch/ofpbuf.h"
-#include "openvswitch/thread.h"
-#include "openvswitch/types.h"
-#include "openvswitch/util.h"
-#include "openvswitch/vlog.h"
 #include "netdev-linux.h"
 #include "netdev-offload-provider.h"
 #include "netdev-provider.h"
 #include "netdev-vport.h"
-#include "netlink.h"
 #include "netlink-socket.h"
+#include "netlink.h"
 #include "odp-netlink.h"
 #include "odp-util.h"
+#include "openvswitch/hmap.h"
+#include "openvswitch/match.h"
+#include "openvswitch/ofpbuf.h"
+#include "openvswitch/thread.h"
+#include "openvswitch/types.h"
+#include "openvswitch/util.h"
+#include "openvswitch/vlog.h"
 #include "tc.h"
 #include "unaligned.h"
 #include "util.h"
-#include "dpif-provider.h"
 
 VLOG_DEFINE_THIS_MODULE(netdev_offload_tc);
 
@@ -104,6 +105,125 @@ static void parse_tc_flower_to_stats(struct tc_flower 
*flower,
 static int get_ufid_adjust_stats(const ovs_u128 *ufid,
  struct dpif_flow_stats *stats);
 
+/* When offloading sample action to TC, userspace creates a unique ID
+ * to map sample action and tunnel info and passes this ID to kernel
+ * instead of the sample info. Kernel will send this ID and sampled
+ * packet to userspace. Using the ID, userspace can recover the sample
+ * info and send sampled packet to the right sample monitoring host. */
+struct offload_sample {
+uint16_t type; /* enum user_action_cookie_type. */
+struct nlattr *action; /* Sample action. Used in flow_get. */
+struct nlattr *userdata;   /* Struct user_action_cookie. */
+struct nlattr *userspace_actions;  /* All actions to get output tunnel. */
+struct flow_tnl *tunnel;   /* Input tunnel. */
+uint16_t ifindex;  /* Input ifindex. */
+};
+
+/* This maps a sample group ID to struct offload_sample. */
+struct sgid_node {
+struct cmap_node id_node;
+uint32_t id;
+struct offload_sample sample;
+};
+
+/* The sgid_map mutex protects the sample_group_ids and the sgid_map for
+ * cmap_insert(), cmap_remove(), or cmap_replace() operations. */
+static struct ovs_mutex sgid_lock = OVS_MUTEX_INITIALIZER;
+static struct cmap sgid_map = CMAP_INITIALIZER;
+static struct id_pool *sample_group_ids OVS_GUARDED_BY(sgid_lock);
+
+static void
+sgid_node_free(struct sgid_node *node)
+{
+if (node) {
+if (node->id) {
+ovs_mutex_lock(_lock);
+id_pool_free_id(sample_group_ids, node->id);
+ovs_mutex_unlock(_lock);
+}
+free(node->sample.tunnel);
+free(node->sample.action);
+free(node->sample.userspace_actions);
+free(node->sample.userdata);
+free(node);
+}
+}
+
+static struct sgid_node *
+sgid_find(uint32_t id)
+{
+const struct cmap_node *node = cmap_find(_map, id);
+
+return node ? CONTAINER_OF(node, struct sgid_node, id_node) : NULL;
+}
+
+static void
+offload_sample_clone(struct offload_sample *dst,
+ const struct offload_sample *src,
+ bool steal_userspace_actions)
+{
+dst->type = src->type;
+dst->action = nullable_xmemdup(src->action, src->action ?
+src->action->nla_len : 0);
+if (steal_userspace_actions) {
+dst->userspace_actions = src->userspace_actions;
+} else {
+dst->userspace_actions = src->userspace_actions
+? xmemdup(src->userspace_actions,
+  src->userspace_actions->nla_len)
+: NULL;
+}
+dst->userdata = nullable_xmemdup(src->userdata, src->userdata ?
+src->userdata->nla_len :
+0);
+dst->tunnel = nullable_xmemdup(src->tunnel, sizeof *src->tunnel);
+dst->ifindex = src->ifindex;
+}
+
+/* Allocate a unique group id for the given set of flow metadata. The id
+ * space is 2**32 - 1. 0 is reserved. */
+static uint32_t
+sgid_alloc(const struct offload_sample *sample)
+{
+struct 

[ovs-dev] [PATCH v29 1/8] compat: Add psample and tc sample action defines for older kernels.

2024-03-25 Thread Chris Mi via dev
Update kernel UAPI to support psample and the tc sample action.

Signed-off-by: Chris Mi 
Reviewed-by: Roi Dayan 
Acked-by: Eelco Chaudron 
---
 include/linux/automake.mk| 10 +++---
 include/linux/psample.h  | 62 
 include/linux/tc_act/tc_sample.h | 25 +
 3 files changed, 93 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/psample.h
 create mode 100644 include/linux/tc_act/tc_sample.h

diff --git a/include/linux/automake.mk b/include/linux/automake.mk
index cdae5eedc..a397562c0 100644
--- a/include/linux/automake.mk
+++ b/include/linux/automake.mk
@@ -1,12 +1,14 @@
 noinst_HEADERS += \
-   include/linux/netlink.h \
+   include/linux/gen_stats.h \
include/linux/netfilter/nf_conntrack_sctp.h \
+   include/linux/netlink.h \
include/linux/openvswitch.h \
include/linux/pkt_cls.h \
-   include/linux/gen_stats.h \
+   include/linux/psample.h \
+   include/linux/tc_act/tc_ct.h \
include/linux/tc_act/tc_mpls.h \
include/linux/tc_act/tc_pedit.h \
+   include/linux/tc_act/tc_sample.h \
include/linux/tc_act/tc_skbedit.h \
include/linux/tc_act/tc_tunnel_key.h \
-   include/linux/tc_act/tc_vlan.h \
-   include/linux/tc_act/tc_ct.h
+   include/linux/tc_act/tc_vlan.h
diff --git a/include/linux/psample.h b/include/linux/psample.h
new file mode 100644
index 0..e585db5bf
--- /dev/null
+++ b/include/linux/psample.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef __UAPI_PSAMPLE_H
+#define __UAPI_PSAMPLE_H
+
+enum {
+   PSAMPLE_ATTR_IIFINDEX,
+   PSAMPLE_ATTR_OIFINDEX,
+   PSAMPLE_ATTR_ORIGSIZE,
+   PSAMPLE_ATTR_SAMPLE_GROUP,
+   PSAMPLE_ATTR_GROUP_SEQ,
+   PSAMPLE_ATTR_SAMPLE_RATE,
+   PSAMPLE_ATTR_DATA,
+   PSAMPLE_ATTR_GROUP_REFCOUNT,
+   PSAMPLE_ATTR_TUNNEL,
+
+   PSAMPLE_ATTR_PAD,
+   PSAMPLE_ATTR_OUT_TC,/* u16 */
+   PSAMPLE_ATTR_OUT_TC_OCC,/* u64, bytes */
+   PSAMPLE_ATTR_LATENCY,   /* u64, nanoseconds */
+   PSAMPLE_ATTR_TIMESTAMP, /* u64, nanoseconds */
+   PSAMPLE_ATTR_PROTO, /* u16 */
+
+   __PSAMPLE_ATTR_MAX
+};
+
+enum psample_command {
+   PSAMPLE_CMD_SAMPLE,
+   PSAMPLE_CMD_GET_GROUP,
+   PSAMPLE_CMD_NEW_GROUP,
+   PSAMPLE_CMD_DEL_GROUP,
+};
+
+enum psample_tunnel_key_attr {
+   PSAMPLE_TUNNEL_KEY_ATTR_ID, /* be64 Tunnel ID */
+   PSAMPLE_TUNNEL_KEY_ATTR_IPV4_SRC,   /* be32 src IP address. */
+   PSAMPLE_TUNNEL_KEY_ATTR_IPV4_DST,   /* be32 dst IP address. */
+   PSAMPLE_TUNNEL_KEY_ATTR_TOS,/* u8 Tunnel IP ToS. */
+   PSAMPLE_TUNNEL_KEY_ATTR_TTL,/* u8 Tunnel IP TTL. */
+   PSAMPLE_TUNNEL_KEY_ATTR_DONT_FRAGMENT,  /* No argument, set DF. */
+   PSAMPLE_TUNNEL_KEY_ATTR_CSUM,   /* No argument. CSUM 
packet. */
+   PSAMPLE_TUNNEL_KEY_ATTR_OAM,/* No argument. OAM frame.  
*/
+   PSAMPLE_TUNNEL_KEY_ATTR_GENEVE_OPTS,/* Array of Geneve options. 
*/
+   PSAMPLE_TUNNEL_KEY_ATTR_TP_SRC, /* be16 src Transport Port. 
*/
+   PSAMPLE_TUNNEL_KEY_ATTR_TP_DST, /* be16 dst Transport Port. 
*/
+   PSAMPLE_TUNNEL_KEY_ATTR_VXLAN_OPTS, /* Nested VXLAN opts* */
+   PSAMPLE_TUNNEL_KEY_ATTR_IPV6_SRC,   /* struct in6_addr src IPv6 
address. */
+   PSAMPLE_TUNNEL_KEY_ATTR_IPV6_DST,   /* struct in6_addr dst IPv6 
address. */
+   PSAMPLE_TUNNEL_KEY_ATTR_PAD,
+   PSAMPLE_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* struct erspan_metadata */
+   PSAMPLE_TUNNEL_KEY_ATTR_IPV4_INFO_BRIDGE,   /* No argument. 
IPV4_INFO_BRIDGE mode.*/
+   __PSAMPLE_TUNNEL_KEY_ATTR_MAX
+};
+
+/* Can be overridden at runtime by module option */
+#define PSAMPLE_ATTR_MAX (__PSAMPLE_ATTR_MAX - 1)
+
+#define PSAMPLE_NL_MCGRP_CONFIG_NAME "config"
+#define PSAMPLE_NL_MCGRP_SAMPLE_NAME "packets"
+#define PSAMPLE_GENL_NAME "psample"
+#define PSAMPLE_GENL_VERSION 1
+
+#endif
diff --git a/include/linux/tc_act/tc_sample.h b/include/linux/tc_act/tc_sample.h
new file mode 100644
index 0..fee1bcc20
--- /dev/null
+++ b/include/linux/tc_act/tc_sample.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef __LINUX_TC_SAMPLE_H
+#define __LINUX_TC_SAMPLE_H
+
+#include 
+#include 
+#include 
+
+struct tc_sample {
+   tc_gen;
+};
+
+enum {
+   TCA_SAMPLE_UNSPEC,
+   TCA_SAMPLE_TM,
+   TCA_SAMPLE_PARMS,
+   TCA_SAMPLE_RATE,
+   TCA_SAMPLE_TRUNC_SIZE,
+   TCA_SAMPLE_PSAMPLE_GROUP,
+   TCA_SAMPLE_PAD,
+   __TCA_SAMPLE_MAX
+};
+#define TCA_SAMPLE_MAX (__TCA_SAMPLE_MAX - 1)
+
+#endif
-- 
2.37.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v29 2/8] ovs-kmod-ctl: Load kernel module psample.

2024-03-25 Thread Chris Mi via dev
Load kernel module psample to receive sampled packets from TC.
Before removing kernel module psample, remove act_sample first.

Signed-off-by: Chris Mi 
Reviewed-by: Roi Dayan 
Acked-by: Eelco Chaudron 
---
 utilities/ovs-kmod-ctl.in | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/utilities/ovs-kmod-ctl.in b/utilities/ovs-kmod-ctl.in
index 19f100964..6fa945a83 100644
--- a/utilities/ovs-kmod-ctl.in
+++ b/utilities/ovs-kmod-ctl.in
@@ -28,6 +28,14 @@ for dir in "$sbindir" "$bindir" /sbin /bin /usr/sbin 
/usr/bin; do
 done
 
 insert_mods () {
+# Try loading psample kernel module.
+modinfo psample > /dev/null 2>&1
+if test $? = 0; then
+action "Inserting psample module" modprobe psample
+else
+log_warning_msg "No psample module, can't offload sFlow action"
+fi
+
 # Try loading openvswitch kernel module.
 action "Inserting openvswitch module" modprobe openvswitch
 }
@@ -95,6 +103,12 @@ remove_kmods() {
 if test -e /sys/module/vxlan; then
 action "Forcing removal of vxlan module" rmmod vxlan
 fi
+if test -e /sys/module/act_sample; then
+action "Forcing removal of act_sample module" rmmod act_sample
+fi
+if test -e /sys/module/psample; then
+action "Forcing removal of psample module" rmmod psample
+fi
 }
 
 usage () {
-- 
2.37.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v29 0/8] Add offload support for sFlow

2024-03-25 Thread Chris Mi via dev
This patch set adds offload support for sFlow.

Psample is a genetlink channel for packet sampling. TC action act_sample
uses psample to send sampled packets to userspace.

When offloading sample action to TC, userspace creates a unique ID to
map sFlow action and tunnel info and passes this ID to kernel instead
of the sFlow info. psample will send this ID and sampled packet to
userspace. Using the ID, userspace can recover the sFlow info and send
sampled packet to the right sFlow monitoring host.

v2-v1:
- Fix robot errors.
v3-v2:
- Remove Gerrit Change-Id.
- Add patch #9 to fix older kernels build issue.
- Add travis test result.
v4-v3:
- Fix offload issue when sampling rate is 1.
v5-v4:
- Move polling thread from ofproto to netdev-offload-tc.
v6-v5:
- Rebase.
- Add GitHub Actions test result.
v7-v6:
- Remove Gerrit Change-Id.
- Fix "ERROR: Inappropriate spacing around cast"
v8-v7
- Address Eelco Chaudron's comment for patch #11.
v9-v8
- Remove sflow_len from struct dpif_sflow_attr.
- Log a debug message for other userspace actions.
v10-v9
- Address Eelco Chaudron's comments on v9.
v11-v10
- Fix a bracing error.
v12-v11
- Add duplicate sample group id check.
v13-v12
- Remove the psample poll thread from netdev-offload-tc and reuse
  ofproto handler thread according to Ilya's new desgin.
- Add dpif-offload-provider layer according to Eli's suggestion.
v14-v13
- Fix a robot error.
v15-v14
- Address Eelco Chaudron's comments on v14.
v16-v15
- Address Eelco Chaudron's comments on v15.
- Add two test cases.
v17-v16
- Address Eelco Chaudron's comments on v16.
- Move struct dpif_offload_api from struct dpif_class to struct dpif.
v18-v17
- Rename dpif_offload_api to dpif_offload_class.
- Add init and destroy callbacks in dpif_offload_class. They are called
  when registering dpif_offload_class.
v19-18
- Fix a bug that psample_sock is destroyed when last bridge is deleted.
v20-19
- Move buf_stub to struct dpif_offload_sflow avoid garbage values when
  ofproto proceses the sampled packet.
v21-20
- Remove netdev dummy for dpif-offload according to Eelco's comment.
v22-21
- Address Ilya's comments:
 - Remove dpif-offload-provider layer.
 - Remove process_offload_sflow and reuse upcall_receive.
 - Introduce sample id pool.
 - Introduce netdev_offload_recv.
v23-22
- Address Ilya's comments:
 - Add struct flow in struct dpif_upcall.
 - Add handler id in recv() and recv_wait().
 - misc changes.
v24-23
- Fix checkpath and actions errors.
v25-24
- Address Eelco's comments:
 - Add tunnel test.
 - Change sample group id and sample info to 1:1 mapping.
 - Move sample group id from tcf_id to ufid_to_tc_data.
 - misc changes.
v26-25
- Address Eelco's comments:
 - Rename actions to userspace_actions.
 - Add in tunnel test.
v27-26
- Address Eelco's comments:
 - Introduce nullable_xmemdup() and flow_tnl_copy().
 - Improve sgid node locking.
 - Improve comments and documents.
 - Add support for multiple sgids.
 - Improve test cases.
 - misc changes.
v28-27
- Apply Eelco's three diff files directly.
v29-28
- Address Ilya's comments.


Chris Mi (8):
  compat: Add psample and tc sample action defines for older kernels
  ovs-kmod-ctl: Load kernel module psample
  netdev-offload-tc: Introduce group ID management API
  netdev-offload-tc: Add sample offload API for TC
  netdev-offload: Add netdev offload recv and recv_wait APIs
  dpif-netlink: Add netdev offload recv in normal recv upcalls
  netdev-offload-tc: Add offload support for sFlow
  system-offloads-traffic.at: Add sFlow offload test cases

 Documentation/howto/tc-offload.rst |  24 ++
 NEWS   |   2 +
 include/linux/automake.mk  |  10 +-
 include/linux/psample.h|  62 +++
 include/linux/tc_act/tc_sample.h   |  25 ++
 lib/dpif-netlink.c |  41 +-
 lib/dpif.h |   6 +-
 lib/flow.h |   2 +-
 lib/netdev-offload-provider.h  |  30 ++
 lib/netdev-offload-tc.c| 612 +++--
 lib/netdev-offload.c   |  41 +-
 lib/netdev-offload.h   |   5 +
 lib/packets.h  |   2 +-
 lib/tc.c   |  63 ++-
 lib/tc.h   |   6 +
 lib/util.c |   6 +
 lib/util.h |   1 +
 ofproto/ofproto-dpif-upcall.c  |  16 +-
 tests/system-common-macros.at  |  13 +
 tests/system-offloads-traffic.at   | 177 +
 utilities/ovs-kmod-ctl.in  |  14 +
 21 files changed, 1113 insertions(+), 45 deletions(-)
 create mode 100644 include/linux/psample.h
 create mode 100644 include/linux/tc_act/tc_sample.h

-- 
2.37.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v28 0/8] Add offload support for sFlow

2024-03-25 Thread Chris Mi via dev

On 6/24/2023 4:18 AM, Ilya Maximets wrote:

On 6/19/23 07:05, Chris Mi wrote:

This patch set adds offload support for sFlow.

Psample is a genetlink channel for packet sampling. TC action act_sample
uses psample to send sampled packets to userspace.

When offloading sample action to TC, userspace creates a unique ID to
map sFlow action and tunnel info and passes this ID to kernel instead
of the sFlow info. psample will send this ID and sampled packet to
userspace. Using the ID, userspace can recover the sFlow info and send
sampled packet to the right sFlow monitoring host.

v2-v1:
- Fix robot errors.
v3-v2:
- Remove Gerrit Change-Id.
- Add patch #9 to fix older kernels build issue.
- Add travis test result.
v4-v3:
- Fix offload issue when sampling rate is 1.
v5-v4:
- Move polling thread from ofproto to netdev-offload-tc.
v6-v5:
- Rebase.
- Add GitHub Actions test result.
v7-v6:
- Remove Gerrit Change-Id.
- Fix "ERROR: Inappropriate spacing around cast"
v8-v7
- Address Eelco Chaudron's comment for patch #11.
v9-v8
- Remove sflow_len from struct dpif_sflow_attr.
- Log a debug message for other userspace actions.
v10-v9
- Address Eelco Chaudron's comments on v9.
v11-v10
- Fix a bracing error.
v12-v11
- Add duplicate sample group id check.
v13-v12
- Remove the psample poll thread from netdev-offload-tc and reuse
   ofproto handler thread according to Ilya's new desgin.
- Add dpif-offload-provider layer according to Eli's suggestion.
v14-v13
- Fix a robot error.
v15-v14
- Address Eelco Chaudron's comments on v14.
v16-v15
- Address Eelco Chaudron's comments on v15.
- Add two test cases.
v17-v16
- Address Eelco Chaudron's comments on v16.
- Move struct dpif_offload_api from struct dpif_class to struct dpif.
v18-v17
- Rename dpif_offload_api to dpif_offload_class.
- Add init and destroy callbacks in dpif_offload_class. They are called
   when registering dpif_offload_class.
v19-18
- Fix a bug that psample_sock is destroyed when last bridge is deleted.
v20-19
- Move buf_stub to struct dpif_offload_sflow avoid garbage values when
   ofproto proceses the sampled packet.
v21-20
- Remove netdev dummy for dpif-offload according to Eelco's comment.
v22-21
- Address Ilya's comments:
  - Remove dpif-offload-provider layer.
  - Remove process_offload_sflow and reuse upcall_receive.
  - Introduce sample id pool.
  - Introduce netdev_offload_recv.
v23-22
- Address Ilya's comments:
  - Add struct flow in struct dpif_upcall.
  - Add handler id in recv() and recv_wait().
  - misc changes.
v24-23
- Fix checkpath and actions errors.
v25-24
- Address Eelco's comments:
  - Add tunnel test.
  - Change sample group id and sample info to 1:1 mapping.
  - Move sample group id from tcf_id to ufid_to_tc_data.
  - misc changes.
v26-25
- Address Eelco's comments:
  - Rename actions to userspace_actions.
  - Add in tunnel test.
v27-26
- Address Eelco's comments:
  - Introduce nullable_xmemdup() and flow_tnl_copy().
  - Improve sgid node locking.
  - Improve comments and documents.
  - Add support for multiple sgids.
  - Improve test cases.
  - misc changes.
v28-27
- Apply Eelco's three diff files directly.

Chris Mi (8):
   compat: Add psample and tc sample action defines for older kernels
   ovs-kmod-ctl: Load kernel module psample
   netdev-offload-tc: Introduce group ID management API
   netdev-offload-tc: Add sample offload API for TC
   netdev-offload: Add netdev offload recv and recv_wait APIs
   dpif-netlink: Add netdev offload recv in normal recv upcalls
   netdev-offload-tc: Add offload support for sFlow
   system-offloads-traffic.at: Add sFlow offload test cases

  Documentation/howto/tc-offload.rst |  24 ++
  include/linux/automake.mk  |  10 +-
  include/linux/psample.h|  62 +++
  include/linux/tc_act/tc_sample.h   |  25 ++
  lib/dpif-netlink.c |  41 +-
  lib/dpif.h |   6 +-
  lib/flow.h |   2 +-
  lib/netdev-offload-provider.h  |  30 ++
  lib/netdev-offload-tc.c| 598 +++--
  lib/netdev-offload.c   |  41 +-
  lib/netdev-offload.h   |   5 +
  lib/packets.h  |   2 +-
  lib/tc.c   |  63 ++-
  lib/tc.h   |   6 +
  lib/util.c |   6 +
  lib/util.h |   1 +
  ofproto/ofproto-dpif-upcall.c  |  15 +-
  tests/system-common-macros.at  |  13 +
  tests/system-offloads-traffic.at   | 160 
  utilities/ovs-kmod-ctl.in  |  14 +
  20 files changed, 1079 insertions(+), 45 deletions(-)
  create mode 100644 include/linux/psample.h
  create mode 100644 include/linux/tc_act/tc_sample.h




Hi, Chris, others.  I see a significant memory leak reported by ASAN
while running check_pkt_len offload tests:

13: offloads - check_pkt_len action - offloads enabled FAILED 
(ovs-macros.at:222)
=
==1258509==ERROR: 

Re: [ovs-dev] [PATCH v28 8/8] system-offloads-traffic.at: Add sFlow offload test cases

2024-03-25 Thread Chris Mi via dev

On 6/24/2023 4:18 AM, Ilya Maximets wrote:

On 6/19/23 07:05, Chris Mi wrote:

Add three sFlow offload test cases:

   3: offloads - sflow with sampling=1 - offloads enabled ok
   4: offloads - sflow with sampling=2 - offloads enabled ok
   5: offloads - ping over vxlan tunnel with sflow - offloads enabled ok

Signed-off-by: Chris Mi 
Reviewed-by: Roi Dayan 
Acked-by: Eelco Chaudron 
---
  Documentation/howto/tc-offload.rst |  24 +
  tests/system-common-macros.at  |  13 +++
  tests/system-offloads-traffic.at   | 160 +
  3 files changed, 197 insertions(+)

diff --git a/Documentation/howto/tc-offload.rst 
b/Documentation/howto/tc-offload.rst
index 681dff13e..a50dc3c58 100644
--- a/Documentation/howto/tc-offload.rst
+++ b/Documentation/howto/tc-offload.rst
@@ -88,6 +88,30 @@ As you can see above the none-offload case reports 140 bytes 
more, which is 14
  bytes per packet. This represents the L2 header, in this case, 2 * *Ethernet
  address* + *Ethertype*.
  
+Tunnel offload

+++
+
+Current tunnel offload ignores DF and CSUM flags configuration requested by
+the user. TC for now has no way to pass these flags in a flower key and their
+masks are set by default. To make tunnel offload work, DF and CSUM flags
+are cleared. So please be aware of the following differences.
+
+Dumping vxlan decap match without offload, it shows::
+
+
recirc_id(0),tunnel(tun_id=0x4,src=192.168.1.1,dst=192.168.1.2,flags(-df+csum+key)),in_port(vxlan_sys_4789)
+
+Dumping vxlan decap match with offload, it shows::
+
+
recirc_id(0),tunnel(tun_id=0x4,src=192.168.1.1,dst=192.168.1.2,tp_dst=4789,flags(+key)),in_port(vxlan_sys_4789)
+
+Dumping vxlan encap action without offload, it shows::
+
+
actions:set(tunnel(tun_id=0x4,dst=192.168.1.1,ttl=64,tp_dst=4789,flags(df|key))),vxlan_sys_4789
+
+Dumping vxlan encap action with offload, it shows::
+
+
actions:set(tunnel(tun_id=0x4,dst=192.168.1.64,ttl=64,tp_dst=4789,flags(key))),vxlan_sys_4789
+
  TC Meter Offload
  
  
diff --git a/tests/system-common-macros.at b/tests/system-common-macros.at

index 0077a8609..55ec0bf1e 100644
--- a/tests/system-common-macros.at
+++ b/tests/system-common-macros.at
@@ -359,3 +359,16 @@ m4_define([OVS_CHECK_IPROUTE_ENCAP],
  # OVS_CHECK_CT_CLEAR()
  m4_define([OVS_CHECK_CT_CLEAR],
  [AT_SKIP_IF([! grep -q "Datapath supports ct_clear action" 
ovs-vswitchd.log])])
+
+# LOAD_MODULE([name])
+#
+# Tries to load specified kernel module and removes it after
+# if it wasn't loaded before this call.
+#
+m4_define([LOAD_MODULE],
+[if ! lsmod | grep -q $1; then
+ on_exit 'modprobe -q -r $1'
+ fi
+ AT_CHECK([modprobe $1])
+]
+)
\ No newline at end of file
diff --git a/tests/system-offloads-traffic.at b/tests/system-offloads-traffic.at
index ae302a294..db287a86d 100644
--- a/tests/system-offloads-traffic.at
+++ b/tests/system-offloads-traffic.at
@@ -93,6 +93,166 @@ AT_CHECK([ovs-appctl upcall/show | grep -E "offloaded flows : 
[[1-9]]"], [0], [i
  OVS_TRAFFIC_VSWITCHD_STOP
  AT_CLEANUP
  
+

+AT_SETUP([offloads - sflow with sampling=1 - offloads enabled])
+LOAD_MODULE([psample])
+OVS_TRAFFIC_VSWITCHD_START([], [], [-- set Open_vSwitch . 
other_config:hw-offload=true])
+
+on_exit 'kill `cat test-sflow.pid`'
+AT_CHECK([ovstest test-sflow --log-file --detach --no-chdir --pidfile 0:127.0.0.1 
> sflow.log], [0], [], [ignore])


Please, wrap long lines.  At least move the error codes and output streams
to a new line.  Indented to the first argument, of course.

It's also possible to break the argument with either 'dnl' or a line
continuation '\' (new versions of atotest seems to handle them better).


Done.




+AT_CAPTURE_FILE([sflow.log])
+PARSE_LISTENING_PORT([test-sflow.log], [SFLOW_PORT])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
+
+AT_CHECK([ovs-vsctl -- --id=@sflow create sflow agent=lo 
target=\"127.0.0.1:$SFLOW_PORT\" header=128 sampling=1 polling=100 -- set 
bridge br0 sflow=@sflow], [0], [ignore])


Wrap.


Done.




+NS_CHECK_EXEC([at_ns0], [ping -q -c 1000 -i 0.01 -w 12 10.1.1.2 | 
FORMAT_PING], [0], [dnl
+1000 packets transmitted, 1000 received, 0% packet loss, time 0ms


This check fails when I run tests with ASAN.  It's unable to finish in 12 
seconds:

./system-offloads-traffic.at:113: ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
ping -q -c 1000 -i 0.01 -w 12 10.1.1.2 | grep "transmitted" | sed 
's/time.*ms$/time 0ms/'
NS_EXEC_HEREDOC
--- -   2023-06-23 18:53:18.127071711 +
+++ /root/ovs/tests/system-offloads-testsuite.dir/at-groups/3/stdout
2023-06-23 18:53:18.12000 +
@@ -1,2 +1,2 @@
-1000 packets transmitted, 1000 received, 0% packet loss, time 0ms
+750 packets transmitted, 750 received, 0% packet loss, time 0ms

Same for the other ping in these tests.


Done.





+])
+
+p1_ifindex=$(cat 

Re: [ovs-dev] [PATCH v28 6/8] dpif-netlink: Add netdev offload recv in normal recv upcalls

2024-03-25 Thread Chris Mi via dev

On 6/24/2023 4:18 AM, Ilya Maximets wrote:

On 6/19/23 07:05, Chris Mi wrote:

In thread handler 0, add netdev offload recv in normal recv upcalls.
To avoid starvation, introduce a flag to alternate the order of
receiving normal upcalls and offload upcalls based on that flag.

Add similar change for recv_wait.

Signed-off-by: Chris Mi 
Reviewed-by: Roi Dayan 
---
  lib/dpif-netlink.c| 41 ++-
  ofproto/ofproto-dpif-upcall.c | 15 +
  2 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index 60bd39643..6e7b644e8 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -201,6 +201,11 @@ struct dpif_handler {
  struct nl_sock *sock; /* Each handler thread holds one netlink
   socket. */
  
+/* The receive handler thread deals with both normal and offload receive

+ * upcalls. To avoid starvation, the below flag is used to alternate the
+ * processing order. */
+bool recv_offload_first;
+
  #ifdef _WIN32
  /* Pool of sockets. */
  struct dpif_windows_vport_sock *vport_sock_pool;
@@ -3010,7 +3015,6 @@ dpif_netlink_recv_windows(struct dpif_netlink *dpif, 
uint32_t handler_id,
  static int
  dpif_netlink_recv_cpu_dispatch(struct dpif_netlink *dpif, uint32_t handler_id,
 struct dpif_upcall *upcall, struct ofpbuf *buf)
-OVS_REQ_RDLOCK(dpif->upcall_lock)
  {
  struct dpif_handler *handler;
  int read_tries = 0;
@@ -3061,7 +3065,6 @@ dpif_netlink_recv_vport_dispatch(struct dpif_netlink 
*dpif,
   uint32_t handler_id,
   struct dpif_upcall *upcall,
   struct ofpbuf *buf)
-OVS_REQ_RDLOCK(dpif->upcall_lock)
  {
  struct dpif_handler *handler;
  int read_tries = 0;
@@ -3135,13 +3138,12 @@ dpif_netlink_recv_vport_dispatch(struct dpif_netlink 
*dpif,
  #endif
  
  static int

-dpif_netlink_recv(struct dpif *dpif_, uint32_t handler_id,
-  struct dpif_upcall *upcall, struct ofpbuf *buf)
+dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
+struct dpif_upcall *upcall, struct ofpbuf *buf)
+OVS_REQ_RDLOCK(dpif->upcall_lock)
  {
-struct dpif_netlink *dpif = dpif_netlink_cast(dpif_);
  int error;
  
-fat_rwlock_rdlock(>upcall_lock);

  #ifdef _WIN32
  error = dpif_netlink_recv_windows(dpif, handler_id, upcall, buf);
  #else
@@ -3152,6 +3154,32 @@ dpif_netlink_recv(struct dpif *dpif_, uint32_t 
handler_id,
   handler_id, upcall, buf);
  }
  #endif
+
+return error;
+}
+
+static int
+dpif_netlink_recv(struct dpif *dpif_, uint32_t handler_id,
+  struct dpif_upcall *upcall, struct ofpbuf *buf)
+{
+struct dpif_netlink *dpif = dpif_netlink_cast(dpif_);
+struct dpif_handler *handler;
+int error;
+
+fat_rwlock_rdlock(>upcall_lock);
+handler = >handlers[handler_id];
+if (handler->recv_offload_first) {
+error = netdev_offload_recv(upcall, buf, handler_id);
+if (error == EAGAIN) {
+error = dpif_netlink_recv__(dpif, handler_id, upcall, buf);
+}
+} else {
+error = dpif_netlink_recv__(dpif, handler_id, upcall, buf);
+if (error == EAGAIN) {
+error = netdev_offload_recv(upcall, buf, handler_id);
+}
+}
+handler->recv_offload_first = !handler->recv_offload_first;
  fat_rwlock_unlock(>upcall_lock);
  
  return error;

@@ -3217,6 +3245,7 @@ dpif_netlink_recv_wait(struct dpif *dpif_, uint32_t 
handler_id)
  dpif_netlink_recv_wait_vport_dispatch(dpif, handler_id);
  }
  #endif
+netdev_offload_recv_wait(handler_id);
  fat_rwlock_unlock(>upcall_lock);
  }
  
diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c

index 04b583f81..c1fad9a8f 100644
--- a/ofproto/ofproto-dpif-upcall.c
+++ b/ofproto/ofproto-dpif-upcall.c
@@ -855,10 +855,17 @@ recv_upcalls(struct handler *handler)
  break;
  }
  
-upcall->fitness = odp_flow_key_to_flow(dupcall->key, dupcall->key_len,

-   flow, NULL);
-if (upcall->fitness == ODP_FIT_ERROR) {
-goto free_dupcall;
+/* If key and key_len are available, use them to construct flow.
+ * Otherwise, use upcall->flow. */
+if (dupcall->key && dupcall->key_len) {
+upcall->fitness = odp_flow_key_to_flow(dupcall->key,
+   dupcall->key_len,
+   flow, NULL);
+if (upcall->fitness == ODP_FIT_ERROR) {
+goto free_dupcall;
+}
+} else {
+flow = >flow;


Might make sense to set upcall->fitness to ODP_FIT_PERFECT here.
It might not be used today, 

Re: [ovs-dev] [PATCH v28 7/8] netdev-offload-tc: Add offload support for sFlow

2024-03-25 Thread Chris Mi via dev

On 6/24/2023 4:18 AM, Ilya Maximets wrote:

On 6/19/23 07:05, Chris Mi wrote:

Create a unique group ID to map the sFlow info when offloading sample
action to TC. When showing the offloaded datapath flows, translate the
group ID from TC sample action to sFlow info using the mapping.

Signed-off-by: Chris Mi 
Reviewed-by: Roi Dayan 
---
  lib/netdev-offload-tc.c | 288 
  lib/tc.c|  63 -
  lib/tc.h|   6 +
  3 files changed, 331 insertions(+), 26 deletions(-)


This patch needs a NEWS entry.


Done.



  
  enum tc_action_type type;

@@ -380,6 +385,7 @@ struct tc_flower {
  enum tc_offloaded_state offloaded_state;
  /* Used to force skip_hw when probing tc features. */
  enum tc_offload_policy tc_policy;
+uint16_t ifindex;
  };
  
  int tc_replace_flower(struct tcf_id *id, struct tc_flower *flower);


No need for variable names.



It's not part of this patch. Didn't change it.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v28 5/8] netdev-offload: Add netdev offload recv and recv_wait APIs

2024-03-25 Thread Chris Mi via dev

On 6/24/2023 4:18 AM, Ilya Maximets wrote:

On 6/19/23 07:05, Chris Mi wrote:

Iterate each registered offload API. It's not a problem for today
since we only have one implementation.

Signed-off-by: Chris Mi 
Reviewed-by: Roi Dayan 
---
  lib/netdev-offload.c | 38 ++
  lib/netdev-offload.h |  5 +
  2 files changed, 43 insertions(+)

diff --git a/lib/netdev-offload.c b/lib/netdev-offload.c
index 403315deb..1374aa8ac 100644
--- a/lib/netdev-offload.c
+++ b/lib/netdev-offload.c
@@ -257,6 +257,44 @@ meter_offload_del(ofproto_meter_id meter_id, struct 
ofputil_meter_stats *stats)
  return 0;
  }
  
+int

+netdev_offload_recv(struct dpif_upcall *upcall, struct ofpbuf *buf,
+uint32_t handler_id)
+{
+struct netdev_registered_flow_api *rfa;
+int ret = EAGAIN;
+
+CMAP_FOR_EACH (rfa, cmap_node, _flow_apis) {
+if (rfa->flow_api->recv) {
+ret = rfa->flow_api->recv(upcall, buf, handler_id);
+if (!ret) {
+return 0;
+}
+
+if (ret == EAGAIN) {


Should there be '!=' instead?


Done.




+VLOG_DBG_RL(, "Failed to receive offload packet, %s, "
+"type: %s", ovs_strerror(ret),
+rfa->flow_api->type);
+}
+} else {
+ret = EAGAIN;
+   }
+}
+return ret;
+}
+
+void
+netdev_offload_recv_wait(uint32_t handler_id)
+{
+struct netdev_registered_flow_api *rfa;
+
+CMAP_FOR_EACH (rfa, cmap_node, _flow_apis) {
+if (rfa->flow_api->recv_wait) {
+rfa->flow_api->recv_wait(handler_id);
+}
+}
+}
+
  int
  netdev_flow_flush(struct netdev *netdev)
  {
diff --git a/lib/netdev-offload.h b/lib/netdev-offload.h
index 47f8e6f48..d71e98418 100644
--- a/lib/netdev-offload.h
+++ b/lib/netdev-offload.h
@@ -33,6 +33,7 @@ extern "C" {
  
  struct dp_packet_batch;

  struct dp_packet;
+struct dpif_upcall;
  struct netdev_class;
  struct netdev_rxq;
  struct netdev_saved_flags;
@@ -162,6 +163,10 @@ void meter_offload_set(ofproto_meter_id, struct 
ofputil_meter_config *);
  int meter_offload_get(ofproto_meter_id, struct ofputil_meter_stats *);
  int meter_offload_del(ofproto_meter_id, struct ofputil_meter_stats *);
  
+int netdev_offload_recv(struct dpif_upcall *upcall, struct ofpbuf *buf,


'upcall' and 'buf'  names are not needed.  The structure is self-explanatory.


Done.




+uint32_t handler_id);
+void netdev_offload_recv_wait(uint32_t handler_id);
+
  #ifdef  __cplusplus
  }
  #endif



___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v28 4/8] netdev-offload-tc: Add sample offload API for TC

2024-03-25 Thread Chris Mi via dev

On 6/24/2023 4:17 AM, Ilya Maximets wrote:

On 6/19/23 07:05, Chris Mi wrote:

Initialize psample socket. Add sample recv API to receive sampled
packets from psample socket. Add sample recv wait API to add psample
socket fd to poll list.

Signed-off-by: Chris Mi 
Reviewed-by: Roi Dayan 
---
  lib/dpif.h|   6 +-
  lib/flow.h|   2 +-
  lib/netdev-offload-provider.h |  30 ++
  lib/netdev-offload-tc.c   | 172 ++
  lib/netdev-offload.c  |   3 +-
  lib/packets.h |   2 +-
  6 files changed, 210 insertions(+), 5 deletions(-)

diff --git a/lib/dpif.h b/lib/dpif.h
index 129cbf6a1..f91295862 100644
--- a/lib/dpif.h
+++ b/lib/dpif.h
@@ -834,8 +834,10 @@ struct dpif_upcall {
  
  /* DPIF_UC_ACTION only. */

  struct nlattr *userdata;/* Argument to OVS_ACTION_ATTR_USERSPACE. */
-struct nlattr *out_tun_key;/* Output tunnel key. */
-struct nlattr *actions;/* Argument to OVS_ACTION_ATTR_USERSPACE. */
+struct nlattr *out_tun_key; /* Output tunnel key. */
+struct nlattr *actions; /* Argument to OVS_ACTION_ATTR_USERSPACE. */
+struct flow flow;   /* Caller provided 'flow' if the 'key' is not
+   available. */
  };
  
  /* A callback to notify higher layer of dpif about to be purged, so that

diff --git a/lib/flow.h b/lib/flow.h
index a9d026e1c..0974bfd42 100644
--- a/lib/flow.h
+++ b/lib/flow.h
@@ -970,7 +970,7 @@ pkt_metadata_from_flow(struct pkt_metadata *md, const 
struct flow *flow)
  
  md->recirc_id = flow->recirc_id;

  md->dp_hash = flow->dp_hash;
-flow_tnl_copy__(>tunnel, >tunnel);
+flow_tnl_copy(>tunnel, >tunnel);
  md->skb_priority = flow->skb_priority;
  md->pkt_mark = flow->pkt_mark;
  md->in_port = flow->in_port;
diff --git a/lib/netdev-offload-provider.h b/lib/netdev-offload-provider.h
index 9108856d1..a457556e5 100644
--- a/lib/netdev-offload-provider.h
+++ b/lib/netdev-offload-provider.h
@@ -28,6 +28,8 @@
  extern "C" {
  #endif
  
+struct dpif_upcall;

+
  struct netdev_flow_api {
  char *type;
  /* Flush all offloaded flows from a netdev.
@@ -121,6 +123,34 @@ struct netdev_flow_api {
  int (*meter_del)(ofproto_meter_id meter_id,
   struct ofputil_meter_stats *stats);
  
+/* Polls for upcall offload packets for an upcall handler. If successful,

+ * stores the upcall into '*upcall', using 'buf' for storage.
+ *
+ * The implementation should point 'upcall->flow' and 'upcall->userdata'
+ * (if any) into data in the caller-provided 'buf'.  The implementation may


This is not really correct.  The 'flow' is not a pointer in this patch set,
AFAICT, so we can't point it to buf.


I changed to '>flow'.




+ * also use 'buf' for storing the data of 'upcall->packet'.  If necessary
+ * to make room, the implementation may reallocate the data in 'buf'.
+ *
+ * The caller owns the data of 'upcall->packet' and may modify it.  If
+ * packet's headroom is exhausted as it is manipulated, 'upcall->packet'
+ * will be reallocated.  This requires the data of 'upcall->packet' to be
+ * released with ofpbuf_uninit() before 'upcall' is destroyed.  However,
+ * when an error is returned, the 'upcall->packet' may be uninitialized
+ * and should not be released.
+ *
+ * This function must not block. If no upcall is pending when it is
+ * called, it should return EAGAIN without blocking.
+ *
+ * Return 0 if successful, otherwise returns a positive errno value.
+ */
+int (*recv)(struct dpif_upcall *upcall, struct ofpbuf *buf,
+uint32_t handler_id);
+
+/* Arranges for the poll loop for an upcall handler to wake up when
+ * sample socket has a message queued to be received with the recv


s/sample socket/offload provider/


Done.




+ * member functions. */
+void (*recv_wait)(uint32_t handler_id);
+
  /* Initializies the netdev flow api.
   * Return 0 if successful, otherwise returns a positive errno value. */
  int (*init_flow_api)(struct netdev *);
diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
index 56726acf8..8d571aca8 100644
--- a/lib/netdev-offload-tc.c
+++ b/lib/netdev-offload-tc.c
@@ -18,6 +18,8 @@
  
  #include 

  #include 
+#include 
+#include 
  
  #include "cmap.h"

  #include "dpif-provider.h"
@@ -125,6 +127,9 @@ struct sgid_node {
  struct offload_sample sample;
  };
  
+static struct nl_sock *psample_sock;

+static int psample_family;
+
  /* The sgid_map mutex protects the sample_group_ids and the sgid_map for
   * cmap_insert(), cmap_remove(), or cmap_replace() operations. */
  static struct ovs_mutex sgid_lock = OVS_MUTEX_INITIALIZER;
@@ -156,6 +161,14 @@ sgid_find(uint32_t id)
  return node ? CONTAINER_OF(node, struct sgid_node, id_node) : NULL;
  }
  
+static struct offload_sample *

+sample_find(uint32_t id)
+{
+

Re: [ovs-dev] [PATCH v28 3/8] netdev-offload-tc: Introduce group ID management API

2024-03-25 Thread Chris Mi via dev

On 6/24/2023 4:17 AM, Ilya Maximets wrote:

On 6/19/23 07:05, Chris Mi wrote:

When offloading sample action to TC, userspace creates a unique ID
to map sample action and tunnel info and passes this ID to kernel
instead of the sample info. Kernel will send this ID and sampled
packet to userspace. Using the ID, userspace can recover the sample
info and send sampled packet to the right sample monitoring host.

Signed-off-by: Chris Mi 
Reviewed-by: Roi Dayan 
---
  lib/netdev-offload-tc.c | 156 +---
  lib/util.c  |   6 ++
  lib/util.h  |   1 +
  3 files changed, 154 insertions(+), 9 deletions(-)

diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
index 4f26dd8cc..56726acf8 100644
--- a/lib/netdev-offload-tc.c
+++ b/lib/netdev-offload-tc.c
@@ -19,28 +19,29 @@
  #include 
  #include 
  
+#include "cmap.h"

+#include "dpif-provider.h"
  #include "dpif.h"
  #include "hash.h"
  #include "id-pool.h"
-#include "openvswitch/hmap.h"
-#include "openvswitch/match.h"
-#include "openvswitch/ofpbuf.h"
-#include "openvswitch/thread.h"
-#include "openvswitch/types.h"
-#include "openvswitch/util.h"
-#include "openvswitch/vlog.h"
  #include "netdev-linux.h"
  #include "netdev-offload-provider.h"
  #include "netdev-provider.h"
  #include "netdev-vport.h"
-#include "netlink.h"
  #include "netlink-socket.h"
+#include "netlink.h"
  #include "odp-netlink.h"
  #include "odp-util.h"
+#include "openvswitch/hmap.h"
+#include "openvswitch/match.h"
+#include "openvswitch/ofpbuf.h"
+#include "openvswitch/thread.h"
+#include "openvswitch/types.h"
+#include "openvswitch/util.h"
+#include "openvswitch/vlog.h"
  #include "tc.h"
  #include "unaligned.h"
  #include "util.h"
-#include "dpif-provider.h"
  
  VLOG_DEFINE_THIS_MODULE(netdev_offload_tc);
  
@@ -103,6 +104,125 @@ static void parse_tc_flower_to_stats(struct tc_flower *flower,

  static int get_ufid_adjust_stats(const ovs_u128 *ufid,
   struct dpif_flow_stats *stats);
  
+/* When offloading sample action to TC, userspace creates a unique ID

+ * to map sample action and tunnel info and passes this ID to kernel
+ * instead of the sample info. Kernel will send this ID and sampled
+ * packet to userspace. Using the ID, userspace can recover the sample
+ * info and send sampled packet to the right sample monitoring host. */
+struct offload_sample {
+uint16_t type; /* enum user_action_cookie_type. */
+struct nlattr *action; /* Sample action. Used in flow_get. */
+struct nlattr *userdata;   /* Struct user_action_cookie. */
+struct nlattr *userspace_actions;  /* All actions to get output tunnel. */
+struct flow_tnl *tunnel;   /* Input tunnel. */
+uint16_t ifindex;  /* Input ifindex. */
+};
+
+/* This maps a sample group ID to struct offload_sample. */
+struct sgid_node {
+struct cmap_node id_node;
+uint32_t id;
+struct offload_sample sample;
+};
+
+/* The sgid_map mutex protects the sample_group_ids and the sgid_map for
+ * cmap_insert(), cmap_remove(), or cmap_replace() operations. */
+static struct ovs_mutex sgid_lock = OVS_MUTEX_INITIALIZER;
+static struct cmap sgid_map = CMAP_INITIALIZER;
+static struct id_pool *sample_group_ids OVS_GUARDED_BY(sgid_lock);
+
+static void
+sgid_node_free(struct sgid_node *node)
+{
+if (node) {
+if (node->id) {
+ovs_mutex_lock(_lock);
+id_pool_free_id(sample_group_ids, node->id);
+ovs_mutex_unlock(_lock);
+}
+free(node->sample.tunnel);
+free(node->sample.action);
+free(node->sample.userspace_actions);
+free(node->sample.userdata);
+free(node);
+}
+}
+
+static struct sgid_node *
+sgid_find(uint32_t id)
+{
+const struct cmap_node *node = cmap_find(_map, id);
+
+return node ? CONTAINER_OF(node, struct sgid_node, id_node) : NULL;
+}
+
+static void
+offload_sample_clone(struct offload_sample *dst,
+ const struct offload_sample *src,
+ bool steal_userspace_actions)
+{
+dst->type = src->type;
+dst->action = nullable_xmemdup(src->action, src->action ?
+src->action->nla_len : 0);
+if (steal_userspace_actions) {
+dst->userspace_actions = src->userspace_actions;
+} else {
+dst->userspace_actions = src->userspace_actions
+? xmemdup(src->userspace_actions,
+  src->userspace_actions->nla_len)
+: NULL;
+}
+dst->userdata = nullable_xmemdup(src->userdata, src->userdata ?
+src->userdata->nla_len :
+0);
+dst->tunnel = nullable_xmemdup(src->tunnel, sizeof *src->tunnel);
+dst->ifindex = src->ifindex;
+}
+
+/* Allocate a unique group id for the given set of flow metadata. The id
+ * space is 

Re: [ovs-dev] [PATCH] ofproto-dpif-upcall: Don't mirror packets that aren't modified.

2024-03-25 Thread Zhangweiwei
The ethernet addresses of two ICMP request packets are indeed different . One 
is original packet and the other is modified. It is an expected behavior 
according to the code.
Actually, when a packet sent by port A is changed by flow table and then is 
sent to itself, we expect to capture this packet. However, when this packet is 
changed and is sent to another port, should we still capture the packet on port 
A?

[root@localhost infiniband]# ovs-tcpdump -i tapVm71 -nnvve
11.11.70.1 > 1.1.70.2: ICMP echo request, id 15498, seq 17, length 64
09:36:52.822232 68:05:ca:21:d6:e5 > 52:54:00:67:d5:61, ethertype IPv4 (0x0800), 
length 98: (tos 0x0, ttl 63, id 22101, offset 0, flags [none], proto ICMP (1), 
length 84)
1.1.70.2 > 11.11.70.1: ICMP echo reply, id 15498, seq 17, length 64
09:36:53.862137 52:54:00:67:d5:61 > 68:05:ca:21:d6:e5, ethertype IPv4 (0x0800), 
length 98: (tos 0x0, ttl 64, id 26518, offset 0, flags [DF], proto ICMP (1), 
length 84)
11.11.70.1 > 1.1.70.2: ICMP echo request, id 15498, seq 18, length 64
09:36:53.862139 68:05:ca:21:d6:e5 > 52:54:00:9a:bf:ed, ethertype IPv4 (0x0800), 
length 98: (tos 0x0, ttl 63, id 26518, offset 0, flags [DF], proto ICMP (1), 
length 84)
11.11.70.1 > 1.1.70.2: ICMP echo request, id 15498, seq 18, length 64
09:36:53.862230 68:05:ca:21:d6:e5 > 52:54:00:67:d5:61, ethertype IPv4 (0x0800), 
length 98: (tos 0x0, ttl 63, id 22176, offset 0, flags [none], proto ICMP (1), 
length 84)
1.1.70.2 > 11.11.70.1: ICMP echo reply, id 15498, seq 18, length 64

-邮件原件-
发件人: Mike Pattrick [mailto:m...@redhat.com] 
发送时间: 2024年3月25日 22:26
收件人: zhangweiwei (RD) 
抄送: d...@openvswitch.org
主题: Re: [PATCH] ofproto-dpif-upcall: Don't mirror packets that aren't modified.

On Mon, Mar 25, 2024 at 3:48 AM Zhangweiwei  wrote:
>
> Hi,
> I have tried this patch, however, there are still some issues when the 
> packets contents are changed across recirculation. On the follow example, 
> packets are modified in recirc_id(0) after mirror, the mirror context reset. 
> Therefore, there are two ICMP request packets are mirrored on port mitapVm71.
>
> In the following example, ICMP packets ared sent from port(11) to 
> port(14), [root@localhost ~]# ovs-appctl dpif/dump-flows vds1-br 
> ct_state(-new-est-rel-rpl-inv-trk),recirc_id(0),in_port(11),packet_typ
> e(ns=0,id=0),eth(src=52:54:00:67:d5:61,dst=68:05:ca:21:d6:e5),eth_type
> (0x0800),ipv4(src=11.11.70.1,dst=1.1.70.2,proto=1,ttl=64,frag=no), 
> packets:431, bytes:42238, used:0.574s, 
> actions:10,set(eth(src=68:05:ca:21:d6:e5,dst=52:54:00:9a:bf:ed)),set(i
> pv4(ttl=63)),ct(zone=6),recirc(0x3e8)
> ct_state(+est-rel-rpl),recirc_id(0x3e8),in_port(11),packet_type(ns=0,i
> d=0),eth_type(0x0800),ipv4(frag=no), packets:430, bytes:42140, 
> used:0.574s, actions:10,14
>
> ct_state(-new+est-rel+rpl-inv+trk),recirc_id(0x3e9),in_port(14),packet
> _type(ns=0,id=0),eth(src=52:54:00:9a:bf:ed,dst=68:05:ca:21:d6:e5),eth_
> type(0x0800),ipv4(dst=11.11.70.1,proto=1,ttl=64,frag=no), packets:431, 
> bytes:42238, used:0.574s, 
> actions:set(eth(src=68:05:ca:21:d6:e5,dst=52:54:00:67:d5:61)),set(ipv4
> (ttl=63)),11,10 
> ct_state(-trk),recirc_id(0),in_port(14),packet_type(ns=0,id=0),eth(src
> =52:54:00:9a:bf:ed),eth_type(0x0800),ipv4(src=1.1.70.2,proto=1,frag=no
> ), packets:431, bytes:42238, used:0.574s, 
> actions:ct(zone=6),recirc(0x3e9)
>
> [root@localhost ~]# ovs-appctl dpif/show
> netdev@ovs-netdev: hit:2552 missed:3019
>   vds1-br:
> mitapVm71 14/10: (system)
> tapVm71 5/11: (dpdkvhostuserclient: configured_rx_queues=1, 
> configured_tx_queues=1, mtu=1500, requested_rx_queues=1, 
> requested_tx_queues=1)
> tapVm72 6/14: (dpdkvhostuserclient: configured_rx_queues=1, 
> configured_tx_queues=1, mtu=1500, requested_rx_queues=1, 
> requested_tx_queues=1)
>
> [root@localhost ~]# ovs-tcpdump -i tapVm71
> 14:38:53.702142 IP 11.11.70.1 > 1.1.70.2: ICMP echo request, id 13483, 
> seq 2014, length 64
> 14:38:53.702143 IP 11.11.70.1 > 1.1.70.2: ICMP echo request, id 13483, 
> seq 2014, length 64
> 14:38:53.702185 IP 1.1.70.2 > 11.11.70.1: ICMP echo reply, id 13483, 
> seq 2014, length 64
> 14:38:54.742141 IP 11.11.70.1 > 1.1.70.2: ICMP echo request, id 13483, 
> seq 2015, length 64
> 14:38:54.742143 IP 11.11.70.1 > 1.1.70.2: ICMP echo request, id 13483, 
> seq 2015, length 64
> 14:38:54.742183 IP 1.1.70.2 > 11.11.70.1: ICMP echo reply, id 13483, 
> seq 2015, length 64
> 14:38:55.782142 IP 11.11.70.1 > 1.1.70.2: ICMP echo request, id 13483, 
> seq 2016, length 64
> 14:38:55.782144 IP 11.11.70.1 > 1.1.70.2: ICMP echo request, id 13483, 
> seq 2016, length 64
> 14:38:55.782186 IP 1.1.70.2 > 11.11.70.1: ICMP echo reply, id 13483, 
> seq 2016, length 64


Hello, thanks for the report. Is it possible to run the command "ovs-tcpdump -i 
tapVm71 -ennvv" ? I ask because I see your actions reset the ethernet address. 
If the ethernet address is different then this would be the expected behavior, 
the collection of the packet as it enters, and then as 

Re: [ovs-dev] [PATCH net-next v2 3/3] genetlink: remove linux/genetlink.h

2024-03-25 Thread Jakub Kicinski
On Mon, 25 Mar 2024 19:45:05 +0200 Andy Shevchenko wrote:
> > +/* Non-parallel generic netlink requests are serialized by a global lock.  
> > */  
> 
> While at it, maybe drop extra space? (I noticed it was originally like this.

Fair, I'll post v3.

> > +#define MODULE_ALIAS_GENL_FAMILY(family) \
> > + MODULE_ALIAS_NET_PF_PROTO_NAME(PF_NETLINK, NETLINK_GENERIC, "-family-" 
> > family)  
> 
> I would still make this TAB indented.

That one I'd prefer to keep. Coin toss between going over 80 chars
and having someone atypical indent.
-- 
pw-bot: cr
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net-next v2 3/3] genetlink: remove linux/genetlink.h

2024-03-25 Thread Andy Shevchenko
On Mon, Mar 25, 2024 at 10:37:16AM -0700, Jakub Kicinski wrote:
> genetlink.h is a shell of what used to be a combined uAPI
> and kernel header over a decade ago. It has fewer than
> 10 lines of code. Merge it into net/genetlink.h.
> In some ways it'd be better to keep the combined header
> under linux/ but it would make looking through git history
> harder.

Some last moment minor comments (no need to send a new version, JFYI,
maybe you tide this up when applying).

...

> +/* Non-parallel generic netlink requests are serialized by a global lock.  */

While at it, maybe drop extra space? (I noticed it was originally like this.

...

> +#define MODULE_ALIAS_GENL_FAMILY(family) \
> + MODULE_ALIAS_NET_PF_PROTO_NAME(PF_NETLINK, NETLINK_GENERIC, "-family-" 
> family)

I would still make this TAB indented.

-- 
With Best Regards,
Andy Shevchenko


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn v2] acl-log: Properly log the "pass" verdict.

2024-03-25 Thread Mark Michelson
The "pass" verdict was not explicitly defined in the list of verdicts
for ACL logging. This resulted in logs saying "Syntax error at `pass'
unknown verdict."

This change adds the "pass" verdict explicitly so that it shows up as a
proper log in ovn-controller.

Reported-at: https://issues.redhat.com/browse/FDP-442
Signed-off-by: Mark Michelson 
---
 lib/acl-log.c | 4 +++-
 lib/acl-log.h | 1 +
 lib/actions.c | 2 ++
 tests/ovn.at  | 3 +++
 4 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/lib/acl-log.c b/lib/acl-log.c
index 9530dd763..b3eb4bbd0 100644
--- a/lib/acl-log.c
+++ b/lib/acl-log.c
@@ -34,7 +34,9 @@ log_verdict_to_string(uint8_t verdict)
 return "drop";
 } else if (verdict == LOG_VERDICT_REJECT) {
 return "reject";
-} else {
+} else if (verdict == LOG_VERDICT_PASS) {
+return "pass";
+} else  {
 return "";
 }
 }
diff --git a/lib/acl-log.h b/lib/acl-log.h
index da7fa2f02..3973a8e0b 100644
--- a/lib/acl-log.h
+++ b/lib/acl-log.h
@@ -33,6 +33,7 @@ enum log_verdict {
 LOG_VERDICT_ALLOW,
 LOG_VERDICT_DROP,
 LOG_VERDICT_REJECT,
+LOG_VERDICT_PASS,
 LOG_VERDICT_UNKNOWN = UINT8_MAX
 };
 
diff --git a/lib/actions.c b/lib/actions.c
index 71615fc53..29584a189 100644
--- a/lib/actions.c
+++ b/lib/actions.c
@@ -3596,6 +3596,8 @@ parse_log_arg(struct action_context *ctx, struct 
ovnact_log *log)
 log->verdict = LOG_VERDICT_REJECT;
 } else if (lexer_match_id(ctx->lexer, "allow")) {
 log->verdict = LOG_VERDICT_ALLOW;
+} else if (lexer_match_id(ctx->lexer, "pass")) {
+log->verdict = LOG_VERDICT_PASS;
 } else {
 lexer_syntax_error(ctx->lexer, "unknown verdict");
 return;
diff --git a/tests/ovn.at b/tests/ovn.at
index 4d0c7ad53..f272749aa 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -1847,6 +1847,9 @@ log(name="test1", verdict=drop, severity=info, 
meter="meter1");
 log(verdict=drop);
 formats as log(verdict=drop, severity=info);
 encodes as controller(userdata=00.00.00.07.00.00.00.00.01.06)
+log(verdict=pass);
+formats as log(verdict=pass, severity=info);
+encodes as controller(userdata=00.00.00.07.00.00.00.00.03.06)
 log(verdict=bad_verdict, severity=info);
 Syntax error at `bad_verdict' unknown verdict.
 log(verdict=drop, severity=bad_severity);
-- 
2.44.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH net-next v2 2/3] net: openvswitch: remove unnecessary linux/genetlink.h include

2024-03-25 Thread Jakub Kicinski
The only legit reason I could think of for net/genetlink.h
and linux/genetlink.h to be separate would be if one was
included by other headers and we wanted to keep it lightweight.
That is not the case, net/openvswitch/meter.h includes
linux/genetlink.h but for no apparent reason (for struct genl_family
perhaps? it's not necessary, types of externs do not need
to be known).

Signed-off-by: Jakub Kicinski 
---
CC: pshe...@ovn.org
CC: d...@openvswitch.org
---
 net/openvswitch/meter.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/openvswitch/meter.h b/net/openvswitch/meter.h
index ed11cd12b512..8bbf983cd244 100644
--- a/net/openvswitch/meter.h
+++ b/net/openvswitch/meter.h
@@ -11,7 +11,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
-- 
2.44.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH net-next v2 3/3] genetlink: remove linux/genetlink.h

2024-03-25 Thread Jakub Kicinski
genetlink.h is a shell of what used to be a combined uAPI
and kernel header over a decade ago. It has fewer than
10 lines of code. Merge it into net/genetlink.h.
In some ways it'd be better to keep the combined header
under linux/ but it would make looking through git history
harder.

Acked-by: Sven Eckelmann 
Signed-off-by: Jakub Kicinski 
---
v2:
 - remove extern
 - include linux/net.h
 - improve the comment, not "all" requests are serialized

CC: ja...@zx2c4.com
CC: mareklind...@neomailbox.ch
CC: s...@simonwunderlich.de
CC: a...@unstable.cc
CC: pshe...@ovn.org
CC: andriy.shevche...@linux.intel.com
CC: wiregu...@lists.zx2c4.com
CC: d...@openvswitch.org
---
 drivers/net/wireguard/main.c  |  2 +-
 include/linux/genetlink.h | 14 --
 include/linux/genl_magic_struct.h |  2 +-
 include/net/genetlink.h   | 10 +-
 net/batman-adv/main.c |  2 +-
 net/batman-adv/netlink.c  |  1 -
 net/openvswitch/datapath.c|  1 -
 7 files changed, 12 insertions(+), 20 deletions(-)
 delete mode 100644 include/linux/genetlink.h

diff --git a/drivers/net/wireguard/main.c b/drivers/net/wireguard/main.c
index ee4da9ab8013..a00671b58701 100644
--- a/drivers/net/wireguard/main.c
+++ b/drivers/net/wireguard/main.c
@@ -14,7 +14,7 @@
 
 #include 
 #include 
-#include 
+#include 
 #include 
 
 static int __init wg_mod_init(void)
diff --git a/include/linux/genetlink.h b/include/linux/genetlink.h
deleted file mode 100644
index 9dbd7ba9b858..
--- a/include/linux/genetlink.h
+++ /dev/null
@@ -1,14 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __LINUX_GENERIC_NETLINK_H
-#define __LINUX_GENERIC_NETLINK_H
-
-#include 
-
-/* All generic netlink requests are serialized by a global lock.  */
-extern void genl_lock(void);
-extern void genl_unlock(void);
-
-#define MODULE_ALIAS_GENL_FAMILY(family)\
- MODULE_ALIAS_NET_PF_PROTO_NAME(PF_NETLINK, NETLINK_GENERIC, "-family-" family)
-
-#endif /* __LINUX_GENERIC_NETLINK_H */
diff --git a/include/linux/genl_magic_struct.h 
b/include/linux/genl_magic_struct.h
index a419d93789ff..621b87a87d74 100644
--- a/include/linux/genl_magic_struct.h
+++ b/include/linux/genl_magic_struct.h
@@ -15,8 +15,8 @@
 #endif
 
 #include 
-#include 
 #include 
+#include 
 
 extern int CONCATENATE(GENL_MAGIC_FAMILY, _genl_register)(void);
 extern void CONCATENATE(GENL_MAGIC_FAMILY, _genl_unregister)(void);
diff --git a/include/net/genetlink.h b/include/net/genetlink.h
index 9ece6e5a3ea8..7648dd6b8754 100644
--- a/include/net/genetlink.h
+++ b/include/net/genetlink.h
@@ -2,12 +2,20 @@
 #ifndef __NET_GENERIC_NETLINK_H
 #define __NET_GENERIC_NETLINK_H
 
-#include 
+#include 
 #include 
 #include 
+#include 
 
 #define GENLMSG_DEFAULT_SIZE (NLMSG_DEFAULT_SIZE - GENL_HDRLEN)
 
+/* Non-parallel generic netlink requests are serialized by a global lock.  */
+void genl_lock(void);
+void genl_unlock(void);
+
+#define MODULE_ALIAS_GENL_FAMILY(family) \
+ MODULE_ALIAS_NET_PF_PROTO_NAME(PF_NETLINK, NETLINK_GENERIC, "-family-" family)
+
 /* Binding to multicast group requires %CAP_NET_ADMIN */
 #define GENL_MCAST_CAP_NET_ADMIN   BIT(0)
 /* Binding to multicast group requires %CAP_SYS_ADMIN */
diff --git a/net/batman-adv/main.c b/net/batman-adv/main.c
index 75119f1ffccc..8e0f44c71696 100644
--- a/net/batman-adv/main.c
+++ b/net/batman-adv/main.c
@@ -14,7 +14,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -38,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/net/batman-adv/netlink.c b/net/batman-adv/netlink.c
index 0954757f0b8b..9362cd9d6f3d 100644
--- a/net/batman-adv/netlink.c
+++ b/net/batman-adv/netlink.c
@@ -15,7 +15,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 11c69415c605..99d72543abd3 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -15,7 +15,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
2.44.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn 2/2] missed during test simplification

2024-03-25 Thread 0-day Robot
Bleep bloop.  Greetings Jacob Tanenbaum, I am a robot and I have tried out your 
patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
WARNING: The subject summary should start with a capital.
WARNING: The subject summary should end with a dot.
Subject: missed during test simplification
ERROR: Committer 1807499 changed the testing to get the table numbers 
themselves. needs to sign off.
WARNING: Unexpected sign-offs from developers who are not authors or co-authors 
or committers: 0-day Robot 
Lines checked: 31, Warnings: 3, Errors: 1


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn 1/2] Merge QoS logical pipelines

2024-03-25 Thread 0-day Robot
Bleep bloop.  Greetings Jacob Tanenbaum, I am a robot and I have tried out your 
patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
WARNING: The subject summary should end with a dot.
Subject: Merge QoS logical pipelines
Lines checked: 287, Warnings: 1, Errors: 0


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn 1/2] Merge QoS logical pipelines

2024-03-25 Thread Jacob Tanenbaum
currently there are 2 QoS pipelines for ingress (ls_in_qos_mark,
ls_in_qos_meter) and egress (ls_out_qos_mark, ls_out_qos_meter). This is
not necessary as there are no actions across the two pipelines that
depend on each other. The two pipelines can be merged.

https://issues.redhat.com/browse/FDP-397

Signed-off-by: Jacob Tanenbaum 
---
 northd/northd.c | 65 +
 northd/northd.h | 46 +++-
 tests/ovn-northd.at | 24 -
 tests/ovn.at|  9 +++
 4 files changed, 62 insertions(+), 82 deletions(-)

diff --git a/northd/northd.c b/northd/northd.c
index 1839b7d8b..d5aab756f 100644
--- a/northd/northd.c
+++ b/northd/northd.c
@@ -3484,7 +3484,7 @@ build_lb_vip_actions(const struct ovn_northd_lb *lb,
 }
 
 if (reject) {
-int stage = ls_dp ? ovn_stage_get_table(S_SWITCH_OUT_QOS_MARK)
+int stage = ls_dp ? ovn_stage_get_table(S_SWITCH_OUT_QOS)
   : ovn_stage_get_table(S_ROUTER_OUT_SNAT);
 ds_clear(action);
 ds_put_format(action, "reg0 = 0; reject { outport <-> inport; "
@@ -6705,7 +6705,7 @@ build_acl_action_lflows(const struct ls_stateful_record 
*ls_stateful_rec,
 "/* eth.dst <-> eth.src; ip.dst <-> ip.src; is implicit. */ "
 "outport <-> inport; next(pipeline=%s,table=%d); };",
 ingress ? "egress" : "ingress",
-ingress ? ovn_stage_get_table(S_SWITCH_OUT_QOS_MARK)
+ingress ? ovn_stage_get_table(S_SWITCH_OUT_QOS)
 : ovn_stage_get_table(S_SWITCH_IN_L2_LKUP));
 
 ovn_lflow_metered(lflows, od, stage, 1000,
@@ -7081,24 +7081,39 @@ build_qos(struct ovn_datapath *od, struct lflow_table 
*lflows,
   struct lflow_ref *lflow_ref) {
 struct ds action = DS_EMPTY_INITIALIZER;
 
-ovn_lflow_add(lflows, od, S_SWITCH_IN_QOS_MARK, 0, "1", "next;",
+ovn_lflow_add(lflows, od, S_SWITCH_IN_QOS, 0, "1", "next;",
   lflow_ref);
-ovn_lflow_add(lflows, od, S_SWITCH_OUT_QOS_MARK, 0, "1", "next;",
-  lflow_ref);
-ovn_lflow_add(lflows, od, S_SWITCH_IN_QOS_METER, 0, "1", "next;",
-  lflow_ref);
-ovn_lflow_add(lflows, od, S_SWITCH_OUT_QOS_METER, 0, "1", "next;",
+ovn_lflow_add(lflows, od, S_SWITCH_OUT_QOS, 0, "1", "next;",
   lflow_ref);
 
 for (size_t i = 0; i < od->nbs->n_qos_rules; i++) {
 struct nbrec_qos *qos = od->nbs->qos_rules[i];
 bool ingress = !strcmp(qos->direction, "from-lport") ? true :false;
-enum ovn_stage stage = ingress ? S_SWITCH_IN_QOS_MARK : 
S_SWITCH_OUT_QOS_MARK;
+enum ovn_stage stage = ingress ? S_SWITCH_IN_QOS : S_SWITCH_OUT_QOS;
 static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
 int64_t rate = 0;
 int64_t burst = 0;
 
 ds_clear();
+for (size_t n = 0; n < qos->n_bandwidth; n++) {
+if (!strcmp(qos->key_bandwidth[n], "rate")) {
+rate = qos->value_bandwidth[n];
+} else if (!strcmp(qos->key_bandwidth[n], "burst")) {
+burst = qos->value_bandwidth[n];
+}
+}
+if (rate) {
+stage = ingress ? S_SWITCH_IN_QOS : S_SWITCH_OUT_QOS;
+if (burst) {
+ds_put_format(,
+  "set_meter(%"PRId64", %"PRId64"); ",
+  rate, burst);
+} else {
+ds_put_format(,
+  "set_meter(%"PRId64"); ",
+  rate);
+}
+}
 for (size_t j = 0; j < qos->n_action; j++) {
 if (!strcmp(qos->key_action[j], "dscp")) {
 if (qos->value_action[j] > QOS_MAX_DSCP) {
@@ -7115,43 +7130,11 @@ build_qos(struct ovn_datapath *od, struct lflow_table 
*lflows,
   qos->value_action[j]);
 }
 }
-
-if (action.length) {
 ds_put_cstr(, "next;");
-ovn_lflow_add_with_hint(lflows, od, stage, qos->priority,
-qos->match, ds_cstr(),
->header_, lflow_ref);
-}
-
-for (size_t n = 0; n < qos->n_bandwidth; n++) {
-if (!strcmp(qos->key_bandwidth[n], "rate")) {
-rate = qos->value_bandwidth[n];
-} else if (!strcmp(qos->key_bandwidth[n], "burst")) {
-burst = qos->value_bandwidth[n];
-}
-}
-if (rate) {
-stage = ingress ? S_SWITCH_IN_QOS_METER : S_SWITCH_OUT_QOS_METER;
-ds_clear();
-if (burst) {
-ds_put_format(,
-  "set_meter(%"PRId64", %"PRId64"); next;",
-  rate, burst);
-} else {
-ds_put_format(,
-  "set_meter(%"PRId64"); next;",
-  

[ovs-dev] [PATCH ovn 2/2] missed during test simplification

2024-03-25 Thread Jacob Tanenbaum
commit: 1807499 changed the testing to get the table numbers themselves.
The table number was left in here. This patch corrects that and removes
the table number.

Signed-off-by: Jacob Tanenbaum 
---
 tests/ovn.at | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/ovn.at b/tests/ovn.at
index f04b74210..762b12a7b 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -37684,7 +37684,7 @@ tpa=$(ip_to_hex 10 0 0 10)
 send_garp 1 1 $eth_src $eth_dst $spa $tpa
 
 OVS_WAIT_UNTIL([test 1 = `cat hv1/ovn-controller.log | grep "pinctrl received  
packet-in" | \
-grep opcode=BIND_VPORT | grep OF_Table_ID=28 | wc -l`])
+grep opcode=BIND_VPORT | grep OF_Table_ID=$(ovn-debug lflow-stage-to-oftable 
ls_in_arp_rsp) | wc -l`])
 
 sleep_controller hv1
 
-- 
2.42.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] ofproto-dpif-upcall: Don't mirror packets that aren't modified.

2024-03-25 Thread Mike Pattrick
On Mon, Mar 25, 2024 at 3:48 AM Zhangweiwei  wrote:
>
> Hi,
> I have tried this patch, however, there are still some issues when the 
> packets contents are changed across recirculation. On the follow example, 
> packets are modified in recirc_id(0) after mirror, the mirror context reset. 
> Therefore, there are two ICMP request packets are mirrored on port mitapVm71.
>
> In the following example, ICMP packets ared sent from port(11) to port(14),
> [root@localhost ~]# ovs-appctl dpif/dump-flows vds1-br
> ct_state(-new-est-rel-rpl-inv-trk),recirc_id(0),in_port(11),packet_type(ns=0,id=0),eth(src=52:54:00:67:d5:61,dst=68:05:ca:21:d6:e5),eth_type(0x0800),ipv4(src=11.11.70.1,dst=1.1.70.2,proto=1,ttl=64,frag=no),
>  packets:431, bytes:42238, used:0.574s, 
> actions:10,set(eth(src=68:05:ca:21:d6:e5,dst=52:54:00:9a:bf:ed)),set(ipv4(ttl=63)),ct(zone=6),recirc(0x3e8)
> ct_state(+est-rel-rpl),recirc_id(0x3e8),in_port(11),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
>  packets:430, bytes:42140, used:0.574s, actions:10,14
>
> ct_state(-new+est-rel+rpl-inv+trk),recirc_id(0x3e9),in_port(14),packet_type(ns=0,id=0),eth(src=52:54:00:9a:bf:ed,dst=68:05:ca:21:d6:e5),eth_type(0x0800),ipv4(dst=11.11.70.1,proto=1,ttl=64,frag=no),
>  packets:431, bytes:42238, used:0.574s, 
> actions:set(eth(src=68:05:ca:21:d6:e5,dst=52:54:00:67:d5:61)),set(ipv4(ttl=63)),11,10
> ct_state(-trk),recirc_id(0),in_port(14),packet_type(ns=0,id=0),eth(src=52:54:00:9a:bf:ed),eth_type(0x0800),ipv4(src=1.1.70.2,proto=1,frag=no),
>  packets:431, bytes:42238, used:0.574s, actions:ct(zone=6),recirc(0x3e9)
>
> [root@localhost ~]# ovs-appctl dpif/show
> netdev@ovs-netdev: hit:2552 missed:3019
>   vds1-br:
> mitapVm71 14/10: (system)
> tapVm71 5/11: (dpdkvhostuserclient: configured_rx_queues=1, 
> configured_tx_queues=1, mtu=1500, requested_rx_queues=1, 
> requested_tx_queues=1)
> tapVm72 6/14: (dpdkvhostuserclient: configured_rx_queues=1, 
> configured_tx_queues=1, mtu=1500, requested_rx_queues=1, 
> requested_tx_queues=1)
>
> [root@localhost ~]# ovs-tcpdump -i tapVm71
> 14:38:53.702142 IP 11.11.70.1 > 1.1.70.2: ICMP echo request, id 13483, seq 
> 2014, length 64
> 14:38:53.702143 IP 11.11.70.1 > 1.1.70.2: ICMP echo request, id 13483, seq 
> 2014, length 64
> 14:38:53.702185 IP 1.1.70.2 > 11.11.70.1: ICMP echo reply, id 13483, seq 
> 2014, length 64
> 14:38:54.742141 IP 11.11.70.1 > 1.1.70.2: ICMP echo request, id 13483, seq 
> 2015, length 64
> 14:38:54.742143 IP 11.11.70.1 > 1.1.70.2: ICMP echo request, id 13483, seq 
> 2015, length 64
> 14:38:54.742183 IP 1.1.70.2 > 11.11.70.1: ICMP echo reply, id 13483, seq 
> 2015, length 64
> 14:38:55.782142 IP 11.11.70.1 > 1.1.70.2: ICMP echo request, id 13483, seq 
> 2016, length 64
> 14:38:55.782144 IP 11.11.70.1 > 1.1.70.2: ICMP echo request, id 13483, seq 
> 2016, length 64
> 14:38:55.782186 IP 1.1.70.2 > 11.11.70.1: ICMP echo reply, id 13483, seq 
> 2016, length 64


Hello, thanks for the report. Is it possible to run the command
"ovs-tcpdump -i tapVm71 -ennvv" ? I ask because I see your actions
reset the ethernet address. If the ethernet address is different then
this would be the expected behavior, the collection of the packet as
it enters, and then as it exists modified.


Thank you,

Mike


>
> -邮件原件-
> 发件人: Mike Pattrick [mailto:m...@redhat.com]
> 发送时间: 2024年3月13日 1:37
> 收件人: d...@openvswitch.org
> 抄送: Mike Pattrick ; zhangweiwei (RD) 
> 主题: [PATCH] ofproto-dpif-upcall: Don't mirror packets that aren't modified.
>
> Previously OVS reset the mirror contents when a packet is modified in such a 
> way that the packets contents changes. However, this change incorrectly reset 
> that mirror context when only metadata changes as well.
>
> Now we check for all metadata fields, instead of just tunnel metadata, before 
> resetting the mirror context.
>
> Fixes: feed7f677505 ("ofproto-dpif-upcall: Mirror packets that are modified.")
> Reported-by: Zhangweiwei 
> Signed-off-by: Mike Pattrick 
> ---
>  include/openvswitch/meta-flow.h |   1 +
>  lib/meta-flow.c | 109 
>  ofproto/ofproto-dpif-xlate.c|   2 +-
>  tests/ofproto-dpif.at   |   5 +-
>  4 files changed, 114 insertions(+), 3 deletions(-)
>
> diff --git a/include/openvswitch/meta-flow.h 
> b/include/openvswitch/meta-flow.h index 3b0220aaa..96aad3933 100644
> --- a/include/openvswitch/meta-flow.h
> +++ b/include/openvswitch/meta-flow.h
> @@ -2305,6 +2305,7 @@ void mf_set_flow_value_masked(const struct mf_field *,
>const union mf_value *mask,
>struct flow *);  bool mf_is_tun_metadata(const 
> struct mf_field *);
> +bool mf_is_metadata(const struct mf_field *);
>  bool mf_is_frozen_metadata(const struct mf_field *);  bool 
> mf_is_pipeline_field(const struct mf_field *);  bool mf_is_set(const struct 
> mf_field *, const struct flow *); diff --git a/lib/meta-flow.c 
> b/lib/meta-flow.c index aa7cf1fcb..7ecec334e 100644

Re: [ovs-dev] [PATCH net] openvswitch: Set the skbuff pkt_type for proper pmtud support.

2024-03-25 Thread Eelco Chaudron


On 25 Mar 2024, at 13:37, Ilya Maximets wrote:

> On 3/25/24 13:22, Aaron Conole wrote:
>> Eelco Chaudron  writes:
>>
>>> On 22 Mar 2024, at 20:06, Aaron Conole wrote:
>>>
 Open vSwitch is originally intended to switch at layer 2, only dealing with
 Ethernet frames.  With the introduction of l3 tunnels support, it crossed
 into the realm of needing to care a bit about some routing details when
 making forwarding decisions.  If an oversized packet would need to be
 fragmented during this forwarding decision, there is a chance for pmtu
 to get involved and generate a routing exception.  This is gated by the
 skbuff->pkt_type field.

 When a flow is already loaded into the openvswitch module this field is
 set up and transitioned properly as a packet moves from one port to
 another.  In the case that a packet execute is invoked after a flow is
 newly installed this field is not properly initialized.  This causes the
 pmtud mechanism to omit sending the required exception messages across
 the tunnel boundary and a second attempt needs to be made to make sure
 that the routing exception is properly setup.  To fix this, we set the
 outgoing packet's pkt_type to PACKET_OUTGOING, since it can only get
 to the openvswitch module via a port device or packet command.
>>>
>>> Is this not a problem when the packet comes from the bridge port in the 
>>> kernel?
>>
>> It very well may be an issue there as well, but the recommendation is to
>> operate with the bridge port down as far as I know, so I don't know if
>> this issue has been observed happening from the bridge port.
>
> FWIW, bridge ports are typically used as an entry point for tunneled
> traffic so it can egress from a physical port attached to OVS.  It means
> they are pretty much always UP in most common setups like OpenStack or
> ovn-kubernetes and handle a decent amount of traffic.  They are also used
> to direct some other types of traffic to the host kernel.

+1 here, I’m talking about the same port. I think we only advise having this 
down for userspace bridges, but not in the case the bridge is the tunnel 
endpoint.

> Unless I misunderstood which ports we're talking about here.
>
> Best regards, Ilya Maximets.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 1/5] ovsdb: raft: Randomize leadership transfer.

2024-03-25 Thread Ilya Maximets
On 3/25/24 08:19, Han Zhou wrote:
> 
> 
> On Tue, Mar 19, 2024 at 12:05 AM Felix Huettner via dev 
> mailto:ovs-dev@openvswitch.org>> wrote:
>>
>> On Mon, Mar 18, 2024 at 05:52:12PM +0100, Ilya Maximets wrote:
>> > On 3/18/24 17:15, Felix Huettner wrote:
>> > > On Fri, Mar 15, 2024 at 09:14:49PM +0100, Ilya Maximets wrote:
>> > >> Each cluster member typically always transfers leadership to the same
>> > >> other member, which is the first in their list of servers.  This may
>> > >> result in two servers in a 3-node cluster to transfer leadership to
>> > >> each other and never to the third one.
>> > >>
>> > >> Randomizing the selection to make the load more evenly distributed.
>> > >
>> > > Hi Ilya,
>> >
>> > Hi, Felix.  Thanks for the comments!
>> >
>> > >
>> > > just out of curiosity: since basically only one of the 3 members is
>> > > active at any point in time, is balancing the load even relevant. It
>> > > will always only be on one of the 3 members anyway.
>> > It is not very important, I agree.  What I observed in practice is
>> > that sometimes if, for example, compactions happen in approximately
>> > similar time, the server we transfer the leadership to may send it
>> > right back, while the first server is busy compacting.  This is
>> > less of a problem today as well, since we have parallel compaction,
>> > but it may still be annoying if that happens every time.
>> >
>> > I'm mostly making this patch for the purpose of better testing below.
>> >
>> > >
>> > >>
>> > >> This also makes cluster failure tests cover more scenarios as servers
>> > >> will transfer leadership to servers they didn't before.  This is
>> > >> important especially for cluster joining tests.
>> > >>
>> > >> Ideally, we would transfer to a random server with a highest apply
>> > >> index, but not trying to implement this for now.
>> > >>
>> > >> Signed-off-by: Ilya Maximets > > >> >
>> > >> ---
>> > >>  ovsdb/raft.c | 6 +-
>> > >>  1 file changed, 5 insertions(+), 1 deletion(-)
>> > >>
>> > >> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>> > >> index f463afcb3..25f462431 100644
>> > >> --- a/ovsdb/raft.c
>> > >> +++ b/ovsdb/raft.c
>> > >> @@ -1261,8 +1261,12 @@ raft_transfer_leadership(struct raft *raft, 
>> > >> const char *reason)
>> > >>          return;
>> > >>      }
>> > >>
>> > >> +    size_t n = hmap_count(>servers) * 3;
>> > >>      struct raft_server *s;
>> > >> -    HMAP_FOR_EACH (s, hmap_node, >servers) {
>> > >> +
>> > >> +    while (n--) {
>> > >> +        s = CONTAINER_OF(hmap_random_node(>servers),
>> > >> +                         struct raft_server, hmap_node);
>> > >>          if (!uuid_equals(>sid, >sid)
>> > >>              && s->phase == RAFT_PHASE_STABLE) {
>> > >>              struct raft_conn *conn = raft_find_conn_by_sid(raft, 
>> > >> >sid);
>> > >
>> > > i think this has the risk of never selecting one server out of the list 
>> > > of
>> > > cluster members. Suppose you have a 3 node cluster where one of them
>> > > members is down. In this case there is a single member the leadership
>> > > can be transfered to.
>> > > This means for a single iteration of the while loop has a chance of 2/3
>> > > to select a member that can not be used. Over the 9 iterations this
>> > > would do this would give a chance of (2/3)^9 to always choose an
>> > > inappropriate member. This equals to a chance of 0.026% of never
>> > > selecting the single appropriate target member.
>> > >
>> > > Could we instead rather start iterating the hmap at some random offset?
>> > > That should give a similar result while giving the guarantee that we
>> > > visit each member once.
>> >
>> > I don't think it's very important, because we're transferring leadership
>> > without verifying if the other side is alive and we're not checking if we
>> > actually transferred it or not.  So, retries are basically just for the
>> > server not hitting itself or servers that didn't join yet.  We will 
>> > transfer
>> > to the first other server that already joined regardless of the iteration
>> > method.
>> >
>> > The way to mostly fix the issue with dead servers is, as I mentioned, to
>> > transfer only to the servers with the highest apply index, i.e. the servers
>> > that acknowledged the most amount of changes.  This will ensure that we
>> > at least heard something from the server in the recent past.  But that's
>> > a separate feature to implement.
>> >
>> > Also, the leadership transfer is just an optimization to speed up 
>> > elections,
>> > so it's not necessary for correctness for this operation to be successful.
>> > If we fail to transfer or transfer to the dead server, the rest of the
>> > cluster will notice the absence of the leader and initiate election by
>> > the timeout.
>> >
>> > Best regards, Ilya Maximets.
>>
>> Hi Ilya,
>>
>> thanks for the clarifications.
>> Then i guess the 0.026% chance is not too relevant.
>>
> 
> Thanks for the patch and the discussion.
> I still have a concern 

Re: [ovs-dev] [PATCH net] openvswitch: Set the skbuff pkt_type for proper pmtud support.

2024-03-25 Thread Ilya Maximets
On 3/25/24 13:22, Aaron Conole wrote:
> Eelco Chaudron  writes:
> 
>> On 22 Mar 2024, at 20:06, Aaron Conole wrote:
>>
>>> Open vSwitch is originally intended to switch at layer 2, only dealing with
>>> Ethernet frames.  With the introduction of l3 tunnels support, it crossed
>>> into the realm of needing to care a bit about some routing details when
>>> making forwarding decisions.  If an oversized packet would need to be
>>> fragmented during this forwarding decision, there is a chance for pmtu
>>> to get involved and generate a routing exception.  This is gated by the
>>> skbuff->pkt_type field.
>>>
>>> When a flow is already loaded into the openvswitch module this field is
>>> set up and transitioned properly as a packet moves from one port to
>>> another.  In the case that a packet execute is invoked after a flow is
>>> newly installed this field is not properly initialized.  This causes the
>>> pmtud mechanism to omit sending the required exception messages across
>>> the tunnel boundary and a second attempt needs to be made to make sure
>>> that the routing exception is properly setup.  To fix this, we set the
>>> outgoing packet's pkt_type to PACKET_OUTGOING, since it can only get
>>> to the openvswitch module via a port device or packet command.
>>
>> Is this not a problem when the packet comes from the bridge port in the 
>> kernel?
> 
> It very well may be an issue there as well, but the recommendation is to
> operate with the bridge port down as far as I know, so I don't know if
> this issue has been observed happening from the bridge port.

FWIW, bridge ports are typically used as an entry point for tunneled
traffic so it can egress from a physical port attached to OVS.  It means
they are pretty much always UP in most common setups like OpenStack or
ovn-kubernetes and handle a decent amount of traffic.  They are also used
to direct some other types of traffic to the host kernel.

Unless I misunderstood which ports we're talking about here.

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net] openvswitch: Set the skbuff pkt_type for proper pmtud support.

2024-03-25 Thread Aaron Conole
Eelco Chaudron  writes:

> On 22 Mar 2024, at 20:06, Aaron Conole wrote:
>
>> Open vSwitch is originally intended to switch at layer 2, only dealing with
>> Ethernet frames.  With the introduction of l3 tunnels support, it crossed
>> into the realm of needing to care a bit about some routing details when
>> making forwarding decisions.  If an oversized packet would need to be
>> fragmented during this forwarding decision, there is a chance for pmtu
>> to get involved and generate a routing exception.  This is gated by the
>> skbuff->pkt_type field.
>>
>> When a flow is already loaded into the openvswitch module this field is
>> set up and transitioned properly as a packet moves from one port to
>> another.  In the case that a packet execute is invoked after a flow is
>> newly installed this field is not properly initialized.  This causes the
>> pmtud mechanism to omit sending the required exception messages across
>> the tunnel boundary and a second attempt needs to be made to make sure
>> that the routing exception is properly setup.  To fix this, we set the
>> outgoing packet's pkt_type to PACKET_OUTGOING, since it can only get
>> to the openvswitch module via a port device or packet command.
>
> Is this not a problem when the packet comes from the bridge port in the 
> kernel?

It very well may be an issue there as well, but the recommendation is to
operate with the bridge port down as far as I know, so I don't know if
this issue has been observed happening from the bridge port.

Since I will spin a v2 with a comment, do you want me to mention
something about the bridge port?

>> This issue is periodically encountered in complex setups, such as large
>> openshift deployments, where multiple sets of tunnel traversal occurs.
>> A way to recreate this is with the ovn-heater project that can setup
>> a networking environment which mimics such large deployments.  In that
>> environment, without this patch, we can see:
>>
>>   ./ovn_cluster.sh start
>>   podman exec ovn-chassis-1 ip r a 170.168.0.5/32 dev eth1 mtu 1200
>>   podman exec ovn-chassis-1 ip netns exec sw01p1  ip r flush cache
>>   podman exec ovn-chassis-1 ip netns exec sw01p1 ping 21.0.0.3 -M do -s 1300 
>> -c2
>>   PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
>>   From 21.0.0.3 icmp_seq=2 Frag needed and DF set (mtu = 1142)
>>
>>   --- 21.0.0.3 ping statistics ---
>>   2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 1017ms
>>
>> Using tcpdump, we can also see the expected ICMP FRAG_NEEDED message is not
>> sent into the server.
>>
>> With this patch, setting the pkt_type, we see the following:
>>
>>   podman exec ovn-chassis-1 ip netns exec sw01p1 ping 21.0.0.3 -M do -s 1300 
>> -c2
>>   PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
>>   From 21.0.0.3 icmp_seq=1 Frag needed and DF set (mtu = 1222)
>>   ping: local error: message too long, mtu=1222
>>
>>   --- 21.0.0.3 ping statistics ---
>>   2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1061ms
>>
>> In this case, the first ping request receives the FRAG_NEEDED message and
>> a local routing exception is created.
>>
>> Reported-at: https://issues.redhat.com/browse/FDP-164
>> Fixes: 58264848a5a7 ("openvswitch: Add vxlan tunneling support.")
>> Signed-off-by: Aaron Conole 
>> ---
>> NOTE: An alternate approach would be to add a netlink attribute to preserve
>>   pkt_type across the kernel->user boundary, but that does require some
>>   userspace cooperation.
>
> I prefer the method in this patch, as it requires no userspace change,
> i.e. it will work even with older versions of OVS without the need for
> backports.

Yes - that was my thinking as well.

>>  net/openvswitch/actions.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
>> index 6fcd7e2ca81fe..952c6292100d0 100644
>> --- a/net/openvswitch/actions.c
>> +++ b/net/openvswitch/actions.c
>> @@ -936,6 +936,8 @@ static void do_output(struct datapath *dp, struct 
>> sk_buff *skb, int out_port,
>>  pskb_trim(skb, ovs_mac_header_len(key));
>>  }
>>
>> +skb->pkt_type = PACKET_OUTGOING;
>> +
>
> Maybe add a comment based on the large explanation above?

Okay - I can add one.

>>  if (likely(!mru ||
>> (skb->len <= mru + vport->dev->hard_header_len))) {
>>  ovs_vport_send(vport, skb, ovs_key_mac_proto(key));
>> -- 
>> 2.41.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2] ofproto-dpif-xlate: Fix continuations with associated metering.

2024-03-25 Thread Aaron Conole
Ilya Maximets  writes:

> On 3/22/24 14:40, Aaron Conole wrote:
>> Open vSwitch supports the ability to invoke a controller action by way
>> of a sample action with a specified meter.  In the normal case, this
>> sample action is transparently generated during xlate processing.  However,
>> when executing via a continuation, the logic to generate the sample
>> action when finishing the context freeze was missing.  The result is that
>> the behavior when action is 'controller(pause,meter_id=1)' does not match
>> the behavior when action is 'controller(meter_id=1)'.
>> 
>> OVN and other controller solutions may rely on this metering to protect
>> the control path, so it is critical to preserve metering, whether we are
>> doing a plain old send to controller, or a continuation.
>> 
>> Fixes: 77ab5fd2a95b ("Implement serializing the state of packet
>> traversal in "continuations".")
>> Reported-at: https://issues.redhat.com/browse/FDP-455
>> Tested-by: Alex Musil 
>> Signed-off-by: Aaron Conole 
>> ---
>> v1 -> v2:
>>   Clean up unrelated whitespace change
>>   Fix style issues around test comments
>>   Fix a line length issue truncating action in a comment
>>   Change from `` to $() in the test
>
> Nit: May also change `` to $() in on_exit hook.

I'll fix on apply.

> But anyway:
>
> Acked-by: Ilya Maximets 

Thanks for the review!

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn] ovn-ctl: Use the current user for default file permissions.

2024-03-25 Thread Ales Musil
The ovn-ctl utility was assuming that the user/group is always root,
when not specified otherwise by the --ovn-user/--ovn-group options.
This has the consequence of trying to change permissions of OVN
directories to root:root even though the script might be run as
completely different user.

Take the current user and group instead of the hardcoded root.
At the same time remove the ovs-user option as it was not used for
anything and might be confusing.

Reported-at: https://issues.redhat.com/browse/FDP-245
Signed-off-by: Ales Musil 
---
 utilities/ovn-ctl   | 5 ++---
 utilities/ovn-ctl.8.xml | 1 -
 utilities/ovn-lib.in| 4 ++--
 3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/utilities/ovn-ctl b/utilities/ovn-ctl
index 700efe35a..dae5e22f4 100755
--- a/utilities/ovn-ctl
+++ b/utilities/ovn-ctl
@@ -269,8 +269,8 @@ $cluster_remote_port
 # Set the owner of the ovn_dbdir (with -R option) to OVN_USER if set.
 # This is required because the ovndbs are created with root permission
 # if not present when create_cluster/upgrade_db is called.
-INSTALL_USER="root"
-INSTALL_GROUP="root"
+INSTALL_USER="$(id -un)"
+INSTALL_GROUP="$(id -gn)"
 [ "$OVN_USER" != "" ] && INSTALL_USER="${OVN_USER%:*}"
 [ "${OVN_USER##*:}" != "" ] && INSTALL_GROUP="${OVN_USER##*:}"
 
@@ -1088,7 +1088,6 @@ Options:
   --ovn-ic-sb-db-ssl-protocols=PROTOCOLS OVN IC Southbound DB SSL protocols
   --ovn-ic-sb-db-ssl-ciphers=CIPHERS OVN IC Southbound DB SSL cipher list
   --ovn-user="user[:group]"  pass the --user flag to the ovn daemons
-  --ovs-user="user[:group]"  pass the --user flag to ovs daemons
   --ovsdb-nb-wrapper=WRAPPER run with a wrapper like valgrind for debugging
   --ovsdb-sb-wrapper=WRAPPER run with a wrapper like valgrind for debugging
   --ovsdb-disable-file-column-diff=no|yes
diff --git a/utilities/ovn-ctl.8.xml b/utilities/ovn-ctl.8.xml
index 57712bfdc..c0fbb0792 100644
--- a/utilities/ovn-ctl.8.xml
+++ b/utilities/ovn-ctl.8.xml
@@ -70,7 +70,6 @@
 --ovsdb-nb-wrapper=WRAPPER
 --ovsdb-sb-wrapper=WRAPPER
 --ovn-user=USER:GROUP
---ovs-user=USER:GROUP
 -h | --help
 
 File location options
diff --git a/utilities/ovn-lib.in b/utilities/ovn-lib.in
index 1e48ef28c..65cbfbcdc 100644
--- a/utilities/ovn-lib.in
+++ b/utilities/ovn-lib.in
@@ -48,8 +48,8 @@ LC_ALL=C; export LC_ALL
 ovn_install_dir () {
 DIR="$1"
 INSTALL_MODE="${2:-755}"
-INSTALL_USER="root"
-INSTALL_GROUP="root"
+INSTALL_USER="$(id -un)"
+INSTALL_GROUP="$(id -gn)"
 [ "$OVN_USER" != "" ] && INSTALL_USER="${OVN_USER%:*}"
 [ "${OVN_USER##*:}" != "" ] && INSTALL_GROUP="${OVN_USER##*:}"
 
-- 
2.44.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] ofproto-dpif-upcall: Don't mirror packets that aren't modified.

2024-03-25 Thread Zhangweiwei
Hi,
I have tried this patch, however, there are still some issues when the packets 
contents are changed across recirculation. On the follow example, packets are 
modified in recirc_id(0) after mirror, the mirror context reset. Therefore, 
there are two ICMP request packets are mirrored on port mitapVm71.

In the following example, ICMP packets ared sent from port(11) to port(14),
[root@localhost ~]# ovs-appctl dpif/dump-flows vds1-br
ct_state(-new-est-rel-rpl-inv-trk),recirc_id(0),in_port(11),packet_type(ns=0,id=0),eth(src=52:54:00:67:d5:61,dst=68:05:ca:21:d6:e5),eth_type(0x0800),ipv4(src=11.11.70.1,dst=1.1.70.2,proto=1,ttl=64,frag=no),
 packets:431, bytes:42238, used:0.574s, 
actions:10,set(eth(src=68:05:ca:21:d6:e5,dst=52:54:00:9a:bf:ed)),set(ipv4(ttl=63)),ct(zone=6),recirc(0x3e8)
ct_state(+est-rel-rpl),recirc_id(0x3e8),in_port(11),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
 packets:430, bytes:42140, used:0.574s, actions:10,14

ct_state(-new+est-rel+rpl-inv+trk),recirc_id(0x3e9),in_port(14),packet_type(ns=0,id=0),eth(src=52:54:00:9a:bf:ed,dst=68:05:ca:21:d6:e5),eth_type(0x0800),ipv4(dst=11.11.70.1,proto=1,ttl=64,frag=no),
 packets:431, bytes:42238, used:0.574s, 
actions:set(eth(src=68:05:ca:21:d6:e5,dst=52:54:00:67:d5:61)),set(ipv4(ttl=63)),11,10
ct_state(-trk),recirc_id(0),in_port(14),packet_type(ns=0,id=0),eth(src=52:54:00:9a:bf:ed),eth_type(0x0800),ipv4(src=1.1.70.2,proto=1,frag=no),
 packets:431, bytes:42238, used:0.574s, actions:ct(zone=6),recirc(0x3e9)

[root@localhost ~]# ovs-appctl dpif/show
netdev@ovs-netdev: hit:2552 missed:3019
  vds1-br:
mitapVm71 14/10: (system)
tapVm71 5/11: (dpdkvhostuserclient: configured_rx_queues=1, 
configured_tx_queues=1, mtu=1500, requested_rx_queues=1, requested_tx_queues=1)
tapVm72 6/14: (dpdkvhostuserclient: configured_rx_queues=1, 
configured_tx_queues=1, mtu=1500, requested_rx_queues=1, requested_tx_queues=1)

[root@localhost ~]# ovs-tcpdump -i tapVm71
14:38:53.702142 IP 11.11.70.1 > 1.1.70.2: ICMP echo request, id 13483, seq 
2014, length 64
14:38:53.702143 IP 11.11.70.1 > 1.1.70.2: ICMP echo request, id 13483, seq 
2014, length 64
14:38:53.702185 IP 1.1.70.2 > 11.11.70.1: ICMP echo reply, id 13483, seq 2014, 
length 64
14:38:54.742141 IP 11.11.70.1 > 1.1.70.2: ICMP echo request, id 13483, seq 
2015, length 64
14:38:54.742143 IP 11.11.70.1 > 1.1.70.2: ICMP echo request, id 13483, seq 
2015, length 64
14:38:54.742183 IP 1.1.70.2 > 11.11.70.1: ICMP echo reply, id 13483, seq 2015, 
length 64
14:38:55.782142 IP 11.11.70.1 > 1.1.70.2: ICMP echo request, id 13483, seq 
2016, length 64
14:38:55.782144 IP 11.11.70.1 > 1.1.70.2: ICMP echo request, id 13483, seq 
2016, length 64
14:38:55.782186 IP 1.1.70.2 > 11.11.70.1: ICMP echo reply, id 13483, seq 2016, 
length 64

-邮件原件-
发件人: Mike Pattrick [mailto:m...@redhat.com]
发送时间: 2024年3月13日 1:37
收件人: d...@openvswitch.org
抄送: Mike Pattrick ; zhangweiwei (RD) 
主题: [PATCH] ofproto-dpif-upcall: Don't mirror packets that aren't modified.

Previously OVS reset the mirror contents when a packet is modified in such a 
way that the packets contents changes. However, this change incorrectly reset 
that mirror context when only metadata changes as well.

Now we check for all metadata fields, instead of just tunnel metadata, before 
resetting the mirror context.

Fixes: feed7f677505 ("ofproto-dpif-upcall: Mirror packets that are modified.")
Reported-by: Zhangweiwei 
Signed-off-by: Mike Pattrick 
---
 include/openvswitch/meta-flow.h |   1 +
 lib/meta-flow.c | 109 
 ofproto/ofproto-dpif-xlate.c|   2 +-
 tests/ofproto-dpif.at   |   5 +-
 4 files changed, 114 insertions(+), 3 deletions(-)

diff --git a/include/openvswitch/meta-flow.h b/include/openvswitch/meta-flow.h 
index 3b0220aaa..96aad3933 100644
--- a/include/openvswitch/meta-flow.h
+++ b/include/openvswitch/meta-flow.h
@@ -2305,6 +2305,7 @@ void mf_set_flow_value_masked(const struct mf_field *,
   const union mf_value *mask,
   struct flow *);  bool mf_is_tun_metadata(const 
struct mf_field *);
+bool mf_is_metadata(const struct mf_field *);
 bool mf_is_frozen_metadata(const struct mf_field *);  bool 
mf_is_pipeline_field(const struct mf_field *);  bool mf_is_set(const struct 
mf_field *, const struct flow *); diff --git a/lib/meta-flow.c 
b/lib/meta-flow.c index aa7cf1fcb..7ecec334e 100644
--- a/lib/meta-flow.c
+++ b/lib/meta-flow.c
@@ -1788,6 +1788,115 @@ mf_is_tun_metadata(const struct mf_field *mf)
mf->id < MFF_TUN_METADATA0 + TUN_METADATA_NUM_OPTS;  }

+bool
+mf_is_metadata(const struct mf_field *mf) {
+switch (mf->id) {
+CASE_MFF_TUN_METADATA:
+case MFF_METADATA:
+case MFF_IN_PORT:
+case MFF_IN_PORT_OXM:
+CASE_MFF_REGS:
+CASE_MFF_XREGS:
+CASE_MFF_XXREGS:
+case MFF_PACKET_TYPE:
+case MFF_DP_HASH:
+case MFF_RECIRC_ID:
+case MFF_CONJ_ID:
+case MFF_ACTSET_OUTPUT:
+case 

Re: [ovs-dev] [PATCH ovn v3 6/8] utilities/docker: Fix up container build.

2024-03-25 Thread Ales Musil
On Mon, Mar 25, 2024 at 9:47 AM Dumitru Ceara  wrote:

> On 3/22/24 19:34, Mark Michelson wrote:
> > On 3/21/24 19:15, Dumitru Ceara wrote:
> >> Most of the steps were inaccurate.  Instead, use latest Ubuntu, use
> >> OVS from the submodule inside the OVN repo.
> >>
> >> Signed-off-by: Dumitru Ceara 
> >> ---
> >>   utilities/docker/Makefile  |  4 ++--
> >>   utilities/docker/debian/Dockerfile |  2 +-
> >>   utilities/docker/debian/build.sh   |  2 ++
> >>   utilities/docker/install_ovn.sh| 31 ++
> >>   4 files changed, 19 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/utilities/docker/Makefile b/utilities/docker/Makefile
> >> index 57e95651cb..aad9c3482c 100644
> >> --- a/utilities/docker/Makefile
> >> +++ b/utilities/docker/Makefile
> >> @@ -1,5 +1,5 @@
> >> -#export OVN_BRANCH=master
> >> -#export OVN_VERSION=2.12
> >> +#export OVN_BRANCH=main
> >> +#export OVN_VERSION=24.03.90
> >
> > Is this something that we should update with each major release of OVN?
> > If so, I could probably alter my release script to include updating
> > utilities/docker/Makefile as part of the release patches.
> >
>
> It's just a comment and I guess the original intention was to show users
> how to set this up.  But, back to your question, if it's not a lot of
> work, automatically changing this on release would be nice.
>
> Also, it might make sense to set up a CI job that actually runs this in
> CI, maybe periodically.  I can try to find some time to do that at some
> point in the future.  Wdyt?
>


Hi Mark, Dumitru,

from a different perspective, I would be interested to see if this is still
used by anyone. We have other methods of running OVN in containers that are
up to date and maintained. It might be the case of working perfectly, still
used and doesn't need any attention. But it would probably still be useful
to hear from the community if that's the case. IMO we shouldn't run CI on
something just for the sake of running CI.

Does that make sense?


>
> >>   #export DISTRO=debian
> >>   #export GITHUB_SRC=https://github.com/ovn-org/ovn.git
> >>   #export DOCKER_REPO=ovn-org/ovn
> >> diff --git a/utilities/docker/debian/Dockerfile
> >> b/utilities/docker/debian/Dockerfile
> >> index 366ad6d4f3..a89ef46c9f 100644
> >> --- a/utilities/docker/debian/Dockerfile
> >> +++ b/utilities/docker/debian/Dockerfile
> >> @@ -1,4 +1,4 @@
> >> -FROM ubuntu:16.04
> >> +FROM ubuntu:22.04
> >>   MAINTAINER "Aliasgar Ginwala" 
> >> ARG OVN_BRANCH
> >> diff --git a/utilities/docker/debian/build.sh
> >> b/utilities/docker/debian/build.sh
> >> index 57ace5f505..6edb5b85e4 100755
> >> --- a/utilities/docker/debian/build.sh
> >> +++ b/utilities/docker/debian/build.sh
> >> @@ -12,6 +12,8 @@
> >>   # See the License for the specific language governing permissions and
> >>   # limitations under the License.
> >>   +set -e
> >> +
> >>   OVN_BRANCH=$1
> >>   GITHUB_SRC=$2
> >>   diff --git a/utilities/docker/install_ovn.sh
> >> b/utilities/docker/install_ovn.sh
> >> index 55c189aaee..5157da1497 100755
> >> --- a/utilities/docker/install_ovn.sh
> >> +++ b/utilities/docker/install_ovn.sh
> >> @@ -12,29 +12,26 @@
> >>   # See the License for the specific language governing permissions and
> >>   # limitations under the License.
> >>   +set -e
> >> +
> >>   OVN_BRANCH=$1
> >>   GITHUB_SRC=$2
> >>   -# get ovs source always from master as its needed as dependency
> >> -mkdir /build; cd /build
> >> -git clone --depth 1 -b master https://github.com/openvswitch/ovs.git
> >> -cd ovs;
> >> -mkdir _gcc;
> >> +# Get ovn source.
> >> +git clone --depth 1 -b $OVN_BRANCH $GITHUB_SRC
> >> +cd ovn
> >>   -# build and install
> >> +# Get OVS submodule, build and install OVS.
> >> +git submodule update --init
> >> +cd ovs
> >>   ./boot.sh
> >> -cd _gcc
> >> -../configure --localstatedir="/var" --sysconfdir="/etc"
> >> --prefix="/usr" \
> >> +./configure --localstatedir="/var" --sysconfdir="/etc" --prefix="/usr"
> \
> >>   --enable-ssl
> >> -cd ..; make -C _gcc install; cd ..
> >> -
> >> +make -j8 install
> >> +cd ..
> >>   -# get ovn source
> >> -git clone --depth 1 -b $OVN_BRANCH $GITHUB_SRC
> >> -cd ovn
> >> -
> >> -# build and install
> >> +# Build and install OVN.
> >>   ./boot.sh
> >>   ./configure --localstatedir="/var" --sysconfdir="/etc"
> >> --prefix="/usr" \
> >> ---enable-ssl --with-ovs-source=/build/ovs/
> >> --with-ovs-build=/build/ovs/_gcc
> >> -make -j8; make install
> >> +--enable-ssl
> >> +make -j8 install
> >
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>

Thanks,
Ales
-- 

Ales Musil

Senior Software Engineer - OVN Core

Red Hat EMEA 

amu...@redhat.com

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v3 6/8] utilities/docker: Fix up container build.

2024-03-25 Thread Dumitru Ceara
On 3/22/24 19:34, Mark Michelson wrote:
> On 3/21/24 19:15, Dumitru Ceara wrote:
>> Most of the steps were inaccurate.  Instead, use latest Ubuntu, use
>> OVS from the submodule inside the OVN repo.
>>
>> Signed-off-by: Dumitru Ceara 
>> ---
>>   utilities/docker/Makefile  |  4 ++--
>>   utilities/docker/debian/Dockerfile |  2 +-
>>   utilities/docker/debian/build.sh   |  2 ++
>>   utilities/docker/install_ovn.sh    | 31 ++
>>   4 files changed, 19 insertions(+), 20 deletions(-)
>>
>> diff --git a/utilities/docker/Makefile b/utilities/docker/Makefile
>> index 57e95651cb..aad9c3482c 100644
>> --- a/utilities/docker/Makefile
>> +++ b/utilities/docker/Makefile
>> @@ -1,5 +1,5 @@
>> -#export OVN_BRANCH=master
>> -#export OVN_VERSION=2.12
>> +#export OVN_BRANCH=main
>> +#export OVN_VERSION=24.03.90
> 
> Is this something that we should update with each major release of OVN?
> If so, I could probably alter my release script to include updating
> utilities/docker/Makefile as part of the release patches.
> 

It's just a comment and I guess the original intention was to show users
how to set this up.  But, back to your question, if it's not a lot of
work, automatically changing this on release would be nice.

Also, it might make sense to set up a CI job that actually runs this in
CI, maybe periodically.  I can try to find some time to do that at some
point in the future.  Wdyt?

>>   #export DISTRO=debian
>>   #export GITHUB_SRC=https://github.com/ovn-org/ovn.git
>>   #export DOCKER_REPO=ovn-org/ovn
>> diff --git a/utilities/docker/debian/Dockerfile
>> b/utilities/docker/debian/Dockerfile
>> index 366ad6d4f3..a89ef46c9f 100644
>> --- a/utilities/docker/debian/Dockerfile
>> +++ b/utilities/docker/debian/Dockerfile
>> @@ -1,4 +1,4 @@
>> -FROM ubuntu:16.04
>> +FROM ubuntu:22.04
>>   MAINTAINER "Aliasgar Ginwala" 
>>     ARG OVN_BRANCH
>> diff --git a/utilities/docker/debian/build.sh
>> b/utilities/docker/debian/build.sh
>> index 57ace5f505..6edb5b85e4 100755
>> --- a/utilities/docker/debian/build.sh
>> +++ b/utilities/docker/debian/build.sh
>> @@ -12,6 +12,8 @@
>>   # See the License for the specific language governing permissions and
>>   # limitations under the License.
>>   +set -e
>> +
>>   OVN_BRANCH=$1
>>   GITHUB_SRC=$2
>>   diff --git a/utilities/docker/install_ovn.sh
>> b/utilities/docker/install_ovn.sh
>> index 55c189aaee..5157da1497 100755
>> --- a/utilities/docker/install_ovn.sh
>> +++ b/utilities/docker/install_ovn.sh
>> @@ -12,29 +12,26 @@
>>   # See the License for the specific language governing permissions and
>>   # limitations under the License.
>>   +set -e
>> +
>>   OVN_BRANCH=$1
>>   GITHUB_SRC=$2
>>   -# get ovs source always from master as its needed as dependency
>> -mkdir /build; cd /build
>> -git clone --depth 1 -b master https://github.com/openvswitch/ovs.git
>> -cd ovs;
>> -mkdir _gcc;
>> +# Get ovn source.
>> +git clone --depth 1 -b $OVN_BRANCH $GITHUB_SRC
>> +cd ovn
>>   -# build and install
>> +# Get OVS submodule, build and install OVS.
>> +git submodule update --init
>> +cd ovs
>>   ./boot.sh
>> -cd _gcc
>> -../configure --localstatedir="/var" --sysconfdir="/etc"
>> --prefix="/usr" \
>> +./configure --localstatedir="/var" --sysconfdir="/etc" --prefix="/usr" \
>>   --enable-ssl
>> -cd ..; make -C _gcc install; cd ..
>> -
>> +make -j8 install
>> +cd ..
>>   -# get ovn source
>> -git clone --depth 1 -b $OVN_BRANCH $GITHUB_SRC
>> -cd ovn
>> -
>> -# build and install
>> +# Build and install OVN.
>>   ./boot.sh
>>   ./configure --localstatedir="/var" --sysconfdir="/etc"
>> --prefix="/usr" \
>> ---enable-ssl --with-ovs-source=/build/ovs/
>> --with-ovs-build=/build/ovs/_gcc
>> -make -j8; make install
>> +--enable-ssl
>> +make -j8 install
> 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net] openvswitch: Set the skbuff pkt_type for proper pmtud support.

2024-03-25 Thread Eelco Chaudron



On 22 Mar 2024, at 20:06, Aaron Conole wrote:

> Open vSwitch is originally intended to switch at layer 2, only dealing with
> Ethernet frames.  With the introduction of l3 tunnels support, it crossed
> into the realm of needing to care a bit about some routing details when
> making forwarding decisions.  If an oversized packet would need to be
> fragmented during this forwarding decision, there is a chance for pmtu
> to get involved and generate a routing exception.  This is gated by the
> skbuff->pkt_type field.
>
> When a flow is already loaded into the openvswitch module this field is
> set up and transitioned properly as a packet moves from one port to
> another.  In the case that a packet execute is invoked after a flow is
> newly installed this field is not properly initialized.  This causes the
> pmtud mechanism to omit sending the required exception messages across
> the tunnel boundary and a second attempt needs to be made to make sure
> that the routing exception is properly setup.  To fix this, we set the
> outgoing packet's pkt_type to PACKET_OUTGOING, since it can only get
> to the openvswitch module via a port device or packet command.

Is this not a problem when the packet comes from the bridge port in the kernel?

> This issue is periodically encountered in complex setups, such as large
> openshift deployments, where multiple sets of tunnel traversal occurs.
> A way to recreate this is with the ovn-heater project that can setup
> a networking environment which mimics such large deployments.  In that
> environment, without this patch, we can see:
>
>   ./ovn_cluster.sh start
>   podman exec ovn-chassis-1 ip r a 170.168.0.5/32 dev eth1 mtu 1200
>   podman exec ovn-chassis-1 ip netns exec sw01p1  ip r flush cache
>   podman exec ovn-chassis-1 ip netns exec sw01p1 ping 21.0.0.3 -M do -s 1300 
> -c2
>   PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
>   From 21.0.0.3 icmp_seq=2 Frag needed and DF set (mtu = 1142)
>
>   --- 21.0.0.3 ping statistics ---
>   2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 1017ms
>
> Using tcpdump, we can also see the expected ICMP FRAG_NEEDED message is not
> sent into the server.
>
> With this patch, setting the pkt_type, we see the following:
>
>   podman exec ovn-chassis-1 ip netns exec sw01p1 ping 21.0.0.3 -M do -s 1300 
> -c2
>   PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
>   From 21.0.0.3 icmp_seq=1 Frag needed and DF set (mtu = 1222)
>   ping: local error: message too long, mtu=1222
>
>   --- 21.0.0.3 ping statistics ---
>   2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1061ms
>
> In this case, the first ping request receives the FRAG_NEEDED message and
> a local routing exception is created.
>
> Reported-at: https://issues.redhat.com/browse/FDP-164
> Fixes: 58264848a5a7 ("openvswitch: Add vxlan tunneling support.")
> Signed-off-by: Aaron Conole 
> ---
> NOTE: An alternate approach would be to add a netlink attribute to preserve
>   pkt_type across the kernel->user boundary, but that does require some
>   userspace cooperation.

I prefer the method in this patch, as it requires no userspace change, i.e. it 
will work even with older versions of OVS without the need for backports.

>  net/openvswitch/actions.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index 6fcd7e2ca81fe..952c6292100d0 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -936,6 +936,8 @@ static void do_output(struct datapath *dp, struct sk_buff 
> *skb, int out_port,
>   pskb_trim(skb, ovs_mac_header_len(key));
>   }
>
> + skb->pkt_type = PACKET_OUTGOING;
> +

Maybe add a comment based on the large explanation above?

>   if (likely(!mru ||
>  (skb->len <= mru + vport->dev->hard_header_len))) {
>   ovs_vport_send(vport, skb, ovs_key_mac_proto(key));
> -- 
> 2.41.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 1/5] ovsdb: raft: Randomize leadership transfer.

2024-03-25 Thread Han Zhou
On Tue, Mar 19, 2024 at 12:05 AM Felix Huettner via dev <
ovs-dev@openvswitch.org> wrote:
>
> On Mon, Mar 18, 2024 at 05:52:12PM +0100, Ilya Maximets wrote:
> > On 3/18/24 17:15, Felix Huettner wrote:
> > > On Fri, Mar 15, 2024 at 09:14:49PM +0100, Ilya Maximets wrote:
> > >> Each cluster member typically always transfers leadership to the same
> > >> other member, which is the first in their list of servers.  This may
> > >> result in two servers in a 3-node cluster to transfer leadership to
> > >> each other and never to the third one.
> > >>
> > >> Randomizing the selection to make the load more evenly distributed.
> > >
> > > Hi Ilya,
> >
> > Hi, Felix.  Thanks for the comments!
> >
> > >
> > > just out of curiosity: since basically only one of the 3 members is
> > > active at any point in time, is balancing the load even relevant. It
> > > will always only be on one of the 3 members anyway.
> > It is not very important, I agree.  What I observed in practice is
> > that sometimes if, for example, compactions happen in approximately
> > similar time, the server we transfer the leadership to may send it
> > right back, while the first server is busy compacting.  This is
> > less of a problem today as well, since we have parallel compaction,
> > but it may still be annoying if that happens every time.
> >
> > I'm mostly making this patch for the purpose of better testing below.
> >
> > >
> > >>
> > >> This also makes cluster failure tests cover more scenarios as servers
> > >> will transfer leadership to servers they didn't before.  This is
> > >> important especially for cluster joining tests.
> > >>
> > >> Ideally, we would transfer to a random server with a highest apply
> > >> index, but not trying to implement this for now.
> > >>
> > >> Signed-off-by: Ilya Maximets 
> > >> ---
> > >>  ovsdb/raft.c | 6 +-
> > >>  1 file changed, 5 insertions(+), 1 deletion(-)
> > >>
> > >> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> > >> index f463afcb3..25f462431 100644
> > >> --- a/ovsdb/raft.c
> > >> +++ b/ovsdb/raft.c
> > >> @@ -1261,8 +1261,12 @@ raft_transfer_leadership(struct raft *raft,
const char *reason)
> > >>  return;
> > >>  }
> > >>
> > >> +size_t n = hmap_count(>servers) * 3;
> > >>  struct raft_server *s;
> > >> -HMAP_FOR_EACH (s, hmap_node, >servers) {
> > >> +
> > >> +while (n--) {
> > >> +s = CONTAINER_OF(hmap_random_node(>servers),
> > >> + struct raft_server, hmap_node);
> > >>  if (!uuid_equals(>sid, >sid)
> > >>  && s->phase == RAFT_PHASE_STABLE) {
> > >>  struct raft_conn *conn = raft_find_conn_by_sid(raft,
>sid);
> > >
> > > i think this has the risk of never selecting one server out of the
list of
> > > cluster members. Suppose you have a 3 node cluster where one of them
> > > members is down. In this case there is a single member the leadership
> > > can be transfered to.
> > > This means for a single iteration of the while loop has a chance of
2/3
> > > to select a member that can not be used. Over the 9 iterations this
> > > would do this would give a chance of (2/3)^9 to always choose an
> > > inappropriate member. This equals to a chance of 0.026% of never
> > > selecting the single appropriate target member.
> > >
> > > Could we instead rather start iterating the hmap at some random
offset?
> > > That should give a similar result while giving the guarantee that we
> > > visit each member once.
> >
> > I don't think it's very important, because we're transferring leadership
> > without verifying if the other side is alive and we're not checking if
we
> > actually transferred it or not.  So, retries are basically just for the
> > server not hitting itself or servers that didn't join yet.  We will
transfer
> > to the first other server that already joined regardless of the
iteration
> > method.
> >
> > The way to mostly fix the issue with dead servers is, as I mentioned, to
> > transfer only to the servers with the highest apply index, i.e. the
servers
> > that acknowledged the most amount of changes.  This will ensure that we
> > at least heard something from the server in the recent past.  But that's
> > a separate feature to implement.
> >
> > Also, the leadership transfer is just an optimization to speed up
elections,
> > so it's not necessary for correctness for this operation to be
successful.
> > If we fail to transfer or transfer to the dead server, the rest of the
> > cluster will notice the absence of the leader and initiate election by
> > the timeout.
> >
> > Best regards, Ilya Maximets.
>
> Hi Ilya,
>
> thanks for the clarifications.
> Then i guess the 0.026% chance is not too relevant.
>

Thanks for the patch and the discussion.
I still have a concern regarding the retry logic. Maybe it is not a
critical problem if the function failed to select any server to transfer,
but I'd still prefer this function to be more predictable. For example, we
can store the servers of the hmap to an