Re: [ovs-dev] [PATCH 2/4] ovsdb-idl.c: Allows retry even when using a single remote.

2019-08-14 Thread aginwala
Sure. Thanks for re-validating different corner cases with me. Yup, we can
update more details in  leader-only section about using single LB VIP while
accessing clustered db via  ovn-nbctl/ovn-sbctl for sure to avoid
confusion.

On Wed, Aug 14, 2019 at 3:21 PM Han Zhou  wrote:

> Hi Aginwala, thanks for the testing. The change of this patch will cause
> the IDL to avoid connecting to a follower if "leader_only" is set to
> "true". It is the same behavior as before when multiple remotes are used,
> but now it just does the same even when a single remote is used, because
> the single remote could be a VIP, so it is desired  behavior. For
> ovn-nbctl, "leader_only" is default to true, so it is expected that it
> refuses to connect if the remote is a follower. However, if you are using
> daemon mode ovn-nbctl, and if you didn't set "--no-leader-only", it should
> keep retrying until it connects to a leader (depends on LB redirecting to
> different servers). I agree this may cause some confusion when a user of
> ovn-nbctl connects to only a single remote of a cluster, he/she could get
> failure if --no-leader-only is not specified, but it is better than let
> user believe they are connected to a leader while in reality connected to a
> follower. Maybe I can improve the ovn-nbctl/sbctl documentation to emphasis
> this to avoid confusion.
>
> On Tue, Aug 13, 2019 at 7:13 PM aginwala  wrote:
>
>>
>>
>> On Tue, Aug 13, 2019 at 7:07 PM aginwala  wrote:
>>
>>> Thanks for the fixes. Found a bug in current code which breaks both
>>> nbctl with local socket and daemon mode on follower nodes. Also nbctl
>>> daemon mode always tries to connect to leader node by default which also
>>> failed to connect.
>>> e.g. export OVN_NB_DB=tcp::6641; ovn-nbctl  --pidfile --detach.
>>>  
>>> 2019-08-14T01:07:06Z|00036|ovsdb_idl|INFO|unix:/var/run/openvswitch/ovnnb_db.sock:
>>> clustered database server is not cluster leader; trying another server
>>> 2019-08-14T01:07:06Z|00037|reconnect|DBG|unix:/var/run/openvswitch/ovnnb_db.sock:
>>> entering RECONNECT
>>> 2019-08-14T01:07:06Z|00038|ovsdb_idl|DBG|unix:/var/run/openvswitch/ovnnb_db.sock:
>>> SERVER_MONITOR_COND_REQUESTED -> RETRY at lib/ovsdb-idl.c:1917
>>> 2019-08-14T01:07:06Z|00039|poll_loop|DBG|wakeup due to 0-ms timeout at
>>> lib/reconnect.c:643
>>> 2019-08-14T01:07:06Z|00040|reconnect|INFO|unix:/var/run/openvswitch/ovnnb_db.sock:
>>> connection attempt timed out
>>> 2019-08-14T01:07:06Z|00041|reconnect|INFO|unix:/var/run/openvswitch/ovnnb_db.sock:
>>> waiting 2 seconds before reconnect
>>> 2019-08-14T01:07:06Z|00042|reconnect|DBG|unix:/var/run/openvswitch/ovnnb_db.sock:
>>> entering BACKOFF
>>>
>>> Need to explicitly specify no-leader-only to work around.  ovn-nbctl
>>> --pidfile --detach --no-leader-only.
>>> For LB VIP, since LB sees all nodes are active, connections are
>>> established to all cluster nodes equally. I am using round robin setting
>>> for LB VIP for 3 node cluster using raft.
>>>
>>> Hence, are we always going to avoid this behavior because this is
>>> forcing to always connect to leader and not to followers? So if we use this
>>> approach, idl will show ovn-nbctl: tcp::6641: database connection
>>> failed () if requests reaches followers and only processes success if
>>> request reaches leader node with this patch.
>>>
>>>
>>> On Tue, Aug 13, 2019 at 9:23 AM Han Zhou  wrote:
>>>
 From: Han Zhou 

 When clustered mode is used, the client needs to retry connecting
 to new servers when certain failures happen. Today it is allowed to
 retry new connection only if multiple remotes are used, which prevents
 using LB VIP with clustered nodes. This patch makes sure the retry
 logic works when using LB VIP: although same IP is used for retrying,
 the LB can actually redirect the connection to a new node.

 Signed-off-by: Han Zhou 
 ---
  lib/ovsdb-idl.c| 11 +++---
  tests/ovsdb-cluster.at | 57
 ++
  tests/test-ovsdb.c |  1 +
  3 files changed, 61 insertions(+), 8 deletions(-)

 diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
 index 1a6a4af..190143f 100644
 --- a/lib/ovsdb-idl.c
 +++ b/lib/ovsdb-idl.c
 @@ -657,12 +657,8 @@ ovsdb_idl_state_to_string(enum ovsdb_idl_state
 state)
  static void
  ovsdb_idl_retry_at(struct ovsdb_idl *idl, const char *where)
  {
 -if (idl->session && jsonrpc_session_get_n_remotes(idl->session) >
 1) {
 -ovsdb_idl_force_reconnect(idl);
 -ovsdb_idl_transition_at(idl, IDL_S_RETRY, where);
 -} else {
 -ovsdb_idl_transition_at(idl, IDL_S_ERROR, where);
 -}
 +ovsdb_idl_force_reconnect(idl);
 +ovsdb_idl_transition_at(idl, IDL_S_RETRY, where);
  }

  static void
 @@ -1895,8 +1891,7 @@ ovsdb_idl_check_server_db(struct ovsdb_idl *idl)
  if (!database) {
  VLOG_INFO_RL(, 

Re: [ovs-dev] [PATCH v3 9/9] ofproto-dpif-xlate: Translate timeout policy in ct action

2019-08-14 Thread Yi-Hung Wei
On Tue, Aug 13, 2019 at 8:03 PM Darrell Ball  wrote:
>> diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
>> index e988626ea05b..d12b5a91c2eb 100644
>> --- a/lib/dpif-provider.h
>> +++ b/lib/dpif-provider.h
>> @@ -542,6 +542,16 @@ struct dpif_class {
>> struct ct_dpif_timeout_policy *tp);
>>  int (*ct_timeout_policy_dump_done)(struct dpif *, void *state);
>>
>> +/* Gets timeout policy name based on 'tp_id', 'dl_type' and 'nw_proto'.
>> + * On success, returns 0, stores the timeout policy name in 'tp_name',
>> + * and sets 'unwildcard'.  'unwildcard' is true if the timeout
>> + * policy in 'dpif' is 'dl_type' and 'nw_proto' specific, .i.e. in
>> + * kernel datapath.  Sets 'unwildcard' to false if the timeout policy
>> + * is generic to all supported 'dl_type' and 'nw_proto'. */
>> +int (*ct_get_timeout_policy_name)(struct dpif *, uint32_t tp_id,
>> +  uint16_t dl_type, uint8_t nw_proto,
>> +  struct ds *tp_name, bool *unwildcard);
>
> I think adding the 'unwildcard' parameter to this particular API is not 
> intuitive;
> I would create a simple dedicated API for it.

As our offline discussion, we will keep this interface as is but
update comment to make the API easier to understand.  I also add some
comment in the caller (in ofproto-dpif.c) to make the 'unwildcard'
idea to be more clear.

>> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
>> index 3013d83e96a0..8bbc596e2ce0 100644
>> --- a/ofproto/ofproto-dpif.c
>> +++ b/ofproto/ofproto-dpif.c
>> +bool
>> +ofproto_dpif_ct_zone_timeout_policy_get_name(
>> +const struct dpif_backer *backer, uint16_t zone, uint16_t dl_type,
>> +uint8_t nw_proto, struct ds *tp_name, bool *unwildcard)
>> +{
>> +struct ct_zone *ct_zone;
>> +
>> +if (!ct_dpif_timeout_policy_support_ipproto(nw_proto)) {
>> +return false;
>> +}
>> +
>> +ct_zone = ct_zone_lookup(>ct_zones, zone);
>
>
> struct ct_zone *ct_zone = ct_zone_lookup(>ct_zones, zone);

Done in v4.


>> --- a/tests/system-traffic.at
>> +++ b/tests/system-traffic.at
>> +dnl Wait until the timeout expire.
>> +dnl We intend to wait a bit longer, because conntrack does not recycle the 
>> entry right after it is expired.
>> +sleep 4
>
> Once the issue with sending additional packets after the first timeout expiry 
> is fixed, we should enhance the test
> to resend and re-timeout to make sure it works.

Sure, will modify the test case once the kernel fix is upstream.


>> diff --git a/tests/system-userspace-macros.at 
>> b/tests/system-userspace-macros.at
>> index 9d5f3bf419d3..8950a4de7287 100644
>> --- a/tests/system-userspace-macros.at
>> +++ b/tests/system-userspace-macros.at
>> +# VSCTL_ADD_ZONE_TIMEOUT_POLICY([parameters])
>> +#
>> +# Add zone based timeout policy to userspace datapath
>> +m4_define([VSCTL_ADD_ZONE_TIMEOUT_POLICY],
>
>
> TBH, does not seems useful; just hides the what is happening

Thanks for the diff in the other e-mail.  I will fold in the proposed
diff in v4.

Thanks,

-Yi-Hung
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3 4/9] ct-dpif, dpif-netlink: Add conntrack timeout policy support

2019-08-14 Thread Yi-Hung Wei
On Wed, Aug 14, 2019 at 2:35 PM Darrell Ball  wrote:
> On Mon, Aug 12, 2019 at 5:54 PM Yi-Hung Wei  wrote:
>> --- a/lib/dpif-netlink.c
>> +++ b/lib/dpif-netlink.c
>> +static int
>> +dpif_netlink_ct_set_timeout_policy(struct dpif *dpif OVS_UNUSED,
>> +   const struct ct_dpif_timeout_policy *tp)
>> +{
>> +struct nl_ct_timeout_policy nl_tp;
>> +struct ds nl_tp_name = DS_EMPTY_INITIALIZER;
>> +int i, err = 0;
>> +
>> +for (i = 0; i < ARRAY_SIZE(tp_protos); ++i) {
>
>
> for (int i = 0; i < ARRAY_SIZE(tp_protos); ++i) {
>
> there are several other cases as well
> I not going to mention all

Thanks for the review. I make all the similar code changes in
ct-dpif.c, dpif-netlink.c and netlink-conntrack.c.


>> +static int
>> +dpif_netlink_ct_timeout_policy_dump_next(struct dpif *dpif OVS_UNUSED,
>> + void *state,
>> + struct ct_dpif_timeout_policy *tp)
>
> I think it would be super helpful to add some comments to this function.
>

Sure, I added some comments to dpif_netlink_ct_timeout_policy_dump_next() in v4.

>> +tp_dump_node = get_dpif_netlink_tp_dump_node_by_tp_id(
>> +tp_id, _state->tp_dump_map);
>> +if (!tp_dump_node) {
>> +tp_dump_node = xzalloc(sizeof *tp_dump_node);
>> +tp_dump_node->tp = xzalloc(sizeof *tp_dump_node->tp);
>> +tp_dump_node->tp->id = tp_id;
>> +hmap_insert(_state->tp_dump_map, _dump_node->hmap_node,
>> +hash_int(tp_id, 0));
>> +}
>> +
>> +update_dpif_netlink_tp_dump_node(_tp, tp_dump_node);
>> +if (tp_dump_node->l3_l4_present == DPIF_NL_ALL_TP) {
>
>>
>> +hmap_remove(_state->tp_dump_map, _dump_node->hmap_node);
>> +*tp = *tp_dump_node->tp;
>> +free(tp_dump_node->tp);
>> +free(tp_dump_node);
>
> common block; write once

Sure, I move the common block to a new function in v4.

static void
get_and_cleanup_tp_dump_node(struct hmap *hamp,
 struct dpif_netlink_tp_dump_node *tp_dump_node,
 struct ct_dpif_timeout_policy *tp)
{
hmap_remove(hmap, _dump_node->hmap_node);
*tp = *tp_dump_node->tp;
free(tp_dump_node->tp);
free(tp_dump_node);
}

Thanks,

-Yi-Hung
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 3/4] raft.c: Set candidate_retrying if no leader elected since last election.

2019-08-14 Thread Han Zhou
On Tue, Aug 13, 2019 at 9:23 AM Han Zhou  wrote:
>
> From: Han Zhou 
>
> candiate_retrying is used to determine if the current node is disconnected
> from the cluster when the node is in candiate role. However, a node
> can flap between candidate and follower role before a leader is elected
> when majority of the cluster is down, so is_connected() will flap, too,
which
> confuses clients.
>
> This patch avoids the flapping with the help of a new member had_leader,
> so that if no leader was elected since last election, we know we are
> still retrying, and keep as disconnected from the cluster.
>
> Signed-off-by: Han Zhou 
> ---
>  ovsdb/raft.c | 25 -
>  1 file changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index 1c38b3b..63c3081 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -286,8 +286,11 @@ struct raft {
>
>  /* Candidates only.  Reinitialized at start of election. */
>  int n_votes;/* Number of votes for me. */
> -bool candidate_retrying;/* The first round of election timed-out
and it
> -   is now retrying. */
> +bool candidate_retrying;/* The earlier election timed-out and we
are
> +   now retrying. */
> +bool had_leader;/* There has been leader elected since
last
> +   election initiated. This is to help
setting
> +   candidate_retrying. */
>  };
>
>  /* All Raft structures. */
> @@ -345,6 +348,7 @@ static bool raft_handle_write_error(struct raft *,
struct ovsdb_error *);
>
>  static void raft_run_reconfigure(struct raft *);
>
> +static void raft_set_leader(struct raft *, const struct uuid *sid);
>  static struct raft_server *
>  raft_find_server(const struct raft *raft, const struct uuid *sid)
>  {
> @@ -1616,8 +1620,11 @@ raft_start_election(struct raft *raft, bool
leadership_transfer)
>  }
>
>  ovs_assert(raft->role != RAFT_LEADER);
> -raft->candidate_retrying = (raft->role == RAFT_CANDIDATE);
>  raft->role = RAFT_CANDIDATE;
> +/* If there was no leader elected since last election, we know we are
> + * retrying now. */
> +raft->candidate_retrying = !raft->had_leader;
> +raft->had_leader = false;
>
>  raft->n_votes = 0;
>
> @@ -2450,6 +2457,14 @@ raft_server_init_leader(struct raft *raft, struct
raft_server *s)
>  }
>
>  static void
> +raft_set_leader(struct raft *raft, const struct uuid *sid)
> +{
> +raft->leader_sid = *sid;
> +raft->had_leader = true;
> +raft->candidate_retrying = false;
> +}
> +
> +static void
>  raft_become_leader(struct raft *raft)
>  {
>  log_all_commands(raft);
> @@ -2461,7 +2476,7 @@ raft_become_leader(struct raft *raft)
>
>  ovs_assert(raft->role != RAFT_LEADER);
>  raft->role = RAFT_LEADER;
> -raft->leader_sid = raft->sid;
> +raft_set_leader(raft, >sid);
>  raft->election_timeout = LLONG_MAX;
>  raft_reset_ping_timer(raft);
>
> @@ -2855,7 +2870,7 @@ raft_update_leader(struct raft *raft, const struct
uuid *sid)
>raft_get_nickname(raft, sid, buf, sizeof buf),
>raft->term);
>  }
> -raft->leader_sid = *sid;
> +raft_set_leader(raft, sid);
>
>  /* Record the leader to the log.  This is not used by the
algorithm
>   * (although it could be, for quick restart), but it is used for
> --
> 2.1.0
>

Sorry that I forgot to update the condition in raft_is_connected(). I fixed
it and added a test case in v2:
https://patchwork.ozlabs.org/patch/1147277/
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 4/4] raft.c: Set candidate_retrying if no leader elected since last election.

2019-08-14 Thread Han Zhou
From: Han Zhou 

candiate_retrying is used to determine if the current node is disconnected
from the cluster when the node is in candiate role. However, a node
can flap between candidate and follower role before a leader is elected
when majority of the cluster is down, so is_connected() will flap, too, which
confuses clients.

This patch avoids the flapping with the help of a new member had_leader,
so that if no leader was elected since last election, we know we are
still retrying, and keep as disconnected from the cluster.

Signed-off-by: Han Zhou 
---
v1 -> v2: Fixed the condition in raft_is_connected(), and added a test case.
  Moved this patch from 3/4 to 4/4 in the series.

 ovsdb/raft.c   | 29 ++-
 tests/ovsdb-cluster.at | 64 --
 2 files changed, 74 insertions(+), 19 deletions(-)

diff --git a/ovsdb/raft.c b/ovsdb/raft.c
index 64b8150..c2bd0ed 100644
--- a/ovsdb/raft.c
+++ b/ovsdb/raft.c
@@ -286,8 +286,11 @@ struct raft {
 
 /* Candidates only.  Reinitialized at start of election. */
 int n_votes;/* Number of votes for me. */
-bool candidate_retrying;/* The first round of election timed-out and it
-   is now retrying. */
+bool candidate_retrying;/* The earlier election timed-out and we are
+   now retrying. */
+bool had_leader;/* There has been leader elected since last
+   election initiated. This is to help setting
+   candidate_retrying. */
 };
 
 /* All Raft structures. */
@@ -345,6 +348,7 @@ static bool raft_handle_write_error(struct raft *, struct 
ovsdb_error *);
 
 static void raft_run_reconfigure(struct raft *);
 
+static void raft_set_leader(struct raft *, const struct uuid *sid);
 static struct raft_server *
 raft_find_server(const struct raft *raft, const struct uuid *sid)
 {
@@ -997,11 +1001,13 @@ raft_get_sid(const struct raft *raft)
 bool
 raft_is_connected(const struct raft *raft)
 {
-return (!(raft->role == RAFT_CANDIDATE && raft->candidate_retrying)
+bool ret = (!raft->candidate_retrying
 && !raft->joining
 && !raft->leaving
 && !raft->left
 && !raft->failed);
+VLOG_DBG("raft_is_connected: %s\n", ret? "true": "false");
+return ret;
 }
 
 /* Returns true if 'raft' is the cluster leader. */
@@ -1616,8 +1622,11 @@ raft_start_election(struct raft *raft, bool 
leadership_transfer)
 }
 
 ovs_assert(raft->role != RAFT_LEADER);
-raft->candidate_retrying = (raft->role == RAFT_CANDIDATE);
 raft->role = RAFT_CANDIDATE;
+/* If there was no leader elected since last election, we know we are
+ * retrying now. */
+raft->candidate_retrying = !raft->had_leader;
+raft->had_leader = false;
 
 raft->n_votes = 0;
 
@@ -2487,6 +2496,14 @@ raft_server_init_leader(struct raft *raft, struct 
raft_server *s)
 }
 
 static void
+raft_set_leader(struct raft *raft, const struct uuid *sid)
+{
+raft->leader_sid = *sid;
+raft->had_leader = true;
+raft->candidate_retrying = false;
+}
+
+static void
 raft_become_leader(struct raft *raft)
 {
 log_all_commands(raft);
@@ -2498,7 +2515,7 @@ raft_become_leader(struct raft *raft)
 
 ovs_assert(raft->role != RAFT_LEADER);
 raft->role = RAFT_LEADER;
-raft->leader_sid = raft->sid;
+raft_set_leader(raft, >sid);
 raft_reset_election_timer(raft);
 raft_reset_ping_timer(raft);
 
@@ -2892,7 +2909,7 @@ raft_update_leader(struct raft *raft, const struct uuid 
*sid)
   raft_get_nickname(raft, sid, buf, sizeof buf),
   raft->term);
 }
-raft->leader_sid = *sid;
+raft_set_leader(raft, sid);
 
 /* Record the leader to the log.  This is not used by the algorithm
  * (although it could be, for quick restart), but it is used for
diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
index 7146fe6..f124244 100644
--- a/tests/ovsdb-cluster.at
+++ b/tests/ovsdb-cluster.at
@@ -66,23 +66,30 @@ EXECUTION_EXAMPLES
 AT_BANNER([OVSDB - disconnect from cluster])
 
 OVS_START_SHELL_HELPERS
-# ovsdb_test_cluster_disconnect LEADER_OR_FOLLOWER
+# ovsdb_test_cluster_disconnect N_SERVERS LEADER_OR_FOLLOWER [CHECK_FLAPPING]
+# Test server disconnected from the cluster.
+# N_SERVERS: Number of servers in the cluster.
+# LEADER_OR_FOLLOWER: The role of the server that is disconnected from the
+# cluster: "leader" or "follower".
+# CHECK_FLAPPING: Whether to check if is_disconnected flapped. "yes", "no".
 ovsdb_test_cluster_disconnect () {
-leader_or_follower=$1
+n=$1
+leader_or_follower=$2
+check_flapping=$3
 schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
 ordinal_schema > schema
 AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db 

[ovs-dev] [PATCH v2 3/4] raft.c: Stale leader should disconnect from cluster.

2019-08-14 Thread Han Zhou
From: Han Zhou 

As mentioned in RAFT paper, section 6.2:

Leaders: A server might be in the leader state, but if it isn’t the current
leader, it could be needlessly delaying client requests. For example, suppose a
leader is partitioned from the rest of the cluster, but it can still
communicate with a particular client. Without additional mechanism, it could
delay a request from that client forever, being unable to replicate a log entry
to any other servers. Meanwhile, there might be another leader of a newer term
that is able to communicate with a majority of the cluster and would be able to
commit the client’s request. Thus, a leader in Raft steps down if an election
timeout elapses without a successful round of heartbeats to a majority of its
cluster; this allows clients to retry their requests with another server.

Reported-by: Aliasgar Ginwala 
Tested-by: Aliasgar Ginwala 
Signed-off-by: Han Zhou 
---
 ovsdb/raft-private.h   |   3 ++
 ovsdb/raft.c   |  42 -
 tests/ovsdb-cluster.at | 123 +
 3 files changed, 116 insertions(+), 52 deletions(-)

diff --git a/ovsdb/raft-private.h b/ovsdb/raft-private.h
index 35a3dd7..fb7e6e3 100644
--- a/ovsdb/raft-private.h
+++ b/ovsdb/raft-private.h
@@ -80,6 +80,9 @@ struct raft_server {
 uint64_t next_index; /* Index of next log entry to send this server. */
 uint64_t match_index;/* Index of max log entry server known to have. */
 enum raft_server_phase phase;
+bool replied;/* Reply to append_request was received from this
+node during current election_timeout interval.
+*/
 /* For use in adding and removing servers: */
 struct uuid requester_sid;  /* Nonzero if requested via RPC. */
 struct unixctl_conn *requester_conn; /* Only if requested via unixctl. */
diff --git a/ovsdb/raft.c b/ovsdb/raft.c
index 1c38b3b..64b8150 100644
--- a/ovsdb/raft.c
+++ b/ovsdb/raft.c
@@ -1785,7 +1785,43 @@ raft_run(struct raft *raft)
 }
 
 if (!raft->joining && time_msec() >= raft->election_timeout) {
-raft_start_election(raft, false);
+if (raft->role == RAFT_LEADER) {
+/* Check if majority of followers replied, then reset
+ * election_timeout and reset s->replied. Otherwise, become
+ * follower.
+ *
+ * Raft paper section 6.2: Leaders: A server might be in the leader
+ * state, but if it isn’t the current leader, it could be
+ * needlessly delaying client requests. For example, suppose a
+ * leader is partitioned from the rest of the cluster, but it can
+ * still communicate with a particular client. Without additional
+ * mechanism, it could delay a request from that client forever,
+ * being unable to replicate a log entry to any other servers.
+ * Meanwhile, there might be another leader of a newer term that is
+ * able to communicate with a majority of the cluster and would be
+ * able to commit the client’s request. Thus, a leader in Raft
+ * steps down if an election timeout elapses without a successful
+ * round of heartbeats to a majority of its cluster; this allows
+ * clients to retry their requests with another server.  */
+int count = 0;
+HMAP_FOR_EACH (server, hmap_node, >servers) {
+if (server->replied) {
+count ++;
+}
+}
+if (count >= hmap_count(>servers) / 2) {
+HMAP_FOR_EACH (server, hmap_node, >servers) {
+server->replied = false;
+}
+raft_reset_election_timer(raft);
+} else {
+raft_become_follower(raft);
+raft_start_election(raft, false);
+}
+} else {
+raft_start_election(raft, false);
+}
+
 }
 
 if (raft->leaving && time_msec() >= raft->leave_timeout) {
@@ -2447,6 +2483,7 @@ raft_server_init_leader(struct raft *raft, struct 
raft_server *s)
 s->next_index = raft->log_end;
 s->match_index = 0;
 s->phase = RAFT_PHASE_STABLE;
+s->replied = false;
 }
 
 static void
@@ -2462,7 +2499,7 @@ raft_become_leader(struct raft *raft)
 ovs_assert(raft->role != RAFT_LEADER);
 raft->role = RAFT_LEADER;
 raft->leader_sid = raft->sid;
-raft->election_timeout = LLONG_MAX;
+raft_reset_election_timer(raft);
 raft_reset_ping_timer(raft);
 
 struct raft_server *s;
@@ -3192,6 +3229,7 @@ raft_handle_append_reply(struct raft *raft,
 }
 }
 
+s->replied = true;
 if (rpy->result == RAFT_APPEND_OK) {
 /* Figure 3.1: "If successful, update nextIndex and matchIndex for
  * follower (section 3.5)." */
diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at

[ovs-dev] [PATCH v2 2/4] ovsdb-idl.c: Allows retry even when using a single remote.

2019-08-14 Thread Han Zhou
From: Han Zhou 

When clustered mode is used, the client needs to retry connecting
to new servers when certain failures happen. Today it is allowed to
retry new connection only if multiple remotes are used, which prevents
using LB VIP with clustered nodes. This patch makes sure the retry
logic works when using LB VIP: although same IP is used for retrying,
the LB can actually redirect the connection to a new node.

Signed-off-by: Han Zhou 
---
 lib/ovsdb-idl.c| 11 +++---
 tests/ovsdb-cluster.at | 57 ++
 tests/test-ovsdb.c |  1 +
 3 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
index 1a6a4af..190143f 100644
--- a/lib/ovsdb-idl.c
+++ b/lib/ovsdb-idl.c
@@ -657,12 +657,8 @@ ovsdb_idl_state_to_string(enum ovsdb_idl_state state)
 static void
 ovsdb_idl_retry_at(struct ovsdb_idl *idl, const char *where)
 {
-if (idl->session && jsonrpc_session_get_n_remotes(idl->session) > 1) {
-ovsdb_idl_force_reconnect(idl);
-ovsdb_idl_transition_at(idl, IDL_S_RETRY, where);
-} else {
-ovsdb_idl_transition_at(idl, IDL_S_ERROR, where);
-}
+ovsdb_idl_force_reconnect(idl);
+ovsdb_idl_transition_at(idl, IDL_S_RETRY, where);
 }
 
 static void
@@ -1895,8 +1891,7 @@ ovsdb_idl_check_server_db(struct ovsdb_idl *idl)
 if (!database) {
 VLOG_INFO_RL(, "%s: server does not have %s database",
  server_name, idl->data.class_->database);
-} else if (!strcmp(database->model, "clustered")
-   && jsonrpc_session_get_n_remotes(idl->session) > 1) {
+} else if (!strcmp(database->model, "clustered")) {
 uint64_t index = database->n_index ? *database->index : 0;
 
 if (!database->schema) {
diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
index 4701272..6a13843 100644
--- a/tests/ovsdb-cluster.at
+++ b/tests/ovsdb-cluster.at
@@ -63,6 +63,63 @@ m4_define([OVSDB_CHECK_EXECUTION],
 EXECUTION_EXAMPLES
 
 
+AT_BANNER([OVSDB - disconnect from cluster])
+
+# When a node is disconnected from the cluster, the IDL should disconnect
+# and retry even if it uses a single remote, because the remote IP can be
+# a VIP on a load-balance.
+AT_SETUP([OVSDB cluster - disconnect from cluster, single remote])
+AT_KEYWORDS([ovsdb server negative unix cluster disconnect])
+
+schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
+ordinal_schema > schema
+AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db 
$abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr])
+cid=`ovsdb-tool db-cid s1.db`
+schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
+for i in `seq 2 3`; do
+AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft 
unix:s1.raft])
+done
+
+on_exit 'kill `cat *.pid`'
+for i in `seq 3`; do
+AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach --no-chdir 
--log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i --remote=punix:s$i.ovsdb 
s$i.db])
+done
+for i in `seq 3`; do
+AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected])
+done
+
+AT_CHECK([ovsdb-client transact unix:s1.ovsdb '[["idltest",
+  {"op": "insert",
+   "table": "simple",
+   "row": {"i": 1}}]]'], [0], [ignore], [ignore])
+
+# Connect to a single remote s3.  Use "wait" to trigger a non-op transaction so
+# that test-ovsdb will not quit.
+
+on_exit 'pkill test-ovsdb'
+test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -v -t10 idl unix:s3.ovsdb 
'[["idltest",
+  {"op": "wait",
+   "table": "simple",
+   "where": [["i", "==", 1]],
+   "columns": ["i"],
+   "until": "==",
+   "rows": [{"i": 1}]}]]' > test-ovsdb.log 2>&1 &
+
+OVS_WAIT_UNTIL([grep "000: i=1" test-ovsdb.log])
+
+# Shutdown the other 2 servers so that s3 is disconnected from the cluster.
+for i in 2 1; do
+OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s$i], [s$i.pid])
+done
+
+# The test-ovsdb should detect the disconnect and retry.
+OVS_WAIT_UNTIL([grep disconnect test-ovsdb.log])
+
+OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s3], [s3.pid])
+
+AT_CLEANUP
+
+
 OVS_START_SHELL_HELPERS
 # ovsdb_cluster_failure_test SCHEMA_FUNC OUTPUT TRANSACTION...
 ovsdb_cluster_failure_test () {
diff --git a/tests/test-ovsdb.c b/tests/test-ovsdb.c
index 187eb28..b1a4be3 100644
--- a/tests/test-ovsdb.c
+++ b/tests/test-ovsdb.c
@@ -2412,6 +2412,7 @@ do_idl(struct ovs_cmdl_context *ctx)
 track = ((struct test_ovsdb_pvt_context *)(ctx->pvt))->track;
 
 idl = ovsdb_idl_create(ctx->argv[1], _idl_class, true, true);
+ovsdb_idl_set_leader_only(idl, false);
 if (ctx->argc > 2) {
 struct stream *stream;
 
-- 
2.1.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 1/4] raft.c: Move raft_reset_ping_timer() out of the loop.

2019-08-14 Thread Han Zhou
From: Han Zhou 

Fixes: commit 5a9b53a5 ("ovsdb raft: Fix duplicated transaction execution when 
leader failover.")
Signed-off-by: Han Zhou 
---
 ovsdb/raft.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ovsdb/raft.c b/ovsdb/raft.c
index c60ef41..1c38b3b 100644
--- a/ovsdb/raft.c
+++ b/ovsdb/raft.c
@@ -1816,8 +1816,8 @@ raft_run(struct raft *raft)
 && now - cmd->timestamp > ELECTION_BASE_MSEC * 2) {
 raft_command_complete(raft, cmd, RAFT_CMD_TIMEOUT);
 }
-raft_reset_ping_timer(raft);
 }
+raft_reset_ping_timer(raft);
 }
 
 /* Do this only at the end; if we did it as soon as we set raft->left or
-- 
2.1.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 2/4] ovsdb-idl.c: Allows retry even when using a single remote.

2019-08-14 Thread Han Zhou
Hi Aginwala, thanks for the testing. The change of this patch will cause
the IDL to avoid connecting to a follower if "leader_only" is set to
"true". It is the same behavior as before when multiple remotes are used,
but now it just does the same even when a single remote is used, because
the single remote could be a VIP, so it is desired  behavior. For
ovn-nbctl, "leader_only" is default to true, so it is expected that it
refuses to connect if the remote is a follower. However, if you are using
daemon mode ovn-nbctl, and if you didn't set "--no-leader-only", it should
keep retrying until it connects to a leader (depends on LB redirecting to
different servers). I agree this may cause some confusion when a user of
ovn-nbctl connects to only a single remote of a cluster, he/she could get
failure if --no-leader-only is not specified, but it is better than let
user believe they are connected to a leader while in reality connected to a
follower. Maybe I can improve the ovn-nbctl/sbctl documentation to emphasis
this to avoid confusion.

On Tue, Aug 13, 2019 at 7:13 PM aginwala  wrote:

>
>
> On Tue, Aug 13, 2019 at 7:07 PM aginwala  wrote:
>
>> Thanks for the fixes. Found a bug in current code which breaks both nbctl
>> with local socket and daemon mode on follower nodes. Also nbctl daemon mode
>> always tries to connect to leader node by default which also failed to
>> connect.
>> e.g. export OVN_NB_DB=tcp::6641; ovn-nbctl  --pidfile --detach.
>>  
>> 2019-08-14T01:07:06Z|00036|ovsdb_idl|INFO|unix:/var/run/openvswitch/ovnnb_db.sock:
>> clustered database server is not cluster leader; trying another server
>> 2019-08-14T01:07:06Z|00037|reconnect|DBG|unix:/var/run/openvswitch/ovnnb_db.sock:
>> entering RECONNECT
>> 2019-08-14T01:07:06Z|00038|ovsdb_idl|DBG|unix:/var/run/openvswitch/ovnnb_db.sock:
>> SERVER_MONITOR_COND_REQUESTED -> RETRY at lib/ovsdb-idl.c:1917
>> 2019-08-14T01:07:06Z|00039|poll_loop|DBG|wakeup due to 0-ms timeout at
>> lib/reconnect.c:643
>> 2019-08-14T01:07:06Z|00040|reconnect|INFO|unix:/var/run/openvswitch/ovnnb_db.sock:
>> connection attempt timed out
>> 2019-08-14T01:07:06Z|00041|reconnect|INFO|unix:/var/run/openvswitch/ovnnb_db.sock:
>> waiting 2 seconds before reconnect
>> 2019-08-14T01:07:06Z|00042|reconnect|DBG|unix:/var/run/openvswitch/ovnnb_db.sock:
>> entering BACKOFF
>>
>> Need to explicitly specify no-leader-only to work around.  ovn-nbctl
>> --pidfile --detach --no-leader-only.
>> For LB VIP, since LB sees all nodes are active, connections are
>> established to all cluster nodes equally. I am using round robin setting
>> for LB VIP for 3 node cluster using raft.
>>
>> Hence, are we always going to avoid this behavior because this is forcing
>> to always connect to leader and not to followers? So if we use this
>> approach, idl will show ovn-nbctl: tcp::6641: database connection
>> failed () if requests reaches followers and only processes success if
>> request reaches leader node with this patch.
>>
>>
>> On Tue, Aug 13, 2019 at 9:23 AM Han Zhou  wrote:
>>
>>> From: Han Zhou 
>>>
>>> When clustered mode is used, the client needs to retry connecting
>>> to new servers when certain failures happen. Today it is allowed to
>>> retry new connection only if multiple remotes are used, which prevents
>>> using LB VIP with clustered nodes. This patch makes sure the retry
>>> logic works when using LB VIP: although same IP is used for retrying,
>>> the LB can actually redirect the connection to a new node.
>>>
>>> Signed-off-by: Han Zhou 
>>> ---
>>>  lib/ovsdb-idl.c| 11 +++---
>>>  tests/ovsdb-cluster.at | 57
>>> ++
>>>  tests/test-ovsdb.c |  1 +
>>>  3 files changed, 61 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
>>> index 1a6a4af..190143f 100644
>>> --- a/lib/ovsdb-idl.c
>>> +++ b/lib/ovsdb-idl.c
>>> @@ -657,12 +657,8 @@ ovsdb_idl_state_to_string(enum ovsdb_idl_state
>>> state)
>>>  static void
>>>  ovsdb_idl_retry_at(struct ovsdb_idl *idl, const char *where)
>>>  {
>>> -if (idl->session && jsonrpc_session_get_n_remotes(idl->session) >
>>> 1) {
>>> -ovsdb_idl_force_reconnect(idl);
>>> -ovsdb_idl_transition_at(idl, IDL_S_RETRY, where);
>>> -} else {
>>> -ovsdb_idl_transition_at(idl, IDL_S_ERROR, where);
>>> -}
>>> +ovsdb_idl_force_reconnect(idl);
>>> +ovsdb_idl_transition_at(idl, IDL_S_RETRY, where);
>>>  }
>>>
>>>  static void
>>> @@ -1895,8 +1891,7 @@ ovsdb_idl_check_server_db(struct ovsdb_idl *idl)
>>>  if (!database) {
>>>  VLOG_INFO_RL(, "%s: server does not have %s database",
>>>   server_name, idl->data.class_->database);
>>> -} else if (!strcmp(database->model, "clustered")
>>> -   && jsonrpc_session_get_n_remotes(idl->session) > 1) {
>>> +} else if (!strcmp(database->model, "clustered")) {
>>>
>> > I think this check jsonrpc_session_get_n_remotes(idl->session) > 1 is
>> still needed 

Re: [ovs-dev] [PATCH v3 4/9] ct-dpif, dpif-netlink: Add conntrack timeout policy support

2019-08-14 Thread Darrell Ball
Thanks for the patch

mostly minor comments

On Mon, Aug 12, 2019 at 5:54 PM Yi-Hung Wei  wrote:

> This patch first defines the dpif interface for a datapath to support
> adding, deleting, getting and dumping conntrack timeout policy.
> The timeout policy is identified by a 4 bytes unsigned integer in
> datapath, and it currently support timeout for TCP, UDP, and ICMP
> protocols.
>
> Moreover, this patch provides the implementation for Linux kernel
> datapath in dpif-netlink.
>
> In Linux kernel, the timeout policy is maintained per L3/L4 protocol,
> and it is identified by 32 bytes null terminated string.  On the other
> hand, in vswitchd, the timeout policy is a generic one that consists of
> all the supported L4 protocols.  Therefore, one of the main task in
> dpif-netlink is to break down the generic timeout policy into 6
> sub policies (ipv4 tcp, udp, icmp, and ipv6 tcp, udp, icmp),
> and push down the configuration using the netlink API in
> netlink-conntrack.c.
>
> This patch also adds missing symbols in the windows datapath so
> that the build on windows can pass.
>
> Appveyor CI:
> * https://ci.appveyor.com/project/YiHungWei/ovs/builds/26387754
>
> Signed-off-by: Yi-Hung Wei 
> ---
>  Documentation/faq/releases.rst |   3 +-
>  datapath-windows/include/OvsDpInterfaceCtExt.h | 114 +
>  datapath-windows/ovsext/Netlink/NetlinkProto.h |   8 +-
>  include/windows/automake.mk|   1 +
>  .../windows/linux/netfilter/nfnetlink_cttimeout.h  |   0
>  lib/ct-dpif.c  | 104 +
>  lib/ct-dpif.h  |  56 +++
>  lib/dpif-netdev.c  |   6 +
>  lib/dpif-netlink.c | 469
> +
>  lib/dpif-netlink.h |   1 -
>  lib/dpif-provider.h|  44 ++
>  lib/netlink-conntrack.c| 308 ++
>  lib/netlink-conntrack.h|  27 +-
>  lib/netlink-protocol.h |   8 +-
>  14 files changed, 1142 insertions(+), 7 deletions(-)
>  create mode 100644 include/windows/linux/netfilter/nfnetlink_cttimeout.h
>
> diff --git a/Documentation/faq/releases.rst
> b/Documentation/faq/releases.rst
> index 8daa23bb2d0c..0b7eaab1b143 100644
> --- a/Documentation/faq/releases.rst
> +++ b/Documentation/faq/releases.rst
> @@ -110,8 +110,9 @@ Q: Are all features available with all datapaths?
>  == == == =
> ===
>  Connection tracking 4.3YES  YES
> YES
>  Conntrack Fragment Reass.   4.3YES  YES
> YES
> +Conntrack Timeout Policies  5.2YES  NO
>  NO
> +Conntrack Zone Limit4.18   YES  NO
>  YES
>  NAT 4.6YES  YES
> YES
> -Conntrack zone limit4.18   YES  NO
>  YES
>  Tunnel - LISP   NO YES  NO
>  NO
>  Tunnel - STTNO YES  NO
>  YES
>  Tunnel - GRE3.11   YES  YES
> YES
> diff --git a/datapath-windows/include/OvsDpInterfaceCtExt.h
> b/datapath-windows/include/OvsDpInterfaceCtExt.h
> index 3b947782e90c..4379855bb8dd 100644
> --- a/datapath-windows/include/OvsDpInterfaceCtExt.h
> +++ b/datapath-windows/include/OvsDpInterfaceCtExt.h
> @@ -421,4 +421,118 @@ struct nf_ct_tcp_flags {
>  UINT8 mask;
>  };
>
> +/* File: nfnetlink_cttimeout.h */
> +enum ctnl_timeout_msg_types {
> +IPCTNL_MSG_TIMEOUT_NEW,
> +IPCTNL_MSG_TIMEOUT_GET,
> +IPCTNL_MSG_TIMEOUT_DELETE,
> +IPCTNL_MSG_TIMEOUT_DEFAULT_SET,
> +IPCTNL_MSG_TIMEOUT_DEFAULT_GET,
> +
> +IPCTNL_MSG_TIMEOUT_MAX
> +};
> +
> +enum ctattr_timeout {
> +CTA_TIMEOUT_UNSPEC,
> +CTA_TIMEOUT_NAME,
> +CTA_TIMEOUT_L3PROTO,
> +CTA_TIMEOUT_L4PROTO,
> +CTA_TIMEOUT_DATA,
> +CTA_TIMEOUT_USE,
> +__CTA_TIMEOUT_MAX
> +};
> +#define CTA_TIMEOUT_MAX (__CTA_TIMEOUT_MAX - 1)
> +
> +enum ctattr_timeout_generic {
> +CTA_TIMEOUT_GENERIC_UNSPEC,
> +CTA_TIMEOUT_GENERIC_TIMEOUT,
> +__CTA_TIMEOUT_GENERIC_MAX
> +};
> +#define CTA_TIMEOUT_GENERIC_MAX (__CTA_TIMEOUT_GENERIC_MAX - 1)
> +
> +enum ctattr_timeout_tcp {
> +CTA_TIMEOUT_TCP_UNSPEC,
> +CTA_TIMEOUT_TCP_SYN_SENT,
> +CTA_TIMEOUT_TCP_SYN_RECV,
> +CTA_TIMEOUT_TCP_ESTABLISHED,
> +CTA_TIMEOUT_TCP_FIN_WAIT,
> +CTA_TIMEOUT_TCP_CLOSE_WAIT,
> +CTA_TIMEOUT_TCP_LAST_ACK,
> +CTA_TIMEOUT_TCP_TIME_WAIT,
> +CTA_TIMEOUT_TCP_CLOSE,
> +CTA_TIMEOUT_TCP_SYN_SENT2,
> +CTA_TIMEOUT_TCP_RETRANS,
> +CTA_TIMEOUT_TCP_UNACK,
> +__CTA_TIMEOUT_TCP_MAX
> +};
> +#define CTA_TIMEOUT_TCP_MAX (__CTA_TIMEOUT_TCP_MAX - 1)
> +
> +enum ctattr_timeout_udp {
> +

Re: [ovs-dev] [PATCH v3 6/9] ofproto-dpif: Consume CT_Zone, and CT_Timeout_Policy tables

2019-08-14 Thread Darrell Ball
On Wed, Aug 14, 2019 at 1:28 PM Yi-Hung Wei  wrote:

> On Tue, Aug 13, 2019 at 7:46 PM Darrell Ball  wrote:
> >
> > Thanks for the patch
> >
> > Some high level comments:
> >
> > 1/ The ct_tp_kill_list code is still in common code
> > I think we discussed moving that to the dpif backer code
> > ct_timeout_policy_unref() is adding to this deferred kill list which
> is not needed for userspace
> > datapath.
> > 2/ clear_existing_ct_timeout_policies() is in common code, but only does
> something if
> > ct_dpif_timeout_policy_dump_start/next/done are realized in the datatype
> type specific code
> > (which is only for the kernel code, which is correct). I think it would
> be cleaner and less confusing
> > just to make the API clear_existing_ct_timeout_policies() kernel
> specific; i.e. in dpif-netlink.
>
>
> Thanks for review. I address most of the code changes as in the
> detailed inline code review.
>
> For the two high level concerns,  it is mainly because currently
> ct_tp_kill_list is maintained in ofproto-dpif.c  I thought about to
> move the ct_tp_kill_list implementation from dpif_backer (in
> ofproto-dpif.c) to dpif-netlink.c, and here is why I still keep it in
> ofproto-dpif.c in the dpif_backer layer in this version.
>
> AFAIK, currently, we do not have a proper place to store
> ct_tp_kill_list in dpif-netlink.c in the userspace.  dpif-netlink is
> for the kernel datapath implementation, all the information that we
> configured to dpif-netlink are directly pass down into the kernel
> currently.   In userspace datapath, we can store userspace specific
> information in "struct dp_netdev", but there is no such place in
> dpif-neltink for now.  In this case, it is naturally to maintain the
> ct_tp_kill_list one level up in the dpif_backer layer.
>
> Anyhow, we can always make proper change on the way we maintain
> timeout policy in ofproto-dpif layer when the dpif-netdev
> implementation is introduced.
>

As discussed, lets defer.


>
>
> > On Mon, Aug 12, 2019 at 5:55 PM Yi-Hung Wei 
> wrote:
> >> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> >> index 751535249e21..3013d83e96a0 100644
> >> --- a/ofproto/ofproto-dpif.c
> >> +++ b/ofproto/ofproto-dpif.c
> >> @@ -694,6 +718,8 @@ struct odp_garbage {
> >>
> >>  static void check_support(struct dpif_backer *backer);
> >>
> >> +#define MAX_TIMEOUT_POLICY_ID UINT32_MAX
> >
> >
> > seems like random placement; could be moved where it is used.
> >
>
> ok, I will move it right before ct_zone_config_init().
>
>
> >> +static struct ct_timeout_policy *
> >> +ct_timeout_policy_alloc__(void)
> >> +{
> >> +struct ct_timeout_policy *ct_tp = xzalloc(sizeof *ct_tp);
> >> +simap_init(_tp->tp);
> >> +return ct_tp;
> >> +}
> >
> >
> > by using above API, you are not saving any code and maybe more error
> prone
> >
>
> This function is used in ct_timeout_policy_alloc() and
> clear_existing_ct_timeout_policies(). So are you sugguesting to expand
> it in these two functions?
>

ohh. I missed the other usage; I don't feel strongly either way.


>
> >>
> >> +
> >> +static struct ct_timeout_policy *
> >> +ct_timeout_policy_alloc(struct simap *tp, struct id_pool *tp_ids)
> >> +{
> >> +struct simap_node *node;
> >> +
> >> +struct ct_timeout_policy *ct_tp = ct_timeout_policy_alloc__();
> >> +SIMAP_FOR_EACH (node, tp) {
> >> +simap_put(_tp->tp, node->name, node->data);
> >> +}
> >> +
> >> +if (!id_pool_alloc_id(tp_ids, _tp->tp_id)) {
> >> +VLOG_ERR_RL(, "failed to allocate timeout policy id.");
> >> +simap_destroy(_tp->tp);
> >> +free(tp);
> >
> >
> > I think you rather need to free 'ct_tp'; i.e. free(ct_tp)
>
> Yes, thanks for spotting this bug.
>
>
> >> +static void
> >> +clear_existing_ct_timeout_policies(struct dpif_backer *backer)
> >> +{
> >> +/* In kernel datapath, when OVS starts, there may be some
> pre-existing
> >> + * timeout policies in the kernel.  To avoid reassign the same
> timeout
> >> + * policy ids, we dump all the pre-existing timeout policies and
> keep
> >> + * the ids in the pool.  Since OVS will not use those timeout
> policies
> >> + * for new datapath flow, we add them to the kill list and remove
> >> + * them later on. */
> >> +void *state;
> >> +
> >> +int err = ct_dpif_timeout_policy_dump_start(backer->dpif, );
> >> +if (err) {
> >> +return;
> >> +}
> >
> >
> > can be
> >
> > + if (ct_dpif_timeout_policy_dump_start(backer->dpif, )) {
> > +return;
> > +}
> >
> > similarly below
>
> Sure, I will get rid of 'int err' in v4.
>
> >> +
> >> +struct ct_dpif_timeout_policy cdtp;
> >> +while (!(err = ct_dpif_timeout_policy_dump_next(backer->dpif,
> state,
> >> +))) {
> >> +struct ct_timeout_policy *ct_tp = ct_timeout_policy_alloc__();
> >> +ct_tp->tp_id = cdtp.id;
> >> +id_pool_add(backer->tp_ids, cdtp.id);
> >> +

Re: [ovs-dev] [PATCH v3 6/9] ofproto-dpif: Consume CT_Zone, and CT_Timeout_Policy tables

2019-08-14 Thread Yi-Hung Wei
On Tue, Aug 13, 2019 at 7:46 PM Darrell Ball  wrote:
>
> Thanks for the patch
>
> Some high level comments:
>
> 1/ The ct_tp_kill_list code is still in common code
> I think we discussed moving that to the dpif backer code
> ct_timeout_policy_unref() is adding to this deferred kill list which is 
> not needed for userspace
> datapath.
> 2/ clear_existing_ct_timeout_policies() is in common code, but only does 
> something if
> ct_dpif_timeout_policy_dump_start/next/done are realized in the datatype type 
> specific code
> (which is only for the kernel code, which is correct). I think it would be 
> cleaner and less confusing
> just to make the API clear_existing_ct_timeout_policies() kernel specific; 
> i.e. in dpif-netlink.


Thanks for review. I address most of the code changes as in the
detailed inline code review.

For the two high level concerns,  it is mainly because currently
ct_tp_kill_list is maintained in ofproto-dpif.c  I thought about to
move the ct_tp_kill_list implementation from dpif_backer (in
ofproto-dpif.c) to dpif-netlink.c, and here is why I still keep it in
ofproto-dpif.c in the dpif_backer layer in this version.

AFAIK, currently, we do not have a proper place to store
ct_tp_kill_list in dpif-netlink.c in the userspace.  dpif-netlink is
for the kernel datapath implementation, all the information that we
configured to dpif-netlink are directly pass down into the kernel
currently.   In userspace datapath, we can store userspace specific
information in "struct dp_netdev", but there is no such place in
dpif-neltink for now.  In this case, it is naturally to maintain the
ct_tp_kill_list one level up in the dpif_backer layer.

Anyhow, we can always make proper change on the way we maintain
timeout policy in ofproto-dpif layer when the dpif-netdev
implementation is introduced.


> On Mon, Aug 12, 2019 at 5:55 PM Yi-Hung Wei  wrote:
>> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
>> index 751535249e21..3013d83e96a0 100644
>> --- a/ofproto/ofproto-dpif.c
>> +++ b/ofproto/ofproto-dpif.c
>> @@ -694,6 +718,8 @@ struct odp_garbage {
>>
>>  static void check_support(struct dpif_backer *backer);
>>
>> +#define MAX_TIMEOUT_POLICY_ID UINT32_MAX
>
>
> seems like random placement; could be moved where it is used.
>

ok, I will move it right before ct_zone_config_init().


>> +static struct ct_timeout_policy *
>> +ct_timeout_policy_alloc__(void)
>> +{
>> +struct ct_timeout_policy *ct_tp = xzalloc(sizeof *ct_tp);
>> +simap_init(_tp->tp);
>> +return ct_tp;
>> +}
>
>
> by using above API, you are not saving any code and maybe more error prone
>

This function is used in ct_timeout_policy_alloc() and
clear_existing_ct_timeout_policies(). So are you sugguesting to expand
it in these two functions?


>>
>> +
>> +static struct ct_timeout_policy *
>> +ct_timeout_policy_alloc(struct simap *tp, struct id_pool *tp_ids)
>> +{
>> +struct simap_node *node;
>> +
>> +struct ct_timeout_policy *ct_tp = ct_timeout_policy_alloc__();
>> +SIMAP_FOR_EACH (node, tp) {
>> +simap_put(_tp->tp, node->name, node->data);
>> +}
>> +
>> +if (!id_pool_alloc_id(tp_ids, _tp->tp_id)) {
>> +VLOG_ERR_RL(, "failed to allocate timeout policy id.");
>> +simap_destroy(_tp->tp);
>> +free(tp);
>
>
> I think you rather need to free 'ct_tp'; i.e. free(ct_tp)

Yes, thanks for spotting this bug.


>> +static void
>> +clear_existing_ct_timeout_policies(struct dpif_backer *backer)
>> +{
>> +/* In kernel datapath, when OVS starts, there may be some pre-existing
>> + * timeout policies in the kernel.  To avoid reassign the same timeout
>> + * policy ids, we dump all the pre-existing timeout policies and keep
>> + * the ids in the pool.  Since OVS will not use those timeout policies
>> + * for new datapath flow, we add them to the kill list and remove
>> + * them later on. */
>> +void *state;
>> +
>> +int err = ct_dpif_timeout_policy_dump_start(backer->dpif, );
>> +if (err) {
>> +return;
>> +}
>
>
> can be
>
> + if (ct_dpif_timeout_policy_dump_start(backer->dpif, )) {
> +return;
> +}
>
> similarly below

Sure, I will get rid of 'int err' in v4.

>> +
>> +struct ct_dpif_timeout_policy cdtp;
>> +while (!(err = ct_dpif_timeout_policy_dump_next(backer->dpif, state,
>> +))) {
>> +struct ct_timeout_policy *ct_tp = ct_timeout_policy_alloc__();
>> +ct_tp->tp_id = cdtp.id;
>> +id_pool_add(backer->tp_ids, cdtp.id);
>> +ovs_list_insert(>ct_tp_kill_list, _tp->list_node);
>
>
> not sure why you add to beginning here rather than end; was there some deeper 
> meaning at play ?

Not really, I am fine to add it on either side tho.

>> +static void
>> +ct_del_zone_timeout_policy(const char *datapath_type, uint16_t zone)
>> +{
>> +struct dpif_backer *backer;
>> +struct ct_zone *ct_zone;
>> +
>> +backer = 

Re: [ovs-dev] [PATCH v3 1/9] ovs-vswitchd: Add Datapath, CT_Zone, and CT_Zone_Policy tables.

2019-08-14 Thread Darrell Ball
On Wed, Aug 14, 2019 at 9:47 AM Yi-Hung Wei  wrote:

> On Mon, Aug 12, 2019 at 7:46 PM Darrell Ball  wrote:
> >> diff --git a/vswitchd/vswitch.ovsschema b/vswitchd/vswitch.ovsschema
> >> index f7c6eb8983cd..c0a2242ad345 100644
> >> --- a/vswitchd/vswitch.ovsschema
> >> +++ b/vswitchd/vswitch.ovsschema
> >> @@ -1,9 +1,14 @@
> >>  {"name": "Open_vSwitch",
> >> - "version": "8.0.0",
> >> - "cksum": "3962141869 23978",
> >> + "version": "8.1.0",
> >> + "cksum": "1635647160 26090",
> >>   "tables": {
> >> "Open_vSwitch": {
> >>   "columns": {
> >> +   "datapaths": {
> >> + "type": {"key": {"type": "string"},
> >
> >
> > I had a minor comment in V2 about using an enum here for key - 'system'
> or 'netdev'
> > Does it work or is there worry that other datapath types will likely
> develop
> > and we will need to update the enum ?
>
> Thanks for the review.
>
> I discussed this with Justin about this.  We currently do not limit
> the datapath type in the Bridge table in ovsdb schema. So it might
> just keep it as is to be consistent with the the Bridge table.
>

Lets defer for now.
It can be treated as a separate issue and fixed later in one shot.


>
> >> +  "value": {"type": "uuid",
> >> +"refTable": "Datapath"},
> >> +  "min": 0, "max": "unlimited"}},
> >>
> >> "bridges": {
> >>   "type": {"key": {"type": "uuid",
> >>"refTable": "Bridge"},
> >> @@ -629,6 +634,48 @@
> >>"min": 0, "max": "unlimited"},
> >>   "ephemeral": true}},
> >>   "indexes": [["target"]]},
> >> +   "Datapath": {
> >> + "columns": {
> >> +   "datapath_version": {
> >> + "type": "string"},
> >> +   "ct_zones": {
> >> + "type": {"key": {"type": "integer",
> >> +  "minInteger": 0,
> >> +  "maxInteger": 65535},
> >> +  "value": {"type": "uuid",
> >> +"refTable": "CT_Zone"},
> >> +  "min": 0, "max": "unlimited"}},
> >
> >
> > minor comment from V2; I think
> > +  "min": 0, "max": "unlimited"}},
> > should be
> > +  "min": 0, "max": "65536"}},
>
> Since ct_zones is a map, so  the maximum size is already limited the key
> range.
> + "type": {"key": {"type": "integer",
> +  "minInteger": 0,
> +  "maxInteger": 65535},
>
> Keep "max" as "unlimited" also has the benefit that we do not need to
> update it when the range of value is changed.  There are other cases
> in ovsdb schema that has similar behavior, for  example:


>"queues": {
>  "type": {"key": {"type": "integer",
>   "minInteger": 0,
>   "maxInteger": 4294967295},
>   "value": {"type": "uuid",
> "refTable": "Queue"},
>   "min": 0, "max": "unlimited"}},
>
>"mappings": {
>  "type": {"key": {"type": "integer",
>   "minInteger": 0,
>   "maxInteger": 16777215},
>   "value": {"type": "integer",
>   "minInteger": 0,
>   "maxInteger": 4095},
>   "min": 0, "max": "unlimited"}}
>

lets defer


>
> >> +   "external_ids": {
> >> + "type": {"key": "string", "value": "string",
> >> +  "min": 0, "max": "unlimited",
> >> +   "CT_Zone": {
> >> + "columns": {
> >> +   "timeout_policy": {
> >> + "type": {"key": {"type": "uuid",
> >> +  "refTable": "CT_Timeout_Policy"},
> >> +  "min": 0, "max": 1}},
> >> +   "external_ids": {
> >> + "type": {"key": "string", "value": "string",
> >> +  "min": 0, "max": "unlimited",
> >> +   "CT_Timeout_Policy": {
> >> + "columns": {
> >> +   "timeouts": {
> >> + "type": {"key": {"type" : "string",
> >> +  "enum": ["set", ["tcp_syn_sent",
> "tcp_syn_recv",
> >> +   "tcp_established",
> "tcp_fin_wait",
> >> +   "tcp_close_wait",
> "tcp_last_ack",
> >> +   "tcp_time_wait",
> "tcp_close",
> >> +   "tcp_syn_sent2",
> "tcp_retransmit",
> >> +   "tcp_unack", "udp_first",
> >> +   "udp_single",
> "udp_multiple",
> >> +   "icmp_first",
> "icmp_reply"]]},
> >> +  "value": {"type" : "integer",
> >> +"minInteger" : 0,
> >> +"maxInteger" : 4294967295},
> >> +  "min": 0, "max": "unlimited"}},
> >
> >
> > Should it be ?
> > +  

Re: [ovs-dev] [PATCH v3 1/9] ovs-vswitchd: Add Datapath, CT_Zone, and CT_Zone_Policy tables.

2019-08-14 Thread Yi-Hung Wei
On Mon, Aug 12, 2019 at 7:46 PM Darrell Ball  wrote:
>> diff --git a/vswitchd/vswitch.ovsschema b/vswitchd/vswitch.ovsschema
>> index f7c6eb8983cd..c0a2242ad345 100644
>> --- a/vswitchd/vswitch.ovsschema
>> +++ b/vswitchd/vswitch.ovsschema
>> @@ -1,9 +1,14 @@
>>  {"name": "Open_vSwitch",
>> - "version": "8.0.0",
>> - "cksum": "3962141869 23978",
>> + "version": "8.1.0",
>> + "cksum": "1635647160 26090",
>>   "tables": {
>> "Open_vSwitch": {
>>   "columns": {
>> +   "datapaths": {
>> + "type": {"key": {"type": "string"},
>
>
> I had a minor comment in V2 about using an enum here for key - 'system' or 
> 'netdev'
> Does it work or is there worry that other datapath types will likely develop
> and we will need to update the enum ?

Thanks for the review.

I discussed this with Justin about this.  We currently do not limit
the datapath type in the Bridge table in ovsdb schema. So it might
just keep it as is to be consistent with the the Bridge table.


>> +  "value": {"type": "uuid",
>> +"refTable": "Datapath"},
>> +  "min": 0, "max": "unlimited"}},
>>
>> "bridges": {
>>   "type": {"key": {"type": "uuid",
>>"refTable": "Bridge"},
>> @@ -629,6 +634,48 @@
>>"min": 0, "max": "unlimited"},
>>   "ephemeral": true}},
>>   "indexes": [["target"]]},
>> +   "Datapath": {
>> + "columns": {
>> +   "datapath_version": {
>> + "type": "string"},
>> +   "ct_zones": {
>> + "type": {"key": {"type": "integer",
>> +  "minInteger": 0,
>> +  "maxInteger": 65535},
>> +  "value": {"type": "uuid",
>> +"refTable": "CT_Zone"},
>> +  "min": 0, "max": "unlimited"}},
>
>
> minor comment from V2; I think
> +  "min": 0, "max": "unlimited"}},
> should be
> +  "min": 0, "max": "65536"}},

Since ct_zones is a map, so  the maximum size is already limited the key range.
+ "type": {"key": {"type": "integer",
+  "minInteger": 0,
+  "maxInteger": 65535},

Keep "max" as "unlimited" also has the benefit that we do not need to
update it when the range of value is changed.  There are other cases
in ovsdb schema that has similar behavior, for  example:

   "queues": {
 "type": {"key": {"type": "integer",
  "minInteger": 0,
  "maxInteger": 4294967295},
  "value": {"type": "uuid",
"refTable": "Queue"},
  "min": 0, "max": "unlimited"}},

   "mappings": {
 "type": {"key": {"type": "integer",
  "minInteger": 0,
  "maxInteger": 16777215},
  "value": {"type": "integer",
  "minInteger": 0,
  "maxInteger": 4095},
  "min": 0, "max": "unlimited"}}


>> +   "external_ids": {
>> + "type": {"key": "string", "value": "string",
>> +  "min": 0, "max": "unlimited",
>> +   "CT_Zone": {
>> + "columns": {
>> +   "timeout_policy": {
>> + "type": {"key": {"type": "uuid",
>> +  "refTable": "CT_Timeout_Policy"},
>> +  "min": 0, "max": 1}},
>> +   "external_ids": {
>> + "type": {"key": "string", "value": "string",
>> +  "min": 0, "max": "unlimited",
>> +   "CT_Timeout_Policy": {
>> + "columns": {
>> +   "timeouts": {
>> + "type": {"key": {"type" : "string",
>> +  "enum": ["set", ["tcp_syn_sent", "tcp_syn_recv",
>> +   "tcp_established", 
>> "tcp_fin_wait",
>> +   "tcp_close_wait", "tcp_last_ack",
>> +   "tcp_time_wait", "tcp_close",
>> +   "tcp_syn_sent2", 
>> "tcp_retransmit",
>> +   "tcp_unack", "udp_first",
>> +   "udp_single", "udp_multiple",
>> +   "icmp_first", "icmp_reply"]]},
>> +  "value": {"type" : "integer",
>> +"minInteger" : 0,
>> +"maxInteger" : 4294967295},
>> +  "min": 0, "max": "unlimited"}},
>
>
> Should it be ?
> +  "min": 0, "max": "16"}},
>
> or will this create upgrade issues ?

Similarly, I would keep "max" as "unlimited" here, or we will need to
update "max" when more L4 protocol timeouts are supported.


>> +  
>> +
>> +  Configuration for a datapath within .
>> +
>> +
>> +  A datapath is responsible for providing the packet handling in Open
>> +  vSwitch.  There 

Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-08-14 Thread William Tu
On Wed, Aug 14, 2019 at 7:58 AM William Tu  wrote:
>
> On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron  wrote:
> >
> >
> >
> > On 8 Aug 2019, at 17:38, Ilya Maximets wrote:
> >
> > 
> >
> >  I see a rather high number of afxdp_cq_skip, which should to my
> >  knowledge never happen?
> > >>>
> > >>> I tried to investigate this previously, but didn't find anything
> > >>> suspicious.
> > >>> So, for my knowledge, this should never happen too.
> > >>> However, I only looked at the code without actually running, because
> > >>> I had no
> > >>> HW available for testing.
> > >>>
> > >>> While investigation and stress-testing virtual ports I found few
> > >>> issues with
> > >>> missing locking inside the kernel, so there is no trust for kernel
> > >>> part of XDP
> > >>> implementation from my side. I'm suspecting that there are some
> > >>> other bugs in
> > >>> kernel/libbpf that only could be reproduced with driver mode.
> > >>>
> > >>> This never happens for virtual ports with SKB mode, so I never saw
> > >>> this coverage
> > >>> counter being non-zero.
> > >>
> > >> Did some quick debugging, as something else has come up that needs my
> > >> attention :)
> > >>
> > >> But once I’m in a faulty state and sent a single packet, causing
> > >> afxdp_complete_tx() to be called, it tells me 2048 descriptors are
> > >> ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that
> > >> there might be some ring management bug. Maybe consumer and receiver
> > >> are equal meaning 0 buffers, but it returns max? I did not look at
> > >> the kernel code, so this is just a wild guess :)
> > >>
> > >> (gdb) p tx_done
> > >> $3 = 2048
> > >>
> > >> (gdb) p umem->cq
> > >> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask =
> > >> 2047, size = 2048, producer = 0x7f08486b8000, consumer =
> > >> 0x7f08486b8040, ring = 0x7f08486b8080}
> > >
> > > Thanks for debugging!
> > >
> > > xsk_ring_cons__peek() just returns the difference between cached_prod
> > > and cached_cons, but these values are too different:
> > >
> > > 3830466864 - 3578066899 = 252399965
> > >
> > > Since this value > requested, it returns requested number (2048).
> > >
> > > So, the ring is broken. At least broken its 'cached' part. It'll be
> > > good
> > > to look at *consumer and *producer values to verify the state of the
> > > actual ring.
> > >
> >
> > I’ll try to find some more time next week to debug further.
> >
> > William I noticed your email in xdp-newbies where you mention this
> > problem of getting the wrong pointers. Did you ever follow up, or did
> > further trouble shooting on the above?
>
> Yes, I posted here
> https://www.spinics.net/lists/xdp-newbies/msg00956.html
> "Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?"
>
> At that time I was thinking about reproducing the problem using the
> xdpsock sample code from kernel. But turned out that my reproduction
> code is not correct, so not able to show the case we hit here in OVS.
>
> Then I put more similar code logic from OVS to xdpsock, but the problem
> does not show up. As a result, I worked around it by marking addr as
> "*addr == UINT64_MAX".
>
> I will debug again this week once I get my testbed back.
>
Just to refresh my memory. The original issue is that
when calling:
tx_done = xsk_ring_cons__peek(>cq, CONS_NUM_DESCS, _cq);
xsk_ring_cons__release(>cq, tx_done);

I expect there are 'tx_done' elems on the CQ to re-cycle back to memory pool.
However, when I inspect these elems, I found some elems that 'already' been
reported complete last time I call xsk_ring_cons__peek. In other word, some
elems show up at CQ twice. And this cause overflow of the mempool.

Thus, mark the elems on CQ as UINT64_MAX to indicate that we already
seen this elem.

William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 3/3] dpdk: Use ovs-numa provided functions to manage thread affinity.

2019-08-14 Thread William Tu
On Tue, Aug 13, 2019 at 8:30 AM Ilya Maximets  wrote:
>
> This allows to decrease code duplication and avoid using Linux-specific
> functions (this might be useful in the future if we'll try to allow
> running OvS+DPDK on FreeBSD).
>
> Signed-off-by: Ilya Maximets 

LGTM
Acked-by: William Tu 


> ---
>  lib/dpdk.c | 27 ---
>  1 file changed, 12 insertions(+), 15 deletions(-)
>
> diff --git a/lib/dpdk.c b/lib/dpdk.c
> index fc58de55a..6f297d918 100644
> --- a/lib/dpdk.c
> +++ b/lib/dpdk.c
> @@ -275,7 +275,7 @@ dpdk_init__(const struct smap *ovs_other_config)
>  int result;
>  bool auto_determine = true;
>  int err = 0;
> -cpu_set_t cpuset;
> +struct ovs_numa_dump *affinity = NULL;
>  struct svec args = SVEC_EMPTY_INITIALIZER;
>
>  log_stream = fopencookie(NULL, "w+", dpdk_log_func);
> @@ -357,22 +357,22 @@ dpdk_init__(const struct smap *ovs_other_config)
>   * lcore for the DPDK Master.
>   */
>  if (auto_determine) {
> +const struct ovs_numa_info_core *core;
>  int cpu = 0;
>
>  /* Get the main thread affinity */
> -CPU_ZERO();
> -err = pthread_getaffinity_np(pthread_self(), sizeof(cpu_set_t),
> - );
> -if (!err) {
> -for (cpu = 0; cpu < CPU_SETSIZE; cpu++) {
> -if (CPU_ISSET(cpu, )) {
> -break;
> +affinity = ovs_numa_thread_getaffinity_dump();
> +if (affinity) {
> +cpu = INT_MAX;
> +FOR_EACH_CORE_ON_DUMP (core, affinity) {
> +if (cpu > core->core_id) {
> +cpu = core->core_id;
>  }
>  }
>  } else {
>  /* User did not set dpdk-lcore-mask and unable to get current
>   * thread affintity - default to core #0 */
> -VLOG_ERR("Thread getaffinity error %d. Using core #0", err);
> +VLOG_ERR("Thread getaffinity failed. Using core #0");
>  }
>  svec_add(, "-l");
>  svec_add_nocopy(, xasprintf("%d", cpu));
> @@ -403,12 +403,9 @@ dpdk_init__(const struct smap *ovs_other_config)
>  svec_destroy();
>
>  /* Set the main thread affinity back to pre rte_eal_init() value */
> -if (auto_determine && !err) {
> -err = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t),
> - );
> -if (err) {
> -VLOG_ERR("Thread setaffinity error %d", err);
> -}
> +if (affinity) {
> +ovs_numa_thread_setaffinity_dump(affinity);
> +ovs_numa_dump_destroy(affinity);
>  }
>
>  if (result < 0) {
> --
> 2.17.1
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] Make pidfile_is_running more robust against empty pidfiles

2019-08-14 Thread Michele Baldessari
On Wed, Aug 14, 2019 at 02:28:13PM +0300, Ilya Maximets wrote:
> On 14.08.2019 11:39, Michele Baldessari wrote:
> > In some of our destructive testing of ovn-dbs inside containers managed
> > by pacemaker we reached a situation where /var/run/openvswitch had
> > empty .pid files. The current code does not deal well with them
> > and pidfile_is_running() returns true in such a case and this confuses
> > the OCF resource agent.
> > 
> > - Before this change:
> > Inside a container run:
> >   killall ovsdb-server;
> >   echo -n '' > /var/run/openvswitch/ovnnb_db.pid; echo -n '' > 
> > /var/run/openvswitch/ovnsb_db.pid
> > 
> > We will observe that the cluster is unable to ever recover because
> > it believes the ovn processes to be running when they really aren't and
> > eventually just fails:
> >  podman container set: ovn-dbs-bundle 
> > [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest]
> >ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers):   Master controller-0
> >ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers):   Stopped controller-1
> >ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers):   Slave controller-2
> > 
> > - After this change the cluster is able to recover from this state and
> > correctly start the resource:
> >  podman container set: ovn-dbs-bundle 
> > [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest]
> >ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers):   Master controller-0
> >ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers):   Slave controller-1
> >ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers):   Slave controller-2
> > 
> > Signed-off-by: Michele Baldessari 
> > ---
> >  ovn/utilities/ovn-ctl | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/ovn/utilities/ovn-ctl b/ovn/utilities/ovn-ctl
> > index 7e5cd469c83c..65f03e28ddba 100755
> > --- a/ovn/utilities/ovn-ctl
> > +++ b/ovn/utilities/ovn-ctl
> > @@ -35,7 +35,7 @@ 
> > ovn_northd_db_conf_file="$etcdir/ovn-northd-db-params.conf"
> >  
> >  pidfile_is_running () {
> >  pidfile=$1
> > -test -e "$pidfile" && pid=`cat "$pidfile"` && pid_exists "$pid"
> > +test -e "$pidfile" && [ -s "$pidfile" ] && pid=`cat "$pidfile"` && 
> > pid_exists "$pid"
> 
> Hi. Thanks for the fix!
> 
> Maybe it's better to add additional check for an empty argument to
> 'pid_exists' function instead? This will cover more cases like invocations
> from the utilities/ovs-lib.in.
> 
> I think, you may also add following tag to commit-message in this case:
> Fixes: 3028ce2595c8 ("ovs-lib: Allow "status" command to work as non-root.")
> 
> This patch also will be needed in ovn-org/ovn repository too.
> (Use 'PATCH ovn' subject prefix while sending patches targeted for ovn repo.)
> 
> Best regards, Ilya Maximets.

Thanks for the feedback Ilya, I have amended things (hopefully correctly) in
http://patchwork.ozlabs.org/patch/1147111/ (I could not figure out how
to update an existing patch in patchwork, I hope this is okay)

Kind regards,
Michele
-- 
Michele Baldessari
C2A5 9DA3 9961 4FFB E01B  D0BC DDD4 DCCB 7515 5C6D
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2, ovn] Make pid_exists() more robust against empty pid argument

2019-08-14 Thread Michele Baldessari
In some of our destructive testing of ovn-dbs inside containers managed
by pacemaker we reached a situation where /var/run/openvswitch had
empty .pid files. The current code does not deal well with them
and pidfile_is_running() returns true in such a case and this confuses
the OCF resource agent.

- Before this change:
Inside a container run:
  killall ovsdb-server;
  echo -n '' > /var/run/openvswitch/ovnnb_db.pid; echo -n '' > 
/var/run/openvswitch/ovnsb_db.pid

We will observe that the cluster is unable to ever recover because
it believes the ovn processes to be running when they really aren't and
eventually just fails:
 podman container set: ovn-dbs-bundle 
[192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest]
   ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers):   Master controller-0
   ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers):   Stopped controller-1
   ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers):   Slave controller-2

Let's make sure pid_exists() returns false when the pid is an empty
string.

- After this change the cluster is able to recover from this state and
correctly start the resource:
 podman container set: ovn-dbs-bundle 
[192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest]
   ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers):   Master controller-0
   ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers):   Slave controller-1
   ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers):   Slave controller-2

Fixes: 3028ce2595c8 ("ovs-lib: Allow "status" command to work as non-root.")

Signed-off-by: Michele Baldessari 
---
v1 -> v2

- Implemented Ilya's suggestion and moved the check from
  pidfile_is_running() to pid_exists() and re-run my tests
---
 utilities/ovs-lib.in | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/utilities/ovs-lib.in b/utilities/ovs-lib.in
index fa840ec637f5..dc485413ef0c 100644
--- a/utilities/ovs-lib.in
+++ b/utilities/ovs-lib.in
@@ -127,7 +127,7 @@ fi
 pid_exists () {
 # This is better than "kill -0" because it doesn't require permission to
 # send a signal (so daemon_status in particular works as non-root).
-test -d /proc/"$1"
+test -n "$1" && test -d /proc/"$1"
 }
 
 pid_comm_check () {
-- 
2.21.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2] Make pid_exists() more robust against empty pid argument

2019-08-14 Thread Michele Baldessari
In some of our destructive testing of ovn-dbs inside containers managed
by pacemaker we reached a situation where /var/run/openvswitch had
empty .pid files. The current code does not deal well with them
and pidfile_is_running() returns true in such a case and this confuses
the OCF resource agent.

- Before this change:
Inside a container run:
  killall ovsdb-server;
  echo -n '' > /var/run/openvswitch/ovnnb_db.pid; echo -n '' > 
/var/run/openvswitch/ovnsb_db.pid

We will observe that the cluster is unable to ever recover because
it believes the ovn processes to be running when they really aren't and
eventually just fails:
 podman container set: ovn-dbs-bundle 
[192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest]
   ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers):   Master controller-0
   ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers):   Stopped controller-1
   ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers):   Slave controller-2

Let's make sure pid_exists() returns false when the pid is an empty
string.

- After this change the cluster is able to recover from this state and
correctly start the resource:
 podman container set: ovn-dbs-bundle 
[192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest]
   ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers):   Master controller-0
   ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers):   Slave controller-1
   ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers):   Slave controller-2

Fixes: 3028ce2595c8 ("ovs-lib: Allow "status" command to work as non-root.")

Signed-off-by: Michele Baldessari 
---
v1 -> v2

- Implemented Ilya's suggestion and moved the check from
  pidfile_is_running() to pid_exists() and re-run my tests
---
 utilities/ovs-lib.in | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/utilities/ovs-lib.in b/utilities/ovs-lib.in
index fa840ec637f5..dc485413ef0c 100644
--- a/utilities/ovs-lib.in
+++ b/utilities/ovs-lib.in
@@ -127,7 +127,7 @@ fi
 pid_exists () {
 # This is better than "kill -0" because it doesn't require permission to
 # send a signal (so daemon_status in particular works as non-root).
-test -d /proc/"$1"
+test -n "$1" && test -d /proc/"$1"
 }
 
 pid_comm_check () {
-- 
2.21.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] tnl-neigh: Use outgoing ofproto version.

2019-08-14 Thread Vasu Dasari
Looks good Flavio.

Acked-By: Vasu Dasari 

Thanks
-Vasu

*Vasu Dasari*


On Tue, Aug 13, 2019 at 12:45 PM Flavio Leitner via dev <
ovs-dev@openvswitch.org> wrote:

> On Tue, Aug 13, 2019 at 01:34:04PM -0300, Flavio Leitner via dev wrote:
> > When a packet needs to be encapsulated in userspace, the endpoint
> > address needs to be resolved to fill in the headers. If it is not,
> > then currently OvS sends either a Neighbor Solicitation (IPv6)
> > or an ARP Query (IPv4) to resolve it.
> >
> > The problem is that the NS/ARP packet will go through the flow
> > rules in the new bridge, but inheriting the ofproto table version
> > from the original packet to be encapsulated. When those versions
> > don't match, the result is unexpected because no flow rules might
> > be visible, which would cause the default table rule to be used
> > to drop the packet. Or only part of the flow rules would be visible
> > and so on.
> >
> > Since the NS/ARP packet is created by OvS and will be injected in
> > the outgoing bridge, use the corresponding ofproto version instead.
>
> Hi,
>
> Please backport this up to 2.9 at least.
> Thanks,
> fbl
>
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 2/3] dpif-netdev-perf: Fix TSC frequency for non-DPDK case.

2019-08-14 Thread William Tu
On Tue, Aug 13, 2019 at 8:30 AM Ilya Maximets  wrote:
>
> Unlike 'rte_get_tsc_cycles()' which doesn't need any specific
> initialization, 'rte_get_tsc_hz()' could be used only after successfull
> call to 'rte_eal_init()'. 'rte_eal_init()' estimates the TSC frequency
> for later use by 'rte_get_tsc_hz()'.  Fairly said, we're not allowed
> to use 'rte_get_tsc_cycles()' before initializing DPDK too, but it
> works this way for now and provides correct results.
>
> This patch provides TSC frequency estimation code that will be used
> in two cases:
>   * DPDK is not compiled in, i.e. DPDK_NETDEV not defined.
>   * DPDK compiled in but not initialized,
> i.e. other_config:dpdk-init=false
>
> This change is mostly useful for AF_XDP netdev support, i.e. allows
> to use dpif-netdev/pmd-perf-show command and various PMD perf metrics.
>
> Signed-off-by: Ilya Maximets 

Looks good to me.
Acked-by: William Tu 


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 1/3] ovs-numa: Add dump based thread affinity functions.

2019-08-14 Thread William Tu
On Tue, Aug 13, 2019 at 8:30 AM Ilya Maximets  wrote:
>
> New functions to get and set CPU affinity using CPU dumps.
> This will abstract OS specific implementation details from the
> cross-platform code.
>
> Signed-off-by: Ilya Maximets 

Acked-by: William Tu 

One comment inline below.

> ---
>  lib/ovs-numa.c | 74 +-
>  lib/ovs-numa.h |  2 ++
>  2 files changed, 75 insertions(+), 1 deletion(-)
>
> diff --git a/lib/ovs-numa.c b/lib/ovs-numa.c
> index 24edeab2a..a417730b8 100644
> --- a/lib/ovs-numa.c
> +++ b/lib/ovs-numa.c
> @@ -532,6 +532,78 @@ ovs_numa_dump_destroy(struct ovs_numa_dump *dump)
>  free(dump);
>  }
>
> +struct ovs_numa_dump *
> +ovs_numa_thread_getaffinity_dump(void)
> +{
> +if (dummy_numa) {
> +/* Nothing to do. */
> +return NULL;
> +}
> +
> +#ifndef __linux__
> +return NULL;
> +#else
> +struct ovs_numa_dump *dump;
> +const struct numa_node *n;
> +cpu_set_t cpuset;
> +int err;
> +
> +CPU_ZERO();
> +err = pthread_getaffinity_np(pthread_self(), sizeof cpuset, );
> +if (err) {
> +VLOG_ERR("Thread getaffinity error: %s", ovs_strerror(err));
> +return NULL;
> +}
> +
> +dump = ovs_numa_dump_create();
> +
> +HMAP_FOR_EACH (n, hmap_node, _numa_nodes) {
> +const struct cpu_core *core;
> +
> +LIST_FOR_EACH (core, list_node, >cores) {
> +if (CPU_ISSET(core->core_id, )) {
> +ovs_numa_dump_add(dump, core->numa->numa_id, core->core_id);
> +}
> +}
> +}
> +
> +if (!ovs_numa_dump_count(dump)) {
> +ovs_numa_dump_destroy(dump);
> +return NULL;
> +}
> +return dump;
> +#endif /* __linux__ */
> +}
> +
> +int
> +ovs_numa_thread_setaffinity_dump(struct ovs_numa_dump *dump OVS_UNUSED)

'dump' is used.

> +{
> +if (!dump || dummy_numa) {
> +/* Nothing to do. */
> +return 0;
> +}
> +
> +#ifdef __linux__
> +struct ovs_numa_info_core *core;
> +cpu_set_t cpuset;
> +int err;
> +
> +CPU_ZERO();
> +FOR_EACH_CORE_ON_DUMP (core, dump) {
> +CPU_SET(core->core_id, );
> +}
> +err = pthread_setaffinity_np(pthread_self(), sizeof cpuset, );
> +if (err) {
> +VLOG_ERR("Thread setaffinity error: %s", ovs_strerror(err));
> +return err;
> +}
> +
> +return 0;
> +#else /* !__linux__ */
> +return EOPNOTSUPP;
> +#endif /* __linux__ */
> +}
> +
>  int ovs_numa_thread_setaffinity_core(unsigned core_id OVS_UNUSED)
>  {
>  if (dummy_numa) {
> @@ -547,7 +619,7 @@ int ovs_numa_thread_setaffinity_core(unsigned core_id 
> OVS_UNUSED)
>  CPU_SET(core_id, );
>  err = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), );
>  if (err) {
> -VLOG_ERR("Thread affinity error %d",err);
> +VLOG_ERR("Thread setaffinity error: %s", ovs_strerror(err));
>  return err;
>  }
>
> diff --git a/lib/ovs-numa.h b/lib/ovs-numa.h
> index 088fcb8c3..88352a93e 100644
> --- a/lib/ovs-numa.h
> +++ b/lib/ovs-numa.h
> @@ -60,6 +60,8 @@ struct ovs_numa_dump *ovs_numa_dump_n_cores_per_numa(int n);
>  bool ovs_numa_dump_contains_core(const struct ovs_numa_dump *,
>   int numa_id, unsigned core_id);
>  size_t ovs_numa_dump_count(const struct ovs_numa_dump *);
> +struct ovs_numa_dump * ovs_numa_thread_getaffinity_dump(void);
> +int ovs_numa_thread_setaffinity_dump(struct ovs_numa_dump *);
>  void ovs_numa_dump_destroy(struct ovs_numa_dump *);
>  int ovs_numa_thread_setaffinity_core(unsigned core_id);
>
> --
> 2.17.1
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-08-14 Thread William Tu
On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron  wrote:
>
>
>
> On 8 Aug 2019, at 17:38, Ilya Maximets wrote:
>
> 
>
>  I see a rather high number of afxdp_cq_skip, which should to my
>  knowledge never happen?
> >>>
> >>> I tried to investigate this previously, but didn't find anything
> >>> suspicious.
> >>> So, for my knowledge, this should never happen too.
> >>> However, I only looked at the code without actually running, because
> >>> I had no
> >>> HW available for testing.
> >>>
> >>> While investigation and stress-testing virtual ports I found few
> >>> issues with
> >>> missing locking inside the kernel, so there is no trust for kernel
> >>> part of XDP
> >>> implementation from my side. I'm suspecting that there are some
> >>> other bugs in
> >>> kernel/libbpf that only could be reproduced with driver mode.
> >>>
> >>> This never happens for virtual ports with SKB mode, so I never saw
> >>> this coverage
> >>> counter being non-zero.
> >>
> >> Did some quick debugging, as something else has come up that needs my
> >> attention :)
> >>
> >> But once I’m in a faulty state and sent a single packet, causing
> >> afxdp_complete_tx() to be called, it tells me 2048 descriptors are
> >> ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that
> >> there might be some ring management bug. Maybe consumer and receiver
> >> are equal meaning 0 buffers, but it returns max? I did not look at
> >> the kernel code, so this is just a wild guess :)
> >>
> >> (gdb) p tx_done
> >> $3 = 2048
> >>
> >> (gdb) p umem->cq
> >> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask =
> >> 2047, size = 2048, producer = 0x7f08486b8000, consumer =
> >> 0x7f08486b8040, ring = 0x7f08486b8080}
> >
> > Thanks for debugging!
> >
> > xsk_ring_cons__peek() just returns the difference between cached_prod
> > and cached_cons, but these values are too different:
> >
> > 3830466864 - 3578066899 = 252399965
> >
> > Since this value > requested, it returns requested number (2048).
> >
> > So, the ring is broken. At least broken its 'cached' part. It'll be
> > good
> > to look at *consumer and *producer values to verify the state of the
> > actual ring.
> >
>
> I’ll try to find some more time next week to debug further.
>
> William I noticed your email in xdp-newbies where you mention this
> problem of getting the wrong pointers. Did you ever follow up, or did
> further trouble shooting on the above?

Yes, I posted here
https://www.spinics.net/lists/xdp-newbies/msg00956.html
"Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?"

At that time I was thinking about reproducing the problem using the
xdpsock sample code from kernel. But turned out that my reproduction
code is not correct, so not able to show the case we hit here in OVS.

Then I put more similar code logic from OVS to xdpsock, but the problem
does not show up. As a result, I worked around it by marking addr as
"*addr == UINT64_MAX".

I will debug again this week once I get my testbed back.

William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn 4/4] rhel: Run ovn services with the 'openvswitch' user

2019-08-14 Thread Numan Siddique
On Wed, Aug 14, 2019 at 6:08 PM Jaime Caamaño Ruiz  wrote:

> Hello
>
> Some comments, that probably are due to me being confused checking the
> new repo:
>
>

Hi Jaime,

This is patch 4 of the series. Patch 1 takes care of adding ovn specific
run dirs (run, log, db dirs).
Can you please take a look at the series ?

Patch 0 -
https://mail.openvswitch.org/pipermail/ovs-dev/2019-August/361628.html
https://patchwork.ozlabs.org/project/openvswitch/list/?submitter=67480




> 1)
>
> > sed -i 's:\(.*su\).*:\1 ovn ovn:' %{_sysconfdir}/logrotate.d/ovn
>
> I dont see the file %{_sysconfdir}/logrotate.d/ovn included anywhere.
> Anyway, since OVN_USER_ID defaults to openvswitch:openvswitch,
> shouldn't this be
>


It's added in patch 3 of the series -
https://patchwork.ozlabs.org/patch/1146492/



> sed -i 's:\(.*su\).*:\1 openvswitch openvswitch:'
> %{_sysconfdir}/logrotate.d/ovn
>
> What about when the group is hugetlbfs?
>
>
hugetlbfs is used when ovs  is configured with dpdk.

ovs-vswitchd/ovsdb-server will be running as user:group -
"openvswitch:hugetlbfs".

If ovn-controller runs as user:group - "openvswitch:openvswitch" it can
still access
the file - /var/run/openvswitch/db.sock and other br-*.mgmt socket files
right ?

That's what I thought. Please correct me If I am wrong.


> 2)
>
> #OVN_USER_ID="openvswitch:openvswitch"
>
> Again, what about when the group should be hugetlbfs?
>
> 2)
>
> > chown -R $INSTALL_USER:$INSTALL_GROUP $ovn_dbdir
> > chown -R $INSTALL_USER:$INSTALL_GROUP $OVN_RUNDIR
> > chown -R $INSTALL_USER:$INSTALL_GROUP $ovn_logdir
>
> Since OVN and OVS still share log and run directories (right?, anyway I
> dont see where ovn_dbdir and ovn_logdir are set), changing OVN_USER_ID
> to a different value than OVS_USER_ID without changing a bunch of other
> stuff will break things. Doesn't it make sense to use OVS_USER_ID
> instead of introducing OVN_USER_ID until there is a more meaningful
> split?
>
>
Please see patch 1 of the series which sets ovn_logdir, ovn_dbdir etc.

The idea is to totally separate out OVN from OVS including user ids. But
since its a bit tricky
as ovn-controller needs to access /var/run/openvswitch/db.sock, I thought
of using openvswitch user
for now. But added the OVN_USER_ID so that it would be easier later to
switch to new 'ovn' user.

Is there a way to solve this issue ? If we run ovn services as
"ovn:openvswitch" (or ovn:hugetlbs in the case
of dpdk), can ovn-controller access  /var/run/openvswitch/db.sock ?
I tested it by running as "ovn:openvswitch" and it was not working for me.
May be I am doing something wrong ?

Thanks for the comments.

Numan



> BR
> Jaime.
>
> -Original Message-
> From: nusid...@redhat.com
> To: d...@openvswitch.org
> Cc: Jaime Caamano 
> Subject: [PATCH ovn 4/4] rhel: Run ovn services with the 'openvswitch'
> user
> Date: Tue, 13 Aug 2019 21:58:22 +0530
>
> From: Numan Siddique 
>
> This patch could have created a new user 'ovn' for ovn services instead
> of using 'openvswitch' user. But this would require some amount of work
> and
> proper testing since the new user 'ovn' should be part of 'openvswitch'
> group (to access /var/run/openvswitch/db.sock.). If ovs is compiled
> with dpdk,
> then it may get tricky (as ovs-vswitchd is run as user -
> openvswitch:hugetlbfs).
> We can support a new user for 'ovn' services in the future.
>
> Recently the commit [1] in ovs repo added support to run ovn services
> with the
> 'openvswitch' user, but this commit was not applied to ovn repo as we
> had
> already created a new OVN repo. During the OVS/OVN formal split, we
> missed
> out on applying the patch [1]. This patch takes some code from [1].
>
> [1] - 94e1e8be3187 ("rhel: run ovn with the same user as ovs").
>
> CC: Jaime Caamaño Ruiz 
> Signed-off-by: Numan Siddique 
> ---
>  rhel/automake.mk|  3 ++-
>  rhel/ovn-fedora.spec.in | 13 +
>  ...r_lib_systemd_system_ovn-controller-vtep.service |  2 ++
>  rhel/usr_lib_systemd_system_ovn-controller.service  |  2 ++
>  rhel/usr_lib_systemd_system_ovn-northd.service  |  5 -
>  ...usr_share_ovn_scripts_systemd_sysconfig.template | 13 +
>  utilities/ovn-ctl   | 12 
>  7 files changed, 48 insertions(+), 2 deletions(-)
>  create mode 100644
> rhel/usr_share_ovn_scripts_systemd_sysconfig.template
>
> diff --git a/rhel/automake.mk b/rhel/automake.mk
> index 39e216b01..a46e6579b 100644
> --- a/rhel/automake.mk
> +++ b/rhel/automake.mk
> @@ -15,7 +15,8 @@ EXTRA_DIST += \
> rhel/usr_lib_systemd_system_ovn-controller-vtep.service \
> rhel/usr_lib_systemd_system_ovn-northd.service \
> rhel/usr_lib_firewalld_services_ovn-central-firewall-
> service.xml \
> -   rhel/usr_lib_firewalld_services_ovn-host-firewall-service.xml
> +   rhel/usr_lib_firewalld_services_ovn-host-firewall-service.xml
> \
> +   

Re: [ovs-dev] [PATCH ovn 4/4] rhel: Run ovn services with the 'openvswitch' user

2019-08-14 Thread Jaime Caamaño Ruiz
Hello

Some comments, that probably are due to me being confused checking the
new repo:

1)

> sed -i 's:\(.*su\).*:\1 ovn ovn:' %{_sysconfdir}/logrotate.d/ovn

I dont see the file %{_sysconfdir}/logrotate.d/ovn included anywhere.
Anyway, since OVN_USER_ID defaults to openvswitch:openvswitch,
shouldn't this be

sed -i 's:\(.*su\).*:\1 openvswitch openvswitch:'
%{_sysconfdir}/logrotate.d/ovn

What about when the group is hugetlbfs?

2)

#OVN_USER_ID="openvswitch:openvswitch"

Again, what about when the group should be hugetlbfs?

2)

> chown -R $INSTALL_USER:$INSTALL_GROUP $ovn_dbdir
> chown -R $INSTALL_USER:$INSTALL_GROUP $OVN_RUNDIR
> chown -R $INSTALL_USER:$INSTALL_GROUP $ovn_logdir

Since OVN and OVS still share log and run directories (right?, anyway I
dont see where ovn_dbdir and ovn_logdir are set), changing OVN_USER_ID
to a different value than OVS_USER_ID without changing a bunch of other
stuff will break things. Doesn't it make sense to use OVS_USER_ID
instead of introducing OVN_USER_ID until there is a more meaningful
split?

BR
Jaime.

-Original Message-
From: nusid...@redhat.com
To: d...@openvswitch.org
Cc: Jaime Caamano 
Subject: [PATCH ovn 4/4] rhel: Run ovn services with the 'openvswitch'
user
Date: Tue, 13 Aug 2019 21:58:22 +0530

From: Numan Siddique 

This patch could have created a new user 'ovn' for ovn services instead
of using 'openvswitch' user. But this would require some amount of work
and
proper testing since the new user 'ovn' should be part of 'openvswitch'
group (to access /var/run/openvswitch/db.sock.). If ovs is compiled
with dpdk,
then it may get tricky (as ovs-vswitchd is run as user -
openvswitch:hugetlbfs).
We can support a new user for 'ovn' services in the future.

Recently the commit [1] in ovs repo added support to run ovn services
with the
'openvswitch' user, but this commit was not applied to ovn repo as we
had
already created a new OVN repo. During the OVS/OVN formal split, we
missed
out on applying the patch [1]. This patch takes some code from [1].

[1] - 94e1e8be3187 ("rhel: run ovn with the same user as ovs").

CC: Jaime Caamaño Ruiz 
Signed-off-by: Numan Siddique 
---
 rhel/automake.mk|  3 ++-
 rhel/ovn-fedora.spec.in | 13 +
 ...r_lib_systemd_system_ovn-controller-vtep.service |  2 ++
 rhel/usr_lib_systemd_system_ovn-controller.service  |  2 ++
 rhel/usr_lib_systemd_system_ovn-northd.service  |  5 -
 ...usr_share_ovn_scripts_systemd_sysconfig.template | 13 +
 utilities/ovn-ctl   | 12 
 7 files changed, 48 insertions(+), 2 deletions(-)
 create mode 100644
rhel/usr_share_ovn_scripts_systemd_sysconfig.template

diff --git a/rhel/automake.mk b/rhel/automake.mk
index 39e216b01..a46e6579b 100644
--- a/rhel/automake.mk
+++ b/rhel/automake.mk
@@ -15,7 +15,8 @@ EXTRA_DIST += \
rhel/usr_lib_systemd_system_ovn-controller-vtep.service \
rhel/usr_lib_systemd_system_ovn-northd.service \
rhel/usr_lib_firewalld_services_ovn-central-firewall-
service.xml \
-   rhel/usr_lib_firewalld_services_ovn-host-firewall-service.xml
+   rhel/usr_lib_firewalld_services_ovn-host-firewall-service.xml
\
+   rhel/usr_share_ovn_scripts_systemd_sysconfig.template
 
 update_rhel_spec = \
   $(AM_V_GEN)($(ro_shell) && sed -e 's,[@]VERSION[@],$(VERSION),g') \
diff --git a/rhel/ovn-fedora.spec.in b/rhel/ovn-fedora.spec.in
index cbca87511..14035de9a 100644
--- a/rhel/ovn-fedora.spec.in
+++ b/rhel/ovn-fedora.spec.in
@@ -186,6 +186,10 @@ make %{?_smp_mflags}
 rm -rf $RPM_BUILD_ROOT
 make install DESTDIR=$RPM_BUILD_ROOT
 
+install -p -D -m 0644 \
+rhel/usr_share_ovn_scripts_systemd_sysconfig.template \
+$RPM_BUILD_ROOT/%{_sysconfdir}/sysconfig/ovn
+
 for service in ovn-controller ovn-controller-vtep ovn-northd; do
 install -p -D -m 0644 \
 rhel/usr_lib_systemd_system_${service}.service
\
@@ -319,6 +323,14 @@ fi
 fi
 %endif
 
+%post
+%if %{with libcapng}
+if [ $1 -eq 1 ]; then
+sed -i 's:^#OVN_USER_ID=:OVN_USER_ID=:'
%{_sysconfdir}/sysconfig/ovn
+sed -i 's:\(.*su\).*:\1 ovn ovn:' %{_sysconfdir}/logrotate.d/ovn
+fi
+%endif
+
 %post central
 %if 0%{?systemd_post:1}
 %systemd_post ovn-northd.service
@@ -413,6 +425,7 @@ if [ $1 -eq 1 ]; then
 fi
 
 %files
+%config(noreplace) %{_sysconfdir}/sysconfig/ovn
 %{_bindir}/ovn-nbctl
 %{_bindir}/ovn-sbctl
 %{_bindir}/ovn-trace
diff --git a/rhel/usr_lib_systemd_system_ovn-controller-vtep.service
b/rhel/usr_lib_systemd_system_ovn-controller-vtep.service
index 832849488..09ad0612c 100644
--- a/rhel/usr_lib_systemd_system_ovn-controller-vtep.service
+++ b/rhel/usr_lib_systemd_system_ovn-controller-vtep.service
@@ -38,10 +38,12 @@ Restart=on-failure
 Environment=OVS_RUNDIR=%t/openvswitch
 Environment=OVN_RUNDIR=%t/ovn
 Environment=OVN_DB=unix:%t/ovn/ovnsb_db.sock
+EnvironmentFile=-/etc/sysconfig/ovn
 

Re: [ovs-dev] [PATCH] Improve MPLSoGRE performance by reducing EMC hash collisions.

2019-08-14 Thread Ilya Maximets
On 14.08.2019 11:33, Nitin Katiyar wrote:
> 
> 
>> -Original Message-
>> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
>> Sent: Wednesday, August 14, 2019 1:42 PM
>> To: Nitin Katiyar ; ovs-dev@openvswitch.org
>> Cc: Stokes, Ian 
>> Subject: Re: [ovs-dev] [PATCH] Improve MPLSoGRE performance by reducing
>> EMC hash collisions.
>>
>> On 13.08.2019 20:49, Nitin Katiyar wrote:
>>> When a packet is received, the RSS hash is calculated if it is not
>>> already available. The Exact Match Cache (EMC) entry is then looked up
>>> using this RSS hash.
>>>
>>> When a MPLSoGRE encapsulated packet is received, the GRE header is
>>> popped, the RSS hash is invalidated and the packet is recirculated for
>>> further processing. When the recirculated packet is processed, the
>>> MPLS header is popped and the packet is recirculated again. Since the
>>> RSS hash has not been invalidated here, the EMC lookup will hit the
>>> same entry as that after the first recirculation. This degrades
>>> performance severely.
>>
> Hi Ilya,
>> Hmm. Function 'dpif_netdev_packet_get_rss_hash()' accounts recirculation
>> depth into the hash to avoid such cases. Why this doesn't work for you?
> This will not very for multiple inner flows. If there are multiple flows are 
> sent across same tunnel (with same MPLS label) then they will all lead to 
> same hash as external tuple and recirculation id remains same for them. Let 
> me know if I am wrong.

OK. I see. Collision is not with the previous pass of this packet but with
other packets from the same MPLS tunnel. This needs to be made clear in the
commit message.

And, IIUC, this issue is not related to MPLSoGRE specifically, it's just an
issue with any MPLS. I think, you need to re-write the commit message to be
more general and, probably, rename the patch to something like:
"packets: Invalidate offloads after MPLS decapsulation."
or
"packets: Fix using outdated RSS hash after MPLS decapsulation."
etc.

Another thing:
It looks like we need to do the same for NSH decap. What do you think?

Note:
This change will force hash re-calculation in case of output to balanced 
bonding.

Also, It's better to not mention EMC in a common 'lib/packets.c' file.
I think that it's enough to just say that we need to invalidate offloads
since they are not valid anymore after decapsulation.

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] My humble request

2019-08-14 Thread sophia
Greetings 
How are you today,
I am a NATO soldier serving in Afghanistan.I and my comrades, we are
seeking your assistance as a business entrepreneur or business
development manager to help us receive/invest our funds in your country
in any lucrative business.Please if this proposal is acceptable by you,
kindly respond back

Best Regards
Sophia
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-08-14 Thread Eelco Chaudron



On 8 Aug 2019, at 17:38, Ilya Maximets wrote:



I see a rather high number of afxdp_cq_skip, which should to my 
knowledge never happen?


I tried to investigate this previously, but didn't find anything 
suspicious.

So, for my knowledge, this should never happen too.
However, I only looked at the code without actually running, because 
I had no

HW available for testing.

While investigation and stress-testing virtual ports I found few 
issues with
missing locking inside the kernel, so there is no trust for kernel 
part of XDP
implementation from my side. I'm suspecting that there are some 
other bugs in

kernel/libbpf that only could be reproduced with driver mode.

This never happens for virtual ports with SKB mode, so I never saw 
this coverage

counter being non-zero.


Did some quick debugging, as something else has come up that needs my 
attention :)


But once I’m in a faulty state and sent a single packet, causing 
afxdp_complete_tx() to be called, it tells me 2048 descriptors are 
ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that 
there might be some ring management bug. Maybe consumer and receiver 
are equal meaning 0 buffers, but it returns max? I did not look at 
the kernel code, so this is just a wild guess :)


(gdb) p tx_done
$3 = 2048

(gdb) p umem->cq
$4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask = 
2047, size = 2048, producer = 0x7f08486b8000, consumer = 
0x7f08486b8040, ring = 0x7f08486b8080}


Thanks for debugging!

xsk_ring_cons__peek() just returns the difference between cached_prod
and cached_cons, but these values are too different:

3830466864 - 3578066899 = 252399965

Since this value > requested, it returns requested number (2048).

So, the ring is broken. At least broken its 'cached' part. It'll be 
good

to look at *consumer and *producer values to verify the state of the
actual ring.



I’ll try to find some more time next week to debug further.

William I noticed your email in xdp-newbies where you mention this 
problem of getting the wrong pointers. Did you ever follow up, or did 
further trouble shooting on the above?






$ ovs-appctl coverage/show  | grep xdp
afxdp_cq_empty 0.0/sec   
339.600/sec    5.6606/sec   total: 20378
afxdp_tx_full  0.0/sec    
29.967/sec    0.4994/sec   total: 1798
afxdp_cq_skip  0.0/sec 61884770.167/sec  
1174238.3644/sec   total: 4227258112



You mentioned you saw this high number in your v15 change notes, 
did you do any research on why?


Cheers,

Eelco




___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] Make pidfile_is_running more robust against empty pidfiles

2019-08-14 Thread Ilya Maximets
On 14.08.2019 11:39, Michele Baldessari wrote:
> In some of our destructive testing of ovn-dbs inside containers managed
> by pacemaker we reached a situation where /var/run/openvswitch had
> empty .pid files. The current code does not deal well with them
> and pidfile_is_running() returns true in such a case and this confuses
> the OCF resource agent.
> 
> - Before this change:
> Inside a container run:
>   killall ovsdb-server;
>   echo -n '' > /var/run/openvswitch/ovnnb_db.pid; echo -n '' > 
> /var/run/openvswitch/ovnsb_db.pid
> 
> We will observe that the cluster is unable to ever recover because
> it believes the ovn processes to be running when they really aren't and
> eventually just fails:
>  podman container set: ovn-dbs-bundle 
> [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest]
>ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers):   Master controller-0
>ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers):   Stopped controller-1
>ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers):   Slave controller-2
> 
> - After this change the cluster is able to recover from this state and
> correctly start the resource:
>  podman container set: ovn-dbs-bundle 
> [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest]
>ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers):   Master controller-0
>ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers):   Slave controller-1
>ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers):   Slave controller-2
> 
> Signed-off-by: Michele Baldessari 
> ---
>  ovn/utilities/ovn-ctl | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/ovn/utilities/ovn-ctl b/ovn/utilities/ovn-ctl
> index 7e5cd469c83c..65f03e28ddba 100755
> --- a/ovn/utilities/ovn-ctl
> +++ b/ovn/utilities/ovn-ctl
> @@ -35,7 +35,7 @@ ovn_northd_db_conf_file="$etcdir/ovn-northd-db-params.conf"
>  
>  pidfile_is_running () {
>  pidfile=$1
> -test -e "$pidfile" && pid=`cat "$pidfile"` && pid_exists "$pid"
> +test -e "$pidfile" && [ -s "$pidfile" ] && pid=`cat "$pidfile"` && 
> pid_exists "$pid"

Hi. Thanks for the fix!

Maybe it's better to add additional check for an empty argument to
'pid_exists' function instead? This will cover more cases like invocations
from the utilities/ovs-lib.in.

I think, you may also add following tag to commit-message in this case:
Fixes: 3028ce2595c8 ("ovs-lib: Allow "status" command to work as non-root.")

This patch also will be needed in ovn-org/ovn repository too.
(Use 'PATCH ovn' subject prefix while sending patches targeted for ovn repo.)

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] Top quality most effective medications is the only thing you can find at our pharmacy!

2019-08-14 Thread 24-7-Pharmacy




		Very fast delivery. Incredible service! 


ENTER HERE

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] Make pidfile_is_running more robust against empty pidfiles

2019-08-14 Thread 0-day Robot
Bleep bloop.  Greetings Michele Baldessari, I am a robot and I have tried out 
your patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
WARNING: Line is 88 characters long (recommended limit is 79)
#46 FILE: ovn/utilities/ovn-ctl:38:
test -e "$pidfile" && [ -s "$pidfile" ] && pid=`cat "$pidfile"` && 
pid_exists "$pid"

Lines checked: 52, Warnings: 1, Errors: 0


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] Make pidfile_is_running more robust against empty pidfiles

2019-08-14 Thread Michele Baldessari
In some of our destructive testing of ovn-dbs inside containers managed
by pacemaker we reached a situation where /var/run/openvswitch had
empty .pid files. The current code does not deal well with them
and pidfile_is_running() returns true in such a case and this confuses
the OCF resource agent.

- Before this change:
Inside a container run:
  killall ovsdb-server;
  echo -n '' > /var/run/openvswitch/ovnnb_db.pid; echo -n '' > 
/var/run/openvswitch/ovnsb_db.pid

We will observe that the cluster is unable to ever recover because
it believes the ovn processes to be running when they really aren't and
eventually just fails:
 podman container set: ovn-dbs-bundle 
[192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest]
   ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers):   Master controller-0
   ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers):   Stopped controller-1
   ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers):   Slave controller-2

- After this change the cluster is able to recover from this state and
correctly start the resource:
 podman container set: ovn-dbs-bundle 
[192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest]
   ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers):   Master controller-0
   ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers):   Slave controller-1
   ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers):   Slave controller-2

Signed-off-by: Michele Baldessari 
---
 ovn/utilities/ovn-ctl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ovn/utilities/ovn-ctl b/ovn/utilities/ovn-ctl
index 7e5cd469c83c..65f03e28ddba 100755
--- a/ovn/utilities/ovn-ctl
+++ b/ovn/utilities/ovn-ctl
@@ -35,7 +35,7 @@ ovn_northd_db_conf_file="$etcdir/ovn-northd-db-params.conf"
 
 pidfile_is_running () {
 pidfile=$1
-test -e "$pidfile" && pid=`cat "$pidfile"` && pid_exists "$pid"
+test -e "$pidfile" && [ -s "$pidfile" ] && pid=`cat "$pidfile"` && 
pid_exists "$pid"
 } >/dev/null 2>&1
 
 stop_nb_ovsdb() {
-- 
2.21.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] Do RCU synchronization at fixed interval in PMD main loop.

2019-08-14 Thread Nitin Katiyar



> -Original Message-
> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> Sent: Tuesday, August 13, 2019 8:49 PM
> To: Nitin Katiyar ; ovs-dev@openvswitch.org
> Cc: Anju Thomas ; Stokes, Ian
> 
> Subject: Re: [ovs-dev] [PATCH] Do RCU synchronization at fixed interval in
> PMD main loop.
> 
> On 07.08.2019 17:21, Nitin Katiyar wrote:
> > Each PMD updates the global sequence number for RCU synchronization
> > purpose with other OVS threads. This is done at every 1025th iteration
> > in PMD main loop.
> >
> > If the PMD thread is responsible for polling large number of queues
> > that are carrying traffic, it spends a lot of time processing packets
> > and this results in significant delay in performing the housekeeping
> > activities.
> >
> > If the OVS main thread is waiting to synchronize with the PMD threads
> > and if those threads delay performing housekeeping activities for more
> > than 3 sec then LACP processing will be impacted and it will lead to
> > LACP flaps. Similarly, other controls protocols run by OVS main thread
> > are impacted.
> >
> > For e.g. a PMD thread polling 200 ports/queues with average of 1600
> > processing cycles per packet with batch size of 32 may take 1024
> > (200 * 1600 * 32) cycles per iteration. In system with 2.0 GHz CPU it
> > means more than 5 ms per iteration. So, for 1024 iterations to
> > complete it would be more than 5 seconds.
> >
> > This gets worse when there are PMD threads which are less loaded.
> > It reduces possibility of getting mutex lock in ovsrcu_try_quiesce()
> > by heavily loaded PMD and next attempt to quiesce would be after 1024
> > iterations.
> >
> > With this patch, PMD RCU synchronization will be performed after fixed
> > interval instead after a fixed number of iterations. This will ensure
> > that even if the packet processing load is high the RCU
> > synchronization will not be delayed long.
> >
> > Co-authored-by: Anju Thomas 
> >
> > Signed-off-by: Nitin Katiyar 
> > Signed-off-by: Anju Thomas 
> > ---
> 
> 
> Hi. Thanks for working on this.
> 
> Your setup seems a bit strange to me (thread will be able to handle only 6Kpps
> per port), but I'm tend to agree that RCU synchronization needs to be fixed.
> Not sure about implementation. At least, I'd suggest using time instead of TSC
> for measurements.
> 
> Each PMD thread has 'ctx.now' time context measured in microseconds, so
> you may schedule next synchronization to exactly ctx.now + 10ms.  See
> 'next_optimization'
> and 'rxq_next_cycle_store' as a reference.  This will save few instructions 
> and
> also will help to avoid issues in non-DPDK (e.g. netdev-afxdp) case.
Thanks for reviewing it. I will look into it and send updated patch.
> 
> BTW, It's better to add 'dpif-netdev:' prefix to the patch name.
I will add this too.

Regards,
Nitin
> 
> Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] Improve MPLSoGRE performance by reducing EMC hash collisions.

2019-08-14 Thread Nitin Katiyar



> -Original Message-
> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> Sent: Wednesday, August 14, 2019 1:42 PM
> To: Nitin Katiyar ; ovs-dev@openvswitch.org
> Cc: Stokes, Ian 
> Subject: Re: [ovs-dev] [PATCH] Improve MPLSoGRE performance by reducing
> EMC hash collisions.
> 
> On 13.08.2019 20:49, Nitin Katiyar wrote:
> > When a packet is received, the RSS hash is calculated if it is not
> > already available. The Exact Match Cache (EMC) entry is then looked up
> > using this RSS hash.
> >
> > When a MPLSoGRE encapsulated packet is received, the GRE header is
> > popped, the RSS hash is invalidated and the packet is recirculated for
> > further processing. When the recirculated packet is processed, the
> > MPLS header is popped and the packet is recirculated again. Since the
> > RSS hash has not been invalidated here, the EMC lookup will hit the
> > same entry as that after the first recirculation. This degrades
> > performance severely.
> 
Hi Ilya,
> Hmm. Function 'dpif_netdev_packet_get_rss_hash()' accounts recirculation
> depth into the hash to avoid such cases. Why this doesn't work for you?
This will not very for multiple inner flows. If there are multiple flows are 
sent across same tunnel (with same MPLS label) then they will all lead to same 
hash as external tuple and recirculation id remains same for them. Let me know 
if I am wrong.

Regards,
Nitin
> 
> Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] Improve MPLSoGRE performance by reducing EMC hash collisions.

2019-08-14 Thread Ilya Maximets
On 13.08.2019 20:49, Nitin Katiyar wrote:
> When a packet is received, the RSS hash is calculated if it is not
> already available. The Exact Match Cache (EMC) entry is then looked up
> using this RSS hash.
> 
> When a MPLSoGRE encapsulated packet is received, the GRE header is
> popped, the RSS hash is invalidated and the packet is recirculated for
> further processing. When the recirculated packet is processed, the MPLS
> header is popped and the packet is recirculated again. Since the RSS
> hash has not been invalidated here, the EMC lookup will hit the same
> entry as that after the first recirculation. This degrades performance
> severely.

Hmm. Function 'dpif_netdev_packet_get_rss_hash()' accounts recirculation
depth into the hash to avoid such cases. Why this doesn't work for you?

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] dpif-netdev: Add core id in the PMD thread name.

2019-08-14 Thread Eelco Chaudron



On 14 Aug 2019, at 9:45, Ilya Maximets wrote:


On 13.08.2019 19:46, Eelco Chaudron wrote:



On 13 Aug 2019, at 18:37, Ilya Maximets wrote:


This is highly useful to see on which core PMD is running by
only looking at the thread name. Thread Id still allows to
distinguish different threads running on the same core over the 
time:


   |dpif_netdev(pmd-c10/id:53)|DBG|Creating 2. subtable <...>
   |dpif_netdev(pmd-c10/id:53)|DBG|flow_add: <...>, actions:2
   |dpif_netdev(pmd-c09/id:70)|DBG|Core 9 processing port <..>

In gdb, top or any other utility it's useful to quickly catch up
needed thread without parsing logs, memory or matching threads by 
port

names they're handling.

Signed-off-by: Ilya Maximets 
---
 lib/dpif-netdev.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index d0a1c58ad..34ba03836 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -4735,9 +4735,16 @@ reconfigure_pmd_threads(struct dp_netdev *dp)
 FOR_EACH_CORE_ON_DUMP(core, pmd_cores) {
 pmd = dp_netdev_get_pmd(dp, core->core_id);
 if (!pmd) {
+    struct ds name = DS_EMPTY_INITIALIZER;
+
 pmd = xzalloc(sizeof *pmd);
 dp_netdev_configure_pmd(pmd, dp, 
core->core_id, core->numa_id);
-    pmd->thread = ovs_thread_create("pmd", 
pmd_thread_main, pmd);

+
+    ds_put_format(, "pmd-c%02d/id:", 
core->core_id);


This is a really good idea :) One remark should we make it %03d?


There is a hard limit for the thread name. It's 15 meaningful chars 
excluding the
terminating null byte. 'pmd-c02/id:' is 11 bytes wide keeping 4 bytes 
for the
thread id. 'pmd-c002/id:' is 12 bytes wide with only 3 bytes remaining 
for id.
Thread ids could easily become big ( > 1000) for a long running 
process, that is

why %02d was chosen, to save some space.

BTW, even on a systems with 100+ CPUs I'm usually placing OVS threads 
on a lower
cores. Anyway, this is only the matter of a bit more visual beauty in 
the logs.


What do you think?


Makes sense

Acked-by: Eelco Chaudron 





+    pmd->thread = 
ovs_thread_create(ds_cstr(),
+    
pmd_thread_main, pmd);

+    ds_destroy();
+
 VLOG_INFO("PMD thread on numa_id: %d, core 
id: %2d created.",
   pmd->numa_id, 
pmd->core_id);

 changed = true;
-- 
2.17.1




___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] dpif-netdev: Add core id in the PMD thread name.

2019-08-14 Thread Ilya Maximets
On 13.08.2019 19:46, Eelco Chaudron wrote:
> 
> 
> On 13 Aug 2019, at 18:37, Ilya Maximets wrote:
> 
>> This is highly useful to see on which core PMD is running by
>> only looking at the thread name. Thread Id still allows to
>> distinguish different threads running on the same core over the time:
>>
>>    |dpif_netdev(pmd-c10/id:53)|DBG|Creating 2. subtable <...>
>>    |dpif_netdev(pmd-c10/id:53)|DBG|flow_add: <...>, actions:2
>>    |dpif_netdev(pmd-c09/id:70)|DBG|Core 9 processing port <..>
>>
>> In gdb, top or any other utility it's useful to quickly catch up
>> needed thread without parsing logs, memory or matching threads by port
>> names they're handling.
>>
>> Signed-off-by: Ilya Maximets 
>> ---
>>  lib/dpif-netdev.c | 9 -
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index d0a1c58ad..34ba03836 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -4735,9 +4735,16 @@ reconfigure_pmd_threads(struct dp_netdev *dp)
>>  FOR_EACH_CORE_ON_DUMP(core, pmd_cores) {
>>  pmd = dp_netdev_get_pmd(dp, core->core_id);
>>  if (!pmd) {
>> +    struct ds name = DS_EMPTY_INITIALIZER;
>> +
>>  pmd = xzalloc(sizeof *pmd);
>>  dp_netdev_configure_pmd(pmd, dp, core->core_id, core->numa_id);
>> -    pmd->thread = ovs_thread_create("pmd", pmd_thread_main, pmd);
>> +
>> +    ds_put_format(, "pmd-c%02d/id:", core->core_id);
> 
> This is a really good idea :) One remark should we make it %03d?

There is a hard limit for the thread name. It's 15 meaningful chars excluding 
the
terminating null byte. 'pmd-c02/id:' is 11 bytes wide keeping 4 bytes for the
thread id. 'pmd-c002/id:' is 12 bytes wide with only 3 bytes remaining for id.
Thread ids could easily become big ( > 1000) for a long running process, that is
why %02d was chosen, to save some space.

BTW, even on a systems with 100+ CPUs I'm usually placing OVS threads on a lower
cores. Anyway, this is only the matter of a bit more visual beauty in the logs.

What do you think?

> 
>> +    pmd->thread = ovs_thread_create(ds_cstr(),
>> +    pmd_thread_main, pmd);
>> +    ds_destroy();
>> +
>>  VLOG_INFO("PMD thread on numa_id: %d, core id: %2d created.",
>>    pmd->numa_id, pmd->core_id);
>>  changed = true;
>> -- 
>> 2.17.1
> 
> 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev