[ovs-dev] [PATCH v4 2/2] ovsdb-idl: Exclude missing tables and columns in the transaction.

2021-08-24 Thread numans
From: Numan Siddique 

In the cases where the C idl client is compiled with a newer schema
and the ovsdb-server is running with older schema, the IDL clients
can included tables or columns in the transaction which are missing
in the server schema.  ovsdb-server will reject the transaction,
but the IDL client keeps on trying the transaction resulting
in a continuous loop.  This patch fixes this issue by excluding
them for the jsonrpc transaction message.

This patch chose to exclude the missing tables/columns from the
transaction instead of failing the transaction (TXN_ERROR) for
the following reasons:

  1.  IDL clients will not come to know the exact reason for the
  transaction failure.

  2.  IDL client may not know all the transaction objects added in
  its loop before calling ovsdb_idl_loop_commit_and_wait().

  3.  If the client has to figure out the reason by checking the
  presence of tables/columns using the APIs added in the
  previous commit, it can very well exclude such tables/columns
  from the transaction.

Relevant test cases are added to cover this case.

Related issue: https://bugzilla.redhat.com/show_bug.cgi?id=1992705
Signed-off-by: Numan Siddique 
---
 lib/ovsdb-idl.c  | 16 
 tests/idltest.ovsschema  |  9 +++
 tests/idltest2.ovsschema |  7 ++
 tests/ovsdb-idl.at   | 37 
 tests/test-ovsdb.c   | 53 
 5 files changed, 122 insertions(+)

diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
index b2dfff46c..404cfc75a 100644
--- a/lib/ovsdb-idl.c
+++ b/lib/ovsdb-idl.c
@@ -3084,6 +3084,16 @@ ovsdb_idl_txn_commit(struct ovsdb_idl_txn *txn)
 HMAP_FOR_EACH (row, txn_node, >txn_rows) {
 const struct ovsdb_idl_table_class *class = row->table->class_;
 
+if (!row->table->in_server_schema) {
+/* The table is not present in the server schema.  Do not
+ * include it in the transaction. */
+static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
+VLOG_WARN_RL(, "%s database lacks %s table, excluding from "
+ "the txn.", row->table->idl->class_->database,
+ row->table->class_->name);
+continue;
+}
+
 if (!row->new_datum) {
 if (class->is_root) {
 struct json *op = json_object_create();
@@ -3382,6 +3392,12 @@ ovsdb_idl_txn_write__(const struct ovsdb_idl_row *row_,
 }
 
 class = row->table->class_;
+
+if (!ovsdb_idl_has_column_in_table(row->table->idl, class->name,
+   column->name)) {
+goto discard_datum;
+}
+
 column_idx = column - class->columns;
 write_only = row->table->modes[column_idx] == OVSDB_IDL_MONITOR;
 
diff --git a/tests/idltest.ovsschema b/tests/idltest.ovsschema
index 3ddb612b0..65eadb961 100644
--- a/tests/idltest.ovsschema
+++ b/tests/idltest.ovsschema
@@ -210,6 +210,15 @@
   },
   "isRoot": true
 },
+"simple7" : {
+  "columns" : {
+"name" : {
+  "type": "string"
+},
+"id": {"type": "string"}
+  },
+  "isRoot" : true
+},
 "singleton" : {
   "columns" : {
 "name" : {
diff --git a/tests/idltest2.ovsschema b/tests/idltest2.ovsschema
index 210e4c389..a8199e56c 100644
--- a/tests/idltest2.ovsschema
+++ b/tests/idltest2.ovsschema
@@ -139,6 +139,13 @@
   "type": "string"
 }
   }
+},
+"simple7" : {
+  "columns" : {
+"name" : {
+  "type": "string"
+}
+  }
 }
   }
 }
diff --git a/tests/ovsdb-idl.at b/tests/ovsdb-idl.at
index b11fabe70..26212f265 100644
--- a/tests/ovsdb-idl.at
+++ b/tests/ovsdb-idl.at
@@ -999,6 +999,7 @@ test-ovsdb|ovsdb_idl|idltest database lacks simple5 table 
(database needs upgrad
 test-ovsdb|ovsdb_idl|idltest database lacks simple6 table (database needs 
upgrade?)
 test-ovsdb|ovsdb_idl|idltest database lacks singleton table (database needs 
upgrade?)
 test-ovsdb|ovsdb_idl|link1 table in idltest database lacks l2 column (database 
needs upgrade?)
+test-ovsdb|ovsdb_idl|simple7 table in idltest database lacks id column 
(database needs upgrade?)
 ])
 
 # Check that ovsdb-idl sent on "monitor" request and that it didn't
@@ -2410,3 +2411,39 @@ column l2 in table link1 is present
 
 OVSDB_SERVER_SHUTDOWN
 AT_CLEANUP
+
+AT_SETUP([idl transaction handling of missing tables and columns - C])
+AT_KEYWORDS([ovsdb client idl txn])
+
+# idltest2.ovsschema is the same as idltest.ovsschema, except that
+# few tables and columns are missing. This test checks that idl doesn't
+# include the missing tables and columns in the transactions.
+# idl-missing-table-column-txn inserts
+#   - a row for table - 'simple'
+#   - a row for table - 'simple5' which is missing.  This should not be
+#included in the transaction.
+#   - a row for table - 'simple7 with the missing column 

[ovs-dev] [PATCH v4 1/2] ovsdb-idl : Add APIs to query if a table and a column is present or not.

2021-08-24 Thread numans
From: Numan Siddique 

This patch adds 2 new APIs in the ovsdb-idl client library
 - ovsdb_idl_has_table() and ovsdb_idl_has_column_in_table() to
query if a table and a column is present in the IDL or not.  This
is required for scenarios where the server schema is old and
missing a table or column and the client (built with a new schema
version) does a transaction with the missing table or column.  This
results in a continuous loop of transaction failures.

OVN would require the API - ovsdb_idl_has_table() to address this issue
when an old ovsdb-server is used (OVS 2.11) which has the 'datapath'
table missing.  A recent commit in OVN creates a 'datapath' row
in the local ovsdb-server.  ovsdb_idl_has_column_in_table() would be
useful to have.

Related issue: https://bugzilla.redhat.com/show_bug.cgi?id=1992705
Signed-off-by: Numan Siddique 
---
 lib/ovsdb-idl-provider.h |  4 +++
 lib/ovsdb-idl.c  | 36 
 lib/ovsdb-idl.h  |  3 ++
 tests/ovsdb-idl.at   | 38 +
 tests/test-ovsdb.c   | 73 
 5 files changed, 154 insertions(+)

diff --git a/lib/ovsdb-idl-provider.h b/lib/ovsdb-idl-provider.h
index 0f38f9b34..0f122e23c 100644
--- a/lib/ovsdb-idl-provider.h
+++ b/lib/ovsdb-idl-provider.h
@@ -24,6 +24,7 @@
 #include "ovsdb-set-op.h"
 #include "ovsdb-types.h"
 #include "openvswitch/shash.h"
+#include "sset.h"
 #include "uuid.h"
 
 #ifdef __cplusplus
@@ -117,9 +118,12 @@ struct ovsdb_idl_table {
 bool need_table; /* Monitor table even if no columns are selected
   * for replication. */
 struct shash columns;/* Contains "const struct ovsdb_idl_column *"s. */
+struct sset schema_columns; /* Column names from schema. */
 struct hmap rows;/* Contains "struct ovsdb_idl_row"s. */
 struct ovsdb_idl *idl;   /* Containing IDL instance. */
 unsigned int change_seqno[OVSDB_IDL_CHANGE_MAX];
+bool in_server_schema;   /* Indicates if this table is in the server schema
+  * or not. */
 struct ovs_list indexes;/* Contains "struct ovsdb_idl_index"s */
 struct ovs_list track_list; /* Tracked rows (ovsdb_idl_row.track_node). */
 };
diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
index 2198c69c6..b2dfff46c 100644
--- a/lib/ovsdb-idl.c
+++ b/lib/ovsdb-idl.c
@@ -287,6 +287,8 @@ ovsdb_idl_create_unconnected(const struct ovsdb_idl_class 
*class,
 = table->change_seqno[OVSDB_IDL_CHANGE_MODIFY]
 = table->change_seqno[OVSDB_IDL_CHANGE_DELETE] = 0;
 table->idl = idl;
+table->in_server_schema = true;  /* Assume it's in server schema. */
+sset_init(>schema_columns);
 }
 
 return idl;
@@ -337,6 +339,7 @@ ovsdb_idl_destroy(struct ovsdb_idl *idl)
 struct ovsdb_idl_table *table = >tables[i];
 ovsdb_idl_destroy_indexes(table);
 shash_destroy(>columns);
+sset_destroy(>schema_columns);
 hmap_destroy(>rows);
 free(table->modes);
 }
@@ -718,6 +721,7 @@ ovsdb_idl_compose_monitor_request(const struct json 
*schema_json, void *idl_)
 
 struct json *columns
 = table->need_table ? json_array_create_empty() : NULL;
+sset_clear(>schema_columns);
 for (size_t j = 0; j < tc->n_columns; j++) {
 const struct ovsdb_idl_column *column = >columns[j];
 bool idl_has_column = (table_schema &&
@@ -741,6 +745,7 @@ ovsdb_idl_compose_monitor_request(const struct json 
*schema_json, void *idl_)
 }
 json_array_add(columns, json_string_create(column->name));
 }
+sset_add(>schema_columns, column->name);
 }
 
 if (columns) {
@@ -749,7 +754,12 @@ ovsdb_idl_compose_monitor_request(const struct json 
*schema_json, void *idl_)
   "(database needs upgrade?)",
   idl->class_->database, table->class_->name);
 json_destroy(columns);
+/* Set 'table->in_server_schema' to false so that this can be
+ * excluded from transactions. */
+table->in_server_schema = false;
 continue;
+} else if (schema && table_schema) {
+table->in_server_schema = true;
 }
 
 monitor_request = json_object_create();
@@ -4256,3 +4266,29 @@ ovsdb_idl_loop_commit_and_wait(struct ovsdb_idl_loop 
*loop)
 
 return retval;
 }
+
+static struct ovsdb_idl_table*
+ovsdb_idl_get_table(struct ovsdb_idl *idl, const char *table_name)
+{
+struct ovsdb_idl_table *table = shash_find_data(>table_by_name,
+table_name);
+return table && table->in_server_schema ? table : NULL;
+}
+
+bool
+ovsdb_idl_has_table(struct ovsdb_idl *idl, const char *table_name)
+{
+return ovsdb_idl_get_table(idl, table_name) ? true: false;
+}
+

[ovs-dev] [PATCH v4 0/2] ovsdb-idl: Address missing table and column issues.

2021-08-24 Thread numans
From: Numan Siddique 

This patch series addresses the transaction issues seen when
the C IDL client is running compiled with newer schema and ovsdb-server
is still running with older schema.

Patch 1: Adds the API to query for table and column names.
Patch 2: Addresses the transaction issues.

v3 -> v4
--
  * Addressed the review comment from Ilya for patch 2.
ovsdb_idl_txn_insert() will not return NULL for missing tables.
 
v2 -> v3
--
  * Patch 2 is added in v3.
  * Patch 1 implementation in v2 was wrong.  This patch fixes it and
enhances the test cases.

v1 -> v2
--
  * Added the test cases.



Numan Siddique (2):
  ovsdb-idl : Add APIs to query if a table and a column is present or
not.
  ovsdb-idl: Exclude missing tables and columns in the transaction.

 lib/ovsdb-idl-provider.h |   4 ++
 lib/ovsdb-idl.c  |  52 
 lib/ovsdb-idl.h  |   3 +
 tests/idltest.ovsschema  |   9 +++
 tests/idltest2.ovsschema |   7 +++
 tests/ovsdb-idl.at   |  75 +++
 tests/test-ovsdb.c   | 126 +++
 7 files changed, 276 insertions(+)

-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn 2/2] northd: improve unreachable_ips flows processing for dp_groups

2021-08-24 Thread Mark Michelson

I think you should quote the results from the cover letter in this commit.

Other than that,
Acked-by: Mark Michelson 

On 8/24/21 2:02 PM, Lorenzo Bianconi wrote:

Improve code efficiency of load-balancer unreachable_ips flows
processing when datapath_groups are enabled. We can avoid to
perform a flow lookup for each ovn_port in
build_lrouter_flows_for_unreachable_ips() since lflows only
depends on vips (match and action filters).

Signed-off-by: Lorenzo Bianconi 
---
  northd/ovn-northd.c | 51 ++---
  1 file changed, 43 insertions(+), 8 deletions(-)

diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index a8e69b3cb..1a48574f2 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -4357,7 +4357,7 @@ static struct hashrow_locks lflow_locks;
  /* Adds a row with the specified contents to the Logical_Flow table.
   * Version to use when locking is required.
   */
-static void
+static struct ovn_lflow *
  do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od,
   uint32_t hash, enum ovn_stage stage, uint16_t priority,
   const char *match, const char *actions, const char *io_port,
@@ -4373,7 +4373,7 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct 
ovn_datapath *od,
 actions, ctrl_meter, hash);
  if (old_lflow) {
  hmapx_add(_lflow->od_group, od);
-return;
+return old_lflow;
  }
  }
  
@@ -4388,10 +4388,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od,

 ovn_lflow_hint(stage_hint), where);
  hmapx_add(>od_group, od);
  hmap_insert_fast(lflow_map, >hmap_node, hash);
+return lflow;
  }
  
  /* Adds a row with the specified contents to the Logical_Flow table. */

-static void
+static struct ovn_lflow *
  ovn_lflow_add_at_with_hash(struct hmap *lflow_map, struct ovn_datapath *od,
 enum ovn_stage stage, uint16_t priority,
 const char *match, const char *actions,
@@ -4399,16 +4400,21 @@ ovn_lflow_add_at_with_hash(struct hmap *lflow_map, 
struct ovn_datapath *od,
 const struct ovsdb_idl_row *stage_hint,
 const char *where, uint32_t hash)
  {
+struct ovn_lflow *lflow;
+
  ovs_assert(ovn_stage_to_datapath_type(stage) == 
ovn_datapath_get_type(od));
  if (use_logical_dp_groups && use_parallel_build) {
  lock_hash_row(_locks, hash);
-do_ovn_lflow_add(lflow_map, od, hash, stage, priority, match,
- actions, io_port, stage_hint, where, ctrl_meter);
+lflow = do_ovn_lflow_add(lflow_map, od, hash, stage, priority, match,
+ actions, io_port, stage_hint, where,
+ ctrl_meter);
  unlock_hash_row(_locks, hash);
  } else {
-do_ovn_lflow_add(lflow_map, od, hash, stage, priority, match,
- actions, io_port, stage_hint, where, ctrl_meter);
+lflow = do_ovn_lflow_add(lflow_map, od, hash, stage, priority, match,
+ actions, io_port, stage_hint, where,
+ ctrl_meter);
  }
+return lflow;
  }
  
  static void

@@ -9422,6 +9428,26 @@ build_lrouter_defrag_flows_for_lb(struct ovn_northd_lb 
*lb,
  ds_destroy(_actions);
  }
  
+static bool

+ovn_dp_group_update_with_reference(struct ovn_lflow *lflow_ref,
+   struct ovn_datapath *od,
+   uint32_t hash)
+{
+if (!use_logical_dp_groups || !lflow_ref) {
+return false;
+}
+
+if (use_parallel_build) {
+lock_hash_row(_locks, hash);
+hmapx_add(_ref->od_group, od);
+unlock_hash_row(_locks, hash);
+} else {
+hmapx_add(_ref->od_group, od);
+}
+
+return true;
+}
+
  static void
  build_lrouter_flows_for_unreachable_ips(struct ovn_northd_lb *lb,
  struct ovn_lb_vip *lb_vip,
@@ -9431,6 +9457,7 @@ build_lrouter_flows_for_unreachable_ips(struct 
ovn_northd_lb *lb,
  struct ds *action)
  {
  bool ipv4 = IN6_IS_ADDR_V4MAPPED(_vip->vip);
+struct ovn_lflow *lflow_ref = NULL;
  ovs_be32 ipv4_addr;
  
  ds_clear(match);

@@ -9474,7 +9501,15 @@ build_lrouter_flows_for_unreachable_ips(struct 
ovn_northd_lb *lb,
  (!ipv4 && lrouter_port_ipv6_reachable(op, _vip->vip))) 
{
  continue;
  }
-ovn_lflow_add_at_with_hash(lflows, peer->od,
+
+if (ovn_dp_group_update_with_reference(lflow_ref, od, hash)) {
+/* if we are running dp_groups, we do not need to run full
+ * lookup since lflow just depends on the vip and not on
+ * the ovn_port.
+ */
+   

Re: [ovs-dev] [PATCH ovn 1/2] northd: refactor unreachable IPs lb flows

2021-08-24 Thread Mark Michelson

Hi Lorenzo,

I have some minor findings below.

On 8/24/21 2:02 PM, Lorenzo Bianconi wrote:

Refactor unreachable IPs for vip load-balancers inverting the logic used
during the lb flow creation in order to visit lb first and then related
datapath/ovn_ports. This is a preliminary patch to avoid recomputing
flow hash and lflow lookup when not necessary.

Signed-off-by: Lorenzo Bianconi 
---
  northd/ovn-northd.c | 137 
  1 file changed, 101 insertions(+), 36 deletions(-)

diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index 3d8e21a4f..a8e69b3cb 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -4391,6 +4391,26 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct 
ovn_datapath *od,
  }
  
  /* Adds a row with the specified contents to the Logical_Flow table. */

+static void
+ovn_lflow_add_at_with_hash(struct hmap *lflow_map, struct ovn_datapath *od,
+   enum ovn_stage stage, uint16_t priority,
+   const char *match, const char *actions,
+   const char *io_port, const char *ctrl_meter,
+   const struct ovsdb_idl_row *stage_hint,
+   const char *where, uint32_t hash)
+{
+ovs_assert(ovn_stage_to_datapath_type(stage) == ovn_datapath_get_type(od));
+if (use_logical_dp_groups && use_parallel_build) {
+lock_hash_row(_locks, hash);
+do_ovn_lflow_add(lflow_map, od, hash, stage, priority, match,
+ actions, io_port, stage_hint, where, ctrl_meter);
+unlock_hash_row(_locks, hash);
+} else {
+do_ovn_lflow_add(lflow_map, od, hash, stage, priority, match,
+ actions, io_port, stage_hint, where, ctrl_meter);
+}
+}
+
  static void
  ovn_lflow_add_at(struct hmap *lflow_map, struct ovn_datapath *od,
   enum ovn_stage stage, uint16_t priority,
@@ -4398,24 +4418,14 @@ ovn_lflow_add_at(struct hmap *lflow_map, struct 
ovn_datapath *od,
   const char *ctrl_meter,
   const struct ovsdb_idl_row *stage_hint, const char *where)
  {
-ovs_assert(ovn_stage_to_datapath_type(stage) == ovn_datapath_get_type(od));
-
  uint32_t hash;
  
  hash = ovn_logical_flow_hash(ovn_stage_get_table(stage),

   ovn_stage_get_pipeline_name(stage),
   priority, match,
   actions);
-
-if (use_logical_dp_groups && use_parallel_build) {
-lock_hash_row(_locks, hash);
-do_ovn_lflow_add(lflow_map, od, hash, stage, priority, match,
- actions, io_port, stage_hint, where, ctrl_meter);
-unlock_hash_row(_locks, hash);
-} else {
-do_ovn_lflow_add(lflow_map, od, hash, stage, priority, match,
- actions, io_port, stage_hint, where, ctrl_meter);
-}
+ovn_lflow_add_at_with_hash(lflow_map,od, stage, priority, match, actions,


s/,od/, od/


+   io_port, ctrl_meter, stage_hint, where, hash);
  }
  
  /* Adds a row with the specified contents to the Logical_Flow table. */

@@ -6870,16 +6880,11 @@ build_lswitch_rport_arp_req_flows(struct ovn_port *op,
  /* Check if the ovn port has a network configured on which we could
   * expect ARP requests for the LB VIP.
   */
-if (ip_parse(ip_addr, _addr)) {
-if (lrouter_port_ipv4_reachable(op, ipv4_addr)) {
-build_lswitch_rport_arp_req_flow_for_reachable_ip(
-ip_addr, AF_INET, sw_op, sw_od, 80, lflows,
-stage_hint);
-} else {
-build_lswitch_rport_arp_req_flow_for_unreachable_ip(
-ip_addr, AF_INET, sw_od, 90, lflows,
-stage_hint);
-}
+if (ip_parse(ip_addr, _addr) &&
+lrouter_port_ipv4_reachable(op, ipv4_addr)) {
+build_lswitch_rport_arp_req_flow_for_reachable_ip(
+ip_addr, AF_INET, sw_op, sw_od, 80, lflows,
+stage_hint);
  }
  }
  SSET_FOR_EACH (ip_addr, >od->lb_ips_v6) {
@@ -6888,16 +6893,11 @@ build_lswitch_rport_arp_req_flows(struct ovn_port *op,
  /* Check if the ovn port has a network configured on which we could
   * expect NS requests for the LB VIP.
   */
-if (ipv6_parse(ip_addr, _addr)) {
-if (lrouter_port_ipv6_reachable(op, _addr)) {
-build_lswitch_rport_arp_req_flow_for_reachable_ip(
-ip_addr, AF_INET6, sw_op, sw_od, 80, lflows,
-stage_hint);
-} else {
-build_lswitch_rport_arp_req_flow_for_unreachable_ip(
-ip_addr, AF_INET6, sw_od, 90, lflows,
-stage_hint);
-}
+if (ipv6_parse(ip_addr, _addr) &&
+

Re: [ovs-dev] [PATCH ovn v2] ovn-nbctl: Monitor nb_cfg to detect and handle potential overflows

2021-08-24 Thread Numan Siddique
On Sat, Aug 21, 2021 at 1:15 PM  wrote:
>
> From: Mohammad Heib 
>
> ovn-nbctl sync command will increase nb_cfg value each time it is executed
> with a specific wait_type, this increment will be handled without testing
> the current nb_cfg value because it is not monitored and that could lead
> to an overflow issue if nb_cfg == LLONG_MAX.
>
> To avoid such potential overflow cases we must monitor the real value
> of nb_cfg each time sync is executed and if there is any overflow issue
> it will be handled by function nbctl_pre_execute later on the execution.
>
> Fixes: be3a60f8e6a3 ("ovn-nbctl: Deal with nb_cfg overflows.")
> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1979774
> Signed-off-by: Mohammad Heib 

Thanks for adding the test cases.

I applied this patch to the main branch with the below changes in the test code.
I also added your name in the AUTHORS file.

---
diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at
index 2fe5ac561..8c80b00c9 100644
--- a/tests/ovn-controller.at
+++ b/tests/ovn-controller.at
@@ -403,11 +403,10 @@ as hv
 ovs-vsctl add-br br-phys
 ovn_attach n1 br-phys 192.168.0.1

-OVS_WAIT_UNTIL([test xhv = x`ovn-sbctl --columns name --bare find chassis`])
+check ovn-nbctl --wait=hv sync

 # overflow the NB_Global nb_cfg value
-nb_global_id=$(ovn-nbctl --columns _uuid --bare find nb_global)
-ovn-nbctl set NB_Global ${nb_global_id} nb_cfg=9223372036854775806
+check ovn-nbctl set NB_Global . nb_cfg=9223372036854775806

 # nb_cfg must be set to zero if it exceed the value of LLONG_MAX
 # the command below will try incress the value of nb_cfg to be
greater than LLONG_MAX and
@@ -415,12 +414,10 @@ ovn-nbctl set NB_Global ${nb_global_id}
nb_cfg=9223372036854775806
-OVS_WAIT_UNTIL([test x0 = x`ovn-nbctl --wait=hv sync && ovn-nbctl
--wait=hv sync && echo $?`])
+check ovn-nbctl --wait=hv sync
+check ovn-nbctl --wait=hv sync

 # nb_cfg should be set to 1 in the chassis_private/nb_global/sb_global table
-OVS_WAIT_UNTIL([test x1 = x`ovn-sbctl --columns nb_cfg --bare find
chassis_private`])
-OVS_WAIT_UNTIL([test x1 = x`ovn-sbctl --columns nb_cfg --bare find sb_global`])
-OVS_WAIT_UNTIL([test x1 = x`ovn-nbctl --columns nb_cfg --bare find nb_global`])
-
-# Assert that the the nb_cfg from the Chassis table was not incremented
-OVS_WAIT_UNTIL([test x0 = x`ovn-sbctl --columns nb_cfg --bare find chassis`])
+check_column 1 chassis_private nb_cfg
+check_column 1 sb_global nb_cfg
+check_column 1 nb:nb_global nb_cfg
+check_column 0 chassis nb_cfg

 OVN_CLEANUP([hv])
 AT_CLEANUP


Thanks
Numan



> ---
>  tests/ovn-controller.at | 33 +
>  utilities/ovn-nbctl.c   |  2 ++
>  2 files changed, 35 insertions(+)
>
> diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at
> index e8550a5dc39a..2fe5ac56169c 100644
> --- a/tests/ovn-controller.at
> +++ b/tests/ovn-controller.at
> @@ -392,6 +392,39 @@ OVN_CLEANUP([hv])
>  AT_CLEANUP
>  ])
>
> +# check that nb_cfg overflow cases handled properly
> +AT_SETUP([ovn-controller - overflow the nb_cfg value across the tables])
> +AT_KEYWORDS([ovn])
> +ovn_start
> +
> +net_add n1
> +sim_add hv
> +as hv
> +ovs-vsctl add-br br-phys
> +ovn_attach n1 br-phys 192.168.0.1
> +
> +OVS_WAIT_UNTIL([test xhv = x`ovn-sbctl --columns name --bare find chassis`])
> +
> +# overflow the NB_Global nb_cfg value
> +nb_global_id=$(ovn-nbctl --columns _uuid --bare find nb_global)
> +ovn-nbctl set NB_Global ${nb_global_id} nb_cfg=9223372036854775806
> +
> +# nb_cfg must be set to zero if it exceed the value of LLONG_MAX
> +# the command below will try incress the value of nb_cfg to be greater than 
> LLONG_MAX and
> +# expect zero as a return value
> +OVS_WAIT_UNTIL([test x0 = x`ovn-nbctl --wait=hv sync && ovn-nbctl --wait=hv 
> sync && echo $?`])
> +
> +# nb_cfg should be set to 1 in the chassis_private/nb_global/sb_global table
> +OVS_WAIT_UNTIL([test x1 = x`ovn-sbctl --columns nb_cfg --bare find 
> chassis_private`])
> +OVS_WAIT_UNTIL([test x1 = x`ovn-sbctl --columns nb_cfg --bare find 
> sb_global`])
> +OVS_WAIT_UNTIL([test x1 = x`ovn-nbctl --columns nb_cfg --bare find 
> nb_global`])
> +
> +# Assert that the the nb_cfg from the Chassis table was not incremented
> +OVS_WAIT_UNTIL([test x0 = x`ovn-sbctl --columns nb_cfg --bare find chassis`])
> +
> +OVN_CLEANUP([hv])
> +AT_CLEANUP
> +
>  # Test unix command: debug/delay-nb-cfg-report
>  OVN_FOR_EACH_NORTHD([
>  AT_SETUP([ovn-controller - debug/delay-nb-cfg-report])
> diff --git a/utilities/ovn-nbctl.c b/utilities/ovn-nbctl.c
> index ada53b662d3c..69be775bc567 100644
> --- a/utilities/ovn-nbctl.c
> +++ b/utilities/ovn-nbctl.c
> @@ -830,6 +830,8 @@ static void
>  nbctl_pre_sync(struct ctl_context *base OVS_UNUSED)
>  {
>  force_wait = true;
> +/* Monitor nb_cfg to detect and handle potential overflows. */
> +

[ovs-dev] [PATCH v2 ovn] Suppress LOCAL_ONLY traffic for localnet ports

2021-08-24 Thread Ihar Hrachyshka
When a router port is attached to a localnet switch, sending periodic
RAs through localnet port will confuse upstream router by leaking
conflicting router advertisements into datacenter network.

This patch blocks all LOCAL_ONLY marked traffic leaving through localnet
port. In addition to RAs, it also covers BFD periodic messages and IPv6
prefix delegation.

Signed-off-by: Ihar Hrachyshka 
---
v1: initial version
v2: adjusted new flow priority (153 -> 150)
v2: updated ovn-architecture.5
---
 controller/physical.c  |  11 +++
 ovn-architecture.7.xml |   3 +-
 tests/ovn.at   | 160 +
 3 files changed, 173 insertions(+), 1 deletion(-)

diff --git a/controller/physical.c b/controller/physical.c
index 2b8189246..6f2c1cea0 100644
--- a/controller/physical.c
+++ b/controller/physical.c
@@ -1252,6 +1252,17 @@ consider_port_binding(struct ovsdb_idl_index 
*sbrec_port_binding_by_name,
 binding->header_.uuid.parts[0], ,
 ofpacts_p, >header_.uuid);
 
+/* Drop LOCAL_ONLY traffic leaking through localnet ports. */
+match_init_catchall();
+ofpbuf_clear(ofpacts_p);
+match_set_metadata(, htonll(dp_key));
+match_set_reg(, MFF_LOG_OUTPORT - MFF_REG0, port_key);
+match_set_reg_masked(, MFF_LOG_FLAGS - MFF_REG0,
+ MLF_LOCAL_ONLY, MLF_LOCAL_ONLY);
+ofctrl_add_flow(flow_table, OFTABLE_CHECK_LOOPBACK, 160,
+binding->header_.uuid.parts[0], ,
+ofpacts_p, >header_.uuid);
+
 /* localport traffic directed to external is *not* local */
 struct shash_node *node;
 SHASH_FOR_EACH (node, >external_ports) {
diff --git a/ovn-architecture.7.xml b/ovn-architecture.7.xml
index 2d2d74b06..3d2910358 100644
--- a/ovn-architecture.7.xml
+++ b/ovn-architecture.7.xml
@@ -1519,7 +1519,8 @@
   
 Table 34 matches and drops packets for which the logical input and
 output ports are the same and the MLF_ALLOW_LOOPBACK flag is not
-set.  It resubmits other packets to table 40.
+set. It also drops MLF_LOCAL_ONLY packets directed to a localnet port.
+It resubmits other packets to table 40.
   
 
 
diff --git a/tests/ovn.at b/tests/ovn.at
index 9329dd828..6c5e1dfff 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -13675,6 +13675,166 @@ OVN_CLEANUP([hv1],[hv2])
 AT_CLEANUP
 ])
 
+OVN_FOR_EACH_NORTHD([
+AT_SETUP([IPv6 periodic RA disabled for localnet adjacent switch ports])
+ovn_start
+
+net_add n1
+sim_add hv1
+sim_add hv2
+as hv1
+check ovs-vsctl add-br br-phys
+check ovs-vsctl set open . external-ids:ovn-bridge-mappings=phys:br-phys
+ovn_attach n1 br-phys 192.168.0.2
+as hv2
+check ovs-vsctl add-br br-phys
+check ovs-vsctl set open . external-ids:ovn-bridge-mappings=phys:br-phys
+ovn_attach n1 br-phys 192.168.0.3
+
+check ovn-nbctl lr-add ro
+check ovn-nbctl lrp-add ro ro-sw 00:00:00:00:00:01 aef0:0:0:0:0:0:0:1/64
+
+check ovn-nbctl ls-add sw
+check ovn-nbctl lsp-add sw ln
+check ovn-nbctl lsp-set-addresses ln unknown
+check ovn-nbctl lsp-set-type ln localnet
+check ovn-nbctl lsp-set-options ln network_name=phys
+
+check ovn-nbctl lsp-add sw sw-ro
+check ovn-nbctl lsp-set-type sw-ro router
+check ovn-nbctl lsp-set-options sw-ro router-port=ro-sw
+check ovn-nbctl lsp-set-addresses sw-ro 00:00:00:00:00:01
+check ovn-nbctl lsp-add sw sw-p1
+check ovn-nbctl lsp-set-addresses sw-p1 "00:00:00:00:00:02 aef0::200:ff:fe00:2"
+check ovn-nbctl lsp-add sw sw-p2
+check ovn-nbctl lsp-set-addresses sw-p2 "00:00:00:00:00:03 aef0::200:ff:fe00:3"
+
+check ovn-nbctl set Logical_Router_Port ro-sw 
ipv6_ra_configs:send_periodic=true
+check ovn-nbctl set Logical_Router_Port ro-sw 
ipv6_ra_configs:address_mode=slaac
+check ovn-nbctl set Logical_Router_Port ro-sw ipv6_ra_configs:max_interval=1
+check ovn-nbctl set Logical_Router_Port ro-sw ipv6_ra_configs:min_interval=1
+
+for i in 1 2 ; do
+as hv$i
+check ovs-vsctl -- add-port br-int hv$i-vif1 -- \
+set interface hv$i-vif1 external-ids:iface-id=sw-p$i \
+options:tx_pcap=hv$i/vif1-tx.pcap \
+options:rxq_pcap=hv$i/vif1-rx.pcap \
+ofport-request=1
+done
+
+OVS_WAIT_UNTIL([test x`ovn-nbctl lsp-get-up sw-p1` = xup])
+OVS_WAIT_UNTIL([test x`ovn-nbctl lsp-get-up sw-p2` = xup])
+
+reset_pcap_file() {
+local iface=$1
+local pcap_file=$2
+ovs-vsctl -- set Interface $iface options:tx_pcap=dummy-tx.pcap \
+options:rxq_pcap=dummy-rx.pcap
+rm -f ${pcap_file}*.pcap
+ovs-vsctl -- set Interface $iface options:tx_pcap=${pcap_file}-tx.pcap \
+options:rxq_pcap=${pcap_file}-rx.pcap
+
+}
+
+construct_expected_ra() {
+local src_mac=0001
+local dst_mac=0001
+local src_addr=fe8002fffe01
+local dst_addr=ff020001
+
+local mtu=$1
+local ra_mo=$2
+local 

Re: [ovs-dev] [PATCH ovn] Suppress periodic RAs for switches attached to localnet

2021-08-24 Thread Ihar Hrachyshka
On Tue, Aug 17, 2021 at 2:20 PM Numan Siddique  wrote:
>
> On Fri, Aug 13, 2021 at 10:33 AM Frode Nordahl
>  wrote:
> >
> > On Sat, Aug 7, 2021 at 12:20 AM Ihar Hrachyshka  wrote:
> > >
> > > When a router port is attached to a localnet switch, sending periodic
> > > RAs through localnet port will confuse upstream router by leaking
> > > conflicting router advertisements into datacenter network.
> >
> > Do I understand you correctly that you want to suppress RA's on an
> > entire LS if it has one localnet port? If so, I'm a bit concerned
> > about this approach as it is a perfectly valid configuration to attach
> > an instance directly to a LS with a localnet port.
> >
> > In such a scenario the instance would no longer receive periodic RAs.
> > While the instance would still be able to solicit its IPv6 prefix,
> > router and DNS servers on startup, it would not receive any
> > information should the router address, prefix or DNS server
> > information change somewhere down the line.
> >
> > Could we filter this in some other way that does not affect the entire
> > switch? Or could it be managing and suppressing the RAs is the
> > responsibility of the physical DC switch/router administrator?
>
>
> Agree with Frode.  VIFs associated with logical switches with localnet port
> should be served periodic RAs if configured.
>
> I think the right fix would be to make sure that ovn-controller doesn't inject
> the periodic RAs to the patch ports connecting to the provider bridges.
>
> If I recall, ovn-controller would inject the periodic RAs only to the local 
> VIFs
> which need the periodic RAs.  So ideally it should not leak out of the br-int.
>

You are right we shouldn't suppress other RAs. I am sending a new
patch in a sec.

To the point of the mechanism to suppress RAs on localnets, AFAIU the
packet is injected into LS peer port that is plugged into router. From
the controller() action perspective, we don't know where RA will
propagate.

There are two options here:
a) block specifically RAs using corresponding drop rule (dst=ff02:1,
icmp6, code=134...)
b) block all LOCAL_ONLY marked traffic on leaving through localnet port

The latter will also block prefix delegation and BFD messages, which I
believe we should block any way.

I will send (b) in a second but let me know if a more fine grained (a)
option is preferable (perhaps we should have both?)

Cheers,
Ihar


> Thanks
> Numan
>
>
> >
> > --
> > Frode Nordahl
> >
> > > Signed-off-by: Ihar Hrachyshka 
> > > ---
> > >  northd/ovn-northd.c |   5 +-
> > >  tests/ovn.at| 156 
> > >  2 files changed, 160 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> > > index a0eaa1247..6cd686d12 100644
> > > --- a/northd/ovn-northd.c
> > > +++ b/northd/ovn-northd.c
> > > @@ -10207,7 +10207,10 @@ build_ND_RA_flows_for_lrouter_port(
> > >
> > >  if (smap_get_bool(>nbrp->ipv6_ra_configs, "send_periodic",
> > >false)) {
> > > -copy_ra_to_sb(op, address_mode);
> > > +/* Don't leak RAs into datacenter networks. */
> > > +if (!op->peer->od->n_localnet_ports) {
> > > +copy_ra_to_sb(op, address_mode);
> > > +}
> > >  }
> > >
> > >  ds_clear(match);
> > > diff --git a/tests/ovn.at b/tests/ovn.at
> > > index 7ae136ad9..22c5ed07c 100644
> > > --- a/tests/ovn.at
> > > +++ b/tests/ovn.at
> > > @@ -13674,6 +13674,162 @@ OVN_CLEANUP([hv1],[hv2])
> > >  AT_CLEANUP
> > >  ])
> > >
> > > +OVN_FOR_EACH_NORTHD([
> > > +AT_SETUP([IPv6 periodic RA disabled for localnet adjacent switch ports])
> > > +ovn_start
> > > +
> > > +net_add n1
> > > +sim_add hv1
> > > +sim_add hv2
> > > +as hv1
> > > +check ovs-vsctl add-br br-phys
> > > +check ovs-vsctl set open . external-ids:ovn-bridge-mappings=phys:br-phys
> > > +ovn_attach n1 br-phys 192.168.0.2
> > > +as hv2
> > > +check ovs-vsctl add-br br-phys
> > > +check ovs-vsctl set open . external-ids:ovn-bridge-mappings=phys:br-phys
> > > +ovn_attach n1 br-phys 192.168.0.3
> > > +
> > > +check ovn-nbctl lr-add ro
> > > +check ovn-nbctl lrp-add ro ro-sw 00:00:00:00:00:01 aef0:0:0:0:0:0:0:1/64
> > > +
> > > +check ovn-nbctl ls-add sw
> > > +check ovn-nbctl lsp-add sw ln
> > > +check ovn-nbctl lsp-set-addresses ln unknown
> > > +check ovn-nbctl lsp-set-type ln localnet
> > > +check ovn-nbctl lsp-set-options ln network_name=phys
> > > +
> > > +check ovn-nbctl lsp-add sw sw-ro
> > > +check ovn-nbctl lsp-set-type sw-ro router
> > > +check ovn-nbctl lsp-set-options sw-ro router-port=ro-sw
> > > +check ovn-nbctl lsp-set-addresses sw-ro 00:00:00:00:00:01
> > > +check ovn-nbctl lsp-add sw sw-p1
> > > +check ovn-nbctl lsp-set-addresses sw-p1 "00:00:00:00:00:02 
> > > aef0::200:ff:fe00:2"
> > > +check ovn-nbctl lsp-add sw sw-p2
> > > +check ovn-nbctl lsp-set-addresses sw-p2 "00:00:00:00:00:03 
> > > aef0::200:ff:fe00:3"
> > > +
> > > +check ovn-nbctl set Logical_Router_Port ro-sw 
> > > 

[ovs-dev] [PATCH ovn] Suppress LOCAL_ONLY traffic for localnet ports

2021-08-24 Thread Ihar Hrachyshka
When a router port is attached to a localnet switch, sending periodic
RAs through localnet port will confuse upstream router by leaking
conflicting router advertisements into datacenter network.

This patch blocks all LOCAL_ONLY marked traffic leaving through localnet
port. In addition to RAs, it also covers BFD periodic messages and IPv6
prefix delegation.

Signed-off-by: Ihar Hrachyshka 
---
 controller/physical.c |  11 +++
 tests/ovn.at  | 160 ++
 2 files changed, 171 insertions(+)

diff --git a/controller/physical.c b/controller/physical.c
index 2b8189246..cae1bdad4 100644
--- a/controller/physical.c
+++ b/controller/physical.c
@@ -1252,6 +1252,17 @@ consider_port_binding(struct ovsdb_idl_index 
*sbrec_port_binding_by_name,
 binding->header_.uuid.parts[0], ,
 ofpacts_p, >header_.uuid);
 
+/* Drop LOCAL_ONLY traffic leaking through localnet ports. */
+match_init_catchall();
+ofpbuf_clear(ofpacts_p);
+match_set_metadata(, htonll(dp_key));
+match_set_reg(, MFF_LOG_OUTPORT - MFF_REG0, port_key);
+match_set_reg_masked(, MFF_LOG_FLAGS - MFF_REG0,
+ MLF_LOCAL_ONLY, MLF_LOCAL_ONLY);
+ofctrl_add_flow(flow_table, OFTABLE_CHECK_LOOPBACK, 153,
+binding->header_.uuid.parts[0], ,
+ofpacts_p, >header_.uuid);
+
 /* localport traffic directed to external is *not* local */
 struct shash_node *node;
 SHASH_FOR_EACH (node, >external_ports) {
diff --git a/tests/ovn.at b/tests/ovn.at
index 9329dd828..6c5e1dfff 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -13675,6 +13675,166 @@ OVN_CLEANUP([hv1],[hv2])
 AT_CLEANUP
 ])
 
+OVN_FOR_EACH_NORTHD([
+AT_SETUP([IPv6 periodic RA disabled for localnet adjacent switch ports])
+ovn_start
+
+net_add n1
+sim_add hv1
+sim_add hv2
+as hv1
+check ovs-vsctl add-br br-phys
+check ovs-vsctl set open . external-ids:ovn-bridge-mappings=phys:br-phys
+ovn_attach n1 br-phys 192.168.0.2
+as hv2
+check ovs-vsctl add-br br-phys
+check ovs-vsctl set open . external-ids:ovn-bridge-mappings=phys:br-phys
+ovn_attach n1 br-phys 192.168.0.3
+
+check ovn-nbctl lr-add ro
+check ovn-nbctl lrp-add ro ro-sw 00:00:00:00:00:01 aef0:0:0:0:0:0:0:1/64
+
+check ovn-nbctl ls-add sw
+check ovn-nbctl lsp-add sw ln
+check ovn-nbctl lsp-set-addresses ln unknown
+check ovn-nbctl lsp-set-type ln localnet
+check ovn-nbctl lsp-set-options ln network_name=phys
+
+check ovn-nbctl lsp-add sw sw-ro
+check ovn-nbctl lsp-set-type sw-ro router
+check ovn-nbctl lsp-set-options sw-ro router-port=ro-sw
+check ovn-nbctl lsp-set-addresses sw-ro 00:00:00:00:00:01
+check ovn-nbctl lsp-add sw sw-p1
+check ovn-nbctl lsp-set-addresses sw-p1 "00:00:00:00:00:02 aef0::200:ff:fe00:2"
+check ovn-nbctl lsp-add sw sw-p2
+check ovn-nbctl lsp-set-addresses sw-p2 "00:00:00:00:00:03 aef0::200:ff:fe00:3"
+
+check ovn-nbctl set Logical_Router_Port ro-sw 
ipv6_ra_configs:send_periodic=true
+check ovn-nbctl set Logical_Router_Port ro-sw 
ipv6_ra_configs:address_mode=slaac
+check ovn-nbctl set Logical_Router_Port ro-sw ipv6_ra_configs:max_interval=1
+check ovn-nbctl set Logical_Router_Port ro-sw ipv6_ra_configs:min_interval=1
+
+for i in 1 2 ; do
+as hv$i
+check ovs-vsctl -- add-port br-int hv$i-vif1 -- \
+set interface hv$i-vif1 external-ids:iface-id=sw-p$i \
+options:tx_pcap=hv$i/vif1-tx.pcap \
+options:rxq_pcap=hv$i/vif1-rx.pcap \
+ofport-request=1
+done
+
+OVS_WAIT_UNTIL([test x`ovn-nbctl lsp-get-up sw-p1` = xup])
+OVS_WAIT_UNTIL([test x`ovn-nbctl lsp-get-up sw-p2` = xup])
+
+reset_pcap_file() {
+local iface=$1
+local pcap_file=$2
+ovs-vsctl -- set Interface $iface options:tx_pcap=dummy-tx.pcap \
+options:rxq_pcap=dummy-rx.pcap
+rm -f ${pcap_file}*.pcap
+ovs-vsctl -- set Interface $iface options:tx_pcap=${pcap_file}-tx.pcap \
+options:rxq_pcap=${pcap_file}-rx.pcap
+
+}
+
+construct_expected_ra() {
+local src_mac=0001
+local dst_mac=0001
+local src_addr=fe8002fffe01
+local dst_addr=ff020001
+
+local mtu=$1
+local ra_mo=$2
+local rdnss=$3
+local dnssl=$4
+local route_info=$5
+local ra_prefix_la=$6
+
+local slla=0101${src_mac}
+local mtu_opt=""
+if test $mtu != 0; then
+mtu_opt=0501${mtu}
+fi
+shift 6
+
+local prefix=""
+while [[ $# -gt 0 ]] ; do
+local size=$1
+local net=$2
+
prefix=${prefix}0304${size}${ra_prefix_la}${net}
+shift 2
+done
+
+local rdnss_opt=""
+if test $rdnss != 0; then
+rdnss_opt=1903${rdnss}
+fi
+local dnssl_opt=""
+if test $dnssl != 0; then
+dnssl_opt=1f03${dnssl}
+fi
+local route_info_opt=""
+if test 

Re: [ovs-dev] [PATCH] json: Optimize string serialization.

2021-08-24 Thread 0-day Robot
Bleep bloop.  Greetings Ilya Maximets, I am a robot and I have tried out your 
patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
ERROR: Inappropriate bracing around statement
#154 FILE: tests/test-json.c:212:
for (int j = 0; j < iter; j++ ) {

ERROR: Inappropriate bracing around statement
#160 FILE: tests/test-json.c:218:
for (int j = 0; j < iter; j++ ) {

Lines checked: 172, Warnings: 0, Errors: 2


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn] ovn-northd: Fix extremely inefficient usage of lflow hash map.

2021-08-24 Thread Ilya Maximets
On 8/24/21 10:07 PM, Anton Ivanov wrote:
> On 23/08/2021 22:36, Ilya Maximets wrote:
>> On 8/23/21 10:37 PM, Anton Ivanov wrote:
>>> On 23/08/2021 21:26, Ilya Maximets wrote:
 On 8/23/21 10:20 PM, Anton Ivanov wrote:
> Should not be the case.
>
> The map is pre-sized for the size from the previous iterations.
>
> Line 12861 in my tree which is probably a few commits out of date:
>
>   fast_hmap_size_for(, max_seen_lflow_size);
>
> And immediately after building the lflows:
>
>   if (hmap_count() > max_seen_lflow_size) {
>   max_seen_lflow_size = hmap_count();
>   }
>
> So the map SHOULD be sized correctly - to the most recent seen lflow 
> count.
 Please, re-read the commit message.  It's a first run of the loop
 and the 'max_seen_lflow_size' is default 128 at this point.
>>> Ack,
>>>
>>> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing 
>>> it.
>>>
>>>  From that perspective the patch is a straight +1 from me.
>>>
>>>  From the perspective of the use case stated in the commit message- I am 
>>> not sure it addresses it.
>>>
>>> If northd is in single-threaded mode and is tackling a GIGANTIC> database, 
>>> it may never complete the first iteration before the
>>> expiration of the timers and everyone deciding that northd is AWOL.
>> Well, how do you suggest to fix that?  Obviously, we can always create
>> a database that northd will never be able to process in a reasonable
>> amount of time.  And it doesn't matter if it's single- or multi-threaded.
> 
> I will send the additional "baseline" fixes for parallel - resizing and 
> initial sizing tomorrow, they are fairly trivial and have tested out OK.
> 
> However, they do not solve the fact that the overall "heatmap" with dp_groups 
> have moved. A lot of processing once again happens out of the "parallel" 
> portion and in the single threaded part.
> 
> Unless I am mistaken, this can be improved.
> 
> Namely, at present, after computing lflows with dp_groups they are walked in 
> full, single dp flows separated into a hmap and reprocessed. That is 
> suboptimal for parallel (and possibly suboptimal for single threaded).
> 
> Unless I am mistaken, when dp_groups are enabled, all lflows can be initially 
> inserted into a separate "single datapath" hmap. If the dp_group for an lflow 
> grows to more than one, the flow is then moved to the main lflow hmap. This 
> way, the computation will generate both types of flows straight away 
> (optionally in parallel) and there will be no need to do a full single 
> threaded walk of lflows after they have been generated.
> 
> One question (so I can get some idea on which optimizations are worth it and 
> which aren't). What is the percentage and overall numbers of single datapath 
> lflows?

From the DB that I have I extracted following information:

Total lflows generated : 9.916.227
Ended up in SbDB: 540.795 (462.196 has no dp_group, 78.599 has a dp_group)
On disk size of this DB with dp groups enabled is 270 MB.

So, the lflows hashmap contains ~540K flows, 460K of them are single
flows.  But still it's 20 times less than number of lflows that northd
generated.  So, performance improvement from parallelization of this
part might be not significant if dp-groups enabled.  If disabled, it
will be very hard for both northd and SbDB to handle database of this
size even from the memory consumption standpoint.  Database will take
around 5 GB on disk.  In memory as a parsed json object, it will be huge.
I'd not advise running a setup of that size without dp groups.  Node
will ran out of memory very fast.

> 
> Brgds,
> 
> A.
> 
>>
>> In this case NbDB is only 9MB in size, which is very reasonable, and
>> northd on my laptop takes more than 15 minutes to process it (I killed
>> it at this point).  With the patch applied it took only 11 seconds.
>> So, for me, this patch pretty much fixes the issue.  11 seconds is not
>> that bad, e.g. ovn-k8s configures inactivity probes for clients to 180.
>> It would be great to reduce, but we're not there yet.
>>
>>> In that case, if it is multi-threaded from the start, it should probably
>>> grab the sizing from the lflow table hash in south db. That would be a
>>> good approximation for the first run.
>> This will not work for a case where SbDB is empty for any reason while
>> NbDB is not.  And there is still a case where northd initially connects
>> to semi-empty databases and after few iterations NbDB receives a big
>> transaction and generates a big update for northd.
>>
>>> A.
>>>
> A.
>
> On 23/08/2021 21:02, Ilya Maximets wrote:
>> 'lflow_map' is never expanded because northd always uses fast
>> insertion.  This leads to the case where we have a hash map
>> with only 128 initial buckets and every ovn_lflow_find() ends
>> up iterating over n_lflows / 128 entries.  It's thousands of
>> logical flows or even more.  For example, it 

Re: [ovs-dev] [PATCH ovn 0/2] allow to configure routes with no nexthop

2021-08-24 Thread Numan Siddique
On Fri, Aug 6, 2021 at 8:59 AM Lorenzo Bianconi
 wrote:
>
> Allow ovn-nbctl to add a valid route without a nexthop. E.g:
>
> ovn-nbctl lr-route-add lr0 192.168.0.0/24 lr0-sw
>
> Lorenzo Bianconi (2):
>   nbctl: validate outport in nbctl_lr_route_add
>   northd: allow to configure routes with no nexthop

Thanks for the patches.  I applied both the patches.   I added an
entry in the NEWS item before
merging patch 2.

Numan

>
>  northd/lrouter.dl | 60 +++
>  northd/ovn-northd.c   |  7 +++--
>  northd/ovn_northd.dl  |  5 
>  tests/ovn-nbctl.at| 11 
>  tests/ovn-northd.at   | 21 --
>  utilities/ovn-nbctl.c | 65 ---
>  6 files changed, 147 insertions(+), 22 deletions(-)
>
> --
> 2.31.1
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn] ovn-northd: Hash pipeline as a digit.

2021-08-24 Thread Ilya Maximets
Currently pipeline is hashed as a string ("egress" or "ingress"),
and this seems wasteful.  Hashing it as a digit (P_IN or P_OUT)
to eliminate one call to hash_string() for every generated logical
flow.

In my testing this saves 1.5 - 3 % of a processing time.

Signed-off-by: Ilya Maximets 
---
 lib/ovn-util.c  | 14 ++
 lib/ovn-util.h  |  8 +++-
 northd/ovn-northd.c |  8 +---
 3 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/lib/ovn-util.c b/lib/ovn-util.c
index 3805923c8..683ca37d9 100644
--- a/lib/ovn-util.c
+++ b/lib/ovn-util.c
@@ -523,11 +523,18 @@ ovn_is_known_nb_lsp_type(const char *type)
 return false;
 }
 
+static enum ovn_pipeline
+ovn_pipeline_from_name(const char *pipeline)
+{
+return pipeline[0] == 'i' ? P_IN : P_OUT;
+}
+
 uint32_t
 sbrec_logical_flow_hash(const struct sbrec_logical_flow *lf)
 {
 const struct sbrec_datapath_binding *ld = lf->logical_datapath;
-uint32_t hash = ovn_logical_flow_hash(lf->table_id, lf->pipeline,
+uint32_t hash = ovn_logical_flow_hash(lf->table_id,
+  ovn_pipeline_from_name(lf->pipeline),
   lf->priority, lf->match,
   lf->actions);
 
@@ -535,12 +542,11 @@ sbrec_logical_flow_hash(const struct sbrec_logical_flow 
*lf)
 }
 
 uint32_t
-ovn_logical_flow_hash(uint8_t table_id, const char *pipeline,
+ovn_logical_flow_hash(uint8_t table_id, enum ovn_pipeline pipeline,
   uint16_t priority,
   const char *match, const char *actions)
 {
-size_t hash = hash_2words((table_id << 16) | priority, 0);
-hash = hash_string(pipeline, hash);
+size_t hash = hash_2words((table_id << 16) | priority, pipeline);
 hash = hash_string(match, hash);
 return hash_string(actions, hash);
 }
diff --git a/lib/ovn-util.h b/lib/ovn-util.h
index 9935cad34..b0bc70a16 100644
--- a/lib/ovn-util.h
+++ b/lib/ovn-util.h
@@ -102,8 +102,14 @@ const char *db_table_usage(struct ds *tables,
 
 bool ovn_is_known_nb_lsp_type(const char *type);
 
+/* The two pipelines in an OVN logical flow table. */
+enum ovn_pipeline {
+P_IN,   /* Ingress pipeline. */
+P_OUT   /* Egress pipeline. */
+};
+
 uint32_t sbrec_logical_flow_hash(const struct sbrec_logical_flow *);
-uint32_t ovn_logical_flow_hash(uint8_t table_id, const char *pipeline,
+uint32_t ovn_logical_flow_hash(uint8_t table_id, enum ovn_pipeline pipeline,
uint16_t priority,
const char *match, const char *actions);
 uint32_t ovn_logical_flow_hash_datapath(const struct uuid *logical_datapath,
diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index a9a3987b8..b22cf5388 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -120,12 +120,6 @@ static const char *ssl_ca_cert_file;
 
 /* Pipeline stages. */
 
-/* The two pipelines in an OVN logical flow table. */
-enum ovn_pipeline {
-P_IN,   /* Ingress pipeline. */
-P_OUT   /* Egress pipeline. */
-};
-
 /* The two purposes for which ovn-northd uses OVN logical datapaths. */
 enum ovn_datapath_type {
 DP_SWITCH,  /* OVN logical switch. */
@@ -4412,7 +4406,7 @@ ovn_lflow_add_at(struct hmap *lflow_map, struct 
ovn_datapath *od,
 uint32_t hash;
 
 hash = ovn_logical_flow_hash(ovn_stage_get_table(stage),
- ovn_stage_get_pipeline_name(stage),
+ ovn_stage_get_pipeline(stage),
  priority, match,
  actions);
 
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] json: Optimize string serialization.

2021-08-24 Thread Ilya Maximets
Current string serialization code puts all characters one by one.
This is slow because dynamic string needs to perform length checks
on every ds_put_char() and it's also doesn't allow compiler to use
better memory copy operations, i.e. doesn't allow copying few bytes
at once.

Special symbols are rare in a typical database.  Quotes are frequent,
but not too frequent.  In databases created by ovn-kubernetes, for
example, usually there are at least 10 to 50 chars between quotes.
So, it's better to count characters that doesn't require escaping
and use fast data copy for the whole sequential block.

Testing with a synthetic benchmark (included) on my laptop shows
following performance improvement:

   Size  Q  S   Before   After   Diff
 -
 10  0  0 :0.227 ms 0.142 ms   -37.4 %
 10  2  1 :0.277 ms 0.186 ms   -32.8 %
 10  10 1 :0.361 ms 0.309 ms   -14.4 %
 10000  0 :   22.720 ms12.160 ms   -46.4 %
 10002  1 :   27.470 ms19.300 ms   -29.7 %
 100010 1 :   37.950 ms31.250 ms   -17.6 %
 1   0  0 :  239.600 ms   126.700 ms   -47.1 %
 1   2  1 :  292.400 ms   188.600 ms   -35.4 %
 1   10 1 :  387.700 ms   321.200 ms   -17.1 %

Here Q - probability (%) for a character to be a '\"' and
S - probability (%) to be a special character ( < 32).

Testing with a closer to real world scenario shows overall decrease
of the time needed for database compaction by ~5-10 %.  And this
change also decreases CPU consumption in general, because string
serialization is used in many different places including ovsdb
monitors and raft.

Signed-off-by: Ilya Maximets 
---
 lib/json.c| 23 +---
 tests/test-json.c | 68 +++
 2 files changed, 88 insertions(+), 3 deletions(-)

diff --git a/lib/json.c b/lib/json.c
index 32d25003b..b2c845171 100644
--- a/lib/json.c
+++ b/lib/json.c
@@ -1696,14 +1696,31 @@ json_serialize_string(const char *string, struct ds *ds)
 {
 uint8_t c;
 uint8_t c2;
+int count;
 const char *escape;
+const char *start;
 
 ds_put_char(ds, '"');
+count = 0;
+start = string;
 while ((c = *string++) != '\0') {
-escape = chars_escaping[c];
-while ((c2 = *escape++) != '\0') {
-ds_put_char(ds, c2);
+if (c >= ' ' && c != '\"' && c != '\\') {
+count++;
+continue;
+} else {
+if (count) {
+ds_put_buffer(ds, start, count);
+count = 0;
+}
+start = string;
+escape = chars_escaping[c];
+while ((c2 = *escape++) != '\0') {
+ds_put_char(ds, c2);
+}
 }
 }
+if (count) {
+ds_put_buffer(ds, start, count);
+}
 ds_put_char(ds, '"');
 }
diff --git a/tests/test-json.c b/tests/test-json.c
index a7ee595e0..60be82bc8 100644
--- a/tests/test-json.c
+++ b/tests/test-json.c
@@ -22,6 +22,8 @@
 #include 
 #include 
 #include "ovstest.h"
+#include "random.h"
+#include "timeval.h"
 #include "util.h"
 
 /* --pretty: If set, the JSON output is pretty-printed, instead of printed as
@@ -157,3 +159,69 @@ test_json_main(int argc, char *argv[])
 }
 
 OVSTEST_REGISTER("test-json", test_json_main);
+
+static void
+json_string_benchmark_main(int argc OVS_UNUSED, char *argv[] OVS_UNUSED)
+{
+struct {
+int n;
+int quote_probablility;
+int special_probability;
+int iter;
+} configs[] = {
+{ 10, 0, 0, 1000, },
+{ 10, 2, 1, 1000, },
+{ 10,10, 1, 1000, },
+{ 1000,   0, 0, 100,  },
+{ 1000,   2, 1, 100,  },
+{ 1000,  10, 1, 100,  },
+{ 1,  0, 0, 10.   },
+{ 1,  2, 1, 10,   },
+{ 1, 10, 1, 10,   },
+};
+
+printf("  SIZE  Q  STIME\n");
+printf("--\n");
+
+for (int i = 0; i < ARRAY_SIZE(configs); i++) {
+int iter = configs[i].iter;
+int n = configs[i].n;
+char *str = xzalloc(n);
+
+for (int j = 0; j < n - 1; j++) {
+int r = random_range(100);
+
+if (r < configs[i].special_probability) {
+str[j] = random_range(' ' - 1) + 1;
+} else if (r < (configs[i].special_probability
++ configs[i].quote_probablility)) {
+str[j] = '\"';
+} else {
+str[j] = random_range(256 - ' ') + ' ';
+}
+}
+
+printf("%-11d %-2d %-2d: ", n, configs[i].quote_probablility,
+   configs[i].special_probability);
+fflush(stdout);
+
+struct json *json = json_string_create_nocopy(str);
+uint64_t start = time_msec();
+
+char **res = xzalloc(iter * 

[ovs-dev] [PATCH] sha1: Use implementation from openssl if available.

2021-08-24 Thread Ilya Maximets
Implementation of SHA1 in OpenSSL library is much faster and optimized
for all available CPU architectures and instruction sets.  OVS should
use it instead of internal implementation if possible.

Depending on compiler options OpenSSL's version finishes our sha1
unit tests from 3 to 12 times faster.  Performance of OpenSSL's
version is constant, but OVS's implementation highly depends on
compiler.  Interestingly, default build with  '-g -O2' works faster
than optimized '-march=native -Ofast'.

Tests with ovsdb-server on big databases shows ~5-10% improvement of
the time needed for database compaction (sha1 is only a part of this
operation), depending on compiler options.

We still need internal implementation, because OpenSSL can be not
available on some platforms.  Tests enhanced to check both versions
of API.

Signed-off-by: Ilya Maximets 
---
 lib/sha1.c| 143 ++
 lib/sha1.h|  27 +++--
 tests/library.at  |   2 +-
 tests/test-sha1.c |  60 +--
 4 files changed, 172 insertions(+), 60 deletions(-)

diff --git a/lib/sha1.c b/lib/sha1.c
index 87360d9cd..d90780a6d 100644
--- a/lib/sha1.c
+++ b/lib/sha1.c
@@ -31,12 +31,114 @@
 
 #include 
 #include "sha1.h"
+
+#ifdef HAVE_OPENSSL
+#include 
+#endif
+
 #include 
 #include 
 #include "compiler.h"
+#include "openvswitch/vlog.h"
 #include "util.h"
 
-/* a bit faster & bigger, if defined */
+VLOG_DEFINE_THIS_MODULE(sha1);
+
+/*
+ * Initialize the SHA digest.
+ * context: The SHA context to initialize
+ */
+void
+sha1_init(struct sha1_ctx *sha_info)
+{
+#ifdef HAVE_OPENSSL
+if (!SHA1_Init(_info->ctx)) {
+VLOG_FATAL("SHA1_Init failed.");
+}
+#else
+ovs_sha1_init(sha_info);
+#endif
+}
+
+/*
+ * Update the SHA digest.
+ * context: The SHA1 context to update.
+ * input: The buffer to add to the SHA digest.
+ * inputLen: The length of the input buffer.
+ */
+void
+sha1_update(struct sha1_ctx *ctx, const void *buffer_, uint32_t count)
+{
+#ifdef HAVE_OPENSSL
+if (!SHA1_Update(>ctx, buffer_, count)) {
+VLOG_FATAL("SHA1_Update failed.");
+}
+#else
+ovs_sha1_update(ctx, buffer_, count);
+#endif
+}
+
+/*
+ * Finish computing the SHA digest.
+ * digest: the output buffer in which to store the digest.
+ * context: The context to finalize.
+ */
+void
+sha1_final(struct sha1_ctx *ctx, uint8_t digest[SHA1_DIGEST_SIZE])
+{
+#ifdef HAVE_OPENSSL
+if (!SHA1_Final(digest, >ctx)) {
+VLOG_FATAL("SHA1_Final failed.");
+}
+#else
+ovs_sha1_final(ctx, digest);
+#endif
+}
+
+/* Computes the hash of 'n' bytes in 'data' into 'digest'. */
+void
+sha1_bytes(const void *data, uint32_t n, uint8_t digest[SHA1_DIGEST_SIZE])
+{
+#ifdef HAVE_OPENSSL
+SHA1(data, n, digest);
+#else
+ovs_sha1_bytes(data, n, digest);
+#endif
+}
+
+void
+sha1_to_hex(const uint8_t digest[SHA1_DIGEST_SIZE],
+char hex[SHA1_HEX_DIGEST_LEN + 1])
+{
+int i;
+
+for (i = 0; i < SHA1_DIGEST_SIZE; i++) {
+*hex++ = "0123456789abcdef"[digest[i] >> 4];
+*hex++ = "0123456789abcdef"[digest[i] & 15];
+}
+*hex = '\0';
+}
+
+bool
+sha1_from_hex(uint8_t digest[SHA1_DIGEST_SIZE], const char *hex)
+{
+int i;
+
+for (i = 0; i < SHA1_DIGEST_SIZE; i++) {
+bool ok;
+
+digest[i] = hexits_value(hex, 2, );
+if (!ok) {
+return false;
+}
+hex += 2;
+}
+return true;
+}
+
+/* Generic implementation for a case where OpenSSL is not available. */
+
+/* A bit faster & bigger, if defined */
 #define UNROLL_LOOPS
 
 /* SHA f()-functions */
@@ -178,7 +280,7 @@ maybe_byte_reverse(uint32_t *buffer OVS_UNUSED, int count 
OVS_UNUSED)
  * context: The SHA context to initialize
  */
 void
-sha1_init(struct sha1_ctx *sha_info)
+ovs_sha1_init(struct sha1_ctx *sha_info)
 {
 sha_info->digest[0] = 0x67452301L;
 sha_info->digest[1] = 0xefcdab89L;
@@ -197,7 +299,7 @@ sha1_init(struct sha1_ctx *sha_info)
  * inputLen: The length of the input buffer.
  */
 void
-sha1_update(struct sha1_ctx *ctx, const void *buffer_, uint32_t count)
+ovs_sha1_update(struct sha1_ctx *ctx, const void *buffer_, uint32_t count)
 {
 const uint8_t *buffer = buffer_;
 unsigned int i;
@@ -240,7 +342,7 @@ sha1_update(struct sha1_ctx *ctx, const void *buffer_, 
uint32_t count)
  * context: The context to finalize.
  */
 void
-sha1_final(struct sha1_ctx *ctx, uint8_t digest[SHA1_DIGEST_SIZE])
+ovs_sha1_final(struct sha1_ctx *ctx, uint8_t digest[SHA1_DIGEST_SIZE])
 {
 int count, i, j;
 uint32_t lo_bit_count, hi_bit_count, k;
@@ -274,7 +376,7 @@ sha1_final(struct sha1_ctx *ctx, uint8_t 
digest[SHA1_DIGEST_SIZE])
 
 /* Computes the hash of 'n' bytes in 'data' into 'digest'. */
 void
-sha1_bytes(const void *data, uint32_t n, uint8_t digest[SHA1_DIGEST_SIZE])
+ovs_sha1_bytes(const void *data, uint32_t n, uint8_t digest[SHA1_DIGEST_SIZE])
 {
 struct sha1_ctx ctx;
 
@@ -282,34 +384,3 @@ sha1_bytes(const void *data, uint32_t n, 

Re: [ovs-dev] [PATCH v1 2/2] ofp-monitor: Support flow monitoring for OpenFlow 1.3, 1.4+

2021-08-24 Thread Vasu Dasari
Great, thanks!

*Vasu Dasari*


On Tue, Aug 24, 2021 at 4:11 PM Ashish Varma 
wrote:

> I will take a look.
>
> Thanks,
> Ashish
>
> On Tue, Aug 24, 2021 at 9:29 AM Vasu Dasari  wrote:
>
>> Hi Ben/Ashish,
>>
>> When you get a chance, Can you please take a look at my code?
>>
>> -Vasu
>>
>> *Vasu Dasari*
>>
>>
>> On Thu, Jul 29, 2021 at 10:36 PM Vasu Dasari  wrote:
>>
>>> Extended OpenFlow monitoring support
>>> * OpenFlow 1.3 with ONF extensions
>>> * OpenFlow 1.4+ as defined in OpenFlow specification 1.4+.
>>>
>>> ONF extensions are similar to Nicira extensions except for
>>> onf_flow_monitor_request{}
>>> where out_port is defined as 32-bit number OF(1.1) number, oxm match
>>> formats are
>>> used in update and request messages.
>>>
>>> Flow monitoring support in 1.4+ is slightly different from Nicira and ONF
>>> extensions.
>>>  * More flow monitoring flags are defined.
>>>  * Monitor add/modify/delete command is intruduced in flow_monitor
>>>request message.
>>>  * Addition of out_group as part of flow_monitor request message
>>>
>>> Description of changes:
>>> 1. Generate ofp-msgs.inc to be able to support 1.3, 1.4+ flow Monitoring
>>> messages.
>>> include/openvswitch/ofp-msgs.h
>>>
>>> 2. Modify openflow header files with protocol specific headers.
>>> include/openflow/openflow-1.3.h
>>> include/openflow/openflow-1.4.h
>>>
>>> 3. Modify OvS abstraction of openflow headers. ofp-monitor.h leverages
>>> enums
>>>from on nicira extensions for creating protocol abstraction headers.
>>> OF(1.4+)
>>>enums are superset of nicira extensions.
>>> include/openvswitch/ofp-monitor.h
>>>
>>> 4. Changes to these files reflect encoding and decoding of new protocol
>>> messages.
>>> lib/ofp-monitor.c
>>>
>>> 5. Changes to mmodules using ofp-monitor APIs. Most of the changes here
>>> are to
>>>migrate enums from nicira to OF 1.4+ versions.
>>> ofproto/connmgr.c
>>> ofproto/connmgr.h
>>> ofproto/ofproto-provider.h
>>> ofproto/ofproto.c
>>>
>>> 6. Extended protocol decoding tests to verify all protocol versions
>>> FLOW_MONITOR_CANCEL
>>> FLOW_MONITOR_PAUSED
>>> FLOW_MONITOR_RESUMED
>>> FLOW_MONITOR request
>>> FLOW_MONITOR reply
>>> tests/ofp-print.at
>>>
>>> 7. Modify flow monitoring tests to be able executed by all protocol
>>> versions.
>>> tests/ofproto.at
>>>
>>> 7. Modified documentation highlighting the change
>>> utilities/ovs-ofctl.8.in
>>> NEWS
>>>
>>> Signed-off-by: Vasu Dasari 
>>> Reported-at:
>>> https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/383915.html
>>> ---
>>> v1:
>>>  - Fixed 0-day Robot errors
>>>
>>> ---
>>>  NEWS  |   6 +-
>>>  include/openflow/openflow-1.3.h   |  89 
>>>  include/openflow/openflow-1.4.h   |  93 +++-
>>>  include/openvswitch/ofp-monitor.h |   9 +-
>>>  include/openvswitch/ofp-msgs.h|  39 +-
>>>  lib/ofp-monitor.c | 844 --
>>>  lib/ofp-print.c   |  24 +-
>>>  ofproto/connmgr.c |  47 +-
>>>  ofproto/connmgr.h |   6 +-
>>>  ofproto/ofproto-provider.h|   4 +-
>>>  ofproto/ofproto.c |  89 +++-
>>>  tests/ofp-print.at| 122 -
>>>  tests/ofproto.at  | 176 +--
>>>  utilities/ovs-ofctl.8.in  |   3 +
>>>  utilities/ovs-ofctl.c |   6 +
>>>  15 files changed, 1265 insertions(+), 292 deletions(-)
>>>
>>> diff --git a/NEWS b/NEWS
>>> index 02884b774..47ad9de2a 100644
>>> --- a/NEWS
>>> +++ b/NEWS
>>> @@ -25,8 +25,10 @@ v2.16.0 - xx xxx 
>>> - In ovs-vsctl and vtep-ctl, the "find" command now accept new
>>>   operators {in} and {not-in}.
>>> - OpenFlow:
>>> - * Extend Flow Monitoring support for OpenFlow 1.0-1.2 with Nicira
>>> -   Extensions
>>> + * Extended Flow Monitoring support for all supported OpenFlow
>>> versions
>>> + OpenFlow versions 1.0-1.2 with Nicira Extensions
>>> + OpenFlow versions 1.3 with Open Network Foundation extension
>>> + OpenFlow versions 1.4+, as defined in the OpenFlow
>>> specification
>>> - Userspace datapath:
>>>   * Auto load balancing of PMDs now partially supports cross-NUMA
>>> polling
>>> cases, e.g if all PMD threads are running on the same NUMA node.
>>> diff --git a/include/openflow/openflow-1.3.h
>>> b/include/openflow/openflow-1.3.h
>>> index c48a8ea7f..1a818dbb4 100644
>>> --- a/include/openflow/openflow-1.3.h
>>> +++ b/include/openflow/openflow-1.3.h
>>> @@ -374,4 +374,93 @@ struct ofp13_async_config {
>>>  };
>>>  OFP_ASSERT(sizeof(struct ofp13_async_config) == 24);
>>>
>>> +struct onf_flow_monitor_request {
>>> +ovs_be32   id;/* Controller-assigned ID for this
>>> monitor. */
>>> +ovs_be16   flags; /* ONFFMF_*. */
>>> +ovs_be16   match_len; /* Length of oxm_fields. */
>>> +ovs_be32   out_port;  /* Required output 

Re: [ovs-dev] [PATCH v1 2/2] ofp-monitor: Support flow monitoring for OpenFlow 1.3, 1.4+

2021-08-24 Thread Ashish Varma
I will take a look.

Thanks,
Ashish

On Tue, Aug 24, 2021 at 9:29 AM Vasu Dasari  wrote:

> Hi Ben/Ashish,
>
> When you get a chance, Can you please take a look at my code?
>
> -Vasu
>
> *Vasu Dasari*
>
>
> On Thu, Jul 29, 2021 at 10:36 PM Vasu Dasari  wrote:
>
>> Extended OpenFlow monitoring support
>> * OpenFlow 1.3 with ONF extensions
>> * OpenFlow 1.4+ as defined in OpenFlow specification 1.4+.
>>
>> ONF extensions are similar to Nicira extensions except for
>> onf_flow_monitor_request{}
>> where out_port is defined as 32-bit number OF(1.1) number, oxm match
>> formats are
>> used in update and request messages.
>>
>> Flow monitoring support in 1.4+ is slightly different from Nicira and ONF
>> extensions.
>>  * More flow monitoring flags are defined.
>>  * Monitor add/modify/delete command is intruduced in flow_monitor
>>request message.
>>  * Addition of out_group as part of flow_monitor request message
>>
>> Description of changes:
>> 1. Generate ofp-msgs.inc to be able to support 1.3, 1.4+ flow Monitoring
>> messages.
>> include/openvswitch/ofp-msgs.h
>>
>> 2. Modify openflow header files with protocol specific headers.
>> include/openflow/openflow-1.3.h
>> include/openflow/openflow-1.4.h
>>
>> 3. Modify OvS abstraction of openflow headers. ofp-monitor.h leverages
>> enums
>>from on nicira extensions for creating protocol abstraction headers.
>> OF(1.4+)
>>enums are superset of nicira extensions.
>> include/openvswitch/ofp-monitor.h
>>
>> 4. Changes to these files reflect encoding and decoding of new protocol
>> messages.
>> lib/ofp-monitor.c
>>
>> 5. Changes to mmodules using ofp-monitor APIs. Most of the changes here
>> are to
>>migrate enums from nicira to OF 1.4+ versions.
>> ofproto/connmgr.c
>> ofproto/connmgr.h
>> ofproto/ofproto-provider.h
>> ofproto/ofproto.c
>>
>> 6. Extended protocol decoding tests to verify all protocol versions
>> FLOW_MONITOR_CANCEL
>> FLOW_MONITOR_PAUSED
>> FLOW_MONITOR_RESUMED
>> FLOW_MONITOR request
>> FLOW_MONITOR reply
>> tests/ofp-print.at
>>
>> 7. Modify flow monitoring tests to be able executed by all protocol
>> versions.
>> tests/ofproto.at
>>
>> 7. Modified documentation highlighting the change
>> utilities/ovs-ofctl.8.in
>> NEWS
>>
>> Signed-off-by: Vasu Dasari 
>> Reported-at:
>> https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/383915.html
>> ---
>> v1:
>>  - Fixed 0-day Robot errors
>>
>> ---
>>  NEWS  |   6 +-
>>  include/openflow/openflow-1.3.h   |  89 
>>  include/openflow/openflow-1.4.h   |  93 +++-
>>  include/openvswitch/ofp-monitor.h |   9 +-
>>  include/openvswitch/ofp-msgs.h|  39 +-
>>  lib/ofp-monitor.c | 844 --
>>  lib/ofp-print.c   |  24 +-
>>  ofproto/connmgr.c |  47 +-
>>  ofproto/connmgr.h |   6 +-
>>  ofproto/ofproto-provider.h|   4 +-
>>  ofproto/ofproto.c |  89 +++-
>>  tests/ofp-print.at| 122 -
>>  tests/ofproto.at  | 176 +--
>>  utilities/ovs-ofctl.8.in  |   3 +
>>  utilities/ovs-ofctl.c |   6 +
>>  15 files changed, 1265 insertions(+), 292 deletions(-)
>>
>> diff --git a/NEWS b/NEWS
>> index 02884b774..47ad9de2a 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -25,8 +25,10 @@ v2.16.0 - xx xxx 
>> - In ovs-vsctl and vtep-ctl, the "find" command now accept new
>>   operators {in} and {not-in}.
>> - OpenFlow:
>> - * Extend Flow Monitoring support for OpenFlow 1.0-1.2 with Nicira
>> -   Extensions
>> + * Extended Flow Monitoring support for all supported OpenFlow
>> versions
>> + OpenFlow versions 1.0-1.2 with Nicira Extensions
>> + OpenFlow versions 1.3 with Open Network Foundation extension
>> + OpenFlow versions 1.4+, as defined in the OpenFlow specification
>> - Userspace datapath:
>>   * Auto load balancing of PMDs now partially supports cross-NUMA
>> polling
>> cases, e.g if all PMD threads are running on the same NUMA node.
>> diff --git a/include/openflow/openflow-1.3.h
>> b/include/openflow/openflow-1.3.h
>> index c48a8ea7f..1a818dbb4 100644
>> --- a/include/openflow/openflow-1.3.h
>> +++ b/include/openflow/openflow-1.3.h
>> @@ -374,4 +374,93 @@ struct ofp13_async_config {
>>  };
>>  OFP_ASSERT(sizeof(struct ofp13_async_config) == 24);
>>
>> +struct onf_flow_monitor_request {
>> +ovs_be32   id;/* Controller-assigned ID for this
>> monitor. */
>> +ovs_be16   flags; /* ONFFMF_*. */
>> +ovs_be16   match_len; /* Length of oxm_fields. */
>> +ovs_be32   out_port;  /* Required output port, if not OFPP_NONE.
>> */
>> +uint8_ttable_id;  /* One table’s ID or 0xff for all tables.
>> */
>> +uint8_tzeros[3];  /* Align to 64 bits (must be zero). */
>> +/* Followed by an ofp11_match 

Re: [ovs-dev] [PATCH ovn] ovn-northd: Fix extremely inefficient usage of lflow hash map.

2021-08-24 Thread Anton Ivanov

On 23/08/2021 22:36, Ilya Maximets wrote:

On 8/23/21 10:37 PM, Anton Ivanov wrote:

On 23/08/2021 21:26, Ilya Maximets wrote:

On 8/23/21 10:20 PM, Anton Ivanov wrote:

Should not be the case.

The map is pre-sized for the size from the previous iterations.

Line 12861 in my tree which is probably a few commits out of date:

  fast_hmap_size_for(, max_seen_lflow_size);

And immediately after building the lflows:

  if (hmap_count() > max_seen_lflow_size) {
  max_seen_lflow_size = hmap_count();
  }

So the map SHOULD be sized correctly - to the most recent seen lflow count.

Please, re-read the commit message.  It's a first run of the loop
and the 'max_seen_lflow_size' is default 128 at this point.

Ack,

Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it.

 From that perspective the patch is a straight +1 from me.

 From the perspective of the use case stated in the commit message- I am not 
sure it addresses it.

If northd is in single-threaded mode and is tackling a GIGANTIC> database, it 
may never complete the first iteration before the
expiration of the timers and everyone deciding that northd is AWOL.

Well, how do you suggest to fix that?  Obviously, we can always create
a database that northd will never be able to process in a reasonable
amount of time.  And it doesn't matter if it's single- or multi-threaded.


I will send the additional "baseline" fixes for parallel - resizing and 
initial sizing tomorrow, they are fairly trivial and have tested out OK.


However, they do not solve the fact that the overall "heatmap" with 
dp_groups have moved. A lot of processing once again happens out of the 
"parallel" portion and in the single threaded part.


Unless I am mistaken, this can be improved.

Namely, at present, after computing lflows with dp_groups they are 
walked in full, single dp flows separated into a hmap and reprocessed. 
That is suboptimal for parallel (and possibly suboptimal for single 
threaded).


Unless I am mistaken, when dp_groups are enabled, all lflows can be 
initially inserted into a separate "single datapath" hmap. If the 
dp_group for an lflow grows to more than one, the flow is then moved to 
the main lflow hmap. This way, the computation will generate both types 
of flows straight away (optionally in parallel) and there will be no 
need to do a full single threaded walk of lflows after they have been 
generated.


One question (so I can get some idea on which optimizations are worth it 
and which aren't). What is the percentage and overall numbers of single 
datapath lflows?


Brgds,

A.



In this case NbDB is only 9MB in size, which is very reasonable, and
northd on my laptop takes more than 15 minutes to process it (I killed
it at this point).  With the patch applied it took only 11 seconds.
So, for me, this patch pretty much fixes the issue.  11 seconds is not
that bad, e.g. ovn-k8s configures inactivity probes for clients to 180.
It would be great to reduce, but we're not there yet.


In that case, if it is multi-threaded from the start, it should probably
grab the sizing from the lflow table hash in south db. That would be a
good approximation for the first run.

This will not work for a case where SbDB is empty for any reason while
NbDB is not.  And there is still a case where northd initially connects
to semi-empty databases and after few iterations NbDB receives a big
transaction and generates a big update for northd.


A.


A.

On 23/08/2021 21:02, Ilya Maximets wrote:

'lflow_map' is never expanded because northd always uses fast
insertion.  This leads to the case where we have a hash map
with only 128 initial buckets and every ovn_lflow_find() ends
up iterating over n_lflows / 128 entries.  It's thousands of
logical flows or even more.  For example, it takes forever for
ovn-northd to start up with the Northbound Db from the 120 node
density-heavy test from ovn-heater, because every lookup is slower
than previous one.  I aborted the process after 15 minutes of
waiting, because there was no sign that it will converge.  With
this change applied the loop completes in only 11 seconds.

Hash map will be pre-allocated to the maximum seen number of
logical flows on a second iteration, but this doesn't help for
the first iteration when northd first time connects to a big
Northbound database, which is a common case during failover or
cluster upgrade.  And there is an even trickier case where big
NbDB transaction that explodes the number of logical flows received
on not the first run.

We can't expand the hash map in case of parallel build, so this
should be fixed separately.

CC: Anton Ivanov 
Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build")
Signed-off-by: Ilya Maximets 
---
    northd/ovn-northd.c | 6 +-
    1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index 3d8e21a4f..40cf957c0 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ 

Re: [ovs-dev] [PATCH 0/2] raft: Reduce memory consumption by freeing unnecessary json objects.

2021-08-24 Thread Ilya Maximets
On 8/20/21 6:51 PM, Ilya Maximets wrote:
> In short, database snapshot stored inside the raft module is
> a huge json object.  E.g. in ovn-heater scale tests with 270 MB
> on-disk Southbound DB, json object of a snapshot takes 1.6 GB of
> RAM out of total 3.8 GB of the RSS of an ovsdb-server process.
> 
> This change is targeted to free that json object as soon as it
> no longer needed and keep its serialized version instead.
> 
> Testing on a bit smaller cluster with 97 MB of on-disk database
> showed *58%* of the memory consumption decrease with this change.
> 
> More details on testing and implementation in commit messages.
> 
> Ilya Maximets (2):
>   json: Add support for partially serialized json objects.
>   raft: Don't keep full json objects in memory if no longer needed.
> 
>  include/openvswitch/json.h |   9 ++-
>  lib/json.c |  34 
>  ovsdb/ovsdb-tool.c |  10 ++--
>  ovsdb/raft-private.c   | 111 +
>  ovsdb/raft-private.h   |  12 +++-
>  ovsdb/raft.c   |  94 +--
>  ovsdb/raft.h   |   3 +-
>  ovsdb/storage.c|   4 +-
>  8 files changed, 211 insertions(+), 66 deletions(-)
> 

I sent a v2 with one more patch that allows to reduce CPU
usage of ovsdb-server by more than 50% using same technique:
  https://patchwork.ozlabs.org/project/openvswitch/list/?series=259477=*

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 3/3] ovsdb: monitor: Store serialized json in a json cache.

2021-08-24 Thread Ilya Maximets
Same json from a json cache is typically sent to all the clients,
e.g., in case of OVN deployment with ovn-monitor-all=true.

There could be hundreds or thousands connected clients and ovsdb
will serialize the same json object for each of them before sending.

Serializing it once before storing into json cache to speed up
processing.

This change allows to save a lot of CPU cycles and a bit of memory
since we need to store in memory only a string and not the full json
object.

Testing with ovn-heater on 120 nodes using density-heavy scenario
shows reduction of the total CPU time used by Southbound DB processes
from 256 minutes to 147.  Duration of unreasonably long poll intervals
also reduced dramatically from 7 to 2 seconds:

   Count   MinMax   MedianMean   95 percentile
 -
  Before   1934   1012   7480   4302.5   4875.3 7034.3
  After1909   1004   2730   1453.0   1532.5 2053.6

Signed-off-by: Ilya Maximets 
---
 ovsdb/monitor.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/ovsdb/monitor.c b/ovsdb/monitor.c
index 532dedcb6..ab814cf20 100644
--- a/ovsdb/monitor.c
+++ b/ovsdb/monitor.c
@@ -1231,6 +1231,15 @@ ovsdb_monitor_get_update(
 condition,
 ovsdb_monitor_compose_row_update2);
 if (!condition || !condition->conditional) {
+if (json) {
+struct json *json_serialized;
+
+/* Pre-serializing the object to avoid doing this
+ * for every client. */
+json_serialized = json_serialized_object_create(json);
+json_destroy(json);
+json = json_serialized;
+}
 ovsdb_monitor_json_cache_insert(dbmon, version, mcs,
 json);
 }
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 2/3] raft: Don't keep full json objects in memory if no longer needed.

2021-08-24 Thread Ilya Maximets
Raft log entries (and raft database snapshot) contains json objects
of the data.  Follower receives append requests with data that gets
parsed and added to the raft log.  Leader receives execution requests,
parses data out of them and adds to the log.  In both cases, later
ovsdb-server reads the log with ovsdb_storage_read(), constructs
transaction and updates the database.  On followers these json objects
in common case are never used again.  Leader may use them to send
append requests or snapshot installation requests to followers.
However, all these operations (except for ovsdb_storage_read()) are
just serializing the json in order to send it over the network.

Json objects are significantly larger than their serialized string
representation.  For example, the snapshot of the database from one of
the ovn-heater scale tests takes 270 MB as a string, but 1.6 GB as
a json object from the total 3.8 GB consumed by ovsdb-server process.

ovsdb_storage_read() for a given raft entry happens only once in a
lifetime, so after this call, we can serialize the json object, store
the string representation and free the actual json object that ovsdb
will never need again.  This can save a lot of memory and can also
save serialization time, because each raft entry for append requests
and snapshot installation requests serialized only once instead of
doing that every time such request needs to be sent.

JSON_SERIALIZED_OBJECT can be used in order to seamlessly integrate
pre-serialized data into raft_header and similar json objects.

One major special case is creation of a database snapshot.
Snapshot installation request received over the network will be parsed
and read by ovsdb-server just like any other raft log entry.  However,
snapshots created locally with raft_store_snapshot() will never be
read back, because they reflect the current state of the database,
hence already applied.  For this case we can free the json object
right after writing snapshot on disk.

Tests performed with ovn-heater on 60 node density-light scenario,
where on-disk database goes up to 97 MB, shows average memory
consumption of ovsdb-server Southbound DB processes decreased by 58%
(from 602 MB to 256 MB per process) and peak memory consumption
decreased by 40% (from 1288 MB to 771 MB).

Test with 120 nodes on density-heavy scenario with 270 MB on-disk
database shows 1.5 GB memory consumption decrease as expected.
Also, total CPU time consumed by the Southbound DB process reduced
from 296 to 256 minutes.  Number of unreasonably long poll intervals
reduced from 2896 down to 1934.

Deserialization is also implemented just in case.  I didn't see this
function being invoked in practice.

Signed-off-by: Ilya Maximets 
---
 ovsdb/ovsdb-tool.c   |  10 ++--
 ovsdb/raft-private.c | 111 ++-
 ovsdb/raft-private.h |  12 -
 ovsdb/raft.c |  94 
 ovsdb/raft.h |   3 +-
 ovsdb/storage.c  |   4 +-
 6 files changed, 170 insertions(+), 64 deletions(-)

diff --git a/ovsdb/ovsdb-tool.c b/ovsdb/ovsdb-tool.c
index 05a0223e7..903b2ebc9 100644
--- a/ovsdb/ovsdb-tool.c
+++ b/ovsdb/ovsdb-tool.c
@@ -919,7 +919,7 @@ print_raft_header(const struct raft_header *h,
 if (!uuid_is_zero(>snap.eid)) {
 printf(" prev_eid: %04x\n", uuid_prefix(>snap.eid, 4));
 }
-print_data("prev_", h->snap.data, schemap, names);
+print_data("prev_", raft_entry_get_data(>snap), schemap, names);
 }
 }
 
@@ -973,11 +973,13 @@ raft_header_to_standalone_log(const struct raft_header *h,
   struct ovsdb_log *db_log_data)
 {
 if (h->snap_index) {
-if (!h->snap.data || json_array(h->snap.data)->n != 2) {
+const struct json *data = raft_entry_get_data(>snap);
+
+if (!data || json_array(data)->n != 2) {
 ovs_fatal(0, "Incorrect raft header data array length");
 }
 
-struct json_array *pa = json_array(h->snap.data);
+struct json_array *pa = json_array(data);
 struct json *schema_json = pa->elems[0];
 struct ovsdb_error *error = NULL;
 
@@ -1373,7 +1375,7 @@ do_check_cluster(struct ovs_cmdl_context *ctx)
 }
 struct raft_entry *e = >entries[log_idx];
 e->term = r->term;
-e->data = r->entry.data;
+raft_entry_set_data_nocopy(e, r->entry.data);
 e->eid = r->entry.eid;
 e->servers = r->entry.servers;
 break;
diff --git a/ovsdb/raft-private.c b/ovsdb/raft-private.c
index 26d39a087..33c20e71b 100644
--- a/ovsdb/raft-private.c
+++ b/ovsdb/raft-private.c
@@ -18,11 +18,16 @@
 
 #include "raft-private.h"
 
+#include "coverage.h"
 #include "openvswitch/dynamic-string.h"
 #include "ovsdb-error.h"
 #include "ovsdb-parser.h"
 #include "socket-util.h"
 #include "sset.h"
+
+COVERAGE_DEFINE(raft_entry_deserialize);

[ovs-dev] [PATCH v2 1/3] json: Add support for partially serialized json objects.

2021-08-24 Thread Ilya Maximets
Introducing a new json type JSON_SERIALIZED_OBJECT.  It's not an
actual type that can be seen in a json message on a wire, but
internal type that is intended to hold a serialized version of
some other json object.  For this reason it's defined after the
JSON_N_TYPES to not confuse parsers and other parts of the code
that relies on compliance with RFC 4627.

With this JSON type internal users may construct large JSON objects,
parts of which are already serialized.  This way, while serializing
the larger object, data from JSON_SERIALIZED_OBJECT can be added
directly to the result, without additional processing.

This will be used by next commits to add pre-serialized JSON data
to the raft_header structure, that can be converted to a JSON
before writing the file transaction on disk or sending to other
servers.  Same technique can also be used to pre-serialize json_cache
for ovsdb monitors, this should allow to not perform serialization
for every client and will save some more memory.

Since serialized JSON is just a string, reusing the 'json->string'
pointer for it.

Signed-off-by: Ilya Maximets 
---
 include/openvswitch/json.h |  9 +++--
 lib/json.c | 34 ++
 2 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/include/openvswitch/json.h b/include/openvswitch/json.h
index 73b562e03..b8cf49ab6 100644
--- a/include/openvswitch/json.h
+++ b/include/openvswitch/json.h
@@ -50,7 +50,9 @@ enum json_type {
 JSON_INTEGER,   /* 123. */
 JSON_REAL,  /* 123.456. */
 JSON_STRING,/* "..." */
-JSON_N_TYPES
+JSON_N_TYPES,
+JSON_SERIALIZED_OBJECT, /* Internal type to hold serialized version of
+ * data of other types. */
 };
 
 const char *json_type_to_string(enum json_type);
@@ -70,7 +72,7 @@ struct json {
 struct json_array array;
 long long int integer;
 double real;
-char *string;
+char *string;  /* JSON_STRING or JSON_SERIALIZED_OBJECT.  */
 };
 };
 
@@ -78,6 +80,7 @@ struct json *json_null_create(void);
 struct json *json_boolean_create(bool);
 struct json *json_string_create(const char *);
 struct json *json_string_create_nocopy(char *);
+struct json *json_serialized_object_create(const struct json *);
 struct json *json_integer_create(long long int);
 struct json *json_real_create(double);
 
@@ -99,6 +102,7 @@ void json_object_put_format(struct json *,
 OVS_PRINTF_FORMAT(3, 4);
 
 const char *json_string(const struct json *);
+const char *json_serialized_object(const struct json *);
 struct json_array *json_array(const struct json *);
 struct shash *json_object(const struct json *);
 bool json_boolean(const struct json *);
@@ -125,6 +129,7 @@ struct json *json_parser_finish(struct json_parser *);
 void json_parser_abort(struct json_parser *);
 
 struct json *json_from_string(const char *string);
+struct json *json_from_serialized_object(const struct json *);
 struct json *json_from_file(const char *file_name);
 struct json *json_from_stream(FILE *stream);
 
diff --git a/lib/json.c b/lib/json.c
index 32d25003b..6bb620724 100644
--- a/lib/json.c
+++ b/lib/json.c
@@ -146,6 +146,7 @@ json_type_to_string(enum json_type type)
 case JSON_STRING:
 return "string";
 
+case JSON_SERIALIZED_OBJECT:
 case JSON_N_TYPES:
 default:
 return "";
@@ -180,6 +181,14 @@ json_string_create(const char *s)
 return json_string_create_nocopy(xstrdup(s));
 }
 
+struct json *
+json_serialized_object_create(const struct json *src)
+{
+struct json *json = json_create(JSON_SERIALIZED_OBJECT);
+json->string = json_to_string(src, JSSF_SORT);
+return json;
+}
+
 struct json *
 json_array_create_empty(void)
 {
@@ -309,6 +318,13 @@ json_string(const struct json *json)
 return json->string;
 }
 
+const char *
+json_serialized_object(const struct json *json)
+{
+ovs_assert(json->type == JSON_SERIALIZED_OBJECT);
+return json->string;
+}
+
 struct json_array *
 json_array(const struct json *json)
 {
@@ -362,6 +378,7 @@ json_destroy(struct json *json)
 break;
 
 case JSON_STRING:
+case JSON_SERIALIZED_OBJECT:
 free(json->string);
 break;
 
@@ -422,6 +439,9 @@ json_deep_clone(const struct json *json)
 case JSON_STRING:
 return json_string_create(json->string);
 
+case JSON_SERIALIZED_OBJECT:
+return json_serialized_object_create(json);
+
 case JSON_NULL:
 case JSON_FALSE:
 case JSON_TRUE:
@@ -521,6 +541,7 @@ json_hash(const struct json *json, size_t basis)
 return json_hash_array(>array, basis);
 
 case JSON_STRING:
+case JSON_SERIALIZED_OBJECT:
 return hash_string(json->string, basis);
 
 case JSON_NULL:
@@ -596,6 +617,7 @@ json_equal(const struct json *a, const struct json *b)
 return json_equal_array(>array, >array);
 
 case JSON_STRING:
+case 

[ovs-dev] [PATCH v2 0/3] ovsdb: Reduce memory and CPU consumption by serializing json objects.

2021-08-24 Thread Ilya Maximets
In short, database snapshot stored inside the raft module is
a huge json object.  E.g. in ovn-heater scale tests with 270 MB
on-disk Southbound DB, json object of a snapshot takes 1.6 GB of
RAM out of total 3.8 GB of the RSS of an ovsdb-server process.

Second patch of the set is targeted to free that json object as
soon as it no longer needed and keep its serialized version instead.
First patch provides required json infrastructure.

Testing on a bit smaller cluster with 97 MB of on-disk database
showed *58%* of the memory consumption decrease with this change.
Testing with 270 MB database showed RSS decrease by 1.5 GB.

The third patch re-uses same json infrastructure to significantly
reduce CPU time by pre-serializing json cache.

All changes applied allowed to reduce CPU time by 50% and memory
consumption by 1.5G on 120 node density-heavy test with ovn-heater.
Duration of long poll intervals on Southbound database dropped
from 7.5 to 2 seconds.

More details on testing and implementation in commit messages.

Version 2:
 - Added patch #3.
 - Updated test data in commit messages.

Ilya Maximets (3):
  json: Add support for partially serialized json objects.
  raft: Don't keep full json objects in memory if no longer needed.
  ovsdb: monitor: Store serialized json in a json cache.

 include/openvswitch/json.h |   9 ++-
 lib/json.c |  34 
 ovsdb/monitor.c|   9 +++
 ovsdb/ovsdb-tool.c |  10 ++--
 ovsdb/raft-private.c   | 111 +
 ovsdb/raft-private.h   |  12 +++-
 ovsdb/raft.c   |  94 +--
 ovsdb/raft.h   |   3 +-
 ovsdb/storage.c|   4 +-
 9 files changed, 220 insertions(+), 66 deletions(-)

-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn] ovn-northd: Fix extremely inefficient usage of lflow hash map.

2021-08-24 Thread Numan Siddique
On Tue, Aug 24, 2021 at 2:20 PM Anton Ivanov
 wrote:
>
> On 24/08/2021 19:16, Ilya Maximets wrote:
> > On 8/24/21 7:48 PM, Numan Siddique wrote:
> >> On Tue, Aug 24, 2021 at 1:24 PM Anton Ivanov
> >>  wrote:
> >>>
> >>> On 24/08/2021 17:35, Ilya Maximets wrote:
>  On 8/24/21 6:25 PM, Numan Siddique wrote:
> > On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov
> >  wrote:
> >> On 24/08/2021 12:46, Ilya Maximets wrote:
> >>> On 8/24/21 1:18 PM, Anton Ivanov wrote:
>  On 24/08/2021 12:05, Ilya Maximets wrote:
> > On 8/24/21 7:36 AM, Anton Ivanov wrote:
> >> On 23/08/2021 22:36, Ilya Maximets wrote:
> >>> On 8/23/21 10:37 PM, Anton Ivanov wrote:
>  On 23/08/2021 21:26, Ilya Maximets wrote:
> > On 8/23/21 10:20 PM, Anton Ivanov wrote:
> >> Should not be the case.
> >>
> >> The map is pre-sized for the size from the previous iterations.
> >>
> >> Line 12861 in my tree which is probably a few commits out of 
> >> date:
> >>
> >>   fast_hmap_size_for(, max_seen_lflow_size);
> >>
> >> And immediately after building the lflows:
> >>
> >>   if (hmap_count() > max_seen_lflow_size) {
> >>   max_seen_lflow_size = hmap_count();
> >>   }
> >>
> >> So the map SHOULD be sized correctly - to the most recent seen 
> >> lflow count.
> > Please, re-read the commit message.  It's a first run of the 
> > loop
> > and the 'max_seen_lflow_size' is default 128 at this point.
>  Ack,
> 
>  Not using auto-resizing in single threaded mode is a bug. Thanks 
>  for fixing it.
> 
>   From that perspective the patch is a straight +1 from me.
> 
>   From the perspective of the use case stated in the commit 
>  message- I am not sure it addresses it.
> 
>  If northd is in single-threaded mode and is tackling a GIGANTIC> 
>  database, it may never complete the first iteration before the
>  expiration of the timers and everyone deciding that northd is 
>  AWOL.
> >>> Well, how do you suggest to fix that?  Obviously, we can always 
> >>> create
> >>> a database that northd will never be able to process in a 
> >>> reasonable
> >>> amount of time.  And it doesn't matter if it's single- or 
> >>> multi-threaded.
> >>>
> >>> In this case NbDB is only 9MB in size, which is very reasonable, 
> >>> and
> >>> northd on my laptop takes more than 15 minutes to process it (I 
> >>> killed
> >>> it at this point).  With the patch applied it took only 11 
> >>> seconds.
> >>> So, for me, this patch pretty much fixes the issue.  11 seconds 
> >>> is not
> >>> that bad, e.g. ovn-k8s configures inactivity probes for clients 
> >>> to 180.
> >>> It would be great to reduce, but we're not there yet.
> >>>
>  In that case, if it is multi-threaded from the start, it should 
>  probably
>  grab the sizing from the lflow table hash in south db. That 
>  would be a
>  good approximation for the first run.
> >>> This will not work for a case where SbDB is empty for any reason 
> >>> while
> >>> NbDB is not.  And there is still a case where northd initially 
> >>> connects
> >>> to semi-empty databases and after few iterations NbDB receives a 
> >>> big
> >>> transaction and generates a big update for northd.
> >> A partial fix is to resize to optimum size the hash after lflow 
> >> processing.
> > I'm not sure I understand what you mean here, because resizing after
> > lflow processing will not help.  The lflow processing itself is the
> > very slow part that we're trying to make faster here.
>  That can be the case only with dpgroups. Otherwise lflows are just a 
>  destination to dump data and the bucket sizing is irrelevant because 
>  there is never any lookup inside lflows during processing. The lflow 
>  map is just used to store data. So if it is suboptimal at the exit 
>  of build_lflows() the resize will fix it before the first lookup 
>  shortly thereafter.
> 
>  Are you running it with dpgroups enabled? In that case there are 
>  lookups inside lflows during build which happen under a per-bucket 
>  lock. So in addition to suboptimal size when searching the 
>  contention depends on the number of buckets. If they are too few, 
>  the system 

[ovs-dev] [PATCH ovn] ic: remove port_binding on ts deletion

2021-08-24 Thread Vladislav Odintsov
When IC port_binding exists and transit switch is deleted,
the orphan port_binding if left in the IC_SB_DB.

This patch fixes such situation and adds test for this case.

Signed-off-by: Vladislav Odintsov 
---
 ic/ovn-ic.c | 35 +++--
 tests/ovn-ic.at | 52 +
 2 files changed, 85 insertions(+), 2 deletions(-)

diff --git a/ic/ovn-ic.c b/ic/ovn-ic.c
index 75e798cd1..e80cd34ca 100644
--- a/ic/ovn-ic.c
+++ b/ic/ovn-ic.c
@@ -66,6 +66,7 @@ struct ic_context {
 struct ovsdb_idl_index *nbrec_port_by_name;
 struct ovsdb_idl_index *sbrec_chassis_by_name;
 struct ovsdb_idl_index *sbrec_port_binding_by_name;
+struct ovsdb_idl_index *icsbrec_port_binding_by_az;
 struct ovsdb_idl_index *icsbrec_port_binding_by_ts;
 struct ovsdb_idl_index *icsbrec_route_by_ts;
 };
@@ -174,7 +175,7 @@ allocate_ts_dp_key(struct hmap *dp_tnlids)
 }
 
 static void
-ts_run(struct ic_context *ctx)
+ts_run(struct ic_context *ctx, const struct icsbrec_availability_zone *az)
 {
 const struct icnbrec_transit_switch *ts;
 
@@ -239,8 +240,25 @@ ts_run(struct ic_context *ctx)
  * SB, to avoid uncommitted ISB datapath tunnel key to be synced back to
  * AZ. */
 if (ctx->ovnisb_txn) {
+struct shash isb_pbs = SHASH_INITIALIZER(_pbs);
+const struct icsbrec_port_binding *isb_pb;
+const struct icsbrec_port_binding *isb_pb_key =
+icsbrec_port_binding_index_init_row(
+ctx->icsbrec_port_binding_by_az);
+
+icsbrec_port_binding_index_set_availability_zone(isb_pb_key, az);
+
+ICSBREC_PORT_BINDING_FOR_EACH (isb_pb, ctx->ovnisb_idl) {
+shash_add(_pbs, isb_pb->transit_switch, isb_pb);
+}
+
 /* Create ISB Datapath_Binding */
 ICNBREC_TRANSIT_SWITCH_FOR_EACH (ts, ctx->ovninb_idl) {
+while (shash_find_and_delete(_pbs, ts->name)) {
+/* There may be multiple Port_Bindings */
+continue;
+}
+
 isb_dp = shash_find_and_delete(_dps, ts->name);
 if (!isb_dp) {
 /* Allocate tunnel key */
@@ -260,6 +278,13 @@ ts_run(struct ic_context *ctx)
 SHASH_FOR_EACH (node, _dps) {
 icsbrec_datapath_binding_delete(node->data);
 }
+
+SHASH_FOR_EACH (node, _pbs) {
+icsbrec_port_binding_delete(node->data);
+}
+
+icsbrec_port_binding_index_destroy_row(isb_pb_key);
+shash_destroy(_pbs);
 }
 ovn_destroy_tnlids(_tnlids);
 shash_destroy(_dps);
@@ -1493,7 +1518,7 @@ ovn_db_run(struct ic_context *ctx)
 return;
 }
 
-ts_run(ctx);
+ts_run(ctx, az);
 gateway_run(ctx, az);
 port_binding_run(ctx, az);
 route_run(ctx, az);
@@ -1684,6 +1709,11 @@ main(int argc, char *argv[])
 struct ovsdb_idl_index *sbrec_chassis_by_name
 = ovsdb_idl_index_create1(ovnsb_idl_loop.idl,
   _chassis_col_name);
+
+struct ovsdb_idl_index *icsbrec_port_binding_by_az
+= ovsdb_idl_index_create1(ovnisb_idl_loop.idl,
+  _port_binding_col_availability_zone);
+
 struct ovsdb_idl_index *icsbrec_port_binding_by_ts
 = ovsdb_idl_index_create1(ovnisb_idl_loop.idl,
   _port_binding_col_transit_switch);
@@ -1731,6 +1761,7 @@ main(int argc, char *argv[])
 .nbrec_port_by_name = nbrec_port_by_name,
 .sbrec_port_binding_by_name = sbrec_port_binding_by_name,
 .sbrec_chassis_by_name = sbrec_chassis_by_name,
+.icsbrec_port_binding_by_az = icsbrec_port_binding_by_az,
 .icsbrec_port_binding_by_ts = icsbrec_port_binding_by_ts,
 .icsbrec_route_by_ts = icsbrec_route_by_ts,
 };
diff --git a/tests/ovn-ic.at b/tests/ovn-ic.at
index ee78f4794..b6a8edb68 100644
--- a/tests/ovn-ic.at
+++ b/tests/ovn-ic.at
@@ -64,6 +64,58 @@ OVN_CLEANUP_IC([az1])
 AT_CLEANUP
 ])
 
+
+AT_SETUP([ovn-ic -- port bindings])
+
+ovn_init_ic_db
+net_add n1
+
+# 1 GW per AZ
+for i in 1 2; do
+az=az$i
+ovn_start $az
+sim_add gw-$az
+as gw-$az
+check ovs-vsctl add-br br-phys
+ovn_az_attach $az n1 br-phys 192.168.1.$i
+check ovs-vsctl set open . external-ids:ovn-is-interconn=true
+done
+
+ovn_as az1
+
+# create transit switch and connect to LR
+check ovn-ic-nbctl ts-add ts1
+check ovn-nbctl lr-add lr1
+check ovn-nbctl lrp-add lr1 lrp1 00:00:00:00:00:01 10.0.0.1/24
+check ovn-nbctl lrp-set-gateway-chassis lrp1 gw-az1
+
+check ovn-nbctl lsp-add ts1 lsp1 -- \
+lsp-set-addresses lsp1 router -- \
+lsp-set-type lsp1 router -- \
+lsp-set-options lsp1 router-port=lrp1
+
+OVS_WAIT_UNTIL([ovn-sbctl list datapath_binding | grep interconn-ts | grep 
ts1])
+
+# check port binding appeared
+AT_CHECK([ovn-ic-sbctl show | grep -A2 lsp1], [0], [dnl
+port lsp1
+transit switch: ts1
+

Re: [ovs-dev] [PATCH ovn] ovn-northd: Fix extremely inefficient usage of lflow hash map.

2021-08-24 Thread Anton Ivanov

On 24/08/2021 19:16, Ilya Maximets wrote:

On 8/24/21 7:48 PM, Numan Siddique wrote:

On Tue, Aug 24, 2021 at 1:24 PM Anton Ivanov
 wrote:


On 24/08/2021 17:35, Ilya Maximets wrote:

On 8/24/21 6:25 PM, Numan Siddique wrote:

On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov
 wrote:

On 24/08/2021 12:46, Ilya Maximets wrote:

On 8/24/21 1:18 PM, Anton Ivanov wrote:

On 24/08/2021 12:05, Ilya Maximets wrote:

On 8/24/21 7:36 AM, Anton Ivanov wrote:

On 23/08/2021 22:36, Ilya Maximets wrote:

On 8/23/21 10:37 PM, Anton Ivanov wrote:

On 23/08/2021 21:26, Ilya Maximets wrote:

On 8/23/21 10:20 PM, Anton Ivanov wrote:

Should not be the case.

The map is pre-sized for the size from the previous iterations.

Line 12861 in my tree which is probably a few commits out of date:

  fast_hmap_size_for(, max_seen_lflow_size);

And immediately after building the lflows:

  if (hmap_count() > max_seen_lflow_size) {
  max_seen_lflow_size = hmap_count();
  }

So the map SHOULD be sized correctly - to the most recent seen lflow count.

Please, re-read the commit message.  It's a first run of the loop
and the 'max_seen_lflow_size' is default 128 at this point.

Ack,

Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it.

 From that perspective the patch is a straight +1 from me.

 From the perspective of the use case stated in the commit message- I am 
not sure it addresses it.

If northd is in single-threaded mode and is tackling a GIGANTIC> database, it 
may never complete the first iteration before the
expiration of the timers and everyone deciding that northd is AWOL.

Well, how do you suggest to fix that?  Obviously, we can always create
a database that northd will never be able to process in a reasonable
amount of time.  And it doesn't matter if it's single- or multi-threaded.

In this case NbDB is only 9MB in size, which is very reasonable, and
northd on my laptop takes more than 15 minutes to process it (I killed
it at this point).  With the patch applied it took only 11 seconds.
So, for me, this patch pretty much fixes the issue.  11 seconds is not
that bad, e.g. ovn-k8s configures inactivity probes for clients to 180.
It would be great to reduce, but we're not there yet.


In that case, if it is multi-threaded from the start, it should probably
grab the sizing from the lflow table hash in south db. That would be a
good approximation for the first run.

This will not work for a case where SbDB is empty for any reason while
NbDB is not.  And there is still a case where northd initially connects
to semi-empty databases and after few iterations NbDB receives a big
transaction and generates a big update for northd.

A partial fix is to resize to optimum size the hash after lflow processing.

I'm not sure I understand what you mean here, because resizing after
lflow processing will not help.  The lflow processing itself is the
very slow part that we're trying to make faster here.

That can be the case only with dpgroups. Otherwise lflows are just a 
destination to dump data and the bucket sizing is irrelevant because there is 
never any lookup inside lflows during processing. The lflow map is just used to 
store data. So if it is suboptimal at the exit of build_lflows() the resize 
will fix it before the first lookup shortly thereafter.

Are you running it with dpgroups enabled? In that case there are lookups inside 
lflows during build which happen under a per-bucket lock. So in addition to 
suboptimal size when searching the contention depends on the number of buckets. 
If they are too few, the system becomes heavily contended resulting in 
ridiculous computation sizes.

Oh, I see.  Indeed, without dp-groups there is no lookup during
lflow build.  I missed that.  So, yes, I agree that for a case
without dp-groups, re-sizing after lflow processing should work.
We need that for parallel case.

Current patch (use hmap_insert() that resizes if needed) helps
for all non-parallel cases.

Indeed. It should go in.


Why can't we have hmap_insert() for both parallel and non parallel configs
to start with and switch over to hmap_insert_fast() when ovn-northd
has successfully connected to SB DB and has approximated on the
more accurate hmap size ?

We can't use hmap_insert() for parallel, because resize of a hash map
will crash threads that are working on it at the moment, IIUC.

We actually can for non-dp-groups case, but the merge will become much slower. 
Example - the last version of the snapshot and monitor parallelization for OVS. 
It no longer uses pre-sized fixed hmaps because it is nearly impossible to 
presize an RFC7047 structure correctly.

For the dp-groups case we can't. Locking of the lflows which are used for 
lookups is per hash bucket. You cannot change the number of buckets in the 
middle of the run and other locking mechanisms will be more coarse. So the 
marginal benefit with dp-groups at present will become none (or even slower).


We could 

Re: [ovs-dev] [PATCH ovn] ovn-northd: Fix extremely inefficient usage of lflow hash map.

2021-08-24 Thread Ilya Maximets
On 8/24/21 7:48 PM, Numan Siddique wrote:
> On Tue, Aug 24, 2021 at 1:24 PM Anton Ivanov
>  wrote:
>>
>>
>> On 24/08/2021 17:35, Ilya Maximets wrote:
>>> On 8/24/21 6:25 PM, Numan Siddique wrote:
 On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov
  wrote:
>
> On 24/08/2021 12:46, Ilya Maximets wrote:
>> On 8/24/21 1:18 PM, Anton Ivanov wrote:
>>> On 24/08/2021 12:05, Ilya Maximets wrote:
 On 8/24/21 7:36 AM, Anton Ivanov wrote:
> On 23/08/2021 22:36, Ilya Maximets wrote:
>> On 8/23/21 10:37 PM, Anton Ivanov wrote:
>>> On 23/08/2021 21:26, Ilya Maximets wrote:
 On 8/23/21 10:20 PM, Anton Ivanov wrote:
> Should not be the case.
>
> The map is pre-sized for the size from the previous iterations.
>
> Line 12861 in my tree which is probably a few commits out of date:
>
>  fast_hmap_size_for(, max_seen_lflow_size);
>
> And immediately after building the lflows:
>
>  if (hmap_count() > max_seen_lflow_size) {
>  max_seen_lflow_size = hmap_count();
>  }
>
> So the map SHOULD be sized correctly - to the most recent seen 
> lflow count.
 Please, re-read the commit message.  It's a first run of the loop
 and the 'max_seen_lflow_size' is default 128 at this point.
>>> Ack,
>>>
>>> Not using auto-resizing in single threaded mode is a bug. Thanks 
>>> for fixing it.
>>>
>>> From that perspective the patch is a straight +1 from me.
>>>
>>> From the perspective of the use case stated in the commit 
>>> message- I am not sure it addresses it.
>>>
>>> If northd is in single-threaded mode and is tackling a GIGANTIC> 
>>> database, it may never complete the first iteration before the
>>> expiration of the timers and everyone deciding that northd is AWOL.
>> Well, how do you suggest to fix that?  Obviously, we can always 
>> create
>> a database that northd will never be able to process in a reasonable
>> amount of time.  And it doesn't matter if it's single- or 
>> multi-threaded.
>>
>> In this case NbDB is only 9MB in size, which is very reasonable, and
>> northd on my laptop takes more than 15 minutes to process it (I 
>> killed
>> it at this point).  With the patch applied it took only 11 seconds.
>> So, for me, this patch pretty much fixes the issue.  11 seconds is 
>> not
>> that bad, e.g. ovn-k8s configures inactivity probes for clients to 
>> 180.
>> It would be great to reduce, but we're not there yet.
>>
>>> In that case, if it is multi-threaded from the start, it should 
>>> probably
>>> grab the sizing from the lflow table hash in south db. That would 
>>> be a
>>> good approximation for the first run.
>> This will not work for a case where SbDB is empty for any reason 
>> while
>> NbDB is not.  And there is still a case where northd initially 
>> connects
>> to semi-empty databases and after few iterations NbDB receives a big
>> transaction and generates a big update for northd.
> A partial fix is to resize to optimum size the hash after lflow 
> processing.
 I'm not sure I understand what you mean here, because resizing after
 lflow processing will not help.  The lflow processing itself is the
 very slow part that we're trying to make faster here.
>>> That can be the case only with dpgroups. Otherwise lflows are just a 
>>> destination to dump data and the bucket sizing is irrelevant because 
>>> there is never any lookup inside lflows during processing. The lflow 
>>> map is just used to store data. So if it is suboptimal at the exit of 
>>> build_lflows() the resize will fix it before the first lookup shortly 
>>> thereafter.
>>>
>>> Are you running it with dpgroups enabled? In that case there are 
>>> lookups inside lflows during build which happen under a per-bucket 
>>> lock. So in addition to suboptimal size when searching the contention 
>>> depends on the number of buckets. If they are too few, the system 
>>> becomes heavily contended resulting in ridiculous computation sizes.
>> Oh, I see.  Indeed, without dp-groups there is no lookup during
>> lflow build.  I missed that.  So, yes, I agree that for a case
>> without dp-groups, re-sizing after lflow processing should work.
>> We need that for parallel case.
>>
>> Current patch (use hmap_insert() that resizes if needed) helps
>> for all non-parallel cases.
> Indeed. It should go in.
>
 Why can't we 

[ovs-dev] [PATCH ovn 1/2] northd: refactor unreachable IPs lb flows

2021-08-24 Thread Lorenzo Bianconi
Refactor unreachable IPs for vip load-balancers inverting the logic used
during the lb flow creation in order to visit lb first and then related
datapath/ovn_ports. This is a preliminary patch to avoid recomputing
flow hash and lflow lookup when not necessary.

Signed-off-by: Lorenzo Bianconi 
---
 northd/ovn-northd.c | 137 
 1 file changed, 101 insertions(+), 36 deletions(-)

diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index 3d8e21a4f..a8e69b3cb 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -4391,6 +4391,26 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct 
ovn_datapath *od,
 }
 
 /* Adds a row with the specified contents to the Logical_Flow table. */
+static void
+ovn_lflow_add_at_with_hash(struct hmap *lflow_map, struct ovn_datapath *od,
+   enum ovn_stage stage, uint16_t priority,
+   const char *match, const char *actions,
+   const char *io_port, const char *ctrl_meter,
+   const struct ovsdb_idl_row *stage_hint,
+   const char *where, uint32_t hash)
+{
+ovs_assert(ovn_stage_to_datapath_type(stage) == ovn_datapath_get_type(od));
+if (use_logical_dp_groups && use_parallel_build) {
+lock_hash_row(_locks, hash);
+do_ovn_lflow_add(lflow_map, od, hash, stage, priority, match,
+ actions, io_port, stage_hint, where, ctrl_meter);
+unlock_hash_row(_locks, hash);
+} else {
+do_ovn_lflow_add(lflow_map, od, hash, stage, priority, match,
+ actions, io_port, stage_hint, where, ctrl_meter);
+}
+}
+
 static void
 ovn_lflow_add_at(struct hmap *lflow_map, struct ovn_datapath *od,
  enum ovn_stage stage, uint16_t priority,
@@ -4398,24 +4418,14 @@ ovn_lflow_add_at(struct hmap *lflow_map, struct 
ovn_datapath *od,
  const char *ctrl_meter,
  const struct ovsdb_idl_row *stage_hint, const char *where)
 {
-ovs_assert(ovn_stage_to_datapath_type(stage) == ovn_datapath_get_type(od));
-
 uint32_t hash;
 
 hash = ovn_logical_flow_hash(ovn_stage_get_table(stage),
  ovn_stage_get_pipeline_name(stage),
  priority, match,
  actions);
-
-if (use_logical_dp_groups && use_parallel_build) {
-lock_hash_row(_locks, hash);
-do_ovn_lflow_add(lflow_map, od, hash, stage, priority, match,
- actions, io_port, stage_hint, where, ctrl_meter);
-unlock_hash_row(_locks, hash);
-} else {
-do_ovn_lflow_add(lflow_map, od, hash, stage, priority, match,
- actions, io_port, stage_hint, where, ctrl_meter);
-}
+ovn_lflow_add_at_with_hash(lflow_map,od, stage, priority, match, actions,
+   io_port, ctrl_meter, stage_hint, where, hash);
 }
 
 /* Adds a row with the specified contents to the Logical_Flow table. */
@@ -6870,16 +6880,11 @@ build_lswitch_rport_arp_req_flows(struct ovn_port *op,
 /* Check if the ovn port has a network configured on which we could
  * expect ARP requests for the LB VIP.
  */
-if (ip_parse(ip_addr, _addr)) {
-if (lrouter_port_ipv4_reachable(op, ipv4_addr)) {
-build_lswitch_rport_arp_req_flow_for_reachable_ip(
-ip_addr, AF_INET, sw_op, sw_od, 80, lflows,
-stage_hint);
-} else {
-build_lswitch_rport_arp_req_flow_for_unreachable_ip(
-ip_addr, AF_INET, sw_od, 90, lflows,
-stage_hint);
-}
+if (ip_parse(ip_addr, _addr) &&
+lrouter_port_ipv4_reachable(op, ipv4_addr)) {
+build_lswitch_rport_arp_req_flow_for_reachable_ip(
+ip_addr, AF_INET, sw_op, sw_od, 80, lflows,
+stage_hint);
 }
 }
 SSET_FOR_EACH (ip_addr, >od->lb_ips_v6) {
@@ -6888,16 +6893,11 @@ build_lswitch_rport_arp_req_flows(struct ovn_port *op,
 /* Check if the ovn port has a network configured on which we could
  * expect NS requests for the LB VIP.
  */
-if (ipv6_parse(ip_addr, _addr)) {
-if (lrouter_port_ipv6_reachable(op, _addr)) {
-build_lswitch_rport_arp_req_flow_for_reachable_ip(
-ip_addr, AF_INET6, sw_op, sw_od, 80, lflows,
-stage_hint);
-} else {
-build_lswitch_rport_arp_req_flow_for_unreachable_ip(
-ip_addr, AF_INET6, sw_od, 90, lflows,
-stage_hint);
-}
+if (ipv6_parse(ip_addr, _addr) &&
+lrouter_port_ipv6_reachable(op, _addr)) {
+build_lswitch_rport_arp_req_flow_for_reachable_ip(
+ip_addr, AF_INET6, sw_op, sw_od, 80, lflows,
+ 

[ovs-dev] [PATCH ovn 0/2] improve code efficiency for ovn-northd lb-unreachable ips

2021-08-24 Thread Lorenzo Bianconi
ovn-master:
--
tatistics for 'ovnnb_db_run'
  Total samples: 97
  Maximum: 12596 msec
  Minimum: 12324 msec
  95th percentile: 12549.921348 msec
  Short term average: 12438.001085 msec
  Long term average: 12407.350781 msec
Statistics for 'ovn-northd-loop'
  Total samples: 185
  Maximum: 12745 msec
  Minimum: 0 msec
  95th percentile: 12673.369586 msec
  Short term average: 8399.345856 msec
  Long term average: 5521.962478 msec
Statistics for 'ovnsb_db_run'
  Total samples: 113
  Maximum: 21 msec
  Minimum: 9 msec
  95th percentile: 20.999602 msec
  Short term average: 19.140572 msec
  Long term average: 19.076484 msec

ovn-master + flow rework:
-
Statistics for 'ovnnb_db_run'
  Total samples: 97
  Maximum: 8853 msec
  Minimum: 8533 msec
  95th percentile: 8697.718376 msec
  Short term average: 8608.222669 msec
  Long term average: 8623.947235 msec
Statistics for 'ovn-northd-loop'
  Total samples: 170
  Maximum: 9005 msec
  Minimum: 0 msec
  95th percentile: 8843.785574 msec
  Short term average: 5840.511105 msec
  Long term average: 5556.930338 msec
Statistics for 'ovnsb_db_run'
  Total samples: 113
  Maximum: 21 msec
  Minimum: 9 msec
  95th percentile: 21.00 msec
  Short term average: 18.668853 msec
  Long term average: 19.720689 msec

ovn-nbctl lr-list | wc -l
121
ovn-nbctl ls-list | wc -l
241
ovn-nbctl lb-list | wc -l
47077
ovn-sbctl dump-flows |wc -l
9852935

Lorenzo Bianconi (2):
  northd: refactor unreachable IPs lb flows
  northd: improve unreachable_ips flows processing for dp_groups

 northd/ovn-northd.c | 176 ++--
 1 file changed, 138 insertions(+), 38 deletions(-)

-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn 2/2] northd: improve unreachable_ips flows processing for dp_groups

2021-08-24 Thread Lorenzo Bianconi
Improve code efficiency of load-balancer unreachable_ips flows
processing when datapath_groups are enabled. We can avoid to
perform a flow lookup for each ovn_port in
build_lrouter_flows_for_unreachable_ips() since lflows only
depends on vips (match and action filters).

Signed-off-by: Lorenzo Bianconi 
---
 northd/ovn-northd.c | 51 ++---
 1 file changed, 43 insertions(+), 8 deletions(-)

diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index a8e69b3cb..1a48574f2 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -4357,7 +4357,7 @@ static struct hashrow_locks lflow_locks;
 /* Adds a row with the specified contents to the Logical_Flow table.
  * Version to use when locking is required.
  */
-static void
+static struct ovn_lflow *
 do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od,
  uint32_t hash, enum ovn_stage stage, uint16_t priority,
  const char *match, const char *actions, const char *io_port,
@@ -4373,7 +4373,7 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct 
ovn_datapath *od,
actions, ctrl_meter, hash);
 if (old_lflow) {
 hmapx_add(_lflow->od_group, od);
-return;
+return old_lflow;
 }
 }
 
@@ -4388,10 +4388,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct 
ovn_datapath *od,
ovn_lflow_hint(stage_hint), where);
 hmapx_add(>od_group, od);
 hmap_insert_fast(lflow_map, >hmap_node, hash);
+return lflow;
 }
 
 /* Adds a row with the specified contents to the Logical_Flow table. */
-static void
+static struct ovn_lflow *
 ovn_lflow_add_at_with_hash(struct hmap *lflow_map, struct ovn_datapath *od,
enum ovn_stage stage, uint16_t priority,
const char *match, const char *actions,
@@ -4399,16 +4400,21 @@ ovn_lflow_add_at_with_hash(struct hmap *lflow_map, 
struct ovn_datapath *od,
const struct ovsdb_idl_row *stage_hint,
const char *where, uint32_t hash)
 {
+struct ovn_lflow *lflow;
+
 ovs_assert(ovn_stage_to_datapath_type(stage) == ovn_datapath_get_type(od));
 if (use_logical_dp_groups && use_parallel_build) {
 lock_hash_row(_locks, hash);
-do_ovn_lflow_add(lflow_map, od, hash, stage, priority, match,
- actions, io_port, stage_hint, where, ctrl_meter);
+lflow = do_ovn_lflow_add(lflow_map, od, hash, stage, priority, match,
+ actions, io_port, stage_hint, where,
+ ctrl_meter);
 unlock_hash_row(_locks, hash);
 } else {
-do_ovn_lflow_add(lflow_map, od, hash, stage, priority, match,
- actions, io_port, stage_hint, where, ctrl_meter);
+lflow = do_ovn_lflow_add(lflow_map, od, hash, stage, priority, match,
+ actions, io_port, stage_hint, where,
+ ctrl_meter);
 }
+return lflow;
 }
 
 static void
@@ -9422,6 +9428,26 @@ build_lrouter_defrag_flows_for_lb(struct ovn_northd_lb 
*lb,
 ds_destroy(_actions);
 }
 
+static bool
+ovn_dp_group_update_with_reference(struct ovn_lflow *lflow_ref,
+   struct ovn_datapath *od,
+   uint32_t hash)
+{
+if (!use_logical_dp_groups || !lflow_ref) {
+return false;
+}
+
+if (use_parallel_build) {
+lock_hash_row(_locks, hash);
+hmapx_add(_ref->od_group, od);
+unlock_hash_row(_locks, hash);
+} else {
+hmapx_add(_ref->od_group, od);
+}
+
+return true;
+}
+
 static void
 build_lrouter_flows_for_unreachable_ips(struct ovn_northd_lb *lb,
 struct ovn_lb_vip *lb_vip,
@@ -9431,6 +9457,7 @@ build_lrouter_flows_for_unreachable_ips(struct 
ovn_northd_lb *lb,
 struct ds *action)
 {
 bool ipv4 = IN6_IS_ADDR_V4MAPPED(_vip->vip);
+struct ovn_lflow *lflow_ref = NULL;
 ovs_be32 ipv4_addr;
 
 ds_clear(match);
@@ -9474,7 +9501,15 @@ build_lrouter_flows_for_unreachable_ips(struct 
ovn_northd_lb *lb,
 (!ipv4 && lrouter_port_ipv6_reachable(op, _vip->vip))) {
 continue;
 }
-ovn_lflow_add_at_with_hash(lflows, peer->od,
+
+if (ovn_dp_group_update_with_reference(lflow_ref, od, hash)) {
+/* if we are running dp_groups, we do not need to run full
+ * lookup since lflow just depends on the vip and not on
+ * the ovn_port.
+ */
+continue;
+}
+lflow_ref = ovn_lflow_add_at_with_hash(lflows, peer->od,
S_SWITCH_IN_L2_LKUP, 90,
 

Re: [ovs-dev] [PATCH ovn] ovn-northd: Fix extremely inefficient usage of lflow hash map.

2021-08-24 Thread Anton Ivanov

On 24/08/2021 18:48, Numan Siddique wrote:

On Tue, Aug 24, 2021 at 1:24 PM Anton Ivanov
 wrote:


On 24/08/2021 17:35, Ilya Maximets wrote:

On 8/24/21 6:25 PM, Numan Siddique wrote:

On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov
 wrote:

On 24/08/2021 12:46, Ilya Maximets wrote:

On 8/24/21 1:18 PM, Anton Ivanov wrote:

On 24/08/2021 12:05, Ilya Maximets wrote:

On 8/24/21 7:36 AM, Anton Ivanov wrote:

On 23/08/2021 22:36, Ilya Maximets wrote:

On 8/23/21 10:37 PM, Anton Ivanov wrote:

On 23/08/2021 21:26, Ilya Maximets wrote:

On 8/23/21 10:20 PM, Anton Ivanov wrote:

Should not be the case.

The map is pre-sized for the size from the previous iterations.

Line 12861 in my tree which is probably a few commits out of date:

  fast_hmap_size_for(, max_seen_lflow_size);

And immediately after building the lflows:

  if (hmap_count() > max_seen_lflow_size) {
  max_seen_lflow_size = hmap_count();
  }

So the map SHOULD be sized correctly - to the most recent seen lflow count.

Please, re-read the commit message.  It's a first run of the loop
and the 'max_seen_lflow_size' is default 128 at this point.

Ack,

Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it.

 From that perspective the patch is a straight +1 from me.

 From the perspective of the use case stated in the commit message- I am 
not sure it addresses it.

If northd is in single-threaded mode and is tackling a GIGANTIC> database, it 
may never complete the first iteration before the
expiration of the timers and everyone deciding that northd is AWOL.

Well, how do you suggest to fix that?  Obviously, we can always create
a database that northd will never be able to process in a reasonable
amount of time.  And it doesn't matter if it's single- or multi-threaded.

In this case NbDB is only 9MB in size, which is very reasonable, and
northd on my laptop takes more than 15 minutes to process it (I killed
it at this point).  With the patch applied it took only 11 seconds.
So, for me, this patch pretty much fixes the issue.  11 seconds is not
that bad, e.g. ovn-k8s configures inactivity probes for clients to 180.
It would be great to reduce, but we're not there yet.


In that case, if it is multi-threaded from the start, it should probably
grab the sizing from the lflow table hash in south db. That would be a
good approximation for the first run.

This will not work for a case where SbDB is empty for any reason while
NbDB is not.  And there is still a case where northd initially connects
to semi-empty databases and after few iterations NbDB receives a big
transaction and generates a big update for northd.

A partial fix is to resize to optimum size the hash after lflow processing.

I'm not sure I understand what you mean here, because resizing after
lflow processing will not help.  The lflow processing itself is the
very slow part that we're trying to make faster here.

That can be the case only with dpgroups. Otherwise lflows are just a 
destination to dump data and the bucket sizing is irrelevant because there is 
never any lookup inside lflows during processing. The lflow map is just used to 
store data. So if it is suboptimal at the exit of build_lflows() the resize 
will fix it before the first lookup shortly thereafter.

Are you running it with dpgroups enabled? In that case there are lookups inside 
lflows during build which happen under a per-bucket lock. So in addition to 
suboptimal size when searching the contention depends on the number of buckets. 
If they are too few, the system becomes heavily contended resulting in 
ridiculous computation sizes.

Oh, I see.  Indeed, without dp-groups there is no lookup during
lflow build.  I missed that.  So, yes, I agree that for a case
without dp-groups, re-sizing after lflow processing should work.
We need that for parallel case.

Current patch (use hmap_insert() that resizes if needed) helps
for all non-parallel cases.

Indeed. It should go in.


Why can't we have hmap_insert() for both parallel and non parallel configs
to start with and switch over to hmap_insert_fast() when ovn-northd
has successfully connected to SB DB and has approximated on the
more accurate hmap size ?

We can't use hmap_insert() for parallel, because resize of a hash map
will crash threads that are working on it at the moment, IIUC.

We actually can for non-dp-groups case, but the merge will become much slower. 
Example - the last version of the snapshot and monitor parallelization for OVS. 
It no longer uses pre-sized fixed hmaps because it is nearly impossible to 
presize an RFC7047 structure correctly.

For the dp-groups case we can't. Locking of the lflows which are used for 
lookups is per hash bucket. You cannot change the number of buckets in the 
middle of the run and other locking mechanisms will be more coarse. So the 
marginal benefit with dp-groups at present will become none (or even slower).


We could disable parallel for first several 

Re: [ovs-dev] [PATCH ovn] ovn-northd: Fix extremely inefficient usage of lflow hash map.

2021-08-24 Thread Numan Siddique
On Tue, Aug 24, 2021 at 1:24 PM Anton Ivanov
 wrote:
>
>
> On 24/08/2021 17:35, Ilya Maximets wrote:
> > On 8/24/21 6:25 PM, Numan Siddique wrote:
> >> On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov
> >>  wrote:
> >>>
> >>> On 24/08/2021 12:46, Ilya Maximets wrote:
>  On 8/24/21 1:18 PM, Anton Ivanov wrote:
> > On 24/08/2021 12:05, Ilya Maximets wrote:
> >> On 8/24/21 7:36 AM, Anton Ivanov wrote:
> >>> On 23/08/2021 22:36, Ilya Maximets wrote:
>  On 8/23/21 10:37 PM, Anton Ivanov wrote:
> > On 23/08/2021 21:26, Ilya Maximets wrote:
> >> On 8/23/21 10:20 PM, Anton Ivanov wrote:
> >>> Should not be the case.
> >>>
> >>> The map is pre-sized for the size from the previous iterations.
> >>>
> >>> Line 12861 in my tree which is probably a few commits out of date:
> >>>
> >>>  fast_hmap_size_for(, max_seen_lflow_size);
> >>>
> >>> And immediately after building the lflows:
> >>>
> >>>  if (hmap_count() > max_seen_lflow_size) {
> >>>  max_seen_lflow_size = hmap_count();
> >>>  }
> >>>
> >>> So the map SHOULD be sized correctly - to the most recent seen 
> >>> lflow count.
> >> Please, re-read the commit message.  It's a first run of the loop
> >> and the 'max_seen_lflow_size' is default 128 at this point.
> > Ack,
> >
> > Not using auto-resizing in single threaded mode is a bug. Thanks 
> > for fixing it.
> >
> > From that perspective the patch is a straight +1 from me.
> >
> > From the perspective of the use case stated in the commit 
> > message- I am not sure it addresses it.
> >
> > If northd is in single-threaded mode and is tackling a GIGANTIC> 
> > database, it may never complete the first iteration before the
> > expiration of the timers and everyone deciding that northd is AWOL.
>  Well, how do you suggest to fix that?  Obviously, we can always 
>  create
>  a database that northd will never be able to process in a reasonable
>  amount of time.  And it doesn't matter if it's single- or 
>  multi-threaded.
> 
>  In this case NbDB is only 9MB in size, which is very reasonable, and
>  northd on my laptop takes more than 15 minutes to process it (I 
>  killed
>  it at this point).  With the patch applied it took only 11 seconds.
>  So, for me, this patch pretty much fixes the issue.  11 seconds is 
>  not
>  that bad, e.g. ovn-k8s configures inactivity probes for clients to 
>  180.
>  It would be great to reduce, but we're not there yet.
> 
> > In that case, if it is multi-threaded from the start, it should 
> > probably
> > grab the sizing from the lflow table hash in south db. That would 
> > be a
> > good approximation for the first run.
>  This will not work for a case where SbDB is empty for any reason 
>  while
>  NbDB is not.  And there is still a case where northd initially 
>  connects
>  to semi-empty databases and after few iterations NbDB receives a big
>  transaction and generates a big update for northd.
> >>> A partial fix is to resize to optimum size the hash after lflow 
> >>> processing.
> >> I'm not sure I understand what you mean here, because resizing after
> >> lflow processing will not help.  The lflow processing itself is the
> >> very slow part that we're trying to make faster here.
> > That can be the case only with dpgroups. Otherwise lflows are just a 
> > destination to dump data and the bucket sizing is irrelevant because 
> > there is never any lookup inside lflows during processing. The lflow 
> > map is just used to store data. So if it is suboptimal at the exit of 
> > build_lflows() the resize will fix it before the first lookup shortly 
> > thereafter.
> >
> > Are you running it with dpgroups enabled? In that case there are 
> > lookups inside lflows during build which happen under a per-bucket 
> > lock. So in addition to suboptimal size when searching the contention 
> > depends on the number of buckets. If they are too few, the system 
> > becomes heavily contended resulting in ridiculous computation sizes.
>  Oh, I see.  Indeed, without dp-groups there is no lookup during
>  lflow build.  I missed that.  So, yes, I agree that for a case
>  without dp-groups, re-sizing after lflow processing should work.
>  We need that for parallel case.
> 
>  Current patch (use hmap_insert() that resizes if needed) helps
>  for all non-parallel cases.
> >>> Indeed. It should go in.
> >>>
> >> Why can't we have hmap_insert() for both parallel and non 

Re: [ovs-dev] [PATCH v2] ipf: fix only nat the first fragment in the reass process

2021-08-24 Thread Aaron Conole
Aaron Conole  writes:

> Ilya Maximets  writes:
>
>> On 8/12/21 6:17 PM, Aaron Conole wrote:
>>> we...@ucloud.cn writes:
>>> 
 From: wenxu 

 The ipf collect original fragment packets and reass a new pkt
 to do the conntrack logic. After finsh the conntrack things
 copy the ct meta info to each orignal packet and modify the
 l4 header in the first fragment. It should modify the ip src/
 dst info for all the fragments.

 Signed-off-by: wenxu 
 Co-authored-by: luke.li 
 Signed-off-by: luke.li 
 ---
>>> 
>>> Acked-by: Aaron Conole 
>>> 
>>> Thanks for the fix.  I see it can work for any l3 protocol.
>>> 
>>> Based on the comments you supplied, I wrote the following test case.  It
>>> can either be folded in by you (or Ilya on apply), or I can submit as a
>>> separate patch (in case you are worried about having my sign-off /
>>> coauthor on this patch).
>>> 
>>> When testing 'make check-system-userspace' before this patch, I see a
>>> failure and get the following tcpdump logged:
>>> 
>>>   12:15:31 aconole@RHTPC1VM0NT {master} ~/git/ovs$ sudo tcpdump -r
>>> tests/system-userspace-testsuite.dir/078/p1.pcap
>>>   reading from file
>>> tests/system-userspace-testsuite.dir/078/p1.pcap, link-type EN10MB
>>> (Ethernet), snapshot length 262144
>>>   dropped privs to tcpdump
>>>   12:07:21.364925 ARP, Request who-has 10.2.1.2 tell 10.2.1.1, length 28
>>>   12:07:21.364928 ARP, Reply 10.2.1.2 is-at e6:45:4a:80:7c:61 (oui 
>>> Unknown), length 28
>>>   12:07:21.365095 IP 10.1.1.1 > 10.1.1.2: ICMP echo request, id 40165, seq 
>>> 1, length 1480
>>>   12:07:21.365099 IP 10.2.1.1 > 10.1.1.2: icmp
>>>   12:07:21.365101 IP 10.2.1.1 > 10.1.1.2: icmp
>>>   12:07:21.365102 IP 10.2.1.1 > 10.1.1.2: icmp
>>> 
>>> We see the first frag correct, but subsequent frags are broken.
>>> 
>>> This test worked both for userspace and kernel datapath on my local
>>> system.
>>
>> Hmm.  This test fails for me for both kernel and userspace:
>
> Okay, I'll try it again on my system.  For reference, I was on F34,
> kernel 5.12.12-300.fc34.x86_64
>
>> tcpdump -r tests/system-userspace-testsuite.dir/078/p0.pcap
>> 15:17:12.832383 ARP, Request who-has 10.2.1.2 tell 10.2.1.1, length 28
>> 15:17:12.834317 ARP, Reply 10.2.1.2 is-at 46:0c:83:aa:6e:b0 (oui Unknown), 
>> length 28
>> 15:17:12.834327 IP 10.2.1.1 > 10.1.1.2: ICMP echo request, id 27759, seq 1, 
>> length 1480
>> 15:17:12.834329 IP 10.2.1.1 > 10.1.1.2: icmp
>> 15:17:12.834330 IP 10.2.1.1 > 10.1.1.2: icmp
>> 15:17:12.834332 IP 10.2.1.1 > 10.1.1.2: icmp
>>
>> tcpdump -r tests/system-userspace-testsuite.dir/078/p1.pcap
>> 15:17:12.833542 ARP, Request who-has 10.2.1.2 tell 10.2.1.1, length 28
>> 15:17:12.834994 IP 10.1.1.1 > 10.1.1.2: ICMP echo request, id 27759, seq 1, 
>> length 1480
>> 15:17:12.834999 IP 10.1.1.1 > 10.1.1.2: icmp
>> 15:17:12.835002 IP 10.1.1.1 > 10.1.1.2: icmp
>> 15:17:12.835004 IP 10.1.1.1 > 10.1.1.2: icmp
>>
>> ping -c 1 10.1.1.2 -M dont -s 4500 | grep "transmitted" | sed 
>> 's/time.*ms$/time 0ms/'
>> NS_EXEC_HEREDOC
>> --- -   2021-08-16 15:17:22.844535052 -0400
>> +++ /root/ovs/tests/system-userspace-testsuite.dir/at-groups/78/stdout
>> @@ -1,2 +1,2 @@
>> -1 packets transmitted, 1 received, 0% packet loss, time 0ms
>> +1 packets transmitted, 0 received, 100% packet loss, time 0ms
>>
>> # uname -a
>> Linux rhel8 4.18.0-305.3.1.el8_4.x86_64
>>
>> I'm not sure what is going on.  Could you, please, re-check?
>
> I'll boot an rhel8.4 instance and try it out.
>
>> I will not apply this patch for now until we figure out how to test it.
>
> Okay.
>

I hope this change works.  I altered the test environment to use a
single port so instead of having a dummy attached to the bridge, I now
use lo to be the receiver.

  [ 'client' ] < --- > [ bridge ] < --- > [ 'server' ][ lo ]

Before it was

  [ 'client' ] < --- > [ bridge ] < --- > [ 'server' ]
^ - > [  'dummy' ]

And 'dummy' provided the l3 routing path.  Seems that doesn't work in all
environments, so I use 'server' to provide the routing path and 'lo' as
the actual ping endpoing.

I tested this on RHEL8 (with the latest z-stream kernel), and Fedora
34.

---
 tests/system-traffic.at | 41 +
 1 file changed, 41 insertions(+)

diff --git a/tests/system-traffic.at b/tests/system-traffic.at
index f400cfabc9..de9108ac20 100644
--- a/tests/system-traffic.at
+++ b/tests/system-traffic.at
@@ -3305,6 +3305,47 @@ NS_CHECK_EXEC([at_ns0], [ping6 -s 3200 -q -c 3 -i 0.3 -w 
2 fc00::2 | FORMAT_PING
 OVS_TRAFFIC_VSWITCHD_STOP
 AT_CLEANUP
 
+AT_SETUP([conntrack - IPv4 Fragmentation + NAT])
+AT_SKIP_IF([test $HAVE_TCPDUMP = no])
+CHECK_CONNTRACK()
+
+OVS_TRAFFIC_VSWITCHD_START(
+   [set-fail-mode br0 secure -- ])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH(p0, at_ns0, br0, "10.2.1.1/24")
+ADD_VETH(p1, at_ns1, br0, "10.2.1.2/24")
+
+# create a dummy route for NAT
+NS_CHECK_EXEC([at_ns1], [ip addr add 10.1.1.2/32 dev lo])

Re: [ovs-dev] [PATCH ovn] ovn-northd: Fix extremely inefficient usage of lflow hash map.

2021-08-24 Thread Anton Ivanov



On 24/08/2021 17:35, Ilya Maximets wrote:

On 8/24/21 6:25 PM, Numan Siddique wrote:

On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov
 wrote:


On 24/08/2021 12:46, Ilya Maximets wrote:

On 8/24/21 1:18 PM, Anton Ivanov wrote:

On 24/08/2021 12:05, Ilya Maximets wrote:

On 8/24/21 7:36 AM, Anton Ivanov wrote:

On 23/08/2021 22:36, Ilya Maximets wrote:

On 8/23/21 10:37 PM, Anton Ivanov wrote:

On 23/08/2021 21:26, Ilya Maximets wrote:

On 8/23/21 10:20 PM, Anton Ivanov wrote:

Should not be the case.

The map is pre-sized for the size from the previous iterations.

Line 12861 in my tree which is probably a few commits out of date:

 fast_hmap_size_for(, max_seen_lflow_size);

And immediately after building the lflows:

 if (hmap_count() > max_seen_lflow_size) {
 max_seen_lflow_size = hmap_count();
 }

So the map SHOULD be sized correctly - to the most recent seen lflow count.

Please, re-read the commit message.  It's a first run of the loop
and the 'max_seen_lflow_size' is default 128 at this point.

Ack,

Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it.

From that perspective the patch is a straight +1 from me.

From the perspective of the use case stated in the commit message- I am not 
sure it addresses it.

If northd is in single-threaded mode and is tackling a GIGANTIC> database, it 
may never complete the first iteration before the
expiration of the timers and everyone deciding that northd is AWOL.

Well, how do you suggest to fix that?  Obviously, we can always create
a database that northd will never be able to process in a reasonable
amount of time.  And it doesn't matter if it's single- or multi-threaded.

In this case NbDB is only 9MB in size, which is very reasonable, and
northd on my laptop takes more than 15 minutes to process it (I killed
it at this point).  With the patch applied it took only 11 seconds.
So, for me, this patch pretty much fixes the issue.  11 seconds is not
that bad, e.g. ovn-k8s configures inactivity probes for clients to 180.
It would be great to reduce, but we're not there yet.


In that case, if it is multi-threaded from the start, it should probably
grab the sizing from the lflow table hash in south db. That would be a
good approximation for the first run.

This will not work for a case where SbDB is empty for any reason while
NbDB is not.  And there is still a case where northd initially connects
to semi-empty databases and after few iterations NbDB receives a big
transaction and generates a big update for northd.

A partial fix is to resize to optimum size the hash after lflow processing.

I'm not sure I understand what you mean here, because resizing after
lflow processing will not help.  The lflow processing itself is the
very slow part that we're trying to make faster here.

That can be the case only with dpgroups. Otherwise lflows are just a 
destination to dump data and the bucket sizing is irrelevant because there is 
never any lookup inside lflows during processing. The lflow map is just used to 
store data. So if it is suboptimal at the exit of build_lflows() the resize 
will fix it before the first lookup shortly thereafter.

Are you running it with dpgroups enabled? In that case there are lookups inside 
lflows during build which happen under a per-bucket lock. So in addition to 
suboptimal size when searching the contention depends on the number of buckets. 
If they are too few, the system becomes heavily contended resulting in 
ridiculous computation sizes.

Oh, I see.  Indeed, without dp-groups there is no lookup during
lflow build.  I missed that.  So, yes, I agree that for a case
without dp-groups, re-sizing after lflow processing should work.
We need that for parallel case.

Current patch (use hmap_insert() that resizes if needed) helps
for all non-parallel cases.

Indeed. It should go in.


Why can't we have hmap_insert() for both parallel and non parallel configs
to start with and switch over to hmap_insert_fast() when ovn-northd
has successfully connected to SB DB and has approximated on the
more accurate hmap size ?

We can't use hmap_insert() for parallel, because resize of a hash map
will crash threads that are working on it at the moment, IIUC.


We actually can for non-dp-groups case, but the merge will become much slower. 
Example - the last version of the snapshot and monitor parallelization for OVS. 
It no longer uses pre-sized fixed hmaps because it is nearly impossible to 
presize an RFC7047 structure correctly.

For the dp-groups case we can't. Locking of the lflows which are used for 
lookups is per hash bucket. You cannot change the number of buckets in the 
middle of the run and other locking mechanisms will be more coarse. So the 
marginal benefit with dp-groups at present will become none (or even slower).



We could disable parallel for first several iterations, but still,
this doesn't cover all the cases.  i.e. the one where we have a big
update from the 

Re: [ovs-dev] [PATCH ovn] ovn-northd: Fix extremely inefficient usage of lflow hash map.

2021-08-24 Thread Anton Ivanov



On 24/08/2021 17:25, Numan Siddique wrote:

On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov
 wrote:


On 24/08/2021 12:46, Ilya Maximets wrote:

On 8/24/21 1:18 PM, Anton Ivanov wrote:

On 24/08/2021 12:05, Ilya Maximets wrote:

On 8/24/21 7:36 AM, Anton Ivanov wrote:

On 23/08/2021 22:36, Ilya Maximets wrote:

On 8/23/21 10:37 PM, Anton Ivanov wrote:

On 23/08/2021 21:26, Ilya Maximets wrote:

On 8/23/21 10:20 PM, Anton Ivanov wrote:

Should not be the case.

The map is pre-sized for the size from the previous iterations.

Line 12861 in my tree which is probably a few commits out of date:

 fast_hmap_size_for(, max_seen_lflow_size);

And immediately after building the lflows:

 if (hmap_count() > max_seen_lflow_size) {
 max_seen_lflow_size = hmap_count();
 }

So the map SHOULD be sized correctly - to the most recent seen lflow count.

Please, re-read the commit message.  It's a first run of the loop
and the 'max_seen_lflow_size' is default 128 at this point.

Ack,

Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it.

From that perspective the patch is a straight +1 from me.

From the perspective of the use case stated in the commit message- I am not 
sure it addresses it.

If northd is in single-threaded mode and is tackling a GIGANTIC> database, it 
may never complete the first iteration before the
expiration of the timers and everyone deciding that northd is AWOL.

Well, how do you suggest to fix that?  Obviously, we can always create
a database that northd will never be able to process in a reasonable
amount of time.  And it doesn't matter if it's single- or multi-threaded.

In this case NbDB is only 9MB in size, which is very reasonable, and
northd on my laptop takes more than 15 minutes to process it (I killed
it at this point).  With the patch applied it took only 11 seconds.
So, for me, this patch pretty much fixes the issue.  11 seconds is not
that bad, e.g. ovn-k8s configures inactivity probes for clients to 180.
It would be great to reduce, but we're not there yet.


In that case, if it is multi-threaded from the start, it should probably
grab the sizing from the lflow table hash in south db. That would be a
good approximation for the first run.

This will not work for a case where SbDB is empty for any reason while
NbDB is not.  And there is still a case where northd initially connects
to semi-empty databases and after few iterations NbDB receives a big
transaction and generates a big update for northd.

A partial fix is to resize to optimum size the hash after lflow processing.

I'm not sure I understand what you mean here, because resizing after
lflow processing will not help.  The lflow processing itself is the
very slow part that we're trying to make faster here.

That can be the case only with dpgroups. Otherwise lflows are just a 
destination to dump data and the bucket sizing is irrelevant because there is 
never any lookup inside lflows during processing. The lflow map is just used to 
store data. So if it is suboptimal at the exit of build_lflows() the resize 
will fix it before the first lookup shortly thereafter.

Are you running it with dpgroups enabled? In that case there are lookups inside 
lflows during build which happen under a per-bucket lock. So in addition to 
suboptimal size when searching the contention depends on the number of buckets. 
If they are too few, the system becomes heavily contended resulting in 
ridiculous computation sizes.

Oh, I see.  Indeed, without dp-groups there is no lookup during
lflow build.  I missed that.  So, yes, I agree that for a case
without dp-groups, re-sizing after lflow processing should work.
We need that for parallel case.

Current patch (use hmap_insert() that resizes if needed) helps
for all non-parallel cases.

Indeed. It should go in.


Why can't we have hmap_insert() for both parallel and non parallel configs
to start with and switch over to hmap_insert_fast() when ovn-northd
has successfully connected to SB DB and has approximated on the
more accurate hmap size ?


That is possible, but not for dp_groups.

If it is just the lflow compute, you can use hmap_insert, but that does not 
actually have any benefit. In fact, you will consume much more CPU than merging 
into a suboptimal hmap and then resizing it at the end.

For dp_groups, the locking is per hash bucket. If you change the number of 
buckets (as upon resize) your locks are no longer valid and you end up 
corrupting the data.

I am running tests on dp_groups and I am starting to think that we should 
abandon the parallelization of lflow compute altogether for the dp_groups case.

I get at best the same results and sometimes worse results. Looking at the the 
picture on ovn-central node of ovn-heater the threads never ramp up to more 
than single digit percents - they are waiting on locks. Compared to that the 
brute-force lflow compute has threads ramping up to 100% and clear benefit from 

Re: [ovs-dev] [PATCH ovn] ovn-northd: Fix extremely inefficient usage of lflow hash map.

2021-08-24 Thread Ilya Maximets
On 8/24/21 6:25 PM, Numan Siddique wrote:
> On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov
>  wrote:
>>
>>
>> On 24/08/2021 12:46, Ilya Maximets wrote:
>>> On 8/24/21 1:18 PM, Anton Ivanov wrote:
 On 24/08/2021 12:05, Ilya Maximets wrote:
> On 8/24/21 7:36 AM, Anton Ivanov wrote:
>> On 23/08/2021 22:36, Ilya Maximets wrote:
>>> On 8/23/21 10:37 PM, Anton Ivanov wrote:
 On 23/08/2021 21:26, Ilya Maximets wrote:
> On 8/23/21 10:20 PM, Anton Ivanov wrote:
>> Should not be the case.
>>
>> The map is pre-sized for the size from the previous iterations.
>>
>> Line 12861 in my tree which is probably a few commits out of date:
>>
>> fast_hmap_size_for(, max_seen_lflow_size);
>>
>> And immediately after building the lflows:
>>
>> if (hmap_count() > max_seen_lflow_size) {
>> max_seen_lflow_size = hmap_count();
>> }
>>
>> So the map SHOULD be sized correctly - to the most recent seen lflow 
>> count.
> Please, re-read the commit message.  It's a first run of the loop
> and the 'max_seen_lflow_size' is default 128 at this point.
 Ack,

 Not using auto-resizing in single threaded mode is a bug. Thanks for 
 fixing it.

From that perspective the patch is a straight +1 from me.

From the perspective of the use case stated in the commit message- 
 I am not sure it addresses it.

 If northd is in single-threaded mode and is tackling a GIGANTIC> 
 database, it may never complete the first iteration before the
 expiration of the timers and everyone deciding that northd is AWOL.
>>> Well, how do you suggest to fix that?  Obviously, we can always create
>>> a database that northd will never be able to process in a reasonable
>>> amount of time.  And it doesn't matter if it's single- or 
>>> multi-threaded.
>>>
>>> In this case NbDB is only 9MB in size, which is very reasonable, and
>>> northd on my laptop takes more than 15 minutes to process it (I killed
>>> it at this point).  With the patch applied it took only 11 seconds.
>>> So, for me, this patch pretty much fixes the issue.  11 seconds is not
>>> that bad, e.g. ovn-k8s configures inactivity probes for clients to 180.
>>> It would be great to reduce, but we're not there yet.
>>>
 In that case, if it is multi-threaded from the start, it should 
 probably
 grab the sizing from the lflow table hash in south db. That would be a
 good approximation for the first run.
>>> This will not work for a case where SbDB is empty for any reason while
>>> NbDB is not.  And there is still a case where northd initially connects
>>> to semi-empty databases and after few iterations NbDB receives a big
>>> transaction and generates a big update for northd.
>> A partial fix is to resize to optimum size the hash after lflow 
>> processing.
> I'm not sure I understand what you mean here, because resizing after
> lflow processing will not help.  The lflow processing itself is the
> very slow part that we're trying to make faster here.
 That can be the case only with dpgroups. Otherwise lflows are just a 
 destination to dump data and the bucket sizing is irrelevant because there 
 is never any lookup inside lflows during processing. The lflow map is just 
 used to store data. So if it is suboptimal at the exit of build_lflows() 
 the resize will fix it before the first lookup shortly thereafter.

 Are you running it with dpgroups enabled? In that case there are lookups 
 inside lflows during build which happen under a per-bucket lock. So in 
 addition to suboptimal size when searching the contention depends on the 
 number of buckets. If they are too few, the system becomes heavily 
 contended resulting in ridiculous computation sizes.
>>> Oh, I see.  Indeed, without dp-groups there is no lookup during
>>> lflow build.  I missed that.  So, yes, I agree that for a case
>>> without dp-groups, re-sizing after lflow processing should work.
>>> We need that for parallel case.
>>>
>>> Current patch (use hmap_insert() that resizes if needed) helps
>>> for all non-parallel cases.
>>
>> Indeed. It should go in.
>>
> 
> Why can't we have hmap_insert() for both parallel and non parallel configs
> to start with and switch over to hmap_insert_fast() when ovn-northd
> has successfully connected to SB DB and has approximated on the
> more accurate hmap size ?

We can't use hmap_insert() for parallel, because resize of a hash map
will crash threads that are working on it at the moment, IIUC.

We could disable parallel for first several iterations, but still,
this doesn't cover all the cases.  i.e. the one where we have a big
update from 

Re: [ovs-dev] [PATCH v1 2/2] ofp-monitor: Support flow monitoring for OpenFlow 1.3, 1.4+

2021-08-24 Thread Vasu Dasari
Hi Ben/Ashish,

When you get a chance, Can you please take a look at my code?

-Vasu

*Vasu Dasari*


On Thu, Jul 29, 2021 at 10:36 PM Vasu Dasari  wrote:

> Extended OpenFlow monitoring support
> * OpenFlow 1.3 with ONF extensions
> * OpenFlow 1.4+ as defined in OpenFlow specification 1.4+.
>
> ONF extensions are similar to Nicira extensions except for
> onf_flow_monitor_request{}
> where out_port is defined as 32-bit number OF(1.1) number, oxm match
> formats are
> used in update and request messages.
>
> Flow monitoring support in 1.4+ is slightly different from Nicira and ONF
> extensions.
>  * More flow monitoring flags are defined.
>  * Monitor add/modify/delete command is intruduced in flow_monitor
>request message.
>  * Addition of out_group as part of flow_monitor request message
>
> Description of changes:
> 1. Generate ofp-msgs.inc to be able to support 1.3, 1.4+ flow Monitoring
> messages.
> include/openvswitch/ofp-msgs.h
>
> 2. Modify openflow header files with protocol specific headers.
> include/openflow/openflow-1.3.h
> include/openflow/openflow-1.4.h
>
> 3. Modify OvS abstraction of openflow headers. ofp-monitor.h leverages
> enums
>from on nicira extensions for creating protocol abstraction headers.
> OF(1.4+)
>enums are superset of nicira extensions.
> include/openvswitch/ofp-monitor.h
>
> 4. Changes to these files reflect encoding and decoding of new protocol
> messages.
> lib/ofp-monitor.c
>
> 5. Changes to mmodules using ofp-monitor APIs. Most of the changes here
> are to
>migrate enums from nicira to OF 1.4+ versions.
> ofproto/connmgr.c
> ofproto/connmgr.h
> ofproto/ofproto-provider.h
> ofproto/ofproto.c
>
> 6. Extended protocol decoding tests to verify all protocol versions
> FLOW_MONITOR_CANCEL
> FLOW_MONITOR_PAUSED
> FLOW_MONITOR_RESUMED
> FLOW_MONITOR request
> FLOW_MONITOR reply
> tests/ofp-print.at
>
> 7. Modify flow monitoring tests to be able executed by all protocol
> versions.
> tests/ofproto.at
>
> 7. Modified documentation highlighting the change
> utilities/ovs-ofctl.8.in
> NEWS
>
> Signed-off-by: Vasu Dasari 
> Reported-at:
> https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/383915.html
> ---
> v1:
>  - Fixed 0-day Robot errors
>
> ---
>  NEWS  |   6 +-
>  include/openflow/openflow-1.3.h   |  89 
>  include/openflow/openflow-1.4.h   |  93 +++-
>  include/openvswitch/ofp-monitor.h |   9 +-
>  include/openvswitch/ofp-msgs.h|  39 +-
>  lib/ofp-monitor.c | 844 --
>  lib/ofp-print.c   |  24 +-
>  ofproto/connmgr.c |  47 +-
>  ofproto/connmgr.h |   6 +-
>  ofproto/ofproto-provider.h|   4 +-
>  ofproto/ofproto.c |  89 +++-
>  tests/ofp-print.at| 122 -
>  tests/ofproto.at  | 176 +--
>  utilities/ovs-ofctl.8.in  |   3 +
>  utilities/ovs-ofctl.c |   6 +
>  15 files changed, 1265 insertions(+), 292 deletions(-)
>
> diff --git a/NEWS b/NEWS
> index 02884b774..47ad9de2a 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -25,8 +25,10 @@ v2.16.0 - xx xxx 
> - In ovs-vsctl and vtep-ctl, the "find" command now accept new
>   operators {in} and {not-in}.
> - OpenFlow:
> - * Extend Flow Monitoring support for OpenFlow 1.0-1.2 with Nicira
> -   Extensions
> + * Extended Flow Monitoring support for all supported OpenFlow
> versions
> + OpenFlow versions 1.0-1.2 with Nicira Extensions
> + OpenFlow versions 1.3 with Open Network Foundation extension
> + OpenFlow versions 1.4+, as defined in the OpenFlow specification
> - Userspace datapath:
>   * Auto load balancing of PMDs now partially supports cross-NUMA
> polling
> cases, e.g if all PMD threads are running on the same NUMA node.
> diff --git a/include/openflow/openflow-1.3.h
> b/include/openflow/openflow-1.3.h
> index c48a8ea7f..1a818dbb4 100644
> --- a/include/openflow/openflow-1.3.h
> +++ b/include/openflow/openflow-1.3.h
> @@ -374,4 +374,93 @@ struct ofp13_async_config {
>  };
>  OFP_ASSERT(sizeof(struct ofp13_async_config) == 24);
>
> +struct onf_flow_monitor_request {
> +ovs_be32   id;/* Controller-assigned ID for this monitor.
> */
> +ovs_be16   flags; /* ONFFMF_*. */
> +ovs_be16   match_len; /* Length of oxm_fields. */
> +ovs_be32   out_port;  /* Required output port, if not OFPP_NONE.
> */
> +uint8_ttable_id;  /* One table’s ID or 0xff for all tables. */
> +uint8_tzeros[3];  /* Align to 64 bits (must be zero). */
> +/* Followed by an ofp11_match structure. */
> +};
> +OFP_ASSERT(sizeof(struct onf_flow_monitor_request) == 16);
> +
> +/* Header for experimenter requests and replies. */
> +struct onf_experimenter_header {
> +struct ofp_header header;
> +ovs_be32   vendor;

Re: [ovs-dev] [PATCH ovn] ovn-northd: Fix extremely inefficient usage of lflow hash map.

2021-08-24 Thread Numan Siddique
On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov
 wrote:
>
>
> On 24/08/2021 12:46, Ilya Maximets wrote:
> > On 8/24/21 1:18 PM, Anton Ivanov wrote:
> >> On 24/08/2021 12:05, Ilya Maximets wrote:
> >>> On 8/24/21 7:36 AM, Anton Ivanov wrote:
>  On 23/08/2021 22:36, Ilya Maximets wrote:
> > On 8/23/21 10:37 PM, Anton Ivanov wrote:
> >> On 23/08/2021 21:26, Ilya Maximets wrote:
> >>> On 8/23/21 10:20 PM, Anton Ivanov wrote:
>  Should not be the case.
> 
>  The map is pre-sized for the size from the previous iterations.
> 
>  Line 12861 in my tree which is probably a few commits out of date:
> 
>  fast_hmap_size_for(, max_seen_lflow_size);
> 
>  And immediately after building the lflows:
> 
>  if (hmap_count() > max_seen_lflow_size) {
>  max_seen_lflow_size = hmap_count();
>  }
> 
>  So the map SHOULD be sized correctly - to the most recent seen lflow 
>  count.
> >>> Please, re-read the commit message.  It's a first run of the loop
> >>> and the 'max_seen_lflow_size' is default 128 at this point.
> >> Ack,
> >>
> >> Not using auto-resizing in single threaded mode is a bug. Thanks for 
> >> fixing it.
> >>
> >>From that perspective the patch is a straight +1 from me.
> >>
> >>From the perspective of the use case stated in the commit message- 
> >> I am not sure it addresses it.
> >>
> >> If northd is in single-threaded mode and is tackling a GIGANTIC> 
> >> database, it may never complete the first iteration before the
> >> expiration of the timers and everyone deciding that northd is AWOL.
> > Well, how do you suggest to fix that?  Obviously, we can always create
> > a database that northd will never be able to process in a reasonable
> > amount of time.  And it doesn't matter if it's single- or 
> > multi-threaded.
> >
> > In this case NbDB is only 9MB in size, which is very reasonable, and
> > northd on my laptop takes more than 15 minutes to process it (I killed
> > it at this point).  With the patch applied it took only 11 seconds.
> > So, for me, this patch pretty much fixes the issue.  11 seconds is not
> > that bad, e.g. ovn-k8s configures inactivity probes for clients to 180.
> > It would be great to reduce, but we're not there yet.
> >
> >> In that case, if it is multi-threaded from the start, it should 
> >> probably
> >> grab the sizing from the lflow table hash in south db. That would be a
> >> good approximation for the first run.
> > This will not work for a case where SbDB is empty for any reason while
> > NbDB is not.  And there is still a case where northd initially connects
> > to semi-empty databases and after few iterations NbDB receives a big
> > transaction and generates a big update for northd.
>  A partial fix is to resize to optimum size the hash after lflow 
>  processing.
> >>> I'm not sure I understand what you mean here, because resizing after
> >>> lflow processing will not help.  The lflow processing itself is the
> >>> very slow part that we're trying to make faster here.
> >> That can be the case only with dpgroups. Otherwise lflows are just a 
> >> destination to dump data and the bucket sizing is irrelevant because there 
> >> is never any lookup inside lflows during processing. The lflow map is just 
> >> used to store data. So if it is suboptimal at the exit of build_lflows() 
> >> the resize will fix it before the first lookup shortly thereafter.
> >>
> >> Are you running it with dpgroups enabled? In that case there are lookups 
> >> inside lflows during build which happen under a per-bucket lock. So in 
> >> addition to suboptimal size when searching the contention depends on the 
> >> number of buckets. If they are too few, the system becomes heavily 
> >> contended resulting in ridiculous computation sizes.
> > Oh, I see.  Indeed, without dp-groups there is no lookup during
> > lflow build.  I missed that.  So, yes, I agree that for a case
> > without dp-groups, re-sizing after lflow processing should work.
> > We need that for parallel case.
> >
> > Current patch (use hmap_insert() that resizes if needed) helps
> > for all non-parallel cases.
>
> Indeed. It should go in.
>

Why can't we have hmap_insert() for both parallel and non parallel configs
to start with and switch over to hmap_insert_fast() when ovn-northd
has successfully connected to SB DB and has approximated on the
more accurate hmap size ?

Thanks
Numan

> I will sort out the other cases to the extent possible.


>
> Brgds,
>
> A.
>
> >
> > I'm mostly running dp-groups + non-parallel which is a default
> > case for ovn-heater/ovn-k8s.
> >
> >> For the case of "dpgroups + parallel + first iteration + pre-existing 
> >> large database" there is no cure short of pre-allocating the 

Re: [ovs-dev] [PATCH V3 1/2] netdev-offload-dpdk: Use has_vlan match attribute

2021-08-24 Thread Ilya Maximets
On 8/24/21 6:04 PM, Eli Britstein wrote:
> 
> On 8/24/2021 6:47 PM, Ilya Maximets wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 8/24/21 5:25 PM, Eli Britstein wrote:
>>> On 8/24/2021 6:08 PM, Finn, Emma wrote:
 External email: Use caution opening links or attachments


 -Original Message-
 From: Eli Britstein 
 Sent: Monday 16 August 2021 14:55
 To: d...@openvswitch.org; Ilya Maximets 
 Cc: Finn, Emma ; Stokes, Ian ; 
 Sriharsha Basavapatna ; Gaetan Rivet 
 ; Majd Dibbiny ; Eli Britstein 
 ; Salem Sol 
 Subject: [PATCH V3 1/2] netdev-offload-dpdk: Use has_vlan match attribute

 DPDK 20.11 introduced an ability to specify existance/non-existance of 
 VLAN tag by [1].
 Use this attribute.

 [1]: 09315fc83861 ("ethdev: add VLAN attributes to ethernet and VLAN 
 items")

 Signed-off-by: Eli Britstein 
 Reviewed-by: Salem Sol 

 Hi Eli,

 I tested this but currently we don't have support in the i40e pmd for the 
 has_vlan match attribute and with these patches it is breaking offload for 
 VLAN packets on Intel devices.
>>> Hi Emma,
>>>
>>> Thanks for testing.
>>>
>>> Is adding such support in your plans?
>>>
>>> How do you suggest to proceed? It is needed in order to fix OVS bug.
>>>
>>> Thanks,
>>>
>>> Eli
>> The "Table 1.2 rte_flow items availability in networking drivers"
>> here: https://doc.dpdk.org/guides/nics/overview.html
>> says that both ixgbe and i40e has a full support for 'vlan' and
>> 'eth' items.  Is it a bug?  Should it be 'partial' instead?
>>
>> In general, this sounds like a big limitation of rte_flow API.
>> I mean the fact that there is no way to get what is implemented by
>> a particular driver and what is not implemented in runtime.
>> Someone should, probably, work on adding this kind of API to DPDK.
>> Otherwise, we will stuck with inability to use certain actions/matches
>> unless all the drivers supports them (which is also hard to check
>> taking documentation issues into account).  If I missed it and the
>> API actually exists, we should definitely start using it.
>>
>> CC: dpdk-dev and rte_flow maintainers.
>>
>> Thoughts?
> 
> There is such an API - rte_flow_validate().
> 
> However, in OVS, as each flow is independent and can have different matches 
> and actions, we just call rte_flow_create(). The PMD (at least mlx5) first 
> internally validates it (as if rte_flow_validate() is called), and bail out 
> with a failure in case validate fails.
> 
> Can you suggest an effective way to utilize it in OVS?

This one is hard to use.  And I guess, it's hard to determine
what exactly is not supported as some things can be only
supported in certain combinations or otherwise not supported in
certain combinations.  So, rte_flow_validate() doesn't seem to
be suitable for checking particular features and alternating
the flow construction based on that.  If we had clear yes/no
flags for all features that could be easily probed/verified or
otherwise retrived from the driver, that would be better.

But it seems to be a problem for current rte_flow implementations
in actual drivers.

> 
> In theory, if the API exists in rte_flow, OVS should not care if all PMDs 
> support it or not.
> 
> In practice, the "has_vlan" field was introduced only in 20.11, and 
> apparently Intel has not adapted i40e PMD, so it breaks their offloads. I 
> suspected this so I've added Emma and Ian to review it.
> 
> I don't know i40e HW capabilities, but at least from PMD point of view, it 
> can be silently ignored until a proper support is added.
> 
>>
>> Best regards, Ilya Maximets.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH V3 1/2] netdev-offload-dpdk: Use has_vlan match attribute

2021-08-24 Thread Eli Britstein via dev



On 8/24/2021 6:47 PM, Ilya Maximets wrote:

External email: Use caution opening links or attachments


On 8/24/21 5:25 PM, Eli Britstein wrote:

On 8/24/2021 6:08 PM, Finn, Emma wrote:

External email: Use caution opening links or attachments


-Original Message-
From: Eli Britstein 
Sent: Monday 16 August 2021 14:55
To: d...@openvswitch.org; Ilya Maximets 
Cc: Finn, Emma ; Stokes, Ian ; Sriharsha Basavapatna 
; Gaetan Rivet ; Majd Dibbiny 
; Eli Britstein ; Salem Sol 
Subject: [PATCH V3 1/2] netdev-offload-dpdk: Use has_vlan match attribute

DPDK 20.11 introduced an ability to specify existance/non-existance of VLAN tag 
by [1].
Use this attribute.

[1]: 09315fc83861 ("ethdev: add VLAN attributes to ethernet and VLAN items")

Signed-off-by: Eli Britstein 
Reviewed-by: Salem Sol 

Hi Eli,

I tested this but currently we don't have support in the i40e pmd for the 
has_vlan match attribute and with these patches it is breaking offload for VLAN 
packets on Intel devices.

Hi Emma,

Thanks for testing.

Is adding such support in your plans?

How do you suggest to proceed? It is needed in order to fix OVS bug.

Thanks,

Eli

The "Table 1.2 rte_flow items availability in networking drivers"
here: https://doc.dpdk.org/guides/nics/overview.html
says that both ixgbe and i40e has a full support for 'vlan' and
'eth' items.  Is it a bug?  Should it be 'partial' instead?

In general, this sounds like a big limitation of rte_flow API.
I mean the fact that there is no way to get what is implemented by
a particular driver and what is not implemented in runtime.
Someone should, probably, work on adding this kind of API to DPDK.
Otherwise, we will stuck with inability to use certain actions/matches
unless all the drivers supports them (which is also hard to check
taking documentation issues into account).  If I missed it and the
API actually exists, we should definitely start using it.

CC: dpdk-dev and rte_flow maintainers.

Thoughts?


There is such an API - rte_flow_validate().

However, in OVS, as each flow is independent and can have different 
matches and actions, we just call rte_flow_create(). The PMD (at least 
mlx5) first internally validates it (as if rte_flow_validate() is 
called), and bail out with a failure in case validate fails.


Can you suggest an effective way to utilize it in OVS?

In theory, if the API exists in rte_flow, OVS should not care if all 
PMDs support it or not.


In practice, the "has_vlan" field was introduced only in 20.11, and 
apparently Intel has not adapted i40e PMD, so it breaks their offloads. 
I suspected this so I've added Emma and Ian to review it.


I don't know i40e HW capabilities, but at least from PMD point of view, 
it can be silently ignored until a proper support is added.




Best regards, Ilya Maximets.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2] northd: Add VIP port to established flows in DNAT table for Load Balancers

2021-08-24 Thread Numan Siddique
On Tue, Aug 24, 2021 at 11:21 AM Mark Gray  wrote:
>
> On 24/08/2021 16:15, Mark Michelson wrote:
> > Excellent, thanks for the updated patch!
> >
> > Acked-by: Mark Michelson 
>
> Thanks, and thanks for the review!

Thanks.  I applied this patch to the main branch.
With the ovsrobot run for this patch,  I see the ovn-k8s control plane
test has passed.  Great !!

Thanks
Numan

>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH V3 1/2] netdev-offload-dpdk: Use has_vlan match attribute

2021-08-24 Thread Ilya Maximets
On 8/24/21 5:25 PM, Eli Britstein wrote:
> 
> On 8/24/2021 6:08 PM, Finn, Emma wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> -Original Message-
>> From: Eli Britstein 
>> Sent: Monday 16 August 2021 14:55
>> To: d...@openvswitch.org; Ilya Maximets 
>> Cc: Finn, Emma ; Stokes, Ian ; 
>> Sriharsha Basavapatna ; Gaetan Rivet 
>> ; Majd Dibbiny ; Eli Britstein 
>> ; Salem Sol 
>> Subject: [PATCH V3 1/2] netdev-offload-dpdk: Use has_vlan match attribute
>>
>> DPDK 20.11 introduced an ability to specify existance/non-existance of VLAN 
>> tag by [1].
>> Use this attribute.
>>
>> [1]: 09315fc83861 ("ethdev: add VLAN attributes to ethernet and VLAN items")
>>
>> Signed-off-by: Eli Britstein 
>> Reviewed-by: Salem Sol 
>>
>> Hi Eli,
>>
>> I tested this but currently we don't have support in the i40e pmd for the 
>> has_vlan match attribute and with these patches it is breaking offload for 
>> VLAN packets on Intel devices.
> 
> Hi Emma,
> 
> Thanks for testing.
> 
> Is adding such support in your plans?
> 
> How do you suggest to proceed? It is needed in order to fix OVS bug.
> 
> Thanks,
> 
> Eli

The "Table 1.2 rte_flow items availability in networking drivers"
here: https://doc.dpdk.org/guides/nics/overview.html
says that both ixgbe and i40e has a full support for 'vlan' and
'eth' items.  Is it a bug?  Should it be 'partial' instead?

In general, this sounds like a big limitation of rte_flow API.
I mean the fact that there is no way to get what is implemented by
a particular driver and what is not implemented in runtime.
Someone should, probably, work on adding this kind of API to DPDK.
Otherwise, we will stuck with inability to use certain actions/matches
unless all the drivers supports them (which is also hard to check
taking documentation issues into account).  If I missed it and the
API actually exists, we should definitely start using it.

CC: dpdk-dev and rte_flow maintainers.

Thoughts?

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn 0/3] Add multiple routing tables support to Logical Routers

2021-08-24 Thread Vladislav Odintsov
Thanks Numan for the information.

Regards,
Vladislav Odintsov

> On 24 Aug 2021, at 18:43, Numan Siddique  wrote:
> 
> On Mon, Aug 23, 2021 at 1:38 PM Odintsov Vladislav  wrote:
>> 
>> Hi,
>> 
>> I’m wonder if this patch series is interesting for the project.
>> Should I wait for review or what is the process to add new functionality?
> 
> Hi Vladislav,
> 
> This is the present review queue -
> https://patchwork.ozlabs.org/project/ovn/list/
> and your patches are present.  Please expect some delay for the
> reviews from my side
> atleast.  Since they are submitted well within the cutoff time, these
> patches should be
> considered for the next release.
> 
> Thanks
> Numan
> 
>> 
>> Thanks.
>> 
>> Regards,
>> Vladislav Odintsov
>> 
>> On 17 Aug 2021, at 00:15, Vladislav Odintsov 
>> mailto:odiv...@gmail.com>> wrote:
>> 
>> This patch series extends Logical Router's routing functionality.
>> Now user may create multiple routing tables within a Logical Router
>> and assign them to Logical Router Ports.
>> 
>> Traffic coming from Logical Router Port with assigned route_table
>> is checked against global routes if any (Logical_Router_Static_Routes
>> whith empty route_table field), next against directly connected routes
>> and then Logical_Router_Static_Routes with same route_table value as
>> in Logical_Router_Port options:route_table field.
>> 
>> This series doesn't have ddlog support yet. It will eventually be added
>> once this series get reviewed.
>> 
>> Vladislav Odintsov (3):
>> tests: remove strict check for table=N for N > 9
>> northd: support for RouteTables in LRs
>> utilities: update ovn-nbctl with RouteTables support
>> 
>> northd/ovn-northd.8.xml |  63 --
>> northd/ovn-northd.c | 198 ++---
>> ovn-nb.ovsschema|   5 +-
>> ovn-nb.xml  |  30 +++
>> tests/ovn-ic.at |   4 +
>> tests/ovn-nbctl.at  | 165 +-
>> tests/ovn-northd.at | 268 ++-
>> tests/ovn.at| 464 
>> ++--
>> utilities/ovn-nbctl.c   | 124 ++-
>> 9 files changed, 1146 insertions(+), 175 deletions(-)
>> 
>> --
>> 2.30.0
>> 
>> 
>> ___
>> dev mailing list
>> d...@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH V3 1/2] netdev-offload-dpdk: Use has_vlan match attribute

2021-08-24 Thread Finn, Emma



-Original Message-
From: Eli Britstein  
Sent: Tuesday 24 August 2021 16:26
To: Finn, Emma ; d...@openvswitch.org; Ilya Maximets 

Cc: Stokes, Ian ; Sriharsha Basavapatna 
; Gaetan Rivet ; Majd 
Dibbiny ; Salem Sol 
Subject: Re: [PATCH V3 1/2] netdev-offload-dpdk: Use has_vlan match attribute


On 8/24/2021 6:08 PM, Finn, Emma wrote:
> External email: Use caution opening links or attachments
>
>
> -Original Message-
> From: Eli Britstein 
> Sent: Monday 16 August 2021 14:55
> To: d...@openvswitch.org; Ilya Maximets 
> Cc: Finn, Emma ; Stokes, Ian ; 
> Sriharsha Basavapatna ; Gaetan Rivet 
> ; Majd Dibbiny ; Eli Britstein 
> ; Salem Sol 
> Subject: [PATCH V3 1/2] netdev-offload-dpdk: Use has_vlan match attribute
>
> DPDK 20.11 introduced an ability to specify existance/non-existance of VLAN 
> tag by [1].
> Use this attribute.
>
> [1]: 09315fc83861 ("ethdev: add VLAN attributes to ethernet and VLAN items")
>
> Signed-off-by: Eli Britstein 
> Reviewed-by: Salem Sol 
>
> Hi Eli,
>
> I tested this but currently we don't have support in the i40e pmd for the 
> has_vlan match attribute and with these patches it is breaking offload for 
> VLAN packets on Intel devices.

Hi Emma,

Thanks for testing.

Is adding such support in your plans?

How do you suggest to proceed? It is needed in order to fix OVS bug.

Thanks,

Eli

Hi, 

Let me look into putting a feature request in DPDK for the i40e pmd. 

Thanks, 
Emma 

>
> Thanks,
> Emma
> ---
>   lib/netdev-offload-dpdk.c | 16 
>   1 file changed, 16 insertions(+)
>
> diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c index 
> f6706ee0c..28c4ba276 100644
> --- a/lib/netdev-offload-dpdk.c
> +++ b/lib/netdev-offload-dpdk.c
> @@ -210,6 +210,8 @@ dump_flow_pattern(struct ds *s,
>
>   ds_put_cstr(s, "eth ");
>   if (eth_spec) {
> +uint32_t has_vlan_mask;
> +
>   if (!eth_mask) {
>   eth_mask = _flow_item_eth_mask;
>   }
> @@ -222,6 +224,9 @@ dump_flow_pattern(struct ds *s,
>   DUMP_PATTERN_ITEM(eth_mask->type, "type", "0x%04"PRIx16,
> ntohs(eth_spec->type),
> ntohs(eth_mask->type));
> +has_vlan_mask = eth_mask->has_vlan ? UINT32_MAX : 0;
> +DUMP_PATTERN_ITEM(has_vlan_mask, "has_vlan", "%d",
> +  eth_spec->has_vlan, eth_mask->has_vlan);
>   }
>   ds_put_cstr(s, "/ ");
>   } else if (item->type == RTE_FLOW_ITEM_TYPE_VLAN) { @@ -1047,6 +1052,7 
> @@ parse_flow_match(struct netdev *netdev,
>struct flow_patterns *patterns,
>struct match *match)
>   {
> +struct rte_flow_item_eth *eth_spec = NULL, *eth_mask = NULL;
>   struct flow *consumed_masks;
>   uint8_t proto = 0;
>
> @@ -1092,6 +1098,11 @@ parse_flow_match(struct netdev *netdev,
>   memset(_masks->dl_src, 0, sizeof consumed_masks->dl_src);
>   consumed_masks->dl_type = 0;
>
> +spec->has_vlan = 0;
> +mask->has_vlan = 1;
> +eth_spec = spec;
> +eth_mask = mask;
> +
>   add_flow_pattern(patterns, RTE_FLOW_ITEM_TYPE_ETH, spec, mask);
>   }
>
> @@ -1108,6 +1119,11 @@ parse_flow_match(struct netdev *netdev,
>   /* Match any protocols. */
>   mask->inner_type = 0;
>
> +if (eth_spec && eth_mask) {
> +eth_spec->has_vlan = 1;
> +eth_mask->has_vlan = 1;
> +}
> +
>   add_flow_pattern(patterns, RTE_FLOW_ITEM_TYPE_VLAN, spec, mask);
>   }
>   /* For untagged matching match->wc.masks.vlans[0].tci is 0x and
> --
> 2.28.0.2311.g225365fb51
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn 0/3] Add multiple routing tables support to Logical Routers

2021-08-24 Thread Numan Siddique
On Mon, Aug 23, 2021 at 1:38 PM Odintsov Vladislav  wrote:
>
> Hi,
>
> I’m wonder if this patch series is interesting for the project.
> Should I wait for review or what is the process to add new functionality?

Hi Vladislav,

This is the present review queue -
https://patchwork.ozlabs.org/project/ovn/list/
and your patches are present.  Please expect some delay for the
reviews from my side
atleast.  Since they are submitted well within the cutoff time, these
patches should be
considered for the next release.

Thanks
Numan

>
> Thanks.
>
> Regards,
> Vladislav Odintsov
>
> On 17 Aug 2021, at 00:15, Vladislav Odintsov 
> mailto:odiv...@gmail.com>> wrote:
>
> This patch series extends Logical Router's routing functionality.
> Now user may create multiple routing tables within a Logical Router
> and assign them to Logical Router Ports.
>
> Traffic coming from Logical Router Port with assigned route_table
> is checked against global routes if any (Logical_Router_Static_Routes
> whith empty route_table field), next against directly connected routes
> and then Logical_Router_Static_Routes with same route_table value as
> in Logical_Router_Port options:route_table field.
>
> This series doesn't have ddlog support yet. It will eventually be added
> once this series get reviewed.
>
> Vladislav Odintsov (3):
>  tests: remove strict check for table=N for N > 9
>  northd: support for RouteTables in LRs
>  utilities: update ovn-nbctl with RouteTables support
>
> northd/ovn-northd.8.xml |  63 --
> northd/ovn-northd.c | 198 ++---
> ovn-nb.ovsschema|   5 +-
> ovn-nb.xml  |  30 +++
> tests/ovn-ic.at |   4 +
> tests/ovn-nbctl.at  | 165 +-
> tests/ovn-northd.at | 268 ++-
> tests/ovn.at| 464 
> ++--
> utilities/ovn-nbctl.c   | 124 ++-
> 9 files changed, 1146 insertions(+), 175 deletions(-)
>
> --
> 2.30.0
>
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH V3 1/2] netdev-offload-dpdk: Use has_vlan match attribute

2021-08-24 Thread Eli Britstein via dev



On 8/24/2021 6:08 PM, Finn, Emma wrote:

External email: Use caution opening links or attachments


-Original Message-
From: Eli Britstein 
Sent: Monday 16 August 2021 14:55
To: d...@openvswitch.org; Ilya Maximets 
Cc: Finn, Emma ; Stokes, Ian ; Sriharsha Basavapatna 
; Gaetan Rivet ; Majd Dibbiny 
; Eli Britstein ; Salem Sol 
Subject: [PATCH V3 1/2] netdev-offload-dpdk: Use has_vlan match attribute

DPDK 20.11 introduced an ability to specify existance/non-existance of VLAN tag 
by [1].
Use this attribute.

[1]: 09315fc83861 ("ethdev: add VLAN attributes to ethernet and VLAN items")

Signed-off-by: Eli Britstein 
Reviewed-by: Salem Sol 

Hi Eli,

I tested this but currently we don't have support in the i40e pmd for the 
has_vlan match attribute and with these patches it is breaking offload for VLAN 
packets on Intel devices.


Hi Emma,

Thanks for testing.

Is adding such support in your plans?

How do you suggest to proceed? It is needed in order to fix OVS bug.

Thanks,

Eli



Thanks,
Emma
---
  lib/netdev-offload-dpdk.c | 16 
  1 file changed, 16 insertions(+)

diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c index 
f6706ee0c..28c4ba276 100644
--- a/lib/netdev-offload-dpdk.c
+++ b/lib/netdev-offload-dpdk.c
@@ -210,6 +210,8 @@ dump_flow_pattern(struct ds *s,

  ds_put_cstr(s, "eth ");
  if (eth_spec) {
+uint32_t has_vlan_mask;
+
  if (!eth_mask) {
  eth_mask = _flow_item_eth_mask;
  }
@@ -222,6 +224,9 @@ dump_flow_pattern(struct ds *s,
  DUMP_PATTERN_ITEM(eth_mask->type, "type", "0x%04"PRIx16,
ntohs(eth_spec->type),
ntohs(eth_mask->type));
+has_vlan_mask = eth_mask->has_vlan ? UINT32_MAX : 0;
+DUMP_PATTERN_ITEM(has_vlan_mask, "has_vlan", "%d",
+  eth_spec->has_vlan, eth_mask->has_vlan);
  }
  ds_put_cstr(s, "/ ");
  } else if (item->type == RTE_FLOW_ITEM_TYPE_VLAN) { @@ -1047,6 +1052,7 @@ 
parse_flow_match(struct netdev *netdev,
   struct flow_patterns *patterns,
   struct match *match)
  {
+struct rte_flow_item_eth *eth_spec = NULL, *eth_mask = NULL;
  struct flow *consumed_masks;
  uint8_t proto = 0;

@@ -1092,6 +1098,11 @@ parse_flow_match(struct netdev *netdev,
  memset(_masks->dl_src, 0, sizeof consumed_masks->dl_src);
  consumed_masks->dl_type = 0;

+spec->has_vlan = 0;
+mask->has_vlan = 1;
+eth_spec = spec;
+eth_mask = mask;
+
  add_flow_pattern(patterns, RTE_FLOW_ITEM_TYPE_ETH, spec, mask);
  }

@@ -1108,6 +1119,11 @@ parse_flow_match(struct netdev *netdev,
  /* Match any protocols. */
  mask->inner_type = 0;

+if (eth_spec && eth_mask) {
+eth_spec->has_vlan = 1;
+eth_mask->has_vlan = 1;
+}
+
  add_flow_pattern(patterns, RTE_FLOW_ITEM_TYPE_VLAN, spec, mask);
  }
  /* For untagged matching match->wc.masks.vlans[0].tci is 0x and
--
2.28.0.2311.g225365fb51


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2] northd: Add VIP port to established flows in DNAT table for Load Balancers

2021-08-24 Thread Mark Gray
On 24/08/2021 16:15, Mark Michelson wrote:
> Excellent, thanks for the updated patch!
> 
> Acked-by: Mark Michelson 

Thanks, and thanks for the review!

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2] northd: Add VIP port to established flows in DNAT table for Load Balancers

2021-08-24 Thread Mark Michelson

Excellent, thanks for the updated patch!

Acked-by: Mark Michelson 

On 8/23/21 4:37 PM, Mark Gray wrote:

When adding a load balancer to a logical router, two flows are added to
the ingress DNAT table. One flow is for established connections and one is
for new connections. They have the following form:

ct.est && ip4 && reg0 == 10.0.0.10 && ct_label.natted == 1 && tcp

As the established flow does not specify the VIP port, if two load
balancers are added with the same VIP but different VIP ports, then
two conflicting flows will be added. For example,

ct.est && ip4 && reg0 == 10.0.0.10 && ct_label.natted == 1 && tcp
ct.est && ip4 && reg0 == 10.0.0.10 && ct_label.natted == 1 && tcp

This normally does not give an issue as both flows will have the same
action: next.

However, if the logical router specifies "force_snat_for_lb" and one
load balancer specifies "skip_snat" then both flows will have the
same match but different, conflicting actions: "flags.force_snat_for_lb = 1; 
next;"
and "flags.skip_snat_for_lb = 1; next;". This can cause unintended
consequences.

This commit adds the VIP port to the DNAT flow. It also updates
the defrag table to save that port in a register (before it gets
DNATted).

Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1995326
Signed-off-by: Mark Gray 
---

Notes:
 v2: Addressed Mark's comments
 - Change registers

  northd/ovn-northd.8.xml |  52 +++---
  northd/ovn-northd.c |  32 ++---
  northd/ovn_northd.dl|  38 ++
  tests/ovn-northd.at | 150 
  tests/ovn.at|   6 +-
  5 files changed, 154 insertions(+), 124 deletions(-)

diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
index 9b69e4e5750e..eebf0d717999 100644
--- a/northd/ovn-northd.8.xml
+++ b/northd/ovn-northd.8.xml
@@ -2855,15 +2855,18 @@ icmp6 {
If load balancing rules with virtual IP addresses and ports are
configured in OVN_Northbound database for a Gateway router,
a priority-110 flow is added for each configured virtual IP address
-  VIP and protocol PROTO. For IPv4 VIPs
-  the flow matches ip  ip4.dst == VIP 
-  PROTO. For IPv6 VIPs, the flow matches
+  VIP, protocol PROTO and port PORT.
+  For IPv4 VIPs the flow matches
+  ip  ip4.dst == VIP 
+  PROTO  PROTO.dst ==
+  PORT. For IPv6 VIPs, the flow matches
ip  ip6.dst == VIP 
-  PROTO. The flow applies the action reg0 =
-  VIP; ct_dnat; (or xxreg0 for IPv6) to send
-  IP packets to the connection tracker for packet de-fragmentation and to
-  dnat the destination IP for the committed connection before sending it to
-  the next table.
+  PROTO  PROTO.dst ==
+  PORT. The flow applies the action reg0 =
+  VIP; reg9[16..31] = PROTO.dst; ct_dnat;
+  (or xxreg0 for IPv6) to send IP packets to the connection
+  tracker for packet de-fragmentation and to dnat the destination IP for
+  the committed connection before sending it to the next table.
  
  
  

@@ -2913,14 +2916,14 @@ icmp6 {
includes a L4 port PORT of protocol P and IPv4
or IPv6 address VIP, a priority-120 flow that matches on
ct.new  ip  reg0 == VIP
-   P  P.dst == PORT
-   (xxreg0 == VIP in the IPv6
-  case) with an action of ct_lb(args),
-  where args contains comma separated IPv4 or IPv6 addresses
-  (and optional port numbers) to load balance to.  If the router is
-  configured to force SNAT any load-balanced packets, the above action
-  will be replaced by flags.force_snat_for_lb = 1;
-  ct_lb(args);.
+   P  reg9[16..31] == 
+  PORT (xxreg0 == VIP
+  in the IPv6 case) with an action of
+  ct_lb(args), where args contains
+  comma separated IPv4 or IPv6 addresses (and optional port numbers) to
+  load balance to.  If the router is configured to force SNAT any
+  load-balanced packets, the above action will be replaced by
+  flags.force_snat_for_lb = 1; ct_lb(args);.
If the load balancing rule is configured with skip_snat
set to true, the above action will be replaced by
flags.skip_snat_for_lb = 1; ct_lb(args);.
@@ -2945,14 +2948,15 @@ icmp6 {
PORT of protocol P and IPv4 or IPv6 address
VIP, a priority-120 flow that matches on
ct.est  ip4  reg0 == VIP
-   P  P.dst == PORT
-   (ip6 and xxreg0 == VIP
-   in the IPv6 case) with an action of next;. If
-  the router is configured to force SNAT any load-balanced packets, the
-  above action will be replaced by flags.force_snat_for_lb = 1;
-  next;. If the load balancing rule is configured with
-  skip_snat set to true, the above action will be replaced
-  by flags.skip_snat_for_lb = 1; next;.
+   P  reg9[16..31] == 
+  PORT 

Re: [ovs-dev] [PATCH V3 1/2] netdev-offload-dpdk: Use has_vlan match attribute

2021-08-24 Thread Finn, Emma


-Original Message-
From: Eli Britstein  
Sent: Monday 16 August 2021 14:55
To: d...@openvswitch.org; Ilya Maximets 
Cc: Finn, Emma ; Stokes, Ian ; 
Sriharsha Basavapatna ; Gaetan Rivet 
; Majd Dibbiny ; Eli Britstein 
; Salem Sol 
Subject: [PATCH V3 1/2] netdev-offload-dpdk: Use has_vlan match attribute

DPDK 20.11 introduced an ability to specify existance/non-existance of VLAN tag 
by [1].
Use this attribute.

[1]: 09315fc83861 ("ethdev: add VLAN attributes to ethernet and VLAN items")

Signed-off-by: Eli Britstein 
Reviewed-by: Salem Sol 

Hi Eli,

I tested this but currently we don't have support in the i40e pmd for the 
has_vlan match attribute and with these patches it is breaking offload for VLAN 
packets on Intel devices. 

Thanks, 
Emma 
---
 lib/netdev-offload-dpdk.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c index 
f6706ee0c..28c4ba276 100644
--- a/lib/netdev-offload-dpdk.c
+++ b/lib/netdev-offload-dpdk.c
@@ -210,6 +210,8 @@ dump_flow_pattern(struct ds *s,
 
 ds_put_cstr(s, "eth ");
 if (eth_spec) {
+uint32_t has_vlan_mask;
+
 if (!eth_mask) {
 eth_mask = _flow_item_eth_mask;
 }
@@ -222,6 +224,9 @@ dump_flow_pattern(struct ds *s,
 DUMP_PATTERN_ITEM(eth_mask->type, "type", "0x%04"PRIx16,
   ntohs(eth_spec->type),
   ntohs(eth_mask->type));
+has_vlan_mask = eth_mask->has_vlan ? UINT32_MAX : 0;
+DUMP_PATTERN_ITEM(has_vlan_mask, "has_vlan", "%d",
+  eth_spec->has_vlan, eth_mask->has_vlan);
 }
 ds_put_cstr(s, "/ ");
 } else if (item->type == RTE_FLOW_ITEM_TYPE_VLAN) { @@ -1047,6 +1052,7 @@ 
parse_flow_match(struct netdev *netdev,
  struct flow_patterns *patterns,
  struct match *match)
 {
+struct rte_flow_item_eth *eth_spec = NULL, *eth_mask = NULL;
 struct flow *consumed_masks;
 uint8_t proto = 0;
 
@@ -1092,6 +1098,11 @@ parse_flow_match(struct netdev *netdev,
 memset(_masks->dl_src, 0, sizeof consumed_masks->dl_src);
 consumed_masks->dl_type = 0;
 
+spec->has_vlan = 0;
+mask->has_vlan = 1;
+eth_spec = spec;
+eth_mask = mask;
+
 add_flow_pattern(patterns, RTE_FLOW_ITEM_TYPE_ETH, spec, mask);
 }
 
@@ -1108,6 +1119,11 @@ parse_flow_match(struct netdev *netdev,
 /* Match any protocols. */
 mask->inner_type = 0;
 
+if (eth_spec && eth_mask) {
+eth_spec->has_vlan = 1;
+eth_mask->has_vlan = 1;
+}
+
 add_flow_pattern(patterns, RTE_FLOW_ITEM_TYPE_VLAN, spec, mask);
 }
 /* For untagged matching match->wc.masks.vlans[0].tci is 0x and
--
2.28.0.2311.g225365fb51

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 6/6] dpif-netdev/mfex: Avoid hashing when opt mfex called

2021-08-24 Thread 0-day Robot
Bleep bloop.  Greetings Kumar Amber, I am a robot and I have tried out your 
patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
ERROR: Author kumar Amber  needs to sign off.
WARNING: Unexpected sign-offs from developers who are not authors or co-authors 
or committers: Kumar Amber 
Lines checked: 41, Warnings: 1, Errors: 1


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 4/6] dpif-netdev/mfex: Add ipv4 profile based hashing

2021-08-24 Thread 0-day Robot
Bleep bloop.  Greetings Kumar Amber, I am a robot and I have tried out your 
patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
ERROR: Author kumar Amber  needs to sign off.
WARNING: Unexpected sign-offs from developers who are not authors or co-authors 
or committers: Kumar Amber 
Lines checked: 172, Warnings: 1, Errors: 1


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 3/6] dpif-netdev/mfex: Add packet hash check to autovalidator

2021-08-24 Thread 0-day Robot
Bleep bloop.  Greetings Kumar Amber, I am a robot and I have tried out your 
patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
ERROR: Author kumar Amber  needs to sign off.
WARNING: Unexpected sign-offs from developers who are not authors or co-authors 
or committers: Kumar Amber 
Lines checked: 46, Warnings: 1, Errors: 1


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3 2/2] ovsdb-idl: Exclude missing tables and columns in the transaction.

2021-08-24 Thread Numan Siddique
On Tue, Aug 24, 2021 at 7:31 AM Ilya Maximets  wrote:
>
> On 8/24/21 12:59 AM, num...@ovn.org wrote:
> > From: Numan Siddique 
> >
> > In the cases where the C idl client is compiled with a newer schema
> > and the ovsdb-server is running with older schema, the IDL clients
> > can included tables or columns in the transaction which are missing
> > in the server schema.  ovsdb-server will reject the transaction,
> > but the IDL client keeps on trying the transaction resulting
> > in a continuous loop.  This patch fixes this issue.
> >
> > For the missing tables, the function ovsdb_idl_txn_insert() will now
> > return NULL.  The client should check for NULL before accessing the
> > row object.  For the missing columns, the columns are excluded from
> > the transaction object.
> This doesn't look safe, because none of existing clients are checking
> the return value of IDL wrapper functions and are using it right away.
>

Thanks for the review.  I had this concern too.

> Can we just fail the transaction before sending it to the server?
> Clients are required to check the transaction status.  Clients will
> need to check for existence of the certain columns before adding
> them to transactions though with the API from patch #1.

Either we can fail the transaction at the beginning or just not exclude
that table/column when sending the transaction.   Can you please
Let me know what would you prefer ?  I'll

I'd prefer the latter because if the clients can check the existence of
table/columns, then probably they won't even be included in the
transaction in the first place.   Also by checking the status, can the clients
know which table/column caused the failure ?   So probably it seems
better to me to just ignore such tables/columns from the transaction
just like how monitor request ignores such tables/columns.

Is it possible to consider the patch #1 (if it looks fine) please as
we need it urgently
to unblock an upgrade issue we have.  With that, ovn-controller can check
for the existence of a table and exclude it from the transaction.

I'll work on patch #2 meanwhile.

Thanks
Numan



>
> Best regards, Ilya Maximets.
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 0/6] MFEX Optimizations IPv6 + Hashing

2021-08-24 Thread Kumar Amber
---
v2:
- fix the CI build.
- fix check-patch for co-author
---

The patch-set introduces AVX512 optimizations of IPv6
traffic profiles and hashing improvements for all AVX512
supported traffic profiles for IPv4 and IPv6.

Kumar Amber (3):
  dpif-netdev/mfex: Add AVX512 basic ipv6 traffic profiles
  dpif-netdev/mfex: Add AVX512 vlan ipv6 traffic profiles
  dpif-netdev/mfex: Add ipv6 profile based hashing

kumar Amber (3):
  dpif-netdev/mfex: Add packet hash check to autovalidator
  dpif-netdev/mfex: Add ipv4 profile based hashing
  dpif-netdev/mfex: Avoid hashing when opt mfex called

 NEWS  |   7 +
 lib/automake.mk   |   1 +
 lib/dpif-netdev-avx512.c  |   6 +-
 lib/dpif-netdev-extract-avx512.c  | 348 +-
 lib/dpif-netdev-private-extract.c |  63 +-
 lib/dpif-netdev-private-extract.h |  12 ++
 tests/pcap/mfex_test.pcap | Bin 416 -> 632 bytes
 7 files changed, 432 insertions(+), 5 deletions(-)

-- 
2.25.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 1/6] dpif-netdev/mfex: Add AVX512 basic ipv6 traffic profiles

2021-08-24 Thread Kumar Amber
Add AVX512 IPv6 optimized profile for IPv6/UDP and
IPv6/TCP.

MFEX autovalidaton test-case already has the IPv6 support for
validating against the scalar mfex.

Signed-off-by: Kumar Amber 
Signed-off-by: Harry van Haaren 
Co-authored-by: Harry van Haaren 

---
v2:
- Fix CI build error
- Fix check-patch sign-offs
---
 NEWS  |   3 +
 lib/automake.mk   |   1 +
 lib/dpif-netdev-extract-avx512.c  | 140 +-
 lib/dpif-netdev-private-extract.c |  28 +-
 lib/dpif-netdev-private-extract.h |   6 ++
 tests/pcap/mfex_test.pcap | Bin 416 -> 632 bytes
 6 files changed, 176 insertions(+), 2 deletions(-)

diff --git a/NEWS b/NEWS
index 1f2adf718..f18e2c572 100644
--- a/NEWS
+++ b/NEWS
@@ -8,6 +8,9 @@ Post-v2.16.0
by default.  'other_config:dpdk-socket-limit' can be set equal to
the 'other_config:dpdk-socket-mem' to preserve the legacy memory
limiting behavior.
+   - Userspace datapath:
+ * Add AVX512 optimized profiles to miniflow extract for IPv6/UDP and
+   IPv6/TCP.
 
 
 v2.16.0 - 16 Aug 2021
diff --git a/lib/automake.mk b/lib/automake.mk
index 8ac138f71..245c0886c 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -33,6 +33,7 @@ lib_libopenvswitchavx512_la_CFLAGS = \
-mavx512f \
-mavx512bw \
-mavx512dq \
+   -mavx512vl \
-mbmi \
-mbmi2 \
-fPIC \
diff --git a/lib/dpif-netdev-extract-avx512.c b/lib/dpif-netdev-extract-avx512.c
index ec64419e3..3384a8dba 100644
--- a/lib/dpif-netdev-extract-avx512.c
+++ b/lib/dpif-netdev-extract-avx512.c
@@ -49,6 +49,8 @@
 #include "dpif-netdev-private-extract.h"
 #include "dpif-netdev-private-flow.h"
 
+#define plen ip6_ctlun.ip6_un1.ip6_un1_plen
+
 /* AVX512-BW level permutex2var_epi8 emulation. */
 static inline __m512i
 __attribute__((target("avx512bw")))
@@ -137,6 +139,7 @@ _mm512_maskz_permutexvar_epi8_wrap(__mmask64 kmask, __m512i 
idx, __m512i a)
 #define PATTERN_ETHERTYPE_MASK PATTERN_ETHERTYPE_GEN(0xFF, 0xFF)
 #define PATTERN_ETHERTYPE_IPV4 PATTERN_ETHERTYPE_GEN(0x08, 0x00)
 #define PATTERN_ETHERTYPE_DT1Q PATTERN_ETHERTYPE_GEN(0x81, 0x00)
+#define PATTERN_ETHERTYPE_IPV6 PATTERN_ETHERTYPE_GEN(0x86, 0xDD)
 
 /* VLAN (Dot1Q) patterns and masks. */
 #define PATTERN_DT1Q_MASK   \
@@ -192,6 +195,25 @@ _mm512_maskz_permutexvar_epi8_wrap(__mmask64 kmask, 
__m512i idx, __m512i a)
   NU, NU, NU, NU, NU, NU, NU, NU, 38, 39, 40, 41, NU, NU, NU, NU, /* TCP */   \
   NU, NU, NU, NU, NU, NU, NU, NU, /* Unused. */
 
+/* Generator for checking IPv6 ver. */
+#define PATTERN_IPV6_GEN(VER_TRC, PROTO)  \
+  VER_TRC, /* Version: 4bits and Traffic class: 4bits. */ \
+  0, 0, 0, /* Traffic class: 4bits and Flow Label: 24bits. */ \
+  0, 0,/* Payload length 16bits. */   \
+  PROTO, 0,/* Next Header 8bits and Hop limit 8bits. */   \
+  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* Src IP: 128bits. */  \
+  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* Dst IP: 128bits. */
+
+#define PATTERN_IPV6_MASK PATTERN_IPV6_GEN(0xF0, 0xFF)
+#define PATTERN_IPV6_UDP PATTERN_IPV6_GEN(0x60, 0x11)
+#define PATTERN_IPV6_TCP PATTERN_IPV6_GEN(0x60, 0x06)
+
+#define PATTERN_IPV6_SHUFFLE  \
+   0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, NU, NU, /* Ether */ \
+  22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, /* IPv6 */  \
+  38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, /* IPv6 */  \
+  NU, NU, NU, NU, NU, NU, NU, NU, NU, NU, NU, NU, NU, NU, NU, NU, /* Unused */
+
 /* Generation of K-mask bitmask values, to zero out data in result. Note that
  * these correspond 1:1 to the above "*_SHUFFLE" values, and bit used must be
  * set in this K-mask, and "NU" values must be zero in the k-mask. Each mask
@@ -204,6 +226,8 @@ _mm512_maskz_permutexvar_epi8_wrap(__mmask64 kmask, __m512i 
idx, __m512i a)
 #define KMASK_IPV4  0xF0FFULL
 #define KMASK_UDP   0x000FULL
 #define KMASK_TCP   0x0F00ULL
+#define KMASK_IPV6  0xULL
+#define KMASK_ETHER_IPV6 0x3FFFULL
 
 #define PATTERN_IPV4_UDP_KMASK \
 (KMASK_ETHER | (KMASK_IPV4 << 16) | (KMASK_UDP << 32))
@@ -217,6 +241,9 @@ _mm512_maskz_permutexvar_epi8_wrap(__mmask64 kmask, __m512i 
idx, __m512i a)
 #define PATTERN_DT1Q_IPV4_TCP_KMASK \
 (KMASK_ETHER | (KMASK_DT1Q << 16) | (KMASK_IPV4 << 24) | (KMASK_TCP << 40))
 
+#define PATTERN_IPV6_KMASK \
+(KMASK_ETHER_IPV6 | (KMASK_IPV6 << 16) | (KMASK_IPV6 << 32))
+
 /* This union allows initializing static data as u8, but easily loading it
  * into AVX512 registers too. The union ensures proper alignment for the zmm.
  */
@@ -295,6 +322,8 @@ enum MFEX_PROFILES {
 PROFILE_ETH_IPV4_TCP,
 PROFILE_ETH_VLAN_IPV4_UDP,
 PROFILE_ETH_VLAN_IPV4_TCP,
+

[ovs-dev] [PATCH v2 4/6] dpif-netdev/mfex: Add ipv4 profile based hashing

2021-08-24 Thread Kumar Amber
From: kumar Amber 

This commit adds IPv4 profile specific hashing which
uses fixed offsets into the packet to improve hashing
perforamnce.

Hash value is autovalidated by MFEX autovalidator.

Signed-off-by: Kumar Amber 
Signed-off-by: Harry van Haaren 
Co-authored-by: Harry van Haaren 

---
v2:
- Fix check-patch sign-offs
---
 NEWS |  1 +
 lib/dpif-netdev-extract-avx512.c | 57 
 2 files changed, 58 insertions(+)

diff --git a/NEWS b/NEWS
index 959df3add..6d4e271fd 100644
--- a/NEWS
+++ b/NEWS
@@ -13,6 +13,7 @@ Post-v2.16.0
IPv6/TCP.
  * Add AVX512 optimized profiles to miniflow extract for VLAN/IPv6/UDP
and VLAN/IPv6/TCP.
+ * Add IPv4 profile based 5tuple hashing optimizations.
 
 
 v2.16.0 - 16 Aug 2021
diff --git a/lib/dpif-netdev-extract-avx512.c b/lib/dpif-netdev-extract-avx512.c
index 11bca0144..196ec1625 100644
--- a/lib/dpif-netdev-extract-avx512.c
+++ b/lib/dpif-netdev-extract-avx512.c
@@ -297,6 +297,9 @@ struct mfex_profile {
 uint64_t mf_bits[FLOWMAP_UNITS];
 uint16_t dp_pkt_offs[4];
 uint16_t dp_pkt_min_size;
+
+/* Constant data offsets for Hashing. */
+uint8_t hash_pkt_offs[6];
 };
 
 /* Ensure dp_pkt_offs[4] is the correct size as in struct dp_packet. */
@@ -350,6 +353,13 @@ enum MFEX_PROFILES {
 PROFILE_COUNT,
 };
 
+/* Packet offsets for 5 tuple Hash function. */
+#define HASH_IPV4 \
+26, 30, 23, 34, 0, 0
+
+#define HASH_DT1Q_IPV4 \
+30, 34, 27, 38, 0, 0
+
 /* Static const instances of profiles. These are compile-time constants,
  * and are specialized into individual miniflow-extract functions.
  * NOTE: Order of the fields is significant, any change in the order must be
@@ -369,6 +379,8 @@ static const struct mfex_profile 
mfex_profiles[PROFILE_COUNT] =
 0, UINT16_MAX, 14, 34,
 },
 .dp_pkt_min_size = 42,
+
+.hash_pkt_offs = { HASH_IPV4 },
 },
 
 [PROFILE_ETH_IPV4_TCP] = {
@@ -383,6 +395,8 @@ static const struct mfex_profile 
mfex_profiles[PROFILE_COUNT] =
 0, UINT16_MAX, 14, 34,
 },
 .dp_pkt_min_size = 54,
+
+.hash_pkt_offs = { HASH_IPV4 },
 },
 
 [PROFILE_ETH_VLAN_IPV4_UDP] = {
@@ -401,6 +415,8 @@ static const struct mfex_profile 
mfex_profiles[PROFILE_COUNT] =
 14, UINT16_MAX, 18, 38,
 },
 .dp_pkt_min_size = 46,
+
+.hash_pkt_offs = { HASH_DT1Q_IPV4 },
 },
 
 [PROFILE_ETH_VLAN_IPV4_TCP] = {
@@ -419,6 +435,8 @@ static const struct mfex_profile 
mfex_profiles[PROFILE_COUNT] =
 14, UINT16_MAX, 18, 38,
 },
 .dp_pkt_min_size = 46,
+
+.hash_pkt_offs = { HASH_DT1Q_IPV4 },
 },
 
 [PROFILE_ETH_IPV6_UDP] = {
@@ -530,6 +548,34 @@ mfex_ipv6_set_l2_pad_size(struct dp_packet *pkt,
 dp_packet_set_l2_pad_size(pkt, payload_size_ipv6 - p_len);
 }
 
+static inline void
+mfex_5tuple_hash_ipv4(struct dp_packet *packet, const uint8_t *pkt,
+  struct netdev_flow_key *key,
+  const uint8_t *pkt_offsets)
+{
+if (!dp_packet_rss_valid(packet)) {
+uint32_t hash = 0;
+void *ipv4_src = (void *) [pkt_offsets[0]];
+void *ipv4_dst = (void *) [pkt_offsets[1]];
+void *ports_l4 = (void *) [pkt_offsets[3]];
+
+/* IPv4 Src and Dst. */
+hash = hash_add(hash, *(uint32_t *) ipv4_src);
+hash = hash_add(hash, *(uint32_t *) ipv4_dst);
+/* IPv4 proto. */
+hash = hash_add(hash, pkt[pkt_offsets[2]]);
+/* L4 ports. */
+hash = hash_add(hash, *(uint32_t *) ports_l4);
+hash = hash_finish(hash, 42);
+
+dp_packet_set_rss_hash(packet, hash);
+key->hash = hash;
+} else {
+key->hash = dp_packet_get_rss_hash(packet);
+}
+key->len = netdev_flow_key_size(miniflow_n_values(>mf));
+}
+
 /* Protocol specific helper functions, for calculating offsets/lenghts. */
 static int32_t
 mfex_ipv4_set_l2_pad_size(struct dp_packet *pkt, struct ip_header *nh,
@@ -664,6 +710,9 @@ mfex_avx512_process(struct dp_packet_batch *packets,
 /* Process TCP flags, and store to blocks. */
 const struct tcp_header *tcp = (void *)[38];
 mfex_handle_tcp_flags(tcp, [7]);
+
+mfex_5tuple_hash_ipv4(packet, pkt, [i],
+  profile->hash_pkt_offs);
 } break;
 
 case PROFILE_ETH_VLAN_IPV4_UDP: {
@@ -674,6 +723,9 @@ mfex_avx512_process(struct dp_packet_batch *packets,
 if (mfex_ipv4_set_l2_pad_size(packet, nh, size_from_ipv4)) {
 continue;
 }
+
+mfex_5tuple_hash_ipv4(packet, pkt, [i],
+  profile->hash_pkt_offs);
 } break;
 
 case PROFILE_ETH_IPV4_TCP: {
@@ -687,6 +739,9 @@ mfex_avx512_process(struct dp_packet_batch *packets,
 if (mfex_ipv4_set_l2_pad_size(packet, nh, 

[ovs-dev] [PATCH v2 6/6] dpif-netdev/mfex: Avoid hashing when opt mfex called

2021-08-24 Thread Kumar Amber
From: kumar Amber 

This patch avoids calculating the software hash of the packet again
if the optimized miniflow-extract hit and has already calculated the
packet hash. In cases of scalar miniflow extract, the normal hashing
calculation is performed.

Signed-off-by: Kumar Amber 
---
 lib/dpif-netdev-avx512.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/dpif-netdev-avx512.c b/lib/dpif-netdev-avx512.c
index 544d36903..2188abfd9 100644
--- a/lib/dpif-netdev-avx512.c
+++ b/lib/dpif-netdev-avx512.c
@@ -210,15 +210,15 @@ dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread 
*pmd,
 if (!mfex_hit) {
 /* Do a scalar miniflow extract into keys. */
 miniflow_extract(packet, >mf);
+key->len = netdev_flow_key_size(miniflow_n_values(>mf));
+key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
+ >mf);
 }
 
 /* Cache TCP and byte values for all packets. */
 pkt_meta[i].bytes = dp_packet_size(packet);
 pkt_meta[i].tcp_flags = miniflow_get_tcp_flags(>mf);
 
-key->len = netdev_flow_key_size(miniflow_n_values(>mf));
-key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet, >mf);
-
 if (emc_enabled) {
 f = emc_lookup(>emc_cache, key);
 
-- 
2.25.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 5/6] dpif-netdev/mfex: Add ipv6 profile based hashing

2021-08-24 Thread Kumar Amber
This commit adds IPv6 profile specific hashing which
uses fixed offsets into the packet to improve hashing
perforamnce.

Hash value is autovalidated by MFEX autovalidator.

Signed-off-by: Kumar Amber 
Signed-off-by: Harry van Haaren 
Co-authored-by: Harry van Haaren 

---
v2:
- Fix check-patch sign-offs
---
 NEWS |  1 +
 lib/dpif-netdev-extract-avx512.c | 57 
 2 files changed, 58 insertions(+)

diff --git a/NEWS b/NEWS
index 6d4e271fd..9e9f60825 100644
--- a/NEWS
+++ b/NEWS
@@ -14,6 +14,7 @@ Post-v2.16.0
  * Add AVX512 optimized profiles to miniflow extract for VLAN/IPv6/UDP
and VLAN/IPv6/TCP.
  * Add IPv4 profile based 5tuple hashing optimizations.
+ * Add IPv6 profile based 5tuple hashing optimizations.
 
 
 v2.16.0 - 16 Aug 2021
diff --git a/lib/dpif-netdev-extract-avx512.c b/lib/dpif-netdev-extract-avx512.c
index 196ec1625..a16ba06b7 100644
--- a/lib/dpif-netdev-extract-avx512.c
+++ b/lib/dpif-netdev-extract-avx512.c
@@ -360,6 +360,12 @@ enum MFEX_PROFILES {
 #define HASH_DT1Q_IPV4 \
 30, 34, 27, 38, 0, 0
 
+#define HASH_IPV6 \
+22, 30, 38, 46, 20, 54
+
+#define HASH_DT1Q_IPV6 \
+26, 34, 42, 50, 24, 58
+
 /* Static const instances of profiles. These are compile-time constants,
  * and are specialized into individual miniflow-extract functions.
  * NOTE: Order of the fields is significant, any change in the order must be
@@ -451,6 +457,8 @@ static const struct mfex_profile 
mfex_profiles[PROFILE_COUNT] =
 0, UINT16_MAX, 14, 54,
 },
 .dp_pkt_min_size = 54,
+
+.hash_pkt_offs = { HASH_IPV6 },
 },
 
 [PROFILE_ETH_IPV6_TCP] = {
@@ -465,6 +473,8 @@ static const struct mfex_profile 
mfex_profiles[PROFILE_COUNT] =
 0, UINT16_MAX, 14, 54,
 },
 .dp_pkt_min_size = 54,
+
+.hash_pkt_offs = { HASH_IPV6 },
 },
 
 [PROFILE_ETH_VLAN_IPV6_TCP] = {
@@ -481,6 +491,8 @@ static const struct mfex_profile 
mfex_profiles[PROFILE_COUNT] =
 14, UINT16_MAX, 18, 58,
 },
 .dp_pkt_min_size = 66,
+
+.hash_pkt_offs = { HASH_DT1Q_IPV6 },
 },
 
 [PROFILE_ETH_VLAN_IPV6_UDP] = {
@@ -497,6 +509,8 @@ static const struct mfex_profile 
mfex_profiles[PROFILE_COUNT] =
 14, UINT16_MAX, 18, 58,
 },
 .dp_pkt_min_size = 66,
+
+.hash_pkt_offs = { HASH_DT1Q_IPV6 },
 },
 };
 
@@ -576,6 +590,38 @@ mfex_5tuple_hash_ipv4(struct dp_packet *packet, const 
uint8_t *pkt,
 key->len = netdev_flow_key_size(miniflow_n_values(>mf));
 }
 
+static inline void
+mfex_5tuple_hash_ipv6(struct dp_packet *packet, const uint8_t *pkt,
+  struct netdev_flow_key *key,
+  const uint8_t *pkt_offsets)
+{
+if (!dp_packet_rss_valid(packet)) {
+uint32_t hash = 0;
+void *ipv6_src_lo = (void *) [pkt_offsets[0]];
+void *ipv6_src_hi = (void *) [pkt_offsets[1]];
+void *ipv6_dst_lo = (void *) [pkt_offsets[2]];
+void *ipv6_dst_hi = (void *) [pkt_offsets[3]];
+void *ports_l4 = (void *) [pkt_offsets[5]];
+
+/* IPv6 Src and Dst. */
+hash = hash_add64(hash, *(uint64_t *) ipv6_src_lo);
+hash = hash_add64(hash, *(uint64_t *) ipv6_src_hi);
+hash = hash_add64(hash, *(uint64_t *) ipv6_dst_lo);
+hash = hash_add64(hash, *(uint64_t *) ipv6_dst_hi);
+/* IPv6 proto. */
+hash = hash_add(hash, pkt[pkt_offsets[4]]);
+/* L4 ports. */
+hash = hash_add(hash, *(uint32_t *) ports_l4);
+hash = hash_finish(hash, 42);
+
+dp_packet_set_rss_hash(packet, hash);
+key->hash = hash;
+} else {
+key->hash = dp_packet_get_rss_hash(packet);
+}
+key->len = netdev_flow_key_size(miniflow_n_values(>mf));
+}
+
 /* Protocol specific helper functions, for calculating offsets/lenghts. */
 static int32_t
 mfex_ipv4_set_l2_pad_size(struct dp_packet *pkt, struct ip_header *nh,
@@ -769,6 +815,9 @@ mfex_avx512_process(struct dp_packet_batch *packets,
 /* Process UDP header. */
 mfex_handle_ipv6_l4((void *)[54], [9]);
 
+mfex_5tuple_hash_ipv6(packet, pkt, [i],
+  profile->hash_pkt_offs);
+
 } break;
 
 case PROFILE_ETH_IPV6_TCP: {
@@ -786,6 +835,9 @@ mfex_avx512_process(struct dp_packet_batch *packets,
 const struct tcp_header *tcp = (void *)[54];
 mfex_handle_tcp_flags(tcp, [9]);
 
+mfex_5tuple_hash_ipv6(packet, pkt, [i],
+  profile->hash_pkt_offs);
+
 } break;
 
 case PROFILE_ETH_VLAN_IPV6_TCP: {
@@ -806,6 +858,9 @@ mfex_avx512_process(struct dp_packet_batch *packets,
 const struct tcp_header *tcp = (void *)[58];
 mfex_handle_tcp_flags(tcp, [10]);
 
+mfex_5tuple_hash_ipv6(packet, pkt, [i],
+  

[ovs-dev] [PATCH v2 3/6] dpif-netdev/mfex: Add packet hash check to autovalidator

2021-08-24 Thread Kumar Amber
From: kumar Amber 

This patch adds the per profile AVX512 opt hashing to autovalidator
for validating the hash values against the scalar hash.

Signed-off-by: Kumar Amber 
---
 lib/dpif-netdev-private-extract.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/lib/dpif-netdev-private-extract.c 
b/lib/dpif-netdev-private-extract.c
index b3d96075c..263629903 100644
--- a/lib/dpif-netdev-private-extract.c
+++ b/lib/dpif-netdev-private-extract.c
@@ -303,6 +303,9 @@ dpif_miniflow_extract_autovalidator(struct dp_packet_batch 
*packets,
 DP_PACKET_BATCH_FOR_EACH (i, packet, packets) {
 pkt_metadata_init(>md, in_port);
 miniflow_extract(packet, [i].mf);
+keys[i].len = netdev_flow_key_size(miniflow_n_values([i].mf));
+keys[i].hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
+[i].mf);
 
 /* Store known good metadata to compare with optimized metadata. */
 good_l2_5_ofs[i] = packet->l2_5_ofs;
@@ -352,6 +355,15 @@ dpif_miniflow_extract_autovalidator(struct dp_packet_batch 
*packets,
 failed = 1;
 }
 
+/* Check hashes are equal. */
+if ((keys[i].hash != test_keys[i].hash) ||
+(keys[i].len != test_keys[i].len)) {
+ds_put_format(_msg, "Good hash: %d len: %d\tTest hash:%d"
+  " len:%d\n", keys[i].hash, keys[i].len,
+  test_keys[i].hash, test_keys[i].len);
+failed = 1;
+}
+
 if (!miniflow_equal([i].mf, _keys[i].mf)) {
 uint32_t block_cnt = miniflow_n_values([i].mf);
 ds_put_format(_msg, "Autovalidation blocks failed\n"
-- 
2.25.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 2/6] dpif-netdev/mfex: Add AVX512 vlan ipv6 traffic profiles

2021-08-24 Thread Kumar Amber
Add AVX512 Ipv6 optimized profile for vlan/IPv6/UDP and
vlan/IPv6/TCP.

MFEX autovalidaton test-case already has the IPv6 support for
validating against the scalar mfex.

Signed-off-by: Kumar Amber 
Signed-off-by: Harry van Haaren 
Co-authored-by: Harry van Haaren 

---
v2:
- Fix check-patch sign-offs
---
 NEWS  |  2 +
 lib/dpif-netdev-extract-avx512.c  | 94 +++
 lib/dpif-netdev-private-extract.c | 23 
 lib/dpif-netdev-private-extract.h |  6 ++
 4 files changed, 125 insertions(+)

diff --git a/NEWS b/NEWS
index f18e2c572..959df3add 100644
--- a/NEWS
+++ b/NEWS
@@ -11,6 +11,8 @@ Post-v2.16.0
- Userspace datapath:
  * Add AVX512 optimized profiles to miniflow extract for IPv6/UDP and
IPv6/TCP.
+ * Add AVX512 optimized profiles to miniflow extract for VLAN/IPv6/UDP
+   and VLAN/IPv6/TCP.
 
 
 v2.16.0 - 16 Aug 2021
diff --git a/lib/dpif-netdev-extract-avx512.c b/lib/dpif-netdev-extract-avx512.c
index 3384a8dba..11bca0144 100644
--- a/lib/dpif-netdev-extract-avx512.c
+++ b/lib/dpif-netdev-extract-avx512.c
@@ -214,6 +214,21 @@ _mm512_maskz_permutexvar_epi8_wrap(__mmask64 kmask, 
__m512i idx, __m512i a)
   38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, /* IPv6 */  \
   NU, NU, NU, NU, NU, NU, NU, NU, NU, NU, NU, NU, NU, NU, NU, NU, /* Unused */
 
+/* VLAN (Dot1Q) patterns and masks. */
+#define PATTERN_DT1Q_MASK \
+  0x00, 0x00, 0xFF, 0xFF,
+#define PATTERN_DT1Q_IPV6 \
+  0x00, 0x00, 0x86, 0xDD,
+
+#define PATTERN_DT1Q_IPV6_SHUFFLE \
+  /* Ether (2 blocks): Note that *VLAN* type is written here. */  \
+  0,  1,  2,  3,  4,  5,  6,  7, 8,  9, 10, 11, 16, 17,  0,  0,   \
+  /* VLAN (1 block): Note that the *EtherHdr->Type* is written here. */   \
+  12, 13, 14, 15, 0, 0, 0, 0, \
+  26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, /* IPv6 */  \
+  42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, /* IPv6 */  \
+  NU, NU, NU, NU, NU, NU, NU, NU, /* Unused */
+
 /* Generation of K-mask bitmask values, to zero out data in result. Note that
  * these correspond 1:1 to the above "*_SHUFFLE" values, and bit used must be
  * set in this K-mask, and "NU" values must be zero in the k-mask. Each mask
@@ -228,6 +243,8 @@ _mm512_maskz_permutexvar_epi8_wrap(__mmask64 kmask, __m512i 
idx, __m512i a)
 #define KMASK_TCP   0x0F00ULL
 #define KMASK_IPV6  0xULL
 #define KMASK_ETHER_IPV6 0x3FFFULL
+#define KMASK_DT1Q_IPV6  0xFF0FULL
+#define KMASK_IPV6_NOHDR 0x00FFULL
 
 #define PATTERN_IPV4_UDP_KMASK \
 (KMASK_ETHER | (KMASK_IPV4 << 16) | (KMASK_UDP << 32))
@@ -244,6 +261,10 @@ _mm512_maskz_permutexvar_epi8_wrap(__mmask64 kmask, 
__m512i idx, __m512i a)
 #define PATTERN_IPV6_KMASK \
 (KMASK_ETHER_IPV6 | (KMASK_IPV6 << 16) | (KMASK_IPV6 << 32))
 
+#define PATTERN_DT1Q_IPV6_KMASK \
+(KMASK_ETHER_IPV6 | (KMASK_DT1Q_IPV6 << 16) | (KMASK_IPV6 << 32) | \
+(KMASK_IPV6_NOHDR << 48))
+
 /* This union allows initializing static data as u8, but easily loading it
  * into AVX512 registers too. The union ensures proper alignment for the zmm.
  */
@@ -324,6 +345,8 @@ enum MFEX_PROFILES {
 PROFILE_ETH_VLAN_IPV4_TCP,
 PROFILE_ETH_IPV6_UDP,
 PROFILE_ETH_IPV6_TCP,
+PROFILE_ETH_VLAN_IPV6_TCP,
+PROFILE_ETH_VLAN_IPV6_UDP,
 PROFILE_COUNT,
 };
 
@@ -426,6 +449,37 @@ static const struct mfex_profile 
mfex_profiles[PROFILE_COUNT] =
 .dp_pkt_min_size = 54,
 },
 
+[PROFILE_ETH_VLAN_IPV6_TCP] = {
+.probe_mask.u8_data = {
+PATTERN_ETHERTYPE_MASK PATTERN_DT1Q_MASK PATTERN_IPV6_MASK },
+.probe_data.u8_data = {
+PATTERN_ETHERTYPE_DT1Q PATTERN_DT1Q_IPV6 PATTERN_IPV6_TCP },
+
+.store_shuf.u8_data = { PATTERN_DT1Q_IPV6_SHUFFLE },
+.store_kmsk = PATTERN_DT1Q_IPV6_KMASK,
+
+.mf_bits = { 0x38a0, 0x0004443c},
+.dp_pkt_offs = {
+14, UINT16_MAX, 18, 58,
+},
+.dp_pkt_min_size = 66,
+},
+
+[PROFILE_ETH_VLAN_IPV6_UDP] = {
+.probe_mask.u8_data = {
+PATTERN_ETHERTYPE_MASK PATTERN_DT1Q_MASK PATTERN_IPV6_MASK },
+.probe_data.u8_data = {
+PATTERN_ETHERTYPE_DT1Q PATTERN_DT1Q_IPV6 PATTERN_IPV6_UDP },
+
+.store_shuf.u8_data = { PATTERN_DT1Q_IPV6_SHUFFLE },
+.store_kmsk = PATTERN_DT1Q_IPV6_KMASK,
+
+.mf_bits = { 0x38a0, 0x0004043c},
+.dp_pkt_offs = {
+14, UINT16_MAX, 18, 58,
+},
+.dp_pkt_min_size = 66,
+},
 };
 
 /* IPv6 header helper function to fix TC, flow label and next header. */
@@ -676,6 +730,44 @@ mfex_avx512_process(struct dp_packet_batch *packets,
 

Re: [ovs-dev] [PATCH ovn] ovn-northd: Fix extremely inefficient usage of lflow hash map.

2021-08-24 Thread Anton Ivanov


On 24/08/2021 12:46, Ilya Maximets wrote:

On 8/24/21 1:18 PM, Anton Ivanov wrote:

On 24/08/2021 12:05, Ilya Maximets wrote:

On 8/24/21 7:36 AM, Anton Ivanov wrote:

On 23/08/2021 22:36, Ilya Maximets wrote:

On 8/23/21 10:37 PM, Anton Ivanov wrote:

On 23/08/2021 21:26, Ilya Maximets wrote:

On 8/23/21 10:20 PM, Anton Ivanov wrote:

Should not be the case.

The map is pre-sized for the size from the previous iterations.

Line 12861 in my tree which is probably a few commits out of date:

    fast_hmap_size_for(, max_seen_lflow_size);

And immediately after building the lflows:

    if (hmap_count() > max_seen_lflow_size) {
    max_seen_lflow_size = hmap_count();
    }

So the map SHOULD be sized correctly - to the most recent seen lflow count.

Please, re-read the commit message.  It's a first run of the loop
and the 'max_seen_lflow_size' is default 128 at this point.

Ack,

Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it.

   From that perspective the patch is a straight +1 from me.

   From the perspective of the use case stated in the commit message- I am not 
sure it addresses it.

If northd is in single-threaded mode and is tackling a GIGANTIC> database, it 
may never complete the first iteration before the
expiration of the timers and everyone deciding that northd is AWOL.

Well, how do you suggest to fix that?  Obviously, we can always create
a database that northd will never be able to process in a reasonable
amount of time.  And it doesn't matter if it's single- or multi-threaded.

In this case NbDB is only 9MB in size, which is very reasonable, and
northd on my laptop takes more than 15 minutes to process it (I killed
it at this point).  With the patch applied it took only 11 seconds.
So, for me, this patch pretty much fixes the issue.  11 seconds is not
that bad, e.g. ovn-k8s configures inactivity probes for clients to 180.
It would be great to reduce, but we're not there yet.


In that case, if it is multi-threaded from the start, it should probably
grab the sizing from the lflow table hash in south db. That would be a
good approximation for the first run.

This will not work for a case where SbDB is empty for any reason while
NbDB is not.  And there is still a case where northd initially connects
to semi-empty databases and after few iterations NbDB receives a big
transaction and generates a big update for northd.

A partial fix is to resize to optimum size the hash after lflow processing.

I'm not sure I understand what you mean here, because resizing after
lflow processing will not help.  The lflow processing itself is the
very slow part that we're trying to make faster here.

That can be the case only with dpgroups. Otherwise lflows are just a 
destination to dump data and the bucket sizing is irrelevant because there is 
never any lookup inside lflows during processing. The lflow map is just used to 
store data. So if it is suboptimal at the exit of build_lflows() the resize 
will fix it before the first lookup shortly thereafter.

Are you running it with dpgroups enabled? In that case there are lookups inside 
lflows during build which happen under a per-bucket lock. So in addition to 
suboptimal size when searching the contention depends on the number of buckets. 
If they are too few, the system becomes heavily contended resulting in 
ridiculous computation sizes.

Oh, I see.  Indeed, without dp-groups there is no lookup during
lflow build.  I missed that.  So, yes, I agree that for a case
without dp-groups, re-sizing after lflow processing should work.
We need that for parallel case.

Current patch (use hmap_insert() that resizes if needed) helps
for all non-parallel cases.


Indeed. It should go in.

I will sort out the other cases to the extent possible.

Brgds,

A.



I'm mostly running dp-groups + non-parallel which is a default
case for ovn-heater/ovn-k8s.


For the case of "dpgroups + parallel + first iteration + pre-existing large 
database" there is no cure short of pre-allocating the hash to maximum size.

Yeah, dp-groups + parallel is a hard case.


I am scale testing that as well as resize (for non-dp-groups cases) at present.

Brgds,

A.


If the sizing was correct - 99.9% of the case this will be a noop.

If the sizing was incorrect, it will be resized so that the DP searches and all 
other ops which were recently added for flow reduction will work optimally.

This still does not work for lflow compute with dpgroups + parallel upon 
initial connect and without a SB database to use for size guidance. It will 
work for all other cases.

I will send two separate patches to address the cases which can be easily 
addressed and see what can be done with the dp+parallel upon initial connect to 
an empty sb database.

Brgds,

A


A.


A.

On 23/08/2021 21:02, Ilya Maximets wrote:

'lflow_map' is never expanded because northd always uses fast
insertion.  This leads to the case where we have a hash map
with only 128 initial 

Re: [ovs-dev] [PATCH ovn] ovn-northd: Fix extremely inefficient usage of lflow hash map.

2021-08-24 Thread Ilya Maximets
On 8/24/21 1:18 PM, Anton Ivanov wrote:
> 
> On 24/08/2021 12:05, Ilya Maximets wrote:
>> On 8/24/21 7:36 AM, Anton Ivanov wrote:
>>> On 23/08/2021 22:36, Ilya Maximets wrote:
 On 8/23/21 10:37 PM, Anton Ivanov wrote:
> On 23/08/2021 21:26, Ilya Maximets wrote:
>> On 8/23/21 10:20 PM, Anton Ivanov wrote:
>>> Should not be the case.
>>>
>>> The map is pre-sized for the size from the previous iterations.
>>>
>>> Line 12861 in my tree which is probably a few commits out of date:
>>>
>>>    fast_hmap_size_for(, max_seen_lflow_size);
>>>
>>> And immediately after building the lflows:
>>>
>>>    if (hmap_count() > max_seen_lflow_size) {
>>>    max_seen_lflow_size = hmap_count();
>>>    }
>>>
>>> So the map SHOULD be sized correctly - to the most recent seen lflow 
>>> count.
>> Please, re-read the commit message.  It's a first run of the loop
>> and the 'max_seen_lflow_size' is default 128 at this point.
> Ack,
>
> Not using auto-resizing in single threaded mode is a bug. Thanks for 
> fixing it.
>
>   From that perspective the patch is a straight +1 from me.
>
>   From the perspective of the use case stated in the commit message- I am 
> not sure it addresses it.
>
> If northd is in single-threaded mode and is tackling a GIGANTIC> 
> database, it may never complete the first iteration before the
> expiration of the timers and everyone deciding that northd is AWOL.
 Well, how do you suggest to fix that?  Obviously, we can always create
 a database that northd will never be able to process in a reasonable
 amount of time.  And it doesn't matter if it's single- or multi-threaded.

 In this case NbDB is only 9MB in size, which is very reasonable, and
 northd on my laptop takes more than 15 minutes to process it (I killed
 it at this point).  With the patch applied it took only 11 seconds.
 So, for me, this patch pretty much fixes the issue.  11 seconds is not
 that bad, e.g. ovn-k8s configures inactivity probes for clients to 180.
 It would be great to reduce, but we're not there yet.

> In that case, if it is multi-threaded from the start, it should probably
> grab the sizing from the lflow table hash in south db. That would be a
> good approximation for the first run.
 This will not work for a case where SbDB is empty for any reason while
 NbDB is not.  And there is still a case where northd initially connects
 to semi-empty databases and after few iterations NbDB receives a big
 transaction and generates a big update for northd.
>>> A partial fix is to resize to optimum size the hash after lflow processing.
>> I'm not sure I understand what you mean here, because resizing after
>> lflow processing will not help.  The lflow processing itself is the
>> very slow part that we're trying to make faster here.
> 
> That can be the case only with dpgroups. Otherwise lflows are just a 
> destination to dump data and the bucket sizing is irrelevant because there is 
> never any lookup inside lflows during processing. The lflow map is just used 
> to store data. So if it is suboptimal at the exit of build_lflows() the 
> resize will fix it before the first lookup shortly thereafter.
> 
> Are you running it with dpgroups enabled? In that case there are lookups 
> inside lflows during build which happen under a per-bucket lock. So in 
> addition to suboptimal size when searching the contention depends on the 
> number of buckets. If they are too few, the system becomes heavily contended 
> resulting in ridiculous computation sizes.

Oh, I see.  Indeed, without dp-groups there is no lookup during
lflow build.  I missed that.  So, yes, I agree that for a case
without dp-groups, re-sizing after lflow processing should work.
We need that for parallel case.

Current patch (use hmap_insert() that resizes if needed) helps
for all non-parallel cases.

I'm mostly running dp-groups + non-parallel which is a default
case for ovn-heater/ovn-k8s.

> 
> For the case of "dpgroups + parallel + first iteration + pre-existing large 
> database" there is no cure short of pre-allocating the hash to maximum size.

Yeah, dp-groups + parallel is a hard case.

> 
> I am scale testing that as well as resize (for non-dp-groups cases) at 
> present.
> 
> Brgds,
> 
> A.
> 
>>
>>> If the sizing was correct - 99.9% of the case this will be a noop.
>>>
>>> If the sizing was incorrect, it will be resized so that the DP searches and 
>>> all other ops which were recently added for flow reduction will work 
>>> optimally.
>>>
>>> This still does not work for lflow compute with dpgroups + parallel upon 
>>> initial connect and without a SB database to use for size guidance. It will 
>>> work for all other cases.
>>>
>>> I will send two separate patches to address the cases which can be easily 
>>> addressed and see what can be done 

Re: [ovs-dev] [PATCH v2] Documentation: Cleanup PMD information.

2021-08-24 Thread David Marchand
On Tue, Aug 10, 2021 at 2:10 PM Kevin Traynor  wrote:
>
> The 'Port/Rx Queue Assigment to PMD Threads' section has
> expanded over time and now includes info about stats/commands,
> manual pinning and different options for OVS assigning Rxqs to
> PMDs.
>
> Split them into different sections with sub-headings and move
> the two similar paragraphs about stats together.
>
> Rename 'Automatic assignment of Port/Rx Queue to PMD Threads'
> section to 'PMD Automatic Load Balance'.
>
> A few other minor cleanups as I was reading.
>
> Signed-off-by: Kevin Traynor 
> Acked-by: Adrian Moreno 
> ---
>
> v2:
> - a couple of small fixes as per Adrian's comments
> - remove duplicate PMD ALB conditions paragraph
> ---
>  Documentation/topics/dpdk/pmd.rst | 104 --
>  1 file changed, 54 insertions(+), 50 deletions(-)
>
> diff --git a/Documentation/topics/dpdk/pmd.rst 
> b/Documentation/topics/dpdk/pmd.rst
> index 95fa7af12..b0e2419c2 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -75,8 +75,47 @@ for enabling things like multiqueue for :ref:`physical 
> `
>  and :ref:`vhost-user ` interfaces.
>
> -To show port/Rx queue assignment::
> +Rx queues will be assigned to PMD threads by OVS, or they can be manually
> +pinned to PMD threads by the user.
> +
> +To see the port/Rx queue assignment and current measured usage history of PMD
> +core cycles for each Rx queue::
>
>  $ ovs-appctl dpif-netdev/pmd-rxq-show
>
> +.. note::
> +
> +   A history of one minute is recorded and shown for each Rx queue to allow 
> for
> +   traffic pattern spikes. Any changes in the Rx queue's PMD core cycles 
> usage,
> +   due to traffic pattern or reconfig changes, will take one minute to be 
> fully
> +   reflected in the stats.
> +

XXX see below

> +.. versionchanged:: 2.16.0
> +
> +   A ``overhead`` statistics is shown per PMD: it represents the number of
> +   cycles inherently consumed by the OVS PMD processing loop.
> +
> +Rx queue to PMD assignment takes place whenever there are configuration 
> changes
> +or can be triggered by using::
> +
> +$ ovs-appctl dpif-netdev/pmd-rxq-rebalance
> +
> +.. versionchanged:: 2.6.0
> +
> +   The ``pmd-rxq-show`` command was added in OVS 2.6.0.

It seems unrelated with the pmd-rxq-rebalance command itself.
I would either move this comment next to the first reference to the
pmd-rxq-show command (i.e. at XXX, before the overhead stat comment),
or drop it.


> +
> +.. versionchanged:: 2.9.0
> +
> +   Utilization-based allocation of Rx queues to PMDs and the
> +   ``pmd-rxq-rebalance`` command were added in OVS 2.9.0. Prior to this,
> +   allocation was round-robin and processing cycles were not taken into
> +   consideration.
> +
> +   In addition, the output of ``pmd-rxq-show`` was modified to include
> +   Rx queue utilization of the PMD as a percentage. Prior to this, tracking 
> of
> +   stats was not available.
> +
> +

nit: this double empty line is not consistent with the rest of the
doc, is there a need for it?


> +Port/Rx Queue assignment to PMD threads by manual pinning
> +~
>  Rx queues may be manually pinned to cores. This will change the default Rx
>  queue assignment to PMD threads::
> @@ -117,4 +156,6 @@ If using ``pmd-rxq-assign=group`` PMD threads with 
> *pinned* Rxqs can be
> a *non-isolated* PMD, that will remain *non-isolated*.
>
> +Automatic Port/Rx Queue assignment to PMD threads
> +~
>  If ``pmd-rxq-affinity`` is not set for Rx queues, they will be assigned to 
> PMDs
>  (cores) automatically.
> @@ -126,7 +167,9 @@ The algorithm used to automatically assign Rxqs to PMDs 
> can be set by::
>  By default, ``cycles`` assignment is used where the Rxqs will be ordered by
>  their measured processing cycles, and then be evenly assigned in descending
> -order to PMDs based on an up/down walk of the PMDs. For example, where there
> -are five Rx queues and three cores - 3, 7, and 8 - available and the measured
> -usage of core cycles per Rx queue over the last interval is seen to be:
> +order to PMDs. The PMD that will be selected for a given Rxq will be the next
> +one in alternating ascending/descending order based on core id. For example,
> +where there are five Rx queues and three cores - 3, 7, and 8 - available and
> +the measured usage of core cycles per Rx queue over the last interval is seen
> +to be:
>
>  - Queue #0: 30%
> @@ -184,48 +227,11 @@ The Rx queues may be assigned to the cores in the 
> following order::
>  Core 8: P1Q0 |
>
> -To see the current measured usage history of PMD core cycles for each Rx
> -queue::
> -
> -$ ovs-appctl dpif-netdev/pmd-rxq-show
> -
> -.. note::
> -
> -   A history of one minute is recorded and shown for each Rx queue to allow 
> for
> -   traffic pattern spikes. Any changes in the Rx queue's PMD core cycles 
> usage,
> -   due to traffic pattern or 

Re: [ovs-dev] [PATCH v3 2/2] ovsdb-idl: Exclude missing tables and columns in the transaction.

2021-08-24 Thread Ilya Maximets
On 8/24/21 12:59 AM, num...@ovn.org wrote:
> From: Numan Siddique 
> 
> In the cases where the C idl client is compiled with a newer schema
> and the ovsdb-server is running with older schema, the IDL clients
> can included tables or columns in the transaction which are missing
> in the server schema.  ovsdb-server will reject the transaction,
> but the IDL client keeps on trying the transaction resulting
> in a continuous loop.  This patch fixes this issue.
> 
> For the missing tables, the function ovsdb_idl_txn_insert() will now
> return NULL.  The client should check for NULL before accessing the
> row object.  For the missing columns, the columns are excluded from
> the transaction object.
This doesn't look safe, because none of existing clients are checking
the return value of IDL wrapper functions and are using it right away.

Can we just fail the transaction before sending it to the server?
Clients are required to check the transaction status.  Clients will
need to check for existence of the certain columns before adding
them to transactions though with the API from patch #1.

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn] ovn-northd: Fix extremely inefficient usage of lflow hash map.

2021-08-24 Thread Anton Ivanov


On 24/08/2021 12:05, Ilya Maximets wrote:

On 8/24/21 7:36 AM, Anton Ivanov wrote:

On 23/08/2021 22:36, Ilya Maximets wrote:

On 8/23/21 10:37 PM, Anton Ivanov wrote:

On 23/08/2021 21:26, Ilya Maximets wrote:

On 8/23/21 10:20 PM, Anton Ivanov wrote:

Should not be the case.

The map is pre-sized for the size from the previous iterations.

Line 12861 in my tree which is probably a few commits out of date:

   fast_hmap_size_for(, max_seen_lflow_size);

And immediately after building the lflows:

   if (hmap_count() > max_seen_lflow_size) {
   max_seen_lflow_size = hmap_count();
   }

So the map SHOULD be sized correctly - to the most recent seen lflow count.

Please, re-read the commit message.  It's a first run of the loop
and the 'max_seen_lflow_size' is default 128 at this point.

Ack,

Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it.

  From that perspective the patch is a straight +1 from me.

  From the perspective of the use case stated in the commit message- I am not 
sure it addresses it.

If northd is in single-threaded mode and is tackling a GIGANTIC> database, it 
may never complete the first iteration before the
expiration of the timers and everyone deciding that northd is AWOL.

Well, how do you suggest to fix that?  Obviously, we can always create
a database that northd will never be able to process in a reasonable
amount of time.  And it doesn't matter if it's single- or multi-threaded.

In this case NbDB is only 9MB in size, which is very reasonable, and
northd on my laptop takes more than 15 minutes to process it (I killed
it at this point).  With the patch applied it took only 11 seconds.
So, for me, this patch pretty much fixes the issue.  11 seconds is not
that bad, e.g. ovn-k8s configures inactivity probes for clients to 180.
It would be great to reduce, but we're not there yet.


In that case, if it is multi-threaded from the start, it should probably
grab the sizing from the lflow table hash in south db. That would be a
good approximation for the first run.

This will not work for a case where SbDB is empty for any reason while
NbDB is not.  And there is still a case where northd initially connects
to semi-empty databases and after few iterations NbDB receives a big
transaction and generates a big update for northd.

A partial fix is to resize to optimum size the hash after lflow processing.

I'm not sure I understand what you mean here, because resizing after
lflow processing will not help.  The lflow processing itself is the
very slow part that we're trying to make faster here.


That can be the case only with dpgroups. Otherwise lflows are just a 
destination to dump data and the bucket sizing is irrelevant because there is 
never any lookup inside lflows during processing. The lflow map is just used to 
store data. So if it is suboptimal at the exit of build_lflows() the resize 
will fix it before the first lookup shortly thereafter.

Are you running it with dpgroups enabled? In that case there are lookups inside 
lflows during build which happen under a per-bucket lock. So in addition to 
suboptimal size when searching the contention depends on the number of buckets. 
If they are too few, the system becomes heavily contended resulting in 
ridiculous computation sizes.

For the case of "dpgroups + parallel + first iteration + pre-existing large 
database" there is no cure short of pre-allocating the hash to maximum size.

I am scale testing that as well as resize (for non-dp-groups cases) at present.

Brgds,

A.




If the sizing was correct - 99.9% of the case this will be a noop.

If the sizing was incorrect, it will be resized so that the DP searches and all 
other ops which were recently added for flow reduction will work optimally.

This still does not work for lflow compute with dpgroups + parallel upon 
initial connect and without a SB database to use for size guidance. It will 
work for all other cases.

I will send two separate patches to address the cases which can be easily 
addressed and see what can be done with the dp+parallel upon initial connect to 
an empty sb database.

Brgds,

A


A.


A.

On 23/08/2021 21:02, Ilya Maximets wrote:

'lflow_map' is never expanded because northd always uses fast
insertion.  This leads to the case where we have a hash map
with only 128 initial buckets and every ovn_lflow_find() ends
up iterating over n_lflows / 128 entries.  It's thousands of
logical flows or even more.  For example, it takes forever for
ovn-northd to start up with the Northbound Db from the 120 node
density-heavy test from ovn-heater, because every lookup is slower
than previous one.  I aborted the process after 15 minutes of
waiting, because there was no sign that it will converge.  With
this change applied the loop completes in only 11 seconds.

Hash map will be pre-allocated to the maximum seen number of
logical flows on a second iteration, but this doesn't help for
the first iteration when northd first 

Re: [ovs-dev] [PATCH ovn] ovn-northd: Fix extremely inefficient usage of lflow hash map.

2021-08-24 Thread Ilya Maximets
On 8/24/21 7:36 AM, Anton Ivanov wrote:
> On 23/08/2021 22:36, Ilya Maximets wrote:
>> On 8/23/21 10:37 PM, Anton Ivanov wrote:
>>> On 23/08/2021 21:26, Ilya Maximets wrote:
 On 8/23/21 10:20 PM, Anton Ivanov wrote:
> Should not be the case.
>
> The map is pre-sized for the size from the previous iterations.
>
> Line 12861 in my tree which is probably a few commits out of date:
>
>   fast_hmap_size_for(, max_seen_lflow_size);
>
> And immediately after building the lflows:
>
>   if (hmap_count() > max_seen_lflow_size) {
>   max_seen_lflow_size = hmap_count();
>   }
>
> So the map SHOULD be sized correctly - to the most recent seen lflow 
> count.
 Please, re-read the commit message.  It's a first run of the loop
 and the 'max_seen_lflow_size' is default 128 at this point.
>>> Ack,
>>>
>>> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing 
>>> it.
>>>
>>>  From that perspective the patch is a straight +1 from me.
>>>
>>>  From the perspective of the use case stated in the commit message- I am 
>>> not sure it addresses it.
>>>
>>> If northd is in single-threaded mode and is tackling a GIGANTIC> database, 
>>> it may never complete the first iteration before the
>>> expiration of the timers and everyone deciding that northd is AWOL.
>> Well, how do you suggest to fix that?  Obviously, we can always create
>> a database that northd will never be able to process in a reasonable
>> amount of time.  And it doesn't matter if it's single- or multi-threaded.
>>
>> In this case NbDB is only 9MB in size, which is very reasonable, and
>> northd on my laptop takes more than 15 minutes to process it (I killed
>> it at this point).  With the patch applied it took only 11 seconds.
>> So, for me, this patch pretty much fixes the issue.  11 seconds is not
>> that bad, e.g. ovn-k8s configures inactivity probes for clients to 180.
>> It would be great to reduce, but we're not there yet.
>>
>>> In that case, if it is multi-threaded from the start, it should probably
>>> grab the sizing from the lflow table hash in south db. That would be a
>>> good approximation for the first run.
>> This will not work for a case where SbDB is empty for any reason while
>> NbDB is not.  And there is still a case where northd initially connects
>> to semi-empty databases and after few iterations NbDB receives a big
>> transaction and generates a big update for northd.
> 
> A partial fix is to resize to optimum size the hash after lflow processing.

I'm not sure I understand what you mean here, because resizing after
lflow processing will not help.  The lflow processing itself is the
very slow part that we're trying to make faster here.

> 
> If the sizing was correct - 99.9% of the case this will be a noop.
> 
> If the sizing was incorrect, it will be resized so that the DP searches and 
> all other ops which were recently added for flow reduction will work 
> optimally.
> 
> This still does not work for lflow compute with dpgroups + parallel upon 
> initial connect and without a SB database to use for size guidance. It will 
> work for all other cases.
> 
> I will send two separate patches to address the cases which can be easily 
> addressed and see what can be done with the dp+parallel upon initial connect 
> to an empty sb database.
> 
> Brgds,
> 
> A
> 
>>
>>> A.
>>>
> A.
>
> On 23/08/2021 21:02, Ilya Maximets wrote:
>> 'lflow_map' is never expanded because northd always uses fast
>> insertion.  This leads to the case where we have a hash map
>> with only 128 initial buckets and every ovn_lflow_find() ends
>> up iterating over n_lflows / 128 entries.  It's thousands of
>> logical flows or even more.  For example, it takes forever for
>> ovn-northd to start up with the Northbound Db from the 120 node
>> density-heavy test from ovn-heater, because every lookup is slower
>> than previous one.  I aborted the process after 15 minutes of
>> waiting, because there was no sign that it will converge.  With
>> this change applied the loop completes in only 11 seconds.
>>
>> Hash map will be pre-allocated to the maximum seen number of
>> logical flows on a second iteration, but this doesn't help for
>> the first iteration when northd first time connects to a big
>> Northbound database, which is a common case during failover or
>> cluster upgrade.  And there is an even trickier case where big
>> NbDB transaction that explodes the number of logical flows received
>> on not the first run.
>>
>> We can't expand the hash map in case of parallel build, so this
>> should be fixed separately.
>>
>> CC: Anton Ivanov 
>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build")
>> Signed-off-by: Ilya Maximets 
>> ---
>>     northd/ovn-northd.c | 6 +-
>>     1 file changed, 5 insertions(+), 1 deletion(-)