date:20180306

[PATCH net] cxgb4: do not set needs_free_netdev for mgmt dev's

2018-03-06 Thread Ganesh Goudar

Do not set 'needs_free_netdev' as we do call free_netdev
for mgmt net devices, doing both hits BUG_ON.

Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 33bc8418..61022b5 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -4970,7 +4970,6 @@ static void cxgb4_mgmt_setup(struct net_device *dev)
/* Initialize the device structure. */
dev->netdev_ops = &cxgb4_mgmt_netdev_ops;
dev->ethtool_ops = &cxgb4_mgmt_ethtool_ops;
-   dev->needs_free_netdev = true;
 }
 
 static int cxgb4_iov_configure(struct pci_dev *pdev, int num_vfs)
-- 
2.1.0

[PATCH net] cxgb4: copy adap index to PF0-3 adapter instances

2018-03-06 Thread Ganesh Goudar

instantiation of VF's on different adapters fails, copy
adapter index and chip type to PF0-3 adapter instances
to fix the issue.

Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 7b452e8..33bc8418 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -5181,6 +5181,8 @@ static int init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
adapter->name = pci_name(pdev);
adapter->mbox = func;
adapter->pf = func;
+   adapter->params.chip = chip;
+   adapter->adap_idx = adap_idx;
adapter->msg_enable = DFLT_MSG_ENABLE;
adapter->mbox_log = kzalloc(sizeof(*adapter->mbox_log) +
(sizeof(struct mbox_cmd) *
-- 
2.1.0

Re: Use of Indirect function calls

2018-03-06 Thread Rao Shoaib




On 03/06/2018 10:32 PM, Eric Dumazet wrote:

On Tue, 2018-03-06 at 21:53 -0800, Rao Shoaib wrote:

David,

Thanks a lot for your prompt response. Do you have a specific
solution
in mind or will the calls be replaced with simple checks ?

There is upcoming work for that, but not specific to TCP stack.


Also while I have your attention can I ask your opinion about
breaking
up some TCP functions, mostly control functions into smaller units
so
that if a little different behavior is desired it can be achieved
and
common code could still be shared. Of course you can not say much
without looking at the code but will you even entertain such a change
?

I am sorry, but I would prefer no code refactoring unless you fix a
serious bug, or prepare for something really new (and having noticeable
impact)

We have to maintain stable trees, and such code churns are adding
maintenance hassles.

Of course, you can submit patches, but be warned that you can not
expect us spending hours reviewing patches that might bring serious
regressions.

I suggest you start with small patches first.

Thanks a lot Eric, I absolutely understand your concerns. I asked the 
question as I did not want to spend the time if you guys were not 
willing to even entertain the idea. Thanks a lot for your flexibility.
The changes I am making are very simple and very unlikely to cause any 
issues. You can always reject the patch as too big and I will work to 
make it smaller.


Regards,

Shoaib

Re: [pci PATCH v3 0/3] Add support for unmanaged SR-IOV

2018-03-06 Thread Christoph Hellwig

On Tue, Mar 06, 2018 at 11:29:08AM -0800, Alexander Duyck wrote:
> This series is meant to add support for SR-IOV on devices when the VFs are
> not managed by the kernel. Examples of recent patches attempting to do this
> include:
> virto - https://patchwork.kernel.org/patch/10241225/
> pci-stub - https://patchwork.kernel.org/patch/10109935/
> vfio - https://patchwork.kernel.org/patch/10103353/
> uio - https://patchwork.kernel.org/patch/9974031/

nvme and ema seems to be existing examples.  Care to throw in
conversions while you're at it?

[for-next V2 12/13] {net,IB}/mlx5: Add flow steering helpers

2018-03-06 Thread Saeed Mahameed

From: Boris Pismenny 

Add helper functions that check if a protocol is
part of a flow steering match criteria.

Signed-off-by: Boris Pismenny 
Signed-off-by: Matan Barak 
Signed-off-by: Aviad Yehezkel 
Signed-off-by: Saeed Mahameed 
---
 drivers/infiniband/hw/mlx5/main.c |   7 +-
 include/linux/mlx5/fs_helpers.h   | 134 ++
 include/linux/mlx5/mlx5_ifc.h |   8 ++-
 3 files changed, 143 insertions(+), 6 deletions(-)
 create mode 100644 include/linux/mlx5/fs_helpers.h

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index d50ace805995..d9474b95d8e5 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -59,6 +59,7 @@
 #include "mlx5_ib.h"
 #include "ib_rep.h"
 #include "cmd.h"
+#include 
 
 #define DRIVER_NAME "mlx5_ib"
 #define DRIVER_VERSION "5.0-0"
@@ -2312,8 +2313,6 @@ static void set_tos(void *outer_c, void *outer_v, u8 
mask, u8 val)
   offsetof(typeof(filter), field) -\
   sizeof(filter.field))
 
-#define IPV4_VERSION 4
-#define IPV6_VERSION 6
 static int parse_flow_attr(struct mlx5_core_dev *mdev, u32 *match_c,
   u32 *match_v, const union ib_flow_spec *ib_spec,
   struct mlx5_flow_act *action)
@@ -2399,7 +2398,7 @@ static int parse_flow_attr(struct mlx5_core_dev *mdev, 
u32 *match_c,
MLX5_SET(fte_match_set_lyr_2_4, headers_c,
 ip_version, 0xf);
MLX5_SET(fte_match_set_lyr_2_4, headers_v,
-ip_version, IPV4_VERSION);
+ip_version, MLX5_FS_IPV4_VERSION);
} else {
MLX5_SET(fte_match_set_lyr_2_4, headers_c,
 ethertype, 0x);
@@ -2438,7 +2437,7 @@ static int parse_flow_attr(struct mlx5_core_dev *mdev, 
u32 *match_c,
MLX5_SET(fte_match_set_lyr_2_4, headers_c,
 ip_version, 0xf);
MLX5_SET(fte_match_set_lyr_2_4, headers_v,
-ip_version, IPV6_VERSION);
+ip_version, MLX5_FS_IPV6_VERSION);
} else {
MLX5_SET(fte_match_set_lyr_2_4, headers_c,
 ethertype, 0x);
diff --git a/include/linux/mlx5/fs_helpers.h b/include/linux/mlx5/fs_helpers.h
new file mode 100644
index ..7b476bbae731
--- /dev/null
+++ b/include/linux/mlx5/fs_helpers.h
@@ -0,0 +1,134 @@
+/*
+ * Copyright (c) 2018, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _MLX5_FS_HELPERS_
+#define _MLX5_FS_HELPERS_
+
+#include 
+
+#define MLX5_FS_IPV4_VERSION 4
+#define MLX5_FS_IPV6_VERSION 6
+
+static inline bool _mlx5_fs_is_outer_ipproto_flow(const u32 *match_c,
+ const u32 *match_v, u8 match)
+{
+   const void *headers_c = MLX5_ADDR_OF(fte_match_param, match_c,
+outer_headers);
+   const void *headers_v = MLX5_ADDR_OF(fte_match_param, match_v,
+outer_headers);
+
+   return MLX5_GET(fte_match_set_lyr_2_4, headers_c, ip_protocol) == 0xff 
&&
+   MLX5_GET(fte_match_set_lyr_2_4, headers_v, ip_protocol) == 
match;
+}
+
+static inline bool mlx5_fs_is_outer_tcp_flow(const u32 *match_c,
+const u32 *match_v)
+{
+   return _mlx5_fs_is_outer_ipproto_flow(match_c, match

[for-next V2 13/13] net/mlx5: Flow steering cmd interface should get the fte when deleting

2018-03-06 Thread Saeed Mahameed

From: Aviad Yehezkel 

Previously, deleting a flow steering entry only got the index.
Since the FPGA implementation of FTE's deletion might need to dig
inside the FTE itself, we would like to get the FTE's context.
Changing the interface to pass the FTE context.

Signed-off-by: Aviad Yehezkel 
Signed-off-by: Matan Barak 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c  | 6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h  | 2 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index d9d7dd439bb9..645f83cac34d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -106,7 +106,7 @@ static int mlx5_cmd_stub_update_fte(struct mlx5_core_dev 
*dev,
 
 static int mlx5_cmd_stub_delete_fte(struct mlx5_core_dev *dev,
struct mlx5_flow_table *ft,
-   unsigned int index)
+   struct fs_fte *fte)
 {
return 0;
 }
@@ -436,7 +436,7 @@ static int mlx5_cmd_update_fte(struct mlx5_core_dev *dev,
 
 static int mlx5_cmd_delete_fte(struct mlx5_core_dev *dev,
   struct mlx5_flow_table *ft,
-  unsigned int index)
+  struct fs_fte *fte)
 {
u32 out[MLX5_ST_SZ_DW(delete_fte_out)] = {0};
u32 in[MLX5_ST_SZ_DW(delete_fte_in)]   = {0};
@@ -444,7 +444,7 @@ static int mlx5_cmd_delete_fte(struct mlx5_core_dev *dev,
MLX5_SET(delete_fte_in, in, opcode, 
MLX5_CMD_OP_DELETE_FLOW_TABLE_ENTRY);
MLX5_SET(delete_fte_in, in, table_type, ft->type);
MLX5_SET(delete_fte_in, in, table_id, ft->id);
-   MLX5_SET(delete_fte_in, in, flow_index, index);
+   MLX5_SET(delete_fte_in, in, flow_index, fte->index);
if (ft->vport) {
MLX5_SET(delete_fte_in, in, vport_number, ft->vport);
MLX5_SET(delete_fte_in, in, other_vport, 1);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h
index 81c82f48d93e..6228ba7bfa1a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h
@@ -72,7 +72,7 @@ struct mlx5_flow_cmds {
 
int (*delete_fte)(struct mlx5_core_dev *dev,
  struct mlx5_flow_table *ft,
- unsigned int index);
+ struct fs_fte *fte);
 
int (*update_root_ft)(struct mlx5_core_dev *dev,
  struct mlx5_flow_table *ft,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 2e4a1d4e0cea..4e456c292ce4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -520,7 +520,7 @@ static void del_hw_fte(struct fs_node *node)
dev = get_dev(&ft->node);
root = find_root(&ft->node);
if (node->active) {
-   err = root->cmds->delete_fte(dev, ft, fte->index);
+   err = root->cmds->delete_fte(dev, ft, fte);
if (err)
mlx5_core_warn(dev,
   "flow steering can't delete fte in index 
%d of flow group id %d\n",
-- 
2.14.3

[for-next V2 11/13] net/mlx5: Embed mlx5_flow_act into fs_fte

2018-03-06 Thread Saeed Mahameed

From: Matan Barak 

fte objects contain the match value and action. Currently, extending
the actions require in adding them both to the API and fs_fte.

Signed-off-by: Matan Barak 
Signed-off-by: Aviad Yehezkel 
Signed-off-by: Saeed Mahameed 
---
 .../mellanox/mlx5/core/diag/fs_tracepoint.h|  4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c   | 13 ++--
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  | 24 ++
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h  |  5 +
 4 files changed, 21 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.h 
b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.h
index 80eef4163f52..a6ba57fbb414 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.h
@@ -163,9 +163,9 @@ TRACE_EVENT(mlx5_fs_set_fte,
   fs_get_obj(__entry->fg, fte->node.parent);
   __entry->group_index = __entry->fg->id;
   __entry->index = fte->index;
-  __entry->action = fte->action;
+  __entry->action = fte->action.action;
   __entry->mask_enable = 
__entry->fg->mask.match_criteria_enable;
-  __entry->flow_tag = fte->flow_tag;
+  __entry->flow_tag = fte->action.flow_tag;
   memcpy(__entry->mask_outer,
  MLX5_ADDR_OF(fte_match_param,
   
&__entry->fg->mask.match_criteria,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index ed3ea80a24be..d9d7dd439bb9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -340,16 +340,17 @@ static int mlx5_cmd_set_fte(struct mlx5_core_dev *dev,
 
in_flow_context = MLX5_ADDR_OF(set_fte_in, in, flow_context);
MLX5_SET(flow_context, in_flow_context, group_id, group_id);
-   MLX5_SET(flow_context, in_flow_context, flow_tag, fte->flow_tag);
-   MLX5_SET(flow_context, in_flow_context, action, fte->action);
-   MLX5_SET(flow_context, in_flow_context, encap_id, fte->encap_id);
-   MLX5_SET(flow_context, in_flow_context, modify_header_id, 
fte->modify_id);
+   MLX5_SET(flow_context, in_flow_context, flow_tag, fte->action.flow_tag);
+   MLX5_SET(flow_context, in_flow_context, action, fte->action.action);
+   MLX5_SET(flow_context, in_flow_context, encap_id, fte->action.encap_id);
+   MLX5_SET(flow_context, in_flow_context, modify_header_id,
+fte->action.modify_id);
in_match_value = MLX5_ADDR_OF(flow_context, in_flow_context,
  match_value);
memcpy(in_match_value, &fte->val, sizeof(fte->val));
 
in_dests = MLX5_ADDR_OF(flow_context, in_flow_context, destination);
-   if (fte->action & MLX5_FLOW_CONTEXT_ACTION_FWD_DEST) {
+   if (fte->action.action & MLX5_FLOW_CONTEXT_ACTION_FWD_DEST) {
int list_size = 0;
 
list_for_each_entry(dst, &fte->node.children, node.list) {
@@ -375,7 +376,7 @@ static int mlx5_cmd_set_fte(struct mlx5_core_dev *dev,
 list_size);
}
 
-   if (fte->action & MLX5_FLOW_CONTEXT_ACTION_COUNT) {
+   if (fte->action.action & MLX5_FLOW_CONTEXT_ACTION_COUNT) {
int max_list_size = BIT(MLX5_CAP_FLOWTABLE_TYPE(dev,
log_max_flow_counter,
ft->type));
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 5c86d103..2e4a1d4e0cea 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -481,12 +481,12 @@ static void del_sw_hw_rule(struct fs_node *node)
if (rule->dest_attr.type == MLX5_FLOW_DESTINATION_TYPE_COUNTER  &&
--fte->dests_size) {
modify_mask = BIT(MLX5_SET_FTE_MODIFY_ENABLE_MASK_ACTION);
-   fte->action &= ~MLX5_FLOW_CONTEXT_ACTION_COUNT;
+   fte->action.action &= ~MLX5_FLOW_CONTEXT_ACTION_COUNT;
update_fte = true;
goto out;
}
 
-   if ((fte->action & MLX5_FLOW_CONTEXT_ACTION_FWD_DEST) &&
+   if ((fte->action.action & MLX5_FLOW_CONTEXT_ACTION_FWD_DEST) &&
--fte->dests_size) {
modify_mask = 
BIT(MLX5_SET_FTE_MODIFY_ENABLE_MASK_DESTINATION_LIST),
update_fte = true;
@@ -623,10 +623,7 @@ static struct fs_fte *alloc_fte(struct mlx5_flow_table *ft,
 
memcpy(fte->val, match_value, sizeof(fte->val));
fte->node.type =  FS_TYPE_FLOW_ENTRY;
-   fte->flow_tag = flow_act->flow_tag;
-

[for-next V2 09/13] net/mlx5: Add shim layer between fs and cmd

2018-03-06 Thread Saeed Mahameed

From: Matan Barak 

The shim layer allows each namespace to define possibly different
functionality for add/delete/update commands. The shim layer
introduced here, will be used to support flow steering with the FPGA.

Signed-off-by: Matan Barak 
Signed-off-by: Aviad Yehezkel 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c  | 191 ++
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h  |  72 
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c |  84 ++
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h |   1 +
 4 files changed, 248 insertions(+), 100 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index 881e2e55840c..c3eaddb43e57 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -39,9 +39,81 @@
 #include "mlx5_core.h"
 #include "eswitch.h"
 
-int mlx5_cmd_update_root_ft(struct mlx5_core_dev *dev,
-   struct mlx5_flow_table *ft, u32 underlay_qpn,
-   bool disconnect)
+static int mlx5_cmd_stub_update_root_ft(struct mlx5_core_dev *dev,
+   struct mlx5_flow_table *ft,
+   u32 underlay_qpn,
+   bool disconnect)
+{
+   return 0;
+}
+
+static int mlx5_cmd_stub_create_flow_table(struct mlx5_core_dev *dev,
+  u16 vport,
+  enum fs_flow_table_op_mod op_mod,
+  enum fs_flow_table_type type,
+  unsigned int level,
+  unsigned int log_size,
+  struct mlx5_flow_table *next_ft,
+  unsigned int *table_id, u32 flags)
+{
+   return 0;
+}
+
+static int mlx5_cmd_stub_destroy_flow_table(struct mlx5_core_dev *dev,
+   struct mlx5_flow_table *ft)
+{
+   return 0;
+}
+
+static int mlx5_cmd_stub_modify_flow_table(struct mlx5_core_dev *dev,
+  struct mlx5_flow_table *ft,
+  struct mlx5_flow_table *next_ft)
+{
+   return 0;
+}
+
+static int mlx5_cmd_stub_create_flow_group(struct mlx5_core_dev *dev,
+  struct mlx5_flow_table *ft,
+  u32 *in,
+  unsigned int *group_id)
+{
+   return 0;
+}
+
+static int mlx5_cmd_stub_destroy_flow_group(struct mlx5_core_dev *dev,
+   struct mlx5_flow_table *ft,
+   unsigned int group_id)
+{
+   return 0;
+}
+
+static int mlx5_cmd_stub_create_fte(struct mlx5_core_dev *dev,
+   struct mlx5_flow_table *ft,
+   struct mlx5_flow_group *group,
+   struct fs_fte *fte)
+{
+   return 0;
+}
+
+static int mlx5_cmd_stub_update_fte(struct mlx5_core_dev *dev,
+   struct mlx5_flow_table *ft,
+   unsigned int group_id,
+   int modify_mask,
+   struct fs_fte *fte)
+{
+   return -EOPNOTSUPP;
+}
+
+static int mlx5_cmd_stub_delete_fte(struct mlx5_core_dev *dev,
+   struct mlx5_flow_table *ft,
+   unsigned int index)
+{
+   return 0;
+}
+
+static int mlx5_cmd_update_root_ft(struct mlx5_core_dev *dev,
+  struct mlx5_flow_table *ft, u32 underlay_qpn,
+  bool disconnect)
 {
u32 in[MLX5_ST_SZ_DW(set_flow_table_root_in)]   = {0};
u32 out[MLX5_ST_SZ_DW(set_flow_table_root_out)] = {0};
@@ -71,12 +143,14 @@ int mlx5_cmd_update_root_ft(struct mlx5_core_dev *dev,
return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
 }
 
-int mlx5_cmd_create_flow_table(struct mlx5_core_dev *dev,
-  u16 vport,
-  enum fs_flow_table_op_mod op_mod,
-  enum fs_flow_table_type type, unsigned int level,
-  unsigned int log_size, struct mlx5_flow_table
-  *next_ft, unsigned int *table_id, u32 flags)
+static int mlx5_cmd_create_flow_table(struct mlx5_core_dev *dev,
+ u16 vport,
+ enum fs_flow_table_op_mod op_mod,
+ enum fs_flow_table_type type,
+ unsigned int level,
+

[for-next V2 07/13] IB/mlx5: Pass mlx5_flow_act struct instead of multiple arguments

2018-03-06 Thread Saeed Mahameed

From: Boris Pismenny 

Group and pass all function arguments of parse_flow_attr call in one
common struct mlx5_flow_act.

This patch passes all the action arguments of parse_flow_attr in one common
struct mlx5_flow_act. It allows us to scale the number of actions without adding
new arguments to the function.

Signed-off-by: Matan Barak 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
Acked-by: Jason Gunthorpe 
---
 drivers/infiniband/hw/mlx5/main.c | 20 
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 23511638fbbc..1b305367a817 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -2316,7 +2316,7 @@ static void set_tos(void *outer_c, void *outer_v, u8 
mask, u8 val)
 #define IPV6_VERSION 6
 static int parse_flow_attr(struct mlx5_core_dev *mdev, u32 *match_c,
   u32 *match_v, const union ib_flow_spec *ib_spec,
-  u32 *tag_id, bool *is_drop)
+  struct mlx5_flow_act *action)
 {
void *misc_params_c = MLX5_ADDR_OF(fte_match_param, match_c,
   misc_parameters);
@@ -2534,13 +2534,13 @@ static int parse_flow_attr(struct mlx5_core_dev *mdev, 
u32 *match_c,
if (ib_spec->flow_tag.tag_id >= BIT(24))
return -EINVAL;
 
-   *tag_id = ib_spec->flow_tag.tag_id;
+   action->flow_tag = ib_spec->flow_tag.tag_id;
break;
case IB_FLOW_SPEC_ACTION_DROP:
if (FIELDS_NOT_SUPPORTED(ib_spec->drop,
 LAST_DROP_FIELD))
return -EOPNOTSUPP;
-   *is_drop = true;
+   action->action |= MLX5_FLOW_CONTEXT_ACTION_DROP;
break;
default:
return -EINVAL;
@@ -2793,13 +2793,11 @@ static struct mlx5_ib_flow_handler 
*_create_flow_rule(struct mlx5_ib_dev *dev,
 {
struct mlx5_flow_table  *ft = ft_prio->flow_table;
struct mlx5_ib_flow_handler *handler;
-   struct mlx5_flow_act flow_act = {0};
+   struct mlx5_flow_act flow_act = {.flow_tag = MLX5_FS_DEFAULT_FLOW_TAG};
struct mlx5_flow_spec *spec;
struct mlx5_flow_destination *rule_dst = dst;
const void *ib_flow = (const void *)flow_attr + sizeof(*flow_attr);
unsigned int spec_index;
-   u32 flow_tag = MLX5_FS_DEFAULT_FLOW_TAG;
-   bool is_drop = false;
int err = 0;
int dest_num = 1;
 
@@ -2818,7 +2816,7 @@ static struct mlx5_ib_flow_handler 
*_create_flow_rule(struct mlx5_ib_dev *dev,
for (spec_index = 0; spec_index < flow_attr->num_of_specs; 
spec_index++) {
err = parse_flow_attr(dev->mdev, spec->match_criteria,
  spec->match_value,
- ib_flow, &flow_tag, &is_drop);
+ ib_flow, &flow_act);
if (err < 0)
goto free;
 
@@ -2841,8 +2839,7 @@ static struct mlx5_ib_flow_handler 
*_create_flow_rule(struct mlx5_ib_dev *dev,
}
 
spec->match_criteria_enable = 
get_match_criteria_enable(spec->match_criteria);
-   if (is_drop) {
-   flow_act.action = MLX5_FLOW_CONTEXT_ACTION_DROP;
+   if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_DROP) {
rule_dst = NULL;
dest_num = 0;
} else {
@@ -2850,15 +2847,14 @@ static struct mlx5_ib_flow_handler 
*_create_flow_rule(struct mlx5_ib_dev *dev,
MLX5_FLOW_CONTEXT_ACTION_FWD_NEXT_PRIO;
}
 
-   if (flow_tag != MLX5_FS_DEFAULT_FLOW_TAG &&
+   if (flow_act.flow_tag != MLX5_FS_DEFAULT_FLOW_TAG &&
(flow_attr->type == IB_FLOW_ATTR_ALL_DEFAULT ||
 flow_attr->type == IB_FLOW_ATTR_MC_DEFAULT)) {
mlx5_ib_warn(dev, "Flow tag %u and attribute type %x isn't 
allowed in leftovers\n",
-flow_tag, flow_attr->type);
+flow_act.flow_tag, flow_attr->type);
err = -EINVAL;
goto free;
}
-   flow_act.flow_tag = flow_tag;
handler->rule = mlx5_add_flow_rules(ft, spec,
&flow_act,
rule_dst, dest_num);
-- 
2.14.3

[for-next V2 04/13] net/mlx5e: Fixed sleeping inside atomic context

2018-03-06 Thread Saeed Mahameed

From: Aviad Yehezkel 

We can't allocate with GFP_KERNEL inside spinlock.
Actually ida_simple doesn't require spinlock so remove it.

Fixes: 547eede070eb ("net/mlx5e: IPSec, Innova IPSec offload infrastructure")
Signed-off-by: Aviad Yehezkel 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c | 13 -
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c
index bac5103efad3..710521181143 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c
@@ -74,18 +74,16 @@ static int mlx5e_ipsec_sadb_rx_add(struct 
mlx5e_ipsec_sa_entry *sa_entry)
unsigned long flags;
int ret;
 
-   spin_lock_irqsave(&ipsec->sadb_rx_lock, flags);
ret = ida_simple_get(&ipsec->halloc, 1, 0, GFP_KERNEL);
if (ret < 0)
-   goto out;
+   return ret;
 
+   spin_lock_irqsave(&ipsec->sadb_rx_lock, flags);
sa_entry->handle = ret;
hash_add_rcu(ipsec->sadb_rx, &sa_entry->hlist, sa_entry->handle);
-   ret = 0;
-
-out:
spin_unlock_irqrestore(&ipsec->sadb_rx_lock, flags);
-   return ret;
+
+   return 0;
 }
 
 static void mlx5e_ipsec_sadb_rx_del(struct mlx5e_ipsec_sa_entry *sa_entry)
@@ -101,13 +99,10 @@ static void mlx5e_ipsec_sadb_rx_del(struct 
mlx5e_ipsec_sa_entry *sa_entry)
 static void mlx5e_ipsec_sadb_rx_free(struct mlx5e_ipsec_sa_entry *sa_entry)
 {
struct mlx5e_ipsec *ipsec = sa_entry->ipsec;
-   unsigned long flags;
 
/* Wait for the hash_del_rcu call in sadb_rx_del to affect data path */
synchronize_rcu();
-   spin_lock_irqsave(&ipsec->sadb_rx_lock, flags);
ida_simple_remove(&ipsec->halloc, sa_entry->handle);
-   spin_unlock_irqrestore(&ipsec->sadb_rx_lock, flags);
 }
 
 static enum mlx5_accel_ipsec_enc_mode mlx5e_ipsec_enc_mode(struct xfrm_state 
*x)
-- 
2.14.3

[for-next V2 10/13] net/mlx5: Add empty egress namespace to flow steering core

2018-03-06 Thread Saeed Mahameed

From: Aviad Yehezkel 

Currently, we don't support egress flow steering namespace in mlx5
flow steering core implementation. However, when we want to encrypt
a packet, we model it as a flow steering rule in the egress path.
To overcome this, we add an empty egress namespace to flow steering.
This namespace is initialized only when ipsec support exists.
In the future, this will grow to a full blown full steering
implementation, resembling the ingress path.

Signed-off-by: Matan Barak 
Signed-off-by: Aviad Yehezkel 
Signed-off-by: Saeed Mahameed 
---
 .../mellanox/mlx5/core/diag/fs_tracepoint.c|  3 +++
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c   |  1 +
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  | 28 ++
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h  |  2 ++
 include/linux/mlx5/fs.h|  1 +
 include/linux/mlx5/mlx5_ifc.h  |  1 +
 6 files changed, 36 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c 
b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c
index 0be4575b58a2..3816b4506561 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c
@@ -246,6 +246,9 @@ const char *parse_fs_dst(struct trace_seq *p,
case MLX5_FLOW_DESTINATION_TYPE_COUNTER:
trace_seq_printf(p, "counter_id=%u\n", counter_id);
break;
+   case MLX5_FLOW_DESTINATION_TYPE_PORT:
+   trace_seq_printf(p, "port\n");
+   break;
}
 
trace_seq_putc(p, 0);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index c3eaddb43e57..ed3ea80a24be 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -731,6 +731,7 @@ const struct mlx5_flow_cmds *mlx5_fs_cmd_get_default(enum 
fs_flow_table_type typ
case FS_FT_SNIFFER_RX:
case FS_FT_SNIFFER_TX:
return mlx5_fs_cmd_get_fw_cmds();
+   case FS_FT_NIC_TX:
default:
return mlx5_fs_cmd_get_stub_cmds();
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index f3a654b96b98..5c86d103 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -37,6 +37,7 @@
 #include "fs_core.h"
 #include "fs_cmd.h"
 #include "diag/fs_tracepoint.h"
+#include "accel/ipsec.h"
 
 #define INIT_TREE_NODE_ARRAY_SIZE(...) (sizeof((struct 
init_tree_node[]){__VA_ARGS__}) /\
 sizeof(struct init_tree_node))
@@ -2049,6 +2050,11 @@ struct mlx5_flow_namespace 
*mlx5_get_flow_namespace(struct mlx5_core_dev *dev,
return &steering->sniffer_tx_root_ns->ns;
else
return NULL;
+   case MLX5_FLOW_NAMESPACE_EGRESS:
+   if (steering->egress_root_ns)
+   return &steering->egress_root_ns->ns;
+   else
+   return NULL;
default:
return NULL;
}
@@ -2413,6 +2419,7 @@ void mlx5_cleanup_fs(struct mlx5_core_dev *dev)
cleanup_root_ns(steering->fdb_root_ns);
cleanup_root_ns(steering->sniffer_rx_root_ns);
cleanup_root_ns(steering->sniffer_tx_root_ns);
+   cleanup_root_ns(steering->egress_root_ns);
mlx5_cleanup_fc_stats(dev);
kmem_cache_destroy(steering->ftes_cache);
kmem_cache_destroy(steering->fgs_cache);
@@ -2558,6 +2565,20 @@ static int init_ingress_acls_root_ns(struct 
mlx5_core_dev *dev)
return err;
 }
 
+static int init_egress_root_ns(struct mlx5_flow_steering *steering)
+{
+   struct fs_prio *prio;
+
+   steering->egress_root_ns = create_root_ns(steering,
+ FS_FT_NIC_TX);
+   if (!steering->egress_root_ns)
+   return -ENOMEM;
+
+   /* create 1 prio*/
+   prio = fs_create_prio(&steering->egress_root_ns->ns, 0, 1);
+   return PTR_ERR_OR_ZERO(prio);
+}
+
 int mlx5_init_fs(struct mlx5_core_dev *dev)
 {
struct mlx5_flow_steering *steering;
@@ -2623,6 +2644,13 @@ int mlx5_init_fs(struct mlx5_core_dev *dev)
goto err;
}
 
+   if (mlx5_accel_ipsec_device_caps(steering->dev) &
+   MLX5_ACCEL_IPSEC_DEVICE) {
+   err = init_egress_root_ns(steering);
+   if (err)
+   goto err;
+   }
+
return 0;
 err:
mlx5_cleanup_fs(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
index 45791c792296..8586af9ce514 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
@@ -48,6 +

[for-next V2 08/13] {net,IB}/mlx5: Add has_tag to mlx5_flow_act

2018-03-06 Thread Saeed Mahameed

From: Matan Barak 

The has_tag member will indicate whether a tag action was specified
in flow specification.

A flow tag 0 = MLX5_FS_DEFAULT_FLOW_TAG is assumed a valid flow tag
that is currently used by mlx5 RDMA driver, whereas in HW flow_tag = 0
means that the user doesn't care about flow_tag.  HW always provide
a flow_tag = 0 if all flow tags requested on a specific flow are 0.

So we need a way (in the driver) to differentiate between a user really
requesting flow_tag = 0 and a user who does not care, in order to be
able to report conflicting flow tags on a specific flow.

Signed-off-by: Matan Barak 
Reviewed-by: Aviad Yehezkel 
Signed-off-by: Saeed Mahameed 
---
 drivers/infiniband/hw/mlx5/main.c | 3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c   | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 2 +-
 include/linux/mlx5/fs.h   | 1 +
 4 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 1b305367a817..d50ace805995 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -2535,6 +2535,7 @@ static int parse_flow_attr(struct mlx5_core_dev *mdev, 
u32 *match_c,
return -EINVAL;
 
action->flow_tag = ib_spec->flow_tag.tag_id;
+   action->has_flow_tag = true;
break;
case IB_FLOW_SPEC_ACTION_DROP:
if (FIELDS_NOT_SUPPORTED(ib_spec->drop,
@@ -2847,7 +2848,7 @@ static struct mlx5_ib_flow_handler 
*_create_flow_rule(struct mlx5_ib_dev *dev,
MLX5_FLOW_CONTEXT_ACTION_FWD_NEXT_PRIO;
}
 
-   if (flow_act.flow_tag != MLX5_FS_DEFAULT_FLOW_TAG &&
+   if (flow_act.has_flow_tag &&
(flow_attr->type == IB_FLOW_ATTR_ALL_DEFAULT ||
 flow_attr->type == IB_FLOW_ATTR_MC_DEFAULT)) {
mlx5_ib_warn(dev, "Flow tag %u and attribute type %x isn't 
allowed in leftovers\n",
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index fd98b0dc610f..eeff1fac77ef 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -675,6 +675,7 @@ mlx5e_tc_add_nic_flow(struct mlx5e_priv *priv,
struct mlx5_flow_destination dest[2] = {};
struct mlx5_flow_act flow_act = {
.action = attr->action,
+   .has_flow_tag = true,
.flow_tag = attr->flow_tag,
.encap_id = 0,
};
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index c025c98700e4..d81da6920be8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -1443,7 +1443,7 @@ static int check_conflicting_ftes(struct fs_fte *fte, 
const struct mlx5_flow_act
return -EEXIST;
}
 
-   if (fte->flow_tag != flow_act->flow_tag) {
+   if (flow_act->has_flow_tag && fte->flow_tag != flow_act->flow_tag) {
mlx5_core_warn(get_dev(&fte->node),
   "FTE flow tag %u already exists with different 
flow tag %u\n",
   fte->flow_tag,
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index a0b48afcb422..f580bc4c2443 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -141,6 +141,7 @@ void mlx5_destroy_flow_group(struct mlx5_flow_group *fg);
 
 struct mlx5_flow_act {
u32 action;
+   bool has_flow_tag;
u32 flow_tag;
u32 encap_id;
u32 modify_id;
-- 
2.14.3

[for-next V2 06/13] net/mlx5: FPGA and IPSec initialization to be before flow steering

2018-03-06 Thread Saeed Mahameed

From: Matan Barak 

Some flow steering namespace initialization (i.e. egress namespace)
might depend on FPGA capabilities. Changing the initialization order
such that the FPGA will be initialized before flow steering.

Flow steering fs cmds initialization might depend on
IPSec capabilities. Changing the initialization order such
that the IPSec will be initialized before flow steering as well.

Signed-off-by: Aviad Yehezkel 
Signed-off-by: Matan Barak 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 39 +-
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 8cc22bf80c87..03972eed02cd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1173,6 +1173,18 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
goto err_affinity_hints;
}
 
+   err = mlx5_fpga_device_start(dev);
+   if (err) {
+   dev_err(&pdev->dev, "fpga device start failed %d\n", err);
+   goto err_fpga_start;
+   }
+
+   err = mlx5_accel_ipsec_init(dev);
+   if (err) {
+   dev_err(&pdev->dev, "IPSec device start failed %d\n", err);
+   goto err_ipsec_start;
+   }
+
err = mlx5_init_fs(dev);
if (err) {
dev_err(&pdev->dev, "Failed to init flow steering\n");
@@ -1191,17 +1203,6 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
goto err_sriov;
}
 
-   err = mlx5_fpga_device_start(dev);
-   if (err) {
-   dev_err(&pdev->dev, "fpga device start failed %d\n", err);
-   goto err_fpga_start;
-   }
-   err = mlx5_accel_ipsec_init(dev);
-   if (err) {
-   dev_err(&pdev->dev, "IPSec device start failed %d\n", err);
-   goto err_ipsec_start;
-   }
-
if (mlx5_device_registered(dev)) {
mlx5_attach_device(dev);
} else {
@@ -1219,17 +1220,18 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
return 0;
 
 err_reg_dev:
-   mlx5_accel_ipsec_cleanup(dev);
-err_ipsec_start:
-   mlx5_fpga_device_stop(dev);
-
-err_fpga_start:
mlx5_sriov_detach(dev);
 
 err_sriov:
mlx5_cleanup_fs(dev);
 
 err_fs:
+   mlx5_accel_ipsec_cleanup(dev);
+
+err_ipsec_start:
+   mlx5_fpga_device_stop(dev);
+
+err_fpga_start:
mlx5_irq_clear_affinity_hints(dev);
 
 err_affinity_hints:
@@ -1296,11 +1298,10 @@ static int mlx5_unload_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
if (mlx5_device_registered(dev))
mlx5_detach_device(dev);
 
-   mlx5_accel_ipsec_cleanup(dev);
-   mlx5_fpga_device_stop(dev);
-
mlx5_sriov_detach(dev);
mlx5_cleanup_fs(dev);
+   mlx5_accel_ipsec_cleanup(dev);
+   mlx5_fpga_device_stop(dev);
mlx5_irq_clear_affinity_hints(dev);
free_comp_eqs(dev);
mlx5_stop_eqs(dev);
-- 
2.14.3

[for-next V2 05/13] net/mlx5e: Removed not need synchronize_rcu

2018-03-06 Thread Saeed Mahameed

From: Aviad Yehezkel 

This is already done by xfrm layer between state_dev_del callback
to state_dev_free callback.

Signed-off-by: Aviad Yehezkel 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c
index 710521181143..1b49afca65c0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c
@@ -100,8 +100,8 @@ static void mlx5e_ipsec_sadb_rx_free(struct 
mlx5e_ipsec_sa_entry *sa_entry)
 {
struct mlx5e_ipsec *ipsec = sa_entry->ipsec;
 
-   /* Wait for the hash_del_rcu call in sadb_rx_del to affect data path */
-   synchronize_rcu();
+   /* xfrm already doing sync rcu between del and free callbacks */
+
ida_simple_remove(&ipsec->halloc, sa_entry->handle);
 }
 
-- 
2.14.3

[for-next V2 03/13] net/mlx5e: Wait for FPGA command responses with a timeout

2018-03-06 Thread Saeed Mahameed

From: Aviad Yehezkel 

Generally, FPGA IPSec commands must always complete.
We want to wait for one minute for them to complete gracefully also
when killing a process.

Signed-off-by: Aviad Yehezkel 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
index 35d0e33381ca..95f9c5a8619b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
@@ -39,6 +39,7 @@
 #include "fpga/core.h"
 
 #define SBU_QP_QUEUE_SIZE 8
+#define MLX5_FPGA_IPSEC_CMD_TIMEOUT_MSEC   (60 * 1000)
 
 enum mlx5_ipsec_response_syndrome {
MLX5_IPSEC_RESPONSE_SUCCESS = 0,
@@ -217,12 +218,14 @@ void *mlx5_fpga_ipsec_sa_cmd_exec(struct mlx5_core_dev 
*mdev,
 int mlx5_fpga_ipsec_sa_cmd_wait(void *ctx)
 {
struct mlx5_ipsec_command_context *context = ctx;
+   unsigned long timeout =
+   msecs_to_jiffies(MLX5_FPGA_IPSEC_CMD_TIMEOUT_MSEC);
int res;
 
-   res = wait_for_completion_killable(&context->complete);
-   if (res) {
+   res = wait_for_completion_timeout(&context->complete, timeout);
+   if (!res) {
mlx5_fpga_warn(context->dev, "Failure waiting for IPSec command 
response\n");
-   return -EINTR;
+   return -ETIMEDOUT;
}
 
if (context->status == MLX5_FPGA_IPSEC_SACMD_COMPLETE)
-- 
2.14.3

[for-next V2 02/13] net/mlx5: Fixed compilation issue when CONFIG_MLX5_ACCEL is disabled

2018-03-06 Thread Saeed Mahameed

From: Aviad Yehezkel 

IPSec init and cleanup functions also depends on linux/mlx5/driver.h.

Signed-off-by: Aviad Yehezkel 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/accel/ipsec.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/ipsec.h 
b/drivers/net/ethernet/mellanox/mlx5/core/accel/ipsec.h
index d6e20fea9554..67cda8871f5a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/accel/ipsec.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/ipsec.h
@@ -34,10 +34,10 @@
 #ifndef __MLX5_ACCEL_IPSEC_H__
 #define __MLX5_ACCEL_IPSEC_H__
 
-#ifdef CONFIG_MLX5_ACCEL
-
 #include 
 
+#ifdef CONFIG_MLX5_ACCEL
+
 enum {
MLX5_ACCEL_IPSEC_DEVICE = BIT(1),
MLX5_ACCEL_IPSEC_IPV6 = BIT(2),
-- 
2.14.3

[for-next V2 01/13] IB/mlx5: Removed not used parameters

2018-03-06 Thread Saeed Mahameed

From: Aviad Yehezkel 

Signed-off-by: Aviad Yehezkel 
Signed-off-by: Saeed Mahameed 
Acked-by: Jason Gunthorpe 
---
 drivers/infiniband/hw/mlx5/main.c | 2 --
 drivers/infiniband/hw/mlx5/qp.c   | 3 ---
 2 files changed, 5 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index ee55d7d64554..23511638fbbc 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -4585,8 +4585,6 @@ int mlx5_ib_stage_init_init(struct mlx5_ib_dev *dev)
goto err_free_port;
 
if (!mlx5_core_mp_enabled(mdev)) {
-   int i;
-
for (i = 1; i <= dev->num_ports; i++) {
err = get_port_caps(dev, i);
if (err)
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 5663530ea5fd..0e67e3682bca 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -2153,7 +2153,6 @@ static struct ib_qp *mlx5_ib_create_dct(struct ib_pd *pd,
struct ib_qp_init_attr *attr,
struct mlx5_ib_create_qp *ucmd)
 {
-   struct mlx5_ib_dev *dev;
struct mlx5_ib_qp *qp;
int err = 0;
u32 uidx = MLX5_IB_DEFAULT_UIDX;
@@ -2162,8 +2161,6 @@ static struct ib_qp *mlx5_ib_create_dct(struct ib_pd *pd,
if (!attr->srq || !attr->recv_cq)
return ERR_PTR(-EINVAL);
 
-   dev = to_mdev(pd->device);
-
err = get_qp_user_index(to_mucontext(pd->uobject->context),
ucmd, sizeof(*ucmd), &uidx);
if (err)
-- 
2.14.3

[pull request][for-next V2 00/13] Mellanox, mlx5 IPSec updates 2018-02-28-1

2018-03-06 Thread Saeed Mahameed

Hi Dave and Doug,

This series includes shared code updates for mlx5 core driver for both
netdev and rdma subsystems.  This series should be pulled to both
trees so we can continue netdev and rdma specific submissions separately.

For more information please see tag log below.

The series doesn't cause any conflict with the latest mlx5 rc fixes.

v1->v2:
  - Drop sparse fixes patch
  - Updated commit message of "net/mlx5: Add has_tag to mlx5_flow_act"
  - Add const to  static mlx5_flow_cmd structs where needed.

Thanks,
Saeed.

--- 

The following changes since commit ec9c2fb8ceb5b514c4820f732537cb2982de0620:

  IB/mlx5: Disable self loopback check when in switchdev mode (2018-02-23 
12:36:39 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git 
tags/mlx5-updates-2018-02-28-1

for you to fetch changes up to e810bf5e96e327500cc6334f9d56c8047aaabcff:

  net/mlx5: Flow steering cmd interface should get the fte when deleting 
(2018-03-06 22:20:15 -0800)


mlx5-updates-2018-02-28-1 (IPSec-1)

This series consists of some fixes and refactors for the mlx5 drivers,
especially around the FPGA and flow steering. Most of them are trivial
fixes and are the foundation of allowing IPSec acceleration from user-space.

We use flow steering abstraction in order to accelerate IPSec packets.
When a user creates a steering rule, [s]he states that we'll carry an
encrypt/decrypt flow action (using a specific configuration) for every
packet which conforms to a certain match. Since currently offloading these
packets is done via FPGA, we'll add another set of flow steering ops.
These ops will execute the required FPGA commands and then call the
standard steering ops.

In order to achieve this, we need that the commands will get all the
required information. Therefore, we pass the fte object and embed the
flow_action struct inside the fte. In addition, we add the shim layer
that will later be used for alternating between the standard and the
FPGA steering commands.

Some fixes, like " net/mlx5e: Wait for FPGA command responses with a timeout"
are very relevant for user-space applications, as these applications could
be killed, but we still want to wait for the FPGA and update the kernel's
database.

Regards,
Aviad and Matan


Aviad Yehezkel (7):
  IB/mlx5: Removed not used parameters
  net/mlx5: Fixed compilation issue when CONFIG_MLX5_ACCEL is disabled
  net/mlx5e: Wait for FPGA command responses with a timeout
  net/mlx5e: Fixed sleeping inside atomic context
  net/mlx5e: Removed not need synchronize_rcu
  net/mlx5: Add empty egress namespace to flow steering core
  net/mlx5: Flow steering cmd interface should get the fte when deleting

Boris Pismenny (2):
  IB/mlx5: Pass mlx5_flow_act struct instead of multiple arguments
  {net,IB}/mlx5: Add flow steering helpers

Matan Barak (4):
  net/mlx5: FPGA and IPSec initialization to be before flow steering
  {net,IB}/mlx5: Add has_tag to mlx5_flow_act
  net/mlx5: Add shim layer between fs and cmd
  net/mlx5: Embed mlx5_flow_act into fs_fte

 drivers/infiniband/hw/mlx5/main.c  |  30 ++-
 drivers/infiniband/hw/mlx5/qp.c|   3 -
 .../net/ethernet/mellanox/mlx5/core/accel/ipsec.h  |   4 +-
 .../mellanox/mlx5/core/diag/fs_tracepoint.c|   3 +
 .../mellanox/mlx5/core/diag/fs_tracepoint.h|   4 +-
 .../ethernet/mellanox/mlx5/core/en_accel/ipsec.c   |  17 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c|   1 +
 .../net/ethernet/mellanox/mlx5/core/fpga/ipsec.c   |   9 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c   | 207 +
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h   |  72 +++
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  | 136 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h  |   8 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c |  39 ++--
 include/linux/mlx5/fs.h|   2 +
 include/linux/mlx5/fs_helpers.h| 134 +
 include/linux/mlx5/mlx5_ifc.h  |   9 +-
 16 files changed, 494 insertions(+), 184 deletions(-)
 create mode 100644 include/linux/mlx5/fs_helpers.h

Re: Use of Indirect function calls

2018-03-06 Thread Eric Dumazet

On Tue, 2018-03-06 at 21:53 -0800, Rao Shoaib wrote:
> David,
> 
> Thanks a lot for your prompt response. Do you have a specific
> solution 
> in mind or will the calls be replaced with simple checks ?

There is upcoming work for that, but not specific to TCP stack.

> 
> Also while I have your attention can I ask your opinion about
> breaking 
> up some TCP functions, mostly control functions into smaller units
> so 
> that if a little different behavior is desired it can be achieved
> and 
> common code could still be shared. Of course you can not say much 
> without looking at the code but will you even entertain such a change
> ?

I am sorry, but I would prefer no code refactoring unless you fix a
serious bug, or prepare for something really new (and having noticeable
impact)

We have to maintain stable trees, and such code churns are adding
maintenance hassles.

Of course, you can submit patches, but be warned that you can not
expect us spending hours reviewing patches that might bring serious
regressions.

I suggest you start with small patches first.

Re: [for-next 01/14] net/mlx5: Fixed sparse issues

2018-03-06 Thread Saeed Mahameed

On Mon, 2018-03-05 at 22:53 +0200, Or Gerlitz wrote:
> On Mon, Mar 5, 2018 at 10:46 PM, Saeed Mahameed 
> wrote:
> > From: Aviad Yehezkel 
> > 
> > 1. Local fucntions should be static.
> 
> s/fucntions/functions/
> 
> > 2. Missing declarations warnings.
> > 
> > Signed-off-by: Aviad Yehezkel 
> > Signed-off-by: Saeed Mahameed 
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/en_main.c   | 4 ++--
> >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 1 +
> >  drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c | 1 +
> >  drivers/net/ethernet/mellanox/mlx5/core/lib/clock.h | 2 ++
> >  4 files changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > index 47bab842c5ee..a64b9226d281 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > @@ -2994,8 +2994,8 @@ static int mlx5e_setup_tc_block(struct
> > net_device *dev,
> >  }
> >  #endif
> > 
> > -int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type
> > type,
> > -  void *type_data)
> > +static int mlx5e_setup_tc(struct net_device *dev, enum
> > tc_setup_type type,
> > + void *type_data)
> >  {
> > switch (type) {
> >  #ifdef CONFIG_MLX5_ESWITCH
> 
> Saeed, this (and also the below change) seems like re-doing net
> commit
>  9afe9a5353778994d4396f3d5ff639221bfa5cc9 "net/mlx5e: Eliminate build
> warnings on no previous prototype", why?
> 

Because the trees are not merged yet and byte to byte level the code is
identical nothing to worry about here.

I will drop this patch since we need to re=spin a V2 anyways and Aviad
will have to fix it later.

> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > index 80b84f6af2a1..59fe0ec5edcd 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > @@ -34,6 +34,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include "en.h"
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
> > index e159243e0fcf..2ffa59ce7976 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
> > @@ -33,6 +33,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include "clock.h"
> >  #include "en.h"
> > 
> >  enum {
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.h
> > b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.h
> > index a8eecedd46c2..c200182aa0af 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.h
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.h
> > @@ -30,6 +30,8 @@
> >   * SOFTWARE.
> >   */
> > 
> > +#include 
> > +

Re: [for-next 09/14] {net,IB}/mlx5: Add has_tag to mlx5_flow_act

2018-03-06 Thread Saeed Mahameed

On Mon, 2018-03-05 at 14:07 -0700, Jason Gunthorpe wrote:
> On Mon, Mar 05, 2018 at 12:46:32PM -0800, Saeed Mahameed wrote:
> > From: Matan Barak 
> > 
> > The has_tag member will indicate whether a tag action was specified
> > in flow specification.
> 
> It would be good to describe in the commit message why
> 
>  flow_act.flow_tag != MLX5_FS_DEFAULT_FLOW_TAG
> 
> isn't good enough anymore.

A flow tag 0 = MLX5_FS_DEFAULT_FLOW_TAG is assumed a valid flow tag
that is currently used by RDMA driver, whereas in HW flow_tag = 0 means
that the user doesn't care about flow_tag. HW always provide flow_tag =
0 if all flow_tags requested on a specific flow are 0.

So we need a way (in the driver) to differentiate between a user really
requesting flow_tag = 0 and a user who does not care, in order to be
able to report conflicting flow tags on a specific flow.

We will add this to commit message.

> 
> > Signed-off-by: Matan Barak 
> > Reviewed-by: Aviad Yehezkel 
> > Signed-off-by: Saeed Mahameed 
> >  drivers/infiniband/hw/mlx5/main.c | 3 ++-
> >  drivers/net/ethernet/mellanox/mlx5/core/en_tc.c   | 1 +
> >  drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 2 +-
> >  include/linux/mlx5/fs.h   | 1 +
> >  4 files changed, 5 insertions(+), 2 deletions(-)
> 
> Assuming there is a good reason to do this:
> 
> Acked-by: Jason Gunthorpe 
> 
> Jason

Re: [for-next 10/14] net/mlx5: Add shim layer between fs and cmd

2018-03-06 Thread Saeed Mahameed

On Mon, 2018-03-05 at 14:03 -0700, Jason Gunthorpe wrote:
> On Mon, Mar 05, 2018 at 12:46:33PM -0800, Saeed Mahameed wrote:
> 
> > +static struct mlx5_flow_cmds mlx5_flow_cmds = {
> 
> 'static const' on these new static structs?
> 

Yes, Will fix.

> Jason

Re: Use of Indirect function calls

2018-03-06 Thread Rao Shoaib


David,

Thanks a lot for your prompt response. Do you have a specific solution 
in mind or will the calls be replaced with simple checks ?


Also while I have your attention can I ask your opinion about breaking 
up some TCP functions, mostly control functions into smaller units so 
that if a little different behavior is desired it can be achieved and 
common code could still be shared. Of course you can not say much 
without looking at the code but will you even entertain such a change ?


Regards,

Rao.


On 03/06/2018 08:43 PM, David Miller wrote:

From: Rao Shoaib 
Date: Tue, 6 Mar 2018 19:35:46 -0800


I do not expect any measurable overhead as modern CPU's use
pre-fetching and multiple parallel execution engines.

Please see Spectre and retpolines, all of this parallel execution and
prefetching is essentially disabled to address those vulnerabilities
and side-channel exploits.

Indirect calls are terrible and we are now looking at ways in which
we can remove them from as many parts of the networking as possible.

Re: [RFC v3 net-next 00/18] Time based packet transmission

2018-03-06 Thread Richard Cochran

On Tue, Mar 06, 2018 at 05:12:12PM -0800, Jesus Sanchez-Palencia wrote:
> Design changes since v2:
>  - Now on the dequeue() path, tbs only drops an expired packet if it has the
>skb->tc_drop_if_late flag set. In practical terms, this will define if
>the semantics of txtime on a system is "not earlier than" or "not later
>than" a given timestamp;
>  - Now on the enqueue() path, the qdisc will drop a packet if its clockid
>doesn't match the qdisc's one;
>  - Sorting the packets based on their txtime is now an option for the disc.
>Effectively, this means it can be configured in 4 modes: HW offload or
>SW best-effort, sorting enabled or disabled;

While all of this makes the series and the configuration more complex,
still I like the fact that the interface offers these different modes.

Looking forward to testing this...

Thanks,
Richard

Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params

2018-03-06 Thread Richard Cochran

On Tue, Mar 06, 2018 at 06:53:29PM -0800, Eric Dumazet wrote:
> This is adding 32+1 bits to sk_buff, and possibly holes in this very
> very hot (and already too fat) structure.
> 
> Do we really need 32 bits for a clockid_t ?

Probably we can live with fewer bits.

For clock IDs with a positive sign, the max possible clock value is 16.

For clock IDs with a negative sign, IIRC, three bits are for the type
code (we have also posix timers packed like this) and the are for the
file descriptor.  So maybe we could use 16 bits, allowing 12 bits or
so for encoding the FD.

The downside would be that this forces the application to make sure
and open the dynamic posix clock early enough before the FD count gets
too high.

Thanks,
Richard

Re: [PATCH net v2 RESEND] ipv6: Reflect MTU changes on PMTU of exceptions for MTU-less routes

2018-03-06 Thread David Ahern

On 3/6/18 3:10 AM, Stefano Brivio wrote:
> Currently, administrative MTU changes on a given netdevice are
> not reflected on route exceptions for MTU-less routes, with a
> set PMTU value, for that device:
> 
>  # ip -6 route get 2001:db8::b
>  2001:db8::b from :: dev vti_a proto kernel src 2001:db8::a metric 256 pref 
> medium
>  # ping6 -c 1 -q -s1 2001:db8::b > /dev/null
>  # ip netns exec a ip -6 route get 2001:db8::b
>  2001:db8::b from :: dev vti_a src 2001:db8::a metric 0
>  cache expires 571sec mtu 4926 pref medium
>  # ip link set dev vti_a mtu 3000
>  # ip -6 route get 2001:db8::b
>  2001:db8::b from :: dev vti_a src 2001:db8::a metric 0
>  cache expires 571sec mtu 4926 pref medium
>  # ip link set dev vti_a mtu 9000
>  # ip -6 route get 2001:db8::b
>  2001:db8::b from :: dev vti_a src 2001:db8::a metric 0
>  cache expires 571sec mtu 4926 pref medium
> 
> The first issue is that since commit fb56be83e43d ("net-ipv6: on
> device mtu change do not add mtu to mtu-less routes") we don't
> call rt6_exceptions_update_pmtu() from rt6_mtu_change_route(),
> which handles administrative MTU changes, if the regular route
> is MTU-less.
> 
> However, PMTU exceptions should be always updated, as long as
> RTAX_MTU is not locked. Keep the check for MTU-less main route,
> as introduced by that commit, but, for exceptions,
> call rt6_exceptions_update_pmtu() regardless of that check.
> 
> Once that is fixed, one problem remains: MTU changes are not
> reflected if the new MTU is higher than the previous one,
> because rt6_exceptions_update_pmtu() doesn't allow that. We
> should instead allow PMTU increase if the old PMTU matches the
> local MTU, as that implies that the old MTU was the lowest in the
> path, and PMTU discovery might lead to different results.
> 
> The existing check in rt6_mtu_change_route() correctly took that
> case into account (for regular routes only), so factor it out
> and re-use it also in rt6_exceptions_update_pmtu().
> 
> While at it, fix comments style and grammar, and try to be a bit
> more descriptive.
> 
> Reported-by: Xiumei Mu 
> Fixes: fb56be83e43d ("net-ipv6: on device mtu change do not add mtu to 
> mtu-less routes")
> Fixes: f5bbe7ee79c2 ("ipv6: prepare rt6_mtu_change() for exception table")
> Signed-off-by: Stefano Brivio 
> ---

Acked-by: David Ahern

Re: [PATCH net-next v3] selftests: net: Introduce first PMTU test

2018-03-06 Thread David Ahern

On 3/6/18 2:16 PM, Stefano Brivio wrote:
> One single test implemented so far: test_pmtu_vti6_exception
> checks that the PMTU of a route exception, caused by a tunnel
> exceeding the link layer MTU, is affected by administrative
> changes of the tunnel MTU. Creation of the route exception is
> checked too.
> 
> Requested-by: David Ahern 
> Signed-off-by: Stefano Brivio 
> ---
> v3: Explicitly set veth MTU before causing route exception to ensure we 
> actually
> decrease the PMTU as second step in the test (issue reported by David 
> Ahern)
> 
> v2: Fix error handling for setup_*() functions, make 'ip route get' output
> parsing more robust, sleep after configuring vti6 addresses (all issues
> reported by David Ahern)
> 
>  tools/testing/selftests/net/Makefile |   2 +-
>  tools/testing/selftests/net/pmtu.sh  | 163 
> +++
>  2 files changed, 164 insertions(+), 1 deletion(-)
>  create mode 100755 tools/testing/selftests/net/pmtu.sh
> 

Acked-by: David Ahern

[PATCH] hv_netvsc: fix multicast flags and sync

2018-03-06 Thread Stephen Hemminger

This addresses two problems with recent multicast flags synchronization.
The wrong filter value was being computed (and therefore ARP would
not work on WS2016). And the uc/mc sync logic had locking issues because
it would be called with irq disabled (reported by lockdep).

Fixes: bee9d41b37ea ("hv_netvsc: propagate rx filters to VF")
Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/netvsc_drv.c   |  6 --
 drivers/net/hyperv/rndis_filter.c | 23 ---
 2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index cdb78eefab67..55e7b5aa1711 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -89,14 +89,8 @@ static void netvsc_change_rx_flags(struct net_device *net, 
int change)
 static void netvsc_set_rx_mode(struct net_device *net)
 {
struct net_device_context *ndev_ctx = netdev_priv(net);
-   struct net_device *vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
struct netvsc_device *nvdev = rtnl_dereference(ndev_ctx->nvdev);
 
-   if (vf_netdev) {
-   dev_uc_sync(vf_netdev, net);
-   dev_mc_sync(vf_netdev, net);
-   }
-
rndis_filter_update(nvdev);
 }
 
diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 8927c483c217..b8cb2c3eb303 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -854,19 +854,28 @@ static void rndis_set_multicast(struct work_struct *w)
 {
struct rndis_device *rdev
= container_of(w, struct rndis_device, mcast_work);
+   struct net_device *ndev = rdev->ndev;
+   struct net_device_context *ndev_ctx = netdev_priv(ndev);
+   struct net_device *vf_netdev;
u32 filter = NDIS_PACKET_TYPE_DIRECTED;
-   unsigned int flags = rdev->ndev->flags;
 
-   if (flags & IFF_PROMISC) {
+   if (ndev->flags & IFF_PROMISC) {
filter = NDIS_PACKET_TYPE_PROMISCUOUS;
} else {
-   if (flags & IFF_ALLMULTI)
-   flags |= NDIS_PACKET_TYPE_ALL_MULTICAST;
-   if (flags & IFF_BROADCAST)
-   flags |= NDIS_PACKET_TYPE_BROADCAST;
+   if (ndev->flags & IFF_ALLMULTI)
+   filter |= NDIS_PACKET_TYPE_ALL_MULTICAST;
+   if (ndev->flags & IFF_BROADCAST)
+   filter |= NDIS_PACKET_TYPE_BROADCAST;
}
-
rndis_filter_set_packet_filter(rdev, filter);
+
+   rcu_read_lock();
+   vf_netdev = rcu_dereference(ndev_ctx->vf_netdev);
+   if (vf_netdev) {
+   dev_uc_sync(vf_netdev, ndev);
+   dev_mc_sync(vf_netdev, ndev);
+   }
+   rcu_read_unlock();
 }
 
 void rndis_filter_update(struct netvsc_device *nvdev)
-- 
2.16.1

Re: Use of Indirect function calls

2018-03-06 Thread David Miller

From: Rao Shoaib 
Date: Tue, 6 Mar 2018 19:35:46 -0800

> I do not expect any measurable overhead as modern CPU's use
> pre-fetching and multiple parallel execution engines.

Please see Spectre and retpolines, all of this parallel execution and
prefetching is essentially disabled to address those vulnerabilities
and side-channel exploits.

Indirect calls are terrible and we are now looking at ways in which
we can remove them from as many parts of the networking as possible.

Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-06 Thread David Miller

From: John Fastabend 
Date: Tue, 6 Mar 2018 19:25:01 -0800

> What do you think? 

Sounds good from your description, I can't wait to see it :-)

[PATCH v3 net-next 4/5] selftests: fib_tests: Allow user to run a specific test

2018-03-06 Thread David Ahern

Allow a user to run just a specific fib test by setting the TEST
environment variable.

Signed-off-by: David Ahern 
---
 tools/testing/selftests/net/fib_tests.sh | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/net/fib_tests.sh 
b/tools/testing/selftests/net/fib_tests.sh
index 953254439e39..cfdeb35bfed5 100755
--- a/tools/testing/selftests/net/fib_tests.sh
+++ b/tools/testing/selftests/net/fib_tests.sh
@@ -392,9 +392,13 @@ fib_carrier_test()
 
 fib_test()
 {
-   fib_unreg_test
-   fib_down_test
-   fib_carrier_test
+   if [ -n "$TEST" ]; then
+   eval $TEST
+   else
+   fib_unreg_test
+   fib_down_test
+   fib_carrier_test
+   fi
 }
 
 if [ "$(id -u)" -ne 0 ];then
-- 
2.11.0

[PATCH v3 net-next 2/5] net/ipv6: Address checks need to consider the L3 domain

2018-03-06 Thread David Ahern

ipv6_chk_addr_and_flags determines if an address is a local address. It
is called by ip6_route_info_create to validate a gateway address is not a
local address. It currently does not consider L3 domains and as a result
does not allow a route to be added in one VRF if the nexthop points to
an address in a second VRF. e.g.,

$ ip route add 2001:db8:1::/64 vrf r2 via 2001:db8:102::23
Error: Invalid gateway address.

where 2001:db8:102::23 is an address on an interface in vrf r1.

Resolve by comparing the l3mdev for the passed in device and requiring an
l3mdev match with the device containing an address. The intent of checking
for an address on the specified device versus any device in the domain is
mantained by a new argument to skip the check between the passed in device
and the device with the address.

Update the handful of users of ipv6_chk_addr with a NULL dev argument:
- anycast to call ipv6_chk_addr_and_flags. If the device is given by the
  user, look for the given address across the L3 domain. If the index is
  not given, the default table is presumed so only addresses on devices
  not enslaved are considered.

- ip6_tnl_rcv_ctl - local address must exist on device, remote address
  can not exist in L3 domain; only remote check needs to be updated but
  do both for consistency.

ip6_validate_gw needs to handle 2 cases - one where the device is given
as part of the nexthop spec and the other where the device is resolved.
There is at least 1 VRF case where deferring the check to only after
the route lookup has resolved the device fails with an unintuitive error
"RTNETLINK answers: No route to host" as opposed to the preferred
"Error: Gateway can not be a local address." The 'no route to host'
error is because of the fallback to a full lookup.

Signed-off-by: David Ahern 
---
 include/net/addrconf.h |  4 ++--
 net/ipv6/addrconf.c| 26 ++
 net/ipv6/anycast.c |  9 ++---
 net/ipv6/datagram.c|  5 +++--
 net/ipv6/ip6_tunnel.c  | 12 
 net/ipv6/ndisc.c   |  2 +-
 net/ipv6/route.c   | 37 -
 7 files changed, 70 insertions(+), 25 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index c4185a7b0e90..132e5b95167a 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -69,8 +69,8 @@ int addrconf_set_dstaddr(struct net *net, void __user *arg);
 int ipv6_chk_addr(struct net *net, const struct in6_addr *addr,
  const struct net_device *dev, int strict);
 int ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr,
-   const struct net_device *dev, int strict,
-   u32 banned_flags);
+   const struct net_device *dev, bool skip_dev_check,
+   int strict, u32 banned_flags);
 
 #if defined(CONFIG_IPV6_MIP6) || defined(CONFIG_IPV6_MIP6_MODULE)
 int ipv6_chk_home_addr(struct net *net, const struct in6_addr *addr);
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index b5fd116c046a..17d5d3f42d21 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1851,22 +1851,40 @@ static int ipv6_count_addresses(const struct inet6_dev 
*idev)
 int ipv6_chk_addr(struct net *net, const struct in6_addr *addr,
  const struct net_device *dev, int strict)
 {
-   return ipv6_chk_addr_and_flags(net, addr, dev, strict, IFA_F_TENTATIVE);
+   return ipv6_chk_addr_and_flags(net, addr, dev, !dev,
+  strict, IFA_F_TENTATIVE);
 }
 EXPORT_SYMBOL(ipv6_chk_addr);
 
+/* device argument is used to find the L3 domain of interest. If
+ * skip_dev_check is set, then the ifp device is not checked against
+ * the passed in dev argument. So the 2 cases for addresses checks are:
+ *   1. does the address exist in the L3 domain that dev is part of
+ *  (skip_dev_check = true), or
+ *
+ *   2. does the address exist on the specific device
+ *  (skip_dev_check = false)
+ */
 int ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr,
-   const struct net_device *dev, int strict,
-   u32 banned_flags)
+   const struct net_device *dev, bool skip_dev_check,
+   int strict, u32 banned_flags)
 {
unsigned int hash = inet6_addr_hash(net, addr);
+   const struct net_device *l3mdev;
struct inet6_ifaddr *ifp;
u32 ifp_flags;
 
rcu_read_lock();
+
+   l3mdev = l3mdev_master_dev_rcu(dev);
+
hlist_for_each_entry_rcu(ifp, &inet6_addr_lst[hash], addr_lst) {
if (!net_eq(dev_net(ifp->idev->dev), net))
continue;
+
+   if (l3mdev_master_dev_rcu(ifp->idev->dev) != l3mdev)
+   continue;
+
/* Decouple optimistic from tentative for evaluation here.
 * Ban optimistic addresses explicitly, when r

[PATCH v3 net-next 5/5] selftests: fib_tests: Add IPv6 nexthop spec tests

2018-03-06 Thread David Ahern

Add series of tests for valid and invalid nexthop specs for IPv6.

$ TEST=fib_nexthop_test ./fib_tests.sh
...
IPv6 nexthop tests
TEST: Directly connected nexthop, unicast address  [ OK ]
TEST: Directly connected nexthop, unicast address with device  [ OK ]
TEST: Gateway is linklocal address [ OK ]
TEST: Gateway is linklocal address, no device  [ OK ]
TEST: Gateway can not be local unicast address [ OK ]
TEST: Gateway can not be local unicast address, with device[ OK ]
TEST: Gateway can not be a local linklocal address [ OK ]
TEST: Gateway can be local address in a VRF[ OK ]
TEST: Gateway can be local address in a VRF, with device   [ OK ]
TEST: Gateway can be local linklocal address in a VRF  [ OK ]
TEST: Redirect to VRF lookup   [ OK ]
TEST: VRF route, gateway can be local address in default VRF   [ OK ]
TEST: VRF route, gateway can not be a local address[ OK ]
TEST: VRF route, gateway can not be a local addr with device   [ OK ]

Signed-off-by: David Ahern 
---
 tools/testing/selftests/net/fib_tests.sh | 180 ++-
 1 file changed, 178 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/fib_tests.sh 
b/tools/testing/selftests/net/fib_tests.sh
index cfdeb35bfed5..9164e60d4b66 100755
--- a/tools/testing/selftests/net/fib_tests.sh
+++ b/tools/testing/selftests/net/fib_tests.sh
@@ -6,6 +6,7 @@
 
 ret=0
 
+VERBOSE=${VERBOSE:=0}
 PAUSE_ON_FAIL=${PAUSE_ON_FAIL:=no}
 IP="ip -netns testns"
 
@@ -16,10 +17,10 @@ log_test()
local msg="$3"
 
if [ ${rc} -eq ${expected} ]; then
-   printf "%-60s  [ OK ]\n" "${msg}"
+   printf "TEST: %-60s  [ OK ]\n" "${msg}"
else
ret=1
-   printf "%-60s  [FAIL]\n" "${msg}"
+   printf "TEST: %-60s  [FAIL]\n" "${msg}"
if [ "${PAUSE_ON_FAIL}" = "yes" ]; then
echo
echo "hit enter to continue, 'q' to quit"
@@ -49,6 +50,28 @@ cleanup()
ip netns del testns
 }
 
+get_linklocal()
+{
+   local dev=$1
+   local addr
+
+   addr=$($IP -6 -br addr show dev ${dev} | \
+   awk '{
+   for (i = 3; i <= NF; ++i) {
+   if ($i ~ /^fe80/)
+   print $i
+   }
+   }'
+   )
+   addr=${addr/\/*}
+
+   [ -z "$addr" ] && return 1
+
+   echo $addr
+
+   return 0
+}
+
 fib_unreg_unicast_test()
 {
echo
@@ -390,6 +413,158 @@ fib_carrier_test()
fib_carrier_unicast_test
 }
 
+
+# Tests on nexthop spec
+
+# run 'ip route add' with given spec
+add_rt()
+{
+   local desc="$1"
+   local erc=$2
+   local vrf=$3
+   local pfx=$4
+   local gw=$5
+   local dev=$6
+   local cmd out rc
+
+   [ "$vrf" = "-" ] && vrf="default"
+   [ -n "$gw" ] && gw="via $gw"
+   [ -n "$dev" ] && dev="dev $dev"
+
+   cmd="$IP route add vrf $vrf $pfx $gw $dev"
+   if [ "$VERBOSE" = "1" ]; then
+   printf "\nCOMMAND: $cmd\n"
+   fi
+
+   out=$(eval $cmd 2>&1)
+   rc=$?
+   if [ "$VERBOSE" = "1" -a -n "$out" ]; then
+   echo "$out"
+   fi
+   log_test $rc $erc "$desc"
+}
+
+fib4_nexthop()
+{
+   echo
+   echo "IPv4 nexthop tests"
+
+   echo "<<< write me >>>"
+}
+
+fib6_nexthop()
+{
+   local lldummy=$(get_linklocal dummy0)
+   local llv1=$(get_linklocal dummy0)
+
+   if [ -z "$lldummy" ]; then
+   echo "Failed to get linklocal address for dummy0"
+   return 1
+   fi
+   if [ -z "$llv1" ]; then
+   echo "Failed to get linklocal address for veth1"
+   return 1
+   fi
+
+   echo
+   echo "IPv6 nexthop tests"
+
+   add_rt "Directly connected nexthop, unicast address" 0 \
+   - 2001:db8:101::/64 2001:db8:1::2
+   add_rt "Directly connected nexthop, unicast address with device" 0 \
+   - 2001:db8:102::/64 2001:db8:1::2 "dummy0"
+   add_rt "Gateway is linklocal address" 0 \
+   - 2001:db8:103::1/64 $llv1 "veth0"
+
+   # fails because LL address requires a device
+   add_rt "Gateway is linklocal address, no device" 2 \
+   - 2001:db8:104::1/64 $llv1
+
+   # local address can not be a gateway
+   add_rt "Gateway can not be local unicast address" 2 \
+   - 2001:db8:105::/64 2001:db8:1::1
+   add_rt "Gateway can not be local unicast address, with device" 2 \
+   - 2001:db8:106::/64 2001:db8:1::1 "dummy0"
+   add_rt "Gateway can not be a local linklocal address" 2 \
+   - 2001:db8:107::1/64 $lldummy "dummy0"
+
+

[PATCH v3 net-next 3/5] selftests: fib_tests: Use an alias for ip command

2018-03-06 Thread David Ahern

Replace 'ip -netns testns' with the alias IP. Shortens the line lengths
and makes running the commands manually a bit easier.

Signed-off-by: David Ahern 
---
 tools/testing/selftests/net/fib_tests.sh | 169 ---
 1 file changed, 85 insertions(+), 84 deletions(-)

diff --git a/tools/testing/selftests/net/fib_tests.sh 
b/tools/testing/selftests/net/fib_tests.sh
index b617985ecdc1..953254439e39 100755
--- a/tools/testing/selftests/net/fib_tests.sh
+++ b/tools/testing/selftests/net/fib_tests.sh
@@ -7,6 +7,7 @@
 ret=0
 
 PAUSE_ON_FAIL=${PAUSE_ON_FAIL:=no}
+IP="ip -netns testns"
 
 log_test()
 {
@@ -32,19 +33,19 @@ setup()
 {
set -e
ip netns add testns
-   ip -netns testns link set dev lo up
+   $IP link set dev lo up
 
-   ip -netns testns link add dummy0 type dummy
-   ip -netns testns link set dev dummy0 up
-   ip -netns testns address add 198.51.100.1/24 dev dummy0
-   ip -netns testns -6 address add 2001:db8:1::1/64 dev dummy0
+   $IP link add dummy0 type dummy
+   $IP link set dev dummy0 up
+   $IP address add 198.51.100.1/24 dev dummy0
+   $IP -6 address add 2001:db8:1::1/64 dev dummy0
set +e
 
 }
 
 cleanup()
 {
-   ip -netns testns link del dev dummy0 &> /dev/null
+   $IP link del dev dummy0 &> /dev/null
ip netns del testns
 }
 
@@ -56,19 +57,19 @@ fib_unreg_unicast_test()
setup
 
echo "Start point"
-   ip -netns testns route get fibmatch 198.51.100.2 &> /dev/null
+   $IP route get fibmatch 198.51.100.2 &> /dev/null
log_test $? 0 "IPv4 fibmatch"
-   ip -netns testns -6 route get fibmatch 2001:db8:1::2 &> /dev/null
+   $IP -6 route get fibmatch 2001:db8:1::2 &> /dev/null
log_test $? 0 "IPv6 fibmatch"
 
set -e
-   ip -netns testns link del dev dummy0
+   $IP link del dev dummy0
set +e
 
echo "Nexthop device deleted"
-   ip -netns testns route get fibmatch 198.51.100.2 &> /dev/null
+   $IP route get fibmatch 198.51.100.2 &> /dev/null
log_test $? 2 "IPv4 fibmatch - no route"
-   ip -netns testns -6 route get fibmatch 2001:db8:1::2 &> /dev/null
+   $IP -6 route get fibmatch 2001:db8:1::2 &> /dev/null
log_test $? 2 "IPv6 fibmatch - no route"
 
cleanup
@@ -83,43 +84,43 @@ fib_unreg_multipath_test()
setup
 
set -e
-   ip -netns testns link add dummy1 type dummy
-   ip -netns testns link set dev dummy1 up
-   ip -netns testns address add 192.0.2.1/24 dev dummy1
-   ip -netns testns -6 address add 2001:db8:2::1/64 dev dummy1
+   $IP link add dummy1 type dummy
+   $IP link set dev dummy1 up
+   $IP address add 192.0.2.1/24 dev dummy1
+   $IP -6 address add 2001:db8:2::1/64 dev dummy1
 
-   ip -netns testns route add 203.0.113.0/24 \
+   $IP route add 203.0.113.0/24 \
nexthop via 198.51.100.2 dev dummy0 \
nexthop via 192.0.2.2 dev dummy1
-   ip -netns testns -6 route add 2001:db8:3::/64 \
+   $IP -6 route add 2001:db8:3::/64 \
nexthop via 2001:db8:1::2 dev dummy0 \
nexthop via 2001:db8:2::2 dev dummy1
set +e
 
echo "Start point"
-   ip -netns testns route get fibmatch 203.0.113.1 &> /dev/null
+   $IP route get fibmatch 203.0.113.1 &> /dev/null
log_test $? 0 "IPv4 fibmatch"
-   ip -netns testns -6 route get fibmatch 2001:db8:3::1 &> /dev/null
+   $IP -6 route get fibmatch 2001:db8:3::1 &> /dev/null
log_test $? 0 "IPv6 fibmatch"
 
set -e
-   ip -netns testns link del dev dummy0
+   $IP link del dev dummy0
set +e
 
echo "One nexthop device deleted"
-   ip -netns testns route get fibmatch 203.0.113.1 &> /dev/null
+   $IP route get fibmatch 203.0.113.1 &> /dev/null
log_test $? 2 "IPv4 - multipath route removed on delete"
 
-   ip -netns testns -6 route get fibmatch 2001:db8:3::1 &> /dev/null
+   $IP -6 route get fibmatch 2001:db8:3::1 &> /dev/null
# In IPv6 we do not flush the entire multipath route.
log_test $? 0 "IPv6 - multipath down to single path"
 
set -e
-   ip -netns testns link del dev dummy1
+   $IP link del dev dummy1
set +e
 
echo "Second nexthop device deleted"
-   ip -netns testns -6 route get fibmatch 2001:db8:3::1 &> /dev/null
+   $IP -6 route get fibmatch 2001:db8:3::1 &> /dev/null
log_test $? 2 "IPv6 - no route"
 
cleanup
@@ -139,19 +140,19 @@ fib_down_unicast_test()
setup
 
echo "Start point"
-   ip -netns testns route get fibmatch 198.51.100.2 &> /dev/null
+   $IP route get fibmatch 198.51.100.2 &> /dev/null
log_test $? 0 "IPv4 fibmatch"
-   ip -netns testns -6 route get fibmatch 2001:db8:1::2 &> /dev/null
+   $IP -6 route get fibmatch 2001:db8:1::2 &> /dev/null
log_test $? 0 "IPv6 fibmatch"
 
set -e

[PATCH v3 net-next 1/5] net/ipv6: Refactor gateway validation on route add

2018-03-06 Thread David Ahern

Move gateway validation code from ip6_route_info_create into
ip6_validate_gw. Code move plus adjustments to handle the potential
reset of dev and idev and to make checkpatch happy.

Signed-off-by: David Ahern 
---
 net/ipv6/route.c | 120 ++-
 1 file changed, 66 insertions(+), 54 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f0ae58424c45..3851c3ccfd7a 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2550,7 +2550,7 @@ static struct rt6_info *ip6_nh_lookup_table(struct net 
*net,
 
 static int ip6_route_check_nh_onlink(struct net *net,
 struct fib6_config *cfg,
-struct net_device *dev,
+const struct net_device *dev,
 struct netlink_ext_ack *extack)
 {
u32 tbid = l3mdev_fib_table(dev) ? : RT_TABLE_MAIN;
@@ -2626,6 +2626,68 @@ static int ip6_route_check_nh(struct net *net,
return err;
 }
 
+static int ip6_validate_gw(struct net *net, struct fib6_config *cfg,
+  struct net_device **_dev, struct inet6_dev **idev,
+  struct netlink_ext_ack *extack)
+{
+   const struct in6_addr *gw_addr = &cfg->fc_gateway;
+   int gwa_type = ipv6_addr_type(gw_addr);
+   const struct net_device *dev = *_dev;
+   int err = -EINVAL;
+
+   /* if gw_addr is local we will fail to detect this in case
+* address is still TENTATIVE (DAD in progress). rt6_lookup()
+* will return already-added prefix route via interface that
+* prefix route was assigned to, which might be non-loopback.
+*/
+   if (ipv6_chk_addr_and_flags(net, gw_addr,
+   gwa_type & IPV6_ADDR_LINKLOCAL ?
+   dev : NULL, 0, 0)) {
+   NL_SET_ERR_MSG(extack, "Invalid gateway address");
+   goto out;
+   }
+
+   if (gwa_type != (IPV6_ADDR_LINKLOCAL | IPV6_ADDR_UNICAST)) {
+   /* IPv6 strictly inhibits using not link-local
+* addresses as nexthop address.
+* Otherwise, router will not able to send redirects.
+* It is very good, but in some (rare!) circumstances
+* (SIT, PtP, NBMA NOARP links) it is handy to allow
+* some exceptions. --ANK
+* We allow IPv4-mapped nexthops to support RFC4798-type
+* addressing
+*/
+   if (!(gwa_type & (IPV6_ADDR_UNICAST | IPV6_ADDR_MAPPED))) {
+   NL_SET_ERR_MSG(extack, "Invalid gateway address");
+   goto out;
+   }
+
+   if (cfg->fc_flags & RTNH_F_ONLINK)
+   err = ip6_route_check_nh_onlink(net, cfg, dev, extack);
+   else
+   err = ip6_route_check_nh(net, cfg, _dev, idev);
+
+   if (err)
+   goto out;
+   }
+
+   /* reload in case device was changed */
+   dev = *_dev;
+
+   err = -EINVAL;
+   if (!dev) {
+   NL_SET_ERR_MSG(extack, "Egress device not specified");
+   goto out;
+   } else if (dev->flags & IFF_LOOPBACK) {
+   NL_SET_ERR_MSG(extack,
+  "Egress device can not be loopback device for 
this route");
+   goto out;
+   }
+   err = 0;
+out:
+   return err;
+}
+
 static struct rt6_info *ip6_route_info_create(struct fib6_config *cfg,
  struct netlink_ext_ack *extack)
 {
@@ -2808,61 +2870,11 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg,
}
 
if (cfg->fc_flags & RTF_GATEWAY) {
-   const struct in6_addr *gw_addr;
-   int gwa_type;
-
-   gw_addr = &cfg->fc_gateway;
-   gwa_type = ipv6_addr_type(gw_addr);
-
-   /* if gw_addr is local we will fail to detect this in case
-* address is still TENTATIVE (DAD in progress). rt6_lookup()
-* will return already-added prefix route via interface that
-* prefix route was assigned to, which might be non-loopback.
-*/
-   err = -EINVAL;
-   if (ipv6_chk_addr_and_flags(net, gw_addr,
-   gwa_type & IPV6_ADDR_LINKLOCAL ?
-   dev : NULL, 0, 0)) {
-   NL_SET_ERR_MSG(extack, "Invalid gateway address");
+   err = ip6_validate_gw(net, cfg, &dev, &idev, extack);
+   if (err)
goto out;
-   }
-   rt->rt6i_gateway = *gw_addr;
-
-   if (gwa_type != (IPV6_ADDR_LINKLOCAL|IPV6_ADDR_UNICAST)) {
-   /* IPv6 strictly inhibits using not link-local
-

[PATCH v3 net-next 0/5] net/ipv6: Address checks need to consider the L3 domain

2018-03-06 Thread David Ahern

IPv6 prohibits a local address from being used as a gateway for a route.
However, it is ok for the local address to be in a different L3 domain
(e.g., VRF); this allows, for example, veth pairs to connect VRFs.

ip6_route_info_create calls ipv6_chk_addr_and_flags for gateway addresses
to determine if the address is a local one, but ipv6_chk_addr_and_flags
does not currently consider L3 domains. As a result routes can not be
added in one VRF with a nexthop that points to a local address in a
second VRF.

Resolve by comparing the l3mdev for the passed in device and requiring an
l3mdev match with the device containing an address. The intent of checking
for an address on the specified device versus any device in the domain is
mantained by a new argument to skip the check between the passed in device
and the device with the address.

Patch 1 moves the gateway validation from ip6_route_info_create into a
helper; the function is long enough and refactoring drops the indent
level.

Patch 2 adds l3mdev checks to ipv6_chk_addr_and_flags and fixes up
a few ipv6_chk_addr callers that pass a NULL device.

Patches 3 and 4 do some refactoring to the fib_tests script and then
patch 5 adds nexthop validation tests.

v3
- set skip_dev_check in ipv6_chk_addr based on dev == NULL

v2
- handle 2 variations of route spec with sane error path
- add test cases

David Ahern (5):
  net/ipv6: Refactor gateway validation on route add
  net/ipv6: Address checks need to consider the L3 domain
  selftests: fib_tests: Use an alias for ip command
  selftests: fib_tests: Allow user to run a specific test
  selftests: fib_tests: Add IPv6 nexthop spec tests

 include/net/addrconf.h   |   4 +-
 net/ipv6/addrconf.c  |  26 ++-
 net/ipv6/anycast.c   |   9 +-
 net/ipv6/datagram.c  |   5 +-
 net/ipv6/ip6_tunnel.c|  12 +-
 net/ipv6/ndisc.c |   2 +-
 net/ipv6/route.c | 139 +++-
 tools/testing/selftests/net/fib_tests.sh | 359 +++
 8 files changed, 397 insertions(+), 159 deletions(-)

-- 
2.11.0

Use of Indirect function calls

2018-03-06 Thread Rao Shoaib

I am working on a change which introduces a couple of indirect function 
calls in the fast path. I used indirect function calls instead of 
"if/else" as it keeps the code cleaner and more readable and provides 
for extensibility.


I do not expect any measurable overhead as modern CPU's use pre-fetching 
and multiple parallel execution engines. Indirect jump prediction has 
gotten very good too. Assembly shows that the difference is an 
instruction to get the table address followed by a call to *callq 
instead of callq. I could not find official Intel numbers[1] but third 
party numbers suggest that *callq may have an over head of 1 clock 
cycle. Over head of accessing the table address is similar to reading a 
variable or testing a flag.


Will a patch that introduces indirect function calls be rejected just 
because it uses indirect function calls in the fast path or is the 
use/impact evaluated on a case by case basis ? I do see usage of 
indirect function calls in TCP fast path and Ethernet drivers so it 
can't be that bad.


Regards,

Shoaib

[1] I believe this is because the instruction is broken down into 
multiple uops and due to perfecting and parallel execution the latency 
is hard to measure.

Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-06 Thread John Fastabend

On 03/06/2018 10:18 AM, John Fastabend wrote:
> On 03/06/2018 07:47 AM, David Miller wrote:
>> From: John Fastabend 
>> Date: Mon, 5 Mar 2018 23:06:01 -0800
>>
>>> On 03/05/2018 10:42 PM, David Miller wrote:
 From: John Fastabend 
 Date: Mon, 5 Mar 2018 22:22:21 -0800

> All I meant by this is if an application uses sendfile() call
> there is no good way to know when/if the kernel side will copy or
> xmit the  data. So a reliable user space application will need to
> only modify the data if it "knows" there are no outstanding sends
> in-flight. So if we assume applications follow this then it
> is OK to avoid the copy. Of course this is not good enough for
> security, but for monitoring/statistics (my use case 1 it works).

 For an application implementing a networking file system, it's pretty
 legitimate for file contents to change before the page gets DMA's to
 the networking card.

>>>
>>> Still there are useful BPF programs that can tolerate this. So I
>>> would prefer to allow BPF programs to operate in the no-copy mode
>>> if wanted. It doesn't have to be the default though as it currently
>>> is. A l7 load balancer is a good example of this.
>>
>> Maybe I'd be ok if it were not the default.  But do you really want to
>> expose a potential attack vector, even if the app gets to choose and
>> say "I'm ok"?
>>
> 
> Yes, because I have use cases where I don't need to read the data, but
> have already "approved" the data. One example applications like
> nginx can serve static http data. Just reading over the code what they
> do, when sendfile is enabled, is a sendmsg call with the header. We want
> to enforce the policy on the header. Then we know the next N bytes are
> OK. Nginx will then send the payload over sendfile syscall. We already
> know the data is good from initial sendmsg call the next N bytes can
> get the verdict SK_PASS without even touching the data. If we do a
> copy in this case we see significant performance degradation.
> 
> The other use case is the L7 load balancer mentioned above. If we are
> using RR policies or some other heuristic if the user modifies the
> payload after the BPF verdict that is also fine. A malicious user
> could rewrite the header and try to game the load balancer but the
> BPF program can always just dev/null (SK_DROP) the application when
> it detects this. This also assumes the load balancer is using the
> header for its heuristic some interesting heuristics may not use
> the header at all.
> 
 And that's perfectly fine, and we everything such that this will work
 properly.

 The card checksums what ends up being DMA'd so nothing from the
 networking side is broken.
>>>
>>> Assuming the card has checksum support correct? Which is why we have
>>> the SKBTX_SHARED_FRAG checked in skb_has_shared_frag() and the checksum
>>> helpers called by the drivers when they do not support the protocol
>>> being used. So probably OK assumption if using supported protocols and
>>> hardware? Perhaps in general folks just use normal protocols and
>>> hardware so it works.
>>
>> If the hardware doesn't support the checksums, we linearize the SKB
>> (therefore obtain a snapshot of the data), and checksum.  Exactly what
>> would happen if the hardware did the checksum.
>>
>> So OK in that case too.
>>
>> We always guarantee that you will always get a correct checksum on
>> outgoing packets, even if you modify the page contents meanwhile.
>>
> 
> Agreed the checksum is correct, but the user doesn't know if the linearize
> happened while it was modifying the data, potentially creating data with
> a partial update. Because the user modifying the data doesn't block the
> linearize operation in the kernel and vice versa the linearize operation
> can happen in parallel with the user side data modification. So maybe
> I'm still missing something but it seems the data can be in some unknown
> state on the wire.
> 
> Either way though I think its fine to make the default sendpage hook do
> the copy. A flag to avoid the copy can be added later to resolve my use
> cases above. I'll code this up in a v2 today/tomorrow.

Hi,

Thought about this a bit more and chatted with Daniel a bit. I think
a better solution is to set data_start = data_end = 0 by default in the
sendpage case. This will disallow any read/writes into the sendpage
data. Then if the user needs to read/write data we can use a helper
bpf_sk_msg_pull_data(start_byte, end_byte) which can pull the data into a
linear buffer as needed. This will ensure any user writes will not
change data after the BPF verdict (your concern). Also it will minimize
the amount of data that needs to be copied (my concern). In some of my
use cases where no data is needed we can simple not use the helper. Then
on the sendmsg side we can continue to set the (data_start, data_end)
pointers to the first scatterlist element. But, also use this helper to
set the data pointers past

Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-06 Thread Greg KH

On Tue, Mar 06, 2018 at 05:07:45PM -0800, Alexei Starovoitov wrote:
> combining multiple answers...
> 
> On 3/6/18 3:05 AM, Greg KH wrote:
> > 
> > Any chance you can add a field to your "umh module" type such that a
> > normal 'modinfo' program will be able to notice it is different easily?
> 
> ok. handling of modinfo turned out to be straightforward.
> kmod tooling worked fine with simple addition of .modinfo section.
> 
> $ modinfo bpfilter
> filename:
> /lib/modules/4.16.0-rc4-00799-g1716f0aa3039-dirty/net/bpfilter/bpfilter.ko
> umh:Y

Nice.  But perhaps spell it out, "user_mode_helper"?  Anyway,
bikesheding now, sorry, whatever you want to call it is fine with me.

> license:GPL
> 
> I will require umh=Y and license to be present.
> umh has to be set to Y for this 'umh modules'
> and taint of kernel will happen if license is not gpl.

Interesting, I like it :)


> Other modinfo like vermagic are not applicable here, since
> umh modules interact with kernel via normal kernel/user abi.

Very true.

> > > Since umh can crash, can be oom-ed by the kernel, killed by admin,
> > > the subsystem that uses them (like bpfilter) need to manage life
> > > time of umh on its own, so module infra doesn't do any accounting
> > > of them. They don't appear in "lsmod" and cannot be "rmmod".
> > > Multiple request_module("umh") will load multiple umh.ko processes.
> > > 
> > > Similar to kernel modules the kernel will be tainted if "umh module"
> > > has invalid signature.
> > 
> > Shouldn't we fail to load the "module" if the signature is not valid if
> > CONFIG_MODULE_SIG_FORCE=y is enabled, like we do for modules?  I run my
> > systems like that, and just "warning" isn't probably a good idea for
> > systems that want to enforce that everything is signed properly?
> 
> CONFIG_MODULE_SIG_FORCE=y is already handled by this patch.
> It's checked first for either .ko or umh.ko (before any elf parsing)
> and returns -ENOKEY to user space without any dmesg message.
> I think it's best to keep it as-is.
> The taint and warning is for 'undef SIG_FORCE' and when module
> is signed, but incorrectly.

Ah, sorry, I missed that, thanks for clearing it up.

> > Other than that, one minor question:
> > 
> > > @@ -1745,7 +1745,9 @@ static int do_execveat_common(int fd, struct 
> > > filename *filename,
> > >   sched_exec();
> > > 
> > >   bprm->file = file;
> > > - if (fd == AT_FDCWD || filename->name[0] == '/') {
> > > + if (!filename) {
> > > + bprm->filename = "/dev/null";
> > 
> > Why the use of "/dev/null" for the filename here, and elsewhere in the
> > code?  While I'm "sure" that everyone really does have /dev/null/
> > mounted in the root namespace, what is the use of it here?
> 
> filename is assumed to be non-null in several places further
> down and instead of hacking it everywhere it's cleaner to assign
> some string to it.
> I'll change it to filename = "none"
> Same in umh part.

Thanks, that makes sense.

greg k-h

Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params

2018-03-06 Thread Eric Dumazet

On Tue, 2018-03-06 at 17:12 -0800, Jesus Sanchez-Palencia wrote:
> Extend SO_TXTIME APIs with new per-packet parameters: a clockid_t and
> a drop_if_late flag. With this commit the API becomes:
> 
> 

 * diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
 * index d8340e6e8814..951969ceaf65 100644
 * --- a/include/linux/skbuff.h
 * +++ b/include/linux/skbuff.h
 * @@ -788,6 +788,9 @@ struct sk_buff {
 *  __u8tc_redirected:1;
 *  __u8tc_from_ingress:1;
 *  #endif
 * +__u8tc_drop_if_late:1;
 * +
 * +clockid_t   txtime_clockid;
 *  
 *  #ifdef CONFIG_NET_SCHED
 *  __u16   tc_index;   /* traffic
   control index */


This is adding 32+1 bits to sk_buff, and possibly holes in this very
very hot (and already too fat) structure.

Do we really need 32 bits for a clockid_t ?

Re: [PATCH iproute2 net-next v3] iprule: support for ip_proto, sport and dport match options

2018-03-06 Thread Stephen Hemminger

On Tue,  6 Mar 2018 18:07:59 -0800
Roopa Prabhu  wrote:

> + if (tb[FRA_IP_PROTO]) {
> + SPRINT_BUF(pbuf);
> + fprintf(fp, "ip_proto %s ",
> + inet_proto_n2a(rta_getattr_u8(tb[FRA_IP_PROTO]), pbuf,
> +sizeof(pbuf)));
> + }
> +
> + if (tb[FRA_SPORT_RANGE]) {
> + struct fib_rule_port_range *r = RTA_DATA(tb[FRA_SPORT_RANGE]);
> +
> + if (r->start == r->end)
> + fprintf(fp, "sport %hu ", r->start);
> + else
> + fprintf(fp, "sport %hu-%hu ", r->start, r->end);
> + }
> +
> + if (tb[FRA_DPORT_RANGE]) {
> + struct fib_rule_port_range *r = RTA_DATA(tb[FRA_DPORT_RANGE]);
> +
> + if (r->start == r->end)
> + fprintf(fp, "dport %hu ", r->start);
> + else
> + fprintf(fp, "dport %hu-%hu ", r->start, r->end);
> + }
> +

in net-next this is all JSON now.

Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-06 Thread Michael S. Tsirkin

On Tue, Mar 06, 2018 at 03:27:46PM -0800, Alexander Duyck wrote:
> > I definitelly vote for a separate common shared code for both netvsc and
> > virtio_net - even if you use 2 and 3 netdev model, you could share the
> > common code. Strict checks and limitation should be in place.
> 
> Noted. But as I also mentioned there isn't that much "common" code
> between the two models. I think if anything we could probably look at
> peeling out a few bits such as "get__bymac" which really would
> become dev_get_by_mac_and_ops in order to find the device for the
> notifiers. I probably wouldn't even put that in our driver and would
> instead put it in the core code since it almost makes more sense
> there. Beyond that sharing becomes much more challenging due to the
> differences in the Rx and Tx paths that build out of the difference
> between the 2 driver and 3 driver models.

At this point it might be worth it to articulate the advantages
of the 3 netdev model.

If they are compelling, why wouldn't netvsc users want them?

Alex, I think you were one of the strongest proponents of this model,
you should be well placed to provide a summary.

-- 
MST

[PATCH iproute2 net-next v3] iprule: support for ip_proto, sport and dport match options

2018-03-06 Thread Roopa Prabhu

From: Roopa Prabhu 

add support to match on ip_proto, sport and dport ranges.
For ip_proto, this patch currently enumerates, tcp, udp and sctp.
This list can be extended in the future.

example:
$ip rule add sport 666-777 dport 999 ip_proto tcp table 100
$ip rule show
0:  from all lookup local
32765:  from all ip_proto 6 sport 666-777 dport 999 lookup 100
32766:  from all lookup main
32767:  from all lookup default

Signed-off-by: Roopa Prabhu 
---
v2: use inet_proto_* as suggested by David Ahern

v3: fix newlines in usage (feedback from David Ahern)

 include/uapi/linux/fib_rules.h |  8 ++
 ip/iprule.c| 61 ++
 man/man8/ip-rule.8 | 32 +-
 3 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h
index 77d90ae..1809af5 100644
--- a/include/uapi/linux/fib_rules.h
+++ b/include/uapi/linux/fib_rules.h
@@ -35,6 +35,11 @@ struct fib_rule_uid_range {
__u32   end;
 };
 
+struct fib_rule_port_range {
+   __u16   start;
+   __u16   end;
+};
+
 enum {
FRA_UNSPEC,
FRA_DST,/* destination address */
@@ -59,6 +64,9 @@ enum {
FRA_L3MDEV, /* iif or oif is l3mdev goto its table */
FRA_UID_RANGE,  /* UID range */
FRA_PROTOCOL,   /* Originator of the rule */
+   FRA_IP_PROTO,   /* ip proto */
+   FRA_SPORT_RANGE,/* sport range */
+   FRA_DPORT_RANGE,/* dport range */
__FRA_MAX
 };
 
diff --git a/ip/iprule.c b/ip/iprule.c
index 6fdc9b5..a2eae72 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -46,6 +46,9 @@ static void usage(void)
"SELECTOR := [ not ] [ from PREFIX ] [ to PREFIX ] [ tos TOS ] 
[ fwmark FWMARK[/MASK] ]\n"
"[ iif STRING ] [ oif STRING ] [ pref NUMBER ] [ 
l3mdev ]\n"
"[ uidrange NUMBER-NUMBER ]\n"
+   "[ ip_proto PROTOCOL ]\n"
+   "[ sport [ NUMBER | NUMBER-NUMBER ]\n"
+   "[ dport [ NUMBER | NUMBER-NUMBER ] ]\n"
"ACTION := [ table TABLE_ID ]\n"
"  [ protocol PROTO ]\n"
"  [ nat ADDRESS ]\n"
@@ -284,6 +287,31 @@ int print_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
fprintf(fp, "uidrange %u-%u ", r->start, r->end);
}
 
+   if (tb[FRA_IP_PROTO]) {
+   SPRINT_BUF(pbuf);
+   fprintf(fp, "ip_proto %s ",
+   inet_proto_n2a(rta_getattr_u8(tb[FRA_IP_PROTO]), pbuf,
+  sizeof(pbuf)));
+   }
+
+   if (tb[FRA_SPORT_RANGE]) {
+   struct fib_rule_port_range *r = RTA_DATA(tb[FRA_SPORT_RANGE]);
+
+   if (r->start == r->end)
+   fprintf(fp, "sport %hu ", r->start);
+   else
+   fprintf(fp, "sport %hu-%hu ", r->start, r->end);
+   }
+
+   if (tb[FRA_DPORT_RANGE]) {
+   struct fib_rule_port_range *r = RTA_DATA(tb[FRA_DPORT_RANGE]);
+
+   if (r->start == r->end)
+   fprintf(fp, "dport %hu ", r->start);
+   else
+   fprintf(fp, "dport %hu-%hu ", r->start, r->end);
+   }
+
table = frh_get_table(frh, tb);
if (table) {
fprintf(fp, "lookup %s ",
@@ -768,6 +796,39 @@ static int iprule_modify(int cmd, int argc, char **argv)
addattr32(&req.n, sizeof(req), RTA_GATEWAY,
  get_addr32(*argv));
req.frh.action = RTN_NAT;
+   } else if (strcmp(*argv, "ip_proto") == 0) {
+   __u8 ip_proto;
+
+   NEXT_ARG();
+   ip_proto = inet_proto_a2n(*argv);
+   if (ip_proto < 0)
+   invarg("Invalid \"ip_proto\" value\n",
+  *argv);
+   addattr8(&req.n, sizeof(req), FRA_IP_PROTO, ip_proto);
+   } else if (strcmp(*argv, "sport") == 0) {
+   struct fib_rule_port_range r;
+   int ret = 0;
+
+   NEXT_ARG();
+   ret = sscanf(*argv, "%hu-%hu", &r.start, &r.end);
+   if (ret == 1)
+   r.end = r.start;
+   else if (ret != 2)
+   invarg("invalid port range\n", *argv);
+   addattr_l(&req.n, sizeof(req), FRA_SPORT_RANGE, &r,
+ sizeof(r));
+   } else if (strcmp(*argv, "dport") == 0) {
+   struct fib_rule_port_range r;
+   int ret = 0;
+
+   NEXT_ARG();
+   ret = sscanf(*argv, "%hu-%hu", &

Re: Investment

2018-03-06 Thread James Tyler

Thank you for your time,

We are looking for clients in your country with good business or project
that requires financing to execute.

Please get back to me if you are interested in this or you know anybody who
has good business ideas but lack the necessary capital to fund his projects
so we can establish working relationship.

Sincerely,

James Tyler, MBA, CFA
Investment analyst.

[PATCH net-next] openvswitch: fix vport packet length check.

2018-03-06 Thread William Tu

When sending a packet to a tunnel device, the dev's hard_header_len
could be larger than the skb->len in function packet_length().
In the case of ip6gretap/erspan, hard_header_len = LL_MAX_HEADER + t_hlen,
which is around 180, and an ARP packet sent to this tunnel has
skb->len = 42.  This causes the 'unsign int length' to become super
large because it is negative value, causing the later ovs_vport_send
to drop it due to over-mtu size.  The patch fixes it by setting it to 0.

Signed-off-by: William Tu 
---
 net/openvswitch/vport.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index b6c8524032a0..7718d5b4cf8a 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -467,7 +467,7 @@ int ovs_vport_receive(struct vport *vport, struct sk_buff 
*skb,
 static unsigned int packet_length(const struct sk_buff *skb,
  struct net_device *dev)
 {
-   unsigned int length = skb->len - dev->hard_header_len;
+   int length = skb->len - dev->hard_header_len;
 
if (!skb_vlan_tag_present(skb) &&
eth_type_vlan(skb->protocol))
@@ -478,7 +478,7 @@ static unsigned int packet_length(const struct sk_buff *skb,
 * account for 802.1ad. e.g. is_skb_forwardable().
 */
 
-   return length;
+   return length > 0 ? length : 0;
 }
 
 void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto)
-- 
2.7.4

Re: [RFC v3 iproute2 3/3] tc: Add support for the TBS Qdisc

2018-03-06 Thread Stephen Hemminger

On Tue,  6 Mar 2018 17:16:08 -0800
Jesus Sanchez-Palencia  wrote:

> atic int tbs_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
> +{
> + struct rtattr *tb[TCA_TBS_MAX+1];
> + struct tc_tbs_qopt *qopt;
> +
> + if (opt == NULL)
> + return 0;
> +
> + parse_rtattr_nested(tb, TCA_TBS_MAX, opt);
> +
> + if (tb[TCA_TBS_PARMS] == NULL)
> + return -1;
> +
> + qopt = RTA_DATA(tb[TCA_TBS_PARMS]);
> + if (RTA_PAYLOAD(tb[TCA_TBS_PARMS])  < sizeof(*qopt))
> + return -1;
> +
> + fprintf(f, "clockid ");
> + if (qopt->clockid == CLOCKID_INVALID)
> + fprintf(f, "invalid ");
> + else
> + fprintf(f, "%d ", qopt->clockid);
> +
> + fprintf(f, "delta %d ", qopt->delta);
> + fprintf(f, "offload %s ", (qopt->flags & TC_TBS_OFFLOAD_ON) ?
> + "on" : "off");
> + fprintf(f, "sorting %s", (qopt->flags & TC_TBS_SORTING_ON) ?
> + "on" : "off");
> +
> + return 0;
> +}

All new print code in iproute2 should support JSON output.
Look at other code using json_print.h for simple way to handle this.

[next-queue PATCH v3 3/8] igb: Enable the hardware traffic class feature bit for igb models

2018-03-06 Thread Vinicius Costa Gomes

This will allow functionality depending on the hardware being traffic
class aware to work. In particular the tc-flower offloading checks
verifies that this bit is set.

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 0ea32be07d71..3c4d209aad76 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2804,6 +2804,9 @@ static int igb_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
if (hw->mac.type >= e1000_82576)
netdev->features |= NETIF_F_SCTP_CRC;
 
+   if (hw->mac.type >= e1000_i350)
+   netdev->features |= NETIF_F_HW_TC;
+
 #define IGB_GSO_PARTIAL_FEATURES (NETIF_F_GSO_GRE | \
  NETIF_F_GSO_GRE_CSUM | \
  NETIF_F_GSO_IPXIP4 | \
-- 
2.16.2

[next-queue PATCH v3 1/8] igb: Fix not adding filter elements to the list

2018-03-06 Thread Vinicius Costa Gomes

Because the order of the parameters passes to 'hlist_add_behind()' was
inverted, the 'parent' node was added "behind" the 'input', as input
is not in the list, this causes the 'input' node to be lost.

Fixes: 0e71def25281 ("igb: add support of RX network flow classification")
Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb_ethtool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c 
b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 606e6761758f..143f0bb34e4d 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2864,7 +2864,7 @@ static int igb_update_ethtool_nfc_entry(struct 
igb_adapter *adapter,
 
/* add filter to the list */
if (parent)
-   hlist_add_behind(&parent->nfc_node, &input->nfc_node);
+   hlist_add_behind(&input->nfc_node, &parent->nfc_node);
else
hlist_add_head(&input->nfc_node, &adapter->nfc_filter_list);
 
-- 
2.16.2

[next-queue PATCH v3 6/8] igb: Add MAC address support for ethtool nftuple filters

2018-03-06 Thread Vinicius Costa Gomes

This adds the capability of configuring the queue steering of arriving
packets based on their source and destination MAC addresses.

In practical terms this adds support for the following use cases,
characterized by these examples:

$ ethtool -N eth0 flow-type ether dst aa:aa:aa:aa:aa:aa action 0
(this will direct packets with destination address "aa:aa:aa:aa:aa:aa"
to the RX queue 0)

$ ethtool -N eth0 flow-type ether src 44:44:44:44:44:44 action 3
(this will direct packets with destination address "44:44:44:44:44:44"
to the RX queue 3)

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb_ethtool.c | 35 
 1 file changed, 31 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c 
b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 94fc9a4bed8b..3f98299d4cd0 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2494,6 +2494,23 @@ static int igb_get_ethtool_nfc_entry(struct igb_adapter 
*adapter,
fsp->h_ext.vlan_tci = rule->filter.vlan_tci;
fsp->m_ext.vlan_tci = htons(VLAN_PRIO_MASK);
}
+   if (rule->filter.match_flags & IGB_FILTER_FLAG_DST_MAC_ADDR) {
+   ether_addr_copy(fsp->h_u.ether_spec.h_dest,
+   rule->filter.dst_addr);
+   /* As we only support matching by the full
+* mask, return the mask to userspace
+*/
+   eth_broadcast_addr(fsp->m_u.ether_spec.h_dest);
+   }
+   if (rule->filter.match_flags & IGB_FILTER_FLAG_SRC_MAC_ADDR) {
+   ether_addr_copy(fsp->h_u.ether_spec.h_source,
+   rule->filter.src_addr);
+   /* As we only support matching by the full
+* mask, return the mask to userspace
+*/
+   eth_broadcast_addr(fsp->m_u.ether_spec.h_source);
+   }
+
return 0;
}
return -EINVAL;
@@ -2932,10 +2949,6 @@ static int igb_add_ethtool_nfc_entry(struct igb_adapter 
*adapter,
if ((fsp->flow_type & ~FLOW_EXT) != ETHER_FLOW)
return -EINVAL;
 
-   if (fsp->m_u.ether_spec.h_proto != ETHER_TYPE_FULL_MASK &&
-   fsp->m_ext.vlan_tci != htons(VLAN_PRIO_MASK))
-   return -EINVAL;
-
input = kzalloc(sizeof(*input), GFP_KERNEL);
if (!input)
return -ENOMEM;
@@ -2945,6 +2958,20 @@ static int igb_add_ethtool_nfc_entry(struct igb_adapter 
*adapter,
input->filter.match_flags = IGB_FILTER_FLAG_ETHER_TYPE;
}
 
+   /* Only support matching addresses by the full mask */
+   if (is_broadcast_ether_addr(fsp->m_u.ether_spec.h_source)) {
+   input->filter.match_flags |= IGB_FILTER_FLAG_SRC_MAC_ADDR;
+   ether_addr_copy(input->filter.src_addr,
+   fsp->h_u.ether_spec.h_source);
+   }
+
+   /* Only support matching addresses by the full mask */
+   if (is_broadcast_ether_addr(fsp->m_u.ether_spec.h_dest)) {
+   input->filter.match_flags |= IGB_FILTER_FLAG_DST_MAC_ADDR;
+   ether_addr_copy(input->filter.dst_addr,
+   fsp->h_u.ether_spec.h_dest);
+   }
+
if ((fsp->flow_type & FLOW_EXT) && fsp->m_ext.vlan_tci) {
if (fsp->m_ext.vlan_tci != htons(VLAN_PRIO_MASK)) {
err = -EINVAL;
-- 
2.16.2

[next-queue PATCH v3 7/8] igb: Add the skeletons for tc-flower offloading

2018-03-06 Thread Vinicius Costa Gomes

This adds basic functions needed to implement offloading for filters
created by tc-flower.

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 66 +++
 1 file changed, 66 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 71e03b5227df..b5a6bd37bb16 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2496,6 +2497,69 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
return 0;
 }
 
+static int igb_configure_clsflower(struct igb_adapter *adapter,
+  struct tc_cls_flower_offload *cls_flower)
+{
+   return -EOPNOTSUPP;
+}
+
+static int igb_delete_clsflower(struct igb_adapter *adapter,
+   struct tc_cls_flower_offload *cls_flower)
+{
+   return -EOPNOTSUPP;
+}
+
+static int igb_setup_tc_cls_flower(struct igb_adapter *adapter,
+  struct tc_cls_flower_offload *cls_flower)
+{
+   switch (cls_flower->command) {
+   case TC_CLSFLOWER_REPLACE:
+   return igb_configure_clsflower(adapter, cls_flower);
+   case TC_CLSFLOWER_DESTROY:
+   return igb_delete_clsflower(adapter, cls_flower);
+   case TC_CLSFLOWER_STATS:
+   return -EOPNOTSUPP;
+   default:
+   return -EINVAL;
+   }
+}
+
+static int igb_setup_tc_block_cb(enum tc_setup_type type, void *type_data,
+void *cb_priv)
+{
+   struct igb_adapter *adapter = cb_priv;
+
+   if (!tc_cls_can_offload_and_chain0(adapter->netdev, type_data))
+   return -EOPNOTSUPP;
+
+   switch (type) {
+   case TC_SETUP_CLSFLOWER:
+   return igb_setup_tc_cls_flower(adapter, type_data);
+
+   default:
+   return -EOPNOTSUPP;
+   }
+}
+
+static int igb_setup_tc_block(struct igb_adapter *adapter,
+ struct tc_block_offload *f)
+{
+   if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
+   return -EOPNOTSUPP;
+
+   switch (f->command) {
+   case TC_BLOCK_BIND:
+   return tcf_block_cb_register(f->block, igb_setup_tc_block_cb,
+adapter, adapter);
+   case TC_BLOCK_UNBIND:
+   tcf_block_cb_unregister(f->block, igb_setup_tc_block_cb,
+   adapter);
+   return 0;
+   default:
+   return -EOPNOTSUPP;
+   }
+}
+
 static int igb_setup_tc(struct net_device *dev, enum tc_setup_type type,
void *type_data)
 {
@@ -2504,6 +2568,8 @@ static int igb_setup_tc(struct net_device *dev, enum 
tc_setup_type type,
switch (type) {
case TC_SETUP_QDISC_CBS:
return igb_offload_cbs(adapter, type_data);
+   case TC_SETUP_BLOCK:
+   return igb_setup_tc_block(adapter, type_data);
 
default:
return -EOPNOTSUPP;
-- 
2.16.2

[next-queue PATCH v3 0/8] igb: offloading of receive filters

2018-03-06 Thread Vinicius Costa Gomes

Hi,

Changes from v2:
 - Addressed review comments from Jakub Kicinski, mostly about coding
   style adjustments and more consistent error reporting;

Changes from v1:
 - Addressed review comments from Alexander Duyck and Florian
   Fainelli;
 - Adding and removing cls_flower filters are now proposed in the same
   patch;
 - cls_flower filters are kept in a separated list from "ethtool"
   filters (so that section of the original cover letter is no longer
   valid);
 - The patch adding support for ethtool filters is now independent from
   the rest of the series;

Original cover letter:

This series enables some ethtool and tc-flower filters to be offloaded
to igb-based network controllers. This is useful when the system
configurator want to steer kinds of traffic to a specific hardware
queue.

The first two commits are bug fixes.

The basis of this series is to export the internal API used to
configure address filters, so they can be used by ethtool, and
extending the functionality so an source address can be handled.

Then, we enable the tc-flower offloading implementation to re-use the
same infrastructure as ethtool, and storing them in the per-adapter
"nfc" (Network Filter Config?) list. But for consistency, for
destructive access they are separated, i.e. an filter added by
tc-flower can only be removed by tc-flower, but ethtool can read them
all.

Only support for VLAN Prio, Source and Destination MAC Address, and
Ethertype is enabled for now.

Open question:
  - igb is initialized with the number of traffic classes as 1, if we
  want to use multiple traffic classes we need to increase this value,
  the only way I could find is to use mqprio (for example). Should igb
  be initialized with, say, the number of queues as its "num_tc"?


Vinicius Costa Gomes (8):
  igb: Fix not adding filter elements to the list
  igb: Fix queue selection on MAC filters on i210 and i211
  igb: Enable the hardware traffic class feature bit for igb models
  igb: Add support for MAC address filters specifying source addresses
  igb: Enable nfc filters to specify MAC addresses
  igb: Add MAC address support for ethtool nftuple filters
  igb: Add the skeletons for tc-flower offloading
  igb: Add support for adding offloaded clsflower filters

 drivers/net/ethernet/intel/igb/e1000_defines.h |   2 +
 drivers/net/ethernet/intel/igb/igb.h   |  12 +
 drivers/net/ethernet/intel/igb/igb_ethtool.c   |  65 +-
 drivers/net/ethernet/intel/igb/igb_main.c  | 306 -
 4 files changed, 371 insertions(+), 14 deletions(-)

--
2.16.2

[next-queue PATCH v3 4/8] igb: Add support for MAC address filters specifying source addresses

2018-03-06 Thread Vinicius Costa Gomes

Makes it possible to direct packets to queues based on their source
address. Documents the expected usage of the 'flags' parameter.

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/e1000_defines.h |  1 +
 drivers/net/ethernet/intel/igb/igb.h   |  1 +
 drivers/net/ethernet/intel/igb/igb_main.c  | 38 ++
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h 
b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 573bf177fd08..c6f552de30dd 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -490,6 +490,7 @@
  * manageability enabled, allowing us room for 15 multicast addresses.
  */
 #define E1000_RAH_AV  0x8000/* Receive descriptor valid */
+#define E1000_RAH_ASEL_SRC_ADDR 0x0001
 #define E1000_RAH_QSEL_ENABLE 0x1000
 #define E1000_RAL_MAC_ADDR_LEN 4
 #define E1000_RAH_MAC_ADDR_LEN 2
diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index 1c6b8d9176a8..d5cd5f6708d9 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -472,6 +472,7 @@ struct igb_mac_addr {
 
 #define IGB_MAC_STATE_DEFAULT  0x1
 #define IGB_MAC_STATE_IN_USE   0x2
+#define IGB_MAC_STATE_SRC_ADDR  0x4
 
 /* board specific private data structure */
 struct igb_adapter {
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 3c4d209aad76..1df1c5a99a0d 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6838,8 +6838,13 @@ static void igb_set_default_mac_filter(struct 
igb_adapter *adapter)
igb_rar_set_index(adapter, 0);
 }
 
-static int igb_add_mac_filter(struct igb_adapter *adapter, const u8 *addr,
- const u8 queue)
+/* Add a MAC filter for 'addr' directing matching traffic to 'queue',
+ * 'flags' is used to indicate what kind of match is made, match is by
+ * default for the destination address, if matching by source address
+ * is desired the flag IGB_MAC_STATE_SRC_ADDR can be used.
+ */
+static int igb_add_mac_filter_flags(struct igb_adapter *adapter, const u8 
*addr,
+   const u8 queue, const u8 flags)
 {
struct e1000_hw *hw = &adapter->hw;
int rar_entries = hw->mac.rar_entry_count -
@@ -6859,7 +6864,7 @@ static int igb_add_mac_filter(struct igb_adapter 
*adapter, const u8 *addr,
 
ether_addr_copy(adapter->mac_table[i].addr, addr);
adapter->mac_table[i].queue = queue;
-   adapter->mac_table[i].state |= IGB_MAC_STATE_IN_USE;
+   adapter->mac_table[i].state |= IGB_MAC_STATE_IN_USE | flags;
 
igb_rar_set_index(adapter, i);
return i;
@@ -6868,8 +6873,20 @@ static int igb_add_mac_filter(struct igb_adapter 
*adapter, const u8 *addr,
return -ENOSPC;
 }
 
-static int igb_del_mac_filter(struct igb_adapter *adapter, const u8 *addr,
+static int igb_add_mac_filter(struct igb_adapter *adapter, const u8 *addr,
  const u8 queue)
+{
+   return igb_add_mac_filter_flags(adapter, addr, queue, 0);
+}
+
+/* Remove a MAC filter for 'addr' directing matching traffic to
+ * 'queue', 'flags' is used to indicate what kind of match need to be
+ * removed, match is by default for the destination address, if
+ * matching by source address is to be removed the flag
+ * IGB_MAC_STATE_SRC_ADDR can be used.
+ */
+static int igb_del_mac_filter_flags(struct igb_adapter *adapter, const u8 
*addr,
+   const u8 queue, const u8 flags)
 {
struct e1000_hw *hw = &adapter->hw;
int rar_entries = hw->mac.rar_entry_count -
@@ -6886,12 +6903,14 @@ static int igb_del_mac_filter(struct igb_adapter 
*adapter, const u8 *addr,
for (i = 0; i < rar_entries; i++) {
if (!(adapter->mac_table[i].state & IGB_MAC_STATE_IN_USE))
continue;
+   if ((adapter->mac_table[i].state & flags) != flags)
+   continue;
if (adapter->mac_table[i].queue != queue)
continue;
if (!ether_addr_equal(adapter->mac_table[i].addr, addr))
continue;
 
-   adapter->mac_table[i].state &= ~IGB_MAC_STATE_IN_USE;
+   adapter->mac_table[i].state = 0;
memset(adapter->mac_table[i].addr, 0, ETH_ALEN);
adapter->mac_table[i].queue = 0;
 
@@ -6902,6 +6921,12 @@ static int igb_del_mac_filter(struct igb_adapter 
*adapter, const u8 *addr,
return -ENOENT;
 }
 
+static int igb_del_mac_filter(struct igb_adapter *adapter, const u8 *addr,
+ const u8 queue)
+{
+   return igb_del_mac_filter_flags(adapter, addr, queue, 0);
+}
+
 static int igb_uc_sync(struct net

[next-queue PATCH v3 8/8] igb: Add support for adding offloaded clsflower filters

2018-03-06 Thread Vinicius Costa Gomes

This allows filters added by tc-flower and specifying MAC addresses,
Ethernet types, and the VLAN priority field, to be offloaded to the
controller.

This reuses most of the infrastructure used by ethtool, but clsflower
filters are kept in a separated list, so they are invisible to
ethtool.

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb.h  |   2 +
 drivers/net/ethernet/intel/igb/igb_main.c | 188 +-
 2 files changed, 188 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index 43ce6d64f693..0edd3a74d043 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -463,6 +463,7 @@ struct igb_nfc_input {
 struct igb_nfc_filter {
struct hlist_node nfc_node;
struct igb_nfc_input filter;
+   unsigned long cookie;
u16 etype_reg_index;
u16 sw_idx;
u16 action;
@@ -601,6 +602,7 @@ struct igb_adapter {
 
/* RX network flow classification support */
struct hlist_head nfc_filter_list;
+   struct hlist_head cls_flower_list;
unsigned int nfc_filter_count;
/* lock for RX network flow classification filter */
spinlock_t nfc_lock;
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index b5a6bd37bb16..66174ae7d62d 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2497,16 +2497,197 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
return 0;
 }
 
+#define ETHER_TYPE_FULL_MASK ((__force __be16)~0)
+#define VLAN_PRIO_FULL_MASK (0x07)
+
+static int igb_parse_cls_flower(struct igb_adapter *adapter,
+   struct tc_cls_flower_offload *f,
+   int traffic_class,
+   struct igb_nfc_filter *input)
+{
+   struct netlink_ext_ack *extack = f->common.extack;
+
+   if (f->dissector->used_keys &
+   ~(BIT(FLOW_DISSECTOR_KEY_BASIC) |
+ BIT(FLOW_DISSECTOR_KEY_CONTROL) |
+ BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) |
+ BIT(FLOW_DISSECTOR_KEY_VLAN))) {
+   NL_SET_ERR_MSG(extack,
+  "Unsupported key used, only BASIC, CONTROL, 
ETH_ADDRS and VLAN are supported");
+   return -EOPNOTSUPP;
+   }
+
+   if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
+   struct flow_dissector_key_eth_addrs *key, *mask;
+
+   key = skb_flow_dissector_target(f->dissector,
+   FLOW_DISSECTOR_KEY_ETH_ADDRS,
+   f->key);
+   mask = skb_flow_dissector_target(f->dissector,
+FLOW_DISSECTOR_KEY_ETH_ADDRS,
+f->mask);
+
+   if (!is_zero_ether_addr(mask->dst)) {
+   if (!is_broadcast_ether_addr(mask->dst)) {
+   NL_SET_ERR_MSG(extack, "Only full masks are 
supported for destination MAC address");
+   return -EINVAL;
+   }
+
+   input->filter.match_flags |=
+   IGB_FILTER_FLAG_DST_MAC_ADDR;
+   ether_addr_copy(input->filter.dst_addr, key->dst);
+   }
+
+   if (!is_zero_ether_addr(mask->src)) {
+   if (!is_broadcast_ether_addr(mask->src)) {
+   NL_SET_ERR_MSG(extack, "Only full masks are 
supported for source MAC address");
+   return -EINVAL;
+   }
+
+   input->filter.match_flags |=
+   IGB_FILTER_FLAG_SRC_MAC_ADDR;
+   ether_addr_copy(input->filter.src_addr, key->src);
+   }
+   }
+
+   if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_BASIC)) {
+   struct flow_dissector_key_basic *key, *mask;
+
+   key = skb_flow_dissector_target(f->dissector,
+   FLOW_DISSECTOR_KEY_BASIC,
+   f->key);
+   mask = skb_flow_dissector_target(f->dissector,
+FLOW_DISSECTOR_KEY_BASIC,
+f->mask);
+
+   if (mask->n_proto) {
+   if (mask->n_proto != ETHER_TYPE_FULL_MASK) {
+   NL_SET_ERR_MSG(extack, "Only full mask is 
supported for EtherType filter");
+   return -EINVAL;
+   }
+
+   input->filter.match_flags |= IGB_FILTER_FLAG_ETHER_TYPE;
+   input->filter.etype = key->n

[next-queue PATCH v3 2/8] igb: Fix queue selection on MAC filters on i210 and i211

2018-03-06 Thread Vinicius Costa Gomes

On the RAH registers there are semantic differences on the meaning of
the "queue" parameter for traffic steering depending on the controller
model: there is the 82575 meaning, which "queue" means a RX Hardware
Queue, and the i350 meaning, where it is a reception pool.

The previous behaviour was having no effect for i210 and i211 based
controllers because the QSEL bit of the RAH register wasn't being set.

This patch separates the condition in discrete cases, so the different
handling is clearer.

Fixes: 83c21335c876 ("igb: improve MAC filter handling")
Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/e1000_defines.h |  1 +
 drivers/net/ethernet/intel/igb/igb_main.c  | 15 +++
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h 
b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 83cabff1e0ab..573bf177fd08 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -490,6 +490,7 @@
  * manageability enabled, allowing us room for 15 multicast addresses.
  */
 #define E1000_RAH_AV  0x8000/* Receive descriptor valid */
+#define E1000_RAH_QSEL_ENABLE 0x1000
 #define E1000_RAL_MAC_ADDR_LEN 4
 #define E1000_RAH_MAC_ADDR_LEN 2
 #define E1000_RAH_POOL_MASK 0x03FC
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index b88fae785369..0ea32be07d71 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -8741,12 +8741,19 @@ static void igb_rar_set_index(struct igb_adapter 
*adapter, u32 index)
if (is_valid_ether_addr(addr))
rar_high |= E1000_RAH_AV;
 
-   if (hw->mac.type == e1000_82575)
+   switch (hw->mac.type) {
+   case e1000_82575:
+   case e1000_i210:
+   case e1000_i211:
+   rar_high |= E1000_RAH_QSEL_ENABLE;
rar_high |= E1000_RAH_POOL_1 *
-   adapter->mac_table[index].queue;
-   else
+ adapter->mac_table[index].queue;
+   break;
+   default:
rar_high |= E1000_RAH_POOL_1 <<
-   adapter->mac_table[index].queue;
+   adapter->mac_table[index].queue;
+   break;
+   }
}
 
wr32(E1000_RAL(index), rar_low);
-- 
2.16.2

[next-queue PATCH v3 5/8] igb: Enable nfc filters to specify MAC addresses

2018-03-06 Thread Vinicius Costa Gomes

This allows igb_add_filter()/igb_erase_filter() to work on filters
that include MAC addresses (both source and destination).

For now, this only exposes the functionality, the next commit glues
ethtool into this. Later in this series, these APIs are used to allow
offloading of cls_flower filters.

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb.h |  9 +
 drivers/net/ethernet/intel/igb/igb_ethtool.c | 28 
 drivers/net/ethernet/intel/igb/igb_main.c|  8 
 3 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index d5cd5f6708d9..43ce6d64f693 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -440,6 +440,8 @@ struct hwmon_buff {
 enum igb_filter_match_flags {
IGB_FILTER_FLAG_ETHER_TYPE = 0x1,
IGB_FILTER_FLAG_VLAN_TCI   = 0x2,
+   IGB_FILTER_FLAG_SRC_MAC_ADDR   = 0x4,
+   IGB_FILTER_FLAG_DST_MAC_ADDR   = 0x8,
 };
 
 #define IGB_MAX_RXNFC_FILTERS 16
@@ -454,6 +456,8 @@ struct igb_nfc_input {
u8 match_flags;
__be16 etype;
__be16 vlan_tci;
+   u8 src_addr[ETH_ALEN];
+   u8 dst_addr[ETH_ALEN];
 };
 
 struct igb_nfc_filter {
@@ -738,4 +742,9 @@ int igb_add_filter(struct igb_adapter *adapter,
 int igb_erase_filter(struct igb_adapter *adapter,
 struct igb_nfc_filter *input);
 
+int igb_add_mac_filter_flags(struct igb_adapter *adapter, const u8 *addr,
+const u8 queue, const u8 flags);
+int igb_del_mac_filter_flags(struct igb_adapter *adapter, const u8 *addr,
+const u8 queue, const u8 flags);
+
 #endif /* _IGB_H_ */
diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c 
b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 143f0bb34e4d..94fc9a4bed8b 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2775,6 +2775,25 @@ int igb_add_filter(struct igb_adapter *adapter, struct 
igb_nfc_filter *input)
return err;
}
 
+   if (input->filter.match_flags & IGB_FILTER_FLAG_DST_MAC_ADDR) {
+   err = igb_add_mac_filter_flags(adapter,
+  input->filter.dst_addr,
+  input->action, 0);
+   err = min_t(int, err, 0);
+   if (err)
+   return err;
+   }
+
+   if (input->filter.match_flags & IGB_FILTER_FLAG_SRC_MAC_ADDR) {
+   err = igb_add_mac_filter_flags(adapter,
+  input->filter.src_addr,
+  input->action,
+  IGB_MAC_STATE_SRC_ADDR);
+   err = min_t(int, err, 0);
+   if (err)
+   return err;
+   }
+
if (input->filter.match_flags & IGB_FILTER_FLAG_VLAN_TCI)
err = igb_rxnfc_write_vlan_prio_filter(adapter, input);
 
@@ -2823,6 +2842,15 @@ int igb_erase_filter(struct igb_adapter *adapter, struct 
igb_nfc_filter *input)
igb_clear_vlan_prio_filter(adapter,
   ntohs(input->filter.vlan_tci));
 
+   if (input->filter.match_flags & IGB_FILTER_FLAG_SRC_MAC_ADDR)
+   igb_del_mac_filter_flags(adapter, input->filter.src_addr,
+input->action,
+IGB_MAC_STATE_SRC_ADDR);
+
+   if (input->filter.match_flags & IGB_FILTER_FLAG_DST_MAC_ADDR)
+   igb_del_mac_filter_flags(adapter, input->filter.dst_addr,
+input->action, 0);
+
return 0;
 }
 
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 1df1c5a99a0d..71e03b5227df 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6843,8 +6843,8 @@ static void igb_set_default_mac_filter(struct igb_adapter 
*adapter)
  * default for the destination address, if matching by source address
  * is desired the flag IGB_MAC_STATE_SRC_ADDR can be used.
  */
-static int igb_add_mac_filter_flags(struct igb_adapter *adapter, const u8 
*addr,
-   const u8 queue, const u8 flags)
+int igb_add_mac_filter_flags(struct igb_adapter *adapter, const u8 *addr,
+const u8 queue, const u8 flags)
 {
struct e1000_hw *hw = &adapter->hw;
int rar_entries = hw->mac.rar_entry_count -
@@ -6885,8 +6885,8 @@ static int igb_add_mac_filter(struct igb_adapter 
*adapter, const u8 *addr,
  * matching by source address is to be removed the flag
  * IGB_MAC_STATE_SRC_ADDR can be used.
  */
-static int igb_del_mac_filter_flags(struct igb_adapter *adapter, const u8 
*add

Re: "wrong" ifindex on received VLAN tagged packet?

2018-03-06 Thread Lawrence Kreeger

Using ETH_P_ALL instead of ETH_P_802_2, is causing mstpd to get 3
copies of the same BPDU.  One from eth0, one from eth0.100, and
another from vlan100 (the bridge).
mstpd will drop the one from vlan100, but since there is also an
instance of spanning tree running on the native VLAN, there is now no
way to differentiate BPDUs coming in
tagged vs untagged because they all show up with eth0.  So, there
isn't some kernel knob to get the BPDUs to only come from eth0.100?

On Tue, Mar 6, 2018 at 4:43 PM, David Ahern  wrote:
> On 3/6/18 3:02 PM, Lawrence Kreeger wrote:
>> Hello,
>>
>> I'm trying to run mstpd on a per VLAN basis using one traditional
>> linux bridge per VLAN.  I'm running it on kernel version 4.12.4.  It
>> works fine for untagged frames, but I'm having a problem with VLAN
>> tagged BPDUs arriving on the socket with the ifindex of the bridge
>> itself, and not the VLAN tagged interface.  For example, I have a
>> tagged interface eth0.100 connected to the bridge "vlan100".  When
>> packets arrive, they have the ifindex of vlan100, which mstpd doesn't
>> recognize as a valid spanning tree interface, so it drops them.  Is
>> there something needed to be set in the kernel to get the ifindex of
>> eth0.100 instead?  This is how mstpd opens the raw socket:
>>
>>
>> /* Berkeley Packet filter code to filter out spanning tree packets.
>>from tcpdump -s 1152 -dd stp
>>  */
>> static struct sock_filter stp_filter[] = {
>> { 0x28, 0, 0, 0x000c },
>> { 0x25, 3, 0, 0x05dc },
>> { 0x30, 0, 0, 0x000e },
>> { 0x15, 0, 1, 0x0042 },
>> { 0x6, 0, 0, 0x0480 },
>> { 0x6, 0, 0, 0x },
>> };
>>
>> /*
>>  * Open up a raw packet socket to catch all 802.2 packets.
>>  * and install a packet filter to only see STP (SAP 42)
>>  *
>>  * Since any bridged devices are already in promiscious mode
>>  * no need to add multicast address.
>>  */
>> int packet_sock_init(void)
>> {
>> int s;
>> struct sock_fprog prog =
>> {
>> .len = sizeof(stp_filter) / sizeof(stp_filter[0]),
>> .filter = stp_filter,
>> };
>>
>> s = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_802_2));
>
> try ETH_P_ALL
>
>> if(s < 0)
>> {
>> ERROR("socket failed: %m");
>> return -1;
>> }
>>
>> if(setsockopt(s, SOL_SOCKET, SO_ATTACH_FILTER, &prog, sizeof(prog)) < 0)
>> ERROR("setsockopt packet filter failed: %m");
>> else if(fcntl(s, F_SETFL, O_NONBLOCK) < 0)
>> ERROR("fcntl set nonblock failed: %m");
>> else
>> {
>> packet_event.fd = s;
>> packet_event.handler = packet_rcv;
>
> And then packet_rcv using recvfrom:
> struct sockaddr_ll sll;
> char buf[4096];
> socklen_t alen;
> int len;
>
> alen = sizeof(sll);
> len = recvfrom(sd, buf, sizeof(buf), 0,
> (struct sockaddr *)&sll, &alen);
>
> And sll.sll_ifindex will show vlan device indices.
>
>
>>
>> if(0 == add_epoll(&packet_event))
>> return 0;
>> }
>>
>> close(s);
>> return -1;
>> }
>>
>> Thanks, Larry
>>
>

Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-03-06 Thread Andy Lutomirski

On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün  wrote:
>
> On 06/03/2018 23:46, Tycho Andersen wrote:
>> On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote:
> Suppose I'm writing a container manager.  I want to run "mount" in the
> container, but I don't want to allow moun() in general and I want to
> emulate certain mount() actions.  I can write a filter that catches
> mount using seccomp and calls out to the container manager for help.
> This isn't theoretical -- Tycho wants *exactly* this use case to be
> supported.

 Well, I think this use case should be handled with something like
 LD_PRELOAD and a helper library. FYI, I did something like this:
 https://github.com/stemjail/stemshim
>>>
>>> I doubt that will work for containers.  Containers that use user
>>> namespaces and, for example, setuid programs aren't going to honor
>>> LD_PRELOAD.
>>
>> Or anything that calls syscalls directly, like go programs.
>
> That's why the vDSO-like approach. Enforcing an access control is not
> the issue here, patching a buggy userland (without patching its code) is
> the issue isn't it?
>
> As far as I remember, the main problem is to handle file descriptors
> while "emulating" the kernel behavior. This can be done with a "shim"
> code mapped in every processes. Chrome used something like this (in a
> previous sandbox mechanism) as a kind of emulation (with the current
> seccomp-bpf ). I think it should be doable to replace the (userland)
> emulation code with an IPC wrapper receiving file descriptors through
> UNIX socket.
>

Can you explain exactly what you mean by "vDSO-like"?

When a 64-bit program does a syscall, it just executes the SYSCALL
instruction.  The vDSO isn't involved at all.  32-bit programs usually
go through the vDSO, but not always.

It could be possible to force-load a DSO into an entire container and
rig up seccomp to intercept all SYSCALLs not originating from the DSO
such that they merely redirect control to the DSO, but that seems
quite messy.

[RFC v3 iproute2 2/3] uapi pkt_sched: Add tbs info - DO NOT COMMIT

2018-03-06 Thread Jesus Sanchez-Palencia

This should come from the next uapi headers update.
Sending it now just as a convenience so anyone can build tc with tbs
support.

Signed-off-by: Jesus Sanchez-Palencia 
---
 include/uapi/linux/pkt_sched.h | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096a..92af9fa4 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -934,4 +934,22 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+
+/* TBS */
+struct tc_tbs_qopt {
+   __s32 delta;
+   __s32 clockid;
+   __u32 flags;
+#define TC_TBS_SORTING_ON BIT(0)
+#define TC_TBS_OFFLOAD_ON BIT(1)
+};
+
+enum {
+   TCA_TBS_UNSPEC,
+   TCA_TBS_PARMS,
+   __TCA_TBS_MAX,
+};
+
+#define TCA_TBS_MAX (__TCA_TBS_MAX - 1)
+
 #endif
-- 
2.16.2

[RFC v3 iproute2 3/3] tc: Add support for the TBS Qdisc

2018-03-06 Thread Jesus Sanchez-Palencia

From: Vinicius Costa Gomes 

The Time Based Scheduler (TBS) queueing discipline allows precise
control of the transmission time of packets.

The syntax is:

tc qdisc add dev DEV parent NODE tbs delta 
 clockid  [offload] [sorting]

Signed-off-by: Vinicius Costa Gomes 
Signed-off-by: Jesus Sanchez-Palencia 
---
 tc/Makefile |   1 +
 tc/q_tbs.c  | 200 
 2 files changed, 201 insertions(+)
 create mode 100644 tc/q_tbs.c

diff --git a/tc/Makefile b/tc/Makefile
index 3716dd6a..3c87b0dc 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -71,6 +71,7 @@ TCMODULES += q_clsact.o
 TCMODULES += e_bpf.o
 TCMODULES += f_matchall.o
 TCMODULES += q_cbs.o
+TCMODULES += q_tbs.o
 
 TCSO :=
 ifeq ($(TC_CONFIG_ATM),y)
diff --git a/tc/q_tbs.c b/tc/q_tbs.c
new file mode 100644
index ..b0823dc9
--- /dev/null
+++ b/tc/q_tbs.c
@@ -0,0 +1,200 @@
+/*
+ * q_tbs.c TBS.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Vinicius Costa Gomes 
+ * Jesus Sanchez-Palencia 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "tc_util.h"
+
+/* clockid is invalid if bits 0, 1, 2 are set as described by posix-timers.h */
+#define CLOCKID_INVALID (BIT(0) | BIT(1) | BIT(2))
+#define PTP_MAX_DEV_PATH 16
+
+/* fd to clockid helpers. Copied from posix-timers.h. */
+#define CLOCKFD 3
+static inline clockid_t make_process_cpuclock(const unsigned int pid,
+const clockid_t clock)
+{
+return ((~pid) << 3) | clock;
+}
+
+static inline clockid_t fd_to_clockid(const int fd)
+{
+return make_process_cpuclock((unsigned int) fd, CLOCKFD);
+}
+
+static void explain(void)
+{
+   fprintf(stderr, "Usage: ... tbs delta NANOS clockid CLOCKID [offload] 
[sorting]\n");
+   fprintf(stderr, "CLOCKID must be a valid SYS-V id (i.e. CLOCK_TAI) or \
+a dynamic clock (i.e. /dev/ptp0).\n");
+}
+
+static void explain1(const char *arg, const char *val)
+{
+   fprintf(stderr, "tbs: illegal value for \"%s\": \"%s\"\n", arg, val);
+}
+
+static void explain_clockid(const char *val)
+{
+   fprintf(stderr, "tbs: illegal value for \"clockid\": \"%s\".\n", val);
+   fprintf(stderr, "It must be a valid SYS-V id (i.e. CLOCK_TAI) or "\
+   "dynamic clock (i.e. /dev/ptp0).\n");
+}
+
+static int get_clockid(__s32 *val, const char *arg)
+{
+   const struct static_clockid {
+   const char *name;
+   clockid_t clockid;
+   } clockids_sysv[] = {
+   { "CLOCK_REALTIME", CLOCK_REALTIME },
+   { "CLOCK_TAI", CLOCK_TAI },
+   { "CLOCK_BOOTTIME", CLOCK_BOOTTIME },
+   { "CLOCK_MONOTONIC", CLOCK_MONOTONIC },
+   { NULL }
+   };
+
+   struct ptp_clock_caps capabilities;
+   char ptp_path[PTP_MAX_DEV_PATH];
+   const struct static_clockid *c;
+   int fd_ptp;
+
+   for (c = clockids_sysv; c->name; c++) {
+   if (strncasecmp(c->name, arg, 25) == 0) {
+   *val = c->clockid;
+
+   return 0;
+   }
+   }
+
+   snprintf(ptp_path, sizeof(ptp_path), "%s", arg);
+   fd_ptp = open(ptp_path, O_RDONLY);
+
+   /* Make sure the path provided points to a PTP chardev. */
+   if (fd_ptp < 0 || ioctl(fd_ptp, PTP_CLOCK_GETCAPS, &capabilities) < 0) {
+   return -1;
+   }
+
+   *val = fd_to_clockid(fd_ptp);
+   return 0;
+}
+
+
+static int tbs_parse_opt(struct qdisc_util *qu, int argc,
+char **argv, struct nlmsghdr *n, const char *dev)
+{
+   struct tc_tbs_qopt opt = {
+   .clockid = CLOCKID_INVALID,
+   };
+   struct rtattr *tail;
+
+   while (argc > 0) {
+   if (matches(*argv, "offload") == 0) {
+   if (opt.flags & TC_TBS_OFFLOAD_ON) {
+   fprintf(stderr, "tbs: duplicate \"offload\" 
specification\n");
+   return -1;
+   }
+
+   opt.flags |= TC_TBS_OFFLOAD_ON;
+   } else if (matches(*argv, "sorting") == 0) {
+   if (opt.flags & TC_TBS_SORTING_ON) {
+   fprintf(stderr, "tbs: duplicate \"sorting\" 
specification\n");
+   return -1;
+   }
+
+   opt.flags |= TC_TBS_SORTING_ON;
+   } else if (matches(*argv, "delta") == 0) {
+   NEXT_ARG();
+   if (opt.delta) {
+

[RFC v3 iproute2 1/3] include: Add ptp_clock.h to linux uapi

2018-03-06 Thread Jesus Sanchez-Palencia

This header will be used by the new tc-tbs qdisc.
It was copied from kernel tag 4.16.0-rc2.

Signed-off-by: Jesus Sanchez-Palencia 
---
 include/uapi/linux/ptp_clock.h | 147 +
 1 file changed, 147 insertions(+)
 create mode 100644 include/uapi/linux/ptp_clock.h

diff --git a/include/uapi/linux/ptp_clock.h b/include/uapi/linux/ptp_clock.h
new file mode 100644
index ..3039bf6a
--- /dev/null
+++ b/include/uapi/linux/ptp_clock.h
@@ -0,0 +1,147 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+/*
+ * PTP 1588 clock support - user space interface
+ *
+ * Copyright (C) 2010 OMICRON electronics GmbH
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#ifndef _PTP_CLOCK_H_
+#define _PTP_CLOCK_H_
+
+#include 
+#include 
+
+/* PTP_xxx bits, for the flags field within the request structures. */
+#define PTP_ENABLE_FEATURE (1<<0)
+#define PTP_RISING_EDGE(1<<1)
+#define PTP_FALLING_EDGE   (1<<2)
+
+/*
+ * struct ptp_clock_time - represents a time value
+ *
+ * The sign of the seconds field applies to the whole value. The
+ * nanoseconds field is always unsigned. The reserved field is
+ * included for sub-nanosecond resolution, should the demand for
+ * this ever appear.
+ *
+ */
+struct ptp_clock_time {
+   __s64 sec;  /* seconds */
+   __u32 nsec; /* nanoseconds */
+   __u32 reserved;
+};
+
+struct ptp_clock_caps {
+   int max_adj;   /* Maximum frequency adjustment in parts per billon. */
+   int n_alarm;   /* Number of programmable alarms. */
+   int n_ext_ts;  /* Number of external time stamp channels. */
+   int n_per_out; /* Number of programmable periodic signals. */
+   int pps;   /* Whether the clock supports a PPS callback. */
+   int n_pins;/* Number of input/output pins. */
+   /* Whether the clock supports precise system-device cross timestamps */
+   int cross_timestamping;
+   int rsv[13];   /* Reserved for future use. */
+};
+
+struct ptp_extts_request {
+   unsigned int index;  /* Which channel to configure. */
+   unsigned int flags;  /* Bit field for PTP_xxx flags. */
+   unsigned int rsv[2]; /* Reserved for future use. */
+};
+
+struct ptp_perout_request {
+   struct ptp_clock_time start;  /* Absolute start time. */
+   struct ptp_clock_time period; /* Desired period, zero means disable. */
+   unsigned int index;   /* Which channel to configure. */
+   unsigned int flags;   /* Reserved for future use. */
+   unsigned int rsv[4];  /* Reserved for future use. */
+};
+
+#define PTP_MAX_SAMPLES 25 /* Maximum allowed offset measurement samples. */
+
+struct ptp_sys_offset {
+   unsigned int n_samples; /* Desired number of measurements. */
+   unsigned int rsv[3];/* Reserved for future use. */
+   /*
+* Array of interleaved system/phc time stamps. The kernel
+* will provide 2*n_samples + 1 time stamps, with the last
+* one as a system time stamp.
+*/
+   struct ptp_clock_time ts[2 * PTP_MAX_SAMPLES + 1];
+};
+
+struct ptp_sys_offset_precise {
+   struct ptp_clock_time device;
+   struct ptp_clock_time sys_realtime;
+   struct ptp_clock_time sys_monoraw;
+   unsigned int rsv[4];/* Reserved for future use. */
+};
+
+enum ptp_pin_function {
+   PTP_PF_NONE,
+   PTP_PF_EXTTS,
+   PTP_PF_PEROUT,
+   PTP_PF_PHYSYNC,
+};
+
+struct ptp_pin_desc {
+   /*
+* Hardware specific human readable pin name. This field is
+* set by the kernel during the PTP_PIN_GETFUNC ioctl and is
+* ignored for the PTP_PIN_SETFUNC ioctl.
+*/
+   char name[64];
+   /*
+* Pin index in the range of zero to ptp_clock_caps.n_pins - 1.
+*/
+   unsigned int index;
+   /*
+* Which of the PTP_PF_xxx functions to use on this pin.
+*/
+   unsigned int func;
+   /*
+* The specific channel to use for this function.
+* This corresponds to the 'index' field of the
+* PTP_EXTTS_REQUEST and PTP_PEROUT_REQUEST ioctls.
+*/
+   unsigned int chan;
+   /*
+* Reserved for future use.
+*/
+   unsigned int rsv[5];
+};
+
+#define PTP_CLK_MAGIC '='
+
+#define PTP_CLOCK_GETCAPS  _IOR

[RFC v3 net-next 05/18] net: ipv4: raw: Hook into time based transmission.

2018-03-06 Thread Jesus Sanchez-Palencia

From: Richard Cochran 

For raw packets, copy the desired future transmit time from the CMSG
cookie into the skb.

Signed-off-by: Richard Cochran 
Signed-off-by: Jesus Sanchez-Palencia 
---
 net/ipv4/raw.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 54648d20bf0f..8e05970ba7c4 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -381,6 +381,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 
*fl4,
 
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
+   skb->tstamp = sockc->transmit_time;
skb_dst_set(skb, &rt->dst);
*rtp = NULL;
 
@@ -562,6 +563,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
}
 
ipc.sockc.tsflags = sk->sk_tsflags;
+   ipc.sockc.transmit_time = 0;
ipc.addr = inet->inet_saddr;
ipc.opt = NULL;
ipc.tx_flags = 0;
-- 
2.16.2

[RFC v3 net-next 03/18] posix-timers: Add CLOCKID_INVALID mask

2018-03-06 Thread Jesus Sanchez-Palencia

posix-timers.h states that a clockid_t value is invalid if bits 0, 1 and
2 are all set. Add a mask that can be safely used elsewhere even if this
implicit rule's implementation is changed.

This is done in preparation for the upcoming time based transmission
patchset.

Signed-off-by: Jesus Sanchez-Palencia 
---
 include/linux/posix-timers.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index c85704fcdbd2..0ba677cc8da6 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -28,6 +28,7 @@ struct cpu_timer_list {
  *
  * A clockid is invalid if bits 2, 1, and 0 are all set.
  */
+#define CLOCKID_INVALIDGENMASK(2, 0)
 #define CPUCLOCK_PID(clock)((pid_t) ~((clock) >> 3))
 #define CPUCLOCK_PERTHREAD(clock) \
(((clock) & (clockid_t) CPUCLOCK_PERTHREAD_MASK) != 0)
-- 
2.16.2

[RFC v3 net-next 00/18] Time based packet transmission

2018-03-06 Thread Jesus Sanchez-Palencia

This series is the v3 of the Time based packet transmission RFC, which was
originally proposed by Richard Cochran (v1: https://lwn.net/Articles/733962/ )
and further developed by us with the addition of the tbs qdisc
(v2: https://lwn.net/Articles/744797/ ).

It introduces a new socket option (SO_TXTIME), a new qdisc (tbs) and
implements support for hw offloading on the igb driver for the Intel
i210 NIC. The tbs qdisc also supports SW best effort that can be used
as a fallback.

The main changes since v2 can be found below.

Fixes since v2:
 - skb->tstamp is only cleared on the forwarding path;
 - ktime_t is no longer the type used for timestamps (s64 is);
 - get_unaligned() is now used for copying data from the cmsg header;
 - added getsockopt() support for SO_TXTIME;
 - restricted SO_TXTIME input range to [0,1];
 - removed ns_capable() check from __sock_cmsg_send();
 - the qdisc  control struct now uses a 32 bitmap for config flags;
 - fixed qdisc backlog decrement bug;
 - 'overlimits' is now incremented on dequeue() drops in addition to the
   'dropped' counter;

Interface changes since v2:
 * CMSG interface:
   - added a per-packet clockid parameter to the cmsg (SCM_CLOCKID);
   - added a per-packet drop_if_late flag to the cmsg (SCM_DROP_IF_LATE);
 * tc-tbs:
   - clockid now receives a string;
 e.g.: CLOCK_REALTIME or /dev/ptp0
   - offload is now a standalone argument (i.e. no more offload 1);
   - sorting is now argument that enables txtime based sorting provided
 by the qdisc;

Design changes since v2:
 - Now on the dequeue() path, tbs only drops an expired packet if it has the
   skb->tc_drop_if_late flag set. In practical terms, this will define if
   the semantics of txtime on a system is "not earlier than" or "not later
   than" a given timestamp;
 - Now on the enqueue() path, the qdisc will drop a packet if its clockid
   doesn't match the qdisc's one;
 - Sorting the packets based on their txtime is now an option for the disc.
   Effectively, this means it can be configured in 4 modes: HW offload or
   SW best-effort, sorting enabled or disabled;


The tbs qdisc is designed so it buffers packets until a configurable time before
their deadline (tx times). If sorting is enabled, regardless of HW offload or SW
fallback modes, the qdisc uses a rbtree internally so the buffered packets are
always 'ordered' by the earliest deadline.

If sorting is disabled, then for HW offload the qdisc will use a 'raw' FIFO
through qdisc_enqueue_tail() / qdisc_dequeue_head(), whereas for SW best-effort,
it will use a 'scheduled' FIFO.

The other configurable parameter from the tbs qdisc is the clockid to be used.
In order to provide that, this series adds a new API to pkt_sched.h (i.e.
qdisc_watchdog_init_clockid()).

The tbs qdisc will drop any packets with a transmission time in the past or
when a deadline is missed if SCM_DROP_IF_LATE is set. Queueing packets in
advance plus configuring the delta parameter for the system correctly makes
all the difference in reducing the number of drops. Moreover, note that the
delta parameter ends up defining the Tx time when SW best-effort is used
given that the timestamps won't be used by the NIC on this case.

Examples:

# SW best-effort with sorting #

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 tbs delta 10 \
   clockid CLOCK_REALTIME sorting

In this example first the mqprio qdisc is setup, then the tbs qdisc is
configured onto the first hw Tx queue using SW best-effort with sorting
enabled. Also, it is configured so the timestamps on each packet are in
reference to the clockid CLOCK_REALTIME and so packets are dequeued from
the qdisc 10 nanoseconds before their transmission time.


# HW offload without sorting #

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 tbs offload

In this example, the Qdisc will use HW offload for the control of the
transmission time through the network adapter. It's assumed implicitly
the timestamp in skbuffs are in reference to the interface's PHC and
setting any other valid clockid would be treated as an error. Because
there is no scheduling being performed in the qdisc, setting a delta != 0
would also be considered an error.


# HW offload with sorting #
$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 10 \
   clockid CLOCK_REALTIME sorting

Here, the Qdisc will use HW offload for the txtime control again,
but now sorting will be enabled, and thus there will be scheduling being
performed by the

[RFC v3 net-next 02/18] net: Clear skb->tstamp only on the forwarding path

2018-03-06 Thread Jesus Sanchez-Palencia

This is done in preparation for the upcoming time based transmission
patchset. Now that skb->tstamp will be used to hold packet's txtime,
we must ensure that it is being cleared when traversing namespaces.
Also, doing that from skb_scrub_packet() would break our feature when
tunnels are used.

Signed-off-by: Jesus Sanchez-Palencia 
---
 include/linux/netdevice.h | 1 +
 net/core/skbuff.c | 1 -
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index dbe6344b727a..7104de2bc957 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3379,6 +3379,7 @@ static __always_inline int dev_forward_skb(struct 
net_device *dev,
 
skb_scrub_packet(skb, true);
skb->priority = 0;
+   skb->tstamp = 0;
return 0;
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 715c13495ba6..678fc5416ae1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4865,7 +4865,6 @@ EXPORT_SYMBOL(skb_try_coalesce);
  */
 void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 {
-   skb->tstamp = 0;
skb->pkt_type = PACKET_HOST;
skb->skb_iif = 0;
skb->ignore_df = 0;
-- 
2.16.2

[RFC v3 net-next 06/18] net: ipv4: udp: Hook into time based transmission.

2018-03-06 Thread Jesus Sanchez-Palencia

From: Richard Cochran 

For udp packets, copy the desired future transmit time from the CMSG
cookie into the skb.

Signed-off-by: Richard Cochran 
Signed-off-by: Jesus Sanchez-Palencia 
---
 net/ipv4/udp.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3013404d0935..d683bbde526b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -926,6 +926,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t 
len)
}
 
ipc.sockc.tsflags = sk->sk_tsflags;
+   ipc.sockc.transmit_time = 0;
ipc.addr = inet->inet_saddr;
ipc.oif = sk->sk_bound_dev_if;
 
@@ -1040,8 +1041,10 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
  sizeof(struct udphdr), &ipc, &rt,
  msg->msg_flags);
err = PTR_ERR(skb);
-   if (!IS_ERR_OR_NULL(skb))
+   if (!IS_ERR_OR_NULL(skb)) {
+   skb->tstamp = ipc.sockc.transmit_time;
err = udp_send_skb(skb, fl4);
+   }
goto out;
}
 
-- 
2.16.2

[RFC v3 net-next 04/18] net: Add a new socket option for a future transmit time.

2018-03-06 Thread Jesus Sanchez-Palencia

From: Richard Cochran 

This patch introduces SO_TXTIME.  User space enables this option in
order to pass a desired future transmit time in a CMSG when calling
sendmsg(2).

A new field is added to struct sockcm_cookie, and the tstamp from
skbuffs will be used later on.

Signed-off-by: Richard Cochran 
Signed-off-by: Jesus Sanchez-Palencia 
---
 arch/alpha/include/uapi/asm/socket.h   |  3 +++
 arch/frv/include/uapi/asm/socket.h |  3 +++
 arch/ia64/include/uapi/asm/socket.h|  3 +++
 arch/m32r/include/uapi/asm/socket.h|  3 +++
 arch/mips/include/uapi/asm/socket.h|  3 +++
 arch/mn10300/include/uapi/asm/socket.h |  3 +++
 arch/parisc/include/uapi/asm/socket.h  |  3 +++
 arch/s390/include/uapi/asm/socket.h|  3 +++
 arch/sparc/include/uapi/asm/socket.h   |  3 +++
 arch/xtensa/include/uapi/asm/socket.h  |  3 +++
 include/net/sock.h |  2 ++
 include/uapi/asm-generic/socket.h  |  3 +++
 net/core/sock.c| 21 +
 13 files changed, 56 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index be14f16149d5..065fb372e355 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -112,4 +112,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h 
b/arch/frv/include/uapi/asm/socket.h
index 9168e78fa32a..0e95f45cd058 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -105,5 +105,8 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h 
b/arch/ia64/include/uapi/asm/socket.h
index 3efba40adc54..c872c4e6bafb 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -114,4 +114,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h 
b/arch/m32r/include/uapi/asm/socket.h
index cf5018e82c3d..65276c95b8df 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -105,4 +105,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 49c3d4795963..71370fb3ceef 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -123,4 +123,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h 
b/arch/mn10300/include/uapi/asm/socket.h
index b35eee132142..d029a40b1b55 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -105,4 +105,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index 1d0fdc3b5d22..061b9cf2a779 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -104,4 +104,7 @@
 
 #define SO_ZEROCOPY0x4035
 
+#define SO_TXTIME  0x4036
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h 
b/arch/s390/include/uapi/asm/socket.h
index 3510c0fd06f4..39d901476ee5 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -111,4 +111,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h 
b/arch/sparc/include/uapi/asm/socket.h
index d58520c2e6ff..7ea35e5601b6 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -101,6 +101,9 @@
 
 #define SO_ZEROCOPY0x003e
 
+#define SO_TXTIME  0x003f
+#define SCM_TXTIME SO_TXTIME
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION 0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT   0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h 
b/arch/xtensa/include/uapi/asm/socket.h
index 75a07b8119a9..1de07a7f7680 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -116,4 +116,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME

[RFC v3 net-next 17/18] igb: Refactor igb_offload_cbs()

2018-03-06 Thread Jesus Sanchez-Palencia

Split code into a separate function (igb_offload_apply()) that will be
used by TBS offload implementation.

Signed-off-by: Jesus Sanchez-Palencia 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 9c33f2d18d8c..10d7809a85d7 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2476,6 +2476,19 @@ igb_features_check(struct sk_buff *skb, struct 
net_device *dev,
return features;
 }
 
+static void igb_offload_apply(struct igb_adapter *adapter, s32 queue)
+{
+   if (!is_fqtss_enabled(adapter)) {
+   enable_fqtss(adapter, true);
+   return;
+   }
+
+   igb_config_tx_modes(adapter, queue);
+
+   if (!is_any_cbs_enabled(adapter))
+   enable_fqtss(adapter, false);
+}
+
 static int igb_offload_cbs(struct igb_adapter *adapter,
   struct tc_cbs_qopt_offload *qopt)
 {
@@ -2496,15 +2509,7 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
if (err)
return err;
 
-   if (is_fqtss_enabled(adapter)) {
-   igb_config_tx_modes(adapter, qopt->queue);
-
-   if (!is_any_cbs_enabled(adapter))
-   enable_fqtss(adapter, false);
-
-   } else {
-   enable_fqtss(adapter, true);
-   }
+   igb_offload_apply(adapter, qopt->queue);
 
return 0;
 }
-- 
2.16.2

[RFC v3 net-next 07/18] net: packet: Hook into time based transmission.

2018-03-06 Thread Jesus Sanchez-Palencia

From: Richard Cochran 

For raw layer-2 packets, copy the desired future transmit time from
the CMSG cookie into the skb.

Signed-off-by: Richard Cochran 
Signed-off-by: Jesus Sanchez-Palencia 
---
 net/packet/af_packet.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 2c5a6fe5d749..b2115fac2a8d 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1976,6 +1976,7 @@ static int packet_sendmsg_spkt(struct socket *sock, 
struct msghdr *msg,
goto out_unlock;
}
 
+   sockc.transmit_time = 0;
sockc.tsflags = sk->sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(sk, msg, &sockc);
@@ -1987,6 +1988,7 @@ static int packet_sendmsg_spkt(struct socket *sock, 
struct msghdr *msg,
skb->dev = dev;
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
+   skb->tstamp = sockc.transmit_time;
 
sock_tx_timestamp(sk, sockc.tsflags, &skb_shinfo(skb)->tx_flags);
 
@@ -2484,6 +2486,7 @@ static int tpacket_fill_skb(struct packet_sock *po, 
struct sk_buff *skb,
skb->dev = dev;
skb->priority = po->sk.sk_priority;
skb->mark = po->sk.sk_mark;
+   skb->tstamp = sockc->transmit_time;
sock_tx_timestamp(&po->sk, sockc->tsflags, &skb_shinfo(skb)->tx_flags);
skb_shinfo(skb)->destructor_arg = ph.raw;
 
@@ -2660,6 +2663,7 @@ static int tpacket_snd(struct packet_sock *po, struct 
msghdr *msg)
if (unlikely(!(dev->flags & IFF_UP)))
goto out_put;
 
+   sockc.transmit_time = 0;
sockc.tsflags = po->sk.sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(&po->sk, msg, &sockc);
@@ -2856,6 +2860,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
if (unlikely(!(dev->flags & IFF_UP)))
goto out_unlock;
 
+   sockc.transmit_time = 0;
sockc.tsflags = sk->sk_tsflags;
sockc.mark = sk->sk_mark;
if (msg->msg_controllen) {
@@ -2928,6 +2933,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
skb->dev = dev;
skb->priority = sk->sk_priority;
skb->mark = sockc.mark;
+   skb->tstamp = sockc.transmit_time;
 
if (has_vnet_hdr) {
err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
-- 
2.16.2

[RFC v3 net-next 12/18] net/sched: Allow creating a Qdisc watchdog with other clocks

2018-03-06 Thread Jesus Sanchez-Palencia

From: Vinicius Costa Gomes 

This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be
passed, this allows other time references to be used when scheduling
the Qdisc to run.

Signed-off-by: Vinicius Costa Gomes 
---
 include/net/pkt_sched.h |  2 ++
 net/sched/sch_api.c | 11 +--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 815b92a23936..2466ea143d01 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -72,6 +72,8 @@ struct qdisc_watchdog {
struct Qdisc*qdisc;
 };
 
+void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc 
*qdisc,
+clockid_t clockid);
 void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc);
 void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires);
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 68f9d942bed4..beb1dc296bfb 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -596,12 +596,19 @@ static enum hrtimer_restart qdisc_watchdog(struct hrtimer 
*timer)
return HRTIMER_NORESTART;
 }
 
-void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc)
+void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc 
*qdisc,
+clockid_t clockid)
 {
-   hrtimer_init(&wd->timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+   hrtimer_init(&wd->timer, clockid, HRTIMER_MODE_ABS_PINNED);
wd->timer.function = qdisc_watchdog;
wd->qdisc = qdisc;
 }
+EXPORT_SYMBOL(qdisc_watchdog_init_clockid);
+
+void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc)
+{
+   qdisc_watchdog_init_clockid(wd, qdisc, CLOCK_MONOTONIC);
+}
 EXPORT_SYMBOL(qdisc_watchdog_init);
 
 void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires)
-- 
2.16.2

[RFC v3 net-next 18/18] igb: Add support for TBS offload

2018-03-06 Thread Jesus Sanchez-Palencia

Implement HW offload support for SO_TXTIME through igb's Launchtime
feature. This is done by extending igb_setup_tc() so it supports
TC_SETUP_QDISC_TBS and configuring i210 so time based transmit
arbitration is enabled.

The FQTSS transmission mode added before is extended so strict
priority (SP) queues wait for stream reservation (SR) ones.
igb_config_tx_modes() is extended so it can support enabling/disabling
Launchtime following the previous approach used for the credit-based
shaper (CBS).

As the previous flow, FQTSS transmission mode is enabled automatically
by the driver once Launchtime (or CBS, as before) is enabled.
Similarly, it's automatically disabled when the feature is disabled
for the last queue that had it setup on.

The driver just consumes the transmit times from the skbuffs directly,
so no special handling is done in case an 'invalid' time is provided.
We assume this has been handled by the TBS qdisc already.

Signed-off-by: Jesus Sanchez-Palencia 
---
 drivers/net/ethernet/intel/igb/e1000_defines.h |  16 +++
 drivers/net/ethernet/intel/igb/igb.h   |   1 +
 drivers/net/ethernet/intel/igb/igb_main.c  | 135 ++---
 3 files changed, 137 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h 
b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 83cabff1e0ab..9e357848c550 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -1066,6 +1066,22 @@
 #define E1000_TQAVCTRL_XMIT_MODE   BIT(0)
 #define E1000_TQAVCTRL_DATAFETCHARBBIT(4)
 #define E1000_TQAVCTRL_DATATRANARB BIT(8)
+#define E1000_TQAVCTRL_DATATRANTIM BIT(9)
+#define E1000_TQAVCTRL_SP_WAIT_SR  BIT(10)
+/* Fetch Time Delta - bits 31:16
+ *
+ * This field holds the value to be reduced from the launch time for
+ * fetch time decision. The FetchTimeDelta value is defined in 32 ns
+ * granularity.
+ *
+ * This field is 16 bits wide, and so the maximum value is:
+ *
+ * 65535 * 32 = 2097120 ~= 2.1 msec
+ *
+ * XXX: We are configuring the max value here since we couldn't come up
+ * with a reason for not doing so.
+ */
+#define E1000_TQAVCTRL_FETCHTIME_DELTA (0x << 16)
 
 /* TX Qav Credit Control fields */
 #define E1000_TQAVCC_IDLESLOPE_MASK0x
diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index 1c6b8d9176a8..4e1146efa399 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -281,6 +281,7 @@ struct igb_ring {
u16 count;  /* number of desc. in the ring */
u8 queue_index; /* logical index of the ring*/
u8 reg_idx; /* physical index of the ring */
+   bool launchtime_enable; /* true if LaunchTime is enabled */
bool cbs_enable;/* indicates if CBS is enabled */
s32 idleslope;  /* idleSlope in kbps */
s32 sendslope;  /* sendSlope in kbps */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 10d7809a85d7..fa931f66a1f8 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1684,13 +1684,26 @@ static bool is_any_cbs_enabled(struct igb_adapter 
*adapter)
return false;
 }
 
+static bool is_any_txtime_enabled(struct igb_adapter *adapter)
+{
+   int i;
+
+   for (i = 0; i < adapter->num_tx_queues; i++) {
+   if (adapter->tx_ring[i]->launchtime_enable)
+   return true;
+   }
+
+   return false;
+}
+
 /**
  *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
  *  @queue: queue number
  *
- *  Configure CBS for a given hardware queue. Parameters are retrieved
- *  from the correct Tx ring, so igb_save_cbs_params() should be used
+ *  Configure CBS and Launchtime for a given hardware queue.
+ *  Parameters are retrieved from the correct Tx ring, so
+ *  igb_save_cbs_params() and igb_save_txtime_params() should be used
  *  for setting those correctly prior to this function being called.
  **/
 static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
@@ -1704,10 +1717,20 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
WARN_ON(hw->mac.type != e1000_i210);
WARN_ON(queue < 0 || queue > 1);
 
-   if (ring->cbs_enable) {
+   /* If any of the Qav features is enabled, configure queues as SR and
+* with HIGH PRIO. If none is, then configure them with LOW PRIO and
+* as SP.
+*/
+   if (ring->cbs_enable || ring->launchtime_enable) {
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
+   } else {
+   set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_LOW);

[RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params

2018-03-06 Thread Jesus Sanchez-Palencia

Extend SO_TXTIME APIs with new per-packet parameters: a clockid_t and
a drop_if_late flag. With this commit the API becomes:

- use SO_TXTIME to enable the feature on a socket;
- pass the per-packet arguments through the cmsg header using:
  * SCM_CLOCKID for the clockid to be used as the txtime clock source;
  * SCM_TXTIME for the txtime timestamp;
  * SCM_DROP_IF_LATE for the drop flag. This flag will be used by the
traffic control to decide if a delayed packet should be dropped.

Signed-off-by: Jesus Sanchez-Palencia 
---
 arch/alpha/include/uapi/asm/socket.h   |  2 ++
 arch/frv/include/uapi/asm/socket.h |  2 ++
 arch/ia64/include/uapi/asm/socket.h|  2 ++
 arch/m32r/include/uapi/asm/socket.h|  2 ++
 arch/mips/include/uapi/asm/socket.h|  2 ++
 arch/mn10300/include/uapi/asm/socket.h |  2 ++
 arch/parisc/include/uapi/asm/socket.h  |  2 ++
 arch/s390/include/uapi/asm/socket.h|  2 ++
 arch/sparc/include/uapi/asm/socket.h   |  2 ++
 arch/xtensa/include/uapi/asm/socket.h  |  2 ++
 include/linux/skbuff.h |  3 +++
 include/net/sock.h |  2 ++
 include/uapi/asm-generic/socket.h  |  2 ++
 net/core/sock.c| 22 +-
 14 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index 065fb372e355..3399dfefa579 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -114,5 +114,7 @@
 
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE   62
+#define SCM_CLOCKID63
 
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h 
b/arch/frv/include/uapi/asm/socket.h
index 0e95f45cd058..43b636836722 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -107,6 +107,8 @@
 
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE   62
+#define SCM_CLOCKID63
 
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h 
b/arch/ia64/include/uapi/asm/socket.h
index c872c4e6bafb..1f06d07aadbe 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -116,5 +116,7 @@
 
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE   62
+#define SCM_CLOCKID63
 
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h 
b/arch/m32r/include/uapi/asm/socket.h
index 65276c95b8df..69ab380d8d48 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -107,5 +107,7 @@
 
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE   62
+#define SCM_CLOCKID63
 
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 71370fb3ceef..97da79f58538 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -125,5 +125,7 @@
 
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE   62
+#define SCM_CLOCKID63
 
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h 
b/arch/mn10300/include/uapi/asm/socket.h
index d029a40b1b55..7c7a174fdfae 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -107,5 +107,7 @@
 
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE   62
+#define SCM_CLOCKID63
 
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index 061b9cf2a779..7fe86b5cd593 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -106,5 +106,7 @@
 
 #define SO_TXTIME  0x4036
 #define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE   0x4037
+#define SCM_CLOCKID0x4038
 
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h 
b/arch/s390/include/uapi/asm/socket.h
index 39d901476ee5..97f90c4a9b8c 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -113,5 +113,7 @@
 
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE   62
+#define SCM_CLOCKID63
 
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h 
b/arch/sparc/include/uapi/asm/socket.h
index 7ea35e5601b6..6397c366dd2d 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -103,6 +103,8 @@
 
 #define SO_TXTIME  0x003f
 #define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE   0x0040
+#define SCM_C

[RFC v3 net-next 16/18] igb: Only change Tx arbitration when CBS is on

2018-03-06 Thread Jesus Sanchez-Palencia

Currently the data transmission arbitration algorithm - DataTranARB
field on TQAVCTRL reg - is always set to CBS when the Tx mode is
changed from legacy to 'Qav' mode.

Make that configuration a bit more granular in preparation for the
upcoming Launchtime enabling patches, since CBS and Launchtime can be
enabled separately. That is achieved by moving the DataTranARB setup
to igb_config_tx_modes() instead.

Similarly, when disabling CBS we must check if it has been disabled
for all queues, and clear the DataTranARB accordingly.

Signed-off-by: Jesus Sanchez-Palencia 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 49 +--
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 49cfbe4fd2b1..9c33f2d18d8c 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1672,6 +1672,18 @@ static void set_queue_mode(struct e1000_hw *hw, int 
queue, enum queue_mode mode)
wr32(E1000_I210_TQAVCC(queue), val);
 }
 
+static bool is_any_cbs_enabled(struct igb_adapter *adapter)
+{
+   int i;
+
+   for (i = 0; i < adapter->num_tx_queues; i++) {
+   if (adapter->tx_ring[i]->cbs_enable)
+   return true;
+   }
+
+   return false;
+}
+
 /**
  *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
@@ -1686,7 +1698,7 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
struct igb_ring *ring = adapter->tx_ring[queue];
struct net_device *netdev = adapter->netdev;
struct e1000_hw *hw = &adapter->hw;
-   u32 tqavcc;
+   u32 tqavcc, tqavctrl;
u16 value;
 
WARN_ON(hw->mac.type != e1000_i210);
@@ -1696,6 +1708,14 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
 
+   /* Always set data transfer arbitration to credit-based
+* shaper algorithm on TQAVCTRL if CBS is enabled for any of
+* the queues.
+*/
+   tqavctrl = rd32(E1000_I210_TQAVCTRL);
+   tqavctrl |= E1000_TQAVCTRL_DATATRANARB;
+   wr32(E1000_I210_TQAVCTRL, tqavctrl);
+
/* According to i210 datasheet section 7.2.7.7, we should set
 * the 'idleSlope' field from TQAVCC register following the
 * equation:
@@ -1773,6 +1793,16 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
 
/* Set hiCredit to zero. */
wr32(E1000_I210_TQAVHC(queue), 0);
+
+   /* If CBS is not enabled for any queues anymore, then return to
+* the default state of Data Transmission Arbitration on
+* TQAVCTRL.
+*/
+   if (!is_any_cbs_enabled(adapter)) {
+   tqavctrl = rd32(E1000_I210_TQAVCTRL);
+   tqavctrl &= ~E1000_TQAVCTRL_DATATRANARB;
+   wr32(E1000_I210_TQAVCTRL, tqavctrl);
+   }
}
 
/* XXX: In i210 controller the sendSlope and loCredit parameters from
@@ -1806,18 +1836,6 @@ static int igb_save_cbs_params(struct igb_adapter 
*adapter, int queue,
return 0;
 }
 
-static bool is_any_cbs_enabled(struct igb_adapter *adapter)
-{
-   int i;
-
-   for (i = 0; i < adapter->num_tx_queues; i++) {
-   if (adapter->tx_ring[i]->cbs_enable)
-   return true;
-   }
-
-   return false;
-}
-
 /**
  *  igb_setup_tx_mode - Switch to/from Qav Tx mode when applicable
  *  @adapter: pointer to adapter struct
@@ -1841,11 +1859,10 @@ static void igb_setup_tx_mode(struct igb_adapter 
*adapter)
int i, max_queue;
 
/* Configure TQAVCTRL register: set transmit mode to 'Qav',
-* set data fetch arbitration to 'round robin' and set data
-* transfer arbitration to 'credit shaper algorithm.
+* set data fetch arbitration to 'round robin'.
 */
val = rd32(E1000_I210_TQAVCTRL);
-   val |= E1000_TQAVCTRL_XMIT_MODE | E1000_TQAVCTRL_DATATRANARB;
+   val |= E1000_TQAVCTRL_XMIT_MODE;
val &= ~E1000_TQAVCTRL_DATAFETCHARB;
wr32(E1000_I210_TQAVCTRL, val);
 
-- 
2.16.2

[RFC v3 net-next 11/18] net: packet: Handle remaining txtime parameters

2018-03-06 Thread Jesus Sanchez-Palencia

Initialize clockid to CLOCKID_INVALID instead of 0 (i.e.
CLOCK_REALTIME), and copy both drop_if_late and clockid from CMSG cookie
into skb.

Signed-off-by: Jesus Sanchez-Palencia 
---
 net/packet/af_packet.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index b2115fac2a8d..e455fbf5a356 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -94,6 +94,7 @@
 #endif
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -1977,6 +1978,8 @@ static int packet_sendmsg_spkt(struct socket *sock, 
struct msghdr *msg,
}
 
sockc.transmit_time = 0;
+   sockc.drop_if_late = 0;
+   sockc.clockid = CLOCKID_INVALID;
sockc.tsflags = sk->sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(sk, msg, &sockc);
@@ -1989,6 +1992,8 @@ static int packet_sendmsg_spkt(struct socket *sock, 
struct msghdr *msg,
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
skb->tstamp = sockc.transmit_time;
+   skb->tc_drop_if_late = sockc.drop_if_late;
+   skb->txtime_clockid = sockc.clockid;
 
sock_tx_timestamp(sk, sockc.tsflags, &skb_shinfo(skb)->tx_flags);
 
@@ -2487,6 +2492,8 @@ static int tpacket_fill_skb(struct packet_sock *po, 
struct sk_buff *skb,
skb->priority = po->sk.sk_priority;
skb->mark = po->sk.sk_mark;
skb->tstamp = sockc->transmit_time;
+   skb->tc_drop_if_late = sockc->drop_if_late;
+   skb->txtime_clockid = sockc->clockid;
sock_tx_timestamp(&po->sk, sockc->tsflags, &skb_shinfo(skb)->tx_flags);
skb_shinfo(skb)->destructor_arg = ph.raw;
 
@@ -2664,6 +2671,8 @@ static int tpacket_snd(struct packet_sock *po, struct 
msghdr *msg)
goto out_put;
 
sockc.transmit_time = 0;
+   sockc.drop_if_late = 0;
+   sockc.clockid = CLOCKID_INVALID;
sockc.tsflags = po->sk.sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(&po->sk, msg, &sockc);
@@ -2861,6 +2870,8 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
goto out_unlock;
 
sockc.transmit_time = 0;
+   sockc.drop_if_late = 0;
+   sockc.clockid = CLOCKID_INVALID;
sockc.tsflags = sk->sk_tsflags;
sockc.mark = sk->sk_mark;
if (msg->msg_controllen) {
@@ -2934,6 +2945,8 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
skb->priority = sk->sk_priority;
skb->mark = sockc.mark;
skb->tstamp = sockc.transmit_time;
+   skb->tc_drop_if_late = sockc.drop_if_late;
+   skb->txtime_clockid = sockc.clockid;
 
if (has_vnet_hdr) {
err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
-- 
2.16.2

[RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS

2018-03-06 Thread Jesus Sanchez-Palencia

Add new queueing modes to tbs qdisc so HW offload is supported.

For hw offload, if sorting is on, then the time sorted list will still
be used, but when sorting is disabled the enqueue / dequeue flow will
be based on a 'raw' FIFO through the usage of qdisc_enqueue_tail() and
qdisc_dequeue_head(). For the 'raw hw offload' mode, the drop_if_late
flag from skbuffs is not used by the Qdisc since this mode implicitly
assumes the PHC clock is being used by applications.

Example 1:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 tbs offload

In this example, the Qdisc will use HW offload for the control of the
transmission time through the network adapter. It's assumed the timestamp
in skbuffs are in reference to the interface's PHC and setting any other
valid clockid would be treated as an error. Because there is no
scheduling being performed in the qdisc, setting a delta != 0 would also
be considered an error.

Example 2:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 10 \
   clockid CLOCK_REALTIME sorting

Here, the Qdisc will use HW offload for the txtime control again,
but now sorting will be enabled, and thus there will be scheduling being
performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
reference and packets leave the Qdisc "delta" (10) nanoseconds before
their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by the hrtimer, the system clock
and the PHC clock must be synchronized for this mode to behave as expected.

Signed-off-by: Jesus Sanchez-Palencia 
---
 include/net/pkt_sched.h|   5 ++
 include/uapi/linux/pkt_sched.h |   1 +
 net/sched/sch_tbs.c| 159 +++--
 3 files changed, 144 insertions(+), 21 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 2466ea143d01..d042ffda7f21 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -155,4 +155,9 @@ struct tc_cbs_qopt_offload {
s32 sendslope;
 };
 
+struct tc_tbs_qopt_offload {
+   u8 enable;
+   s32 queue;
+};
+
 #endif
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index a33b5b9da81a..92af9fa4dee4 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -941,6 +941,7 @@ struct tc_tbs_qopt {
__s32 clockid;
__u32 flags;
 #define TC_TBS_SORTING_ON BIT(0)
+#define TC_TBS_OFFLOAD_ON BIT(1)
 };
 
 enum {
diff --git a/net/sched/sch_tbs.c b/net/sched/sch_tbs.c
index c19eedda9bc5..2aafa55de42c 100644
--- a/net/sched/sch_tbs.c
+++ b/net/sched/sch_tbs.c
@@ -25,8 +25,10 @@
 #include 
 
 #define SORTING_IS_ON(x) (x->flags & TC_TBS_SORTING_ON)
+#define OFFLOAD_IS_ON(x) (x->flags & TC_TBS_OFFLOAD_ON)
 
 struct tbs_sched_data {
+   bool offload;
bool sorting;
int clockid;
int queue;
@@ -68,25 +70,42 @@ static inline int validate_input_params(struct tc_tbs_qopt 
*qopt,
struct netlink_ext_ack *extack)
 {
/* Check if params comply to the following rules:
-*  * If SW best-effort, then clockid and delta must be valid
-*regardless of sorting enabled or not.
+*  * If SW best-effort, then clockid and delta must be valid.
+*
+*  * If HW offload is ON and sorting is ON, then clockid and delta
+*must be valid.
+*
+*  * If HW offload is ON and sorting is OFF, then clockid and
+*delta must not have been set. The netdevice PHC will be used
+*implictly.
 *
 *  * Dynamic clockids are not supported.
 *  * Delta must be a positive integer.
 */
-   if ((qopt->clockid & CLOCKID_INVALID) == CLOCKID_INVALID ||
-   qopt->clockid >= MAX_CLOCKS) {
-   NL_SET_ERR_MSG(extack, "Invalid clockid");
-   return -EINVAL;
-   } else if (qopt->clockid < 0 ||
-  !clockid_to_get_time[qopt->clockid]) {
-   NL_SET_ERR_MSG(extack, "Clockid is not supported");
-   return -ENOTSUPP;
-   }
-
-   if (qopt->delta < 0) {
-   NL_SET_ERR_MSG(extack, "Delta must be positive");
-   return -EINVAL;
+   if (!OFFLOAD_IS_ON(qopt) || SORTING_IS_ON(qopt)) {
+   if ((qopt->clockid & CLOCKID_INVALID) == CLOCKID_INVALID ||
+   qopt->clockid >= MAX_CLOCKS) {
+   NL_SET_ERR_MSG(extack, "Invalid clockid");
+   return -EINVAL;
+   } else if (qopt->clockid < 0 ||
+  !clockid_to_get_time[qopt->clockid]) {
+

[RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-03-06 Thread Jesus Sanchez-Palencia

From: Vinicius Costa Gomes 

TBS (Time Based Scheduler) uses the information added earlier in this
series (the socket option SO_TXTIME and the new role of
sk_buff->tstamp) to schedule traffic transmission based on absolute
time.

For some workloads, just bandwidth enforcement is not enough, and
precise control of the transmission of packets is necessary.

Example:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 tbs delta 10 \
   clockid CLOCK_REALTIME sorting

In this example, the Qdisc will provide SW best-effort for the control
of the transmission time to the network adapter, the time stamp in socket
are in reference to the clockid CLOCK_REALTIME and packets leave the
Qdisc "delta" (10) nanoseconds before its transmission time. It will
also enable sorting of the buffered packets based on their txtime.

The qdisc will drop packets on enqueue() if their skbuff clockid does not
match the clock reference of the Qdisc. Moreover, the tc_drop_if_late
flag from skbuffs will be used on dequeue() to determine if a packet
that has expired while being enqueued should be dropped or not.

Signed-off-by: Jesus Sanchez-Palencia 
Signed-off-by: Vinicius Costa Gomes 
---
 include/linux/netdevice.h  |   1 +
 include/uapi/linux/pkt_sched.h |  17 ++
 net/sched/Kconfig  |  11 +
 net/sched/Makefile |   1 +
 net/sched/sch_tbs.c| 474 +
 5 files changed, 504 insertions(+)
 create mode 100644 net/sched/sch_tbs.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7104de2bc957..09b5b2e08f04 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -781,6 +781,7 @@ enum tc_setup_type {
TC_SETUP_QDISC_CBS,
TC_SETUP_QDISC_RED,
TC_SETUP_QDISC_PRIO,
+   TC_SETUP_QDISC_TBS,
 };
 
 /* These structures hold the attributes of bpf state that are being passed
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096ae97b..a33b5b9da81a 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -934,4 +934,21 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+
+/* TBS */
+struct tc_tbs_qopt {
+   __s32 delta;
+   __s32 clockid;
+   __u32 flags;
+#define TC_TBS_SORTING_ON BIT(0)
+};
+
+enum {
+   TCA_TBS_UNSPEC,
+   TCA_TBS_PARMS,
+   __TCA_TBS_MAX,
+};
+
+#define TCA_TBS_MAX (__TCA_TBS_MAX - 1)
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a01169fb5325..9e68fef78d50 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -183,6 +183,17 @@ config NET_SCH_CBS
  To compile this code as a module, choose M here: the
  module will be called sch_cbs.
 
+config NET_SCH_TBS
+   tristate "Time Based Scheduler (TBS)"
+   ---help---
+ Say Y here if you want to use the Time Based Scheduler (TBS) packet
+ scheduling algorithm.
+
+ See the top of  for more details.
+
+ To compile this code as a module, choose M here: the
+ module will be called sch_tbs.
+
 config NET_SCH_GRED
tristate "Generic Random Early Detection (GRED)"
---help---
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 8811d3804878..f02378a0a8f2 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -54,6 +54,7 @@ obj-$(CONFIG_NET_SCH_FQ)  += sch_fq.o
 obj-$(CONFIG_NET_SCH_HHF)  += sch_hhf.o
 obj-$(CONFIG_NET_SCH_PIE)  += sch_pie.o
 obj-$(CONFIG_NET_SCH_CBS)  += sch_cbs.o
+obj-$(CONFIG_NET_SCH_TBS)  += sch_tbs.o
 
 obj-$(CONFIG_NET_CLS_U32)  += cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)   += cls_route.o
diff --git a/net/sched/sch_tbs.c b/net/sched/sch_tbs.c
new file mode 100644
index ..c19eedda9bc5
--- /dev/null
+++ b/net/sched/sch_tbs.c
@@ -0,0 +1,474 @@
+/*
+ * net/sched/sch_tbs.c Time Based Shaper
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Jesus Sanchez-Palencia 
+ * Vinicius Costa Gomes 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define SORTING_IS_ON(x) (x->flags & TC_TBS_SORTING_ON)
+
+struct tbs_sched_data {
+   bool sorting;
+   int clockid;
+   int queue;
+   s32 delta; /* in ns */
+   ktime_t last; /* The txtime of the last skb sent to the netdevice. */
+   struct rb_root head;
+   struct qdisc_watchdog watchdog;
+   struct Qdisc *qdisc;
+   int (*enqueue)(struct sk_buff *skb, struct Qdisc *sch,
+  struct sk_buff

[RFC v3 net-next 09/18] net: ipv4: raw: Handle remaining txtime parameters

2018-03-06 Thread Jesus Sanchez-Palencia

Initialize clockid to CLOCKID_INVALID instead of 0 (i.e.
CLOCK_REALTIME), and copy both drop_if_late and clockid from CMSG cookie
into skb.

Signed-off-by: Jesus Sanchez-Palencia 
---
 net/ipv4/raw.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 8e05970ba7c4..61b6a72b 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -79,6 +79,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct raw_frag_vec {
struct msghdr *msg;
@@ -382,6 +383,8 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 
*fl4,
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
skb->tstamp = sockc->transmit_time;
+   skb->txtime_clockid = sockc->clockid;
+   skb->tc_drop_if_late = sockc->drop_if_late;
skb_dst_set(skb, &rt->dst);
*rtp = NULL;
 
@@ -564,6 +567,8 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
ipc.sockc.tsflags = sk->sk_tsflags;
ipc.sockc.transmit_time = 0;
+   ipc.sockc.drop_if_late = 0;
+   ipc.sockc.clockid = CLOCKID_INVALID;
ipc.addr = inet->inet_saddr;
ipc.opt = NULL;
ipc.tx_flags = 0;
-- 
2.16.2

[RFC v3 net-next 10/18] net: ipv4: udp: Handle remaining txtime parameters

2018-03-06 Thread Jesus Sanchez-Palencia

Initialize clockid to CLOCKID_INVALID instead of 0 (i.e.
CLOCK_REALTIME), and copy both drop_if_late and clockid from CMSG cookie
into skb.

Signed-off-by: Jesus Sanchez-Palencia 
---
 net/ipv4/udp.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d683bbde526b..4bea8d5ab968 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -115,6 +115,7 @@
 #include "udp_impl.h"
 #include 
 #include 
+#include 
 
 struct udp_table udp_table __read_mostly;
 EXPORT_SYMBOL(udp_table);
@@ -927,6 +928,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t 
len)
 
ipc.sockc.tsflags = sk->sk_tsflags;
ipc.sockc.transmit_time = 0;
+   ipc.sockc.drop_if_late = 0;
+   ipc.sockc.clockid = CLOCKID_INVALID;
ipc.addr = inet->inet_saddr;
ipc.oif = sk->sk_bound_dev_if;
 
@@ -1043,6 +1046,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
err = PTR_ERR(skb);
if (!IS_ERR_OR_NULL(skb)) {
skb->tstamp = ipc.sockc.transmit_time;
+   skb->txtime_clockid = ipc.sockc.clockid;
+   skb->tc_drop_if_late = ipc.sockc.drop_if_late;
err = udp_send_skb(skb, fl4);
}
goto out;
-- 
2.16.2

[RFC v3 net-next 01/18] sock: Fix SO_ZEROCOPY switch case

2018-03-06 Thread Jesus Sanchez-Palencia

Fix the SO_ZEROCOPY switch case on sock_setsockopt() avoiding the
ret values to be overwritten by the one set on the default case.

Fixes: 28190752c7092 ("sock: permit SO_ZEROCOPY on PF_RDS socket")
Signed-off-by: Jesus Sanchez-Palencia 
---
 net/core/sock.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 507d8c6c4319..27f218bba43f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1062,8 +1062,9 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
ret = -EINVAL;
else
sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);
-   break;
}
+   break;
+
default:
ret = -ENOPROTOOPT;
break;
-- 
2.16.2

[RFC v3 net-next 15/18] igb: Refactor igb_configure_cbs()

2018-03-06 Thread Jesus Sanchez-Palencia

Make this function retrieve what it needs from the Tx ring being
addressed since it already relies on what had been saved on it before.
Also, since this function will be used by the upcoming Launchtime
patches rename it to better reflect its intention. Note that
Launchtime is not part of what 802.1Qav specifies, but the i210
datasheet refers to this set of functionality as "Qav Transmission
Mode".

Here we also perform a tiny refactor at is_any_cbs_enabled(), and add
further documentation to igb_setup_tx_mode().

Signed-off-by: Jesus Sanchez-Palencia 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 54 ++-
 1 file changed, 25 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index b88fae785369..49cfbe4fd2b1 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1673,23 +1673,17 @@ static void set_queue_mode(struct e1000_hw *hw, int 
queue, enum queue_mode mode)
 }
 
 /**
- *  igb_configure_cbs - Configure Credit-Based Shaper (CBS)
+ *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
  *  @queue: queue number
- *  @enable: true = enable CBS, false = disable CBS
- *  @idleslope: idleSlope in kbps
- *  @sendslope: sendSlope in kbps
- *  @hicredit: hiCredit in bytes
- *  @locredit: loCredit in bytes
  *
- *  Configure CBS for a given hardware queue. When disabling, idleslope,
- *  sendslope, hicredit, locredit arguments are ignored. Returns 0 if
- *  success. Negative otherwise.
+ *  Configure CBS for a given hardware queue. Parameters are retrieved
+ *  from the correct Tx ring, so igb_save_cbs_params() should be used
+ *  for setting those correctly prior to this function being called.
  **/
-static void igb_configure_cbs(struct igb_adapter *adapter, int queue,
- bool enable, int idleslope, int sendslope,
- int hicredit, int locredit)
+static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
 {
+   struct igb_ring *ring = adapter->tx_ring[queue];
struct net_device *netdev = adapter->netdev;
struct e1000_hw *hw = &adapter->hw;
u32 tqavcc;
@@ -1698,7 +1692,7 @@ static void igb_configure_cbs(struct igb_adapter 
*adapter, int queue,
WARN_ON(hw->mac.type != e1000_i210);
WARN_ON(queue < 0 || queue > 1);
 
-   if (enable) {
+   if (ring->cbs_enable) {
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
 
@@ -1759,14 +1753,15 @@ static void igb_configure_cbs(struct igb_adapter 
*adapter, int queue,
 *   calculated value, so the resulting bandwidth might
 *   be slightly higher for some configurations.
 */
-   value = DIV_ROUND_UP_ULL(idleslope * 61034ULL, 100);
+   value = DIV_ROUND_UP_ULL(ring->idleslope * 61034ULL, 100);
 
tqavcc = rd32(E1000_I210_TQAVCC(queue));
tqavcc &= ~E1000_TQAVCC_IDLESLOPE_MASK;
tqavcc |= value;
wr32(E1000_I210_TQAVCC(queue), tqavcc);
 
-   wr32(E1000_I210_TQAVHC(queue), 0x8000 + hicredit * 0x7735);
+   wr32(E1000_I210_TQAVHC(queue),
+0x8000 + ring->hicredit * 0x7735);
} else {
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_LOW);
set_queue_mode(hw, queue, QUEUE_MODE_STRICT_PRIORITY);
@@ -1786,8 +1781,9 @@ static void igb_configure_cbs(struct igb_adapter 
*adapter, int queue,
 */
 
netdev_dbg(netdev, "CBS %s: queue %d idleslope %d sendslope %d hiCredit 
%d locredit %d\n",
-  (enable) ? "enabled" : "disabled", queue,
-  idleslope, sendslope, hicredit, locredit);
+  (ring->cbs_enable) ? "enabled" : "disabled", queue,
+  ring->idleslope, ring->sendslope, ring->hicredit,
+  ring->locredit);
 }
 
 static int igb_save_cbs_params(struct igb_adapter *adapter, int queue,
@@ -1812,19 +1808,25 @@ static int igb_save_cbs_params(struct igb_adapter 
*adapter, int queue,
 
 static bool is_any_cbs_enabled(struct igb_adapter *adapter)
 {
-   struct igb_ring *ring;
int i;
 
for (i = 0; i < adapter->num_tx_queues; i++) {
-   ring = adapter->tx_ring[i];
-
-   if (ring->cbs_enable)
+   if (adapter->tx_ring[i]->cbs_enable)
return true;
}
 
return false;
 }
 
+/**
+ *  igb_setup_tx_mode - Switch to/from Qav Tx mode when applicable
+ *  @adapter: pointer to adapter struct
+ *
+ *  Configure TQAVCTRL register switching the controller's Tx mode
+ *  if FQTSS mode is enabled or disabled. Additionally, will issue
+ *  a call to igb_config_tx_modes() per queue so any

Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-06 Thread Alexei Starovoitov

combining multiple answers...

On 3/6/18 3:05 AM, Greg KH wrote:

Any chance you can add a field to your "umh module" type such that a
normal 'modinfo' program will be able to notice it is different easily?

ok. handling of modinfo turned out to be straightforward.
kmod tooling worked fine with simple addition of .modinfo section.

$ modinfo bpfilter
filename: 
/lib/modules/4.16.0-rc4-00799-g1716f0aa3039-dirty/net/bpfilter/bpfilter.ko

umh:Y
license:GPL

I will require umh=Y and license to be present.
umh has to be set to Y for this 'umh modules'
and taint of kernel will happen if license is not gpl.
Other modinfo like vermagic are not applicable here, since
umh modules interact with kernel via normal kernel/user abi.

Since umh can crash, can be oom-ed by the kernel, killed by admin,
the subsystem that uses them (like bpfilter) need to manage life
time of umh on its own, so module infra doesn't do any accounting
of them. They don't appear in "lsmod" and cannot be "rmmod".
Multiple request_module("umh") will load multiple umh.ko processes.

Similar to kernel modules the kernel will be tainted if "umh module"
has invalid signature.

Shouldn't we fail to load the "module" if the signature is not valid if
CONFIG_MODULE_SIG_FORCE=y is enabled, like we do for modules?  I run my
systems like that, and just "warning" isn't probably a good idea for
systems that want to enforce that everything is signed properly?

CONFIG_MODULE_SIG_FORCE=y is already handled by this patch.
It's checked first for either .ko or umh.ko (before any elf parsing)
and returns -ENOKEY to user space without any dmesg message.
I think it's best to keep it as-is.
The taint and warning is for 'undef SIG_FORCE' and when module
is signed, but incorrectly.

Other than that, one minor question:

@@ -1745,7 +1745,9 @@ static int do_execveat_common(int fd, struct filename 
*filename,
sched_exec();

bprm->file = file;
-   if (fd == AT_FDCWD || filename->name[0] == '/') {
+   if (!filename) {
+   bprm->filename = "/dev/null";

Why the use of "/dev/null" for the filename here, and elsewhere in the
code?  While I'm "sure" that everyone really does have /dev/null/
mounted in the root namespace, what is the use of it here?

filename is assumed to be non-null in several places further
down and instead of hacking it everywhere it's cleaner to assign
some string to it.
I'll change it to filename = "none"
Same in umh part.

Also, what "namespace" does this usermode helper run in?  I'm guessing
the "root" one, which is fine with me, but note that people have
complained in the past about other UMH running in that namespace and not
in the specific namespace of the "container" that they wanted it to run
in.

right. this is something we can tweak later if really necessary.
Right now most of the bpf is root-only, so bpfilter.ko would have to run
as cap_sys_admin for now. Later we plan to tighten it to be
cap_net_admin.

On 3/6/18 11:12 AM, Linus Torvalds wrote:
>
> particularly for the early implementation when this is a new thing, I
> really want a message like
>
> executed user process xyz-abc as a pseudo-module
>
> or something in dmesg.
>
> I do *not* want this to be a magical way to hide things.

right. no intent of hiding anything.
The first thing bpfilter.ko does is print 'Starting bpfilter'
into /dev/console.

Long term the health check of 'umh module' and interaction with
the kernel should be standardized and though they're normal processes
seen with 'ps' would be good to see them in lsmod as well.
For now it's indeed the best to do pr_warn() message like above.
Ratelimiting is probably not necessary.

On 3/6/18 12:01 PM, Andy Lutomirski wrote:
>
> I imagine that usermode tooling needs to change regardless
> because the existing tools may get rather confused if a .ko "module"

the goal is to do zero changes to user tooling.
The kmod tools handle this special .ko just fine.
Tested with modprobe, depmod, modinfo, insmod.
scripts/sign-file also works.

[PATCH iproute2-next 3/3] ipmroute: convert to output JSON

2018-03-06 Thread Stephen Hemminger

From: Stephen Hemminger 

Should be no change for non-json case except putting color
on address if desired.

Signed-off-by: Stephen Hemminger 
---
 ip/ipmroute.c | 117 ++
 1 file changed, 77 insertions(+), 40 deletions(-)

diff --git a/ip/ipmroute.c b/ip/ipmroute.c
index 03ca0575e571..8258a4ed0a0b 100644
--- a/ip/ipmroute.c
+++ b/ip/ipmroute.c
@@ -29,6 +29,7 @@
 #include 
 #include "utils.h"
 #include "ip_common.h"
+#include "json_print.h"
 
 static void usage(void) __attribute__((noreturn));
 
@@ -53,13 +54,12 @@ struct rtfilter {
 
 int print_mroute(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 {
-   FILE *fp = (FILE *)arg;
struct rtmsg *r = NLMSG_DATA(n);
int len = n->nlmsg_len;
struct rtattr *tb[RTA_MAX+1];
-   char obuf[256];
-
+   const char *src, *dst;
SPRINT_BUF(b1);
+   SPRINT_BUF(b2);
__u32 table;
int iif = 0;
int family;
@@ -102,30 +102,44 @@ int print_mroute(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
 
family = get_real_family(r->rtm_type, r->rtm_family);
 
+   open_json_object(NULL);
if (n->nlmsg_type == RTM_DELROUTE)
-   fprintf(fp, "Deleted ");
+   print_bool(PRINT_ANY, "deleted", "Deleted ", true);
 
if (tb[RTA_SRC])
-   len = snprintf(obuf, sizeof(obuf),
-  "(%s, ", rt_addr_n2a_rta(family, tb[RTA_SRC]));
+   src = rt_addr_n2a_r(family, RTA_PAYLOAD(tb[RTA_SRC]),
+   RTA_DATA(tb[RTA_SRC]), b1, sizeof(b1));
else
-   len = sprintf(obuf, "(unknown, ");
+   src = "unknown";
+
if (tb[RTA_DST])
-   snprintf(obuf + len, sizeof(obuf) - len,
-"%s)", rt_addr_n2a_rta(family, tb[RTA_DST]));
+   dst = rt_addr_n2a_r(family, RTA_PAYLOAD(tb[RTA_DST]),
+   RTA_DATA(tb[RTA_DST]), b2, sizeof(b2));
else
-   snprintf(obuf + len, sizeof(obuf) - len, "unknown) ");
+   dst = "unknown";
+
+   if (is_json_context()) {
+   print_string(PRINT_JSON, "src", NULL, src);
+   print_string(PRINT_JSON, "dst", NULL, dst);
+   } else {
+   char obuf[256];
+
+   snprintf(obuf, sizeof(obuf), "(%s,%s)", src, dst);
+   print_string(PRINT_FP, NULL,
+"%-32s Iif: ", obuf);
+   }
 
-   fprintf(fp, "%-32s Iif: ", obuf);
if (iif)
-   fprintf(fp, "%-10s ", ll_index_to_name(iif));
+   print_color_string(PRINT_ANY, COLOR_IFNAME,
+  "iif", "%-10s ", ll_index_to_name(iif));
else
-   fprintf(fp, "unresolved ");
+   print_string(PRINT_ANY,"iif", "%s ", "unresolved");
 
if (tb[RTA_MULTIPATH]) {
struct rtnexthop *nh = RTA_DATA(tb[RTA_MULTIPATH]);
int first = 1;
 
+   open_json_array(PRINT_JSON, "multipath");
len = RTA_PAYLOAD(tb[RTA_MULTIPATH]);
 
for (;;) {
@@ -134,47 +148,67 @@ int print_mroute(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
if (nh->rtnh_len > len)
break;
 
+   open_json_object(NULL);
if (first) {
-   fprintf(fp, "Oifs: ");
+   print_string(PRINT_FP, NULL, "Oifs: ", NULL);
first = 0;
}
-   fprintf(fp, "%s", ll_index_to_name(nh->rtnh_ifindex));
+
+   print_color_string(PRINT_ANY, COLOR_IFNAME,
+  "oif", "%s", 
ll_index_to_name(nh->rtnh_ifindex));
+
if (nh->rtnh_hops > 1)
-   fprintf(fp, "(ttl %d) ", nh->rtnh_hops);
+   print_uint(PRINT_ANY,
+  "ttl", "(ttl %u) ", nh->rtnh_hops);
else
-   fprintf(fp, " ");
+   print_string(PRINT_FP, NULL, " ", NULL);
+
+   close_json_object();
len -= NLMSG_ALIGN(nh->rtnh_len);
nh = RTNH_NEXT(nh);
}
+   close_json_array(PRINT_JSON, NULL);
}
-   fprintf(fp, " State: %s",
-   r->rtm_flags & RTNH_F_UNRESOLVED ? "unresolved" : "resolved");
+
+   print_string(PRINT_ANY, "state", " State: %s",
+(r->rtm_flags & RTNH_F_UNRESOLVED) ? "unresolved" : 
"resolved");
+
if (r->rtm_flags & RTNH_F_OFFLOAD)
-   fprintf(fp, " offload");
-   if (show_stats && tb[RTA_MFC_STATS]) {
-   struct rta_mfc_stats *mfcs = RTA_DATA

[PATCH iproute2-next 2/3] ipmroute: don't complain about unicast routes

2018-03-06 Thread Stephen Hemminger

From: Stephen Hemminger 

Every non-multicast route prints an error message.
Kernel doesn't filter out unicast routes, it is up to filter function
to do this.

Signed-off-by: Stephen Hemminger 
---
 ip/ipmroute.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/ip/ipmroute.c b/ip/ipmroute.c
index aa5029b44f41..03ca0575e571 100644
--- a/ip/ipmroute.c
+++ b/ip/ipmroute.c
@@ -75,15 +75,14 @@ int print_mroute(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
fprintf(stderr, "BUG: wrong nlmsg len %d\n", len);
return -1;
}
-   if (r->rtm_type != RTN_MULTICAST) {
-   fprintf(stderr, "Not a multicast route (type: %s)\n",
-   rtnl_rtntype_n2a(r->rtm_type, b1, sizeof(b1)));
+
+   if (r->rtm_type != RTN_MULTICAST)
return 0;
-   }
 
parse_rtattr(tb, RTA_MAX, RTM_RTA(r), len);
table = rtm_get_table(r, tb);
 
+
if (filter.tb > 0 && filter.tb != table)
return 0;
 
-- 
2.16.1

[PATCH iproute2-next 0/3] ip multicast command JSON support

2018-03-06 Thread Stephen Hemminger

From: Stephen Hemminger 

Update maddr and mroute to support JSON.
Fix bug in ipmroute that causes it print error on every unicast route.

Stephen Hemminger (3):
  ipmaddr: json and color support
  ipmroute: don't complain about unicast routes
  ipmroute: convert to output JSON

 ip/ipmaddr.c  |  69 
 ip/ipmroute.c | 124 +-
 2 files changed, 123 insertions(+), 70 deletions(-)

-- 
2.16.1

[PATCH iproute2-next 1/3] ipmaddr: json and color support

2018-03-06 Thread Stephen Hemminger

From: Stephen Hemminger 

Support printing mulitcast addresses in json and color mode.
Output format is unchanged for normal use.

Signed-off-by: Stephen Hemminger 
---
 ip/ipmaddr.c | 69 +---
 1 file changed, 43 insertions(+), 26 deletions(-)

diff --git a/ip/ipmaddr.c b/ip/ipmaddr.c
index d7bf1f99f67e..a48499029e17 100644
--- a/ip/ipmaddr.c
+++ b/ip/ipmaddr.c
@@ -28,6 +28,7 @@
 #include "rt_names.h"
 #include "utils.h"
 #include "ip_common.h"
+#include "json_print.h"
 
 static struct {
char *dev;
@@ -193,50 +194,66 @@ static void read_igmp6(struct ma_info **result_p)
 
 static void print_maddr(FILE *fp, struct ma_info *list)
 {
-   fprintf(fp, "\t");
+   print_string(PRINT_FP, NULL, "\t", NULL);
 
+   open_json_object(NULL);
if (list->addr.family == AF_PACKET) {
SPRINT_BUF(b1);
-   fprintf(fp, "link  %s", ll_addr_n2a((unsigned char 
*)list->addr.data,
-   list->addr.bytelen, 0,
-   b1, sizeof(b1)));
+
+   print_string(PRINT_FP, NULL, "link  ", NULL);
+   print_color_string(PRINT_ANY, COLOR_MAC, "link", "%s",
+  ll_addr_n2a((void *)list->addr.data, 
list->addr.bytelen,
+  0, b1, sizeof(b1)));
} else {
-   switch (list->addr.family) {
-   case AF_INET:
-   fprintf(fp, "inet  ");
-   break;
-   case AF_INET6:
-   fprintf(fp, "inet6 ");
-   break;
-   default:
-   fprintf(fp, "family %d ", list->addr.family);
-   break;
-   }
-   fprintf(fp, "%s",
-   format_host(list->addr.family,
-   -1, list->addr.data));
+   print_string(PRINT_ANY, "family", "%-5s ",
+family_name(list->addr.family));
+   print_color_string(PRINT_ANY, 
ifa_family_color(list->addr.family),
+  "address", "%s",
+  format_host(list->addr.family,
+  -1, list->addr.data));
}
+
if (list->users != 1)
-   fprintf(fp, " users %d", list->users);
+   print_uint(PRINT_ANY, "users", " users %u", list->users);
+
if (list->features)
-   fprintf(fp, " %s", list->features);
-   fprintf(fp, "\n");
+   print_string(PRINT_ANY, "features", " %s", list->features);
+
+   print_string(PRINT_FP, NULL, "\n", NULL);
+   close_json_object();
 }
 
 static void print_mlist(FILE *fp, struct ma_info *list)
 {
int cur_index = 0;
 
+   new_json_obj(json);
for (; list; list = list->next) {
-   if (oneline) {
-   cur_index = list->index;
-   fprintf(fp, "%d:\t%s%s", cur_index, list->name, _SL_);
-   } else if (cur_index != list->index) {
+
+   if (list->index != cur_index || oneline) {
+   if (cur_index) {
+   close_json_array(PRINT_JSON, NULL);
+   close_json_object();
+   }
+   open_json_object(NULL);
+
+   print_uint(PRINT_ANY, "ifindex", "%d:", list->index);
+   print_color_string(PRINT_ANY, COLOR_IFNAME,
+  "ifname", "\t%s", list->name);
+   print_string(PRINT_FP, NULL, "%s", _SL_);
cur_index = list->index;
-   fprintf(fp, "%d:\t%s\n", cur_index, list->name);
+
+   open_json_array(PRINT_JSON, "maddr");
}
+
print_maddr(fp, list);
}
+   if (cur_index) {
+   close_json_array(PRINT_JSON, NULL);
+   close_json_object();
+   }
+
+   delete_json_obj();
 }
 
 static int multiaddr_list(int argc, char **argv)
-- 
2.16.1

Re: [PATCH net-next 2/2] rds: use list structure to track information for zerocopy completion notification

2018-03-06 Thread Willem de Bruijn

On Tue, Mar 6, 2018 at 10:22 AM, Sowmini Varadhan
 wrote:
> Commit 401910db4cd4 ("rds: deliver zerocopy completion notification
> with data") removes support fo r zerocopy completion notification
> on the sk_error_queue, thus we no longer need to track the cookie
> information in sk_buff structures.
>
> This commit removes the struct sk_buff_head rs_zcookie_queue by
> a simpler list that results in a smaller memory footprint as well
> as more efficient memory_allocation time.
>
> Signed-off-by: Sowmini Varadhan 

Acked-by: Willem de Bruijn 

>  static void rds_rm_zerocopy_callback(struct rds_sock *rs,
>  struct rds_znotifier *znotif)
>  {
> -   struct sk_buff *skb, *tail;
> -   unsigned long flags;
> -   struct sk_buff_head *q;
> +   struct rds_msg_zcopy_info *info;
> +   struct rds_msg_zcopy_queue *q;
> u32 cookie = znotif->z_cookie;
> struct rds_zcopy_cookies *ck;
> +   struct list_head *head;
> +   unsigned long flags;
>
> +   mm_unaccount_pinned_pages(&znotif->z_mmp);
> q = &rs->rs_zcookie_queue;
> spin_lock_irqsave(&q->lock, flags);
> -   tail = skb_peek_tail(q);
> -
> -   if (tail && skb_zcookie_add(tail, cookie)) {
> -   spin_unlock_irqrestore(&q->lock, flags);
> -   mm_unaccount_pinned_pages(&znotif->z_mmp);
> -   consume_skb(rds_skb_from_znotifier(znotif));
> -   /* caller invokes rds_wake_sk_sleep() */
> -   return;
> +   head = &q->zcookie_head;
> +   if (!list_empty(head)) {
> +   info = list_entry(head, struct rds_msg_zcopy_info,
> + rs_zcookie_next);
> +   if (info && rds_zcookie_add(info, cookie)) {

small nit: the test for info will always succeed.

Re: [PATCH net-next 1/2] rds: refactor zcopy code into rds_message_zcopy_from_user

2018-03-06 Thread Willem de Bruijn

On Tue, Mar 6, 2018 at 10:22 AM, Sowmini Varadhan
 wrote:
> Move the large block of code predicated on zcopy from
> rds_message_copy_from_user into a new function,
> rds_message_zcopy_from_user()
>
> Signed-off-by: Sowmini Varadhan 

Acked-by: Willem de Bruijn 

> +int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from,
> +  bool zcopy)
> +{
> +   unsigned long to_copy, nbytes;
> +   unsigned long sg_off;
> +   struct scatterlist *sg;
> +   int ret = 0;
> +
> +   rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
> +
> +   /* now allocate and copy in the data payload.  */
> +   sg = rm->data.op_sg;
> +   sg_off = 0; /* Dear gcc, sg->page will be null from kzalloc. */

The above lines appear both here and in rds_message_zcopy_from_user.
Not strictly necessary, but not buggy, so no need to revise just for that.

Re: "wrong" ifindex on received VLAN tagged packet?

2018-03-06 Thread David Ahern

On 3/6/18 3:02 PM, Lawrence Kreeger wrote:
> Hello,
> 
> I'm trying to run mstpd on a per VLAN basis using one traditional
> linux bridge per VLAN.  I'm running it on kernel version 4.12.4.  It
> works fine for untagged frames, but I'm having a problem with VLAN
> tagged BPDUs arriving on the socket with the ifindex of the bridge
> itself, and not the VLAN tagged interface.  For example, I have a
> tagged interface eth0.100 connected to the bridge "vlan100".  When
> packets arrive, they have the ifindex of vlan100, which mstpd doesn't
> recognize as a valid spanning tree interface, so it drops them.  Is
> there something needed to be set in the kernel to get the ifindex of
> eth0.100 instead?  This is how mstpd opens the raw socket:
> 
> 
> /* Berkeley Packet filter code to filter out spanning tree packets.
>from tcpdump -s 1152 -dd stp
>  */
> static struct sock_filter stp_filter[] = {
> { 0x28, 0, 0, 0x000c },
> { 0x25, 3, 0, 0x05dc },
> { 0x30, 0, 0, 0x000e },
> { 0x15, 0, 1, 0x0042 },
> { 0x6, 0, 0, 0x0480 },
> { 0x6, 0, 0, 0x },
> };
> 
> /*
>  * Open up a raw packet socket to catch all 802.2 packets.
>  * and install a packet filter to only see STP (SAP 42)
>  *
>  * Since any bridged devices are already in promiscious mode
>  * no need to add multicast address.
>  */
> int packet_sock_init(void)
> {
> int s;
> struct sock_fprog prog =
> {
> .len = sizeof(stp_filter) / sizeof(stp_filter[0]),
> .filter = stp_filter,
> };
> 
> s = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_802_2));

try ETH_P_ALL

> if(s < 0)
> {
> ERROR("socket failed: %m");
> return -1;
> }
> 
> if(setsockopt(s, SOL_SOCKET, SO_ATTACH_FILTER, &prog, sizeof(prog)) < 0)
> ERROR("setsockopt packet filter failed: %m");
> else if(fcntl(s, F_SETFL, O_NONBLOCK) < 0)
> ERROR("fcntl set nonblock failed: %m");
> else
> {
> packet_event.fd = s;
> packet_event.handler = packet_rcv;

And then packet_rcv using recvfrom:
struct sockaddr_ll sll;
char buf[4096];
socklen_t alen;
int len;

alen = sizeof(sll);
len = recvfrom(sd, buf, sizeof(buf), 0,
(struct sockaddr *)&sll, &alen);

And sll.sll_ifindex will show vlan device indices.


> 
> if(0 == add_epoll(&packet_event))
> return 0;
> }
> 
> close(s);
> return -1;
> }
> 
> Thanks, Larry
>

Re: [PATCH] pci-iov: Add support for unmanaged SR-IOV

2018-03-06 Thread Alexander Duyck

On Tue, Mar 6, 2018 at 12:19 PM, Don Dutile  wrote:
> On 03/05/2018 04:41 PM, Alexander Duyck wrote:
>>
>> On Mon, Mar 5, 2018 at 12:57 PM, Don Dutile  wrote:
>>>
>>> On 03/01/2018 03:22 PM, Alex Williamson wrote:


 On Wed, 28 Feb 2018 16:36:38 -0800
 Alexander Duyck  wrote:

> On Wed, Feb 28, 2018 at 2:59 PM, Alex Williamson
>  wrote:
>>
>>
>> On Wed, 28 Feb 2018 09:49:21 -0800
>> Alexander Duyck  wrote:
>>
>>>
>>> On Tue, Feb 27, 2018 at 2:25 PM, Alexander Duyck
>>>  wrote:


 On Tue, Feb 27, 2018 at 1:40 PM, Alex Williamson
  wrote:
>
>
> On Tue, 27 Feb 2018 11:06:54 -0800
> Alexander Duyck  wrote:
>
>>
>> From: Alexander Duyck 
>>
>> This patch is meant to add support for SR-IOV on devices when the
>> VFs are
>> not managed by the kernel. Examples of recent patches attempting
>> to
>> do this
>> include:
>
>
>
> It appears to enable sriov when the _pf_ is not managed by the
> kernel, but by "managed" we mean that either there is no pf driver
> or
> the pf driver doesn't provide an sriov_configure callback,
> intentionally or otherwise.
>
>>
>> virto - https://patchwork.kernel.org/patch/10241225/
>> pci-stub - https://patchwork.kernel.org/patch/10109935/
>> vfio - https://patchwork.kernel.org/patch/10103353/
>> uio - https://patchwork.kernel.org/patch/9974031/
>
>
>
> So is the goal to get around the issues with enabling sriov on each
> of
> the above drivers by doing it under the covers or are you really
> just
> trying to enable sriov for a truly unmanage (no pf driver) case?
> For
> example, should a driver explicitly not wanting sriov enabled
> implement
> a dummy sriov_configure function?
>
>>
>> Since this is quickly blowing up into a multi-driver problem it is
>> probably
>> best to implement this solution in one spot.
>>
>> This patch is an attempt to do that. What we do with this patch is
>> provide
>> a generic call to enable SR-IOV in the case that the PF driver is
>> either
>> not present, or the PF driver doesn't support configuring SR-IOV.
>>
>> A new sysfs value called sriov_unmanaged_autoprobe has been added.
>> This
>> value is used as the drivers_autoprobe setting of the VFs when
>> they
>> are
>> being managed by an external entity such as userspace or device
>> firmware
>> instead of being managed by the kernel.
>
>
>
> Documentation/ABI/testing/sysfs-bus-pci update is missing.



 I can make sure to update that in the next version.

>>
>> One side effect of this change is that the sriov_drivers_autoprobe
>> and
>> sriov_unmanaged_autoprobe will only apply their updates when
>> SR-IOV
>> is
>> disabled. Attempts to update them when SR-IOV is in use will only
>> update
>> the local value and will not update sriov->autoprobe.
>
>
>
> And we expect users to understand when sriov_drivers_autoprobe
> applies
> vs sriov_unmanaged_autoprobe, even though they're using the same
> interfaces to enable sriov?  Are all combinations expected to work,
> ex.
> unmanaged sriov is enabled, a native pf driver loads, vfs work?
> Not
> only does it seems like there's opportunity to use this
> incorrectly,
> I
> think maybe it might be difficult to use correctly.
>
>>
>> I based my patch set originally on the patch by Mark Rustad but
>> there isn't
>> much left after going through and cleaning out the bits that were
>> no
>> longer
>> needed, and after incorporating the feedback from David Miller.
>>
>> I have included the authors of the original 4 patches above in the
>> Cc here.
>> My hope is to get feedback and/or review on if this works for
>> their
>> use
>> cases.
>>
>> Cc: Mark Rustad 
>> Cc: Maximilian Heyne 
>> Cc: Liang-Min Wang 
>> Cc: David Woodhouse 
>> Signed-off-by: Alexander Duyck 
>> ---
>>drivers/pci/iov.c|   27 +++-
>>drivers/pci/pci-driver.c |2 +
>>drivers/pci/pci-sysfs.c  |   62
>> +-
>>drivers/pci/pci.h|4 ++-
>>include/linux/pci.h  |1 +
>>>

Re: [PATCH] net: don't unnecessarily load kernel modules in dev_ioctl()

2018-03-06 Thread Stephen Hemminger

On Tue, 06 Mar 2018 17:27:44 -0500
Paul Moore  wrote:

> From: Paul Moore 
> 
> Starting with v4.16-rc1 we've been seeing a higher than usual number
> of requests for the kernel to load networking modules, even on events
> which shouldn't trigger a module load (e.g. ioctl(TCGETS)).  Stephen
> Smalley suggested the problem may lie in commit 44c02a2c3dc5
> ("dev_ioctl(): move copyin/copyout to callers") which moves changes
> the network dev_ioctl() function to always call dev_load(),
> regardless of the requested ioctl.
> 
> This patch moves the dev_load() calls back into the individual ioctls
> while preserving the rest of the original patch.
> 
> Reported-by: Dominick Grift 
> Suggested-by: Stephen Smalley 
> Signed-off-by: Paul Moore 
> ---
>  net/core/dev_ioctl.c |7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
> index 0ab1af04296c..a04e1e88bf3a 100644
> --- a/net/core/dev_ioctl.c
> +++ b/net/core/dev_ioctl.c
> @@ -402,8 +402,6 @@ int dev_ioctl(struct net *net, unsigned int cmd, struct 
> ifreq *ifr, bool *need_c
>   if (colon)
>   *colon = 0;
>  
> - dev_load(net, ifr->ifr_name);

Actually dev_load by ethernet name is really a legacy thing that should just 
die,

It was kept around so that some very tunnel configuration using special names.

# ifconfig sit0

which probably several web pages still tell users to do...
We have much better control now with ip commands so that this is just
baggage.

Re: [PATCH iproute2-next v2 00/12] ip more JSON

2018-03-06 Thread David Ahern

On 3/6/18 2:07 PM, Stephen Hemminger wrote:
> From: Stephen Hemminger 
> 
> The ip command implementation of JSON was very spotty. Only address
> and link were originally implemented. After doing route for next,
> went ahead and implemented it for a bunch of the other sub commands.
> 
> Hopefully will reach full coverage soon.
> 
> Stephen Hemminger (12):
>   ipneigh: add color and json support
>   ipaddrlabel: add json support
>   iprule: add json support
>   ipntable: add json support
>   ipnetconf: add JSON support
>   tcp_metrics; make tables const
>   tcp_metrics: add json support
>   ipsr: add json support
>   token: support JSON
>   tuntap: support JSON output
>   fou: break long lines
>   fou: support JSON output
> 

applied to iproute2-next.

glad to see the json support. Thanks for working on it.

Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-06 Thread Chris Mason


On 6 Mar 2018, at 11:12, Linus Torvalds wrote:

On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov  
wrote:
As the first step in development of bpfilter project [1] the 
request_module()
code is extended to allow user mode helpers to be invoked. Idea is 
that
user mode helpers are built as part of the kernel build and installed 
as
traditional kernel modules with .ko file extension into distro 
specified
location, such that from a distribution point of view, they are no 
different
than regular kernel modules. Thus, allow request_module() logic to 
load such

user mode helper (umh) modules via:

[,,]

I like this, but I have one request: can we make sure that this action
is visible in the system messages?

When we load a regular module, at least it shows in lsmod afterwards,
although I have a few times wanted to really see module load as an
event in the logs too.

When we load a module that just executes a user program, and there is
no sign of it in the module list, I think we *really* need to make
that event show to the admin some way.

.. and yes, maybe we'll need to rate-limit the messages, and maybe it
turns out that I'm entirely wrong and people will hate the messages
after they get used to the concept of these pseudo-modules, but
particularly for the early implementation when this is a new thing, I
really want a message like

 executed user process xyz-abc as a pseudo-module

or something in dmesg.

I do *not* want this to be a magical way to hide things.


Especially early on, this makes a lot of sense.  But I wanted to plug 
bps and the hopefully growing set of bpf introspection tools:


https://github.com/iovisor/bcc/blob/master/introspection/bps_example.txt

Long term these are probably a good place to tell the admin what's going 
on.


-chris

Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-06 Thread Alexander Duyck

On Tue, Mar 6, 2018 at 2:59 PM, Jiri Pirko  wrote:
> Tue, Mar 06, 2018 at 08:08:21PM CET, alexander.du...@gmail.com wrote:
>>On Mon, Mar 5, 2018 at 7:15 PM, Stephen Hemminger
>> wrote:
>>> On Mon, 5 Mar 2018 14:47:20 -0800
>>> Alexander Duyck  wrote:
>>>
 On Mon, Mar 5, 2018 at 2:30 PM, Jiri Pirko  wrote:
 > Mon, Mar 05, 2018 at 05:11:32PM CET, step...@networkplumber.org wrote:
 >>On Mon, 5 Mar 2018 10:21:18 +0100
 >>Jiri Pirko  wrote:
 >>
 >>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.du...@gmail.com wrote:
 >>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko  wrote:
 >>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.du...@gmail.com 
 >>> >> wrote:
 >>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko  
 >>> >>>wrote:
 >>>
 >>> [...]
 >>>
 >>> >
 >>> >>>Currently we only have agreement from Michael on taking this code, 
 >>> >>>as
 >>> >>>such we are working with virtio only for now. When the time comes 
 >>> >>>that
 >>> >>
 >>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it 
 >>> >> will
 >>> >> stay there forever. So what you say is: "We will do it halfway now
 >>> >> and promise to fix it later". That later will never happen, I'm 
 >>> >> pretty
 >>> >> sure. That is why I push for in-driver bonding shared code as a 
 >>> >> part of
 >>> >> this patchset.
 >>> >
 >>> >You want this new approach and a copy of netvsc moved into either core
 >>> >or some module of its own. I say pick an architecture. We are looking
 >>> >at either 2 netdevs or 3. We are not going to support both because
 >>> >that will ultimately lead to a terrible user experience and make
 >>> >things quite confusing.
 >>> >
 >>> >> + if you would be pushing first driver to do this, I would 
 >>> >> understand.
 >>> >> But the first driver is already in. You are pushing second. This is 
 >>> >> the
 >>> >> time to do the sharing, unification of behaviour. Next time is too 
 >>> >> late.
 >>> >
 >>> >That is great, if we want to share then lets share. But what you are
 >>> >essentially telling us is that we need to fork this solution and
 >>> >maintain two code paths, one for 2 netdevs, and another for 3. At that
 >>> >point what is the point in merging them together?
 >>>
 >>> Of course, I vote for the same behaviour for netvsc and virtio_net. 
 >>> That
 >>> is my point from the very beginning.
 >>>
 >>> Stephen, what do you think? Could we please make virtio_net and netvsc
 >>> behave the same and to use a single code with well-defined checks and
 >>> restrictions for this feature?
 >>
 >>Eventually, yes both could share common code routines. In reality,
 >>the failover stuff is only a very small part of either driver so
 >>it is not worth stretching to try and cover too much. If you look,
 >>the failover code is just using routines that already exist for
 >>use by bonding, teaming, etc.
 >
 > Yeah, we consern was also about the code that processes the netdev
 > notifications and does auto-enslave and all related stuff.

 The concern was the driver model. If we expose 3 netdevs or 2 with the
 VF driver present. Somehow this is turning into a "merge netvsc into
 virtio" think and that isn't the subject that was being asked.

 Ideally we want one model for this. Either 3 netdevs or 2. The problem
 is 2 causes issues in terms of performance and will limit features of
 virtio, but 2 is the precedent set by netvsc. We need to figure out
 the path forward for this. There is talk about "sharing" but it is
 hard to make these two approaches share code when they are doing two
 very different setups and end up presenting themselves as two very
 different driver models.
>>>
>>> I appreciate this discussion, and it has helped a lot.
>>>
>>> Netvsc is stuck with 2 netdev model for the foreseeable future.
>>> We already failed once with the bonding model, and that created a lot of
>>> pain. The current model is working well and have convinced the major distros
>>> to support the two netdev model and don't want to back.
>>>
>>> Very open to optimizations and ways to smooth out the rough edges.
>>
>>Thank you for clarifying this Stephen.
>>
>>Okay. So with things defined such that we are doing a 2 netdev model
>>for netvsc, and a 3 netdev model for virtio, is it still in our
>>interest for us to try making a shared library between the two? In my
>>mind, the virtnet_bypass becomes the way we go forward for any future
>>solutions. I say we treat the netvsc approach as a "legacy" approach
>>and avoid creating any new libraries or drivers to support it, and
>>instead just focus on the 3 netdev approach as the way this is to be
>>done going forward. That way we avoid anyone else trying to implement
>>something like the 2 netdev solut

[PATCH v3.18] net: fec: introduce fec_ptp_stop and use in probe fail path

2018-03-06 Thread Guenter Roeck

From: Lucas Stach 

[ upstream commit 32cba57ba74be58589aeb4cb6496183e46a5e3e5 ]

This function frees resources and cancels delayed work item that
have been initialized in fec_ptp_init().

Use this to do proper error handling if something goes wrong in
probe function after fec_ptp_init has been called.

Signed-off-by: Lucas Stach 
Acked-by: Fugang Duan 
Signed-off-by: David S. Miller 
[groeck: backport: context changes in .../fec_main.c]
Signed-off-by: Guenter Roeck 
---
Not really sure if I should send this one to David or Greg, since v3.18 is no
longer officially supported, so I am sending it to both. Sorry for the noise.

This patch fixes a crash seen when running sabrelite images in a version
of qemu which fixes https://bugs.launchpad.net/qemu/+bug/1753309.

 drivers/net/ethernet/freescale/fec.h  |  1 +
 drivers/net/ethernet/freescale/fec_main.c |  5 ++---
 drivers/net/ethernet/freescale/fec_ptp.c  | 10 ++
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec.h 
b/drivers/net/ethernet/freescale/fec.h
index 9af296a1ca99..c16b6deff5ea 100644
--- a/drivers/net/ethernet/freescale/fec.h
+++ b/drivers/net/ethernet/freescale/fec.h
@@ -546,6 +546,7 @@ struct fec_enet_private {
 };
 
 void fec_ptp_init(struct platform_device *pdev);
+void fec_ptp_stop(struct platform_device *pdev);
 void fec_ptp_start_cyclecounter(struct net_device *ndev);
 int fec_ptp_set(struct net_device *ndev, struct ifreq *ifr);
 int fec_ptp_get(struct net_device *ndev, struct ifreq *ifr);
diff --git a/drivers/net/ethernet/freescale/fec_main.c 
b/drivers/net/ethernet/freescale/fec_main.c
index 065a7616e961..f1224c2d112a 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -3312,6 +3312,7 @@ failed_register:
 failed_mii_init:
 failed_irq:
 failed_init:
+   fec_ptp_stop(pdev);
if (fep->reg_phy)
regulator_disable(fep->reg_phy);
 failed_regulator:
@@ -3331,14 +3332,12 @@ fec_drv_remove(struct platform_device *pdev)
struct net_device *ndev = platform_get_drvdata(pdev);
struct fec_enet_private *fep = netdev_priv(ndev);
 
-   cancel_delayed_work_sync(&fep->time_keep);
cancel_work_sync(&fep->tx_timeout_work);
+   fec_ptp_stop(pdev);
unregister_netdev(ndev);
fec_enet_mii_remove(fep);
if (fep->reg_phy)
regulator_disable(fep->reg_phy);
-   if (fep->ptp_clock)
-   ptp_clock_unregister(fep->ptp_clock);
fec_enet_clk_enable(ndev, false);
of_node_put(fep->phy_node);
free_netdev(ndev);
diff --git a/drivers/net/ethernet/freescale/fec_ptp.c 
b/drivers/net/ethernet/freescale/fec_ptp.c
index 992c8c3db553..9b2dcf4261a5 100644
--- a/drivers/net/ethernet/freescale/fec_ptp.c
+++ b/drivers/net/ethernet/freescale/fec_ptp.c
@@ -620,6 +620,16 @@ void fec_ptp_init(struct platform_device *pdev)
schedule_delayed_work(&fep->time_keep, HZ);
 }
 
+void fec_ptp_stop(struct platform_device *pdev)
+{
+   struct net_device *ndev = platform_get_drvdata(pdev);
+   struct fec_enet_private *fep = netdev_priv(ndev);
+
+   cancel_delayed_work_sync(&fep->time_keep);
+   if (fep->ptp_clock)
+   ptp_clock_unregister(fep->ptp_clock);
+}
+
 /**
  * fec_ptp_check_pps_event
  * @fep: the fec_enet_private structure handle
-- 
2.7.4

Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-03-06 Thread Mickaël Salaün

On 06/03/2018 23:46, Tycho Andersen wrote:
> On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote:
 Suppose I'm writing a container manager.  I want to run "mount" in the
 container, but I don't want to allow moun() in general and I want to
 emulate certain mount() actions.  I can write a filter that catches
 mount using seccomp and calls out to the container manager for help.
 This isn't theoretical -- Tycho wants *exactly* this use case to be
 supported.
>>>
>>> Well, I think this use case should be handled with something like
>>> LD_PRELOAD and a helper library. FYI, I did something like this:
>>> https://github.com/stemjail/stemshim
>>
>> I doubt that will work for containers.  Containers that use user
>> namespaces and, for example, setuid programs aren't going to honor
>> LD_PRELOAD.
> 
> Or anything that calls syscalls directly, like go programs.

That's why the vDSO-like approach. Enforcing an access control is not
the issue here, patching a buggy userland (without patching its code) is
the issue isn't it?

As far as I remember, the main problem is to handle file descriptors
while "emulating" the kernel behavior. This can be done with a "shim"
code mapped in every processes. Chrome used something like this (in a
previous sandbox mechanism) as a kind of emulation (with the current
seccomp-bpf ). I think it should be doable to replace the (userland)
emulation code with an IPC wrapper receiving file descriptors through
UNIX socket.

signature.asc
Description: OpenPGP digital signature

Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-06 Thread Jiri Pirko

Tue, Mar 06, 2018 at 08:08:21PM CET, alexander.du...@gmail.com wrote:
>On Mon, Mar 5, 2018 at 7:15 PM, Stephen Hemminger
> wrote:
>> On Mon, 5 Mar 2018 14:47:20 -0800
>> Alexander Duyck  wrote:
>>
>>> On Mon, Mar 5, 2018 at 2:30 PM, Jiri Pirko  wrote:
>>> > Mon, Mar 05, 2018 at 05:11:32PM CET, step...@networkplumber.org wrote:
>>> >>On Mon, 5 Mar 2018 10:21:18 +0100
>>> >>Jiri Pirko  wrote:
>>> >>
>>> >>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.du...@gmail.com wrote:
>>> >>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko  wrote:
>>> >>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.du...@gmail.com wrote:
>>> >>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko  wrote:
>>> >>>
>>> >>> [...]
>>> >>>
>>> >>> >
>>> >>> >>>Currently we only have agreement from Michael on taking this code, as
>>> >>> >>>such we are working with virtio only for now. When the time comes 
>>> >>> >>>that
>>> >>> >>
>>> >>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it 
>>> >>> >> will
>>> >>> >> stay there forever. So what you say is: "We will do it halfway now
>>> >>> >> and promise to fix it later". That later will never happen, I'm 
>>> >>> >> pretty
>>> >>> >> sure. That is why I push for in-driver bonding shared code as a part 
>>> >>> >> of
>>> >>> >> this patchset.
>>> >>> >
>>> >>> >You want this new approach and a copy of netvsc moved into either core
>>> >>> >or some module of its own. I say pick an architecture. We are looking
>>> >>> >at either 2 netdevs or 3. We are not going to support both because
>>> >>> >that will ultimately lead to a terrible user experience and make
>>> >>> >things quite confusing.
>>> >>> >
>>> >>> >> + if you would be pushing first driver to do this, I would 
>>> >>> >> understand.
>>> >>> >> But the first driver is already in. You are pushing second. This is 
>>> >>> >> the
>>> >>> >> time to do the sharing, unification of behaviour. Next time is too 
>>> >>> >> late.
>>> >>> >
>>> >>> >That is great, if we want to share then lets share. But what you are
>>> >>> >essentially telling us is that we need to fork this solution and
>>> >>> >maintain two code paths, one for 2 netdevs, and another for 3. At that
>>> >>> >point what is the point in merging them together?
>>> >>>
>>> >>> Of course, I vote for the same behaviour for netvsc and virtio_net. That
>>> >>> is my point from the very beginning.
>>> >>>
>>> >>> Stephen, what do you think? Could we please make virtio_net and netvsc
>>> >>> behave the same and to use a single code with well-defined checks and
>>> >>> restrictions for this feature?
>>> >>
>>> >>Eventually, yes both could share common code routines. In reality,
>>> >>the failover stuff is only a very small part of either driver so
>>> >>it is not worth stretching to try and cover too much. If you look,
>>> >>the failover code is just using routines that already exist for
>>> >>use by bonding, teaming, etc.
>>> >
>>> > Yeah, we consern was also about the code that processes the netdev
>>> > notifications and does auto-enslave and all related stuff.
>>>
>>> The concern was the driver model. If we expose 3 netdevs or 2 with the
>>> VF driver present. Somehow this is turning into a "merge netvsc into
>>> virtio" think and that isn't the subject that was being asked.
>>>
>>> Ideally we want one model for this. Either 3 netdevs or 2. The problem
>>> is 2 causes issues in terms of performance and will limit features of
>>> virtio, but 2 is the precedent set by netvsc. We need to figure out
>>> the path forward for this. There is talk about "sharing" but it is
>>> hard to make these two approaches share code when they are doing two
>>> very different setups and end up presenting themselves as two very
>>> different driver models.
>>
>> I appreciate this discussion, and it has helped a lot.
>>
>> Netvsc is stuck with 2 netdev model for the foreseeable future.
>> We already failed once with the bonding model, and that created a lot of
>> pain. The current model is working well and have convinced the major distros
>> to support the two netdev model and don't want to back.
>>
>> Very open to optimizations and ways to smooth out the rough edges.
>
>Thank you for clarifying this Stephen.
>
>Okay. So with things defined such that we are doing a 2 netdev model
>for netvsc, and a 3 netdev model for virtio, is it still in our
>interest for us to try making a shared library between the two? In my
>mind, the virtnet_bypass becomes the way we go forward for any future
>solutions. I say we treat the netvsc approach as a "legacy" approach
>and avoid creating any new libraries or drivers to support it, and
>instead just focus on the 3 netdev approach as the way this is to be
>done going forward. That way we avoid anyone else trying to implement
>something like the 2 netdev solution in the future.
>
>So getting back to the code here. Should we split the virtnet_bypass
>code out into a separate module? My preference would be to let this
>incubate as a part of virtio_net until

[PATCH net] macvlan: filter out xfrm feature flags

2018-03-06 Thread Shannon Nelson

Adding a macvlan device on top of a lowerdev that supports
the xfrm offloads fails.
# ip link add link ens1f0 mv0 type macvlan
RTNETLINK answers: Operation not permitted

Tracing down the failure shows that the macvlan device inherits
the NETIF_F_HW_ESP and NETIF_F_HW_ESP_TX_CSUM feature flags from
the lowerdev, but doesn't actually support xfrm so doesn't have
the dev->xfrmdev_ops API filled in.  When the request is made
to add the new macvlan device, the various feature flags are
checked by the feature subsystems, and the xfrm_api_check()
fails the check since the dev->xfrmdev_ops are not set up.
The macvlan creation succeeds when we filter out those flags
in macvlan_fix_features().

This isn't broken for vlans because they use a separate features
connection (vlan_features) for inheriting features.  This is
fine, but I don't think trying to add something like this to
every driver for every new upperdev is a good idea - I think
the upperdev should try to protect itself.

Fixes: d77e38e612a0 ("xfrm: Add an IPsec hardware offloading API")
Signed-off-by: Shannon Nelson 
---
 drivers/net/macvlan.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 8fc02d9..76b8fb5 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -844,6 +844,10 @@ static struct lock_class_key macvlan_netdev_addr_lock_key;
 NETIF_F_TSO_ECN | NETIF_F_TSO6 | NETIF_F_GRO | NETIF_F_RXCSUM | \
 NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER)
 
+#define MACVLAN_NON_FEATURES \
+   (NETIF_F_HW_ESP | NETIF_F_HW_ESP_TX_CSUM | NETIF_F_GSO_ESP | \
+NETIF_F_NETNS_LOCAL)
+
 #define MACVLAN_STATE_MASK \
((1<<__LINK_STATE_NOCARRIER) | (1<<__LINK_STATE_DORMANT))
 
@@ -1036,7 +1040,7 @@ static netdev_features_t macvlan_fix_features(struct 
net_device *dev,
lowerdev_features &= (features | ~NETIF_F_LRO);
features = netdev_increment_features(lowerdev_features, features, mask);
features |= ALWAYS_ON_FEATURES;
-   features &= ~NETIF_F_NETNS_LOCAL;
+   features &= ~MACVLAN_NON_FEATURES;
 
return features;
 }
-- 
2.7.4

Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-03-06 Thread Tycho Andersen

On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote:
> >> Suppose I'm writing a container manager.  I want to run "mount" in the
> >> container, but I don't want to allow moun() in general and I want to
> >> emulate certain mount() actions.  I can write a filter that catches
> >> mount using seccomp and calls out to the container manager for help.
> >> This isn't theoretical -- Tycho wants *exactly* this use case to be
> >> supported.
> >
> > Well, I think this use case should be handled with something like
> > LD_PRELOAD and a helper library. FYI, I did something like this:
> > https://github.com/stemjail/stemshim
> 
> I doubt that will work for containers.  Containers that use user
> namespaces and, for example, setuid programs aren't going to honor
> LD_PRELOAD.

Or anything that calls syscalls directly, like go programs.

Tycho

Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-03-06 Thread Andy Lutomirski

On Tue, Mar 6, 2018 at 10:25 PM, Mickaël Salaün  wrote:
>
>
> On 28/02/2018 00:09, Andy Lutomirski wrote:
>> On Tue, Feb 27, 2018 at 10:03 PM, Mickaël Salaün  wrote:
>>>
>>> On 27/02/2018 05:36, Andy Lutomirski wrote:
 On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün  wrote:
> Hi,
>
>>
>
> ## Why use the seccomp(2) syscall?
>
> Landlock use the same semantic as seccomp to apply access rule
> restrictions. It add a new layer of security for the current process
> which is inherited by its children. It makes sense to use an unique
> access-restricting syscall (that should be allowed by seccomp filters)
> which can only drop privileges. Moreover, a Landlock rule could come
> from outside a process (e.g.  passed through a UNIX socket). It is then
> useful to differentiate the creation/load of Landlock eBPF programs via
> bpf(2), from rule enforcement via seccomp(2).

 This seems like a weak argument to me.  Sure, this is a bit different
 from seccomp(), and maybe shoving it into the seccomp() multiplexer is
 awkward, but surely the bpf() multiplexer is even less applicable.
>>>
>>> I think using the seccomp syscall is fine, and everyone agreed on it.
>>>
>>
>> Ah, sorry, I completely misread what you wrote.  My apologies.  You
>> can disregard most of my email.
>>
>>>

 Also, looking forward, I think you're going to want a bunch of the
 stuff that's under consideration as new seccomp features.  Tycho is
 working on a "user notifier" feature for seccomp where, in addition to
 accepting, rejecting, or kicking to ptrace, you can send a message to
 the creator of the filter and wait for a reply.  I think that Landlock
 will want exactly the same feature.
>>>
>>> I don't think why this may be useful at all her. Landlock does not
>>> filter at the syscall level but handles kernel object and actions as
>>> does an LSM. That is the whole purpose of Landlock.
>>
>> Suppose I'm writing a container manager.  I want to run "mount" in the
>> container, but I don't want to allow moun() in general and I want to
>> emulate certain mount() actions.  I can write a filter that catches
>> mount using seccomp and calls out to the container manager for help.
>> This isn't theoretical -- Tycho wants *exactly* this use case to be
>> supported.
>
> Well, I think this use case should be handled with something like
> LD_PRELOAD and a helper library. FYI, I did something like this:
> https://github.com/stemjail/stemshim

I doubt that will work for containers.  Containers that use user
namespaces and, for example, setuid programs aren't going to honor
LD_PRELOAD.

>
> Otherwise, we should think about enabling a process to (dynamically)
> extend/patch the vDSO (similar to LD_PRELOAD but at the syscall level
> and works with static binaries) for a subset of processes (the same way
> seccomp filters are inherited). It may be more powerful and flexible
> than extending the kernel/seccomp to patch (buggy?) userland.

Egads!

Re: [PATCH] net: don't unnecessarily load kernel modules in dev_ioctl()

2018-03-06 Thread Paul Moore

On Tue, Mar 6, 2018 at 5:27 PM, Paul Moore  wrote:
> From: Paul Moore 
>
> Starting with v4.16-rc1 we've been seeing a higher than usual number
> of requests for the kernel to load networking modules, even on events
> which shouldn't trigger a module load (e.g. ioctl(TCGETS)).  Stephen
> Smalley suggested the problem may lie in commit 44c02a2c3dc5
> ("dev_ioctl(): move copyin/copyout to callers") which moves changes
> the network dev_ioctl() function to always call dev_load(),
> regardless of the requested ioctl.
>
> This patch moves the dev_load() calls back into the individual ioctls
> while preserving the rest of the original patch.
>
> Reported-by: Dominick Grift 
> Suggested-by: Stephen Smalley 
> Signed-off-by: Paul Moore 
> ---
>  net/core/dev_ioctl.c |7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)

In the interest of full disclosure, I've compiled this code but I
haven't booted it yet (test kernel building now).  I just wanted to
post this sooner rather than later in case the networking folks, or
Al, had a different solution they would prefer.

> diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
> index 0ab1af04296c..a04e1e88bf3a 100644
> --- a/net/core/dev_ioctl.c
> +++ b/net/core/dev_ioctl.c
> @@ -402,8 +402,6 @@ int dev_ioctl(struct net *net, unsigned int cmd, struct 
> ifreq *ifr, bool *need_c
> if (colon)
> *colon = 0;
>
> -   dev_load(net, ifr->ifr_name);
> -
> /*
>  *  See which interface the caller is talking about.
>  */
> @@ -423,6 +421,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, struct 
> ifreq *ifr, bool *need_c
> case SIOCGIFMAP:
> case SIOCGIFINDEX:
> case SIOCGIFTXQLEN:
> +   dev_load(net, ifr->ifr_name);
> rcu_read_lock();
> ret = dev_ifsioc_locked(net, ifr, cmd);
> rcu_read_unlock();
> @@ -431,6 +430,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, struct 
> ifreq *ifr, bool *need_c
> return ret;
>
> case SIOCETHTOOL:
> +   dev_load(net, ifr->ifr_name);
> rtnl_lock();
> ret = dev_ethtool(net, ifr);
> rtnl_unlock();
> @@ -447,6 +447,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, struct 
> ifreq *ifr, bool *need_c
> case SIOCGMIIPHY:
> case SIOCGMIIREG:
> case SIOCSIFNAME:
> +   dev_load(net, ifr->ifr_name);
> if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
> return -EPERM;
> rtnl_lock();
> @@ -494,6 +495,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, struct 
> ifreq *ifr, bool *need_c
> /* fall through */
> case SIOCBONDSLAVEINFOQUERY:
> case SIOCBONDINFOQUERY:
> +   dev_load(net, ifr->ifr_name);
> rtnl_lock();
> ret = dev_ifsioc(net, ifr, cmd);
> rtnl_unlock();
> @@ -518,6 +520,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, struct 
> ifreq *ifr, bool *need_c
> cmd == SIOCGHWTSTAMP ||
> (cmd >= SIOCDEVPRIVATE &&
>  cmd <= SIOCDEVPRIVATE + 15)) {
> +   dev_load(net, ifr->ifr_name);
> rtnl_lock();
> ret = dev_ifsioc(net, ifr, cmd);
> rtnl_unlock();
>
> --
> To unsubscribe from this list: send the line "unsubscribe 
> linux-security-module" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
paul moore
www.paul-moore.com

Re: [PATCH bpf-next v8 08/11] landlock: Add ptrace restrictions

2018-03-06 Thread Mickaël Salaün


On 28/02/2018 01:09, Andy Lutomirski wrote:
> On Wed, Feb 28, 2018 at 12:00 AM, Mickaël Salaün  wrote:
>>
>> On 28/02/2018 00:23, Andy Lutomirski wrote:
>>> On Tue, Feb 27, 2018 at 11:02 PM, Andy Lutomirski  wrote:
 On Tue, Feb 27, 2018 at 10:14 PM, Mickaël Salaün  wrote:
>

 I think you're wrong here.  Any sane container trying to use Landlock
 like this would also create a PID namespace.  Problem solved.  I still
 think you should drop this patch.
>>
>> Containers is one use case, another is build-in sandboxing (e.g. for web
>> browser…) and another one is for sandbox managers (e.g. Firejail,
>> Bubblewrap, Flatpack…). In some of these use cases, especially from a
>> developer point of view, you may want/need to debug your applications
>> (without requiring to be root). For nested Landlock access-controls
>> (e.g. container + user session + web browser), it may not be allowed to
>> create a PID namespace, but you still want to have a meaningful
>> access-control.
>>
> 
> The consideration should be exactly the same as for normal seccomp.
> If I'm in a container (using PID namespaces + seccomp) and a run a web
> browser, I can debug the browser.
> 
> If there's a real use case for adding this type of automatic ptrace
> protection, then by all means, let's add it as a general seccomp
> feature.
> 

Right, it makes sense to add this feature to seccomp filters as well.
What do you think Kees?



signature.asc
Description: OpenPGP digital signature

1 2 3 >

1 - 100 of 269 matches

Mail list logo