National Investment Corporation (NIC)

2018-03-21 Thread Michael Childer
Greetings from United Arab Emirate,

Let me introduce to you about National Investment Corporation (NIC)
Funding Program.

I have to start by introducing myself as an investment consultant
working under the mandate of National Investment Corporation (NIC)
here in Abu Dhabi UAE to reach out to project owners and business men
and women for funding cooperation between their companies/firms and
National Investment Corporation (NIC).

In introducing the company I represent, NIC is a private Investment
Company, is one of the leading strategic investors based in the of Abu
Dhabi, United Arab Emirates.

Since its establishment the company has focused on contributing to the
sustainable development of the region while creating value through
investments in fundamental growing sectors Today National Investment
Corporation presents an optimal balance between return on investments
and growth, through focusing on essential sectors including Oil & Gas,
Banking & Finance, project management, tourism, Aviation, Real estate,
Business Investment, Marine Projects, Solar project and
industrialization.

Furthermore, NIC has put forward unique investment opportunities and
facilitated the development of various projects that meet the local
and international market needs.

National Investment Corporation (NIC) is acting as a lender and the
loan will be disbursed on a clear Loan of 3.5% Interest annually to
project owners and Equity Partners for their Investment Projects. They
focus on seed Capital, Early-stage, start-up ventures, existing LLC
and total completion and expansion of Investment Projects with
immediate Funding.

NIC can invest in any country on a good conduct with the both parties.

Hope to hear from you if we have a common goal of a better tomorrow
through investments.

Best Regards,

Michael Childer (Investment Consultant)
National Investment Corporation (NIC)
Marina Mall, Al Marina, Abu Dhabi
United Arab Emirates


Re: Fw: [Bug 199109] New: pptp: kernel printk "recursion detected", and then reboot itself

2018-03-21 Thread xu heng

  On Wed, 21 Mar 2018 16:35:28 +0800 Guillaume Nault  
wrote  
 > On Wed, Mar 21, 2018 at 09:03:57AM +0800, xu heng wrote: 
 > > Yes, i have tested it for 146390 seconds in my board, it's ok now. Thanks! 
 > >  
 > Feel free to add your Tested-by tag to the patch if you want to. 
 > Thanks for your report. 
 >  
 > Guillaume 
 >  
 > BTW, for your future exchanges on the list, please avoid top-posting. 
 > 

I'm sorry for that, will never do that again. Thanks.

xuheng



Re: [bpf-next V2 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping

2018-03-21 Thread Jason Wang



On 2018年03月20日 22:27, Jesper Dangaard Brouer wrote:

On Tue, 20 Mar 2018 10:26:50 +0800
Jason Wang  wrote:


On 2018年03月19日 17:48, Jesper Dangaard Brouer wrote:

On Fri, 16 Mar 2018 16:45:30 +0800
Jason Wang  wrote:
  

On 2018年03月10日 00:07, Jesper Dangaard Brouer wrote:

On Fri, 9 Mar 2018 21:07:36 +0800
Jason Wang  wrote:
 

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.

Signed-off-by: Jesper Dangaard Brouer

A stupid question is, can we manage to unify this ID with NAPI id?

Sorry I don't understand the question?

I mean can we associate page poll pointer to napi_struct, record NAPI id
in xdp_mem_info and do lookup through NAPI id?

No. The driver can unreg/reg a new XDP memory model,

Is there an actual use case for this?

I believe this is the common use case.  When attaching an XDP/bpf prog,
then the driver usually want to change the RX-ring memory model
(different performance trade off).

Right, but a single driver should only have one XDP memory model.

No! -- a driver can have multiple XDP memory models, based on different
performance trade offs and hardware capabilities.

The mlx5 (100Gbit/s) driver/hardware is a good example, which need
different memory models.  It already support multiple RX memory models,
depending on HW support.


So let me correct my question, not familiar with mlx5e driver but if I 
understand correctly, driver (mlx5) will not change memory model during 
runtime for each NAPI. So NAPI id still work in this case?



So, I predict that we hit at performance
limit around 42Mpps on PCIe (I can measure 36Mpps), this is due to
PCI-express translations/sec limit.  The mlx5 HW supports a compressed
descriptor format which deliver packets in several pages (based on
offset and len), thus lowering the needed PCIe transactions.  The
pitfall is that this comes tail room limitations, which can be okay if
e.g. the users use-case does not involve cpumap.

Plus, when a driver need to support AF_XDP zero-copy, that also count
as another XDP memory model...


Yes or TAP zero-copy XDP. But it looks to me that we don't even need to 
care about the recycling here since the pages belongs to userspace.


Thanks


ITS ALL ABOUT FACEBOOK

2018-03-21 Thread Facebook Int'l


Hello,

Facebook is given out 14,000,000.USD (Fourteen Million Dollars) its all about 
14 Please, respond with your Unique Code (FB/BF14-13M5250UD) using your 
registration email, to the Verification Department at; 
dustinmoskovitz.facebo...@gmail.com

Dustin Moskovitz
Facebook Team
Copyright © 2018 Facebook Int'l


Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-03-21 Thread Richard Cochran
On Thu, Mar 22, 2018 at 01:43:49AM +0100, Andrew Lunn wrote:
> On Wed, Mar 21, 2018 at 03:47:02PM -0700, Richard Cochran wrote:
> > I happy to improve the modeling, but the solution should be generic
> > and work for every MAC driver.

Let me qualify that a bit...
 
> Something else to think about. There are a number of MAC drivers which
> don't use phylib. All the intel drivers for example. They have their
> own MDIO and PHY code. And recently there have been a number of MAC
> drivers for hardware capable of > 1GBps which do all the PHY control
> in firmware.
> 
> A phydev is optional, the MAC is mandatory.

So MACs that have a built in PHY won't work, but we don't care because
there is no way to hang another MII device in there anyhow.

We already require phylib for NETWORK_PHY_TIMESTAMPING, and so we
expect that here, too.

Many of these IP core things will be targeting arm with device tree,
and I want that to "just work" without MAC changes.  

(This is exactly the same situation with DSA, BTW.)

If someone attaches an MII time stamper to a MACs whose driver does
their own thing without phylib, then they are going to have to hack
the MAC driver in any case.  Such hacks will never be acceptable for
mainline because they are design specific.  We really don't have to
worry about this case.

Thanks,
Richard



[PATCH net-next v4 1/5] net: qualcomm: rmnet: Fix casting issues

2018-03-21 Thread Subash Abhinov Kasiviswanathan
Fix warnings which were reported when running with sparse
(make C=1 CF=-D__CHECK_ENDIAN__)

drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c:81:15:
warning: cast to restricted __be16
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:271:37:
warning: incorrect type in assignment (different base types)
expected unsigned short [unsigned] [usertype] pkt_len
got restricted __be16 [usertype] 
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:287:29:
warning: incorrect type in assignment (different base types)
expected unsigned short [unsigned] [usertype] pkt_len
got restricted __be16 [usertype] 
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:310:22:
warning: cast to restricted __be16
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:319:13:
warning: cast to restricted __be16
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c:49:18:
warning: cast to restricted __be16
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c:50:18:
warning: cast to restricted __be32
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c:74:21:
warning: cast to restricted __be16

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
index 6ce31e2..4f362df 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
@@ -23,8 +23,8 @@ struct rmnet_map_control_command {
struct {
u16 ip_family:2;
u16 reserved:14;
-   u16 flow_control_seq_num;
-   u32 qos_id;
+   __be16 flow_control_seq_num;
+   __be32 qos_id;
} flow_control;
u8 data[0];
};
@@ -44,7 +44,7 @@ struct rmnet_map_header {
u8  reserved_bit:1;
u8  cd_bit:1;
u8  mux_id;
-   u16 pkt_len;
+   __be16 pkt_len;
 }  __aligned(1);
 
 struct rmnet_map_dl_csum_trailer {
-- 
1.9.1



[PATCH net-next v4 4/5] net: qualcomm: rmnet: Export mux_id and flags to netlink

2018-03-21 Thread Subash Abhinov Kasiviswanathan
Define new netlink attributes for rmnet mux_id and flags. These
flags / mux_id were earlier using vlan flags / id respectively.
The flag bits are also moved to uapi and are renamed with
prefix RMNET_FLAG_*.

Also add the rmnet policy to handle the new netlink attributes.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 41 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c   | 10 +++---
 .../ethernet/qualcomm/rmnet/rmnet_map_command.c|  2 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c   |  2 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_private.h|  6 
 include/uapi/linux/if_link.h   | 21 +++
 6 files changed, 53 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
index 096301a..c5b7b2a 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
@@ -43,6 +43,11 @@
 
 /* Local Definitions and Declarations */
 
+static const struct nla_policy rmnet_policy[IFLA_RMNET_MAX + 1] = {
+   [IFLA_RMNET_MUX_ID] = { .type = NLA_U16 },
+   [IFLA_RMNET_FLAGS]  = { .len = sizeof(struct ifla_rmnet_flags) },
+};
+
 static int rmnet_is_real_dev_registered(const struct net_device *real_dev)
 {
return rcu_access_pointer(real_dev->rx_handler) == rmnet_rx_handler;
@@ -131,7 +136,7 @@ static int rmnet_newlink(struct net *src_net, struct 
net_device *dev,
 struct nlattr *tb[], struct nlattr *data[],
 struct netlink_ext_ack *extack)
 {
-   u32 data_format = RMNET_INGRESS_FORMAT_DEAGGREGATION;
+   u32 data_format = RMNET_FLAGS_INGRESS_DEAGGREGATION;
struct net_device *real_dev;
int mode = RMNET_EPMODE_VND;
struct rmnet_endpoint *ep;
@@ -143,14 +148,14 @@ static int rmnet_newlink(struct net *src_net, struct 
net_device *dev,
if (!real_dev || !dev)
return -ENODEV;
 
-   if (!data[IFLA_VLAN_ID])
+   if (!data[IFLA_RMNET_MUX_ID])
return -EINVAL;
 
ep = kzalloc(sizeof(*ep), GFP_ATOMIC);
if (!ep)
return -ENOMEM;
 
-   mux_id = nla_get_u16(data[IFLA_VLAN_ID]);
+   mux_id = nla_get_u16(data[IFLA_RMNET_MUX_ID]);
 
err = rmnet_register_real_device(real_dev);
if (err)
@@ -165,10 +170,10 @@ static int rmnet_newlink(struct net *src_net, struct 
net_device *dev,
 
hlist_add_head_rcu(>hlnode, >muxed_ep[mux_id]);
 
-   if (data[IFLA_VLAN_FLAGS]) {
-   struct ifla_vlan_flags *flags;
+   if (data[IFLA_RMNET_FLAGS]) {
+   struct ifla_rmnet_flags *flags;
 
-   flags = nla_data(data[IFLA_VLAN_FLAGS]);
+   flags = nla_data(data[IFLA_RMNET_FLAGS]);
data_format = flags->flags & flags->mask;
}
 
@@ -276,10 +281,10 @@ static int rmnet_rtnl_validate(struct nlattr *tb[], 
struct nlattr *data[],
 {
u16 mux_id;
 
-   if (!data || !data[IFLA_VLAN_ID])
+   if (!data || !data[IFLA_RMNET_MUX_ID])
return -EINVAL;
 
-   mux_id = nla_get_u16(data[IFLA_VLAN_ID]);
+   mux_id = nla_get_u16(data[IFLA_RMNET_MUX_ID]);
if (mux_id > (RMNET_MAX_LOGICAL_EP - 1))
return -ERANGE;
 
@@ -304,8 +309,8 @@ static int rmnet_changelink(struct net_device *dev, struct 
nlattr *tb[],
 
port = rmnet_get_port_rtnl(real_dev);
 
-   if (data[IFLA_VLAN_ID]) {
-   mux_id = nla_get_u16(data[IFLA_VLAN_ID]);
+   if (data[IFLA_RMNET_MUX_ID]) {
+   mux_id = nla_get_u16(data[IFLA_RMNET_MUX_ID]);
ep = rmnet_get_endpoint(port, priv->mux_id);
 
hlist_del_init_rcu(>hlnode);
@@ -315,10 +320,10 @@ static int rmnet_changelink(struct net_device *dev, 
struct nlattr *tb[],
priv->mux_id = mux_id;
}
 
-   if (data[IFLA_VLAN_FLAGS]) {
-   struct ifla_vlan_flags *flags;
+   if (data[IFLA_RMNET_FLAGS]) {
+   struct ifla_rmnet_flags *flags;
 
-   flags = nla_data(data[IFLA_VLAN_FLAGS]);
+   flags = nla_data(data[IFLA_RMNET_FLAGS]);
port->data_format = flags->flags & flags->mask;
}
 
@@ -327,13 +332,16 @@ static int rmnet_changelink(struct net_device *dev, 
struct nlattr *tb[],
 
 static size_t rmnet_get_size(const struct net_device *dev)
 {
-   return nla_total_size(2) /* IFLA_VLAN_ID */ +
-  nla_total_size(sizeof(struct ifla_vlan_flags)); /* 
IFLA_VLAN_FLAGS */
+   return
+   /* IFLA_RMNET_MUX_ID */
+   nla_total_size(2) +
+   /* IFLA_RMNET_FLAGS */
+   nla_total_size(sizeof(struct ifla_rmnet_flags));
 }
 
 struct rtnl_link_ops rmnet_link_ops __read_mostly = {
.kind   = "rmnet",
-   .maxtype= 

[PATCH net-next v4 0/5] net: qualcomm: rmnet: Updates 2018-03-12

2018-03-21 Thread Subash Abhinov Kasiviswanathan
This series contains some minor updates for rmnet driver.

Patch 1 contains fixes for sparse warnings.
Patch 2 updates the copyright date to 2018.
Patch 3 is a cleanup in receive path.
Patch 4 has the new rmnet netlink attributes in uapi and updates the usage.
Patch 5 has the implementation of the fill_info operation.

v1->v2: Remove the force casts since the data type is changed to __be
types as mentioned by David.
v2->v3: Update copyright in files which actually had changes as
mentioned by Joe.
v3->v4: Add new netlink attributes for mux_id and flags instead of using the
the vlan attributes as mentioned by David. The rmnet specific flags are also
moved to uapi. The netlink updates are done as part of #4 and #5 has the
fill_info operation.

Subash Abhinov Kasiviswanathan (5):
  net: qualcomm: rmnet: Fix casting issues
  net: qualcomm: rmnet: Update copyright year to 2018
  net: qualcomm: rmnet: Remove unnecessary device assignment
  net: qualcomm: rmnet: Export mux_id and flags to netlink
  net: qualcomm: rmnet: Implement fill_info

 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 73 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h |  2 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c   | 12 ++--
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h|  8 +--
 .../ethernet/qualcomm/rmnet/rmnet_map_command.c|  4 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c   |  5 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_private.h|  8 +--
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c|  2 +-
 include/uapi/linux/if_link.h   | 21 +++
 9 files changed, 94 insertions(+), 41 deletions(-)

-- 
1.9.1



[PATCH net-next v4 2/5] net: qualcomm: rmnet: Update copyright year to 2018

2018-03-21 Thread Subash Abhinov Kasiviswanathan
Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c  | 2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h  | 2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c| 2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h | 2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c | 2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c| 2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h | 2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c | 2 +-
 8 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
index c494918..096301a 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
@@ -1,4 +1,4 @@
-/* Copyright (c) 2013-2017, The Linux Foundation. All rights reserved.
+/* Copyright (c) 2013-2018, The Linux Foundation. All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 and
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
index 00e4634..0b5b5da 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
@@ -1,4 +1,4 @@
-/* Copyright (c) 2013-2014, 2016-2017 The Linux Foundation. All rights 
reserved.
+/* Copyright (c) 2013-2014, 2016-2018 The Linux Foundation. All rights 
reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 and
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
index 601edec..c758248 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
@@ -1,4 +1,4 @@
-/* Copyright (c) 2013-2017, The Linux Foundation. All rights reserved.
+/* Copyright (c) 2013-2018, The Linux Foundation. All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 and
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
index 4f362df..884f1f5 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
@@ -1,4 +1,4 @@
-/* Copyright (c) 2013-2017, The Linux Foundation. All rights reserved.
+/* Copyright (c) 2013-2018, The Linux Foundation. All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 and
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
index b0dbca0..afa2b86 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
@@ -1,4 +1,4 @@
-/* Copyright (c) 2013-2017, The Linux Foundation. All rights reserved.
+/* Copyright (c) 2013-2018, The Linux Foundation. All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 and
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
index c74a6c5..49e420e 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
@@ -1,4 +1,4 @@
-/* Copyright (c) 2013-2017, The Linux Foundation. All rights reserved.
+/* Copyright (c) 2013-2018, The Linux Foundation. All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 and
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h
index de0143e..98365ef 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h
@@ -1,4 +1,4 @@
-/* Copyright (c) 2013-2014, 2016-2017 The Linux Foundation. All rights 
reserved.
+/* Copyright (c) 2013-2014, 2016-2018 The Linux Foundation. All rights 
reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 and
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
index 346d310..2ea16a0 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
@@ -1,4 +1,4 @@
-/* Copyright (c) 2013-2017, 

[PATCH net-next v4 3/5] net: qualcomm: rmnet: Remove unnecessary device assignment

2018-03-21 Thread Subash Abhinov Kasiviswanathan
Device of the de-aggregated skb is correctly assigned after inspecting
the mux_id, so remove the assignment in rmnet_map_deaggregate().

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
index 49e420e..e8f6c79 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
@@ -323,7 +323,6 @@ struct sk_buff *rmnet_map_deaggregate(struct sk_buff *skb,
if (!skbn)
return NULL;
 
-   skbn->dev = skb->dev;
skb_reserve(skbn, RMNET_MAP_DEAGGR_HEADROOM);
skb_put(skbn, packet_len);
memcpy(skbn->data, skb->data, packet_len);
-- 
1.9.1



[PATCH net-next v4 5/5] net: qualcomm: rmnet: Implement fill_info

2018-03-21 Thread Subash Abhinov Kasiviswanathan
This is needed to query the mux_id and flags of a rmnet device.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
index c5b7b2a..38d9356 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
@@ -339,6 +339,35 @@ static size_t rmnet_get_size(const struct net_device *dev)
nla_total_size(sizeof(struct ifla_rmnet_flags));
 }
 
+static int rmnet_fill_info(struct sk_buff *skb, const struct net_device *dev)
+{
+   struct rmnet_priv *priv = netdev_priv(dev);
+   struct net_device *real_dev;
+   struct ifla_rmnet_flags f;
+   struct rmnet_port *port;
+
+   real_dev = priv->real_dev;
+
+   if (!rmnet_is_real_dev_registered(real_dev))
+   return -ENODEV;
+
+   if (nla_put_u16(skb, IFLA_RMNET_MUX_ID, priv->mux_id))
+   goto nla_put_failure;
+
+   port = rmnet_get_port_rtnl(real_dev);
+
+   f.flags = port->data_format;
+   f.mask  = ~0;
+
+   if (nla_put(skb, IFLA_RMNET_FLAGS, sizeof(f), ))
+   goto nla_put_failure;
+
+   return 0;
+
+nla_put_failure:
+   return -EMSGSIZE;
+}
+
 struct rtnl_link_ops rmnet_link_ops __read_mostly = {
.kind   = "rmnet",
.maxtype= __IFLA_RMNET_MAX,
@@ -350,6 +379,7 @@ struct rtnl_link_ops rmnet_link_ops __read_mostly = {
.get_size   = rmnet_get_size,
.changelink = rmnet_changelink,
.policy = rmnet_policy,
+   .fill_info  = rmnet_fill_info,
 };
 
 /* Needs either rcu_read_lock() or rtnl lock */
-- 
1.9.1



Re: [RFC PATCH 2/3] x86/io: implement 256-bit IO read and write

2018-03-21 Thread Linus Torvalds
On Tue, Mar 20, 2018 at 7:42 AM, Alexander Duyck
 wrote:
>
> Instead of framing this as an enhanced version of the read/write ops
> why not look at replacing or extending something like the
> memcpy_fromio or memcpy_toio operations?

Yes, doing something like "memcpy_fromio_avx()" is much more
palatable, in that it works like the crypto functions do - if you do
big chunks, the "kernel_fpu_begin/end()" isn't nearly the issue it can
be otherwise.

Note that we definitely have seen hardware that *depends* on the
regular memcpy_fromio()" not doing big reads. I don't know how
hardware people screw it up, but it's clearly possible.

So it really needs to be an explicitly named function that basically a
driver can use to say "my hardware really likes big aligned accesses"
and explicitly ask for some AVX version if possible.

Linus


linux-next: manual merge of the net-next tree with the mac80211 tree

2018-03-21 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  net/mac80211/debugfs.c
  include/net/mac80211.h

between commit:

  7c181f4fcdc6 ("mac80211: add ieee80211_hw flag for QoS NDP support")

from the mac80211 tree and commit:

  94ba92713f83 ("mac80211: Call mgd_prep_tx before transmitting 
deauthentication")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc include/net/mac80211.h
index 2b581bd93812,2fd59ed3be00..
--- a/include/net/mac80211.h
+++ b/include/net/mac80211.h
@@@ -2063,9 -2070,14 +2070,17 @@@ struct ieee80211_txq 
   * @IEEE80211_HW_SUPPORTS_TDLS_BUFFER_STA: Hardware supports buffer STA on
   *TDLS links.
   *
 + * @IEEE80211_HW_DOESNT_SUPPORT_QOS_NDP: The driver (or firmware) doesn't
 + *support QoS NDP for AP probing - that's most likely a driver bug.
 + *
+  * @IEEE80211_HW_DEAUTH_NEED_MGD_TX_PREP: The driver requires the
+  *mgd_prepare_tx() callback to be called before transmission of a
+  *deauthentication frame in case the association was completed but no
+  *beacon was heard. This is required in multi-channel scenarios, where the
+  *virtual interface might not be given air time for the transmission of
+  *the frame, as it is not synced with the AP/P2P GO yet, and thus the
+  *deauthentication frame might not be transmitted.
+  *
   * @NUM_IEEE80211_HW_FLAGS: number of hardware flags, used for sizing arrays
   */
  enum ieee80211_hw_flags {
@@@ -2109,7 -2121,7 +2124,8 @@@
IEEE80211_HW_REPORTS_LOW_ACK,
IEEE80211_HW_SUPPORTS_TX_FRAG,
IEEE80211_HW_SUPPORTS_TDLS_BUFFER_STA,
 +  IEEE80211_HW_DOESNT_SUPPORT_QOS_NDP,
+   IEEE80211_HW_DEAUTH_NEED_MGD_TX_PREP,
  
/* keep last, obviously */
NUM_IEEE80211_HW_FLAGS
diff --cc net/mac80211/debugfs.c
index 94c7ee9df33b,a75653affbf7..
--- a/net/mac80211/debugfs.c
+++ b/net/mac80211/debugfs.c
@@@ -212,7 -212,7 +212,8 @@@ static const char *hw_flag_names[] = 
FLAG(REPORTS_LOW_ACK),
FLAG(SUPPORTS_TX_FRAG),
FLAG(SUPPORTS_TDLS_BUFFER_STA),
 +  FLAG(DOESNT_SUPPORT_QOS_NDP),
+   FLAG(DEAUTH_NEED_MGD_TX_PREP),
  #undef FLAG
  };
  


pgpJb0NCHNdlE.pgp
Description: OpenPGP digital signature


Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-03-21 Thread Andrew Lunn
On Wed, Mar 21, 2018 at 03:47:02PM -0700, Richard Cochran wrote:
> On Wed, Mar 21, 2018 at 11:16:52PM +0100, Andrew Lunn wrote:
> > The MAC drivers are clients of this device. They then use a phandle
> > and specifier:
> > 
> > eth0: ethernet-controller@72000 {
> > compatible = "marvell,kirkwood-eth";
> > #address-cells = <1>;
> > #size-cells = <0>;
> > reg = <0x72000 0x4000>;
> > 
> > timerstamper = < 2>
> > }
> > 
> > The 2 indicates this MAC is using port 2.
> > 
> > The MAC driver can then do the standard device tree things to follow
> > the phandle to get access to the device and use the API it exports.
> 
> But that would require hacking every last MAC driver.
> 
> I happy to improve the modeling, but the solution should be generic
> and work for every MAC driver.

Something else to think about. There are a number of MAC drivers which
don't use phylib. All the intel drivers for example. They have their
own MDIO and PHY code. And recently there have been a number of MAC
drivers for hardware capable of > 1GBps which do all the PHY control
in firmware.

A phydev is optional, the MAC is mandatory.

  Andrew



[next-queue PATCH v5 1/9] igb: Fix not adding filter elements to the list

2018-03-21 Thread Vinicius Costa Gomes
Because the order of the parameters passes to 'hlist_add_behind()' was
inverted, the 'parent' node was added "behind" the 'input', as input
is not in the list, this causes the 'input' node to be lost.

Fixes: 0e71def25281 ("igb: add support of RX network flow classification")
Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb_ethtool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c 
b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 606e6761758f..143f0bb34e4d 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2864,7 +2864,7 @@ static int igb_update_ethtool_nfc_entry(struct 
igb_adapter *adapter,
 
/* add filter to the list */
if (parent)
-   hlist_add_behind(>nfc_node, >nfc_node);
+   hlist_add_behind(>nfc_node, >nfc_node);
else
hlist_add_head(>nfc_node, >nfc_filter_list);
 
-- 
2.16.2



[next-queue PATCH v5 3/9] igb: Enable the hardware traffic class feature bit for igb models

2018-03-21 Thread Vinicius Costa Gomes
This will allow functionality depending on the hardware being traffic
class aware to work. In particular the tc-flower offloading checks
verifies that this bit is set.

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index d0e8e796c6fa..9ce29b8bb7da 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2806,6 +2806,9 @@ static int igb_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
if (hw->mac.type >= e1000_82576)
netdev->features |= NETIF_F_SCTP_CRC;
 
+   if (hw->mac.type >= e1000_i350)
+   netdev->features |= NETIF_F_HW_TC;
+
 #define IGB_GSO_PARTIAL_FEATURES (NETIF_F_GSO_GRE | \
  NETIF_F_GSO_GRE_CSUM | \
  NETIF_F_GSO_IPXIP4 | \
-- 
2.16.2



[next-queue PATCH v5 6/9] igb: Enable nfc filters to specify MAC addresses

2018-03-21 Thread Vinicius Costa Gomes
This allows igb_add_filter()/igb_erase_filter() to work on filters
that include MAC addresses (both source and destination).

For now, this only exposes the functionality, the next commit glues
ethtool into this. Later in this series, these APIs are used to allow
offloading of cls_flower filters.

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb.h |  4 
 drivers/net/ethernet/intel/igb/igb_ethtool.c | 28 
 2 files changed, 32 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index dfef1702ba21..66165879f12b 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -441,6 +441,8 @@ struct hwmon_buff {
 enum igb_filter_match_flags {
IGB_FILTER_FLAG_ETHER_TYPE = 0x1,
IGB_FILTER_FLAG_VLAN_TCI   = 0x2,
+   IGB_FILTER_FLAG_SRC_MAC_ADDR   = 0x4,
+   IGB_FILTER_FLAG_DST_MAC_ADDR   = 0x8,
 };
 
 #define IGB_MAX_RXNFC_FILTERS 16
@@ -455,6 +457,8 @@ struct igb_nfc_input {
u8 match_flags;
__be16 etype;
__be16 vlan_tci;
+   u8 src_addr[ETH_ALEN];
+   u8 dst_addr[ETH_ALEN];
 };
 
 struct igb_nfc_filter {
diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c 
b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 143f0bb34e4d..4c6a1b78c413 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2775,6 +2775,25 @@ int igb_add_filter(struct igb_adapter *adapter, struct 
igb_nfc_filter *input)
return err;
}
 
+   if (input->filter.match_flags & IGB_FILTER_FLAG_DST_MAC_ADDR) {
+   err = igb_add_mac_steering_filter(adapter,
+ input->filter.dst_addr,
+ input->action, 0);
+   err = min_t(int, err, 0);
+   if (err)
+   return err;
+   }
+
+   if (input->filter.match_flags & IGB_FILTER_FLAG_SRC_MAC_ADDR) {
+   err = igb_add_mac_steering_filter(adapter,
+ input->filter.src_addr,
+ input->action,
+ IGB_MAC_STATE_SRC_ADDR);
+   err = min_t(int, err, 0);
+   if (err)
+   return err;
+   }
+
if (input->filter.match_flags & IGB_FILTER_FLAG_VLAN_TCI)
err = igb_rxnfc_write_vlan_prio_filter(adapter, input);
 
@@ -2823,6 +2842,15 @@ int igb_erase_filter(struct igb_adapter *adapter, struct 
igb_nfc_filter *input)
igb_clear_vlan_prio_filter(adapter,
   ntohs(input->filter.vlan_tci));
 
+   if (input->filter.match_flags & IGB_FILTER_FLAG_SRC_MAC_ADDR)
+   igb_del_mac_steering_filter(adapter, input->filter.src_addr,
+   input->action,
+   IGB_MAC_STATE_SRC_ADDR);
+
+   if (input->filter.match_flags & IGB_FILTER_FLAG_DST_MAC_ADDR)
+   igb_del_mac_steering_filter(adapter, input->filter.dst_addr,
+   input->action, 0);
+
return 0;
 }
 
-- 
2.16.2



[next-queue PATCH v5 8/9] igb: Add the skeletons for tc-flower offloading

2018-03-21 Thread Vinicius Costa Gomes
This adds basic functions needed to implement offloading for filters
created by tc-flower.

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 66 +++
 1 file changed, 66 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 52cd891aa579..150231e4db9d 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2497,6 +2498,69 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
return 0;
 }
 
+static int igb_configure_clsflower(struct igb_adapter *adapter,
+  struct tc_cls_flower_offload *cls_flower)
+{
+   return -EOPNOTSUPP;
+}
+
+static int igb_delete_clsflower(struct igb_adapter *adapter,
+   struct tc_cls_flower_offload *cls_flower)
+{
+   return -EOPNOTSUPP;
+}
+
+static int igb_setup_tc_cls_flower(struct igb_adapter *adapter,
+  struct tc_cls_flower_offload *cls_flower)
+{
+   switch (cls_flower->command) {
+   case TC_CLSFLOWER_REPLACE:
+   return igb_configure_clsflower(adapter, cls_flower);
+   case TC_CLSFLOWER_DESTROY:
+   return igb_delete_clsflower(adapter, cls_flower);
+   case TC_CLSFLOWER_STATS:
+   return -EOPNOTSUPP;
+   default:
+   return -EINVAL;
+   }
+}
+
+static int igb_setup_tc_block_cb(enum tc_setup_type type, void *type_data,
+void *cb_priv)
+{
+   struct igb_adapter *adapter = cb_priv;
+
+   if (!tc_cls_can_offload_and_chain0(adapter->netdev, type_data))
+   return -EOPNOTSUPP;
+
+   switch (type) {
+   case TC_SETUP_CLSFLOWER:
+   return igb_setup_tc_cls_flower(adapter, type_data);
+
+   default:
+   return -EOPNOTSUPP;
+   }
+}
+
+static int igb_setup_tc_block(struct igb_adapter *adapter,
+ struct tc_block_offload *f)
+{
+   if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
+   return -EOPNOTSUPP;
+
+   switch (f->command) {
+   case TC_BLOCK_BIND:
+   return tcf_block_cb_register(f->block, igb_setup_tc_block_cb,
+adapter, adapter);
+   case TC_BLOCK_UNBIND:
+   tcf_block_cb_unregister(f->block, igb_setup_tc_block_cb,
+   adapter);
+   return 0;
+   default:
+   return -EOPNOTSUPP;
+   }
+}
+
 static int igb_setup_tc(struct net_device *dev, enum tc_setup_type type,
void *type_data)
 {
@@ -2505,6 +2569,8 @@ static int igb_setup_tc(struct net_device *dev, enum 
tc_setup_type type,
switch (type) {
case TC_SETUP_QDISC_CBS:
return igb_offload_cbs(adapter, type_data);
+   case TC_SETUP_BLOCK:
+   return igb_setup_tc_block(adapter, type_data);
 
default:
return -EOPNOTSUPP;
-- 
2.16.2



[next-queue PATCH v5 0/9] igb: offloading of receive filters

2018-03-21 Thread Vinicius Costa Gomes
Hi,

Changes from v4:
 - Added a new bit to the MAC address filters internal representation
 to mean that some MAC address filters are steering filters (i.e. they
 direct traffic to queues);
 - And, this is only supported in i210;
 - Added a "Known Issue" section;

Changes from v3:
 - Addressed review comments from Aaron F. Brown and
   Jakub Kicinski;

Changes from v2:
 - Addressed review comments from Jakub Kicinski, mostly about coding
   style adjustments and more consistent error reporting;

Changes from v1:
 - Addressed review comments from Alexander Duyck and Florian
   Fainelli;
 - Adding and removing cls_flower filters are now proposed in the same
   patch;
 - cls_flower filters are kept in a separated list from "ethtool"
   filters (so that section of the original cover letter is no longer
   valid);
 - The patch adding support for ethtool filters is now independent from
   the rest of the series;

Known issue:
 - It seems that the the QSEL bits in the RAH registers do not have
 any effect for source address (i.e. steering doesn't work for source
 address filters), everything is pointing to a hardware (or
 documentation) issue;

Original cover letter:

This series enables some ethtool and tc-flower filters to be offloaded
to igb-based network controllers. This is useful when the system
configurator want to steer kinds of traffic to a specific hardware
queue.

The first two commits are bug fixes.

The basis of this series is to export the internal API used to
configure address filters, so they can be used by ethtool, and
extending the functionality so an source address can be handled.

Then, we enable the tc-flower offloading implementation to re-use the
same infrastructure as ethtool, and storing them in the per-adapter
"nfc" (Network Filter Config?) list. But for consistency, for
destructive access they are separated, i.e. an filter added by
tc-flower can only be removed by tc-flower, but ethtool can read them
all.

Only support for VLAN Prio, Source and Destination MAC Address, and
Ethertype is enabled for now.

Open question:
  - igb is initialized with the number of traffic classes as 1, if we
  want to use multiple traffic classes we need to increase this value,
  the only way I could find is to use mqprio (for example). Should igb
  be initialized with, say, the number of queues as its "num_tc"?

Vinicius Costa Gomes (9):
  igb: Fix not adding filter elements to the list
  igb: Fix queue selection on MAC filters on i210
  igb: Enable the hardware traffic class feature bit for igb models
  igb: Add support for MAC address filters specifying source addresses
  igb: Add support for enabling queue steering in filters
  igb: Enable nfc filters to specify MAC addresses
  igb: Add MAC address support for ethtool nftuple filters
  igb: Add the skeletons for tc-flower offloading
  igb: Add support for adding offloaded clsflower filters

 drivers/net/ethernet/intel/igb/e1000_defines.h |   2 +
 drivers/net/ethernet/intel/igb/igb.h   |  13 +
 drivers/net/ethernet/intel/igb/igb_ethtool.c   |  65 -
 drivers/net/ethernet/intel/igb/igb_main.c  | 332 -
 4 files changed, 398 insertions(+), 14 deletions(-)

--
2.16.2


[next-queue PATCH v5 4/9] igb: Add support for MAC address filters specifying source addresses

2018-03-21 Thread Vinicius Costa Gomes
Makes it possible to direct packets to queues based on their source
address. Documents the expected usage of the 'flags' parameter.

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/e1000_defines.h |  1 +
 drivers/net/ethernet/intel/igb/igb.h   |  1 +
 drivers/net/ethernet/intel/igb/igb_main.c  | 40 ++
 3 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h 
b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 83cabff1e0ab..a3e5514b044e 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -490,6 +490,7 @@
  * manageability enabled, allowing us room for 15 multicast addresses.
  */
 #define E1000_RAH_AV  0x8000/* Receive descriptor valid */
+#define E1000_RAH_ASEL_SRC_ADDR 0x0001
 #define E1000_RAL_MAC_ADDR_LEN 4
 #define E1000_RAH_MAC_ADDR_LEN 2
 #define E1000_RAH_POOL_MASK 0x03FC
diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index 55d6f17d5799..4501b28ff7c5 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -473,6 +473,7 @@ struct igb_mac_addr {
 
 #define IGB_MAC_STATE_DEFAULT  0x1
 #define IGB_MAC_STATE_IN_USE   0x2
+#define IGB_MAC_STATE_SRC_ADDR  0x4
 
 /* board specific private data structure */
 struct igb_adapter {
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 9ce29b8bb7da..a5a681f7fbb2 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6843,8 +6843,14 @@ static void igb_set_default_mac_filter(struct 
igb_adapter *adapter)
igb_rar_set_index(adapter, 0);
 }
 
-static int igb_add_mac_filter(struct igb_adapter *adapter, const u8 *addr,
- const u8 queue)
+/* Add a MAC filter for 'addr' directing matching traffic to 'queue',
+ * 'flags' is used to indicate what kind of match is made, match is by
+ * default for the destination address, if matching by source address
+ * is desired the flag IGB_MAC_STATE_SRC_ADDR can be used.
+ */
+static int igb_add_mac_filter_flags(struct igb_adapter *adapter,
+   const u8 *addr, const u8 queue,
+   const u8 flags)
 {
struct e1000_hw *hw = >hw;
int rar_entries = hw->mac.rar_entry_count -
@@ -6864,7 +6870,7 @@ static int igb_add_mac_filter(struct igb_adapter 
*adapter, const u8 *addr,
 
ether_addr_copy(adapter->mac_table[i].addr, addr);
adapter->mac_table[i].queue = queue;
-   adapter->mac_table[i].state |= IGB_MAC_STATE_IN_USE;
+   adapter->mac_table[i].state |= IGB_MAC_STATE_IN_USE | flags;
 
igb_rar_set_index(adapter, i);
return i;
@@ -6873,8 +6879,21 @@ static int igb_add_mac_filter(struct igb_adapter 
*adapter, const u8 *addr,
return -ENOSPC;
 }
 
-static int igb_del_mac_filter(struct igb_adapter *adapter, const u8 *addr,
+static int igb_add_mac_filter(struct igb_adapter *adapter, const u8 *addr,
  const u8 queue)
+{
+   return igb_add_mac_filter_flags(adapter, addr, queue, 0);
+}
+
+/* Remove a MAC filter for 'addr' directing matching traffic to
+ * 'queue', 'flags' is used to indicate what kind of match need to be
+ * removed, match is by default for the destination address, if
+ * matching by source address is to be removed the flag
+ * IGB_MAC_STATE_SRC_ADDR can be used.
+ */
+static int igb_del_mac_filter_flags(struct igb_adapter *adapter,
+   const u8 *addr, const u8 queue,
+   const u8 flags)
 {
struct e1000_hw *hw = >hw;
int rar_entries = hw->mac.rar_entry_count -
@@ -6891,12 +6910,14 @@ static int igb_del_mac_filter(struct igb_adapter 
*adapter, const u8 *addr,
for (i = 0; i < rar_entries; i++) {
if (!(adapter->mac_table[i].state & IGB_MAC_STATE_IN_USE))
continue;
+   if ((adapter->mac_table[i].state & flags) != flags)
+   continue;
if (adapter->mac_table[i].queue != queue)
continue;
if (!ether_addr_equal(adapter->mac_table[i].addr, addr))
continue;
 
-   adapter->mac_table[i].state &= ~IGB_MAC_STATE_IN_USE;
+   adapter->mac_table[i].state = 0;
memset(adapter->mac_table[i].addr, 0, ETH_ALEN);
adapter->mac_table[i].queue = 0;
 
@@ -6907,6 +6928,12 @@ static int igb_del_mac_filter(struct igb_adapter 
*adapter, const u8 *addr,
return -ENOENT;
 }
 
+static int igb_del_mac_filter(struct igb_adapter *adapter, const u8 *addr,
+ const u8 queue)
+{
+   return 

[next-queue PATCH v5 7/9] igb: Add MAC address support for ethtool nftuple filters

2018-03-21 Thread Vinicius Costa Gomes
This adds the capability of configuring the queue steering of arriving
packets based on their source and destination MAC addresses.

In practical terms this adds support for the following use cases,
characterized by these examples:

$ ethtool -N eth0 flow-type ether dst aa:aa:aa:aa:aa:aa action 0
(this will direct packets with destination address "aa:aa:aa:aa:aa:aa"
to the RX queue 0)

$ ethtool -N eth0 flow-type ether src 44:44:44:44:44:44 action 3
(this will direct packets with source address "44:44:44:44:44:44" to
the RX queue 3)

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb_ethtool.c | 35 
 1 file changed, 31 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c 
b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 4c6a1b78c413..27caa413ade2 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2494,6 +2494,23 @@ static int igb_get_ethtool_nfc_entry(struct igb_adapter 
*adapter,
fsp->h_ext.vlan_tci = rule->filter.vlan_tci;
fsp->m_ext.vlan_tci = htons(VLAN_PRIO_MASK);
}
+   if (rule->filter.match_flags & IGB_FILTER_FLAG_DST_MAC_ADDR) {
+   ether_addr_copy(fsp->h_u.ether_spec.h_dest,
+   rule->filter.dst_addr);
+   /* As we only support matching by the full
+* mask, return the mask to userspace
+*/
+   eth_broadcast_addr(fsp->m_u.ether_spec.h_dest);
+   }
+   if (rule->filter.match_flags & IGB_FILTER_FLAG_SRC_MAC_ADDR) {
+   ether_addr_copy(fsp->h_u.ether_spec.h_source,
+   rule->filter.src_addr);
+   /* As we only support matching by the full
+* mask, return the mask to userspace
+*/
+   eth_broadcast_addr(fsp->m_u.ether_spec.h_source);
+   }
+
return 0;
}
return -EINVAL;
@@ -2932,10 +2949,6 @@ static int igb_add_ethtool_nfc_entry(struct igb_adapter 
*adapter,
if ((fsp->flow_type & ~FLOW_EXT) != ETHER_FLOW)
return -EINVAL;
 
-   if (fsp->m_u.ether_spec.h_proto != ETHER_TYPE_FULL_MASK &&
-   fsp->m_ext.vlan_tci != htons(VLAN_PRIO_MASK))
-   return -EINVAL;
-
input = kzalloc(sizeof(*input), GFP_KERNEL);
if (!input)
return -ENOMEM;
@@ -2945,6 +2958,20 @@ static int igb_add_ethtool_nfc_entry(struct igb_adapter 
*adapter,
input->filter.match_flags = IGB_FILTER_FLAG_ETHER_TYPE;
}
 
+   /* Only support matching addresses by the full mask */
+   if (is_broadcast_ether_addr(fsp->m_u.ether_spec.h_source)) {
+   input->filter.match_flags |= IGB_FILTER_FLAG_SRC_MAC_ADDR;
+   ether_addr_copy(input->filter.src_addr,
+   fsp->h_u.ether_spec.h_source);
+   }
+
+   /* Only support matching addresses by the full mask */
+   if (is_broadcast_ether_addr(fsp->m_u.ether_spec.h_dest)) {
+   input->filter.match_flags |= IGB_FILTER_FLAG_DST_MAC_ADDR;
+   ether_addr_copy(input->filter.dst_addr,
+   fsp->h_u.ether_spec.h_dest);
+   }
+
if ((fsp->flow_type & FLOW_EXT) && fsp->m_ext.vlan_tci) {
if (fsp->m_ext.vlan_tci != htons(VLAN_PRIO_MASK)) {
err = -EINVAL;
-- 
2.16.2



[next-queue PATCH v5 2/9] igb: Fix queue selection on MAC filters on i210

2018-03-21 Thread Vinicius Costa Gomes
On the RAH registers there are semantic differences on the meaning of
the "queue" parameter for traffic steering depending on the controller
model: there is the 82575 meaning, which "queue" means a RX Hardware
Queue, and the i350 meaning, where it is a reception pool.

The previous behaviour was having no effect for i210 based controllers
because the QSEL bit of the RAH register wasn't being set.

This patch separates the condition in discrete cases, so the different
handling is clearer.

Fixes: 83c21335c876 ("igb: improve MAC filter handling")
Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 715bb32e6901..d0e8e796c6fa 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -8747,12 +8747,17 @@ static void igb_rar_set_index(struct igb_adapter 
*adapter, u32 index)
if (is_valid_ether_addr(addr))
rar_high |= E1000_RAH_AV;
 
-   if (hw->mac.type == e1000_82575)
+   switch (hw->mac.type) {
+   case e1000_82575:
+   case e1000_i210:
rar_high |= E1000_RAH_POOL_1 *
-   adapter->mac_table[index].queue;
-   else
+ adapter->mac_table[index].queue;
+   break;
+   default:
rar_high |= E1000_RAH_POOL_1 <<
-   adapter->mac_table[index].queue;
+   adapter->mac_table[index].queue;
+   break;
+   }
}
 
wr32(E1000_RAL(index), rar_low);
-- 
2.16.2



[next-queue PATCH v5 5/9] igb: Add support for enabling queue steering in filters

2018-03-21 Thread Vinicius Costa Gomes
On some igb models (82575 and i210) the MAC address filters can
control to which queue the packet will be assigned.

This extends the 'state' with one more state to signify that queue
selection should be enabled for that filter.

As 82575 parts are no longer easily obtained (and this was developed
against i210), only support for the i210 model is enabled.

These functions are exported and will be used in the next patch.

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/e1000_defines.h |  1 +
 drivers/net/ethernet/intel/igb/igb.h   |  6 ++
 drivers/net/ethernet/intel/igb/igb_main.c  | 26 ++
 3 files changed, 33 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h 
b/drivers/net/ethernet/intel/igb/e1000_defines.h
index a3e5514b044e..c6f552de30dd 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -491,6 +491,7 @@
  */
 #define E1000_RAH_AV  0x8000/* Receive descriptor valid */
 #define E1000_RAH_ASEL_SRC_ADDR 0x0001
+#define E1000_RAH_QSEL_ENABLE 0x1000
 #define E1000_RAL_MAC_ADDR_LEN 4
 #define E1000_RAH_MAC_ADDR_LEN 2
 #define E1000_RAH_POOL_MASK 0x03FC
diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index 4501b28ff7c5..dfef1702ba21 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -474,6 +474,7 @@ struct igb_mac_addr {
 #define IGB_MAC_STATE_DEFAULT  0x1
 #define IGB_MAC_STATE_IN_USE   0x2
 #define IGB_MAC_STATE_SRC_ADDR  0x4
+#define IGB_MAC_STATE_QUEUE_STEERING 0x8
 
 /* board specific private data structure */
 struct igb_adapter {
@@ -739,4 +740,9 @@ int igb_add_filter(struct igb_adapter *adapter,
 int igb_erase_filter(struct igb_adapter *adapter,
 struct igb_nfc_filter *input);
 
+int igb_add_mac_steering_filter(struct igb_adapter *adapter,
+   const u8 *addr, u8 queue, u8 flags);
+int igb_del_mac_steering_filter(struct igb_adapter *adapter,
+   const u8 *addr, u8 queue, u8 flags);
+
 #endif /* _IGB_H_ */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index a5a681f7fbb2..52cd891aa579 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6934,6 +6934,28 @@ static int igb_del_mac_filter(struct igb_adapter 
*adapter, const u8 *addr,
return igb_del_mac_filter_flags(adapter, addr, queue, 0);
 }
 
+int igb_add_mac_steering_filter(struct igb_adapter *adapter,
+   const u8 *addr, u8 queue, u8 flags)
+{
+   struct e1000_hw *hw = >hw;
+
+   /* In theory, this should be supported on 82575 as well, but
+* that part wasn't easily accessible during development.
+*/
+   if (hw->mac.type != e1000_i210)
+   return -EOPNOTSUPP;
+
+   return igb_add_mac_filter_flags(adapter, addr, queue,
+   IGB_MAC_STATE_QUEUE_STEERING | flags);
+}
+
+int igb_del_mac_steering_filter(struct igb_adapter *adapter,
+   const u8 *addr, u8 queue, u8 flags)
+{
+   return igb_del_mac_filter_flags(adapter, addr, queue,
+   IGB_MAC_STATE_QUEUE_STEERING | flags);
+}
+
 static int igb_uc_sync(struct net_device *netdev, const unsigned char *addr)
 {
struct igb_adapter *adapter = netdev_priv(netdev);
@@ -8783,6 +8805,10 @@ static void igb_rar_set_index(struct igb_adapter 
*adapter, u32 index)
switch (hw->mac.type) {
case e1000_82575:
case e1000_i210:
+   if (adapter->mac_table[index].state &
+   IGB_MAC_STATE_QUEUE_STEERING)
+   rar_high |= E1000_RAH_QSEL_ENABLE;
+
rar_high |= E1000_RAH_POOL_1 *
  adapter->mac_table[index].queue;
break;
-- 
2.16.2



[next-queue PATCH v5 9/9] igb: Add support for adding offloaded clsflower filters

2018-03-21 Thread Vinicius Costa Gomes
This allows filters added by tc-flower and specifying MAC addresses,
Ethernet types, and the VLAN priority field, to be offloaded to the
controller.

This reuses most of the infrastructure used by ethtool, but clsflower
filters are kept in a separated list, so they are invisible to
ethtool.

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb.h  |   2 +
 drivers/net/ethernet/intel/igb/igb_main.c | 188 +-
 2 files changed, 188 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index 66165879f12b..adfef068e866 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -464,6 +464,7 @@ struct igb_nfc_input {
 struct igb_nfc_filter {
struct hlist_node nfc_node;
struct igb_nfc_input filter;
+   unsigned long cookie;
u16 etype_reg_index;
u16 sw_idx;
u16 action;
@@ -603,6 +604,7 @@ struct igb_adapter {
 
/* RX network flow classification support */
struct hlist_head nfc_filter_list;
+   struct hlist_head cls_flower_list;
unsigned int nfc_filter_count;
/* lock for RX network flow classification filter */
spinlock_t nfc_lock;
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 150231e4db9d..cc580b17dab3 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2498,16 +2498,197 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
return 0;
 }
 
+#define ETHER_TYPE_FULL_MASK ((__force __be16)~0)
+#define VLAN_PRIO_FULL_MASK (0x07)
+
+static int igb_parse_cls_flower(struct igb_adapter *adapter,
+   struct tc_cls_flower_offload *f,
+   int traffic_class,
+   struct igb_nfc_filter *input)
+{
+   struct netlink_ext_ack *extack = f->common.extack;
+
+   if (f->dissector->used_keys &
+   ~(BIT(FLOW_DISSECTOR_KEY_BASIC) |
+ BIT(FLOW_DISSECTOR_KEY_CONTROL) |
+ BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) |
+ BIT(FLOW_DISSECTOR_KEY_VLAN))) {
+   NL_SET_ERR_MSG_MOD(extack,
+  "Unsupported key used, only BASIC, CONTROL, 
ETH_ADDRS and VLAN are supported");
+   return -EOPNOTSUPP;
+   }
+
+   if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
+   struct flow_dissector_key_eth_addrs *key, *mask;
+
+   key = skb_flow_dissector_target(f->dissector,
+   FLOW_DISSECTOR_KEY_ETH_ADDRS,
+   f->key);
+   mask = skb_flow_dissector_target(f->dissector,
+FLOW_DISSECTOR_KEY_ETH_ADDRS,
+f->mask);
+
+   if (!is_zero_ether_addr(mask->dst)) {
+   if (!is_broadcast_ether_addr(mask->dst)) {
+   NL_SET_ERR_MSG_MOD(extack, "Only full masks are 
supported for destination MAC address");
+   return -EINVAL;
+   }
+
+   input->filter.match_flags |=
+   IGB_FILTER_FLAG_DST_MAC_ADDR;
+   ether_addr_copy(input->filter.dst_addr, key->dst);
+   }
+
+   if (!is_zero_ether_addr(mask->src)) {
+   if (!is_broadcast_ether_addr(mask->src)) {
+   NL_SET_ERR_MSG_MOD(extack, "Only full masks are 
supported for source MAC address");
+   return -EINVAL;
+   }
+
+   input->filter.match_flags |=
+   IGB_FILTER_FLAG_SRC_MAC_ADDR;
+   ether_addr_copy(input->filter.src_addr, key->src);
+   }
+   }
+
+   if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_BASIC)) {
+   struct flow_dissector_key_basic *key, *mask;
+
+   key = skb_flow_dissector_target(f->dissector,
+   FLOW_DISSECTOR_KEY_BASIC,
+   f->key);
+   mask = skb_flow_dissector_target(f->dissector,
+FLOW_DISSECTOR_KEY_BASIC,
+f->mask);
+
+   if (mask->n_proto) {
+   if (mask->n_proto != ETHER_TYPE_FULL_MASK) {
+   NL_SET_ERR_MSG_MOD(extack, "Only full mask is 
supported for EtherType filter");
+   return -EINVAL;
+   }
+
+   input->filter.match_flags |= IGB_FILTER_FLAG_ETHER_TYPE;
+ 

Re: [PATCH net-next v6 1/2] net: permit skb_segment on head_frag frag_list skb

2018-03-21 Thread Alexander Duyck
On Wed, Mar 21, 2018 at 4:31 PM, Yonghong Song  wrote:
> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
> function skb_segment(), line 3667. The bpf program attaches to
> clsact ingress, calls bpf_skb_change_proto to change protocol
> from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
> to send the changed packet out.
>
> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
> 3473 netdev_features_t features)
> 3474 {
> 3475 struct sk_buff *segs = NULL;
> 3476 struct sk_buff *tail = NULL;
> ...
> 3665 while (pos < offset + len) {
> 3666 if (i >= nfrags) {
> 3667 BUG_ON(skb_headlen(list_skb));
> 3668
> 3669 i = 0;
> 3670 nfrags = skb_shinfo(list_skb)->nr_frags;
> 3671 frag = skb_shinfo(list_skb)->frags;
> 3672 frag_skb = list_skb;
> ...
>
> call stack:
> ...
>  #1 [883ffef03558] __crash_kexec at 8110c525
>  #2 [883ffef03620] crash_kexec at 8110d5cc
>  #3 [883ffef03640] oops_end at 8101d7e7
>  #4 [883ffef03668] die at 8101deb2
>  #5 [883ffef03698] do_trap at 8101a700
>  #6 [883ffef036e8] do_error_trap at 8101abfe
>  #7 [883ffef037a0] do_invalid_op at 8101acd0
>  #8 [883ffef037b0] invalid_op at 81a00bab
> [exception RIP: skb_segment+3044]
> RIP: 817e4dd4  RSP: 883ffef03860  RFLAGS: 00010216
> RAX: 2bf6  RBX: 883feb7aaa00  RCX: 0011
> RDX: 883fb87910c0  RSI: 0011  RDI: 883feb7ab500
> RBP: 883ffef03928   R8: 2ce2   R9: 27da
> R10: 01ea  R11: 2d82  R12: 883f90a1ee80
> R13: 883fb8791120  R14: 883feb7abc00  R15: 2ce2
> ORIG_RAX:   CS: 0010  SS: 0018
>  #9 [883ffef03930] tcp_gso_segment at 818713e7
> ---  ---
> ...
>
> The triggering input skb has the following properties:
> list_skb = skb->frag_list;
> skb->nfrags != NULL && skb_headlen(list_skb) != 0
> and skb_segment() is not able to handle a frag_list skb
> if its headlen (list_skb->len - list_skb->data_len) is not 0.
>
> This patch addressed the issue by handling skb_headlen(list_skb) != 0
> case properly if list_skb->head_frag is true, which is expected in
> most cases. The head frag is processed before list_skb->frags
> are processed.
>
> Reported-by: Diptanu Gon Choudhury 
> Signed-off-by: Yonghong Song 

This looks good to me.

Reviewed-by: Alexander Duyck 


Re: [PATCH v4 12/17] net: cxgb4/cxgb4vf: Eliminate duplicate barriers on weakly-ordered archs

2018-03-21 Thread okaya

On 2018-03-21 19:03, Casey Leedom wrote:
[[ Appologies for the DUPLICATE email.  I forgot to tell my Mail Agent 
to

   use Plain Text. -- Casey ]]

  I feel very uncomfortable with these proposed changes.  Our team is 
right

in the middle of trying to tease our way through the various platform
implementations of writel(), writel_relaxed(), __raw_writel(), etc. in 
order

to support x86, PowerPC, ARM, etc. with a single code base.  This is
complicated by the somewhat ... "fuzzily defined" semantics and varying
platform implementations of all of these APIs.  (And note that I'm just
picking writel() as an example.)

  Additionally, many of the changes aren't even in fast paths and are 
thus

unneeded for performance.

  Please don't make these changes.  We're trying to get this all sussed 
out.




I was also given the feedback to look at performance critical path only. 
I am in the process of revisiting the patches.


If you can point me to the ones that are important, I can try to limit 
the changes to those only.


If your team wants to do it, I can drop this patch as well.

I think the semantics of write API is clear. What was actually 
implemented is another story.


I can share a few of my findings.

A portable driver needs to do this.

descriptor update in mem
wmb ()
writel_relaxed ()
mmiowb ()

Using __raw_write() is wrong as it can get reordered.

Using wmb()+writel() is also wrong for performance reasons.

If something is unclear, please ask.





RE: [PATCH net-next RFC V1 1/5] net: Introduce peer to peer one step PTP time stamping.

2018-03-21 Thread Keller, Jacob E
> -Original Message-
> From: Richard Cochran [mailto:richardcoch...@gmail.com]
> Sent: Wednesday, March 21, 2018 2:26 PM
> To: Keller, Jacob E 
> Cc: netdev@vger.kernel.org; devicet...@vger.kernel.org; Andrew Lunn
> ; David Miller ; Florian Fainelli
> ; Mark Rutland ; Miroslav
> Lichvar ; Rob Herring ; Willem de
> Bruijn 
> Subject: Re: [PATCH net-next RFC V1 1/5] net: Introduce peer to peer one step
> PTP time stamping.
> 
> On Wed, Mar 21, 2018 at 08:05:36PM +, Keller, Jacob E wrote:
> > I am guessing that we expect all devices which support onestep P2P messages,
> will always support onestep SYNC as well?
> 
> Yes.  Anything else doesn't make sense, don't you think?
> 
> Also, reading 1588, it isn't clear whether supporting only 1-step Sync
> without 1-step P2P is even intended.  There is only a "one-step
> clock", and it is described as doing both.
> 
> Thanks,
> Richard

This was my understanding as well, but given the limited hardware which can do 
sync but not pdelay messages, I just wanted to make sure we were on the same 
page.

Thanks,
Jake


Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-03-21 Thread Andrew Lunn
On Wed, Mar 21, 2018 at 03:47:02PM -0700, Richard Cochran wrote:
> On Wed, Mar 21, 2018 at 11:16:52PM +0100, Andrew Lunn wrote:
> > The MAC drivers are clients of this device. They then use a phandle
> > and specifier:
> > 
> > eth0: ethernet-controller@72000 {
> > compatible = "marvell,kirkwood-eth";
> > #address-cells = <1>;
> > #size-cells = <0>;
> > reg = <0x72000 0x4000>;
> > 
> > timerstamper = < 2>
> > }
> > 
> > The 2 indicates this MAC is using port 2.
> > 
> > The MAC driver can then do the standard device tree things to follow
> > the phandle to get access to the device and use the API it exports.
> 
> But that would require hacking every last MAC driver.
> 
> I happy to improve the modeling, but the solution should be generic
> and work for every MAC driver.

Well, the solution is generic, in that the phandle can point to a
device anywhere. It could be MMIO, it could be on an MDIO bus,
etc. You just need to make sure you API makes no assumption about how
the device driver talks to the hardware.

How clever is this device? Can it tell the difference between
1000Base-X and SGMII? Can it figure out that the MAC is repeating
every bit 100 times and so has dropped to 10Mbits? Does it understand
EEE? Does it need to know if RGMII or RGMII-ID is being used?

Can such a device really operation without the MAC being involved?  My
feeling is it needs to understand how the MII bus is being used. It
might also be that the device is less capable than the MAC, so you
need to turn off some of the MAC features. I think you are going to
need the MAC actively involved in this.

Andrew


[PATCH net-next v6 0/2] net: permit skb_segment on head_frag frag_list skb

2018-03-21 Thread Yonghong Song
One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
function skb_segment(), line 3667. The bpf program attaches to
clsact ingress, calls bpf_skb_change_proto to change protocol
from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
to send the changed packet out.
 ...
3665 while (pos < offset + len) {
3666 if (i >= nfrags) {
3667 BUG_ON(skb_headlen(list_skb));
 ...

The triggering input skb has the following properties:
list_skb = skb->frag_list;
skb->nfrags != NULL && skb_headlen(list_skb) != 0
and skb_segment() is not able to handle a frag_list skb
if its headlen (list_skb->len - list_skb->data_len) is not 0.

Patch #1 provides a simple solution to avoid BUG_ON. If
list_skb->head_frag is true, its page-backed frag will
be processed before the list_skb->frags.
Patch #2 provides a test case in test_bpf module which
constructs a skb and calls skb_segment() directly. The test
case is able to trigger the BUG_ON without Patch #1.

The patch has been tested in the following setup:
  ipv6_host <-> nat_server <-> ipv4_host
where nat_server has a bpf program doing ipv4<->ipv6
translation and forwarding through clsact hook
bpf_skb_change_proto.

Changelog:
v5 -> v6:
  . Added back missed BUG_ON(!nfrags) for zero
skb_headlen(skb) case, plus a couple of
cosmetic changes, from Alexander.
v4 -> v5:
  . Replace local variable head_frag with
a static inline function skb_head_frag_to_page_desc
which gets the head_frag on-demand. This makes
code more readable and also does not increase
the stack size, from Alexander.
  . Remove the "if(nfrags)" guard for skb_orphan_frags
and skb_zerocopy_clone as I found that they can
handle zero-frag skb (with non-zero skb_headlen(skb))
properly.
  . Properly release segment list from skb_segment()
in the test, from Eric.
v3 -> v4:
  . Remove dynamic memory allocation and use rewinding
for both index and frag to remove one branch in fast path,
from Alexander.
  . Fix a bunch of issues in test_bpf skb_segment() test,
including proper way to allocate skb, proper function
argument for skb_add_rx_frag and not freeint skb, etc.,
from Eric.
v2 -> v3:
  . Use starting frag index -1 (instead of 0) to
special process head_frag before other frags in the skb,
from Alexander Duyck.
v1 -> v2:
  . Removed never-hit BUG_ON, spotted by Linyu Yuan.

Yonghong Song (2):
  net: permit skb_segment on head_frag frag_list skb
  net: bpf: add a test for skb_segment in test_bpf module

 lib/test_bpf.c| 93 +--
 net/core/skbuff.c | 27 +---
 2 files changed, 113 insertions(+), 7 deletions(-)

-- 
2.9.5



[PATCH net-next v6 2/2] net: bpf: add a test for skb_segment in test_bpf module

2018-03-21 Thread Yonghong Song
Without the previous commit,
"modprobe test_bpf" will have the following errors:
...
[   98.149165] [ cut here ]
[   98.159362] kernel BUG at net/core/skbuff.c:3667!
[   98.169756] invalid opcode:  [#1] SMP PTI
[   98.179370] Modules linked in:
[   98.179371]  test_bpf(+)
...
which triggers the bug the previous commit intends to fix.

The skbs are constructed to mimic what mlx5 may generate.
The packet size/header may not mimic real cases in production. But
the processing flow is similar.

Signed-off-by: Yonghong Song 
---
 lib/test_bpf.c | 93 --
 1 file changed, 91 insertions(+), 2 deletions(-)

diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index 2efb213..a468b5c 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -6574,6 +6574,93 @@ static bool exclude_test(int test_id)
return test_id < test_range[0] || test_id > test_range[1];
 }
 
+static __init struct sk_buff *build_test_skb(void)
+{
+   u32 headroom = NET_SKB_PAD + NET_IP_ALIGN + ETH_HLEN;
+   struct sk_buff *skb[2];
+   struct page *page[2];
+   int i, data_size = 8;
+
+   for (i = 0; i < 2; i++) {
+   page[i] = alloc_page(GFP_KERNEL);
+   if (!page[i]) {
+   if (i == 0)
+   goto err_page0;
+   else
+   goto err_page1;
+   }
+
+   /* this will set skb[i]->head_frag */
+   skb[i] = dev_alloc_skb(headroom + data_size);
+   if (!skb[i]) {
+   if (i == 0)
+   goto err_skb0;
+   else
+   goto err_skb1;
+   }
+
+   skb_reserve(skb[i], headroom);
+   skb_put(skb[i], data_size);
+   skb[i]->protocol = htons(ETH_P_IP);
+   skb_reset_network_header(skb[i]);
+   skb_set_mac_header(skb[i], -ETH_HLEN);
+
+   skb_add_rx_frag(skb[i], 0, page[i], 0, 64, 64);
+   // skb_headlen(skb[i]): 8, skb[i]->head_frag = 1
+   }
+
+   /* setup shinfo */
+   skb_shinfo(skb[0])->gso_size = 1448;
+   skb_shinfo(skb[0])->gso_type = SKB_GSO_TCPV4;
+   skb_shinfo(skb[0])->gso_type |= SKB_GSO_DODGY;
+   skb_shinfo(skb[0])->gso_segs = 0;
+   skb_shinfo(skb[0])->frag_list = skb[1];
+
+   /* adjust skb[0]'s len */
+   skb[0]->len += skb[1]->len;
+   skb[0]->data_len += skb[1]->data_len;
+   skb[0]->truesize += skb[1]->truesize;
+
+   return skb[0];
+
+err_skb1:
+   __free_page(page[1]);
+err_page1:
+   kfree_skb(skb[0]);
+err_skb0:
+   __free_page(page[0]);
+err_page0:
+   return NULL;
+}
+
+static __init int test_skb_segment(void)
+{
+   netdev_features_t features;
+   struct sk_buff *skb, *segs;
+   int ret = -1;
+
+   features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM |
+  NETIF_F_IPV6_CSUM;
+   features |= NETIF_F_RXCSUM;
+   skb = build_test_skb();
+   if (!skb) {
+   pr_info("%s: failed to build_test_skb", __func__);
+   goto done;
+   }
+
+   segs = skb_segment(skb, features);
+   if (segs) {
+   kfree_skb_list(segs);
+   ret = 0;
+   pr_info("%s: success in skb_segment!", __func__);
+   } else {
+   pr_info("%s: failed in skb_segment!", __func__);
+   }
+   kfree_skb(skb);
+done:
+   return ret;
+}
+
 static __init int test_bpf(void)
 {
int i, err_cnt = 0, pass_cnt = 0;
@@ -6632,9 +6719,11 @@ static int __init test_bpf_init(void)
return ret;
 
ret = test_bpf();
-
destroy_bpf_tests();
-   return ret;
+   if (ret)
+   return ret;
+
+   return test_skb_segment();
 }
 
 static void __exit test_bpf_exit(void)
-- 
2.9.5



[PATCH net-next v6 1/2] net: permit skb_segment on head_frag frag_list skb

2018-03-21 Thread Yonghong Song
One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
function skb_segment(), line 3667. The bpf program attaches to
clsact ingress, calls bpf_skb_change_proto to change protocol
from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
to send the changed packet out.

3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
3473 netdev_features_t features)
3474 {
3475 struct sk_buff *segs = NULL;
3476 struct sk_buff *tail = NULL;
...
3665 while (pos < offset + len) {
3666 if (i >= nfrags) {
3667 BUG_ON(skb_headlen(list_skb));
3668
3669 i = 0;
3670 nfrags = skb_shinfo(list_skb)->nr_frags;
3671 frag = skb_shinfo(list_skb)->frags;
3672 frag_skb = list_skb;
...

call stack:
...
 #1 [883ffef03558] __crash_kexec at 8110c525
 #2 [883ffef03620] crash_kexec at 8110d5cc
 #3 [883ffef03640] oops_end at 8101d7e7
 #4 [883ffef03668] die at 8101deb2
 #5 [883ffef03698] do_trap at 8101a700
 #6 [883ffef036e8] do_error_trap at 8101abfe
 #7 [883ffef037a0] do_invalid_op at 8101acd0
 #8 [883ffef037b0] invalid_op at 81a00bab
[exception RIP: skb_segment+3044]
RIP: 817e4dd4  RSP: 883ffef03860  RFLAGS: 00010216
RAX: 2bf6  RBX: 883feb7aaa00  RCX: 0011
RDX: 883fb87910c0  RSI: 0011  RDI: 883feb7ab500
RBP: 883ffef03928   R8: 2ce2   R9: 27da
R10: 01ea  R11: 2d82  R12: 883f90a1ee80
R13: 883fb8791120  R14: 883feb7abc00  R15: 2ce2
ORIG_RAX:   CS: 0010  SS: 0018
 #9 [883ffef03930] tcp_gso_segment at 818713e7
---  ---
...

The triggering input skb has the following properties:
list_skb = skb->frag_list;
skb->nfrags != NULL && skb_headlen(list_skb) != 0
and skb_segment() is not able to handle a frag_list skb
if its headlen (list_skb->len - list_skb->data_len) is not 0.

This patch addressed the issue by handling skb_headlen(list_skb) != 0
case properly if list_skb->head_frag is true, which is expected in
most cases. The head frag is processed before list_skb->frags
are processed.

Reported-by: Diptanu Gon Choudhury 
Signed-off-by: Yonghong Song 
---
 net/core/skbuff.c | 27 ++-
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 715c134..4e1d4e7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3460,6 +3460,19 @@ void *skb_pull_rcsum(struct sk_buff *skb, unsigned int 
len)
 }
 EXPORT_SYMBOL_GPL(skb_pull_rcsum);
 
+static inline skb_frag_t skb_head_frag_to_page_desc(struct sk_buff *frag_skb)
+{
+   skb_frag_t head_frag;
+   struct page *page;
+
+   page = virt_to_head_page(frag_skb->head);
+   head_frag.page.p = page;
+   head_frag.page_offset = frag_skb->data -
+   (unsigned char *)page_address(page);
+   head_frag.size = skb_headlen(frag_skb);
+   return head_frag;
+}
+
 /**
  * skb_segment - Perform protocol segmentation on skb.
  * @head_skb: buffer to segment
@@ -3664,15 +3677,19 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
while (pos < offset + len) {
if (i >= nfrags) {
-   BUG_ON(skb_headlen(list_skb));
-
i = 0;
nfrags = skb_shinfo(list_skb)->nr_frags;
frag = skb_shinfo(list_skb)->frags;
frag_skb = list_skb;
+   if (!skb_headlen(list_skb)) {
+   BUG_ON(!nfrags);
+   } else {
+   BUG_ON(!list_skb->head_frag);
 
-   BUG_ON(!nfrags);
-
+   /* to make room for head_frag. */
+   i--;
+   frag--;
+   }
if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
skb_zerocopy_clone(nskb, frag_skb,
   GFP_ATOMIC))
@@ -3689,7 +3706,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
goto err;
}
 
-   *nskb_frag = *frag;
+   *nskb_frag = (i < 0) ? 
skb_head_frag_to_page_desc(frag_skb) : *frag;
__skb_frag_ref(nskb_frag);
size = skb_frag_size(nskb_frag);
 
-- 
2.9.5



Re: [PATCH v4 12/17] net: cxgb4/cxgb4vf: Eliminate duplicate barriers on weakly-ordered archs

2018-03-21 Thread Casey Leedom
[[ Appologies for the DUPLICATE email.  I forgot to tell my Mail Agent to
   use Plain Text. -- Casey ]]

  I feel very uncomfortable with these proposed changes.  Our team is right
in the middle of trying to tease our way through the various platform
implementations of writel(), writel_relaxed(), __raw_writel(), etc. in order
to support x86, PowerPC, ARM, etc. with a single code base.  This is
complicated by the somewhat ... "fuzzily defined" semantics and varying
platform implementations of all of these APIs.  (And note that I'm just
picking writel() as an example.)

  Additionally, many of the changes aren't even in fast paths and are thus
unneeded for performance.

  Please don't make these changes.  We're trying to get this all sussed out.

Casey


  
From: Sinan Kaya 
Sent: Monday, March 19, 2018 7:42:27 PM
To: netdev@vger.kernel.org; ti...@codeaurora.org; sulr...@codeaurora.org
Cc: linux-arm-...@vger.kernel.org; linux-arm-ker...@lists.infradead.org; Sinan 
Kaya; Ganesh GR; Casey Leedom; linux-ker...@vger.kernel.org
Subject: [PATCH v4 12/17] net: cxgb4/cxgb4vf: Eliminate duplicate barriers on 
weakly-ordered archs
  

Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  |  6 ++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 13 +++--
 drivers/net/ethernet/chelsio/cxgb4/sge.c    | 12 ++--
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c  |  2 +-
 drivers/net/ethernet/chelsio/cxgb4vf/adapter.h  | 14 ++
 drivers/net/ethernet/chelsio/cxgb4vf/sge.c  | 18 ++
 6 files changed, 44 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 9040e13..6bde0b9 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -1202,6 +1202,12 @@ static inline void t4_write_reg(struct adapter *adap, 
u32 reg_addr, u32 val)
 writel(val, adap->regs + reg_addr);
 }
 
+static inline void t4_write_reg_relaxed(struct adapter *adap, u32 reg_addr,
+   u32 val)
+{
+   writel_relaxed(val, adap->regs + reg_addr);
+}
+
 #ifndef readq
 static inline u64 readq(const volatile void __iomem *addr)
 {
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 7b452e8..276472d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -1723,8 +1723,8 @@ int cxgb4_sync_txq_pidx(struct net_device *dev, u16 qid, 
u16 pidx,
 else
 val = PIDX_T5_V(delta);
 wmb();
-   t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
-    QID_V(qid) | val);
+   t4_write_reg_relaxed(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
+    QID_V(qid) | val);
 }
 out:
 return ret;
@@ -1902,8 +1902,9 @@ static void enable_txq_db(struct adapter *adap, struct 
sge_txq *q)
  * are committed before we tell HW about them.
  */
 wmb();
-   t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
-    QID_V(q->cntxt_id) | PIDX_V(q->db_pidx_inc));
+   t4_write_reg_relaxed(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
+    QID_V(q->cntxt_id) |
+   PIDX_V(q->db_pidx_inc));
 q->db_pidx_inc = 0;
 }
 q->db_disabled = 0;
@@ -2003,8 +2004,8 @@ static void sync_txq_pidx(struct adapter *adap, struct 
sge_txq *q)
 else
 val = PIDX_T5_V(delta);
 wmb();
-   t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
-    QID_V(q->cntxt_id) | val);
+   t4_write_reg_relaxed(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
+    QID_V(q->cntxt_id) | val);
 }
 out:
 q->db_disabled = 0;
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 6e310a0..7388aac 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -530,11 +530,11 @@ static inline void ring_fl_db(struct adapter *adap, 
struct sge_fl *q)
  * mechanism.
  */
 if (unlikely(q->bar2_addr == NULL)) {
-   t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
-    val | 

Re: [PATCH net-next v5 1/2] net: permit skb_segment on head_frag frag_list skb

2018-03-21 Thread Yonghong Song



On 3/21/18 2:51 PM, Alexander Duyck wrote:

On Wed, Mar 21, 2018 at 1:36 PM, Yonghong Song  wrote:

One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
function skb_segment(), line 3667. The bpf program attaches to
clsact ingress, calls bpf_skb_change_proto to change protocol
from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
to send the changed packet out.

3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
3473 netdev_features_t features)
3474 {
3475 struct sk_buff *segs = NULL;
3476 struct sk_buff *tail = NULL;
...
3665 while (pos < offset + len) {
3666 if (i >= nfrags) {
3667 BUG_ON(skb_headlen(list_skb));
3668
3669 i = 0;
3670 nfrags = skb_shinfo(list_skb)->nr_frags;
3671 frag = skb_shinfo(list_skb)->frags;
3672 frag_skb = list_skb;
...

call stack:
...
  #1 [883ffef03558] __crash_kexec at 8110c525
  #2 [883ffef03620] crash_kexec at 8110d5cc
  #3 [883ffef03640] oops_end at 8101d7e7
  #4 [883ffef03668] die at 8101deb2
  #5 [883ffef03698] do_trap at 8101a700
  #6 [883ffef036e8] do_error_trap at 8101abfe
  #7 [883ffef037a0] do_invalid_op at 8101acd0
  #8 [883ffef037b0] invalid_op at 81a00bab
 [exception RIP: skb_segment+3044]
 RIP: 817e4dd4  RSP: 883ffef03860  RFLAGS: 00010216
 RAX: 2bf6  RBX: 883feb7aaa00  RCX: 0011
 RDX: 883fb87910c0  RSI: 0011  RDI: 883feb7ab500
 RBP: 883ffef03928   R8: 2ce2   R9: 27da
 R10: 01ea  R11: 2d82  R12: 883f90a1ee80
 R13: 883fb8791120  R14: 883feb7abc00  R15: 2ce2
 ORIG_RAX:   CS: 0010  SS: 0018
  #9 [883ffef03930] tcp_gso_segment at 818713e7
---  ---
...

The triggering input skb has the following properties:
 list_skb = skb->frag_list;
 skb->nfrags != NULL && skb_headlen(list_skb) != 0
and skb_segment() is not able to handle a frag_list skb
if its headlen (list_skb->len - list_skb->data_len) is not 0.

This patch addressed the issue by handling skb_headlen(list_skb) != 0
case properly if list_skb->head_frag is true, which is expected in
most cases. The head frag is processed before list_skb->frags
are processed.

Reported-by: Diptanu Gon Choudhury 
Signed-off-by: Yonghong Song 
---
  net/core/skbuff.c | 26 --
  1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 715c134..23b317a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3460,6 +3460,19 @@ void *skb_pull_rcsum(struct sk_buff *skb, unsigned int 
len)
  }
  EXPORT_SYMBOL_GPL(skb_pull_rcsum);

+static inline skb_frag_t skb_head_frag_to_page_desc(struct sk_buff *frag_skb)
+{
+   skb_frag_t head_frag;
+   struct page *page;
+
+   page = virt_to_head_page(frag_skb->head);
+   head_frag.page.p = page;
+   head_frag.page_offset = frag_skb->data -
+   (unsigned char *)page_address(page);
+   head_frag.size = skb_headlen(frag_skb);
+   return head_frag;
+}
+
  /**
   * skb_segment - Perform protocol segmentation on skb.
   * @head_skb: buffer to segment
@@ -3664,15 +3677,16 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,

 while (pos < offset + len) {
 if (i >= nfrags) {
-   BUG_ON(skb_headlen(list_skb));
-
 i = 0;
 nfrags = skb_shinfo(list_skb)->nr_frags;
 frag = skb_shinfo(list_skb)->frags;
-   frag_skb = list_skb;


You could probably leave this line in place. No point in moving it.


The only reason I moved it is to make define more close to the use.
But I am totally fine with leaving it as it.




-
-   BUG_ON(!nfrags);
+   if (skb_headlen(list_skb)) {
+   BUG_ON(!list_skb->head_frag);

+   /* to make room for head_frag. */
+   i--; frag--;


Normally these should be two separate lines one for "i--;" and one for
"frag--;".


Will change. Surprised that checkpatch.pl did not complain about this.




+   }


You could probably place the BUG_ON(!nfrags) in an else statement here
to handle the case where we have a potentially empty skb which would
be a bug.


Yes, this makes sense. Will add this BUG_ON.



+   frag_skb = list_skb;
 

Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-03-21 Thread Richard Cochran
On Wed, Mar 21, 2018 at 11:16:52PM +0100, Andrew Lunn wrote:
> The MAC drivers are clients of this device. They then use a phandle
> and specifier:
> 
>   eth0: ethernet-controller@72000 {
>   compatible = "marvell,kirkwood-eth";
>   #address-cells = <1>;
>   #size-cells = <0>;
>   reg = <0x72000 0x4000>;
> 
>   timerstamper = < 2>
>   }
> 
> The 2 indicates this MAC is using port 2.
> 
> The MAC driver can then do the standard device tree things to follow
> the phandle to get access to the device and use the API it exports.

But that would require hacking every last MAC driver.

I happy to improve the modeling, but the solution should be generic
and work for every MAC driver.

Thanks,
Richard


Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-03-21 Thread Thomas Gleixner
On Wed, 21 Mar 2018, Thomas Gleixner wrote:
> If you look at the use cases of TDM in various fields then FIFO mode is
> pretty much useless. In industrial/automotive fieldbus applications the
> various time slices are filled by different threads or even processes.

That brings me to a related question. The TDM cases I'm familiar with which
aim to use this utilize multiple periodic time slices, aka 802.1Qbv
time-aware scheduling.

Simple example:

[1a][1b][1c][1d][1a][1b][1c][1d][.
[2a][2b][2c][2d]
[3a][3b]
[4a][4b]
--> t   


where 1-4 is the slice level and a-d are network nodes.

In most cases the slice levels on a node are handled by different
applications or threads. Some of the protocols utilize dedicated time slice
levels - lets assume '4' in the above example - to run general network
traffic which might even be allowed to have collisions, i.e. [4a-d] would
become [4] and any node can send; the involved componets like switches are
supposed to handle that.

I'm not seing how TBS is going to assist with any of that. It requires
everything to be handled at the application level. Not really useful
especially not for general traffic which does not know about the scheduling
bands at all.

If you look at an industrial control node. It basically does:

queue_first_packet(tx, slice1);
while (!stop) {
if (wait_for_packet(rx) == ERROR)
goto errorhandling;
tx = do_computation(rx);
queue_next_tx(tx, slice1);
}

that's a pretty common pattern for these kind of applications. For audio
sources queue_next() might be triggered by the input sampler which needs to
be synchronized to the network slices anyway in order to work properly.

TBS per current implementation is nice as a proof of concept, but it solves
just a small portion of the complete problem space. I have the suspicion
that this was 'designed' to replace the user space hack in the AVNU stack
with something close to it. Not really a good plan to be honest.

I think what we really want is a strict periodic scheduler which supports
multiple slices as shown above because thats what all relevant TDM use
cases need: A/V, industrial fieldbusses .

  |-|
  | |
  |   TAS   |<- Config
  |1   2   3   4|
  |-|
   |   |   |   |
   |   |   |   |
   |   |   |   |
   |   |   |   |
  [DirectSocket]   [Qdisc FIFO]   [Qdisc Prio] [Qdisc FIFO]
   |   |   |
   |   |   |
[Socket][Socket] [General traffic]


The interesting thing here is that it does not require any time stamp
information brought in from the application. That's especially good for
general network traffic which is routed through a dedicated time slot. If
we don't have that then we need a user space scheduler which does exactly
the same thing and we have to route the general traffic out to user space
and back into the kernel, which is obviously a pointless exercise.

There are all kind of TDM schemes out there which are not directly driven
by applications, but rather route categorized traffic like VLANs through
dedicated time slices. That works pretty well with the above scheme because
in that case the applications might be completely oblivious about the tx
time schedule.

Surely there are protocols which do not utilize every time slice they could
use, so we need a way to tell the number of empty slices between two
consecutive packets. There are also different policies vs. the unused time
slices, like sending dummy frames or just nothing which wants to be
addressed, but I don't think that changes the general approach.

There might be some special cases for setup or node hotplug, but the
protocols I'm familiar with handle these in dedicated time slices or
through general traffic so it should just fit in.

I'm surely missing some details, but from my knowledge about the protocols
which want to utilize this, the general direction should be fine.

Feel free to tell me that I'm missing the point completely though :)

Thoughts?

Thanks,

tglx







Re: [PATCH V2 net-next 06/14] net/tls: Add generic NIC offload infrastructure

2018-03-21 Thread Kirill Tkhai
Hi, Saeed,

thanks for fixing some of my remarks, but I've dived into the code
more deeply, and found with a sadness, the patch lacks the readability.

It too big and not fit kernel coding style. Please, see some comments
below.

Can we do something with patch length? Is there a way to split it in
several small patches? It's difficult to review the logic of changes.

On 22.03.2018 00:01, Saeed Mahameed wrote:
> From: Ilya Lesokhin 
> 
> This patch adds a generic infrastructure to offload TLS crypto to a
> network devices. It enables the kernel TLS socket to skip encryption
> and authentication operations on the transmit side of the data path.
> Leaving those computationally expensive operations to the NIC.
> 
> The NIC offload infrastructure builds TLS records and pushes them to
> the TCP layer just like the SW KTLS implementation and using the same API.
> TCP segmentation is mostly unaffected. Currently the only exception is
> that we prevent mixed SKBs where only part of the payload requires
> offload. In the future we are likely to add a similar restriction
> following a change cipher spec record.
> 
> The notable differences between SW KTLS and NIC offloaded TLS
> implementations are as follows:
> 1. The offloaded implementation builds "plaintext TLS record", those
> records contain plaintext instead of ciphertext and place holder bytes
> instead of authentication tags.
> 2. The offloaded implementation maintains a mapping from TCP sequence
> number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
> TLS socket, we can use the tls NIC offload infrastructure to obtain
> enough context to encrypt the payload of the SKB.
> A TLS record is released when the last byte of the record is ack'ed,
> this is done through the new icsk_clean_acked callback.
> 
> The infrastructure should be extendable to support various NIC offload
> implementations.  However it is currently written with the
> implementation below in mind:
> The NIC assumes that packets from each offloaded stream are sent as
> plaintext and in-order. It keeps track of the TLS records in the TCP
> stream. When a packet marked for offload is transmitted, the NIC
> encrypts the payload in-place and puts authentication tags in the
> relevant place holders.
> 
> The responsibility for handling out-of-order packets (i.e. TCP
> retransmission, qdisc drops) falls on the netdev driver.
> 
> The netdev driver keeps track of the expected TCP SN from the NIC's
> perspective.  If the next packet to transmit matches the expected TCP
> SN, the driver advances the expected TCP SN, and transmits the packet
> with TLS offload indication.
> 
> If the next packet to transmit does not match the expected TCP SN. The
> driver calls the TLS layer to obtain the TLS record that includes the
> TCP of the packet for transmission. Using this TLS record, the driver
> posts a work entry on the transmit queue to reconstruct the NIC TLS
> state required for the offload of the out-of-order packet. It updates
> the expected TCP SN accordingly and transmit the now in-order packet.
> The same queue is used for packet transmission and TLS context
> reconstruction to avoid the need for flushing the transmit queue before
> issuing the context reconstruction request.
> 
> Signed-off-by: Ilya Lesokhin 
> Signed-off-by: Boris Pismenny 
> Signed-off-by: Aviad Yehezkel 
> Signed-off-by: Saeed Mahameed 
> ---
>  include/net/tls.h |  74 +++-
>  net/tls/Kconfig   |  10 +
>  net/tls/Makefile  |   2 +
>  net/tls/tls_device.c  | 793 
> ++
>  net/tls/tls_device_fallback.c | 415 ++
>  net/tls/tls_main.c|  33 +-
>  6 files changed, 1320 insertions(+), 7 deletions(-)
>  create mode 100644 net/tls/tls_device.c
>  create mode 100644 net/tls/tls_device_fallback.c
> 
> diff --git a/include/net/tls.h b/include/net/tls.h
> index 4913430ab807..0bfb1b0a156a 100644
> --- a/include/net/tls.h
> +++ b/include/net/tls.h
> @@ -77,6 +77,37 @@ struct tls_sw_context {
>   struct scatterlist sg_aead_out[2];
>  };
>  
> +struct tls_record_info {
> + struct list_head list;
> + u32 end_seq;
> + int len;
> + int num_frags;
> + skb_frag_t frags[MAX_SKB_FRAGS];
> +};
> +
> +struct tls_offload_context {
> + struct crypto_aead *aead_send;
> + spinlock_t lock;/* protects records list */
> + struct list_head records_list;
> + struct tls_record_info *open_record;
> + struct tls_record_info *retransmit_hint;
> + u64 hint_record_sn;
> + u64 unacked_record_sn;
> +
> + struct scatterlist sg_tx_data[MAX_SKB_FRAGS];
> + void (*sk_destruct)(struct sock *sk);
> + u8 driver_state[];
> + /* The TLS layer reserves room for driver specific state
> +  * Currently the belief is that there is not enough
> +  * driver specific state 

[PATCH net-next 1/1] tc-testing: updated police, mirred, skbedit and skbmod with more tests

2018-03-21 Thread Roman Mashak
Added extra test cases for control actions (reclassify, pipe etc.),
cookies, max index value and police args sanity check.

Signed-off-by: Roman Mashak 
---
 .../tc-testing/tc-tests/actions/mirred.json| 192 +
 .../tc-testing/tc-tests/actions/police.json| 144 
 .../tc-testing/tc-tests/actions/skbedit.json   | 168 ++
 .../tc-testing/tc-tests/actions/skbmod.json|  26 ++-
 4 files changed, 529 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json 
b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
index 0fcccf18399b..443c9b3c8664 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
@@ -171,6 +171,198 @@
 ]
 },
 {
+"id": "8917",
+"name": "Add mirred mirror action with control pass",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo 
pass index 1",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mirred index 1",
+"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to 
device lo\\) pass.*index 1 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mirred"
+]
+},
+{
+"id": "1054",
+"name": "Add mirred mirror action with control pipe",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo 
pipe index 15",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mirred index 15",
+"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to 
device lo\\) pipe.*index 15 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mirred"
+]
+},
+{
+"id": "9887",
+"name": "Add mirred mirror action with control continue",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo 
continue index 15",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mirred index 15",
+"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to 
device lo\\) continue.*index 15 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mirred"
+]
+},
+{
+"id": "e4aa",
+"name": "Add mirred mirror action with control reclassify",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo 
reclassify index 150",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mirred index 150",
+"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to 
device lo\\) reclassify.*index 150 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mirred"
+]
+},
+{
+"id": "ece9",
+"name": "Add mirred mirror action with control drop",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo 
drop index 99",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mirred index 99",
+"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to 
device lo\\) drop.*index 99 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mirred"
+]
+},
+{
+"id": "0031",
+"name": "Add mirred mirror action with control jump",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",

Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-03-21 Thread Andrew Lunn
On Wed, Mar 21, 2018 at 02:57:29PM -0700, Richard Cochran wrote:
> On Wed, Mar 21, 2018 at 10:44:36PM +0100, Andrew Lunn wrote:
> > O.K, so lets do the 20 questions approach.
> 
> :)
> 
> > As far as i can see, this is not an MDIO device. It is not connected
> > to the MDIO bus, it has no MDIO registers, you don't even pass a valid
> > MDIO address in device tree.
> 
> Right.

O.K, so i suggest we stop trying to model this thing as an MDIO
device. It is really an MMIO device.

> There might very well be other products out there that *do*
> use MDIO commands.  I know that there are MII time stamping asics and
> ip cores on the market, but I don't know all of their creative design
> details.

So i suggest we leave the design for those until we actual see one.
  
> > It it actually an MII bus snooper? Does it snoop, or is it actually in
> > the MII bus, and can modify packets, i.e. insert time stamps as frames
> > pass over the MII bus?
> 
> It acts like a "snooper" to provide out of band time stamps, but it
> also can modify packets when for the one-step functionality.
>  
> > When the driver talks about having three ports, does that mean it can
> > be on three different MII busses?

O.K, so here is how i think it should be done. It is a device which
offers services to other devices. It is not that different to an
interrupt controller, a GPIO controller, etc. Lets follow how they
work in device tree

The device itself is just another MMIO mapped device in the SoC:

timestamper@6000 {
compatible = "ines,ptp-ctrl";
reg = <0x6000 0x80>;
#address-cells = <1>;
#size-cells = <0>;
};

The MAC drivers are clients of this device. They then use a phandle
and specifier:

eth0: ethernet-controller@72000 {
compatible = "marvell,kirkwood-eth";
#address-cells = <1>;
#size-cells = <0>;
reg = <0x72000 0x4000>;

timerstamper = < 2>
}

The 2 indicates this MAC is using port 2.

The MAC driver can then do the standard device tree things to follow
the phandle to get access to the device and use the API it exports.

Andrew


Re: [PATCH REPOST v4 5/7] ixgbevf: keep writel() closer to wmb()

2018-03-21 Thread okaya

On 2018-03-21 17:54, David Miller wrote:

From: Jeff Kirsher 
Date: Wed, 21 Mar 2018 14:48:08 -0700


On Wed, 2018-03-21 at 14:56 -0400, Sinan Kaya wrote:

Remove ixgbevf_write_tail() in favor of moving writel() close to
wmb().

Signed-off-by: Sinan Kaya 
Reviewed-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  | 5 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 4 ++--
 2 files changed, 2 insertions(+), 7 deletions(-)


This patch fails to compile because there is a call to
ixgbevf_write_tail() which you missed cleaning up.


For a change with delicate side effects, it doesn't create much
confidence if the code does not even compile.

Sinan, please put more care into the changes you are making.


I think the issue is the tree that code is getting tested has 
undelivered code as Alex mentioned.


I was using linux-next 4.16 rc4 for testing.

I will rebase to Jeff's tree.



Thank you.


Re: [PATCH net v2 0/7] fix idr leak in actions

2018-03-21 Thread David Miller
From: Davide Caratti 
Date: Mon, 19 Mar 2018 15:31:21 +0100

> This series fixes situations where a temporary failure to install a TC
> action results in the permanent impossibility to reuse the configured
> value of 'index'.
> 
> Thanks to Cong Wang for the initial review.
> 
> v2: fix build error in act_ipt.c, reported by kbuild test robot

Series applied, thanks Davide.


Re: [PATCH] qede: fix spelling mistake: "registeration" -> "registration"

2018-03-21 Thread David Miller
From: Colin King 
Date: Mon, 19 Mar 2018 14:57:11 +

> From: Colin Ian King 
> 
> Trivial fix to spelling mistakes in DP_ERR error message text and
> comments
> 
> Signed-off-by: Colin Ian King 

Applied.


Re: [PATCH] bnx2x: fix spelling mistake: "registeration" -> "registration"

2018-03-21 Thread David Miller
From: Colin King 
Date: Mon, 19 Mar 2018 14:32:59 +

> From: Colin Ian King 
> 
> Trivial fix to spelling mistake in BNX2X_ERR error message text
> 
> Signed-off-by: Colin Ian King 

Applied.


[trivial PATCH V2] treewide: Align function definition open/close braces

2018-03-21 Thread Joe Perches
Some functions definitions have either the initial open brace and/or
the closing brace outside of column 1.

Move those braces to column 1.

This allows various function analyzers like gnu complexity to work
properly for these modified functions.

Signed-off-by: Joe Perches 
Acked-by: Andy Shevchenko 
Acked-by: Paul Moore 
Acked-by: Alex Deucher 
Acked-by: Dave Chinner 
Reviewed-by: Darrick J. Wong 
Acked-by: Alexandre Belloni 
Acked-by: Martin K. Petersen 
Acked-by: Takashi Iwai 
Acked-by: Mauro Carvalho Chehab 
---

git diff -w still shows no difference.

This patch was sent but December and not applied.

As the trivial maintainer seems not active, it'd be nice if
Andrew Morton picks this up.

V2: Remove fs/xfs/libxfs/xfs_alloc.c as it's updated and remerge the rest

 arch/x86/include/asm/atomic64_32.h   |  2 +-
 drivers/acpi/custom_method.c |  2 +-
 drivers/acpi/fan.c   |  2 +-
 drivers/gpu/drm/amd/display/dc/core/dc.c |  2 +-
 drivers/media/i2c/msp3400-kthreads.c |  2 +-
 drivers/message/fusion/mptsas.c  |  2 +-
 drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c |  2 +-
 drivers/net/wireless/ath/ath9k/xmit.c|  2 +-
 drivers/platform/x86/eeepc-laptop.c  |  2 +-
 drivers/rtc/rtc-ab-b5ze-s3.c |  2 +-
 drivers/scsi/dpt_i2o.c   |  2 +-
 drivers/scsi/sym53c8xx_2/sym_glue.c  |  2 +-
 fs/locks.c   |  2 +-
 fs/ocfs2/stack_user.c|  2 +-
 fs/xfs/xfs_export.c  |  2 +-
 kernel/audit.c   |  6 +++---
 kernel/trace/trace_printk.c  |  4 ++--
 lib/raid6/sse2.c | 14 +++---
 sound/soc/fsl/fsl_dma.c  |  2 +-
 19 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/atomic64_32.h 
b/arch/x86/include/asm/atomic64_32.h
index 46e1ef17d92d..92212bf0484f 100644
--- a/arch/x86/include/asm/atomic64_32.h
+++ b/arch/x86/include/asm/atomic64_32.h
@@ -123,7 +123,7 @@ static inline long long arch_atomic64_read(const atomic64_t 
*v)
long long r;
alternative_atomic64(read, "=" (r), "c" (v) : "memory");
return r;
- }
+}
 
 /**
  * arch_atomic64_add_return - add and return
diff --git a/drivers/acpi/custom_method.c b/drivers/acpi/custom_method.c
index b33fba70ec51..a07fbe999eb6 100644
--- a/drivers/acpi/custom_method.c
+++ b/drivers/acpi/custom_method.c
@@ -97,7 +97,7 @@ static void __exit acpi_custom_method_exit(void)
 {
if (cm_dentry)
debugfs_remove(cm_dentry);
- }
+}
 
 module_init(acpi_custom_method_init);
 module_exit(acpi_custom_method_exit);
diff --git a/drivers/acpi/fan.c b/drivers/acpi/fan.c
index 6cf4988206f2..3563103590c6 100644
--- a/drivers/acpi/fan.c
+++ b/drivers/acpi/fan.c
@@ -219,7 +219,7 @@ fan_set_cur_state(struct thermal_cooling_device *cdev, 
unsigned long state)
return fan_set_state_acpi4(device, state);
else
return fan_set_state(device, state);
- }
+}
 
 static const struct thermal_cooling_device_ops fan_cooling_ops = {
.get_max_state = fan_get_max_state,
diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c 
b/drivers/gpu/drm/amd/display/dc/core/dc.c
index 8394d69b963f..e934326a95d3 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
@@ -588,7 +588,7 @@ static void disable_dangling_plane(struct dc *dc, struct 
dc_state *context)
  
**/
 
 struct dc *dc_create(const struct dc_init_data *init_params)
- {
+{
struct dc *dc = kzalloc(sizeof(*dc), GFP_KERNEL);
unsigned int full_pipe_count;
 
diff --git a/drivers/media/i2c/msp3400-kthreads.c 
b/drivers/media/i2c/msp3400-kthreads.c
index 4dd01e9f553b..dc6cb8d475b3 100644
--- a/drivers/media/i2c/msp3400-kthreads.c
+++ b/drivers/media/i2c/msp3400-kthreads.c
@@ -885,7 +885,7 @@ static int msp34xxg_modus(struct i2c_client *client)
 }
 
 static void msp34xxg_set_source(struct i2c_client *client, u16 reg, int in)
- {
+{
struct msp_state *state = to_state(i2c_get_clientdata(client));
int source, matrix;
 
diff --git a/drivers/message/fusion/mptsas.c b/drivers/message/fusion/mptsas.c
index 439ee9c5f535..231f3a1e27bf 100644
--- a/drivers/message/fusion/mptsas.c
+++ b/drivers/message/fusion/mptsas.c
@@ -2967,7 +2967,7 @@ mptsas_exp_repmanufacture_info(MPT_ADAPTER *ioc,
mutex_unlock(>sas_mgmt.mutex);
 out:
  

Re: [PATCH v2 bpf-next 4/8] tracepoint: compute num_args at build time

2018-03-21 Thread Alexei Starovoitov

On 3/21/18 12:44 PM, Linus Torvalds wrote:

On Wed, Mar 21, 2018 at 11:54 AM, Alexei Starovoitov  wrote:


add fancy macro to compute number of arguments passed into tracepoint
at compile time and store it as part of 'struct tracepoint'.


We should probably do this __COUNT() thing in some generic header, we
just talked last week about another use case entirely.


ok. Not sure which generic header though.
Should I move it to include/linux/kernel.h ?


And wouldn't it be nice to just have some generic infrastructure like this:

/*
 * This counts to ten.
 *
 * Any more than that, and we'd need to take off our shoes
 */
#define __GET_COUNT(_0,_1,_2,_3,_4,_5,_6,_7,_8,_9,_10,_n,...) _n
#define __COUNT(...) \
__GET_COUNT(__VA_ARGS__,10,9,8,7,6,5,4,3,2,1,0)
#define COUNT(...) __COUNT(dummy,##__VA_ARGS__)


since it will be a build time error, it's a good time to discuss
how many arguments we want to support in tracepoints and
in general in other places that would want to use this macro.

Like the only reason my patch is counting till 17 is because of
trace_iwlwifi_dev_ucode_error().
The next offenders are using 12 arguments:
trace_mc_event()
trace_mm_vmscan_lru_shrink_inactive()

Clearly not every efficient usage of it:
trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
nr_scanned, nr_reclaimed,
stat.nr_dirty,  stat.nr_writeback,
stat.nr_congested, stat.nr_immediate,
stat.nr_activate, stat.nr_ref_keep,
stat.nr_unmap_fail,
sc->priority, file);
could have passed  instead.

I'd like to refactor that trace_iwlwifi_dev_ucode_error()
and from now on set the limit to 12.
Any offenders should be using tracepoints with <= 12 args
instead of extending the macro.
Does it sound reasonable ?


#define __CONCAT(a,b) a##b
#define __CONCATENATE(a,b) __CONCAT(a,b)

and then you can do things like:

#define fn(...) __CONCATENATE(fn,COUNT(__VA_ARGS__))(__VA_ARGS__)

which turns "fn(x,y,z..)" into "fn(x,y,z)".

That can be useful for things like "max(a,b,c,d)" expanding to
"max4()", and then you can just have the trivial

  #define max3(a,b,c) max2(a,max2(b.c))


I can try that. Not sure my macro-fu is up to that level.
__CAST_TO_U64() macro from the next patch was difficult to make
work across compilers and architectures.



Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-03-21 Thread Richard Cochran
On Wed, Mar 21, 2018 at 10:44:36PM +0100, Andrew Lunn wrote:
> O.K, so lets do the 20 questions approach.

:)

> As far as i can see, this is not an MDIO device. It is not connected
> to the MDIO bus, it has no MDIO registers, you don't even pass a valid
> MDIO address in device tree.

Right.  There might very well be other products out there that *do*
use MDIO commands.  I know that there are MII time stamping asics and
ip cores on the market, but I don't know all of their creative design
details.
 
> It it actually an MII bus snooper? Does it snoop, or is it actually in
> the MII bus, and can modify packets, i.e. insert time stamps as frames
> pass over the MII bus?

It acts like a "snooper" to provide out of band time stamps, but it
also can modify packets when for the one-step functionality.
 
> When the driver talks about having three ports, does that mean it can
> be on three different MII busses?

Yes.

HTH,
Richard


Re: [PATCH REPOST v4 5/7] ixgbevf: keep writel() closer to wmb()

2018-03-21 Thread David Miller
From: Jeff Kirsher 
Date: Wed, 21 Mar 2018 14:48:08 -0700

> On Wed, 2018-03-21 at 14:56 -0400, Sinan Kaya wrote:
>> Remove ixgbevf_write_tail() in favor of moving writel() close to
>> wmb().
>> 
>> Signed-off-by: Sinan Kaya 
>> Reviewed-by: Alexander Duyck 
>> ---
>>  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  | 5 -
>>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 4 ++--
>>  2 files changed, 2 insertions(+), 7 deletions(-)
> 
> This patch fails to compile because there is a call to
> ixgbevf_write_tail() which you missed cleaning up.

For a change with delicate side effects, it doesn't create much
confidence if the code does not even compile.

Sinan, please put more care into the changes you are making.

Thank you.


Re: [Intel-wired-lan] [PATCH REPOST v4 5/7] ixgbevf: keep writel() closer to wmb()

2018-03-21 Thread Alexander Duyck
On Wed, Mar 21, 2018 at 2:51 PM,   wrote:
> On 2018-03-21 17:48, Jeff Kirsher wrote:
>>
>> On Wed, 2018-03-21 at 14:56 -0400, Sinan Kaya wrote:
>>>
>>> Remove ixgbevf_write_tail() in favor of moving writel() close to
>>> wmb().
>>>
>>> Signed-off-by: Sinan Kaya 
>>> Reviewed-by: Alexander Duyck 
>>> ---
>>>  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  | 5 -
>>>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 4 ++--
>>>  2 files changed, 2 insertions(+), 7 deletions(-)
>>
>>
>> This patch fails to compile because there is a call to
>> ixgbevf_write_tail() which you missed cleaning up.
>
>
> Hah, I did a compile test but maybe I missed something. I will get v6 of
> this patch only and leave the rest of the series as it is.

Actually you might want to just pull Jeff's tree and rebase before you
submit your patches. I suspect the difference is the ixgbevf XDP code
that is present in Jeff's tree and not in Dave's. The alternative is
to wait for Jeff to push the ixgbevf code and then once Dave has
pulled it you could rebase your patches.

Thanks.

- Alex


Re: [PATCH net-next RFC V1 3/5] net: Introduce field for the MII time stamper.

2018-03-21 Thread Richard Cochran
On Wed, Mar 21, 2018 at 12:12:00PM -0700, Florian Fainelli wrote:
> > +static int mdiobus_netdev_notification(struct notifier_block *nb,
> > +  unsigned long msg, void *ptr)
> > +{
> > +   struct net_device *netdev = netdev_notifier_info_to_dev(ptr);
> > +   struct phy_device *phydev = netdev->phydev;
> > +   struct mdio_device *mdev;
> > +   struct mii_bus *bus;
> > +   int i;
> > +
> > +   if (netdev->mdiots || msg != NETDEV_UP || !phydev)
> > +   return NOTIFY_DONE;
> 
> You are still assuming that we have a phy_device somehow, whereas you
> parch series wants to solve that for generic MDIO devices, that is a bit
> confusing.

The phydev is the only thing that associates a netdev with an MII bus.

> > +
> > +   /*
> > +* Examine the MII bus associated with the PHY that is
> > +* attached to the MAC.  If there is a time stamping device
> > +* on the bus, then connect it to the network device.
> > +*/
> > +   bus = phydev->mdio.bus;
> > +
> > +   for (i = 0; i < PHY_MAX_ADDR; i++) {
> > +   mdev = bus->mdio_map[i];
> > +   if (!mdev)
> > +   continue;
> > +   if (mdiodev_supports_timestamping(mdev)) {
> > +   netdev->mdiots = mdev;
> > +   return NOTIFY_OK;
> 
> What guarantees that netdev->mdiots gets cleared?

Why would it need to be cleared?

> Also, why is this done
> with a notifier instead of through phy_{connect,attach,disconnect}?

We have no guarantee the mdio device has been probed yet.

> It
> looks like we still have this requirement of the mdio TS device being a
> phy_device somehow, I am confused here...

We only need the phydev to get from the netdev to the mii bus.
 
> > +   }
> > +   }
> > +
> > +   return NOTIFY_DONE;
> > +}
> > +
> >  #ifdef CONFIG_PM
> >  static int mdio_bus_suspend(struct device *dev)
> >  {
> 
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 5fbb9f1da7fd..223d691aa0b0 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -1943,6 +1943,7 @@ struct net_device {
> > struct netprio_map __rcu *priomap;
> >  #endif
> > struct phy_device   *phydev;
> > +   struct mdio_device  *mdiots;
> 
> phy_device embedds a mdio_device, can you find a way to rework the PHY
> PTP code to utilize the phy_device's mdio instance so do not introduce
> yet another pointer in that big structure that net_device already is?

It would be strange and wrong to "steal" the phy's mdio struct, IMHO.
After all, we just got support for non-PHY mdio devices.  The natural
solution is to use it.

Thanks,
Richard


Re: [PATCH net-next v5 1/2] net: permit skb_segment on head_frag frag_list skb

2018-03-21 Thread Alexander Duyck
On Wed, Mar 21, 2018 at 1:36 PM, Yonghong Song  wrote:
> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
> function skb_segment(), line 3667. The bpf program attaches to
> clsact ingress, calls bpf_skb_change_proto to change protocol
> from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
> to send the changed packet out.
>
> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
> 3473 netdev_features_t features)
> 3474 {
> 3475 struct sk_buff *segs = NULL;
> 3476 struct sk_buff *tail = NULL;
> ...
> 3665 while (pos < offset + len) {
> 3666 if (i >= nfrags) {
> 3667 BUG_ON(skb_headlen(list_skb));
> 3668
> 3669 i = 0;
> 3670 nfrags = skb_shinfo(list_skb)->nr_frags;
> 3671 frag = skb_shinfo(list_skb)->frags;
> 3672 frag_skb = list_skb;
> ...
>
> call stack:
> ...
>  #1 [883ffef03558] __crash_kexec at 8110c525
>  #2 [883ffef03620] crash_kexec at 8110d5cc
>  #3 [883ffef03640] oops_end at 8101d7e7
>  #4 [883ffef03668] die at 8101deb2
>  #5 [883ffef03698] do_trap at 8101a700
>  #6 [883ffef036e8] do_error_trap at 8101abfe
>  #7 [883ffef037a0] do_invalid_op at 8101acd0
>  #8 [883ffef037b0] invalid_op at 81a00bab
> [exception RIP: skb_segment+3044]
> RIP: 817e4dd4  RSP: 883ffef03860  RFLAGS: 00010216
> RAX: 2bf6  RBX: 883feb7aaa00  RCX: 0011
> RDX: 883fb87910c0  RSI: 0011  RDI: 883feb7ab500
> RBP: 883ffef03928   R8: 2ce2   R9: 27da
> R10: 01ea  R11: 2d82  R12: 883f90a1ee80
> R13: 883fb8791120  R14: 883feb7abc00  R15: 2ce2
> ORIG_RAX:   CS: 0010  SS: 0018
>  #9 [883ffef03930] tcp_gso_segment at 818713e7
> ---  ---
> ...
>
> The triggering input skb has the following properties:
> list_skb = skb->frag_list;
> skb->nfrags != NULL && skb_headlen(list_skb) != 0
> and skb_segment() is not able to handle a frag_list skb
> if its headlen (list_skb->len - list_skb->data_len) is not 0.
>
> This patch addressed the issue by handling skb_headlen(list_skb) != 0
> case properly if list_skb->head_frag is true, which is expected in
> most cases. The head frag is processed before list_skb->frags
> are processed.
>
> Reported-by: Diptanu Gon Choudhury 
> Signed-off-by: Yonghong Song 
> ---
>  net/core/skbuff.c | 26 --
>  1 file changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 715c134..23b317a 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3460,6 +3460,19 @@ void *skb_pull_rcsum(struct sk_buff *skb, unsigned int 
> len)
>  }
>  EXPORT_SYMBOL_GPL(skb_pull_rcsum);
>
> +static inline skb_frag_t skb_head_frag_to_page_desc(struct sk_buff *frag_skb)
> +{
> +   skb_frag_t head_frag;
> +   struct page *page;
> +
> +   page = virt_to_head_page(frag_skb->head);
> +   head_frag.page.p = page;
> +   head_frag.page_offset = frag_skb->data -
> +   (unsigned char *)page_address(page);
> +   head_frag.size = skb_headlen(frag_skb);
> +   return head_frag;
> +}
> +
>  /**
>   * skb_segment - Perform protocol segmentation on skb.
>   * @head_skb: buffer to segment
> @@ -3664,15 +3677,16 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>
> while (pos < offset + len) {
> if (i >= nfrags) {
> -   BUG_ON(skb_headlen(list_skb));
> -
> i = 0;
> nfrags = skb_shinfo(list_skb)->nr_frags;
> frag = skb_shinfo(list_skb)->frags;
> -   frag_skb = list_skb;

You could probably leave this line in place. No point in moving it.

> -
> -   BUG_ON(!nfrags);
> +   if (skb_headlen(list_skb)) {
> +   BUG_ON(!list_skb->head_frag);
>
> +   /* to make room for head_frag. */
> +   i--; frag--;

Normally these should be two separate lines one for "i--;" and one for
"frag--;".

> +   }

You could probably place the BUG_ON(!nfrags) in an else statement here
to handle the case where we have a potentially empty skb which would
be a bug.

> +   frag_skb = list_skb;
> if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
> skb_zerocopy_clone(nskb, frag_skb,

Re: [PATCH REPOST v4 5/7] ixgbevf: keep writel() closer to wmb()

2018-03-21 Thread okaya

On 2018-03-21 17:48, Jeff Kirsher wrote:

On Wed, 2018-03-21 at 14:56 -0400, Sinan Kaya wrote:

Remove ixgbevf_write_tail() in favor of moving writel() close to
wmb().

Signed-off-by: Sinan Kaya 
Reviewed-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  | 5 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 4 ++--
 2 files changed, 2 insertions(+), 7 deletions(-)


This patch fails to compile because there is a call to
ixgbevf_write_tail() which you missed cleaning up.


Hah, I did a compile test but maybe I missed something. I will get v6 of 
this patch only and leave the rest of the series as it is.


Re: [PATCH REPOST v4 5/7] ixgbevf: keep writel() closer to wmb()

2018-03-21 Thread Jeff Kirsher
On Wed, 2018-03-21 at 14:56 -0400, Sinan Kaya wrote:
> Remove ixgbevf_write_tail() in favor of moving writel() close to
> wmb().
> 
> Signed-off-by: Sinan Kaya 
> Reviewed-by: Alexander Duyck 
> ---
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  | 5 -
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 4 ++--
>  2 files changed, 2 insertions(+), 7 deletions(-)

This patch fails to compile because there is a call to
ixgbevf_write_tail() which you missed cleaning up.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH net-next RFC V1 2/5] net: phy: Move time stamping interface into the generic mdio layer.

2018-03-21 Thread Richard Cochran
On Wed, Mar 21, 2018 at 12:10:07PM -0700, Florian Fainelli wrote:
> > +   phydev->mdio.ts_info = dp83640_ts_info;
> > +   phydev->mdio.hwtstamp = dp83640_hwtstamp;
> > +   phydev->mdio.rxtstamp = dp83640_rxtstamp;
> > +   phydev->mdio.txtstamp = dp83640_txtstamp;
> 
> Why is this implemented a the mdio_device level and not at the
> mdio_driver level? This looks like the wrong level at which this is done.

The question could be asked of:

struct mdio_device {
int (*bus_match)(struct device *dev, struct device_driver *drv);
void (*device_free)(struct mdio_device *mdiodev);
void (*device_remove)(struct mdio_device *mdiodev);
}

I saw how this is done for the phy, etc, but I don't see any benefit
of doing it that way.  It would add an extra layer (or two) of
indirection and save the space four pointer functions.  Is that
trade-off worth it?

Thanks,
Richard


Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-03-21 Thread Andrew Lunn
Hi Richard

> The only other docs that I have is a PDF of the register layout, but I
> don't think I can redistribute that.  Actually, there really isn't any
> detail in that doc at all.

O.K, so lets do the 20 questions approach.

As far as i can see, this is not an MDIO device. It is not connected
to the MDIO bus, it has no MDIO registers, you don't even pass a valid
MDIO address in device tree.

It it actually an MII bus snooper? Does it snoop, or is it actually in
the MII bus, and can modify packets, i.e. insert time stamps as frames
pass over the MII bus?

When the driver talks about having three ports, does that mean it can
be on three different MII busses?

Thanks
   Andrew


Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-03-21 Thread Richard Cochran
On Wed, Mar 21, 2018 at 08:33:15PM +0100, Andrew Lunn wrote:
> Can you point us at some documentation for this.

The overall one-step functionality is described IEEE 1588.

> I think Florian and I want to better understand how this device works,
> in order to understand your other changes.

The device is from here:

   
https://www.zhaw.ch/en/engineering/institutes-centres/ines/products-and-services/ptp-ieee-1588/ptp-hardware/#c43991

The only other docs that I have is a PDF of the register layout, but I
don't think I can redistribute that.  Actually, there really isn't any
detail in that doc at all.

Thanks,
Richard



Re: [PATCH net-next RFC V1 1/5] net: Introduce peer to peer one step PTP time stamping.

2018-03-21 Thread Richard Cochran
On Wed, Mar 21, 2018 at 08:05:36PM +, Keller, Jacob E wrote:
> I am guessing that we expect all devices which support onestep P2P messages, 
> will always support onestep SYNC as well?

Yes.  Anything else doesn't make sense, don't you think?

Also, reading 1588, it isn't clear whether supporting only 1-step Sync
without 1-step P2P is even intended.  There is only a "one-step
clock", and it is described as doing both.

Thanks,
Richard


Re: [PATCH][next] gre: fix TUNNEL_SEQ bit check on sequence numbering

2018-03-21 Thread William Tu
On Wed, Mar 21, 2018 at 12:34 PM, Colin King  wrote:
> From: Colin Ian King 
>
> The current logic of flags | TUNNEL_SEQ is always non-zero and hence
> sequence numbers are always incremented no matter the setting of the
> TUNNEL_SEQ bit.  Fix this by using & instead of |.
>
> Detected by CoverityScan, CID#1466039 ("Operands don't affect result")
>
> Fixes: 77a5196a804e ("gre: add sequence number for collect md mode.")
> Signed-off-by: Colin Ian King 

Thanks for the fix!
btw, how can I access the CoverityScan result with this CID?

Acked-by: William Tu 


> ---
>  net/ipv4/ip_gre.c  | 2 +-
>  net/ipv6/ip6_gre.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index 2fa2ef2e2af9..9ab1aa2f7660 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -550,7 +550,7 @@ static void gre_fb_xmit(struct sk_buff *skb, struct 
> net_device *dev,
> (TUNNEL_CSUM | TUNNEL_KEY | TUNNEL_SEQ);
> gre_build_header(skb, tunnel_hlen, flags, proto,
>  tunnel_id_to_key32(tun_info->key.tun_id),
> -(flags | TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0);
> +(flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0);
>
> df = key->tun_flags & TUNNEL_DONT_FRAGMENT ?  htons(IP_DF) : 0;
>
> diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
> index 0bcefc480aeb..3a98c694da5f 100644
> --- a/net/ipv6/ip6_gre.c
> +++ b/net/ipv6/ip6_gre.c
> @@ -725,7 +725,7 @@ static netdev_tx_t __gre6_xmit(struct sk_buff *skb,
> gre_build_header(skb, tunnel->tun_hlen,
>  flags, protocol,
>  tunnel_id_to_key32(tun_info->key.tun_id),
> -(flags | TUNNEL_SEQ) ? 
> htonl(tunnel->o_seqno++)
> +(flags & TUNNEL_SEQ) ? 
> htonl(tunnel->o_seqno++)
>   : 0);
>
> } else {
> --
> 2.15.1
>


Re: [PATCH net-next v2] net: mvpp2: Don't use dynamic allocs for local variables

2018-03-21 Thread Maxime Chevallier
Hello Yan,

On Wed, 21 Mar 2018 19:57:47 +,
Yan Markman  wrote :

> Hi Maxime

Please avoid top-posting on this list.

> Please check the TWO points:
> 
> 1). The mvpp2_prs_flow_find() returns TID if found
> The TID=0 is valid FOUND value
> For Not-found use -ENOENT (just like your mvpp2_prs_vlan_find)

This is actually what is used in this patch. You might be refering to
a previous draft version of this patch.

> 2). The original code always uses "mvpp2_prs_entry *pe" storage
> Zero-Allocated Please check the correctnes of new "mvpp2_prs_entry
> pe" without memset(pe, 0, sizeof(pe));
>in all procedures where pe=kzalloc() has been replaced

I think we're good on that regard. On places where I didn't memset the
prs_entry, the pe.index field is set, and this is followed by a read
from TCAM that will initialize the prs_entry to the correct value :

pe.index = tid;
mvpp2_prs_hw_read(priv, );

> Thanks
> Yan Markman

[...]

Thanks,

Maxime


Re: [PATCH V2 net-next 06/14] net/tls: Add generic NIC offload infrastructure

2018-03-21 Thread Eric Dumazet


On 03/21/2018 02:01 PM, Saeed Mahameed wrote:
> From: Ilya Lesokhin 
> 
> This patch adds a generic infrastructure to offload TLS crypto to a

...

> +
> +static inline int tls_push_record(struct sock *sk,
> +   struct tls_context *ctx,
> +   struct tls_offload_context *offload_ctx,
> +   struct tls_record_info *record,
> +   struct page_frag *pfrag,
> +   int flags,
> +   unsigned char record_type)
> +{
> + skb_frag_t *frag;
> + struct tcp_sock *tp = tcp_sk(sk);
> + struct page_frag fallback_frag;
> + struct page_frag  *tag_pfrag = pfrag;
> + int i;
> +
> + /* fill prepand */
> + frag = >frags[0];
> + tls_fill_prepend(ctx,
> +  skb_frag_address(frag),
> +  record->len - ctx->prepend_size,
> +  record_type);
> +
> + if (unlikely(!skb_page_frag_refill(ctx->tag_size, pfrag, GFP_KERNEL))) {
> + /* HW doesn't care about the data in the tag
> +  * so in case pfrag has no room
> +  * for a tag and we can't allocate a new pfrag
> +  * just use the page in the first frag
> +  * rather then write a complicated fall back code.
> +  */
> + tag_pfrag = _frag;
> + tag_pfrag->page = skb_frag_page(frag);
> + tag_pfrag->offset = 0;
> + }
> +

If HW does not care, why even trying to call skb_page_frag_refill() ?

If you remove it, then we remove one seldom used path and might uncover bugs

This part looks very suspect to me, to be honest.



[PATCH V2 net-next 03/14] net: Add Software fallback infrastructure for socket dependent offloads

2018-03-21 Thread Saeed Mahameed
From: Ilya Lesokhin 

With socket dependent offloads we rely on the netdev to transform
the transmitted packets before sending them to the wire.
When a packet from an offloaded socket is rerouted to a different
device we need to detect it and do the transformation in software.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 include/net/sock.h | 21 +
 net/Kconfig|  4 
 net/core/dev.c |  4 
 3 files changed, 29 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index b9624581d639..92a0e0c54ac1 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -481,6 +481,11 @@ struct sock {
void(*sk_error_report)(struct sock *sk);
int (*sk_backlog_rcv)(struct sock *sk,
  struct sk_buff *skb);
+#ifdef CONFIG_SOCK_VALIDATE_XMIT
+   struct sk_buff* (*sk_validate_xmit_skb)(struct sock *sk,
+   struct net_device *dev,
+   struct sk_buff *skb);
+#endif
void(*sk_destruct)(struct sock *sk);
struct sock_reuseport __rcu *sk_reuseport_cb;
struct rcu_head sk_rcu;
@@ -2323,6 +2328,22 @@ static inline bool sk_fullsock(const struct sock *sk)
return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
 }
 
+/* Checks if this SKB belongs to an HW offloaded socket
+ * and whether any SW fallbacks are required based on dev.
+ */
+static inline struct sk_buff *sk_validate_xmit_skb(struct sk_buff *skb,
+  struct net_device *dev)
+{
+#ifdef CONFIG_SOCK_VALIDATE_XMIT
+   struct sock *sk = skb->sk;
+
+   if (sk && sk_fullsock(sk) && sk->sk_validate_xmit_skb)
+   skb = sk->sk_validate_xmit_skb(sk, dev, skb);
+#endif
+
+   return skb;
+}
+
 /* This helper checks if a socket is a LISTEN or NEW_SYN_RECV
  * SYNACK messages can be attached to either ones (depending on SYNCOOKIE)
  */
diff --git a/net/Kconfig b/net/Kconfig
index 0428f12c25c2..fe84cfe3260e 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -407,6 +407,10 @@ config GRO_CELLS
bool
default n
 
+config SOCK_VALIDATE_XMIT
+   bool
+   default n
+
 config NET_DEVLINK
tristate "Network physical/parent device Netlink interface"
help
diff --git a/net/core/dev.c b/net/core/dev.c
index d8887cc38e7b..244a4c7ab266 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3086,6 +3086,10 @@ static struct sk_buff *validate_xmit_skb(struct sk_buff 
*skb, struct net_device
if (unlikely(!skb))
goto out_null;
 
+   skb = sk_validate_xmit_skb(skb, dev);
+   if (unlikely(!skb))
+   goto out_null;
+
if (netif_needs_gso(skb, features)) {
struct sk_buff *segs;
 
-- 
2.14.3



[PATCH V2 net-next 02/14] net: Rename and export copy_skb_header

2018-03-21 Thread Saeed Mahameed
From: Ilya Lesokhin 

copy_skb_header is renamed to skb_copy_header and
exported. Exposing this function give more flexibility
in copying SKBs.
skb_copy and skb_copy_expand do not give enough control
over which parts are copied.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 include/linux/skbuff.h | 1 +
 net/core/skbuff.c  | 9 +
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d8340e6e8814..dc0f81277723 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1031,6 +1031,7 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned 
int size,
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority);
+void skb_copy_header(struct sk_buff *new, const struct sk_buff *old);
 struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t priority);
 struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
   gfp_t gfp_mask, bool fclone);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 715c13495ba6..9ae1812fb705 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1304,7 +1304,7 @@ static void skb_headers_offset_update(struct sk_buff 
*skb, int off)
skb->inner_mac_header += off;
 }
 
-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+void skb_copy_header(struct sk_buff *new, const struct sk_buff *old)
 {
__copy_skb_header(new, old);
 
@@ -1312,6 +1312,7 @@ static void copy_skb_header(struct sk_buff *new, const 
struct sk_buff *old)
skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
+EXPORT_SYMBOL(skb_copy_header);
 
 static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
 {
@@ -1354,7 +1355,7 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t 
gfp_mask)
 
BUG_ON(skb_copy_bits(skb, -headerlen, n->head, headerlen + skb->len));
 
-   copy_skb_header(n, skb);
+   skb_copy_header(n, skb);
return n;
 }
 EXPORT_SYMBOL(skb_copy);
@@ -1418,7 +1419,7 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, 
int headroom,
skb_clone_fraglist(n);
}
 
-   copy_skb_header(n, skb);
+   skb_copy_header(n, skb);
 out:
return n;
 }
@@ -1598,7 +1599,7 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
BUG_ON(skb_copy_bits(skb, -head_copy_len, n->head + head_copy_off,
 skb->len + head_copy_len));
 
-   copy_skb_header(n, skb);
+   skb_copy_header(n, skb);
 
skb_headers_offset_update(n, newheadroom - oldheadroom);
 
-- 
2.14.3



[PATCH V2 net-next 04/14] net: Add TLS offload netdev ops

2018-03-21 Thread Saeed Mahameed
From: Ilya Lesokhin 

Add new netdev ops to add and delete tls context

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Aviad Yehezkel 
Signed-off-by: Saeed Mahameed 
---
 include/linux/netdevice.h | 24 
 1 file changed, 24 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 913b1cc882cf..e1fef7bb6ed4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -864,6 +864,26 @@ struct xfrmdev_ops {
 };
 #endif
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+enum tls_offload_ctx_dir {
+   TLS_OFFLOAD_CTX_DIR_RX,
+   TLS_OFFLOAD_CTX_DIR_TX,
+};
+
+struct tls_crypto_info;
+struct tls_context;
+
+struct tlsdev_ops {
+   int (*tls_dev_add)(struct net_device *netdev, struct sock *sk,
+  enum tls_offload_ctx_dir direction,
+  struct tls_crypto_info *crypto_info,
+  u32 start_offload_tcp_sn);
+   void (*tls_dev_del)(struct net_device *netdev,
+   struct tls_context *ctx,
+   enum tls_offload_ctx_dir direction);
+};
+#endif
+
 struct dev_ifalias {
struct rcu_head rcuhead;
char ifalias[];
@@ -1748,6 +1768,10 @@ struct net_device {
const struct xfrmdev_ops *xfrmdev_ops;
 #endif
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+   const struct tlsdev_ops *tlsdev_ops;
+#endif
+
const struct header_ops *header_ops;
 
unsigned intflags;
-- 
2.14.3



[PATCH V2 net-next 12/14] net/mlx5e: TLS, Add error statistics

2018-03-21 Thread Saeed Mahameed
From: Ilya Lesokhin 

Add statistics for rare TLS related errors.
Since the errors are rare we have a counter per netdev
rather then per SQ.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  3 +
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 22 ++
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h | 22 ++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c | 24 +++---
 .../mellanox/mlx5/core/en_accel/tls_stats.c| 89 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  4 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c | 22 ++
 8 files changed, 178 insertions(+), 10 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_stats.c

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index ec785f589666..a7135f5d5cf6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -28,6 +28,6 @@ mlx5_core-$(CONFIG_MLX5_CORE_IPOIB) += ipoib/ipoib.o 
ipoib/ethtool.o ipoib/ipoib
 mlx5_core-$(CONFIG_MLX5_EN_IPSEC) += en_accel/ipsec.o en_accel/ipsec_rxtx.o \
en_accel/ipsec_stats.o
 
-mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o en_accel/tls_rxtx.o
+mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o en_accel/tls_rxtx.o 
en_accel/tls_stats.o
 
 CFLAGS_tracepoint.o := -I$(src)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 7d8696fca826..d397be0b5885 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -795,6 +795,9 @@ struct mlx5e_priv {
 #ifdef CONFIG_MLX5_EN_IPSEC
struct mlx5e_ipsec*ipsec;
 #endif
+#ifdef CONFIG_MLX5_EN_TLS
+   struct mlx5e_tls  *tls;
+#endif
 };
 
 struct mlx5e_profile {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
index aa6981c98bdc..d167845271c3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -173,3 +173,25 @@ void mlx5e_tls_build_netdev(struct mlx5e_priv *priv)
netdev->hw_features |= NETIF_F_HW_TLS_TX;
netdev->tlsdev_ops = _tls_ops;
 }
+
+int mlx5e_tls_init(struct mlx5e_priv *priv)
+{
+   struct mlx5e_tls *tls = kzalloc(sizeof(*tls), GFP_KERNEL);
+
+   if (!tls)
+   return -ENOMEM;
+
+   priv->tls = tls;
+   return 0;
+}
+
+void mlx5e_tls_cleanup(struct mlx5e_priv *priv)
+{
+   struct mlx5e_tls *tls = priv->tls;
+
+   if (!tls)
+   return;
+
+   kfree(tls);
+   priv->tls = NULL;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
index f7216b9b98e2..b6162178f621 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
@@ -38,6 +38,17 @@
 #include 
 #include "en.h"
 
+struct mlx5e_tls_sw_stats {
+   atomic64_t tx_tls_drop_metadata;
+   atomic64_t tx_tls_drop_resync_alloc;
+   atomic64_t tx_tls_drop_no_sync_data;
+   atomic64_t tx_tls_drop_bypass_required;
+};
+
+struct mlx5e_tls {
+   struct mlx5e_tls_sw_stats sw_stats;
+};
+
 struct mlx5e_tls_offload_context {
struct tls_offload_context base;
u32 expected_seq;
@@ -55,10 +66,21 @@ mlx5e_get_tls_tx_context(struct tls_context *tls_ctx)
 }
 
 void mlx5e_tls_build_netdev(struct mlx5e_priv *priv);
+int mlx5e_tls_init(struct mlx5e_priv *priv);
+void mlx5e_tls_cleanup(struct mlx5e_priv *priv);
+
+int mlx5e_tls_get_count(struct mlx5e_priv *priv);
+int mlx5e_tls_get_strings(struct mlx5e_priv *priv, uint8_t *data);
+int mlx5e_tls_get_stats(struct mlx5e_priv *priv, u64 *data);
 
 #else
 
 static inline void mlx5e_tls_build_netdev(struct mlx5e_priv *priv) { }
+static inline int mlx5e_tls_init(struct mlx5e_priv *priv) { return 0; }
+static inline void mlx5e_tls_cleanup(struct mlx5e_priv *priv) { }
+static inline int mlx5e_tls_get_count(struct mlx5e_priv *priv) { return 0; }
+static inline int mlx5e_tls_get_strings(struct mlx5e_priv *priv, uint8_t 
*data) { return 0; }
+static inline int mlx5e_tls_get_stats(struct mlx5e_priv *priv, u64 *data) { 
return 0; }
 
 #endif
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
index 49e8d455ebc3..ad2790fb5966 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -164,7 +164,8 @@ static struct sk_buff *
 

[PATCH V2 net-next 11/14] net/mlx5e: TLS, Add Innova TLS TX offload data path

2018-03-21 Thread Saeed Mahameed
From: Ilya Lesokhin 

Implement the TLS tx offload data path according to the
requirements of the TLS generic NIC offload infrastructure.

Special metadata ethertype is used to pass information to
the hardware.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  15 ++
 .../mellanox/mlx5/core/en_accel/en_accel.h |  72 ++
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c |   2 +
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c | 272 +
 .../mellanox/mlx5/core/en_accel/tls_rxtx.h |  50 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |  10 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c|  37 +--
 10 files changed, 455 insertions(+), 16 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 50872ed30c0b..ec785f589666 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -28,6 +28,6 @@ mlx5_core-$(CONFIG_MLX5_CORE_IPOIB) += ipoib/ipoib.o 
ipoib/ethtool.o ipoib/ipoib
 mlx5_core-$(CONFIG_MLX5_EN_IPSEC) += en_accel/ipsec.o en_accel/ipsec_rxtx.o \
en_accel/ipsec_stats.o
 
-mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o
+mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o en_accel/tls_rxtx.o
 
 CFLAGS_tracepoint.o := -I$(src)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 6660986285bf..7d8696fca826 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -340,6 +340,7 @@ struct mlx5e_sq_dma {
 enum {
MLX5E_SQ_STATE_ENABLED,
MLX5E_SQ_STATE_IPSEC,
+   MLX5E_SQ_STATE_TLS,
 };
 
 struct mlx5e_sq_wqe_info {
@@ -824,6 +825,8 @@ void mlx5e_build_ptys2ethtool_map(void);
 u16 mlx5e_select_queue(struct net_device *dev, struct sk_buff *skb,
   void *accel_priv, select_queue_fallback_t fallback);
 netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev);
+netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
+ struct mlx5e_tx_wqe *wqe, u16 pi);
 
 void mlx5e_completion_event(struct mlx5_core_cq *mcq);
 void mlx5e_cq_error_event(struct mlx5_core_cq *mcq, enum mlx5_event event);
@@ -929,6 +932,18 @@ static inline bool mlx5e_tunnel_inner_ft_supported(struct 
mlx5_core_dev *mdev)
MLX5_CAP_FLOWTABLE_NIC_RX(mdev, 
ft_field_support.inner_ip_version));
 }
 
+static inline void mlx5e_sq_fetch_wqe(struct mlx5e_txqsq *sq,
+ struct mlx5e_tx_wqe **wqe,
+ u16 *pi)
+{
+   struct mlx5_wq_cyc *wq;
+
+   wq = >wq;
+   *pi = sq->pc & wq->sz_m1;
+   *wqe = mlx5_wq_cyc_get_wqe(wq, *pi);
+   memset(*wqe, 0, sizeof(**wqe));
+}
+
 static inline
 struct mlx5e_tx_wqe *mlx5e_post_nop(struct mlx5_wq_cyc *wq, u32 sqn, u16 *pc)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
new file mode 100644
index ..68fcb40a2847
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
@@ -0,0 +1,72 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A 

[PATCH V2 net-next 09/14] net/mlx5: Accel, Add TLS tx offload interface

2018-03-21 Thread Saeed Mahameed
From: Ilya Lesokhin 

Add routines for manipulating TLS TX offload contexts.

In Innova TLS, TLS contexts are added or deleted
via a command message over the SBU connection.
The HW then sends a response message over the same connection.

Add implementation for Innova TLS (FPGA-based) hardware.

These routines will be used by the TLS offload support in a later patch

mlx5/accel is a middle acceleration layer to allow mlx5e and other ULPs
to work directly with mlx5_core rather than Innova FPGA or other mlx5
acceleration providers.

In the future, when IPSec/TLS or any other acceleration gets integrated
into ConnectX chip, mlx5/accel layer will provide the integrated
acceleration, rather than the Innova one.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   4 +-
 .../net/ethernet/mellanox/mlx5/core/accel/tls.c|  71 +++
 .../net/ethernet/mellanox/mlx5/core/accel/tls.h|  86 
 .../net/ethernet/mellanox/mlx5/core/fpga/core.h|   1 +
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c | 563 +
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h |  68 +++
 drivers/net/ethernet/mellanox/mlx5/core/main.c |  11 +
 include/linux/mlx5/mlx5_ifc.h  |  16 -
 include/linux/mlx5/mlx5_ifc_fpga.h |  77 +++
 9 files changed, 879 insertions(+), 18 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index c805769d92a9..9989e5265a45 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -8,10 +8,10 @@ mlx5_core-y :=main.o cmd.o debugfs.o fw.o eq.o uar.o 
pagealloc.o \
fs_counters.o rl.o lag.o dev.o wq.o lib/gid.o lib/clock.o \
diag/fs_tracepoint.o
 
-mlx5_core-$(CONFIG_MLX5_ACCEL) += accel/ipsec.o
+mlx5_core-$(CONFIG_MLX5_ACCEL) += accel/ipsec.o accel/tls.o
 
 mlx5_core-$(CONFIG_MLX5_FPGA) += fpga/cmd.o fpga/core.o fpga/conn.o fpga/sdk.o 
\
-   fpga/ipsec.o
+   fpga/ipsec.o fpga/tls.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o 
\
en_tx.o en_rx.o en_dim.o en_txrx.o en_stats.o vxlan.o \
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c 
b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
new file mode 100644
index ..77ac19f38cbe
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
@@ -0,0 +1,71 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include 
+
+#include "accel/tls.h"
+#include "mlx5_core.h"
+#include "fpga/tls.h"
+
+int mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
+  struct tls_crypto_info *crypto_info,
+  u32 start_offload_tcp_sn, u32 *p_swid)
+{
+   return mlx5_fpga_tls_add_tx_flow(mdev, flow, crypto_info,
+start_offload_tcp_sn, p_swid);
+}
+
+void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid)
+{
+   mlx5_fpga_tls_del_tx_flow(mdev, swid, 

[PATCH V2 net-next 06/14] net/tls: Add generic NIC offload infrastructure

2018-03-21 Thread Saeed Mahameed
From: Ilya Lesokhin 

This patch adds a generic infrastructure to offload TLS crypto to a
network devices. It enables the kernel TLS socket to skip encryption
and authentication operations on the transmit side of the data path.
Leaving those computationally expensive operations to the NIC.

The NIC offload infrastructure builds TLS records and pushes them to
the TCP layer just like the SW KTLS implementation and using the same API.
TCP segmentation is mostly unaffected. Currently the only exception is
that we prevent mixed SKBs where only part of the payload requires
offload. In the future we are likely to add a similar restriction
following a change cipher spec record.

The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. The offloaded implementation builds "plaintext TLS record", those
records contain plaintext instead of ciphertext and place holder bytes
instead of authentication tags.
2. The offloaded implementation maintains a mapping from TCP sequence
number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
TLS socket, we can use the tls NIC offload infrastructure to obtain
enough context to encrypt the payload of the SKB.
A TLS record is released when the last byte of the record is ack'ed,
this is done through the new icsk_clean_acked callback.

The infrastructure should be extendable to support various NIC offload
implementations.  However it is currently written with the
implementation below in mind:
The NIC assumes that packets from each offloaded stream are sent as
plaintext and in-order. It keeps track of the TLS records in the TCP
stream. When a packet marked for offload is transmitted, the NIC
encrypts the payload in-place and puts authentication tags in the
relevant place holders.

The responsibility for handling out-of-order packets (i.e. TCP
retransmission, qdisc drops) falls on the netdev driver.

The netdev driver keeps track of the expected TCP SN from the NIC's
perspective.  If the next packet to transmit matches the expected TCP
SN, the driver advances the expected TCP SN, and transmits the packet
with TLS offload indication.

If the next packet to transmit does not match the expected TCP SN. The
driver calls the TLS layer to obtain the TLS record that includes the
TCP of the packet for transmission. Using this TLS record, the driver
posts a work entry on the transmit queue to reconstruct the NIC TLS
state required for the offload of the out-of-order packet. It updates
the expected TCP SN accordingly and transmit the now in-order packet.
The same queue is used for packet transmission and TLS context
reconstruction to avoid the need for flushing the transmit queue before
issuing the context reconstruction request.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Aviad Yehezkel 
Signed-off-by: Saeed Mahameed 
---
 include/net/tls.h |  74 +++-
 net/tls/Kconfig   |  10 +
 net/tls/Makefile  |   2 +
 net/tls/tls_device.c  | 793 ++
 net/tls/tls_device_fallback.c | 415 ++
 net/tls/tls_main.c|  33 +-
 6 files changed, 1320 insertions(+), 7 deletions(-)
 create mode 100644 net/tls/tls_device.c
 create mode 100644 net/tls/tls_device_fallback.c

diff --git a/include/net/tls.h b/include/net/tls.h
index 4913430ab807..0bfb1b0a156a 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -77,6 +77,37 @@ struct tls_sw_context {
struct scatterlist sg_aead_out[2];
 };
 
+struct tls_record_info {
+   struct list_head list;
+   u32 end_seq;
+   int len;
+   int num_frags;
+   skb_frag_t frags[MAX_SKB_FRAGS];
+};
+
+struct tls_offload_context {
+   struct crypto_aead *aead_send;
+   spinlock_t lock;/* protects records list */
+   struct list_head records_list;
+   struct tls_record_info *open_record;
+   struct tls_record_info *retransmit_hint;
+   u64 hint_record_sn;
+   u64 unacked_record_sn;
+
+   struct scatterlist sg_tx_data[MAX_SKB_FRAGS];
+   void (*sk_destruct)(struct sock *sk);
+   u8 driver_state[];
+   /* The TLS layer reserves room for driver specific state
+* Currently the belief is that there is not enough
+* driver specific state to justify another layer of indirection
+*/
+#define TLS_DRIVER_STATE_SIZE (max_t(size_t, 8, sizeof(void *)))
+};
+
+#define TLS_OFFLOAD_CONTEXT_SIZE   
\
+   (ALIGN(sizeof(struct tls_offload_context), sizeof(void *)) +   \
+TLS_DRIVER_STATE_SIZE)
+
 enum {
TLS_PENDING_CLOSED_RECORD
 };
@@ -87,6 +118,10 @@ struct tls_context {
struct tls12_crypto_info_aes_gcm_128 crypto_send_aes_gcm_128;
};
 
+   struct list_head list;
+   struct net_device *netdev;
+   

[PATCH V2 net-next 13/14] MAINTAINERS: Update mlx5 innova driver maintainers

2018-03-21 Thread Saeed Mahameed
From: Boris Pismenny 

Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 MAINTAINERS | 17 -
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 214c9bca232a..cd4067ccf959 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8913,26 +8913,17 @@ W:  http://www.mellanox.com
 Q: http://patchwork.ozlabs.org/project/netdev/list/
 F: drivers/net/ethernet/mellanox/mlx5/core/en_*
 
-MELLANOX ETHERNET INNOVA DRIVER
-M: Ilan Tayari 
-R: Boris Pismenny 
+MELLANOX ETHERNET INNOVA DRIVERS
+M: Boris Pismenny 
 L: netdev@vger.kernel.org
 S: Supported
 W: http://www.mellanox.com
 Q: http://patchwork.ozlabs.org/project/netdev/list/
+F: drivers/net/ethernet/mellanox/mlx5/core/en_accel/*
+F: drivers/net/ethernet/mellanox/mlx5/core/accel/*
 F: drivers/net/ethernet/mellanox/mlx5/core/fpga/*
 F: include/linux/mlx5/mlx5_ifc_fpga.h
 
-MELLANOX ETHERNET INNOVA IPSEC DRIVER
-M: Ilan Tayari 
-R: Boris Pismenny 
-L: netdev@vger.kernel.org
-S: Supported
-W: http://www.mellanox.com
-Q: http://patchwork.ozlabs.org/project/netdev/list/
-F: drivers/net/ethernet/mellanox/mlx5/core/en_ipsec/*
-F: drivers/net/ethernet/mellanox/mlx5/core/ipsec*
-
 MELLANOX ETHERNET SWITCH DRIVERS
 M: Jiri Pirko 
 M: Ido Schimmel 
-- 
2.14.3



[PATCH V2 net-next 14/14] MAINTAINERS: Update TLS maintainers

2018-03-21 Thread Saeed Mahameed
From: Boris Pismenny 

Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index cd4067ccf959..285ea4e6c580 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9711,7 +9711,7 @@ F:net/netfilter/xt_CONNSECMARK.c
 F: net/netfilter/xt_SECMARK.c
 
 NETWORKING [TLS]
-M: Ilya Lesokhin 
+M: Boris Pismenny 
 M: Aviad Yehezkel 
 M: Dave Watson 
 L: netdev@vger.kernel.org
-- 
2.14.3



[PATCH V2 net-next 10/14] net/mlx5e: TLS, Add Innova TLS TX support

2018-03-21 Thread Saeed Mahameed
From: Ilya Lesokhin 

Add NETIF_F_HW_TLS_TX capability and expose tlsdev_ops to work with the
TLS generic NIC offload infrastructure.
The NETIF_F_HW_TLS_TX capability will be added in the next patch.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig|  11 ++
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   2 +
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 173 +
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h |  65 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   3 +
 5 files changed, 254 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig 
b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 25deaa5a534c..6befd2c381b8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -85,3 +85,14 @@ config MLX5_EN_IPSEC
  Build support for IPsec cryptography-offload accelaration in the NIC.
  Note: Support for hardware with this capability needs to be selected
  for this option to become available.
+
+config MLX5_EN_TLS
+   bool "TLS cryptography-offload accelaration"
+   depends on MLX5_CORE_EN
+   depends on TLS_DEVICE
+   depends on MLX5_ACCEL
+   default n
+   ---help---
+ Build support for TLS cryptography-offload accelaration in the NIC.
+ Note: Support for hardware with this capability needs to be selected
+ for this option to become available.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 9989e5265a45..50872ed30c0b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -28,4 +28,6 @@ mlx5_core-$(CONFIG_MLX5_CORE_IPOIB) += ipoib/ipoib.o 
ipoib/ethtool.o ipoib/ipoib
 mlx5_core-$(CONFIG_MLX5_EN_IPSEC) += en_accel/ipsec.o en_accel/ipsec_rxtx.o \
en_accel/ipsec_stats.o
 
+mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o
+
 CFLAGS_tracepoint.o := -I$(src)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
new file mode 100644
index ..38d88108a55a
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -0,0 +1,173 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include 
+#include 
+#include "en_accel/tls.h"
+#include "accel/tls.h"
+
+static void mlx5e_tls_set_ipv4_flow(void *flow, struct sock *sk)
+{
+   struct inet_sock *inet = inet_sk(sk);
+
+   MLX5_SET(tls_flow, flow, ipv6, 0);
+   memcpy(MLX5_ADDR_OF(tls_flow, flow, dst_ipv4_dst_ipv6.ipv4_layout.ipv4),
+  >inet_daddr, MLX5_FLD_SZ_BYTES(ipv4_layout, ipv4));
+   memcpy(MLX5_ADDR_OF(tls_flow, flow, src_ipv4_src_ipv6.ipv4_layout.ipv4),
+  >inet_rcv_saddr, MLX5_FLD_SZ_BYTES(ipv4_layout, ipv4));
+}
+
+#if IS_ENABLED(CONFIG_IPV6)
+static void mlx5e_tls_set_ipv6_flow(void *flow, struct sock *sk)
+{
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   MLX5_SET(tls_flow, flow, ipv6, 1);
+   memcpy(MLX5_ADDR_OF(tls_flow, flow, dst_ipv4_dst_ipv6.ipv6_layout.ipv6),
+  >sk_v6_daddr, MLX5_FLD_SZ_BYTES(ipv6_layout, 

[PATCH V2 net-next 07/14] net/tls: Support TLS device offload with IPv6

2018-03-21 Thread Saeed Mahameed
From: Ilya Lesokhin 

Previously get_netdev_for_sock worked only with IPv4.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 net/tls/tls_device.c | 49 -
 1 file changed, 48 insertions(+), 1 deletion(-)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index e623280ea019..c35fc107d9c5 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -34,6 +34,11 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #include 
 #include 
@@ -99,13 +104,55 @@ static void tls_device_queue_ctx_destruction(struct 
tls_context *ctx)
spin_unlock_irqrestore(_device_lock, flags);
 }
 
+static inline struct net_device *ipv6_get_netdev(struct sock *sk)
+{
+   struct net_device *dev = NULL;
+#if IS_ENABLED(CONFIG_IPV6)
+   struct inet_sock *inet = inet_sk(sk);
+   struct ipv6_pinfo *np = inet6_sk(sk);
+   struct flowi6 _fl6, *fl6 = &_fl6;
+   struct dst_entry *dst;
+
+   memset(fl6, 0, sizeof(*fl6));
+   fl6->flowi6_proto = sk->sk_protocol;
+   fl6->daddr = sk->sk_v6_daddr;
+   fl6->saddr = np->saddr;
+   fl6->flowlabel = np->flow_label;
+   IP6_ECN_flow_xmit(sk, fl6->flowlabel);
+   fl6->flowi6_oif = sk->sk_bound_dev_if;
+   fl6->flowi6_mark = sk->sk_mark;
+   fl6->fl6_sport = inet->inet_sport;
+   fl6->fl6_dport = inet->inet_dport;
+   fl6->flowi6_uid = sk->sk_uid;
+   security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
+
+   if (ipv6_stub->ipv6_dst_lookup(sock_net(sk), sk, , fl6) < 0)
+   return NULL;
+
+   dev = dst->dev;
+   dev_hold(dev);
+   dst_release(dst);
+
+#endif
+   return dev;
+}
+
 /* We assume that the socket is already connected */
 static struct net_device *get_netdev_for_sock(struct sock *sk)
 {
struct inet_sock *inet = inet_sk(sk);
struct net_device *netdev = NULL;
 
-   netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
+   if (sk->sk_family == AF_INET)
+   netdev = dev_get_by_index(sock_net(sk),
+ inet->cork.fl.flowi_oif);
+   else if (sk->sk_family == AF_INET6) {
+   netdev = ipv6_get_netdev(sk);
+   if (!netdev && !sk->sk_ipv6only &&
+   ipv6_addr_type(>sk_v6_daddr) == IPV6_ADDR_MAPPED)
+   netdev = dev_get_by_index(sock_net(sk),
+ inet->cork.fl.flowi_oif);
+   }
 
return netdev;
 }
-- 
2.14.3



[PATCH V2 net-next 08/14] net/mlx5e: Move defines out of ipsec code

2018-03-21 Thread Saeed Mahameed
From: Ilya Lesokhin 

The defines are not IPSEC specific.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h | 3 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h | 3 ---
 drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c | 5 +
 drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h   | 2 ++
 4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 4c9360b25532..6660986285bf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -53,6 +53,9 @@
 #include "mlx5_core.h"
 #include "en_stats.h"
 
+#define MLX5E_METADATA_ETHER_TYPE (0x8CE4)
+#define MLX5E_METADATA_ETHER_LEN 8
+
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
 
 #define MLX5E_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h
index 1198fc1eba4c..93bf10e6508c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h
@@ -45,9 +45,6 @@
 #define MLX5E_IPSEC_SADB_RX_BITS 10
 #define MLX5E_IPSEC_ESN_SCOPE_MID 0x8000L
 
-#define MLX5E_METADATA_ETHER_TYPE (0x8CE4)
-#define MLX5E_METADATA_ETHER_LEN 8
-
 struct mlx5e_priv;
 
 struct mlx5e_ipsec_sw_stats {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
index 4f1568528738..a6b672840e34 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
@@ -43,9 +43,6 @@
 #include "fpga/sdk.h"
 #include "fpga/core.h"
 
-#define SBU_QP_QUEUE_SIZE 8
-#define MLX5_FPGA_IPSEC_CMD_TIMEOUT_MSEC   (60 * 1000)
-
 enum mlx5_fpga_ipsec_cmd_status {
MLX5_FPGA_IPSEC_CMD_PENDING,
MLX5_FPGA_IPSEC_CMD_SEND_FAIL,
@@ -258,7 +255,7 @@ static int mlx5_fpga_ipsec_cmd_wait(void *ctx)
 {
struct mlx5_fpga_ipsec_cmd_context *context = ctx;
unsigned long timeout =
-   msecs_to_jiffies(MLX5_FPGA_IPSEC_CMD_TIMEOUT_MSEC);
+   msecs_to_jiffies(MLX5_FPGA_CMD_TIMEOUT_MSEC);
int res;
 
res = wait_for_completion_timeout(>complete, timeout);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h 
b/drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h
index baa537e54a49..a0573cc2fc9b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h
@@ -41,6 +41,8 @@
  * DOC: Innova SDK
  * This header defines the in-kernel API for Innova FPGA client drivers.
  */
+#define SBU_QP_QUEUE_SIZE 8
+#define MLX5_FPGA_CMD_TIMEOUT_MSEC (60 * 1000)
 
 enum mlx5_fpga_access_type {
MLX5_FPGA_ACCESS_TYPE_I2C = 0x0,
-- 
2.14.3



[PATCH V2 net-next 05/14] net: Add TLS TX offload features

2018-03-21 Thread Saeed Mahameed
From: Ilya Lesokhin 

This patch adds a netdev feature to configure TLS TX offloads.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Aviad Yehezkel 
Signed-off-by: Saeed Mahameed 
---
 include/linux/netdev_features.h | 2 ++
 net/core/ethtool.c  | 1 +
 2 files changed, 3 insertions(+)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index db84c516bcfb..18dc34202080 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -77,6 +77,7 @@ enum {
NETIF_F_HW_ESP_BIT, /* Hardware ESP transformation offload 
*/
NETIF_F_HW_ESP_TX_CSUM_BIT, /* ESP with TX checksum offload */
NETIF_F_RX_UDP_TUNNEL_PORT_BIT, /* Offload of RX port for UDP tunnels */
+   NETIF_F_HW_TLS_TX_BIT,  /* Hardware TLS TX offload */
 
NETIF_F_GRO_HW_BIT, /* Hardware Generic receive offload */
 
@@ -145,6 +146,7 @@ enum {
 #define NETIF_F_HW_ESP __NETIF_F(HW_ESP)
 #define NETIF_F_HW_ESP_TX_CSUM __NETIF_F(HW_ESP_TX_CSUM)
 #defineNETIF_F_RX_UDP_TUNNEL_PORT  __NETIF_F(RX_UDP_TUNNEL_PORT)
+#define NETIF_F_HW_TLS_TX  __NETIF_F(HW_TLS_TX)
 
 #define for_each_netdev_feature(mask_addr, bit)\
for_each_set_bit(bit, (unsigned long *)mask_addr, NETDEV_FEATURE_COUNT)
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 157cd9efa4be..9f07f9fe39ca 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -107,6 +107,7 @@ static const char 
netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
[NETIF_F_HW_ESP_BIT] =   "esp-hw-offload",
[NETIF_F_HW_ESP_TX_CSUM_BIT] =   "esp-tx-csum-hw-offload",
[NETIF_F_RX_UDP_TUNNEL_PORT_BIT] =   "rx-udp_tunnel-port-offload",
+   [NETIF_F_HW_TLS_TX_BIT] ="tls-hw-tx-offload",
 };
 
 static const char
-- 
2.14.3



[PATCH V2 net-next 01/14] tcp: Add clean acked data hook

2018-03-21 Thread Saeed Mahameed
From: Ilya Lesokhin 

Called when a TCP segment is acknowledged.
Could be used by application protocols who hold additional
metadata associated with the stream data.

This is required by TLS device offload to release
metadata associated with acknowledged TLS records.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Aviad Yehezkel 
Signed-off-by: Saeed Mahameed 
---
 include/net/inet_connection_sock.h | 2 ++
 include/net/tcp.h  | 5 +
 net/ipv4/tcp.c | 5 +
 net/ipv4/tcp_input.c   | 6 ++
 4 files changed, 18 insertions(+)

diff --git a/include/net/inet_connection_sock.h 
b/include/net/inet_connection_sock.h
index b68fea022a82..2ab6667275df 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -77,6 +77,7 @@ struct inet_connection_sock_af_ops {
  * @icsk_af_ops   Operations which are AF_INET{4,6} specific
  * @icsk_ulp_ops  Pluggable ULP control hook
  * @icsk_ulp_data ULP private data
+ * @icsk_clean_acked  Clean acked data hook
  * @icsk_listen_portaddr_node  hash to the portaddr listener hashtable
  * @icsk_ca_state:Congestion control state
  * @icsk_retransmits: Number of unrecovered [RTO] timeouts
@@ -102,6 +103,7 @@ struct inet_connection_sock {
const struct inet_connection_sock_af_ops *icsk_af_ops;
const struct tcp_ulp_ops  *icsk_ulp_ops;
void  *icsk_ulp_data;
+   void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
struct hlist_node icsk_listen_portaddr_node;
unsigned int  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
__u8  icsk_ca_state:6,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9c9b3768b350..dba03b205680 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2101,4 +2101,9 @@ static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
 #if IS_ENABLED(CONFIG_SMC)
 extern struct static_key_false tcp_have_smc;
 #endif
+
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+extern struct static_key_false clean_acked_data_enabled;
+#endif
+
 #endif /* _TCP_H */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e553f84bde83..70056bb760d2 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -297,6 +297,11 @@ DEFINE_STATIC_KEY_FALSE(tcp_have_smc);
 EXPORT_SYMBOL(tcp_have_smc);
 #endif
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+DEFINE_STATIC_KEY_FALSE(clean_acked_data_enabled);
+EXPORT_SYMBOL_GPL(clean_acked_data_enabled);
+#endif
+
 /*
  * Current number of TCP sockets.
  */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 451ef3012636..21f5c647f4be 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3542,6 +3542,12 @@ static int tcp_ack(struct sock *sk, const struct sk_buff 
*skb, int flag)
if (after(ack, prior_snd_una)) {
flag |= FLAG_SND_UNA_ADVANCED;
icsk->icsk_retransmits = 0;
+
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+   if (static_branch_unlikely(_acked_data_enabled))
+   if (icsk->icsk_clean_acked)
+   icsk->icsk_clean_acked(sk, ack);
+#endif
}
 
prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) : tp->snd_una;
-- 
2.14.3



[PATCH V2 net-next 00/14] TLS offload, netdev & MLX5 support

2018-03-21 Thread Saeed Mahameed
Hi Dave,

The following series from Ilya and Boris provides TLS TX inline crypto
offload.

v1->v2:
   - Added IS_ENABLED(CONFIG_TLS_DEVICE) and a STATIC_KEY for icsk_clean_acked
   - File license fix
   - Fix spelling, comment by DaveW
   - Move memory allocations out of tls_set_device_offload and other misc fixes,
comments by Kiril.

Boris says:
===
This series adds a generic infrastructure to offload TLS crypto to a
network devices. It enables the kernel TLS socket to skip encryption and
authentication operations on the transmit side of the data path. Leaving
those computationally expensive operations to the NIC.

The NIC offload infrastructure builds TLS records and pushes them to the
TCP layer just like the SW KTLS implementation and using the same API.
TCP segmentation is mostly unaffected. Currently the only exception is
that we prevent mixed SKBs where only part of the payload requires
offload. In the future we are likely to add a similar restriction
following a change cipher spec record.

The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. The offloaded implementation builds "plaintext TLS record", those
records contain plaintext instead of ciphertext and place holder bytes
instead of authentication tags.
2. The offloaded implementation maintains a mapping from TCP sequence
number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
  TLS socket, we can use the tls NIC offload infrastructure to obtain
enough context to encrypt the payload of the SKB.
A TLS record is released when the last byte of the record is ack'ed,
this is done through the new icsk_clean_acked callback.

The infrastructure should be extendable to support various NIC offload
implementations.  However it is currently written with the
implementation below in mind:
The NIC assumes that packets from each offloaded stream are sent as
plaintext and in-order. It keeps track of the TLS records in the TCP
stream. When a packet marked for offload is transmitted, the NIC
encrypts the payload in-place and puts authentication tags in the
relevant place holders.

The responsibility for handling out-of-order packets (i.e. TCP
retransmission, qdisc drops) falls on the netdev driver.

The netdev driver keeps track of the expected TCP SN from the NIC's
perspective.  If the next packet to transmit matches the expected TCP
SN, the driver advances the expected TCP SN, and transmits the packet
with TLS offload indication.

If the next packet to transmit does not match the expected TCP SN. The
driver calls the TLS layer to obtain the TLS record that includes the
TCP of the packet for transmission. Using this TLS record, the driver
posts a work entry on the transmit queue to reconstruct the NIC TLS
state required for the offload of the out-of-order packet. It updates
the expected TCP SN accordingly and transmit the now in-order packet.
The same queue is used for packet transmission and TLS context
reconstruction to avoid the need for flushing the transmit queue before
issuing the context reconstruction request.

Expected TCP SN is accessed without a lock, under the assumption that
TCP doesn't transmit SKBs from different TX queue concurrently.

We assume that packets are not rerouted to a different network device.

Paper: https://www.netdevconf.org/1.2/papers/netdevconf-TLS.pdf

===

===

The series is based on latest net-next:
0466080c751e ("Merge branch 'dsa-mv88e6xxx-some-fixes'")

Thanks,
Saeed.

---

Boris Pismenny (2):
  MAINTAINERS: Update mlx5 innova driver maintainers
  MAINTAINERS: Update TLS maintainers

Ilya Lesokhin (12):
  tcp: Add clean acked data hook
  net: Rename and export copy_skb_header
  net: Add Software fallback infrastructure for socket dependent
offloads
  net: Add TLS offload netdev ops
  net: Add TLS TX offload features
  net/tls: Add generic NIC offload infrastructure
  net/tls: Support TLS device offload with IPv6
  net/mlx5e: Move defines out of ipsec code
  net/mlx5: Accel, Add TLS tx offload interface
  net/mlx5e: TLS, Add Innova TLS TX support
  net/mlx5e: TLS, Add Innova TLS TX offload data path
  net/mlx5e: TLS, Add error statistics

 MAINTAINERS|  19 +-
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig|  11 +
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   6 +-
 .../net/ethernet/mellanox/mlx5/core/accel/tls.c|  71 ++
 .../net/ethernet/mellanox/mlx5/core/accel/tls.h|  86 +++
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  21 +
 .../mellanox/mlx5/core/en_accel/en_accel.h |  72 ++
 .../ethernet/mellanox/mlx5/core/en_accel/ipsec.h   |   3 -
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 197 +
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h |  87 +++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c | 278 +++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.h |  50 ++
 .../mellanox/mlx5/core/en_accel/tls_stats.c   

Hello,

2018-03-21 Thread Abdulahi Issa
Hello,

I am Mr. Abdulahi Issa, from Burkina Faso in West African region. I
work with the Bank of Africa here Which i am the audit manager . Can
you safe Guard these amount( $18 Million USD) for me in your Country??
Further Details will be given to you if you show Interest.

Regards

Mr. Abdulahi Issa.


Re: [bug, bisected] pfifo_fast causes packet reordering

2018-03-21 Thread John Fastabend
On 03/21/2018 12:44 PM, Jakob Unterwurzacher wrote:
> On 21.03.18 19:43, John Fastabend wrote:
>> Thats my theory at least. Are you able to test a patch if I generate
>> one to fix this?
> 
> Yes, no problem.

Can you try this,

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index d4907b5..1e596bd 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -30,6 +30,7 @@ struct qdisc_rate_table {
 enum qdisc_state_t {
__QDISC_STATE_SCHED,
__QDISC_STATE_DEACTIVATED,
+   __QDISC_STATE_RUNNING,
 };
 
 struct qdisc_size_table {
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 190570f..cf7c37d 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -377,20 +377,26 @@ static inline bool qdisc_restart(struct Qdisc *q, int 
*packets)
struct netdev_queue *txq;
struct net_device *dev;
struct sk_buff *skb;
-   bool validate;
+   bool more, validate;
 
/* Dequeue packet */
+   if (test_and_set_bit(__QDISC_STATE_RUNNING, >state))
+   return false;
+
skb = dequeue_skb(q, , packets);
-   if (unlikely(!skb))
+   if (unlikely(!skb)) {
+   clear_bit(__QDISC_STATE_RUNNING, >state);
return false;
+   }
 
if (!(q->flags & TCQ_F_NOLOCK))
root_lock = qdisc_lock(q);
 
dev = qdisc_dev(q);
txq = skb_get_tx_queue(dev, skb);
-
-   return sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
+   more = sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
+   clear_bit(__QDISC_STATE_RUNNING, >state);
+   return more;
 }


> 
> I just tested with the flag change you suggested (see below, I had to keep 
> TCQ_F_CPUSTATS to prevent a crash) and I have NOT seen OOO so far.
> 

Right because the code expects per cpu stats if the CPUSTATS flag is
removed it will crash.

> Thanks,
> Jakob
> 
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 190570f21b20..51b68ef4977b 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -792,7 +792,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
>     .dump   =   pfifo_fast_dump,
>     .change_tx_queue_len =  pfifo_fast_change_tx_queue_len,
>     .owner  =   THIS_MODULE,
> -   .static_flags   =   TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
> +   .static_flags   =   TCQ_F_CPUSTATS,
>  };
>  EXPORT_SYMBOL(pfifo_fast_ops);



Re: [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure

2018-03-21 Thread Saeed Mahameed
On Wed, 2018-03-21 at 19:31 +0300, Kirill Tkhai wrote:
> On 21.03.2018 18:53, Boris Pismenny wrote:
> > ...
> > > 
> > > Other patches have two licenses in header. Can I distribute this
> > > file under GPL license terms?
> > > 
> > 
> > Sure, I'll update the license to match other files under net/tls.
> > 
> > > > +#include 
> > > > +#include 
> > > > +#include 
> > > > +#include 
> > > > +#include 
> > > > +
> > > > +#include 
> > > > +#include 
> > > > +
> > > > +/* device_offload_lock is used to synchronize tls_dev_add
> > > > + * against NETDEV_DOWN notifications.
> > > > + */
> > > > +DEFINE_STATIC_PERCPU_RWSEM(device_offload_lock);
> > > > +
> > > > +static void tls_device_gc_task(struct work_struct *work);
> > > > +
> > > > +static DECLARE_WORK(tls_device_gc_work, tls_device_gc_task);
> > > > +static LIST_HEAD(tls_device_gc_list);
> > > > +static LIST_HEAD(tls_device_list);
> > > > +static DEFINE_SPINLOCK(tls_device_lock);
> > > > +
> > > > +static void tls_device_free_ctx(struct tls_context *ctx)
> > > > +{
> > > > +struct tls_offload_context *offlad_ctx =
> > > > tls_offload_ctx(ctx);
> > > > +
> > > > +kfree(offlad_ctx);
> > > > +kfree(ctx);
> > > > +}
> > > > +
> > > > +static void tls_device_gc_task(struct work_struct *work)
> > > > +{
> > > > +struct tls_context *ctx, *tmp;
> > > > +struct list_head gc_list;
> > > > +unsigned long flags;
> > > > +
> > > > +spin_lock_irqsave(_device_lock, flags);
> > > > +INIT_LIST_HEAD(_list);
> > > 
> > > This is stack variable, and it should be initialized outside of
> > > global spinlock.
> > > There is LIST_HEAD() primitive for that in kernel.
> > > There is one more similar place below.
> > > 
> > 
> > Sure.
> > 
> > > > +list_splice_init(_device_gc_list, _list);
> > > > +spin_unlock_irqrestore(_device_lock, flags);
> > > > +
> > > > +list_for_each_entry_safe(ctx, tmp, _list, list) {
> > > > +struct net_device *netdev = ctx->netdev;
> > > > +
> > > > +if (netdev) {
> > > > +netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
> > > > +TLS_OFFLOAD_CTX_DIR_TX);
> > > > +dev_put(netdev);
> > > > +}
> > > 
> > > How is possible the situation we meet NULL netdev here >
> > 
> > This can happen in tls_device_down. tls_deviec_down is called
> > whenever a netdev that is used for TLS inline crypto offload goes
> > down. It gets called via the NETDEV_DOWN event of the netdevice
> > notifier.
> > 
> > This flow is somewhat similar to the xfrm_device netdev notifier.
> > However, we do not destroy the socket (as in destroying the
> > xfrm_state in xfrm_device). Instead, we cleanup the netdev state
> > and allow software fallback to handle the rest of the traffic.
> > 
> > > > +
> > > > +list_del(>list);
> > > > +tls_device_free_ctx(ctx);
> > > > +}
> > > > +}
> > > > +
> > > > +static void tls_device_queue_ctx_destruction(struct
> > > > tls_context *ctx)
> > > > +{
> > > > +unsigned long flags;
> > > > +
> > > > +spin_lock_irqsave(_device_lock, flags);
> > > > +list_move_tail(>list, _device_gc_list);
> > > > +
> > > > +/* schedule_work inside the spinlock
> > > > + * to make sure tls_device_down waits for that work.
> > > > + */
> > > > +schedule_work(_device_gc_work);
> > > > +
> > > > +spin_unlock_irqrestore(_device_lock, flags);
> > > > +}
> > > > +
> > > > +/* We assume that the socket is already connected */
> > > > +static struct net_device *get_netdev_for_sock(struct sock *sk)
> > > > +{
> > > > +struct inet_sock *inet = inet_sk(sk);
> > > > +struct net_device *netdev = NULL;
> > > > +
> > > > +netdev = dev_get_by_index(sock_net(sk), inet-
> > > > >cork.fl.flowi_oif);
> > > > +
> > > > +return netdev;
> > > > +}
> > > > +
> > > > +static int attach_sock_to_netdev(struct sock *sk, struct
> > > > net_device *netdev,
> > > > + struct tls_context *ctx)
> > > > +{
> > > > +int rc;
> > > > +
> > > > +rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk,
> > > > TLS_OFFLOAD_CTX_DIR_TX,
> > > > + >crypto_send,
> > > > + tcp_sk(sk)->write_seq);
> > > > +if (rc) {
> > > > +pr_err_ratelimited("The netdev has refused to offload
> > > > this socket\n");
> > > > +goto out;
> > > > +}
> > > > +
> > > > +rc = 0;
> > > > +out:
> > > > +return rc;
> > > > +}
> > > > +
> > > > +static void destroy_record(struct tls_record_info *record)
> > > > +{
> > > > +skb_frag_t *frag;
> > > > +int nr_frags = record->num_frags;
> > > > +
> > > > +while (nr_frags > 0) {
> > > > +frag = >frags[nr_frags - 1];
> > > > +__skb_frag_unref(frag);
> > > > +--nr_frags;
> > > > +}
> > > > +kfree(record);
> > > > +}
> > > > +
> > > > +static void delete_all_records(struct tls_offload_context
> > > > *offload_ctx)
> > > > +{
> > > > +struct tls_record_info *info, *temp;
> > > > +

[PATCH net-next v5 2/2] net: bpf: add a test for skb_segment in test_bpf module

2018-03-21 Thread Yonghong Song
Without the previous commit,
"modprobe test_bpf" will have the following errors:
...
[   98.149165] [ cut here ]
[   98.159362] kernel BUG at net/core/skbuff.c:3667!
[   98.169756] invalid opcode:  [#1] SMP PTI
[   98.179370] Modules linked in:
[   98.179371]  test_bpf(+)
...
which triggers the bug the previous commit intends to fix.

The skbs are constructed to mimic what mlx5 may generate.
The packet size/header may not mimic real cases in production. But
the processing flow is similar.

Signed-off-by: Yonghong Song 
---
 lib/test_bpf.c | 93 --
 1 file changed, 91 insertions(+), 2 deletions(-)

diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index 2efb213..a468b5c 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -6574,6 +6574,93 @@ static bool exclude_test(int test_id)
return test_id < test_range[0] || test_id > test_range[1];
 }
 
+static __init struct sk_buff *build_test_skb(void)
+{
+   u32 headroom = NET_SKB_PAD + NET_IP_ALIGN + ETH_HLEN;
+   struct sk_buff *skb[2];
+   struct page *page[2];
+   int i, data_size = 8;
+
+   for (i = 0; i < 2; i++) {
+   page[i] = alloc_page(GFP_KERNEL);
+   if (!page[i]) {
+   if (i == 0)
+   goto err_page0;
+   else
+   goto err_page1;
+   }
+
+   /* this will set skb[i]->head_frag */
+   skb[i] = dev_alloc_skb(headroom + data_size);
+   if (!skb[i]) {
+   if (i == 0)
+   goto err_skb0;
+   else
+   goto err_skb1;
+   }
+
+   skb_reserve(skb[i], headroom);
+   skb_put(skb[i], data_size);
+   skb[i]->protocol = htons(ETH_P_IP);
+   skb_reset_network_header(skb[i]);
+   skb_set_mac_header(skb[i], -ETH_HLEN);
+
+   skb_add_rx_frag(skb[i], 0, page[i], 0, 64, 64);
+   // skb_headlen(skb[i]): 8, skb[i]->head_frag = 1
+   }
+
+   /* setup shinfo */
+   skb_shinfo(skb[0])->gso_size = 1448;
+   skb_shinfo(skb[0])->gso_type = SKB_GSO_TCPV4;
+   skb_shinfo(skb[0])->gso_type |= SKB_GSO_DODGY;
+   skb_shinfo(skb[0])->gso_segs = 0;
+   skb_shinfo(skb[0])->frag_list = skb[1];
+
+   /* adjust skb[0]'s len */
+   skb[0]->len += skb[1]->len;
+   skb[0]->data_len += skb[1]->data_len;
+   skb[0]->truesize += skb[1]->truesize;
+
+   return skb[0];
+
+err_skb1:
+   __free_page(page[1]);
+err_page1:
+   kfree_skb(skb[0]);
+err_skb0:
+   __free_page(page[0]);
+err_page0:
+   return NULL;
+}
+
+static __init int test_skb_segment(void)
+{
+   netdev_features_t features;
+   struct sk_buff *skb, *segs;
+   int ret = -1;
+
+   features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM |
+  NETIF_F_IPV6_CSUM;
+   features |= NETIF_F_RXCSUM;
+   skb = build_test_skb();
+   if (!skb) {
+   pr_info("%s: failed to build_test_skb", __func__);
+   goto done;
+   }
+
+   segs = skb_segment(skb, features);
+   if (segs) {
+   kfree_skb_list(segs);
+   ret = 0;
+   pr_info("%s: success in skb_segment!", __func__);
+   } else {
+   pr_info("%s: failed in skb_segment!", __func__);
+   }
+   kfree_skb(skb);
+done:
+   return ret;
+}
+
 static __init int test_bpf(void)
 {
int i, err_cnt = 0, pass_cnt = 0;
@@ -6632,9 +6719,11 @@ static int __init test_bpf_init(void)
return ret;
 
ret = test_bpf();
-
destroy_bpf_tests();
-   return ret;
+   if (ret)
+   return ret;
+
+   return test_skb_segment();
 }
 
 static void __exit test_bpf_exit(void)
-- 
2.9.5



[PATCH net-next v5 1/2] net: permit skb_segment on head_frag frag_list skb

2018-03-21 Thread Yonghong Song
One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
function skb_segment(), line 3667. The bpf program attaches to
clsact ingress, calls bpf_skb_change_proto to change protocol
from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
to send the changed packet out.

3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
3473 netdev_features_t features)
3474 {
3475 struct sk_buff *segs = NULL;
3476 struct sk_buff *tail = NULL;
...
3665 while (pos < offset + len) {
3666 if (i >= nfrags) {
3667 BUG_ON(skb_headlen(list_skb));
3668
3669 i = 0;
3670 nfrags = skb_shinfo(list_skb)->nr_frags;
3671 frag = skb_shinfo(list_skb)->frags;
3672 frag_skb = list_skb;
...

call stack:
...
 #1 [883ffef03558] __crash_kexec at 8110c525
 #2 [883ffef03620] crash_kexec at 8110d5cc
 #3 [883ffef03640] oops_end at 8101d7e7
 #4 [883ffef03668] die at 8101deb2
 #5 [883ffef03698] do_trap at 8101a700
 #6 [883ffef036e8] do_error_trap at 8101abfe
 #7 [883ffef037a0] do_invalid_op at 8101acd0
 #8 [883ffef037b0] invalid_op at 81a00bab
[exception RIP: skb_segment+3044]
RIP: 817e4dd4  RSP: 883ffef03860  RFLAGS: 00010216
RAX: 2bf6  RBX: 883feb7aaa00  RCX: 0011
RDX: 883fb87910c0  RSI: 0011  RDI: 883feb7ab500
RBP: 883ffef03928   R8: 2ce2   R9: 27da
R10: 01ea  R11: 2d82  R12: 883f90a1ee80
R13: 883fb8791120  R14: 883feb7abc00  R15: 2ce2
ORIG_RAX:   CS: 0010  SS: 0018
 #9 [883ffef03930] tcp_gso_segment at 818713e7
---  ---
...

The triggering input skb has the following properties:
list_skb = skb->frag_list;
skb->nfrags != NULL && skb_headlen(list_skb) != 0
and skb_segment() is not able to handle a frag_list skb
if its headlen (list_skb->len - list_skb->data_len) is not 0.

This patch addressed the issue by handling skb_headlen(list_skb) != 0
case properly if list_skb->head_frag is true, which is expected in
most cases. The head frag is processed before list_skb->frags
are processed.

Reported-by: Diptanu Gon Choudhury 
Signed-off-by: Yonghong Song 
---
 net/core/skbuff.c | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 715c134..23b317a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3460,6 +3460,19 @@ void *skb_pull_rcsum(struct sk_buff *skb, unsigned int 
len)
 }
 EXPORT_SYMBOL_GPL(skb_pull_rcsum);
 
+static inline skb_frag_t skb_head_frag_to_page_desc(struct sk_buff *frag_skb)
+{
+   skb_frag_t head_frag;
+   struct page *page;
+
+   page = virt_to_head_page(frag_skb->head);
+   head_frag.page.p = page;
+   head_frag.page_offset = frag_skb->data -
+   (unsigned char *)page_address(page);
+   head_frag.size = skb_headlen(frag_skb);
+   return head_frag;
+}
+
 /**
  * skb_segment - Perform protocol segmentation on skb.
  * @head_skb: buffer to segment
@@ -3664,15 +3677,16 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
while (pos < offset + len) {
if (i >= nfrags) {
-   BUG_ON(skb_headlen(list_skb));
-
i = 0;
nfrags = skb_shinfo(list_skb)->nr_frags;
frag = skb_shinfo(list_skb)->frags;
-   frag_skb = list_skb;
-
-   BUG_ON(!nfrags);
+   if (skb_headlen(list_skb)) {
+   BUG_ON(!list_skb->head_frag);
 
+   /* to make room for head_frag. */
+   i--; frag--;
+   }
+   frag_skb = list_skb;
if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
skb_zerocopy_clone(nskb, frag_skb,
   GFP_ATOMIC))
@@ -3689,7 +3703,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
goto err;
}
 
-   *nskb_frag = *frag;
+   *nskb_frag = (i < 0) ? 
skb_head_frag_to_page_desc(frag_skb) : *frag;
__skb_frag_ref(nskb_frag);
size = skb_frag_size(nskb_frag);
 
-- 
2.9.5



[PATCH net-next v5 0/2] net: permit skb_segment on head_frag frag_list skb

2018-03-21 Thread Yonghong Song
One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
function skb_segment(), line 3667. The bpf program attaches to
clsact ingress, calls bpf_skb_change_proto to change protocol
from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
to send the changed packet out.
 ...
3665 while (pos < offset + len) {
3666 if (i >= nfrags) {
3667 BUG_ON(skb_headlen(list_skb));
 ...

The triggering input skb has the following properties:
list_skb = skb->frag_list;
skb->nfrags != NULL && skb_headlen(list_skb) != 0
and skb_segment() is not able to handle a frag_list skb
if its headlen (list_skb->len - list_skb->data_len) is not 0.

Patch #1 provides a simple solution to avoid BUG_ON. If
list_skb->head_frag is true, its page-backed frag will
be processed before the list_skb->frags.
Patch #2 provides a test case in test_bpf module which
constructs a skb and calls skb_segment() directly. The test
case is able to trigger the BUG_ON without Patch #1.

The patch has been tested in the following setup:
  ipv6_host <-> nat_server <-> ipv4_host
where nat_server has a bpf program doing ipv4<->ipv6
translation and forwarding through clsact hook
bpf_skb_change_proto.

Changelog:
v4 -> v5:
  . Replace local variable head_frag with
a static inline function skb_head_frag_to_page_desc
which gets the head_frag on-demand. This makes
code more readable and also does not increase
the stack size, from Alexander.
  . Remove the "if(nfrags)" guard for skb_orphan_frags
and skb_zerocopy_clone as I found that they can
handle zero-frag skb (with non-zero skb_headlen(skb))
properly.
  . Properly release segment list from skb_segment()
in the test, from Eric.
v3 -> v4:
  . Remove dynamic memory allocation and use rewinding
for both index and frag to remove one branch in fast path,
from Alexander.
  . Fix a bunch of issues in test_bpf skb_segment() test,
including proper way to allocate skb, proper function
argument for skb_add_rx_frag and not freeint skb, etc.,
from Eric.
v2 -> v3:
  . Use starting frag index -1 (instead of 0) to
special process head_frag before other frags in the skb,
from Alexander Duyck.
v1 -> v2:
  . Removed never-hit BUG_ON, spotted by Linyu Yuan.

Yonghong Song (2):
  net: permit skb_segment on head_frag frag_list skb
  net: bpf: add a test for skb_segment in test_bpf module

 lib/test_bpf.c| 93 +--
 net/core/skbuff.c | 26 
 2 files changed, 111 insertions(+), 8 deletions(-)

-- 
2.9.5



Re: [PATCH net-next V2] Documentation/networking: Add net DIM documentation

2018-03-21 Thread Marcelo Ricardo Leitner
On Wed, Mar 21, 2018 at 12:44:29PM -0700, Florian Fainelli wrote:
> On 03/21/2018 12:37 PM, Randy Dunlap wrote:
> > On 03/21/2018 11:33 AM, Tal Gilboa wrote:
> >> Net DIM is a generic algorithm, purposed for dynamically
> >> optimizing network devices interrupt moderation. This
> >> document describes how it works and how to use it.
> >>
> >> Signed-off-by: Tal Gilboa 
> >> ---
> >>  Documentation/networking/net_dim.txt | 174 
> >> +++
> >>  1 file changed, 174 insertions(+)
> >>  create mode 100644 Documentation/networking/net_dim.txt
> >>
> >> diff --git a/Documentation/networking/net_dim.txt 
> >> b/Documentation/networking/net_dim.txt
> >> new file mode 100644
> >> index 000..9cb31c5
> >> --- /dev/null
> >> +++ b/Documentation/networking/net_dim.txt
> >> @@ -0,0 +1,174 @@
> >> +Net DIM - Generic Network Dynamic Interrupt Moderation
> >> +==
> >> +
> >> +Author:
> >> +  Tal Gilboa 
> >> +
> >> +
> >> +Contents
> >> +=
> >> +
> >> +- Assumptions
> >> +- Introduction
> >> +- The Net DIM Algorithm
> >> +- Registering a Network Device to DIM
> >> +- Example
> >> +
> >> +Part 0: Assumptions
> >> +==
> >> +
> >> +This document assumes the reader has basic knowledge in network drivers
> >> +and in general interrupt moderation.
> >> +
> >> +
> >> +Part I: Introduction
> >> +==
> >> +
> >> +Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the
> >> +interrupt moderation configuration of a channel in order to optimize 
> >> packet
> >> +processing. The mechanism includes an algorithm which decides if and how 
> >> to
> >> +change moderation parameters for a channel, usually by performing an 
> >> analysis on
> >> +runtime data sampled from the system. Net DIM is such a mechanism. In each
> >> +iteration of the algorithm, it analyses a given sample of the data, 
> >> compares it
> >> +to the previous sample and if required, it can decide to change some of 
> >> the
> >> +interrupt moderation configuration fields. The data sample is composed of 
> >> data
> >> +bandwidth, the number of packets and the number of events. The time 
> >> between
> >> +samples is also measured. Net DIM compares the current and the previous 
> >> data and
> >> +returns an adjusted interrupt moderation configuration object. In some 
> >> cases,
> >> +the algorithm might decide not to change anything. The configuration 
> >> fields are
> >> +the minimum duration (microseconds) allowed between events and the maximum
> >> +number of wanted packets per event. The Net DIM algorithm ascribes 
> >> importance to
> >> +increase bandwidth over reducing interrupt rate.
> >> +
> >> +
> >> +Part II: The Net DIM Algorithm
> >> +===
> >> +
> >> +Each iteration of the Net DIM algorithm follows these steps:
> >> +1. Calculates new data sample.
> >> +2. Compares it to previous sample.
> >> +3. Makes a decision - suggests interrupt moderation configuration fields.
> >> +4. Applies a schedule work function, which applies suggested 
> >> configuration.
> >> +
> >> +The first two steps are straightforward, both the new and the previous 
> >> data are
> >> +supplied by the driver registered to Net DIM. The previous data is the 
> >> new data
> >> +supplied to the previous iteration. The comparison step checks the 
> >> difference
> >> +between the new and previous data and decides on the result of the last 
> >> step.
> >> +A step would result as "better" if bandwidth increases and as "worse" if
> >> +bandwidth reduces. If there is no change in bandwidth, the packet rate is
> >> +compared in a similar fashion - increase == "better" and decrease == 
> >> "worse".
> >> +In case there is no change in the packet rate as well, the interrupt rate 
> >> is
> >> +compared. Here the algorithm tries to optimize for lower interrupt rate 
> >> so an
> >> +increase in the interrupt rate is considered "worse" and a decrease is
> >> +considered "better". Step #2 has an optimization for avoiding false 
> >> results: it
> >> +only considers a difference between samples as valid if it is greater 
> >> than a
> >> +certain percentage. Also, since Net DIM does not measure anything by 
> >> itself, it
> >> +assumes the data provided by the driver is valid.
> >> +
> >> +Step #3 decides on the suggested configuration based on the result from 
> >> step #2
> >> +and the internal state of the algorithm. The states reflect the 
> >> "direction" of
> >> +the algorithm: is it going left (reducing moderation), right (increasing
> >> +moderation) or standing still. Another optimization is that if a decision
> >> +to stay still is made multiple times, the interval between iterations of 
> >> the
> >> +algorithm would increase in order to reduce calculation overhead. Also, 
> >> after
> >> +"parking" on one of the most left or most right decisions, the algorithm 
> >> 

Re: [PATCH net-next V2] Documentation/networking: Add net DIM documentation

2018-03-21 Thread Marcelo Ricardo Leitner
On Wed, Mar 21, 2018 at 08:33:45PM +0200, Tal Gilboa wrote:
> Net DIM is a generic algorithm, purposed for dynamically
> optimizing network devices interrupt moderation. This
> document describes how it works and how to use it.
> 
> Signed-off-by: Tal Gilboa 
> ---
>  Documentation/networking/net_dim.txt | 174 
> +++
>  1 file changed, 174 insertions(+)
>  create mode 100644 Documentation/networking/net_dim.txt
> 
> diff --git a/Documentation/networking/net_dim.txt 
> b/Documentation/networking/net_dim.txt
> new file mode 100644
> index 000..9cb31c5
> --- /dev/null
> +++ b/Documentation/networking/net_dim.txt
> @@ -0,0 +1,174 @@
> +Net DIM - Generic Network Dynamic Interrupt Moderation
> +==
> +
> +Author:
> + Tal Gilboa 
> +
> +
> +Contents
> +=
> +
> +- Assumptions
> +- Introduction
> +- The Net DIM Algorithm
> +- Registering a Network Device to DIM
> +- Example
> +
> +Part 0: Assumptions
> +==
> +
> +This document assumes the reader has basic knowledge in network drivers
> +and in general interrupt moderation.
> +
> +
> +Part I: Introduction
> +==
> +
> +Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the
> +interrupt moderation configuration of a channel in order to optimize packet
> +processing. The mechanism includes an algorithm which decides if and how to
> +change moderation parameters for a channel, usually by performing an 
> analysis on
> +runtime data sampled from the system. Net DIM is such a mechanism. In each
> +iteration of the algorithm, it analyses a given sample of the data, compares 
> it
> +to the previous sample and if required, it can decide to change some of the
> +interrupt moderation configuration fields. The data sample is composed of 
> data
> +bandwidth, the number of packets and the number of events. The time between
> +samples is also measured. Net DIM compares the current and the previous data 
> and
> +returns an adjusted interrupt moderation configuration object. In some cases,
> +the algorithm might decide not to change anything. The configuration fields 
> are
> +the minimum duration (microseconds) allowed between events and the maximum
> +number of wanted packets per event. The Net DIM algorithm ascribes 
> importance to
> +increase bandwidth over reducing interrupt rate.
> +
> +
> +Part II: The Net DIM Algorithm
> +===
> +
> +Each iteration of the Net DIM algorithm follows these steps:
> +1. Calculates new data sample.
> +2. Compares it to previous sample.
> +3. Makes a decision - suggests interrupt moderation configuration fields.
> +4. Applies a schedule work function, which applies suggested configuration.
> +
> +The first two steps are straightforward, both the new and the previous data 
> are
> +supplied by the driver registered to Net DIM. The previous data is the new 
> data
> +supplied to the previous iteration. The comparison step checks the difference
> +between the new and previous data and decides on the result of the last step.
> +A step would result as "better" if bandwidth increases and as "worse" if
> +bandwidth reduces. If there is no change in bandwidth, the packet rate is
> +compared in a similar fashion - increase == "better" and decrease == "worse".
> +In case there is no change in the packet rate as well, the interrupt rate is
> +compared. Here the algorithm tries to optimize for lower interrupt rate so an
> +increase in the interrupt rate is considered "worse" and a decrease is
> +considered "better". Step #2 has an optimization for avoiding false results: 
> it
> +only considers a difference between samples as valid if it is greater than a
> +certain percentage. Also, since Net DIM does not measure anything by itself, 
> it
> +assumes the data provided by the driver is valid.
> +
> +Step #3 decides on the suggested configuration based on the result from step 
> #2
> +and the internal state of the algorithm. The states reflect the "direction" 
> of
> +the algorithm: is it going left (reducing moderation), right (increasing
> +moderation) or standing still. Another optimization is that if a decision
> +to stay still is made multiple times, the interval between iterations of the
> +algorithm would increase in order to reduce calculation overhead. Also, after

I wonder if this increased interval can lead to packet drops due to
some impulse? Like, the card is receiving a low volume of packets and
suddenly a new flow starts at line rate, for example. If the max
interval is not too aggressive, this would't be a problem.

(sorry, I didn't read much of the implementation nor the drivers
already using it)

> +"parking" on one of the most left or most right decisions, the algorithm may
> +decide to verify this decision by taking a step in the other direction. This 
> is
> +done in order to avoid getting stuck in a "deep sleep" 

Re: [PATCH net-next v4 2/2] net: bpf: add a test for skb_segment in test_bpf module

2018-03-21 Thread Yonghong Song



On 3/21/18 8:26 AM, Eric Dumazet wrote:



On 03/20/2018 11:47 PM, Yonghong Song wrote:

+static __init int test_skb_segment(void)
+{
+   netdev_features_t features;
+   struct sk_buff *skb;
+   int ret = -1;
+
+   features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM |
+  NETIF_F_IPV6_CSUM;
+   features |= NETIF_F_RXCSUM;
+   skb = build_test_skb();
+   if (!skb) {
+   pr_info("%s: failed to build_test_skb", __func__);
+   goto done;
+   }
+
+   if (skb_segment(skb, features)) {
+   ret = 0;
+   pr_info("%s: success in skb_segment!", __func__);
+   } else {
+   pr_info("%s: failed in skb_segment!", __func__);
+   }
+   kfree_skb(skb);


If skb_segmen() was successful (original) skb was already freed.

kfree_skb(old_skb) should thus panic the box, if you run this code
on a kernel having some debugging features like KASAN


I tried with KASAN. It does not panic.
Looking at the code in net/core/dev.c: validate_xmit_skb:

static struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct 
net_device *dev, bool *again)

...

if (netif_needs_gso(skb, features)) {
struct sk_buff *segs;

segs = skb_gso_segment(skb, features);
if (IS_ERR(segs)) {
goto out_kfree_skb;
} else if (segs) {
consume_skb(skb);
skb = segs;
}
...
out_kfree_skb:
kfree_skb(skb);

which also indicates kfree_skb/consume_skb probably is the right way
to free skb after skb_gso_segment/skb_segment.

This probably explains why my above kfree_skb(skb) does not crash.



So you must store in a variable the return of skb_segment(),
to be able to free skb(s), using kfree_skb_list()


Totally agree. Will make the change. Thanks!





+done:
+   return ret;
+}
+


Re: [PATCH net-next v3 1/2] net: permit skb_segment on head_frag frag_list skb

2018-03-21 Thread Yonghong Song



On 3/21/18 7:59 AM, Alexander Duyck wrote:

On Tue, Mar 20, 2018 at 10:02 PM, Yonghong Song  wrote:



On 3/20/18 4:50 PM, Alexander Duyck wrote:


On Tue, Mar 20, 2018 at 4:21 PM, Yonghong Song  wrote:


One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
function skb_segment(), line 3667. The bpf program attaches to
clsact ingress, calls bpf_skb_change_proto to change protocol
from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
to send the changed packet out.

3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
3473 netdev_features_t features)
3474 {
3475 struct sk_buff *segs = NULL;
3476 struct sk_buff *tail = NULL;
...
3665 while (pos < offset + len) {
3666 if (i >= nfrags) {
3667 BUG_ON(skb_headlen(list_skb));
3668
3669 i = 0;
3670 nfrags =
skb_shinfo(list_skb)->nr_frags;
3671 frag = skb_shinfo(list_skb)->frags;
3672 frag_skb = list_skb;
...

call stack:
...
   #1 [883ffef03558] __crash_kexec at 8110c525
   #2 [883ffef03620] crash_kexec at 8110d5cc
   #3 [883ffef03640] oops_end at 8101d7e7
   #4 [883ffef03668] die at 8101deb2
   #5 [883ffef03698] do_trap at 8101a700
   #6 [883ffef036e8] do_error_trap at 8101abfe
   #7 [883ffef037a0] do_invalid_op at 8101acd0
   #8 [883ffef037b0] invalid_op at 81a00bab
  [exception RIP: skb_segment+3044]
  RIP: 817e4dd4  RSP: 883ffef03860  RFLAGS: 00010216
  RAX: 2bf6  RBX: 883feb7aaa00  RCX: 0011
  RDX: 883fb87910c0  RSI: 0011  RDI: 883feb7ab500
  RBP: 883ffef03928   R8: 2ce2   R9: 27da
  R10: 01ea  R11: 2d82  R12: 883f90a1ee80
  R13: 883fb8791120  R14: 883feb7abc00  R15: 2ce2
  ORIG_RAX:   CS: 0010  SS: 0018
   #9 [883ffef03930] tcp_gso_segment at 818713e7
---  ---
...

The triggering input skb has the following properties:
  list_skb = skb->frag_list;
  skb->nfrags != NULL && skb_headlen(list_skb) != 0
and skb_segment() is not able to handle a frag_list skb
if its headlen (list_skb->len - list_skb->data_len) is not 0.

This patch addressed the issue by handling skb_headlen(list_skb) != 0
case properly if list_skb->head_frag is true, which is expected in
most cases. The head frag is processed before list_skb->frags
are processed.

Reported-by: Diptanu Gon Choudhury 
Signed-off-by: Yonghong Song 
---
   net/core/skbuff.c | 51
+--
   1 file changed, 37 insertions(+), 14 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 715c134..59bbc06 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3475,7 +3475,7 @@ struct sk_buff *skb_segment(struct sk_buff
*head_skb,
  struct sk_buff *segs = NULL;
  struct sk_buff *tail = NULL;
  struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
-   skb_frag_t *frag = skb_shinfo(head_skb)->frags;
+   skb_frag_t *frag = skb_shinfo(head_skb)->frags, *head_frag =
NULL;



I think you misunderstood me. I wasn't saying you allocate head_frag.
I was saying you could move the declaration down.



Sorry for my misunderstanding. I did understand your intention of moving
the declaration down in order to save stack space. I thought that we cannot
really move declaration down (although it works in C, but semantically it is
not quite right, more later), so I moved on to
use runtime allocation. But indeed skb_frag_t is not big (16 bytes), it
could live on the stack.




  unsigned int mss = skb_shinfo(head_skb)->gso_size;
  unsigned int doffset = head_skb->data -
skb_mac_header(head_skb);
  struct sk_buff *frag_skb = head_skb;
@@ -3664,19 +3664,39 @@ struct sk_buff *skb_segment(struct sk_buff
*head_skb,

  while (pos < offset + len) {



So right here in the loop you could add a "skb_frag_t head_frag;" just
so we declare it here and save ourselves the stack space.



I actually tried to move "skb_frag_t head_frag". The stack size remains the
same, 0xc0. This is related to how C compiler allocates stack space.
The declaration place won't decide the stack size as long as the declaration
dictates the usage. The stack size is really determined by liveness
analysis.

Further, we have code like:
 do {

while (pos < offset + len) {
if (i >= nfrags) {
...
head_frag = ...
}
... = head_frag; // head_frag access guaranteed after
 

RE: [PATCH net-next RFC V1 1/5] net: Introduce peer to peer one step PTP time stamping.

2018-03-21 Thread Keller, Jacob E


> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Richard Cochran
> Sent: Wednesday, March 21, 2018 11:58 AM
> To: netdev@vger.kernel.org
> Cc: devicet...@vger.kernel.org; Andrew Lunn ; David Miller
> ; Florian Fainelli ; Mark Rutland
> ; Miroslav Lichvar ; Rob
> Herring ; Willem de Bruijn 
> Subject: [PATCH net-next RFC V1 1/5] net: Introduce peer to peer one step PTP
> +
> + /*
> +  * Same as HWTSTAMP_TX_ONESTEP_SYNC, but also enables time
> +  * stamp insertion directly into PDelay_Resp packets. In this
> +  * case, neither transmitted Sync nor PDelay_Resp packets will
> +  * receive a time stamp via the socket error queue.
> +  */
> + HWTSTAMP_TX_ONESTEP_P2P,
>  };
> 

I am guessing that we expect all devices which support onestep P2P messages, 
will always support onestep SYNC as well?

Thanks,
Jake



RE: [PATCH net-next v2] net: mvpp2: Don't use dynamic allocs for local variables

2018-03-21 Thread Yan Markman
Hi Maxime

Please check the TWO points:

1). The mvpp2_prs_flow_find() returns TID if found
The TID=0 is valid FOUND value
For Not-found use -ENOENT (just like your mvpp2_prs_vlan_find)

2). The original code always uses "mvpp2_prs_entry *pe" storage Zero-Allocated
   Please check the correctnes of new "mvpp2_prs_entry pe" without 
memset(pe, 0, sizeof(pe));
   in all procedures where pe=kzalloc() has been replaced

Thanks
Yan Markman
Tel. 05-44732819


-Original Message-
From: Maxime Chevallier [mailto:maxime.chevall...@bootlin.com] 
Sent: Wednesday, March 21, 2018 5:14 PM
To: da...@davemloft.net
Cc: Maxime Chevallier ; netdev@vger.kernel.org; 
linux-ker...@vger.kernel.org; Antoine Tenart ; 
thomas.petazz...@bootlin.com; gregory.clem...@bootlin.com; 
miquel.ray...@bootlin.com; Nadav Haklai ; Stefan Chulski 
; Yan Markman ; m...@semihalf.com
Subject: [PATCH net-next v2] net: mvpp2: Don't use dynamic allocs for local 
variables

Some helper functions that search for given entries in the TCAM filter on PPv2 
controller make use of dynamically alloced temporary variables, allocated with 
GFP_KERNEL. These functions can be called in atomic context, and dynamic alloc 
is not really needed in these cases anyways.

This commit gets rid of dynamic allocs and use stack allocation in the 
following functions, and where they're used :
 - mvpp2_prs_flow_find
 - mvpp2_prs_vlan_find
 - mvpp2_prs_double_vlan_find
 - mvpp2_prs_mac_da_range_find

For all these functions, instead of returning an temporary object representing 
the TCAM entry, we simply return the TCAM id that matches the requested entry.

Signed-off-by: Maxime Chevallier 
---
V2: Remove unnecessary brackets, following Antoine Tenart's review.

 drivers/net/ethernet/marvell/mvpp2.c | 289 +++
 1 file changed, 127 insertions(+), 162 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
b/drivers/net/ethernet/marvell/mvpp2.c
index 9bd35f2291d6..28e33e139178 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -1913,16 +1913,11 @@ static void mvpp2_prs_sram_offset_set(struct 
mvpp2_prs_entry *pe,  }
 
 /* Find parser flow entry */
-static struct mvpp2_prs_entry *mvpp2_prs_flow_find(struct mvpp2 *priv, int 
flow)
+static int mvpp2_prs_flow_find(struct mvpp2 *priv, int flow)
 {
-   struct mvpp2_prs_entry *pe;
+   struct mvpp2_prs_entry pe;
int tid;
 
-   pe = kzalloc(sizeof(*pe), GFP_KERNEL);
-   if (!pe)
-   return NULL;
-   mvpp2_prs_tcam_lu_set(pe, MVPP2_PRS_LU_FLOWS);
-
/* Go through the all entires with MVPP2_PRS_LU_FLOWS */
for (tid = MVPP2_PRS_TCAM_SRAM_SIZE - 1; tid >= 0; tid--) {
u8 bits;
@@ -1931,17 +1926,16 @@ static struct mvpp2_prs_entry 
*mvpp2_prs_flow_find(struct mvpp2 *priv, int flow)
priv->prs_shadow[tid].lu != MVPP2_PRS_LU_FLOWS)
continue;
 
-   pe->index = tid;
-   mvpp2_prs_hw_read(priv, pe);
-   bits = mvpp2_prs_sram_ai_get(pe);
+   pe.index = tid;
+   mvpp2_prs_hw_read(priv, );
+   bits = mvpp2_prs_sram_ai_get();
 
/* Sram store classification lookup ID in AI bits [5:0] */
if ((bits & MVPP2_PRS_FLOW_ID_MASK) == flow)
-   return pe;
+   return tid;
}
-   kfree(pe);
 
-   return NULL;
+   return -ENOENT;
 }
 
 /* Return first free tcam index, seeking from start to end */ @@ -2189,16 
+2183,12 @@ static void mvpp2_prs_dsa_tag_ethertype_set(struct mvpp2 *priv, int 
port,  }
 
 /* Search for existing single/triple vlan entry */ -static struct 
mvpp2_prs_entry *mvpp2_prs_vlan_find(struct mvpp2 *priv,
-  unsigned short tpid, int ai)
+static int mvpp2_prs_vlan_find(struct mvpp2 *priv, unsigned short tpid, 
+int ai)
 {
-   struct mvpp2_prs_entry *pe;
+   struct mvpp2_prs_entry pe;
int tid;
 
-   pe = kzalloc(sizeof(*pe), GFP_KERNEL);
-   if (!pe)
-   return NULL;
-   mvpp2_prs_tcam_lu_set(pe, MVPP2_PRS_LU_VLAN);
+   memset(, 0, sizeof(pe));
 
/* Go through the all entries with MVPP2_PRS_LU_VLAN */
for (tid = MVPP2_PE_FIRST_FREE_TID;
@@ -2210,19 +2200,19 @@ static struct mvpp2_prs_entry 
*mvpp2_prs_vlan_find(struct mvpp2 *priv,
priv->prs_shadow[tid].lu != MVPP2_PRS_LU_VLAN)
continue;
 
-   pe->index = tid;
+   pe.index = tid;
 
-   mvpp2_prs_hw_read(priv, pe);
-   match = mvpp2_prs_tcam_data_cmp(pe, 0, swab16(tpid));
+   mvpp2_prs_hw_read(priv, );
+   match = mvpp2_prs_tcam_data_cmp(, 0, 

Re: [PATCH net-next V2] Documentation/networking: Add net DIM documentation

2018-03-21 Thread Florian Fainelli
On 03/21/2018 12:37 PM, Randy Dunlap wrote:
> On 03/21/2018 11:33 AM, Tal Gilboa wrote:
>> Net DIM is a generic algorithm, purposed for dynamically
>> optimizing network devices interrupt moderation. This
>> document describes how it works and how to use it.
>>
>> Signed-off-by: Tal Gilboa 
>> ---
>>  Documentation/networking/net_dim.txt | 174 
>> +++
>>  1 file changed, 174 insertions(+)
>>  create mode 100644 Documentation/networking/net_dim.txt
>>
>> diff --git a/Documentation/networking/net_dim.txt 
>> b/Documentation/networking/net_dim.txt
>> new file mode 100644
>> index 000..9cb31c5
>> --- /dev/null
>> +++ b/Documentation/networking/net_dim.txt
>> @@ -0,0 +1,174 @@
>> +Net DIM - Generic Network Dynamic Interrupt Moderation
>> +==
>> +
>> +Author:
>> +Tal Gilboa 
>> +
>> +
>> +Contents
>> +=
>> +
>> +- Assumptions
>> +- Introduction
>> +- The Net DIM Algorithm
>> +- Registering a Network Device to DIM
>> +- Example
>> +
>> +Part 0: Assumptions
>> +==
>> +
>> +This document assumes the reader has basic knowledge in network drivers
>> +and in general interrupt moderation.
>> +
>> +
>> +Part I: Introduction
>> +==
>> +
>> +Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the
>> +interrupt moderation configuration of a channel in order to optimize packet
>> +processing. The mechanism includes an algorithm which decides if and how to
>> +change moderation parameters for a channel, usually by performing an 
>> analysis on
>> +runtime data sampled from the system. Net DIM is such a mechanism. In each
>> +iteration of the algorithm, it analyses a given sample of the data, 
>> compares it
>> +to the previous sample and if required, it can decide to change some of the
>> +interrupt moderation configuration fields. The data sample is composed of 
>> data
>> +bandwidth, the number of packets and the number of events. The time between
>> +samples is also measured. Net DIM compares the current and the previous 
>> data and
>> +returns an adjusted interrupt moderation configuration object. In some 
>> cases,
>> +the algorithm might decide not to change anything. The configuration fields 
>> are
>> +the minimum duration (microseconds) allowed between events and the maximum
>> +number of wanted packets per event. The Net DIM algorithm ascribes 
>> importance to
>> +increase bandwidth over reducing interrupt rate.
>> +
>> +
>> +Part II: The Net DIM Algorithm
>> +===
>> +
>> +Each iteration of the Net DIM algorithm follows these steps:
>> +1. Calculates new data sample.
>> +2. Compares it to previous sample.
>> +3. Makes a decision - suggests interrupt moderation configuration fields.
>> +4. Applies a schedule work function, which applies suggested configuration.
>> +
>> +The first two steps are straightforward, both the new and the previous data 
>> are
>> +supplied by the driver registered to Net DIM. The previous data is the new 
>> data
>> +supplied to the previous iteration. The comparison step checks the 
>> difference
>> +between the new and previous data and decides on the result of the last 
>> step.
>> +A step would result as "better" if bandwidth increases and as "worse" if
>> +bandwidth reduces. If there is no change in bandwidth, the packet rate is
>> +compared in a similar fashion - increase == "better" and decrease == 
>> "worse".
>> +In case there is no change in the packet rate as well, the interrupt rate is
>> +compared. Here the algorithm tries to optimize for lower interrupt rate so 
>> an
>> +increase in the interrupt rate is considered "worse" and a decrease is
>> +considered "better". Step #2 has an optimization for avoiding false 
>> results: it
>> +only considers a difference between samples as valid if it is greater than a
>> +certain percentage. Also, since Net DIM does not measure anything by 
>> itself, it
>> +assumes the data provided by the driver is valid.
>> +
>> +Step #3 decides on the suggested configuration based on the result from 
>> step #2
>> +and the internal state of the algorithm. The states reflect the "direction" 
>> of
>> +the algorithm: is it going left (reducing moderation), right (increasing
>> +moderation) or standing still. Another optimization is that if a decision
>> +to stay still is made multiple times, the interval between iterations of the
>> +algorithm would increase in order to reduce calculation overhead. Also, 
>> after
>> +"parking" on one of the most left or most right decisions, the algorithm may
>> +decide to verify this decision by taking a step in the other direction. 
>> This is
>> +done in order to avoid getting stuck in a "deep sleep" scenario. Once a
>> +decision is made, an interrupt moderation configuration is selected from
>> +the predefined profiles.
> 
> I think a short description of the predefined profiles could 

Re: [PATCH v2 bpf-next 4/8] tracepoint: compute num_args at build time

2018-03-21 Thread Linus Torvalds
On Wed, Mar 21, 2018 at 11:54 AM, Alexei Starovoitov  wrote:
>
> add fancy macro to compute number of arguments passed into tracepoint
> at compile time and store it as part of 'struct tracepoint'.

We should probably do this __COUNT() thing in some generic header, we
just talked last week about another use case entirely.

And wouldn't it be nice to just have some generic infrastructure like this:

/*
 * This counts to ten.
 *
 * Any more than that, and we'd need to take off our shoes
 */
#define __GET_COUNT(_0,_1,_2,_3,_4,_5,_6,_7,_8,_9,_10,_n,...) _n
#define __COUNT(...) \
__GET_COUNT(__VA_ARGS__,10,9,8,7,6,5,4,3,2,1,0)
#define COUNT(...) __COUNT(dummy,##__VA_ARGS__)

#define __CONCAT(a,b) a##b
#define __CONCATENATE(a,b) __CONCAT(a,b)

and then you can do things like:

#define fn(...) __CONCATENATE(fn,COUNT(__VA_ARGS__))(__VA_ARGS__)

which turns "fn(x,y,z..)" into "fn(x,y,z)".

That can be useful for things like "max(a,b,c,d)" expanding to
"max4()", and then you can just have the trivial

  #define max3(a,b,c) max2(a,max2(b.c))

etc (with proper parentheses, of course).

And I'd rather not have that function name concatenation be part of
the counting logic, because we actually may have different ways of
counting, so the concatenation is separate.

In particular, the other situation this came up for, the counting was
in arguments _pairs_, so you'd use a "COUNT_PAIR()" instead of
"COUNT()".

NOTE NOTE NOTE! The above was slightly tested and then cut-and-pasted.
I might have screwed up at any point. Think of it as pseudo-code.

 Linus


Re: [bug, bisected] pfifo_fast causes packet reordering

2018-03-21 Thread Jakob Unterwurzacher

On 21.03.18 19:43, John Fastabend wrote:

Thats my theory at least. Are you able to test a patch if I generate
one to fix this?


Yes, no problem.

I just tested with the flag change you suggested (see below, I had to 
keep TCQ_F_CPUSTATS to prevent a crash) and I have NOT seen OOO so far.


Thanks,
Jakob


diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 190570f21b20..51b68ef4977b 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -792,7 +792,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
.dump   =   pfifo_fast_dump,
.change_tx_queue_len =  pfifo_fast_change_tx_queue_len,
.owner  =   THIS_MODULE,
-   .static_flags   =   TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
+   .static_flags   =   TCQ_F_CPUSTATS,
 };
 EXPORT_SYMBOL(pfifo_fast_ops);


Re: [PATCH net-next V2] Documentation/networking: Add net DIM documentation

2018-03-21 Thread Randy Dunlap
On 03/21/2018 11:33 AM, Tal Gilboa wrote:
> Net DIM is a generic algorithm, purposed for dynamically
> optimizing network devices interrupt moderation. This
> document describes how it works and how to use it.
> 
> Signed-off-by: Tal Gilboa 
> ---
>  Documentation/networking/net_dim.txt | 174 
> +++
>  1 file changed, 174 insertions(+)
>  create mode 100644 Documentation/networking/net_dim.txt
> 
> diff --git a/Documentation/networking/net_dim.txt 
> b/Documentation/networking/net_dim.txt
> new file mode 100644
> index 000..9cb31c5
> --- /dev/null
> +++ b/Documentation/networking/net_dim.txt
> @@ -0,0 +1,174 @@
> +Net DIM - Generic Network Dynamic Interrupt Moderation
> +==
> +
> +Author:
> + Tal Gilboa 
> +
> +
> +Contents
> +=
> +
> +- Assumptions
> +- Introduction
> +- The Net DIM Algorithm
> +- Registering a Network Device to DIM
> +- Example
> +
> +Part 0: Assumptions
> +==
> +
> +This document assumes the reader has basic knowledge in network drivers
> +and in general interrupt moderation.
> +
> +
> +Part I: Introduction
> +==
> +
> +Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the
> +interrupt moderation configuration of a channel in order to optimize packet
> +processing. The mechanism includes an algorithm which decides if and how to
> +change moderation parameters for a channel, usually by performing an 
> analysis on
> +runtime data sampled from the system. Net DIM is such a mechanism. In each
> +iteration of the algorithm, it analyses a given sample of the data, compares 
> it
> +to the previous sample and if required, it can decide to change some of the
> +interrupt moderation configuration fields. The data sample is composed of 
> data
> +bandwidth, the number of packets and the number of events. The time between
> +samples is also measured. Net DIM compares the current and the previous data 
> and
> +returns an adjusted interrupt moderation configuration object. In some cases,
> +the algorithm might decide not to change anything. The configuration fields 
> are
> +the minimum duration (microseconds) allowed between events and the maximum
> +number of wanted packets per event. The Net DIM algorithm ascribes 
> importance to
> +increase bandwidth over reducing interrupt rate.
> +
> +
> +Part II: The Net DIM Algorithm
> +===
> +
> +Each iteration of the Net DIM algorithm follows these steps:
> +1. Calculates new data sample.
> +2. Compares it to previous sample.
> +3. Makes a decision - suggests interrupt moderation configuration fields.
> +4. Applies a schedule work function, which applies suggested configuration.
> +
> +The first two steps are straightforward, both the new and the previous data 
> are
> +supplied by the driver registered to Net DIM. The previous data is the new 
> data
> +supplied to the previous iteration. The comparison step checks the difference
> +between the new and previous data and decides on the result of the last step.
> +A step would result as "better" if bandwidth increases and as "worse" if
> +bandwidth reduces. If there is no change in bandwidth, the packet rate is
> +compared in a similar fashion - increase == "better" and decrease == "worse".
> +In case there is no change in the packet rate as well, the interrupt rate is
> +compared. Here the algorithm tries to optimize for lower interrupt rate so an
> +increase in the interrupt rate is considered "worse" and a decrease is
> +considered "better". Step #2 has an optimization for avoiding false results: 
> it
> +only considers a difference between samples as valid if it is greater than a
> +certain percentage. Also, since Net DIM does not measure anything by itself, 
> it
> +assumes the data provided by the driver is valid.
> +
> +Step #3 decides on the suggested configuration based on the result from step 
> #2
> +and the internal state of the algorithm. The states reflect the "direction" 
> of
> +the algorithm: is it going left (reducing moderation), right (increasing
> +moderation) or standing still. Another optimization is that if a decision
> +to stay still is made multiple times, the interval between iterations of the
> +algorithm would increase in order to reduce calculation overhead. Also, after
> +"parking" on one of the most left or most right decisions, the algorithm may
> +decide to verify this decision by taking a step in the other direction. This 
> is
> +done in order to avoid getting stuck in a "deep sleep" scenario. Once a
> +decision is made, an interrupt moderation configuration is selected from
> +the predefined profiles.

I think a short description of the predefined profiles could help.

> +
> +The last step is to notify the registered driver that it should apply the
> +suggested configuration. This is done by scheduling a work function, defined 
> by
> +the Net 

[PATCH][next] gre: fix TUNNEL_SEQ bit check on sequence numbering

2018-03-21 Thread Colin King
From: Colin Ian King 

The current logic of flags | TUNNEL_SEQ is always non-zero and hence
sequence numbers are always incremented no matter the setting of the
TUNNEL_SEQ bit.  Fix this by using & instead of |.

Detected by CoverityScan, CID#1466039 ("Operands don't affect result")

Fixes: 77a5196a804e ("gre: add sequence number for collect md mode.")
Signed-off-by: Colin Ian King 
---
 net/ipv4/ip_gre.c  | 2 +-
 net/ipv6/ip6_gre.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 2fa2ef2e2af9..9ab1aa2f7660 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -550,7 +550,7 @@ static void gre_fb_xmit(struct sk_buff *skb, struct 
net_device *dev,
(TUNNEL_CSUM | TUNNEL_KEY | TUNNEL_SEQ);
gre_build_header(skb, tunnel_hlen, flags, proto,
 tunnel_id_to_key32(tun_info->key.tun_id),
-(flags | TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0);
+(flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0);
 
df = key->tun_flags & TUNNEL_DONT_FRAGMENT ?  htons(IP_DF) : 0;
 
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 0bcefc480aeb..3a98c694da5f 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -725,7 +725,7 @@ static netdev_tx_t __gre6_xmit(struct sk_buff *skb,
gre_build_header(skb, tunnel->tun_hlen,
 flags, protocol,
 tunnel_id_to_key32(tun_info->key.tun_id),
-(flags | TUNNEL_SEQ) ? htonl(tunnel->o_seqno++)
+(flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++)
  : 0);
 
} else {
-- 
2.15.1



Re: [PATCH net-next V2] Documentation/networking: Add net DIM documentation

2018-03-21 Thread Florian Fainelli
Hi Tal,

On 03/21/2018 11:33 AM, Tal Gilboa wrote:
> Net DIM is a generic algorithm, purposed for dynamically
> optimizing network devices interrupt moderation. This
> document describes how it works and how to use it.

Thanks a lot for providing this documentation, this is very helpful! A
few things that could be good to be expanded upon a little bit:

- HW must support configuring a timeout per channel
- HW must support configuring a number of packets before getting an
interrupt per channel

Does that sound about right?

[snip]

> +In order to use Net DIM from a networking driver, the driver needs to call 
> the
> +main net_dim() function. The recommended method is to call net_dim() on each
> +interrupt.

I would make it a bit clearer that this is on each invocation of the
interrupt service routine function. With NAPI + net DIM working in
concert you would not actually get one packet per interrupt
consistently, it would largely depend on the rate, right?

> Since Net DIM has a built-in moderation and it might decide to skip
> +iterations under certain conditions, there is no need to moderate the 
> net_dim()
> +calls as well. As mentioned above, the driver needs to provide an object of 
> type
> +struct net_dim to the net_dim() function call. It is advised for each entity
> +using Net DIM to hold a struct net_dim as part of its data structure and use 
> it
> +as the main Net DIM API object. The struct net_dim_sample should hold the 
> latest
> +bytes, packets and interrupts count. No need to perform any calculations, 
> just
> +include the raw data.
> +
> +The net_dim() call itself does not return anything. Instead Net DIM relies on
> +the driver to provide a callback function, which is called when the algorithm
> +decides to make a change in the interrupt moderation parameters. This 
> callback
> +will be scheduled and run in a separate thread in order not to add overhead 
> to
> +the data flow. After the work is done, Net DIM algorithm needs to be set to
> +the proper state in order to move to the next iteration.
> +
> +
> +Part IV: Example
> +=
> +
> +The following code demonstrates how to register a driver to Net DIM. The 
> actual
> +usage is not complete but it should make the outline of the usage clear.

It could be worth to touch a word or two about reflecting the use of Net
DIM within the driver into ethtool_coalesce::use_adaptive_rx_coalesce
and ethtool_coalesce::use_adaptive_tx_coalesce?

> +
> +my_driver.c:
> +
> +#include 
> +
> +/* Callback for net DIM to schedule on a decision to change moderation */
> +void my_driver_do_dim_work(struct work_struct *work)
> +{
> + /* Get struct net_dim from struct work_struct */
> + struct net_dim *dim = container_of(work, struct net_dim,
> +work);
> + /* Do interrupt moderation related stuff */
> + ...
> +
> + /* Signal net DIM work is done and it should move to next iteration */
> + dim->state = NET_DIM_START_MEASURE;
> +}
> +
> +/* My driver's interrupt handler */
> +int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...)
> +{
> + ...
> + /* A struct to hold current measured data */
> + struct net_dim_sample dim_sample;
> + ...
> + /* Initiate data sample struct with current data */
> + net_dim_sample(my_entity->events,
> +my_entity->packets,
> +my_entity->bytes,
> +_sample);
> + /* Call net DIM */
> + net_dim(_entity->dim, dim_sample);
> + ...
> +}
> +
> +/* My entity's initialization function (my_entity was already allocated) */
> +int my_driver_init_my_entity(struct my_driver_entity *my_entity, ...)
> +{
> + ...
> + /* Initiate struct work_struct with my driver's callback function */
> + INIT_WORK(_entity->dim.work, my_driver_do_dim_work);
> + ...
> +}
> 


-- 
Florian


Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-03-21 Thread Andrew Lunn
On Wed, Mar 21, 2018 at 11:58:18AM -0700, Richard Cochran wrote:
> The InES at the ZHAW offers a PTP time stamping IP core.  The FPGA
> logic recognizes and time stamps PTP frames on the MII bus.  This
> patch adds a driver for the core along with a device tree binding to
> allow hooking the driver to MAC devices.

Hi Richard

Can you point us at some documentation for this.

I think Florian and I want to better understand how this device works,
in order to understand your other changes.

   Andrew


Re: [PATCH net-next RFC V1 3/5] net: Introduce field for the MII time stamper.

2018-03-21 Thread Florian Fainelli
On 03/21/2018 11:58 AM, Richard Cochran wrote:
> This patch adds a new field to the network device structure to reference
> a time stamping device on the MII bus.  By decoupling the time stamping
> function from the PHY device, we pave the way to allowing a non-PHY
> device to take this role.
> 
> Signed-off-by: Richard Cochran 
> ---
>  drivers/net/phy/mdio_bus.c | 51 
> +-
>  include/linux/netdevice.h  |  1 +
>  2 files changed, 51 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
> index 24b5511222c8..fdac8c8ac272 100644
> --- a/drivers/net/phy/mdio_bus.c
> +++ b/drivers/net/phy/mdio_bus.c
> @@ -717,6 +717,47 @@ static int mdio_uevent(struct device *dev, struct 
> kobj_uevent_env *env)
>   return 0;
>  }
>  
> +static bool mdiodev_supports_timestamping(struct mdio_device *mdiodev)
> +{
> + if (mdiodev->ts_info  && mdiodev->hwtstamp &&
> + mdiodev->rxtstamp && mdiodev->txtstamp)
> + return true;
> + else
> + return false;
> +}
> +
> +static int mdiobus_netdev_notification(struct notifier_block *nb,
> +unsigned long msg, void *ptr)
> +{
> + struct net_device *netdev = netdev_notifier_info_to_dev(ptr);
> + struct phy_device *phydev = netdev->phydev;
> + struct mdio_device *mdev;
> + struct mii_bus *bus;
> + int i;
> +
> + if (netdev->mdiots || msg != NETDEV_UP || !phydev)
> + return NOTIFY_DONE;

You are still assuming that we have a phy_device somehow, whereas you
parch series wants to solve that for generic MDIO devices, that is a bit
confusing.

> +
> + /*
> +  * Examine the MII bus associated with the PHY that is
> +  * attached to the MAC.  If there is a time stamping device
> +  * on the bus, then connect it to the network device.
> +  */
> + bus = phydev->mdio.bus;
> +
> + for (i = 0; i < PHY_MAX_ADDR; i++) {
> + mdev = bus->mdio_map[i];
> + if (!mdev)
> + continue;
> + if (mdiodev_supports_timestamping(mdev)) {
> + netdev->mdiots = mdev;
> + return NOTIFY_OK;

What guarantees that netdev->mdiots gets cleared? Also, why is this done
with a notifier instead of through phy_{connect,attach,disconnect}? It
looks like we still have this requirement of the mdio TS device being a
phy_device somehow, I am confused here...

> + }
> + }
> +
> + return NOTIFY_DONE;
> +}
> +
>  #ifdef CONFIG_PM
>  static int mdio_bus_suspend(struct device *dev)
>  {

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 5fbb9f1da7fd..223d691aa0b0 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1943,6 +1943,7 @@ struct net_device {
>   struct netprio_map __rcu *priomap;
>  #endif
>   struct phy_device   *phydev;
> + struct mdio_device  *mdiots;

phy_device embedds a mdio_device, can you find a way to rework the PHY
PTP code to utilize the phy_device's mdio instance so do not introduce
yet another pointer in that big structure that net_device already is?

>   struct lock_class_key   *qdisc_tx_busylock;
>   struct lock_class_key   *qdisc_running_key;
>   boolproto_down;
> 


-- 
Florian


Re: [PATCH net-next RFC V1 2/5] net: phy: Move time stamping interface into the generic mdio layer.

2018-03-21 Thread Florian Fainelli
On 03/21/2018 11:58 AM, Richard Cochran wrote:
> There are different ways of obtaining hardware time stamps on network
> packets.  The ingress and egress times can be measured in the MAC, in
> the PHY, or by a device listening on the MII bus.  Up until now, the
> kernel has support for MAC and PHY time stamping, but not for other
> MII bus devices.
> 
> This patch moves the PHY time stamping interface into the generic
> mdio device in order to support MII time stamping hardware.
> 
> Signed-off-by: Richard Cochran 
> ---
>  drivers/net/phy/dp83640.c | 29 -
>  drivers/net/phy/phy.c |  4 ++--
>  include/linux/mdio.h  | 23 +++
>  include/linux/phy.h   | 23 ---
>  net/core/ethtool.c|  4 ++--
>  net/core/timestamping.c   |  8 
>  6 files changed, 51 insertions(+), 40 deletions(-)
> 
> diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
> index 654f42d00092..79aeb5eb471a 100644
> --- a/drivers/net/phy/dp83640.c
> +++ b/drivers/net/phy/dp83640.c
> @@ -215,6 +215,10 @@ static LIST_HEAD(phyter_clocks);
>  static DEFINE_MUTEX(phyter_clocks_lock);
>  
>  static void rx_timestamp_work(struct work_struct *work);
> +static  int dp83640_ts_info(struct mdio_device *m, struct ethtool_ts_info 
> *i);
> +static  int dp83640_hwtstamp(struct mdio_device *m, struct ifreq *i);
> +static bool dp83640_rxtstamp(struct mdio_device *m, struct sk_buff *s, int 
> t);
> +static void dp83640_txtstamp(struct mdio_device *m, struct sk_buff *s, int 
> t);
>  
>  /* extended register access functions */
>  
> @@ -1162,6 +1166,12 @@ static int dp83640_probe(struct phy_device *phydev)
>   list_add_tail(>list, >phylist);
>  
>   dp83640_clock_put(clock);
> +
> + phydev->mdio.ts_info = dp83640_ts_info;
> + phydev->mdio.hwtstamp = dp83640_hwtstamp;
> + phydev->mdio.rxtstamp = dp83640_rxtstamp;
> + phydev->mdio.txtstamp = dp83640_txtstamp;

Why is this implemented a the mdio_device level and not at the
mdio_driver level? This looks like the wrong level at which this is done.
--
Florian


Re: [PATCH v4 00/17] netdev: Eliminate duplicate barriers on weakly-ordered archs

2018-03-21 Thread Sinan Kaya
On 3/21/2018 10:56 AM, David Miller wrote:
> From: Sinan Kaya 
> Date: Mon, 19 Mar 2018 22:42:15 -0400
> 
>> Code includes wmb() followed by writel() in multiple places. writel()
>> already has a barrier on some architectures like arm64.
>>
>> This ends up CPU observing two barriers back to back before executing the
>> register write.
>>
>> Since code already has an explicit barrier call, changing writel() to
>> writel_relaxed().
>>
>> I did a regex search for wmb() followed by writel() in each drivers
>> directory.
>> I scrubbed the ones I care about in this series.
>>
>> I considered "ease of change", "popular usage" and "performance critical
>> path" as the determining criteria for my filtering.
> 
> I agree that for performance sensitive operations, specifically writing
> doorbell registers in the hot paths or RX and TX packet processing, this
> is a good change.
> 
> However, in configuration paths and whatnot, it is much less urgent and
> useful.
> 
> Therefore I think it would work better if you concentrated solely on
> hot code path cases.
> 
> You can, on a driver by driver basis, submit the other transformations
> in the slow paths, and let the driver maintainers decide whether to
> take those on or not.
> 
> Also, please stick exactly to the case where we have:
> 
>   wmb/mb/etc.
>   writel()
> 

OK

> Because I see some changes where we have:
> 
>   writel()
> 
>   barrier()
> 
>   writel()
> 

barrier() on ARM is a write barrier. Apparently, it is a compiler barrier
on Intel. I briefly discussed the barrier() behavior in rdma mailing list [1].

Our conclusion is that code should have used wmb() if it really needed
to synchronize memory contents to the device and barrier() is already
wrong. It just guarantees that code doesn't move. writel() already has
a compiler barrier inside. It won't move to begin with.

Like you suggested, we decided to leave these changes alone and even
skip those drivers.

I'll take another look at the patches.

> for exmaple, and you are turning that second writel() into a relaxed
> on as well.  The above is using a compile barrier, not a memory
> barrier, so effectively it is two writel()'s in sequence which is
> not what this patch set is about.
> 
> If anything, that compile barrier() is superfluous and could be
> removed.  But that is also a separate change from what this patch
> series is doing here.
> 

agreed, I'll remove such changes.

> Finally, it makes it that much easier if we can see the preceeding
> memory barrier in the context of the patch that adjusts the writel
> into a writel_relaxed.
> 
> In one case, a macro DOORBELL() is changed to use writel().  This
> makes it so that the patch reviewer has to scan over the entire
> driver in question to see exactly how DOORBELL() is used and whether
> it fits the criteria for the writel_relaxed() transformation.
> 
> I would suggest that you adjust the name of the macro in a situation
> like this, f.e. to DOORBELL_RELAXED().

makes sense.

> 
> Thank you.
> 

[1] https://patchwork.kernel.org/project/LKML/list/?submitter=145491

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


[PATCH REPOST v4 4/7] igb: eliminate duplicate barriers on weakly-ordered archs

2018-03-21 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya 
Reviewed-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index b88fae7..82aea92 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -5671,7 +5671,7 @@ static int igb_tx_map(struct igb_ring *tx_ring,
igb_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
@@ -8072,7 +8072,7 @@ void igb_alloc_rx_buffers(struct igb_ring *rx_ring, u16 
cleaned_count)
 * such as IA-64).
 */
wmb();
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
}
 }
 
-- 
2.7.4



[PATCH net-next RFC V1 0/5] Peer to Peer One-Step time stamping

2018-03-21 Thread Richard Cochran
This series adds support for PTP (IEEE 1588) P2P one-step time
stamping along with a driver for a hardware device that supports this.

If the hardware supports p2p one-step, it subtracts the ingress time
stamp value from the Pdelay_Request correction field.  The user space
software stack then simply copies the correction field into the
Pdelay_Response, and on transmission the hardware adds the egress time
stamp into the correction field.

- Patch 1 adds the new option.
- Patches 2-4 adds support for MII time stamping in non-PHY devices.
- Patch 5 adds a driver implementing the new option.

Earlier today I posted user space support as an RFC on the
linuxptp-devel list.  Comments and review are most welcome.

Thanks,
Richard

Richard Cochran (5):
  net: Introduce peer to peer one step PTP time stamping.
  net: phy: Move time stamping interface into the generic mdio layer.
  net: Introduce field for the MII time stamper.
  net: Use the generic MII time stamper when available.
  net: mdio: Add a driver for InES time stamping IP core.

 Documentation/devicetree/bindings/net/ines-ptp.txt |  42 +
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c   |   1 +
 drivers/net/phy/Makefile   |   1 +
 drivers/net/phy/dp83640.c  |  29 +-
 drivers/net/phy/ines_ptp.c | 857 +
 drivers/net/phy/mdio_bus.c |  51 +-
 drivers/net/phy/phy.c  |   6 +-
 drivers/ptp/Kconfig|  10 +
 include/linux/mdio.h   |  23 +
 include/linux/netdevice.h  |   1 +
 include/linux/phy.h|  23 -
 include/uapi/linux/net_tstamp.h|   8 +
 net/Kconfig|   8 +-
 net/core/dev_ioctl.c   |   1 +
 net/core/ethtool.c |   5 +-
 net/core/timestamping.c|  36 +-
 16 files changed, 1034 insertions(+), 68 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/ines-ptp.txt
 create mode 100644 drivers/net/phy/ines_ptp.c

-- 
2.11.0



[PATCH REPOST v4 1/7] i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs

2018-03-21 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya 
Reviewed-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 8 
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e554aa6cf..9455869 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -185,7 +185,7 @@ static int i40e_program_fdir_filter(struct i40e_fdir_filter 
*fdir_data,
/* Mark the data descriptor to be watched */
first->next_to_watch = tx_desc;
 
-   writel(tx_ring->next_to_use, tx_ring->tail);
+   writel_relaxed(tx_ring->next_to_use, tx_ring->tail);
return 0;
 
 dma_fail:
@@ -1375,7 +1375,7 @@ static inline void i40e_release_rx_desc(struct i40e_ring 
*rx_ring, u32 val)
 * such as IA-64).
 */
wmb();
-   writel(val, rx_ring->tail);
+   writel_relaxed(val, rx_ring->tail);
 }
 
 /**
@@ -2258,7 +2258,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, 
int budget)
 */
wmb();
 
-   writel(xdp_ring->next_to_use, xdp_ring->tail);
+   writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
}
 
rx_ring->skb = skb;
@@ -3286,7 +3286,7 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, 
struct sk_buff *skb,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 357d605..56eea20 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -667,7 +667,7 @@ static inline void i40e_release_rx_desc(struct i40e_ring 
*rx_ring, u32 val)
 * such as IA-64).
 */
wmb();
-   writel(val, rx_ring->tail);
+   writel_relaxed(val, rx_ring->tail);
 }
 
 /**
@@ -2243,7 +2243,7 @@ static inline void i40evf_tx_map(struct i40e_ring 
*tx_ring, struct sk_buff *skb,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
-- 
2.7.4



[PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-03-21 Thread Richard Cochran
The InES at the ZHAW offers a PTP time stamping IP core.  The FPGA
logic recognizes and time stamps PTP frames on the MII bus.  This
patch adds a driver for the core along with a device tree binding to
allow hooking the driver to MAC devices.

Signed-off-by: Richard Cochran 
---
 Documentation/devicetree/bindings/net/ines-ptp.txt |  42 +
 drivers/net/phy/Makefile   |   1 +
 drivers/net/phy/ines_ptp.c | 857 +
 drivers/ptp/Kconfig|  10 +
 4 files changed, 910 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/ines-ptp.txt
 create mode 100644 drivers/net/phy/ines_ptp.c

diff --git a/Documentation/devicetree/bindings/net/ines-ptp.txt 
b/Documentation/devicetree/bindings/net/ines-ptp.txt
new file mode 100644
index ..ed7b1d773ded
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/ines-ptp.txt
@@ -0,0 +1,42 @@
+ZHAW InES PTP time stamping IP core
+
+The IP core needs two different kinds of nodes.  The control node
+lives somewhere in the memory map and specifies the address of the
+control registers.  There can be up to three port nodes placed on the
+mdio bus.  They associate a particular MAC with a port index within
+the IP core.
+
+Required properties of the control node:
+
+- compatible:  "ines,ptp-ctrl"
+- reg: physical address and size of the register bank
+- phandle: globally unique handle for the ports to point to
+
+Required properties of the port nodes:
+
+- compatible:  "ines,ptp-port"
+- ctrl-handle: points to the control node
+- port-index:  port channel within the IP core
+- reg: phy address. This is required even though the
+   device does not respond to mdio operations
+
+Example:
+
+   timestamper@6000 {
+   compatible = "ines,ptp-ctrl";
+   reg = <0x6000 0x80>;
+   phandle = <0x10>;
+   };
+
+   ethernet@8000 {
+   ...
+   mdio {
+   ...
+   timestamper@1f {
+   compatible = "ines,ptp-port";
+   ctrl-handle = <0x10>;
+   port-index = <0>;
+   reg = <0x1f>;
+   };
+   };
+   };
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index 01acbcb2c798..e286bb822295 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -61,6 +61,7 @@ obj-$(CONFIG_DP83848_PHY) += dp83848.o
 obj-$(CONFIG_DP83867_PHY)  += dp83867.o
 obj-$(CONFIG_FIXED_PHY)+= fixed_phy.o
 obj-$(CONFIG_ICPLUS_PHY)   += icplus.o
+obj-$(CONFIG_INES_PTP_TSTAMP)  += ines_ptp.o
 obj-$(CONFIG_INTEL_XWAY_PHY)   += intel-xway.o
 obj-$(CONFIG_LSI_ET1011C_PHY)  += et1011c.o
 obj-$(CONFIG_LXT_PHY)  += lxt.o
diff --git a/drivers/net/phy/ines_ptp.c b/drivers/net/phy/ines_ptp.c
new file mode 100644
index ..4f66459d4417
--- /dev/null
+++ b/drivers/net/phy/ines_ptp.c
@@ -0,0 +1,857 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018 MOSER-BAER AG
+ */
+#define pr_fmt(fmt) "InES_PTP: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_DESCRIPTION("Driver for the ZHAW InES PTP time stamping IP core");
+MODULE_AUTHOR("Richard Cochran ");
+MODULE_VERSION("1.0");
+MODULE_LICENSE("GPL");
+
+/* GLOBAL register */
+#define MCAST_MAC_SELECT_SHIFT 2
+#define MCAST_MAC_SELECT_MASK  0x3
+#define IO_RESET   BIT(1)
+#define PTP_RESET  BIT(0)
+
+/* VERSION register */
+#define IF_MAJOR_VER_SHIFT 12
+#define IF_MAJOR_VER_MASK  0xf
+#define IF_MINOR_VER_SHIFT 8
+#define IF_MINOR_VER_MASK  0xf
+#define FPGA_MAJOR_VER_SHIFT   4
+#define FPGA_MAJOR_VER_MASK0xf
+#define FPGA_MINOR_VER_SHIFT   0
+#define FPGA_MINOR_VER_MASK0xf
+
+/* INT_STAT register */
+#define RX_INTR_STATUS_3   BIT(5)
+#define RX_INTR_STATUS_2   BIT(4)
+#define RX_INTR_STATUS_1   BIT(3)
+#define TX_INTR_STATUS_3   BIT(2)
+#define TX_INTR_STATUS_2   BIT(1)
+#define TX_INTR_STATUS_1   BIT(0)
+
+/* INT_MSK register */
+#define RX_INTR_MASK_3 BIT(5)
+#define RX_INTR_MASK_2 BIT(4)
+#define RX_INTR_MASK_1 BIT(3)
+#define TX_INTR_MASK_3 BIT(2)
+#define TX_INTR_MASK_2 BIT(1)
+#define TX_INTR_MASK_1 BIT(0)
+
+/* BUF_STAT register */
+#define RX_FIFO_NE_3   BIT(5)
+#define RX_FIFO_NE_2   BIT(4)
+#define RX_FIFO_NE_1   BIT(3)
+#define TX_FIFO_NE_3   BIT(2)
+#define TX_FIFO_NE_2   BIT(1)
+#define TX_FIFO_NE_1   BIT(0)
+
+/* PORT_CONF register */
+#define CM_ONE_STEPBIT(6)
+#define 

[PATCH net-next RFC V1 1/5] net: Introduce peer to peer one step PTP time stamping.

2018-03-21 Thread Richard Cochran
The 1588 standard defines one step operation for both Sync and
PDelay_Resp messages.  Up until now, hardware with P2P one step has
been rare, and kernel support was lacking.  This patch adds support of
the mode in anticipation of new hardware developments.

Signed-off-by: Richard Cochran 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 1 +
 include/uapi/linux/net_tstamp.h  | 8 
 net/core/dev_ioctl.c | 1 +
 3 files changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 74fc9af4aadb..c6295e5c16af 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -15379,6 +15379,7 @@ int bnx2x_configure_ptp_filters(struct bnx2x *bp)
   NIG_REG_P0_TLLH_PTP_RULE_MASK, 0x3EEE);
break;
case HWTSTAMP_TX_ONESTEP_SYNC:
+   case HWTSTAMP_TX_ONESTEP_P2P:
BNX2X_ERR("One-step timestamping is not supported\n");
return -ERANGE;
}
diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index 4fe104b2411f..f89b5a836c2a 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -90,6 +90,14 @@ enum hwtstamp_tx_types {
 * queue.
 */
HWTSTAMP_TX_ONESTEP_SYNC,
+
+   /*
+* Same as HWTSTAMP_TX_ONESTEP_SYNC, but also enables time
+* stamp insertion directly into PDelay_Resp packets. In this
+* case, neither transmitted Sync nor PDelay_Resp packets will
+* receive a time stamp via the socket error queue.
+*/
+   HWTSTAMP_TX_ONESTEP_P2P,
 };
 
 /* possible values for hwtstamp_config->rx_filter */
diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
index 0ab1af04296c..cdda085e4b47 100644
--- a/net/core/dev_ioctl.c
+++ b/net/core/dev_ioctl.c
@@ -187,6 +187,7 @@ static int net_hwtstamp_validate(struct ifreq *ifr)
case HWTSTAMP_TX_OFF:
case HWTSTAMP_TX_ON:
case HWTSTAMP_TX_ONESTEP_SYNC:
+   case HWTSTAMP_TX_ONESTEP_P2P:
tx_type_valid = 1;
break;
}
-- 
2.11.0



  1   2   3   >