RE: [PATCH net-next] net/tls: Calculate nsg for zerocopy path without skb_cow_data.

2018-08-06 Thread Vakul Garg



> -Original Message-
> From: Doron Roberts-Kedes [mailto:doro...@fb.com]
> Sent: Tuesday, August 7, 2018 12:02 AM
> To: Vakul Garg 
> Cc: David S . Miller ; Dave Watson
> ; Boris Pismenny ; Aviad
> Yehezkel ; netdev@vger.kernel.org
> Subject: Re: [PATCH net-next] net/tls: Calculate nsg for zerocopy path
> without skb_cow_data.
> 
> On Fri, Aug 03, 2018 at 11:49:02AM -0700, Doron Roberts-Kedes wrote:
> > On Fri, Aug 03, 2018 at 01:23:33AM +, Vakul Garg wrote:
> > >
> > >
> > > > -Original Message-
> > > > From: Doron Roberts-Kedes [mailto:doro...@fb.com]
> > > > Sent: Friday, August 3, 2018 6:00 AM
> > > > To: David S . Miller 
> > > > Cc: Dave Watson ; Vakul Garg
> > > > ; Boris Pismenny ;
> Aviad
> > > > Yehezkel ; netdev@vger.kernel.org; Doron
> > > > Roberts-Kedes 
> > > > Subject: [PATCH net-next] net/tls: Calculate nsg for zerocopy path
> > > > without skb_cow_data.
> > > >
> > > > decrypt_skb fails if the number of sg elements required to map is
> > > > greater than MAX_SKB_FRAGS. As noted by Vakul Garg, nsg must
> > > > always be calculated, but skb_cow_data adds unnecessary memcpy's
> > > > for the zerocopy case.
> > > >
> > > > The new function skb_nsg calculates the number of scatterlist
> > > > elements required to map the skb without the extra overhead of
> > > > skb_cow_data. This function mimics the structure of skb_to_sgvec.
> > > >
> > > > Fixes: c46234ebb4d1 ("tls: RX path for ktls")
> > > > Signed-off-by: Doron Roberts-Kedes 
> > > > ---
> > > >  net/tls/tls_sw.c | 89
> > > > ++--
> > > >  1 file changed, 86 insertions(+), 3 deletions(-)
> > > >
> > > > memcpy(iv, tls_ctx->rx.iv, TLS_CIPHER_AES_GCM_128_SALT_SIZE);
> > > > if (!sgout) {
> > > > -   nsg = skb_cow_data(skb, 0, ) + 1;
> > > > +   nsg = skb_cow_data(skb, 0, );
> > > > +   } else {
> > > > +   nsg = skb_nsg(skb,
> > > > + rxm->offset + tls_ctx->rx.prepend_size,
> > > > + rxm->full_len - tls_ctx->rx.prepend_size);
> > > > +   if (nsg <= 0)
> > > > +   return nsg;
> > > Comparison should be (nsg < 1). TLS forbids '0' sized records.
> >
> > Yes true, v2 incoming
> >
> 
> Glancing at this a second time, I actually don't believe this should be
> changed. nsg <= 0 is equivalent to nsg < 1. 

Correct.


> Returning 0 if the record is
> 0 sized is the proper behavior here, since decrypting a zero-length skb is a 
> no-
> op. It is true that zero sized TLS records are forbidden, but it is confusing 
> to
> enforce that in this part of the code. I would be surpised to learn that
> tls_sw_recvmsg could be invoked with a len equal to 0, but if I'm wrong, and
> that case does need to be handled, then it should be in a different patch.


Re: [PATCH net-next 2/2] ibmvnic: Update firmware error reporting with cause string

2018-08-06 Thread Thomas Falcon

On 08/06/2018 10:48 PM, Nathan Fontenot wrote:


On 08/06/2018 09:39 PM, Thomas Falcon wrote:

Print a string instead of the error code. Since there is a
possibility that the driver can recover, classify it as a
warning instead of an error.

Signed-off-by: Thomas Falcon 
---
  drivers/net/ethernet/ibm/ibmvnic.c | 34 
++

  1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c

index 109e4a58efad..dafdd4ade705 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -3825,15 +3825,41 @@ static void 
handle_query_ip_offload_rsp(struct ibmvnic_adapter *adapter)

  ibmvnic_send_crq(adapter, );
  }

+static const char *ibmvnic_fw_err_cause(u16 cause)
+{
+    switch (cause) {
+    case ADAPTER_PROBLEM:
+    return "adapter problem";
+    case BUS_PROBLEM:
+    return "bus problem";
+    case FW_PROBLEM:
+    return "firmware problem";
+    case DD_PROBLEM:
+    return "device driver problem";
+    case EEH_RECOVERY:
+    return "EEH recovery";
+    case FW_UPDATED:
+    return "firmware updated";
+    case LOW_MEMORY:
+    return "low Memory";
+    default:
+    return "unknown";
+    }
+}
+
  static void handle_error_indication(union ibmvnic_crq *crq,
  struct ibmvnic_adapter *adapter)
  {
  struct device *dev = >vdev->dev;
+    u16 cause;
+
+    cause = be16_to_cpu(crq->error_indication.error_cause);

-    dev_err(dev, "Firmware reports %serror, cause %d\n",
-    crq->error_indication.flags
-    & IBMVNIC_FATAL_ERROR ? "FATAL " : "",
-    be16_to_cpu(crq->error_indication.error_cause));
+    dev_warn_ratelimited(dev,
+ "Firmware reports %serror, cause: %s. Starting 
recovery...\n",


   ^^^

You're going to want a space between after the %s here.

-Nathan



It does look odd at first glance, but there is a space after "FATAL" below.

Thanks,
Tom


+ crq->error_indication.flags
+    & IBMVNIC_FATAL_ERROR ? "FATAL " : "",
+ ibmvnic_fw_err_cause(cause));

  if (crq->error_indication.flags & IBMVNIC_FATAL_ERROR)
  ibmvnic_reset(adapter, VNIC_RESET_FATAL);





Re: [PATCH net-next 2/2] ibmvnic: Update firmware error reporting with cause string

2018-08-06 Thread Nathan Fontenot



On 08/06/2018 09:39 PM, Thomas Falcon wrote:

Print a string instead of the error code. Since there is a
possibility that the driver can recover, classify it as a
warning instead of an error.

Signed-off-by: Thomas Falcon 
---
  drivers/net/ethernet/ibm/ibmvnic.c | 34 ++
  1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 109e4a58efad..dafdd4ade705 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -3825,15 +3825,41 @@ static void handle_query_ip_offload_rsp(struct 
ibmvnic_adapter *adapter)
ibmvnic_send_crq(adapter, );
  }

+static const char *ibmvnic_fw_err_cause(u16 cause)
+{
+   switch (cause) {
+   case ADAPTER_PROBLEM:
+   return "adapter problem";
+   case BUS_PROBLEM:
+   return "bus problem";
+   case FW_PROBLEM:
+   return "firmware problem";
+   case DD_PROBLEM:
+   return "device driver problem";
+   case EEH_RECOVERY:
+   return "EEH recovery";
+   case FW_UPDATED:
+   return "firmware updated";
+   case LOW_MEMORY:
+   return "low Memory";
+   default:
+   return "unknown";
+   }
+}
+
  static void handle_error_indication(union ibmvnic_crq *crq,
struct ibmvnic_adapter *adapter)
  {
struct device *dev = >vdev->dev;
+   u16 cause;
+
+   cause = be16_to_cpu(crq->error_indication.error_cause);

-   dev_err(dev, "Firmware reports %serror, cause %d\n",
-   crq->error_indication.flags
-   & IBMVNIC_FATAL_ERROR ? "FATAL " : "",
-   be16_to_cpu(crq->error_indication.error_cause));
+   dev_warn_ratelimited(dev,
+"Firmware reports %serror, cause: %s. Starting 
recovery...\n",


   ^^^

You're going to want a space between after the %s here.

-Nathan


+crq->error_indication.flags
+   & IBMVNIC_FATAL_ERROR ? "FATAL " : "",
+ibmvnic_fw_err_cause(cause));

if (crq->error_indication.flags & IBMVNIC_FATAL_ERROR)
ibmvnic_reset(adapter, VNIC_RESET_FATAL);





[PATCH net-next 0/2] ibmvnic: Update firmware error reporting

2018-08-06 Thread Thomas Falcon
This patch set cleans out a lot of dead code from the ibmvnic driver
and adds some more. The error ID field of the descriptor is not filled
in by firmware, so do not print it and do not use it to query for
more detailed information. Remove the unused code written for this.
Finally, update the message to print a string explainng the error
cause instead of just the error code.

Thomas Falcon (2):
  ibmvnic: Remove code to request error information
  ibmvnic: Update firmware error reporting with cause string

 drivers/net/ethernet/ibm/ibmvnic.c | 168 ++---
 drivers/net/ethernet/ibm/ibmvnic.h |  33 
 2 files changed, 26 insertions(+), 175 deletions(-)

-- 
2.12.3



[PATCH net-next 2/2] ibmvnic: Update firmware error reporting with cause string

2018-08-06 Thread Thomas Falcon
Print a string instead of the error code. Since there is a
possibility that the driver can recover, classify it as a
warning instead of an error.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 34 ++
 1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 109e4a58efad..dafdd4ade705 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -3825,15 +3825,41 @@ static void handle_query_ip_offload_rsp(struct 
ibmvnic_adapter *adapter)
ibmvnic_send_crq(adapter, );
 }
 
+static const char *ibmvnic_fw_err_cause(u16 cause)
+{
+   switch (cause) {
+   case ADAPTER_PROBLEM:
+   return "adapter problem";
+   case BUS_PROBLEM:
+   return "bus problem";
+   case FW_PROBLEM:
+   return "firmware problem";
+   case DD_PROBLEM:
+   return "device driver problem";
+   case EEH_RECOVERY:
+   return "EEH recovery";
+   case FW_UPDATED:
+   return "firmware updated";
+   case LOW_MEMORY:
+   return "low Memory";
+   default:
+   return "unknown";
+   }
+}
+
 static void handle_error_indication(union ibmvnic_crq *crq,
struct ibmvnic_adapter *adapter)
 {
struct device *dev = >vdev->dev;
+   u16 cause;
+
+   cause = be16_to_cpu(crq->error_indication.error_cause);
 
-   dev_err(dev, "Firmware reports %serror, cause %d\n",
-   crq->error_indication.flags
-   & IBMVNIC_FATAL_ERROR ? "FATAL " : "",
-   be16_to_cpu(crq->error_indication.error_cause));
+   dev_warn_ratelimited(dev,
+"Firmware reports %serror, cause: %s. Starting 
recovery...\n",
+crq->error_indication.flags
+   & IBMVNIC_FATAL_ERROR ? "FATAL " : "",
+ibmvnic_fw_err_cause(cause));
 
if (crq->error_indication.flags & IBMVNIC_FATAL_ERROR)
ibmvnic_reset(adapter, VNIC_RESET_FATAL);
-- 
2.12.3



[PATCH net-next 1/2] ibmvnic: Remove code to request error information

2018-08-06 Thread Thomas Falcon
When backing device firmware reports an error, it provides an
error ID, which is meant to be queried for more detailed error
information. Currently, however, an error ID is not provided by
the Virtual I/O server and there are not any plans to do so. For
now, it is always unfilled or zero, so request_error_information
will never be called.  Remove it.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 144 +
 drivers/net/ethernet/ibm/ibmvnic.h |  33 -
 2 files changed, 1 insertion(+), 176 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index ffe7acbeaa22..109e4a58efad 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -718,23 +718,6 @@ static int init_tx_pools(struct net_device *netdev)
return 0;
 }
 
-static void release_error_buffers(struct ibmvnic_adapter *adapter)
-{
-   struct device *dev = >vdev->dev;
-   struct ibmvnic_error_buff *error_buff, *tmp;
-   unsigned long flags;
-
-   spin_lock_irqsave(>error_list_lock, flags);
-   list_for_each_entry_safe(error_buff, tmp, >errors, list) {
-   list_del(_buff->list);
-   dma_unmap_single(dev, error_buff->dma, error_buff->len,
-DMA_FROM_DEVICE);
-   kfree(error_buff->buff);
-   kfree(error_buff);
-   }
-   spin_unlock_irqrestore(>error_list_lock, flags);
-}
-
 static void ibmvnic_napi_enable(struct ibmvnic_adapter *adapter)
 {
int i;
@@ -896,7 +879,6 @@ static void release_resources(struct ibmvnic_adapter 
*adapter)
release_tx_pools(adapter);
release_rx_pools(adapter);
 
-   release_error_buffers(adapter);
release_napi(adapter);
release_login_rsp_buffer(adapter);
 }
@@ -3843,133 +3825,16 @@ static void handle_query_ip_offload_rsp(struct 
ibmvnic_adapter *adapter)
ibmvnic_send_crq(adapter, );
 }
 
-static void handle_error_info_rsp(union ibmvnic_crq *crq,
- struct ibmvnic_adapter *adapter)
-{
-   struct device *dev = >vdev->dev;
-   struct ibmvnic_error_buff *error_buff, *tmp;
-   unsigned long flags;
-   bool found = false;
-   int i;
-
-   if (!crq->request_error_rsp.rc.code) {
-   dev_info(dev, "Request Error Rsp returned with rc=%x\n",
-crq->request_error_rsp.rc.code);
-   return;
-   }
-
-   spin_lock_irqsave(>error_list_lock, flags);
-   list_for_each_entry_safe(error_buff, tmp, >errors, list)
-   if (error_buff->error_id == crq->request_error_rsp.error_id) {
-   found = true;
-   list_del(_buff->list);
-   break;
-   }
-   spin_unlock_irqrestore(>error_list_lock, flags);
-
-   if (!found) {
-   dev_err(dev, "Couldn't find error id %x\n",
-   be32_to_cpu(crq->request_error_rsp.error_id));
-   return;
-   }
-
-   dev_err(dev, "Detailed info for error id %x:",
-   be32_to_cpu(crq->request_error_rsp.error_id));
-
-   for (i = 0; i < error_buff->len; i++) {
-   pr_cont("%02x", (int)error_buff->buff[i]);
-   if (i % 8 == 7)
-   pr_cont(" ");
-   }
-   pr_cont("\n");
-
-   dma_unmap_single(dev, error_buff->dma, error_buff->len,
-DMA_FROM_DEVICE);
-   kfree(error_buff->buff);
-   kfree(error_buff);
-}
-
-static void request_error_information(struct ibmvnic_adapter *adapter,
- union ibmvnic_crq *err_crq)
-{
-   struct device *dev = >vdev->dev;
-   struct net_device *netdev = adapter->netdev;
-   struct ibmvnic_error_buff *error_buff;
-   unsigned long timeout = msecs_to_jiffies(3);
-   union ibmvnic_crq crq;
-   unsigned long flags;
-   int rc, detail_len;
-
-   error_buff = kmalloc(sizeof(*error_buff), GFP_ATOMIC);
-   if (!error_buff)
-   return;
-
-   detail_len = be32_to_cpu(err_crq->error_indication.detail_error_sz);
-   error_buff->buff = kmalloc(detail_len, GFP_ATOMIC);
-   if (!error_buff->buff) {
-   kfree(error_buff);
-   return;
-   }
-
-   error_buff->dma = dma_map_single(dev, error_buff->buff, detail_len,
-DMA_FROM_DEVICE);
-   if (dma_mapping_error(dev, error_buff->dma)) {
-   netdev_err(netdev, "Couldn't map error buffer\n");
-   kfree(error_buff->buff);
-   kfree(error_buff);
-   return;
-   }
-
-   error_buff->len = detail_len;
-   error_buff->error_id = err_crq->error_indication.error_id;
-
-   spin_lock_irqsave(>error_list_lock, flags);
-   list_add_tail(_buff->list, >errors);
-   spin_unlock_irqrestore(>error_list_lock, 

Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31

2018-08-06 Thread Jakub Kicinski
On Mon, 6 Aug 2018 16:01:25 +0300, Eran Ben Elisha wrote:
> >> Hi Dave,
> >> I would like to re-state that this feature was not meant to be a generic
> >> one. This feature was added in order to resolve a HW bug which exist in
> >> a small portion of our devices.  
> > 
> > Would you mind describing the HW bug in more detail?  To a outside
> > reviewer it really looks like you're adding a feature.  What are you
> > working around?  Is the lack of full AQM on the PCIe side of the chip
> > considered a bug?  
> 
> In multiple function environment, there is an issue with buffer 
> allocation per function which may lead to starvation. 

Multi-function?  I thought you have a PF per uplink on all mlx5
silicon.  Does the problem occur in single host scenarios as well?
What if with single function the host is too slow with taking packets
off the RX ring?  Can the problem occur?  Would this feature help in
such scenario as well?

> There is an HW WA for mitigate this starvation by identifying this
> state and apply early drop/mark.

If I understand you correctly you presently have one shared buffer with
no way to place limits (quotas) on how much of it can be consumed by a
traffic for a single PF.  It remains unclear why this have not been a
problem for you until now.

To avoid the starvation you are adding AQM which is a *feature* that
may help you avoid queue build up in the NIC.  But even if you could
place quotas, why would you not expose the AQM scheme?  It looks very
useful.

> >> Those params will be used only on those current HWs and won't be in
> >> use for our future devices.  
> > 
> > I'm glad that is your plan today, however, customers may get used to
> > the simple interface you're adding now.  This means the API you are
> > adding is effectively becoming an API other drivers may need to
> > implement to keep compatibility with someone's proprietary
> > orchestration.  
> 
> This issue was refactored, thus no need to have this WA at all in
> future NICs. So I don't believe we will end up in the situation you are
> describing. It is less likely that other vendors will be facing the
> same issue and will have to support such param. it was burn out of a
> bug and not as a feature which other may follow.

Sure other vendors may have buffers quotas configurable by e.g.
devlink-sb.  But the AQM you are adding is a feature which is
potentially already supported by others.

> >> During the discussions, several alternatives where offered to be
> >> used by various members of the community. These alternatives
> >> includes TC and enhancements to PCI configuration tools.
> >>
> >> Regarding the TC, from my perspective, this is not an option as:
> >> 1) The HW mechanism handles multiple functions and therefore
> >> cannot be configured on as a regular TC  
> > 
> > Could you elaborate?  What are the multiple functions?  You seem to
> > be adding a knob to enable ECN marking and a knob for choosing
> > between some predefined slopes.  
> 
> PSB, The sloped are dynamic and enabled in a dynamic way.
> Indeed, we are adding a very specific knob for very non standard 
> specific issue which can be used in addition to standard ECN marking.
> 
> > In what way would your solution not behave like a RED offload?  
> 
> Existing Algo (RED, PIE, etc) are static, configurable. Our HW WA is 
> dynamic (dynamic slope), adjusted and auto enabled.

You mean like Adaptive RED?  The lack of documentation is making this
conversation harder to have.  What's dynamic and aggressive?  These 
are not antonyms.  Are the parameters to the algorithm configurable?

Will you not want to expose the actual threshold and adjustment values
so the customers can tweak them on their own depending on the workload?

> > With TC offload you'd also get a well-defined set of statistics, I
> > presume right now you're planning on adding a set of ethtool -S
> > counters?
> >   
> >> 2) No PF + representors modeling can be applied here, this is a
> >> MultiHost environment where one host is not aware to the other
> >> hosts, and each is running on its own pci/driver. It is a device
> >> working mode configuration.  
> > 
> > Yes, the multihost part makes it less pleasant.  But this is a
> > problem we have to tackle separately, at some point.  It's not a
> > center of attention here.  
> 
> Agree, however the multihost part makes it non-transparent if we
> chose a solution which is not based on direct vendor configuration.
> This will lead to a bad user experience.

In my experience multi-host is not a major issue in practice.  And
switchdev mode gives some visibility into statistics of others hosts 
etc., which people appreciate.


[net-next:master 1745/1753] drivers/net/ieee802154/mac802154_hwsim.c:39:14: sparse: incorrect type in initializer (different address spaces)

2018-08-06 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   de7de576eca204de9a38e2f6dafe6b7c1ddc85c1
commit: f25da51fdc381ca2863248c7060b3662632f0872 [1745/1753] ieee802154: hwsim: 
add replacement for fakelb
reproduce:
# apt-get install sparse
git checkout f25da51fdc381ca2863248c7060b3662632f0872
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> drivers/net/ieee802154/mac802154_hwsim.c:39:14: sparse: incorrect type in 
>> initializer (different address spaces) @@expected struct list_head *next 
>> @@got struct list_head struct list_head *next @@
   drivers/net/ieee802154/mac802154_hwsim.c:39:14:expected struct list_head 
*next
   drivers/net/ieee802154/mac802154_hwsim.c:39:14:got struct list_head 
[noderef] *
>> drivers/net/ieee802154/mac802154_hwsim.c:39:14: sparse: incorrect type in 
>> initializer (different address spaces) @@expected struct list_head *prev 
>> @@got struct list_head struct list_head *prev @@
   drivers/net/ieee802154/mac802154_hwsim.c:39:14:expected struct list_head 
*prev
   drivers/net/ieee802154/mac802154_hwsim.c:39:14:got struct list_head 
[noderef] *
>> drivers/net/ieee802154/mac802154_hwsim.c:799:18: sparse: incorrect type in 
>> assignment (different address spaces) @@expected struct hwsim_pib 
>> [noderef] *pib @@got  hwsim_pib [noderef] *pib @@
   drivers/net/ieee802154/mac802154_hwsim.c:799:18:expected struct 
hwsim_pib [noderef] *pib
   drivers/net/ieee802154/mac802154_hwsim.c:799:18:got struct hwsim_pib 
*[assigned] pib
>> drivers/net/ieee802154/mac802154_hwsim.c:801:25: sparse: incorrect type in 
>> argument 1 (different modifiers) @@expected struct list_head *list @@
>> got struct lisstruct list_head *list @@
   drivers/net/ieee802154/mac802154_hwsim.c:801:25:expected struct 
list_head *list
   drivers/net/ieee802154/mac802154_hwsim.c:801:25:got struct list_head 
[noderef] *
>> drivers/net/ieee802154/mac802154_hwsim.c:835:9: sparse: incorrect type in 
>> argument 1 (different address spaces) @@expected struct callback_head 
>> *head @@got struct callback_hstruct callback_head *head @@
   drivers/net/ieee802154/mac802154_hwsim.c:835:9:expected struct 
callback_head *head
   drivers/net/ieee802154/mac802154_hwsim.c:835:9:got struct callback_head 
[noderef] *
>> drivers/net/ieee802154/mac802154_hwsim.c:113:17: sparse: incorrect type in 
>> assignment (different address spaces) @@expected struct hwsim_pib 
>> *pib_old @@got struct hwsim_pib struct hwsim_pib *pib_old @@
   drivers/net/ieee802154/mac802154_hwsim.c:113:17:expected struct 
hwsim_pib *pib_old
   drivers/net/ieee802154/mac802154_hwsim.c:113:17:got struct hwsim_pib 
[noderef] *pib
>> drivers/net/ieee802154/mac802154_hwsim.c:130:9: sparse: incompatible types 
>> in comparison expression (different modifiers)
>> drivers/net/ieee802154/mac802154_hwsim.c:144:33: sparse: incompatible types 
>> in comparison expression (different address spaces)
>> drivers/net/ieee802154/mac802154_hwsim.c:161:40: sparse: incorrect type in 
>> argument 2 (different address spaces) @@expected struct list_head *head 
>> @@got struct list_head struct list_head *head @@
   drivers/net/ieee802154/mac802154_hwsim.c:161:40:expected struct 
list_head *head
   drivers/net/ieee802154/mac802154_hwsim.c:161:40:got struct list_head 
[noderef] *
>> drivers/net/ieee802154/mac802154_hwsim.c:232:25: sparse: incorrect type in 
>> argument 1 (different modifiers) @@expected struct list_head const *head 
>> @@got strustruct list_head const *head @@
   drivers/net/ieee802154/mac802154_hwsim.c:232:25:expected struct 
list_head const *head
   drivers/net/ieee802154/mac802154_hwsim.c:232:25:got struct list_head 
[noderef] *
   drivers/net/ieee802154/mac802154_hwsim.c:243:9: sparse: incompatible types 
in comparison expression (different modifiers)
   drivers/net/ieee802154/mac802154_hwsim.c:260:25: sparse: incompatible types 
in comparison expression (different address spaces)
   drivers/net/ieee802154/mac802154_hwsim.c:460:9: sparse: incompatible types 
in comparison expression (different modifiers)
   drivers/net/ieee802154/mac802154_hwsim.c:515:9: sparse: incompatible types 
in comparison expression (different modifiers)
   drivers/net/ieee802154/mac802154_hwsim.c:573:9: sparse: incompatible types 
in comparison expression (different modifiers)
   drivers/net/ieee802154/mac802154_hwsim.c:695:17: sparse: incompatible types 
in comparison expression (different modifiers)
>> drivers/net/ieee802154/mac802154_hwsim.c:717:41: sparse: incorrect type in 
>> argument 2 (different modifiers) @@expected struct list_head *head @@
>> got struct lisstruct list_head *head @@
   drivers/net/ieee802154/mac802154_hwsim.c:717:41:expected struct 
list_head *head
   drivers/net/ieee802154/mac802154_hwsim.c:717:41:got 

Re: e1000e driver stuck at 10Mbps after reconnection

2018-08-06 Thread Alexander Duyck
On Mon, Aug 6, 2018 at 4:59 AM, Camille Bordignon
 wrote:
> Hello,
>
> Recently we experienced some issues with intel NIC (I219-LM and I219-V).
> It seems that after a wire reconnection, auto-negotation "fails" and
> link speed drips to 10 Mbps.
>
> From kernel logs:
> [17616.346150] e1000e: enp0s31f6 NIC Link is Down
> [17627.003322] e1000e: enp0s31f6 NIC Link is Up 10 Mbps Full Duplex, Flow 
> Control: None
> [17627.003325] e1000e :00:1f.6 enp0s31f6: 10/100 speed: disabling TSO
>
>
> $ethtool enp0s31f6
> Settings for enp0s31f6:
> Supported ports: [ TP ]
> Supported link modes:   10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Supported pause frame use: No
> Supports auto-negotiation: Yes
> Supported FEC modes: Not reported
> Advertised link modes:  10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Advertised pause frame use: No
> Advertised auto-negotiation: Yes
> Advertised FEC modes: Not reported
> Speed: 10Mb/s
> Duplex: Full
> Port: Twisted Pair
> PHYAD: 1
> Transceiver: internal
> Auto-negotiation: on
> MDI-X: on (auto)
> Supports Wake-on: pumbg
> Wake-on: g
> Current message level: 0x0007 (7)
>drv probe link
> Link detected: yes
>
>
> Notice that if disconnection last less than about 5 seconds,
> nothing wrong happens.
> And if after last failure, disconnection / connection occurs again and
> last less than 5 seconds, link speed is back to 1000 Mbps.
>
> [18075.350678] e1000e: enp0s31f6 NIC Link is Down
> [18078.716245] e1000e: enp0s31f6 NIC Link is Up 1000 Mbps Full Duplex, Flow 
> Control: None
>
> The following patch seems to fix this issue.
> However I don't clearly understand why.
>
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
> b/drivers/net/ethernet/intel/e1000e/netdev.c
> index 3ba0c90e7055..763c013960f1 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -5069,7 +5069,7 @@ static bool e1000e_has_link(struct e1000_adapter 
> *adapter)
> case e1000_media_type_copper:
> if (hw->mac.get_link_status) {
> ret_val = hw->mac.ops.check_for_link(hw);
> -   link_active = !hw->mac.get_link_status;
> +   link_active = false;
> } else {
> link_active = true;
> }
>
> Maybe this is related to watchdog task.
>
> I've found out this fix by comparing with last commit that works fine :
> commit 0b76aae741abb9d16d2c0e67f8b1e766576f897d.
> However I don't know if this information is relevant.
>
> Thank you.
> Camille Bordignon

What kernel were you testing this on? I know there have been a number
of changes over the past few months in this area and it would be
useful to know exactly what code base you started out with and what
the latest version of the kernel is you have tested.

Looking over the code change the net effect of it should be to add a 2
second delay from the time the link has changed until you actually
check the speed/duplex configuration. It is possible we could be
seeing some sort of timing issue and adding the 2 second delay after
the link event is enough time for things to stabilize and detect the
link at 1000 instead of 10/100.

- Alex


Re: [PATCH net-next 2/3] net: dsa: bcm_sf2: Propagate ethtool::rxnfc to CPU port

2018-08-06 Thread Florian Fainelli



On 08/06/2018 03:32 PM, Andrew Lunn wrote:
>>  int bcm_sf2_get_rxnfc(struct dsa_switch *ds, int port,
>>struct ethtool_rxnfc *nfc, u32 *rule_locs)
>>  {
>> +struct net_device *p = ds->ports[port].cpu_dp->master;
>>  struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
>>  int ret = 0;
>>  
>> @@ -1214,12 +1223,20 @@ int bcm_sf2_get_rxnfc(struct dsa_switch *ds, int 
>> port,
>>  
>>  mutex_unlock(>cfp.lock);
> 
> Hi Florian
> 
> I think you should be testing ret here. If you have had a real error,
> you probably should be returning it, rather than overwriting it with
> what ethtool returns below.

Hu, right, thanks for spotting it, this is done in the set_rxnfc() part,
will submit a v2 shortly.

> 
>> +/* Pass up the commands to the attached master network device */
>> +if (p->ethtool_ops->get_rxnfc) {
>> +ret = p->ethtool_ops->get_rxnfc(p, nfc, rule_locs);
>> +if (ret == -EOPNOTSUPP)
>> +ret = 0;
>> +}
>> +
>>  return ret;
>>  }
> 
>Andrew
> 

-- 
Florian


Re: [PATCH net-next 2/3] net: dsa: bcm_sf2: Propagate ethtool::rxnfc to CPU port

2018-08-06 Thread Andrew Lunn
>  int bcm_sf2_get_rxnfc(struct dsa_switch *ds, int port,
> struct ethtool_rxnfc *nfc, u32 *rule_locs)
>  {
> + struct net_device *p = ds->ports[port].cpu_dp->master;
>   struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
>   int ret = 0;
>  
> @@ -1214,12 +1223,20 @@ int bcm_sf2_get_rxnfc(struct dsa_switch *ds, int port,
>  
>   mutex_unlock(>cfp.lock);

Hi Florian

I think you should be testing ret here. If you have had a real error,
you probably should be returning it, rather than overwriting it with
what ethtool returns below.

> + /* Pass up the commands to the attached master network device */
> + if (p->ethtool_ops->get_rxnfc) {
> + ret = p->ethtool_ops->get_rxnfc(p, nfc, rule_locs);
> + if (ret == -EOPNOTSUPP)
> + ret = 0;
> + }
> +
>   return ret;
>  }

   Andrew


[PATCH net-next 2/3] net: dsa: bcm_sf2: Propagate ethtool::rxnfc to CPU port

2018-08-06 Thread Florian Fainelli
Allow propagating ethtool::rxnfc programming to the CPU/management port
such that it is possible for such a CPU to perform e.g: Wake-on-LAN
using filters configured by the switch. We need a tiny bit of
cooperation between the switch drivers which is able to do the full flow
matching, whereas the CPU/management port might not. The CPU/management
driver needs to return -EOPNOTSUPP to indicate an non critical error,
any other error code otherwise.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/bcm_sf2_cfp.c | 40 ---
 1 file changed, 37 insertions(+), 3 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2_cfp.c b/drivers/net/dsa/bcm_sf2_cfp.c
index 1e37b65aab93..e41ea0f3d5c7 100644
--- a/drivers/net/dsa/bcm_sf2_cfp.c
+++ b/drivers/net/dsa/bcm_sf2_cfp.c
@@ -732,6 +732,8 @@ static int bcm_sf2_cfp_rule_set(struct dsa_switch *ds, int 
port,
struct ethtool_rx_flow_spec *fs)
 {
struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
+   s8 cpu_port = ds->ports[port].cpu_dp->index;
+   __u64 ring_cookie = fs->ring_cookie;
unsigned int queue_num, port_num;
int ret = -EINVAL;
 
@@ -748,13 +750,19 @@ static int bcm_sf2_cfp_rule_set(struct dsa_switch *ds, 
int port,
fs->location > bcm_sf2_cfp_rule_size(priv))
return -EINVAL;
 
+   /* This rule is a Wake-on-LAN filter and we must specifically
+* target the CPU port in order for it to be working.
+*/
+   if (ring_cookie == RX_CLS_FLOW_WAKE)
+   ring_cookie = cpu_port * SF2_NUM_EGRESS_QUEUES;
+
/* We do not support discarding packets, check that the
 * destination port is enabled and that we are within the
 * number of ports supported by the switch
 */
-   port_num = fs->ring_cookie / SF2_NUM_EGRESS_QUEUES;
+   port_num = ring_cookie / SF2_NUM_EGRESS_QUEUES;
 
-   if (fs->ring_cookie == RX_CLS_FLOW_DISC ||
+   if (ring_cookie == RX_CLS_FLOW_DISC ||
!(dsa_is_user_port(ds, port_num) ||
  dsa_is_cpu_port(ds, port_num)) ||
port_num >= priv->hw_params.num_ports)
@@ -763,7 +771,7 @@ static int bcm_sf2_cfp_rule_set(struct dsa_switch *ds, int 
port,
 * We have a small oddity where Port 6 just does not have a
 * valid bit here (so we substract by one).
 */
-   queue_num = fs->ring_cookie % SF2_NUM_EGRESS_QUEUES;
+   queue_num = ring_cookie % SF2_NUM_EGRESS_QUEUES;
if (port_num >= 7)
port_num -= 1;
 
@@ -1188,6 +1196,7 @@ static int bcm_sf2_cfp_rule_get_all(struct bcm_sf2_priv 
*priv,
 int bcm_sf2_get_rxnfc(struct dsa_switch *ds, int port,
  struct ethtool_rxnfc *nfc, u32 *rule_locs)
 {
+   struct net_device *p = ds->ports[port].cpu_dp->master;
struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
int ret = 0;
 
@@ -1214,12 +1223,20 @@ int bcm_sf2_get_rxnfc(struct dsa_switch *ds, int port,
 
mutex_unlock(>cfp.lock);
 
+   /* Pass up the commands to the attached master network device */
+   if (p->ethtool_ops->get_rxnfc) {
+   ret = p->ethtool_ops->get_rxnfc(p, nfc, rule_locs);
+   if (ret == -EOPNOTSUPP)
+   ret = 0;
+   }
+
return ret;
 }
 
 int bcm_sf2_set_rxnfc(struct dsa_switch *ds, int port,
  struct ethtool_rxnfc *nfc)
 {
+   struct net_device *p = ds->ports[port].cpu_dp->master;
struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
int ret = 0;
 
@@ -1240,6 +1257,23 @@ int bcm_sf2_set_rxnfc(struct dsa_switch *ds, int port,
 
mutex_unlock(>cfp.lock);
 
+   if (ret)
+   return ret;
+
+   /* Pass up the commands to the attached master network device.
+* This can fail, so rollback the operation if we need to.
+*/
+   if (p->ethtool_ops->set_rxnfc) {
+   ret = p->ethtool_ops->set_rxnfc(p, nfc);
+   if (ret && ret != -EOPNOTSUPP) {
+   mutex_lock(>cfp.lock);
+   bcm_sf2_cfp_rule_del(priv, port, nfc->fs.location);
+   mutex_unlock(>cfp.lock);
+   } else {
+   ret = 0;
+   }
+   }
+
return ret;
 }
 
-- 
2.17.1



[PATCH net-next 0/3] net: Support Wake-on-LAN using filters

2018-08-06 Thread Florian Fainelli
Hi David, John,

This is technically a v2, but this patch series builds on your feedback
and defines the following:

- a new WAKE_* bit: WAKE_FILTER which can be enabled alongside other type
  of Wake-on-LAN to support waking up on a programmed filter (match + action)
- a new RX_CLS_FLOW_WAKE flow action which can be specified by an user when
  inserting a flow using ethtool::rxnfc, similar to the existing 
RX_CLS_FLOW_DISC

The bcm_sf2 and bcmsysport drivers are updated accordingly to work in concert to
allow matching packets at the switch level, identified by their filter location
to be used as a match by the SYSTEM PORT (CPU/management controller) during
Wake-on-LAN.

Let me know if this looks better than the previous incarnation of the patch
series.

Attached is also the ethtool patch that I would be submitting once the uapi
changes are committed.

Thank you!

Florian Fainelli (3):
  ethtool: Add WAKE_FILTER and RX_CLS_FLOW_WAKE
  net: dsa: bcm_sf2: Propagate ethtool::rxnfc to CPU port
  net: systemport: Add support for WAKE_FILTER

 drivers/net/dsa/bcm_sf2_cfp.c  |  40 -
 drivers/net/ethernet/broadcom/bcmsysport.c | 193 -
 drivers/net/ethernet/broadcom/bcmsysport.h |  11 +-
 include/uapi/linux/ethtool.h   |   5 +-
 4 files changed, 236 insertions(+), 13 deletions(-)

-- 
2.17.1



[PATCH] ethtool: Add support for Wake-on-LAN using filters

2018-08-06 Thread Florian Fainelli
Define a way to specify that a flow's action is to be used for
Wake-on-LAN purposes (using -2 as an action value) and define a new
Wake-on-LAN flag: 'f' which enables the Ethernet adapter for Wake-on-LAN
using filters.

Signed-off-by: Florian Fainelli 
---
 ethtool-copy.h | 2 ++
 ethtool.8.in   | 4 +++-
 ethtool.c  | 7 ++-
 rxclass.c  | 8 +---
 4 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/ethtool-copy.h b/ethtool-copy.h
index 8cc61e9ab40b..943af3882aa1 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -1628,6 +1628,7 @@ static __inline__ int ethtool_validate_duplex(__u8 duplex)
 #define WAKE_ARP   (1 << 4)
 #define WAKE_MAGIC (1 << 5)
 #define WAKE_MAGICSECURE   (1 << 6) /* only meaningful if WAKE_MAGIC */
+#define WAKE_FILTER(1 << 7)
 
 /* L2-L4 network traffic flow types */
 #defineTCP_V4_FLOW 0x01/* hash or spec (tcp_ip4_spec) */
@@ -1665,6 +1666,7 @@ static __inline__ int ethtool_validate_duplex(__u8 duplex)
 #defineRXH_DISCARD (1 << 31)
 
 #defineRX_CLS_FLOW_DISC0xULL
+#define RX_CLS_FLOW_WAKE   0xfffeULL
 
 /* Special RX classification rule insert location values */
 #define RX_CLS_LOC_SPECIAL 0x8000  /* flag */
diff --git a/ethtool.8.in b/ethtool.8.in
index 0a366aa536ae..61923eed7c27 100644
--- a/ethtool.8.in
+++ b/ethtool.8.in
@@ -58,7 +58,7 @@
 .\"
 .\"\(*WO - wol flags
 .\"
-.ds WO \fBp\fP|\fBu\fP|\fBm\fP|\fBb\fP|\fBa\fP|\fBg\fP|\fBs\fP|\fBd\fP...
+.ds WO \fBp\fP|\fBu\fP|\fBm\fP|\fBb\fP|\fBa\fP|\fBg\fP|\fBs\fP|\fBd|\fBf\fP...
 .\"
 .\"\(*FL - flow type values
 .\"
@@ -679,6 +679,7 @@ b   Wake on broadcast messages
 a  Wake on ARP
 g  Wake on MagicPacket\[tm]
 s  Enable SecureOn\[tm] password for MagicPacket\[tm]
+f  Wake on filter(s)
 d  T{
 Disable (wake on nothing).  This option clears all previous options.
 T}
@@ -870,6 +871,7 @@ Specifies the Rx queue to send packets to, or some other 
action.
 nokeep;
 lB l.
 -1 Drop the matched flow
+-2 Use the matched flow as a Wake-on-LAN filter
 0 or higherRx queue to route the flow
 .TE
 .TP
diff --git a/ethtool.c b/ethtool.c
index fb93ae898312..5e91ef9cecdb 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -931,6 +931,9 @@ static int parse_wolopts(char *optstr, u32 *data)
case 's':
*data |= WAKE_MAGICSECURE;
break;
+   case 'f':
+   *data |= WAKE_FILTER;
+   break;
case 'd':
*data = 0;
break;
@@ -964,6 +967,8 @@ static char *unparse_wolopts(int wolopts)
*p++ = 'g';
if (wolopts & WAKE_MAGICSECURE)
*p++ = 's';
+   if (wolopts & WAKE_FILTER)
+   *p++ = 'f';
} else {
*p = 'd';
}
@@ -4224,7 +4229,7 @@ static int flow_spec_to_ntuple(struct 
ethtool_rx_flow_spec *fsp,
return -1;
 
/* verify ring cookie can transfer to action */
-   if (fsp->ring_cookie > INT_MAX && fsp->ring_cookie < (u64)(-2))
+   if (fsp->ring_cookie > INT_MAX && fsp->ring_cookie < (u64)(-3))
return -1;
 
/* verify only one field is setting data field */
diff --git a/rxclass.c b/rxclass.c
index 42d122d1ed86..79972651e706 100644
--- a/rxclass.c
+++ b/rxclass.c
@@ -251,7 +251,11 @@ static void rxclass_print_nfc_rule(struct 
ethtool_rx_flow_spec *fsp,
if (fsp->flow_type & FLOW_RSS)
fprintf(stdout, "\tRSS Context ID: %u\n", rss_context);
 
-   if (fsp->ring_cookie != RX_CLS_FLOW_DISC) {
+   if (fsp->ring_cookie == RX_CLS_FLOW_DISC) {
+   fprintf(stdout, "\tAction: Drop\n");
+   } else if (fsp->ring_cookie == RX_CLS_FLOW_WAKE) {
+   fprintf(stdout, "\tAction: Wake-on-LAN\n");
+   } else {
u64 vf = ethtool_get_flow_spec_ring_vf(fsp->ring_cookie);
u64 queue = ethtool_get_flow_spec_ring(fsp->ring_cookie);
 
@@ -266,8 +270,6 @@ static void rxclass_print_nfc_rule(struct 
ethtool_rx_flow_spec *fsp,
else
fprintf(stdout, "\tAction: Direct to queue %llu\n",
queue);
-   } else {
-   fprintf(stdout, "\tAction: Drop\n");
}
 
fprintf(stdout, "\n");
-- 
2.17.1



[PATCH net-next 3/3] net: systemport: Add support for WAKE_FILTER

2018-08-06 Thread Florian Fainelli
The SYSTEMPORT MAC allows up to 8 filters to be programmed to wake-up
from LAN. Verify that we have up to 8 filters and program them to the
appropriate RXCHK entries to be matched (along with their masks).

We need to update the entry and exit to Wake-on-LAN mode to keep the
RXCHK engine running to match during suspend, but this is otherwise
fairly similar to Magic Packet detection.

Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/bcmsysport.c | 193 -
 drivers/net/ethernet/broadcom/bcmsysport.h |  11 +-
 2 files changed, 195 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 284581c9680e..ca47309d4494 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -521,7 +521,7 @@ static void bcm_sysport_get_wol(struct net_device *dev,
struct bcm_sysport_priv *priv = netdev_priv(dev);
u32 reg;
 
-   wol->supported = WAKE_MAGIC | WAKE_MAGICSECURE;
+   wol->supported = WAKE_MAGIC | WAKE_MAGICSECURE | WAKE_FILTER;
wol->wolopts = priv->wolopts;
 
if (!(priv->wolopts & WAKE_MAGICSECURE))
@@ -539,7 +539,7 @@ static int bcm_sysport_set_wol(struct net_device *dev,
 {
struct bcm_sysport_priv *priv = netdev_priv(dev);
struct device *kdev = >pdev->dev;
-   u32 supported = WAKE_MAGIC | WAKE_MAGICSECURE;
+   u32 supported = WAKE_MAGIC | WAKE_MAGICSECURE | WAKE_FILTER;
 
if (!device_can_wakeup(kdev))
return -ENOTSUPP;
@@ -1043,7 +1043,7 @@ static int bcm_sysport_poll(struct napi_struct *napi, int 
budget)
 
 static void mpd_enable_set(struct bcm_sysport_priv *priv, bool enable)
 {
-   u32 reg;
+   u32 reg, bit;
 
reg = umac_readl(priv, UMAC_MPD_CTRL);
if (enable)
@@ -1051,12 +1051,32 @@ static void mpd_enable_set(struct bcm_sysport_priv 
*priv, bool enable)
else
reg &= ~MPD_EN;
umac_writel(priv, reg, UMAC_MPD_CTRL);
+
+   if (priv->is_lite)
+   bit = RBUF_ACPI_EN_LITE;
+   else
+   bit = RBUF_ACPI_EN;
+
+   reg = rbuf_readl(priv, RBUF_CONTROL);
+   if (enable)
+   reg |= bit;
+   else
+   reg &= ~bit;
+   rbuf_writel(priv, reg, RBUF_CONTROL);
 }
 
 static void bcm_sysport_resume_from_wol(struct bcm_sysport_priv *priv)
 {
+   u32 reg;
+
/* Stop monitoring MPD interrupt */
-   intrl2_0_mask_set(priv, INTRL2_0_MPD);
+   intrl2_0_mask_set(priv, INTRL2_0_MPD | INTRL2_0_BRCM_MATCH_TAG);
+
+   /* Disable RXCHK, active filters and Broadcom tag matching */
+   reg = rxchk_readl(priv, RXCHK_CONTROL);
+   reg &= ~(RXCHK_BRCM_TAG_MATCH_MASK <<
+RXCHK_BRCM_TAG_MATCH_SHIFT | RXCHK_EN | RXCHK_BRCM_TAG_EN);
+   rxchk_writel(priv, reg, RXCHK_CONTROL);
 
/* Clear the MagicPacket detection logic */
mpd_enable_set(priv, false);
@@ -1085,6 +1105,7 @@ static irqreturn_t bcm_sysport_rx_isr(int irq, void 
*dev_id)
struct bcm_sysport_priv *priv = netdev_priv(dev);
struct bcm_sysport_tx_ring *txr;
unsigned int ring, ring_bit;
+   u32 reg;
 
priv->irq0_stat = intrl2_0_readl(priv, INTRL2_CPU_STATUS) &
  ~intrl2_0_readl(priv, INTRL2_CPU_MASK_STATUS);
@@ -,7 +1132,14 @@ static irqreturn_t bcm_sysport_rx_isr(int irq, void 
*dev_id)
bcm_sysport_tx_reclaim_all(priv);
 
if (priv->irq0_stat & INTRL2_0_MPD)
-   netdev_info(priv->netdev, "Wake-on-LAN interrupt!\n");
+   netdev_info(priv->netdev, "Wake-on-LAN (MPD) interrupt!\n");
+
+   if (priv->irq0_stat & INTRL2_0_BRCM_MATCH_TAG) {
+   reg = rxchk_readl(priv, RXCHK_BRCM_TAG_MATCH_STATUS) &
+ RXCHK_BRCM_TAG_MATCH_MASK;
+   netdev_info(priv->netdev,
+   "Wake-on-LAN (filters 0x%02x) interrupt!\n", reg);
+   }
 
if (!priv->is_lite)
goto out;
@@ -2096,6 +2124,132 @@ static int bcm_sysport_stop(struct net_device *dev)
return 0;
 }
 
+static int bcm_sysport_rule_find(struct bcm_sysport_priv *priv,
+u64 location)
+{
+   unsigned int index;
+   u32 reg;
+
+   for_each_set_bit(index, priv->filters, RXCHK_BRCM_TAG_MAX) {
+   reg = rxchk_readl(priv, RXCHK_BRCM_TAG(index));
+   reg >>= RXCHK_BRCM_TAG_CID_SHIFT;
+   reg &= RXCHK_BRCM_TAG_CID_MASK;
+   if (reg == location)
+   return index;
+   }
+
+   return -EINVAL;
+}
+
+static int bcm_sysport_rule_get(struct bcm_sysport_priv *priv,
+   struct ethtool_rxnfc *nfc)
+{
+   int index;
+
+   /* This is not a rule that we know about */
+   index = bcm_sysport_rule_find(priv, nfc->fs.location);
+   if (index < 0)
+   return 

[PATCH net-next 1/3] ethtool: Add WAKE_FILTER and RX_CLS_FLOW_WAKE

2018-08-06 Thread Florian Fainelli
Add the ability to specify through ethtool::rxnfc that a rule location is
special and will be used to participate in Wake-on-LAN, by e.g: having a
specific pattern be matched. When this is the case, fs->ring_cookie must
be set to the special value RX_CLS_FLOW_WAKE.

We also define an additional ethtool::wolinfo flag: WAKE_FILTER which
can be used to configure an Ethernet adapter to allow Wake-on-LAN using
previously programmed filters.

Signed-off-by: Florian Fainelli 
---
 include/uapi/linux/ethtool.h | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 813282cc8af6..dc69391d2bba 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -870,7 +870,8 @@ struct ethtool_flow_ext {
  * includes the %FLOW_EXT or %FLOW_MAC_EXT flag
  * (see  ethtool_flow_ext description).
  * @ring_cookie: RX ring/queue index to deliver to, or %RX_CLS_FLOW_DISC
- * if packets should be discarded
+ * if packets should be discarded, or %RX_CLS_FLOW_WAKE if the
+ * packets should be used for Wake-on-LAN with %WAKE_FILTER
  * @location: Location of rule in the table.  Locations must be
  * numbered such that a flow matching multiple rules will be
  * classified according to the first (lowest numbered) rule.
@@ -1634,6 +1635,7 @@ static inline int ethtool_validate_duplex(__u8 duplex)
 #define WAKE_ARP   (1 << 4)
 #define WAKE_MAGIC (1 << 5)
 #define WAKE_MAGICSECURE   (1 << 6) /* only meaningful if WAKE_MAGIC */
+#define WAKE_FILTER(1 << 7)
 
 /* L2-L4 network traffic flow types */
 #defineTCP_V4_FLOW 0x01/* hash or spec (tcp_ip4_spec) */
@@ -1671,6 +1673,7 @@ static inline int ethtool_validate_duplex(__u8 duplex)
 #defineRXH_DISCARD (1 << 31)
 
 #defineRX_CLS_FLOW_DISC0xULL
+#define RX_CLS_FLOW_WAKE   0xfffeULL
 
 /* Special RX classification rule insert location values */
 #define RX_CLS_LOC_SPECIAL 0x8000  /* flag */
-- 
2.17.1



Re: [endianness bug?] cxgb4_next_header .match_val/.match_mask should be net-endian

2018-08-06 Thread Al Viro
On Tue, Aug 07, 2018 at 03:51:34AM +0800, kbuild test robot wrote:
> Hi Al,
> 
> Thank you for the patch! Perhaps something to improve:
> 
> [auto build test WARNING on net-next/master]
> [also build test WARNING on v4.18-rc8 next-20180806]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Al-Viro/cxgb4_next_header-match_val-match_mask-should-be-net-endian/20180806-183127
> reproduce:
> # apt-get install sparse
> make ARCH=x86_64 allmodconfig
> make C=1 CF=-D__CHECK_ENDIAN__

Yes, there are arseloads of misannotations.  And I do have followups
that *do* annotate that (which is exactly how that bug had been found
in the first place).  But I'd rather separate fixes from trivial
annotations - the former might need to be backported, the latter would
only cause extra PITA for backport.


Re: [PATCH net] packet: refine ring v3 block size test to hold one frame

2018-08-06 Thread David Miller
From: Willem de Bruijn 
Date: Mon,  6 Aug 2018 10:38:34 -0400

> From: Willem de Bruijn 
> 
> TPACKET_V3 stores variable length frames in fixed length blocks.
> Blocks must be able to store a block header, optional private space
> and at least one minimum sized frame.
> 
> Frames, even for a zero snaplen packet, store metadata headers and
> optional reserved space.
> 
> In the block size bounds check, ensure that the frame of the
> chosen configuration fits. This includes sockaddr_ll and optional
> tp_reserve.
> 
> Syzbot was able to construct a ring with insuffient room for the
> sockaddr_ll in the header of a zero-length frame, triggering an
> out-of-bounds write in dev_parse_header.
> 
> Convert the comparison to less than, as zero is a valid snap len.
> This matches the test for minimum tp_frame_size immediately below.
> 
> Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
> Fixes: eb73190f4fbe ("net/packet: refine check for priv area size")
> Reported-by: syzbot 
> Signed-off-by: Willem de Bruijn 

Good catch, applied and queued up for -stable, thanks.


Re: [PATCH bpf] bpf: btf: Change tools/lib/bpf/btf to LGPL

2018-08-06 Thread Jakub Kicinski
On Sun, 5 Aug 2018 17:19:13 -0700, Martin KaFai Lau wrote:
> This patch changes the tools/lib/bpf/btf.[ch] to LGPL which
> is inline with libbpf also.

FWIW:

Reported-by: David Beckett 

And in case removing code in 6534770d6f17 ("tools: bpf: fix BTF code
added twice to different trees") counts as a contribution:

Acked-by: Jakub Kicinski 

for the license change :)

> Signed-off-by: Martin KaFai Lau 
> ---
>  tools/lib/bpf/btf.c | 2 +-
>  tools/lib/bpf/btf.h | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
> index 2d270c560df3..c36a3a76986a 100644
> --- a/tools/lib/bpf/btf.c
> +++ b/tools/lib/bpf/btf.c
> @@ -1,4 +1,4 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> +// SPDX-License-Identifier: LGPL-2.1
>  /* Copyright (c) 2018 Facebook */
>  
>  #include 
> diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h
> index e2a09a155f84..caac3a404dc5 100644
> --- a/tools/lib/bpf/btf.h
> +++ b/tools/lib/bpf/btf.h
> @@ -1,4 +1,4 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> +/* SPDX-License-Identifier: LGPL-2.1 */
>  /* Copyright (c) 2018 Facebook */
>  
>  #ifndef __BPF_BTF_H



Re: pull-request: ieee802154-next 2018-08-06

2018-08-06 Thread David Miller
From: Stefan Schmidt 
Date: Mon,  6 Aug 2018 21:38:50 +0200

> An update from ieee802154 for *net-next*
> 
> Romuald added a socket option to get the LQI value of the received datagram.
> Alexander added a new hardware simulation driver modelled after hwsim of the
> wireless people. It allows runtime configuration for new nodes and edges over 
> a
> netlink interface (a config utlity is making its way into wpan-tools).
> We also have three fixes in here. One from Colin which is more of a cleanup 
> and
> two from Alex fixing tailroom and single frame space problems.
> I would normally put the last two into my fixes tree, but given we are already
> in -rc8 I simply put them here and added a cc: stable to them.
> 
> Please pull, or let me know if there are any problems.

Pulled, thank you.


Re: [PATCH net-next] ipv4: frags: precedence bug in ip_expire()

2018-08-06 Thread David Miller
From: Dan Carpenter 
Date: Mon, 6 Aug 2018 22:17:35 +0300

> We accidentally removed the parentheses here, but they are required
> because '!' has higher precedence than '&'.
> 
> Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
> Signed-off-by: Dan Carpenter 

Ugh, good catch, applied, thanks!


[PATCH net-next] liquidio: avoided acquiring post_lock for data only queues

2018-08-06 Thread Felix Manlunas
From: Intiyaz Basha 

All control commands (soft commands) goes through only Queue 0
(control and data queue). So only queue-0 needs post_lock,
other queues are only data queues and does not need post_lock

Added a flag to indicate the queue can be used for soft commands.

If this flag is set, post_lock must be acquired before posting
a command to the queue.
If this flag is clear, post_lock is invalid for the queue.

Signed-off-by: Intiyaz Basha 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/octeon_iq.h   | 10 ++
 .../net/ethernet/cavium/liquidio/request_manager.c | 22 +++---
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_iq.h 
b/drivers/net/ethernet/cavium/liquidio/octeon_iq.h
index 5fed7b6..2327062 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_iq.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_iq.h
@@ -82,6 +82,16 @@ struct octeon_instr_queue {
/** A spinlock to protect while posting on the ring.  */
spinlock_t post_lock;
 
+   /** This flag indicates if the queue can be used for soft commands.
+*  If this flag is set, post_lock must be acquired before posting
+*  a command to the queue.
+*  If this flag is clear, post_lock is invalid for the queue.
+*  All control commands (soft commands) will go through only Queue 0
+*  (control and data queue). So only queue-0 needs post_lock,
+*  other queues are only data queues and does not need post_lock
+*/
+   bool allow_soft_cmds;
+
u32 pkt_in_done;
 
/** A spinlock to protect access to the input ring.*/
diff --git a/drivers/net/ethernet/cavium/liquidio/request_manager.c 
b/drivers/net/ethernet/cavium/liquidio/request_manager.c
index d5d9e47..8f746e1 100644
--- a/drivers/net/ethernet/cavium/liquidio/request_manager.c
+++ b/drivers/net/ethernet/cavium/liquidio/request_manager.c
@@ -126,7 +126,12 @@ int octeon_init_instr_queue(struct octeon_device *oct,
 
/* Initialize the spinlock for this instruction queue */
spin_lock_init(>lock);
-   spin_lock_init(>post_lock);
+   if (iq_no == 0) {
+   iq->allow_soft_cmds = true;
+   spin_lock_init(>post_lock);
+   } else {
+   iq->allow_soft_cmds = false;
+   }
 
spin_lock_init(>iq_flush_running_lock);
 
@@ -566,7 +571,8 @@ octeon_send_command(struct octeon_device *oct, u32 iq_no,
/* Get the lock and prevent other tasks and tx interrupt handler from
 * running.
 */
-   spin_lock_bh(>post_lock);
+   if (iq->allow_soft_cmds)
+   spin_lock_bh(>post_lock);
 
st = __post_command2(iq, cmd);
 
@@ -583,7 +589,8 @@ octeon_send_command(struct octeon_device *oct, u32 iq_no,
INCR_INSTRQUEUE_PKT_COUNT(oct, iq_no, instr_dropped, 1);
}
 
-   spin_unlock_bh(>post_lock);
+   if (iq->allow_soft_cmds)
+   spin_unlock_bh(>post_lock);
 
/* This is only done here to expedite packets being flushed
 * for cases where there are no IQ completion interrupts.
@@ -702,11 +709,20 @@ octeon_prepare_soft_command(struct octeon_device *oct,
 int octeon_send_soft_command(struct octeon_device *oct,
 struct octeon_soft_command *sc)
 {
+   struct octeon_instr_queue *iq;
struct octeon_instr_ih2 *ih2;
struct octeon_instr_ih3 *ih3;
struct octeon_instr_irh *irh;
u32 len;
 
+   iq = oct->instr_queue[sc->iq_no];
+   if (!iq->allow_soft_cmds) {
+   dev_err(>pci_dev->dev, "Soft commands are not allowed on 
Queue %d\n",
+   sc->iq_no);
+   INCR_INSTRQUEUE_PKT_COUNT(oct, sc->iq_no, instr_dropped, 1);
+   return IQ_SEND_FAILED;
+   }
+
if (OCTEON_CN23XX_PF(oct) || OCTEON_CN23XX_VF(oct)) {
ih3 =  (struct octeon_instr_ih3 *)>cmd.cmd3.ih3;
if (ih3->dlengsz) {
-- 
2.9.0



Re: [endianness bug?] cxgb4_next_header .match_val/.match_mask should be net-endian

2018-08-06 Thread kbuild test robot
Hi Al,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]
[also build test WARNING on v4.18-rc8 next-20180806]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Al-Viro/cxgb4_next_header-match_val-match_mask-should-be-net-endian/20180806-183127
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:262:40: sparse: 
>> incorrect type in initializer (different base types) @@expected unsigned 
>> int [unsigned] [usertype] match_val @@got ed int [unsigned] [usertype] 
>> match_val @@
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:262:40:expected 
unsigned int [unsigned] [usertype] match_val
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:262:40:got 
restricted __be32 [usertype] 
>> drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:263:25: sparse: 
>> incorrect type in initializer (different base types) @@expected unsigned 
>> int [unsigned] [usertype] match_mask @@got ed int [unsigned] [usertype] 
>> match_mask @@
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:263:25:expected 
unsigned int [unsigned] [usertype] match_mask
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:263:25:got 
restricted __be32 [usertype] 
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:265:40: sparse: 
incorrect type in initializer (different base types) @@expected unsigned 
int [unsigned] [usertype] match_val @@got ed int [unsigned] [usertype] 
match_val @@
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:265:40:expected 
unsigned int [unsigned] [usertype] match_val
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:265:40:got 
restricted __be32 [usertype] 
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:266:25: sparse: 
incorrect type in initializer (different base types) @@expected unsigned 
int [unsigned] [usertype] match_mask @@got ed int [unsigned] [usertype] 
match_mask @@
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:266:25:expected 
unsigned int [unsigned] [usertype] match_mask
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:266:25:got 
restricted __be32 [usertype] 
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:275:40: sparse: 
incorrect type in initializer (different base types) @@expected unsigned 
int [unsigned] [usertype] match_val @@got ed int [unsigned] [usertype] 
match_val @@
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:275:40:expected 
unsigned int [unsigned] [usertype] match_val
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:275:40:got 
restricted __be32 [usertype] 
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:276:25: sparse: 
incorrect type in initializer (different base types) @@expected unsigned 
int [unsigned] [usertype] match_mask @@got ed int [unsigned] [usertype] 
match_mask @@
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:276:25:expected 
unsigned int [unsigned] [usertype] match_mask
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:276:25:got 
restricted __be32 [usertype] 
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:278:40: sparse: 
incorrect type in initializer (different base types) @@expected unsigned 
int [unsigned] [usertype] match_val @@got ed int [unsigned] [usertype] 
match_val @@
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:278:40:expected 
unsigned int [unsigned] [usertype] match_val
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:278:40:got 
restricted __be32 [usertype] 
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:279:25: sparse: 
incorrect type in initializer (different base types) @@expected unsigned 
int [unsigned] [usertype] match_mask @@got ed int [unsigned] [usertype] 
match_mask @@
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:279:25:expected 
unsigned int [unsigned] [usertype] match_mask
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h:279:25:got 
restricted __be32 [usertype] 
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c:56:21: sparse: incorrect 
type in assignment (different base types) @@expected unsigned int 
[unsigned] [usertype] val @@got ed int [unsigned] [usertype] val @@
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c:56:21:expected 
unsigned int [unsigned] [usertype] val
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c:56:21:got restricted 
__be32 [usertype] val
   drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c:57:22: sparse: incorrect 
type in assignment (different base types) @

[PATCH net-next,v3] net/tls: Calculate nsg for zerocopy path without skb_cow_data.

2018-08-06 Thread Doron Roberts-Kedes
decrypt_skb fails if the number of sg elements required to map is
greater than MAX_SKB_FRAGS. As noted by Vakul Garg, nsg must always be
calculated, but skb_cow_data adds unnecessary memcpy's for the zerocopy
case.

The new function skb_nsg calculates the number of scatterlist elements
required to map the skb without the extra overhead of skb_cow_data. This
function mimics the structure of skb_to_sgvec.

Fixes: c46234ebb4d1 ("tls: RX path for ktls")
Signed-off-by: Doron Roberts-Kedes 
---
 net/tls/tls_sw.c | 96 ++--
 1 file changed, 93 insertions(+), 3 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index ff3a6904a722..eb87f931a0d6 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -43,6 +43,80 @@
 
 #define MAX_IV_SIZETLS_CIPHER_AES_GCM_128_IV_SIZE
 
+static int __skb_nsg(struct sk_buff *skb, int offset, int len,
+unsigned int recursion_level)
+{
+   int start = skb_headlen(skb);
+   int i, copy = start - offset;
+   struct sk_buff *frag_iter;
+   int elt = 0;
+
+   if (unlikely(recursion_level >= 24))
+   return -EMSGSIZE;
+
+   if (copy > 0) {
+   if (copy > len)
+   copy = len;
+   elt++;
+   len -= copy;
+   if (len == 0)
+   return elt;
+   offset += copy;
+   }
+
+   for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+   int end;
+
+   WARN_ON(start > offset + len);
+
+   end = start + skb_frag_size(_shinfo(skb)->frags[i]);
+   copy = end - offset;
+   if (copy > 0) {
+   if (copy > len)
+   copy = len;
+   elt++;
+   len -= copy;
+   if (len == 0)
+   return elt;
+   offset += copy;
+   }
+   start = end;
+   }
+
+   skb_walk_frags(skb, frag_iter) {
+   int end, ret;
+
+   WARN_ON(start > offset + len);
+
+   end = start + frag_iter->len;
+   copy = end - offset;
+   if (copy > 0) {
+   if (copy > len)
+   copy = len;
+   ret = __skb_nsg(frag_iter, offset - start, copy,
+   recursion_level + 1);
+   if (unlikely(ret < 0))
+   return ret;
+   elt += ret;
+   len -= copy;
+   if (len == 0)
+   return elt;
+   offset += copy;
+   }
+   start = end;
+   }
+   BUG_ON(len);
+   return elt;
+}
+
+/* Return the number of scatterlist elements required to completely map the
+ * skb, or -EMSGSIZE if the recursion depth is exceeded.
+ */
+static int skb_nsg(struct sk_buff *skb, int offset, int len)
+{
+   return __skb_nsg(skb, offset, len, 0);
+}
+
 static int tls_do_decryption(struct sock *sk,
 struct scatterlist *sgin,
 struct scatterlist *sgout,
@@ -693,7 +767,7 @@ int decrypt_skb(struct sock *sk, struct sk_buff *skb,
struct scatterlist sgin_arr[MAX_SKB_FRAGS + 2];
struct scatterlist *sgin = _arr[0];
struct strp_msg *rxm = strp_msg(skb);
-   int ret, nsg = ARRAY_SIZE(sgin_arr);
+   int ret, nsg;
struct sk_buff *unused;
 
ret = skb_copy_bits(skb, rxm->offset + TLS_HEADER_SIZE,
@@ -704,11 +778,27 @@ int decrypt_skb(struct sock *sk, struct sk_buff *skb,
 
memcpy(iv, tls_ctx->rx.iv, TLS_CIPHER_AES_GCM_128_SALT_SIZE);
if (!sgout) {
-   nsg = skb_cow_data(skb, 0, ) + 1;
+   nsg = skb_cow_data(skb, 0, );
+   } else {
+   nsg = skb_nsg(skb,
+ rxm->offset + tls_ctx->rx.prepend_size,
+ rxm->full_len - tls_ctx->rx.prepend_size);
+   if (nsg <= 0)
+   return nsg;
+   }
+
+   // We need one extra for ctx->rx_aad_ciphertext
+   nsg++;
+
+   if (nsg > ARRAY_SIZE(sgin_arr)) {
sgin = kmalloc_array(nsg, sizeof(*sgin), sk->sk_allocation);
-   sgout = sgin;
+   if (!sgin)
+   return -ENOMEM;
}
 
+   if (!sgout)
+   sgout = sgin;
+
sg_init_table(sgin, nsg);
sg_set_buf([0], ctx->rx_aad_ciphertext, TLS_AAD_SPACE_SIZE);
 
-- 
2.17.1



pull-request: ieee802154-next 2018-08-06

2018-08-06 Thread Stefan Schmidt
Hello Dave.

An update from ieee802154 for *net-next*

Romuald added a socket option to get the LQI value of the received datagram.
Alexander added a new hardware simulation driver modelled after hwsim of the
wireless people. It allows runtime configuration for new nodes and edges over a
netlink interface (a config utlity is making its way into wpan-tools).
We also have three fixes in here. One from Colin which is more of a cleanup and
two from Alex fixing tailroom and single frame space problems.
I would normally put the last two into my fixes tree, but given we are already
in -rc8 I simply put them here and added a cc: stable to them.

Please pull, or let me know if there are any problems.

regards
Stefan Schmidt


The following changes since commit 981467033a37d916649647fa3afe1fe99bba1817:

  tc-testing: remove duplicate spaces in skbedit match patterns (2018-08-05 
17:39:24 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next.git 
ieee802154-for-davem-2018-08-06

for you to fetch changes up to be10d5d1c2d15252624e965202508f30a218a46a:

  ieee802154: fakelb: add deprecated msg while probe (2018-08-06 11:21:15 +0200)


Alexander Aring (4):
  net: 6lowpan: fix reserved space for single frames
  net: mac802154: tx: expand tailroom if necessary
  ieee802154: hwsim: add replacement for fakelb
  ieee802154: fakelb: add deprecated msg while probe

Colin Ian King (1):
  net: ieee802154: 6lowpan: remove redundant pointers 'fq' and 'net'

Romuald CARI (1):
  ieee802154: add rx LQI from userspace

Stefan Schmidt (1):
  Merge remote-tracking branch 'net-next/master'

 drivers/net/ieee802154/Kconfig   |  11 +
 drivers/net/ieee802154/Makefile  |   1 +
 drivers/net/ieee802154/fakelb.c  |   3 +
 drivers/net/ieee802154/mac802154_hwsim.c | 919 +++
 drivers/net/ieee802154/mac802154_hwsim.h |  73 +++
 include/net/af_ieee802154.h  |   1 +
 net/ieee802154/6lowpan/reassembly.c  |   5 -
 net/ieee802154/6lowpan/tx.c  |  21 +-
 net/ieee802154/socket.c  |  17 +
 net/mac802154/tx.c   |  15 +-
 10 files changed, 1057 insertions(+), 9 deletions(-)
 create mode 100644 drivers/net/ieee802154/mac802154_hwsim.c
 create mode 100644 drivers/net/ieee802154/mac802154_hwsim.h


Re: [PATCH net-next] ipv4: frags: precedence bug in ip_expire()

2018-08-06 Thread Peter Oskolkov
Ack. Thanks, Dan!
On Mon, Aug 6, 2018 at 12:17 PM Dan Carpenter  wrote:
>
> We accidentally removed the parentheses here, but they are required
> because '!' has higher precedence than '&'.
>
> Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
> Signed-off-by: Dan Carpenter 
>
> diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
> index 0e8f8de77e71..7cb7ed761d8c 100644
> --- a/net/ipv4/ip_fragment.c
> +++ b/net/ipv4/ip_fragment.c
> @@ -154,7 +154,7 @@ static void ip_expire(struct timer_list *t)
> __IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
> __IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
>
> -   if (!qp->q.flags & INET_FRAG_FIRST_IN)
> +   if (!(qp->q.flags & INET_FRAG_FIRST_IN))
> goto out;
>
> /* sk_buff::dev and sk_buff::rbnode are unionized. So we


Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask

2018-08-06 Thread Steve Wise



On 8/1/2018 9:27 AM, Max Gurtovoy wrote:
>
>
> On 8/1/2018 8:12 AM, Sagi Grimberg wrote:
>> Hi Max,
>
> Hi,
>
>>
>>> Yes, since nvmf is the only user of this function.
>>> Still waiting for comments on the suggested patch :)
>>>
>>
>> Sorry for the late response (but I'm on vacation so I have
>> an excuse ;))
>
> NP :) currently the code works..
>
>>
>> I'm thinking that we should avoid trying to find an assignment
>> when stuff like irqbalance daemon is running and changing
>> the affinitization.
>
> but this is exactly what Steve complained and Leon try to fix (and
> break the connection establishment).
> If this is the case and we all agree then we're good without Leon's
> patch and without our suggestions.
>

I don't agree.  Currently setting certain affinity mappings breaks nvme
connectivity.  I don't think that is desirable.  And mlx5 is broken in
that it doesn't allow changing the affinity but silently ignores the
change, which misleads the admin or irqbalance...
 


>>
>> This extension was made to apply optimal affinity assignment
>> when the device irq affinity is lined up in a vector per
>> core.
>>
>> I'm thinking that when we identify this is not the case, we immediately
>> fallback to the default mapping.
>>
>> 1. when we get a mask, if its weight != 1, we fallback.
>> 2. if a queue was left unmapped, we fallback.
>>
>> Maybe something like the following:
>
> did you test it ? I think it will not work since you need to map all
> the queues and all the CPUs.
>
>> -- 
>> diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
>> index 996167f1de18..1ada6211c55e 100644
>> --- a/block/blk-mq-rdma.c
>> +++ b/block/blk-mq-rdma.c
>> @@ -35,17 +35,26 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set
>> *set,
>>  const struct cpumask *mask;
>>  unsigned int queue, cpu;
>>
>> +   /* reset all CPUs mapping */
>> +   for_each_possible_cpu(cpu)
>> +   set->mq_map[cpu] = UINT_MAX;
>> +
>>  for (queue = 0; queue < set->nr_hw_queues; queue++) {
>>  mask = ib_get_vector_affinity(dev, first_vec + queue);
>>  if (!mask)
>>  goto fallback;
>>
>> -   for_each_cpu(cpu, mask)
>> -   set->mq_map[cpu] = queue;
>> +   if (cpumask_weight(mask) != 1)
>> +   goto fallback;
>> +
>> +   cpu = cpumask_first(mask);
>> +   if (set->mq_map[cpu] != UINT_MAX)
>> +   goto fallback;
>> +
>> +   set->mq_map[cpu] = queue;
>>  }
>>
>>  return 0;
>> -
>>   fallback:
>>  return blk_mq_map_queues(set);
>>   }
>> -- 
>
> see attached another algorithem that can improve the mapping (although
> it's not a short one)...
>
> it will try to map according to affinity mask, and also in case the
> mask weight > 1 it will try to be better than the naive mapping I
> suggest in the previous email.
>

Let me know if you want me to try this or any particular fix.

Steve.


[PATCH net-next] ipv4: frags: precedence bug in ip_expire()

2018-08-06 Thread Dan Carpenter
We accidentally removed the parentheses here, but they are required
because '!' has higher precedence than '&'.

Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
Signed-off-by: Dan Carpenter 

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 0e8f8de77e71..7cb7ed761d8c 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -154,7 +154,7 @@ static void ip_expire(struct timer_list *t)
__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-   if (!qp->q.flags & INET_FRAG_FIRST_IN)
+   if (!(qp->q.flags & INET_FRAG_FIRST_IN))
goto out;
 
/* sk_buff::dev and sk_buff::rbnode are unionized. So we


[PATCH net-next,v2] net/tls: Calculate nsg for zerocopy path without skb_cow_data.

2018-08-06 Thread Doron Roberts-Kedes
decrypt_skb fails if the number of sg elements required to map is
greater than MAX_SKB_FRAGS. As noted by Vakul Garg, nsg must always be
calculated, but skb_cow_data adds unnecessary memcpy's for the zerocopy
case.

The new function skb_nsg calculates the number of scatterlist elements
required to map the skb without the extra overhead of skb_cow_data. This
function mimics the structure of skb_to_sgvec.

Fixes: c46234ebb4d1 ("tls: RX path for ktls")
Signed-off-by: Doron Roberts-Kedes 
---
 net/tls/tls_sw.c | 93 ++--
 1 file changed, 90 insertions(+), 3 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index ff3a6904a722..e1c465a2ce0b 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -43,6 +43,76 @@
 
 #define MAX_IV_SIZETLS_CIPHER_AES_GCM_128_IV_SIZE
 
+static int __skb_nsg(struct sk_buff *skb, int offset, int len,
+  unsigned int recursion_level)
+{
+int start = skb_headlen(skb);
+int i, copy = start - offset;
+struct sk_buff *frag_iter;
+int elt = 0;
+
+if (unlikely(recursion_level >= 24))
+return -EMSGSIZE;
+
+if (copy > 0) {
+if (copy > len)
+copy = len;
+elt++;
+if ((len -= copy) == 0)
+return elt;
+offset += copy;
+}
+
+for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+int end;
+
+WARN_ON(start > offset + len);
+
+end = start + skb_frag_size(_shinfo(skb)->frags[i]);
+if ((copy = end - offset) > 0) {
+if (copy > len)
+copy = len;
+elt++;
+if (!(len -= copy))
+return elt;
+offset += copy;
+}
+start = end;
+}
+
+skb_walk_frags(skb, frag_iter) {
+int end, ret;
+
+WARN_ON(start > offset + len);
+
+end = start + frag_iter->len;
+if ((copy = end - offset) > 0) {
+
+if (copy > len)
+copy = len;
+ret = __skb_nsg(frag_iter, offset - start, copy,
+   recursion_level + 1);
+if (unlikely(ret < 0))
+return ret;
+elt += ret;
+if ((len -= copy) == 0)
+return elt;
+offset += copy;
+}
+start = end;
+}
+BUG_ON(len);
+return elt;
+}
+
+/* Return the number of scatterlist elements required to completely map the
+ * skb, or -EMSGSIZE if the recursion depth is exceeded.
+ */
+static int skb_nsg(struct sk_buff *skb, int offset, int len)
+{
+   return __skb_nsg(skb, offset, len, 0);
+}
+
 static int tls_do_decryption(struct sock *sk,
 struct scatterlist *sgin,
 struct scatterlist *sgout,
@@ -693,7 +763,7 @@ int decrypt_skb(struct sock *sk, struct sk_buff *skb,
struct scatterlist sgin_arr[MAX_SKB_FRAGS + 2];
struct scatterlist *sgin = _arr[0];
struct strp_msg *rxm = strp_msg(skb);
-   int ret, nsg = ARRAY_SIZE(sgin_arr);
+   int ret, nsg;
struct sk_buff *unused;
 
ret = skb_copy_bits(skb, rxm->offset + TLS_HEADER_SIZE,
@@ -704,11 +774,28 @@ int decrypt_skb(struct sock *sk, struct sk_buff *skb,
 
memcpy(iv, tls_ctx->rx.iv, TLS_CIPHER_AES_GCM_128_SALT_SIZE);
if (!sgout) {
-   nsg = skb_cow_data(skb, 0, ) + 1;
+   nsg = skb_cow_data(skb, 0, );
+   } else {
+   nsg = skb_nsg(skb,
+ rxm->offset + tls_ctx->rx.prepend_size,
+ rxm->full_len - tls_ctx->rx.prepend_size);
+   if (nsg <= 0)
+   return nsg;
+   }
+
+   // We need one extra for ctx->rx_aad_ciphertext
+   nsg++;
+
+   if (nsg > ARRAY_SIZE(sgin_arr)) {
sgin = kmalloc_array(nsg, sizeof(*sgin), sk->sk_allocation);
-   sgout = sgin;
+   if (!sgin)
+   return -ENOMEM;
}
 
+
+   if (!sgout)
+   sgout = sgin;
+
sg_init_table(sgin, nsg);
sg_set_buf([0], ctx->rx_aad_ciphertext, TLS_AAD_SPACE_SIZE);
 
-- 
2.17.1



Re: [PATCH net-next] net/tls: Calculate nsg for zerocopy path without skb_cow_data.

2018-08-06 Thread Doron Roberts-Kedes
On Fri, Aug 03, 2018 at 11:49:02AM -0700, Doron Roberts-Kedes wrote:
> On Fri, Aug 03, 2018 at 01:23:33AM +, Vakul Garg wrote:
> > 
> > 
> > > -Original Message-
> > > From: Doron Roberts-Kedes [mailto:doro...@fb.com]
> > > Sent: Friday, August 3, 2018 6:00 AM
> > > To: David S . Miller 
> > > Cc: Dave Watson ; Vakul Garg
> > > ; Boris Pismenny ; Aviad
> > > Yehezkel ; netdev@vger.kernel.org; Doron
> > > Roberts-Kedes 
> > > Subject: [PATCH net-next] net/tls: Calculate nsg for zerocopy path without
> > > skb_cow_data.
> > > 
> > > decrypt_skb fails if the number of sg elements required to map is greater
> > > than MAX_SKB_FRAGS. As noted by Vakul Garg, nsg must always be
> > > calculated, but skb_cow_data adds unnecessary memcpy's for the zerocopy
> > > case.
> > > 
> > > The new function skb_nsg calculates the number of scatterlist elements
> > > required to map the skb without the extra overhead of skb_cow_data. This
> > > function mimics the structure of skb_to_sgvec.
> > > 
> > > Fixes: c46234ebb4d1 ("tls: RX path for ktls")
> > > Signed-off-by: Doron Roberts-Kedes 
> > > ---
> > >  net/tls/tls_sw.c | 89
> > > ++--
> > >  1 file changed, 86 insertions(+), 3 deletions(-)
> > > 
> > >   memcpy(iv, tls_ctx->rx.iv, TLS_CIPHER_AES_GCM_128_SALT_SIZE);
> > >   if (!sgout) {
> > > - nsg = skb_cow_data(skb, 0, ) + 1;
> > > + nsg = skb_cow_data(skb, 0, );
> > > + } else {
> > > + nsg = skb_nsg(skb,
> > > +   rxm->offset + tls_ctx->rx.prepend_size,
> > > +   rxm->full_len - tls_ctx->rx.prepend_size);
> > > + if (nsg <= 0)
> > > + return nsg;
> > Comparison should be (nsg < 1). TLS forbids '0' sized records.
> 
> Yes true, v2 incoming
>

Glancing at this a second time, I actually don't believe this should be
changed. nsg <= 0 is equivalent to nsg < 1. Returning 0 if the record is
0 sized is the proper behavior here, since decrypting a zero-length skb
is a no-op. It is true that zero sized TLS records are forbidden, but it
is confusing to enforce that in this part of the code. I would be
surpised to learn that tls_sw_recvmsg could be invoked with a len equal
to 0, but if I'm wrong, and that case does need to be handled, then it
should be in a different patch. 


[Patch net] vsock: split dwork to avoid reinitializations

2018-08-06 Thread Cong Wang
syzbot reported that we reinitialize an active delayed
work in vsock_stream_connect():

ODEBUG: init active (active state 0) object type: timer_list hint:
delayed_work_timer_fn+0x0/0x90 kernel/workqueue.c:1414
WARNING: CPU: 1 PID: 11518 at lib/debugobjects.c:329
debug_print_object+0x16a/0x210 lib/debugobjects.c:326

The pattern is apparently wrong, we should only initialize
the dealyed work once and could repeatly schedule it. So we
have to move out the initializations to allocation side.
And to avoid confusion, we can split the shared dwork
into two, instead of re-using the same one.

Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Reported-by: 
Cc: Andy king 
Cc: Stefan Hajnoczi 
Cc: Jorgen Hansen 
Signed-off-by: Cong Wang 
---
 include/net/af_vsock.h |  4 ++--
 net/vmw_vsock/af_vsock.c   | 15 ---
 net/vmw_vsock/vmci_transport.c |  3 +--
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 9324ac2d9ff2..43913ae79f64 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -64,7 +64,8 @@ struct vsock_sock {
struct list_head pending_links;
struct list_head accept_queue;
bool rejected;
-   struct delayed_work dwork;
+   struct delayed_work connect_work;
+   struct delayed_work pending_work;
struct delayed_work close_work;
bool close_work_scheduled;
u32 peer_shutdown;
@@ -77,7 +78,6 @@ struct vsock_sock {
 
 s64 vsock_stream_has_data(struct vsock_sock *vsk);
 s64 vsock_stream_has_space(struct vsock_sock *vsk);
-void vsock_pending_work(struct work_struct *work);
 struct sock *__vsock_create(struct net *net,
struct socket *sock,
struct sock *parent,
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index c1076c19b858..ab27a2872935 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -451,14 +451,14 @@ static int vsock_send_shutdown(struct sock *sk, int mode)
return transport->shutdown(vsock_sk(sk), mode);
 }
 
-void vsock_pending_work(struct work_struct *work)
+static void vsock_pending_work(struct work_struct *work)
 {
struct sock *sk;
struct sock *listener;
struct vsock_sock *vsk;
bool cleanup;
 
-   vsk = container_of(work, struct vsock_sock, dwork.work);
+   vsk = container_of(work, struct vsock_sock, pending_work.work);
sk = sk_vsock(vsk);
listener = vsk->listener;
cleanup = true;
@@ -498,7 +498,6 @@ void vsock_pending_work(struct work_struct *work)
sock_put(sk);
sock_put(listener);
 }
-EXPORT_SYMBOL_GPL(vsock_pending_work);
 
 / SOCKET OPERATIONS /
 
@@ -597,6 +596,8 @@ static int __vsock_bind(struct sock *sk, struct sockaddr_vm 
*addr)
return retval;
 }
 
+static void vsock_connect_timeout(struct work_struct *work);
+
 struct sock *__vsock_create(struct net *net,
struct socket *sock,
struct sock *parent,
@@ -638,6 +639,8 @@ struct sock *__vsock_create(struct net *net,
vsk->sent_request = false;
vsk->ignore_connecting_rst = false;
vsk->peer_shutdown = 0;
+   INIT_DELAYED_WORK(>connect_work, vsock_connect_timeout);
+   INIT_DELAYED_WORK(>pending_work, vsock_pending_work);
 
psk = parent ? vsock_sk(parent) : NULL;
if (parent) {
@@ -1117,7 +1120,7 @@ static void vsock_connect_timeout(struct work_struct 
*work)
struct vsock_sock *vsk;
int cancel = 0;
 
-   vsk = container_of(work, struct vsock_sock, dwork.work);
+   vsk = container_of(work, struct vsock_sock, connect_work.work);
sk = sk_vsock(vsk);
 
lock_sock(sk);
@@ -1221,9 +1224,7 @@ static int vsock_stream_connect(struct socket *sock, 
struct sockaddr *addr,
 * timeout fires.
 */
sock_hold(sk);
-   INIT_DELAYED_WORK(>dwork,
- vsock_connect_timeout);
-   schedule_delayed_work(>dwork, timeout);
+   schedule_delayed_work(>connect_work, timeout);
 
/* Skip ahead to preserve error code set above. */
goto out_wait;
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index a7a73ffe675b..cb332adb84cd 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -1094,8 +1094,7 @@ static int vmci_transport_recv_listen(struct sock *sk,
vpending->listener = sk;
sock_hold(sk);
sock_hold(pending);
-   INIT_DELAYED_WORK(>dwork, vsock_pending_work);
-   schedule_delayed_work(>dwork, HZ);
+   schedule_delayed_work(>pending_work, HZ);
 
 out:
return err;
-- 
2.14.4



Re: [PATCH] mellanox: fix the dport endianness in call of __inet6_lookup_established()

2018-08-06 Thread David Miller
From: Al Viro 
Date: Sat, 4 Aug 2018 21:41:27 +0100

> __inet6_lookup_established() expect th->dport passed in host-endian,
> not net-endian.  The reason is microoptimization in __inet6_lookup(),
> but if you use the lower-level helpers, you have to play by their
> rules...
> 
> Signed-off-by: Al Viro 

Applied to net-next, thanks Al.


[PATCH iproute2-next 0/3] support delivering packets in delayed

2018-08-06 Thread Yousuk Seung
This series adds support for the new "slot" netem parameter for
slotting. Slotting is an approximation of shared media that gather up
packets within a varying delay window before delivering them nearly at
once.

Dave Taht (2):
  tc: support conversions to or from 64 bit nanosecond-based time
  q_netem: support delivering packets in delayed time slots

Yousuk Seung (1):
  q_netem: slotting with non-uniform distribution

 man/man8/tc-netem.8 |  40 +++-
 tc/q_netem.c| 112 +++-
 tc/tc_core.h|   4 ++
 tc/tc_util.c|  55 ++
 tc/tc_util.h|   3 ++
 5 files changed, 212 insertions(+), 2 deletions(-)

-- 
2.18.0.597.ga71716f1ad-goog



[PATCH iproute2-next 3/3] q_netem: slotting with non-uniform distribution

2018-08-06 Thread Yousuk Seung
Extend slotting with support for non-uniform distributions. This is
similar to netem's non-uniform distribution delay feature.

Syntax:
   slot distribution DISTRIBUTION DELAY JITTER [packets MAX_PACKETS] \
  [bytes MAX_BYTES]

The syntax and use of the distribution table is the same as in the
non-uniform distribution delay feature. A file DISTRIBUTION must be
present in TC_LIB_DIR (e.g. /usr/lib/tc) containing numbers scaled by
NETEM_DIST_SCALE. A random value x is selected from the table and it
takes DELAY + ( x * JITTER ) as delay. Correlation between values is not
supported.

Examples:
  Normal distribution delay with mean = 800us and stdev = 100us.
  > tc qdisc add dev eth0 root netem slot distribution normal \
800us 100us

  Optionally set the max slot size in bytes and/or packets.
  > tc qdisc add dev eth0 root netem slot distribution normal \
800us 100us bytes 64k packets 42

Signed-off-by: Yousuk Seung 
Signed-off-by: Neal Cardwell 
Signed-off-by: Dave Taht 
---
 man/man8/tc-netem.8 | 20 
 tc/q_netem.c| 75 +
 2 files changed, 76 insertions(+), 19 deletions(-)

diff --git a/man/man8/tc-netem.8 b/man/man8/tc-netem.8
index 8d485b026751..09cf042f 100644
--- a/man/man8/tc-netem.8
+++ b/man/man8/tc-netem.8
@@ -53,9 +53,13 @@ NetEm \- Network Emulator
 .IR RATE " [ " PACKETOVERHEAD " [ " CELLSIZE " [ " CELLOVERHEAD " "
 
 .IR SLOT " := "
-.BR slot
-.IR MIN_DELAY " [ " MAX_DELAY " ] ["
-.BR packets
+.BR slot " { "
+.IR MIN_DELAY " [ " MAX_DELAY " ] |"
+.br
+.RB "   " distribution " { "uniform " | " normal " | " pareto " | 
" paretonormal " | "
+.IR FILE " } " DELAY " " JITTER " } "
+.br
+.RB " [ " packets
 .IR PACKETS " ] [ "
 .BR bytes
 .IR BYTES " ]"
@@ -172,9 +176,13 @@ an artificial packet compression (bursts). Another 
influence factor are network
 adapter buffers which can also add artificial delay.
 
 .SS slot
-defer delivering accumulated packets to within a slot, with each available slot
-configured with a minimum delay to acquire, and an optional maximum delay.  
Slot
-delays can be specified in nanoseconds, microseconds, milliseconds or seconds
+defer delivering accumulated packets to within a slot. Each available slot can 
be
+configured with a minimum delay to acquire, and an optional maximum delay.
+Alternatively it can be configured with the distribution similar to
+.BR distribution
+for
+.BR delay
+option. Slot delays can be specified in nanoseconds, microseconds, 
milliseconds or seconds
 (e.g. 800us). Values for the optional parameters
 .I BYTES
 will limit the number of bytes delivered per slot, and/or
diff --git a/tc/q_netem.c b/tc/q_netem.c
index f52a36b6c31c..e655e1a82e12 100644
--- a/tc/q_netem.c
+++ b/tc/q_netem.c
@@ -43,7 +43,9 @@ static void explain(void)
 " [ rate RATE [PACKETOVERHEAD] [CELLSIZE] [CELLOVERHEAD]]\n" \
 " [ slot MIN_DELAY [MAX_DELAY] [packets MAX_PACKETS]" \
 " [bytes MAX_BYTES]]\n" \
-   );
+" [ slot distribution" \
+" {uniform|normal|pareto|paretonormal|custom} DELAY JITTER" \
+" [packets MAX_PACKETS] [bytes MAX_BYTES]]\n");
 }
 
 static void explain1(const char *arg)
@@ -159,6 +161,7 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, 
char **argv,
   struct nlmsghdr *n, const char *dev)
 {
int dist_size = 0;
+   int slot_dist_size = 0;
struct rtattr *tail;
struct tc_netem_qopt opt = { .limit = 1000 };
struct tc_netem_corr cor = {};
@@ -169,6 +172,7 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, 
char **argv,
struct tc_netem_rate rate = {};
struct tc_netem_slot slot = {};
__s16 *dist_data = NULL;
+   __s16 *slot_dist_data = NULL;
__u16 loss_type = NETEM_LOSS_UNSPEC;
int present[__TCA_NETEM_MAX] = {};
__u64 rate64 = 0;
@@ -417,21 +421,53 @@ static int netem_parse_opt(struct qdisc_util *qu, int 
argc, char **argv,
}
}
} else if (matches(*argv, "slot") == 0) {
-   NEXT_ARG();
-   present[TCA_NETEM_SLOT] = 1;
-   if (get_time64(_delay, *argv)) {
-   explain1("slot min_delay");
-   return -1;
-   }
if (NEXT_IS_NUMBER()) {
NEXT_ARG();
-   if (get_time64(_delay, *argv)) {
-   explain1("slot min_delay max_delay");
+   present[TCA_NETEM_SLOT] = 1;
+   if (get_time64(_delay, *argv)) {
+   explain1("slot min_delay");
+   return -1;
+   }
+   if (NEXT_IS_NUMBER()) {
+  

[PATCH iproute2-next 2/3] q_netem: support delivering packets in delayed time slots

2018-08-06 Thread Yousuk Seung
From: Dave Taht 

Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.

It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.

The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.

The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.

Examples of use:

tc qdisc add dev eth0 root netem delay 200us \
slot 800us 10ms bytes 64k packets 42

A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:

tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
 limit 1
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
 slot 800us 10ms bytes 64k packets 42 limit 512

Signed-off-by: Yousuk Seung 
Signed-off-by: Dave Taht 
Signed-off-by: Neal Cardwell 
---
 man/man8/tc-netem.8 | 32 ++-
 tc/q_netem.c| 63 -
 2 files changed, 93 insertions(+), 2 deletions(-)

diff --git a/man/man8/tc-netem.8 b/man/man8/tc-netem.8
index f2cd86b6ed8a..8d485b026751 100644
--- a/man/man8/tc-netem.8
+++ b/man/man8/tc-netem.8
@@ -8,7 +8,8 @@ NetEm \- Network Emulator
 .I OPTIONS
 
 .IR OPTIONS " := [ " LIMIT " ] [ " DELAY " ] [ " LOSS \
-" ] [ " CORRUPT " ] [ " DUPLICATION " ] [ " REORDERING " ][ " RATE " ]"
+" ] [ " CORRUPT " ] [ " DUPLICATION " ] [ " REORDERING " ] [ " RATE \
+" ] [ " SLOT " ]"
 
 .IR LIMIT " := "
 .B limit
@@ -51,6 +52,14 @@ NetEm \- Network Emulator
 .B rate
 .IR RATE " [ " PACKETOVERHEAD " [ " CELLSIZE " [ " CELLOVERHEAD " "
 
+.IR SLOT " := "
+.BR slot
+.IR MIN_DELAY " [ " MAX_DELAY " ] ["
+.BR packets
+.IR PACKETS " ] [ "
+.BR bytes
+.IR BYTES " ]"
+
 
 .SH DESCRIPTION
 NetEm is an enhancement of the Linux traffic control facilities
@@ -162,6 +171,27 @@ granularity avoid a perfect shaping at a specific level. 
This will show up in
 an artificial packet compression (bursts). Another influence factor are network
 adapter buffers which can also add artificial delay.
 
+.SS slot
+defer delivering accumulated packets to within a slot, with each available slot
+configured with a minimum delay to acquire, and an optional maximum delay.  
Slot
+delays can be specified in nanoseconds, microseconds, milliseconds or seconds
+(e.g. 800us). Values for the optional parameters
+.I BYTES
+will limit the number of bytes delivered per slot, and/or
+.I PACKETS
+will limit the number of packets delivered per slot.
+
+These slot options can provide a crude approximation of bursty MACs such as
+DOCSIS, WiFi, and LTE.
+
+Note that slotting is limited by several factors: the kernel clock granularity,
+as with a rate, and attempts to deliver many packets within a slot will be
+smeared by the timer resolution, and by the underlying native bandwidth also.
+
+It is possible to combine slotting with a rate, in which case complex behaviors
+where either the rate, or the slot limits on bytes or packets per slot, govern
+the actual delivered rate.
+
 .SH LIMITATIONS
 The main known limitation of Netem are related to timer granularity, since
 Linux is not a real-time operating system.
diff --git a/tc/q_netem.c b/tc/q_netem.c
index 9f9a9b3df255..f52a36b6c31c 100644
--- a/tc/q_netem.c
+++ b/tc/q_netem.c
@@ -40,7 +40,10 @@ static void explain(void)
 " [ loss gemodel PERCENT [R [1-H [1-K]]]\n" \
 " [ ecn ]\n" \
 " [ reorder PRECENT [CORRELATION] [ gap DISTANCE ]]\n" \
-" [ rate RATE [PACKETOVERHEAD] [CELLSIZE] [CELLOVERHEAD]]\n");
+" [ rate RATE [PACKETOVERHEAD] [CELLSIZE] [CELLOVERHEAD]]\n" \
+" [ slot MIN_DELAY [MAX_DELAY] [packets MAX_PACKETS]" \
+" [bytes MAX_BYTES]]\n" \
+   );
 }
 
 static void explain1(const char *arg)
@@ -164,6 +167,7 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, 
char **argv,
struct tc_netem_gimodel gimodel;
struct tc_netem_gemodel gemodel;
struct tc_netem_rate rate = {};
+   struct tc_netem_slot slot = {};
__s16 *dist_data = NULL;
__u16 loss_type = NETEM_LOSS_UNSPEC;
int present[__TCA_NETEM_MAX] = {};
@@ -412,6 +416,44 @@ static int netem_parse_opt(struct qdisc_util *qu, int 
argc, char **argv,
return -1;
}
}
+   } else if (matches(*argv, "slot") == 0) {
+   NEXT_ARG();
+   present[TCA_NETEM_SLOT] = 1;
+   if (get_time64(_delay, *argv)) {
+  

[PATCH iproute2-next 1/3] tc: support conversions to or from 64 bit nanosecond-based time

2018-08-06 Thread Yousuk Seung
From: Dave Taht 

Using a 32 bit field to represent time in nanoseconds results in a
maximum value of about 4.3 seconds, which is well below many observed
delays in WiFi and LTE, and barely in the ballpark for a trip past the
Earth's moon, Luna.

Using 64 bit time fields in nanoseconds allows us to simulate
network diameters of several hundred light-years. However, only
conversions to and from ns, us, ms, and seconds are provided.

Signed-off-by: Yousuk Seung 
Signed-off-by: Dave Taht 
Signed-off-by: Neal Cardwell 
---
 tc/tc_core.h |  4 
 tc/tc_util.c | 55 
 tc/tc_util.h |  3 +++
 3 files changed, 62 insertions(+)

diff --git a/tc/tc_core.h b/tc/tc_core.h
index 1dfa9a4f773b..a0fe0923d171 100644
--- a/tc/tc_core.h
+++ b/tc/tc_core.h
@@ -7,6 +7,10 @@
 
 #define TIME_UNITS_PER_SEC 100
 
+#define NSEC_PER_USEC 1000
+#define NSEC_PER_MSEC 100
+#define NSEC_PER_SEC 10LL
+
 enum link_layer {
LINKLAYER_UNSPEC,
LINKLAYER_ETHERNET,
diff --git a/tc/tc_util.c b/tc/tc_util.c
index d7578528a31b..c39c9046dcae 100644
--- a/tc/tc_util.c
+++ b/tc/tc_util.c
@@ -385,6 +385,61 @@ char *sprint_ticks(__u32 ticks, char *buf)
return sprint_time(tc_core_tick2time(ticks), buf);
 }
 
+/* 64 bit times are represented internally in nanoseconds */
+int get_time64(__s64 *time, const char *str)
+{
+   double nsec;
+   char *p;
+
+   nsec = strtod(str, );
+   if (p == str)
+   return -1;
+
+   if (*p) {
+   if (strcasecmp(p, "s") == 0 ||
+   strcasecmp(p, "sec") == 0 ||
+   strcasecmp(p, "secs") == 0)
+   nsec *= NSEC_PER_SEC;
+   else if (strcasecmp(p, "ms") == 0 ||
+strcasecmp(p, "msec") == 0 ||
+strcasecmp(p, "msecs") == 0)
+   nsec *= NSEC_PER_MSEC;
+   else if (strcasecmp(p, "us") == 0 ||
+strcasecmp(p, "usec") == 0 ||
+strcasecmp(p, "usecs") == 0)
+   nsec *= NSEC_PER_USEC;
+   else if (strcasecmp(p, "ns") == 0 ||
+strcasecmp(p, "nsec") == 0 ||
+strcasecmp(p, "nsecs") == 0)
+   nsec *= 1;
+   else
+   return -1;
+   }
+
+   *time = nsec;
+   return 0;
+}
+
+void print_time64(char *buf, int len, __s64 time)
+{
+   double nsec = time;
+
+   if (time >= NSEC_PER_SEC)
+   snprintf(buf, len, "%.3fs", nsec/NSEC_PER_SEC);
+   else if (time >= NSEC_PER_MSEC)
+   snprintf(buf, len, "%.3fms", nsec/NSEC_PER_MSEC);
+   else if (time >= NSEC_PER_USEC)
+   snprintf(buf, len, "%.3fus", nsec/NSEC_PER_USEC);
+   else
+   snprintf(buf, len, "%lldns", time);
+}
+
+char *sprint_time64(__s64 time, char *buf)
+{
+   print_time64(buf, SPRINT_BSIZE-1, time);
+   return buf;
+}
+
 int get_size(unsigned int *size, const char *str)
 {
double sz;
diff --git a/tc/tc_util.h b/tc/tc_util.h
index 6632c4f9c528..87be951c622d 100644
--- a/tc/tc_util.h
+++ b/tc/tc_util.h
@@ -82,12 +82,14 @@ int get_percent_rate64(__u64 *rate, const char *str, const 
char *dev);
 int get_size(unsigned int *size, const char *str);
 int get_size_and_cell(unsigned int *size, int *cell_log, char *str);
 int get_time(unsigned int *time, const char *str);
+int get_time64(__s64 *time, const char *str);
 int get_linklayer(unsigned int *val, const char *arg);
 
 void print_rate(char *buf, int len, __u64 rate);
 void print_size(char *buf, int len, __u32 size);
 void print_qdisc_handle(char *buf, int len, __u32 h);
 void print_time(char *buf, int len, __u32 time);
+void print_time64(char *buf, int len, __s64 time);
 void print_linklayer(char *buf, int len, unsigned int linklayer);
 void print_devname(enum output_type type, int ifindex);
 
@@ -96,6 +98,7 @@ char *sprint_size(__u32 size, char *buf);
 char *sprint_qdisc_handle(__u32 h, char *buf);
 char *sprint_tc_classid(__u32 h, char *buf);
 char *sprint_time(__u32 time, char *buf);
+char *sprint_time64(__s64 time, char *buf);
 char *sprint_ticks(__u32 ticks, char *buf);
 char *sprint_linklayer(unsigned int linklayer, char *buf);
 
-- 
2.18.0.597.ga71716f1ad-goog



Re: [PATCH v3 net-next 5/9] net: stmmac: Add MDIO related functions for XGMAC2

2018-08-06 Thread Florian Fainelli
On August 6, 2018 12:59:54 AM PDT, Jose Abreu  wrote:
>On 03-08-2018 20:06, Florian Fainelli wrote:
>> On 08/03/2018 08:50 AM, Jose Abreu wrote:
>>> Add the MDIO related funcionalities for the new IP block XGMAC2.
>>>
>>> Signed-off-by: Jose Abreu 
>>> Cc: David S. Miller 
>>> Cc: Joao Pinto 
>>> Cc: Giuseppe Cavallaro 
>>> Cc: Alexandre Torgue 
>>> Cc: Andrew Lunn 
>>> ---
>>> +satic int stmmac_xgmac2_c22_format(struct stmmac_priv *priv, int
>phyaddr,
>>> +   int phyreg, u32 *hw_addr)
>>> +{
>>> +   unsigned int mii_data = priv->hw->mii.data;
>>> +   u32 tmp;
>>> +
>>> +   /* HW does not support C22 addr >= 4 */
>>> +   if (phyaddr >= 4)
>>> +   return -ENODEV;
>> It would be nice if this could be moved at probe time so you don't
>have
>> to wait until you connect to the PHY, read its PHY OUI and find out
>it
>> has a MDIO address >= 4. Not a blocker, but something that could be
>> improved further on.
>>
>> In premise you could even scan the MDIO bus' device tree node, and
>find
>> that out ahead of time.
>
>Oh, but I use phylib ... I only provide the read/write callbacks
>so I think we should avoid duplicating the code that's already in
>the phylib ... No?

You can always extract the code that scans a MDIO bus into a helper function 
and make it parametrized with a callback of some kind. In that case I would be 
fine with you open coding the MDIO bus scan to find out if there is an address 
>= 4.

>
>>
>>> +   /* Wait until any existing MII operation is complete */
>>> +   if (readl_poll_timeout(priv->ioaddr + mii_data, tmp,
>>> +  !(tmp & MII_XGMAC_BUSY), 100, 1))
>>> +   return -EBUSY;
>>> +
>>> +   /* Set port as Clause 22 */
>>> +   tmp = readl(priv->ioaddr + XGMAC_MDIO_C22P);
>>> +   tmp |= BIT(phyaddr);
>> Since the registers are being Read/Modify/Write here, don't you need
>to
>> clear the previous address bits as well?
>>
>> You probably did not encounter any problems in your testing if you
>had
>> only one PHY on the MDIO bus, but this is not something that is
>> necessarily true, e.g: if you have an Ethernet switch, several MDIO
>bus
>> addresses are going to be responding.
>>
>> Your MDIO bus implementation must be able to support one transaction
>> with one PHY address and the next transaction with another PHY
>address ,
>> etc...
>>
>> That is something that should be easy to fix and be resubmitted as
>part
>> of v4.
>
>I'm not following you here... I only set/unset the bit for the
>corresponding phyaddr that phylib wants to read/write. Why would
>I clear the remaining addresses?

Because this is all about transactions, the HW must be in a state that it will 
be able to perform that transaction correctly. So here for instance if you 
needed to support a C45 transaction you would have to clear that bit for that 
particular PHY address. Since you don't appear to support those yet then yes 
the code appears fine though it would not hurt if you did clear all other PHY's 
c22 bits to make it clear what this does.

-- 
Florian


[PATCH net] packet: refine ring v3 block size test to hold one frame

2018-08-06 Thread Willem de Bruijn
From: Willem de Bruijn 

TPACKET_V3 stores variable length frames in fixed length blocks.
Blocks must be able to store a block header, optional private space
and at least one minimum sized frame.

Frames, even for a zero snaplen packet, store metadata headers and
optional reserved space.

In the block size bounds check, ensure that the frame of the
chosen configuration fits. This includes sockaddr_ll and optional
tp_reserve.

Syzbot was able to construct a ring with insuffient room for the
sockaddr_ll in the header of a zero-length frame, triggering an
out-of-bounds write in dev_parse_header.

Convert the comparison to less than, as zero is a valid snap len.
This matches the test for minimum tp_frame_size immediately below.

Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
Fixes: eb73190f4fbe ("net/packet: refine check for priv area size")
Reported-by: syzbot 
Signed-off-by: Willem de Bruijn 
---
 net/packet/af_packet.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 9b27d0cd766d5..e6445d8f3f57f 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -4226,6 +4226,8 @@ static int packet_set_ring(struct sock *sk, union 
tpacket_req_u *req_u,
}
 
if (req->tp_block_nr) {
+   unsigned int min_frame_size;
+
/* Sanity tests and some calculations */
err = -EBUSY;
if (unlikely(rb->pg_vec))
@@ -4248,12 +4250,12 @@ static int packet_set_ring(struct sock *sk, union 
tpacket_req_u *req_u,
goto out;
if (unlikely(!PAGE_ALIGNED(req->tp_block_size)))
goto out;
+   min_frame_size = po->tp_hdrlen + po->tp_reserve;
if (po->tp_version >= TPACKET_V3 &&
-   req->tp_block_size <=
-   BLK_PLUS_PRIV((u64)req_u->req3.tp_sizeof_priv) + 
sizeof(struct tpacket3_hdr))
+   req->tp_block_size <
+   BLK_PLUS_PRIV((u64)req_u->req3.tp_sizeof_priv) + 
min_frame_size)
goto out;
-   if (unlikely(req->tp_frame_size < po->tp_hdrlen +
-   po->tp_reserve))
+   if (unlikely(req->tp_frame_size < min_frame_size))
goto out;
if (unlikely(req->tp_frame_size & (TPACKET_ALIGNMENT - 1)))
goto out;
-- 
2.18.0.597.ga71716f1ad-goog



[PATCH bpf-next 3/3] bpf: add sample for BPF_MAP_TYPE_QUEUE

2018-08-06 Thread Mauricio Vasquez B
The example is made by two parts, a eBPF program that consumes elements
from a FIFO queue and prints them in the screen and a user space
application that inserts new elements into the queue each time this is
executed.

Signed-off-by: Mauricio Vasquez B 
---
 samples/bpf/.gitignore |1 +
 samples/bpf/Makefile   |3 ++
 samples/bpf/test_map_in_map_user.c |9 +-
 samples/bpf/test_queuemap.sh   |   37 +
 samples/bpf/test_queuemap_kern.c   |   51 +++
 samples/bpf/test_queuemap_user.c   |   53 
 6 files changed, 147 insertions(+), 7 deletions(-)
 create mode 100755 samples/bpf/test_queuemap.sh
 create mode 100644 samples/bpf/test_queuemap_kern.c
 create mode 100644 samples/bpf/test_queuemap_user.c

diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
index 8ae4940025f8..d7e518c1b3ed 100644
--- a/samples/bpf/.gitignore
+++ b/samples/bpf/.gitignore
@@ -26,6 +26,7 @@ test_lru_dist
 test_map_in_map
 test_overhead
 test_probe_write_user
+test_queuemap
 trace_event
 trace_output
 tracex1
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index f88d5683d6ee..624f4f4b81db 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -53,6 +53,7 @@ hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
 hostprogs-y += task_fd_query
 hostprogs-y += xdp_sample_pkts
+hostprogs-y += test_queuemap
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -109,6 +110,7 @@ xdpsock-objs := xdpsock_user.o
 xdp_fwd-objs := xdp_fwd_user.o
 task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
 xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
+test_queuemap-objs := bpf_load.o test_queuemap_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -166,6 +168,7 @@ always += xdpsock_kern.o
 always += xdp_fwd_kern.o
 always += task_fd_query_kern.o
 always += xdp_sample_pkts_kern.o
+always += test_queuemap_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/test_map_in_map_user.c 
b/samples/bpf/test_map_in_map_user.c
index e308858f7bcf..28edac94234e 100644
--- a/samples/bpf/test_map_in_map_user.c
+++ b/samples/bpf/test_map_in_map_user.c
@@ -1,10 +1,5 @@
-/*
- * Copyright (c) 2017 Facebook
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of version 2 of the GNU General Public
- * License as published by the Free Software Foundation.
- */
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Politecnico di Torino */
 #include 
 #include 
 #include 
diff --git a/samples/bpf/test_queuemap.sh b/samples/bpf/test_queuemap.sh
new file mode 100755
index ..ed08c1fa8c2c
--- /dev/null
+++ b/samples/bpf/test_queuemap.sh
@@ -0,0 +1,37 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+[[ -z $TC ]] && TC='tc'
+[[ -z $IP ]] && IP='ip'
+
+TEST_QUEUE_USER='./test_queuemap'
+TEST_QUEUE_BPF='./test_queuemap_kern.o'
+
+function config {
+   $IP netns add ns1
+   $IP link add ve1 type veth peer name vens1
+   $IP link set dev ve1 up
+   $IP link set dev ve1 mtu 1500
+   $IP link set dev vens1 netns ns1
+
+   $IP -n ns1 link set dev lo up
+   $IP -n ns1 link set dev vens1 up
+   $IP -n ns1 addr add 10.1.1.101/24 dev vens1
+
+   $IP addr add 10.1.1.1/24 dev ve1
+   $TC qdisc add dev ve1 clsact
+   $TC filter add dev ve1 ingress bpf da obj $TEST_QUEUE_BPF sec test_queue
+}
+
+function cleanup {
+   set +e
+   [[ -z $DEBUG ]] || set +x
+   $IP netns delete ns1 >& /dev/null
+   $IP link del ve1 >& /dev/null
+   rm -f /sys/fs/bpf/tc/globals/queue
+   [[ -z $DEBUG ]] || set -x
+   set -e
+}
+
+cleanup
+config
diff --git a/samples/bpf/test_queuemap_kern.c b/samples/bpf/test_queuemap_kern.c
new file mode 100644
index ..2b496dafaffd
--- /dev/null
+++ b/samples/bpf/test_queuemap_kern.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Politecnico di Torino */
+#define KBUILD_MODNAME "foo"
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+#define PIN_GLOBAL_NS  2
+
+struct bpf_elf_map {
+   __u32 type;
+   __u32 key_size;
+   __u32 value_size;
+   __u32 max_entries;
+   __u32 flags;
+   __u32 id;
+   __u32 pinning;
+};
+
+/* map #0 */
+struct bpf_elf_map SEC("maps") queue = {
+   .type = BPF_MAP_TYPE_QUEUE,
+   .key_size = 0,
+   .value_size = sizeof(u32),
+   .flags = BPF_F_QUEUE_FIFO,
+   .max_entries = 1024,
+   .pinning = PIN_GLOBAL_NS,
+};
+
+SEC("test_queue")
+int _test_queue(struct __sk_buff *skb)
+{
+   char msg[] = "element is %u\n";
+   char msg_no[] = "there are not elements\n";
+
+   u32 *val = bpf_map_lookup_elem(, NULL);
+
+   if (!val) {
+   bpf_trace_printk(msg_no, sizeof(msg_no));
+   

[PATCH bpf-next 2/3] selftests/bpf: add test cases for BPF_MAP_TYPE_QUEUE

2018-08-06 Thread Mauricio Vasquez B
Signed-off-by: Mauricio Vasquez B 
---
 tools/include/uapi/linux/bpf.h  |5 ++
 tools/testing/selftests/bpf/test_maps.c |   72 +++
 2 files changed, 77 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0ebaaf7f3568..2c171c40eb45 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -120,6 +120,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_CPUMAP,
BPF_MAP_TYPE_XSKMAP,
BPF_MAP_TYPE_SOCKHASH,
+   BPF_MAP_TYPE_QUEUE,
 };
 
 enum bpf_prog_type {
@@ -255,6 +256,10 @@ enum bpf_attach_type {
 /* Flag for stack_map, store build_id+offset instead of pointer */
 #define BPF_F_STACK_BUILD_ID   (1U << 5)
 
+/* Flags for queue_map, type of queue */
+#define BPF_F_QUEUE_FIFO   (1U << 16)
+#define BPF_F_QUEUE_LIFO   (2U << 16)
+
 enum bpf_stack_build_id_status {
/* user space need an empty entry to identify end of a trace */
BPF_STACK_BUILD_ID_EMPTY = 0,
diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 6c253343a6f9..34567b017dbb 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -457,6 +457,77 @@ static void test_devmap(int task, void *data)
close(fd);
 }
 
+static void test_queuemap(int task, void *data)
+{
+   __u32 value;
+   int fd, i;
+
+   /* test FIFO */
+   fd = bpf_create_map(BPF_MAP_TYPE_QUEUE, 0, sizeof(value), 32,
+   BPF_F_QUEUE_FIFO);
+   if (fd < 0) {
+   printf("Failed to create queuemap '%s'!\n", strerror(errno));
+   exit(1);
+   }
+
+   /* Push 32 elements */
+   for (i = 0; i < 32; i++) {
+   value = 1000 - i * 3;
+   assert(bpf_map_update_elem(fd, NULL, , 0) == 0);
+   }
+
+   /* Check that element cannot be pushed due to max_entries limit */
+   value = 1000;
+   assert(bpf_map_update_elem(fd, NULL, , 0) == -1 &&
+  errno == E2BIG);
+
+   /* Pop all elements */
+   for (i = 0; i < 32; i++)
+   assert(bpf_map_lookup_elem(fd, NULL, ) == 0 &&
+  value == (1000 - i * 3));
+
+   /* Check that there are not elements left */
+   assert(bpf_map_lookup_elem(fd, NULL, ) == -1 && errno == ENOENT);
+
+   assert(bpf_map_delete_elem(fd, NULL) == -1 && errno == EINVAL);
+   assert(bpf_map_get_next_key(fd, NULL, NULL) == -1 && errno == EINVAL);
+
+   close(fd);
+
+   /* test LIFO */
+   fd = bpf_create_map(BPF_MAP_TYPE_QUEUE, 0, sizeof(value), 32,
+   BPF_F_QUEUE_LIFO);
+   if (fd < 0) {
+   printf("Failed to create queuemap '%s'!\n", strerror(errno));
+   exit(1);
+   }
+
+   /* Push 32 elements */
+   for (i = 0; i < 32; i++) {
+   value = 1000 - i * 3;
+   assert(bpf_map_update_elem(fd, NULL, , 0) == 0);
+   }
+
+   /* Check that element cannot be pushed due to max_entries limit */
+   value = 1000;
+   assert(bpf_map_update_elem(fd, NULL, , 0) == -1 &&
+  errno == E2BIG);
+
+   /* Pop all elements */
+   for (i = 31; i >= 0; i--)
+   assert(bpf_map_lookup_elem(fd, NULL, ) == 0 &&
+  value == (1000 - i * 3));
+
+   /* Check that there are not elements left */
+   assert(bpf_map_lookup_elem(fd, NULL, ) == -1 &&
+  errno == ENOENT);
+
+   assert(bpf_map_delete_elem(fd, NULL) == -1 && errno == EINVAL);
+   assert(bpf_map_get_next_key(fd, NULL, NULL) == -1 && errno == EINVAL);
+
+   close(fd);
+}
+
 #include 
 #include 
 #include 
@@ -1162,6 +1233,7 @@ static void run_all_tests(void)
test_arraymap_percpu_many_keys();
 
test_devmap(0, NULL);
+   test_queuemap(0, NULL);
test_sockmap(0, NULL);
 
test_map_large();



[PATCH bpf-next 1/3] bpf: add bpf queue map

2018-08-06 Thread Mauricio Vasquez B
Bpf queue implements a LIFO/FIFO data containers for ebpf programs.

It allows to push an element to the queue by using the update operation
and to pop an element from the queue by using the lookup operation.

A use case for this is to keep track of a pool of elements, like
network ports in a SNAT.

Signed-off-by: Mauricio Vasquez B 
---
 include/linux/bpf_types.h |1 
 include/uapi/linux/bpf.h  |5 +
 kernel/bpf/Makefile   |2 
 kernel/bpf/queuemap.c |  287 +
 kernel/bpf/syscall.c  |   61 +++---
 kernel/bpf/verifier.c |   16 ++-
 6 files changed, 353 insertions(+), 19 deletions(-)
 create mode 100644 kernel/bpf/queuemap.c

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index c5700c2d5549..6c7a62f3fe43 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -58,3 +58,4 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
 #endif
 #endif
+BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0ebaaf7f3568..2c171c40eb45 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -120,6 +120,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_CPUMAP,
BPF_MAP_TYPE_XSKMAP,
BPF_MAP_TYPE_SOCKHASH,
+   BPF_MAP_TYPE_QUEUE,
 };
 
 enum bpf_prog_type {
@@ -255,6 +256,10 @@ enum bpf_attach_type {
 /* Flag for stack_map, store build_id+offset instead of pointer */
 #define BPF_F_STACK_BUILD_ID   (1U << 5)
 
+/* Flags for queue_map, type of queue */
+#define BPF_F_QUEUE_FIFO   (1U << 16)
+#define BPF_F_QUEUE_LIFO   (2U << 16)
+
 enum bpf_stack_build_id_status {
/* user space need an empty entry to identify end of a trace */
BPF_STACK_BUILD_ID_EMPTY = 0,
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index f27f5496d6fe..30f02ef66635 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -2,7 +2,7 @@
 obj-y := core.o
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o lpm_trie.o map_in_map.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o lpm_trie.o map_in_map.o queuemap.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_SYSCALL) += btf.o
 ifeq ($(CONFIG_NET),y)
diff --git a/kernel/bpf/queuemap.c b/kernel/bpf/queuemap.c
new file mode 100644
index ..ab30af43b4cc
--- /dev/null
+++ b/kernel/bpf/queuemap.c
@@ -0,0 +1,287 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * queuemap.c: BPF queue map
+ *
+ * Copyright (c) 2018 Politecnico di Torino
+ */
+#include 
+#include 
+#include 
+#include "percpu_freelist.h"
+
+#define QUEUE_CREATE_FLAG_MASK \
+   (BPF_F_NO_PREALLOC | BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY | \
+BPF_F_QUEUE_FIFO | BPF_F_QUEUE_LIFO)
+
+enum queue_type {
+   QUEUE_FIFO = (BPF_F_QUEUE_FIFO >> 16),
+   QUEUE_LIFO = (BPF_F_QUEUE_LIFO >> 16),
+};
+
+struct bpf_queue {
+   struct bpf_map map;
+   struct list_head head;
+   struct pcpu_freelist freelist;
+   void *nodes;
+   enum queue_type type;
+   raw_spinlock_t lock;
+   atomic_t count;
+   u32 node_size;
+};
+
+struct queue_node {
+   struct pcpu_freelist_node fnode;
+   struct bpf_queue *queue;
+   struct list_head list;
+   struct rcu_head rcu;
+   char element[0] __aligned(8);
+};
+
+static bool queue_map_is_prealloc(struct bpf_queue *queue)
+{
+   return !(queue->map.map_flags & BPF_F_NO_PREALLOC);
+}
+
+/* Called from syscall */
+static int queue_map_alloc_check(union bpf_attr *attr)
+{
+   /* check sanity of attributes */
+   if (attr->max_entries == 0 || attr->key_size != 0 ||
+   attr->value_size == 0 || attr->map_flags & ~QUEUE_CREATE_FLAG_MASK)
+   return -EINVAL;
+
+   if ((attr->map_flags >> 16) != QUEUE_FIFO &&
+   (attr->map_flags >> 16) != QUEUE_LIFO) {
+   return -EINVAL;
+   }
+
+   if (attr->value_size > KMALLOC_MAX_SIZE)
+   /* if value_size is bigger, the user space won't be able to
+* access the elements.
+*/
+   return -E2BIG;
+
+   return 0;
+}
+
+static int prealloc_init(struct bpf_queue *queue)
+{
+   u32 node_size = sizeof(struct queue_node) +
+   round_up(queue->map.value_size, 8);
+   u32 num_entries = queue->map.max_entries;
+   int err;
+
+   queue->nodes = bpf_map_area_alloc(node_size * num_entries,
+ queue->map.numa_node);
+   if (!queue->nodes)
+   return -ENOMEM;
+
+   err = pcpu_freelist_init(>freelist);
+   if (err)
+   goto free_nodes;
+
+   pcpu_freelist_populate(>freelist,
+  queue->nodes +
+  offsetof(struct 

[PATCH bpf-next 0/3] Implement bpf map queue

2018-08-06 Thread Mauricio Vasquez B
Bpf queue map is a new kind of map that provides a LIFO/FIFO queue
implementation.

In some applications, like a SNAT, it is necessary to keep track of
a pool of free elemenets, network ports in this case, then a queue
can be used for that purpose.

Signed-off-by: Mauricio Vasquez B 
---
Mauricio Vasquez B (3):
  bpf: add bpf queue map
  selftests/bpf: add test cases for BPF_MAP_TYPE_QUEUE
  bpf: add sample for BPF_MAP_TYPE_QUEUE


 include/linux/bpf_types.h   |1 
 include/uapi/linux/bpf.h|5 +
 kernel/bpf/Makefile |2 
 kernel/bpf/queuemap.c   |  287 +++
 kernel/bpf/syscall.c|   61 +--
 kernel/bpf/verifier.c   |   16 +-
 samples/bpf/.gitignore  |1 
 samples/bpf/Makefile|3 
 samples/bpf/test_map_in_map_user.c  |9 -
 samples/bpf/test_queuemap.sh|   37 
 samples/bpf/test_queuemap_kern.c|   51 ++
 samples/bpf/test_queuemap_user.c|   53 ++
 tools/include/uapi/linux/bpf.h  |5 +
 tools/testing/selftests/bpf/test_maps.c |   72 
 14 files changed, 577 insertions(+), 26 deletions(-)
 create mode 100644 kernel/bpf/queuemap.c
 create mode 100755 samples/bpf/test_queuemap.sh
 create mode 100644 samples/bpf/test_queuemap_kern.c
 create mode 100644 samples/bpf/test_queuemap_user.c

--



Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31

2018-08-06 Thread Eran Ben Elisha


Hi Dave,
I would like to re-state that this feature was not meant to be a generic
one. This feature was added in order to resolve a HW bug which exist in
a small portion of our devices.


Would you mind describing the HW bug in more detail?  To a outside
reviewer it really looks like you're adding a feature.  What are you
working around?  Is the lack of full AQM on the PCIe side of the chip
considered a bug?


In multiple function environment, there is an issue with buffer 
allocation per function which may lead to starvation. There is an HW WA 
for mitigate this starvation by identifying this state and apply early 
drop/mark.





Those params will be used only on those current HWs and won't be in
use for our future devices.


I'm glad that is your plan today, however, customers may get used to
the simple interface you're adding now.  This means the API you are
adding is effectively becoming an API other drivers may need to
implement to keep compatibility with someone's proprietary
orchestration.


This issue was refactored, thus no need to have this WA at all in future 
NICs. So I don't believe we will end up in the situation you are describing.
It is less likely that other vendors will be facing the same issue and 
will have to support such param. it was burn out of a bug and not as a 
feature which other may follow.





During the discussions, several alternatives where offered to be used by
various members of the community. These alternatives includes TC and
enhancements to PCI configuration tools.

Regarding the TC, from my perspective, this is not an option as:
1) The HW mechanism handles multiple functions and therefore cannot be
configured on as a regular TC


Could you elaborate?  What are the multiple functions?  You seem to be
adding a knob to enable ECN marking and a knob for choosing between
some predefined slopes.


PSB, The sloped are dynamic and enabled in a dynamic way.
Indeed, we are adding a very specific knob for very non standard 
specific issue which can be used in addition to standard ECN marking.




In what way would your solution not behave like a RED offload?


Existing Algo (RED, PIE, etc) are static, configurable. Our HW WA is 
dynamic (dynamic slope), adjusted and auto enabled.




With TC offload you'd also get a well-defined set of statistics, I
presume right now you're planning on adding a set of ethtool -S
counters?


2) No PF + representors modeling can be applied here, this is a
MultiHost environment where one host is not aware to the other hosts,
and each is running on its own pci/driver. It is a device working mode
configuration.


Yes, the multihost part makes it less pleasant.  But this is a problem
we have to tackle separately, at some point.  It's not a center of
attention here.


Agree, however the multihost part makes it non-transparent if we chose a 
solution which is not based on direct vendor configuration. This will 
lead to a bad user experience.





3) The current HW W/A is very limited, maybe it has a similar algorithm
as WRED, but is being used for much simpler different use case (pci bus
congestion).


No one is requesting full RED offload here..  if someone sets the
parameters you can't support you simply won't offload them.  And ignore
the parameters which only make sense in software terms.  Look at the
docs for mlxsw:

https://github.com/Mellanox/mlxsw/wiki/Queues-Management#offloading-red

It says "not offloaded" in a number of places.


It cannot be compared to a standard TC capability (RED/WRED), and
defining it as a offload fully controlled by the user will be a big
misuse.


It's generally preferable to implement a subset of exiting well defined
API than create vendor knobs, hence hardly a misuse.


As written above, this is not the case here.




(for example, drop rate cannot be configured)


I don't know what "configuring drop rate" means in case of RED..


regarding the PCI config tools, there was a consensus that such tool is
not acceptable as it is not a part of the PCI spec.


As I said, this has nothing to do with PCI being the transport.  The
port you're running over could be serial, SPI or anything else.  You
have congestion on a port of a device, that's a networking problem.


Since module param/sysfs/debugfs/etc are no longer acceptable, and
current drivers still desired with a way to do some configurations to
the device/driver which cannot used standard Linux tool or by other
vendors, devlink params was developed (under the assumption that this
tool will be helpful for those needs, and those only).

 From my perspective, Devlink is the tool to configure the device for
handling such unexpected bugs, i.e "PCIe buffer congestion handling
workaround".


Hm.  Are you calling it a bug because you had to work around silicon
limitation in firmware?  Hm.  I'm very intrigued by the framing :)



Re: [endianness bug] cxgb4: mk_act_open_req() buggers ->{local,peer}_ip on big-endian hosts

2018-08-06 Thread Rahul Lakkireddy
On Sunday, August 08/05/18, 2018 at 22:52:38 +0530, Al Viro wrote:
>   Unlike fs.val.lport and fs.val.fport, cxgb4_process_flow_match()
> sets fs.val.{l,f}ip to net-endian values without conversion - they come
> straight from flow_dissector_key_ipv4_addrs ->dst and ->src resp.  So
> the assignment in mk_act_open_req() ought to be a straigh copy.
> 
>   As far as I know, T4 PCIe cards do exist, so it's not as if that
> thing could only be found on little-endian systems...
> 
> Signed-off-by: Al Viro 
> ---
> diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c 
> b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c
> index 00fc5f1afb1d..7dddb9e748b8 100644
> --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c
> +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c
> @@ -1038,10 +1038,8 @@ static void mk_act_open_req(struct filter_entry *f, 
> struct sk_buff *skb,
>   OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ACT_OPEN_REQ, qid_filterid));
>   req->local_port = cpu_to_be16(f->fs.val.lport);
>   req->peer_port = cpu_to_be16(f->fs.val.fport);
> - req->local_ip = f->fs.val.lip[0] | f->fs.val.lip[1] << 8 |
> - f->fs.val.lip[2] << 16 | f->fs.val.lip[3] << 24;
> - req->peer_ip = f->fs.val.fip[0] | f->fs.val.fip[1] << 8 |
> - f->fs.val.fip[2] << 16 | f->fs.val.fip[3] << 24;
> + memcpy(>local_ip, f->fs.val.lip, 4);
> + memcpy(>peer_ip, f->fs.val.fip, 4);
>   req->opt0 = cpu_to_be64(NAGLE_V(f->fs.newvlan == VLAN_REMOVE ||
>   f->fs.newvlan == VLAN_REWRITE) |
>   DELACK_V(f->fs.hitcnts) |

Thanks for fix Al!

Acked-by: Rahul Lakkireddy 


Re: [endianness bug?] cxgb4_next_header .match_val/.match_mask should be net-endian

2018-08-06 Thread Rahul Lakkireddy
On Sunday, August 08/05/18, 2018 at 20:58:11 +0530, Al Viro wrote:
>   AFAICS, cxgb4_next_header() expects to find match_val/match_mask in 
> struct cxgb4_next_header stored as big-endian:
> 
> /* Found a possible candidate.  Find a key that
>  * matches the corresponding offset, value, and
>  * mask to jump to next header.
>  */
> for (j = 0; j < cls->knode.sel->nkeys; j++) {
> off = cls->knode.sel->keys[j].off;
> val = cls->knode.sel->keys[j].val;
> mask = cls->knode.sel->keys[j].mask;
> 
> if (next[i].match_off == off &&
> next[i].match_val == val &&
> next[i].match_mask == mask) {
> found = true;
> break;
> }
> }
> 
> Here ->keys[] is struct tc_u32_key and there mask and val are definitely
> __be32.  match_val and match_mask are never changed after initialization
> and they are set to:
> 
> * .match_off = 8, .match_val = 0x600, .match_mask = 0xFF00
>   meant to check for IPV4.Protocol == TCP, i.e. octet at offset 9 being 6
> * .match_off = 8, .match_val = 0x1100, .match_mask = 0xFF00
>   meant to check for IPV4.Protocol == UDP, i.e. octet at offset 9 being 
> 0x11
> * .match_off = 4, .match_val = 0x6, .match_mask = 0xFF
>   IPV6.NextHeader == TCP, i.e. octet at offset 6 being 6
> * .match_off = 4, .match_val = 0x11, .match_mask = 0xFF
>   IPV6.NextHeader == UDP, i.e. octet at offset 6 being 0x11
> 
> On little-endian host those do yield the right values - e.g. 0x1100 is
> {0, 17, 0, 0}, etc.  On big-endian, though, these will end up checking
> in IPv4 case the octet at offset 10 (i.e. upper 16 bits of checksum) and for 
> IPv6
> - the octet at offset 5 (i.e.  the lower 8 bits of payload length).
> 
> Unless I'm misreading that code, it needs the following to do the right
> thing both on l-e and b-e.  Comments?
> 
> Signed-off-by: Al Viro 
> ---
> diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h 
> b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h
> index a4b99edcc339..ec226b1cebf4 100644
> --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h
> +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h
> @@ -259,11 +259,11 @@ struct cxgb4_next_header {
>   */
>  static const struct cxgb4_next_header cxgb4_ipv4_jumps[] = {
>   { .offset = 0, .offoff = 0, .shift = 6, .mask = 0xF,
> -   .match_off = 8, .match_val = 0x600, .match_mask = 0xFF00,
> -   .jump = cxgb4_tcp_fields },
> +   .match_off = 8, .match_val = htonl(6 << 16),
> +   .match_mask = htonl(0xff<<16), .jump = cxgb4_tcp_fields },
>   { .offset = 0, .offoff = 0, .shift = 6, .mask = 0xF,
> -   .match_off = 8, .match_val = 0x1100, .match_mask = 0xFF00,
> -   .jump = cxgb4_udp_fields },
> +   .match_off = 8, .match_val = htonl(17 << 16),
> +   .match_mask = htonl(0xff<<16), .jump = cxgb4_udp_fields },
>   { .jump = NULL }
>  };
>  
> @@ -272,11 +272,11 @@ static const struct cxgb4_next_header 
> cxgb4_ipv4_jumps[] = {
>   */
>  static const struct cxgb4_next_header cxgb4_ipv6_jumps[] = {
>   { .offset = 0x28, .offoff = 0, .shift = 0, .mask = 0,
> -   .match_off = 4, .match_val = 0x6, .match_mask = 0xFF,
> -   .jump = cxgb4_tcp_fields },
> +   .match_off = 4, .match_val = htonl(6 << 8),
> +   .match_mask = htonl(0xff << 8), .jump = cxgb4_tcp_fields },
>   { .offset = 0x28, .offoff = 0, .shift = 0, .mask = 0,
> -   .match_off = 4, .match_val = 0x11, .match_mask = 0xFF,
> -   .jump = cxgb4_udp_fields },
> +   .match_off = 4, .match_val = htonl(17 << 8),
> +   .match_mask = htonl(0xff << 8), .jump = cxgb4_udp_fields },
>   { .jump = NULL }
>  };
>  

Your observation is correct. The current logic is broken for
Big-Endian. Thanks for fixing it up Al!

As you've already found in the other email thread, ->mask also
needs to be fixed. Should be ".mask = htons(0xf << 8)".

Thanks,
Rahul


[PATCH net-next] ip6_tunnel: collect_md xmit: Use ip_tunnel_key's provided src address

2018-08-06 Thread Shmulik Ladkani
When using an ip6tnl device in collect_md mode, the xmit methods ignore
the ipv6.src field present in skb_tunnel_info's key, both for route
calculation purposes (flowi6 construction) and for assigning the
packet's final ipv6h->saddr.

This makes it impossible specifying a desired ipv6 local address in the
encapsulating header (for example, when using tc action tunnel_key).

This is also not aligned with behavior of ipip (ipv4) in collect_md
mode, where the key->u.ipv4.src gets used.

Fix, by assigning fl6.saddr with given key->u.ipv6.src.
In case ipv6.src is not specified, ip6_tnl_xmit uses existing saddr
selection code.

Signed-off-by: Shmulik Ladkani 
Reviewed-by: Eyal Birger 
Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
---
 net/ipv6/ip6_tunnel.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 00e138a44cbb..820cebe0c687 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1113,7 +1113,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
dst = NULL;
goto tx_err_link_failure;
}
-   if (t->parms.collect_md &&
+   if (t->parms.collect_md && ipv6_addr_any(>saddr) &&
ipv6_dev_get_saddr(net, ip6_dst_idev(dst)->dev,
   >daddr, 0, >saddr))
goto tx_err_link_failure;
@@ -1255,6 +1255,7 @@ ip4ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev)
key = _info->key;
memset(, 0, sizeof(fl6));
fl6.flowi6_proto = IPPROTO_IPIP;
+   fl6.saddr = key->u.ipv6.src;
fl6.daddr = key->u.ipv6.dst;
fl6.flowlabel = key->label;
dsfield =  key->tos;
@@ -1326,6 +1327,7 @@ ip6ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev)
key = _info->key;
memset(, 0, sizeof(fl6));
fl6.flowi6_proto = IPPROTO_IPV6;
+   fl6.saddr = key->u.ipv6.src;
fl6.daddr = key->u.ipv6.dst;
fl6.flowlabel = key->label;
dsfield = key->tos;
-- 
2.18.0



e1000e driver stuck at 10Mbps after reconnection

2018-08-06 Thread Camille Bordignon
Hello,

Recently we experienced some issues with intel NIC (I219-LM and I219-V).
It seems that after a wire reconnection, auto-negotation "fails" and
link speed drips to 10 Mbps.

>From kernel logs:
[17616.346150] e1000e: enp0s31f6 NIC Link is Down
[17627.003322] e1000e: enp0s31f6 NIC Link is Up 10 Mbps Full Duplex, Flow 
Control: None
[17627.003325] e1000e :00:1f.6 enp0s31f6: 10/100 speed: disabling TSO


$ethtool enp0s31f6
Settings for enp0s31f6:
Supported ports: [ TP ]
Supported link modes:   10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supported pause frame use: No
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes:  10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 10Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: on (auto)
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x0007 (7)
   drv probe link
Link detected: yes


Notice that if disconnection last less than about 5 seconds,
nothing wrong happens.
And if after last failure, disconnection / connection occurs again and
last less than 5 seconds, link speed is back to 1000 Mbps.

[18075.350678] e1000e: enp0s31f6 NIC Link is Down
[18078.716245] e1000e: enp0s31f6 NIC Link is Up 1000 Mbps Full Duplex, Flow 
Control: None

The following patch seems to fix this issue.
However I don't clearly understand why.

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 3ba0c90e7055..763c013960f1 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -5069,7 +5069,7 @@ static bool e1000e_has_link(struct e1000_adapter *adapter)
case e1000_media_type_copper:
if (hw->mac.get_link_status) {
ret_val = hw->mac.ops.check_for_link(hw);
-   link_active = !hw->mac.get_link_status;
+   link_active = false;
} else {
link_active = true;
}

Maybe this is related to watchdog task.

I've found out this fix by comparing with last commit that works fine :
commit 0b76aae741abb9d16d2c0e67f8b1e766576f897d.
However I don't know if this information is relevant.

Thank you.
Camille Bordignon


[PATCH RFC net-next 0/1] net/tls: Combined memory allocation for decryption request

2018-08-06 Thread Vakul Garg
This patch does a combined memory allocation from heap for scatterlists,
aead_request, aad and iv for the tls record decryption path. In present
code, aead_request is allocated from heap, scatterlists on a conditional
basis are allocated on heap or on stack. This is inefficient as it may
requires multiple kmalloc/kfree.

The initialization vector passed in cryption request is allocated on
stack. This is a problem since the stack memory is not dma-able from
crypto accelerators.

Doing one combined memory allocation for each decryption request fixes
both the above issues. It also paves a way to be able to submit multiple
async decryption requests while the previous one is pending i.e. being 
processed or queued.

This patch has been built over Doron Roberts-Kedes's patch:

"net/tls: Calculate nsg for zerocopy path without skb_cow_data"


Vakul Garg (1):
  net/tls: Combined memory allocation for decryption request

 include/net/tls.h |   4 -
 net/tls/tls_sw.c  | 256 +++---
 2 files changed, 147 insertions(+), 113 deletions(-)

-- 
2.13.6



[PATCH RFC net-next 1/1] net/tls: Combined memory allocation for decryption request

2018-08-06 Thread Vakul Garg
For preparing decryption request, several memory chunks are required
(aead_req, sgin, sgout, iv, aad). For submitting the decrypt request to
an accelerator, it is required that the buffers which are read by the
accelerator must be dma-able and not come from stack. The buffers for
aad and iv can be separately kmalloced each, but it is inefficient.
This patch does a combined allocation for preparing decryption request
and then segments into aead_req || sgin || sgout || iv || aad.

Signed-off-by: Vakul Garg 
---
 include/net/tls.h |   4 -
 net/tls/tls_sw.c  | 257 +++---
 2 files changed, 148 insertions(+), 113 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index d8b3b6578c01..d5c683e8bb22 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -124,10 +124,6 @@ struct tls_sw_context_rx {
struct sk_buff *recv_pkt;
u8 control;
bool decrypted;
-
-   char rx_aad_ciphertext[TLS_AAD_SPACE_SIZE];
-   char rx_aad_plaintext[TLS_AAD_SPACE_SIZE];
-
 };
 
 struct tls_record_info {
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index cd5ed2d1dbe8..a478b06fc015 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -118,19 +118,13 @@ static int tls_do_decryption(struct sock *sk,
 struct scatterlist *sgout,
 char *iv_recv,
 size_t data_len,
-struct sk_buff *skb,
-gfp_t flags)
+struct aead_request *aead_req)
 {
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
-   struct aead_request *aead_req;
-
int ret;
 
-   aead_req = aead_request_alloc(ctx->aead_recv, flags);
-   if (!aead_req)
-   return -ENOMEM;
-
+   aead_request_set_tfm(aead_req, ctx->aead_recv);
aead_request_set_ad(aead_req, TLS_AAD_SPACE_SIZE);
aead_request_set_crypt(aead_req, sgin, sgout,
   data_len + tls_ctx->rx.tag_size,
@@ -139,8 +133,6 @@ static int tls_do_decryption(struct sock *sk,
  crypto_req_done, >async_wait);
 
ret = crypto_wait_req(crypto_aead_decrypt(aead_req), >async_wait);
-
-   aead_request_free(aead_req);
return ret;
 }
 
@@ -727,8 +719,138 @@ static struct sk_buff *tls_wait_data(struct sock *sk, int 
flags,
return skb;
 }
 
+/* This function decrypts the input skb into either out_iov or in out_sg
+ * or in skb buffers itself. The input parameter 'zc' indicates if
+ * zero-copy mode needs to be tried or not. With zero-copy mode, either
+ * out_iov or out_sg must be non-NULL. In case both out_iov and out_sg are
+ * NULL, then the decryption happens inside skb buffers itself, i.e.
+ * zero-copy gets disabled and 'zc' is updated.
+ */
+
+static int decrypt_internal(struct sock *sk, struct sk_buff *skb,
+   struct iov_iter *out_iov,
+   struct scatterlist *out_sg,
+   int *chunk, bool *zc)
+{
+   struct tls_context *tls_ctx = tls_get_ctx(sk);
+   struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
+   struct strp_msg *rxm = strp_msg(skb);
+   int n_sgin, n_sgout, nsg, mem_size, aead_size, err, pages = 0;
+   struct aead_request *aead_req;
+   struct sk_buff *unused;
+   u8 *aad, *iv, *mem = NULL;
+   struct scatterlist *sgin = NULL;
+   struct scatterlist *sgout = NULL;
+   const int data_len = rxm->full_len - tls_ctx->rx.overhead_size;
+
+   if (*zc && (out_iov || out_sg)) {
+   if (out_iov)
+   n_sgout = iov_iter_npages(out_iov, INT_MAX) + 1;
+   else if (out_sg)
+   n_sgout = sg_nents(out_sg);
+   else
+   goto no_zerocopy;
+
+   n_sgin = skb_nsg(skb, rxm->offset + tls_ctx->rx.prepend_size,
+rxm->full_len - tls_ctx->rx.prepend_size);
+   } else {
+no_zerocopy:
+   n_sgout = 0;
+   *zc = false;
+   n_sgin = skb_cow_data(skb, 0, );
+   }
+
+   if (n_sgin < 1)
+   return -EBADMSG;
+
+   /* Increment to accommodate AAD */
+   n_sgin = n_sgin + 1;
+
+   nsg = n_sgin + n_sgout;
+
+   aead_size = sizeof(*aead_req) + crypto_aead_reqsize(ctx->aead_recv);
+   mem_size = aead_size + (nsg * sizeof(struct scatterlist));
+   mem_size = mem_size + TLS_AAD_SPACE_SIZE;
+   mem_size = mem_size + crypto_aead_ivsize(ctx->aead_recv);
+
+   /* Allocate a single block of memory which contains
+* aead_req || sgin[] || sgout[] || aad || iv.
+* This order achieves correct alignment for aead_req, sgin, sgout.
+*/
+   mem = kmalloc(mem_size, sk->sk_allocation);
+   if (!mem)
+   return -ENOMEM;
+
+   /* 

RE: [PATCH v2 0/2] net/sctp: Avoid allocating high order memory with kmalloc()

2018-08-06 Thread David Laight
From: Michael Tuexen
> Sent: 03 August 2018 21:57
...
> >> Given how useless SCTP streams are, does anything actually use
> >> more than about 4?
> >
> > Maybe Michael can help us with that. I'm also curious now.
> In the context of SIGTRAN I have seen 17 streams...

Ok, I've seen 17 there as well, 5 is probably more common.

> In the context of WebRTC I have seen more streams. In general,
> the streams concept seems to be useful. QUIC has lots of streams.
> 
> So I'm wondering why they are considered useless.
> David, can you elaborate on this?

I don't think a lot of people know what they actually are.

Streams just allow some receive data to forwarded to applications when receive
message(s) on stream(s) are lost and have to be retransmitted.

I suspect some people think that the separate streams have separate flow 
control,
not just separate data sequences.

M2PA separates control message (stream 0) from user data (stream 1).
I think the spec even suggests this is so control messages get through when
user data is flow controlled off - not true (it would be true for ISO
transport's 'expedited data).

M3UA will use 16 streams (one for each (ITU) SLS), but uses stream 0 for 
control.
If a data message is lost then data for the other sls can be passed to the
userpart/mtp3 - this might save bursty processing when the SACK-requested
retransmission arrives. But I doubt you'd want to run M3UA on anything lossy
enough for more than 4 data streams to make sense.

Even M3UA separating control onto stream 0 data onto 1-n doesn't seem useful to 
me.

If QUIC is using 'lots of streams' is it just using the stream-id as a qualifier
for the data? Rather than requiring the 'not head of line blocking' feature
of sctp streams?

Thought
Could we let the application set large stream-ids, but actually mask them
down to (say) 32 for the protocol code?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: [PATCH wpan 2/2] net: mac802154: tx: expand tailroom if necessary

2018-08-06 Thread Stefan Schmidt
Hello.

On 07/02/2018 10:32 PM, Alexander Aring wrote:
> This patch is necessary if case of AF_PACKET or other socket interface
> which I am aware of it and didn't allocated the necessary room.
> 
> Reported-by: David Palma 
> Reported-by: Rabi Narayan Sahoo 
> Signed-off-by: Alexander Aring 
> ---
>  net/mac802154/tx.c | 15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/net/mac802154/tx.c b/net/mac802154/tx.c
> index 7e253455f9dd..bcd1a5e6ebf4 100644
> --- a/net/mac802154/tx.c
> +++ b/net/mac802154/tx.c
> @@ -63,8 +63,21 @@ ieee802154_tx(struct ieee802154_local *local, struct 
> sk_buff *skb)
>   int ret;
>  
>   if (!(local->hw.flags & IEEE802154_HW_TX_OMIT_CKSUM)) {
> - u16 crc = crc_ccitt(0, skb->data, skb->len);
> + struct sk_buff *nskb;
> + u16 crc;
> +
> + if (unlikely(skb_tailroom(skb) < IEEE802154_FCS_LEN)) {
> + nskb = skb_copy_expand(skb, 0, IEEE802154_FCS_LEN,
> +GFP_ATOMIC);
> + if (likely(nskb)) {
> + consume_skb(skb);
> + skb = nskb;
> + } else {
> + goto err_tx;
> + }
> + }
>  
> + crc = crc_ccitt(0, skb->data, skb->len);
>   put_unaligned_le16(crc, skb_put(skb, 2));
>   }
>  
> 

This patch has been applied to the wpan-next tree and will be
part of the next pull request to net-next. Thanks!

I know you submitted this for wpan instead of wpan-next, but with rc8
being out I will not submit another pull request for 4.18. Instead I
added cc stable to the patch to make sure it gets picked into the stable
tree once 4.18 is out.

regards
Stefan Schmidt


Re: [PATCHv2 wpan] net: 6lowpan: fix reserved space for single frames

2018-08-06 Thread Stefan Schmidt
Hello.

On 07/14/2018 06:52 PM, Alexander Aring wrote:
> This patch fixes patch add handling to take care tail and headroom for
> single 6lowpan frames. We need to be sure we have a skb with the right
> head and tailroom for single frames. This patch do it by using
> skb_copy_expand() if head and tailroom is not enough allocated by upper
> layer.
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195059
> Reported-by: David Palma 
> Reported-by: Rabi Narayan Sahoo 
> Signed-off-by: Alexander Aring 
> ---
> changes since v2:
>  skb to nskb, pointed out by stefan
> 
>  net/ieee802154/6lowpan/tx.c | 21 ++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ieee802154/6lowpan/tx.c b/net/ieee802154/6lowpan/tx.c
> index e6ff5128e61a..ca53efa17be1 100644
> --- a/net/ieee802154/6lowpan/tx.c
> +++ b/net/ieee802154/6lowpan/tx.c
> @@ -265,9 +265,24 @@ netdev_tx_t lowpan_xmit(struct sk_buff *skb, struct 
> net_device *ldev)
>   /* We must take a copy of the skb before we modify/replace the ipv6
>* header as the header could be used elsewhere
>*/
> - skb = skb_unshare(skb, GFP_ATOMIC);
> - if (!skb)
> - return NET_XMIT_DROP;
> + if (unlikely(skb_headroom(skb) < ldev->needed_headroom ||
> +  skb_tailroom(skb) < ldev->needed_tailroom)) {
> + struct sk_buff *nskb;
> +
> + nskb = skb_copy_expand(skb, ldev->needed_headroom,
> +ldev->needed_tailroom, GFP_ATOMIC);
> + if (likely(nskb)) {
> + consume_skb(skb);
> + skb = nskb;
> + } else {
> + kfree_skb(skb);
> + return NET_XMIT_DROP;
> + }
> + } else {
> + skb = skb_unshare(skb, GFP_ATOMIC);
> + if (!skb)
> + return NET_XMIT_DROP;
> + }
>  
>   ret = lowpan_header(skb, ldev, _size, _offset);
>   if (ret < 0) {
> 

This patch has been applied to the wpan-next tree and will be
part of the next pull request to net-next. Thanks!

I know you submitted this for wpan instead of wpan-next, but with rc8
being out I will not submit another pull request for 4.18. Instead I
added cc stable to the patch to make sure it gets picked into the stable
tree once 4.18 is out.

regards
Stefan Schmidt


Re: [PATCH] mellanox: fix the dport endianness in call of __inet6_lookup_established()

2018-08-06 Thread Boris Pismenny




On 8/6/2018 3:27 AM, David Miller wrote:

From: Al Viro 
Date: Sat, 4 Aug 2018 21:41:27 +0100


__inet6_lookup_established() expect th->dport passed in host-endian,
not net-endian.  The reason is microoptimization in __inet6_lookup(),
but if you use the lower-level helpers, you have to play by their
rules...
 
Signed-off-by: Al Viro 


Mellanox folks, please review.



Looks good to me. Thanks Al.


Re: [stmmac][bug?] endianness of Flexible RX Parser code

2018-08-06 Thread Jose Abreu
Hi Al,

On 04-08-2018 02:19, Al Viro wrote:
>   The values passed in struct tc_u32_sel ->mask and ->val are
> 32bit net-endian.  Your tc_fill_entry() does this:
>
> data = sel->keys[0].val;
> mask = sel->keys[0].mask;
>
> ...
> entry->frag_ptr = frag;
> entry->val.match_en = (mask << (rem * 8)) &
> GENMASK(31, rem * 8);
> entry->val.match_data = (data << (rem * 8)) &
> GENMASK(31, rem * 8);
> entry->val.frame_offset = real_off;
> entry->prio = prio;
>
> frag->val.match_en = (mask >> (rem * 8)) &
> GENMASK(rem * 8 - 1, 0);
> frag->val.match_data = (data >> (rem * 8)) &
> GENMASK(rem * 8 - 1, 0);
> frag->val.frame_offset = real_off + 1;
> frag->prio = prio;
> frag->is_frag = true;
>
> and that looks very odd.  rem here is offset modulo 4.  Suppose offset is
> equal to 5, val contains {V0, V1, V2, V3} and mask - {M0, M1, M2, M3}.
> Then on little-endian host we get
> entry->val.match_en:  {0, M0, M1, M2}
> entry->val.match_data:{0, V0, V1, V2}
> entry->val.frame_offset = 1;
> frag->val.match_en:   {M3, 0, 0, 0}
> frag->val.match_data: {V3, 0, 0, 0}
> frag->val.frame_offset = 2;
> and on big-endian
> entry->val.match_en:  {M1, M2, M3, 0}
> entry->val.match_data:{V1, V2, V3, 0}
> entry->val.frame_offset = 1;
> frag->val.match_en:   {0, 0, 0, M0}
> frag->val.match_data: {0, 0, 0, V0}
> frag->val.frame_offset = 2;
>
> Little-endian variant looks like we mask octets 5, 6, 7 and 8 with
> M0..M3 resp. and want V0..V3 in those.  On big-endian, though, we
> look at the octets 11, 4, 5 and 6 instead.
>
> I don't know the hardware (and it might be pulling any kind of weird
> endianness-dependent stunts), but that really smells like a bug.

There is a feature in HW that supports Byte-Invariant Big-Endian
data transfer. It's not activated by default though ...

> It looks like that code is trying to do something like
>
> data = ntohl(sel->keys[0].val);
> mask = ntohl(sel->keys[0].mask);
>   shift = rem * 8;
>
>   entry->val.match_en = htonl(mask >> shift);
>   entry->val.match_data = htonl(data >> shift);
>   entry->val.frame_offset = real_off;
>   ...
>   frag->val.match_en = htonl(mask << (32 - shift));
>   frag->val.match_data = htonl(data << (32 - shift));
>   entry->val.frame_offset = real_off + 1;
>
> Comments?

Looks good. Can you send a formal patch and a simple command that
I can use to validate this situation?

I used at the time at least two commands: one for validating
simple match by using 1 entry (i.e. plain match) and another to
validate 2 entries (i.e. two matches) and did not encounter any
problem ...

Thanks and Best Regards,
Jose Miguel Abreu


[PATCH net-next-internal] net: sched: cls_flower: set correct offload data in fl_reoffload

2018-08-06 Thread Vlad Buslov
fl_reoffload implementation sets following members of struct
tc_cls_flower_offload incorrectly:
 - masked key instead of mask
 - key instead of masked key

Fix fl_reoffload to provide correct data to offload callback.

Fixes: 31533cba4327 ("net: sched: cls_flower: implement offload tcf_proto_op")
Signed-off-by: Vlad Buslov 
Acked-by: Jiri Pirko 
---
 net/sched/cls_flower.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index e8bd08ba998a..cbf165c33656 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -1174,8 +1174,8 @@ static int fl_reoffload(struct tcf_proto *tp, bool add, 
tc_setup_cb_t *cb,
TC_CLSFLOWER_REPLACE : TC_CLSFLOWER_DESTROY;
cls_flower.cookie = (unsigned long)f;
cls_flower.dissector = >dissector;
-   cls_flower.mask = >mkey;
-   cls_flower.key = >key;
+   cls_flower.mask = >key;
+   cls_flower.key = >mkey;
cls_flower.exts = >exts;
cls_flower.classid = f->res.classid;
 
-- 
2.7.5



Re: [PATCH v8 bpf-next 00/10] veth: Driver XDP

2018-08-06 Thread Björn Töpel

On 2018-08-03 11:45, Jesper Dangaard Brouer wrote:

On Fri,  3 Aug 2018 16:58:08 +0900
Toshiaki Makita  wrote:


This patch set introduces driver XDP for veth.
Basically this is used in conjunction with redirect action of another XDP
program.

   NIC ---> veth===veth
  (XDP) (redirect)(XDP)



I'm was playing with V7 on my testlab yesterday and I noticed one
fundamental issue.  You are not updating the "ifconfig" stats counters,
when in XDP mode.  This makes receive or send via XDP invisible to
sysadm/management tools.  This for-sure is going to cause confusion...

I took a closer look at other driver. The ixgbe driver is doing the
right thing.  Driver i40e have a bug, where RX/TX stats are swapped
getting (strange!).  


Indeed! Thanks for finding/reporting this! I'll have look!


Björn


The mlx5 driver is not updating the regular RX/TX
counters, but A LOT of other ethtool stats counters (which are the ones
I usually monitor when testing).

So, given other drivers also didn't get this right, we need to have a
discussion outside your/this patchset.  Thus, I don't want to
stop/stall this patchset, but this is something we need to fixup in a
followup patchset to other drivers as well.

Thus, I'm acking the patchset, but I request that we do a joint effort
of fixing this as followup patches.

Acked-by: Jesper Dangaard Brouer 



[net-next:master 457/471] ERROR: "__aeabi_ldivmod" [drivers/ptp/ptp_qoriq.ko] undefined!

2018-08-06 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   981467033a37d916649647fa3afe1fe99bba1817
commit: 91305f2812624c0cf7ccbb44133b66d3b24676e4 [457/471] ptp_qoriq: support 
automatic configuration for ptp timer
config: arm-allmodconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout 91305f2812624c0cf7ccbb44133b66d3b24676e4
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=arm 

All errors (new ones prefixed by >>):

>> ERROR: "__aeabi_ldivmod" [drivers/ptp/ptp_qoriq.ko] undefined!
>> ERROR: "__aeabi_uldivmod" [drivers/ptp/ptp_qoriq.ko] undefined!

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v3 net-next 5/9] net: stmmac: Add MDIO related functions for XGMAC2

2018-08-06 Thread Jose Abreu
On 03-08-2018 20:06, Florian Fainelli wrote:
> On 08/03/2018 08:50 AM, Jose Abreu wrote:
>> Add the MDIO related funcionalities for the new IP block XGMAC2.
>>
>> Signed-off-by: Jose Abreu 
>> Cc: David S. Miller 
>> Cc: Joao Pinto 
>> Cc: Giuseppe Cavallaro 
>> Cc: Alexandre Torgue 
>> Cc: Andrew Lunn 
>> ---
>> +satic int stmmac_xgmac2_c22_format(struct stmmac_priv *priv, int phyaddr,
>> +int phyreg, u32 *hw_addr)
>> +{
>> +unsigned int mii_data = priv->hw->mii.data;
>> +u32 tmp;
>> +
>> +/* HW does not support C22 addr >= 4 */
>> +if (phyaddr >= 4)
>> +return -ENODEV;
> It would be nice if this could be moved at probe time so you don't have
> to wait until you connect to the PHY, read its PHY OUI and find out it
> has a MDIO address >= 4. Not a blocker, but something that could be
> improved further on.
>
> In premise you could even scan the MDIO bus' device tree node, and find
> that out ahead of time.

Oh, but I use phylib ... I only provide the read/write callbacks
so I think we should avoid duplicating the code that's already in
the phylib ... No?

>
>> +/* Wait until any existing MII operation is complete */
>> +if (readl_poll_timeout(priv->ioaddr + mii_data, tmp,
>> +   !(tmp & MII_XGMAC_BUSY), 100, 1))
>> +return -EBUSY;
>> +
>> +/* Set port as Clause 22 */
>> +tmp = readl(priv->ioaddr + XGMAC_MDIO_C22P);
>> +tmp |= BIT(phyaddr);
> Since the registers are being Read/Modify/Write here, don't you need to
> clear the previous address bits as well?
>
> You probably did not encounter any problems in your testing if you had
> only one PHY on the MDIO bus, but this is not something that is
> necessarily true, e.g: if you have an Ethernet switch, several MDIO bus
> addresses are going to be responding.
>
> Your MDIO bus implementation must be able to support one transaction
> with one PHY address and the next transaction with another PHY address ,
> etc...
>
> That is something that should be easy to fix and be resubmitted as part
> of v4.

I'm not following you here... I only set/unset the bit for the
corresponding phyaddr that phylib wants to read/write. Why would
I clear the remaining addresses?

Thanks and Best Regards,
Jose Miguel Abreu


Re: [PATCH v3 net-next 3/9] net: stmmac: Add DMA related callbacks for XGMAC2

2018-08-06 Thread Jose Abreu
On 03-08-2018 19:58, Florian Fainelli wrote:
> On 08/03/2018 08:50 AM, Jose Abreu wrote:
>> Add the DMA related callbacks for the new IP block XGMAC2.
>>
>> Signed-off-by: Jose Abreu 
>> Cc: David S. Miller 
>> Cc: Joao Pinto 
>> Cc: Giuseppe Cavallaro 
>> Cc: Alexandre Torgue 
>> ---
>> +value &= ~XGMAC_RD_OSR_LMT;
>> +value |= (axi->axi_rd_osr_lmt << XGMAC_RD_OSR_LMT_SHIFT) &
>> +XGMAC_RD_OSR_LMT;
>> +
>> +for (i = 0; i < AXI_BLEN; i++) {
>> +if (axi->axi_blen[i])
>> +value &= ~XGMAC_UNDEF;
> Should not you be you clearing all XGMAC_BLEN* values since you do a
> logical or here? I am assuming this is not something that would likely
> change from one open/close but still?

Yeah, this won't change between open/close but I will add the
mask anyway.

Thanks and Best Regards,
Jose Miguel Abreu


Re: [PATCH v3 net-next 1/9] net: stmmac: Add XGMAC 2.10 HWIF entry

2018-08-06 Thread Jose Abreu
Hi Florian,

On 03-08-2018 19:54, Florian Fainelli wrote:
> On 08/03/2018 08:50 AM, Jose Abreu wrote:
>> @@ -87,6 +88,7 @@ static const struct stmmac_hwif_entry {
>>  {
>>  .gmac = false,
>>  .gmac4 = false,
>> +.xgmac = false,
> In a future clean-up you would like want to remove this and replace this
> an enumeration which is less error prone than having to define a boolean
> for each of these previous generations only to say "this is not an xgmac".

Its a good idea! I really don't like the pattern of "if
(priv->plat->has_something)" in the code. I will think about
adding this but for now let's merge xgmac2 and I will try to
refactor the code later because I will need to change many files
and the patchset would be bigger.

Thanks and Best Regards,
Jose Miguel Abreu


[PATCH net-next 05/14] net: sched: act_ipt: remove dependency on rtnl lock

2018-08-06 Thread Vlad Buslov
Use tcf spinlock to protect ipt action private data from concurrent
modification during dump. Ipt init already takes tcf spinlock when
modifying ipt state.

Signed-off-by: Vlad Buslov 
---
 net/sched/act_ipt.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/sched/act_ipt.c b/net/sched/act_ipt.c
index 0dc787a57798..e149f0e66cb6 100644
--- a/net/sched/act_ipt.c
+++ b/net/sched/act_ipt.c
@@ -288,6 +288,7 @@ static int tcf_ipt_dump(struct sk_buff *skb, struct 
tc_action *a, int bind,
 * for foolproof you need to not assume this
 */
 
+   spin_lock_bh(>tcf_lock);
t = kmemdup(ipt->tcfi_t, ipt->tcfi_t->u.user.target_size, GFP_ATOMIC);
if (unlikely(!t))
goto nla_put_failure;
@@ -307,10 +308,12 @@ static int tcf_ipt_dump(struct sk_buff *skb, struct 
tc_action *a, int bind,
if (nla_put_64bit(skb, TCA_IPT_TM, sizeof(tm), , TCA_IPT_PAD))
goto nla_put_failure;
 
+   spin_unlock_bh(>tcf_lock);
kfree(t);
return skb->len;
 
 nla_put_failure:
+   spin_unlock_bh(>tcf_lock);
nlmsg_trim(skb, b);
kfree(t);
return -1;
-- 
2.7.5



[PATCH net-next 06/14] net: sched: act_pedit: remove dependency on rtnl lock

2018-08-06 Thread Vlad Buslov
Rearrange pedit init code to only access pedit action data while holding
tcf spinlock. Change keys allocation type to atomic to allow it to execute
while holding tcf spinlock. Take tcf spinlock in dump function when
accessing pedit action data.

Signed-off-by: Vlad Buslov 
---
 net/sched/act_pedit.c | 40 
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/net/sched/act_pedit.c b/net/sched/act_pedit.c
index 43ba999b2d23..3f62da72ab6a 100644
--- a/net/sched/act_pedit.c
+++ b/net/sched/act_pedit.c
@@ -187,44 +187,38 @@ static int tcf_pedit_init(struct net *net, struct nlattr 
*nla,
tcf_idr_cleanup(tn, parm->index);
goto out_free;
}
-   p = to_pedit(*a);
-   keys = kmalloc(ksize, GFP_KERNEL);
-   if (!keys) {
-   tcf_idr_release(*a, bind);
-   ret = -ENOMEM;
-   goto out_free;
-   }
ret = ACT_P_CREATED;
} else if (err > 0) {
if (bind)
goto out_free;
if (!ovr) {
-   tcf_idr_release(*a, bind);
ret = -EEXIST;
-   goto out_free;
-   }
-   p = to_pedit(*a);
-   if (p->tcfp_nkeys && p->tcfp_nkeys != parm->nkeys) {
-   keys = kmalloc(ksize, GFP_KERNEL);
-   if (!keys) {
-   ret = -ENOMEM;
-   goto out_free;
-   }
+   goto out_release;
}
} else {
return err;
}
 
+   p = to_pedit(*a);
spin_lock_bh(>tcf_lock);
-   p->tcfp_flags = parm->flags;
-   p->tcf_action = parm->action;
-   if (keys) {
+
+   if (ret == ACT_P_CREATED ||
+   (p->tcfp_nkeys && p->tcfp_nkeys != parm->nkeys)) {
+   keys = kmalloc(ksize, GFP_ATOMIC);
+   if (!keys) {
+   spin_unlock_bh(>tcf_lock);
+   ret = -ENOMEM;
+   goto out_release;
+   }
kfree(p->tcfp_keys);
p->tcfp_keys = keys;
p->tcfp_nkeys = parm->nkeys;
}
memcpy(p->tcfp_keys, parm->keys, ksize);
 
+   p->tcfp_flags = parm->flags;
+   p->tcf_action = parm->action;
+
kfree(p->tcfp_keys_ex);
p->tcfp_keys_ex = keys_ex;
 
@@ -232,6 +226,9 @@ static int tcf_pedit_init(struct net *net, struct nlattr 
*nla,
if (ret == ACT_P_CREATED)
tcf_idr_insert(tn, *a);
return ret;
+
+out_release:
+   tcf_idr_release(*a, bind);
 out_free:
kfree(keys_ex);
return ret;
@@ -410,6 +407,7 @@ static int tcf_pedit_dump(struct sk_buff *skb, struct 
tc_action *a,
if (unlikely(!opt))
return -ENOBUFS;
 
+   spin_lock_bh(>tcf_lock);
memcpy(opt->keys, p->tcfp_keys,
   p->tcfp_nkeys * sizeof(struct tc_pedit_key));
opt->index = p->tcf_index;
@@ -432,11 +430,13 @@ static int tcf_pedit_dump(struct sk_buff *skb, struct 
tc_action *a,
tcf_tm_dump(, >tcf_tm);
if (nla_put_64bit(skb, TCA_PEDIT_TM, sizeof(t), , TCA_PEDIT_PAD))
goto nla_put_failure;
+   spin_unlock_bh(>tcf_lock);
 
kfree(opt);
return skb->len;
 
 nla_put_failure:
+   spin_unlock_bh(>tcf_lock);
nlmsg_trim(skb, b);
kfree(opt);
return -1;
-- 
2.7.5



[PATCH net-next 14/14] net: sched: act_police: remove dependency on rtnl lock

2018-08-06 Thread Vlad Buslov
Use tcf spinlock to protect police action private data from concurrent
modification during dump. (init already uses tcf spinlock when changing
police action state)

Pass tcf spinlock as estimator lock argument to gen_replace_estimator()
during action init.

Signed-off-by: Vlad Buslov 
---
 net/sched/act_police.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/sched/act_police.c b/net/sched/act_police.c
index 3eb5fe60c62c..4777da7caadf 100644
--- a/net/sched/act_police.c
+++ b/net/sched/act_police.c
@@ -138,7 +138,8 @@ static int tcf_act_police_init(struct net *net, struct 
nlattr *nla,
 
if (est) {
err = gen_replace_estimator(>tcf_bstats, NULL,
-   >tcf_rate_est, NULL,
+   >tcf_rate_est,
+   >tcf_lock,
>tcf_lock,
NULL, est);
if (err)
@@ -274,14 +275,15 @@ static int tcf_act_police_dump(struct sk_buff *skb, 
struct tc_action *a,
struct tcf_police *police = to_police(a);
struct tc_police opt = {
.index = police->tcf_index,
-   .action = police->tcf_action,
-   .mtu = police->tcfp_mtu,
-   .burst = PSCHED_NS2TICKS(police->tcfp_burst),
.refcnt = refcount_read(>tcf_refcnt) - ref,
.bindcnt = atomic_read(>tcf_bindcnt) - bind,
};
struct tcf_t t;
 
+   spin_lock_bh(>tcf_lock);
+   opt.action = police->tcf_action;
+   opt.mtu = police->tcfp_mtu;
+   opt.burst = PSCHED_NS2TICKS(police->tcfp_burst);
if (police->rate_present)
psched_ratecfg_getrate(, >rate);
if (police->peak_present)
@@ -301,10 +303,12 @@ static int tcf_act_police_dump(struct sk_buff *skb, 
struct tc_action *a,
t.expires = jiffies_to_clock_t(police->tcf_tm.expires);
if (nla_put_64bit(skb, TCA_POLICE_TM, sizeof(t), , TCA_POLICE_PAD))
goto nla_put_failure;
+   spin_unlock_bh(>tcf_lock);
 
return skb->len;
 
 nla_put_failure:
+   spin_unlock_bh(>tcf_lock);
nlmsg_trim(skb, b);
return -1;
 }
-- 
2.7.5



[PATCH net-next 12/14] net: sched: act_mirred: remove dependency on rtnl lock

2018-08-06 Thread Vlad Buslov
Re-introduce mirred list spinlock, that was removed some time ago, in order
to protect it from concurrent modifications, instead of relying on rtnl
lock.

Use tcf spinlock to protect mirred action private data from concurrent
modification in init and dump. Rearrange access to mirred data in order to
be performed only while holding the lock.

Rearrange net dev access to always hold reference while working with it,
instead of relying on rntl lock. Change get dev function to increment net
device reference before returning it to caller, instead of assuming that
caller is protected with rtnl lock.

Provide rcu version of mirred dev and tunnel info access functions. (to be
used by unlocked drivers)

Signed-off-by: Vlad Buslov 
---
 include/net/tc_act/tc_mirred.h |  5 +++
 include/net/tc_act/tc_tunnel_key.h | 33 ---
 net/sched/act_mirred.c | 82 +-
 net/sched/cls_api.c|  1 +
 4 files changed, 88 insertions(+), 33 deletions(-)

diff --git a/include/net/tc_act/tc_mirred.h b/include/net/tc_act/tc_mirred.h
index a2e9cbca5c9e..cb30be55e444 100644
--- a/include/net/tc_act/tc_mirred.h
+++ b/include/net/tc_act/tc_mirred.h
@@ -37,4 +37,9 @@ static inline struct net_device *tcf_mirred_dev(const struct 
tc_action *a)
return rtnl_dereference(to_mirred(a)->tcfm_dev);
 }
 
+static inline struct net_device *tcf_mirred_dev_rcu(const struct tc_action *a)
+{
+   return rcu_dereference(to_mirred(a)->tcfm_dev);
+}
+
 #endif /* __NET_TC_MIR_H */
diff --git a/include/net/tc_act/tc_tunnel_key.h 
b/include/net/tc_act/tc_tunnel_key.h
index 46b8c7f1c8d5..e6e475d788c6 100644
--- a/include/net/tc_act/tc_tunnel_key.h
+++ b/include/net/tc_act/tc_tunnel_key.h
@@ -30,26 +30,47 @@ struct tcf_tunnel_key {
 
 static inline bool is_tcf_tunnel_set(const struct tc_action *a)
 {
+   bool ret = false;
 #ifdef CONFIG_NET_CLS_ACT
struct tcf_tunnel_key *t = to_tunnel_key(a);
-   struct tcf_tunnel_key_params *params = rtnl_dereference(t->params);
+   struct tcf_tunnel_key_params *params;
 
+   rcu_read_lock();
+   params = rcu_dereference(t->params);
if (a->ops && a->ops->type == TCA_ACT_TUNNEL_KEY)
-   return params->tcft_action == TCA_TUNNEL_KEY_ACT_SET;
+   ret = params->tcft_action == TCA_TUNNEL_KEY_ACT_SET;
+   rcu_read_unlock();
 #endif
-   return false;
+   return ret;
 }
 
 static inline bool is_tcf_tunnel_release(const struct tc_action *a)
 {
+   bool ret = false;
 #ifdef CONFIG_NET_CLS_ACT
struct tcf_tunnel_key *t = to_tunnel_key(a);
-   struct tcf_tunnel_key_params *params = rtnl_dereference(t->params);
+   struct tcf_tunnel_key_params *params;
 
+   rcu_read_lock();
+   params = rcu_dereference(t->params);
if (a->ops && a->ops->type == TCA_ACT_TUNNEL_KEY)
-   return params->tcft_action == TCA_TUNNEL_KEY_ACT_RELEASE;
+   ret = params->tcft_action == TCA_TUNNEL_KEY_ACT_RELEASE;
+   rcu_read_unlock();
+#endif
+   return ret;
+}
+
+static inline
+struct ip_tunnel_info *tcf_tunnel_info_rcu(const struct tc_action *a)
+{
+#ifdef CONFIG_NET_CLS_ACT
+   struct tcf_tunnel_key *t = to_tunnel_key(a);
+   struct tcf_tunnel_key_params *params = rcu_dereference(t->params);
+
+   return >tcft_enc_metadata->u.tun_info;
+#else
+   return NULL;
 #endif
-   return false;
 }
 
 static inline struct ip_tunnel_info *tcf_tunnel_info(const struct tc_action *a)
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index b26d060da08e..9f622114f5a5 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -30,6 +30,7 @@
 #include 
 
 static LIST_HEAD(mirred_list);
+static DEFINE_SPINLOCK(mirred_list_lock);
 
 static bool tcf_mirred_is_act_redirect(int action)
 {
@@ -62,13 +63,23 @@ static bool tcf_mirred_can_reinsert(int action)
return false;
 }
 
+static struct net_device *tcf_mirred_dev_dereference(struct tcf_mirred *m)
+{
+   return rcu_dereference_protected(m->tcfm_dev,
+lockdep_is_held(>tcf_lock));
+}
+
 static void tcf_mirred_release(struct tc_action *a)
 {
struct tcf_mirred *m = to_mirred(a);
struct net_device *dev;
 
+   spin_lock(_list_lock);
list_del(>tcfm_list);
-   dev = rtnl_dereference(m->tcfm_dev);
+   spin_unlock(_list_lock);
+
+   /* last reference to action, no need to lock */
+   dev = rcu_dereference_protected(m->tcfm_dev, 1);
if (dev)
dev_put(dev);
 }
@@ -128,22 +139,9 @@ static int tcf_mirred_init(struct net *net, struct nlattr 
*nla,
NL_SET_ERR_MSG_MOD(extack, "Unknown mirred option");
return -EINVAL;
}
-   if (parm->ifindex) {
-   dev = __dev_get_by_index(net, parm->ifindex);
-   if (dev == NULL) {
-   if (exists)
-   tcf_idr_release(*a, bind);
-

[PATCH net-next 10/14] net: sched: act_tunnel_key: remove dependency on rtnl lock

2018-08-06 Thread Vlad Buslov
Use tcf lock to protect tunnel key action struct private data from
concurrent modification in init and dump. Use rcu swap operation to
reassign params pointer under protection of tcf lock. (old params value is
not used by init, so there is no need of standalone rcu dereference step)

Remove rtnl lock assertion that is no longer required.

Signed-off-by: Vlad Buslov 
---
 net/sched/act_tunnel_key.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index d42d9e112789..ba2ae9f75ef5 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -204,7 +204,6 @@ static int tunnel_key_init(struct net *net, struct nlattr 
*nla,
 {
struct tc_action_net *tn = net_generic(net, tunnel_key_net_id);
struct nlattr *tb[TCA_TUNNEL_KEY_MAX + 1];
-   struct tcf_tunnel_key_params *params_old;
struct tcf_tunnel_key_params *params_new;
struct metadata_dst *metadata = NULL;
struct tc_tunnel_key *parm;
@@ -346,24 +345,22 @@ static int tunnel_key_init(struct net *net, struct nlattr 
*nla,
 
t = to_tunnel_key(*a);
 
-   ASSERT_RTNL();
params_new = kzalloc(sizeof(*params_new), GFP_KERNEL);
if (unlikely(!params_new)) {
tcf_idr_release(*a, bind);
NL_SET_ERR_MSG(extack, "Cannot allocate tunnel key parameters");
return -ENOMEM;
}
-
-   params_old = rtnl_dereference(t->params);
-
-   t->tcf_action = parm->action;
params_new->tcft_action = parm->t_action;
params_new->tcft_enc_metadata = metadata;
 
-   rcu_assign_pointer(t->params, params_new);
-
-   if (params_old)
-   kfree_rcu(params_old, rcu);
+   spin_lock(>tcf_lock);
+   t->tcf_action = parm->action;
+   rcu_swap_protected(t->params, params_new,
+  lockdep_is_held(>tcf_lock));
+   spin_unlock(>tcf_lock);
+   if (params_new)
+   kfree_rcu(params_new, rcu);
 
if (ret == ACT_P_CREATED)
tcf_idr_insert(tn, *a);
@@ -485,12 +482,13 @@ static int tunnel_key_dump(struct sk_buff *skb, struct 
tc_action *a,
.index= t->tcf_index,
.refcnt   = refcount_read(>tcf_refcnt) - ref,
.bindcnt  = atomic_read(>tcf_bindcnt) - bind,
-   .action   = t->tcf_action,
};
struct tcf_t tm;
 
-   params = rtnl_dereference(t->params);
-
+   spin_lock(>tcf_lock);
+   params = rcu_dereference_protected(t->params,
+  lockdep_is_held(>tcf_lock));
+   opt.action   = t->tcf_action;
opt.t_action = params->tcft_action;
 
if (nla_put(skb, TCA_TUNNEL_KEY_PARMS, sizeof(opt), ))
@@ -522,10 +520,12 @@ static int tunnel_key_dump(struct sk_buff *skb, struct 
tc_action *a,
if (nla_put_64bit(skb, TCA_TUNNEL_KEY_TM, sizeof(tm),
  , TCA_TUNNEL_KEY_PAD))
goto nla_put_failure;
+   spin_unlock(>tcf_lock);
 
return skb->len;
 
 nla_put_failure:
+   spin_unlock(>tcf_lock);
nlmsg_trim(skb, b);
return -1;
 }
-- 
2.7.5



[PATCH net-next 09/14] net: sched: act_skbmod: remove dependency on rtnl lock

2018-08-06 Thread Vlad Buslov
Move read of skbmod_p rcu pointer to be protected by tcf spinlock. Use tcf
spinlock to protect private skbmod data from concurrent modification during
dump.

Signed-off-by: Vlad Buslov 
---
 net/sched/act_skbmod.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/sched/act_skbmod.c b/net/sched/act_skbmod.c
index c437c6d51a71..e9c86ade3b40 100644
--- a/net/sched/act_skbmod.c
+++ b/net/sched/act_skbmod.c
@@ -156,7 +156,6 @@ static int tcf_skbmod_init(struct net *net, struct nlattr 
*nla,
 
d = to_skbmod(*a);
 
-   ASSERT_RTNL();
p = kzalloc(sizeof(struct tcf_skbmod_params), GFP_KERNEL);
if (unlikely(!p)) {
tcf_idr_release(*a, bind);
@@ -166,10 +165,10 @@ static int tcf_skbmod_init(struct net *net, struct nlattr 
*nla,
p->flags = lflags;
d->tcf_action = parm->action;
 
-   p_old = rtnl_dereference(d->skbmod_p);
-
if (ovr)
spin_lock_bh(>tcf_lock);
+   /* Protected by tcf_lock if overwriting existing action. */
+   p_old = rcu_dereference_protected(d->skbmod_p, 1);
 
if (lflags & SKBMOD_F_DMAC)
ether_addr_copy(p->eth_dst, daddr);
@@ -205,15 +204,18 @@ static int tcf_skbmod_dump(struct sk_buff *skb, struct 
tc_action *a,
 {
struct tcf_skbmod *d = to_skbmod(a);
unsigned char *b = skb_tail_pointer(skb);
-   struct tcf_skbmod_params  *p = rtnl_dereference(d->skbmod_p);
+   struct tcf_skbmod_params  *p;
struct tc_skbmod opt = {
.index   = d->tcf_index,
.refcnt  = refcount_read(>tcf_refcnt) - ref,
.bindcnt = atomic_read(>tcf_bindcnt) - bind,
-   .action  = d->tcf_action,
};
struct tcf_t t;
 
+   spin_lock_bh(>tcf_lock);
+   opt.action = d->tcf_action;
+   p = rcu_dereference_protected(d->skbmod_p,
+ lockdep_is_held(>tcf_lock));
opt.flags  = p->flags;
if (nla_put(skb, TCA_SKBMOD_PARMS, sizeof(opt), ))
goto nla_put_failure;
@@ -231,8 +233,10 @@ static int tcf_skbmod_dump(struct sk_buff *skb, struct 
tc_action *a,
if (nla_put_64bit(skb, TCA_SKBMOD_TM, sizeof(t), , TCA_SKBMOD_PAD))
goto nla_put_failure;
 
+   spin_unlock_bh(>tcf_lock);
return skb->len;
 nla_put_failure:
+   spin_unlock_bh(>tcf_lock);
nlmsg_trim(skb, b);
return -1;
 }
-- 
2.7.5



[PATCH net-next 00/14] Remove rtnl lock dependency from all action implementations

2018-08-06 Thread Vlad Buslov
Currently, all netlink protocol handlers for updating rules, actions and
qdiscs are protected with single global rtnl lock which removes any
possibility for parallelism. This patch set is a second step to remove
rtnl lock dependency from TC rules update path.

Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
Handlers registered with this flag are called without RTNL taken. End
goal is to have rule update handlers(RTM_NEWTFILTER, RTM_DELTFILTER,
etc.) to be registered with UNLOCKED flag to allow parallel execution.
However, there is no intention to completely remove or split rtnl lock
itself. This patch set addresses specific problems in implementation of
tc actions that prevent their control path from being executed
concurrently. Additional changes are required to refactor classifiers
API and individual classifiers for parallel execution. This patch set
lays groundwork to eventually register rule update handlers as
rtnl-unlocked.

Action API is already prepared for parallel execution with previous
patch set, which means that action ops that use action API for their
implementation do not require additional modifications. (delete, search,
etc.) Action API implements concurrency-safe reference counting and
guarantees that cleanup/delete is called only once, after last reference
to action is released.

The goal of this change is to update specific actions APIs that access
action private state directly, in order to be independent from external
locking. General approach is to re-use existing tcf_lock spinlock (used
by some action implementation to synchronize control path with data
path) to protect action private state from concurrent modification. If
action has rcu-protected pointer, tcf spinlock is used to protect its
update code, instead of relying on rtnl lock.

Some actions need to determine rtnl mutex status in order to release it.
For example, ife action can load additional kernel modules(meta ops) and
must make sure that no locks are held during module load. In such cases
'rtnl_held' argument is used to conditionally release rtnl mutex.

Vlad Buslov (14):
  net: sched: act_bpf: remove dependency on rtnl lock
  net: sched: act_csum: remove dependency on rtnl lock
  net: sched: act_gact: remove dependency on rtnl lock
  net: sched: act_ife: remove dependency on rtnl lock
  net: sched: act_ipt: remove dependency on rtnl lock
  net: sched: act_pedit: remove dependency on rtnl lock
  net: sched: act_sample: remove dependency on rtnl lock
  net: sched: act_simple: remove dependency on rtnl lock
  net: sched: act_skbmod: remove dependency on rtnl lock
  net: sched: act_tunnel_key: remove dependency on rtnl lock
  net: sched: act_vlan: remove dependency on rtnl lock
  net: sched: act_mirred: remove dependency on rtnl lock
  net: core: add new/replace rate estimator lock parameter
  net: sched: act_police: remove dependency on rtnl lock

 include/net/gen_stats.h|  2 +
 include/net/tc_act/tc_mirred.h |  5 +++
 include/net/tc_act/tc_tunnel_key.h | 33 ---
 net/core/gen_estimator.c   | 58 ---
 net/netfilter/xt_RATEEST.c |  2 +-
 net/sched/act_api.c|  2 +-
 net/sched/act_bpf.c| 10 +++--
 net/sched/act_csum.c   | 24 ++-
 net/sched/act_gact.c   | 10 -
 net/sched/act_ife.c| 40 ---
 net/sched/act_ipt.c|  3 ++
 net/sched/act_mirred.c | 82 +-
 net/sched/act_pedit.c  | 40 +--
 net/sched/act_police.c | 10 +++--
 net/sched/act_sample.c | 12 +-
 net/sched/act_simple.c |  6 ++-
 net/sched/act_skbmod.c | 14 ---
 net/sched/act_tunnel_key.c | 26 ++--
 net/sched/act_vlan.c   | 27 +++--
 net/sched/cls_api.c|  1 +
 net/sched/sch_api.c|  2 +
 net/sched/sch_cbq.c|  4 +-
 net/sched/sch_drr.c|  4 +-
 net/sched/sch_hfsc.c   |  4 +-
 net/sched/sch_htb.c|  4 +-
 net/sched/sch_qfq.c|  4 +-
 26 files changed, 285 insertions(+), 144 deletions(-)

-- 
2.7.5



[PATCH net-next 11/14] net: sched: act_vlan: remove dependency on rtnl lock

2018-08-06 Thread Vlad Buslov
Use tcf spinlock to protect vlan action private data from concurrent
modification during dump and init. Use rcu swap operation to reassign
params pointer under protection of tcf lock. (old params value is not used
by init, so there is no need of standalone rcu dereference step)

Remove rtnl assertion that is no longer necessary.

Signed-off-by: Vlad Buslov 
---
 net/sched/act_vlan.c | 27 +++
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c
index 15a0ee214c9c..5bde17fe3608 100644
--- a/net/sched/act_vlan.c
+++ b/net/sched/act_vlan.c
@@ -109,7 +109,7 @@ static int tcf_vlan_init(struct net *net, struct nlattr 
*nla,
 {
struct tc_action_net *tn = net_generic(net, vlan_net_id);
struct nlattr *tb[TCA_VLAN_MAX + 1];
-   struct tcf_vlan_params *p, *p_old;
+   struct tcf_vlan_params *p;
struct tc_vlan *parm;
struct tcf_vlan *v;
int action;
@@ -202,26 +202,24 @@ static int tcf_vlan_init(struct net *net, struct nlattr 
*nla,
 
v = to_vlan(*a);
 
-   ASSERT_RTNL();
p = kzalloc(sizeof(*p), GFP_KERNEL);
if (!p) {
tcf_idr_release(*a, bind);
return -ENOMEM;
}
 
-   v->tcf_action = parm->action;
-
-   p_old = rtnl_dereference(v->vlan_p);
-
p->tcfv_action = action;
p->tcfv_push_vid = push_vid;
p->tcfv_push_prio = push_prio;
p->tcfv_push_proto = push_proto;
 
-   rcu_assign_pointer(v->vlan_p, p);
+   spin_lock(>tcf_lock);
+   v->tcf_action = parm->action;
+   rcu_swap_protected(v->vlan_p, p, lockdep_is_held(>tcf_lock));
+   spin_unlock(>tcf_lock);
 
-   if (p_old)
-   kfree_rcu(p_old, rcu);
+   if (p)
+   kfree_rcu(p, rcu);
 
if (ret == ACT_P_CREATED)
tcf_idr_insert(tn, *a);
@@ -243,16 +241,18 @@ static int tcf_vlan_dump(struct sk_buff *skb, struct 
tc_action *a,
 {
unsigned char *b = skb_tail_pointer(skb);
struct tcf_vlan *v = to_vlan(a);
-   struct tcf_vlan_params *p = rtnl_dereference(v->vlan_p);
+   struct tcf_vlan_params *p;
struct tc_vlan opt = {
.index= v->tcf_index,
.refcnt   = refcount_read(>tcf_refcnt) - ref,
.bindcnt  = atomic_read(>tcf_bindcnt) - bind,
-   .action   = v->tcf_action,
-   .v_action = p->tcfv_action,
};
struct tcf_t t;
 
+   spin_lock(>tcf_lock);
+   opt.action = v->tcf_action;
+   p = rcu_dereference_protected(v->vlan_p, lockdep_is_held(>tcf_lock));
+   opt.v_action = p->tcfv_action;
if (nla_put(skb, TCA_VLAN_PARMS, sizeof(opt), ))
goto nla_put_failure;
 
@@ -268,9 +268,12 @@ static int tcf_vlan_dump(struct sk_buff *skb, struct 
tc_action *a,
tcf_tm_dump(, >tcf_tm);
if (nla_put_64bit(skb, TCA_VLAN_TM, sizeof(t), , TCA_VLAN_PAD))
goto nla_put_failure;
+   spin_unlock(>tcf_lock);
+
return skb->len;
 
 nla_put_failure:
+   spin_unlock(>tcf_lock);
nlmsg_trim(skb, b);
return -1;
 }
-- 
2.7.5



[PATCH net-next 03/14] net: sched: act_gact: remove dependency on rtnl lock

2018-08-06 Thread Vlad Buslov
Use tcf spinlock to protect gact action private state from concurrent
modification during dump and init. Remove rtnl assertion that is no longer
necessary.

Signed-off-by: Vlad Buslov 
---
 net/sched/act_gact.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c
index 661b72b9147d..bfccd34a3968 100644
--- a/net/sched/act_gact.c
+++ b/net/sched/act_gact.c
@@ -113,7 +113,7 @@ static int tcf_gact_init(struct net *net, struct nlattr 
*nla,
 
gact = to_gact(*a);
 
-   ASSERT_RTNL();
+   spin_lock(>tcf_lock);
gact->tcf_action = parm->action;
 #ifdef CONFIG_GACT_PROB
if (p_parm) {
@@ -126,6 +126,8 @@ static int tcf_gact_init(struct net *net, struct nlattr 
*nla,
gact->tcfg_ptype   = p_parm->ptype;
}
 #endif
+   spin_unlock(>tcf_lock);
+
if (ret == ACT_P_CREATED)
tcf_idr_insert(tn, *a);
return ret;
@@ -178,10 +180,11 @@ static int tcf_gact_dump(struct sk_buff *skb, struct 
tc_action *a,
.index   = gact->tcf_index,
.refcnt  = refcount_read(>tcf_refcnt) - ref,
.bindcnt = atomic_read(>tcf_bindcnt) - bind,
-   .action  = gact->tcf_action,
};
struct tcf_t t;
 
+   spin_lock(>tcf_lock);
+   opt.action = gact->tcf_action;
if (nla_put(skb, TCA_GACT_PARMS, sizeof(opt), ))
goto nla_put_failure;
 #ifdef CONFIG_GACT_PROB
@@ -199,9 +202,12 @@ static int tcf_gact_dump(struct sk_buff *skb, struct 
tc_action *a,
tcf_tm_dump(, >tcf_tm);
if (nla_put_64bit(skb, TCA_GACT_TM, sizeof(t), , TCA_GACT_PAD))
goto nla_put_failure;
+   spin_unlock(>tcf_lock);
+
return skb->len;
 
 nla_put_failure:
+   spin_unlock(>tcf_lock);
nlmsg_trim(skb, b);
return -1;
 }
-- 
2.7.5



[PATCH net-next 08/14] net: sched: act_simple: remove dependency on rtnl lock

2018-08-06 Thread Vlad Buslov
Use tcf spinlock to protect private simple action data from concurrent
modification during dump. (simple init already uses tcf spinlock when
changing action state)

Signed-off-by: Vlad Buslov 
---
 net/sched/act_simple.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_simple.c b/net/sched/act_simple.c
index aa51152e0066..18e4452574cd 100644
--- a/net/sched/act_simple.c
+++ b/net/sched/act_simple.c
@@ -156,10 +156,11 @@ static int tcf_simp_dump(struct sk_buff *skb, struct 
tc_action *a,
.index   = d->tcf_index,
.refcnt  = refcount_read(>tcf_refcnt) - ref,
.bindcnt = atomic_read(>tcf_bindcnt) - bind,
-   .action  = d->tcf_action,
};
struct tcf_t t;
 
+   spin_lock_bh(>tcf_lock);
+   opt.action = d->tcf_action;
if (nla_put(skb, TCA_DEF_PARMS, sizeof(opt), ) ||
nla_put_string(skb, TCA_DEF_DATA, d->tcfd_defdata))
goto nla_put_failure;
@@ -167,9 +168,12 @@ static int tcf_simp_dump(struct sk_buff *skb, struct 
tc_action *a,
tcf_tm_dump(, >tcf_tm);
if (nla_put_64bit(skb, TCA_DEF_TM, sizeof(t), , TCA_DEF_PAD))
goto nla_put_failure;
+   spin_unlock_bh(>tcf_lock);
+
return skb->len;
 
 nla_put_failure:
+   spin_unlock_bh(>tcf_lock);
nlmsg_trim(skb, b);
return -1;
 }
-- 
2.7.5



[PATCH net-next 04/14] net: sched: act_ife: remove dependency on rtnl lock

2018-08-06 Thread Vlad Buslov
Use tcf spinlock and rcu to protect params pointer from concurrent
modification during dump and init. Use rcu swap operation to reassign
params pointer under protection of tcf lock. (old params value is not used
by init, so there is no need of standalone rcu dereference step)

Ife action has meta-actions that are compiled as standalone modules. Rtnl
mutex must be released while loading a kernel module. In order to support
execution without rtnl mutex, propagate 'rtnl_held' argument to meta action
loading functions. When requesting meta action module, conditionally
release rtnl lock depending on 'rtnl_held' argument.

Signed-off-by: Vlad Buslov 
---
 net/sched/act_ife.c | 40 +---
 1 file changed, 25 insertions(+), 15 deletions(-)

diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index df4060e32d43..5d200495e467 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -268,7 +268,8 @@ static const char *ife_meta_id2name(u32 metaid)
  * under ife->tcf_lock for existing action
 */
 static int load_metaops_and_vet(struct tcf_ife_info *ife, u32 metaid,
-   void *val, int len, bool exists)
+   void *val, int len, bool exists,
+   bool rtnl_held)
 {
struct tcf_meta_ops *ops = find_ife_oplist(metaid);
int ret = 0;
@@ -278,9 +279,11 @@ static int load_metaops_and_vet(struct tcf_ife_info *ife, 
u32 metaid,
 #ifdef CONFIG_MODULES
if (exists)
spin_unlock_bh(>tcf_lock);
-   rtnl_unlock();
+   if (rtnl_held)
+   rtnl_unlock();
request_module("ife-meta-%s", ife_meta_id2name(metaid));
-   rtnl_lock();
+   if (rtnl_held)
+   rtnl_lock();
if (exists)
spin_lock_bh(>tcf_lock);
ops = find_ife_oplist(metaid);
@@ -421,7 +424,7 @@ static void tcf_ife_cleanup(struct tc_action *a)
 
 /* under ife->tcf_lock for existing action */
 static int populate_metalist(struct tcf_ife_info *ife, struct nlattr **tb,
-bool exists)
+bool exists, bool rtnl_held)
 {
int len = 0;
int rc = 0;
@@ -433,7 +436,8 @@ static int populate_metalist(struct tcf_ife_info *ife, 
struct nlattr **tb,
val = nla_data(tb[i]);
len = nla_len(tb[i]);
 
-   rc = load_metaops_and_vet(ife, i, val, len, exists);
+   rc = load_metaops_and_vet(ife, i, val, len, exists,
+ rtnl_held);
if (rc != 0)
return rc;
 
@@ -454,7 +458,7 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla,
struct tc_action_net *tn = net_generic(net, ife_net_id);
struct nlattr *tb[TCA_IFE_MAX + 1];
struct nlattr *tb2[IFE_META_MAX + 1];
-   struct tcf_ife_params *p, *p_old;
+   struct tcf_ife_params *p;
struct tcf_ife_info *ife;
u16 ife_type = ETH_P_IFE;
struct tc_ife *parm;
@@ -558,7 +562,7 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla,
return err;
}
 
-   err = populate_metalist(ife, tb2, exists);
+   err = populate_metalist(ife, tb2, exists, rtnl_held);
if (err)
goto metadata_parse_err;
 
@@ -581,13 +585,13 @@ static int tcf_ife_init(struct net *net, struct nlattr 
*nla,
}
 
ife->tcf_action = parm->action;
+   /* protected by tcf_lock when modifying existing action */
+   rcu_swap_protected(ife->params, p, 1);
+
if (exists)
spin_unlock_bh(>tcf_lock);
-
-   p_old = rtnl_dereference(ife->params);
-   rcu_assign_pointer(ife->params, p);
-   if (p_old)
-   kfree_rcu(p_old, rcu);
+   if (p)
+   kfree_rcu(p, rcu);
 
if (ret == ACT_P_CREATED)
tcf_idr_insert(tn, *a);
@@ -600,16 +604,20 @@ static int tcf_ife_dump(struct sk_buff *skb, struct 
tc_action *a, int bind,
 {
unsigned char *b = skb_tail_pointer(skb);
struct tcf_ife_info *ife = to_ife(a);
-   struct tcf_ife_params *p = rtnl_dereference(ife->params);
+   struct tcf_ife_params *p;
struct tc_ife opt = {
.index = ife->tcf_index,
.refcnt = refcount_read(>tcf_refcnt) - ref,
.bindcnt = atomic_read(>tcf_bindcnt) - bind,
-   .action = ife->tcf_action,
-   .flags = p->flags,
};
struct tcf_t t;
 
+   spin_lock_bh(>tcf_lock);
+   opt.action = ife->tcf_action;
+   p = rcu_dereference_protected(ife->params,
+ lockdep_is_held(>tcf_lock));
+   opt.flags = p->flags;
+
if (nla_put(skb, TCA_IFE_PARMS, 

[PATCH net-next 01/14] net: sched: act_bpf: remove dependency on rtnl lock

2018-08-06 Thread Vlad Buslov
Use tcf spinlock to protect bpf action private data from concurrent
modification during dump and init. Remove rtnl lock assertion that is no
longer necessary.

Signed-off-by: Vlad Buslov 
---
 net/sched/act_bpf.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 6203eb075c9a..9e8a33f9fee3 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -143,11 +143,12 @@ static int tcf_bpf_dump(struct sk_buff *skb, struct 
tc_action *act,
.index   = prog->tcf_index,
.refcnt  = refcount_read(>tcf_refcnt) - ref,
.bindcnt = atomic_read(>tcf_bindcnt) - bind,
-   .action  = prog->tcf_action,
};
struct tcf_t tm;
int ret;
 
+   spin_lock(>tcf_lock);
+   opt.action = prog->tcf_action;
if (nla_put(skb, TCA_ACT_BPF_PARMS, sizeof(opt), ))
goto nla_put_failure;
 
@@ -163,9 +164,11 @@ static int tcf_bpf_dump(struct sk_buff *skb, struct 
tc_action *act,
  TCA_ACT_BPF_PAD))
goto nla_put_failure;
 
+   spin_unlock(>tcf_lock);
return skb->len;
 
 nla_put_failure:
+   spin_unlock(>tcf_lock);
nlmsg_trim(skb, tp);
return -1;
 }
@@ -264,7 +267,7 @@ static void tcf_bpf_prog_fill_cfg(const struct tcf_bpf 
*prog,
 {
cfg->is_ebpf = tcf_bpf_is_ebpf(prog);
/* updates to prog->filter are prevented, since it's called either
-* with rtnl lock or during final cleanup in rcu callback
+* with tcf lock or during final cleanup in rcu callback
 */
cfg->filter = rcu_dereference_protected(prog->filter, 1);
 
@@ -336,8 +339,8 @@ static int tcf_bpf_init(struct net *net, struct nlattr *nla,
goto out;
 
prog = to_bpf(*act);
-   ASSERT_RTNL();
 
+   spin_lock(>tcf_lock);
if (res != ACT_P_CREATED)
tcf_bpf_prog_fill_cfg(prog, );
 
@@ -349,6 +352,7 @@ static int tcf_bpf_init(struct net *net, struct nlattr *nla,
 
prog->tcf_action = parm->action;
rcu_assign_pointer(prog->filter, cfg.filter);
+   spin_unlock(>tcf_lock);
 
if (res == ACT_P_CREATED) {
tcf_idr_insert(tn, *act);
-- 
2.7.5



[PATCH net-next 13/14] net: core: add new/replace rate estimator lock parameter

2018-08-06 Thread Vlad Buslov
Extend rate estimator 'new' and 'replace' APIs with additional spinlock
parameter to be used by rtnl-unlocked actions to protect rate_est pointer
from concurrent modification.

Extract code that requires synchronization from gen_new_estimator into
__replace_estimator function in order to be used by both locked and
unlocked paths.

Signed-off-by: Vlad Buslov 
---
 include/net/gen_stats.h|  2 ++
 net/core/gen_estimator.c   | 58 +++---
 net/netfilter/xt_RATEEST.c |  2 +-
 net/sched/act_api.c|  2 +-
 net/sched/act_police.c |  2 +-
 net/sched/sch_api.c|  2 ++
 net/sched/sch_cbq.c|  4 ++--
 net/sched/sch_drr.c|  4 ++--
 net/sched/sch_hfsc.c   |  4 ++--
 net/sched/sch_htb.c|  4 ++--
 net/sched/sch_qfq.c|  4 ++--
 11 files changed, 61 insertions(+), 27 deletions(-)

diff --git a/include/net/gen_stats.h b/include/net/gen_stats.h
index 0304ba2ae353..d1ef63d16abc 100644
--- a/include/net/gen_stats.h
+++ b/include/net/gen_stats.h
@@ -59,12 +59,14 @@ int gnet_stats_finish_copy(struct gnet_dump *d);
 int gen_new_estimator(struct gnet_stats_basic_packed *bstats,
  struct gnet_stats_basic_cpu __percpu *cpu_bstats,
  struct net_rate_estimator __rcu **rate_est,
+ spinlock_t *rate_est_lock,
  spinlock_t *stats_lock,
  seqcount_t *running, struct nlattr *opt);
 void gen_kill_estimator(struct net_rate_estimator __rcu **ptr);
 int gen_replace_estimator(struct gnet_stats_basic_packed *bstats,
  struct gnet_stats_basic_cpu __percpu *cpu_bstats,
  struct net_rate_estimator __rcu **ptr,
+ spinlock_t *rate_est_lock,
  spinlock_t *stats_lock,
  seqcount_t *running, struct nlattr *opt);
 bool gen_estimator_active(struct net_rate_estimator __rcu **ptr);
diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index 98fd12721221..012e37011147 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -107,11 +107,43 @@ static void est_timer(struct timer_list *t)
mod_timer(>timer, est->next_jiffies);
 }
 
+static void __replace_estimator(struct net_rate_estimator __rcu **rate_est,
+   struct net_rate_estimator *new)
+{
+   struct net_rate_estimator *old = rcu_dereference_protected(*rate_est,
+  1);
+
+   if (old) {
+   del_timer_sync(>timer);
+   new->avbps = old->avbps;
+   new->avpps = old->avpps;
+   }
+
+   new->next_jiffies = jiffies + ((HZ / 4) << new->intvl_log);
+   timer_setup(>timer, est_timer, 0);
+   mod_timer(>timer, new->next_jiffies);
+
+   rcu_assign_pointer(*rate_est, new);
+
+   if (old)
+   kfree_rcu(old, rcu);
+}
+
+static void replace_estimator(struct net_rate_estimator __rcu **rate_est,
+ struct net_rate_estimator *new,
+ spinlock_t *rate_est_lock)
+{
+   spin_lock_bh(rate_est_lock);
+   __replace_estimator(rate_est, new);
+   spin_unlock_bh(rate_est_lock);
+}
+
 /**
  * gen_new_estimator - create a new rate estimator
  * @bstats: basic statistics
  * @cpu_bstats: bstats per cpu
  * @rate_est: rate estimator statistics
+ * @rate_est_lock: rate_est lock (might be NULL)
  * @stats_lock: statistics lock
  * @running: qdisc running seqcount
  * @opt: rate estimator configuration TLV
@@ -128,12 +160,13 @@ static void est_timer(struct timer_list *t)
 int gen_new_estimator(struct gnet_stats_basic_packed *bstats,
  struct gnet_stats_basic_cpu __percpu *cpu_bstats,
  struct net_rate_estimator __rcu **rate_est,
+ spinlock_t *rate_est_lock,
  spinlock_t *stats_lock,
  seqcount_t *running,
  struct nlattr *opt)
 {
struct gnet_estimator *parm = nla_data(opt);
-   struct net_rate_estimator *old, *est;
+   struct net_rate_estimator *est;
struct gnet_stats_basic_packed b;
int intvl_log;
 
@@ -167,20 +200,15 @@ int gen_new_estimator(struct gnet_stats_basic_packed 
*bstats,
local_bh_enable();
est->last_bytes = b.bytes;
est->last_packets = b.packets;
-   old = rcu_dereference_protected(*rate_est, 1);
-   if (old) {
-   del_timer_sync(>timer);
-   est->avbps = old->avbps;
-   est->avpps = old->avpps;
-   }
 
-   est->next_jiffies = jiffies + ((HZ/4) << intvl_log);
-   timer_setup(>timer, est_timer, 0);
-   mod_timer(>timer, est->next_jiffies);
+   if (rate_est_lock)
+   replace_estimator(rate_est, est, rate_est_lock);
+   else
+   /* If no spinlock argument provided,
+* 

[PATCH net-next 07/14] net: sched: act_sample: remove dependency on rtnl lock

2018-08-06 Thread Vlad Buslov
Use tcf spinlock to protect private sample action data from concurrent
modification during dump and init.

Signed-off-by: Vlad Buslov 
---
 net/sched/act_sample.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
index 2608ccc83e5e..81071afe1b43 100644
--- a/net/sched/act_sample.c
+++ b/net/sched/act_sample.c
@@ -80,11 +80,13 @@ static int tcf_sample_init(struct net *net, struct nlattr 
*nla,
}
s = to_sample(*a);
 
+   spin_lock(>tcf_lock);
s->tcf_action = parm->action;
s->rate = nla_get_u32(tb[TCA_SAMPLE_RATE]);
s->psample_group_num = nla_get_u32(tb[TCA_SAMPLE_PSAMPLE_GROUP]);
psample_group = psample_group_get(net, s->psample_group_num);
if (!psample_group) {
+   spin_unlock(>tcf_lock);
tcf_idr_release(*a, bind);
return -ENOMEM;
}
@@ -94,6 +96,7 @@ static int tcf_sample_init(struct net *net, struct nlattr 
*nla,
s->truncate = true;
s->trunc_size = nla_get_u32(tb[TCA_SAMPLE_TRUNC_SIZE]);
}
+   spin_unlock(>tcf_lock);
 
if (ret == ACT_P_CREATED)
tcf_idr_insert(tn, *a);
@@ -105,7 +108,8 @@ static void tcf_sample_cleanup(struct tc_action *a)
struct tcf_sample *s = to_sample(a);
struct psample_group *psample_group;
 
-   psample_group = rtnl_dereference(s->psample_group);
+   /* last reference to action, no need to lock */
+   psample_group = rcu_dereference_protected(s->psample_group, 1);
RCU_INIT_POINTER(s->psample_group, NULL);
if (psample_group)
psample_group_put(psample_group);
@@ -174,12 +178,13 @@ static int tcf_sample_dump(struct sk_buff *skb, struct 
tc_action *a,
struct tcf_sample *s = to_sample(a);
struct tc_sample opt = {
.index  = s->tcf_index,
-   .action = s->tcf_action,
.refcnt = refcount_read(>tcf_refcnt) - ref,
.bindcnt= atomic_read(>tcf_bindcnt) - bind,
};
struct tcf_t t;
 
+   spin_lock(>tcf_lock);
+   opt.action = s->tcf_action;
if (nla_put(skb, TCA_SAMPLE_PARMS, sizeof(opt), ))
goto nla_put_failure;
 
@@ -196,9 +201,12 @@ static int tcf_sample_dump(struct sk_buff *skb, struct 
tc_action *a,
 
if (nla_put_u32(skb, TCA_SAMPLE_PSAMPLE_GROUP, s->psample_group_num))
goto nla_put_failure;
+   spin_unlock(>tcf_lock);
+
return skb->len;
 
 nla_put_failure:
+   spin_unlock(>tcf_lock);
nlmsg_trim(skb, b);
return -1;
 }
-- 
2.7.5



[PATCH net-next 02/14] net: sched: act_csum: remove dependency on rtnl lock

2018-08-06 Thread Vlad Buslov
Use tcf lock to protect csum action struct private data from concurrent
modification in init and dump. Use rcu swap operation to reassign params
pointer under protection of tcf lock. (old params value is not used by
init, so there is no need of standalone rcu dereference step)

Remove rtnl assertion that is no longer necessary.

Signed-off-by: Vlad Buslov 
---
 net/sched/act_csum.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index 648a3a35b720..f01c59ba6d12 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -50,7 +50,7 @@ static int tcf_csum_init(struct net *net, struct nlattr *nla,
 struct netlink_ext_ack *extack)
 {
struct tc_action_net *tn = net_generic(net, csum_net_id);
-   struct tcf_csum_params *params_old, *params_new;
+   struct tcf_csum_params *params_new;
struct nlattr *tb[TCA_CSUM_MAX + 1];
struct tc_csum *parm;
struct tcf_csum *p;
@@ -88,20 +88,22 @@ static int tcf_csum_init(struct net *net, struct nlattr 
*nla,
}
 
p = to_tcf_csum(*a);
-   ASSERT_RTNL();
 
params_new = kzalloc(sizeof(*params_new), GFP_KERNEL);
if (unlikely(!params_new)) {
tcf_idr_release(*a, bind);
return -ENOMEM;
}
-   params_old = rtnl_dereference(p->params);
+   params_new->update_flags = parm->update_flags;
 
+   spin_lock(>tcf_lock);
p->tcf_action = parm->action;
-   params_new->update_flags = parm->update_flags;
-   rcu_assign_pointer(p->params, params_new);
-   if (params_old)
-   kfree_rcu(params_old, rcu);
+   rcu_swap_protected(p->params, params_new,
+  lockdep_is_held(>tcf_lock));
+   spin_unlock(>tcf_lock);
+
+   if (params_new)
+   kfree_rcu(params_new, rcu);
 
if (ret == ACT_P_CREATED)
tcf_idr_insert(tn, *a);
@@ -599,11 +601,13 @@ static int tcf_csum_dump(struct sk_buff *skb, struct 
tc_action *a, int bind,
.index   = p->tcf_index,
.refcnt  = refcount_read(>tcf_refcnt) - ref,
.bindcnt = atomic_read(>tcf_bindcnt) - bind,
-   .action  = p->tcf_action,
};
struct tcf_t t;
 
-   params = rtnl_dereference(p->params);
+   spin_lock(>tcf_lock);
+   params = rcu_dereference_protected(p->params,
+  lockdep_is_held(>tcf_lock));
+   opt.action = p->tcf_action;
opt.update_flags = params->update_flags;
 
if (nla_put(skb, TCA_CSUM_PARMS, sizeof(opt), ))
@@ -612,10 +616,12 @@ static int tcf_csum_dump(struct sk_buff *skb, struct 
tc_action *a, int bind,
tcf_tm_dump(, >tcf_tm);
if (nla_put_64bit(skb, TCA_CSUM_TM, sizeof(t), , TCA_CSUM_PAD))
goto nla_put_failure;
+   spin_unlock(>tcf_lock);
 
return skb->len;
 
 nla_put_failure:
+   spin_unlock(>tcf_lock);
nlmsg_trim(skb, b);
return -1;
 }
-- 
2.7.5



Re: [PATCH bpf] bpf: btf: Change tools/lib/bpf/btf to LGPL

2018-08-06 Thread Daniel Borkmann
On 08/06/2018 06:45 AM, Alexei Starovoitov wrote:
> On Sun, Aug 05, 2018 at 05:19:13PM -0700, Martin KaFai Lau wrote:
>> This patch changes the tools/lib/bpf/btf.[ch] to LGPL which
>> is inline with libbpf also.
>>
>> Signed-off-by: Martin KaFai Lau 
> 
> technically it's bpf tree material, but since it changes
> "comment" only and we're very late into the release,
> I think bpf-next would be fine too.
> I'm ok whichever way.
> Acked-by: Alexei Starovoitov 

I think bpf tree is more appropriate, thus applied there, thanks Martin!