Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-12-06 Thread achiad shochat
On 5 December 2017 at 21:20, Michael S. Tsirkin  wrote:
> On Tue, Dec 05, 2017 at 11:59:17AM +0200, achiad shochat wrote:
>> Then we'll have a single solution for both netvsc and virtio (and any
>> other PV device).
>> And we could handle the VF DMA dirt issue agnostically.
>
> For the record, I won't block patches adding this kist to virtio
> on the basis that they must be generic. It's not a lot
> of code, implementation can come first, prettify later.

It's not a lot of code either way.
So I fail to understand why not to do it right from the beginning.
For the record...

>
> But we do need to have a discussion about how devices are paired.
> I am not sure using just MAC works. E.g. some passthrough
> devices don't give host ability to set the MAC.
> Are these worth worrying about?
>
> --
> MST


Re: [PATCH net-next] virtio_net: Disable interrupts if napi_complete_done rescheduled napi

2017-12-06 Thread Jason Wang



On 2017年12月07日 13:09, Michael S. Tsirkin wrote:

On Thu, Dec 07, 2017 at 01:15:15PM +0900, Toshiaki Makita wrote:

Since commit 39e6c8208d7b ("net: solve a NAPI race") napi has been able
to be rescheduled within napi_complete_done() even in non-busypoll case,
but virtnet_poll() always enabled interrupts before complete, and when
napi was rescheduled within napi_complete_done() it did not disable
interrupts.
This caused more interrupts when event idx is disabled.

According to commit cbdadbbf0c79 ("virtio_net: fix race in RX VQ
processing") we cannot place virtqueue_enable_cb_prepare() after
NAPI_STATE_SCHED is cleared, so disable interrupts again if
napi_complete_done() returned false.

Tested with vhost-user of OVS 2.7 on host, which does not have the event
idx feature.

* Before patch:

$ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.150.253 () port 0 AF_INET
Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

2129921472   60.00 32763206  06430.32
212992   60.00 23384299   4589.56

Interrupts on guest: 9872369
Packets/interrupt:   2.37

* After patch

$ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.150.253 () port 0 AF_INET
Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

2129921472   60.00 32794646  06436.49
212992   60.00 32793501   6436.27

Interrupts on guest: 4941299
Packets/interrupt:   6.64

Signed-off-by: Toshiaki Makita 

Acked-by: Michael S. Tsirkin 

it might make sense in net and not -next since tx napi regressed performance
in some configs, this might bring it back at least partially.
Jason - what do you think?


No sure, the regression I saw was tested with event idx on. And 
virtqueue_disable_cb() does almost nothing for event idx (or even a 
little bit slower).


The patch looks good.

Acked-by: Jason Wang 





---
  drivers/net/virtio_net.c | 9 ++---
  1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 19a985e..c0db48d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -261,9 +261,12 @@ static void virtqueue_napi_complete(struct napi_struct 
*napi,
int opaque;
  
  	opaque = virtqueue_enable_cb_prepare(vq);

-   if (napi_complete_done(napi, processed) &&
-   unlikely(virtqueue_poll(vq, opaque)))
-   virtqueue_napi_schedule(napi, vq);
+   if (napi_complete_done(napi, processed)) {
+   if (unlikely(virtqueue_poll(vq, opaque)))
+   virtqueue_napi_schedule(napi, vq);
+   } else {
+   virtqueue_disable_cb(vq);
+   }
  }
  
  static void skb_xmit_done(struct virtqueue *vq)

--
1.8.3.1





[PATCH net-next 3/3] net: dsa: mediatek: update MAINTAINERS entry with MediaTek switch driver

2017-12-06 Thread sean.wang
From: Sean Wang 

I work for MediaTek and maintain SoC targeting to home gateway and
also will keep extending and testing the function from MediaTek
switch.

Signed-off-by: Sean Wang 
---
 MAINTAINERS | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index c0edf30..070fd91 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8723,6 +8723,13 @@ L:   netdev@vger.kernel.org
 S: Maintained
 F: drivers/net/ethernet/mediatek/
 
+MEDIATEK SWITCH DRIVER
+M: Sean Wang 
+L: netdev@vger.kernel.org
+S: Maintained
+F: drivers/net/dsa/mt7530.*
+F: net/dsa/tag_mtk.c
+
 MEDIATEK JPEG DRIVER
 M: Rick Chang 
 M: Bin Liu 
-- 
2.7.4



[PATCH net-next 1/3] net: dsa: mediatek: add VLAN support for MT7530

2017-12-06 Thread sean.wang
From: Sean Wang 

MT7530 can treat each port as either VLAN-unware port or VLAN-ware port
through the implementation of port matrix mode or port security mode on
the ingress port, respectively. On one hand, Each port has been acting as
the VLAN-unware one whenever the device is created in the initial or
certain port joins or leaves into/from the bridge at the runtime. On the
other hand, the patch just filling the required callbacks for VLAN
operations is achieved via extending the port to be into port security
mode when the port is configured as VLAN-ware port. Which mode can make
the port be able to recognize VID from incoming packets and look up VLAN
table to validate and judge which port it should be going to. And the
range for VID from 1 to 4094 is valid for the hardware.

Signed-off-by: Sean Wang 
---
 drivers/net/dsa/mt7530.c | 292 ++-
 drivers/net/dsa/mt7530.h |  83 +-
 2 files changed, 368 insertions(+), 7 deletions(-)

diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
index 2820d69..a7c5370 100644
--- a/drivers/net/dsa/mt7530.c
+++ b/drivers/net/dsa/mt7530.c
@@ -805,6 +805,69 @@ mt7530_port_bridge_join(struct dsa_switch *ds, int port,
 }
 
 static void
+mt7530_port_set_vlan_unware(struct dsa_switch *ds, int port)
+{
+   struct mt7530_priv *priv = ds->priv;
+   int i;
+   bool all_user_ports_removed = true;
+
+   /* When a port is removed from the bridge, the port would be set up
+* back to the default as is at initial boot which is a VLAN-unware
+* port.
+*/
+   mt7530_rmw(priv, MT7530_PCR_P(port), PCR_PORT_VLAN_MASK,
+  MT7530_PORT_MATRIX_MODE);
+   mt7530_rmw(priv, MT7530_PVC_P(port), VLAN_ATTR_MASK,
+  VLAN_ATTR(MT7530_VLAN_TRANSPARENT));
+
+   priv->ports[port].vlan_filtering = false;
+
+   for (i = 0; i < MT7530_NUM_PORTS; i++) {
+   if (dsa_is_user_port(ds, i) &&
+   priv->ports[i].vlan_filtering) {
+   all_user_ports_removed = false;
+   break;
+   }
+   }
+
+   /* CPU port also does the same thing until all user ports belonging to
+* the CPU port get out of VLAN filtering mode.
+*/
+   if (all_user_ports_removed) {
+   mt7530_write(priv, MT7530_PCR_P(MT7530_CPU_PORT),
+PCR_MATRIX(dsa_user_ports(priv->ds)));
+   mt7530_write(priv, MT7530_PVC_P(MT7530_CPU_PORT),
+PORT_SPEC_TAG);
+   }
+}
+
+static void
+mt7530_port_set_vlan_ware(struct dsa_switch *ds, int port)
+{
+   struct mt7530_priv *priv = ds->priv;
+
+   /* The real fabric path would be decided on the membership in the
+* entry of VLAN table. PCR_MATRIX set up here with ALL_MEMBERS
+* means potential VLAN can be consisting of certain subset of all
+* ports.
+*/
+   mt7530_rmw(priv, MT7530_PCR_P(port),
+  PCR_MATRIX_MASK, PCR_MATRIX(MT7530_ALL_MEMBERS));
+
+   /* Trapped into security mode allows packet forwarding through VLAN
+* table lookup.
+*/
+   mt7530_rmw(priv, MT7530_PCR_P(port), PCR_PORT_VLAN_MASK,
+  MT7530_PORT_SECURITY_MODE);
+
+   /* Set the port as a user port which is to be able to recognize VID
+* from incoming packets before fetching entry within the VLAN table.
+*/
+   mt7530_rmw(priv, MT7530_PVC_P(port), VLAN_ATTR_MASK,
+  VLAN_ATTR(MT7530_VLAN_USER));
+}
+
+static void
 mt7530_port_bridge_leave(struct dsa_switch *ds, int port,
 struct net_device *bridge)
 {
@@ -817,8 +880,11 @@ mt7530_port_bridge_leave(struct dsa_switch *ds, int port,
/* Remove this port from the port matrix of the other ports
 * in the same bridge. If the port is disabled, port matrix
 * is kept and not being setup until the port becomes enabled.
+* And the other port's port matrix cannot be broken when the
+* other port is still a VLAN-ware port.
 */
-   if (dsa_is_user_port(ds, i) && i != port) {
+   if (!priv->ports[i].vlan_filtering &&
+   dsa_is_user_port(ds, i) && i != port) {
if (dsa_to_port(ds, i)->bridge_dev != bridge)
continue;
if (priv->ports[i].enable)
@@ -836,6 +902,8 @@ mt7530_port_bridge_leave(struct dsa_switch *ds, int port,
   PCR_MATRIX(BIT(MT7530_CPU_PORT)));
priv->ports[port].pm = PCR_MATRIX(BIT(MT7530_CPU_PORT));
 
+   mt7530_port_set_vlan_unware(ds, port);
+
mutex_unlock(>reg_mutex);
 }
 
@@ -906,6 +974,224 @@ mt7530_port_fdb_dump(struct dsa_switch *ds, int port,
return 0;
 }
 
+static int

[PATCH net-next 0/3] add VLAN support to DSA MT7530

2017-12-06 Thread sean.wang
From: Sean Wang 

The patchset extends DSA MT7530 to VLAN support through filling required
callbacks in patch 1 and merging the special tag with VLAN tag in patch 2
for allowing that the hardware can handle these packets with VID from the
CPU port.

Sean Wang (3):
  net: dsa: mediatek: add VLAN support for MT7530
  net: dsa: mediatek: combine MediaTek tag with VLAN tag
  net: dsa: mediatek: update MAINTAINERS entry with MediaTek switch
driver

 MAINTAINERS  |   7 ++
 drivers/net/dsa/mt7530.c | 292 ++-
 drivers/net/dsa/mt7530.h |  83 +-
 net/dsa/tag_mtk.c|  38 --
 4 files changed, 404 insertions(+), 16 deletions(-)

-- 
2.7.4



[PATCH net-next 2/3] net: dsa: mediatek: combine MediaTek tag with VLAN tag

2017-12-06 Thread sean.wang
From: Sean Wang 

In order to let MT7530 switch can recognize well those packets
having both special tag and VLAN tag, the information about
the special tag should be carried on the existing VLAN tag.

Signed-off-by: Sean Wang 
---
 net/dsa/tag_mtk.c | 38 +-
 1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/net/dsa/tag_mtk.c b/net/dsa/tag_mtk.c
index 8475434..11535bc 100644
--- a/net/dsa/tag_mtk.c
+++ b/net/dsa/tag_mtk.c
@@ -13,10 +13,13 @@
  */
 
 #include 
+#include 
 
 #include "dsa_priv.h"
 
 #define MTK_HDR_LEN4
+#define MTK_HDR_XMIT_UNTAGGED  0
+#define MTK_HDR_XMIT_TAGGED_TPID_8100  1
 #define MTK_HDR_RECV_SOURCE_PORT_MASK  GENMASK(2, 0)
 #define MTK_HDR_XMIT_DP_BIT_MASK   GENMASK(5, 0)
 
@@ -25,20 +28,37 @@ static struct sk_buff *mtk_tag_xmit(struct sk_buff *skb,
 {
struct dsa_port *dp = dsa_slave_to_port(dev);
u8 *mtk_tag;
+   bool is_vlan_skb = true;
 
-   if (skb_cow_head(skb, MTK_HDR_LEN) < 0)
-   return NULL;
-
-   skb_push(skb, MTK_HDR_LEN);
+   /* Build the special tag after the MAC Source Address. If VLAN header
+* is present, it's required that VLAN header and special tag is
+* being combined. Only in this way we can allow the switch can parse
+* the both special and VLAN tag at the same time and then look up VLAN
+* table with VID.
+*/
+   if (!skb_vlan_tagged(skb)) {
+   if (skb_cow_head(skb, MTK_HDR_LEN) < 0)
+   return NULL;
 
-   memmove(skb->data, skb->data + MTK_HDR_LEN, 2 * ETH_ALEN);
+   skb_push(skb, MTK_HDR_LEN);
+   memmove(skb->data, skb->data + MTK_HDR_LEN, 2 * ETH_ALEN);
+   is_vlan_skb = false;
+   }
 
-   /* Build the tag after the MAC Source Address */
mtk_tag = skb->data + 2 * ETH_ALEN;
-   mtk_tag[0] = 0;
+
+   /* Mark tag attribute on special tag insertion to notify hardware
+* whether that's a combined special tag with 802.1Q header.
+*/
+   mtk_tag[0] = is_vlan_skb ? MTK_HDR_XMIT_TAGGED_TPID_8100 :
+MTK_HDR_XMIT_UNTAGGED;
mtk_tag[1] = (1 << dp->index) & MTK_HDR_XMIT_DP_BIT_MASK;
-   mtk_tag[2] = 0;
-   mtk_tag[3] = 0;
+
+   /* Tag control information is kept for 802.1Q */
+   if (!is_vlan_skb) {
+   mtk_tag[2] = 0;
+   mtk_tag[3] = 0;
+   }
 
return skb;
 }
-- 
2.7.4



Re: [PATCH net-next] netdevsim: check return value of debugfs_create_dir

2017-12-06 Thread Jakub Kicinski
On Thu, 7 Dec 2017 13:10:39 +0900, Prashant Bhole wrote:
> > From: Jakub Kicinski [mailto:jakub.kicin...@netronome.com]
> > 
> > On Thu,  7 Dec 2017 10:02:13 +0900, Prashant Bhole wrote:  
> > > - Handled debugfs_create_dir failure in nsim_init()
> > > - Fixed return value of nsim_module_init() when debugfs_create_dir
> > > fails
> > >
> > > Signed-off-by: Prashant Bhole   
> > 
> > Why?  Failing to expose the state via DebugFS is not fatal to the driver.  
> 
> Ok, my intention was to handle the return code properly, which is not needed
> as per your comment.
> Shall I remove the existing handling in nsim_module_init() in separate
> patch? 

I was going back and forth on the error handling quite a bit writing
that code.  In the end I decided to leave the module_init check and
check for bpf prog directory.  Former one is mostly useful to make sure
the is no duplicate directory with the same name, the latter to limit
possible false positive in the selftest..

> Because it will prevent netdevsim from loading when debugfs is disabled.

Note that netdevsim depends on DEBUG_FS:

config NETDEVSIM
tristate "Simulated networking device"
depends on DEBUG_FS


Re: [Intel-wired-lan] [next-queue 03/10] ixgbe: add ipsec engine start and stop routines

2017-12-06 Thread Shannon Nelson

On 12/5/2017 8:22 AM, Alexander Duyck wrote:

On Mon, Dec 4, 2017 at 9:35 PM, Shannon Nelson
 wrote:

Add in the code for running and stopping the hardware ipsec
encryption/decryption engine.  It is good to keep the engine
off when not in use in order to save on the power draw.

Signed-off-by: Shannon Nelson 
---
  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 140 +
  1 file changed, 140 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 14dd011..38a1a16 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -148,10 +148,150 @@ void ixgbe_ipsec_clear_hw_tables(struct ixgbe_adapter 
*adapter)
  }

  /**
+ * ixgbe_ipsec_stop_data
+ * @adapter: board private structure
+ **/
+static void ixgbe_ipsec_stop_data(struct ixgbe_adapter *adapter)
+{
+   struct ixgbe_hw *hw = >hw;
+   bool link = adapter->link_up;
+   u32 t_rdy, r_rdy;
+   u32 reg;
+
+   /* halt data paths */
+   reg = IXGBE_READ_REG(hw, IXGBE_SECTXCTRL);
+   reg |= IXGBE_SECTXCTRL_TX_DIS;
+   IXGBE_WRITE_REG(hw, IXGBE_SECTXCTRL, reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_SECRXCTRL);
+   reg |= IXGBE_SECRXCTRL_RX_DIS;
+   IXGBE_WRITE_REG(hw, IXGBE_SECRXCTRL, reg);
+
+   IXGBE_WRITE_FLUSH(hw);
+
+   /* If the tx fifo doesn't have link, but still has data,
+* we can't clear the tx sec block.  Set the MAC loopback
+* before block clear
+*/
+   if (!link) {
+   reg = IXGBE_READ_REG(hw, IXGBE_MACC);
+   reg |= IXGBE_MACC_FLU;
+   IXGBE_WRITE_REG(hw, IXGBE_MACC, reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_HLREG0);
+   reg |= IXGBE_HLREG0_LPBK;
+   IXGBE_WRITE_REG(hw, IXGBE_HLREG0, reg);
+
+   IXGBE_WRITE_FLUSH(hw);
+   mdelay(3);
+   }
+
+   /* wait for the paths to empty */
+   do {
+   mdelay(10);
+   t_rdy = IXGBE_READ_REG(hw, IXGBE_SECTXSTAT) &
+   IXGBE_SECTXSTAT_SECTX_RDY;
+   r_rdy = IXGBE_READ_REG(hw, IXGBE_SECRXSTAT) &
+   IXGBE_SECRXSTAT_SECRX_RDY;
+   } while (!t_rdy && !r_rdy);


This piece seems buggy to me. There should be some sort of limit on
how long you are willing to delay. Otherwise a surprise remove can
cause this to spin forever when the register reads return all 1's.


Yep - I had meant to limit that.  Thanks.




+
+   /* undo loopback if we played with it earlier */
+   if (!link) {
+   reg = IXGBE_READ_REG(hw, IXGBE_MACC);
+   reg &= ~IXGBE_MACC_FLU;
+   IXGBE_WRITE_REG(hw, IXGBE_MACC, reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_HLREG0);
+   reg &= ~IXGBE_HLREG0_LPBK;
+   IXGBE_WRITE_REG(hw, IXGBE_HLREG0, reg);
+
+   IXGBE_WRITE_FLUSH(hw);
+   }
+}
+
+/**
+ * ixgbe_ipsec_stop_engine
+ * @adapter: board private structure
+ **/
+static void ixgbe_ipsec_stop_engine(struct ixgbe_adapter *adapter)
+{
+   struct ixgbe_hw *hw = >hw;
+   u32 reg;
+
+   ixgbe_ipsec_stop_data(adapter);
+
+   /* disable Rx and Tx SA lookup */
+   IXGBE_WRITE_REG(hw, IXGBE_IPSTXIDX, 0);
+   IXGBE_WRITE_REG(hw, IXGBE_IPSRXIDX, 0);
+
+   /* disable the Rx and Tx engines and full packet store-n-forward */
+   reg = IXGBE_READ_REG(hw, IXGBE_SECTXCTRL);
+   reg |= IXGBE_SECTXCTRL_SECTX_DIS;
+   reg &= ~IXGBE_SECTXCTRL_STORE_FORWARD;
+   IXGBE_WRITE_REG(hw, IXGBE_SECTXCTRL, reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_SECRXCTRL);
+   reg |= IXGBE_SECRXCTRL_SECRX_DIS;
+   IXGBE_WRITE_REG(hw, IXGBE_SECRXCTRL, reg);
+
+   /* restore the "tx security buffer almost full threshold" to 0x250 */
+   IXGBE_WRITE_REG(hw, IXGBE_SECTXBUFFAF, 0x250);
+
+   /* Set minimum IFG between packets back to the default 0x1 */
+   reg = IXGBE_READ_REG(hw, IXGBE_SECTXMINIFG);
+   reg = (reg & 0xfff0) | 0x1;
+   IXGBE_WRITE_REG(hw, IXGBE_SECTXMINIFG, reg);
+
+   /* final set for normal (no ipsec offload) processing */
+   IXGBE_WRITE_REG(hw, IXGBE_SECTXCTRL, IXGBE_SECTXCTRL_SECTX_DIS);
+   IXGBE_WRITE_REG(hw, IXGBE_SECRXCTRL, IXGBE_SECRXCTRL_SECRX_DIS);
+
+   IXGBE_WRITE_FLUSH(hw);
+}
+
+/**
+ * ixgbe_ipsec_start_engine
+ * @adapter: board private structure
+ *
+ * NOTE: this increases power consumption whether being used or not
+ **/
+static void ixgbe_ipsec_start_engine(struct ixgbe_adapter *adapter)
+{
+   struct ixgbe_hw *hw = >hw;
+   u32 reg;
+
+   ixgbe_ipsec_stop_data(adapter);
+
+   /* Set minimum IFG between packets to 3 */
+   reg = IXGBE_READ_REG(hw, IXGBE_SECTXMINIFG);
+   reg = (reg & 0xfff0) | 0x3;
+   IXGBE_WRITE_REG(hw, IXGBE_SECTXMINIFG, reg);
+
+   /* Set "tx 

Re: [Intel-wired-lan] [next-queue 10/10] ixgbe: register ipsec offload with the xfrm subsystem

2017-12-06 Thread Shannon Nelson

On 12/5/2017 12:11 PM, Alexander Duyck wrote:

On Mon, Dec 4, 2017 at 9:35 PM, Shannon Nelson
 wrote:

With all the support code in place we can now link in the ipsec
offload operations and set the ESP feature flag for the XFRM
subsystem to see.

Signed-off-by: Shannon Nelson 
---
  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 4 
  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 4 
  2 files changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index d1220bf..0d5497b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -884,6 +884,10 @@ void ixgbe_init_ipsec_offload(struct ixgbe_adapter 
*adapter)
 ixgbe_ipsec_clear_hw_tables(adapter);
 ixgbe_ipsec_stop_engine(adapter);

+   adapter->netdev->xfrmdev_ops = _xfrmdev_ops;
+   adapter->netdev->features |= NETIF_F_HW_ESP;
+   adapter->netdev->hw_enc_features |= NETIF_F_HW_ESP;
+
 return;
  err:
 if (ipsec) {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index c857594..9231351 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -9799,6 +9799,10 @@ ixgbe_features_check(struct sk_buff *skb, struct 
net_device *dev,
 if (skb->encapsulation && !(features & NETIF_F_TSO_MANGLEID))
 features &= ~NETIF_F_TSO;

+   /* IPsec offload doesn't get along well with others *yet* */
+   if (skb->sp)
+   features &= ~(NETIF_F_TSO | NETIF_F_HW_CSUM_BIT);


I'm pretty sure the feature flag stripping here isn't correct. The


Well, first of all that NETIF_F_HW_CSUM_BIT should be NETIF_F_HW_CSUM.


feature bits you want to strip would probably be consistent with the
network_hdr_len check bits included before the MANGLEID check.

We should do some digging into this as it may be a kernel issue. I'm
just wondering if ipsec updates any headers such as the transport
offset or skb checksum start. If either of those are updated that
would explain the issues with getting the offloads to work.


Doing this got the TSO and such out of my way so I didn't have to turn 
tx csum off with ethtool, but you're right, this can be tweaked a little.


There will be more digging later when I work on getting TSO and CSUM 
working with ipsec offload, but I want to get these patches out first 
now that they're working, and then tweak for more performance.



Again, thanks for your time and thoughts.

sln




+
 return features;
  }

--
2.7.4

___
Intel-wired-lan mailing list
intel-wired-...@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan


Re: [Intel-wired-lan] [next-queue 07/10] ixgbe: process the Rx ipsec offload

2017-12-06 Thread Shannon Nelson

On 12/5/2017 9:40 AM, Alexander Duyck wrote:

On Mon, Dec 4, 2017 at 9:35 PM, Shannon Nelson
 wrote:

If the chip sees and decrypts an ipsec offload, set up the skb
sp pointer with the ralated SA info.  Since the chip is rude
enough to keep to itself the table index it used for the
decryption, we have to do our own table lookup, using the
hash for speed.

Signed-off-by: Shannon Nelson 
---
  drivers/net/ethernet/intel/ixgbe/ixgbe.h   |  6 ++
  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 89 ++
  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |  3 +
  3 files changed, 98 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 7e8bca7..77f07dc 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -1009,9 +1009,15 @@ s32 ixgbe_negotiate_fc(struct ixgbe_hw *hw, u32 adv_reg, 
u32 lp_reg,
u32 adv_sym, u32 adv_asm, u32 lp_sym, u32 lp_asm);
  #ifdef CONFIG_XFRM_OFFLOAD
  void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter);
+void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
+   union ixgbe_adv_rx_desc *rx_desc,
+   struct sk_buff *skb);
  void ixgbe_ipsec_restore(struct ixgbe_adapter *adapter);
  #else
  static inline void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter) { 
};
+static inline void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
+ union ixgbe_adv_rx_desc *rx_desc,
+ struct sk_buff *skb) { };
  static inline void ixgbe_ipsec_restore(struct ixgbe_adapter *adapter) { };
  #endif /* CONFIG_XFRM_OFFLOAD */
  #endif /* _IXGBE_H_ */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index b93ee7f..fd06d9b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -379,6 +379,35 @@ static int ixgbe_ipsec_find_empty_idx(struct ixgbe_ipsec 
*ipsec, bool rxtable)
  }

  /**
+ * ixgbe_ipsec_find_rx_state - find the state that matches
+ * @ipsec: pointer to ipsec struct
+ * @daddr: inbound address to match
+ * @proto: protocol to match
+ * @spi: SPI to match
+ *
+ * Returns a pointer to the matching SA state information
+ **/
+static struct xfrm_state *ixgbe_ipsec_find_rx_state(struct ixgbe_ipsec *ipsec,
+   __be32 daddr, u8 proto,
+   __be32 spi)
+{
+   struct rx_sa *rsa;
+   struct xfrm_state *ret = NULL;
+
+   rcu_read_lock();
+   hash_for_each_possible_rcu(ipsec->rx_sa_list, rsa, hlist, spi)
+   if (spi == rsa->xs->id.spi &&
+   daddr == rsa->xs->id.daddr.a4 &&
+   proto == rsa->xs->id.proto) {
+   ret = rsa->xs;
+   xfrm_state_hold(ret);
+   break;
+   }
+   rcu_read_unlock();
+   return ret;
+}
+


You need to choose a bucket, not just walk through all buckets.


I may be wrong, but I believe that is what is happening here, where the 
spi is the hash key.  As the function description says "iterate over all 
possible objects hashing to the same bucket".  Besides, I basically 
cribbed this directly from our Mellanox friends (thanks!).



Otherwise you might as well have just used a linked list. You might
look at using something like jhash_3words to generate a hash which you
then use to choose the bucket.


+/**
   * ixgbe_ipsec_parse_proto_keys - find the key and salt based on the protocol
   * @xs: pointer to xfrm_state struct
   * @mykey: pointer to key array to populate
@@ -680,6 +709,66 @@ static const struct xfrmdev_ops ixgbe_xfrmdev_ops = {
  };

  /**
+ * ixgbe_ipsec_rx - decode ipsec bits from Rx descriptor
+ * @rx_ring: receiving ring
+ * @rx_desc: receive data descriptor
+ * @skb: current data packet
+ *
+ * Determine if there was an ipsec encapsulation noticed, and if so set up
+ * the resulting status for later in the receive stack.
+ **/
+void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
+   union ixgbe_adv_rx_desc *rx_desc,
+   struct sk_buff *skb)
+{
+   struct ixgbe_adapter *adapter = netdev_priv(rx_ring->netdev);
+   u16 pkt_info = le16_to_cpu(rx_desc->wb.lower.lo_dword.hs_rss.pkt_info);
+   u16 ipsec_pkt_types = IXGBE_RXDADV_PKTTYPE_IPSEC_AH |
+   IXGBE_RXDADV_PKTTYPE_IPSEC_ESP;
+   struct ixgbe_ipsec *ipsec = adapter->ipsec;
+   struct xfrm_offload *xo = NULL;
+   struct xfrm_state *xs = NULL;
+   struct iphdr *iph;
+   u8 *c_hdr;
+   __be32 spi;
+   u8 proto;
+
+   /* we can assume no vlan header in the way, b/c the
+* hw won't recognize the IPsec packet and anyway the
+* currently vlan device doesn't 

Re: [Intel-wired-lan] [next-queue 09/10] ixgbe: ipsec offload stats

2017-12-06 Thread Shannon Nelson

On 12/5/2017 11:53 AM, Alexander Duyck wrote:

On Mon, Dec 4, 2017 at 9:35 PM, Shannon Nelson
 wrote:

Add a simple statistic to count the ipsec offloads.

Signed-off-by: Shannon Nelson 
---
  drivers/net/ethernet/intel/ixgbe/ixgbe.h |  1 +
  drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 28 ++--
  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c   |  3 +++
  3 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 68097fe..bb66c85 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -265,6 +265,7 @@ struct ixgbe_rx_buffer {
  struct ixgbe_queue_stats {
 u64 packets;
 u64 bytes;
+   u64 ipsec_offloads;
  };

  struct ixgbe_tx_queue_stats {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index c3e7a81..dddbc74 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -1233,34 +1233,34 @@ static void ixgbe_get_ethtool_stats(struct net_device 
*netdev,
 for (j = 0; j < netdev->num_tx_queues; j++) {
 ring = adapter->tx_ring[j];
 if (!ring) {
-   data[i] = 0;
-   data[i+1] = 0;
-   i += 2;
+   data[i++] = 0;
+   data[i++] = 0;
+   data[i++] = 0;
 continue;
 }

 do {
 start = u64_stats_fetch_begin_irq(>syncp);
-   data[i]   = ring->stats.packets;
-   data[i+1] = ring->stats.bytes;
+   data[i++] = ring->stats.packets;
+   data[i++] = ring->stats.bytes;
+   data[i++] = ring->stats.ipsec_offloads;
 } while (u64_stats_fetch_retry_irq(>syncp, start));
-   i += 2;
 }
 for (j = 0; j < IXGBE_NUM_RX_QUEUES; j++) {
 ring = adapter->rx_ring[j];
 if (!ring) {
-   data[i] = 0;
-   data[i+1] = 0;
-   i += 2;
+   data[i++] = 0;
+   data[i++] = 0;
+   data[i++] = 0;
 continue;
 }

 do {
 start = u64_stats_fetch_begin_irq(>syncp);
-   data[i]   = ring->stats.packets;
-   data[i+1] = ring->stats.bytes;
+   data[i++] = ring->stats.packets;
+   data[i++] = ring->stats.bytes;
+   data[i++] = ring->stats.ipsec_offloads;
 } while (u64_stats_fetch_retry_irq(>syncp, start));
-   i += 2;
 }

 for (j = 0; j < IXGBE_MAX_PACKET_BUFFERS; j++) {
@@ -1297,12 +1297,16 @@ static void ixgbe_get_strings(struct net_device 
*netdev, u32 stringset,
 p += ETH_GSTRING_LEN;
 sprintf(p, "tx_queue_%u_bytes", i);
 p += ETH_GSTRING_LEN;
+   sprintf(p, "tx_queue_%u_ipsec_offloads", i);
+   p += ETH_GSTRING_LEN;
 }
 for (i = 0; i < IXGBE_NUM_RX_QUEUES; i++) {
 sprintf(p, "rx_queue_%u_packets", i);
 p += ETH_GSTRING_LEN;
 sprintf(p, "rx_queue_%u_bytes", i);
 p += ETH_GSTRING_LEN;
+   sprintf(p, "rx_queue_%u_ipsec_offloads", i);
+   p += ETH_GSTRING_LEN;
 }
 for (i = 0; i < IXGBE_MAX_PACKET_BUFFERS; i++) {
 sprintf(p, "tx_pb_%u_pxon", i);


I probably wouldn't bother reporting this per ring. It might make more
sense to handle this as an adapter statistic.


I agree, it really messes up the output.  However, I like seeing it per 
ring while I'm testing.  I'll move it.





diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 2a0dd7a..d1220bf 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -782,6 +782,7 @@ int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring, struct 
sk_buff *skb,
 if (tsa->encrypt)
 itd->flags |= IXGBE_ADVTXD_TUCMD_IPSEC_ENCRYPT_EN;

+   tx_ring->stats.ipsec_offloads++;
 return 1;


Instead of doing this here you may want to make it a part of the Tx
clean-up path. You should still have the flag bit set so you could
test a test for the IPSEC flag bit and if it is set on the tx_buffer
following the transmit you could then increment it 

Re: [Intel-wired-lan] [next-queue 08/10] ixgbe: process the Tx ipsec offload

2017-12-06 Thread Shannon Nelson

On 12/5/2017 10:13 AM, Alexander Duyck wrote:

On Mon, Dec 4, 2017 at 9:35 PM, Shannon Nelson
 wrote:

If the skb has a security association referenced in the skb, then
set up the Tx descriptor with the ipsec offload bits.  While we're
here, we fix an oddly named field in the context descriptor struct.

Signed-off-by: Shannon Nelson 
---
  drivers/net/ethernet/intel/ixgbe/ixgbe.h   | 10 +++-
  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 77 ++
  drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c   |  4 +-
  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 38 ++---
  drivers/net/ethernet/intel/ixgbe/ixgbe_type.h  |  2 +-
  5 files changed, 118 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 77f07dc..68097fe 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -171,10 +171,11 @@ enum ixgbe_tx_flags {
 IXGBE_TX_FLAGS_CC   = 0x08,
 IXGBE_TX_FLAGS_IPV4 = 0x10,
 IXGBE_TX_FLAGS_CSUM = 0x20,
+   IXGBE_TX_FLAGS_IPSEC= 0x40,

 /* software defined flags */
-   IXGBE_TX_FLAGS_SW_VLAN  = 0x40,
-   IXGBE_TX_FLAGS_FCOE = 0x80,
+   IXGBE_TX_FLAGS_SW_VLAN  = 0x80,
+   IXGBE_TX_FLAGS_FCOE = 0x100,
  };

  /* VLAN info */
@@ -1012,12 +1013,17 @@ void ixgbe_init_ipsec_offload(struct ixgbe_adapter 
*adapter);
  void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
 union ixgbe_adv_rx_desc *rx_desc,
 struct sk_buff *skb);
+int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring, struct sk_buff *skb,
+  __be16 protocol, struct ixgbe_ipsec_tx_data *itd);
  void ixgbe_ipsec_restore(struct ixgbe_adapter *adapter);
  #else
  static inline void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter) { 
};
  static inline void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
   union ixgbe_adv_rx_desc *rx_desc,
   struct sk_buff *skb) { };
+static inline int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring,
+struct sk_buff *skb, __be16 protocol,
+struct ixgbe_ipsec_tx_data *itd) { return 0; };
  static inline void ixgbe_ipsec_restore(struct ixgbe_adapter *adapter) { };
  #endif /* CONFIG_XFRM_OFFLOAD */
  #endif /* _IXGBE_H_ */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index fd06d9b..2a0dd7a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -703,12 +703,89 @@ static void ixgbe_ipsec_del_sa(struct xfrm_state *xs)
 }
  }

+/**
+ * ixgbe_ipsec_offload_ok - can this packet use the xfrm hw offload
+ * @skb: current data packet
+ * @xs: pointer to transformer state struct
+ **/
+static bool ixgbe_ipsec_offload_ok(struct sk_buff *skb, struct xfrm_state *xs)
+{
+   if (xs->props.family == AF_INET) {
+   /* Offload with IPv4 options is not supported yet */
+   if (ip_hdr(skb)->ihl > 5)


I would make this ihl != 5 instead of "> 5" since smaller values would
be invalid as well.


Sure




+   return false;
+   } else {
+   /* Offload with IPv6 extension headers is not support yet */
+   if (ipv6_ext_hdr(ipv6_hdr(skb)->nexthdr))
+   return false;
+   }
+
+   return true;
+}
+
  static const struct xfrmdev_ops ixgbe_xfrmdev_ops = {
 .xdo_dev_state_add = ixgbe_ipsec_add_sa,
 .xdo_dev_state_delete = ixgbe_ipsec_del_sa,
+   .xdo_dev_offload_ok = ixgbe_ipsec_offload_ok,
  };

  /**
+ * ixgbe_ipsec_tx - setup Tx flags for ipsec offload
+ * @tx_ring: outgoing context
+ * @skb: current data packet
+ * @protocol: network protocol
+ * @itd: ipsec Tx data for later use in building context descriptor
+ **/
+int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring, struct sk_buff *skb,
+  __be16 protocol, struct ixgbe_ipsec_tx_data *itd)
+{
+   struct ixgbe_adapter *adapter = netdev_priv(tx_ring->netdev);
+   struct ixgbe_ipsec *ipsec = adapter->ipsec;
+   struct xfrm_state *xs;
+   struct tx_sa *tsa;
+
+   if (!skb->sp->len) {
+   netdev_err(tx_ring->netdev, "%s: no xfrm state len = %d\n",
+  __func__, skb->sp->len);
+   return 0;
+   }
+
+   xs = xfrm_input_state(skb);
+   if (!xs) {
+   netdev_err(tx_ring->netdev, "%s: no xfrm_input_state() xs = 
%p\n",
+  __func__, xs);
+   return 0;
+   }
+
+   itd->sa_idx = xs->xso.offload_handle - IXGBE_IPSEC_BASE_TX_INDEX;
+   if (itd->sa_idx > IXGBE_IPSEC_MAX_SA_COUNT) {
+   netdev_err(tx_ring->netdev, "%s: bad sa_idx=%d handle=%lu\n",
+  

Re: [Intel-wired-lan] [next-queue 06/10] ixgbe: restore offloaded SAs after a reset

2017-12-06 Thread Shannon Nelson

On 12/5/2017 9:30 AM, Alexander Duyck wrote:

On Mon, Dec 4, 2017 at 9:35 PM, Shannon Nelson
 wrote:

On a chip reset most of the table contents are lost, so must be
restored.  This scans the driver's ipsec tables and restores both
the filled and empty table slots to their pre-reset values.

Signed-off-by: Shannon Nelson 
---
  drivers/net/ethernet/intel/ixgbe/ixgbe.h   |  2 +
  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 53 ++
  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |  1 +
  3 files changed, 56 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 9487750..7e8bca7 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -1009,7 +1009,9 @@ s32 ixgbe_negotiate_fc(struct ixgbe_hw *hw, u32 adv_reg, 
u32 lp_reg,
u32 adv_sym, u32 adv_asm, u32 lp_sym, u32 lp_asm);
  #ifdef CONFIG_XFRM_OFFLOAD
  void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter);
+void ixgbe_ipsec_restore(struct ixgbe_adapter *adapter);
  #else
  static inline void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter) { 
};
+static inline void ixgbe_ipsec_restore(struct ixgbe_adapter *adapter) { };
  #endif /* CONFIG_XFRM_OFFLOAD */
  #endif /* _IXGBE_H_ */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 7b01d92..b93ee7f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -292,6 +292,59 @@ static void ixgbe_ipsec_start_engine(struct ixgbe_adapter 
*adapter)
  }

  /**
+ * ixgbe_ipsec_restore - restore the ipsec HW settings after a reset
+ * @adapter: board private structure
+ **/
+void ixgbe_ipsec_restore(struct ixgbe_adapter *adapter)
+{
+   struct ixgbe_ipsec *ipsec = adapter->ipsec;
+   struct ixgbe_hw *hw = >hw;
+   u32 zbuf[4] = {0, 0, 0, 0};


zbuf should be a static const.


Yep




+   int i;
+
+   if (!(adapter->flags2 & IXGBE_FLAG2_IPSEC_ENABLED))
+   return;
+
+   /* clean up the engine settings */
+   ixgbe_ipsec_stop_engine(adapter);
+
+   /* start the engine */
+   ixgbe_ipsec_start_engine(adapter);
+
+   /* reload the IP addrs */
+   for (i = 0; i < IXGBE_IPSEC_MAX_RX_IP_COUNT; i++) {
+   struct rx_ip_sa *ipsa = >ip_tbl[i];
+
+   if (ipsa->used)
+   ixgbe_ipsec_set_rx_ip(hw, i, ipsa->ipaddr);
+   else
+   ixgbe_ipsec_set_rx_ip(hw, i, zbuf);


If we are doing a restore do we actually need to write the zero
values? If we did a reset I thought you had a function that was going
through and zeroing everything out. If so this now becomes redundant.


Currently ixgbe_ipsec_clear_hw_tables() only gets run at at probe.  It 
should probably get run at remove as well.  Doing this is a bit of 
safety paranoia, and making sure the CAM memory structures that don't 
get cleared on reset have exactly what I expect in them.





+   }
+
+   /* reload the Rx keys */
+   for (i = 0; i < IXGBE_IPSEC_MAX_SA_COUNT; i++) {
+   struct rx_sa *rsa = >rx_tbl[i];
+
+   if (rsa->used)
+   ixgbe_ipsec_set_rx_sa(hw, i, rsa->xs->id.spi,
+ rsa->key, rsa->salt,
+ rsa->mode, rsa->iptbl_ind);
+   else
+   ixgbe_ipsec_set_rx_sa(hw, i, 0, zbuf, 0, 0, 0);


same here


+   }
+
+   /* reload the Tx keys */
+   for (i = 0; i < IXGBE_IPSEC_MAX_SA_COUNT; i++) {
+   struct tx_sa *tsa = >tx_tbl[i];
+
+   if (tsa->used)
+   ixgbe_ipsec_set_tx_sa(hw, i, tsa->key, tsa->salt);
+   else
+   ixgbe_ipsec_set_tx_sa(hw, i, zbuf, 0);


and here


+   }
+}
+
+/**
   * ixgbe_ipsec_find_empty_idx - find the first unused security parameter index
   * @ipsec: pointer to ipsec struct
   * @rxtable: true if we need to look in the Rx table
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 01fd89b..6eabf92 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5347,6 +5347,7 @@ static void ixgbe_configure(struct ixgbe_adapter *adapter)

 ixgbe_set_rx_mode(adapter->netdev);
 ixgbe_restore_vlan(adapter);
+   ixgbe_ipsec_restore(adapter);

 switch (hw->mac.type) {
 case ixgbe_mac_82599EB:
--
2.7.4

___
Intel-wired-lan mailing list
intel-wired-...@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan


Re: [Intel-wired-lan] [next-queue 05/10] ixgbe: implement ipsec add and remove of offloaded SA

2017-12-06 Thread Shannon Nelson

On 12/5/2017 9:26 AM, Alexander Duyck wrote:

On Mon, Dec 4, 2017 at 9:35 PM, Shannon Nelson
 wrote:

Add the functions for setting up and removing offloaded SAs (Security
Associations) with the x540 hardware.  We set up the callback structure
but we don't yet set the hardware feature bit to be sure the XFRM service
won't actually try to use us for an offload yet.

The software tables are made up to mimic the hardware tables to make it
easier to track what's in the hardware, and the SA table index is used
for the XFRM offload handle.  However, there is a hashing field in the
Rx SA tracking that will be used to facilitate faster table searches in
the Rx fast path.

Signed-off-by: Shannon Nelson 
---
  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 377 +
  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |   6 +
  2 files changed, 383 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 38a1a16..7b01d92 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -26,6 +26,8 @@
   
**/

  #include "ixgbe.h"
+#include 
+#include 

  /**
   * ixgbe_ipsec_set_tx_sa - set the Tx SA registers
@@ -128,6 +130,7 @@ static void ixgbe_ipsec_set_rx_ip(struct ixgbe_hw *hw, u16 
idx, u32 addr[])
   **/
  void ixgbe_ipsec_clear_hw_tables(struct ixgbe_adapter *adapter)
  {
+   struct ixgbe_ipsec *ipsec = adapter->ipsec;
 struct ixgbe_hw *hw = >hw;
 u32 buf[4] = {0, 0, 0, 0};
 u16 idx;
@@ -139,9 +142,11 @@ void ixgbe_ipsec_clear_hw_tables(struct ixgbe_adapter 
*adapter)
 /* scrub the tables */
 for (idx = 0; idx < IXGBE_IPSEC_MAX_SA_COUNT; idx++)
 ixgbe_ipsec_set_tx_sa(hw, idx, buf, 0);
+   ipsec->num_tx_sa = 0;

 for (idx = 0; idx < IXGBE_IPSEC_MAX_SA_COUNT; idx++)
 ixgbe_ipsec_set_rx_sa(hw, idx, 0, buf, 0, 0, 0);
+   ipsec->num_rx_sa = 0;

 for (idx = 0; idx < IXGBE_IPSEC_MAX_RX_IP_COUNT; idx++)
 ixgbe_ipsec_set_rx_ip(hw, idx, buf);
@@ -287,11 +292,383 @@ static void ixgbe_ipsec_start_engine(struct 
ixgbe_adapter *adapter)
  }

  /**
+ * ixgbe_ipsec_find_empty_idx - find the first unused security parameter index
+ * @ipsec: pointer to ipsec struct
+ * @rxtable: true if we need to look in the Rx table
+ *
+ * Returns the first unused index in either the Rx or Tx SA table
+ **/
+static int ixgbe_ipsec_find_empty_idx(struct ixgbe_ipsec *ipsec, bool rxtable)
+{
+   u32 i;
+
+   if (rxtable) {
+   if (ipsec->num_rx_sa == IXGBE_IPSEC_MAX_SA_COUNT)
+   return -ENOSPC;
+
+   /* search rx sa table */
+   for (i = 0; i < IXGBE_IPSEC_MAX_SA_COUNT; i++) {
+   if (!ipsec->rx_tbl[i].used)
+   return i;
+   }
+   } else {
+   if (ipsec->num_rx_sa == IXGBE_IPSEC_MAX_SA_COUNT)
+   return -ENOSPC;


Should this bi num_tx_sa?


Hmm - can you say cut-and-paste?  Will fix.




+
+   /* search tx sa table */
+   for (i = 0; i < IXGBE_IPSEC_MAX_SA_COUNT; i++) {
+   if (!ipsec->tx_tbl[i].used)
+   return i;
+   }
+   }
+
+   return -ENOSPC;
+}
+
+/**
+ * ixgbe_ipsec_parse_proto_keys - find the key and salt based on the protocol
+ * @xs: pointer to xfrm_state struct
+ * @mykey: pointer to key array to populate
+ * @mysalt: pointer to salt value to populate
+ *
+ * This copies the protocol keys and salt to our own data tables.  The
+ * 82599 family only supports the one algorithm.
+ **/
+static int ixgbe_ipsec_parse_proto_keys(struct xfrm_state *xs,
+   u32 *mykey, u32 *mysalt)
+{
+   struct net_device *dev = xs->xso.dev;
+   unsigned char *key_data;
+   char *alg_name = NULL;
+   char *aes_gcm_name = "rfc4106(gcm(aes))";


aes_gcm_name should probably be a static const char array instead of a pointer.


Sure.




+   int key_len;
+
+   if (xs->aead) {
+   key_data = >aead->alg_key[0];
+   key_len = xs->aead->alg_key_len;
+   alg_name = xs->aead->alg_name;
+   } else {
+   netdev_err(dev, "Unsupported IPsec algorithm\n");
+   return -EINVAL;
+   }
+
+   if (strcmp(alg_name, aes_gcm_name)) {
+   netdev_err(dev, "Unsupported IPsec algorithm - please use %s\n",
+  aes_gcm_name);
+   return -EINVAL;
+   }
+
+   /* 160 accounts for 16 byte key and 4 byte salt */
+   if (key_len == 128) {
+   netdev_info(dev, "IPsec hw offload parameters missing 32 bit salt 
value\n");
+   } else if (key_len != 160) {
+   

Re: [Intel-wired-lan] [next-queue 02/10] ixgbe: add ipsec register access routines

2017-12-06 Thread Shannon Nelson
Thanks, Alex, for your detailed comments, I do appreciate the time and 
thought you put into them.


Responses below...

sln

On 12/5/2017 8:56 AM, Alexander Duyck wrote:

On Mon, Dec 4, 2017 at 9:35 PM, Shannon Nelson
 wrote:

Add a few routines to make access to the ipsec registers just a little
easier, and throw in the beginnings of an initialization.

Signed-off-by: Shannon Nelson 
---
  drivers/net/ethernet/intel/ixgbe/Makefile  |   1 +
  drivers/net/ethernet/intel/ixgbe/ixgbe.h   |   6 +
  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 157 +
  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.h |  50 
  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |   1 +
  5 files changed, 215 insertions(+)
  create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
  create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.h

diff --git a/drivers/net/ethernet/intel/ixgbe/Makefile 
b/drivers/net/ethernet/intel/ixgbe/Makefile
index 35e6fa6..8319465 100644
--- a/drivers/net/ethernet/intel/ixgbe/Makefile
+++ b/drivers/net/ethernet/intel/ixgbe/Makefile
@@ -42,3 +42,4 @@ ixgbe-$(CONFIG_IXGBE_DCB) +=  ixgbe_dcb.o ixgbe_dcb_82598.o \
  ixgbe-$(CONFIG_IXGBE_HWMON) += ixgbe_sysfs.o
  ixgbe-$(CONFIG_DEBUG_FS) += ixgbe_debugfs.o
  ixgbe-$(CONFIG_FCOE:m=y) += ixgbe_fcoe.o
+ixgbe-$(CONFIG_XFRM_OFFLOAD) += ixgbe_ipsec.o
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index dd55787..1e11462 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -52,6 +52,7 @@
  #ifdef CONFIG_IXGBE_DCA
  #include 
  #endif
+#include "ixgbe_ipsec.h"

  #include 

@@ -1001,4 +1002,9 @@ void ixgbe_store_key(struct ixgbe_adapter *adapter);
  void ixgbe_store_reta(struct ixgbe_adapter *adapter);
  s32 ixgbe_negotiate_fc(struct ixgbe_hw *hw, u32 adv_reg, u32 lp_reg,
u32 adv_sym, u32 adv_asm, u32 lp_sym, u32 lp_asm);
+#ifdef CONFIG_XFRM_OFFLOAD
+void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter);
+#else
+static inline void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter) { };
+#endif /* CONFIG_XFRM_OFFLOAD */
  #endif /* _IXGBE_H_ */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
new file mode 100644
index 000..14dd011
--- /dev/null
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -0,0 +1,157 @@
+/***
+ *
+ * Intel 10 Gigabit PCI Express Linux driver
+ * Copyright(c) 2017 Oracle and/or its affiliates. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program.  If not, see .
+ *
+ * The full GNU General Public License is included in this distribution in
+ * the file called "COPYING".
+ *
+ * Contact Information:
+ * Linux NICS 
+ * e1000-devel Mailing List 
+ * Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497
+ *
+ 
**/
+
+#include "ixgbe.h"
+
+/**
+ * ixgbe_ipsec_set_tx_sa - set the Tx SA registers
+ * @hw: hw specific details
+ * @idx: register index to write
+ * @key: key byte array
+ * @salt: salt bytes
+ **/
+static void ixgbe_ipsec_set_tx_sa(struct ixgbe_hw *hw, u16 idx,
+ u32 key[], u32 salt)
+{
+   u32 reg;
+   int i;
+
+   for (i = 0; i < 4; i++)
+   IXGBE_WRITE_REG(hw, IXGBE_IPSTXKEY(i), cpu_to_be32(key[3-i]));
+   IXGBE_WRITE_REG(hw, IXGBE_IPSTXSALT, cpu_to_be32(salt));
+   IXGBE_WRITE_FLUSH(hw);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_IPSTXIDX);
+   reg &= IXGBE_RXTXIDX_IPS_EN;
+   reg |= idx << 3 | IXGBE_RXTXIDX_IDX_WRITE;
+   IXGBE_WRITE_REG(hw, IXGBE_IPSTXIDX, reg);
+   IXGBE_WRITE_FLUSH(hw);
+}
+


So there are a few things here to unpack.

The first is the carry-forward of the IPS bit. I'm not sure that is
the best way to go. Do we really expect to be updating SA values if
IPsec offload is not enabled?


In order to save on energy, we don't enable the engine until we have the 
first SA successfully stored in the tables, so the enable bit will be 
off for that one.


Also, the datasheet specifically says for the Rx table "Software should 
not make changes in the Rx SA 

Re: [Intel-wired-lan] [next-queue 04/10] ixgbe: add ipsec data structures

2017-12-06 Thread Shannon Nelson

On 12/5/2017 9:03 AM, Alexander Duyck wrote:

On Mon, Dec 4, 2017 at 9:35 PM, Shannon Nelson
 wrote:

Set up the data structures to be used by the ipsec offload.

Signed-off-by: Shannon Nelson 
---
  drivers/net/ethernet/intel/ixgbe/ixgbe.h   |  5 
  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.h | 40 ++
  2 files changed, 45 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 1e11462..9487750 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -622,6 +622,7 @@ struct ixgbe_adapter {
  #define IXGBE_FLAG2_EEE_CAPABLEBIT(14)
  #define IXGBE_FLAG2_EEE_ENABLEDBIT(15)
  #define IXGBE_FLAG2_RX_LEGACY  BIT(16)
+#define IXGBE_FLAG2_IPSEC_ENABLED  BIT(17)

 /* Tx fast path data */
 int num_tx_queues;
@@ -772,6 +773,10 @@ struct ixgbe_adapter {

  #define IXGBE_RSS_KEY_SIZE 40  /* size of RSS Hash Key in bytes */
 u32 *rss_key;
+
+#ifdef CONFIG_XFRM
+   struct ixgbe_ipsec *ipsec;
+#endif /* CONFIG_XFRM */
  };

  static inline u8 ixgbe_max_rss_indices(struct ixgbe_adapter *adapter)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.h
index 017b13f..cb9a4be 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.h
@@ -47,4 +47,44 @@
  #define IXGBE_RXMOD_DECRYPT0x0008
  #define IXGBE_RXMOD_IPV6   0x0010

+struct rx_sa {
+   struct hlist_node hlist;
+   struct xfrm_state *xs;
+   u32 ipaddr[4];


ipaddr should be stored as a __be32, not a u32.


Yep




+   u32 key[4];
+   u32 salt;
+   u32 mode;
+   u8  iptbl_ind;
+   bool used;
+   bool decrypt;
+};
+
+struct rx_ip_sa {
+   u32 ipaddr[4];


Same thing here.


Yep




+   u32 ref_cnt;
+   bool used;
+};
+
+struct tx_sa {
+   struct xfrm_state *xs;
+   u32 key[4];
+   u32 salt;
+   bool encrypt;
+   bool used;
+};
+
+struct ixgbe_ipsec_tx_data {
+   u32 flags;
+   u16 trailer_len;
+   u16 sa_idx;
+};
+
+struct ixgbe_ipsec {
+   u16 num_rx_sa;
+   u16 num_tx_sa;
+   struct rx_ip_sa *ip_tbl;
+   struct rx_sa *rx_tbl;
+   struct tx_sa *tx_tbl;
+   DECLARE_HASHTABLE(rx_sa_list, 8);


The hash table seems a bit on the small side. You might look at
increasing this to something like 32 in order to try and cut down on
the load in each bucket since the upper limit is 1K or so isn't it?


I probably can - didn't want to use too much memory real estate if not 
really needed.  I did a little perf timing with a mostly full table and 
didn't see much degradation.  Maybe boost it to 16?


sln




+};
  #endif /* _IXGBE_IPSEC_H_ */
--
2.7.4

___
Intel-wired-lan mailing list
intel-wired-...@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan


Re: [PATCH V11 4/5] vsprintf: add printk specifier %px

2017-12-06 Thread Sergey Senozhatsky
On (12/07/17 16:17), Tobin C. Harding wrote:
[..]
> > hm, indeed. and !CONFIG_KALLSYMS config turns %pS/%ps
> > into special_hex_number().
> 
> But totally misses this :(
> 
> "" would be better returned when !CONFIG_KALLSYMS, right?

I guess I'll take back my comment.

I assume there are tons of embedded devices that have !CONFIG_KALLSYMS
in 'release' builds, yet those devices still warn/oops sometimes; having
pointers/hex numbers is really the only way to make any sense out of
backtraces... yet it, basically, means that we are leaking kernel pointers.

-ss


Re: [PATCH net-next 1/6] net: sched: sch_api: handle generic qdisc errors

2017-12-06 Thread David Ahern
On 12/6/17 9:08 AM, Alexander Aring wrote:
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index a48ca41b7ecf..7b52a16d2fea 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -449,7 +449,8 @@ static const struct nla_policy stab_policy[TCA_STAB_MAX + 
> 1] = {
>   [TCA_STAB_DATA] = { .type = NLA_BINARY },
>  };
>  
> -static struct qdisc_size_table *qdisc_get_stab(struct nlattr *opt)
> +static struct qdisc_size_table *qdisc_get_stab(struct nlattr *opt,
> +struct netlink_ext_ack *extack)
>  {
>   struct nlattr *tb[TCA_STAB_MAX + 1];
>   struct qdisc_size_table *stab;
> @@ -458,23 +459,31 @@ static struct qdisc_size_table *qdisc_get_stab(struct 
> nlattr *opt)
>   u16 *tab = NULL;
>   int err;
>  
> - err = nla_parse_nested(tb, TCA_STAB_MAX, opt, stab_policy, NULL);
> - if (err < 0)
> + err = nla_parse_nested(tb, TCA_STAB_MAX, opt, stab_policy, extack);
> + if (err < 0) {
> + NL_SET_ERR_MSG(extack, "Failed to parse stab netlink msg");

you don't want to set extack here; it should be set in nla_parse_nested
since it is passed as an arg.


>   return ERR_PTR(err);
> - if (!tb[TCA_STAB_BASE])
> + }
> + if (!tb[TCA_STAB_BASE]) {
> + NL_SET_ERR_MSG(extack, "Stab base is missing");

"stab base" is the name of the netlink attribute, but it does not
describe *what* is missing. The key here is returning a message that
tells an average user what they did wrong.


>   return ERR_PTR(-EINVAL);
> + }
>  
>   s = nla_data(tb[TCA_STAB_BASE]);
>  
>   if (s->tsize > 0) {
> - if (!tb[TCA_STAB_DATA])
> + if (!tb[TCA_STAB_DATA]) {
> + NL_SET_ERR_MSG(extack, "Stab data is missing");

ditto here.

Looking at the code for the tc command, stab appears to be table size;
perhaps a tc author can elaborate on what STAB really means.


>   return ERR_PTR(-EINVAL);
> + }
>   tab = nla_data(tb[TCA_STAB_DATA]);
>   tsize = nla_len(tb[TCA_STAB_DATA]) / sizeof(u16);
>   }
>  
> - if (tsize != s->tsize || (!tab && tsize > 0))
> + if (tsize != s->tsize || (!tab && tsize > 0)) {
> + NL_SET_ERR_MSG(extack, "Invalid data length of stab data");
>   return ERR_PTR(-EINVAL);
> + }
>  
>   list_for_each_entry(stab, _stab_list, list) {
>   if (memcmp(>szopts, s, sizeof(*s)))
> @@ -486,8 +495,10 @@ static struct qdisc_size_table *qdisc_get_stab(struct 
> nlattr *opt)
>   }
>  
>   stab = kmalloc(sizeof(*stab) + tsize * sizeof(u16), GFP_KERNEL);
> - if (!stab)
> + if (!stab) {
> + NL_SET_ERR_MSG(extack, "Failed to allocate memory for stab 
> data");

ENOMEM does not need a text message. Which memory allocation failed is
not really relevant.


>   return ERR_PTR(-ENOMEM);
> + }
>  
>   stab->refcnt = 1;
>   stab->szopts = *s;
> @@ -896,7 +907,8 @@ static void notify_and_destroy(struct net *net, struct 
> sk_buff *skb,
>  
>  static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>  struct sk_buff *skb, struct nlmsghdr *n, u32 classid,
> -struct Qdisc *new, struct Qdisc *old)
> +struct Qdisc *new, struct Qdisc *old,
> +struct netlink_ext_ack *extack)
>  {
>   struct Qdisc *q = old;
>   struct net *net = dev_net(dev);
> @@ -911,8 +923,10 @@ static int qdisc_graft(struct net_device *dev, struct 
> Qdisc *parent,
>   (new && new->flags & TCQ_F_INGRESS)) {
>   num_q = 1;
>   ingress = 1;
> - if (!dev_ingress_queue(dev))
> + if (!dev_ingress_queue(dev)) {
> + NL_SET_ERR_MSG(extack, "Cannot find ingress 
> queue for specified netdev");

netdev is a kernel term. 'device' is in the tc man page so a relevant
term for the message.


I think the above gives you an idea. Apply those comments to the rest of
this patch and the next ones.


Re: [PATCH V11 4/5] vsprintf: add printk specifier %px

2017-12-06 Thread Tobin C. Harding
On Wed, Dec 06, 2017 at 05:45:37PM +0900, Sergey Senozhatsky wrote:
> On (12/06/17 09:32), Geert Uytterhoeven wrote:
> [..]
> > >> show_fault_oops(struct pt_regs *regs, unsigned long error_code,
> > >> unsigned long address)
> > >> ...
> > >> printk(KERN_CONT " at %p\n", (void *) address);
> > >> printk(KERN_ALERT "IP: %pS\n", (void *)regs->ip);
> > >
> > > So %pS isn't %p, and shows the symbolic name.
> > 
> > If the symbolic name is available.
> > Else it prints the non-hashed pointer value (FTR).


Well, this

[RFC 0/3] kallsyms: don't leak address when printing symbol

_trys_ to fix that

> hm, indeed. and !CONFIG_KALLSYMS config turns %pS/%ps
> into special_hex_number().

But totally misses this :(

"" would be better returned when !CONFIG_KALLSYMS, right?

thanks,
Tobin.


Re: [PATCH V11 4/5] vsprintf: add printk specifier %px

2017-12-06 Thread Tobin C. Harding
On Wed, Dec 06, 2017 at 09:32:14AM +0100, Geert Uytterhoeven wrote:
> Hi Linus,
> 
> On Wed, Dec 6, 2017 at 2:59 AM, Linus Torvalds
>  wrote:
> > On Tue, Dec 5, 2017 at 5:36 PM, Sergey Senozhatsky
> >  wrote:
> >> I see some %p-s being used in _supposedly_ important output,
> >> like arch/x86/mm/fault.c
> >>
> >> show_fault_oops(struct pt_regs *regs, unsigned long error_code,
> >> unsigned long address)
> >> ...
> >> printk(KERN_CONT " at %p\n", (void *) address);
> >> printk(KERN_ALERT "IP: %pS\n", (void *)regs->ip);
> >
> > So %pS isn't %p, and shows the symbolic name.
> 
> If the symbolic name is available.
> Else it prints the non-hashed pointer value (FTR).

I'm trying to fix this :)

[RFC 0/3] kallsyms: don't leak address when printing symbol

thanks,
Tobin.


Re: [PATCH net-next] virtio_net: Disable interrupts if napi_complete_done rescheduled napi

2017-12-06 Thread Michael S. Tsirkin
On Thu, Dec 07, 2017 at 01:15:15PM +0900, Toshiaki Makita wrote:
> Since commit 39e6c8208d7b ("net: solve a NAPI race") napi has been able
> to be rescheduled within napi_complete_done() even in non-busypoll case,
> but virtnet_poll() always enabled interrupts before complete, and when
> napi was rescheduled within napi_complete_done() it did not disable
> interrupts.
> This caused more interrupts when event idx is disabled.
> 
> According to commit cbdadbbf0c79 ("virtio_net: fix race in RX VQ
> processing") we cannot place virtqueue_enable_cb_prepare() after
> NAPI_STATE_SCHED is cleared, so disable interrupts again if
> napi_complete_done() returned false.
> 
> Tested with vhost-user of OVS 2.7 on host, which does not have the event
> idx feature.
> 
> * Before patch:
> 
> $ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
> MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> 192.168.150.253 () port 0 AF_INET
> Socket  Message  Elapsed  Messages
> SizeSize Time Okay Errors   Throughput
> bytes   bytessecs#  #   10^6bits/sec
> 
> 2129921472   60.00 32763206  06430.32
> 212992   60.00 23384299   4589.56
> 
> Interrupts on guest: 9872369
> Packets/interrupt:   2.37
> 
> * After patch
> 
> $ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
> MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> 192.168.150.253 () port 0 AF_INET
> Socket  Message  Elapsed  Messages
> SizeSize Time Okay Errors   Throughput
> bytes   bytessecs#  #   10^6bits/sec
> 
> 2129921472   60.00 32794646  06436.49
> 212992   60.00 32793501   6436.27
> 
> Interrupts on guest: 4941299
> Packets/interrupt:   6.64
> 
> Signed-off-by: Toshiaki Makita 

Acked-by: Michael S. Tsirkin 

it might make sense in net and not -next since tx napi regressed performance
in some configs, this might bring it back at least partially.
Jason - what do you think?

> ---
>  drivers/net/virtio_net.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 19a985e..c0db48d 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -261,9 +261,12 @@ static void virtqueue_napi_complete(struct napi_struct 
> *napi,
>   int opaque;
>  
>   opaque = virtqueue_enable_cb_prepare(vq);
> - if (napi_complete_done(napi, processed) &&
> - unlikely(virtqueue_poll(vq, opaque)))
> - virtqueue_napi_schedule(napi, vq);
> + if (napi_complete_done(napi, processed)) {
> + if (unlikely(virtqueue_poll(vq, opaque)))
> + virtqueue_napi_schedule(napi, vq);
> + } else {
> + virtqueue_disable_cb(vq);
> + }
>  }
>  
>  static void skb_xmit_done(struct virtqueue *vq)
> -- 
> 1.8.3.1
> 


[PATCH net-next] virtio_net: Disable interrupts if napi_complete_done rescheduled napi

2017-12-06 Thread Toshiaki Makita
Since commit 39e6c8208d7b ("net: solve a NAPI race") napi has been able
to be rescheduled within napi_complete_done() even in non-busypoll case,
but virtnet_poll() always enabled interrupts before complete, and when
napi was rescheduled within napi_complete_done() it did not disable
interrupts.
This caused more interrupts when event idx is disabled.

According to commit cbdadbbf0c79 ("virtio_net: fix race in RX VQ
processing") we cannot place virtqueue_enable_cb_prepare() after
NAPI_STATE_SCHED is cleared, so disable interrupts again if
napi_complete_done() returned false.

Tested with vhost-user of OVS 2.7 on host, which does not have the event
idx feature.

* Before patch:

$ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.150.253 () port 0 AF_INET
Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

2129921472   60.00 32763206  06430.32
212992   60.00 23384299   4589.56

Interrupts on guest: 9872369
Packets/interrupt:   2.37

* After patch

$ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.150.253 () port 0 AF_INET
Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

2129921472   60.00 32794646  06436.49
212992   60.00 32793501   6436.27

Interrupts on guest: 4941299
Packets/interrupt:   6.64

Signed-off-by: Toshiaki Makita 
---
 drivers/net/virtio_net.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 19a985e..c0db48d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -261,9 +261,12 @@ static void virtqueue_napi_complete(struct napi_struct 
*napi,
int opaque;
 
opaque = virtqueue_enable_cb_prepare(vq);
-   if (napi_complete_done(napi, processed) &&
-   unlikely(virtqueue_poll(vq, opaque)))
-   virtqueue_napi_schedule(napi, vq);
+   if (napi_complete_done(napi, processed)) {
+   if (unlikely(virtqueue_poll(vq, opaque)))
+   virtqueue_napi_schedule(napi, vq);
+   } else {
+   virtqueue_disable_cb(vq);
+   }
 }
 
 static void skb_xmit_done(struct virtqueue *vq)
-- 
1.8.3.1




[PATCH net-next 1/1] forcedeth: remove unnecessary variable

2017-12-06 Thread Zhu Yanjun
Since both tx_ring and first_tx are the head of tx ring, it not
necessary to use two variables. So first_tx is removed.

CC: Srinivas Eeda 
CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/nvidia/forcedeth.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index 53614ed..cadea67 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -822,7 +822,7 @@ struct fe_priv {
/*
 * tx specific fields.
 */
-   union ring_type get_tx, put_tx, first_tx, last_tx;
+   union ring_type get_tx, put_tx, last_tx;
struct nv_skb_map *get_tx_ctx, *put_tx_ctx;
struct nv_skb_map *first_tx_ctx, *last_tx_ctx;
struct nv_skb_map *tx_skb;
@@ -1932,7 +1932,8 @@ static void nv_init_tx(struct net_device *dev)
struct fe_priv *np = netdev_priv(dev);
int i;
 
-   np->get_tx = np->put_tx = np->first_tx = np->tx_ring;
+   np->get_tx = np->tx_ring;
+   np->put_tx = np->tx_ring;
 
if (!nv_optimized(np))
np->last_tx.orig = >tx_ring.orig[np->tx_ring_size-1];
@@ -2248,7 +2249,7 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
offset += bcnt;
size -= bcnt;
if (unlikely(put_tx++ == np->last_tx.orig))
-   put_tx = np->first_tx.orig;
+   put_tx = np->tx_ring.orig;
if (unlikely(np->put_tx_ctx++ == np->last_tx_ctx))
np->put_tx_ctx = np->first_tx_ctx;
} while (size);
@@ -2294,13 +2295,13 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
offset += bcnt;
frag_size -= bcnt;
if (unlikely(put_tx++ == np->last_tx.orig))
-   put_tx = np->first_tx.orig;
+   put_tx = np->tx_ring.orig;
if (unlikely(np->put_tx_ctx++ == np->last_tx_ctx))
np->put_tx_ctx = np->first_tx_ctx;
} while (frag_size);
}
 
-   if (unlikely(put_tx == np->first_tx.orig))
+   if (unlikely(put_tx == np->tx_ring.orig))
prev_tx = np->last_tx.orig;
else
prev_tx = put_tx - 1;
@@ -2406,7 +2407,7 @@ static netdev_tx_t nv_start_xmit_optimized(struct sk_buff 
*skb,
offset += bcnt;
size -= bcnt;
if (unlikely(put_tx++ == np->last_tx.ex))
-   put_tx = np->first_tx.ex;
+   put_tx = np->tx_ring.ex;
if (unlikely(np->put_tx_ctx++ == np->last_tx_ctx))
np->put_tx_ctx = np->first_tx_ctx;
} while (size);
@@ -2452,13 +2453,13 @@ static netdev_tx_t nv_start_xmit_optimized(struct 
sk_buff *skb,
offset += bcnt;
frag_size -= bcnt;
if (unlikely(put_tx++ == np->last_tx.ex))
-   put_tx = np->first_tx.ex;
+   put_tx = np->tx_ring.ex;
if (unlikely(np->put_tx_ctx++ == np->last_tx_ctx))
np->put_tx_ctx = np->first_tx_ctx;
} while (frag_size);
}
 
-   if (unlikely(put_tx == np->first_tx.ex))
+   if (unlikely(put_tx == np->tx_ring.ex))
prev_tx = np->last_tx.ex;
else
prev_tx = put_tx - 1;
@@ -2597,7 +2598,7 @@ static int nv_tx_done(struct net_device *dev, int limit)
}
}
if (unlikely(np->get_tx.orig++ == np->last_tx.orig))
-   np->get_tx.orig = np->first_tx.orig;
+   np->get_tx.orig = np->tx_ring.orig;
if (unlikely(np->get_tx_ctx++ == np->last_tx_ctx))
np->get_tx_ctx = np->first_tx_ctx;
}
@@ -2651,7 +2652,7 @@ static int nv_tx_done_optimized(struct net_device *dev, 
int limit)
}
 
if (unlikely(np->get_tx.ex++ == np->last_tx.ex))
-   np->get_tx.ex = np->first_tx.ex;
+   np->get_tx.ex = np->tx_ring.ex;
if (unlikely(np->get_tx_ctx++ == np->last_tx_ctx))
np->get_tx_ctx = np->first_tx_ctx;
}
-- 
2.7.4



RE: [PATCH net-next] netdevsim: check return value of debugfs_create_dir

2017-12-06 Thread Prashant Bhole

> From: Jakub Kicinski [mailto:jakub.kicin...@netronome.com]
> 
> On Thu,  7 Dec 2017 10:02:13 +0900, Prashant Bhole wrote:
> > - Handled debugfs_create_dir failure in nsim_init()
> > - Fixed return value of nsim_module_init() when debugfs_create_dir
> > fails
> >
> > Signed-off-by: Prashant Bhole 
> 
> Why?  Failing to expose the state via DebugFS is not fatal to the driver.

Ok, my intention was to handle the return code properly, which is not needed
as per your comment.
Shall I remove the existing handling in nsim_module_init() in separate
patch?  Because it will prevent netdevsim from loading when debugfs is
disabled.

Thanks,
Prashant




[PATCH v2 net] netlink: Relax attr validation for fixed length types

2017-12-06 Thread David Ahern
Commit 28033ae4e0f5 ("net: netlink: Update attr validation to require
exact length for some types") requires attributes using types NLA_U* and
NLA_S* to have an exact length. This change is exposing bugs in various
userspace commands that are sending attributes with an invalid length
(e.g., attribute has type NLA_U8 and userspace sends NLA_U32). While
the commands are clearly broken and need to be fixed, users are arguing
that the sudden change in enforcement is breaking older commands on
newer kernels for use cases that otherwise "worked".

Relax the validation to print a warning mesage similar to what is done
for messages containing extra bytes after parsing.

Fixes: 28033ae4e0f5 ("net: netlink: Update attr validation to require exact 
length for some types")
Signed-off-by: David Ahern 
Reviewed-by: Johannes Berg 
---
v2
- updated the comment before nla_attr_len and removed the outdated
  comment before use of nla_attr_len

 lib/nlattr.c | 22 --
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/lib/nlattr.c b/lib/nlattr.c
index 8bf78b4b78f0..dfa55c873c13 100644
--- a/lib/nlattr.c
+++ b/lib/nlattr.c
@@ -15,7 +15,11 @@
 #include 
 #include 
 
-/* for these data types attribute length must be exactly given size */
+/* For these data types, attribute length should be exactly the given
+ * size. However, to maintain compatibility with broken commands, if the
+ * attribute length does not match the expected size a warning is emitted
+ * to the user that the command is sending invalid data and needs to be fixed.
+ */
 static const u8 nla_attr_len[NLA_TYPE_MAX+1] = {
[NLA_U8]= sizeof(u8),
[NLA_U16]   = sizeof(u16),
@@ -28,8 +32,16 @@ static const u8 nla_attr_len[NLA_TYPE_MAX+1] = {
 };
 
 static const u8 nla_attr_minlen[NLA_TYPE_MAX+1] = {
+   [NLA_U8]= sizeof(u8),
+   [NLA_U16]   = sizeof(u16),
+   [NLA_U32]   = sizeof(u32),
+   [NLA_U64]   = sizeof(u64),
[NLA_MSECS] = sizeof(u64),
[NLA_NESTED]= NLA_HDRLEN,
+   [NLA_S8]= sizeof(s8),
+   [NLA_S16]   = sizeof(s16),
+   [NLA_S32]   = sizeof(s32),
+   [NLA_S64]   = sizeof(s64),
 };
 
 static int validate_nla_bitfield32(const struct nlattr *nla,
@@ -69,11 +81,9 @@ static int validate_nla(const struct nlattr *nla, int 
maxtype,
 
BUG_ON(pt->type > NLA_TYPE_MAX);
 
-   /* for data types NLA_U* and NLA_S* require exact length */
-   if (nla_attr_len[pt->type]) {
-   if (attrlen != nla_attr_len[pt->type])
-   return -ERANGE;
-   return 0;
+   if (nla_attr_len[pt->type] && attrlen != nla_attr_len[pt->type]) {
+   pr_warn_ratelimited("netlink: '%s': attribute type %d has an 
invalid length.\n",
+   current->comm, type);
}
 
switch (pt->type) {
-- 
2.11.0



RE: [PATCH 2/2] net: fec: optimize IRQ handler

2017-12-06 Thread Andy Duan
From: Lucas Stach  Sent: Thursday, December 07, 2017 
1:25 AM
>fep->work_rx and fep->work_tx are both non-zero, as long as the NAPI
>softirq hasn't finished its work. So if the current IRQ does not signal any RX 
>or
>TX completion, but some unrelated event, the path to schedule the NAPI
>context is still entered.
>
>The handler works correctly as in this case napi_schedule_prep() will reject
>the scheduling attempt, but the flow can still be optimized by not trying to
>schedule if the IRQ doesn't signal RX or TX completion.
>
>Signed-off-by: Lucas Stach 
>---
> drivers/net/ethernet/freescale/fec_main.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/net/ethernet/freescale/fec_main.c
>b/drivers/net/ethernet/freescale/fec_main.c
>index 0b70c07eb703..2043e140e9bd 100644
>--- a/drivers/net/ethernet/freescale/fec_main.c
>+++ b/drivers/net/ethernet/freescale/fec_main.c
>@@ -1587,14 +1587,14 @@ fec_enet_interrupt(int irq, void *dev_id)
>   int_events = readl_relaxed(fep->hwp + FEC_IEVENT) &
>readl_relaxed(fep->hwp + FEC_IMASK);
>   writel(int_events, fep->hwp + FEC_IEVENT);
>-  fec_enet_collect_events(fep, int_events);
>
>-  if ((fep->work_tx || fep->work_rx) && fep->link) {
>+  if ((int_events & (FEC_ENET_RXF | FEC_ENET_TXF)) && fep->link) {
>   ret = IRQ_HANDLED;
>
>   if (napi_schedule_prep(>napi)) {
>   /* Disable the NAPI interrupts */
>   writel(FEC_NAPI_IMASK, fep->hwp + FEC_IMASK);
>+  fec_enet_collect_events(fep, int_events);
>   __napi_schedule(>napi);
>   }
>   }
>--
>2.11.0

The patch seems fine, and patch#1 is better to cache imask as David's comment.  
Thanks.

Acked-by: Fugang Duan 


Re: [PATCH net-next] netdevsim: check return value of debugfs_create_dir

2017-12-06 Thread Jakub Kicinski
On Thu,  7 Dec 2017 10:02:13 +0900, Prashant Bhole wrote:
> - Handled debugfs_create_dir failure in nsim_init()
> - Fixed return value of nsim_module_init() when debugfs_create_dir fails
> 
> Signed-off-by: Prashant Bhole 

Why?  Failing to expose the state via DebugFS is not fatal to the
driver.


Re: [PATCH net v2] adding missing rcu_read_unlock in ipxip6_rcv

2017-12-06 Thread Alexei Starovoitov
On Wed, Dec 06, 2017 at 05:15:43PM -0800, Nikita V. Shirokov wrote:
> commit 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
> introduced new exit point in  ipxip6_rcv. however rcu_read_unlock is
> missing there. this diff is fixing this
> 
> v1->v2:
>  instead of doing rcu_read_unlock in place, we are going to "drop"
>  section (to prevent skb leakage)
> 
> Signed-off-by: Nikita V. Shirokov 

thanks for fixing my mess.
Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
Acked-by: Alexei Starovoitov 



Re: [PATCH net 1/2] bpf/verifier: fix bounds calculation on BPF_RSH

2017-12-06 Thread Alexei Starovoitov
On Tue, Dec 05, 2017 at 11:44:14AM -0800, Alexei Starovoitov wrote:
> On Tue, Dec 05, 2017 at 07:15:18PM +, Edward Cree wrote:
> > Incorrect signed bounds were being computed, although this had no effect
> >  since the propagation in __reg_deduce_bounds() happened to overwrite them.
> > 
> > Fixes: b03c9f9fdc37 ("bpf/verifier: track signed and unsigned min/max 
> > values")
> > Reported-by: Jann Horn 
> > Signed-off-by: Edward Cree 
> 
> Acked-by: Alexei Starovoitov 

turned out this one is incomplete fix. The more complete set
from Jann and Ed will be coming.



[PATCH net v2] adding missing rcu_read_unlock in ipxip6_rcv

2017-12-06 Thread Nikita V. Shirokov
commit 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
introduced new exit point in  ipxip6_rcv. however rcu_read_unlock is
missing there. this diff is fixing this

v1->v2:
 instead of doing rcu_read_unlock in place, we are going to "drop"
 section (to prevent skb leakage)

Signed-off-by: Nikita V. Shirokov 
---
 net/ipv6/ip6_tunnel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 3d3092a..db84f52 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -904,7 +904,7 @@ static int ipxip6_rcv(struct sk_buff *skb, u8 ipproto,
if (t->parms.collect_md) {
tun_dst = ipv6_tun_rx_dst(skb, 0, 0, 0);
if (!tun_dst)
-   return 0;
+   goto drop;
}
ret = __ip6_tnl_rcv(t, skb, tpi, tun_dst, dscp_ecn_decapsulate,
log_ecn_error);
-- 
2.9.5



[PATCH net-next] netdevsim: check return value of debugfs_create_dir

2017-12-06 Thread Prashant Bhole
- Handled debugfs_create_dir failure in nsim_init()
- Fixed return value of nsim_module_init() when debugfs_create_dir fails

Signed-off-by: Prashant Bhole 
---
 drivers/net/netdevsim/netdev.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index eb8c679fca9f..6c4e08f46bcb 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -147,10 +147,15 @@ struct device_type nsim_dev_type = {
 static int nsim_init(struct net_device *dev)
 {
struct netdevsim *ns = netdev_priv(dev);
-   int err;
+   int err = -ENOMEM;
 
ns->netdev = dev;
ns->ddir = debugfs_create_dir(netdev_name(dev), nsim_ddir);
+   if (IS_ERR_OR_NULL(ns->ddir)) {
+   if (ns->ddir)
+   err = PTR_ERR(ns->ddir);
+   goto err;
+   }
 
err = nsim_bpf_init(ns);
if (err)
@@ -171,6 +176,7 @@ static int nsim_init(struct net_device *dev)
nsim_bpf_uninit(ns);
 err_debugfs_destroy:
debugfs_remove_recursive(ns->ddir);
+err:
return err;
 }
 
@@ -466,11 +472,14 @@ struct dentry *nsim_ddir;
 
 static int __init nsim_module_init(void)
 {
-   int err;
+   int err = -ENOMEM;
 
nsim_ddir = debugfs_create_dir(DRV_NAME, NULL);
-   if (IS_ERR(nsim_ddir))
-   return PTR_ERR(nsim_ddir);
+   if (IS_ERR_OR_NULL(nsim_ddir)) {
+   if (nsim_ddir)
+   err = PTR_ERR(nsim_ddir);
+   goto err;
+   }
 
err = bus_register(_bus);
if (err)
@@ -486,6 +495,7 @@ static int __init nsim_module_init(void)
bus_unregister(_bus);
 err_debugfs_destroy:
debugfs_remove_recursive(nsim_ddir);
+err:
return err;
 }
 
-- 
2.13.6




Re: [PATCH net] adding missing rcu_read_unlock in ipxip6_rcv

2017-12-06 Thread Nikita Shirokov


On 12/6/17, 12:50 PM, "David Miller"  wrote:

From: "Nikita V. Shirokov" 
Date: Wed,  6 Dec 2017 10:19:33 -0800

> commit 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
> introduced new exit point in  ipxip6_rcv. however rcu_read_unlock is
> missing there. this diff is fixing this
> 
> Signed-off-by: Nikita V. Shirokov 
 ...
> @@ -903,8 +903,10 @@ static int ipxip6_rcv(struct sk_buff *skb, u8 
ipproto,
>   goto drop;
>   if (t->parms.collect_md) {
>   tun_dst = ipv6_tun_rx_dst(skb, 0, 0, 0);
> - if (!tun_dst)
> + if (!tun_dst) {
> + rcu_read_unlock();
>   return 0;
> + }
>   }
>   ret = __ip6_tnl_rcv(t, skb, tpi, tun_dst, dscp_ecn_decapsulate,
>   log_ecn_error);

Shouldn't it branch to 'drop' otherwise we leak the skb?


Fair point, will rework.

--
Nikita



Re: [PATCH net-next] bridge: ebtables: Avoid resetting limit rule state

2017-12-06 Thread Pablo Neira Ayuso
Hi Linus,

On Mon, Dec 04, 2017 at 05:53:35AM +0100, Linus Lüssing wrote:
> Hi Pablo,
> 
> Thanks for your reply!
> 
> On Tue, Nov 28, 2017 at 12:30:08AM +0100, Pablo Neira Ayuso wrote:
> > [...]
> > > diff --git a/net/bridge/netfilter/ebt_limit.c 
> > > b/net/bridge/netfilter/ebt_limit.c
> > > index 61a9f1be1263..f74b48633feb 100644
> > > --- a/net/bridge/netfilter/ebt_limit.c
> > > +++ b/net/bridge/netfilter/ebt_limit.c
> > > @@ -69,6 +69,10 @@ static int ebt_limit_mt_check(const struct 
> > > xt_mtchk_param *par)
> > >  {
> > >   struct ebt_limit_info *info = par->matchinfo;
> > >  
> > > + /* Do not reset state on unrelated table changes */
> > > + if (info->prev)
> > > + return 0;
> > 
> > What kernel version are you using? I suspect you don't have this
> > applied?
> 
> I'm indeed using a 4.4.102 kernel, as LEDE is still in the process
> of updating to 4.14. So 4.4 with LEDE is where I got the measurement
> results from.
> 
> > 
> > commit ec23189049651b16dc2ffab35a4371dc1f491aca
> > Author: Willem de Bruijn 
> > Date:   Mon Jan 2 17:19:46 2017 -0500
> > 
> > xtables: extend matches and targets with .usersize
> 
> And so, no I do not have this patch. I looked at it now, but it
> does not seem to have any relation with .matchinfo, does it?
> 
> I also had a quick look at a 4.15-rc1 kernel in a VM now. I still
> end up in ebt_limit_mt_check() with the variables being reset
> when editing the table somewhere.

My question is if your fix would work with 4.15-rc1.


[PATCH net 1/2] net: dsa: mv88e6xxx: Fix interrupt masking on removal

2017-12-06 Thread Andrew Lunn
When removing the interrupt handling code, we should mask the
generation of interrupts. The code however unmasked all
interrupts. This can then cause a new interrupt. We then get into a
deadlock where the interrupt thread is waiting to run, and the code
continues, trying to remove the interrupt handler, which means waiting
for the thread to complete. On a UP machine this deadlocks.

Fix so we really mask interrupts in the hardware. The same error is
made in the error path when install the interrupt handling code.

Fixes: 3460a5770ce9 ("net: dsa: mv88e6xxx: Mask g1 interrupts and free 
interrupt")
Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/mv88e6xxx/chip.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 8171055fde7a..70004264f60d 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -339,7 +339,7 @@ static void mv88e6xxx_g1_irq_free(struct mv88e6xxx_chip 
*chip)
u16 mask;
 
mv88e6xxx_g1_read(chip, MV88E6XXX_G1_CTL1, );
-   mask |= GENMASK(chip->g1_irq.nirqs, 0);
+   mask &= ~GENMASK(chip->g1_irq.nirqs, 0);
mv88e6xxx_g1_write(chip, MV88E6XXX_G1_CTL1, mask);
 
free_irq(chip->irq, chip);
@@ -395,7 +395,7 @@ static int mv88e6xxx_g1_irq_setup(struct mv88e6xxx_chip 
*chip)
return 0;
 
 out_disable:
-   mask |= GENMASK(chip->g1_irq.nirqs, 0);
+   mask &= ~GENMASK(chip->g1_irq.nirqs, 0);
mv88e6xxx_g1_write(chip, MV88E6XXX_G1_CTL1, mask);
 
 out_mapping:
-- 
2.15.1



[PATCH net 2/2] net: dsa: mv88e6xxx: Unregister MDIO bus on error path

2017-12-06 Thread Andrew Lunn
The MDIO busses need to be unregistered before they are freed,
otherwise BUG() is called. Add a call to the unregister code if the
registration fails, since we can have multiple busses, of which some
may correctly register before one fails. This requires moving the code
around a little.

Fixes: a3c53be55c95 ("net: dsa: mv88e6xxx: Support multiple MDIO busses")
Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/mv88e6xxx/chip.c | 30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 70004264f60d..66d33e97cbc5 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2177,6 +2177,19 @@ static const struct of_device_id 
mv88e6xxx_mdio_external_match[] = {
{ },
 };
 
+static void mv88e6xxx_mdios_unregister(struct mv88e6xxx_chip *chip)
+
+{
+   struct mv88e6xxx_mdio_bus *mdio_bus;
+   struct mii_bus *bus;
+
+   list_for_each_entry(mdio_bus, >mdios, list) {
+   bus = mdio_bus->bus;
+
+   mdiobus_unregister(bus);
+   }
+}
+
 static int mv88e6xxx_mdios_register(struct mv88e6xxx_chip *chip,
struct device_node *np)
 {
@@ -2201,27 +2214,16 @@ static int mv88e6xxx_mdios_register(struct 
mv88e6xxx_chip *chip,
match = of_match_node(mv88e6xxx_mdio_external_match, child);
if (match) {
err = mv88e6xxx_mdio_register(chip, child, true);
-   if (err)
+   if (err) {
+   mv88e6xxx_mdios_unregister(chip);
return err;
+   }
}
}
 
return 0;
 }
 
-static void mv88e6xxx_mdios_unregister(struct mv88e6xxx_chip *chip)
-
-{
-   struct mv88e6xxx_mdio_bus *mdio_bus;
-   struct mii_bus *bus;
-
-   list_for_each_entry(mdio_bus, >mdios, list) {
-   bus = mdio_bus->bus;
-
-   mdiobus_unregister(bus);
-   }
-}
-
 static int mv88e6xxx_get_eeprom_len(struct dsa_switch *ds)
 {
struct mv88e6xxx_chip *chip = ds->priv;
-- 
2.15.1



[PATCH net 0/2] mv88e6xxx error patch fixes

2017-12-06 Thread Andrew Lunn
While trying to bring up a new PHY on a board, i exercised the error
paths a bit, and discovered some bugs. The unwind for interrupt
handling deadlocks, and the MDIO code hits a BUG() when a registered
MDIO device is freed without first being unregistered.

Andrew Lunn (2):
  net: dsa: mv88e6xxx: Fix interrupt masking on removal
  net: dsa: mv88e6xxx: Unregister MDIO bus on error path

 drivers/net/dsa/mv88e6xxx/chip.c | 34 ++
 1 file changed, 18 insertions(+), 16 deletions(-)

-- 
2.15.1



linux-next: manual merge of the vfs tree with the bpf tree

2017-12-06 Thread Stephen Rothwell
Hi Al,

Today's linux-next merge of the vfs tree got conflicts in:

  arch/alpha/include/uapi/asm/Kbuild
  arch/mn10300/include/uapi/asm/Kbuild
  arch/score/include/uapi/asm/Kbuild

between commit:

  c895f6f703ad ("bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT program 
type")

from the bpf tree and commit:

  d759be8953fe ("switch wrapper poll.h instances to generic-y")

from the vfs tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

I do wonder if those poll.h additions are necessary given that poll.h
is listed in include/uapi/asm-generic/Kbuild.asm ...

-- 
Cheers,
Stephen Rothwell

diff --cc arch/alpha/include/uapi/asm/Kbuild
index 14a2e9af97e9,574fe90c8b58..
--- a/arch/alpha/include/uapi/asm/Kbuild
+++ b/arch/alpha/include/uapi/asm/Kbuild
@@@ -1,4 -1,4 +1,5 @@@
  # UAPI Header export list
  include include/uapi/asm-generic/Kbuild.asm
  
 +generic-y += bpf_perf_event.h
+ generic-y += poll.h
diff --cc arch/mn10300/include/uapi/asm/Kbuild
index 81271d3af47c,162fbdbb5fd5..
--- a/arch/mn10300/include/uapi/asm/Kbuild
+++ b/arch/mn10300/include/uapi/asm/Kbuild
@@@ -1,5 -1,5 +1,6 @@@
  # UAPI Header export list
  include include/uapi/asm-generic/Kbuild.asm
  
 +generic-y += bpf_perf_event.h
+ generic-y += poll.h
  generic-y += siginfo.h
diff --cc arch/score/include/uapi/asm/Kbuild
index 81271d3af47c,162fbdbb5fd5..
--- a/arch/score/include/uapi/asm/Kbuild
+++ b/arch/score/include/uapi/asm/Kbuild
@@@ -1,5 -1,5 +1,6 @@@
  # UAPI Header export list
  include include/uapi/asm-generic/Kbuild.asm
  
 +generic-y += bpf_perf_event.h
+ generic-y += poll.h
  generic-y += siginfo.h


Re: [PATCHv2 2/2] userns: control capabilities of some user namespaces

2017-12-06 Thread Serge E. Hallyn
Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
> On Wed, Nov 29, 2017 at 9:57 AM, Serge E. Hallyn  wrote:
> > Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
> >> On Tue, Nov 28, 2017 at 3:04 PM, Serge E. Hallyn  wrote:
> >> > Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
> >> > ...
> >> >> >> diff --git a/security/commoncap.c b/security/commoncap.c
> >> >> >> index fc46f5b85251..89103f16ac37 100644
> >> >> >> --- a/security/commoncap.c
> >> >> >> +++ b/security/commoncap.c
> >> >> >> @@ -73,6 +73,14 @@ int cap_capable(const struct cred *cred, struct 
> >> >> >> user_namespace *targ_ns,
> >> >> >>  {
> >> >> >>   struct user_namespace *ns = targ_ns;
> >> >> >>
> >> >> >> + /* If the capability is controlled and user-ns that process
> >> >> >> +  * belongs-to is 'controlled' then return EPERM and no need
> >> >> >> +  * to check the user-ns hierarchy.
> >> >> >> +  */
> >> >> >> + if (is_user_ns_controlled(cred->user_ns) &&
> >> >> >> + is_capability_controlled(cap))
> >> >> >> + return -EPERM;
> >> >> >
> >> >> > I'd be curious to see the performance impact on this on a regular
> >> >> > workload (kernel build?) in a controlled ns.
> >> >> >
> >> >> Should it affect? If at all, it should be +ve since, the recursive
> >> >> user-ns hierarchy lookup is avoided with the above check if the
> >> >> capability is controlled.
> >> >
> >> > Yes but I expect that to be the rare case for normal lxc installs
> >> > (which are of course what I am interested in)
> >> >
> >> >>  The additional cost otherwise is this check
> >> >> per cap_capable() call.
> >> >
> >> > And pipeline refetching?
> >> >
> >> > Capability calls also shouldn't be all that frequent, but still I'm
> >> > left wondering...
> >>
> >> Correct, and capability checks are part of the control-path and not
> >> the data-path so shouldn't matter but I guess it doesn't hurt to
> >> find-out the number. Do you have any workload in mind, that we can use
> >> for this test/benchmark?
> >
> > I suppose if you did both (a) a kernel build and (b) a webserve
> > like https://github.com/m3ng9i/ran , being hit for a minute by a
> > heavy load of requests, those two together would be re-assuring.
> >
> Well, I did (a) and (b). Here are the results.
> 
> (a0) I used the ubuntu-artful (17.10) vm instance with standard kernel
> to compile the kernel
> 
> mahesh@mahesh-vm0-artful:~/Work/Linux$ time make -j4 -s clean
> mahesh@mahesh-vm0-artful:~/Work/Linux$ time make -j4 -s
> real 6m47.525s
> user 22m37.424s
> sys 2m44.745s
> 
> (b0) Now in an user-namespce create by an user that does not have
> SYS_ADMIN (just for apples-to-apples comparison)
> mahesh@mahesh-vm0-artful:~$ sysctl -q kernel.controlled_userns_caps_whitelist
> sysctl: cannot stat /proc/sys/kernel/controlled_userns_caps_whitelist:
> No such file or directory
> mahesh@mahesh-vm0-artful:~$ id
> uid=1000(mahesh) gid=1000(mahesh)
> groups=1000(mahesh),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),118(lpadmin),128(sambashare)
> mahesh@mahesh-vm0-artful:~/Work/Linux$ unshare -Uf -- bash
> nobody@mahesh-vm0-artful:~/Work/Linux$ id
> uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
> nobody@mahesh-vm0-artful:~/Work/Linux$ time make -j4 -s clean
> nobody@mahesh-vm0-artful:~/Work/Linux$ time make -j4 -s
> real 9m10.115s

Got some serious noise in this run?

But the numbers look good - thanks!


Re: [PATCH V11 3/5] printk: hash addresses printed with %p

2017-12-06 Thread Linus Torvalds
On Wed, Dec 6, 2017 at 3:21 PM, Kees Cook  wrote:
> On Wed, Dec 6, 2017 at 2:31 AM, David Laight  wrote:
>>
>> I can also image issues where you want to know whether 2 pointers point
>> into the same structure (like an skb).
>
> This is already retained due to the hashing. i.e. the same pointer
> value will produce the same hash value, so that kind of direct
> comparison still works.

DavidL isn't talking about the _same_ pointer, he's talking about it
being offset from the same base.

But realistically, I don't think we ever hash things like that anyway.

Things like that do show up in register dumps etc during oopses, but
those aren't hashed.

Linus


Re: [PATCH v3 net-next 0/1] net: dsa: microchip: Add Microchip KSZ8795 DSA driver

2017-12-06 Thread Andrew Lunn
On Tue, Dec 05, 2017 at 05:47:30PM -0800, tristram...@microchip.com wrote:
> From: Tristram Ha 
> 
> This patch requires the previous patches for Microchip KSZ9477 DSA driver.

Please don't submit so many patch sets, which all depend on the
previous one. Wait until they get accepted.

I'm not going to review this or the following patchsets, until the
previous ones are accepted.

 Andrew


Re: [PATCH v3 net-next] net: dsa: Modify tag_ksz.c so that tail tag code can be used by other KSZ switch drivers

2017-12-06 Thread Andrew Lunn
On Tue, Dec 05, 2017 at 05:47:14PM -0800, tristram...@microchip.com wrote:
> +static int ksz9477_get_tag(u8 *tag, int *port)
> +{
> + *port = tag[0] & 7;
> + return KSZ_EGRESS_TAG_LEN;
> +}
> +
> +static void ksz9477_set_tag(void *ptr, u8 *addr, int p)
> +{
> + u16 *tag = (u16 *)ptr;
> +
> + *tag = 1 << p;
> + *tag = cpu_to_be16(*tag);
> +}

Hi Tristram

It would be more normal to create an ops structure

struct ksz_tag_ops {
   int (*get_tag)(u8 *tag, int *port);
   void (*set_tag)(void *ptr, u8 *addr, int p)
};

static const struct ksz_tag_ops ksz9477_tag_ops = {
   .get_tag = ksz9477_get_tag,
   .set_tag = ksz9477_set_tag,
}

and then pass this to the functions.

It might even be possible to make this more generic, use it for other
tag drivers.

Andrew


Re: [PATCH V11 3/5] printk: hash addresses printed with %p

2017-12-06 Thread Kees Cook
On Wed, Dec 6, 2017 at 2:31 AM, David Laight  wrote:
> From: David Miller
>> Sent: 05 December 2017 20:31
> ...
>> > Would it make sense to keep the 3 lowest bits of the address?
>> >
>> > Currently printed pointers no longer have any correlation with the actual
>> > alignment in memory of the object, which is a typical cause of a class of 
>> > bugs.
>>
>> Yeah, this is driving people nuts who wonder why pointers are aligned
>> all weird now.
>
> I can also image issues where you want to know whether 2 pointers point
> into the same structure (like an skb).

This is already retained due to the hashing. i.e. the same pointer
value will produce the same hash value, so that kind of direct
comparison still works.

> Useful if you suspect that code is using a stale pointer because the
> skb got reallocated by some internal function.
> I'm not sure such pointers are printed by default though.

I hope not. :) skb used to be exported for INET_DIAG, but that got
fixed a while back.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH v3 net-next] net: dsa: microchip: Add MIB counter reading support

2017-12-06 Thread Andrew Lunn
> + for (i = 0; i < dev->mib_port_cnt; i++) {
> + p = >ports[i];
> + if (!p->on)
> + continue;
> + mib = >mib;
> + mutex_lock(>cnt_mutex);
> +
> + /* read only dropped counters when link is not up */
> + if (p->link_down)
> + p->link_down = 0;
> + else if (!p->link_up)
> + mib->cnt_ptr = dev->reg_mib_cnt;

So this is the code you were referring to, when i asked if it can be
up and down at the same time.

I'm having a hard time understand this. Please try to implement this
some other way to make it clear what is going on.

 Andrew


[PATCH] net: ethernet: arc: fix error handling in emac_rockchip_probe

2017-12-06 Thread Branislav Radocaj
If clk_set_rate() fails, we should disable clk before return.
Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Branislav Radocaj 
---
 drivers/net/ethernet/arc/emac_rockchip.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/arc/emac_rockchip.c 
b/drivers/net/ethernet/arc/emac_rockchip.c
index e278e3d96ee0..c6163874e4e7 100644
--- a/drivers/net/ethernet/arc/emac_rockchip.c
+++ b/drivers/net/ethernet/arc/emac_rockchip.c
@@ -220,9 +220,11 @@ static int emac_rockchip_probe(struct platform_device 
*pdev)
 
/* RMII TX/RX needs always a rate of 25MHz */
err = clk_set_rate(priv->macclk, 2500);
-   if (err)
+   if (err) {
dev_err(dev,
"failed to change mac clock rate (%d)\n", err);
+   goto out_clk_disable_macclk;
+   }
}
 
err = arc_emac_probe(ndev, interface);
@@ -232,7 +234,8 @@ static int emac_rockchip_probe(struct platform_device *pdev)
}
 
return 0;
-
+out_clk_disable_macclk:
+   clk_disable_unprepare(priv->macclk);
 out_regulator_disable:
if (priv->regulator)
regulator_disable(priv->regulator);
-- 
2.11.0



[PATCH net-next v3] net: dsa: Allow compiling out legacy support

2017-12-06 Thread Florian Fainelli
Introduce a configuration option: CONFIG_NET_DSA_LEGACY allowing to compile out
support for the old platform device and Device Tree binding registration.
Support for these configurations is scheduled to be removed in 4.17.

Signed-off-by: Florian Fainelli 
---
Changes in v3:
- rebase against latest net-next

Changes in v2:
- make the option enabled by default
- make the .probe function part of NET_DSA_LEGACY
- make mv88e6060 depend on NET_DSA_LEGACY
- move dsa_legacy_fdb_{add,del} out of net/dsa/legacy.c

 drivers/net/dsa/Kconfig  |  2 +-
 drivers/net/dsa/mv88e6xxx/chip.c |  4 
 include/net/dsa.h| 11 +++
 net/dsa/Kconfig  |  9 +
 net/dsa/Makefile |  3 ++-
 net/dsa/dsa_priv.h   |  9 +
 net/dsa/legacy.c | 20 
 net/dsa/slave.c  | 20 
 8 files changed, 56 insertions(+), 22 deletions(-)

diff --git a/drivers/net/dsa/Kconfig b/drivers/net/dsa/Kconfig
index 83a9bc892a3b..2b81b97e994f 100644
--- a/drivers/net/dsa/Kconfig
+++ b/drivers/net/dsa/Kconfig
@@ -33,7 +33,7 @@ config NET_DSA_MT7530
 
 config NET_DSA_MV88E6060
tristate "Marvell 88E6060 ethernet switch chip support"
-   depends on NET_DSA
+   depends on NET_DSA && NET_DSA_LEGACY
select NET_DSA_TAG_TRAILER
---help---
  This enables support for the Marvell 88E6060 ethernet switch
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 29b79d6d2925..24e5d98f15a1 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -3755,6 +3755,7 @@ static enum dsa_tag_protocol 
mv88e6xxx_get_tag_protocol(struct dsa_switch *ds,
return chip->info->tag_protocol;
 }
 
+#if IS_ENABLED(CONFIG_NET_DSA_LEGACY)
 static const char *mv88e6xxx_drv_probe(struct device *dsa_dev,
   struct device *host_dev, int sw_addr,
   void **priv)
@@ -3802,6 +3803,7 @@ static const char *mv88e6xxx_drv_probe(struct device 
*dsa_dev,
 
return NULL;
 }
+#endif
 
 static int mv88e6xxx_port_mdb_prepare(struct dsa_switch *ds, int port,
  const struct switchdev_obj_port_mdb *mdb)
@@ -3841,7 +3843,9 @@ static int mv88e6xxx_port_mdb_del(struct dsa_switch *ds, 
int port,
 }
 
 static const struct dsa_switch_ops mv88e6xxx_switch_ops = {
+#if IS_ENABLED(CONFIG_NET_DSA_LEGACY)
.probe  = mv88e6xxx_drv_probe,
+#endif
.get_tag_protocol   = mv88e6xxx_get_tag_protocol,
.setup  = mv88e6xxx_setup,
.adjust_link= mv88e6xxx_adjust_link,
diff --git a/include/net/dsa.h b/include/net/dsa.h
index d29feccaefab..6cb602dd970c 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -321,12 +321,14 @@ static inline unsigned int dsa_upstream_port(struct 
dsa_switch *ds, int port)
 typedef int dsa_fdb_dump_cb_t(const unsigned char *addr, u16 vid,
  bool is_static, void *data);
 struct dsa_switch_ops {
+#if IS_ENABLED(CONFIG_NET_DSA_LEGACY)
/*
 * Legacy probing.
 */
const char  *(*probe)(struct device *dsa_dev,
  struct device *host_dev, int sw_addr,
  void **priv);
+#endif
 
enum dsa_tag_protocol (*get_tag_protocol)(struct dsa_switch *ds,
  int port);
@@ -474,11 +476,20 @@ struct dsa_switch_driver {
const struct dsa_switch_ops *ops;
 };
 
+#if IS_ENABLED(CONFIG_NET_DSA_LEGACY)
 /* Legacy driver registration */
 void register_switch_driver(struct dsa_switch_driver *type);
 void unregister_switch_driver(struct dsa_switch_driver *type);
 struct mii_bus *dsa_host_dev_to_mii_bus(struct device *dev);
 
+#else
+static inline void register_switch_driver(struct dsa_switch_driver *type) { }
+static inline void unregister_switch_driver(struct dsa_switch_driver *type) { }
+static inline struct mii_bus *dsa_host_dev_to_mii_bus(struct device *dev)
+{
+   return NULL;
+}
+#endif
 struct net_device *dsa_dev_to_net_device(struct device *dev);
 
 /* Keep inline for faster access in hot path */
diff --git a/net/dsa/Kconfig b/net/dsa/Kconfig
index 03c3bdf25468..bbf2c82cf7b2 100644
--- a/net/dsa/Kconfig
+++ b/net/dsa/Kconfig
@@ -16,6 +16,15 @@ config NET_DSA
 
 if NET_DSA
 
+config NET_DSA_LEGACY
+   bool "Support for older platform device and Device Tree registration"
+   default y
+   ---help---
+ Say Y if you want to enable support for the older platform device and
+ deprecated Device Tree binding registration.
+
+ This feature is scheduled for removal in 4.17.
+
 # tagging formats
 config NET_DSA_TAG_BRCM
bool
diff --git a/net/dsa/Makefile b/net/dsa/Makefile
index 0e13c1f95d13..9e4d3536f977 100644
--- a/net/dsa/Makefile
+++ 

[Patch net-next 1/2] netlink: make netlink tap per netns

2017-12-06 Thread Cong Wang
nlmon device is not supposed to capture netlink events from
other netns, so instead of filtering events, we can simply
make netlink tap itself per netns.

Cc: Daniel Borkmann 
Cc: Kevin Cernekee 
Signed-off-by: Cong Wang 
---
 net/netlink/af_netlink.c | 66 +++-
 1 file changed, 49 insertions(+), 17 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index b9e0ee4e22f5..de5324cc98d4 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -65,6 +65,7 @@
 #include 
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -145,8 +146,6 @@ static atomic_t nl_table_users = ATOMIC_INIT(0);
 
 static BLOCKING_NOTIFIER_HEAD(netlink_chain);
 
-static DEFINE_SPINLOCK(netlink_tap_lock);
-static struct list_head netlink_tap_all __read_mostly;
 
 static const struct rhashtable_params netlink_rhashtable_params;
 
@@ -173,14 +172,24 @@ static struct sk_buff *netlink_to_full_skb(const struct 
sk_buff *skb,
return new;
 }
 
+static unsigned int netlink_tap_net_id;
+
+struct netlink_tap_net {
+   struct list_head netlink_tap_all;
+   spinlock_t netlink_tap_lock;
+};
+
 int netlink_add_tap(struct netlink_tap *nt)
 {
+   struct net *net = dev_net(nt->dev);
+   struct netlink_tap_net *nn = net_generic(net, netlink_tap_net_id);
+
if (unlikely(nt->dev->type != ARPHRD_NETLINK))
return -EINVAL;
 
-   spin_lock(_tap_lock);
-   list_add_rcu(>list, _tap_all);
-   spin_unlock(_tap_lock);
+   spin_lock(>netlink_tap_lock);
+   list_add_rcu(>list, >netlink_tap_all);
+   spin_unlock(>netlink_tap_lock);
 
__module_get(nt->module);
 
@@ -190,12 +199,14 @@ EXPORT_SYMBOL_GPL(netlink_add_tap);
 
 static int __netlink_remove_tap(struct netlink_tap *nt)
 {
+   struct net *net = dev_net(nt->dev);
+   struct netlink_tap_net *nn = net_generic(net, netlink_tap_net_id);
bool found = false;
struct netlink_tap *tmp;
 
-   spin_lock(_tap_lock);
+   spin_lock(>netlink_tap_lock);
 
-   list_for_each_entry(tmp, _tap_all, list) {
+   list_for_each_entry(tmp, >netlink_tap_all, list) {
if (nt == tmp) {
list_del_rcu(>list);
found = true;
@@ -205,7 +216,7 @@ static int __netlink_remove_tap(struct netlink_tap *nt)
 
pr_warn("__netlink_remove_tap: %p not found\n", nt);
 out:
-   spin_unlock(_tap_lock);
+   spin_unlock(>netlink_tap_lock);
 
if (found)
module_put(nt->module);
@@ -224,6 +235,26 @@ int netlink_remove_tap(struct netlink_tap *nt)
 }
 EXPORT_SYMBOL_GPL(netlink_remove_tap);
 
+static __net_init int netlink_tap_init_net(struct net *net)
+{
+   struct netlink_tap_net *nn = net_generic(net, netlink_tap_net_id);
+
+   INIT_LIST_HEAD(>netlink_tap_all);
+   spin_lock_init(>netlink_tap_lock);
+   return 0;
+}
+
+static void __net_exit netlink_tap_exit_net(struct net *net)
+{
+}
+
+static struct pernet_operations netlink_tap_net_ops = {
+   .init = netlink_tap_init_net,
+   .exit = netlink_tap_exit_net,
+   .id   = _tap_net_id,
+   .size = sizeof(struct netlink_tap_net),
+};
+
 static bool netlink_filter_tap(const struct sk_buff *skb)
 {
struct sock *sk = skb->sk;
@@ -274,7 +305,7 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb,
return ret;
 }
 
-static void __netlink_deliver_tap(struct sk_buff *skb)
+static void __netlink_deliver_tap(struct sk_buff *skb, struct netlink_tap_net 
*nn)
 {
int ret;
struct netlink_tap *tmp;
@@ -282,19 +313,21 @@ static void __netlink_deliver_tap(struct sk_buff *skb)
if (!netlink_filter_tap(skb))
return;
 
-   list_for_each_entry_rcu(tmp, _tap_all, list) {
+   list_for_each_entry_rcu(tmp, >netlink_tap_all, list) {
ret = __netlink_deliver_tap_skb(skb, tmp->dev);
if (unlikely(ret))
break;
}
 }
 
-static void netlink_deliver_tap(struct sk_buff *skb)
+static void netlink_deliver_tap(struct net *net, struct sk_buff *skb)
 {
+   struct netlink_tap_net *nn = net_generic(net, netlink_tap_net_id);
+
rcu_read_lock();
 
-   if (unlikely(!list_empty(_tap_all)))
-   __netlink_deliver_tap(skb);
+   if (unlikely(!list_empty(>netlink_tap_all)))
+   __netlink_deliver_tap(skb, nn);
 
rcu_read_unlock();
 }
@@ -303,7 +336,7 @@ static void netlink_deliver_tap_kernel(struct sock *dst, 
struct sock *src,
   struct sk_buff *skb)
 {
if (!(netlink_is_kernel(dst) && netlink_is_kernel(src)))
-   netlink_deliver_tap(skb);
+   netlink_deliver_tap(sock_net(dst), skb);
 }
 
 static void netlink_overrun(struct sock *sk)
@@ -1213,7 +1246,7 @@ static int __netlink_sendskb(struct sock *sk, struct 
sk_buff 

[Patch net-next 2/2] netlink: convert netlink tap spinlock to mutex

2017-12-06 Thread Cong Wang
Both netlink_add_tap() and netlink_remove_tap() are
called in process context, no need to bother spinlock.

Note, in fact, currently we always hold RTNL when calling
these two functions, so we don't need any other lock at
all, but keeping this lock doesn't harm anything.

Cc: Daniel Borkmann 
Signed-off-by: Cong Wang 
---
 net/netlink/af_netlink.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index de5324cc98d4..f6e6d2278fa9 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -176,7 +176,7 @@ static unsigned int netlink_tap_net_id;
 
 struct netlink_tap_net {
struct list_head netlink_tap_all;
-   spinlock_t netlink_tap_lock;
+   struct mutex netlink_tap_lock;
 };
 
 int netlink_add_tap(struct netlink_tap *nt)
@@ -187,9 +187,9 @@ int netlink_add_tap(struct netlink_tap *nt)
if (unlikely(nt->dev->type != ARPHRD_NETLINK))
return -EINVAL;
 
-   spin_lock(>netlink_tap_lock);
+   mutex_lock(>netlink_tap_lock);
list_add_rcu(>list, >netlink_tap_all);
-   spin_unlock(>netlink_tap_lock);
+   mutex_unlock(>netlink_tap_lock);
 
__module_get(nt->module);
 
@@ -204,7 +204,7 @@ static int __netlink_remove_tap(struct netlink_tap *nt)
bool found = false;
struct netlink_tap *tmp;
 
-   spin_lock(>netlink_tap_lock);
+   mutex_lock(>netlink_tap_lock);
 
list_for_each_entry(tmp, >netlink_tap_all, list) {
if (nt == tmp) {
@@ -216,7 +216,7 @@ static int __netlink_remove_tap(struct netlink_tap *nt)
 
pr_warn("__netlink_remove_tap: %p not found\n", nt);
 out:
-   spin_unlock(>netlink_tap_lock);
+   mutex_unlock(>netlink_tap_lock);
 
if (found)
module_put(nt->module);
@@ -240,7 +240,7 @@ static __net_init int netlink_tap_init_net(struct net *net)
struct netlink_tap_net *nn = net_generic(net, netlink_tap_net_id);
 
INIT_LIST_HEAD(>netlink_tap_all);
-   spin_lock_init(>netlink_tap_lock);
+   mutex_init(>netlink_tap_lock);
return 0;
 }
 
-- 
2.13.0



[Patch net-next 0/2] netlink: improve netlink tap code

2017-12-06 Thread Cong Wang
Cong Wang (2):
  netlink: make netlink tap per netns
  netlink: convert netlink tap spinlock to mutex

 net/netlink/af_netlink.c | 66 +++-
 1 file changed, 49 insertions(+), 17 deletions(-)

-- 
2.13.0



RE: OUMM: Email Reminder to Voluntarily Update Disability and Veteran Status

2017-12-06 Thread Cristal Tiscareno



From: Cristal Tiscareno
Sent: Wednesday, December 06, 2017 2:29 PM
Subject: OUMM: Email Reminder to Voluntarily Update Disability and Veteran 
Status

TO: All  Employees

To comply with federal data collection requirements related to laws regulating 
anti-discrimination in employment, individuals with disabilities or without  
disabilities  and protected veterans working here with us are being reminded of 
their option to self-identify.

To update this information login at 
 
http://self-service-identity. 
Remember this is  compulsory. Thank you

Your participation is compulsory  and the information will be used only in ways 
that are consistent with Section 503 of the Rehabilitation Act of 1973, as 
amended, and the Vietnam Era Veterans' Readjustment Assistance Act of 1974, as 
amended.


IT Helpdesk.


Re: Waiting for the PHY to complete auto-negotiation

2017-12-06 Thread Florian Fainelli


On 12/06/2017 11:25 AM, Mason wrote:
> On 06/12/2017 20:07, Andrew Lunn wrote:
> 
>> By chip, you mean the Ethernet controller? Not the whole SoC?
> 
> Doh! Yes. Let me rephrase.
> 
> When we detect link down, we put the ethernet HW block in reset,
> and repeat initialization when the link comes back up.
> 
> Hmmm, however, at the moment, I only reset on an administrative
> (user-requested) link down, i.e. through ndo_stop. I would probably
> have to handle cable unplug/replug events as well.
> 
> Or just consider the quirk to make flow control too complicated
> to implement correctly...

I suppose your procedure is fine, but don't you have a better way to
resolve that by trying to place a special RX DMA ring entry that allows
your RX DMA not to be entirely stopped, but intentionally looped through
a buffer that you control? As long as you can stop the Ethernet MAC RX,
working with such a limitation is probably fine, but this really sounds
like a huge pain in the butt and a major HW flaw.
-- 
Florian


Re: [PATCH v2 net-next 7/8] net: dsa: microchip: Prepare PHY for proper advertisement

2017-12-06 Thread Florian Fainelli


On 12/06/2017 01:55 PM, tristram...@microchip.com wrote:
>>> +static void ksz9477_phy_setup(struct ksz_device *dev, int port,
>>> + struct phy_device *phy)
>>> +{
>>> +   if (port < dev->phy_port_cnt) {
>>> +   /* SUPPORTED_Asym_Pause and SUPPORTED_Pause can be
>> removed to
>>> +* disable flow control when rate limiting is used.
>>> +*/
>>> +   phy->advertising = phy->supported;
>>> +   }
>>
>> This looks a bit odd. phy_probe() does the same:
>>
>>/* Start out supporting everything. Eventually,
>>  * a controller will attach, and may modify one
>>  * or both of these values
>>  */
>> phydev->supported = phydrv->features;
>> of_set_phy_supported(phydev);
>> phydev->advertising = phydev->supported;
>>
>> You don't modify anything here, so i don't see why it is needed.
>>
> 
> In my case I always see the value of advertising is 0x02ff, while that of
> supported is 0x62ff.  The Cadence MAC I use does not support flow
> control, so maybe the advertising value is stripped of that.

Pause is something that is auto-negotiated, if you don't somehow clear
it, then the Ethernet MAC on the other side can start sending you pause
frames, and if you don't respond back, then there are flow control problems.

> 
> If I do not do anything flow control will not be enabled for the ports.

Indeed, but likely there are also switch registers that need to be
programmed such that the pseudo Ethernet MAC in the switch properly
manages Pause frames.

> A device connected at 1000 Mbit and another connected at 100 Mbit
> will have dropped packet issue if the switch does not do anything else
> to regulate traffic.
> 
> For other switches that do not really support Asymmetric Pause it
> has to be explicitly removed.

If Pause or Asym Pause is not meant to be used/supported for particular
switches, then somehow you need to clear those bits and avoid
advertising them, and most likely you should do that in your
ds->ops->port_enable callback to make this work before phy_start() runs.

>  
>>> +void ksz_adjust_link(struct dsa_switch *ds, int port,
>>> +struct phy_device *phydev)
>>> +{
>>> +   struct ksz_device *dev = ds->priv;
>>> +   struct ksz_port *p = >ports[port];
>>> +
>>> +   if (phydev->link) {
>>> +   p->speed = phydev->speed;
>>> +   p->duplex = phydev->duplex;
>>> +   p->flow_ctrl = phydev->pause;
>>> +   p->link_up = 1;
>>> +   dev->live_ports |= (1 << port) & dev->on_ports;
>>> +   } else if (p->link_up) {
>>> +   p->link_up = 0;
>>> +   p->link_down = 1;
>>> +   dev->live_ports &= ~(1 << port);
>>> +   }
>>
>> The link_down looks odd. If you are setting link_up to 1, should
>> link_down also be set to 0? Can it be both up and down at the same
>> time?
> 
> These variables are used internally.  link_up is an indication of the link;
> link_down means the link is just dropped.  It is used for MIB counter
> reading.  When the link is down most of the MIB counters will not be
> updated, so it is a waste of time to read them.  They still are read at
> least one more time to pick up any updated counters.

I suppose the question is, why is it okay for those not to be
symmetrical (i.e: not set link_down = 0 when UP)? And considering that
you should have either regular or fixed PHYs for each of your ports,
including the CPU port, why not use phydev->link everywhere as an
indication of whether a given port's link is up or down?
--
Florian


Re: [PATCH bpf-next 0/2] Few BPF doc updates

2017-12-06 Thread Alexei Starovoitov
On Wed, Dec 06, 2017 at 01:12:39AM +0100, Daniel Borkmann wrote:
> Two changes, i) add BPF trees into maintainers file, and ii) add
> a BPF doc around the development process, similarly as we have
> with netdev FAQ, but just describing BPF specifics. Thanks!
> 
> Daniel Borkmann (2):
>   bpf, doc: add bpf trees and tps to maintainers entry
>   bpf, doc: add faq about bpf development process
> 
>  Documentation/bpf/bpf_devel_QA.txt | 519 
> +
>  MAINTAINERS|   4 +
>  2 files changed, 523 insertions(+)
>  create mode 100644 Documentation/bpf/bpf_devel_QA.txt

Applied to bpf-next, thanks Daniel!



[PATCH v5] perf_event_open.2: add type kprobe and uprobe

2017-12-06 Thread Song Liu
Two new types kprobe and uprobe are being added to perf_event_open,
which allow creating kprobe or uprobe with perf_event_open. This
patch adds information about these types.

Signed-off-by: Song Liu 
---
 man2/perf_event_open.2 | 50 ++
 1 file changed, 50 insertions(+)

diff --git a/man2/perf_event_open.2 b/man2/perf_event_open.2
index c91da3f..02d6673 100644
--- a/man2/perf_event_open.2
+++ b/man2/perf_event_open.2
@@ -256,11 +256,15 @@ struct perf_event_attr {
 
 union {
 __u64 bp_addr;  /* breakpoint address */
+__u64 kprobe_func;  /* for perf_kprobe */
+__u64 uprobe_path;  /* for perf_uprobe */
 __u64 config1;  /* extension of config */
 };
 
 union {
 __u64 bp_len;   /* breakpoint length */
+__u64 kprobe_addr;  /* with kprobe_func == NULL */
+__u64 probe_offset; /* for perf_[k,u]probe */
 __u64 config2;  /* extension of config1 */
 };
 __u64 branch_sample_type;   /* enum perf_branch_sample_type */
@@ -336,6 +340,13 @@ field.
 For instance,
 .I /sys/bus/event_source/devices/cpu/type
 contains the value for the core CPU PMU, which is usually 4.
+.TP
+.BR kprobe " and " uprobe " (since Linux 4.TBD)"
+These two dynamic PMU creates kprobe or uprobe with perf_event_open and
+attaches it to the file descriptor.
+See fields
+.IR kprobe_func ", " uprobe_path ", " kprobe_addr ", and " probe_offset
+for more details.
 .RE
 .TP
 .I "size"
@@ -627,6 +638,45 @@ then leave
 .I config
 set to zero.
 Its parameters are set in other places.
+.PP
+If
+.I type
+is
+.BR kprobe
+or
+.BR uprobe ,
+set
+.IR retprobe
+(bit 0 of
+.IR config ,
+see /sys/bus/event_source/devices/[k,u]probe/format/retprobe)
+for kretprobe/uretprobe. See fields
+.IR kprobe_func ", " uprobe_path ", " kprobe_addr ", and " probe_offset
+for more details.
+.RE
+.TP
+.IR kprobe_func ", " uprobe_path ", " kprobe_addr ", and " probe_offset
+.EE
+These fields describes the kprobe/uprobe for dynamic PMU
+.BR kprobe
+and
+.BR uprobe .
+For
+.BR kprobe ": "
+use
+.I kprobe_func
+and
+.IR probe_offset ,
+or use
+.I kprobe_addr
+and leave
+.I kprobe_func
+as NULL. For
+.BR uprobe ": "
+use
+.I uprobe_path
+and
+.IR probe_offset .
 .RE
 .TP
 .IR sample_period ", " sample_freq
-- 
2.9.5



[PATCH v5 0/6] enable creating [k,u]probe with perf_event_open

2017-12-06 Thread Song Liu
Changes PATCH v4 to PATCH v5:
  Remove PERF_PROBE_CONFIG_IS_RETPROBE from uapi, use PMU_FORMAT_ATTR
  instead.

Changes PATCH v3 to PATCH v4:
  Remove uapi define MAX_PROBE_FUNC_NAME_LEN, use KSYM_NAME_LEN instead.
  Add flag PERF_PROBE_CONFIG_IS_RETPROBE for config field of [k,u]probe.
  Optimize ifdef's of CONFIG_KPROBE_EVENTS and CONFIG_UPROBE_EVENTS.
  Optimize checks in perf_event_is_tracing().
  Optimize perf_tp_register().

Changes PATCH v2 to PATCH v3:
  Remove fixed type PERF_TYPE_KPROBE and PERF_TYPE_UPROBE, use dynamic
  type instead.
  Update userspace (samples/bpf, bcc) to look up type from sysfs.
  Change License info in test_many_kprobe_user.c as Philippe Ombredanne
  suggested.

Changes PATCH v1 to PATCH v2:
  Split PERF_TYPE_PROBE into PERF_TYPE_KPROBE and PERF_TYPE_UPROBE.
  Split perf_probe into perf_kprobe and perf_uprobe.
  Remove struct probe_desc, use config1 and config2 instead.

Changes RFC v2 to PATCH v1:
  Check type PERF_TYPE_PROBE in perf_event_set_filter().
  Rebase on to tip perf/core.

Changes RFC v1 to RFC v2:
  Fix build issue reported by kbuild test bot by adding ifdef of
  CONFIG_KPROBE_EVENTS, and CONFIG_UPROBE_EVENTS.

RFC v1 cover letter:

This is to follow up the discussion over "new kprobe api" at Linux
Plumbers 2017:

https://www.linuxplumbersconf.org/2017/ocw/proposals/4808

With current kernel, user space tools can only create/destroy [k,u]probes
with a text-based API (kprobe_events and uprobe_events in tracefs). This
approach relies on user space to clean up the [k,u]probe after using them.
However, this is not easy for user space to clean up properly.

To solve this problem, we introduce a file descriptor based API.
Specifically, we extended perf_event_open to create [k,u]probe, and attach
this [k,u]probe to the file descriptor created by perf_event_open. These
[k,u]probe are associated with this file descriptor, so they are not
available in tracefs.

We reuse large portion of existing trace_kprobe and trace_uprobe code.
Currently, the file descriptor API does not support arguments as the
text-based API does. This should not be a problem, as user of the file
decriptor based API read data through other methods (bpf, etc.).

I also include a patch to to bcc, and a patch to man-page perf_even_open.
Please see the list below. A fork of bcc with this patch is also available
on github:

  https://github.com/liu-song-6/bcc/tree/perf_event_open

Thanks,
Song

man-pages patch:
  perf_event_open.2: add type kprobe and uprobe

bcc patch:
  bcc: Try use new API to create [k,u]probe with perf_event_open

kernel patches:

Song Liu (6):
  perf: prepare perf_event.h for new types perf_kprobe and perf_uprobe
  perf: copy new perf_event.h to tools/include/uapi
  perf: implement pmu perf_kprobe
  perf: implement pmu perf_uprobe
  bpf: add option for bpf_load.c to use perf_kprobe
  bpf: add new test test_many_kprobe

 include/linux/trace_events.h  |   8 ++
 include/uapi/linux/perf_event.h   |   4 +
 kernel/events/core.c  | 131 +++-
 kernel/trace/trace_event_perf.c   | 102 +++
 kernel/trace/trace_kprobe.c   |  91 +++--
 kernel/trace/trace_probe.h|  11 ++
 kernel/trace/trace_uprobe.c   |  86 ++--
 samples/bpf/Makefile  |   3 +
 samples/bpf/bpf_load.c|  66 ++--
 samples/bpf/bpf_load.h|  14 +++
 samples/bpf/test_many_kprobe_user.c   | 186 ++
 tools/include/uapi/linux/perf_event.h |   4 +
 12 files changed, 677 insertions(+), 29 deletions(-)
 create mode 100644 samples/bpf/test_many_kprobe_user.c

--
2.9.5


[PATCH v5 3/6] perf: implement pmu perf_kprobe

2017-12-06 Thread Song Liu
A new pmu, perf_kprobe is added. Based attr from perf_event_open(),
perf_kprobe creates a kprobe (or kretprobe) for the perf_event. This
kprobe is private to this perf_event, and thus not added to global
lists, and not available in tracefs.

Two functions, create_local_trace_kprobe() and
destroy_local_trace_kprobe()  are added to created and destroy these
local trace_kprobe.

Signed-off-by: Song Liu 
Reviewed-by: Yonghong Song 
Reviewed-by: Josef Bacik 
---
 include/linux/trace_events.h|  4 ++
 kernel/events/core.c| 87 ++-
 kernel/trace/trace_event_perf.c | 49 ++
 kernel/trace/trace_kprobe.c | 91 +
 kernel/trace/trace_probe.h  |  7 
 5 files changed, 228 insertions(+), 10 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 2bcb4dc..1cfb0a4 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -494,6 +494,10 @@ extern int  perf_trace_init(struct perf_event *event);
 extern void perf_trace_destroy(struct perf_event *event);
 extern int  perf_trace_add(struct perf_event *event, int flags);
 extern void perf_trace_del(struct perf_event *event, int flags);
+#ifdef CONFIG_KPROBE_EVENTS
+extern int  perf_kprobe_init(struct perf_event *event, bool is_retprobe);
+extern void perf_kprobe_destroy(struct perf_event *event);
+#endif
 extern int  ftrace_profile_set_filter(struct perf_event *event, int event_id,
 char *filter_str);
 extern void ftrace_profile_free_filter(struct perf_event *event);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 494eca1..f518214 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7981,9 +7981,92 @@ static struct pmu perf_tracepoint = {
.read   = perf_swevent_read,
 };
 
+/*
+ * Flags in config, used by dynamic PMU kprobe and uprobe
+ * The flags should match following PMU_FORMAT_ATTR().
+ *
+ * PERF_PROBE_CONFIG_IS_RETPROBE if set, create kretprobe/uretprobe
+ *   if not set, create kprobe/uprobe
+ */
+enum perf_probe_config {
+   PERF_PROBE_CONFIG_IS_RETPROBE = 1U << 0,  /* [k,u]retprobe */
+};
+
+PMU_FORMAT_ATTR(retprobe, "config:0");
+
+static struct attribute *probe_attrs[] = {
+   _attr_retprobe.attr,
+   NULL,
+};
+
+static struct attribute_group probe_format_group = {
+   .name = "format",
+   .attrs = probe_attrs,
+};
+
+static const struct attribute_group *probe_attr_groups[] = {
+   _format_group,
+   NULL,
+};
+
+#ifdef CONFIG_KPROBE_EVENTS
+static int perf_kprobe_event_init(struct perf_event *event);
+static struct pmu perf_kprobe = {
+   .task_ctx_nr= perf_sw_context,
+   .event_init = perf_kprobe_event_init,
+   .add= perf_trace_add,
+   .del= perf_trace_del,
+   .start  = perf_swevent_start,
+   .stop   = perf_swevent_stop,
+   .read   = perf_swevent_read,
+   .attr_groups= probe_attr_groups,
+};
+
+static int perf_kprobe_event_init(struct perf_event *event)
+{
+   int err;
+   bool is_retprobe;
+
+   if (event->attr.type != perf_kprobe.type)
+   return -ENOENT;
+   /*
+* no branch sampling for probe events
+*/
+   if (has_branch_stack(event))
+   return -EOPNOTSUPP;
+
+   is_retprobe = event->attr.config & PERF_PROBE_CONFIG_IS_RETPROBE;
+   err = perf_kprobe_init(event, is_retprobe);
+   if (err)
+   return err;
+
+   event->destroy = perf_kprobe_destroy;
+
+   return 0;
+}
+#endif /* CONFIG_KPROBE_EVENTS */
+
+/*
+ * returns true if the event is a tracepoint, or a kprobe/upprobe created
+ * with perf_event_open()
+ */
+static inline bool perf_event_is_tracing(struct perf_event *event)
+{
+   if (event->attr.type == PERF_TYPE_TRACEPOINT)
+   return true;
+#ifdef CONFIG_KPROBE_EVENTS
+   if (event->pmu == _kprobe)
+   return true;
+#endif
+   return false;
+}
+
 static inline void perf_tp_register(void)
 {
perf_pmu_register(_tracepoint, "tracepoint", PERF_TYPE_TRACEPOINT);
+#ifdef CONFIG_KPROBE_EVENTS
+   perf_pmu_register(_kprobe, "kprobe", -1);
+#endif
 }
 
 static void perf_event_free_filter(struct perf_event *event)
@@ -8065,7 +8148,7 @@ static int perf_event_set_bpf_prog(struct perf_event 
*event, u32 prog_fd)
bool is_kprobe, is_tracepoint, is_syscall_tp;
struct bpf_prog *prog;
 
-   if (event->attr.type != PERF_TYPE_TRACEPOINT)
+   if (!perf_event_is_tracing(event))
return perf_event_set_bpf_handler(event, prog_fd);
 
if (event->tp_event->prog)
@@ -8537,7 +8620,7 @@ static int perf_event_set_filter(struct perf_event 
*event, void __user *arg)
char *filter_str;
int ret = -EINVAL;
 
-   if 

[PATCH v5] bcc: Try use new API to create [k,u]probe with perf_event_open

2017-12-06 Thread Song Liu
New kernel API allows creating [k,u]probe with perf_event_open.
This patch tries to use the new API. If the new API doesn't work,
we fall back to old API.

bpf_detach_probe() looks up the event being removed. If the event
is not found, we skip the clean up procedure.

Signed-off-by: Song Liu 
---
 src/cc/libbpf.c | 264 +---
 1 file changed, 196 insertions(+), 68 deletions(-)

diff --git a/src/cc/libbpf.c b/src/cc/libbpf.c
index ef6daf3..1ac685f 100644
--- a/src/cc/libbpf.c
+++ b/src/cc/libbpf.c
@@ -526,38 +526,113 @@ int bpf_attach_socket(int sock, int prog) {
   return setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, , sizeof(prog));
 }
 
-static int bpf_attach_tracing_event(int progfd, const char *event_path,
-struct perf_reader *reader, int pid, int cpu, int group_fd) {
-  int efd, pfd;
-  ssize_t bytes;
-  char buf[256];
-  struct perf_event_attr attr = {};
+#define PMU_TYPE_FILE "/sys/bus/event_source/devices/%s/type"
+static int bpf_find_probe_type(const char *event_type)
+{
+  int fd;
+  int ret;
+  char buf[64];
 
-  snprintf(buf, sizeof(buf), "%s/id", event_path);
-  efd = open(buf, O_RDONLY, 0);
-  if (efd < 0) {
-fprintf(stderr, "open(%s): %s\n", buf, strerror(errno));
+  snprintf(buf, sizeof(buf), PMU_TYPE_FILE, event_type);
+
+  fd = open(buf, O_RDONLY);
+  if (fd < 0)
 return -1;
-  }
+  ret = read(fd, buf, sizeof(buf));
+  close(fd);
+  if (ret < 0 || ret >= sizeof(buf))
+return -1;
+  ret = (int)strtol(buf, NULL, 10);
+  return errno ? -1 : ret;
+}
 
-  bytes = read(efd, buf, sizeof(buf));
-  if (bytes <= 0 || bytes >= sizeof(buf)) {
-fprintf(stderr, "read(%s): %s\n", buf, strerror(errno));
-close(efd);
+#define PMU_RETPROBE_FILE "/sys/bus/event_source/devices/%s/format/retprobe"
+static int bpf_get_retprobe_bit(const char *event_type)
+{
+  int fd;
+  int ret;
+  char buf[64];
+
+  snprintf(buf, sizeof(buf), PMU_RETPROBE_FILE, event_type);
+  fd = open(buf, O_RDONLY);
+  if (fd < 0)
+return -1;
+  ret = read(fd, buf, sizeof(buf));
+  close(fd);
+  if (ret < 0 || ret >= sizeof(buf))
+return -1;
+  if (strlen(buf) < strlen("config:"))
+return -1;
+  ret = (int)strtol(buf + strlen("config:"), NULL, 10);
+  return errno ? -1 : ret;
+}
+
+/*
+ * new kernel API allows creating [k,u]probe with perf_event_open, which
+ * makes it easier to clean up the [k,u]probe. This function tries to
+ * create pfd with the new API.
+ */
+static int bpf_try_perf_event_open_with_probe(const char *name, uint64_t offs,
+int pid, int cpu, int group_fd, char *event_type, int is_return)
+{
+  struct perf_event_attr attr = {};
+  int type = bpf_find_probe_type(event_type);
+  int is_return_bit = bpf_get_retprobe_bit(event_type);
+
+  if (type < 0 || is_return_bit < 0)
 return -1;
-  }
-  close(efd);
-  buf[bytes] = '\0';
-  attr.config = strtol(buf, NULL, 0);
-  attr.type = PERF_TYPE_TRACEPOINT;
   attr.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN;
   attr.sample_period = 1;
   attr.wakeup_events = 1;
-  pfd = syscall(__NR_perf_event_open, , pid, cpu, group_fd, 
PERF_FLAG_FD_CLOEXEC);
+  if (is_return)
+attr.config |= 1 << is_return_bit;
+  attr.probe_offset = offs;  /* for kprobe, if name is NULL, this the addr */
+  attr.size = sizeof(attr);
+  attr.type = type;
+  attr.kprobe_func = ptr_to_u64((void *)name);  /* also work for uprobe_path */
+  return syscall(__NR_perf_event_open, , pid, cpu, group_fd,
+ PERF_FLAG_FD_CLOEXEC);
+}
+
+static int bpf_attach_tracing_event(int progfd, const char *event_path,
+struct perf_reader *reader, int pid, int cpu, int group_fd, int pfd) {
+  int efd;
+  ssize_t bytes;
+  char buf[256];
+  struct perf_event_attr attr = {};
+
+  /*
+   * Only look up id and call perf_event_open when
+   * bpf_try_perf_event_open_with_probe() didn't returns valid pfd.
+   */
   if (pfd < 0) {
-fprintf(stderr, "perf_event_open(%s/id): %s\n", event_path, 
strerror(errno));
-return -1;
+snprintf(buf, sizeof(buf), "%s/id", event_path);
+efd = open(buf, O_RDONLY, 0);
+if (efd < 0) {
+  fprintf(stderr, "open(%s): %s\n", buf, strerror(errno));
+  return -1;
+}
+
+bytes = read(efd, buf, sizeof(buf));
+if (bytes <= 0 || bytes >= sizeof(buf)) {
+  fprintf(stderr, "read(%s): %s\n", buf, strerror(errno));
+  close(efd);
+  return -1;
+}
+close(efd);
+buf[bytes] = '\0';
+attr.config = strtol(buf, NULL, 0);
+attr.type = PERF_TYPE_TRACEPOINT;
+attr.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN;
+attr.sample_period = 1;
+attr.wakeup_events = 1;
+pfd = syscall(__NR_perf_event_open, , pid, cpu, group_fd, 
PERF_FLAG_FD_CLOEXEC);
+if (pfd < 0) {
+  fprintf(stderr, "perf_event_open(%s/id): %s\n", event_path, 
strerror(errno));
+  return -1;
+}
   }
+
   perf_reader_set_fd(reader, pfd);
 
   if (perf_reader_mmap(reader, attr.type, attr.sample_type) < 0)
@@ -585,31 

[PATCH v5 2/6] perf: copy new perf_event.h to tools/include/uapi

2017-12-06 Thread Song Liu
perf_event.h is updated in previous patch, this patch applies same
changes to the tools/ version. This is part is put in a separate
patch in case the two files are back ported separately.

Signed-off-by: Song Liu 
Reviewed-by: Yonghong Song 
Reviewed-by: Josef Bacik 
Acked-by: Alexei Starovoitov 
---
 tools/include/uapi/linux/perf_event.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index b9a4953..1133d6a 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -380,10 +380,14 @@ struct perf_event_attr {
__u32   bp_type;
union {
__u64   bp_addr;
+   __u64   kprobe_func; /* for perf_kprobe */
+   __u64   uprobe_path; /* for perf_uprobe */
__u64   config1; /* extension of config */
};
union {
__u64   bp_len;
+   __u64   kprobe_addr; /* when kprobe_func == NULL */
+   __u64   probe_offset; /* for perf_[k,u]probe */
__u64   config2; /* extension of config1 */
};
__u64   branch_sample_type; /* enum perf_branch_sample_type */
-- 
2.9.5



[PATCH v5 6/6] bpf: add new test test_many_kprobe

2017-12-06 Thread Song Liu
The test compares old text based kprobe API with perf_kprobe.

Here is a sample output of this test:

Creating 1000 kprobes with text-based API takes 6.979683 seconds
Cleaning 1000 kprobes with text-based API takes 84.897687 seconds
Creating 1000 kprobes with perf_kprobe (function name) takes 5.077558 seconds
Cleaning 1000 kprobes with perf_kprobe (function name) takes 81.241354 seconds
Creating 1000 kprobes with perf_kprobe (function addr) takes 5.218255 seconds
Cleaning 1000 kprobes with perf_kprobe (function addr) takes 80.010731 seconds

Signed-off-by: Song Liu 
Reviewed-by: Josef Bacik 
Reviewed-by: Philippe Ombredanne 
---
 samples/bpf/Makefile|   3 +
 samples/bpf/bpf_load.c  |   5 +-
 samples/bpf/bpf_load.h  |   4 +
 samples/bpf/test_many_kprobe_user.c | 186 
 4 files changed, 195 insertions(+), 3 deletions(-)
 create mode 100644 samples/bpf/test_many_kprobe_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 9b4a66e..ec92f35 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -42,6 +42,7 @@ hostprogs-y += xdp_redirect
 hostprogs-y += xdp_redirect_map
 hostprogs-y += xdp_monitor
 hostprogs-y += syscall_tp
+hostprogs-y += test_many_kprobe
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o
@@ -87,6 +88,7 @@ xdp_redirect-objs := bpf_load.o $(LIBBPF) xdp_redirect_user.o
 xdp_redirect_map-objs := bpf_load.o $(LIBBPF) xdp_redirect_map_user.o
 xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
+test_many_kprobe-objs := bpf_load.o $(LIBBPF) test_many_kprobe_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -172,6 +174,7 @@ HOSTLOADLIBES_xdp_redirect += -lelf
 HOSTLOADLIBES_xdp_redirect_map += -lelf
 HOSTLOADLIBES_xdp_monitor += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
+HOSTLOADLIBES_test_many_kprobe += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 86e3818..49f5be5 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -666,9 +666,8 @@ void read_trace_pipe(void)
}
 }
 
-#define MAX_SYMS 30
-static struct ksym syms[MAX_SYMS];
-static int sym_cnt;
+struct ksym syms[MAX_SYMS];
+int sym_cnt;
 
 static int ksym_cmp(const void *p1, const void *p2)
 {
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index 95d6be5..6c9d584 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -69,6 +69,10 @@ static inline __u64 ptr_to_u64(const void *ptr)
return (__u64) (unsigned long) ptr;
 }
 
+#define MAX_SYMS 30
+extern struct ksym syms[MAX_SYMS];
+extern int sym_cnt;
+
 int load_kallsyms(void);
 struct ksym *ksym_search(long key);
 int set_link_xdp_fd(int ifindex, int fd, __u32 flags);
diff --git a/samples/bpf/test_many_kprobe_user.c 
b/samples/bpf/test_many_kprobe_user.c
new file mode 100644
index 000..6c111cf
--- /dev/null
+++ b/samples/bpf/test_many_kprobe_user.c
@@ -0,0 +1,186 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2017 Facebook
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "libbpf.h"
+#include "bpf_load.h"
+#include "perf-sys.h"
+
+#define MAX_KPROBES 1000
+
+#define DEBUGFS "/sys/kernel/debug/tracing/"
+
+int kprobes[MAX_KPROBES] = {0};
+int kprobe_count;
+int perf_event_fds[MAX_KPROBES];
+const char license[] = "GPL";
+
+static __u64 time_get_ns(void)
+{
+   struct timespec ts;
+
+   clock_gettime(CLOCK_MONOTONIC, );
+   return ts.tv_sec * 10ull + ts.tv_nsec;
+}
+
+static int kprobe_api(char *func, void *addr, bool use_new_api)
+{
+   int efd;
+   struct perf_event_attr attr = {};
+   char buf[256];
+   int err, id;
+
+   attr.sample_type = PERF_SAMPLE_RAW;
+   attr.sample_period = 1;
+   attr.wakeup_events = 1;
+
+   if (use_new_api) {
+   attr.type = perf_kprobe_type;
+   if (func) {
+   attr.kprobe_func = ptr_to_u64(func);
+   attr.probe_offset = 0;
+   } else {
+   attr.kprobe_func = 0;
+   attr.kprobe_addr = ptr_to_u64(addr);
+   }
+   } else {
+   attr.type = PERF_TYPE_TRACEPOINT;
+   snprintf(buf, sizeof(buf),
+"echo 'p:%s %s' >> 
/sys/kernel/debug/tracing/kprobe_events",
+func, func);
+   err = system(buf);
+   if (err < 0) {
+   printf("failed to create kprobe '%s' error '%s'\n",
+  func, strerror(errno));
+

[PATCH v5 1/6] perf: prepare perf_event.h for new types perf_kprobe and perf_uprobe

2017-12-06 Thread Song Liu
Two new perf types, perf_kprobe and perf_uprobe, will be added to allow
creating [k,u]probe with perf_event_open. These [k,u]probe are associated
with the file decriptor created by perf_event_open, thus are easy to
clean when the file descriptor is destroyed.

kprobe_func and uprobe_path are added to union config1 for pointers to
function name for kprobe or binary path for uprobe.

kprobe_addr and probe_offset are added to union config2 for kernel
address (when kprobe_func is NULL), or [k,u]probe offset.

Signed-off-by: Song Liu 
Reviewed-by: Yonghong Song 
Reviewed-by: Josef Bacik 
Acked-by: Alexei Starovoitov 
---
 include/uapi/linux/perf_event.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 362493a..b2d80a7 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -380,10 +380,14 @@ struct perf_event_attr {
__u32   bp_type;
union {
__u64   bp_addr;
+   __u64   kprobe_func; /* for perf_kprobe */
+   __u64   uprobe_path; /* for perf_uprobe */
__u64   config1; /* extension of config */
};
union {
__u64   bp_len;
+   __u64   kprobe_addr; /* when kprobe_func == NULL */
+   __u64   probe_offset; /* for perf_[k,u]probe */
__u64   config2; /* extension of config1 */
};
__u64   branch_sample_type; /* enum perf_branch_sample_type */
-- 
2.9.5



[PATCH v5 5/6] bpf: add option for bpf_load.c to use perf_kprobe

2017-12-06 Thread Song Liu
Function load_and_attach() is updated to be able to create kprobes
with either old text based API, or the new perf_event_open API.

A global flag use_perf_kprobe is added to select between the two
APIs.

Signed-off-by: Song Liu 
Reviewed-by: Josef Bacik 
---
 samples/bpf/bpf_load.c | 61 +++---
 samples/bpf/bpf_load.h | 10 +
 2 files changed, 63 insertions(+), 8 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 2325d7a..86e3818 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -8,7 +8,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -29,6 +28,7 @@
 #include "perf-sys.h"
 
 #define DEBUGFS "/sys/kernel/debug/tracing/"
+#define KPROBE_TYPE_FILE "/sys/bus/event_source/devices/kprobe/type"
 
 static char license[128];
 static int kern_version;
@@ -42,6 +42,8 @@ int prog_array_fd = -1;
 
 struct bpf_map_data map_data[MAX_MAPS];
 int map_data_count = 0;
+bool use_perf_kprobe = true;
+int perf_kprobe_type = -1;
 
 static int populate_prog_array(const char *event, int prog_fd)
 {
@@ -55,6 +57,26 @@ static int populate_prog_array(const char *event, int 
prog_fd)
return 0;
 }
 
+int get_perf_kprobe_type_id(void)
+{
+   int tfd;
+   int err;
+   char buf[16];
+
+   tfd = open(KPROBE_TYPE_FILE, O_RDONLY);
+   if (tfd < 0)
+   return -1;
+
+   err = read(tfd, buf, sizeof(buf));
+   close(tfd);
+
+   if (err < 0 || err >= sizeof(buf))
+   return -1;
+   buf[err] = 0;
+   perf_kprobe_type = atoi(buf);
+   return perf_kprobe_type;
+}
+
 static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 {
bool is_socket = strncmp(event, "socket", 6) == 0;
@@ -70,7 +92,7 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
size_t insns_cnt = size / sizeof(struct bpf_insn);
enum bpf_prog_type prog_type;
char buf[256];
-   int fd, efd, err, id;
+   int fd, efd, err, id = -1;
struct perf_event_attr attr = {};
 
attr.type = PERF_TYPE_TRACEPOINT;
@@ -128,7 +150,13 @@ static int load_and_attach(const char *event, struct 
bpf_insn *prog, int size)
return populate_prog_array(event, fd);
}
 
-   if (is_kprobe || is_kretprobe) {
+   if (use_perf_kprobe && perf_kprobe_type == -1) {
+   get_perf_kprobe_type_id();
+   if (perf_kprobe_type == -1)
+   use_perf_kprobe = false;
+   }
+
+   if (!use_perf_kprobe && (is_kprobe || is_kretprobe)) {
if (is_kprobe)
event += 7;
else
@@ -169,27 +197,44 @@ static int load_and_attach(const char *event, struct 
bpf_insn *prog, int size)
strcat(buf, "/id");
}
 
+   if (use_perf_kprobe && (is_kprobe || is_kretprobe)) {
+   attr.type = perf_kprobe_type;
+   attr.kprobe_func = ptr_to_u64(
+   event + strlen(is_kprobe ? "kprobe/" : "kretprobe/"));
+   attr.probe_offset = 0;
+
+   /* PERF_PROBE_CONFIG_IS_RETPROBE in kernel/events/core.c */
+   if (is_kretprobe)
+   attr.config |= 1 << 0;
+   } else {
efd = open(buf, O_RDONLY, 0);
if (efd < 0) {
printf("failed to open event %s\n", event);
return -1;
}
-
err = read(efd, buf, sizeof(buf));
if (err < 0 || err >= sizeof(buf)) {
-   printf("read from '%s' failed '%s'\n", event, strerror(errno));
+   printf("read from '%s' failed '%s'\n", event,
+  strerror(errno));
return -1;
}
-
close(efd);
-
buf[err] = 0;
id = atoi(buf);
attr.config = id;
+   }
 
efd = sys_perf_event_open(, -1/*pid*/, 0/*cpu*/, -1/*group_fd*/, 
0);
if (efd < 0) {
-   printf("event %d fd %d err %s\n", id, efd, strerror(errno));
+   if (use_perf_kprobe && (is_kprobe || is_kretprobe))
+   printf("k%sprobe %s fd %d err %s\n",
+  is_kprobe ? "" : "ret",
+  event + strlen(is_kprobe ? "kprobe/"
+ : "kretprobe/"),
+  efd, strerror(errno));
+   else
+   printf("event %d fd %d err %s\n", id, efd,
+  strerror(errno));
return -1;
}
event_fd[prog_cnt - 1] = efd;
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index 7d57a42..95d6be5 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -2,6 +2,7 @@
 #ifndef 

[PATCH v5 4/6] perf: implement pmu perf_uprobe

2017-12-06 Thread Song Liu
This patch adds perf_uprobe support with similar pattern as previous
patch (for kprobe).

Two functions, create_local_trace_uprobe() and
destroy_local_trace_uprobe(), are created so a uprobe can be created
and attached to the file descriptor created by perf_event_open().

Signed-off-by: Song Liu 
Reviewed-by: Yonghong Song 
Reviewed-by: Josef Bacik 
---
 include/linux/trace_events.h|  4 ++
 kernel/events/core.c| 44 +
 kernel/trace/trace_event_perf.c | 53 +
 kernel/trace/trace_probe.h  |  4 ++
 kernel/trace/trace_uprobe.c | 86 +
 5 files changed, 183 insertions(+), 8 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 1cfb0a4..b56ec3d 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -498,6 +498,10 @@ extern void perf_trace_del(struct perf_event *event, int 
flags);
 extern int  perf_kprobe_init(struct perf_event *event, bool is_retprobe);
 extern void perf_kprobe_destroy(struct perf_event *event);
 #endif
+#ifdef CONFIG_UPROBE_EVENTS
+extern int  perf_uprobe_init(struct perf_event *event, bool is_retprobe);
+extern void perf_uprobe_destroy(struct perf_event *event);
+#endif
 extern int  ftrace_profile_set_filter(struct perf_event *event, int event_id,
 char *filter_str);
 extern void ftrace_profile_free_filter(struct perf_event *event);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f518214..31628ca 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8046,6 +8046,43 @@ static int perf_kprobe_event_init(struct perf_event 
*event)
 }
 #endif /* CONFIG_KPROBE_EVENTS */
 
+#ifdef CONFIG_UPROBE_EVENTS
+static int perf_uprobe_event_init(struct perf_event *event);
+static struct pmu perf_uprobe = {
+   .task_ctx_nr= perf_sw_context,
+   .event_init = perf_uprobe_event_init,
+   .add= perf_trace_add,
+   .del= perf_trace_del,
+   .start  = perf_swevent_start,
+   .stop   = perf_swevent_stop,
+   .read   = perf_swevent_read,
+   .attr_groups= probe_attr_groups,
+};
+
+static int perf_uprobe_event_init(struct perf_event *event)
+{
+   int err;
+   bool is_retprobe;
+
+   if (event->attr.type != perf_uprobe.type)
+   return -ENOENT;
+   /*
+* no branch sampling for probe events
+*/
+   if (has_branch_stack(event))
+   return -EOPNOTSUPP;
+
+   is_retprobe = event->attr.config & PERF_PROBE_CONFIG_IS_RETPROBE;
+   err = perf_uprobe_init(event, is_retprobe);
+   if (err)
+   return err;
+
+   event->destroy = perf_uprobe_destroy;
+
+   return 0;
+}
+#endif /* CONFIG_UPROBE_EVENTS */
+
 /*
  * returns true if the event is a tracepoint, or a kprobe/upprobe created
  * with perf_event_open()
@@ -8058,6 +8095,10 @@ static inline bool perf_event_is_tracing(struct 
perf_event *event)
if (event->pmu == _kprobe)
return true;
 #endif
+#if CONFIG_UPROBE_EVENTS
+   if (event->pmu == _uprobe)
+   return true;
+#endif
return false;
 }
 
@@ -8067,6 +8108,9 @@ static inline void perf_tp_register(void)
 #ifdef CONFIG_KPROBE_EVENTS
perf_pmu_register(_kprobe, "kprobe", -1);
 #endif
+#ifdef CONFIG_UPROBE_EVENTS
+   perf_pmu_register(_uprobe, "uprobe", -1);
+#endif
 }
 
 static void perf_event_free_filter(struct perf_event *event)
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 7f1cc45..6a352ee 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -286,6 +286,59 @@ void perf_kprobe_destroy(struct perf_event *p_event)
 }
 #endif /* CONFIG_KPROBE_EVENTS */
 
+#ifdef CONFIG_UPROBE_EVENTS
+int perf_uprobe_init(struct perf_event *p_event, bool is_retprobe)
+{
+   int ret;
+   char *path = NULL;
+   struct trace_event_call *tp_event;
+
+   if (!p_event->attr.uprobe_path)
+   return -EINVAL;
+   path = kzalloc(PATH_MAX, GFP_KERNEL);
+   if (!path)
+   return -ENOMEM;
+   ret = strncpy_from_user(
+   path, u64_to_user_ptr(p_event->attr.uprobe_path), PATH_MAX);
+   if (ret < 0)
+   goto out;
+   if (path[0] == '\0') {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   tp_event = create_local_trace_uprobe(
+   path, p_event->attr.probe_offset, is_retprobe);
+   if (IS_ERR(tp_event)) {
+   ret = PTR_ERR(tp_event);
+   goto out;
+   }
+
+   /*
+* local trace_uprobe need to hold event_mutex to call
+* uprobe_buffer_enable() and uprobe_buffer_disable().
+* event_mutex is not required for local trace_kprobes.
+*/
+   mutex_lock(_mutex);
+   ret = 

Re: [PATCH net-next 0/6] net: sched: sch: introduce extack support

2017-12-06 Thread Alexander Aring
Hi,

On Wed, Dec 6, 2017 at 3:40 PM, David Miller  wrote:
> From: Alexander Aring 
> Date: Wed,  6 Dec 2017 11:08:39 -0500
>
>> this patch series basically add support for extack in common qdisc handling.
>> Additional it adds extack pointer to common qdisc callback handling this
>> offers per qdisc implementation to setting the extack message for each
>> failure over netlink.
>>
>> The extack message will be set deeper in qdisc functions but going not
>> deeper as net core api. For qdisc module callback handling, the extack
>> will not be set. This will be part of per qdisc extack handling.
>>
>> I also want to prepare patches to handle extack per qdisc module...
>> so there will come a lot of more patches, just cut them down to make
>> it reviewable.
>>
>> There are some above 80-chars width warnings, which I ignore because
>> it looks more ugly otherwise.
>>
>> This patch-series based on patches by David Ahren which gave me some
>> hints how to deal with extack support.
>>
>> Cc: David Ahern 
>
> Only add the plumbing when you have actual extack messages you are
> adding as an example use case.
>

I did not understand. I have a lot of patches which make use of these
changes. Do you want me to submit me these in one shot (patch-series)?
I was hoping to making it in smaller patch-series for easier review.

- Alex


[PATCH net-next v3 1/2] bpf/tracing: allow user space to query prog array on the same tp

2017-12-06 Thread Yonghong Song
Commit e87c6bc3852b ("bpf: permit multiple bpf attachments
for a single perf event") added support to attach multiple
bpf programs to a single perf event.
Although this provides flexibility, users may want to know
what other bpf programs attached to the same tp interface.
Besides getting visibility for the underlying bpf system,
such information may also help consolidate multiple bpf programs,
understand potential performance issues due to a large array,
and debug (e.g., one bpf program which overwrites return code
may impact subsequent program results).

Commit 2541517c32be ("tracing, perf: Implement BPF programs
attached to kprobes") utilized the existing perf ioctl
interface and added the command PERF_EVENT_IOC_SET_BPF
to attach a bpf program to a tracepoint. This patch adds a new
ioctl command, given a perf event fd, to query the bpf program
array attached to the same perf tracepoint event.

The new uapi ioctl command:
  PERF_EVENT_IOC_QUERY_BPF

The new uapi/linux/perf_event.h structure:
  struct perf_event_query_bpf {
   __u32ids_len;
   __u32prog_cnt;
   __u32ids[0];
  };

User space provides buffer "ids" for kernel to copy to.
When returning from the kernel, the number of available
programs in the array is set in "prog_cnt".

The usage:
  struct perf_event_query_bpf *query = malloc(...);
  query.ids_len = ids_len;
  err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, );
  if (err == 0) {
/* query.prog_cnt is the number of available progs,
 * number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt
 */
  } else if (errno == ENOSPC) {
/* query.ids_len number of progs copied,
 * query.prog_cnt is the number of available progs
 */
  } else {
  /* other errors */
  }

Signed-off-by: Yonghong Song 
---
 include/linux/bpf.h |  4 
 include/uapi/linux/perf_event.h | 22 ++
 kernel/bpf/core.c   | 21 +
 kernel/events/core.c|  3 +++
 kernel/trace/bpf_trace.c| 23 +++
 5 files changed, 73 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e55e425..f812ac5 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -254,6 +254,7 @@ typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const 
void *src,
 
 u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
 void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy);
+int bpf_event_query_prog_array(struct perf_event *event, void __user *info);
 
 int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
  union bpf_attr __user *uattr);
@@ -285,6 +286,9 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu 
*progs,
 
 void bpf_prog_array_delete_safe(struct bpf_prog_array __rcu *progs,
struct bpf_prog *old_prog);
+int bpf_prog_array_copy_info(struct bpf_prog_array __rcu *array,
+__u32 __user *prog_ids, u32 request_cnt,
+__u32 __user *prog_cnt);
 int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
struct bpf_prog *exclude_prog,
struct bpf_prog *include_prog,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index b9a4953..7695336 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -418,6 +418,27 @@ struct perf_event_attr {
__u16   __reserved_2;   /* align to __u64 */
 };
 
+/*
+ * Structure used by below PERF_EVENT_IOC_QUERY_BPF command
+ * to query bpf programs attached to the same perf tracepoint
+ * as the given perf event.
+ */
+struct perf_event_query_bpf {
+   /*
+* The below ids array length
+*/
+   __u32   ids_len;
+   /*
+* Set by the kernel to indicate the number of
+* available programs
+*/
+   __u32   prog_cnt;
+   /*
+* User provided buffer to store program ids
+*/
+   __u32   ids[0];
+};
+
 #define perf_flags(attr)   (*(&(attr)->read_format + 1))
 
 /*
@@ -433,6 +454,7 @@ struct perf_event_attr {
 #define PERF_EVENT_IOC_ID  _IOR('$', 7, __u64 *)
 #define PERF_EVENT_IOC_SET_BPF _IOW('$', 8, __u32)
 #define PERF_EVENT_IOC_PAUSE_OUTPUT_IOW('$', 9, __u32)
+#define PERF_EVENT_IOC_QUERY_BPF   _IOWR('$', 10, struct 
perf_event_query_bpf *)
 
 enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 86b50aa..b16c6f8 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1462,6 +1462,8 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array 
__rcu *progs,
rcu_read_lock();
prog = rcu_dereference(progs)->progs;
for (; *prog; prog++) {
+   if (*prog == _bpf_prog.prog)
+   continue;
id = (*prog)->aux->id;
  

[PATCH net-next v3 2/2] bpf/tracing: add a bpf test for new ioctl query interface

2017-12-06 Thread Yonghong Song
Added a subtest in test_progs. The tracepoint is
sched/sched_switch. Multiple bpf programs are attached to
this tracepoint and the query interface is exercised.

Signed-off-by: Yonghong Song 
Acked-by: Alexei Starovoitov 
---
 tools/include/uapi/linux/perf_event.h |  22 +
 tools/testing/selftests/bpf/Makefile  |   2 +-
 tools/testing/selftests/bpf/test_progs.c  | 134 ++
 tools/testing/selftests/bpf/test_tracepoint.c |  26 +
 4 files changed, 183 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_tracepoint.c

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index 362493a..f2c354d 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -418,6 +418,27 @@ struct perf_event_attr {
__u16   __reserved_2;   /* align to __u64 */
 };
 
+/*
+ * Structure used by below PERF_EVENT_IOC_QUERY_BPF command
+ * to query bpf programs attached to the same perf tracepoint
+ * as the given perf event.
+ */
+struct perf_event_query_bpf {
+   /*
+* The below ids array length
+*/
+   __u32   ids_len;
+   /*
+* Set by the kernel to indicate the number of
+* available programs
+*/
+   __u32   prog_cnt;
+   /*
+* User provided buffer to store program ids
+*/
+   __u32   ids[0];
+};
+
 #define perf_flags(attr)   (*(&(attr)->read_format + 1))
 
 /*
@@ -433,6 +454,7 @@ struct perf_event_attr {
 #define PERF_EVENT_IOC_ID  _IOR('$', 7, __u64 *)
 #define PERF_EVENT_IOC_SET_BPF _IOW('$', 8, __u32)
 #define PERF_EVENT_IOC_PAUSE_OUTPUT_IOW('$', 9, __u32)
+#define PERF_EVENT_IOC_QUERY_BPF   _IOWR('$', 10, struct 
perf_event_query_bpf *)
 
 enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 2c9d8c6..255fb1f 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -17,7 +17,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps 
test_lru_map test_lpm_map test
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o 
sockmap_parse_prog.o \
-   sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o
+   sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o
 
 TEST_PROGS := test_kmod.sh test_xdp_redirect.sh test_xdp_meta.sh \
test_offload.py
diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index 6942753..1e0479a 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -21,8 +21,10 @@ typedef __u16 __sum16;
 #include 
 #include 
 #include 
+#include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -617,6 +619,137 @@ static void test_obj_name(void)
}
 }
 
+static void test_tp_attach_query(void)
+{
+   const int num_progs = 3;
+   int i, j, bytes, efd, err, prog_fd[num_progs], pmu_fd[num_progs];
+   __u32 duration = 0, info_len, saved_prog_ids[num_progs];
+   const char *file = "./test_tracepoint.o";
+   struct perf_event_query_bpf *query;
+   struct perf_event_attr attr = {};
+   struct bpf_object *obj[num_progs];
+   struct bpf_prog_info prog_info;
+   char buf[256];
+
+   snprintf(buf, sizeof(buf),
+"/sys/kernel/debug/tracing/events/sched/sched_switch/id");
+   efd = open(buf, O_RDONLY, 0);
+   if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
+   return;
+   bytes = read(efd, buf, sizeof(buf));
+   close(efd);
+   if (CHECK(bytes <= 0 || bytes >= sizeof(buf),
+ "read", "bytes %d errno %d\n", bytes, errno))
+   return;
+
+   attr.config = strtol(buf, NULL, 0);
+   attr.type = PERF_TYPE_TRACEPOINT;
+   attr.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN;
+   attr.sample_period = 1;
+   attr.wakeup_events = 1;
+
+   query = (struct perf_event_query_bpf *)malloc(sizeof(struct 
perf_event_query_bpf) +
+ sizeof(__u32) * 
num_progs);
+   for (i = 0; i < num_progs; i++) {
+   err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, [i],
+   _fd[i]);
+   if (CHECK(err, "prog_load", "err %d errno %d\n", err, errno))
+   goto cleanup1;
+
+   bzero(_info, sizeof(prog_info));
+   prog_info.jited_prog_len = 0;
+   prog_info.xlated_prog_len = 0;
+   prog_info.nr_map_ids = 0;
+   info_len = sizeof(prog_info);
+   err = bpf_obj_get_info_by_fd(prog_fd[i], _info, _len);
+   if 

Re: [PATCH 1/3] net: dst: add and use dst_flags_t

2017-12-06 Thread David Miller

A proper patch series must have a header posting.

The header posting must explain what the patch series is doing,
how the patches in the series are related and therefore why they
belong together as a group, how it is doing what it is doing,
and why it is doing it that way.

Thank you.


Re: [PATCH net-next] net: ethernet: ti: cpdma: correct error handling for chan create

2017-12-06 Thread David Miller
From: Ivan Khoronzhuk 
Date: Wed,  6 Dec 2017 16:54:09 +0200

> @@ -3065,10 +3065,16 @@ static int cpsw_probe(struct platform_device *pdev)
>   }
>  
>   cpsw->txv[0].ch = cpdma_chan_create(cpsw->dma, 0, cpsw_tx_handler, 0);
> + if (WARN_ON(IS_ERR(cpsw->txv[0].ch))) {
> + dev_err(priv->dev, "error initializing tx dma channel\n");
> + ret = PTR_ERR(cpsw->txv[0].ch);
> + goto clean_dma_ret;
> + }
> +

You're already emitting a proper dev_err() message, therefore WARN_ON()
is a duplicate notification to the logs and therefore not appropriate.


Re: [PATCH net-next] net: ethernet: ti: cpdma: rate is not changed - correct case

2017-12-06 Thread David Miller
From: Ivan Khoronzhuk 
Date: Wed,  6 Dec 2017 16:41:18 +0200

> If rate is the same as set it's correct case.
> 
> Signed-off-by: Ivan Khoronzhuk 
> ---
> Based on net-next/master
> 
>  drivers/net/ethernet/ti/davinci_cpdma.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c 
> b/drivers/net/ethernet/ti/davinci_cpdma.c
> index e4d6edf..dbe9167 100644
> --- a/drivers/net/ethernet/ti/davinci_cpdma.c
> +++ b/drivers/net/ethernet/ti/davinci_cpdma.c
> @@ -841,7 +841,7 @@ int cpdma_chan_set_rate(struct cpdma_chan *ch, u32 rate)
>   return -EINVAL;
>  
>   if (ch->rate == rate)
> - return rate;
> + return 0;

Looking at the one and only caller of this function, cpsw_ndo_set_tx_maxrate, it
makes sure this can never, ever, happen.

So I would instead remove this check completely since it can never trigger.


[PATCH 2/3] net: dst: switch to 8-bit dst->flags

2017-12-06 Thread Alexey Dobriyan
None of the flags is 16-bit currently.

Space savings on x86_64:

add/remove: 0/0 grow/shrink: 2/16 up/down: 7/-29 (-22)
Function old new   delta
netdev_frame_hook464 470  +6
gre_fill_metadata_dst257 258  +1
xfrm_resolve_and_create_bundle  28172816  -1
ipv6_add_addr   17241723  -1
ip6_rt_cache_alloc   472 471  -1
ip6_route_info_create   29552954  -1
icmp6_dst_alloc  569 568  -1
dst_init 166 165  -1
dev_fill_metadata_dst395 394  -1
bnxt_start_xmit 31193118  -1
arp_process 25242523  -1
addrconf_dst_alloc   442 441  -1
xfrm_lookup 20292027  -2
ip6_tnl_xmit29652963  -2
rt_dst_alloc 200 197  -3
do_execute_actions  27992796  -3
addrconf_disable_policy_idev 449 445  -4
br_netfilter_rtable_init  66  61  -5

Signed-off-by: Alexey Dobriyan 
---
 include/net/dst.h | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 0f0905bda423..25decfa4e14a 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -31,7 +31,7 @@
  */
 
 struct sk_buff;
-typedef unsigned short __bitwise dst_flags_t;
+typedef u8 __bitwise dst_flags_t;
 
 struct dst_entry {
struct net_device   *dev;
@@ -47,14 +47,15 @@ struct dst_entry {
int (*output)(struct net *net, struct sock *sk, 
struct sk_buff *skb);
 
dst_flags_t flags;
-#define DST_HOST   ((dst_flags_t __force)0x0001)
-#define DST_NOXFRM ((dst_flags_t __force)0x0002)
-#define DST_NOPOLICY   ((dst_flags_t __force)0x0004)
-#define DST_NOCOUNT((dst_flags_t __force)0x0008)
-#define DST_FAKE_RTABLE((dst_flags_t __force)0x0010)
-#define DST_XFRM_TUNNEL((dst_flags_t __force)0x0020)
-#define DST_XFRM_QUEUE ((dst_flags_t __force)0x0040)
-#define DST_METADATA   ((dst_flags_t __force)0x0080)
+#define DST_HOST   ((dst_flags_t __force)0x01)
+#define DST_NOXFRM ((dst_flags_t __force)0x02)
+#define DST_NOPOLICY   ((dst_flags_t __force)0x04)
+#define DST_NOCOUNT((dst_flags_t __force)0x08)
+#define DST_FAKE_RTABLE((dst_flags_t __force)0x10)
+#define DST_XFRM_TUNNEL((dst_flags_t __force)0x20)
+#define DST_XFRM_QUEUE ((dst_flags_t __force)0x40)
+#define DST_METADATA   ((dst_flags_t __force)0x80)
+   u8  __pad2;
 
/* A non-zero value of dst->obsolete forces by-hand validation
 * of the route entry.  Positive values are set by the generic
-- 
2.13.6



[PATCH 1/3] net: dst: add and use dst_flags_t

2017-12-06 Thread Alexey Dobriyan
Typedef dst->flags for checking with sparse.

Signed-off-by: Alexey Dobriyan 
---
 drivers/net/vrf.c   |  2 +-
 include/net/dst.h   | 23 ---
 include/net/ip6_route.h |  2 +-
 net/core/dst.c  |  4 ++--
 net/ipv6/route.c|  4 ++--
 5 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index feb1b2e15c2e..f6a5df216fec 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -493,7 +493,7 @@ static void vrf_rt6_release(struct net_device *dev, struct 
net_vrf *vrf)
 
 static int vrf_rt6_create(struct net_device *dev)
 {
-   int flags = DST_HOST | DST_NOPOLICY | DST_NOXFRM;
+   dst_flags_t flags = DST_HOST | DST_NOPOLICY | DST_NOXFRM;
struct net_vrf *vrf = netdev_priv(dev);
struct net *net = dev_net(dev);
struct fib6_table *rt6i_table;
diff --git a/include/net/dst.h b/include/net/dst.h
index 33d2a5433924..0f0905bda423 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -31,6 +31,7 @@
  */
 
 struct sk_buff;
+typedef unsigned short __bitwise dst_flags_t;
 
 struct dst_entry {
struct net_device   *dev;
@@ -45,15 +46,15 @@ struct dst_entry {
int (*input)(struct sk_buff *);
int (*output)(struct net *net, struct sock *sk, 
struct sk_buff *skb);
 
-   unsigned short  flags;
-#define DST_HOST   0x0001
-#define DST_NOXFRM 0x0002
-#define DST_NOPOLICY   0x0004
-#define DST_NOCOUNT0x0008
-#define DST_FAKE_RTABLE0x0010
-#define DST_XFRM_TUNNEL0x0020
-#define DST_XFRM_QUEUE 0x0040
-#define DST_METADATA   0x0080
+   dst_flags_t flags;
+#define DST_HOST   ((dst_flags_t __force)0x0001)
+#define DST_NOXFRM ((dst_flags_t __force)0x0002)
+#define DST_NOPOLICY   ((dst_flags_t __force)0x0004)
+#define DST_NOCOUNT((dst_flags_t __force)0x0008)
+#define DST_FAKE_RTABLE((dst_flags_t __force)0x0010)
+#define DST_XFRM_TUNNEL((dst_flags_t __force)0x0020)
+#define DST_XFRM_QUEUE ((dst_flags_t __force)0x0040)
+#define DST_METADATA   ((dst_flags_t __force)0x0080)
 
/* A non-zero value of dst->obsolete forces by-hand validation
 * of the route entry.  Positive values are set by the generic
@@ -388,10 +389,10 @@ static inline int dst_discard(struct sk_buff *skb)
return dst_discard_out(_net, skb->sk, skb);
 }
 void *dst_alloc(struct dst_ops *ops, struct net_device *dev, int initial_ref,
-   int initial_obsolete, unsigned short flags);
+   int initial_obsolete, dst_flags_t flags);
 void dst_init(struct dst_entry *dst, struct dst_ops *ops,
  struct net_device *dev, int initial_ref, int initial_obsolete,
- unsigned short flags);
+ dst_flags_t flags);
 struct dst_entry *dst_destroy(struct dst_entry *dst);
 void dst_dev_put(struct dst_entry *dst);
 
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 18e442ea93d8..eec7e7ac564b 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -131,7 +131,7 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
const struct in6_addr *addr, bool anycast);
 
 struct rt6_info *ip6_dst_alloc(struct net *net, struct net_device *dev,
-  int flags);
+  dst_flags_t flags);
 
 /*
  * support functions for ND
diff --git a/net/core/dst.c b/net/core/dst.c
index 007aa0b08291..d7cce39c3552 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -61,7 +61,7 @@ const struct dst_metrics dst_default_metrics = {
 
 void dst_init(struct dst_entry *dst, struct dst_ops *ops,
  struct net_device *dev, int initial_ref, int initial_obsolete,
- unsigned short flags)
+ dst_flags_t flags)
 {
dst->dev = dev;
if (dev)
@@ -92,7 +92,7 @@ void dst_init(struct dst_entry *dst, struct dst_ops *ops,
 EXPORT_SYMBOL(dst_init);
 
 void *dst_alloc(struct dst_ops *ops, struct net_device *dev,
-   int initial_ref, int initial_obsolete, unsigned short flags)
+   int initial_ref, int initial_obsolete, dst_flags_t flags)
 {
struct dst_entry *dst;
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b3f4d19b3ca5..1d6d53a5d951 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -356,7 +356,7 @@ static void rt6_info_init(struct rt6_info *rt)
 /* allocate dst with ip6_dst_ops */
 static struct rt6_info *__ip6_dst_alloc(struct net *net,
struct net_device *dev,
-   int flags)
+   dst_flags_t flags)
 {
struct rt6_info *rt = dst_alloc(>ipv6.ip6_dst_ops, dev,
1, 

Re: add "net: fec: Allow reception of frames bigger than 1522 bytes" to stable

2017-12-06 Thread David Miller
From: Uwe Kleine-König 
Date: Wed, 6 Dec 2017 22:24:50 +0100

> Hello,
> 
> on an 4.9.x kernel using the freescale/fec driver "behind" a switch I
> had problems with packet reception.
> 
> Commit fbbeefdd2104 ("net: fec: Allow reception of frames bigger than
> 1522 bytes") fixed it for me and so I suggest to queue that one for
> stable. It can be cherry-picked to 4.9.x without conflict and IMHO
> should be applied to all stable branches before v4.14.

Ok, queued up.


Re: [PATCH net-next 2/4] bnxt_en: Use NETIF_F_GRO_HW.

2017-12-06 Thread Michael Chan
On Tue, Dec 5, 2017 at 10:10 AM, Marcelo Ricardo Leitner
 wrote:
> On Mon, Dec 04, 2017 at 04:07:15PM -0800, Michael Chan wrote:
>> As already pointed out, GRO_HW is a subset of GRO.  Packets that
>> cannot be aggregated in hardware (due to hardware resource limitations
>> or protocol types that it doesn't handle) can just be passed to the
>> stack for GRO aggregation.
>
> How would the parameters/limits work in this case? I mean, currently
> we have the default weight of 64 packets per napi poll cycle, the
> budget of 300 per cycle and also the time constrain,
> net.core.netdev_budget_usecs.

Good point.  Currently, it is no different than LRO.  Each aggregated
packet is counted as 1.  With LRO, you don't necessarily know many
packets were merged.  With GRO_HW, we know and it's possible to count
the original packets towards the NAPI budget.

>
> With GRO_HW, this 64 limit may be exceeded. I'm looking at qede code
> and it works by couting each completion as 1 rcv_pkts
> (qede_fp.c:1318). So if it now gets 64 packets, it's up to 64*MTU
> aprox, GRO'ed or not. But with GRO_HW, seems it may be much more than
> that and which may not be fair with other interfaces in the system.
> Drivers supporting GRO_HW probably should account for this.

Right.  We can make this adjustment for GRO_HW in a future patchset.

>
> And how can one control how much time a packet may spend on NIC queue
> waiting to be GRO'ed? Does it use the coalescing parameters for that?
>

The GRO_HW timer is currently not exposed.  It's different from
interrupt coalescing.  It's possible to make this a tunable parameter
in the future.


Re: [PATCH V2] netlink: Add netns check on taps

2017-12-06 Thread David Miller
From: Kevin Cernekee 
Date: Wed,  6 Dec 2017 12:12:27 -0800

> Currently, a nlmon link inside a child namespace can observe systemwide
> netlink activity.  Filter the traffic so that nlmon can only sniff
> netlink messages from its own netns.
> 
> Test case:
> 
> vpnns -- bash -c "ip link add nlmon0 type nlmon; \
>   ip link set nlmon0 up; \
>   tcpdump -i nlmon0 -q -w /tmp/nlmon.pcap -U" &
> sudo ip xfrm state add src 10.1.1.1 dst 10.1.1.2 proto esp \
> spi 0x1 mode transport \
> auth sha1 0x616263313233 \
> enc aes 0x
> grep --binary abc123 /tmp/nlmon.pcap
> 
> Signed-off-by: Kevin Cernekee 

Applied and queued up for -stable, thanks Kevin.


Re: [PATCH] usbnet: fix alignment for frames with no ethernet header

2017-12-06 Thread David Miller
From: Bjørn Mork 
Date: Wed,  6 Dec 2017 20:21:24 +0100

> The qmi_wwan minidriver support a 'raw-ip' mode where frames are
> received without any ethernet header. This causes alignment issues
> because the skbs allocated by usbnet are "IP aligned".
> 
> Fix by allowing minidrivers to disable the additional alignment
> offset. This is implemented using a per-device flag, since the same
> minidriver also supports 'ethernet' mode.
> 
> Fixes: 32f7adf633b9 ("net: qmi_wwan: support "raw IP" mode")
> Reported-and-tested-by: Jay Foster 
> Signed-off-by: Bjørn Mork 

Looks good, applied and queued up for -stable.


Re: [PATCH net] tcp: use current time in tcp_rcv_space_adjust()

2017-12-06 Thread David Miller
From: Eric Dumazet 
Date: Wed, 06 Dec 2017 11:08:19 -0800

> From: Eric Dumazet 
> 
> When I switched rcv_rtt_est to high resolution timestamps, I forgot
> that tp->tcp_mstamp needed to be refreshed in tcp_rcv_space_adjust()
> 
> Using an old timestamp leads to autotuning lags.
> 
> Fixes: 645f4c6f2ebd ("tcp: switch rcv_rtt_est and rcvq_space to high 
> resolution timestamps")
> Signed-off-by: Eric Dumazet 

Applied and queued up for -stable, thanks Eric.


Re: pull-request: bpf 2017-12-06

2017-12-06 Thread David Miller
From: Daniel Borkmann 
Date: Wed,  6 Dec 2017 19:56:25 +0100

> Hi David,
> 
> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
> 
> 1) Fixing broken uapi for BPF tracing programs for s390 and arm64
>architectures due to pt_regs being in-kernel only, and not part
>of uapi right now. A wrapper is added that exports pt_regs in
>an asm-generic way. For arm64 this maps to existing user_pt_regs
>structure and for s390 a user_pt_regs structure exporting the
>beginning of pt_regs is added and uapi-exported, thus fixing the
>BPF issues seen in perf (and BPF selftests), all from Hendrik.
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thanks Daniel.


Re: [PATCH] netlink: Add netns check on taps

2017-12-06 Thread Daniel Borkmann
On 12/06/2017 08:40 PM, David Miller wrote:
> From: Kevin Cernekee 
> Date: Tue,  5 Dec 2017 14:46:22 -0800
> 
>> Currently, a nlmon link inside a child namespace can observe systemwide
>> netlink activity.  Filter the traffic so that in a non-init netns,
>> nlmon can only sniff netlink messages from its own netns.
>>
>> Test case:
>>
>> vpnns -- bash -c "ip link add nlmon0 type nlmon; \
>>   ip link set nlmon0 up; \
>>   tcpdump -i nlmon0 -q -w /tmp/nlmon.pcap -U" &
>> sudo ip xfrm state add src 10.1.1.1 dst 10.1.1.2 proto esp \
>> spi 0x1 mode transport \
>> auth sha1 0x616263313233 \
>> enc aes 0x
>> grep abc123 /tmp/nlmon.pcap
>>
>> Signed-off-by: Kevin Cernekee 
> 
> Daniel, what behavior did you intend this to have?
> 
> Taps can see their own namespace only, or init_net is special
> and can see all netlink activity.
> 
> I think letting init_net see everything could be confusing,
> because there is no way to distinguish netlink events by
> namespace just by looking at the messages that arrive at
> the tap right?

Yeah, only snooping from own netns makes sense, lets limit
it to this.


Re: [PATCH net] adding missing rcu_read_unlock in ipxip6_rcv

2017-12-06 Thread David Miller
From: "Nikita V. Shirokov" 
Date: Wed,  6 Dec 2017 10:19:33 -0800

> commit 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
> introduced new exit point in  ipxip6_rcv. however rcu_read_unlock is
> missing there. this diff is fixing this
> 
> Signed-off-by: Nikita V. Shirokov 
 ...
> @@ -903,8 +903,10 @@ static int ipxip6_rcv(struct sk_buff *skb, u8 ipproto,
>   goto drop;
>   if (t->parms.collect_md) {
>   tun_dst = ipv6_tun_rx_dst(skb, 0, 0, 0);
> - if (!tun_dst)
> + if (!tun_dst) {
> + rcu_read_unlock();
>   return 0;
> + }
>   }
>   ret = __ip6_tnl_rcv(t, skb, tpi, tun_dst, dscp_ecn_decapsulate,
>   log_ecn_error);

Shouldn't it branch to 'drop' otherwise we leak the skb?


Re: [PATCH 1/2] net: fec: don't ack masked interrupt events

2017-12-06 Thread David Miller
From: Lucas Stach 
Date: Wed,  6 Dec 2017 18:24:58 +0100

> The FEC doesn't have a real interrupt status register, that takes
> into account the mask status of the IRQ. The driver reads the raw
> interrupt event register, which also reports events for masked
> IRQs.
> 
> The driver needs to apply the current mask itself, to avoid acking
> IRQs that are currently masked, as NAPI relies on the masking to
> hide the IRQs. The current behavior of just acking all interrupts
> regardless of their mask status opens the driver up the "rotting
> packet" race-window, as described in the original NAPI-HOWTO, which
> has been observed in the wild.
> 
> Signed-off-by: Lucas Stach 
> ---
>  drivers/net/ethernet/freescale/fec_main.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/freescale/fec_main.c 
> b/drivers/net/ethernet/freescale/fec_main.c
> index 610573855213..0b70c07eb703 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -1584,7 +1584,8 @@ fec_enet_interrupt(int irq, void *dev_id)
>   uint int_events;
>   irqreturn_t ret = IRQ_NONE;
>  
> - int_events = readl(fep->hwp + FEC_IEVENT);
> + int_events = readl_relaxed(fep->hwp + FEC_IEVENT) &
> +  readl_relaxed(fep->hwp + FEC_IMASK);

Adding a full new MMIO register read to every interrupt is going to make
interrupts significantly more expensive.

You should cache this mask in software in order to avoid the expensive
MMIO read.

Thanks.


Re: [PATCH 13/45] drivers: net: dsa: remove duplicate includes

2017-12-06 Thread David Miller
From: Pravin Shedge 
Date: Wed,  6 Dec 2017 22:28:40 +0530

> These duplicate includes have been found with scripts/checkincludes.pl but
> they have been removed manually to avoid removing false positives.
> 
> Signed-off-by: Pravin Shedge 

Applied.


Re: [PATCH] net: ethernet: arc: fix error handling in emac_rockchip_probe

2017-12-06 Thread David Miller
From: Branislav Radocaj 
Date: Wed,  6 Dec 2017 18:24:33 +0100

> If clk_set_rate() fails, we should disable clk
> before return.
> 
> Found by Linux Driver Verification project (linuxtesting.org).
> 
> Signed-off-by: Branislav Radocaj 

Please eliminate this indentation of your commit message and
resubmit.

Thank you.


Re: [PATCH net] rds: Fix NULL pointer dereference in __rds_rdma_map

2017-12-06 Thread David Miller
From: Håkon Bugge 
Date: Wed,  6 Dec 2017 17:18:28 +0100

> This is a fix for syzkaller719569, where memory registration was
> attempted without any underlying transport being loaded.
> 
> Analysis of the case reveals that it is the setsockopt() RDS_GET_MR
> (2) and RDS_GET_MR_FOR_DEST (7) that are vulnerable.
> 
> Here is an example stack trace when the bug is hit:
 ...
> The fix is to check the existence of an underlying transport in
> __rds_rdma_map().
> 
> Signed-off-by: Håkon Bugge 
> Reported-by: syzbot 

Applied and queued up for -stable, thanks.


Re: [PATCH net-next 0/6] net: sched: sch: introduce extack support

2017-12-06 Thread David Miller
From: Alexander Aring 
Date: Wed,  6 Dec 2017 11:08:39 -0500

> this patch series basically add support for extack in common qdisc handling.
> Additional it adds extack pointer to common qdisc callback handling this
> offers per qdisc implementation to setting the extack message for each
> failure over netlink.
> 
> The extack message will be set deeper in qdisc functions but going not
> deeper as net core api. For qdisc module callback handling, the extack
> will not be set. This will be part of per qdisc extack handling.
> 
> I also want to prepare patches to handle extack per qdisc module...
> so there will come a lot of more patches, just cut them down to make
> it reviewable.
> 
> There are some above 80-chars width warnings, which I ignore because
> it looks more ugly otherwise.
> 
> This patch-series based on patches by David Ahren which gave me some
> hints how to deal with extack support.
> 
> Cc: David Ahern 

Only add the plumbing when you have actual extack messages you are
adding as an example use case.

Thank you.


Re: [PATCH] net_sched: use macvlan real dev trans_start in dev_trans_start()

2017-12-06 Thread David Miller
From: Chris Dion 
Date: Wed,  6 Dec 2017 10:50:28 -0500

> Macvlan devices are similar to vlans and do not update their
> own trans_start. In order for arp monitoring to work for a bond device
> when the slaves are macvlans, obtain its real device.
> 
> Signed-off-by: Chris Dion 

Looks good, applied.


Re: [PATCH v4 net-next] net/tcp: trace all TCP/IP state transition with tcp_set_state tracepoint

2017-12-06 Thread David Miller
From: Yafang Shao 
Date: Wed,  6 Dec 2017 15:01:29 +

> v3->v4: Do not trace DCCP socket

This is not sufficient.

Your test will match SCTP stream sockets.


Re: [PATCH net-next] ipvlan: Eliminate duplicated codes with existing function

2017-12-06 Thread David Miller
From: gfree.w...@vip.163.com
Date: Wed,  6 Dec 2017 19:04:26 +0800

> From: Gao Feng 
> 
> The recv flow of ipvlan l2 mode performs as same as l3 mode for
> non-multicast packet, so use the existing func ipvlan_handle_mode_l3
> instead of these duplicated statements in non-multicast case.
> 
> Signed-off-by: Gao Feng 

Applied, thanks.


[PATCH V2] netlink: Add netns check on taps

2017-12-06 Thread Kevin Cernekee
Currently, a nlmon link inside a child namespace can observe systemwide
netlink activity.  Filter the traffic so that nlmon can only sniff
netlink messages from its own netns.

Test case:

vpnns -- bash -c "ip link add nlmon0 type nlmon; \
  ip link set nlmon0 up; \
  tcpdump -i nlmon0 -q -w /tmp/nlmon.pcap -U" &
sudo ip xfrm state add src 10.1.1.1 dst 10.1.1.2 proto esp \
spi 0x1 mode transport \
auth sha1 0x616263313233 \
enc aes 0x
grep --binary abc123 /tmp/nlmon.pcap

Signed-off-by: Kevin Cernekee 
---
 net/netlink/af_netlink.c | 3 +++
 1 file changed, 3 insertions(+)


V1->V2: Drop the special exception for init_net.
Compile-tested only, will retest later today.


diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index b9e0ee4..79cc1bf 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -253,6 +253,9 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb,
struct sock *sk = skb->sk;
int ret = -ENOMEM;
 
+   if (!net_eq(dev_net(dev), sock_net(sk)))
+   return 0;
+
dev_hold(dev);
 
if (is_vmalloc_addr(skb->head))
-- 
2.7.4



Re: [PATCH V2] selinux: Add SCTP support

2017-12-06 Thread David Miller
From: Richard Haines 
Date: Wed,  6 Dec 2017 10:17:36 +

> The SELinux SCTP implementation is explained in:
> Documentation/security/SELinux-sctp.rst
> 
> Signed-off-by: Richard Haines 

No general objections from the networking side, so:

Acked-by: David S. Miller 

from me.  But if I were you I'd wait for some review from the
SCTP maintainers and experts. :-)


Re: [patch net-next] mlxsw: spectrum: handle NETIF_F_HW_TC changes correctly

2017-12-06 Thread David Miller
From: Jiri Pirko 
Date: Wed,  6 Dec 2017 09:41:12 +0100

> From: Jiri Pirko 
> 
> Currently, whenever the NETIF_F_HW_TC feature changes, we silently
> always allow it, but we actually do not disable the flows in HW
> on disable. That breaks user's expectations. So just forbid
> the feature disable in case there are any filters offloaded.
> 
> Signed-off-by: Jiri Pirko 
> Reviewed-by: Ido Schimmel 

Applied, thanks Jiri.


Re: [PATCH] xen-netback: Fix logging message with spurious period after newline

2017-12-06 Thread David Miller
From: Joe Perches 
Date: Tue,  5 Dec 2017 22:40:25 -0800

> Using a period after a newline causes bad output.
> 
> Signed-off-by: Joe Perches 

Applied.


Re: [PATCH net-next] tun: avoid unnecessary READ_ONCE in tun_net_xmit

2017-12-06 Thread David Miller
From: Willem de Bruijn 
Date: Tue,  5 Dec 2017 22:11:17 -0500

> From: Willem de Bruijn 
> 
> The statement no longer serves a purpose.
> 
> Commit fa35864e0bb7 ("tuntap: Fix for a race in accessing numqueues")
> added the ACCESS_ONCE to avoid a race condition with skb_queue_len.
> 
> Commit 436accebb530 ("tuntap: remove unnecessary sk_receive_queue
> length check during xmit") removed the affected skb_queue_len check.
> 
> Commit 96f84061620c ("tun: add eBPF based queue selection method")
> split the function, reading the field a second time in the callee.
> The temp variable is now only read once, so just remove it.
> 
> Signed-off-by: Willem de Bruijn 

Applied, thanks Willem.


Re: [PATCH/RFC v2] leds: trigger: Introduce a NETDEV trigger

2017-12-06 Thread Ben Whitten
Hi Jacek,

On 5 December 2017 at 20:38, Jacek Anaszewski
 wrote:
> Hi Ben,
>
> On 12/05/2017 12:19 PM, Ben Whitten wrote:
>> From: Ben Whitten 
>>
>> The patch was converted to led_blink_oneshot, in doing so we find that the
>> behaviour has changed. As I dont want to break 'userspace' led behaviour this
>> patch shouldn't be merged as is. Open to suggestions.
>>
>> Given an interval of 50ms and heavy throughput, the previous implementation
>> produced a blink with 100ms period and 50% dutycycle. The led_blink_oneshot
>> version produces a blink with 140ms period and 57% dutycycle.
>
> Please check if the LED class driver you're testing the trigger with
> implements blink_set op. If yes it would be good to check if it doesn't
> align the delay intervals to the hardware capabilities instead of
> failing and relying on a LED core software blink fallback.

The led are using gpio-led set from device tree on an embedded system, sama5
based. So as far as I can tell blink_op is NULL in this case and it
then relies on
software for the blink in the form of timers.
I assume its the jiffies playing a part here, taking a jiffy or two to
queue up a flash
may add 10ms to the desired 50ms delay_on/delay_off that I am seeing. Then the
extra time may be due to the stats workqueue not aligning with the
blink timer to
kick it off again.

Best regards,
Ben


Re: [PATCH] ARM64: dts: meson-axg: add ethernet mac controller

2017-12-06 Thread Kevin Hilman
Yixun Lan  writes:

> Add DT info for the stmmac ethernet MAC which found in
> the Amlogic's Meson-AXG SoC, also describe the ethernet
> pinctrl & clock information here.
>
> This is tested in the S400 dev board which use a RTL8211F PHY,
> and the pins connect to the 'eth_rgmii_y_pins' group.
>
> Signed-off-by: Yixun Lan 

This doesn't apply cleanly to mainline.  Are there some dependencies
here that I'm not aware of?

Also...

> ---
>  arch/arm64/boot/dts/amlogic/meson-axg-s400.dts |  7 
>  arch/arm64/boot/dts/amlogic/meson-axg.dtsi | 53 
> ++
>  2 files changed, 60 insertions(+)
>
> diff --git a/arch/arm64/boot/dts/amlogic/meson-axg-s400.dts 
> b/arch/arm64/boot/dts/amlogic/meson-axg-s400.dts
> index 9eb6aaee155d..7b39a9fe2b0f 100644
> --- a/arch/arm64/boot/dts/amlogic/meson-axg-s400.dts
> +++ b/arch/arm64/boot/dts/amlogic/meson-axg-s400.dts
> @@ -21,3 +21,10 @@
>   status = "okay";
>  };
>  
> + {
> + status = "okay";
> + phy-mode = "rgmii";
> +
> + pinctrl-0 = <_rgmii_y_pins>;
> + pinctrl-names = "default";
> +};
> diff --git a/arch/arm64/boot/dts/amlogic/meson-axg.dtsi 
> b/arch/arm64/boot/dts/amlogic/meson-axg.dtsi
> index 65945c6c8b65..57faaa9d8013 100644
> --- a/arch/arm64/boot/dts/amlogic/meson-axg.dtsi
> +++ b/arch/arm64/boot/dts/amlogic/meson-axg.dtsi
> @@ -157,6 +157,19 @@
>   #address-cells = <0>;
>   };
>  
> + ethmac: ethernet@ff3f {
> + compatible = "amlogic,meson-axg-dwmac", 
> "amlogic,meson-gxbb-dwmac", "snps,dwmac";

The new "meson-axg-dwmac" is not documented in the binding.  Can you add
that?

Kevin


Re: [PATCH net-next] rds: debug: fix null check on static array

2017-12-06 Thread David Miller
From: Prashant Bhole 
Date: Wed,  6 Dec 2017 10:47:04 +0900

> t_name cannot be NULL since it is an array field of a struct.
> Replacing null check on static array with string length check using
> strnlen()
> 
> Signed-off-by: Prashant Bhole 

Applied, thank you.


Re: [PATCH net] tcp: use current time in tcp_rcv_space_adjust()

2017-12-06 Thread Neal Cardwell
On Wed, Dec 6, 2017 at 2:08 PM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> When I switched rcv_rtt_est to high resolution timestamps, I forgot
> that tp->tcp_mstamp needed to be refreshed in tcp_rcv_space_adjust()
>
> Using an old timestamp leads to autotuning lags.
>
> Fixes: 645f4c6f2ebd ("tcp: switch rcv_rtt_est and rcvq_space to high 
> resolution timestamps")
> Signed-off-by: Eric Dumazet 
> Cc: Wei Wang 
> Cc: Neal Cardwell 
> Cc: Yuchung Cheng 
> ---

Acked-by: Neal Cardwell 

neal


Re: [Patch net-next v2] act_mirred: get rid of tcfm_ifindex from struct tcf_mirred

2017-12-06 Thread David Miller
From: Cong Wang 
Date: Tue,  5 Dec 2017 16:17:26 -0800

> tcfm_dev always points to the correct netdev and we already
> hold a refcnt, so no need to use tcfm_ifindex to lookup again.
> 
> If we would support moving target netdev across netns, using
> pointer would be better than ifindex.
> 
> This also fixes dumping obsolete ifindex, now after the
> target device is gone we just dump 0 as ifindex.
> 
> Cc: Jiri Pirko 
> Cc: Jamal Hadi Salim 
> Signed-off-by: Cong Wang 

Both act_mirred patches applied, thanks Cong.


Re: [PATCH net-next 0/2] add ip6erspan collect_md mode

2017-12-06 Thread David Miller
From: William Tu 
Date: Tue,  5 Dec 2017 15:15:43 -0800

> Similar to erspan collect_md mode in ipv4, the first patch adds
> support for ip6erspan collect metadata mode.  The second patch
> adds the test case using bpf_skb_[gs]et_tunnel_key helpers.
> 
> The corresponding iproute2 patch:
> https://marc.info/?l=linux-netdev=151251545410047=2

Series applied, thanks William.


Re: [PATCH net] net: thunderx: Fix TCP/UDP checksum offload for IPv4 pkts

2017-12-06 Thread David Miller
From: Florian Westphal 
Date: Wed,  6 Dec 2017 01:04:50 +0100

> Offload IP header checksum to NIC.
> 
> This fixes a previous patch which disabled checksum offloading
> for both IPv4 and IPv6 packets.  So L3 checksum offload was
> getting disabled for IPv4 pkts.  And HW is dropping these pkts
> for some reason.
> 
> Without this patch, IPv4 TSO appears to be broken:
> 
> WIthout this patch I get ~16kbyte/s, with patch close to 2mbyte/s
> when copying files via scp from test box to my home workstation.
> 
> Looking at tcpdump on sender it looks like hardware drops IPv4 TSO skbs.
> This patch restores performance for me, ipv6 looks good too.
> 
> Fixes: fa6d7cb5d76c ("net: thunderx: Fix TCP/UDP checksum offload for IPv6 
> pkts")
> Cc: Sunil Goutham 
> Cc: Aleksey Makarov 
> Cc: Eric Dumazet 
> Signed-off-by: Florian Westphal 

Applied and queued up for -stable, thanks Florian.


Re: [PATCH] netlink: Add netns check on taps

2017-12-06 Thread David Miller
From: Kevin Cernekee 
Date: Tue,  5 Dec 2017 14:46:22 -0800

> Currently, a nlmon link inside a child namespace can observe systemwide
> netlink activity.  Filter the traffic so that in a non-init netns,
> nlmon can only sniff netlink messages from its own netns.
> 
> Test case:
> 
> vpnns -- bash -c "ip link add nlmon0 type nlmon; \
>   ip link set nlmon0 up; \
>   tcpdump -i nlmon0 -q -w /tmp/nlmon.pcap -U" &
> sudo ip xfrm state add src 10.1.1.1 dst 10.1.1.2 proto esp \
> spi 0x1 mode transport \
> auth sha1 0x616263313233 \
> enc aes 0x
> grep abc123 /tmp/nlmon.pcap
> 
> Signed-off-by: Kevin Cernekee 

Daniel, what behavior did you intend this to have?

Taps can see their own namespace only, or init_net is special
and can see all netlink activity.

I think letting init_net see everything could be confusing,
because there is no way to distinguish netlink events by
namespace just by looking at the messages that arrive at
the tap right?

So maybe own-namespace-only is the way to go.

Thanks.


RE: dsa: dsa_slave_port_obj_del calls multiple times with SWITCHDEV_OBJ_ID_HOST_MDB obj id

2017-12-06 Thread Tristram.Ha
> SWITCHDEV_OBJ_ID_HOST_MDB is used, when there is a join/leave on the
> bridge interface. It happens for each interface in the bridge, and it
> means, packets which match the group that ingress on that interface
> should be forwarded to the CPU.
> 
> > As the base driver cannot find an entry with that host port, it returns an
> error
> > and so users will see a lot of failures from the DSA switch.
> 
> You have a few options:
> 
> 1) Just forward all multicast traffic to the cpu, and ignore
>SWITCHDEV_OBJ_ID_HOST_MDB.
> 
> 2) Implement SWITCHDEV_OBJ_ID_HOST_MDB so you setup your tables to
>just forward the requested multicast to the cpu.
> 
> 3) You can also forward a bit too much, e.g. if you cannot set filters
>per ingress port, just send all the traffic for the group from any
>port.
> 
> 
> The bridge will discard whatever it does not need.
> 
> > Is this a new behavior and the driver needs to handle that?  In previous
> versions
> > I do not think I saw that.
> 
> SWITCHDEV_OBJ_ID_HOST_MDB is new. However,
> dsa_slave_port_obj_del()
> can be called for all sorts of objects, and you should only be
> reacting on those your support. So adding a new object should not of
> changed anything.
> 
> > Typical operation is a PC connected to a port in a switch wants to send
> multicast
> > packets.  It broadcasts an IGMP membership join message.  Function
> > dsa_slave_port_obj_add is called to setup an entry in the lookup table.
> When
> > IGMP membership leave message is received dsa_slave_port_obj_del will
> be
> > called after a delay.  But then it is called for each port with host port 
> > as the
> > parameter.
> 
> Correct. switchdev is a generic API. It also needs to work for Top of
> Rack switches, which generally have a match/action architecture. I can
> imagine that this match/action happens per port, so we need to call
> switchdev per port. However, switches supported by DSA tend to have
> central management of all ports, so one call would be sufficient. With
> a DSA driver, you just need to expect redundant calls, and do the
> right thing.

The base DSA driver only knows about port_mdb_add and port_mdb_del.
It does not really know about SWITCHDEV_OBJ_ID_HOST_MDB.

It is fine to use port_mdb_add and port_mdb_del for
SWITCHDEV_OBJ_ID_HOST_MDB as hardware can program an entry for
the host port, but receiving many port_mdb_del calls with the same
parameter makes it difficult to handle.  Example:

port_mdb_add 2 xx:xx:xx:xx:xx:xx
port_mdb_del 2 xx:xx:xx:xx:xx:xx
port_mdb_del 4 xx:xx:xx:xx:xx:xx; lan1
port_mdb_del 4 xx:xx:xx:xx:xx:xx; lan2
port_mdb_del 4 xx:xx:xx:xx:xx:xx; lan3

There is no indication in the port_mdb_del call to show this call is for
port 1, 2, and so on.

port_mdb_add can be used to add an entry for the host.  Again it is
called as many times as number of ports.  I think the port_mdb_* call
needs an extra parameter to make it useful in this situation.

An easy fix is not to return any error in port_mdb_del, or ignore host port.

It seems port_mdb_prepare is used to indicate an entry can be added
before the actual port_mdb_add call, which returns void and so cannot
indicate failure to add.  Currently there is no implementation for that
port_mdb_prepare call and the Marvell driver just displays a warning
inside its port_mdb_add function.



Re: [PATCH net-next] bnxt: Don't print message, if DAC isn't connected on both ends

2017-12-06 Thread Michael Chan
On Tue, Dec 5, 2017 at 4:33 AM, Thomas Bogendoerfer
 wrote:
> bnxt driver spams logfiles with
>
> [  541.003065] bnxt_en :5d:00.1 eth5: Link speed -1 no longer supported
>
> if a direct attached cable (DAC) is plugged into the bnxt card and is
> unplugged on the other side. This patch removes the code printing this
> message, since it doesn't provide any useful information.
>
> Signed-off-by: Thomas Bogendoerfer 
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 9 -
>  1 file changed, 9 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
> b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 8c1dd60eab6f..8a2319ed79dc 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -1701,18 +1701,9 @@ static int bnxt_async_event_process(struct bnxt *bp,
> /* TODO CHIMP_FW: Define event id's for link change, error etc */
> switch (event_id) {
> case ASYNC_EVENT_CMPL_EVENT_ID_LINK_SPEED_CFG_CHANGE: {
> -   u32 data1 = le32_to_cpu(cmpl->event_data1);
> -   struct bnxt_link_info *link_info = >link_info;
> -
> if (BNXT_VF(bp))
> goto async_event_process_exit;
> -   if (data1 & 0x2) {
> -   u16 fw_speed = link_info->force_link_speed;
> -   u32 speed = bnxt_fw_to_ethtool_speed(fw_speed);
>
> -   netdev_warn(bp->dev, "Link speed %d no longer 
> supported\n",
> -   speed);
> -   }

This is supposed to provide useful information to the user under some
conditions.  In your particular situation, it is not useful since the
speed is -1.  Let me try to modify this code a bit to reduce the spam.
I will post a revised patch in the next 2 hours.

Thanks.

> set_bit(BNXT_LINK_SPEED_CHNG_SP_EVENT, >sp_event);
> /* fall thru */
> }
> --
> 2.12.3
>


  1   2   3   >