[PATCH net 1/3] lan78xx: Set ASD in MAC_CR when EEE is enabled.

2018-03-22 Thread Raghuram Chary J
Description:
EEE does not work with lan7800 when AutoSpeed is not set.
(This can happen when EEPROM is not populated or configured incorrectly)

Root-Cause:
When EEE is enabled, the mac config register ASD is not set
i.e in default state,causing EEE fail.

Fix:
Set the register when eeprom is not present.

Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 
Ethernet device driver")
Signed-off-by: Raghuram Chary J 
---
 drivers/net/usb/lan78xx.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 11176070b345..e2d26f9c0f6a 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -2351,6 +2351,7 @@ static int lan78xx_reset(struct lan78xx_net *dev)
u32 buf;
int ret = 0;
unsigned long timeout;
+   u8 sig;
 
ret = lan78xx_read_reg(dev, HW_CFG, &buf);
buf |= HW_CFG_LRST_;
@@ -2450,6 +2451,15 @@ static int lan78xx_reset(struct lan78xx_net *dev)
/* LAN7801 only has RGMII mode */
if (dev->chipid == ID_REV_CHIP_ID_7801_)
buf &= ~MAC_CR_GMII_EN_;
+
+   if(dev->chipid == ID_REV_CHIP_ID_7800_) {
+   ret = lan78xx_read_raw_eeprom(dev, 0, 1, &sig);
+   if ((!ret) && (sig != EEPROM_INDICATOR)) {
+   /*Implies there is no external eeprom. Set mac speed*/
+   netdev_info(dev->net, "No External EEPROM. Setting MAC 
Speed \n");
+   buf |= MAC_CR_AUTO_DUPLEX_ | MAC_CR_AUTO_SPEED_;
+   }
+   }
ret = lan78xx_write_reg(dev, MAC_CR, buf);
 
ret = lan78xx_read_reg(dev, MAC_TX, &buf);
-- 
2.16.2



[RFC PATCH] net: stmmac: dwmac-sun8i: sun8i_syscon_reg_field can be static

2018-03-22 Thread kbuild test robot

Fixes: e3c10deef23c ("net: stmmac: dwmac-sun8i: Use regmap_field for syscon 
register access")
Signed-off-by: Fengguang Wu 
---
 dwmac-sun8i.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
index de93f0f..bbc0514 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
@@ -79,7 +79,7 @@ struct sunxi_priv_data {
 };
 
 /* EMAC clock register @ 0x30 in the "system control" address range */
-const struct reg_field sun8i_syscon_reg_field = {
+static const struct reg_field sun8i_syscon_reg_field = {
.reg = 0x30,
.lsb = 0,
.msb = 31,


Re: [PATCH net-next 06/12] net: stmmac: dwmac-sun8i: Use regmap_field for syscon register access

2018-03-22 Thread kbuild test robot
Hi Chen-Yu,

I love your patch! Perhaps something to improve:

[auto build test WARNING on next-20180309]
[also build test WARNING on v4.16-rc6]
[cannot apply to v4.16-rc4 v4.16-rc3 v4.16-rc2]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Chen-Yu-Tsai/ARM-sun8i-r40-Add-Ethernet-support/20180318-161723
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c:82:24: sparse: symbol 
>> 'sun8i_syscon_reg_field' was not declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


[RFC PATCH] net: stmmac: dwmac-sun8i: sun8i_ccu_reg_field can be static

2018-03-22 Thread kbuild test robot

Fixes: 0e59c15b2797 ("net: stmmac: dwmac-sun8i: Add support for GMAC on 
Allwinner R40 SoC")
Signed-off-by: Fengguang Wu 
---
 dwmac-sun8i.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
index be6705e8..622fb2b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
@@ -98,7 +98,7 @@ const struct reg_field sun8i_syscon_reg_field = {
 };
 
 /* EMAC clock register @ 0x164 in the CCU address range */
-const struct reg_field sun8i_ccu_reg_field = {
+static const struct reg_field sun8i_ccu_reg_field = {
.reg = 0x164,
.lsb = 0,
.msb = 31,


Re: [PATCH net-next 09/12] net: stmmac: dwmac-sun8i: Add support for GMAC on Allwinner R40 SoC

2018-03-22 Thread kbuild test robot
Hi Chen-Yu,

I love your patch! Perhaps something to improve:

[auto build test WARNING on next-20180309]
[also build test WARNING on v4.16-rc6]
[cannot apply to v4.16-rc4 v4.16-rc3 v4.16-rc2]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Chen-Yu-Tsai/ARM-sun8i-r40-Add-Ethernet-support/20180318-161723
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c:94:24: sparse: symbol 
'sun8i_syscon_reg_field' was not declared. Should it be static?
>> drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c:101:24: sparse: symbol 
>> 'sun8i_ccu_reg_field' was not declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


[PATCH] ixgbe: tweak page counting for XDP_REDIRECT

2018-03-22 Thread Björn Töpel
From: Björn Töpel 

The current page counting scheme assumes that the reference count
cannot decrease until the received frame is sent to the upper layers
of the networking stack. This assumption does not hold for the
XDP_REDIRECT action, since a page (pointed out by xdp_buff) can have
its reference count decreased via the xdp_do_redirect call.

To work around that, we now start off by a large page count and then
don't allow a refcount less than two.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index fb80edbd2739..8dbb2ce06287 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1629,7 +1629,8 @@ static bool ixgbe_alloc_mapped_page(struct ixgbe_ring 
*rx_ring,
bi->dma = dma;
bi->page = page;
bi->page_offset = ixgbe_rx_offset(rx_ring);
-   bi->pagecnt_bias = 1;
+   page_ref_add(page, USHRT_MAX - 1);
+   bi->pagecnt_bias = USHRT_MAX;
rx_ring->rx_stats.alloc_rx_page++;
 
return true;
@@ -2039,8 +2040,8 @@ static bool ixgbe_can_reuse_rx_page(struct 
ixgbe_rx_buffer *rx_buffer)
 * the pagecnt_bias and page count so that we fully restock the
 * number of references the driver holds.
 */
-   if (unlikely(!pagecnt_bias)) {
-   page_ref_add(page, USHRT_MAX);
+   if (unlikely(pagecnt_bias == 1)) {
+   page_ref_add(page, USHRT_MAX - 1);
rx_buffer->pagecnt_bias = USHRT_MAX;
}
 
-- 
2.7.4



[PATCH net] virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS

2018-03-22 Thread Jay Vosburgh
The operstate update logic will leave an interface in the
default UNKNOWN operstate if the interface carrier state never changes
from the default carrier up state set at creation.  This includes the
case of an explicit call to netif_carrier_on, as the carrier on to on
transition has no effect on operstate.

This affects virtio-net for the case that the virtio peer does
not support VIRTIO_NET_F_STATUS (the feature that provides carrier state
updates).  Without this feature, the virtio specification states that
"the link should be assumed active," so, logically, the operstate should
be UP instead of UNKNOWN.  This has impact on user space applications
that use the operstate to make availability decisions for the interface.

Resolve this by changing the virtio probe logic slightly to call
netif_carrier_off for both the "with" and "without" VIRTIO_NET_F_STATUS
cases, and then the existing call to netif_carrier_on for the "without"
case will cause an operstate transition.

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Ben Hutchings 
Fixes: 167c25e4c550 ("virtio-net: init link state correctly")
Signed-off-by: Jay Vosburgh 

---

I considered resolving this by changing linkwatch_init_dev to
unconditionally call rfc2863_policy, as that would always set operstate
for all interfaces.

This would not have any impact on most cases (as most drivers
call netif_carrier_off during probe), except for the loopback device,
which currently has an operstate of UNKNOWN (because it never does any
carrier state transitions).  This change would add a round trip on the
dev_base_lock for every loopback device creation, which could have a
negative impact when creating many loopback devices, e.g., when
concurrently creating large numbers of containers.


 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 23374603e4d9..7b187ec7411e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2857,8 +2857,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 
/* Assume link up if device can't report link status,
   otherwise get link status from config. */
+   netif_carrier_off(dev);
if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
-   netif_carrier_off(dev);
schedule_work(&vi->config_work);
} else {
vi->status = VIRTIO_NET_S_LINK_UP;
-- 
2.14.1



[PATCH v2 1/2] i40e: tweak page counting for XDP_REDIRECT

2018-03-22 Thread Björn Töpel
From: Björn Töpel 

This commit tweaks the page counting for XDP_REDIRECT to function
properly. XDP_REDIRECT support will be added in a future commit.

The current page counting scheme assumes that the reference count
cannot decrease until the received frame is sent to the upper layers
of the networking stack. This assumption does not hold for the
XDP_REDIRECT action, since a page (pointed out by xdp_buff) can have
its reference count decreased via the xdp_do_redirect call.

To work around that, we now start off by a large page count and then
don't allow a refcount less than two.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e8eef9a56b6b..2f817d1466eb 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1588,9 +1588,8 @@ static bool i40e_alloc_mapped_page(struct i40e_ring 
*rx_ring,
bi->dma = dma;
bi->page = page;
bi->page_offset = i40e_rx_offset(rx_ring);
-
-   /* initialize pagecnt_bias to 1 representing we fully own page */
-   bi->pagecnt_bias = 1;
+   page_ref_add(page, USHRT_MAX - 1);
+   bi->pagecnt_bias = USHRT_MAX;
 
return true;
 }
@@ -1956,8 +1955,8 @@ static bool i40e_can_reuse_rx_page(struct i40e_rx_buffer 
*rx_buffer)
 * the pagecnt_bias and page count so that we fully restock the
 * number of references the driver holds.
 */
-   if (unlikely(!pagecnt_bias)) {
-   page_ref_add(page, USHRT_MAX);
+   if (unlikely(pagecnt_bias == 1)) {
+   page_ref_add(page, USHRT_MAX - 1);
rx_buffer->pagecnt_bias = USHRT_MAX;
}
 
-- 
2.7.4



[PATCH v2 2/2] i40e: add support for XDP_REDIRECT

2018-03-22 Thread Björn Töpel
From: Björn Töpel 

The driver now acts upon the XDP_REDIRECT return action. Two new ndos
are implemented, ndo_xdp_xmit and ndo_xdp_flush.

XDP_REDIRECT action enables XDP program to redirect frames to other
netdevs.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c |  2 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 74 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  2 +
 3 files changed, 68 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 79ab52276d12..2fb4261b4fd9 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11815,6 +11815,8 @@ static const struct net_device_ops i40e_netdev_ops = {
.ndo_bridge_getlink = i40e_ndo_bridge_getlink,
.ndo_bridge_setlink = i40e_ndo_bridge_setlink,
.ndo_bpf= i40e_xdp,
+   .ndo_xdp_xmit   = i40e_xdp_xmit,
+   .ndo_xdp_flush  = i40e_xdp_flush,
 };
 
 /**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 2f817d1466eb..0168611312df 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2214,7 +2214,7 @@ static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
 static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
struct xdp_buff *xdp)
 {
-   int result = I40E_XDP_PASS;
+   int err, result = I40E_XDP_PASS;
struct i40e_ring *xdp_ring;
struct bpf_prog *xdp_prog;
u32 act;
@@ -2233,6 +2233,10 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring 
*rx_ring,
xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
result = i40e_xmit_xdp_ring(xdp, xdp_ring);
break;
+   case XDP_REDIRECT:
+   err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+   result = !err ? I40E_XDP_TX : I40E_XDP_CONSUMED;
+   break;
default:
bpf_warn_invalid_xdp_action(act);
case XDP_ABORTED:
@@ -2268,6 +2272,15 @@ static void i40e_rx_buffer_flip(struct i40e_ring 
*rx_ring,
 #endif
 }
 
+static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
+{
+   /* Force memory writes to complete before letting h/w
+* know there are new descriptors to fetch.
+*/
+   wmb();
+   writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
+}
+
 /**
  * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
  * @rx_ring: rx descriptor ring to transact packets on
@@ -2402,16 +2415,11 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, 
int budget)
}
 
if (xdp_xmit) {
-   struct i40e_ring *xdp_ring;
+   struct i40e_ring *xdp_ring =
+   rx_ring->vsi->xdp_rings[rx_ring->queue_index];
 
-   xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
-
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.
-*/
-   wmb();
-
-   writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
+   i40e_xdp_ring_update_tail(xdp_ring);
+   xdp_do_flush_map();
}
 
rx_ring->skb = skb;
@@ -3659,3 +3667,49 @@ netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, 
struct net_device *netdev)
 
return i40e_xmit_frame_ring(skb, tx_ring);
 }
+
+/**
+ * i40e_xdp_xmit - Implements ndo_xdp_xmit
+ * @dev: netdev
+ * @xdp: XDP buffer
+ *
+ * Returns Zero if sent, else an error code
+ **/
+int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+{
+   struct i40e_netdev_priv *np = netdev_priv(dev);
+   unsigned int queue_index = smp_processor_id();
+   struct i40e_vsi *vsi = np->vsi;
+   int err;
+
+   if (test_bit(__I40E_VSI_DOWN, vsi->state))
+   return -EINVAL;
+
+   if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
+   return -EINVAL;
+
+   err = i40e_xmit_xdp_ring(xdp, vsi->xdp_rings[queue_index]);
+   if (err != I40E_XDP_TX)
+   return -ENOMEM;
+
+   return 0;
+}
+
+/**
+ * i40e_xdp_flush - Implements ndo_xdp_flush
+ * @dev: netdev
+ **/
+void i40e_xdp_flush(struct net_device *dev)
+{
+   struct i40e_netdev_priv *np = netdev_priv(dev);
+   unsigned int queue_index = smp_processor_id();
+   struct i40e_vsi *vsi = np->vsi;
+
+   if (test_bit(__I40E_VSI_DOWN, vsi->state))
+   return;
+
+   if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
+   return;
+
+   i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
+}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.

Re: [PATCH][next] net: mvpp2: use correct index on array mvpp2_pools

2018-03-22 Thread Antoine Tenart
Hi Colin,

On Wed, Mar 21, 2018 at 05:31:15PM +, Colin King wrote:
> From: Colin Ian King 
> 
> Array mvpp2_pools is being indexed by long_log_pool, however this
> looks like a cut-n-paste bug and in fact should be short_log_pool.
> 
> Detected by CoverityScan, CID#1466113 ("Copy-paste error")
> 
> Fixes: 576193f2d579 ("net: mvpp2: jumbo frames support")
> Signed-off-by: Colin Ian King 

Acked-by: Antoine Tenart 

Thanks!
Antoine

> ---
>  drivers/net/ethernet/marvell/mvpp2.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
> b/drivers/net/ethernet/marvell/mvpp2.c
> index 9bd35f2291d6..f8bc3d4a39ff 100644
> --- a/drivers/net/ethernet/marvell/mvpp2.c
> +++ b/drivers/net/ethernet/marvell/mvpp2.c
> @@ -4632,7 +4632,7 @@ static int mvpp2_swf_bm_pool_init(struct mvpp2_port 
> *port)
>   if (!port->pool_short) {
>   port->pool_short =
>   mvpp2_bm_pool_use(port, short_log_pool,
> -   mvpp2_pools[long_log_pool].pkt_size);
> +   mvpp2_pools[short_log_pool].pkt_size);
>   if (!port->pool_short)
>   return -ENOMEM;
>  
> -- 
> 2.15.1
> 

-- 
Antoine Ténart, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com


Re: [v2] vhost: add vsock compat ioctl

2018-03-22 Thread Stefan Hajnoczi
On Fri, Mar 16, 2018 at 7:30 PM, David Miller  wrote:
> Although the top level ioctls are probably size and layout compatible,
> I do not think that the deeper ioctls can be called by compat binaries
> without some translations in order for them to work.

I audited the vhost ioctl code when reviewing this patch and was
unable to find anything that would break for a 32-bit userspace
process.

drivers/vhost/net.c does the same thing already, which doesn't prove
it's correct but makes me more confident I didn't miss something while
auditing the vhost ioctl code.

Did you have a specific ioctl in mind?

Stefan


Re: [PATCH net 1/3] lan78xx: Set ASD in MAC_CR when EEE is enabled.

2018-03-22 Thread Sergei Shtylyov

Hello!

   Only stylistic comments.

On 3/22/2018 10:41 AM, Raghuram Chary J wrote:


Description:
EEE does not work with lan7800 when AutoSpeed is not set.
(This can happen when EEPROM is not populated or configured incorrectly)

Root-Cause:
When EEE is enabled, the mac config register ASD is not set
i.e in default state,causing EEE fail.


   Need a period after "i.e" and a space after comma.


Fix:
Set the register when eeprom is not present.

Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 Ethernet 
device driver")
Signed-off-by: Raghuram Chary J 
---
  drivers/net/usb/lan78xx.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 11176070b345..e2d26f9c0f6a 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -2351,6 +2351,7 @@ static int lan78xx_reset(struct lan78xx_net *dev)
u32 buf;
int ret = 0;
unsigned long timeout;
+   u8 sig;
  
  	ret = lan78xx_read_reg(dev, HW_CFG, &buf);

buf |= HW_CFG_LRST_;
@@ -2450,6 +2451,15 @@ static int lan78xx_reset(struct lan78xx_net *dev)
/* LAN7801 only has RGMII mode */
if (dev->chipid == ID_REV_CHIP_ID_7801_)
buf &= ~MAC_CR_GMII_EN_;
+
+   if(dev->chipid == ID_REV_CHIP_ID_7800_) {


   Please run your patches thru scripts/checkpatch.pl -- it would have 
complained because of a missing space between *if* and (.



+   ret = lan78xx_read_raw_eeprom(dev, 0, 1, &sig);
+   if ((!ret) && (sig != EEPROM_INDICATOR)) {


   No need for the inner parens here, especially the 1st pair...


+   /*Implies there is no external eeprom. Set mac speed*/


   Please add space after /*  and before */.


+   netdev_info(dev->net, "No External EEPROM. Setting MAC Speed 
\n");
+   buf |= MAC_CR_AUTO_DUPLEX_ | MAC_CR_AUTO_SPEED_;
+   }
+   }
ret = lan78xx_write_reg(dev, MAC_CR, buf);
  
  	ret = lan78xx_read_reg(dev, MAC_TX, &buf);


MBR, Sergei


Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread Ingo Molnar

* Linus Torvalds  wrote:

> And the real worry is things like AVX-512 etc, which is exactly when
> things like "save and restore one ymm register" will quite likely
> clear the upper bits of the zmm register.

Yeah, I think the only valid save/restore pattern is to 100% correctly 
enumerate 
the width of the vector registers, and use full width instructions.

Using partial registers, even though it's possible in some cases is probably a 
bad 
idea not just due to most instructions auto-zeroing the upper portion to reduce 
false dependencies, but also because 'mixed' use of partial and full register 
access is known to result in penalties on a wide range of Intel CPUs, at least 
according to the Agner PDFs. On AMD CPUs there's no penalty.

So what I think could be done at best is to define a full register save/restore 
API, which falls back to XSAVE*/XRSTOR* if we don't have the routines for the 
native vector register width. (I.e. if old kernel is used on very new CPU.)

Note that the actual AVX code could still use partial width, it's the 
save/restore 
primitives that has to handle full width registers.

> And yes, we can have some statically patched code that takes that into 
> account, 
> and saves the whole zmm register when AVX512 is on, but the whole *point* of 
> the 
> dynamic XSAVES thing is actually that Intel wants to be able enable new 
> user-space features without having to wait for OS support. Literally. That's 
> why 
> and how it was designed.

This aspect wouldn't be hurt AFAICS: to me it appears that due to glibc using 
vector instructions in its memset() the AVX bits get used early on and to the 
maximum, so the XINUSE for them is set for every task.

The optionality of other XSAVE based features like MPX wouldn't be hurt if the 
kernel only uses vector registers.

> And saving a couple of zmm registers is actually pretty hard. They're big. Do 
> you want to allocate 128 bytes of stack space, preferably 64-byte aligned, 
> for a 
> save area? No. So now it needs to be some kind of per-thread (or maybe 
> per-CPU, 
> if we're willing to continue to not preempt) special save area too.

Hm, that's indeed a nasty complication:

 - While a single 128 bytes slot might work - in practice at least two vector
   registers are needed to have enough parallelism and hide latencies.

 - ¤t->thread.fpu.state.xsave is available almost all the time: with our 
   current 'direct' FPU context switching code the only time there's live data 
in
   ¤t->thread.fpu is when the task is not running. But it's not IRQ-safe.

We could probably allow irq save/restore sections to use it, as 
local_irq_save()/restore() is still *much* faster than a 1-1.5K FPU context 
save/restore pattern.

But I was hoping for a less restrictive model ... :-/

To have a better model and avoid the local_irq_save()/restore we could perhaps 
change the IRQ model to have a per IRQ 'current' value (we have separate IRQ 
stacks already), but that's quite a bit of work to transform all code that 
operates on the interrupted task (scheduler and timer code).

But it's work that would be useful for other reasons as well.

With such a separation in place ¤t->thread.fpu.state.xsave would become a 
generic, natural vector register save area.

> And even then, it doesn't solve the real worry of "maybe there will be odd 
> interactions with future extensions that we don't even know of".

Yes, that's true, but I think we could avoid these dangers by using CPU model 
based enumeration. The cost would be that vector ops would only be available on 
new CPU models after an explicit opt-in. In many cases it will be a single new 
constant to an existing switch() statement, easily backported as well.

> All this to do a 32-byte PIO access, with absolutely zero data right
> now on what the win is?

Ok, so that's not what I'd use it for, I'd use it:

 - Speed up existing AVX (crypto, RAID) routines for smaller buffer sizes.
   Right now the XSAVE*+XRSTOR* cost is significant:

 x86/fpu: Cost of: XSAVE   insn:   104 cycles
 x86/fpu: Cost of: XRSTOR  insn:80 cycles

   ... and that's with just 128-bit AVX and a ~0.8K XSAVE area. The Agner PDF 
   lists Skylake XSAVE+XRSTOR costs at 107+122 cycles, plus there's probably a
   significant amount of L1 cache churn caused by XSAVE/XRSTOR.

   Most of the relevant vector instructions have a single cycle cost
   on the other hand.

 - To use vector ops in bulk, well-aligned memcpy(), which in many workloads
   is a fair chunk of all memset() activity. A usage profile on a typical 
system:

galatea:~> cat /proc/sched_debug  | grep hist | grep -E 
'[[:digit:]]{4,}$' | grep '0\]'
hist[0x]:1514272
hist[0x0010]:1905248
hist[0x0020]:  99471
hist[0x0030]: 343309
hist[0x0040]: 177874
hist[0x0080]: 190052
hist[0x00a0]:   5258
hist[0x00b0]:   2387
hist[0x00c0

Re: [PATCH 1/2] bpf: Remove struct bpf_verifier_env argument from print_bpf_insn

2018-03-22 Thread Daniel Borkmann
On 03/21/2018 07:37 PM, Jiri Olsa wrote:
> On Wed, Mar 21, 2018 at 05:25:33PM +, Quentin Monnet wrote:
>> 2018-03-21 16:02 UTC+0100 ~ Jiri Olsa 
>>> We use print_bpf_insn in user space (bpftool and soon perf),
>>> so it'd be nice to keep it generic and strip it off the kernel
>>> struct bpf_verifier_env argument.
>>>
>>> This argument can be safely removed, because its users can
>>> use the struct bpf_insn_cbs::private_data to pass it.
>>>
>>> Signed-off-by: Jiri Olsa 
>>> ---
>>>  kernel/bpf/disasm.c   | 52 
>>> +--
>>>  kernel/bpf/disasm.h   |  5 +
>>>  kernel/bpf/verifier.c |  6 +++---
>>>  3 files changed, 30 insertions(+), 33 deletions(-)
>>
>> [...]
>>
>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>> index c6eff108aa99..9f27d3fa7259 100644
>>> --- a/kernel/bpf/verifier.c
>>> +++ b/kernel/bpf/verifier.c
>>> @@ -202,8 +202,7 @@ EXPORT_SYMBOL_GPL(bpf_verifier_log_write);
>>>   * generic for symbol export. The function was renamed, but not the calls 
>>> in
>>>   * the verifier to avoid complicating backports. Hence the alias below.
>>>   */
>>> -static __printf(2, 3) void verbose(struct bpf_verifier_env *env,
>>> -  const char *fmt, ...)
>>> +static __printf(2, 3) void verbose(void *private_data, const char *fmt, 
>>> ...)
>>> __attribute__((alias("bpf_verifier_log_write")));
>>
>> Just as a note, verbose() will be aliased to a function whose prototype
>> differs (bpf_verifier_log_write() still expects a struct
>> bpf_verifier_env as its first argument). I am not so familiar with
>> function aliases, could this change be a concern?
> 
> yea, but as it was pointer for pointer switch I did not
> see any problem with that.. I'll check more

Ok, holding off for now until we have clarification. Other option could also
be to make it void *private_data everywhere and for the kernel writer then
do struct bpf_verifier_env *env = private_data.


Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> On Wed, Mar 21, 2018 at 6:32 AM, Ingo Molnar  wrote:
> >
> > * Linus Torvalds  wrote:
> >
> >> And even if you ignore that "maintenance problems down the line" issue
> >> ("we can fix them when they happen") I don't want to see games like
> >> this, because I'm pretty sure it breaks the optimized xsave by tagging
> >> the state as being dirty.
> >
> > That's true - and it would penalize the context switch cost of the affected 
> > task
> > for the rest of its lifetime, as I don't think there's much that clears 
> > XINUSE
> > other than a FINIT, which is rarely done by user-space.
> >
> >> So no. Don't use vector stuff in the kernel. It's not worth the pain.
> >
> > I agree, but:
> >
> >> The *only* valid use is pretty much crypto, and even there it has had 
> >> issues.
> >> Benchmarks use big arrays and/or dense working sets etc to "prove" how 
> >> good the
> >> vector version is, and then you end up in situations where it's used once 
> >> per
> >> fairly small packet for an interrupt, and it's actually much worse than 
> >> doing it
> >> by hand.
> >
> > That's mainly because the XSAVE/XRESTOR done by kernel_fpu_begin()/end() is 
> > so
> > expensive, so this argument is somewhat circular.
> 
> If we do the deferred restore, then the XSAVE/XRSTOR happens at most
> once per kernel entry, which isn't so bad IMO.  Also, with PTI, kernel
> entries are already so slow that this will be mostly in the noise :(

For performance/scalability work we should just ignore the PTI overhead: it 
doesn't exist on AMD CPUs and Intel has announced Meltdown-fixed CPUs, to be 
released later this year:

   https://www.anandtech.com/show/12533/intel-spectre-meltdown

By the time any kernel changes we are talking about today get to distros and 
users 
the newest hardware won't have the Meltdown bug.

Thanks,

Ingo


Re: [PATCH V2 net-next 07/14] net/tls: Support TLS device offload with IPv6

2018-03-22 Thread Sergei Shtylyov

Hello!

On 3/22/2018 12:01 AM, Saeed Mahameed wrote:


From: Ilya Lesokhin 

Previously get_netdev_for_sock worked only with IPv4.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 


   Only stylistic comments...


---
  net/tls/tls_device.c | 49 -
  1 file changed, 48 insertions(+), 1 deletion(-)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index e623280ea019..c35fc107d9c5 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -34,6 +34,11 @@
  #include 
  #include 
  #include 
+#include 
+#include 
+#include 
+#include 
+#include 
  
  #include 

  #include 
@@ -99,13 +104,55 @@ static void tls_device_queue_ctx_destruction(struct 
tls_context *ctx)
spin_unlock_irqrestore(&tls_device_lock, flags);
  }
  
+static inline struct net_device *ipv6_get_netdev(struct sock *sk)

+{
+   struct net_device *dev = NULL;
+#if IS_ENABLED(CONFIG_IPV6)
+   struct inet_sock *inet = inet_sk(sk);
+   struct ipv6_pinfo *np = inet6_sk(sk);
+   struct flowi6 _fl6, *fl6 = &_fl6;
+   struct dst_entry *dst;
+
+   memset(fl6, 0, sizeof(*fl6));
+   fl6->flowi6_proto = sk->sk_protocol;
+   fl6->daddr = sk->sk_v6_daddr;
+   fl6->saddr = np->saddr;
+   fl6->flowlabel = np->flow_label;
+   IP6_ECN_flow_xmit(sk, fl6->flowlabel);
+   fl6->flowi6_oif = sk->sk_bound_dev_if;
+   fl6->flowi6_mark = sk->sk_mark;
+   fl6->fl6_sport = inet->inet_sport;
+   fl6->fl6_dport = inet->inet_dport;
+   fl6->flowi6_uid = sk->sk_uid;
+   security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
+
+   if (ipv6_stub->ipv6_dst_lookup(sock_net(sk), sk, &dst, fl6) < 0)
+   return NULL;
+
+   dev = dst->dev;
+   dev_hold(dev);
+   dst_release(dst);
+
+#endif


   I think the above empty line should be outside #if as you need an empty 
line between the declaration and other statements.



+   return dev;
+}
+
  /* We assume that the socket is already connected */
  static struct net_device *get_netdev_for_sock(struct sock *sk)
  {
struct inet_sock *inet = inet_sk(sk);
struct net_device *netdev = NULL;
  
-	netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);

+   if (sk->sk_family == AF_INET)
+   netdev = dev_get_by_index(sock_net(sk),
+ inet->cork.fl.flowi_oif);
+   else if (sk->sk_family == AF_INET6) {


   Need {} in the 1st *if* branch since you have it in the 2nd.


+   netdev = ipv6_get_netdev(sk);
+   if (!netdev && !sk->sk_ipv6only &&
+   ipv6_addr_type(&sk->sk_v6_daddr) == IPV6_ADDR_MAPPED)
+   netdev = dev_get_by_index(sock_net(sk),
+ inet->cork.fl.flowi_oif);
+   }
  
  	return netdev;

  }


MBR, Sergei




Re: [PATCH v2 bpf-next 5/8] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-22 Thread Daniel Borkmann
On 03/21/2018 07:54 PM, Alexei Starovoitov wrote:
[...]
> @@ -546,6 +556,53 @@ extern void ftrace_profile_free_filter(struct perf_event 
> *event);
>  void perf_trace_buf_update(void *record, u16 type);
>  void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
>  
> +void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
> +void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
> +void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
> + u64 arg3);
> +void bpf_trace_run4(struct bpf_prog *prog, u64 arg1, u64 arg2,
> + u64 arg3, u64 arg4);
> +void bpf_trace_run5(struct bpf_prog *prog, u64 arg1, u64 arg2,
> + u64 arg3, u64 arg4, u64 arg5);
> +void bpf_trace_run6(struct bpf_prog *prog, u64 arg1, u64 arg2,
> + u64 arg3, u64 arg4, u64 arg5, u64 arg6);
> +void bpf_trace_run7(struct bpf_prog *prog, u64 arg1, u64 arg2,
> + u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7);
> +void bpf_trace_run8(struct bpf_prog *prog, u64 arg1, u64 arg2,
> + u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> + u64 arg8);
> +void bpf_trace_run9(struct bpf_prog *prog, u64 arg1, u64 arg2,
> + u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> + u64 arg8, u64 arg9);
> +void bpf_trace_run10(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +  u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +  u64 arg8, u64 arg9, u64 arg10);
> +void bpf_trace_run11(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +  u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +  u64 arg8, u64 arg9, u64 arg10, u64 arg11);
> +void bpf_trace_run12(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +  u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +  u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12);
> +void bpf_trace_run13(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +  u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +  u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
> +  u64 arg13);
> +void bpf_trace_run14(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +  u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +  u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
> +  u64 arg13, u64 arg14);
> +void bpf_trace_run15(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +  u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +  u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
> +  u64 arg13, u64 arg14, u64 arg15);
> +void bpf_trace_run16(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +  u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +  u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
> +  u64 arg13, u64 arg14, u64 arg15, u64 arg16);
> +void bpf_trace_run17(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +  u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +  u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
> +  u64 arg13, u64 arg14, u64 arg15, u64 arg16, u64 arg17);
>  void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx,
>  struct trace_event_call *call, u64 count,
>  struct pt_regs *regs, struct hlist_head *head,
[...]
> @@ -896,3 +976,206 @@ int perf_event_query_prog_array(struct perf_event 
> *event, void __user *info)
>  
>   return ret;
>  }
> +
> +static __always_inline
> +void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
> +{
> + rcu_read_lock();
> + preempt_disable();
> + (void) BPF_PROG_RUN(prog, args);
> + preempt_enable();
> + rcu_read_unlock();
> +}
> +
> +#define EVAL1(FN, X) FN(X)
> +#define EVAL2(FN, X, Y...) FN(X) EVAL1(FN, Y)
> +#define EVAL3(FN, X, Y...) FN(X) EVAL2(FN, Y)
> +#define EVAL4(FN, X, Y...) FN(X) EVAL3(FN, Y)
> +#define EVAL5(FN, X, Y...) FN(X) EVAL4(FN, Y)
> +#define EVAL6(FN, X, Y...) FN(X) EVAL5(FN, Y)
> +
> +#define COPY(X) args[X - 1] = arg##X;
> +
> +void bpf_trace_run1(struct bpf_prog *prog, u64 arg1)
> +{
> + u64 args[1];
> +
> + EVAL1(COPY, 1);
> + __bpf_trace_run(prog, args);
> +}
> +EXPORT_SYMBOL_GPL(bpf_trace_run1);
> +void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2)
> +{
> + u64 args[2];
> +
> + EVAL2(COPY, 1, 2);
> + __bpf_trace_run(prog, args);
> +}
> +EXPORT_SYMBOL_GPL(bpf_trace_run2);
> +void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
> + u64 arg3)
> +{
> + u64 args[3];
> +
> + EVAL3(COPY, 1, 2, 3);
> + __bpf_trace_run(prog, args);
> +}
> +EXPORT_SYMBOL_GPL(bpf_trace_run3);
> +void bpf_trace_run4(struct bpf_prog *prog, u64 arg1, u64 arg2,
> + u64 arg3, u64 arg4)
> +{
> + u64 args[4];
> +
> + EVAL4(COPY, 1, 2,

[PATCH net-next v3 1/5] net: Revert "ipv4: get rid of ip_ra_lock"

2018-03-22 Thread Kirill Tkhai
This reverts commit ba3f571d5dde. The commit was made
after 1215e51edad1 "ipv4: fix a deadlock in ip_ra_control",
and killed ip_ra_lock, which became useless after rtnl_lock()
made used to destroy every raw ipv4 socket. This scales
very bad, and next patch in series reverts 1215e51edad1.
ip_ra_lock will be used again.

Signed-off-by: Kirill Tkhai 
---
 net/ipv4/ip_sockglue.c |   12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 74c962b9b09c..be7c3b71914d 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -334,6 +334,7 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, 
struct ipcm_cookie *ipc,
sent to multicast group to reach destination designated router.
  */
 struct ip_ra_chain __rcu *ip_ra_chain;
+static DEFINE_SPINLOCK(ip_ra_lock);
 
 
 static void ip_ra_destroy_rcu(struct rcu_head *head)
@@ -355,17 +356,21 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 
new_ra = on ? kmalloc(sizeof(*new_ra), GFP_KERNEL) : NULL;
 
+   spin_lock_bh(&ip_ra_lock);
for (rap = &ip_ra_chain;
-(ra = rtnl_dereference(*rap)) != NULL;
+(ra = rcu_dereference_protected(*rap,
+   lockdep_is_held(&ip_ra_lock))) != NULL;
 rap = &ra->next) {
if (ra->sk == sk) {
if (on) {
+   spin_unlock_bh(&ip_ra_lock);
kfree(new_ra);
return -EADDRINUSE;
}
/* dont let ip_call_ra_chain() use sk again */
ra->sk = NULL;
RCU_INIT_POINTER(*rap, ra->next);
+   spin_unlock_bh(&ip_ra_lock);
 
if (ra->destructor)
ra->destructor(sk);
@@ -379,14 +384,17 @@ int ip_ra_control(struct sock *sk, unsigned char on,
return 0;
}
}
-   if (!new_ra)
+   if (!new_ra) {
+   spin_unlock_bh(&ip_ra_lock);
return -ENOBUFS;
+   }
new_ra->sk = sk;
new_ra->destructor = destructor;
 
RCU_INIT_POINTER(new_ra->next, ra);
rcu_assign_pointer(*rap, new_ra);
sock_hold(sk);
+   spin_unlock_bh(&ip_ra_lock);
 
return 0;
 }



[PATCH net-next v3 2/5] net: Move IP_ROUTER_ALERT out of lock_sock(sk)

2018-03-22 Thread Kirill Tkhai
ip_ra_control() does not need sk_lock. Who are the another
users of ip_ra_chain? ip_mroute_setsockopt() doesn't take
sk_lock, while parallel IP_ROUTER_ALERT syscalls are
synchronized by ip_ra_lock. So, we may move this command
out of sk_lock.

Signed-off-by: Kirill Tkhai 
---
 net/ipv4/ip_sockglue.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index be7c3b71914d..dcbf6afe27e7 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -647,6 +647,8 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 
/* If optlen==0, it is equivalent to val == 0 */
 
+   if (optname == IP_ROUTER_ALERT)
+   return ip_ra_control(sk, val ? 1 : 0, NULL);
if (ip_mroute_opt(optname))
return ip_mroute_setsockopt(sk, optname, optval, optlen);
 
@@ -1157,9 +1159,6 @@ static int do_ip_setsockopt(struct sock *sk, int level,
goto e_inval;
inet->mc_all = val;
break;
-   case IP_ROUTER_ALERT:
-   err = ip_ra_control(sk, val ? 1 : 0, NULL);
-   break;
 
case IP_FREEBIND:
if (optlen < 1)



[PATCH net-next v3 3/5] net: Revert "ipv4: fix a deadlock in ip_ra_control"

2018-03-22 Thread Kirill Tkhai
This reverts commit 1215e51edad1.
Since raw_close() is used on every RAW socket destruction,
the changes made by 1215e51edad1 scale sadly. This clearly
seen on endless unshare(CLONE_NEWNET) test, and cleanup_net()
kwork spends a lot of time waiting for rtnl_lock() introduced
by this commit.

Previous patch moved IP_ROUTER_ALERT out of rtnl_lock(),
so we revert this patch.

Signed-off-by: Kirill Tkhai 
---
 net/ipv4/ip_sockglue.c |1 -
 net/ipv4/ipmr.c|   11 +--
 net/ipv4/raw.c |2 --
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index dcbf6afe27e7..bf5f44b27b7e 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -594,7 +594,6 @@ static bool setsockopt_needs_rtnl(int optname)
case MCAST_LEAVE_GROUP:
case MCAST_LEAVE_SOURCE_GROUP:
case MCAST_UNBLOCK_SOURCE:
-   case IP_ROUTER_ALERT:
return true;
}
return false;
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index d752a70855d8..f6be5db16da2 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1399,7 +1399,7 @@ static void mrtsock_destruct(struct sock *sk)
struct net *net = sock_net(sk);
struct mr_table *mrt;
 
-   ASSERT_RTNL();
+   rtnl_lock();
ipmr_for_each_table(mrt, net) {
if (sk == rtnl_dereference(mrt->mroute_sk)) {
IPV4_DEVCONF_ALL(net, MC_FORWARDING)--;
@@ -1411,6 +1411,7 @@ static void mrtsock_destruct(struct sock *sk)
mroute_clean_tables(mrt, false);
}
}
+   rtnl_unlock();
 }
 
 /* Socket options and virtual interface manipulation. The whole
@@ -1475,8 +1476,13 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, 
char __user *optval,
if (sk != rcu_access_pointer(mrt->mroute_sk)) {
ret = -EACCES;
} else {
+   /* We need to unlock here because mrtsock_destruct takes
+* care of rtnl itself and we can't change that due to
+* the IP_ROUTER_ALERT setsockopt which runs without it.
+*/
+   rtnl_unlock();
ret = ip_ra_control(sk, 0, NULL);
-   goto out_unlock;
+   goto out;
}
break;
case MRT_ADD_VIF:
@@ -1588,6 +1594,7 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, 
char __user *optval,
}
 out_unlock:
rtnl_unlock();
+out:
return ret;
 }
 
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 54648d20bf0f..720bef7da2f6 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -711,9 +711,7 @@ static void raw_close(struct sock *sk, long timeout)
/*
 * Raw sockets may have direct kernel references. Kill them.
 */
-   rtnl_lock();
ip_ra_control(sk, 0, NULL);
-   rtnl_unlock();
 
sk_common_release(sk);
 }



[PATCH net-next v3 0/5] Rework ip_ra_chain protection

2018-03-22 Thread Kirill Tkhai
Commit 1215e51edad1 "ipv4: fix a deadlock in ip_ra_control"
made rtnl_lock() be used in raw_close(). This function is called
on every RAW socket destruction, so that rtnl_mutex is taken
every time. This scales very sadly. I observe cleanup_net()
spending a lot of time in rtnl_lock() and raw_close() is one
of the biggest rtnl user (since we have percpu net->ipv4.icmp_sk).

This patchset reworks the locking: reverts the problem commit
and its descendant, and introduces rtnl-independent locking.
This may have a continuation, and someone may work on killing
rtnl_lock() in mrtsock_destruct() in the future.

Thanks,
Kirill

---
v3: Change patches order: [2/5] and [3/5].
v2: Fix sparse warning [4/5], as reported by kbuild test robot.

---

Kirill Tkhai (5):
  net: Revert "ipv4: get rid of ip_ra_lock"
  net: Move IP_ROUTER_ALERT out of lock_sock(sk)
  net: Revert "ipv4: fix a deadlock in ip_ra_control"
  net: Make ip_ra_chain per struct net
  net: Replace ip_ra_lock with per-net mutex


 include/net/ip.h |   13 +++--
 include/net/netns/ipv4.h |2 ++
 net/core/net_namespace.c |1 +
 net/ipv4/ip_input.c  |5 ++---
 net/ipv4/ip_sockglue.c   |   34 +-
 net/ipv4/ipmr.c  |   11 +--
 net/ipv4/raw.c   |2 --
 7 files changed, 38 insertions(+), 30 deletions(-)

--
Signed-off-by: Kirill Tkhai 


[PATCH net-next v3 5/5] net: Replace ip_ra_lock with per-net mutex

2018-03-22 Thread Kirill Tkhai
Since ra_chain is per-net, we may use per-net mutexes
to protect them in ip_ra_control(). This improves
scalability.

Signed-off-by: Kirill Tkhai 
---
 include/net/netns/ipv4.h |1 +
 net/core/net_namespace.c |1 +
 net/ipv4/ip_sockglue.c   |   15 ++-
 3 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 97d7ee6667c7..8491bc9c86b1 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -50,6 +50,7 @@ struct netns_ipv4 {
struct ipv4_devconf *devconf_all;
struct ipv4_devconf *devconf_dflt;
struct ip_ra_chain __rcu *ra_chain;
+   struct mutexra_mutex;
 #ifdef CONFIG_IP_MULTIPLE_TABLES
struct fib_rules_ops*rules_ops;
boolfib_has_custom_rules;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index c340d5cfbdec..95ba2c53bd9a 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -301,6 +301,7 @@ static __net_init int setup_net(struct net *net, struct 
user_namespace *user_ns)
net->user_ns = user_ns;
idr_init(&net->netns_ids);
spin_lock_init(&net->nsid_lock);
+   mutex_init(&net->ipv4.ra_mutex);
 
list_for_each_entry(ops, &pernet_list, list) {
error = ops_init(ops, net);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index f36d35fe924b..5ad2d8ed3a3f 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -322,9 +322,6 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, 
struct ipcm_cookie *ipc,
return 0;
 }
 
-static DEFINE_SPINLOCK(ip_ra_lock);
-
-
 static void ip_ra_destroy_rcu(struct rcu_head *head)
 {
struct ip_ra_chain *ra = container_of(head, struct ip_ra_chain, rcu);
@@ -345,21 +342,21 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 
new_ra = on ? kmalloc(sizeof(*new_ra), GFP_KERNEL) : NULL;
 
-   spin_lock_bh(&ip_ra_lock);
+   mutex_lock(&net->ipv4.ra_mutex);
for (rap = &net->ipv4.ra_chain;
 (ra = rcu_dereference_protected(*rap,
-   lockdep_is_held(&ip_ra_lock))) != NULL;
+   lockdep_is_held(&net->ipv4.ra_mutex))) != NULL;
 rap = &ra->next) {
if (ra->sk == sk) {
if (on) {
-   spin_unlock_bh(&ip_ra_lock);
+   mutex_unlock(&net->ipv4.ra_mutex);
kfree(new_ra);
return -EADDRINUSE;
}
/* dont let ip_call_ra_chain() use sk again */
ra->sk = NULL;
RCU_INIT_POINTER(*rap, ra->next);
-   spin_unlock_bh(&ip_ra_lock);
+   mutex_unlock(&net->ipv4.ra_mutex);
 
if (ra->destructor)
ra->destructor(sk);
@@ -374,7 +371,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
}
}
if (!new_ra) {
-   spin_unlock_bh(&ip_ra_lock);
+   mutex_unlock(&net->ipv4.ra_mutex);
return -ENOBUFS;
}
new_ra->sk = sk;
@@ -383,7 +380,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
RCU_INIT_POINTER(new_ra->next, ra);
rcu_assign_pointer(*rap, new_ra);
sock_hold(sk);
-   spin_unlock_bh(&ip_ra_lock);
+   mutex_unlock(&net->ipv4.ra_mutex);
 
return 0;
 }



[PATCH net-next v3 4/5] net: Make ip_ra_chain per struct net

2018-03-22 Thread Kirill Tkhai
This is optimization, which makes ip_call_ra_chain()
iterate less sockets to find the sockets it's looking for.

Signed-off-by: Kirill Tkhai 
---
 include/net/ip.h |   13 +++--
 include/net/netns/ipv4.h |1 +
 net/ipv4/ip_input.c  |5 ++---
 net/ipv4/ip_sockglue.c   |   15 ++-
 4 files changed, 16 insertions(+), 18 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index fe63ba95d12b..d53b5a9eae34 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -91,6 +91,17 @@ static inline int inet_sdif(struct sk_buff *skb)
return 0;
 }
 
+/* Special input handler for packets caught by router alert option.
+   They are selected only by protocol field, and then processed likely
+   local ones; but only if someone wants them! Otherwise, router
+   not running rsvpd will kill RSVP.
+
+   It is user level problem, what it will make with them.
+   I have no idea, how it will masquearde or NAT them (it is joke, joke :-)),
+   but receiver should be enough clever f.e. to forward mtrace requests,
+   sent to multicast group to reach destination designated router.
+ */
+
 struct ip_ra_chain {
struct ip_ra_chain __rcu *next;
struct sock *sk;
@@ -101,8 +112,6 @@ struct ip_ra_chain {
struct rcu_head rcu;
 };
 
-extern struct ip_ra_chain __rcu *ip_ra_chain;
-
 /* IP flags. */
 #define IP_CE  0x8000  /* Flag: "Congestion"   */
 #define IP_DF  0x4000  /* Flag: "Don't Fragment"   */
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 382bfd7583cf..97d7ee6667c7 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -49,6 +49,7 @@ struct netns_ipv4 {
 #endif
struct ipv4_devconf *devconf_all;
struct ipv4_devconf *devconf_dflt;
+   struct ip_ra_chain __rcu *ra_chain;
 #ifdef CONFIG_IP_MULTIPLE_TABLES
struct fib_rules_ops*rules_ops;
boolfib_has_custom_rules;
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 57fc13c6ab2b..7582713dd18f 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -159,7 +159,7 @@ bool ip_call_ra_chain(struct sk_buff *skb)
struct net_device *dev = skb->dev;
struct net *net = dev_net(dev);
 
-   for (ra = rcu_dereference(ip_ra_chain); ra; ra = 
rcu_dereference(ra->next)) {
+   for (ra = rcu_dereference(net->ipv4.ra_chain); ra; ra = 
rcu_dereference(ra->next)) {
struct sock *sk = ra->sk;
 
/* If socket is bound to an interface, only report
@@ -167,8 +167,7 @@ bool ip_call_ra_chain(struct sk_buff *skb)
 */
if (sk && inet_sk(sk)->inet_num == protocol &&
(!sk->sk_bound_dev_if ||
-sk->sk_bound_dev_if == dev->ifindex) &&
-   net_eq(sock_net(sk), net)) {
+sk->sk_bound_dev_if == dev->ifindex)) {
if (ip_is_fragment(ip_hdr(skb))) {
if (ip_defrag(net, skb, 
IP_DEFRAG_CALL_RA_CHAIN))
return true;
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index bf5f44b27b7e..f36d35fe924b 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -322,18 +322,6 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, 
struct ipcm_cookie *ipc,
return 0;
 }
 
-
-/* Special input handler for packets caught by router alert option.
-   They are selected only by protocol field, and then processed likely
-   local ones; but only if someone wants them! Otherwise, router
-   not running rsvpd will kill RSVP.
-
-   It is user level problem, what it will make with them.
-   I have no idea, how it will masquearde or NAT them (it is joke, joke :-)),
-   but receiver should be enough clever f.e. to forward mtrace requests,
-   sent to multicast group to reach destination designated router.
- */
-struct ip_ra_chain __rcu *ip_ra_chain;
 static DEFINE_SPINLOCK(ip_ra_lock);
 
 
@@ -350,6 +338,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 {
struct ip_ra_chain *ra, *new_ra;
struct ip_ra_chain __rcu **rap;
+   struct net *net = sock_net(sk);
 
if (sk->sk_type != SOCK_RAW || inet_sk(sk)->inet_num == IPPROTO_RAW)
return -EINVAL;
@@ -357,7 +346,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
new_ra = on ? kmalloc(sizeof(*new_ra), GFP_KERNEL) : NULL;
 
spin_lock_bh(&ip_ra_lock);
-   for (rap = &ip_ra_chain;
+   for (rap = &net->ipv4.ra_chain;
 (ra = rcu_dereference_protected(*rap,
lockdep_is_held(&ip_ra_lock))) != NULL;
 rap = &ra->next) {



Re: [PATCH net-next v3 0/5] Rework ip_ra_chain protection

2018-03-22 Thread Kirill Tkhai
On 22.03.2018 12:44, Kirill Tkhai wrote:
> Commit 1215e51edad1 "ipv4: fix a deadlock in ip_ra_control"
> made rtnl_lock() be used in raw_close(). This function is called
> on every RAW socket destruction, so that rtnl_mutex is taken
> every time. This scales very sadly. I observe cleanup_net()
> spending a lot of time in rtnl_lock() and raw_close() is one
> of the biggest rtnl user (since we have percpu net->ipv4.icmp_sk).
> 
> This patchset reworks the locking: reverts the problem commit
> and its descendant, and introduces rtnl-independent locking.
> This may have a continuation, and someone may work on killing
> rtnl_lock() in mrtsock_destruct() in the future.
> 
> Thanks,
> Kirill
> 
> ---
> v3: Change patches order: [2/5] and [3/5].
> v2: Fix sparse warning [4/5], as reported by kbuild test robot.
> 
> ---
> 
> Kirill Tkhai (5):
>   net: Revert "ipv4: get rid of ip_ra_lock"
>   net: Move IP_ROUTER_ALERT out of lock_sock(sk)
>   net: Revert "ipv4: fix a deadlock in ip_ra_control"
>   net: Make ip_ra_chain per struct net
>   net: Replace ip_ra_lock with per-net mutex
> 
> 
>  include/net/ip.h |   13 +++--
>  include/net/netns/ipv4.h |2 ++
>  net/core/net_namespace.c |1 +
>  net/ipv4/ip_input.c  |5 ++---
>  net/ipv4/ip_sockglue.c   |   34 +-
>  net/ipv4/ipmr.c  |   11 +--
>  net/ipv4/raw.c   |2 --
>  7 files changed, 38 insertions(+), 30 deletions(-)
> 
> --
> Signed-off-by: Kirill Tkhai 

JFI: I used the below program to test:

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 

int main()
{
int sk, v, i = 0;

if (unshare(CLONE_NEWNET)) {
perror("unshare");
return 1;
}
sk = socket(AF_INET, SOCK_RAW, IPPROTO_IGMP);
if (sk < 0) {
perror("socket");
return 1;
}
for (i = 0; i < 3; i++)
fork();

while (1) {
setsockopt(sk, IPPROTO_IP, MRT_INIT, (void *)&v, sizeof(v));
setsockopt(sk, IPPROTO_IP, MRT_DONE, (void *)&v, sizeof(v));
v = (i++)%2;
setsockopt(sk, IPPROTO_IP, IP_ROUTER_ALERT, (void *)&v, 
sizeof(v));
}

return 0;
}


RE: DTS for our Configuration

2018-03-22 Thread Alayev Michael
Hi Andrew,

>I think this is a problem with the macb driver. To me, it looks like you are 
>going to have to >make some changes to the driver to make this work. Normally 
>the MDIO bus children are >placed within a container node, often called 
>'mdio-bus' or simply 'mdio'. See for example 
>>Documentation/devicetree/bindings/net/fsl-fec.txt.  The macb driver does not 
>do this. It >passed the main DT node of the device to of_mdiobus_register(). 
>It then walks all the >children assuming they devices on the MDIO bus. But the 
>first child it finds is the 'fixed-link'. >This is not supposed to be a child 
>of the bus, which is why it goes wrong.

As you understand, I prefer not to change the driver. 
Is there a way for me to bypass this issue?
Can I use other property than 'fixed-link'?

>Please include the full panic details. The stack trace can be very useful.
Please, see below:

You also asked for the linux log of your suggested dts:
libphy: Fixed MDIO Bus: probed
CAN device driver interface
libphy: MACB_mii_bus: probed
mdio_bus e000b000.ethernet-: /amba/ethernet@e000b000/fixed-link has 
invalid PHY address
mv88e6085 e000b000.ethernet-:1d: switch 0xa10 detected: Marvell 
88E6390X, revision 1
libphy: mv88e6xxx SMI: probed
mv88e6085 e000b000.ethernet-:1c: switch 0xa10 detected: Marvell 
88E6390X, revision 1
libphy: mv88e6xxx SMI: probed
mv88e6085: probe of e000b000.ethernet-:1c failed with error -16
mdio_bus e000b000.ethernet-: scan phy fixed-link at address 1
Unable to handle kernel NULL pointer dereference at virtual address 0004
pgd = c0004000
[0004] *pgd=
Internal error: Oops - BUG: 17 [#1] PREEMPT SMP ARM
Modules linked in:
CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.14.0-xilinx #15
Hardware name: Xilinx Zynq Platform
task: df43b840 task.stack: df43c000
PC is at dsa_unregister_switch+0x10/0x48
LR is at dsa_unregister_switch+0x10/0x48
pc : []    lr : []    psr: 6013
sp : df43dd58  ip :   fp : 
r10:   r9 : dd009c78  r8 : 0034
r7 : dd0d1434  r6 : c091dd68  r5 :   r4 : dd0d1210
r3 : df43b840  r2 :   r1 :   r0 : c0936cdc
Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 18c5387d  Table: 1f65c04a  DAC: 0051
Process swapper/0 (pid: 1, stack limit = 0xdf43c210)
Stack: (0xdf43dd58 to 0xdf43e000)
dd40:   dd0d1210 
dd60: c091dd68 c0418ac0 dd0d1400  c091dd68 c04127ac dd0d1400 c03b72b8
dd80: dd0d1400 df4ff530 c091c448 dd0d1460 dfbeb3c4 c03b6420 dd0d1400 dd0d145c
dda0: dd009c78 c03b3acc dd009e8c a013 dd1d2800 dd009e80 dd009e8c dd0d1400
ddc0: dd0d1400 dd009e84 dd009e8c c0412868 dd009c00 c0412624 ffed 0001
dde0: dd009c00 dfbeb67c dfbeb3c4 c04bafac  c04b378c dfbecb5c 001c
de00: c06ffa6d df77f000 df77c000 df77c4c0 dfbeb3c4 df524a10  c0423104
de20:  df43de5c 0001 c0229cd4 dfbeb3c4 c0421054 0001 0002
de40: 0001 0001 dd1c8898 dd1c9cc0 dd1c9c40 dd1c9bc0  
de60: 0001 0001 c091ddc8 c0422974 df524a10 c091ddc8  
de80: c091ddc8   c03b84d4 df524a10 c09523dc c09523e0 c03b6e84
dea0: df524a10 df524a44 c091ddc8 c0918a28  c0938000 c083383c c03b7080
dec0:  c091ddc8 c03b7000 c03b5790 df472f58 df4edb34 c091ddc8 
dee0: dd1c5300 c03b64f8 c06ff8d2 c06ff8d3  c091ddc8 c081aea4 00a8
df00: c083be5c c03b7808 0006 c081aea4 00a8 c0101900  df43df24
df20:    c075d548 00a8 0006 0006 
df40: cccd   c0938000 c083383c 0006 c0833830 00a8
df60: 0006 c0833834 00a8 c083be5c c0938000 c0800d40 0006 0006
df80:  c0800594  c05e1e64    
dfa0:  c05e1e6c  c0106fd0    
dfc0:        
dfe0:     0013   
[] (dsa_unregister_switch) from [] 
(mv88e6xxx_remove+0x1c/0x68)
[] (mv88e6xxx_remove) from [] (mdio_remove+0x18/0x28)
[] (mdio_remove) from [] 
(device_release_driver_internal+0x128/0x1d0)
[] (device_release_driver_internal) from [] 
(bus_remove_device+0xcc/0xdc)
[] (bus_remove_device) from [] (device_del+0x1bc/0x258)
[] (device_del) from [] (mdio_device_remove+0xc/0x18)
[] (mdio_device_remove) from [] 
(mdiobus_unregister+0x40/0x74)
[] (mdiobus_unregister) from [] 
(of_mdiobus_register+0x234/0x254)
[] (of_mdiobus_register) from [] (macb_probe+0x790/0xb88)
[] (macb_probe) from [] (platform_drv_probe+0x50/0xa0)
[] (platform_drv_probe) from [] 
(driver_probe_device+0x13c/0x2b8)
[] (driver_probe_device) from [] (__driver_attach+0x80/0xa4)
[] (__driver_attach) from [] (bus_for_each_dev+0x68/0x8c)
[] (bus_for_each_dev) from [] (bus_add_driver+0xc8/0x1dc)
[] (bus_add_driver) from [] (driver_register+0x9

RE: [PATCH v4 11/17] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

2018-03-22 Thread Kalluru, Sudarsana
-Original Message-
From: Sinan Kaya [mailto:ok...@codeaurora.org] 
Sent: 20 March 2018 08:12
To: netdev@vger.kernel.org; ti...@codeaurora.org; sulr...@codeaurora.org
Cc: linux-arm-...@vger.kernel.org; linux-arm-ker...@lists.infradead.org; Sinan 
Kaya ; Elior, Ariel ; Dept-Eng 
Everest Linux L2 ; 
linux-ker...@vger.kernel.org
Subject: [PATCH v4 11/17] bnx2x: Eliminate duplicate barriers on weakly-ordered 
archs

Code includes wmb() followed by writel(). writel() already has a barrier on 
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the 
register write.

Since code already has an explicit barrier call, changing writel() to 
writel_relaxed().

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h   |  9 -
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h   |  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  | 21 +++--  
drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c |  2 +-  
drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c  |  2 +-
 5 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index 352beff..ac38db9 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -166,6 +166,12 @@ do {   \
 #define REG_RD8(bp, offset)readb(REG_ADDR(bp, offset))
 #define REG_RD16(bp, offset)   readw(REG_ADDR(bp, offset))
 
+#define REG_WR_RELAXED(bp, offset, val)writel_relaxed((u32)val,\
+  REG_ADDR(bp, offset))
+
+#define REG_WR16_RELAXED(bp, offset, val) \
+   writew_relaxed((u16)val, REG_ADDR(bp, offset))
+
 #define REG_WR(bp, offset, val)writel((u32)val, REG_ADDR(bp, 
offset))
 #define REG_WR8(bp, offset, val)   writeb((u8)val, REG_ADDR(bp, offset))
 #define REG_WR16(bp, offset, val)  writew((u16)val, REG_ADDR(bp, offset))
@@ -760,7 +766,8 @@ struct bnx2x_fastpath {  #endif  #define DOORBELL(bp, cid, 
val) \
do { \
-   writel((u32)(val), bp->doorbells + (bp->db_size * (cid))); \
+   writel_relaxed((u32)(val),\
+   bp->doorbells + (bp->db_size * (cid))); \
} while (0)
 
 /* TX CSUM helpers */
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
index a5265e1..a8ce5c5 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
@@ -522,8 +522,8 @@ static inline void bnx2x_update_rx_prod(struct bnx2x *bp,
wmb();
 
for (i = 0; i < sizeof(rx_prods)/4; i++)
-   REG_WR(bp, fp->ustorm_rx_prods_offset + i*4,
-  ((u32 *)&rx_prods)[i]);
+   REG_WR_RELAXED(bp, fp->ustorm_rx_prods_offset + i * 4,
+  ((u32 *)&rx_prods)[i]);
 
mmiowb(); /* keep prod updates ordered */
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 74fc9af..2dea1b6 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -1608,8 +1608,8 @@ static void bnx2x_hc_int_enable(struct bnx2x *bp)
} else
val = 0x;
 
-   REG_WR(bp, HC_REG_TRAILING_EDGE_0 + port*8, val);
-   REG_WR(bp, HC_REG_LEADING_EDGE_0 + port*8, val);
+   REG_WR_RELAXED(bp, HC_REG_TRAILING_EDGE_0 + port * 8, val);
+   REG_WR_RELAXED(bp, HC_REG_LEADING_EDGE_0 + port * 8, val);
}
 
/* Make sure that interrupts are indeed enabled from here on */ @@ 
-1672,8 +1672,8 @@ static void bnx2x_igu_int_enable(struct bnx2x *bp)
} else
val = 0x;
 
-   REG_WR(bp, IGU_REG_TRAILING_EDGE_LATCH, val);
-   REG_WR(bp, IGU_REG_LEADING_EDGE_LATCH, val);
+   REG_WR_RELAXED(bp, IGU_REG_TRAILING_EDGE_LATCH, val);
+   REG_WR_RELAXED(bp, IGU_REG_LEADING_EDGE_LATCH, val);
 
/* Make sure that interrupts are indeed enabled from here on */
mmiowb();
@@ -3817,8 +3817,8 @@ static void bnx2x_sp_prod_update(struct bnx2x *bp)
 */
mb();
 
-   REG_WR16(bp, BAR_XSTRORM_INTMEM + XSTORM_SPQ_PROD_OFFSET(func),
-bp->spq_prod_idx);
+   REG_WR16_RELAXED(bp, BAR_XSTRORM_INTMEM + XSTORM_SPQ_PROD_OFFSET(func),
+bp->spq_prod_idx);
mmiowb();
 }
 
@@ -7761,7 +7761,7 @@ void bnx2x_igu_clear_sb_gen(struct bnx2x *bp, u8 func, u8 
idu_sb_id, bool is_pf)
barrier();
DP(NETIF_MSG_HW, "write 0x%08x to IGU(via GRC) addr 0x%x\n",
  ctl, igu_addr_ctl);
-   REG_WR(bp, igu_addr_ctl, ctl);
+   REG_WR_RELAXED(bp, igu_addr_ctl, ctl);
mmiowb();
ba

[PATCH nf] netfilter: drop template ct when conntrack is skipped.

2018-03-22 Thread Paolo Abeni
The ipv4 nf_ct code currently skips the nf_conntrak_in() call
for fragmented packets. As a results later matches/target can end
up manipulating template ct entry instead of 'real' ones.

Exploiting the above, syzbot found a way to trigger the following
splat:

WARNING: CPU: 1 PID: 4242 at net/netfilter/xt_cluster.c:55
xt_cluster_mt+0x6c1/0x840 net/netfilter/xt_cluster.c:127
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 4242 Comm: syzkaller027971 Not tainted 4.16.0-rc2+ #243
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x194/0x24d lib/dump_stack.c:53
  panic+0x1e4/0x41c kernel/panic.c:183
  __warn+0x1dc/0x200 kernel/panic.c:547
  report_bug+0x211/0x2d0 lib/bug.c:184
  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
  fixup_bug arch/x86/kernel/traps.c:247 [inline]
  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
  invalid_op+0x58/0x80 arch/x86/entry/entry_64.S:957
RIP: 0010:xt_cluster_hash net/netfilter/xt_cluster.c:55 [inline]
RIP: 0010:xt_cluster_mt+0x6c1/0x840 net/netfilter/xt_cluster.c:127
RSP: 0018:8801d2f6f2d0 EFLAGS: 00010293
RAX: 8801af700540 RBX:  RCX: 84a2d1e1
RDX:  RSI: 8801d2f6f478 RDI: 8801cafd336a
RBP: 8801d2f6f2e8 R08:  R09: 0001
R10:  R11:  R12: 8801b03b3d18
R13: 8801cafd3300 R14: dc00 R15: 8801d2f6f478
  ipt_do_table+0xa91/0x19b0 net/ipv4/netfilter/ip_tables.c:296
  iptable_filter_hook+0x65/0x80 net/ipv4/netfilter/iptable_filter.c:41
  nf_hook_entry_hookfn include/linux/netfilter.h:120 [inline]
  nf_hook_slow+0xba/0x1a0 net/netfilter/core.c:483
  nf_hook include/linux/netfilter.h:243 [inline]
  NF_HOOK include/linux/netfilter.h:286 [inline]
  raw_send_hdrinc.isra.17+0xf39/0x1880 net/ipv4/raw.c:432
  raw_sendmsg+0x14cd/0x26b0 net/ipv4/raw.c:669
  inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
  sock_sendmsg_nosec net/socket.c:629 [inline]
  sock_sendmsg+0xca/0x110 net/socket.c:639
  SYSC_sendto+0x361/0x5c0 net/socket.c:1748
  SyS_sendto+0x40/0x50 net/socket.c:1716
  do_syscall_64+0x280/0x940 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x441b49
RSP: 002b:75ca8b18 EFLAGS: 0216 ORIG_RAX: 002c
RAX: ffda RBX: 004002c8 RCX: 00441b49
RDX: 0030 RSI: 20ff7000 RDI: 0003
RBP: 006cc018 R08: 2066354c R09: 0010
R10:  R11: 0216 R12: 00403470
R13: 00403500 R14:  R15: 
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..

Instead of adding checks for template ct on every target/match
manipulating skb->_nfct, simply drop the template ct when skipping
nf_conntrack_in().

Reported-and-tested-by: syzbot+0346441ae0545cfce...@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni 
---
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c 
b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
index b50721d9d30e..9db988f9a4d7 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
@@ -154,8 +154,20 @@ static unsigned int ipv4_conntrack_local(void *priv,
 struct sk_buff *skb,
 const struct nf_hook_state *state)
 {
-   if (ip_is_fragment(ip_hdr(skb))) /* IP_NODEFRAG setsockopt set */
+   if (ip_is_fragment(ip_hdr(skb))) { /* IP_NODEFRAG setsockopt set */
+   enum ip_conntrack_info ctinfo;
+   struct nf_conn *tmpl;
+
+   tmpl = nf_ct_get(skb, &ctinfo);
+   if (tmpl && nf_ct_is_template(tmpl)) {
+   /* when skipping ct, clear templates to avoid fooling
+* later targets/matches
+*/
+   skb->_nfct = 0;
+   nf_ct_put(tmpl);
+   }
return NF_ACCEPT;
+   }
 
return nf_conntrack_in(state->net, PF_INET, state->hook, skb);
 }
-- 
2.14.3



Re: [bug, bisected] pfifo_fast causes packet reordering

2018-03-22 Thread Jakob Unterwurzacher

On 21.03.18 21:52, John Fastabend wrote:

Can you try this,

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index d4907b5..1e596bd 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -30,6 +30,7 @@ struct qdisc_rate_table {
  enum qdisc_state_t {
 __QDISC_STATE_SCHED,
 __QDISC_STATE_DEACTIVATED,
+   __QDISC_STATE_RUNNING,
  };
[...]


Tested, looks good. No OOO observed, no side effects observed, iperf 
numbers on Gigabit Ethernet look the same.


Thanks,
Jakob


Re: [PATCH 1/2] r8169: reinstate ALDPS for power saving

2018-03-22 Thread Kai-Heng Feng

Kai Heng Feng  wrote:


Hopefully Hayes (or Realtek) can shed more lights on the issue. Apparently
ALDPS and ASPM for r8169 is enabled in different commercial products, just
not in Linux mainline.


Hayes and Realtek folks,

How do we make this patch going forward?
Do you find the root cause that make this patch got reverted?

I guess ALDPS is no longer needed after commit a92a08499b1f ("r8169:  
improve runtime pm in general and suspend unused ports"), now the device  
gets runtime suspended when link is down.


OTOH, ASPM is still quite useful though. When it's enabled, it can save 1W  
power usage, which is quite substantial for a laptop.


So, I'd like to hear your feedback and make ASPM for r8169 eventually gets  
upstreamed.


Kai-Heng



Re: [PATCH net-next v2 2/2] dt: bindings: add new dt entries for brcmfmac

2018-03-22 Thread Ulf Hansson
On 20 March 2018 at 10:55, Kalle Valo  wrote:
> Arend van Spriel  writes:
>
 If I get it right, you mean something like this:

 mmc3: mmc@1c12000 {
 ...
  broken-sg-support;
  sd-head-align = 4;
  sd-sgentry-align = 512;

  brcmf: wifi@1 {
  ...
  };
 };

 Where dt: bindings documentation for these entries should reside?
 In generic MMC bindings? Well, this is the very special case and
 mmc-linux maintainer will unlikely to accept these changes.
 Also, extra kernel code modification might be required. It could make
 quite trivial change much more complex.
>>>
>>> If the MMC maintainers are not copied on this patch series, it will
>>> likely be hard for them to identify this patch series and chime in...
>>
>> The main question is whether this is indeed a "very special case" as
>> Alexey claims it to be or that it is likely to be applicable to other
>> device and host combinations as you are suggesting.
>>
>> If these properties are imposed by the host or host controller it
>> would make sense to have these in the mmc bindings.
>
> BTW, last year we were discussing something similar (I mean related to
> alignment requirements) with ath10k SDIO patches and at the time the
> patch submitter was proposing to have a bounce buffer in ath10k to
> workaround that. I don't remember the details anymore, they are on the
> ath10k mailing list archive if anyone is curious to know, but I would
> not be surprised if they are similar as here. So there might be a need
> to solve this in a generic way (but not sure of course as I haven't
> checked the details).

I re-call something about these as well, here are the patches. Perhaps
I should pick some of them up...

https://patchwork.kernel.org/patch/10123137/
https://patchwork.kernel.org/patch/10123139/
https://patchwork.kernel.org/patch/10123141/
https://patchwork.kernel.org/patch/10123143/

Kind regards
Uffe


Drop count for VLAN tagged packets when interface is in promiscuous mode

2018-03-22 Thread Mikael Arvids
Hi,

I have questions regarding how packet drops are counted in net/core/dev.c.

We open a raw socket (with ETH_P_ALL) in promiscuous mode to capture all 
packets we receive from a mirrored port on a switch, and in order to ensure 
that we are not missing any packets we check the rx_dropped statistics on the 
interface (in addition to the PACKET_STATISTICS on the socket).
Under certain circumstances we could see dropped packets on the interface, even 
though we were not missing any packets in the capture. After some investigation 
we concluded that the drop counter were incremented for VLAN tagged PTP packets 
(ether_type 0x88f7), even though these were captured on the raw socket.

It turns out that packets with an unknown VLAN tag and ether_type other than IP 
(0x0800) and ARP (0x0806) will increment the drop counter, even when those 
packets have been processed (by the raw socket). Is this intended?

We have currently patched net/core/dev.c to not increment the drop counter when 
deliver_skb has been called for the vlan packets, which solves our particular 
case, but I'm wondering if there could be a more generic solution to this?

diff --git a/components/linux-kernel/xilinx-v2016.3/net/core/dev.c 
b/components/linux-kernel/xilinx-v2016.3/net/core/dev.c 
index 5c925ac..9d04a1c 100644   
   
--- a/components/linux-kernel/xilinx-v2016.3/net/core/dev.c 
   
+++ b/components/linux-kernel/xilinx-v2016.3/net/core/dev.c 
   
@@ -4028,6 +4028,7 @@ static int __netif_receive_skb_core(struct sk_buff *skb, 
bool pfmemalloc)
bool deliver_exact = false; 
   
int ret = NET_RX_DROP;  
   
__be16 type;
   
+   bool prevent_drop_cnt_inc = false;  
   

   
net_timestamp_check(!netdev_tstamp_prequeue, skb);  
   

   
@@ -4098,6 +4099,7 @@ ncls: 
   
if (pt_prev) {  
   
ret = deliver_skb(skb, pt_prev, orig_dev);  
   
pt_prev = NULL; 
   
+   prevent_drop_cnt_inc = true;
   
}   
   
if (vlan_do_receive(&skb))  
   
goto another_round; 
   
@@ -4160,8 +4162,10 @@ ncls:
   
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);  
   
} else {
   
 drop:  
   
-   if (!deliver_exact) 
   
-   atomic_long_inc(&skb->dev->rx_dropped); 
   
+   if (!deliver_exact) {   
   
+   if (!prevent_drop_cnt_inc)  
   
+   atomic_long_inc(&skb->dev->rx_dropped); 
   
+   }   
   
else

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread David Laight
From: Sent: 21 March 2018 18:16
> To: Ingo Molnar
...
> All this to do a 32-byte PIO access, with absolutely zero data right
> now on what the win is?
> 
> Yes, yes, I can find an Intel white-paper that talks about setting WC
> and then using xmm and ymm instructions to write a single 64-byte
> burst over PCIe, and I assume that is where the code and idea came
> from. But I don't actually see any reason why a burst of 8 regular
> quad-word bytes wouldn't cause a 64-byte burst write too.

The big gain is from wide PCIe reads, not writes.
Writes to uncached locations (without WC) are almost certainly required
to generate the 'obvious' PCIe TLP, otherwise things are likely to break.

> So right now this is for _one_ odd rdma controller, with absolutely
> _zero_ performance numbers, and a very high likelihood that it does
> not matter in the least.
> 
> And if there are some atomicity concerns ("we need to guarantee a
> single atomic access for race conditions with the hardware"), they are
> probably bogus and misplaced anyway, since
> 
>  (a) we can't guarantee that AVX2 exists in the first place

Any code would need to be in memcpy_fromio(), not in every driver that
might benefit.
Then fallback code can be used if the registers aren't available.

>  (b) we can't guarantee that %ymm register write will show up on any
> bus as a single write transaction anyway

Misaligned 8 byte accesses generate a single PCIe TLP.
I can look at what happens for AVX2 transfers later.
I've got code that mmap()s PCIe addresses into user space, and can
look at the TLP (indirectly through tracing on an fpga target).
Just need to set something up that uses AVX copies.

> So as far as I can tell, there are basically *zero* upsides, and a lot
> of potential downsides.

There are some upsides.
I'll do a performance measurement for reads.

David



RE: [RFC PATCH 2/3] x86/io: implement 256-bit IO read and write

2018-03-22 Thread David Laight
From: Linus Torvalds
> Sent: 22 March 2018 01:27
> On Tue, Mar 20, 2018 at 7:42 AM, Alexander Duyck
>  wrote:
> >
> > Instead of framing this as an enhanced version of the read/write ops
> > why not look at replacing or extending something like the
> > memcpy_fromio or memcpy_toio operations?
> 
> Yes, doing something like "memcpy_fromio_avx()" is much more
> palatable, in that it works like the crypto functions do - if you do
> big chunks, the "kernel_fpu_begin/end()" isn't nearly the issue it can
> be otherwise.
> 
> Note that we definitely have seen hardware that *depends* on the
> regular memcpy_fromio()" not doing big reads. I don't know how
> hardware people screw it up, but it's clearly possible.

I wonder if that hardware works with the current kernel on recent cpus?
I bet it doesn't like the byte accesses that get generated either.

> So it really needs to be an explicitly named function that basically a
> driver can use to say "my hardware really likes big aligned accesses"
> and explicitly ask for some AVX version if possible.

For x86 being able to request a copy done as 'rep movsx' (for each x)
would be useful.
For io copies the cost of the memory access is probably much smaller
that the io access, so really fancy copies are unlikely make much
difference unless the width of the io access changes.

David



[patch net-next RFC 05/12] dsa: use devlink helper to generate physical port name

2018-03-22 Thread Jiri Pirko
From: Jiri Pirko 

Since devlink knows the info needed to generate the physical port name
in a generic way for all devlink users, use the helper to do the job.

Signed-off-by: Jiri Pirko 
---
 net/dsa/slave.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 18561af7a8f1..8d71dd672e52 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -727,10 +728,7 @@ static int dsa_slave_get_phys_port_name(struct net_device 
*dev,
 {
struct dsa_port *dp = dsa_slave_to_port(dev);
 
-   if (snprintf(name, len, "p%d", dp->index) >= len)
-   return -EINVAL;
-
-   return 0;
+   return devlink_port_get_phys_port_name(&dp->devlink_port, name, len);
 }
 
 static struct dsa_mall_tc_entry *
-- 
2.14.3



[patch net-next RFC 02/12] devlink: extend attrs_set for setting port flavours

2018-03-22 Thread Jiri Pirko
From: Jiri Pirko 

Devlink ports can have specific flavour according to the purpose of use.
This patch extend attrs_set so the driver can say which flavour port
has. Initial flavours are:
physical, pf_rep, vf_rep, cpu, dsa
User can query this to see right away what is the purpose of each port.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/core.c   |  4 ++--
 drivers/net/ethernet/netronome/nfp/nfp_devlink.c | 12 +---
 include/net/devlink.h|  3 +++
 include/uapi/linux/devlink.h | 19 +++
 net/core/devlink.c   |  5 +
 5 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c 
b/drivers/net/ethernet/mellanox/mlxsw/core.c
index dc924d5fb3b7..0b6e646fed75 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -1721,8 +1721,8 @@ void mlxsw_core_port_eth_set(struct mlxsw_core 
*mlxsw_core, u8 local_port,
struct devlink_port *devlink_port = &mlxsw_core_port->devlink_port;
 
mlxsw_core_port->port_driver_priv = port_driver_priv;
-   devlink_port_attrs_set(devlink_port, port_number,
-  split, split_port_subnumber);
+   devlink_port_attrs_set(devlink_port, DEVLINK_PORT_FLAVOUR_PHYSICAL,
+  port_number, split, split_port_subnumber);
devlink_port_type_eth_set(devlink_port, dev);
 }
 EXPORT_SYMBOL(mlxsw_core_port_eth_set);
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_devlink.c 
b/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
index 3c0f0560f834..e3a46faaadc6 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
@@ -175,15 +175,21 @@ static int nfp_devlink_port_attrs_set(struct nfp_port 
*port)
if (ret)
return ret;
 
-   devlink_port_attrs_set(&port->dl_port, eth_port.label_port,
+   devlink_port_attrs_set(&port->dl_port,
+  DEVLINK_PORT_FLAVOUR_PHYSICAL,
+  eth_port.label_port,
   eth_port.is_split,
   eth_port.label_subport);
break;
case NFP_PORT_PF_PORT:
-   devlink_port_attrs_set(&port->dl_port, port->pf_id, false, 0);
+   devlink_port_attrs_set(&port->dl_port,
+  DEVLINK_PORT_FLAVOUR_PF_REP,
+  port->pf_id, false, 0);
break;
case NFP_PORT_VF_PORT:
-   devlink_port_attrs_set(&port->dl_port, port->vf_id, false, 0);
+   devlink_port_attrs_set(&port->dl_port,
+  DEVLINK_PORT_FLAVOUR_VF_REP,
+  port->vf_id, false, 0);
break;
default:
break;
diff --git a/include/net/devlink.h b/include/net/devlink.h
index 29c3bc260a3e..900295afc521 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -37,6 +37,7 @@ struct devlink {
 
 struct devlink_port_attrs {
bool set;
+   enum devlink_port_flavour flavour;
u32 port_number; /* same value as "split group" */
bool split;
u32 split_subport_number;
@@ -380,6 +381,7 @@ void devlink_port_type_ib_set(struct devlink_port 
*devlink_port,
  struct ib_device *ibdev);
 void devlink_port_type_clear(struct devlink_port *devlink_port);
 void devlink_port_attrs_set(struct devlink_port *devlink_port,
+   enum devlink_port_flavour flavour,
u32 port_number, bool split,
u32 split_subport_number);
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
@@ -476,6 +478,7 @@ static inline void devlink_port_type_clear(struct 
devlink_port *devlink_port)
 }
 
 static inline void devlink_port_attrs_set(struct devlink_port *devlink_port,
+ enum devlink_port_flavour flavour,
  u32 port_number, bool split,
  u32 split_subport_number)
 {
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 15b031a5ee7a..74d0e620059b 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -132,6 +132,24 @@ enum devlink_eswitch_encap_mode {
DEVLINK_ESWITCH_ENCAP_MODE_BASIC,
 };
 
+enum devlink_port_flavour {
+   DEVLINK_PORT_FLAVOUR_PHYSICAL, /* Any kind of a port physically
+   * facing the user.
+   */
+   DEVLINK_PORT_FLAVOUR_PF_REP, /* Port represents a SR-IOV physical
+ * function counterpart port of
+

[patch net-next RFC 09/12] nfp: register devlink port for VF/PF representors

2018-03-22 Thread Jiri Pirko
From: Jiri Pirko 

Drivers should always register devlink port instance for all their
ports. So fix nfp and register devlink port for VF and PF representors.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_repr.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c
index e6445f6707cb..eff07e9a175d 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c
@@ -270,6 +270,8 @@ const struct net_device_ops nfp_repr_netdev_ops = {
 
 static void nfp_repr_clean(struct nfp_repr *repr)
 {
+   if (repr->port)
+   nfp_devlink_port_unregister(repr->port);
unregister_netdev(repr->netdev);
nfp_app_repr_clean(repr->app, repr->netdev);
dst_release((struct dst_entry *)repr->dst);
@@ -330,8 +332,14 @@ int nfp_repr_init(struct nfp_app *app, struct net_device 
*netdev,
 
/* This is incorrect - the id has to be figured out differently */
port->eth_id = cmsg_port_id;
+   err = nfp_devlink_port_register(app, port);
+   if (err)
+   goto err_netdev_clean;
+
return 0;
 
+err_netdev_clean:
+   unregister_netdev(netdev);
 err_repr_clean:
nfp_app_repr_clean(app, netdev);
 err_clean:
-- 
2.14.3



[patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation

2018-03-22 Thread Jiri Pirko
From: Jiri Pirko 

This patchset resolves 2 issues we have right now:
1) There are many netdevices / ports in the system, for port, pf, vf
   represenatation but the user has no way to see which is which
2) The ndo_get_phys_port_name is implemented in each driver separatelly,
   which may lead to inconsistent names between drivers.

This patchset introduces port flavours which should address the first
problem. I'm testing this with Netronome nfp hardware. When the user
has 2 physical ports, 1 pf, and 4 vfs, he should see something like this:
# devlink port
pci/:05:00.0/0: type eth netdev enp5s0np0 flavour physical number 0
pci/:05:00.0/268435456: type eth netdev eth0 flavour physical number 0
pci/:05:00.0/268435460: type eth netdev enp5s0np1 flavour physical number 1
pci/:05:00.0/536875008: type eth netdev eth2 flavour pf_rep number 536875008
pci/:05:00.0/536870912: type eth netdev eth1 flavour vf_rep number 0
pci/:05:00.0/536870976: type eth netdev eth3 flavour vf_rep number 1
pci/:05:00.0/536871040: type eth netdev eth4 flavour vf_rep number 2
pci/:05:00.0/536871104: type eth netdev eth5 flavour vf_rep number 3

The indexes are weird numbers now. That needs to be fixed. Also, netdev
renaming does not work correctly for me now for some reason.
Also, there is one extra port that I don't understand what
is the purpose for it - something nfp specific perhaps.

The desired output should look like this:
# devlink port
pci/:05:00.0/0: type eth netdev enp5s0np0 flavour physical number 0
pci/:05:00.0/1: type eth netdev enp5s0np1 flavour physical number 1
pci/:05:00.0/2: type eth netdev enp5s0npf0 flavour pf_rep number 0
pci/:05:00.0/3: type eth netdev enp5s0nvf0 flavour vf_rep number 0
pci/:05:00.0/4: type eth netdev enp5s0nvf1 flavour vf_rep number 1
pci/:05:00.0/5: type eth netdev enp5s0nvf2 flavour vf_rep number 2
pci/:05:00.0/6: type eth netdev enp5s0nvf3 flavour vf_rep number 3

As you can see, the netdev names are generated according to the flavour
and port number. In case the port is split, the split subnumber is also
included.

I tested this for mlxsw and nfp. I have no way to test this on DSA hw,
I would really appretiate DSA guys to test this. Thanks!

Jiri Pirko (12):
  devlink: introduce devlink_port_attrs_set
  devlink: extend attrs_set for setting port flavours
  devlink: introduce a helper to generate physical port names
  dsa: set devlink port attrs for dsa ports
  dsa: use devlink helper to generate physical port name
  mlxsw: use devlink helper to generate physical port name
  nfp: flower: fix error path during representor creation
  nfp: set eth_id for representors to avoid port index conflict
  nfp: register devlink port for VF/PF representors
  nfp: flower: create port for flower vnic
  nfp: use devlink helper to generate physical port name
  nfp: flower: set sysfs link to device for representors

 drivers/net/ethernet/mellanox/mlxsw/core.c| 18 -
 drivers/net/ethernet/mellanox/mlxsw/core.h|  5 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c| 21 ++
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c| 11 +--
 drivers/net/ethernet/netronome/nfp/flower/main.c  | 17 -
 drivers/net/ethernet/netronome/nfp/nfp_devlink.c  | 45 +--
 drivers/net/ethernet/netronome/nfp/nfp_net_repr.c | 19 -
 drivers/net/ethernet/netronome/nfp/nfp_net_repr.h |  1 +
 drivers/net/ethernet/netronome/nfp/nfp_port.c | 30 +---
 include/net/devlink.h | 32 ++--
 include/uapi/linux/devlink.h  | 22 ++
 net/core/devlink.c| 92 ---
 net/dsa/dsa2.c| 23 ++
 net/dsa/slave.c   |  6 +-
 14 files changed, 252 insertions(+), 90 deletions(-)

-- 
2.14.3



[patch net-next RFC 08/12] nfp: set eth_id for representors to avoid port index conflict

2018-03-22 Thread Jiri Pirko
From: Jiri Pirko 

Incorrect, need to be done differently

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_repr.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c
index d98cbc173dca..e6445f6707cb 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c
@@ -328,6 +328,8 @@ int nfp_repr_init(struct nfp_app *app, struct net_device 
*netdev,
if (err)
goto err_repr_clean;
 
+   /* This is incorrect - the id has to be figured out differently */
+   port->eth_id = cmsg_port_id;
return 0;
 
 err_repr_clean:
-- 
2.14.3



[patch net-next RFC 03/12] devlink: introduce a helper to generate physical port names

2018-03-22 Thread Jiri Pirko
From: Jiri Pirko 

Each driver implements physical port name generation by itself. However
as devlink has all needed info, it can easily do the job for all its
users. So implement this helper in devlink.

Signed-off-by: Jiri Pirko 
---
 include/net/devlink.h |  9 +
 net/core/devlink.c| 39 +++
 2 files changed, 48 insertions(+)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 900295afc521..2e5bfe7723b4 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -384,6 +384,8 @@ void devlink_port_attrs_set(struct devlink_port 
*devlink_port,
enum devlink_port_flavour flavour,
u32 port_number, bool split,
u32 split_subport_number);
+int devlink_port_get_phys_port_name(struct devlink_port *devlink_port,
+   char *name, size_t len);
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
u32 size, u16 ingress_pools_count,
u16 egress_pools_count, u16 ingress_tc_count,
@@ -484,6 +486,13 @@ static inline void devlink_port_attrs_set(struct 
devlink_port *devlink_port,
 {
 }
 
+static inline int
+devlink_port_get_phys_port_name(struct devlink_port *devlink_port,
+   char *name, size_t len)
+{
+   return -EOPNOTSUPP;
+}
+
 static inline int devlink_sb_register(struct devlink *devlink,
  unsigned int sb_index, u32 size,
  u16 ingress_pools_count,
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 782476a1ff8f..4ba69383ab58 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -3017,6 +3017,45 @@ void devlink_port_attrs_set(struct devlink_port 
*devlink_port,
 }
 EXPORT_SYMBOL_GPL(devlink_port_attrs_set);
 
+int devlink_port_get_phys_port_name(struct devlink_port *devlink_port,
+   char *name, size_t len)
+{
+   struct devlink_port_attrs *attrs = &devlink_port->attrs;
+   int n = 0;
+
+   if (!attrs->set)
+   return -EOPNOTSUPP;
+
+   switch (attrs->flavour) {
+   case DEVLINK_PORT_FLAVOUR_PHYSICAL:
+   if (!attrs->split)
+   n = snprintf(name, len, "p%u", attrs->port_number);
+   else
+   n = snprintf(name, len, "p%us%u", attrs->port_number,
+attrs->split_subport_number);
+   break;
+   case DEVLINK_PORT_FLAVOUR_PF_REP:
+   n = snprintf(name, len, "pf%d", attrs->port_number);
+   break;
+   case DEVLINK_PORT_FLAVOUR_VF_REP:
+   n = snprintf(name, len, "vf%d", attrs->port_number);
+   break;
+   case DEVLINK_PORT_FLAVOUR_CPU:
+   case DEVLINK_PORT_FLAVOUR_DSA:
+   /* As CPU and DSA ports do not have a netdevice associated
+* case should not ever happen.
+*/
+   WARN_ON(1);
+   return -EINVAL;
+   }
+
+   if (n >= len)
+   return -EINVAL;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(devlink_port_get_phys_port_name);
+
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
u32 size, u16 ingress_pools_count,
u16 egress_pools_count, u16 ingress_tc_count,
-- 
2.14.3



[patch net-next RFC 07/12] nfp: flower: fix error path during representor creation

2018-03-22 Thread Jiri Pirko
From: Jiri Pirko 

Don't store repr pointer to reprs array until the representor is
successfully created. This avoids message about "representor
destruction" even when it was never created. Also it cleans-up the flow.
Also, check return value after port alloc.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/netronome/nfp/flower/main.c  | 13 +++--
 drivers/net/ethernet/netronome/nfp/nfp_net_repr.c |  9 +++--
 drivers/net/ethernet/netronome/nfp/nfp_net_repr.h |  1 +
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.c 
b/drivers/net/ethernet/netronome/nfp/flower/main.c
index 742d6f1575b5..aed8df0e9d41 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.c
@@ -247,12 +247,16 @@ nfp_flower_spawn_vnic_reprs(struct nfp_app *app,
err = -ENOMEM;
goto err_reprs_clean;
}
-   RCU_INIT_POINTER(reprs->reprs[i], repr);
 
/* For now we only support 1 PF */
WARN_ON(repr_type == NFP_REPR_TYPE_PF && i);
 
port = nfp_port_alloc(app, port_type, repr);
+   if (IS_ERR(port)) {
+   err = PTR_ERR(port);
+   nfp_repr_free(repr);
+   goto err_reprs_clean;
+   }
if (repr_type == NFP_REPR_TYPE_PF) {
port->pf_id = i;
port->vnic = priv->nn->dp.ctrl_bar;
@@ -271,9 +275,11 @@ nfp_flower_spawn_vnic_reprs(struct nfp_app *app,
port_id, port, priv->nn->dp.netdev);
if (err) {
nfp_port_free(port);
+   nfp_repr_free(repr);
goto err_reprs_clean;
}
 
+   RCU_INIT_POINTER(reprs->reprs[i], repr);
nfp_info(app->cpp, "%s%d Representor(%s) created\n",
 repr_type == NFP_REPR_TYPE_PF ? "PF" : "VF", i,
 repr->name);
@@ -344,16 +350,17 @@ nfp_flower_spawn_phy_reprs(struct nfp_app *app, struct 
nfp_flower_priv *priv)
err = -ENOMEM;
goto err_reprs_clean;
}
-   RCU_INIT_POINTER(reprs->reprs[phys_port], repr);
 
port = nfp_port_alloc(app, NFP_PORT_PHYS_PORT, repr);
if (IS_ERR(port)) {
err = PTR_ERR(port);
+   nfp_repr_free(repr);
goto err_reprs_clean;
}
err = nfp_port_init_phy_port(app->pf, app, port, i);
if (err) {
nfp_port_free(port);
+   nfp_repr_free(repr);
goto err_reprs_clean;
}
 
@@ -365,6 +372,7 @@ nfp_flower_spawn_phy_reprs(struct nfp_app *app, struct 
nfp_flower_priv *priv)
cmsg_port_id, port, priv->nn->dp.netdev);
if (err) {
nfp_port_free(port);
+   nfp_repr_free(repr);
goto err_reprs_clean;
}
 
@@ -373,6 +381,7 @@ nfp_flower_spawn_phy_reprs(struct nfp_app *app, struct 
nfp_flower_priv *priv)
 eth_tbl->ports[i].base,
 phys_port);
 
+   RCU_INIT_POINTER(reprs->reprs[phys_port], repr);
nfp_info(app->cpp, "Phys Port %d Representor(%s) created\n",
 phys_port, repr->name);
}
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c
index 619570524d2a..d98cbc173dca 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c
@@ -337,12 +337,17 @@ int nfp_repr_init(struct nfp_app *app, struct net_device 
*netdev,
return err;
 }
 
-static void nfp_repr_free(struct nfp_repr *repr)
+static void __nfp_repr_free(struct nfp_repr *repr)
 {
free_percpu(repr->stats);
free_netdev(repr->netdev);
 }
 
+void nfp_repr_free(struct net_device *netdev)
+{
+   __nfp_repr_free(netdev_priv(netdev));
+}
+
 struct net_device *nfp_repr_alloc(struct nfp_app *app)
 {
struct net_device *netdev;
@@ -374,7 +379,7 @@ static void nfp_repr_clean_and_free(struct nfp_repr *repr)
nfp_info(repr->app->cpp, "Destroying Representor(%s)\n",
 repr->netdev->name);
nfp_repr_clean(repr);
-   nfp_repr_free(repr);
+   __nfp_repr_free(repr);
 }
 
 void nfp_reprs_clean_and_free(struct nfp_app *app, struct nfp_reprs *reprs)
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_repr.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net_repr.h
index a621e8ff528e..cd756a15445f 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp

[patch net-next RFC 04/12] dsa: set devlink port attrs for dsa ports

2018-03-22 Thread Jiri Pirko
From: Jiri Pirko 

Set the attrs and allow to expose port flavour to user via devlink.

Signed-off-by: Jiri Pirko 
---
 net/dsa/dsa2.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index adf50fbc4c13..49453690696d 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -270,7 +270,27 @@ static int dsa_port_setup(struct dsa_port *dp)
case DSA_PORT_TYPE_UNUSED:
break;
case DSA_PORT_TYPE_CPU:
+   /* dp->index is used now as port_number. However
+* CPU ports should have separate numbering
+* independent from front panel port numbers.
+*/
+   devlink_port_attrs_set(&dp->devlink_port,
+  DEVLINK_PORT_FLAVOUR_CPU,
+  dp->index, false, 0);
+   err = dsa_port_link_register_of(dp);
+   if (err) {
+   dev_err(ds->dev, "failed to setup link for port 
%d.%d\n",
+   ds->index, dp->index);
+   return err;
+   }
case DSA_PORT_TYPE_DSA:
+   /* dp->index is used now as port_number. However
+* DSA ports should have separate numbering
+* independent from front panel port numbers.
+*/
+   devlink_port_attrs_set(&dp->devlink_port,
+  DEVLINK_PORT_FLAVOUR_DSA,
+  dp->index, false, 0);
err = dsa_port_link_register_of(dp);
if (err) {
dev_err(ds->dev, "failed to setup link for port 
%d.%d\n",
@@ -279,6 +299,9 @@ static int dsa_port_setup(struct dsa_port *dp)
}
break;
case DSA_PORT_TYPE_USER:
+   devlink_port_attrs_set(&dp->devlink_port,
+  DEVLINK_PORT_FLAVOUR_PHYSICAL,
+  dp->index, false, 0);
err = dsa_slave_create(dp);
if (err)
dev_err(ds->dev, "failed to create slave for port 
%d.%d\n",
-- 
2.14.3



[patch net-next RFC 01/12] devlink: introduce devlink_port_attrs_set

2018-03-22 Thread Jiri Pirko
From: Jiri Pirko 

Change existing setter for split port information into more generic
attrs setter. Alongside with that, allow to set port number and subport
number for split ports.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/core.c   |  7 ++--
 drivers/net/ethernet/mellanox/mlxsw/core.h   |  3 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c   |  4 +--
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c   |  2 +-
 drivers/net/ethernet/netronome/nfp/nfp_devlink.c | 39 +++-
 include/net/devlink.h| 20 +++
 include/uapi/linux/devlink.h |  3 ++
 net/core/devlink.c   | 46 ++--
 8 files changed, 93 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c 
b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 3529b545675d..dc924d5fb3b7 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -1713,15 +1713,16 @@ EXPORT_SYMBOL(mlxsw_core_port_fini);
 
 void mlxsw_core_port_eth_set(struct mlxsw_core *mlxsw_core, u8 local_port,
 void *port_driver_priv, struct net_device *dev,
-bool split, u32 split_group)
+u32 port_number, bool split,
+u32 split_port_subnumber)
 {
struct mlxsw_core_port *mlxsw_core_port =
&mlxsw_core->ports[local_port];
struct devlink_port *devlink_port = &mlxsw_core_port->devlink_port;
 
mlxsw_core_port->port_driver_priv = port_driver_priv;
-   if (split)
-   devlink_port_split_set(devlink_port, split_group);
+   devlink_port_attrs_set(devlink_port, port_number,
+  split, split_port_subnumber);
devlink_port_type_eth_set(devlink_port, dev);
 }
 EXPORT_SYMBOL(mlxsw_core_port_eth_set);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h 
b/drivers/net/ethernet/mellanox/mlxsw/core.h
index 5ddafd74dc00..10589abae67f 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.h
@@ -201,7 +201,8 @@ int mlxsw_core_port_init(struct mlxsw_core *mlxsw_core, u8 
local_port);
 void mlxsw_core_port_fini(struct mlxsw_core *mlxsw_core, u8 local_port);
 void mlxsw_core_port_eth_set(struct mlxsw_core *mlxsw_core, u8 local_port,
 void *port_driver_priv, struct net_device *dev,
-bool split, u32 split_group);
+u32 port_number, bool split,
+u32 split_port_subnumber);
 void mlxsw_core_port_ib_set(struct mlxsw_core *mlxsw_core, u8 local_port,
void *port_driver_priv);
 void mlxsw_core_port_clear(struct mlxsw_core *mlxsw_core, u8 local_port,
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index a120602bca26..59be0bf14127 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -2927,8 +2927,8 @@ static int mlxsw_sp_port_create(struct mlxsw_sp 
*mlxsw_sp, u8 local_port,
}
 
mlxsw_core_port_eth_set(mlxsw_sp->core, mlxsw_sp_port->local_port,
-   mlxsw_sp_port, dev, mlxsw_sp_port->split,
-   module);
+   mlxsw_sp_port, dev, module + 1,
+   mlxsw_sp_port->split, lane / width);
mlxsw_core_schedule_dw(&mlxsw_sp_port->periodic_hw_stats.update_dw, 0);
return 0;
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c 
b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
index f3c29bbf07e2..eddfcef320f1 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
@@ -1149,7 +1149,7 @@ static int __mlxsw_sx_port_eth_create(struct mlxsw_sx 
*mlxsw_sx, u8 local_port,
}
 
mlxsw_core_port_eth_set(mlxsw_sx->core, mlxsw_sx_port->local_port,
-   mlxsw_sx_port, dev, false, 0);
+   mlxsw_sx_port, dev, module + 1, false, 0);
mlxsw_sx->ports[local_port] = mlxsw_sx_port;
return 0;
 
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_devlink.c 
b/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
index eb0fc614673d..3c0f0560f834 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
@@ -162,22 +162,45 @@ const struct devlink_ops nfp_devlink_ops = {
.eswitch_mode_get   = nfp_devlink_eswitch_mode_get,
 };
 
-int nfp_devlink_port_register(struct nfp_app *app, struct nfp_port *port)
+static int nfp_devlink_port_attrs_set(struct nfp_port *port)
 {
struct nfp_eth_table_port eth_port;
+   int ret;
+
+   switch (port->type) {
+ 

[patch net-next RFC 11/12] nfp: use devlink helper to generate physical port name

2018-03-22 Thread Jiri Pirko
From: Jiri Pirko 

Since devlink knows the info needed to generate the physical port name
in a generic way for all devlink users, use the helper to do the job.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/netronome/nfp/nfp_port.c | 30 ++-
 1 file changed, 2 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_port.c 
b/drivers/net/ethernet/netronome/nfp/nfp_port.c
index 7bd8be5c833b..01dac8533ef6 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_port.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_port.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "nfpcore/nfp_cpp.h"
 #include "nfpcore/nfp_nsp.h"
@@ -160,40 +161,13 @@ struct nfp_eth_table_port *nfp_port_get_eth_port(struct 
nfp_port *port)
 int
 nfp_port_get_phys_port_name(struct net_device *netdev, char *name, size_t len)
 {
-   struct nfp_eth_table_port *eth_port;
struct nfp_port *port;
-   int n;
 
port = nfp_port_from_netdev(netdev);
if (!port)
return -EOPNOTSUPP;
 
-   switch (port->type) {
-   case NFP_PORT_PHYS_PORT:
-   eth_port = __nfp_port_get_eth_port(port);
-   if (!eth_port)
-   return -EOPNOTSUPP;
-
-   if (!eth_port->is_split)
-   n = snprintf(name, len, "p%d", eth_port->label_port);
-   else
-   n = snprintf(name, len, "p%ds%d", eth_port->label_port,
-eth_port->label_subport);
-   break;
-   case NFP_PORT_PF_PORT:
-   n = snprintf(name, len, "pf%d", port->pf_id);
-   break;
-   case NFP_PORT_VF_PORT:
-   n = snprintf(name, len, "pf%dvf%d", port->pf_id, port->vf_id);
-   break;
-   default:
-   return -EOPNOTSUPP;
-   }
-
-   if (n >= len)
-   return -EINVAL;
-
-   return 0;
+   return devlink_port_get_phys_port_name(&port->dl_port, name, len);
 }
 
 /**
-- 
2.14.3



[patch net-next RFC 10/12] nfp: flower: create port for flower vnic

2018-03-22 Thread Jiri Pirko
From: Jiri Pirko 

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/netronome/nfp/flower/main.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.c 
b/drivers/net/ethernet/netronome/nfp/flower/main.c
index aed8df0e9d41..1890af7e6196 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.c
@@ -427,10 +427,9 @@ static int nfp_flower_vnic_alloc(struct nfp_app *app, 
struct nfp_net *nn,
goto err_invalid_port;
}
 
-   eth_hw_addr_random(nn->dp.netdev);
netif_keep_dst(nn->dp.netdev);
 
-   return 0;
+   return nfp_app_nic_vnic_alloc(app, nn, id);
 
 err_invalid_port:
nn->port = nfp_port_alloc(app, NFP_PORT_INVALID, nn->dp.netdev);
-- 
2.14.3



[patch net-next RFC 06/12] mlxsw: use devlink helper to generate physical port name

2018-03-22 Thread Jiri Pirko
From: Jiri Pirko 

Since devlink knows the info needed to generate the physical port name
in a generic way for all devlink users, use the helper to do the job.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 11 +++
 drivers/net/ethernet/mellanox/mlxsw/core.h |  2 ++
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 17 +++--
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c |  9 +++--
 4 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c 
b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 0b6e646fed75..7a49eb2c8db4 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -1762,6 +1762,17 @@ enum devlink_port_type mlxsw_core_port_type_get(struct 
mlxsw_core *mlxsw_core,
 }
 EXPORT_SYMBOL(mlxsw_core_port_type_get);
 
+int mlxsw_core_port_get_phys_port_name(struct mlxsw_core *mlxsw_core,
+  u8 local_port, char *name, size_t len)
+{
+   struct mlxsw_core_port *mlxsw_core_port =
+   &mlxsw_core->ports[local_port];
+   struct devlink_port *devlink_port = &mlxsw_core_port->devlink_port;
+
+   return devlink_port_get_phys_port_name(devlink_port, name, len);
+}
+EXPORT_SYMBOL(mlxsw_core_port_get_phys_port_name);
+
 static void mlxsw_core_buf_dump_dbg(struct mlxsw_core *mlxsw_core,
const char *buf, size_t size)
 {
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h 
b/drivers/net/ethernet/mellanox/mlxsw/core.h
index 10589abae67f..09703688ea9a 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.h
@@ -209,6 +209,8 @@ void mlxsw_core_port_clear(struct mlxsw_core *mlxsw_core, 
u8 local_port,
   void *port_driver_priv);
 enum devlink_port_type mlxsw_core_port_type_get(struct mlxsw_core *mlxsw_core,
u8 local_port);
+int mlxsw_core_port_get_phys_port_name(struct mlxsw_core *mlxsw_core,
+  u8 local_port, char *name, size_t len);
 
 int mlxsw_core_schedule_dw(struct delayed_work *dwork, unsigned long delay);
 bool mlxsw_core_schedule_work(struct work_struct *work);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 59be0bf14127..64ea94d4ee14 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -1238,21 +1238,10 @@ static int mlxsw_sp_port_get_phys_port_name(struct 
net_device *dev, char *name,
size_t len)
 {
struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev);
-   u8 module = mlxsw_sp_port->mapping.module;
-   u8 width = mlxsw_sp_port->mapping.width;
-   u8 lane = mlxsw_sp_port->mapping.lane;
-   int err;
-
-   if (!mlxsw_sp_port->split)
-   err = snprintf(name, len, "p%d", module + 1);
-   else
-   err = snprintf(name, len, "p%ds%d", module + 1,
-  lane / width);
 
-   if (err >= len)
-   return -EINVAL;
-
-   return 0;
+   return mlxsw_core_port_get_phys_port_name(mlxsw_sp_port->mlxsw_sp->core,
+ mlxsw_sp_port->local_port,
+ name, len);
 }
 
 static struct mlxsw_sp_port_mall_tc_entry *
diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c 
b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
index eddfcef320f1..96d4c073d9d6 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
@@ -417,13 +417,10 @@ static int mlxsw_sx_port_get_phys_port_name(struct 
net_device *dev, char *name,
size_t len)
 {
struct mlxsw_sx_port *mlxsw_sx_port = netdev_priv(dev);
-   int err;
-
-   err = snprintf(name, len, "p%d", mlxsw_sx_port->mapping.module + 1);
-   if (err >= len)
-   return -EINVAL;
 
-   return 0;
+   return mlxsw_core_port_get_phys_port_name(mlxsw_sx_port->mlxsw_sx->core,
+ mlxsw_sx_port->local_port,
+ name, len);
 }
 
 static const struct net_device_ops mlxsw_sx_port_netdev_ops = {
-- 
2.14.3



[patch net-next RFC 12/12] nfp: flower: set sysfs link to device for representors

2018-03-22 Thread Jiri Pirko
From: Jiri Pirko 

Do this so the sysfs has "device" link correctly set.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/netronome/nfp/flower/main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.c 
b/drivers/net/ethernet/netronome/nfp/flower/main.c
index 1890af7e6196..9751708585e4 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.c
@@ -267,6 +267,7 @@ nfp_flower_spawn_vnic_reprs(struct nfp_app *app,
app->pf->vf_cfg_mem + i * NFP_NET_CFG_BAR_SZ;
}
 
+   SET_NETDEV_DEV(repr, &priv->nn->pdev->dev);
eth_hw_addr_random(repr);
 
port_id = nfp_flower_cmsg_pcie_port(nfp_pcie, vnic_type,
-- 
2.14.3



Re: [PATCH nf] netfilter: drop template ct when conntrack is skipped.

2018-03-22 Thread Florian Westphal
Paolo Abeni  wrote:
> The ipv4 nf_ct code currently skips the nf_conntrak_in() call
> for fragmented packets. As a results later matches/target can end
> up manipulating template ct entry instead of 'real' ones.
> 
> Exploiting the above, syzbot found a way to trigger the following
> splat:
> 
> WARNING: CPU: 1 PID: 4242 at net/netfilter/xt_cluster.c:55
> xt_cluster_mt+0x6c1/0x840 net/netfilter/xt_cluster.c:127
> Kernel panic - not syncing: panic_on_warn set ...

Right, template has l3 protocol 0.

> Instead of adding checks for template ct on every target/match
> manipulating skb->_nfct, simply drop the template ct when skipping
> nf_conntrack_in().

Fixes: 7b4fdf77a450ec ("netfilter: don't track fragmented packets")
Acked-by: Florian Westphal 



Re: [PATCH/RFC 2/3] net/sched: act_tunnel_key: add extended ack support

2018-03-22 Thread Simon Horman
On Fri, Mar 09, 2018 at 12:22:48PM +0100, Jiri Benc wrote:
> On Tue,  6 Mar 2018 18:08:04 +0100, Simon Horman wrote:
> > -   if (!tb[TCA_TUNNEL_KEY_PARMS])
> > +   if (!tb[TCA_TUNNEL_KEY_PARMS]) {
> > +   NL_SET_ERR_MSG(extack, "Missing tunnel key parameter");
> 
> "parameters" (it's not just one parameter)
> 
> > @@ -107,6 +109,7 @@ static int tunnel_key_init(struct net *net, struct 
> > nlattr *nla,
> > break;
> > case TCA_TUNNEL_KEY_ACT_SET:
> > if (!tb[TCA_TUNNEL_KEY_ENC_KEY_ID]) {
> > +   NL_SET_ERR_MSG(extack, "Missing tunnel key enc id");
> 
> This is not much helpful to the user. What's "enc"? I guess "Missing
> tunnel key id" would be enough and better.
> 
> > @@ -144,6 +147,7 @@ static int tunnel_key_init(struct net *net, struct 
> > nlattr *nla,
> >   0, flags,
> >   key_id, 0);
> > } else {
> > +   NL_SET_ERR_MSG(extack, "Missing both ipv4 and ipv6 enc 
> > src and dst");
> 
> We don't need both but only one of them. And again, "enc" does not have
> a clear meaning.
> 
> "Missing either IPv4 or IPv6 source and destination address"?

Sure, I'll work on making the messages clearer.

> In addition, it makes little sense to me to add extacks to just some of
> the errors in the tunnel_key_init function. Please add extacks to all
> of them.

At the time I wrote the patch I tried to cover those errors that could
result from user-input. I can extend the coverage if that is preferred.


Re: [PATCH net] virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS

2018-03-22 Thread Michael S. Tsirkin
On Thu, Mar 22, 2018 at 09:05:52AM +, Jay Vosburgh wrote:
>   The operstate update logic will leave an interface in the
> default UNKNOWN operstate if the interface carrier state never changes
> from the default carrier up state set at creation.  This includes the
> case of an explicit call to netif_carrier_on, as the carrier on to on
> transition has no effect on operstate.
> 
>   This affects virtio-net for the case that the virtio peer does
> not support VIRTIO_NET_F_STATUS (the feature that provides carrier state
> updates).  Without this feature, the virtio specification states that
> "the link should be assumed active," so, logically, the operstate should
> be UP instead of UNKNOWN.  This has impact on user space applications
> that use the operstate to make availability decisions for the interface.
> 
>   Resolve this by changing the virtio probe logic slightly to call
> netif_carrier_off for both the "with" and "without" VIRTIO_NET_F_STATUS
> cases, and then the existing call to netif_carrier_on for the "without"
> case will cause an operstate transition.
> 
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: Ben Hutchings 
> Fixes: 167c25e4c550 ("virtio-net: init link state correctly")

I'd say that's an abuse of this notation. openstate was UNKNOWN
even before that fix.

> Signed-off-by: Jay Vosburgh 

Acked-by: Michael S. Tsirkin 


> ---
> 
>   I considered resolving this by changing linkwatch_init_dev to
> unconditionally call rfc2863_policy, as that would always set operstate
> for all interfaces.
> 
>   This would not have any impact on most cases (as most drivers
> call netif_carrier_off during probe), except for the loopback device,
> which currently has an operstate of UNKNOWN (because it never does any
> carrier state transitions).  This change would add a round trip on the
> dev_base_lock for every loopback device creation, which could have a
> negative impact when creating many loopback devices, e.g., when
> concurrently creating large numbers of containers.
> 
> 
>  drivers/net/virtio_net.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 23374603e4d9..7b187ec7411e 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2857,8 +2857,8 @@ static int virtnet_probe(struct virtio_device *vdev)
>  
>   /* Assume link up if device can't report link status,
>  otherwise get link status from config. */
> + netif_carrier_off(dev);
>   if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
> - netif_carrier_off(dev);
>   schedule_work(&vi->config_work);
>   } else {
>   vi->status = VIRTIO_NET_S_LINK_UP;
> -- 
> 2.14.1


Re: [PATCH/RFC 3/3] net/sched: add tunnel option support to act_tunnel_key

2018-03-22 Thread Simon Horman
On Fri, Mar 09, 2018 at 12:53:17PM +0100, Jiri Benc wrote:
> On Tue,  6 Mar 2018 18:08:05 +0100, Simon Horman wrote:
> > +static int
> > +tunnel_key_copy_geneve_opt(const struct nlattr *nla, int dst_len, void 
> > *dst,
> > +  struct netlink_ext_ack *extack)
> > +{
> > +   struct nlattr *tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX + 1];
> > +   int err, data_len, opt_len;
> > +   u8 *data;
> > +
> > +   err = nla_parse_nested(tb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX,
> > +  nla, geneve_opt_policy, extack);
> > +   if (err < 0)
> > +   return err;
> > +
> > +   if (!tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS] ||
> > +   !tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE] ||
> > +   !tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]) {
> > +   NL_SET_ERR_MSG(extack, "Missing tunnel key enc geneve option 
> > class, type or data");
> 
> I think it's obvious by now that I don't like the "enc" :-)
> 
> > +   return -EINVAL;
> > +   }
> > +
> > +   data = nla_data(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]);
> > +   data_len = nla_len(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]);
> > +   if (data_len < 4) {
> > +   NL_SET_ERR_MSG(extack, "Tunnel key enc geneve option data is 
> > less than 4 bytes long");
> > +   return -ERANGE;
> > +   }
> > +   if (data_len % 4) {
> > +   NL_SET_ERR_MSG(extack, "Tunnel key enc geneve option data is 
> > not a multiple of 4 bytes long");
> > +   return -ERANGE;
> > +   }
> > +
> > +   opt_len = sizeof(struct geneve_opt) + data_len;
> > +   if (dst) {
> > +   struct geneve_opt *opt = dst;
> > +   u16 class;
> > +
> > +   if (dst_len < opt_len) {
> > +   NL_SET_ERR_MSG(extack, "Tunnel key enc geneve option 
> > data length is longer than the supplied data");
> 
> I don't understand this error. What can the user do about it?
> Furthermore, the error is not propagated to the user space (see also
> below).
> 
> Shouldn't this be WARN_ON or something?

Sure, that is fine by me.

> 
> > +   return -EINVAL;
> > +   }
> > +
> > +   class = nla_get_u16(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS]);
> > +   put_unaligned_be16(class, &opt->opt_class);
> 
> How this can be unaligned, given we check that the option length is a
> multiple of 4 bytes and the option header is 4 bytes?

True.

> > +
> > +   opt->type = nla_get_u8(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE]);
> > +   opt->length = data_len / 4; /* length is in units of 4 bytes */
> > +   opt->r1 = 0;
> > +   opt->r2 = 0;
> > +   opt->r3 = 0;
> > +
> > +   memcpy(opt + 1, data, data_len);
> > +   }
> > +
> > +   return opt_len;
> > +}
> > +
> > +static int tunnel_key_copy_opts(const struct nlattr *nla, u8 *dst,
> > +   int dst_len, struct netlink_ext_ack *extack)
> 
> Please be consistent with parameter ordering, dst and dst_len are in a
> different order here and in tunnel_key_copy_geneve_opt.

Sure, will do.

> [...]
> > @@ -157,6 +292,11 @@ static int tunnel_key_init(struct net *net, struct 
> > nlattr *nla,
> > goto err_out;
> > }
> >  
> > +   if (opts_len)
> > +   tunnel_key_opts_set(tb[TCA_TUNNEL_KEY_ENC_OPTS],
> > +   &metadata->u.tun_info, opts_len,
> > +   extack);
> 
> You need to propagate the error here. The previous validation is not
> enough as errors while copying to tun_info were not covered.

Thanks, sorry for missing that.

> > +
> > metadata->u.tun_info.mode |= IP_TUNNEL_INFO_TX;
> > break;
> > default:
> > @@ -221,6 +361,53 @@ static void tunnel_key_release(struct tc_action *a)
> > kfree_rcu(params, rcu);
> >  }
> >  
> > +static int tunnel_key_geneve_opts_dump(struct sk_buff *skb,
> > +  const struct ip_tunnel_info *info)
> > +{
> > +   int len = info->options_len;
> > +   u8 *src = (u8 *)(info + 1);
> > +
> > +   while (len > 0) {
> > +   struct geneve_opt *opt = (struct geneve_opt *)src;
> > +   u16 class;
> > +
> > +   class = get_unaligned_be16(&opt->opt_class);
> 
> I don't think this can be unaligned.

Thanks, I'm not sure why I thought otherwise.

> > +   if (nla_put_u16(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS,
> > +   class) ||
> > +   nla_put_u8(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE,
> > +  opt->type) ||
> > +   nla_put(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA,
> > +   opt->length * 4, opt + 1))
> > +   return -EMSGSIZE;
> > +
> > +   len -= sizeof(struct geneve_opt) + opt->length * 4;
> > +   src += sizeof(struct geneve_opt) + opt->length * 4;
> > +   }
> 
> All of this needs to be nested in TCA_TUNNEL_KEY_ENC_OPTS_GENEVE.

Agreed.

> > +
> > +   return 0;

Re: [PATCH nf] netfilter: drop template ct when conntrack is skipped.

2018-03-22 Thread Pablo Neira Ayuso
On Thu, Mar 22, 2018 at 11:08:50AM +0100, Paolo Abeni wrote:
> The ipv4 nf_ct code currently skips the nf_conntrak_in() call
> for fragmented packets. As a results later matches/target can end
> up manipulating template ct entry instead of 'real' ones.
> 
> Exploiting the above, syzbot found a way to trigger the following
> splat:
> 
> WARNING: CPU: 1 PID: 4242 at net/netfilter/xt_cluster.c:55
> xt_cluster_mt+0x6c1/0x840 net/netfilter/xt_cluster.c:127
> Kernel panic - not syncing: panic_on_warn set ...
> 
> CPU: 1 PID: 4242 Comm: syzkaller027971 Not tainted 4.16.0-rc2+ #243
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>   __dump_stack lib/dump_stack.c:17 [inline]
>   dump_stack+0x194/0x24d lib/dump_stack.c:53
>   panic+0x1e4/0x41c kernel/panic.c:183
>   __warn+0x1dc/0x200 kernel/panic.c:547
>   report_bug+0x211/0x2d0 lib/bug.c:184
>   fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
>   fixup_bug arch/x86/kernel/traps.c:247 [inline]
>   do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
>   do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>   invalid_op+0x58/0x80 arch/x86/entry/entry_64.S:957
> RIP: 0010:xt_cluster_hash net/netfilter/xt_cluster.c:55 [inline]
> RIP: 0010:xt_cluster_mt+0x6c1/0x840 net/netfilter/xt_cluster.c:127
> RSP: 0018:8801d2f6f2d0 EFLAGS: 00010293
> RAX: 8801af700540 RBX:  RCX: 84a2d1e1
> RDX:  RSI: 8801d2f6f478 RDI: 8801cafd336a
> RBP: 8801d2f6f2e8 R08:  R09: 0001
> R10:  R11:  R12: 8801b03b3d18
> R13: 8801cafd3300 R14: dc00 R15: 8801d2f6f478
>   ipt_do_table+0xa91/0x19b0 net/ipv4/netfilter/ip_tables.c:296
>   iptable_filter_hook+0x65/0x80 net/ipv4/netfilter/iptable_filter.c:41
>   nf_hook_entry_hookfn include/linux/netfilter.h:120 [inline]
>   nf_hook_slow+0xba/0x1a0 net/netfilter/core.c:483
>   nf_hook include/linux/netfilter.h:243 [inline]
>   NF_HOOK include/linux/netfilter.h:286 [inline]
>   raw_send_hdrinc.isra.17+0xf39/0x1880 net/ipv4/raw.c:432
>   raw_sendmsg+0x14cd/0x26b0 net/ipv4/raw.c:669
>   inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
>   sock_sendmsg_nosec net/socket.c:629 [inline]
>   sock_sendmsg+0xca/0x110 net/socket.c:639
>   SYSC_sendto+0x361/0x5c0 net/socket.c:1748
>   SyS_sendto+0x40/0x50 net/socket.c:1716
>   do_syscall_64+0x280/0x940 arch/x86/entry/common.c:287
>   entry_SYSCALL_64_after_hwframe+0x42/0xb7
> RIP: 0033:0x441b49
> RSP: 002b:75ca8b18 EFLAGS: 0216 ORIG_RAX: 002c
> RAX: ffda RBX: 004002c8 RCX: 00441b49
> RDX: 0030 RSI: 20ff7000 RDI: 0003
> RBP: 006cc018 R08: 2066354c R09: 0010
> R10:  R11: 0216 R12: 00403470
> R13: 00403500 R14:  R15: 
> Dumping ftrace buffer:
> (ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
> 
> Instead of adding checks for template ct on every target/match
> manipulating skb->_nfct, simply drop the template ct when skipping
> nf_conntrack_in().

Applied, thanks.


Re: [PATCH v2 2/2] i40e: add support for XDP_REDIRECT

2018-03-22 Thread Jesper Dangaard Brouer

On Thu, 22 Mar 2018 10:03:07 +0100 Björn Töpel  wrote:

> +/**
> + * i40e_xdp_xmit - Implements ndo_xdp_xmit
> + * @dev: netdev
> + * @xdp: XDP buffer
> + *
> + * Returns Zero if sent, else an error code
> + **/
> +int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
> +{

The return code is used by the XDP redirect tracepoint... this is the
only way we have to debug/troubleshoot runtime issues with XDP. Thus,
these need to be consistent across drives and distinguishable.

> + struct i40e_netdev_priv *np = netdev_priv(dev);
> + unsigned int queue_index = smp_processor_id();
> + struct i40e_vsi *vsi = np->vsi;
> + int err;
> +
> + if (test_bit(__I40E_VSI_DOWN, vsi->state))
> + return -EINVAL;

Should be: -ENETDOWN

> +
> + if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
> + return -EINVAL;

Should be: -ENXIO

> + err = i40e_xmit_xdp_ring(xdp, vsi->xdp_rings[queue_index]);
> + if (err != I40E_XDP_TX)
> + return -ENOMEM;

Should be: -ENOSPC

The ENOSPC return code is important, as this can be used as a feedback
to a XDP_REDIRECT load-balancer facility.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH net] virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS

2018-03-22 Thread Jay Vosburgh
Michael S. Tsirkin  wrote:

>On Thu, Mar 22, 2018 at 09:05:52AM +, Jay Vosburgh wrote:
>>  The operstate update logic will leave an interface in the
>> default UNKNOWN operstate if the interface carrier state never changes
>> from the default carrier up state set at creation.  This includes the
>> case of an explicit call to netif_carrier_on, as the carrier on to on
>> transition has no effect on operstate.
>> 
>>  This affects virtio-net for the case that the virtio peer does
>> not support VIRTIO_NET_F_STATUS (the feature that provides carrier state
>> updates).  Without this feature, the virtio specification states that
>> "the link should be assumed active," so, logically, the operstate should
>> be UP instead of UNKNOWN.  This has impact on user space applications
>> that use the operstate to make availability decisions for the interface.
>> 
>>  Resolve this by changing the virtio probe logic slightly to call
>> netif_carrier_off for both the "with" and "without" VIRTIO_NET_F_STATUS
>> cases, and then the existing call to netif_carrier_on for the "without"
>> case will cause an operstate transition.
>> 
>> Cc: "Michael S. Tsirkin" 
>> Cc: Jason Wang 
>> Cc: Ben Hutchings 
>> Fixes: 167c25e4c550 ("virtio-net: init link state correctly")
>
>I'd say that's an abuse of this notation. openstate was UNKNOWN
>even before that fix.

I went back to the commit that added the dependency on
VIRTIO_NET_F_STATUS (and that this patch would thus apply on top of).
If that's an issue, I can resubmit without it.

-J

>> Signed-off-by: Jay Vosburgh 
>
>Acked-by: Michael S. Tsirkin 
>
>
>> ---
>> 
>>  I considered resolving this by changing linkwatch_init_dev to
>> unconditionally call rfc2863_policy, as that would always set operstate
>> for all interfaces.
>> 
>>  This would not have any impact on most cases (as most drivers
>> call netif_carrier_off during probe), except for the loopback device,
>> which currently has an operstate of UNKNOWN (because it never does any
>> carrier state transitions).  This change would add a round trip on the
>> dev_base_lock for every loopback device creation, which could have a
>> negative impact when creating many loopback devices, e.g., when
>> concurrently creating large numbers of containers.
>> 
>> 
>>  drivers/net/virtio_net.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 23374603e4d9..7b187ec7411e 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -2857,8 +2857,8 @@ static int virtnet_probe(struct virtio_device *vdev)
>>  
>>  /* Assume link up if device can't report link status,
>> otherwise get link status from config. */
>> +netif_carrier_off(dev);
>>  if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
>> -netif_carrier_off(dev);
>>  schedule_work(&vi->config_work);
>>  } else {
>>  vi->status = VIRTIO_NET_S_LINK_UP;
>> -- 
>> 2.14.1


Re: [PATCH v2 2/2] i40e: add support for XDP_REDIRECT

2018-03-22 Thread Björn Töpel
2018-03-22 12:58 GMT+01:00 Jesper Dangaard Brouer :
>
> On Thu, 22 Mar 2018 10:03:07 +0100 Björn Töpel  wrote:
>
>> +/**
>> + * i40e_xdp_xmit - Implements ndo_xdp_xmit
>> + * @dev: netdev
>> + * @xdp: XDP buffer
>> + *
>> + * Returns Zero if sent, else an error code
>> + **/
>> +int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
>> +{
>
> The return code is used by the XDP redirect tracepoint... this is the
> only way we have to debug/troubleshoot runtime issues with XDP. Thus,
> these need to be consistent across drives and distinguishable.
>

Thanks for pointing this out! I'll address all your comments and do a
respin (but I'll wait for Alex' comments, if any).


Björn

>> + struct i40e_netdev_priv *np = netdev_priv(dev);
>> + unsigned int queue_index = smp_processor_id();
>> + struct i40e_vsi *vsi = np->vsi;
>> + int err;
>> +
>> + if (test_bit(__I40E_VSI_DOWN, vsi->state))
>> + return -EINVAL;
>
> Should be: -ENETDOWN
>
>> +
>> + if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>> + return -EINVAL;
>
> Should be: -ENXIO
>
>> + err = i40e_xmit_xdp_ring(xdp, vsi->xdp_rings[queue_index]);
>> + if (err != I40E_XDP_TX)
>> + return -ENOMEM;
>
> Should be: -ENOSPC
>
> The ENOSPC return code is important, as this can be used as a feedback
> to a XDP_REDIRECT load-balancer facility.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


[RESEND PATCH net-next 1/1] tc-testing: updated police, mirred, skbedit and skbmod with more tests

2018-03-22 Thread Roman Mashak
Added extra test cases for control actions (reclassify, pipe etc.),
cookies, max index value and police args sanity check.

Signed-off-by: Roman Mashak 
---
 .../tc-testing/tc-tests/actions/mirred.json| 192 +
 .../tc-testing/tc-tests/actions/police.json| 144 
 .../tc-testing/tc-tests/actions/skbedit.json   | 168 ++
 .../tc-testing/tc-tests/actions/skbmod.json|  26 ++-
 4 files changed, 529 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json 
b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
index 0fcccf18399b..443c9b3c8664 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
@@ -171,6 +171,198 @@
 ]
 },
 {
+"id": "8917",
+"name": "Add mirred mirror action with control pass",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo 
pass index 1",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mirred index 1",
+"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to 
device lo\\) pass.*index 1 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mirred"
+]
+},
+{
+"id": "1054",
+"name": "Add mirred mirror action with control pipe",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo 
pipe index 15",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mirred index 15",
+"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to 
device lo\\) pipe.*index 15 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mirred"
+]
+},
+{
+"id": "9887",
+"name": "Add mirred mirror action with control continue",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo 
continue index 15",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mirred index 15",
+"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to 
device lo\\) continue.*index 15 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mirred"
+]
+},
+{
+"id": "e4aa",
+"name": "Add mirred mirror action with control reclassify",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo 
reclassify index 150",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mirred index 150",
+"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to 
device lo\\) reclassify.*index 150 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mirred"
+]
+},
+{
+"id": "ece9",
+"name": "Add mirred mirror action with control drop",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo 
drop index 99",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mirred index 99",
+"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to 
device lo\\) drop.*index 99 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mirred"
+]
+},
+{
+"id": "0031",
+"name": "Add mirred mirror action with control jump",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",
+0,

Re: [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure

2018-03-22 Thread Boris Pismenny

...


Can't we move this check in tls_dev_event() and use it for all types of events?
Then we avoid duplicate code.



No. Not all events require this check. Also, the result is different for 
different events.


No. You always return NOTIFY_DONE, in case of !(netdev->features & 
NETIF_F_HW_TLS_TX).
See below:

static int tls_check_dev_ops(struct net_device *dev)
{
if (!dev->tlsdev_ops)
return NOTIFY_BAD;

return NOTIFY_DONE;
}

static int tls_device_down(struct net_device *netdev)
{
struct tls_context *ctx, *tmp;
struct list_head list;
unsigned long flags;

...
return NOTIFY_DONE;
}

static int tls_dev_event(struct notifier_block *this, unsigned long event,
 void *ptr)
{
 struct net_device *dev = netdev_notifier_info_to_dev(ptr);

if (!(netdev->features & NETIF_F_HW_TLS_TX))
return NOTIFY_DONE;
  
 switch (event) {

 case NETDEV_REGISTER:
 case NETDEV_FEAT_CHANGE:
return tls_check_dev_ops(dev);
  
 case NETDEV_DOWN:

return tls_device_down(dev);
 }
 return NOTIFY_DONE;
}
 


Sure, will fix in V3.


+
+    /* Request a write lock to block new offload attempts
+ */
+    percpu_down_write(&device_offload_lock);


What is the reason percpu_rwsem is chosen here? It looks like this primitive
gives more advantages readers, then plain rwsem does. But it also gives
disadvantages to writers. It would be good, unless tls_device_down() is called
with rtnl_lock() held from netdevice notifier. But since netdevice notifier
are called with rtnl_lock() held, percpu_rwsem will increase the time 
rtnl_lock()
is locked.

We use the a rwsem to allow multiple (readers) invocations of 
tls_set_device_offload, which is triggered by the user (persumably) during the 
TLS handshake. This might be considered a fast-path.

However, we must block all calls to tls_set_device_offload while we are 
processing NETDEV_DOWN events (writer).

As you've mentioned, the percpu rwsem is more efficient for readers, especially 
on NUMA systems, where cache-line bouncing occurs during reader acquire and 
reduces performance.


Hm, and who are the readers? It's used from do_tls_setsockopt_tx(), but it 
doesn't
seem to be performance critical. Who else?



It depends on whether you consider the TLS handshake code as critical.
The readers are TCP connections processing the CCS message of the TLS 
handshake. They are providing key material to the kernel to start using 
Kernel TLS.





Can't we use plain rwsem here instead?



Its a performance tradeoff. I'm not certain that the percpu rwsem write side 
acquire is significantly worse than using the global rwsem.

For now, while all of this is experimental, can we agree to focus on the 
performance of readers? We can change it later if it becomes a problem.


Same as above.
  


Replaced with rwsem from V2.


Re: [PATCH V2 net-next 06/14] net/tls: Add generic NIC offload infrastructure

2018-03-22 Thread Boris Pismenny



On 3/21/2018 11:10 PM, Eric Dumazet wrote:



On 03/21/2018 02:01 PM, Saeed Mahameed wrote:

From: Ilya Lesokhin 

This patch adds a generic infrastructure to offload TLS crypto to a


...


+
+static inline int tls_push_record(struct sock *sk,
+ struct tls_context *ctx,
+ struct tls_offload_context *offload_ctx,
+ struct tls_record_info *record,
+ struct page_frag *pfrag,
+ int flags,
+ unsigned char record_type)
+{
+   skb_frag_t *frag;
+   struct tcp_sock *tp = tcp_sk(sk);
+   struct page_frag fallback_frag;
+   struct page_frag  *tag_pfrag = pfrag;
+   int i;
+
+   /* fill prepand */
+   frag = &record->frags[0];
+   tls_fill_prepend(ctx,
+skb_frag_address(frag),
+record->len - ctx->prepend_size,
+record_type);
+
+   if (unlikely(!skb_page_frag_refill(ctx->tag_size, pfrag, GFP_KERNEL))) {
+   /* HW doesn't care about the data in the tag
+* so in case pfrag has no room
+* for a tag and we can't allocate a new pfrag
+* just use the page in the first frag
+* rather then write a complicated fall back code.
+*/
+   tag_pfrag = &fallback_frag;
+   tag_pfrag->page = skb_frag_page(frag);
+   tag_pfrag->offset = 0;
+   }
+


If HW does not care, why even trying to call skb_page_frag_refill() ?



There's no particular reason for allocating memory here. I'll remove it 
for V3.



If you remove it, then we remove one seldom used path and might uncover bugs

This part looks very suspect to me, to be honest.



HW doesn't care, because it generates the tag, and nothing along the 
network stack touches the data here.


Would you prefer that we allocate it anyway and wait for memory if it is 
not available?




RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread David Laight
From: David Laight
> Sent: 22 March 2018 10:36
...
> Any code would need to be in memcpy_fromio(), not in every driver that
> might benefit.
> Then fallback code can be used if the registers aren't available.
> 
> >  (b) we can't guarantee that %ymm register write will show up on any
> > bus as a single write transaction anyway
> 
> Misaligned 8 byte accesses generate a single PCIe TLP.
> I can look at what happens for AVX2 transfers later.
> I've got code that mmap()s PCIe addresses into user space, and can
> look at the TLP (indirectly through tracing on an fpga target).
> Just need to set something up that uses AVX copies.

On my i7-7700 reads into xmm registers generate 16 byte TLP and
reads into ymm registers 32 byte TLP.
I don't think I've a system that supports avx-512

With my lethargic fpga slave 32 bytes reads happen every 144 clocks
and 16 byte ones every 126 (+/- the odd clock).
Single bytes ones happen every 108, 8 byte 117.
So I have (about) 110 clock overhead on every read cycle.
These clocks are the 62.5MHz clock on the fpga.

So if we needed to do PIO reads using the AVX2 (or better AVX-512)
registers would make a significant difference.
Fortunately we can 'dma' most of the data we need to transfer.

I've traced writes before, they are a lot faster and are limited
by things in the fpga fabric (they appear back to back).

David



Re: [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure

2018-03-22 Thread Kirill Tkhai
On 22.03.2018 15:38, Boris Pismenny wrote:
> ...

 Can't we move this check in tls_dev_event() and use it for all types of 
 events?
 Then we avoid duplicate code.

>>>
>>> No. Not all events require this check. Also, the result is different for 
>>> different events.
>>
>> No. You always return NOTIFY_DONE, in case of !(netdev->features & 
>> NETIF_F_HW_TLS_TX).
>> See below:
>>
>> static int tls_check_dev_ops(struct net_device *dev)
>> {
>> if (!dev->tlsdev_ops)
>>     return NOTIFY_BAD;
>>
>> return NOTIFY_DONE;
>> }
>>
>> static int tls_device_down(struct net_device *netdev)
>> {
>> struct tls_context *ctx, *tmp;
>> struct list_head list;
>> unsigned long flags;
>>
>> ...
>> return NOTIFY_DONE;
>> }
>>
>> static int tls_dev_event(struct notifier_block *this, unsigned long event,
>>   void *ptr)
>> {
>>  struct net_device *dev = netdev_notifier_info_to_dev(ptr);
>>
>> if (!(netdev->features & NETIF_F_HW_TLS_TX))
>>     return NOTIFY_DONE;
>>    switch (event) {
>>  case NETDEV_REGISTER:
>>  case NETDEV_FEAT_CHANGE:
>>  return tls_check_dev_ops(dev);
>>    case NETDEV_DOWN:
>>  return tls_device_down(dev);
>>  }
>>  return NOTIFY_DONE;
>> }
>>  
> 
> Sure, will fix in V3.
> 
> +
> +    /* Request a write lock to block new offload attempts
> + */
> +    percpu_down_write(&device_offload_lock);

 What is the reason percpu_rwsem is chosen here? It looks like this 
 primitive
 gives more advantages readers, then plain rwsem does. But it also gives
 disadvantages to writers. It would be good, unless tls_device_down() is 
 called
 with rtnl_lock() held from netdevice notifier. But since netdevice notifier
 are called with rtnl_lock() held, percpu_rwsem will increase the time 
 rtnl_lock()
 is locked.
>>> We use the a rwsem to allow multiple (readers) invocations of 
>>> tls_set_device_offload, which is triggered by the user (persumably) during 
>>> the TLS handshake. This might be considered a fast-path.
>>>
>>> However, we must block all calls to tls_set_device_offload while we are 
>>> processing NETDEV_DOWN events (writer).
>>>
>>> As you've mentioned, the percpu rwsem is more efficient for readers, 
>>> especially on NUMA systems, where cache-line bouncing occurs during reader 
>>> acquire and reduces performance.
>>
>> Hm, and who are the readers? It's used from do_tls_setsockopt_tx(), but it 
>> doesn't
>> seem to be performance critical. Who else?
>>
> 
> It depends on whether you consider the TLS handshake code as critical.
> The readers are TCP connections processing the CCS message of the TLS 
> handshake. They are providing key material to the kernel to start using 
> Kernel TLS.

The thing is rtnl_lock() is critical for the rest of the system,
while TLS handshake is small subset of actions the system makes.

rtnl_lock() is used just almost everywhere, from netlink messages
to netdev ioctls.

Currently, you even just can't close raw socket without rtnl lock.
So, all of this is big reason to avoid doing rcu waitings under it.

Kirill


 Can't we use plain rwsem here instead?

>>>
>>> Its a performance tradeoff. I'm not certain that the percpu rwsem write 
>>> side acquire is significantly worse than using the global rwsem.
>>>
>>> For now, while all of this is experimental, can we agree to focus on the 
>>> performance of readers? We can change it later if it becomes a problem.
>>
>> Same as above.
>>   
> 
> Replaced with rwsem from V2.


Re: [PATCH net-next 1/1] net/ipv4: disable SMC TCP option with SYN Cookies

2018-03-22 Thread Ursula Braun


On 03/20/2018 05:43 PM, Eric Dumazet wrote:
> 
> 
> On 03/20/2018 09:21 AM, Eric Dumazet wrote:
>>
>>
>> On 03/20/2018 08:53 AM, Ursula Braun wrote:
>>> From: Hans Wippel 
>>>
>>> Currently, the SMC experimental TCP option in a SYN packet is lost on
>>> the server side when SYN Cookies are active. However, the corresponding
>>> SYNACK sent back to the client contains the SMC option. This causes an
>>> inconsistent view of the SMC capabilities on the client and server.
>>>
>>> This patch disables the SMC option in the SYNACK when SYN Cookies are
>>> active to avoid this issue.
>>>
>>> Signed-off-by: Hans Wippel 
>>> Signed-off-by: Ursula Braun 
>>> ---
>>>  net/ipv4/tcp_output.c | 2 ++
>>>  1 file changed, 2 insertions(+)
>>>
>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>>> index 383cac0ff0ec..22894514feae 100644
>>> --- a/net/ipv4/tcp_output.c
>>> +++ b/net/ipv4/tcp_output.c
>>> @@ -3199,6 +3199,8 @@ struct sk_buff *tcp_make_synack(const struct sock 
>>> *sk, struct dst_entry *dst,
>>> /* Under synflood, we do not attach skb to a socket,
>>>  * to avoid false sharing.
>>>  */
>>> +   if (IS_ENABLED(CONFIG_SMC))
>>> +   ireq->smc_ok = 0;
>>> break;
>>> case TCP_SYNACK_FASTOPEN:
>>> /* sk is a const pointer, because we want to express multiple
>>>
>>
>> I disagree with net-next qualification.
>>
>> This fixes a bug, so please send it for net tree, and including an 
>> appropriate Fixes: tag.
>>

Okay, I will send it for the net tree.

> 
> Also, please do not add the fix in tcp_make_synack()
> 
> tcp_make_synack() builds an skb, and really should not modify ireq, ideally.
> The only reason ireq is not const is because of the skb_set_owner_w().
> 
> I would clear it in cookie_v4_check()/cookie_v6_check()
> 
> (We could have a common helper to allocate a TCP ireq btw, but this will wait 
> a future patch for net-next)
>

We moved the clear to cookie_v4_check()/cookie_v6_check. However, this does not 
seem to
be sufficient to prevent the SYNACK from containing the SMC experimental option.
We found that an additional check in tcp_conn_request() helps:

--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6248,6 +6248,9 @@ int tcp_conn_request(struct request_sock
if (want_cookie && !tmp_opt.saw_tstamp)
tcp_clear_options(&tmp_opt);
 
+   if (IS_ENABLED(CONFIG_SMC) && want_cookie && tmp_opt.smc_ok)
+   tmp_opt.smc_ok = 0;
+
tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
tcp_openreq_init(req, &tmp_opt, skb, sk);
inet_rsk(req)->no_srccheck = inet_sk(sk)->transparent;

Do you think this could be the right place for clearing the smc_ok bit?

  



Re: [PATCH 1/2] bpf: Remove struct bpf_verifier_env argument from print_bpf_insn

2018-03-22 Thread Jiri Olsa
On Thu, Mar 22, 2018 at 10:34:18AM +0100, Daniel Borkmann wrote:
> On 03/21/2018 07:37 PM, Jiri Olsa wrote:
> > On Wed, Mar 21, 2018 at 05:25:33PM +, Quentin Monnet wrote:
> >> 2018-03-21 16:02 UTC+0100 ~ Jiri Olsa 
> >>> We use print_bpf_insn in user space (bpftool and soon perf),
> >>> so it'd be nice to keep it generic and strip it off the kernel
> >>> struct bpf_verifier_env argument.
> >>>
> >>> This argument can be safely removed, because its users can
> >>> use the struct bpf_insn_cbs::private_data to pass it.
> >>>
> >>> Signed-off-by: Jiri Olsa 
> >>> ---
> >>>  kernel/bpf/disasm.c   | 52 
> >>> +--
> >>>  kernel/bpf/disasm.h   |  5 +
> >>>  kernel/bpf/verifier.c |  6 +++---
> >>>  3 files changed, 30 insertions(+), 33 deletions(-)
> >>
> >> [...]
> >>
> >>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >>> index c6eff108aa99..9f27d3fa7259 100644
> >>> --- a/kernel/bpf/verifier.c
> >>> +++ b/kernel/bpf/verifier.c
> >>> @@ -202,8 +202,7 @@ EXPORT_SYMBOL_GPL(bpf_verifier_log_write);
> >>>   * generic for symbol export. The function was renamed, but not the 
> >>> calls in
> >>>   * the verifier to avoid complicating backports. Hence the alias below.
> >>>   */
> >>> -static __printf(2, 3) void verbose(struct bpf_verifier_env *env,
> >>> -const char *fmt, ...)
> >>> +static __printf(2, 3) void verbose(void *private_data, const char *fmt, 
> >>> ...)
> >>>   __attribute__((alias("bpf_verifier_log_write")));
> >>
> >> Just as a note, verbose() will be aliased to a function whose prototype
> >> differs (bpf_verifier_log_write() still expects a struct
> >> bpf_verifier_env as its first argument). I am not so familiar with
> >> function aliases, could this change be a concern?
> > 
> > yea, but as it was pointer for pointer switch I did not
> > see any problem with that.. I'll check more
> 
> Ok, holding off for now until we have clarification. Other option could also
> be to make it void *private_data everywhere and for the kernel writer then
> do struct bpf_verifier_env *env = private_data.

can't find much info about the alias behaviour for this
case.. so how about having separate function for the
print_cb like below.. I still need to test it

thanks,
jirka


---
 kernel/bpf/disasm.c   | 52 +--
 kernel/bpf/disasm.h   |  5 +
 kernel/bpf/verifier.c | 41 +---
 3 files changed, 57 insertions(+), 41 deletions(-)

diff --git a/kernel/bpf/disasm.c b/kernel/bpf/disasm.c
index 8740406df2cd..d6b76377cb6e 100644
--- a/kernel/bpf/disasm.c
+++ b/kernel/bpf/disasm.c
@@ -113,16 +113,16 @@ static const char *const bpf_jmp_string[16] = {
 };
 
 static void print_bpf_end_insn(bpf_insn_print_t verbose,
-  struct bpf_verifier_env *env,
+  void *private_data,
   const struct bpf_insn *insn)
 {
-   verbose(env, "(%02x) r%d = %s%d r%d\n", insn->code, insn->dst_reg,
+   verbose(private_data, "(%02x) r%d = %s%d r%d\n",
+   insn->code, insn->dst_reg,
BPF_SRC(insn->code) == BPF_TO_BE ? "be" : "le",
insn->imm, insn->dst_reg);
 }
 
 void print_bpf_insn(const struct bpf_insn_cbs *cbs,
-   struct bpf_verifier_env *env,
const struct bpf_insn *insn,
bool allow_ptr_leaks)
 {
@@ -132,23 +132,23 @@ void print_bpf_insn(const struct bpf_insn_cbs *cbs,
if (class == BPF_ALU || class == BPF_ALU64) {
if (BPF_OP(insn->code) == BPF_END) {
if (class == BPF_ALU64)
-   verbose(env, "BUG_alu64_%02x\n", insn->code);
+   verbose(cbs->private_data, "BUG_alu64_%02x\n", 
insn->code);
else
-   print_bpf_end_insn(verbose, env, insn);
+   print_bpf_end_insn(verbose, cbs->private_data, 
insn);
} else if (BPF_OP(insn->code) == BPF_NEG) {
-   verbose(env, "(%02x) r%d = %s-r%d\n",
+   verbose(cbs->private_data, "(%02x) r%d = %s-r%d\n",
insn->code, insn->dst_reg,
class == BPF_ALU ? "(u32) " : "",
insn->dst_reg);
} else if (BPF_SRC(insn->code) == BPF_X) {
-   verbose(env, "(%02x) %sr%d %s %sr%d\n",
+   verbose(cbs->private_data, "(%02x) %sr%d %s %sr%d\n",
insn->code, class == BPF_ALU ? "(u32) " : "",
insn->dst_reg,
bpf_alu_string[BPF_OP(insn->code) >> 4],
class == BPF_ALU ? "(u32) " : "",
insn->src_reg);
} else {
-

Re: [PATCH v2 bpf-next 4/8] tracepoint: compute num_args at build time

2018-03-22 Thread Steven Rostedt
On Wed, 21 Mar 2018 15:05:46 -0700
Alexei Starovoitov  wrote:

> Like the only reason my patch is counting till 17 is because of
> trace_iwlwifi_dev_ucode_error().
> The next offenders are using 12 arguments:
> trace_mc_event()
> trace_mm_vmscan_lru_shrink_inactive()
> 
> Clearly not every efficient usage of it:
>  trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>  nr_scanned, nr_reclaimed,
>  stat.nr_dirty,  stat.nr_writeback,
>  stat.nr_congested, stat.nr_immediate,
>  stat.nr_activate, stat.nr_ref_keep,
>  stat.nr_unmap_fail,
>  sc->priority, file);
> could have passed &stat instead.

Yes they should have, and if I was on the Cc for that patch, I would
have yelled at them and told them that's exactly what they needed to do.

Perhaps I should add something to keep any tracepoint from having more
than 6 arguments. That should force a clean up quickly.

I think I may start doing that.

-- Steve


RE: [PATCH net 1/1] qede: Fix barrier usage after tx doorbell write.

2018-03-22 Thread Chopra, Manish
> -Original Message-
> From: Elior, Ariel
> Sent: Wednesday, March 21, 2018 7:10 PM
> To: da...@davemloft.net
> Cc: netdev@vger.kernel.org; Kalderon, Michal ;
> Chopra, Manish 
> Subject: RE: [PATCH net 1/1] qede: Fix barrier usage after tx doorbell write.
> 
> > Subject: [PATCH net 1/1] qede: Fix barrier usage after tx doorbell write.
> >
> > Since commit c5ad119fb6c09b0297446be05bd66602fa564758
> > ("net: sched: pfifo_fast use skb_array") driver is exposed to an issue
> > where it is hitting NULL skbs while handling TX completions. Driver
> > uses mmiowb() to flush the writes to the doorbell bar which is a
> > write-combined bar, however on x86
> > mmiowb() does not flush the write combined buffer.
> >
> > This patch fixes this problem by replacing mmiowb() with wmb() after
> > the write combined doorbell write so that writes are flushed and
> > synchronized from more than one processor.
> >
> > Signed-off-by: Ariel Elior 
> > Signed-off-by: Manish Chopra 
> > ---
> >  drivers/net/ethernet/qlogic/qede/qede_fp.c |   10 --
> >  1 files changed, 4 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/qlogic/qede/qede_fp.c
> > b/drivers/net/ethernet/qlogic/qede/qede_fp.c
> > index dafc079..2e921ca 100644
> > --- a/drivers/net/ethernet/qlogic/qede/qede_fp.c
> > +++ b/drivers/net/ethernet/qlogic/qede/qede_fp.c
> > @@ -320,13 +320,11 @@ static inline void
> > qede_update_tx_producer(struct qede_tx_queue *txq)
> > barrier();
> > writel(txq->tx_db.raw, txq->doorbell_addr);
> >
> > -   /* mmiowb is needed to synchronize doorbell writes from more than
> one
> > -* processor. It guarantees that the write arrives to the device before
> > -* the queue lock is released and another start_xmit is called (possibly
> > -* on another CPU). Without this barrier, the next doorbell can bypass
> > -* this doorbell. This is applicable to IA64/Altix systems.
> > +   /* Fence required to flush the write combined buffer, since another
> > +* CPU may write to the same doorbell address and data may be lost
> > +* due to relaxed order nature of write combined bar.
> >  */
> > -   mmiowb();
> > +   wmb();
> >  }
> >
> >  static int qede_xdp_xmit(struct qede_dev *edev, struct qede_fastpath
> > *fp,
> > --
> > 1.7.1
> 
> Hi Dave,
> This patch appears as "superseded" in patchwork. I am not really sure why 
> that is
> - I noticed some other barrier work is going on, but none of it will solve 
> this
> issue. This patch solves an important bug in the driver - please consider 
> applying
> it.
> Thanks,
> Ariel

Hi Dave,

The other patchwork which is going on to use writel_relaxed doesn't solve this 
issue.
I think the writel_relaxed patchwork might have caused this fix as superseded 
since those changes are in same area/function.
This is an important fix which is after the write combined doorbell write.
Please let me know if I should re-submit this fix or not ?

Thanks,
Manish


Re: [PATCH 28/28] random: convert to ->poll_mask

2018-03-22 Thread Theodore Y. Ts'o
On Wed, Mar 21, 2018 at 08:40:32AM +0100, Christoph Hellwig wrote:
> The big change is that random_read_wait and random_write_wait are merged
> into a single waitqueue that uses keyed wakeups.  Because wait_event_*
> doesn't know about that this will lead to occassional spurious wakeups
> in _random_read and add_hwgenerator_randomness, but wait_event_* is
> designed to handle these and were are not in a a hot path there.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Theodore Ts'o 



[RFC PATCH 5/5] net: macb: Add WOL support with ARP

2018-03-22 Thread harinikatakamlinux
From: Harini Katakam 

This patch enables ARP wake event support in GEM through the following:

-> WOL capability can be selected based on the SoC/GEM IP version rather
than a devictree property alone. Hence add a new capability property and
set device as "wakeup capable" in probe in this case.
-> Wake source selection can be done via ethtool or by enabling wakeup
in /sys/devices/platform//ethx/power/
This patch adds default wake source as ARP and the existing selection of
WOL using magic packet remains unchanged.
-> When GEM is the wake device with ARP as the wake event, the current
IP address to match is written to WOL register along with other
necessary confuguration required for MAC to recognize an ARP event.
-> While RX needs to remain enabled, there is no need to process the
actual wake packet - hence tie off all RX queues to avoid unnecessary
processing by DMA in the background. This tie off is done using a
dummy buffer descriptor with used bit set. (There is no other provision
to disable RX DMA in the GEM IP version in ZynqMP)
-> TX is disabled and all interrupts except WOL on Q0 are disabled.
Clear the WOL interrupt as no other action is required from driver.
Power management of the SoC will already have got the event and will
take care of initiating resume.
-> Upon resume ARP WOL config is cleared and macb is reinitialized.

Signed-off-by: Harini Katakam 
---
 drivers/net/ethernet/cadence/macb.h  |   6 ++
 drivers/net/ethernet/cadence/macb_main.c | 130 +--
 2 files changed, 131 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.h 
b/drivers/net/ethernet/cadence/macb.h
index 9e7fb14..e18ff34 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -93,6 +93,7 @@
 #define GEM_SA3T   0x009C /* Specific3 Top */
 #define GEM_SA4B   0x00A0 /* Specific4 Bottom */
 #define GEM_SA4T   0x00A4 /* Specific4 Top */
+#define GEM_WOL0x00B8 /* Wake on LAN */
 #define GEM_EFTSH  0x00e8 /* PTP Event Frame Transmitted Seconds 
Register 47:32 */
 #define GEM_EFRSH  0x00ec /* PTP Event Frame Received Seconds 
Register 47:32 */
 #define GEM_PEFTSH 0x00f0 /* PTP Peer Event Frame Transmitted 
Seconds Register 47:32 */
@@ -398,6 +399,8 @@
 #define MACB_PDRSFT_SIZE   1
 #define MACB_SRI_OFFSET26 /* TSU Seconds Register Increment */
 #define MACB_SRI_SIZE  1
+#define GEM_WOL_OFFSET 28 /* Enable wake-on-lan interrupt in GEM */
+#define GEM_WOL_SIZE   1
 
 /* Timer increment fields */
 #define MACB_TI_CNS_OFFSET 0
@@ -635,6 +638,7 @@
 #define MACB_CAPS_USRIO_DISABLED   0x0010
 #define MACB_CAPS_JUMBO0x0020
 #define MACB_CAPS_GEM_HAS_PTP  0x0040
+#define MACB_CAPS_WOL  0x0080
 #define MACB_CAPS_FIFO_MODE0x1000
 #define MACB_CAPS_GIGABIT_MODE_AVAILABLE   0x2000
 #define MACB_CAPS_SG_DISABLED  0x4000
@@ -1147,6 +1151,8 @@ struct macb {
unsigned intnum_queues;
unsigned intqueue_mask;
struct macb_queue   queues[MACB_MAX_QUEUES];
+   dma_addr_t  rx_ring_tieoff_dma;
+   struct macb_dma_desc*rx_ring_tieoff;
 
spinlock_t  lock;
struct platform_device  *pdev;
diff --git a/drivers/net/ethernet/cadence/macb_main.c 
b/drivers/net/ethernet/cadence/macb_main.c
index bca91bd..9902654 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "macb.h"
 
 #define MACB_RX_BUFFER_SIZE128
@@ -1400,6 +1401,12 @@ static irqreturn_t macb_interrupt(int irq, void *dev_id)
spin_lock(&bp->lock);
 
while (status) {
+   if (status & GEM_BIT(WOL)) {
+   if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
+   queue_writel(queue, ISR, GEM_BIT(WOL));
+   break;
+   }
+
/* close possible race with dev_close */
if (unlikely(!netif_running(dev))) {
queue_writel(queue, IDR, -1);
@@ -1900,6 +1907,12 @@ static void macb_free_consistent(struct macb *bp)
queue->rx_ring = NULL;
}
 
+   if (bp->rx_ring_tieoff) {
+   dma_free_coherent(&bp->pdev->dev, macb_dma_desc_get_size(bp),
+ bp->rx_ring_tieoff, bp->rx_ring_tieoff_dma);
+   bp->rx_ring_tieoff = NULL;
+   }
+
for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
kfree(queue->tx_skb);
queue->tx_skb = NULL;
@@ -1979,6 +1992,14 @@ static int macb_alloc_consistent(struct macb *bp)
   "Allocated RX

[RFC PATCH 4/5] net: macb: Add support for suspend/resume with full power down

2018-03-22 Thread harinikatakamlinux
From: Harini Katakam 

When macb device is suspended and system is powered down, the clocks
are removed and hence macb should be closed gracefully and restored
upon resume. This patch does the same by switching off the net device,
suspending phy and performing necessary cleanup of interrupts and BDs.
Upon resume, all these are reinitialized again.

Reset of macb device is done only when GEM is not a wake device.
Even when gem is a wake device, tx queues can be stopped and ptp device
can be closed (tsu clock will be disabled in pm_runtime_suspend) as
wake event detection has no dependency on this.

Signed-off-by: Kedareswara rao Appana 
Signed-off-by: Harini Katakam 
---
 drivers/net/ethernet/cadence/macb_main.c | 38 ++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb_main.c 
b/drivers/net/ethernet/cadence/macb_main.c
index ce75088..bca91bd 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -4167,16 +4167,33 @@ static int __maybe_unused macb_suspend(struct device 
*dev)
struct platform_device *pdev = to_platform_device(dev);
struct net_device *netdev = platform_get_drvdata(pdev);
struct macb *bp = netdev_priv(netdev);
+   struct macb_queue *queue = bp->queues;
+   unsigned long flags;
+   unsigned int q;
+
+   if (!netif_running(netdev))
+   return 0;
 
-   netif_carrier_off(netdev);
-   netif_device_detach(netdev);
 
if (bp->wol & MACB_WOL_ENABLED) {
macb_writel(bp, IER, MACB_BIT(WOL));
macb_writel(bp, WOL, MACB_BIT(MAG));
enable_irq_wake(bp->queues[0].irq);
+   netif_device_detach(netdev);
+   } else {
+   netif_device_detach(netdev);
+   for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, 
++queue)
+   napi_disable(&queue->napi);
+   phy_stop(netdev->phydev);
+   phy_suspend(netdev->phydev);
+   spin_lock_irqsave(&bp->lock, flags);
+   macb_reset_hw(bp);
+   spin_unlock_irqrestore(&bp->lock, flags);
}
 
+   netif_carrier_off(netdev);
+   if (bp->ptp_info)
+   bp->ptp_info->ptp_remove(netdev);
pm_runtime_force_suspend(dev);
 
return 0;
@@ -4187,6 +4204,11 @@ static int __maybe_unused macb_resume(struct device *dev)
struct platform_device *pdev = to_platform_device(dev);
struct net_device *netdev = platform_get_drvdata(pdev);
struct macb *bp = netdev_priv(netdev);
+   struct macb_queue *queue = bp->queues;
+   unsigned int q;
+
+   if (!netif_running(netdev))
+   return 0;
 
pm_runtime_force_resume(dev);
 
@@ -4194,9 +4216,21 @@ static int __maybe_unused macb_resume(struct device *dev)
macb_writel(bp, IDR, MACB_BIT(WOL));
macb_writel(bp, WOL, 0);
disable_irq_wake(bp->queues[0].irq);
+   } else {
+   macb_writel(bp, NCR, MACB_BIT(MPE));
+   for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, 
++queue)
+   napi_enable(&queue->napi);
+   netif_carrier_on(netdev);
+   phy_resume(netdev->phydev);
+   phy_start(netdev->phydev);
}
 
+   bp->macbgem_ops.mog_init_rings(bp);
+   macb_init_hw(bp);
+   macb_set_rx_mode(netdev);
netif_device_attach(netdev);
+   if (bp->ptp_info)
+   bp->ptp_info->ptp_init(netdev);
 
return 0;
 }
-- 
2.7.4



[RFC PATCH 2/5] net: macb: Support clock management for tsu_clk

2018-03-22 Thread harinikatakamlinux
From: Harini Katakam 

TSU clock needs to be enabled/disabled as per support in devicetree
and it should also be controlled during suspend/resume (WOL has no
dependency on this clock).

Signed-off-by: Harini Katakam 
---
 drivers/net/ethernet/cadence/macb.h  |  3 ++-
 drivers/net/ethernet/cadence/macb_main.c | 30 +-
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.h 
b/drivers/net/ethernet/cadence/macb.h
index 8665982..9e7fb14 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -1074,7 +1074,7 @@ struct macb_config {
unsigned intdma_burst_length;
int (*clk_init)(struct platform_device *pdev, struct clk **pclk,
struct clk **hclk, struct clk **tx_clk,
-   struct clk **rx_clk);
+   struct clk **rx_clk, struct clk **tsu_clk);
int (*init)(struct platform_device *pdev);
int jumbo_max_len;
 };
@@ -1154,6 +1154,7 @@ struct macb {
struct clk  *hclk;
struct clk  *tx_clk;
struct clk  *rx_clk;
+   struct clk  *tsu_clk;
struct net_device   *dev;
union {
struct macb_stats   macb;
diff --git a/drivers/net/ethernet/cadence/macb_main.c 
b/drivers/net/ethernet/cadence/macb_main.c
index f4030c1..ae61927 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -3245,7 +3245,7 @@ static void macb_probe_queues(void __iomem *mem,
 
 static int macb_clk_init(struct platform_device *pdev, struct clk **pclk,
 struct clk **hclk, struct clk **tx_clk,
-struct clk **rx_clk)
+struct clk **rx_clk, struct clk **tsu_clk)
 {
struct macb_platform_data *pdata;
int err;
@@ -3279,6 +3279,10 @@ static int macb_clk_init(struct platform_device *pdev, 
struct clk **pclk,
if (IS_ERR(*rx_clk))
*rx_clk = NULL;
 
+   *tsu_clk = devm_clk_get(&pdev->dev, "tsu_clk");
+   if (IS_ERR(*tsu_clk))
+   *tsu_clk = NULL;
+
err = clk_prepare_enable(*pclk);
if (err) {
dev_err(&pdev->dev, "failed to enable pclk (%u)\n", err);
@@ -3303,8 +3307,17 @@ static int macb_clk_init(struct platform_device *pdev, 
struct clk **pclk,
goto err_disable_txclk;
}
 
+   err = clk_prepare_enable(*tsu_clk);
+   if (err) {
+   dev_err(&pdev->dev, "failed to enable tsu_clk (%u)\n", err);
+   goto err_disable_rxclk;
+   }
+
return 0;
 
+err_disable_rxclk:
+   clk_disable_unprepare(*rx_clk);
+
 err_disable_txclk:
clk_disable_unprepare(*tx_clk);
 
@@ -3754,13 +3767,14 @@ static const struct net_device_ops at91ether_netdev_ops 
= {
 
 static int at91ether_clk_init(struct platform_device *pdev, struct clk **pclk,
  struct clk **hclk, struct clk **tx_clk,
- struct clk **rx_clk)
+ struct clk **rx_clk, struct clk **tsu_clk)
 {
int err;
 
*hclk = NULL;
*tx_clk = NULL;
*rx_clk = NULL;
+   *tsu_clk = NULL;
 
*pclk = devm_clk_get(&pdev->dev, "ether_clk");
if (IS_ERR(*pclk))
@@ -3898,11 +3912,12 @@ static int macb_probe(struct platform_device *pdev)
 {
const struct macb_config *macb_config = &default_gem_config;
int (*clk_init)(struct platform_device *, struct clk **,
-   struct clk **, struct clk **,  struct clk **)
- = macb_config->clk_init;
+   struct clk **, struct clk **,  struct clk **,
+   struct clk **) = macb_config->clk_init;
int (*init)(struct platform_device *) = macb_config->init;
struct device_node *np = pdev->dev.of_node;
struct clk *pclk, *hclk = NULL, *tx_clk = NULL, *rx_clk = NULL;
+   struct clk *tsu_clk = NULL;
unsigned int queue_mask, num_queues;
struct macb_platform_data *pdata;
bool native_io;
@@ -3930,7 +3945,7 @@ static int macb_probe(struct platform_device *pdev)
}
}
 
-   err = clk_init(pdev, &pclk, &hclk, &tx_clk, &rx_clk);
+   err = clk_init(pdev, &pclk, &hclk, &tx_clk, &rx_clk, &tsu_clk);
if (err)
return err;
 
@@ -3967,6 +3982,7 @@ static int macb_probe(struct platform_device *pdev)
bp->hclk = hclk;
bp->tx_clk = tx_clk;
bp->rx_clk = rx_clk;
+   bp->tsu_clk = tsu_clk;
if (macb_config)
bp->jumbo_max_len = macb_config->jumbo_max_len;
 
@@ -4064,6 +4080,7 @@ static int macb_probe(struct platform_device *pdev)
clk_disable_unprepare(hclk);
clk_disable_unprepare(pclk);
clk_disable_unprepare(rx_

Re: DTS for our Configuration

2018-03-22 Thread Andrew Lunn
> As you understand, I prefer not to change the driver. 

Actually, i don't understand why you prefer not to change the driver.

> Is there a way for me to bypass this issue?
> Can I use other property than 'fixed-link'?

My quick look at the driver makes me think you are going to have to
change it. But the changes can be mainlined, if done correctly.

I think it is time you rolled up your sleeves and start really
debugging this yourselves, and make the needed changes.

  Andrew


[RFC PATCH 3/5] net: macb: Add pm runtime support

2018-03-22 Thread harinikatakamlinux
From: Harini Katakam 

Add runtime pm functions and move clock handling there.
Enable clocks in mdio read/write functions.

Signed-off-by: Shubhrajyoti Datta 
Signed-off-by: Harini Katakam 
---
 drivers/net/ethernet/cadence/macb_main.c | 105 ++-
 1 file changed, 90 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb_main.c 
b/drivers/net/ethernet/cadence/macb_main.c
index ae61927..ce75088 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "macb.h"
 
 #define MACB_RX_BUFFER_SIZE128
@@ -77,6 +78,7 @@
  * 1 frame time (10 Mbits/s, full-duplex, ignoring collisions)
  */
 #define MACB_HALT_TIMEOUT  1230
+#define MACB_PM_TIMEOUT  100 /* ms */
 
 /* DMA buffer descriptor might be different size
  * depends on hardware configuration:
@@ -321,8 +323,13 @@ static int macb_mdio_read(struct mii_bus *bus, int mii_id, 
int regnum)
 {
struct macb *bp = bus->priv;
int value;
+   int err;
ulong timeout;
 
+   err = pm_runtime_get_sync(&bp->pdev->dev);
+   if (err < 0)
+   return err;
+
timeout = jiffies + msecs_to_jiffies(1000);
/* wait for end of transfer */
do {
@@ -334,6 +341,8 @@ static int macb_mdio_read(struct mii_bus *bus, int mii_id, 
int regnum)
 
if (time_after_eq(jiffies, timeout)) {
netdev_err(bp->dev, "wait for end of transfer timed out\n");
+   pm_runtime_mark_last_busy(&bp->pdev->dev);
+   pm_runtime_put_autosuspend(&bp->pdev->dev);
return -ETIMEDOUT;
}
 
@@ -354,11 +363,15 @@ static int macb_mdio_read(struct mii_bus *bus, int 
mii_id, int regnum)
 
if (time_after_eq(jiffies, timeout)) {
netdev_err(bp->dev, "wait for end of transfer timed out\n");
+   pm_runtime_mark_last_busy(&bp->pdev->dev);
+   pm_runtime_put_autosuspend(&bp->pdev->dev);
return -ETIMEDOUT;
}
 
value = MACB_BFEXT(DATA, macb_readl(bp, MAN));
 
+   pm_runtime_mark_last_busy(&bp->pdev->dev);
+   pm_runtime_put_autosuspend(&bp->pdev->dev);
return value;
 }
 
@@ -366,8 +379,13 @@ static int macb_mdio_write(struct mii_bus *bus, int 
mii_id, int regnum,
   u16 value)
 {
struct macb *bp = bus->priv;
+   int err;
ulong timeout;
 
+   err = pm_runtime_get_sync(&bp->pdev->dev);
+   if (err < 0)
+   return err;
+
timeout = jiffies + msecs_to_jiffies(1000);
/* wait for end of transfer */
do {
@@ -379,6 +397,8 @@ static int macb_mdio_write(struct mii_bus *bus, int mii_id, 
int regnum,
 
if (time_after_eq(jiffies, timeout)) {
netdev_err(bp->dev, "wait for end of transfer timed out\n");
+   pm_runtime_mark_last_busy(&bp->pdev->dev);
+   pm_runtime_put_autosuspend(&bp->pdev->dev);
return -ETIMEDOUT;
}
 
@@ -400,9 +420,13 @@ static int macb_mdio_write(struct mii_bus *bus, int 
mii_id, int regnum,
 
if (time_after_eq(jiffies, timeout)) {
netdev_err(bp->dev, "wait for end of transfer timed out\n");
+   pm_runtime_mark_last_busy(&bp->pdev->dev);
+   pm_runtime_put_autosuspend(&bp->pdev->dev);
return -ETIMEDOUT;
}
 
+   pm_runtime_mark_last_busy(&bp->pdev->dev);
+   pm_runtime_put_autosuspend(&bp->pdev->dev);
return 0;
 }
 
@@ -2338,6 +2362,10 @@ static int macb_open(struct net_device *dev)
 
netdev_dbg(bp->dev, "open\n");
 
+   err = pm_runtime_get_sync(&bp->pdev->dev);
+   if (err < 0)
+   return err;
+
/* carrier starts down */
netif_carrier_off(dev);
 
@@ -2397,6 +2425,8 @@ static int macb_close(struct net_device *dev)
if (bp->ptp_info)
bp->ptp_info->ptp_remove(dev);
 
+   pm_runtime_put(&bp->pdev->dev);
+
return 0;
 }
 
@@ -3949,6 +3979,11 @@ static int macb_probe(struct platform_device *pdev)
if (err)
return err;
 
+   pm_runtime_set_autosuspend_delay(&pdev->dev, MACB_PM_TIMEOUT);
+   pm_runtime_use_autosuspend(&pdev->dev);
+   pm_runtime_get_noresume(&pdev->dev);
+   pm_runtime_set_active(&pdev->dev);
+   pm_runtime_enable(&pdev->dev);
native_io = hw_is_native_io(mem);
 
macb_probe_queues(mem, native_io, &queue_mask, &num_queues);
@@ -4062,6 +4097,9 @@ static int macb_probe(struct platform_device *pdev)
macb_is_gem(bp) ? "GEM" : "MACB", macb_readl(bp, MID),
dev->base_addr, dev->irq, dev->dev_addr);
 
+   pm_runtime_mark_last_busy(&bp->pdev->dev);
+   pm_runtime_put_autosuspend(&bp->pdev->dev);
+
return 0;
 
 err_out_unregister_mdio:
@@ -4081,6 +4119,9 @@ static int macb_probe(struct platform_device *pd

[RFC PATCH 0/5] Macb power management support for ZynqMP

2018-03-22 Thread harinikatakamlinux
From: Harini Katakam 

This series adds support for macb suspend/resume with system power down
and wake on LAN with ARP packets.
In relation to the above, this series also updates mdio_read/write
function for PM and adds tsu clock management.

Harini Katakam (5):
  net: macb: Check MDIO state before read/write and use timeouts
  net: macb: Support clock management for tsu_clk
  net: macb: Add pm runtime support
  net: macb: Add support for suspend/resume with full power down
  net: macb: Add WOL support with ARP

 drivers/net/ethernet/cadence/macb.h  |   9 +-
 drivers/net/ethernet/cadence/macb_main.c | 349 ---
 2 files changed, 332 insertions(+), 26 deletions(-)

-- 
2.7.4



[RFC PATCH 1/5] net: macb: Check MDIO state before read/write and use timeouts

2018-03-22 Thread harinikatakamlinux
From: Harini Katakam 

Replace the while loop in MDIO read/write functions with a timeout.
In addition, add a check for MDIO bus busy before initiating a new
operation as well to make sure there is no ongoing MDIO operation.

Signed-off-by: Shubhrajyoti Datta 
Signed-off-by: Harini Katakam 
---
 drivers/net/ethernet/cadence/macb_main.c | 54 ++--
 1 file changed, 52 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb_main.c 
b/drivers/net/ethernet/cadence/macb_main.c
index d09bd43..f4030c1 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -321,6 +321,21 @@ static int macb_mdio_read(struct mii_bus *bus, int mii_id, 
int regnum)
 {
struct macb *bp = bus->priv;
int value;
+   ulong timeout;
+
+   timeout = jiffies + msecs_to_jiffies(1000);
+   /* wait for end of transfer */
+   do {
+   if (MACB_BFEXT(IDLE, macb_readl(bp, NSR)))
+   break;
+
+   cpu_relax();
+   } while (!time_after_eq(jiffies, timeout));
+
+   if (time_after_eq(jiffies, timeout)) {
+   netdev_err(bp->dev, "wait for end of transfer timed out\n");
+   return -ETIMEDOUT;
+   }
 
macb_writel(bp, MAN, (MACB_BF(SOF, MACB_MAN_SOF)
  | MACB_BF(RW, MACB_MAN_READ)
@@ -328,9 +343,19 @@ static int macb_mdio_read(struct mii_bus *bus, int mii_id, 
int regnum)
  | MACB_BF(REGA, regnum)
  | MACB_BF(CODE, MACB_MAN_CODE)));
 
+   timeout = jiffies + msecs_to_jiffies(1000);
/* wait for end of transfer */
-   while (!MACB_BFEXT(IDLE, macb_readl(bp, NSR)))
+   do {
+   if (MACB_BFEXT(IDLE, macb_readl(bp, NSR)))
+   break;
+
cpu_relax();
+   } while (!time_after_eq(jiffies, timeout));
+
+   if (time_after_eq(jiffies, timeout)) {
+   netdev_err(bp->dev, "wait for end of transfer timed out\n");
+   return -ETIMEDOUT;
+   }
 
value = MACB_BFEXT(DATA, macb_readl(bp, MAN));
 
@@ -341,6 +366,21 @@ static int macb_mdio_write(struct mii_bus *bus, int 
mii_id, int regnum,
   u16 value)
 {
struct macb *bp = bus->priv;
+   ulong timeout;
+
+   timeout = jiffies + msecs_to_jiffies(1000);
+   /* wait for end of transfer */
+   do {
+   if (MACB_BFEXT(IDLE, macb_readl(bp, NSR)))
+   break;
+
+   cpu_relax();
+   } while (!time_after_eq(jiffies, timeout));
+
+   if (time_after_eq(jiffies, timeout)) {
+   netdev_err(bp->dev, "wait for end of transfer timed out\n");
+   return -ETIMEDOUT;
+   }
 
macb_writel(bp, MAN, (MACB_BF(SOF, MACB_MAN_SOF)
  | MACB_BF(RW, MACB_MAN_WRITE)
@@ -349,9 +389,19 @@ static int macb_mdio_write(struct mii_bus *bus, int 
mii_id, int regnum,
  | MACB_BF(CODE, MACB_MAN_CODE)
  | MACB_BF(DATA, value)));
 
+   timeout = jiffies + msecs_to_jiffies(1000);
/* wait for end of transfer */
-   while (!MACB_BFEXT(IDLE, macb_readl(bp, NSR)))
+   do {
+   if (MACB_BFEXT(IDLE, macb_readl(bp, NSR)))
+   break;
+
cpu_relax();
+   } while (!time_after_eq(jiffies, timeout));
+
+   if (time_after_eq(jiffies, timeout)) {
+   netdev_err(bp->dev, "wait for end of transfer timed out\n");
+   return -ETIMEDOUT;
+   }
 
return 0;
 }
-- 
2.7.4



[iproute PATCH] man: ip-route.8: ssthresh parameter is NUMBER

2018-03-22 Thread Phil Sutter
Synopsis section was inconsistent with regards to help text and later
description of ssthresh parameter.

Signed-off-by: Phil Sutter 
---
 man/man8/ip-route.8.in | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/man/man8/ip-route.8.in b/man/man8/ip-route.8.in
index 487a87489a46a..b28f3d2c7d117 100644
--- a/man/man8/ip-route.8.in
+++ b/man/man8/ip-route.8.in
@@ -125,7 +125,7 @@ replace " } "
 .B  cwnd
 .IR NUMBER " ] [ "
 .B  ssthresh
-.IR REALM " ] [ "
+.IR NUMBER " ] [ "
 .B  realms
 .IR REALM " ] [ "
 .B  rto_min
-- 
2.16.1



Re: [PATCH net] virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS

2018-03-22 Thread Michael S. Tsirkin
On Thu, Mar 22, 2018 at 12:02:10PM +, Jay Vosburgh wrote:
> Michael S. Tsirkin  wrote:
> 
> >On Thu, Mar 22, 2018 at 09:05:52AM +, Jay Vosburgh wrote:
> >>The operstate update logic will leave an interface in the
> >> default UNKNOWN operstate if the interface carrier state never changes
> >> from the default carrier up state set at creation.  This includes the
> >> case of an explicit call to netif_carrier_on, as the carrier on to on
> >> transition has no effect on operstate.
> >> 
> >>This affects virtio-net for the case that the virtio peer does
> >> not support VIRTIO_NET_F_STATUS (the feature that provides carrier state
> >> updates).  Without this feature, the virtio specification states that
> >> "the link should be assumed active," so, logically, the operstate should
> >> be UP instead of UNKNOWN.  This has impact on user space applications
> >> that use the operstate to make availability decisions for the interface.
> >> 
> >>Resolve this by changing the virtio probe logic slightly to call
> >> netif_carrier_off for both the "with" and "without" VIRTIO_NET_F_STATUS
> >> cases, and then the existing call to netif_carrier_on for the "without"
> >> case will cause an operstate transition.
> >> 
> >> Cc: "Michael S. Tsirkin" 
> >> Cc: Jason Wang 
> >> Cc: Ben Hutchings 
> >> Fixes: 167c25e4c550 ("virtio-net: init link state correctly")
> >
> >I'd say that's an abuse of this notation. openstate was UNKNOWN
> >even before that fix.
> 
>   I went back to the commit that added the dependency on
> VIRTIO_NET_F_STATUS (and that this patch would thus apply on top of).
> If that's an issue, I can resubmit without it.
> 
>   -J

The patch can be trivially backported to any version that has virtio.

The issue was present since the original version of virtio.
VIRTIO_NET_F_STATUS fixed it for new devices.
So the tag is incorrectly blaming a partial fix for not being a full
one.

Also, I think it's more appropriate for net-next - it's a
minor ABI change (previously presence of VIRTIO_NET_F_STATUS
could be detected by looking at operstate, now it can't).
Hopefully this makes more apps work than it breaks.

So yes, pls repost without Fixes and with net-next unless
davem can make the change himself.

> >> Signed-off-by: Jay Vosburgh 
> >
> >Acked-by: Michael S. Tsirkin 
> >
> >
> >> ---
> >> 
> >>I considered resolving this by changing linkwatch_init_dev to
> >> unconditionally call rfc2863_policy, as that would always set operstate
> >> for all interfaces.
> >> 
> >>This would not have any impact on most cases (as most drivers
> >> call netif_carrier_off during probe), except for the loopback device,
> >> which currently has an operstate of UNKNOWN (because it never does any
> >> carrier state transitions).  This change would add a round trip on the
> >> dev_base_lock for every loopback device creation, which could have a
> >> negative impact when creating many loopback devices, e.g., when
> >> concurrently creating large numbers of containers.
> >> 
> >> 
> >>  drivers/net/virtio_net.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >> 
> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> index 23374603e4d9..7b187ec7411e 100644
> >> --- a/drivers/net/virtio_net.c
> >> +++ b/drivers/net/virtio_net.c
> >> @@ -2857,8 +2857,8 @@ static int virtnet_probe(struct virtio_device *vdev)
> >>  
> >>/* Assume link up if device can't report link status,
> >>   otherwise get link status from config. */
> >> +  netif_carrier_off(dev);
> >>if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
> >> -  netif_carrier_off(dev);
> >>schedule_work(&vi->config_work);
> >>} else {
> >>vi->status = VIRTIO_NET_S_LINK_UP;
> >> -- 
> >> 2.14.1


Re: WARNING: CPU: 3 PID: 0 at net/sched/sch_hfsc.c:1388 hfsc_dequeue+0x319/0x350 [sch_hfsc]

2018-03-22 Thread Marco Berizzi
Il 19 marzo 2018 alle 11.07 Jamal Hadi Salim  ha scritto:
> 
> On 18-03-15 08:48 PM, Cong Wang wrote:
> 
> > On Wed, Mar 14, 2018 at 1:10 AM, Marco Berizzi  wrote:
> > 
> > > > Il 9 marzo 2018 alle 0.14 Cong Wang  ha 
> > > > scritto:
> > > > 
> > > > On Thu, Mar 8, 2018 at 8:02 AM, Marco Berizzi  wrote:
> > > > 
> > > > > > Marco Berizzi wrote:
> > > > > > 
> > > > > > Hello everyone,
> > > > > > 
> > > > > > Yesterday I got this error on a slackware linux 4.16-rc4 system
> > > > > > running as a traffic shaping gateway and netfilter nat.
> > > > > > The error has been arisen after a partial ISP network outage,
> > > > > > so unfortunately it will not trivial for me to reproduce it again.
> > > > > 
> > > > > Hello everyone,
> > > > > 
> > > > > I'm getting this error twice/day, so fortunately I'm able to
> > > > > reproduce it.
> > > > 
> > > > IIRC, there was a patch for this, but it got lost...
> > > > 
> > > > I will take a look anyway.
> > > 
> > > ok, thanks for the response. Let me know when there will be a patch
> > > available to test.
> > 
> > It has been reported here:
> > https://bugzilla.kernel.org/show_bug.cgi?id=109581
> > 
> > And there is a workaround from Konstantin:
> > https://patchwork.ozlabs.org/patch/803885/
> > 
> > Unfortunately I don't think that is a real fix, we probably need to
> > fix HFSC itself rather than just workaround the qlen==0. It is not
> > trivial since HFSC implementation is not easy to understand.
> > Maybe Jamal knows better than me.
> 
> Sorry for the latency - I looked at this on the plane and it is very
> specific to fq/codel. It is not clear to me why codel needs this but
> i note it has been there from the initial commit and from that
> perspective the patch looks reasonable. In any case:
> Punting it to Eric (on Cc).

Hello everyone,

just for ask: it there any luck to get this patch before 4.16 final?


[bpf-next V4 PATCH 00/15] XDP redirect memory return API

2018-03-22 Thread Jesper Dangaard Brouer
This patchset works towards supporting different XDP RX-ring memory
allocators.  As this will be needed by the AF_XDP zero-copy mode.

The patchset uses mlx5 as the sample driver, which gets implemented
XDP_REDIRECT RX-mode, but not ndo_xdp_xmit (as this API is subject to
change thought the patchset).

A new struct xdp_frame is introduced (modeled after cpumap xdp_pkt).
And both ndo_xdp_xmit and the new xdp_return_frame end-up using this.

Support for a driver supplied allocator is implemented, and a
refurbished version of page_pool is the first return allocator type
introduced.  This will be a integration point for AF_XDP zero-copy.

The mlx5 driver evolve into using the page_pool, and see a performance
increase (with ndo_xdp_xmit out ixgbe driver) from 6Mpps to 12Mpps.


The patchset stop at the 15 patches limit, but more API changes are
planned.  Specifically extending ndo_xdp_xmit and xdp_return_frame
APIs to support bulking.  As this will address some known limits.

V2: Updated according to Tariq's feedback
V3: Feedback from Jason Wang and Alex Duyck
V4: Feedback from Tariq and Jason

---

Jesper Dangaard Brouer (15):
  mlx5: basic XDP_REDIRECT forward support
  xdp: introduce xdp_return_frame API and use in cpumap
  ixgbe: use xdp_return_frame API
  xdp: move struct xdp_buff from filter.h to xdp.h
  xdp: introduce a new xdp_frame type
  tun: convert to use generic xdp_frame and xdp_return_frame API
  virtio_net: convert to use generic xdp_frame and xdp_return_frame API
  bpf: cpumap convert to use generic xdp_frame
  mlx5: register a memory model when XDP is enabled
  xdp: rhashtable with allocator ID to pointer mapping
  page_pool: refurbish version of page_pool code
  xdp: allow page_pool as an allocator type in xdp_return_frame
  mlx5: use page_pool for xdp_return_frame call
  xdp: transition into using xdp_frame for return API
  xdp: transition into using xdp_frame for ndo_xdp_xmit


 drivers/net/ethernet/intel/ixgbe/ixgbe.h  |3 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   37 ++
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |4 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   37 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   42 ++-
 drivers/net/tun.c |   60 ++--
 drivers/net/virtio_net.c  |   52 ++-
 drivers/vhost/net.c   |7 
 include/linux/filter.h|   24 --
 include/linux/if_tun.h|4 
 include/linux/netdevice.h |4 
 include/net/page_pool.h   |  133 
 include/net/xdp.h |   83 +
 kernel/bpf/cpumap.c   |  132 +++-
 net/core/Makefile |1 
 net/core/filter.c |   17 +
 net/core/page_pool.c  |  329 +
 net/core/xdp.c|  257 
 18 files changed, 1050 insertions(+), 176 deletions(-)
 create mode 100644 include/net/page_pool.h
 create mode 100644 net/core/page_pool.c

--


[bpf-next V4 PATCH 01/15] mlx5: basic XDP_REDIRECT forward support

2018-03-22 Thread Jesper Dangaard Brouer
This implements basic XDP redirect support in mlx5 driver.

Notice that the ndo_xdp_xmit() is NOT implemented, because that API
need some changes that this patchset is working towards.

The main purpose of this patch is have different drivers doing
XDP_REDIRECT to show how different memory models behave in a cross
driver world.

Update(pre-RFCv2 Tariq): Need to DMA unmap page before xdp_do_redirect,
as the return API does not exist yet to to keep this mapped.

Update(pre-RFCv3 Saeed): Don't mix XDP_TX and XDP_REDIRECT flushing,
introduce xdpsq.db.redirect_flush boolian.

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h|1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |   27 ---
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 4c9360b25532..28cc26debeda 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -398,6 +398,7 @@ struct mlx5e_xdpsq {
struct {
struct mlx5e_dma_info *di;
bool   doorbell;
+   bool   redirect_flush;
} db;
 
/* read only */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 8cce90dc461d..6dcc3e8fbd3e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -236,14 +236,20 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq 
*rq,
return 0;
 }
 
+static inline void mlx5e_page_dma_unmap(struct mlx5e_rq *rq,
+   struct mlx5e_dma_info *dma_info)
+{
+   dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
+  rq->buff.map_dir);
+}
+
 void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
bool recycle)
 {
if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
return;
 
-   dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
-  rq->buff.map_dir);
+   mlx5e_page_dma_unmap(rq, dma_info);
put_page(dma_info->page);
 }
 
@@ -822,9 +828,10 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
   struct mlx5e_dma_info *di,
   void *va, u16 *rx_headroom, u32 *len)
 {
-   const struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
+   struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
struct xdp_buff xdp;
u32 act;
+   int err;
 
if (!prog)
return false;
@@ -845,6 +852,15 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
if (unlikely(!mlx5e_xmit_xdp_frame(rq, di, &xdp)))
trace_xdp_exception(rq->netdev, prog, act);
return true;
+   case XDP_REDIRECT:
+   /* When XDP enabled then page-refcnt==1 here */
+   err = xdp_do_redirect(rq->netdev, &xdp, prog);
+   if (!err) {
+   rq->wqe.xdp_xmit = true; /* XDP xmit owns page */
+   rq->xdpsq.db.redirect_flush = true;
+   mlx5e_page_dma_unmap(rq, di);
+   }
+   return true;
default:
bpf_warn_invalid_xdp_action(act);
case XDP_ABORTED:
@@ -1107,6 +1123,11 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
xdpsq->db.doorbell = false;
}
 
+   if (xdpsq->db.redirect_flush) {
+   xdp_do_flush_map();
+   xdpsq->db.redirect_flush = false;
+   }
+
mlx5_cqwq_update_db_record(&cq->wq);
 
/* ensure cq space is freed before enabling more cqes */



[bpf-next V4 PATCH 03/15] ixgbe: use xdp_return_frame API

2018-03-22 Thread Jesper Dangaard Brouer
Extend struct ixgbe_tx_buffer to store the xdp_mem_info.

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h  |1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |6 --
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index c1e3a0039ea5..cbc20f199364 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -249,6 +249,7 @@ struct ixgbe_tx_buffer {
DEFINE_DMA_UNMAP_ADDR(dma);
DEFINE_DMA_UNMAP_LEN(len);
u32 tx_flags;
+   struct xdp_mem_info xdp_mem;
 };
 
 struct ixgbe_rx_buffer {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 85369423452d..45520eb503ee 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1207,7 +1207,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector 
*q_vector,
 
/* free the skb */
if (ring_is_xdp(tx_ring))
-   page_frag_free(tx_buffer->data);
+   xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
else
napi_consume_skb(tx_buffer->skb, napi_budget);
 
@@ -5787,7 +5787,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring 
*tx_ring)
 
/* Free all the Tx ring sk_buffs */
if (ring_is_xdp(tx_ring))
-   page_frag_free(tx_buffer->data);
+   xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
else
dev_kfree_skb_any(tx_buffer->skb);
 
@@ -8351,6 +8351,8 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter 
*adapter,
dma_unmap_len_set(tx_buffer, len, len);
dma_unmap_addr_set(tx_buffer, dma, dma);
tx_buffer->data = xdp->data;
+   tx_buffer->xdp_mem = xdp->rxq->mem;
+
tx_desc->read.buffer_addr = cpu_to_le64(dma);
 
/* put descriptor type bits */



[bpf-next V4 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap

2018-03-22 Thread Jesper Dangaard Brouer
Introduce an xdp_return_frame API, and convert over cpumap as
the first user, given it have queued XDP frame structure to leverage.

V3: Cleanup and remove C99 style comments, pointed out by Alex Duyck.

Signed-off-by: Jesper Dangaard Brouer 
---
 include/net/xdp.h   |   28 
 kernel/bpf/cpumap.c |   60 +++
 net/core/xdp.c  |   18 +++
 3 files changed, 82 insertions(+), 24 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index b2362ddfa694..15b546325e31 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -33,16 +33,44 @@
  * also mandatory during RX-ring setup.
  */
 
+enum mem_type {
+   MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
+   MEM_TYPE_PAGE_ORDER0, /* Orig XDP full page model */
+   MEM_TYPE_MAX,
+};
+
+struct xdp_mem_info {
+   u32 type; /* enum mem_type, but known size type */
+   /* u32 id; will be added later in this patchset */
+};
+
 struct xdp_rxq_info {
struct net_device *dev;
u32 queue_index;
u32 reg_state;
+   struct xdp_mem_info mem;
 } cacheline_aligned; /* perf critical, avoid false-sharing */
 
+
+static inline
+void xdp_return_frame(void *data, struct xdp_mem_info *mem)
+{
+   if (mem->type == MEM_TYPE_PAGE_SHARED)
+   page_frag_free(data);
+
+   if (mem->type == MEM_TYPE_PAGE_ORDER0) {
+   struct page *page = virt_to_page(data); /* Assumes order0 page*/
+
+   put_page(page);
+   }
+}
+
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 struct net_device *dev, u32 queue_index);
 void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
 void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
 bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
+int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
+  enum mem_type type, void *allocator);
 
 #endif /* __LINUX_NET_XDP_H__ */
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index a4bb0b34375a..3e4bbcbe3e86 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -137,27 +138,6 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
return ERR_PTR(err);
 }
 
-static void __cpu_map_queue_destructor(void *ptr)
-{
-   /* The tear-down procedure should have made sure that queue is
-* empty.  See __cpu_map_entry_replace() and work-queue
-* invoked cpu_map_kthread_stop(). Catch any broken behaviour
-* gracefully and warn once.
-*/
-   if (WARN_ON_ONCE(ptr))
-   page_frag_free(ptr);
-}
-
-static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
-{
-   if (atomic_dec_and_test(&rcpu->refcnt)) {
-   /* The queue should be empty at this point */
-   ptr_ring_cleanup(rcpu->queue, __cpu_map_queue_destructor);
-   kfree(rcpu->queue);
-   kfree(rcpu);
-   }
-}
-
 static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
 {
atomic_inc(&rcpu->refcnt);
@@ -188,6 +168,10 @@ struct xdp_pkt {
u16 len;
u16 headroom;
u16 metasize;
+   /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
+* while mem info is valid on remote CPU.
+*/
+   struct xdp_mem_info mem;
struct net_device *dev_rx;
 };
 
@@ -213,6 +197,9 @@ static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff 
*xdp)
xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
xdp_pkt->metasize = metasize;
 
+   /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
+   xdp_pkt->mem = xdp->rxq->mem;
+
return xdp_pkt;
 }
 
@@ -265,6 +252,31 @@ static struct sk_buff *cpu_map_build_skb(struct 
bpf_cpu_map_entry *rcpu,
return skb;
 }
 
+static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
+{
+   /* The tear-down procedure should have made sure that queue is
+* empty.  See __cpu_map_entry_replace() and work-queue
+* invoked cpu_map_kthread_stop(). Catch any broken behaviour
+* gracefully and warn once.
+*/
+   struct xdp_pkt *xdp_pkt;
+
+   while ((xdp_pkt = ptr_ring_consume(ring)))
+   if (WARN_ON_ONCE(xdp_pkt))
+   xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
+}
+
+static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
+{
+   if (atomic_dec_and_test(&rcpu->refcnt)) {
+   /* The queue should be empty at this point */
+   __cpu_map_ring_cleanup(rcpu->queue);
+   ptr_ring_cleanup(rcpu->queue, NULL);
+   kfree(rcpu->queue);
+   kfree(rcpu);
+   }
+}
+
 static int cpu_map_kthread_run(void *data)
 {
struct bpf_cpu_map_entry *rcpu = data;
@@ -307,7 +319,7 @@ static int cpu_map_kthread_run(void *data)
 
skb =

[bpf-next V4 PATCH 04/15] xdp: move struct xdp_buff from filter.h to xdp.h

2018-03-22 Thread Jesper Dangaard Brouer
This is done to prepare for the next patch, and it is also
nice to move this XDP related struct out of filter.h.

Signed-off-by: Jesper Dangaard Brouer 
---
 include/linux/filter.h |   24 +---
 include/net/xdp.h  |   22 ++
 2 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 109d05ccea9a..340ba653dd80 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -30,6 +30,7 @@ struct sock;
 struct seccomp_data;
 struct bpf_prog_aux;
 struct xdp_rxq_info;
+struct xdp_buff;
 
 /* ArgX, context and stack frame pointer register positions. Note,
  * Arg1, Arg2, Arg3, etc are used as argument mappings of function
@@ -499,14 +500,6 @@ struct bpf_skb_data_end {
void *data_end;
 };
 
-struct xdp_buff {
-   void *data;
-   void *data_end;
-   void *data_meta;
-   void *data_hard_start;
-   struct xdp_rxq_info *rxq;
-};
-
 struct sk_msg_buff {
void *data;
void *data_end;
@@ -769,21 +762,6 @@ int xdp_do_redirect(struct net_device *dev,
struct bpf_prog *prog);
 void xdp_do_flush_map(void);
 
-/* Drivers not supporting XDP metadata can use this helper, which
- * rejects any room expansion for metadata as a result.
- */
-static __always_inline void
-xdp_set_data_meta_invalid(struct xdp_buff *xdp)
-{
-   xdp->data_meta = xdp->data + 1;
-}
-
-static __always_inline bool
-xdp_data_meta_unsupported(const struct xdp_buff *xdp)
-{
-   return unlikely(xdp->data_meta > xdp->data);
-}
-
 void bpf_warn_invalid_xdp_action(u32 act);
 
 struct sock *do_sk_redirect_map(struct sk_buff *skb);
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 15b546325e31..1ee154fe0be6 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -51,6 +51,13 @@ struct xdp_rxq_info {
struct xdp_mem_info mem;
 } cacheline_aligned; /* perf critical, avoid false-sharing */
 
+struct xdp_buff {
+   void *data;
+   void *data_end;
+   void *data_meta;
+   void *data_hard_start;
+   struct xdp_rxq_info *rxq;
+};
 
 static inline
 void xdp_return_frame(void *data, struct xdp_mem_info *mem)
@@ -73,4 +80,19 @@ bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
 int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
   enum mem_type type, void *allocator);
 
+/* Drivers not supporting XDP metadata can use this helper, which
+ * rejects any room expansion for metadata as a result.
+ */
+static __always_inline void
+xdp_set_data_meta_invalid(struct xdp_buff *xdp)
+{
+   xdp->data_meta = xdp->data + 1;
+}
+
+static __always_inline bool
+xdp_data_meta_unsupported(const struct xdp_buff *xdp)
+{
+   return unlikely(xdp->data_meta > xdp->data);
+}
+
 #endif /* __LINUX_NET_XDP_H__ */



[bpf-next V4 PATCH 05/15] xdp: introduce a new xdp_frame type

2018-03-22 Thread Jesper Dangaard Brouer
This is needed to convert drivers tuntap and virtio_net.

This is a generalization of what is done inside cpumap, which will be
converted later.

Signed-off-by: Jesper Dangaard Brouer 
---
 include/net/xdp.h |   40 
 1 file changed, 40 insertions(+)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 1ee154fe0be6..13f71a15c79f 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -59,6 +59,46 @@ struct xdp_buff {
struct xdp_rxq_info *rxq;
 };
 
+struct xdp_frame {
+   void *data;
+   u16 len;
+   u16 headroom;
+   u16 metasize;
+   /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
+* while mem info is valid on remote CPU.
+*/
+   struct xdp_mem_info mem;
+};
+
+/* Convert xdp_buff to xdp_frame */
+static inline
+struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
+{
+   struct xdp_frame *xdp_frame;
+   int metasize;
+   int headroom;
+
+   /* Assure headroom is available for storing info */
+   headroom = xdp->data - xdp->data_hard_start;
+   metasize = xdp->data - xdp->data_meta;
+   metasize = metasize > 0 ? metasize : 0;
+   if (unlikely((headroom - metasize) < sizeof(*xdp_frame)))
+   return NULL;
+
+   /* Store info in top of packet */
+   xdp_frame = xdp->data_hard_start;
+
+   xdp_frame->data = xdp->data;
+   xdp_frame->len  = xdp->data_end - xdp->data;
+   xdp_frame->headroom = headroom - sizeof(*xdp_frame);
+   xdp_frame->metasize = metasize;
+
+   /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
+   xdp_frame->mem = xdp->rxq->mem;
+
+   return xdp_frame;
+}
+
 static inline
 void xdp_return_frame(void *data, struct xdp_mem_info *mem)
 {



[bpf-next V4 PATCH 06/15] tun: convert to use generic xdp_frame and xdp_return_frame API

2018-03-22 Thread Jesper Dangaard Brouer
The tuntap driver invented it's own driver specific way of queuing
XDP packets, by storing the xdp_buff information in the top of
the XDP frame data.

Convert it over to use the more generic xdp_frame structure.  The
main problem with the in-driver method is that the xdp_rxq_info pointer
cannot be trused/used when dequeueing the frame.

V3: Remove check based on feedback from Jason

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/tun.c  |   43 ---
 drivers/vhost/net.c|7 ---
 include/linux/if_tun.h |4 ++--
 3 files changed, 26 insertions(+), 28 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index baeafa004463..6750980d9f30 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -248,11 +248,11 @@ struct veth {
__be16 h_vlan_TCI;
 };
 
-bool tun_is_xdp_buff(void *ptr)
+bool tun_is_xdp_frame(void *ptr)
 {
return (unsigned long)ptr & TUN_XDP_FLAG;
 }
-EXPORT_SYMBOL(tun_is_xdp_buff);
+EXPORT_SYMBOL(tun_is_xdp_frame);
 
 void *tun_xdp_to_ptr(void *ptr)
 {
@@ -660,10 +660,10 @@ static void tun_ptr_free(void *ptr)
 {
if (!ptr)
return;
-   if (tun_is_xdp_buff(ptr)) {
-   struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+   if (tun_is_xdp_frame(ptr)) {
+   struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-   put_page(virt_to_head_page(xdp->data));
+   xdp_return_frame(xdpf->data, &xdpf->mem);
} else {
__skb_array_destroy_skb(ptr);
}
@@ -1290,17 +1290,14 @@ static const struct net_device_ops tun_netdev_ops = {
 static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 {
struct tun_struct *tun = netdev_priv(dev);
-   struct xdp_buff *buff = xdp->data_hard_start;
-   int headroom = xdp->data - xdp->data_hard_start;
+   struct xdp_frame *frame;
struct tun_file *tfile;
u32 numqueues;
int ret = 0;
 
-   /* Assure headroom is available and buff is properly aligned */
-   if (unlikely(headroom < sizeof(*xdp) || tun_is_xdp_buff(xdp)))
-   return -ENOSPC;
-
-   *buff = *xdp;
+   frame = convert_to_xdp_frame(xdp);
+   if (unlikely(!frame))
+   return -EOVERFLOW;
 
rcu_read_lock();
 
@@ -1315,7 +1312,7 @@ static int tun_xdp_xmit(struct net_device *dev, struct 
xdp_buff *xdp)
/* Encode the XDP flag into lowest bit for consumer to differ
 * XDP buffer from sk_buff.
 */
-   if (ptr_ring_produce(&tfile->tx_ring, tun_xdp_to_ptr(buff))) {
+   if (ptr_ring_produce(&tfile->tx_ring, tun_xdp_to_ptr(frame))) {
this_cpu_inc(tun->pcpu_stats->tx_dropped);
ret = -ENOSPC;
}
@@ -1993,11 +1990,11 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
 
 static ssize_t tun_put_user_xdp(struct tun_struct *tun,
struct tun_file *tfile,
-   struct xdp_buff *xdp,
+   struct xdp_frame *xdp_frame,
struct iov_iter *iter)
 {
int vnet_hdr_sz = 0;
-   size_t size = xdp->data_end - xdp->data;
+   size_t size = xdp_frame->len;
struct tun_pcpu_stats *stats;
size_t ret;
 
@@ -2013,7 +2010,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso));
}
 
-   ret = copy_to_iter(xdp->data, size, iter) + vnet_hdr_sz;
+   ret = copy_to_iter(xdp_frame->data, size, iter) + vnet_hdr_sz;
 
stats = get_cpu_ptr(tun->pcpu_stats);
u64_stats_update_begin(&stats->syncp);
@@ -2181,11 +2178,11 @@ static ssize_t tun_do_read(struct tun_struct *tun, 
struct tun_file *tfile,
return err;
}
 
-   if (tun_is_xdp_buff(ptr)) {
-   struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+   if (tun_is_xdp_frame(ptr)) {
+   struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-   ret = tun_put_user_xdp(tun, tfile, xdp, to);
-   put_page(virt_to_head_page(xdp->data));
+   ret = tun_put_user_xdp(tun, tfile, xdpf, to);
+   xdp_return_frame(xdpf->data, &xdpf->mem);
} else {
struct sk_buff *skb = ptr;
 
@@ -2424,10 +2421,10 @@ static int tun_recvmsg(struct socket *sock, struct 
msghdr *m, size_t total_len,
 static int tun_ptr_peek_len(void *ptr)
 {
if (likely(ptr)) {
-   if (tun_is_xdp_buff(ptr)) {
-   struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+   if (tun_is_xdp_frame(ptr)) {
+   struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-   return xdp->data_end - xdp->data;
+   return xdpf->len;
}
return __skb_array_len_with_tag(ptr);
} else {
diff --git a/drivers/vhost/net.c b/dri

[bpf-next V4 PATCH 09/15] mlx5: register a memory model when XDP is enabled

2018-03-22 Thread Jesper Dangaard Brouer
Now all the users of ndo_xdp_xmit have been converted to use xdp_return_frame.
This enable a different memory model, thus activating another code path
in the xdp_return_frame API.

V2: Fixed issues pointed out by Tariq.

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index da94c8cba5ee..2e4ca0f15b62 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -506,6 +506,14 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
rq->mkey_be = c->mkey_be;
}
 
+   /* This must only be activate for order-0 pages */
+   if (rq->xdp_prog) {
+   err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
+MEM_TYPE_PAGE_ORDER0, NULL);
+   if (err)
+   goto err_rq_wq_destroy;
+   }
+
for (i = 0; i < wq_sz; i++) {
struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
 



[bpf-next V4 PATCH 14/15] xdp: transition into using xdp_frame for return API

2018-03-22 Thread Jesper Dangaard Brouer
Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.

When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".

This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object.  In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.

It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame.  The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow.  In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.

To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern.  My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h|4 +---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |   17 +++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |1 +
 drivers/net/tun.c   |4 ++--
 drivers/net/virtio_net.c|2 +-
 include/net/xdp.h   |2 +-
 kernel/bpf/cpumap.c |6 +++---
 net/core/xdp.c  |4 +++-
 8 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index cbc20f199364..dfbc15a45cb4 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -240,8 +240,7 @@ struct ixgbe_tx_buffer {
unsigned long time_stamp;
union {
struct sk_buff *skb;
-   /* XDP uses address ptr on irq_clean */
-   void *data;
+   struct xdp_frame *xdpf;
};
unsigned int bytecount;
unsigned short gso_segs;
@@ -249,7 +248,6 @@ struct ixgbe_tx_buffer {
DEFINE_DMA_UNMAP_ADDR(dma);
DEFINE_DMA_UNMAP_LEN(len);
u32 tx_flags;
-   struct xdp_mem_info xdp_mem;
 };
 
 struct ixgbe_rx_buffer {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index ff069597fccf..e6e9b28ecfba 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1207,7 +1207,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector 
*q_vector,
 
/* free the skb */
if (ring_is_xdp(tx_ring))
-   xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
+   xdp_return_frame(tx_buffer->xdpf);
else
napi_consume_skb(tx_buffer->skb, napi_budget);
 
@@ -2376,6 +2376,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
xdp.data_hard_start = xdp.data -
  ixgbe_rx_offset(rx_ring);
xdp.data_end = xdp.data + size;
+   prefetchw(xdp.data_hard_start); /* xdp_frame write */
 
skb = ixgbe_run_xdp(adapter, rx_ring, &xdp);
}
@@ -5787,7 +5788,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring 
*tx_ring)
 
/* Free all the Tx ring sk_buffs */
if (ring_is_xdp(tx_ring))
-   xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
+   xdp_return_frame(tx_buffer->xdpf);
else
dev_kfree_skb_any(tx_buffer->skb);
 
@@ -8333,16 +8334,21 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter 
*adapter,
struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
struct ixgbe_tx_buffer *tx_buffer;
union ixgbe_adv_tx_desc *tx_desc;
+   struct xdp_frame *xdpf;
u32 len, cmd_type;
dma_addr_t dma;
u16 i;
 
-   len = xdp->data_end - xdp->data;
+   xdpf = convert_to_xdp_frame(xdp);
+   if (unlikely(!xdpf))
+   return -EOVERFLOW;
+
+   len = xdpf->len;
 
if (unlikely(!ixgbe_desc_unused(ring)))
return IXGBE_XDP_CONSUMED;
 
-   dma = dma_map_single(ring->dev, xdp->data, len, DMA_TO_DEVICE);
+   dma = dma_map_single(ring->dev, xdpf->data, len, DMA_TO_DEVICE);
if (dma_mapping_error(ring->dev, dma))
return IXGBE_XDP_CONSUMED;

[bpf-next V4 PATCH 07/15] virtio_net: convert to use generic xdp_frame and xdp_return_frame API

2018-03-22 Thread Jesper Dangaard Brouer
The virtio_net driver assumes XDP frames are always released based on
page refcnt (via put_page).  Thus, is only queues the XDP data pointer
address and uses virt_to_head_page() to retrieve struct page.

Use the XDP return API to get away from such assumptions. Instead
queue an xdp_frame, which allow us to use the xdp_return_frame API,
when releasing the frame.

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/virtio_net.c |   31 +--
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 23374603e4d9..6c4220450506 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -419,30 +419,41 @@ static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
   struct xdp_buff *xdp)
 {
struct virtio_net_hdr_mrg_rxbuf *hdr;
-   unsigned int len;
+   struct xdp_frame *xdpf, *xdpf_sent;
struct send_queue *sq;
+   unsigned int len;
unsigned int qp;
-   void *xdp_sent;
int err;
 
qp = vi->curr_queue_pairs - vi->xdp_queue_pairs + smp_processor_id();
sq = &vi->sq[qp];
 
/* Free up any pending old buffers before queueing new ones. */
-   while ((xdp_sent = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-   struct page *sent_page = virt_to_head_page(xdp_sent);
+   while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
+   xdp_return_frame(xdpf_sent->data, &xdpf_sent->mem);
 
-   put_page(sent_page);
-   }
+   xdpf = convert_to_xdp_frame(xdp);
+   if (unlikely(!xdpf))
+   return -EOVERFLOW;
+
+   /* virtqueue want to use data area in-front of packet */
+   if (unlikely(xdpf->metasize > 0))
+   return -EOPNOTSUPP;
+
+   if (unlikely(xdpf->headroom < vi->hdr_len))
+   return -EOVERFLOW;
 
-   xdp->data -= vi->hdr_len;
+   /* Make room for virtqueue hdr (also change xdpf->headroom?) */
+   xdpf->data -= vi->hdr_len;
/* Zero header and leave csum up to XDP layers */
-   hdr = xdp->data;
+   hdr = xdpf->data;
memset(hdr, 0, vi->hdr_len);
+   hdr->hdr.hdr_len = xdpf->len; /* Q: is this needed? */
+   xdpf->len   += vi->hdr_len;
 
-   sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data);
+   sg_init_one(sq->sg, xdpf->data, xdpf->len);
 
-   err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp->data, GFP_ATOMIC);
+   err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdpf, GFP_ATOMIC);
if (unlikely(err))
return false; /* Caller handle free/refcnt */
 



[bpf-next V4 PATCH 13/15] mlx5: use page_pool for xdp_return_frame call

2018-03-22 Thread Jesper Dangaard Brouer
This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page.  And at the same time, have pages getting returned to the
page_pool from ndp_xdp_xmit DMA completion.

Performance is surprisingly good. Tested DMA-TX completion on ixgbe,
that calls "xdp_return_frame", which call page_pool_put_page().
Stats show DMA-TX-completion runs on CPU#9 and mlx5 RX runs on CPU#5.
(Internally page_pool uses ptr_ring, which is what gives the good
cross CPU performance).

Show adapter(s) (ixgbe2 mlx5p2) statistics (ONLY that changed!)
Ethtool(ixgbe2  ) stat:732863573 (732,863,573) <= tx_bytes /sec
Ethtool(ixgbe2  ) stat:781724427 (781,724,427) <= tx_bytes_nic /sec
Ethtool(ixgbe2  ) stat: 12214393 ( 12,214,393) <= tx_packets /sec
Ethtool(ixgbe2  ) stat: 12214435 ( 12,214,435) <= tx_pkts_nic /sec
Ethtool(mlx5p2  ) stat: 12211786 ( 12,211,786) <= rx3_cache_empty /sec
Ethtool(mlx5p2  ) stat: 36506736 ( 36,506,736) <= rx_64_bytes_phy /sec
Ethtool(mlx5p2  ) stat:   2336430575 (  2,336,430,575) <= rx_bytes_phy /sec
Ethtool(mlx5p2  ) stat: 12211786 ( 12,211,786) <= rx_cache_empty /sec
Ethtool(mlx5p2  ) stat: 22823073 ( 22,823,073) <= rx_discards_phy /sec
Ethtool(mlx5p2  ) stat:  1471860 (  1,471,860) <= rx_out_of_buffer /sec
Ethtool(mlx5p2  ) stat: 36506715 ( 36,506,715) <= rx_packets_phy /sec
Ethtool(mlx5p2  ) stat:   2336542282 (  2,336,542,282) <= rx_prio0_bytes /sec
Ethtool(mlx5p2  ) stat: 13683921 ( 13,683,921) <= rx_prio0_packets /sec
Ethtool(mlx5p2  ) stat:821015537 (821,015,537) <= 
rx_vport_unicast_bytes /sec
Ethtool(mlx5p2  ) stat: 13683608 ( 13,683,608) <= 
rx_vport_unicast_packets /sec

Before this patch: single flow performance was 6Mpps, and if I started
two flows the collective performance drop to 4Mpps, because we hit the
page allocator lock (further negative scaling occurs).

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes not return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.
 - Save a branch in mlx5e_page_release
 - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |3 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   41 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   16 ++--
 3 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 28cc26debeda..ab91166f7c5a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -53,6 +53,8 @@
 #include "mlx5_core.h"
 #include "en_stats.h"
 
+struct page_pool;
+
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
 
 #define MLX5E_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN)
@@ -535,6 +537,7 @@ struct mlx5e_rq {
/* XDP */
struct bpf_prog   *xdp_prog;
struct mlx5e_xdpsq xdpsq;
+   struct page_pool  *page_pool;
 
/* control */
struct mlx5_wq_ctrlwq_ctrl;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 2e4ca0f15b62..bf17e6d614d6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "eswitch.h"
 #include "en.h"
 #include "en_tc.h"
@@ -387,10 +388,11 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
  struct mlx5e_rq_param *rqp,
  struct mlx5e_rq *rq)
 {
+   struct page_pool_params pp_params = { 0 };
struct mlx5_core_dev *mdev = c->mdev;
void *rqc = rqp->rqc;
void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
-   u32 byte_count;
+   u32 byte_count, pool_size;
int npages;
int wq_sz;
int err;
@@ -429,10 +431,13 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 
rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
rq->buff.headroom = params->rq_headroom;
+   pool_size = 1 << params->log_rq_size;
 
switch (rq->wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
 
+   pool_size = pool_size * MLX5_MPWRQ_PAGES_PER_WQE;
+
rq->post_wqes = mlx5e_post_rx_mpwqes;
rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
 
@@ -506,13 +511,31 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
rq->mkey_be = c->mkey_be;
}
 
-   /* This must only be activate for order-0 pages */
-   if (rq->xdp_prog) {
-   err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
-MEM_TYPE_PAGE_ORDER0, NULL);

[bpf-next V4 PATCH 11/15] page_pool: refurbish version of page_pool code

2018-03-22 Thread Jesper Dangaard Brouer
Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
pages on DMA-TX completion time, which have good cross CPU
performance, given DMA-TX completion time can happen on a remote CPU.

Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
Adapted page_pool code to not depend the page allocator and
integration into struct page.  The DMA mapping feature is kept,
even-though it will not be activated/used in this patchset.

[1] 
http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes, don't return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.

V4: many small improvements and cleanups
- Add DOC comment section, that can be used by kernel-doc
- Improve fallback mode, to work better with refcnt based recycling
  e.g. remove a WARN as pointed out by Tariq
  e.g. quicker fallback if ptr_ring is empty.

Signed-off-by: Jesper Dangaard Brouer 
---
 include/net/page_pool.h |  133 +++
 net/core/Makefile   |1 
 net/core/page_pool.c|  329 +++
 3 files changed, 463 insertions(+)
 create mode 100644 include/net/page_pool.h
 create mode 100644 net/core/page_pool.c

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
new file mode 100644
index ..1ff11e641b2e
--- /dev/null
+++ b/include/net/page_pool.h
@@ -0,0 +1,133 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
+ *
+ * page_pool.h
+ * Author: Jesper Dangaard Brouer 
+ * Copyright (C) 2016 Red Hat, Inc.
+ */
+
+/**
+ * DOC: page_pool allocator
+ *
+ * This page_pool allocator is optimized for the XDP mode that
+ * uses one-frame-per-page, but have fallbacks that act like the
+ * regular page allocator APIs.
+ *
+ * Basic use involve replacing alloc_pages() calls with the
+ * page_pool_alloc_pages() call.  Drivers should likely use
+ * page_pool_dev_alloc_pages() replacing dev_alloc_pages().
+ *
+ * If page_pool handles DMA mapping (use page->private), then API user
+ * is responsible for invoking page_pool_put_page() once.  In-case of
+ * elevated refcnt, the DMA state is released, assuming other users of
+ * the page will eventually call put_page().
+ *
+ * If no DMA mapping is done, then it can act as shim-layer that
+ * fall-through to alloc_page.  As no state is kept on the page, the
+ * regular put_page() call is sufficient.
+ */
+#ifndef _NET_PAGE_POOL_H
+#define _NET_PAGE_POOL_H
+
+#include  /* Needed by ptr_ring */
+#include 
+#include 
+
+#define PP_FLAG_DMA_MAP 1 /* Should page_pool do the DMA map/unmap */
+#define PP_FLAG_ALLPP_FLAG_DMA_MAP
+
+/*
+ * Fast allocation side cache array/stack
+ *
+ * The cache size and refill watermark is related to the network
+ * use-case.  The NAPI budget is 64 packets.  After a NAPI poll the RX
+ * ring is usually refilled and the max consumed elements will be 64,
+ * thus a natural max size of objects needed in the cache.
+ *
+ * Keeping room for more objects, is due to XDP_DROP use-case.  As
+ * XDP_DROP allows the opportunity to recycle objects directly into
+ * this array, as it shares the same softirq/NAPI protection.  If
+ * cache is already full (or partly full) then the XDP_DROP recycles
+ * would have to take a slower code path.
+ */
+#define PP_ALLOC_CACHE_SIZE128
+#define PP_ALLOC_CACHE_REFILL  64
+struct pp_alloc_cache {
+   u32 count cacheline_aligned_in_smp;
+   void *cache[PP_ALLOC_CACHE_SIZE];
+};
+
+struct page_pool_params {
+   u32 size; /* caller sets size of struct */
+   unsigned intorder;
+   unsigned long   flags;
+   struct device   *dev; /* device, for DMA pre-mapping purposes */
+   int nid;  /* Numa node id to allocate from pages from */
+   enum dma_data_direction dma_dir; /* DMA mapping direction */
+   unsigned intpool_size;
+   charend_marker[0]; /* must be last struct member */
+};
+#definePAGE_POOL_PARAMS_SIZE   offsetof(struct page_pool_params, 
end_marker)
+
+struct page_pool {
+   struct page_pool_params p;
+
+   /*
+* Data structure for allocation side
+*
+* Drivers allocation side usually already perform some kind
+* of resource protection.  Piggyback on this protection, and
+* require driver to protect allocation side.
+*
+* For NIC drivers this means, allocate a page_pool per
+* RX-queue. As the RX-queue is already protected by
+* Softirq/BH scheduling and napi_schedule. NAPI schedule
+* guarantee that a single napi_struct will only be scheduled
+* on a single CPU (see napi_schedule).
+*/
+   struct pp_alloc_cache alloc;
+
+   /* Data structure for storing recycled pages.
+*
+* Returning/freeing pages is more complicated synchronization
+* wise, because free's can happen o

[bpf-next V4 PATCH 15/15] xdp: transition into using xdp_frame for ndo_xdp_xmit

2018-03-22 Thread Jesper Dangaard Brouer
Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
xdp_buff.  This brings xdp_return_frame and ndp_xdp_xmit in sync.

This builds towards changing the API further to become a bulk API,
because xdp_buff is not a queue-able object while xdp_frame is.

V4: Adjust for commit 59655a5b6c83 ("tuntap: XDP_TX can use native XDP")

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   21 +++--
 drivers/net/tun.c |   19 ---
 drivers/net/virtio_net.c  |   24 ++--
 include/linux/netdevice.h |4 ++--
 net/core/filter.c |   17 +++--
 5 files changed, 54 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index e6e9b28ecfba..f78096ed4c86 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2252,7 +2252,7 @@ static struct sk_buff *ixgbe_build_skb(struct ixgbe_ring 
*rx_ring,
 #define IXGBE_XDP_TX 2
 
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
-  struct xdp_buff *xdp);
+  struct xdp_frame *xdpf);
 
 static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 struct ixgbe_ring *rx_ring,
@@ -2260,6 +2260,7 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter 
*adapter,
 {
int err, result = IXGBE_XDP_PASS;
struct bpf_prog *xdp_prog;
+   struct xdp_frame *xdpf;
u32 act;
 
rcu_read_lock();
@@ -2273,7 +2274,12 @@ static struct sk_buff *ixgbe_run_xdp(struct 
ixgbe_adapter *adapter,
case XDP_PASS:
break;
case XDP_TX:
-   result = ixgbe_xmit_xdp_ring(adapter, xdp);
+   xdpf = convert_to_xdp_frame(xdp);
+   if (unlikely(!xdpf)) {
+   result = IXGBE_XDP_CONSUMED;
+   break;
+   }
+   result = ixgbe_xmit_xdp_ring(adapter, xdpf);
break;
case XDP_REDIRECT:
err = xdp_do_redirect(adapter->netdev, xdp, xdp_prog);
@@ -8329,20 +8335,15 @@ static u16 ixgbe_select_queue(struct net_device *dev, 
struct sk_buff *skb,
 }
 
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
-  struct xdp_buff *xdp)
+  struct xdp_frame *xdpf)
 {
struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
struct ixgbe_tx_buffer *tx_buffer;
union ixgbe_adv_tx_desc *tx_desc;
-   struct xdp_frame *xdpf;
u32 len, cmd_type;
dma_addr_t dma;
u16 i;
 
-   xdpf = convert_to_xdp_frame(xdp);
-   if (unlikely(!xdpf))
-   return -EOVERFLOW;
-
len = xdpf->len;
 
if (unlikely(!ixgbe_desc_unused(ring)))
@@ -9995,7 +9996,7 @@ static int ixgbe_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
}
 }
 
-static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 {
struct ixgbe_adapter *adapter = netdev_priv(dev);
struct ixgbe_ring *ring;
@@ -10011,7 +10012,7 @@ static int ixgbe_xdp_xmit(struct net_device *dev, 
struct xdp_buff *xdp)
if (unlikely(!ring))
return -ENXIO;
 
-   err = ixgbe_xmit_xdp_ring(adapter, xdp);
+   err = ixgbe_xmit_xdp_ring(adapter, xdpf);
if (err != IXGBE_XDP_TX)
return -ENOSPC;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a7e42ae1b220..da0402ebc5ce 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1293,18 +1293,13 @@ static const struct net_device_ops tun_netdev_ops = {
.ndo_get_stats64= tun_net_get_stats64,
 };
 
-static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int tun_xdp_xmit(struct net_device *dev, struct xdp_frame *frame)
 {
struct tun_struct *tun = netdev_priv(dev);
-   struct xdp_frame *frame;
struct tun_file *tfile;
u32 numqueues;
int ret = 0;
 
-   frame = convert_to_xdp_frame(xdp);
-   if (unlikely(!frame))
-   return -EOVERFLOW;
-
rcu_read_lock();
 
numqueues = READ_ONCE(tun->numqueues);
@@ -1328,6 +1323,16 @@ static int tun_xdp_xmit(struct net_device *dev, struct 
xdp_buff *xdp)
return ret;
 }
 
+static int tun_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+{
+   struct xdp_frame *frame = convert_to_xdp_frame(xdp);
+
+   if (unlikely(!frame))
+   return -EOVERFLOW;
+
+   return tun_xdp_xmit(dev, frame);
+}
+
 static void tun_xdp_flush(struct net_device *dev)
 {
struct tun_struct *tun = netdev_priv(dev);
@@ -1675,7 +1680,7 @@ static struct sk_buff *

[bpf-next V4 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping

2018-03-22 Thread Jesper Dangaard Brouer
Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
uses a radix tree, use a dynamic rhashtable, for creating ID to
pointer lookup table, because this is faster.

The problem that is being solved here is that, the xdp_rxq_info
pointer (stored in xdp_buff) cannot be used directly, as the
guaranteed lifetime is too short.  The info is needed on a
(potentially) remote CPU during DMA-TX completion time . In an
xdp_frame the xdp_mem_info is stored, when it got converted from an
xdp_buff, which is sufficient for the simple page refcnt based recycle
schemes.

For more advanced allocators there is a need to store a pointer to the
registered allocator.  Thus, there is a need to guard the lifetime or
validity of the allocator pointer, which is done through this
rhashtable ID map to pointer. The removal and validity of of the
allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
allocator will be created by the driver, and registered with
xdp_rxq_info_reg_mem_model().

It is up-to debate who is responsible for freeing the allocator
pointer or invoking the allocator destructor function.  In any case,
this must happen via RCU freeing.

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.

V4: Per req of Jason Wang
- Use xdp_rxq_info_reg_mem_model() in all drivers implementing
  XDP_REDIRECT, even-though it's not strictly necessary when
  allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |9 +
 drivers/net/tun.c |6 +
 drivers/net/virtio_net.c  |7 +
 include/net/xdp.h |   15 --
 net/core/xdp.c|  230 -
 5 files changed, 248 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 45520eb503ee..ff069597fccf 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6360,7 +6360,7 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter 
*adapter,
struct device *dev = rx_ring->dev;
int orig_node = dev_to_node(dev);
int ring_node = -1;
-   int size;
+   int size, err;
 
size = sizeof(struct ixgbe_rx_buffer) * rx_ring->count;
 
@@ -6397,6 +6397,13 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter 
*adapter,
 rx_ring->queue_index) < 0)
goto err;
 
+   err = xdp_rxq_info_reg_mem_model(&rx_ring->xdp_rxq,
+MEM_TYPE_PAGE_SHARED, NULL);
+   if (err) {
+   xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
+   goto err;
+   }
+
rx_ring->xdp_prog = adapter->xdp_prog;
 
return 0;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 6750980d9f30..81fddf9cc58f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -846,6 +846,12 @@ static int tun_attach(struct tun_struct *tun, struct file 
*file,
   tun->dev, tfile->queue_index);
if (err < 0)
goto out;
+   err = xdp_rxq_info_reg_mem_model(&tfile->xdp_rxq,
+MEM_TYPE_PAGE_SHARED, NULL);
+   if (err < 0) {
+   xdp_rxq_info_unreg(&tfile->xdp_rxq);
+   goto out;
+   }
err = 0;
}
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6c4220450506..48c86accd3b8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1312,6 +1312,13 @@ static int virtnet_open(struct net_device *dev)
if (err < 0)
return err;
 
+   err = xdp_rxq_info_reg_mem_model(&vi->rq[i].xdp_rxq,
+MEM_TYPE_PAGE_SHARED, NULL);
+   if (err < 0) {
+   xdp_rxq_info_unreg(&vi->rq[i].xdp_rxq);
+   return err;
+   }
+
virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
virtnet_napi_tx_enable(vi, vi->sq[i].vq, &vi->sq[i].napi);
}
diff --git a/include/net/xdp.h b/include/net/xdp.h
index bc0cb97e20dc..859aa9b737fe 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -41,7 +41,7 @@ enum mem_type {
 
 struct xdp_mem_info {
u32 type; /* enum mem_type, but known size type */
-   /* u32 id; will be added later in this patchset */
+   u32 id;
 };
 
 struct xdp_rxq_info {
@@ -100,18 +100,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff 

[bpf-next V4 PATCH 12/15] xdp: allow page_pool as an allocator type in xdp_return_frame

2018-03-22 Thread Jesper Dangaard Brouer
New allocator type MEM_TYPE_PAGE_POOL for page_pool usage.

The registered allocator page_pool pointer is not available directly
from xdp_rxq_info, but it could be (if needed).  For now, the driver
should keep separate track of the page_pool pointer, which it should
use for RX-ring page allocation.

As suggested by Saeed, to maintain a symmetric API it is the drivers
responsibility to allocate/create and free/destroy the page_pool.
Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers
responsibility to free the page_pool, but with a RCU free call.  This
is done easily via the page_pool helper page_pool_destroy_rcu() (which
avoids touching any driver code during the RCU callback, which could
happen after the driver have been unloaded).

Signed-off-by: Jesper Dangaard Brouer 
---
 include/net/xdp.h |3 +++
 net/core/xdp.c|   23 ---
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 859aa9b737fe..98b55eaf8fd7 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -36,6 +36,7 @@
 enum mem_type {
MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
MEM_TYPE_PAGE_ORDER0, /* Orig XDP full page model */
+   MEM_TYPE_PAGE_POOL,
MEM_TYPE_MAX,
 };
 
@@ -44,6 +45,8 @@ struct xdp_mem_info {
u32 id;
 };
 
+struct page_pool;
+
 struct xdp_rxq_info {
struct net_device *dev;
u32 queue_index;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 06a5b39491ad..fe8e87abc266 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -27,7 +28,10 @@ static struct rhashtable *mem_id_ht;
 
 struct xdp_mem_allocator {
struct xdp_mem_info mem;
-   void *allocator;
+   union {
+   void *allocator;
+   struct page_pool *page_pool;
+   };
struct rhash_head node;
struct rcu_head rcu;
 };
@@ -74,7 +78,9 @@ void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
/* Allow this ID to be reused */
ida_simple_remove(&mem_id_pool, xa->mem.id);
 
-   /* TODO: Depending on allocator type/pointer free resources */
+   /* Notice, driver is expected to free the *allocator,
+* e.g. page_pool, and MUST also use RCU free.
+*/
 
/* Poison memory */
xa->mem.id = 0x;
@@ -290,11 +296,21 @@ EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
 void xdp_return_frame(void *data, struct xdp_mem_info *mem)
 {
-   struct xdp_mem_allocator *xa;
+   struct xdp_mem_allocator *xa = NULL;
 
rcu_read_lock();
if (mem->id)
xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
+
+   if (mem->type == MEM_TYPE_PAGE_POOL) {
+   struct page *page = virt_to_head_page(data);
+
+   if (xa)
+   page_pool_put_page(xa->page_pool, page);
+   else
+   put_page(page);
+   return;
+   }
rcu_read_unlock();
 
if (mem->type == MEM_TYPE_PAGE_SHARED) {
@@ -306,6 +322,7 @@ void xdp_return_frame(void *data, struct xdp_mem_info *mem)
struct page *page = virt_to_page(data); /* Assumes order0 page*/
 
put_page(page);
+   return;
}
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame);



[bpf-next V4 PATCH 08/15] bpf: cpumap convert to use generic xdp_frame

2018-03-22 Thread Jesper Dangaard Brouer
The generic xdp_frame format, was inspired by the cpumap own internal
xdp_pkt format.  It is now time to convert it over to the generic
xdp_frame format.  The cpumap needs one extra field dev_rx.

Signed-off-by: Jesper Dangaard Brouer 
---
 include/net/xdp.h   |1 +
 kernel/bpf/cpumap.c |  100 ++-
 2 files changed, 29 insertions(+), 72 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 13f71a15c79f..bc0cb97e20dc 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -68,6 +68,7 @@ struct xdp_frame {
 * while mem info is valid on remote CPU.
 */
struct xdp_mem_info mem;
+   struct net_device *dev_rx; /* used by cpumap */
 };
 
 /* Convert xdp_buff to xdp_frame */
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 3e4bbcbe3e86..bcdc4dea5ce7 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -159,52 +159,8 @@ static void cpu_map_kthread_stop(struct work_struct *work)
kthread_stop(rcpu->kthread);
 }
 
-/* For now, xdp_pkt is a cpumap internal data structure, with info
- * carried between enqueue to dequeue. It is mapped into the top
- * headroom of the packet, to avoid allocating separate mem.
- */
-struct xdp_pkt {
-   void *data;
-   u16 len;
-   u16 headroom;
-   u16 metasize;
-   /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
-* while mem info is valid on remote CPU.
-*/
-   struct xdp_mem_info mem;
-   struct net_device *dev_rx;
-};
-
-/* Convert xdp_buff to xdp_pkt */
-static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)
-{
-   struct xdp_pkt *xdp_pkt;
-   int metasize;
-   int headroom;
-
-   /* Assure headroom is available for storing info */
-   headroom = xdp->data - xdp->data_hard_start;
-   metasize = xdp->data - xdp->data_meta;
-   metasize = metasize > 0 ? metasize : 0;
-   if (unlikely((headroom - metasize) < sizeof(*xdp_pkt)))
-   return NULL;
-
-   /* Store info in top of packet */
-   xdp_pkt = xdp->data_hard_start;
-
-   xdp_pkt->data = xdp->data;
-   xdp_pkt->len  = xdp->data_end - xdp->data;
-   xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
-   xdp_pkt->metasize = metasize;
-
-   /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
-   xdp_pkt->mem = xdp->rxq->mem;
-
-   return xdp_pkt;
-}
-
 static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
-struct xdp_pkt *xdp_pkt)
+struct xdp_frame *xdpf)
 {
unsigned int frame_size;
void *pkt_data_start;
@@ -219,7 +175,7 @@ static struct sk_buff *cpu_map_build_skb(struct 
bpf_cpu_map_entry *rcpu,
 * would be preferred to set frame_size to 2048 or 4096
 * depending on the driver.
 *   frame_size = 2048;
-*   frame_len  = frame_size - sizeof(*xdp_pkt);
+*   frame_len  = frame_size - sizeof(*xdp_frame);
 *
 * Instead, with info avail, skb_shared_info in placed after
 * packet len.  This, unfortunately fakes the truesize.
@@ -227,21 +183,21 @@ static struct sk_buff *cpu_map_build_skb(struct 
bpf_cpu_map_entry *rcpu,
 * is not at a fixed memory location, with mixed length
 * packets, which is bad for cache-line hotness.
 */
-   frame_size = SKB_DATA_ALIGN(xdp_pkt->len) + xdp_pkt->headroom +
+   frame_size = SKB_DATA_ALIGN(xdpf->len) + xdpf->headroom +
SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
-   pkt_data_start = xdp_pkt->data - xdp_pkt->headroom;
+   pkt_data_start = xdpf->data - xdpf->headroom;
skb = build_skb(pkt_data_start, frame_size);
if (!skb)
return NULL;
 
-   skb_reserve(skb, xdp_pkt->headroom);
-   __skb_put(skb, xdp_pkt->len);
-   if (xdp_pkt->metasize)
-   skb_metadata_set(skb, xdp_pkt->metasize);
+   skb_reserve(skb, xdpf->headroom);
+   __skb_put(skb, xdpf->len);
+   if (xdpf->metasize)
+   skb_metadata_set(skb, xdpf->metasize);
 
/* Essential SKB info: protocol and skb->dev */
-   skb->protocol = eth_type_trans(skb, xdp_pkt->dev_rx);
+   skb->protocol = eth_type_trans(skb, xdpf->dev_rx);
 
/* Optional SKB info, currently missing:
 * - HW checksum info   (skb->ip_summed)
@@ -259,11 +215,11 @@ static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
 * invoked cpu_map_kthread_stop(). Catch any broken behaviour
 * gracefully and warn once.
 */
-   struct xdp_pkt *xdp_pkt;
+   struct xdp_frame *xdpf;
 
-   while ((xdp_pkt = ptr_ring_consume(ring)))
-   if (WARN_ON_ONCE(xdp_pkt))
-   xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
+   while ((xdpf = ptr_ring_consume(ring)))
+   if (WARN_ON_ONCE(xdpf))
+

Re: Fwd: Kernel panic when using KVM and mlx4_en driver (when bonding and sriov enabled)

2018-03-22 Thread Tariq Toukan



On 20/03/2018 10:14 PM, kvaps wrote:

Hello, I have one bug with new HPE ProLiant m710x Server Cartridges,
there is Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
Ethernet controller.

When I use bonding + VFs and KVM I have stacked kernel with these
messages on console:

[ 1011.070739] kvm [16361]: vcpu0, guest rIP: 0x810644d8
disabled perfctr wrmsr: 0xc2 data 0x
[ 1011.528347] cache_from_obj: Wrong slab cache. kmalloc-256 but
object is from kmalloc-192
[ 1011.927642] general protection fault:  [#1] SMP PTI
[ 1012.185439] cache_from_obj: Wrong slab cache. kmalloc-256 but
object is from kmalloc-192

I've already reported this bug on launchpad:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1755268
But since the bug is present in the latest kernel, I was advised to
contact you directly.



Thanks for that!

I will check the details below and let you know of any questions/updates 
I have.


Regards,
Tariq


=== Steps to repoduce ===

I have the next network configuration:

eno1 (physical)eno1d1 (physical)eno2 (virtual function)
eno2d1 (virtual function)
 |  |
 +-- bond0 -+
   |
   |
 vmbr0 (bridge)


After my machine is booted, I can run this commands:

# wget 
http://dl-cdn.alpinelinux.org/alpine/v3.7/releases/x86_64/alpine-virt-3.7.0-x86_64.iso
-O alpine.iso
# qemu-system-x86_64 -machine pc-i440fx-xenial,accel=kvm,usb=off -boot
d -cdrom alpine.iso -m 512 -nographic -device e1000,netdev=net0
-netdev tap,id=net0

And kernel will break down.

=== System information ===

##
# Network config #
##

This is my /etc/network/interfaces file:

auto lo
iface lo inet loopback

auto openibd
iface openibd inet manual
 pre-up /etc/init.d/openibd start
 pre-down /etc/init.d/openibd force-stop

auto bond0
iface bond0 inet manual
 pre-up ip link add bond0 type bond || true
 pre-up ip link set bond0 down
 pre-up ip link set bond0 type bond mode active-backup
arp_interval 2000 arp_ip_target 10.36.0.1 arp_validate 3 primary eno1
 pre-up ip link set eno1 down
 pre-up ip link set eno1d1 down
 pre-up ip link set eno1 master bond0
 pre-up ip link set eno1d1 master bond0
 pre-up ip link set bond0 up
 pre-down ip link del bond0

auto vmbr0
iface vmbr0 inet static
 address 10.36.128.217
 netmask 255.255.0.0
 gateway 10.36.0.1
 bridge_ports bond0
 bridge_stp off
 bridge_fd 0


##
# Kernel version #
##

Latest kernel that I've tested:

# cat /proc/version
Linux version 4.16.0-041600rc6-generic (kernel@gloin) (gcc version
7.2.0 (Ubuntu 7.2.0-8ubuntu3.2)) #201803182230 SMP Mon Mar 19 02:32:18
UTC 2018

##
# Driver version #
##

Both drivers that I tested:

# Mellanox driver on stock and hwe kernels:

# ethtool -i eno1
driver: mlx4_en
version: 4.3-1.0.1
firmware-version: 2.40.5540
expansion-rom-version:
bus-info: :11:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

Built-in driver from latest kernel:

# ethtool -i eno1
driver: mlx4_en
version: 4.0-0
firmware-version: 2.42.5004
expansion-rom-version:
bus-info: :11:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

###
# NIC Details #
###

# mst status
MST modules:

 MST PCI module loaded
 MST PCI configuration module loaded

MST devices:

/dev/mst/mt4103_pci_cr0 - PCI direct access.
domain:bus:dev.fn=:11:00.0
bar=0x7f10 size=0x10
Chip revision is: 00
/dev/mst/mt4103_pciconf0 - PCI configuration cycles access.
domain:bus:dev.fn=:11:00.0
addr.reg=88 data.reg=92
Chip revision is: 00

# ibv_devinfo
hca_id: mlx4_1
 transport: InfiniBand (0)
 fw_ver: 2.40.5540
 node_guid: 0014:0500:d300:bc52
 sys_image_guid: f403:4303:00fd:102d
 vendor_id: 0x02c9
 vendor_part_id: 4100
 hw_ver: 0x0
 board_id: HP_1690110017
 phys_port_cnt: 2
 Device ports:
 port: 1
 state: PORT_DOWN (1)
 max_mtu: 4096 (5)
 active_mtu: 1024 (3)
 sm_lid: 0
 port_lid: 0
 port_lmc: 0x00
 link_layer: Ethernet

 port: 2
 state: PORT_DOWN (1)
 max_mtu: 4096 (5)
 active_mtu: 1024 (3)
 sm_lid: 0
 port_lid: 0
   

[PATCH net-next 5/9] net: hns3: Add support to reset the enet/ring mgmt layer

2018-03-22 Thread Salil Mehta
After VF driver knows that hardware reset has been performed
successfully, it should proceed ahead and reset the enet layer.
This primarily consists of bringing down interface, clearing
TX/RX rings, disassociating vectors from ring etc.

Signed-off-by: Salil Mehta 
---
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 103 -
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h  |   3 +
 2 files changed, 102 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index b648311..bd45b11 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -2,6 +2,7 @@
 // Copyright (c) 2016-2017 Hisilicon Limited.
 
 #include 
+#include 
 #include "hclgevf_cmd.h"
 #include "hclgevf_main.h"
 #include "hclge_mbx.h"
@@ -832,6 +833,101 @@ static void hclgevf_reset_tqp(struct hnae3_handle 
*handle, u16 queue_id)
 2, true, NULL, 0);
 }
 
+static int hclgevf_notify_client(struct hclgevf_dev *hdev,
+enum hnae3_reset_notify_type type)
+{
+   struct hnae3_client *client = hdev->nic_client;
+   struct hnae3_handle *handle = &hdev->nic;
+
+   if (!client->ops->reset_notify)
+   return -EOPNOTSUPP;
+
+   return client->ops->reset_notify(handle, type);
+}
+
+static int hclgevf_reset_wait(struct hclgevf_dev *hdev)
+{
+#define HCLGEVF_RESET_WAIT_MS  500
+#define HCLGEVF_RESET_WAIT_CNT 20
+   u32 val, cnt = 0;
+
+   /* wait to check the hardware reset completion status */
+   val = hclgevf_read_dev(&hdev->hw, HCLGEVF_FUN_RST_ING);
+   while (hnae_get_bit(val, HCLGEVF_FUN_RST_ING_B) &&
+   (cnt < HCLGEVF_RESET_WAIT_CNT)) {
+   msleep(HCLGEVF_RESET_WAIT_MS);
+   val = hclgevf_read_dev(&hdev->hw, HCLGEVF_FUN_RST_ING);
+   cnt++;
+   }
+
+   /* hardware completion status should be available by this time */
+   if (cnt >= HCLGEVF_RESET_WAIT_CNT) {
+   dev_warn(&hdev->pdev->dev,
+"could'nt get reset done status from h/w, timeout!\n");
+   return -EBUSY;
+   }
+
+   /* we will wait a bit more to let reset of the stack to complete. This
+* might happen in case reset assertion was made by PF. Yes, this also
+* means we might end up waiting bit more even for VF reset.
+*/
+   msleep(5000);
+
+   return 0;
+}
+
+static int hclgevf_reset_stack(struct hclgevf_dev *hdev)
+{
+   /* uninitialize the nic client */
+   hclgevf_notify_client(hdev, HNAE3_UNINIT_CLIENT);
+
+   /* re-initialize the hclge device - add code here */
+
+   /* bring up the nic client again */
+   hclgevf_notify_client(hdev, HNAE3_INIT_CLIENT);
+
+   return 0;
+}
+
+static int hclgevf_reset(struct hclgevf_dev *hdev)
+{
+   int ret;
+
+   rtnl_lock();
+
+   /* bring down the nic to stop any ongoing TX/RX */
+   hclgevf_notify_client(hdev, HNAE3_DOWN_CLIENT);
+
+   /* check if VF could successfully fetch the hardware reset completion
+* status from the hardware
+*/
+   ret = hclgevf_reset_wait(hdev);
+   if (ret) {
+   /* can't do much in this situation, will disable VF */
+   dev_err(&hdev->pdev->dev,
+   "VF failed(=%d) to fetch H/W reset completion status\n",
+   ret);
+
+   dev_warn(&hdev->pdev->dev, "VF reset failed, disabling VF!\n");
+   hclgevf_notify_client(hdev, HNAE3_UNINIT_CLIENT);
+
+   rtnl_unlock();
+   return ret;
+   }
+
+   /* now, re-initialize the nic client and ae device*/
+   ret = hclgevf_reset_stack(hdev);
+   if (ret)
+   dev_err(&hdev->pdev->dev, "failed to reset VF stack\n");
+
+   /* bring up the nic to enable TX/RX again */
+   hclgevf_notify_client(hdev, HNAE3_UP_CLIENT);
+
+   rtnl_unlock();
+
+   return ret;
+}
+
 static int hclgevf_do_reset(struct hclgevf_dev *hdev)
 {
int status;
@@ -940,10 +1036,9 @@ static void hclgevf_reset_service_task(struct work_struct 
*work)
 */
hdev->reset_attempts = 0;
 
-   /* code to check/wait for hardware reset completion and the
-* further initiating software stack reset would be added here
-*/
-
+   ret = hclgevf_reset(hdev);
+   if (ret)
+   dev_err(&hdev->pdev->dev, "VF stack reset failed.\n");
} else if (test_and_clear_bit(HCLGEVF_RESET_REQUESTED,
  &hdev->reset_state)) {
/* we could be here when either of below happens:
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mai

[PATCH net-next 7/9] net: hns3: Changes to support ARQ(Asynchronous Receive Queue)

2018-03-22 Thread Salil Mehta
Current mailbox CRQ could consists of both synchronous and async
responses from the PF. Synchronous responses are time critical
and should be handed over to the waiting tasks/context as quickly
as possible otherwise timeout occurs.

Above problem gets accentuated if CRQ consists of even single
async message. Hence, it is important to have quick handling of
synchronous messages and maybe deferred handling of async messages
This patch introduces separate ARQ(async receive queues) for the
async messages. These messages are processed later with repsect
to mailbox task while synchronous messages still gets processed
in context to mailbox interrupt.

ARQ is important as VF reset introduces some new async messages
like MBX_ASSERTING_RESET which adds up to the presssure on the
responses for synchronousmessages and they timeout even more
quickly.

Signed-off-by: Salil Mehta 
---
 drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h| 15 
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c   |  6 ++
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 16 +++--
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h  |  5 ++
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c   | 83 +++---
 5 files changed, 111 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h 
b/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
index e6e1d22..f3e90c2 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
@@ -85,6 +85,21 @@ struct hclge_mbx_pf_to_vf_cmd {
u16 msg[8];
 };
 
+/* used by VF to store the received Async responses from PF */
+struct hclgevf_mbx_arq_ring {
+#define HCLGE_MBX_MAX_ARQ_MSG_SIZE 8
+#define HCLGE_MBX_MAX_ARQ_MSG_NUM  1024
+   struct hclgevf_dev *hdev;
+   u32 head;
+   u32 tail;
+   u32 count;
+   u16 msg_q[HCLGE_MBX_MAX_ARQ_MSG_NUM][HCLGE_MBX_MAX_ARQ_MSG_SIZE];
+};
+
 #define hclge_mbx_ring_ptr_move_crq(crq) \
(crq->next_to_use = (crq->next_to_use + 1) % crq->desc_num)
+#define hclge_mbx_tail_ptr_move_arq(arq) \
+   (arq.tail = (arq.tail + 1) % HCLGE_MBX_MAX_ARQ_MSG_SIZE)
+#define hclge_mbx_head_ptr_move_arq(arq) \
+   (arq.head = (arq.head + 1) % HCLGE_MBX_MAX_ARQ_MSG_SIZE)
 #endif
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c
index 85985e7..1bbfe13 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c
@@ -315,6 +315,12 @@ int hclgevf_cmd_init(struct hclgevf_dev *hdev)
goto err_csq;
}
 
+   /* initialize the pointers of async rx queue of mailbox */
+   hdev->arq.hdev = hdev;
+   hdev->arq.head = 0;
+   hdev->arq.tail = 0;
+   hdev->arq.count = 0;
+
/* get firmware version */
ret = hclgevf_cmd_query_firmware_version(&hdev->hw, &version);
if (ret) {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index 6dd7561..2b84264 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -1010,10 +1010,13 @@ void hclgevf_reset_task_schedule(struct hclgevf_dev 
*hdev)
}
 }
 
-static void hclgevf_mbx_task_schedule(struct hclgevf_dev *hdev)
+void hclgevf_mbx_task_schedule(struct hclgevf_dev *hdev)
 {
-   if (!test_and_set_bit(HCLGEVF_STATE_MBX_SERVICE_SCHED, &hdev->state))
+   if (!test_bit(HCLGEVF_STATE_MBX_SERVICE_SCHED, &hdev->state) &&
+   !test_bit(HCLGEVF_STATE_MBX_HANDLING, &hdev->state)) {
+   set_bit(HCLGEVF_STATE_MBX_SERVICE_SCHED, &hdev->state);
schedule_work(&hdev->mbx_service_task);
+   }
 }
 
 static void hclgevf_task_schedule(struct hclgevf_dev *hdev)
@@ -1025,6 +1028,10 @@ static void hclgevf_task_schedule(struct hclgevf_dev 
*hdev)
 
 static void hclgevf_deferred_task_schedule(struct hclgevf_dev *hdev)
 {
+   /* if we have any pending mailbox event then schedule the mbx task */
+   if (hdev->mbx_event_pending)
+   hclgevf_mbx_task_schedule(hdev);
+
if (test_bit(HCLGEVF_RESET_PENDING, &hdev->reset_state))
hclgevf_reset_task_schedule(hdev);
 }
@@ -1118,7 +1125,7 @@ static void hclgevf_mailbox_service_task(struct 
work_struct *work)
 
clear_bit(HCLGEVF_STATE_MBX_SERVICE_SCHED, &hdev->state);
 
-   hclgevf_mbx_handler(hdev);
+   hclgevf_mbx_async_handler(hdev);
 
clear_bit(HCLGEVF_STATE_MBX_HANDLING, &hdev->state);
 }
@@ -1178,8 +1185,7 @@ static irqreturn_t hclgevf_misc_irq_handle(int irq, void 
*data)
if (!hclgevf_check_event_cause(hdev, &clearval))
goto skip_sched;
 
-   /* schedule the VF mailbox service task, if not already scheduled */
-   hclgevf_mbx_task_schedule(hdev);
+   hclgevf_mbx_handler(

Re: [PATCH net-next 1/1] net/ipv4: disable SMC TCP option with SYN Cookies

2018-03-22 Thread Eric Dumazet


On 03/22/2018 06:23 AM, Ursula Braun wrote:

> We moved the clear to cookie_v4_check()/cookie_v6_check. However, this does 
> not seem to
> be sufficient to prevent the SYNACK from containing the SMC experimental 
> option.
> We found that an additional check in tcp_conn_request() helps:
> 
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6248,6 +6248,9 @@ int tcp_conn_request(struct request_sock
>   if (want_cookie && !tmp_opt.saw_tstamp)
>   tcp_clear_options(&tmp_opt);
>  
> + if (IS_ENABLED(CONFIG_SMC) && want_cookie && tmp_opt.smc_ok)
> + tmp_opt.smc_ok = 0;
> +
>   tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
>   tcp_openreq_init(req, &tmp_opt, skb, sk);
>   inet_rsk(req)->no_srccheck = inet_sk(sk)->transparent;
> 
> Do you think this could be the right place for clearing the smc_ok bit?


Yes, but since tmp_opt is private to this thread/cpu, no false sharing to be 
afraid of

if (IS_ENABLED(CONFIG_SMC) && want_cookie)
tmp_opt.smc_ok = 0;




[PATCH net-next 9/9] net: hns3: Changes required in PF mailbox to support VF reset

2018-03-22 Thread Salil Mehta
PF needs to assert the VF reset when it receives the request to
reset from VF. After receiving request PF ackknowledges the
request by replying back MBX_ASSERTING_RESET message to VF.
VF then goes to pending state and wait for hardware to complete
the reset.

This patch contains code to handle the received VF message, inform
the VF of assertion and reset the VF using cmdq interface.

Signed-off-by: Salil Mehta 
---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c|  2 +-
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|  1 +
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c | 42 ++
 3 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index a3e00da..bede411 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -2749,7 +2749,7 @@ static int hclge_reset_wait(struct hclge_dev *hdev)
return 0;
 }
 
-static int hclge_func_reset_cmd(struct hclge_dev *hdev, int func_id)
+int hclge_func_reset_cmd(struct hclge_dev *hdev, int func_id)
 {
struct hclge_desc desc;
struct hclge_reset_cmd *req = (struct hclge_reset_cmd *)desc.data;
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
index 8c14d10..0f4157e 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
@@ -657,4 +657,5 @@ void hclge_mbx_handler(struct hclge_dev *hdev);
 void hclge_reset_tqp(struct hnae3_handle *handle, u16 queue_id);
 void hclge_reset_vf_queue(struct hclge_vport *vport, u16 queue_id);
 int hclge_cfg_flowctrl(struct hclge_dev *hdev);
+int hclge_func_reset_cmd(struct hclge_dev *hdev, int func_id);
 #endif
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c
index 949da0c..3901333 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c
@@ -79,6 +79,18 @@ static int hclge_send_mbx_msg(struct hclge_vport *vport, u8 
*msg, u16 msg_len,
return status;
 }
 
+int hclge_inform_reset_assert_to_vf(struct hclge_vport *vport)
+{
+   u8 msg_data[2];
+   u8 dest_vfid;
+
+   dest_vfid = (u8)vport->vport_id;
+
+   /* send this requested info to VF */
+   return hclge_send_mbx_msg(vport, msg_data, sizeof(u8),
+ HCLGE_MBX_ASSERTING_RESET, dest_vfid);
+}
+
 static void hclge_free_vector_ring_chain(struct hnae3_ring_chain_node *head)
 {
struct hnae3_ring_chain_node *chain_tmp, *chain;
@@ -339,6 +351,33 @@ static void hclge_mbx_reset_vf_queue(struct hclge_vport 
*vport,
hclge_gen_resp_to_vf(vport, mbx_req, 0, NULL, 0);
 }
 
+static void hclge_reset_vf(struct hclge_vport *vport,
+  struct hclge_mbx_vf_to_pf_cmd *mbx_req)
+{
+   struct hclge_dev *hdev = vport->back;
+   int ret;
+
+   dev_warn(&hdev->pdev->dev, "PF received VF reset request from VF %d!",
+mbx_req->mbx_src_vfid);
+
+   /* Acknowledge VF that PF is now about to assert the reset for the VF.
+* On receiving this message VF will get into pending state and will
+* start polling for the hardware reset completion status.
+*/
+   ret = hclge_inform_reset_assert_to_vf(vport);
+   if (ret) {
+   dev_err(&hdev->pdev->dev,
+   "PF fail(%d) to inform VF(%d)of reset, reset failed!\n",
+   ret, vport->vport_id);
+   return;
+   }
+
+   dev_warn(&hdev->pdev->dev, "PF is now resetting VF %d.\n",
+mbx_req->mbx_src_vfid);
+   /* reset this virtual function */
+   hclge_func_reset_cmd(hdev, mbx_req->mbx_src_vfid);
+}
+
 void hclge_mbx_handler(struct hclge_dev *hdev)
 {
struct hclge_cmq_ring *crq = &hdev->hw.cmq.crq;
@@ -416,6 +455,9 @@ void hclge_mbx_handler(struct hclge_dev *hdev)
case HCLGE_MBX_QUEUE_RESET:
hclge_mbx_reset_vf_queue(vport, req);
break;
+   case HCLGE_MBX_RESET:
+   hclge_reset_vf(vport, req);
+   break;
default:
dev_err(&hdev->pdev->dev,
"un-supported mailbox message, code = %d\n",
-- 
2.7.4




[PATCH net-next 8/9] net: hns3: Add *Asserting Reset* mailbox message & handling in VF

2018-03-22 Thread Salil Mehta
Reset Asserting message is forwarded by PF to inform VF about
the hardware reset which is about to happen. This might be due
to the earlier VF reset request received by the PF or because PF
for any reason decides to undergo reset. This message results in
VF to go in pending state in which it polls the hardware to
complete the reset and then further resets/tears its own stack.

Signed-off-by: Salil Mehta 
---
 drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h  |  1 +
 drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c | 12 
 2 files changed, 13 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h 
b/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
index f3e90c2..519e2bd 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
@@ -11,6 +11,7 @@
 
 enum HCLGE_MBX_OPCODE {
HCLGE_MBX_RESET = 0x01, /* (VF -> PF) assert reset */
+   HCLGE_MBX_ASSERTING_RESET,  /* (PF -> VF) PF is asserting reset*/
HCLGE_MBX_SET_UNICAST,  /* (VF -> PF) set UC addr */
HCLGE_MBX_SET_MULTICAST,/* (VF -> PF) set MC addr */
HCLGE_MBX_SET_VLAN, /* (VF -> PF) set VLAN */
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
index 7687911..a286184 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
@@ -170,6 +170,7 @@ void hclgevf_mbx_handler(struct hclgevf_dev *hdev)
}
break;
case HCLGE_MBX_LINK_STAT_CHANGE:
+   case HCLGE_MBX_ASSERTING_RESET:
/* set this mbx event as pending. This is required as we
 * might loose interrupt event when mbx task is busy
 * handling. This shall be cleared when mbx task just
@@ -242,6 +243,17 @@ void hclgevf_mbx_async_handler(struct hclgevf_dev *hdev)
hclgevf_update_speed_duplex(hdev, speed, duplex);
 
break;
+   case HCLGE_MBX_ASSERTING_RESET:
+   /* PF has asserted reset hence VF should go in pending
+* state and poll for the hardware reset status till it
+* has been completely reset. After this stack should
+* eventually be re-initialized.
+*/
+   hdev->nic.reset_level = HNAE3_VF_RESET;
+   set_bit(HCLGEVF_RESET_PENDING, &hdev->reset_state);
+   hclgevf_reset_task_schedule(hdev);
+
+   break;
default:
dev_err(&hdev->pdev->dev,
"fetched unsupported(%d) message from arq\n",
-- 
2.7.4




[PATCH net-next 4/9] net: hns3: Add support to request VF Reset to PF

2018-03-22 Thread Salil Mehta
VF driver depends upon PF to eventually reset the hardware. This
request is made using the mailbox command. This patch adds the
required function to acheive above.

Signed-off-by: Salil Mehta 
---
 .../net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index 0d204e2..b648311 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -832,6 +832,20 @@ static void hclgevf_reset_tqp(struct hnae3_handle *handle, 
u16 queue_id)
 2, true, NULL, 0);
 }
 
+static int hclgevf_do_reset(struct hclgevf_dev *hdev)
+{
+   int status;
+   u8 respmsg;
+
+   status = hclgevf_send_mbx_msg(hdev, HCLGE_MBX_RESET, 0, NULL,
+ 0, false, &respmsg, sizeof(u8));
+   if (status)
+   dev_err(&hdev->pdev->dev,
+   "VF reset request to PF failed(=%d)\n", status);
+
+   return status;
+}
+
 static void hclgevf_reset_event(struct hnae3_handle *handle)
 {
struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle);
@@ -910,6 +924,7 @@ static void hclgevf_reset_service_task(struct work_struct 
*work)
 {
struct hclgevf_dev *hdev =
container_of(work, struct hclgevf_dev, rst_service_task);
+   int ret;
 
if (test_and_set_bit(HCLGEVF_STATE_RST_HANDLING, &hdev->state))
return;
@@ -965,6 +980,10 @@ static void hclgevf_reset_service_task(struct work_struct 
*work)
hdev->reset_attempts++;
 
/* request PF for resetting this VF via mailbox */
+   ret = hclgevf_do_reset(hdev);
+   if (ret)
+   dev_warn(&hdev->pdev->dev,
+"VF rst fail, stack will call\n");
}
}
 
-- 
2.7.4




[PATCH net-next 2/9] net: hns3: Add VF Reset Service Task to support event handling

2018-03-22 Thread Salil Mehta
VF reset would involve handling of different reset related events
from the stack, physical function, mailbox etc. Reset service task
would be used in servicing such reset event requests and later
handling the hardware completions waits and initiating the stack
resets.

Signed-off-by: Salil Mehta 
---
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 31 ++
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h  |  4 +++
 2 files changed, 35 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index 6c3881d..cdb6e7a 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -867,6 +867,15 @@ static void hclgevf_get_misc_vector(struct hclgevf_dev 
*hdev)
hdev->num_msi_used += 1;
 }
 
+void hclgevf_reset_task_schedule(struct hclgevf_dev *hdev)
+{
+   if (!test_bit(HCLGEVF_STATE_RST_SERVICE_SCHED, &hdev->state) &&
+   !test_bit(HCLGEVF_STATE_RST_HANDLING, &hdev->state)) {
+   set_bit(HCLGEVF_STATE_RST_SERVICE_SCHED, &hdev->state);
+   schedule_work(&hdev->rst_service_task);
+   }
+}
+
 static void hclgevf_mbx_task_schedule(struct hclgevf_dev *hdev)
 {
if (!test_and_set_bit(HCLGEVF_STATE_MBX_SERVICE_SCHED, &hdev->state))
@@ -889,6 +898,24 @@ static void hclgevf_service_timer(struct timer_list *t)
hclgevf_task_schedule(hdev);
 }
 
+static void hclgevf_reset_service_task(struct work_struct *work)
+{
+   struct hclgevf_dev *hdev =
+   container_of(work, struct hclgevf_dev, rst_service_task);
+
+   if (test_and_set_bit(HCLGEVF_STATE_RST_HANDLING, &hdev->state))
+   return;
+
+   clear_bit(HCLGEVF_STATE_RST_SERVICE_SCHED, &hdev->state);
+
+   /* body of the reset service task will constitute of hclge device
+* reset state handling. This code shall be added subsequently in
+* next patches.
+*/
+
+   clear_bit(HCLGEVF_STATE_RST_HANDLING, &hdev->state);
+}
+
 static void hclgevf_mailbox_service_task(struct work_struct *work)
 {
struct hclgevf_dev *hdev;
@@ -1097,6 +1124,8 @@ static void hclgevf_state_init(struct hclgevf_dev *hdev)
INIT_WORK(&hdev->service_task, hclgevf_service_task);
clear_bit(HCLGEVF_STATE_SERVICE_SCHED, &hdev->state);
 
+   INIT_WORK(&hdev->rst_service_task, hclgevf_reset_service_task);
+
mutex_init(&hdev->mbx_resp.mbx_mutex);
 
/* bring the device down */
@@ -1113,6 +1142,8 @@ static void hclgevf_state_uninit(struct hclgevf_dev *hdev)
cancel_work_sync(&hdev->service_task);
if (hdev->mbx_service_task.func)
cancel_work_sync(&hdev->mbx_service_task);
+   if (hdev->rst_service_task.func)
+   cancel_work_sync(&hdev->rst_service_task);
 
mutex_destroy(&hdev->mbx_resp.mbx_mutex);
 }
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
index 0eaea06..8b5fa67 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
@@ -52,6 +52,8 @@ enum hclgevf_states {
HCLGEVF_STATE_DISABLED,
/* task states */
HCLGEVF_STATE_SERVICE_SCHED,
+   HCLGEVF_STATE_RST_SERVICE_SCHED,
+   HCLGEVF_STATE_RST_HANDLING,
HCLGEVF_STATE_MBX_SERVICE_SCHED,
HCLGEVF_STATE_MBX_HANDLING,
 };
@@ -146,6 +148,7 @@ struct hclgevf_dev {
 
struct timer_list service_timer;
struct work_struct service_task;
+   struct work_struct rst_service_task;
struct work_struct mbx_service_task;
 
struct hclgevf_tqp *htqp;
@@ -165,4 +168,5 @@ void hclgevf_mbx_handler(struct hclgevf_dev *hdev);
 void hclgevf_update_link_status(struct hclgevf_dev *hdev, int link_state);
 void hclgevf_update_speed_duplex(struct hclgevf_dev *hdev, u32 speed,
 u8 duplex);
+void hclgevf_reset_task_schedule(struct hclgevf_dev *hdev);
 #endif
-- 
2.7.4




[PATCH net-next 3/9] net: hns3: Add VF Reset device state and its handling

2018-03-22 Thread Salil Mehta
This introduces the hclge device reset states of "requested" and
"pending" and also its handling in context to Reset Service Task.

Device gets into requested state because of any VF reset request
asserted from upper layers, for example due to watchdog timeout
expiration. Requested state would result in eventually forwarding
the VF reset request to PF which would actually reset the VF.

Device will get into pending state if:
1. VF receives the acknowledgement from PF for the VF reset
   request it originally sent to PF.
2. Reset Service Task detects that after asserting VF reset for
   certain times the data-path is not working and device then
   decides to assert full VF reset(this means also resetting the
   PCIe interface).
3. PF intimates the VF that it has undergone reset.
Pending state would result in VF to poll for hardware reset
completion status and then resetting the stack/enet layer, which
in turn means reinitializing the ring management/enet layer.

Note: we would be adding support of 3. later as a separate patch.
This decision should not affect VF reset as its event handling
is generic in nature.

Signed-off-by: Salil Mehta 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h|  1 +
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 67 --
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h  |  5 ++
 3 files changed, 68 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index 56f9e650..37ec1b3 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -119,6 +119,7 @@ enum hnae3_reset_notify_type {
 
 enum hnae3_reset_type {
HNAE3_VF_RESET,
+   HNAE3_VF_FULL_RESET,
HNAE3_FUNC_RESET,
HNAE3_CORE_RESET,
HNAE3_GLOBAL_RESET,
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index cdb6e7a..0d204e2 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -840,7 +840,9 @@ static void hclgevf_reset_event(struct hnae3_handle *handle)
 
handle->reset_level = HNAE3_VF_RESET;
 
-   /* request VF reset here. Code added later */
+   /* reset of this VF requested */
+   set_bit(HCLGEVF_RESET_REQUESTED, &hdev->reset_state);
+   hclgevf_reset_task_schedule(hdev);
 
handle->last_reset_time = jiffies;
 }
@@ -889,6 +891,12 @@ static void hclgevf_task_schedule(struct hclgevf_dev *hdev)
schedule_work(&hdev->service_task);
 }
 
+static void hclgevf_deferred_task_schedule(struct hclgevf_dev *hdev)
+{
+   if (test_bit(HCLGEVF_RESET_PENDING, &hdev->reset_state))
+   hclgevf_reset_task_schedule(hdev);
+}
+
 static void hclgevf_service_timer(struct timer_list *t)
 {
struct hclgevf_dev *hdev = from_timer(hdev, t, service_timer);
@@ -908,10 +916,57 @@ static void hclgevf_reset_service_task(struct work_struct 
*work)
 
clear_bit(HCLGEVF_STATE_RST_SERVICE_SCHED, &hdev->state);
 
-   /* body of the reset service task will constitute of hclge device
-* reset state handling. This code shall be added subsequently in
-* next patches.
-*/
+   if (test_and_clear_bit(HCLGEVF_RESET_PENDING,
+  &hdev->reset_state)) {
+   /* PF has initmated that it is about to reset the hardware.
+* We now have to poll & check if harware has actually completed
+* the reset sequence. On hardware reset completion, VF needs to
+* reset the client and ae device.
+*/
+   hdev->reset_attempts = 0;
+
+   /* code to check/wait for hardware reset completion and the
+* further initiating software stack reset would be added here
+*/
+
+   } else if (test_and_clear_bit(HCLGEVF_RESET_REQUESTED,
+ &hdev->reset_state)) {
+   /* we could be here when either of below happens:
+* 1. reset was initiated due to watchdog timeout due to
+*a. IMP was earlier reset and our TX got choked down and
+*   which resulted in watchdog reacting and inducing VF
+*   reset. This also means our cmdq would be unreliable.
+*b. problem in TX due to other lower layer(example link
+*   layer not functioning properly etc.)
+* 2. VF reset might have been initiated due to some config
+*change.
+*
+* NOTE: Theres no clear way to detect above cases than to react
+* to the response of PF for this reset request. PF will ack the
+* 1b and 2. cases but we will not get any intimation about 1a
+* from PF a

[PATCH net-next 1/9] net: hns3: Changes to make enet watchdog timeout func common for PF/VF

2018-03-22 Thread Salil Mehta
HNS3 drivers enet layer, used for the ring management and stack
interaction, is common to both VF and PF. PF already supports reset
functionality to handle the network stack watchdog timeout trigger
but the existing code is not generic enough to be used to support VF
reset as well.
This patch does following:
1. Makes the existing watchdog timeout handler in enet layer generic
   i.e. suitable for both VF and PF and
2. Introduces the new reset event handler for the VF code.
3. Changes existing reset event handler of PF code to initialize the
   reset level

Signed-off-by: Salil Mehta 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h|  7 +++--
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c| 30 +-
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h|  2 --
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 36 --
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 14 +
 5 files changed, 46 insertions(+), 43 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index 9daa88d..56f9e650 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -118,6 +118,7 @@ enum hnae3_reset_notify_type {
 };
 
 enum hnae3_reset_type {
+   HNAE3_VF_RESET,
HNAE3_FUNC_RESET,
HNAE3_CORE_RESET,
HNAE3_GLOBAL_RESET,
@@ -400,8 +401,7 @@ struct hnae3_ae_ops {
int (*set_vf_vlan_filter)(struct hnae3_handle *handle, int vfid,
  u16 vlan, u8 qos, __be16 proto);
int (*enable_hw_strip_rxvtag)(struct hnae3_handle *handle, bool enable);
-   void (*reset_event)(struct hnae3_handle *handle,
-   enum hnae3_reset_type reset);
+   void (*reset_event)(struct hnae3_handle *handle);
void (*get_channels)(struct hnae3_handle *handle,
 struct ethtool_channels *ch);
void (*get_tqps_and_rss_info)(struct hnae3_handle *h,
@@ -495,6 +495,9 @@ struct hnae3_handle {
struct hnae3_ae_algo *ae_algo;  /* the class who provides this handle */
u64 flags; /* Indicate the capabilities for this handle*/
 
+   unsigned long last_reset_time;
+   enum hnae3_reset_type reset_level;
+
union {
struct net_device *netdev; /* first member */
struct hnae3_knic_private_info kinfo;
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 0b4a676..40a3eb7 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -320,7 +320,7 @@ static int hns3_nic_net_open(struct net_device *netdev)
return ret;
}
 
-   priv->last_reset_time = jiffies;
+   priv->ae_handle->last_reset_time = jiffies;
return 0;
 }
 
@@ -1543,7 +1543,6 @@ static bool hns3_get_tx_timeo_queue_info(struct 
net_device *ndev)
 static void hns3_nic_net_timeout(struct net_device *ndev)
 {
struct hns3_nic_priv *priv = netdev_priv(ndev);
-   unsigned long last_reset_time = priv->last_reset_time;
struct hnae3_handle *h = priv->ae_handle;
 
if (!hns3_get_tx_timeo_queue_info(ndev))
@@ -1551,24 +1550,12 @@ static void hns3_nic_net_timeout(struct net_device 
*ndev)
 
priv->tx_timeout_count++;
 
-   /* This timeout is far away enough from last timeout,
-* if timeout again,set the reset type to PF reset
-*/
-   if (time_after(jiffies, (last_reset_time + 20 * HZ)))
-   priv->reset_level = HNAE3_FUNC_RESET;
-
-   /* Don't do any new action before the next timeout */
-   else if (time_before(jiffies, (last_reset_time + ndev->watchdog_timeo)))
+   if (time_before(jiffies, (h->last_reset_time + ndev->watchdog_timeo)))
return;
 
-   priv->last_reset_time = jiffies;
-
+   /* request the reset */
if (h->ae_algo->ops->reset_event)
-   h->ae_algo->ops->reset_event(h, priv->reset_level);
-
-   priv->reset_level++;
-   if (priv->reset_level > HNAE3_GLOBAL_RESET)
-   priv->reset_level = HNAE3_GLOBAL_RESET;
+   h->ae_algo->ops->reset_event(h);
 }
 
 static const struct net_device_ops hns3_nic_netdev_ops = {
@@ -3122,8 +3109,8 @@ static int hns3_client_init(struct hnae3_handle *handle)
priv->dev = &pdev->dev;
priv->netdev = netdev;
priv->ae_handle = handle;
-   priv->last_reset_time = jiffies;
-   priv->reset_level = HNAE3_FUNC_RESET;
+   priv->ae_handle->reset_level = HNAE3_NONE_RESET;
+   priv->ae_handle->last_reset_time = jiffies;
priv->tx_timeout_count = 0;
 
handle->kinfo.netdev = netdev;
@@ -3355,7 +3342,6 @@ static int hns3_reset_notify_down_enet(struct 
hnae3_handle *handle)
 static int hns3_reset_notify_up_enet(struct hnae3_handle *handle)
 {
struct hnae3_knic_private_info *k

  1   2   3   4   >