Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path

2017-03-15 Thread Alexei Starovoitov
On Wed, Mar 15, 2017 at 07:48:04PM -0700, Eric Dumazet wrote:
> On Wed, Mar 15, 2017 at 6:56 PM, Alexei Starovoitov
>  wrote:
> > On Wed, Mar 15, 2017 at 06:07:16PM -0700, Eric Dumazet wrote:
> >> On Wed, 2017-03-15 at 16:06 -0700, Alexei Starovoitov wrote:
> >>
> >> > yes. and we have 'xdp_tx_full' counter for it that we monitor.
> >> > When tx ring and mtu are sized properly, we don't expect to see it
> >> > incrementing at all. This is something in our control. 'Our' means
> >> > humans that setup the environment.
> >> > 'cache empty' condition is something ephemeral. Packets will be dropped
> >> > and we won't have any tools to address it. These packets are real
> >> > people requests. Any drop needs to be categorized.
> >> > Like there is 'rx_fifo_errors' counter that mlx4 reports when
> >> > hw is dropping packets before they reach the driver. We see it
> >> > incrementing depending on the traffic pattern though overall Mpps
> >> > through the nic is not too high and this is something we
> >> > actively investigating too.
> >>
> >>
> >> This all looks nice, except that current mlx4 driver does not have a
> >> counter of failed allocations.
> >>
> >> You are asking me to keep some inexistent functionality.
> >>
> >> Look in David net tree :
> >>
> >> mlx4_en_refill_rx_buffers()
> >> ...
> >>if (mlx4_en_prepare_rx_desc(...))
> >>   break;
> >>
> >>
> >> So in case of memory pressure, mlx4 stops working and not a single
> >> counter is incremented/reported.
> >
> > Not quite. That is exactly the case i'm asking to keep.
> > In case of memory pressure (like in the above case rx fifo
> > won't be replenished) and in case of hw receiving
> > faster than the driver can drain the rx ring,
> > the hw will increment 'rx_fifo_errors' counter.
> 
> In current mlx4 driver, if napi_get_frags() fails, no counter is incremented.
> 
> So you are describing quite a different behavior, where _cpu_ can not
> keep up and rx_fifo_errors is incremented.
> 
> But in case of _memory_ pressure (and normal traffic), rx_fifo_errors
> wont be incremented.

when there is no room in the rx fifo the hw will increment the counter.
That's the same as oom causing alloc fails and rx ring not being replenished.
When there is nothing free in rx ring to dma the packet to, the hw will
increment that counter.

> And even if xdp_prog 'decides' to return XDP_PASS, the fine packet
> will be dropped anyway.

yes. And no counter will be incremented when packet is dropped further
in the stack. So?
The discussion specifically about changing xdp rings behavior for xdp_tx case.

> > And that's what we monitor already and what I described in previous email.
> >
> >> Is it really what you want ?
> >
> > almost. see below.
> >
> >>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |   38 
> >> +
> >>  1 file changed, 18 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> >> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> >> index 
> >> cc41f2f145541b469b52e7014659d5fdbb7dac68..e5ef8999087b52705faf083c94cde439aab9e2b7
> >>  100644
> >> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> >> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> >> @@ -793,10 +793,24 @@ int mlx4_en_process_rx_cq(struct net_device *dev, 
> >> struct mlx4_en_cq *cq, int bud
> >> if (xdp_prog) {
> >> struct xdp_buff xdp;
> >> struct page *npage;
> >> -   dma_addr_t ndma, dma;
> >> +   dma_addr_t dma;
> >> void *orig_data;
> >> u32 act;
> >>
> >> +   /* Make sure we have one page ready to replace 
> >> this one, per Alexei request */
> >
> > do you have to add snarky comments?
> 
> Is request a bad or offensive word ?
> 
> What would be the best way to say that you asked to move this code
> here, while in my opinion it was better where it was  ?

so in other words you're saying that you disagree with
all of my previous explanations of why the drop after xdp_tx is wrong?
And in other words you think it's fine to execute the program,
do map lookups, rewrites and at the last step drop the result
because the driver couldn't allocate to make the rx ring populated, right?
I can try to see why it would be reasonable if you provide the arguments.

> >
> >> +   if (unlikely(!ring->page_cache.index)) {
> >> +   npage = mlx4_alloc_page(priv, ring,
> >> +   
> >> >page_cache.buf[0].dma,
> >> +   numa_mem_id(),
> >> +   GFP_ATOMIC | 
> >> __GFP_MEMALLOC);
> >> +   if (!npage) {
> >> +   /* replace this by a new 
> >> ring->rx_alloc_failed++
> >> +  

Re: [PATCH net-next 0/7] drivers: net: xgene: Bug fixes and errata workarounds

2017-03-15 Thread David Miller
From: Iyappan Subramanian 
Date: Wed, 15 Mar 2017 13:27:14 -0700

> This patch set addresses bug fixes and errata workarounds.
> 
> Signed-off-by: Iyappan Subramanian 
> Signed-off-by: Quan Nguyen 

Series applied.


Re: linux-next: WARNING: CPU: 1 PID: 19544 at net/bridge/br_fdb.c:109 br_fdb_find+0x19d/0x1b0

2017-03-15 Thread Cong Wang
On Wed, Mar 15, 2017 at 6:12 PM, Andrei Vagin  wrote:
> Hello,
>
> We execute CRIU tests for linux-next and here is a new warning:
> [  178.930950] [ cut here ]
> [  178.930960] WARNING: CPU: 1 PID: 19544 at net/bridge/br_fdb.c:109
> br_fdb_find+0x19d/0x1b0
> [  178.930961] Modules linked in:
> [  178.930966] CPU: 1 PID: 19544 Comm: criu Not tainted
> 4.11.0-rc1-next-20170310 #1
> [  178.930968] Hardware name: Google Google Compute Engine/Google
> Compute Engine, BIOS Google 01/01/2011
> [  178.930970] Call Trace:
> [  178.930976]  dump_stack+0x85/0xc9
> [  178.930982]  __warn+0xd1/0xf0
> [  178.930988]  warn_slowpath_null+0x1d/0x20
> [  178.930990]  br_fdb_find+0x19d/0x1b0
> [  178.930994]  br_fdb_change_mac_address+0x38/0x80
> [  178.930999]  br_stp_change_bridge_id+0x44/0x140
> [  178.931003]  br_dev_newlink+0x3f/0x70
> [  178.931009]  rtnl_newlink+0x68e/0x830
> [  178.931014]  ? netlink_broadcast_filtered+0x134/0x3e0
> [  178.931025]  ? get_partial_node.isra.76+0x4b/0x2a0
> [  178.931032]  ? nla_parse+0xa3/0x100
> [  178.931035]  ? nla_strlcpy+0x5b/0x70
> [  178.931037]  ? rtnl_link_ops_get+0x39/0x50
> [  178.931040]  ? rtnl_newlink+0x158/0x830
> [  178.931064]  rtnetlink_rcv_msg+0x95/0x230
> [  178.931068]  ? rtnl_newlink+0x830/0x830
> [  178.931072]  netlink_rcv_skb+0xa7/0xc0
> [  178.931075]  rtnetlink_rcv+0x2a/0x40
> [  178.931078]  netlink_unicast+0x15b/0x210
> [  178.931082]  netlink_sendmsg+0x31a/0x3a0
> [  178.931089]  sock_sendmsg+0x38/0x50
> [  178.931092]  ___sys_sendmsg+0x26c/0x280
> [  178.931099]  ? __generic_file_write_iter+0x19b/0x1e0
> [  178.931105]  ? up_write+0x1f/0x40
> [  178.931110]  ? ext4_file_write_iter+0xa4/0x390
> [  178.931120]  __sys_sendmsg+0x45/0x80
> [  178.931126]  SyS_sendmsg+0x12/0x20
> [  178.931130]  entry_SYSCALL_64_fastpath+0x23/0xc2
> [  178.931132] RIP: 0033:0x2b71009f99a0
> [  178.931134] RSP: 002b:7ffed7413c58 EFLAGS: 0246 ORIG_RAX:
> 002e
> [  178.931137] RAX: ffda RBX:  RCX: 
> 2b71009f99a0
> [  178.931139] RDX:  RSI: 7ffed7413c90 RDI: 
> 0002
> [  178.931140] RBP:  R08:  R09: 
> 
> [  178.931142] R10: 2b710180cfe0 R11: 0246 R12: 
> 0001
> [  178.931144] R13: 0004 R14: 7ffed74134b0 R15: 
> 0003
> [  178.931152] ---[ end trace 61d5dd5e3b9abaf8 ]---
> [  178.931453] IPv6: ADDRCONF(NETDEV_UP): zdtmbr0: li
>
> All logs are here:
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/211220073/log.txt


Could the attached patch fix this false alarm?

Thanks.
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 4f598dc..ab09c3c 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -106,7 +106,7 @@ static struct net_bridge_fdb_entry *br_fdb_find(struct 
net_bridge *br,
struct hlist_head *head = >hash[br_mac_hash(addr, vid)];
struct net_bridge_fdb_entry *fdb;
 
-   WARN_ON_ONCE(!br_hash_lock_held(br));
+   lockdep_assert_held(>hash_lock);
 
rcu_read_lock();
fdb = fdb_find_rcu(head, addr, vid);
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 2288fca..6136818 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -531,15 +531,6 @@ int br_fdb_external_learn_add(struct net_bridge *br, 
struct net_bridge_port *p,
 int br_fdb_external_learn_del(struct net_bridge *br, struct net_bridge_port *p,
  const unsigned char *addr, u16 vid);
 
-static inline bool br_hash_lock_held(struct net_bridge *br)
-{
-#ifdef CONFIG_LOCKDEP
-   return lockdep_is_held(>hash_lock);
-#else
-   return true;
-#endif
-}
-
 /* br_forward.c */
 enum br_pkt_type {
BR_PKT_UNICAST,


[PATCH v2] net: mvneta: support suspend and resume

2017-03-15 Thread Jane Li
Add basic support for handling suspend and resume.

Signed-off-by: Jane Li 
---
Since v1:
- add mvneta_conf_mbus_windows() and mvneta_bm_port_init() in mvneta_resume()

 drivers/net/ethernet/marvell/mvneta.c | 62 ---
 1 file changed, 58 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index 61dd446..250017b 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -431,6 +431,7 @@ struct mvneta_port {
/* Flags for special SoC configurations */
bool neta_armada3700;
u16 rx_offset_correction;
+   const struct mbus_dram_target_info *dram_target_info;
 };
 
 /* The mvneta_tx_desc and mvneta_rx_desc structures describe the
@@ -4118,7 +4119,6 @@ static int mvneta_port_power_up(struct mvneta_port *pp, 
int phy_mode)
 /* Device initialization routine */
 static int mvneta_probe(struct platform_device *pdev)
 {
-   const struct mbus_dram_target_info *dram_target_info;
struct resource *res;
struct device_node *dn = pdev->dev.of_node;
struct device_node *phy_node;
@@ -4267,13 +4267,13 @@ static int mvneta_probe(struct platform_device *pdev)
 
pp->tx_csum_limit = tx_csum_limit;
 
-   dram_target_info = mv_mbus_dram_info();
+   pp->dram_target_info = mv_mbus_dram_info();
/* Armada3700 requires setting default configuration of Mbus
 * windows, however without using filled mbus_dram_target_info
 * structure.
 */
-   if (dram_target_info || pp->neta_armada3700)
-   mvneta_conf_mbus_windows(pp, dram_target_info);
+   if (pp->dram_target_info || pp->neta_armada3700)
+   mvneta_conf_mbus_windows(pp, pp->dram_target_info);
 
pp->tx_ring_size = MVNETA_MAX_TXD;
pp->rx_ring_size = MVNETA_MAX_RXD;
@@ -4405,6 +4405,59 @@ static int mvneta_remove(struct platform_device *pdev)
return 0;
 }
 
+#ifdef CONFIG_PM_SLEEP
+static int mvneta_suspend(struct device *device)
+{
+   struct net_device *dev = dev_get_drvdata(device);
+   struct mvneta_port *pp = netdev_priv(dev);
+
+   if (netif_running(dev))
+   mvneta_stop(dev);
+   netif_device_detach(dev);
+   clk_disable_unprepare(pp->clk_bus);
+   clk_disable_unprepare(pp->clk);
+   return 0;
+}
+
+static int mvneta_resume(struct device *device)
+{
+   struct platform_device *pdev = to_platform_device(device);
+   struct net_device *dev = dev_get_drvdata(device);
+   struct mvneta_port *pp = netdev_priv(dev);
+   int err;
+
+   clk_prepare_enable(pp->clk);
+   clk_prepare_enable(pp->clk_bus);
+   if (pp->dram_target_info || pp->neta_armada3700)
+   mvneta_conf_mbus_windows(pp, pp->dram_target_info);
+   if (pp->bm_priv) {
+   err = mvneta_bm_port_init(pdev, pp);
+   if (err < 0) {
+   dev_info(>dev, "use SW buffer management\n");
+   pp->bm_priv = NULL;
+   }
+   }
+   mvneta_defaults_set(pp);
+   err = mvneta_port_power_up(pp, pp->phy_interface);
+   if (err < 0) {
+   dev_err(device, "can't power up port\n");
+   return err;
+   }
+
+   if (pp->use_inband_status)
+   mvneta_fixed_link_update(pp, dev->phydev);
+
+   netif_device_attach(dev);
+   if (netif_running(dev))
+   mvneta_open(dev);
+   return 0;
+}
+#endif
+
+static const struct dev_pm_ops mvneta_pm_ops = {
+   SET_LATE_SYSTEM_SLEEP_PM_OPS(mvneta_suspend, mvneta_resume)
+};
+
 static const struct of_device_id mvneta_match[] = {
{ .compatible = "marvell,armada-370-neta" },
{ .compatible = "marvell,armada-xp-neta" },
@@ -4419,6 +4472,7 @@ static int mvneta_remove(struct platform_device *pdev)
.driver = {
.name = MVNETA_DRIVER_NAME,
.of_match_table = mvneta_match,
+   .pm = _pm_ops,
},
 };
 
-- 
1.9.1



Re: [PATCH v2] net: mvneta: support suspend and resume

2017-03-15 Thread Jisheng Zhang
On Thu, 16 Mar 2017 11:19:10 +0800 Jane Li wrote:

> Add basic support for handling suspend and resume.
> 
> Signed-off-by: Jane Li 
> ---
> Since v1:
> - add mvneta_conf_mbus_windows() and mvneta_bm_port_init() in mvneta_resume()
> 
>  drivers/net/ethernet/marvell/mvneta.c | 62 
> ---
>  1 file changed, 58 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvneta.c 
> b/drivers/net/ethernet/marvell/mvneta.c
> index 61dd446..250017b 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -431,6 +431,7 @@ struct mvneta_port {
>   /* Flags for special SoC configurations */
>   bool neta_armada3700;
>   u16 rx_offset_correction;
> + const struct mbus_dram_target_info *dram_target_info;
>  };
>  
>  /* The mvneta_tx_desc and mvneta_rx_desc structures describe the
> @@ -4118,7 +4119,6 @@ static int mvneta_port_power_up(struct mvneta_port *pp, 
> int phy_mode)
>  /* Device initialization routine */
>  static int mvneta_probe(struct platform_device *pdev)
>  {
> - const struct mbus_dram_target_info *dram_target_info;
>   struct resource *res;
>   struct device_node *dn = pdev->dev.of_node;
>   struct device_node *phy_node;
> @@ -4267,13 +4267,13 @@ static int mvneta_probe(struct platform_device *pdev)
>  
>   pp->tx_csum_limit = tx_csum_limit;
>  
> - dram_target_info = mv_mbus_dram_info();
> + pp->dram_target_info = mv_mbus_dram_info();
>   /* Armada3700 requires setting default configuration of Mbus
>* windows, however without using filled mbus_dram_target_info
>* structure.
>*/
> - if (dram_target_info || pp->neta_armada3700)
> - mvneta_conf_mbus_windows(pp, dram_target_info);
> + if (pp->dram_target_info || pp->neta_armada3700)
> + mvneta_conf_mbus_windows(pp, pp->dram_target_info);
>  
>   pp->tx_ring_size = MVNETA_MAX_TXD;
>   pp->rx_ring_size = MVNETA_MAX_RXD;
> @@ -4405,6 +4405,59 @@ static int mvneta_remove(struct platform_device *pdev)
>   return 0;
>  }
>  
> +#ifdef CONFIG_PM_SLEEP
> +static int mvneta_suspend(struct device *device)
> +{
> + struct net_device *dev = dev_get_drvdata(device);
> + struct mvneta_port *pp = netdev_priv(dev);
> +
> + if (netif_running(dev))
> + mvneta_stop(dev);
> + netif_device_detach(dev);
> + clk_disable_unprepare(pp->clk_bus);
> + clk_disable_unprepare(pp->clk);
> + return 0;
> +}
> +
> +static int mvneta_resume(struct device *device)
> +{
> + struct platform_device *pdev = to_platform_device(device);
> + struct net_device *dev = dev_get_drvdata(device);
> + struct mvneta_port *pp = netdev_priv(dev);
> + int err;
> +
> + clk_prepare_enable(pp->clk);
> + clk_prepare_enable(pp->clk_bus);
> + if (pp->dram_target_info || pp->neta_armada3700)
> + mvneta_conf_mbus_windows(pp, pp->dram_target_info);
> + if (pp->bm_priv) {
> + err = mvneta_bm_port_init(pdev, pp);
> + if (err < 0) {
> + dev_info(>dev, "use SW buffer management\n");
> + pp->bm_priv = NULL;
> + }
> + }
> + mvneta_defaults_set(pp);
> + err = mvneta_port_power_up(pp, pp->phy_interface);
> + if (err < 0) {
> + dev_err(device, "can't power up port\n");
> + return err;
> + }
> +
> + if (pp->use_inband_status)
> + mvneta_fixed_link_update(pp, dev->phydev);
> +
> + netif_device_attach(dev);
> + if (netif_running(dev))
> + mvneta_open(dev);
> + return 0;
> +}
> +#endif
> +
> +static const struct dev_pm_ops mvneta_pm_ops = {
> + SET_LATE_SYSTEM_SLEEP_PM_OPS(mvneta_suspend, mvneta_resume)
> +};

we could make use of SIMPLE_DEV_PM_OPS to simplify the code

Thanks


Re: [PATCH] net: mvneta: support suspend and resume

2017-03-15 Thread Jane Li

Hi Jisheng,


On 2017年03月15日 15:18, Jisheng Zhang wrote:

Hi Jane,

On Wed, 15 Mar 2017 15:08:34 +0800 Jane Li  wrote:


Add basic support for handling suspend and resume.

Signed-off-by: Jane Li 
---
  drivers/net/ethernet/marvell/mvneta.c | 44 +++
  1 file changed, 44 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index 61dd446..4f16342 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -4405,6 +4405,49 @@ static int mvneta_remove(struct platform_device *pdev)
return 0;
  }
  
+#ifdef CONFIG_PM_SLEEP

+static int mvneta_suspend(struct device *device)
+{
+   struct net_device *dev = dev_get_drvdata(device);
+   struct mvneta_port *pp = netdev_priv(dev);
+
+   if (netif_running(dev))
+   mvneta_stop(dev);
+   netif_device_detach(dev);
+   clk_disable_unprepare(pp->clk_bus);
+   clk_disable_unprepare(pp->clk);
+   return 0;
+}
+
+static int mvneta_resume(struct device *device)
+{
+   struct net_device *dev = dev_get_drvdata(device);
+   struct mvneta_port *pp = netdev_priv(dev);
+   int err;
+
+   clk_prepare_enable(pp->clk);
+   clk_prepare_enable(pp->clk_bus);

we may miss the necessary registers setting in mvneta_bm_port_init() and
mvneta_conf_mbus_windows(). those registers also need to be restored.


Done. Add them in patch v2.

+   mvneta_defaults_set(pp);

before restore the default setting, is it safer to mvneta_port_disable()?

Thanks,
Jisheng

During suspend, mvneta_port_disable() has been executed as bellow path.
It seems it is not need to do mvneta_port_disable() during resume.

mvneta_suspend() -> mvneta_stop() -> mvneta_stop_dev() -> 
mvneta_port_disable()


Thanks,
Jane

+   err = mvneta_port_power_up(pp, pp->phy_interface);
+   if (err < 0) {
+   dev_err(device, "can't power up port\n");
+   return err;
+   }
+
+   if (pp->use_inband_status)
+   mvneta_fixed_link_update(pp, dev->phydev);
+
+   netif_device_attach(dev);
+   if (netif_running(dev))
+   mvneta_open(dev);
+   return 0;
+}
+#endif
+
+static const struct dev_pm_ops mvneta_pm_ops = {
+   SET_LATE_SYSTEM_SLEEP_PM_OPS(mvneta_suspend, mvneta_resume)
+};
+
  static const struct of_device_id mvneta_match[] = {
{ .compatible = "marvell,armada-370-neta" },
{ .compatible = "marvell,armada-xp-neta" },
@@ -4419,6 +4462,7 @@ static int mvneta_remove(struct platform_device *pdev)
.driver = {
.name = MVNETA_DRIVER_NAME,
.of_match_table = mvneta_match,
+   .pm = _pm_ops,
},
  };
  




Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path

2017-03-15 Thread Eric Dumazet
On Wed, Mar 15, 2017 at 6:56 PM, Alexei Starovoitov
 wrote:
> On Wed, Mar 15, 2017 at 06:07:16PM -0700, Eric Dumazet wrote:
>> On Wed, 2017-03-15 at 16:06 -0700, Alexei Starovoitov wrote:
>>
>> > yes. and we have 'xdp_tx_full' counter for it that we monitor.
>> > When tx ring and mtu are sized properly, we don't expect to see it
>> > incrementing at all. This is something in our control. 'Our' means
>> > humans that setup the environment.
>> > 'cache empty' condition is something ephemeral. Packets will be dropped
>> > and we won't have any tools to address it. These packets are real
>> > people requests. Any drop needs to be categorized.
>> > Like there is 'rx_fifo_errors' counter that mlx4 reports when
>> > hw is dropping packets before they reach the driver. We see it
>> > incrementing depending on the traffic pattern though overall Mpps
>> > through the nic is not too high and this is something we
>> > actively investigating too.
>>
>>
>> This all looks nice, except that current mlx4 driver does not have a
>> counter of failed allocations.
>>
>> You are asking me to keep some inexistent functionality.
>>
>> Look in David net tree :
>>
>> mlx4_en_refill_rx_buffers()
>> ...
>>if (mlx4_en_prepare_rx_desc(...))
>>   break;
>>
>>
>> So in case of memory pressure, mlx4 stops working and not a single
>> counter is incremented/reported.
>
> Not quite. That is exactly the case i'm asking to keep.
> In case of memory pressure (like in the above case rx fifo
> won't be replenished) and in case of hw receiving
> faster than the driver can drain the rx ring,
> the hw will increment 'rx_fifo_errors' counter.

In current mlx4 driver, if napi_get_frags() fails, no counter is incremented.

So you are describing quite a different behavior, where _cpu_ can not
keep up and rx_fifo_errors is incremented.

But in case of _memory_ pressure (and normal traffic), rx_fifo_errors
wont be incremented.

And even if xdp_prog 'decides' to return XDP_PASS, the fine packet
will be dropped anyway.


> And that's what we monitor already and what I described in previous email.
>
>> Is it really what you want ?
>
> almost. see below.
>
>>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |   38 
>> +
>>  1 file changed, 18 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
>> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> index 
>> cc41f2f145541b469b52e7014659d5fdbb7dac68..e5ef8999087b52705faf083c94cde439aab9e2b7
>>  100644
>> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> @@ -793,10 +793,24 @@ int mlx4_en_process_rx_cq(struct net_device *dev, 
>> struct mlx4_en_cq *cq, int bud
>> if (xdp_prog) {
>> struct xdp_buff xdp;
>> struct page *npage;
>> -   dma_addr_t ndma, dma;
>> +   dma_addr_t dma;
>> void *orig_data;
>> u32 act;
>>
>> +   /* Make sure we have one page ready to replace this 
>> one, per Alexei request */
>
> do you have to add snarky comments?

Is request a bad or offensive word ?

What would be the best way to say that you asked to move this code
here, while in my opinion it was better where it was  ?

>
>> +   if (unlikely(!ring->page_cache.index)) {
>> +   npage = mlx4_alloc_page(priv, ring,
>> +   
>> >page_cache.buf[0].dma,
>> +   numa_mem_id(),
>> +   GFP_ATOMIC | 
>> __GFP_MEMALLOC);
>> +   if (!npage) {
>> +   /* replace this by a new 
>> ring->rx_alloc_failed++
>> +*/
>> +   ring->xdp_drop++;
>
> counting it as 'xdp_drop' is incorrect.

I added a comment to make that very clear .

If you do not read the comment, what can I say ?

So the comment is :

 replace this by a new ring->rx_alloc_failed++

This of course will require other changes in other files (folding
stats at ethtool -S)
that are irrelevant for the discussion we have right now.

I wont provide full patch without knowing exactly what you are requesting.


> 'xdp_drop' should be incremented only when program actually doing it,
> otherwise that's confusing to the user.
> If you add new counter 'rx_alloc_failed' (as you implying above)
> than it's similar to the existing state.
> Before: for both hw receives too much and oom with rx fifo empty - we
> will see 'rx_fifo_errors' counter.
> After: most rx_fifo_erros would mean hw receive issues and oom will
> be covered by this new counter.
>
> Another thought... if we do this 'replenish rx ring immediately'
> why do it for xdp rings only? Let's do 

Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path

2017-03-15 Thread Alexei Starovoitov
On Wed, Mar 15, 2017 at 06:07:16PM -0700, Eric Dumazet wrote:
> On Wed, 2017-03-15 at 16:06 -0700, Alexei Starovoitov wrote:
> 
> > yes. and we have 'xdp_tx_full' counter for it that we monitor.
> > When tx ring and mtu are sized properly, we don't expect to see it
> > incrementing at all. This is something in our control. 'Our' means
> > humans that setup the environment.
> > 'cache empty' condition is something ephemeral. Packets will be dropped
> > and we won't have any tools to address it. These packets are real
> > people requests. Any drop needs to be categorized.
> > Like there is 'rx_fifo_errors' counter that mlx4 reports when
> > hw is dropping packets before they reach the driver. We see it
> > incrementing depending on the traffic pattern though overall Mpps
> > through the nic is not too high and this is something we
> > actively investigating too.
> 
> 
> This all looks nice, except that current mlx4 driver does not have a
> counter of failed allocations.
> 
> You are asking me to keep some inexistent functionality.
> 
> Look in David net tree :
> 
> mlx4_en_refill_rx_buffers()
> ...
>if (mlx4_en_prepare_rx_desc(...))
>   break;
> 
> 
> So in case of memory pressure, mlx4 stops working and not a single
> counter is incremented/reported.

Not quite. That is exactly the case i'm asking to keep.
In case of memory pressure (like in the above case rx fifo
won't be replenished) and in case of hw receiving
faster than the driver can drain the rx ring,
the hw will increment 'rx_fifo_errors' counter.
And that's what we monitor already and what I described in previous email.

> Is it really what you want ?

almost. see below.

>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |   38 
> +
>  1 file changed, 18 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 
> cc41f2f145541b469b52e7014659d5fdbb7dac68..e5ef8999087b52705faf083c94cde439aab9e2b7
>  100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -793,10 +793,24 @@ int mlx4_en_process_rx_cq(struct net_device *dev, 
> struct mlx4_en_cq *cq, int bud
> if (xdp_prog) {
> struct xdp_buff xdp;
> struct page *npage;
> -   dma_addr_t ndma, dma;
> +   dma_addr_t dma;
> void *orig_data;
> u32 act;
>  
> +   /* Make sure we have one page ready to replace this 
> one, per Alexei request */

do you have to add snarky comments?

> +   if (unlikely(!ring->page_cache.index)) {
> +   npage = mlx4_alloc_page(priv, ring,
> +   
> >page_cache.buf[0].dma,
> +   numa_mem_id(),
> +   GFP_ATOMIC | 
> __GFP_MEMALLOC);
> +   if (!npage) {
> +   /* replace this by a new 
> ring->rx_alloc_failed++
> +*/
> +   ring->xdp_drop++;

counting it as 'xdp_drop' is incorrect.
'xdp_drop' should be incremented only when program actually doing it,
otherwise that's confusing to the user.
If you add new counter 'rx_alloc_failed' (as you implying above)
than it's similar to the existing state.
Before: for both hw receives too much and oom with rx fifo empty - we
will see 'rx_fifo_errors' counter.
After: most rx_fifo_erros would mean hw receive issues and oom will
be covered by this new counter.

Another thought... if we do this 'replenish rx ring immediately'
why do it for xdp rings only? Let's do it for all rings?
the above 'if (unlikely(!ring->page_cache.index)) ..alloc_page'
can move before 'if (xdp_prog)' and simplify the rest?



[PATCH net-next 1/6] bpf: move fixup_bpf_calls() function

2017-03-15 Thread Alexei Starovoitov
no functional change.
move fixup_bpf_calls() to verifier.c
it's being refactored in the next patch

Signed-off-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 kernel/bpf/syscall.c  | 56 --
 kernel/bpf/verifier.c | 57 +++
 2 files changed, 57 insertions(+), 56 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7af0dcc5d755..48c914b983bd 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -586,59 +586,6 @@ void bpf_register_prog_type(struct bpf_prog_type_list *tl)
list_add(>list_node, _prog_types);
 }
 
-/* fixup insn->imm field of bpf_call instructions:
- * if (insn->imm == BPF_FUNC_map_lookup_elem)
- *  insn->imm = bpf_map_lookup_elem - __bpf_call_base;
- * else if (insn->imm == BPF_FUNC_map_update_elem)
- *  insn->imm = bpf_map_update_elem - __bpf_call_base;
- * else ...
- *
- * this function is called after eBPF program passed verification
- */
-static void fixup_bpf_calls(struct bpf_prog *prog)
-{
-   const struct bpf_func_proto *fn;
-   int i;
-
-   for (i = 0; i < prog->len; i++) {
-   struct bpf_insn *insn = >insnsi[i];
-
-   if (insn->code == (BPF_JMP | BPF_CALL)) {
-   /* we reach here when program has bpf_call instructions
-* and it passed bpf_check(), means that
-* ops->get_func_proto must have been supplied, check it
-*/
-   BUG_ON(!prog->aux->ops->get_func_proto);
-
-   if (insn->imm == BPF_FUNC_get_route_realm)
-   prog->dst_needed = 1;
-   if (insn->imm == BPF_FUNC_get_prandom_u32)
-   bpf_user_rnd_init_once();
-   if (insn->imm == BPF_FUNC_xdp_adjust_head)
-   prog->xdp_adjust_head = 1;
-   if (insn->imm == BPF_FUNC_tail_call) {
-   /* mark bpf_tail_call as different opcode
-* to avoid conditional branch in
-* interpeter for every normal call
-* and to prevent accidental JITing by
-* JIT compiler that doesn't support
-* bpf_tail_call yet
-*/
-   insn->imm = 0;
-   insn->code |= BPF_X;
-   continue;
-   }
-
-   fn = prog->aux->ops->get_func_proto(insn->imm);
-   /* all functions that have prototype and verifier 
allowed
-* programs to call them, must be real in-kernel 
functions
-*/
-   BUG_ON(!fn->func);
-   insn->imm = fn->func - __bpf_call_base;
-   }
-   }
-}
-
 /* drop refcnt on maps used by eBPF program and free auxilary data */
 static void free_used_maps(struct bpf_prog_aux *aux)
 {
@@ -892,9 +839,6 @@ static int bpf_prog_load(union bpf_attr *attr)
if (err < 0)
goto free_used_maps;
 
-   /* fixup BPF_CALL->imm field */
-   fixup_bpf_calls(prog);
-
/* eBPF program is ready to be JITed */
prog = bpf_prog_select_runtime(prog, );
if (err < 0)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 796b68d00119..e41da6c57053 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3233,6 +3233,60 @@ static int convert_ctx_accesses(struct bpf_verifier_env 
*env)
return 0;
 }
 
+/* fixup insn->imm field of bpf_call instructions:
+ * if (insn->imm == BPF_FUNC_map_lookup_elem)
+ *  insn->imm = bpf_map_lookup_elem - __bpf_call_base;
+ * else if (insn->imm == BPF_FUNC_map_update_elem)
+ *  insn->imm = bpf_map_update_elem - __bpf_call_base;
+ * else ...
+ *
+ * this function is called after eBPF program passed verification
+ */
+static void fixup_bpf_calls(struct bpf_prog *prog)
+{
+   const struct bpf_func_proto *fn;
+   int i;
+
+   for (i = 0; i < prog->len; i++) {
+   struct bpf_insn *insn = >insnsi[i];
+
+   if (insn->code == (BPF_JMP | BPF_CALL)) {
+   /* we reach here when program has bpf_call instructions
+* and it passed bpf_check(), means that
+* ops->get_func_proto must have been supplied, check it
+*/
+   BUG_ON(!prog->aux->ops->get_func_proto);
+
+   if (insn->imm == BPF_FUNC_get_route_realm)
+   prog->dst_needed = 1;
+   if (insn->imm == BPF_FUNC_get_prandom_u32)
+   bpf_user_rnd_init_once();
+  

[PATCH net-next 5/6] bpf: inline htab_map_lookup_elem()

2017-03-15 Thread Alexei Starovoitov
Optimize:
bpf_call
  bpf_map_lookup_elem
map->ops->map_lookup_elem
  htab_map_lookup_elem
__htab_map_lookup_elem
into:
bpf_call
  __htab_map_lookup_elem

to improve performance of JITed programs.

Signed-off-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 kernel/bpf/hashtab.c | 31 ++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index afe5bab376c9..000153acb6d5 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -419,7 +419,11 @@ static struct htab_elem *lookup_nulls_elem_raw(struct 
hlist_nulls_head *head,
return NULL;
 }
 
-/* Called from syscall or from eBPF program */
+/* Called from syscall or from eBPF program directly, so
+ * arguments have to match bpf_map_lookup_elem() exactly.
+ * The return value is adjusted by BPF instructions
+ * in htab_map_gen_lookup().
+ */
 static void *__htab_map_lookup_elem(struct bpf_map *map, void *key)
 {
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
@@ -451,6 +455,30 @@ static void *htab_map_lookup_elem(struct bpf_map *map, 
void *key)
return NULL;
 }
 
+/* inline bpf_map_lookup_elem() call.
+ * Instead of:
+ * bpf_prog
+ *   bpf_map_lookup_elem
+ * map->ops->map_lookup_elem
+ *   htab_map_lookup_elem
+ * __htab_map_lookup_elem
+ * do:
+ * bpf_prog
+ *   __htab_map_lookup_elem
+ */
+static u32 htab_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
+{
+   struct bpf_insn *insn = insn_buf;
+   const int ret = BPF_REG_0;
+
+   *insn++ = BPF_EMIT_CALL((u64 (*)(u64, u64, u64, u64, 
u64))__htab_map_lookup_elem);
+   *insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 1);
+   *insn++ = BPF_ALU64_IMM(BPF_ADD, ret,
+   offsetof(struct htab_elem, key) +
+   round_up(map->key_size, 8));
+   return insn - insn_buf;
+}
+
 static void *htab_lru_map_lookup_elem(struct bpf_map *map, void *key)
 {
struct htab_elem *l = __htab_map_lookup_elem(map, key);
@@ -1062,6 +1090,7 @@ static const struct bpf_map_ops htab_ops = {
.map_lookup_elem = htab_map_lookup_elem,
.map_update_elem = htab_map_update_elem,
.map_delete_elem = htab_map_delete_elem,
+   .map_gen_lookup = htab_map_gen_lookup,
 };
 
 static struct bpf_map_type_list htab_type __ro_after_init = {
-- 
2.8.0



[PATCH net-next 4/6] bpf: add helper inlining infra and optimize map_array lookup

2017-03-15 Thread Alexei Starovoitov
Optimize bpf_call -> bpf_map_lookup_elem() -> array_map_lookup_elem()
into a sequence of bpf instructions.
When JIT is on the sequence of bpf instructions is the sequence
of native cpu instructions with significantly faster performance
than indirect call and two function's prologue/epilogue.

Signed-off-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 include/linux/bpf.h  |  1 +
 include/linux/bpf_verifier.h |  5 -
 include/linux/filter.h   | 10 ++
 kernel/bpf/arraymap.c| 29 +
 kernel/bpf/verifier.c| 36 +---
 5 files changed, 77 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 909fc033173a..da8c64ca8dc9 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -35,6 +35,7 @@ struct bpf_map_ops {
void *(*map_fd_get_ptr)(struct bpf_map *map, struct file *map_file,
int fd);
void (*map_fd_put_ptr)(void *ptr);
+   u32 (*map_gen_lookup)(struct bpf_map *map, struct bpf_insn *insn_buf);
 };
 
 struct bpf_map {
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index a13b031dc6b8..5efb4db44e1e 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -66,7 +66,10 @@ struct bpf_verifier_state_list {
 };
 
 struct bpf_insn_aux_data {
-   enum bpf_reg_type ptr_type; /* pointer type for load/store insns */
+   union {
+   enum bpf_reg_type ptr_type; /* pointer type for load/store 
insns */
+   struct bpf_map *map_ptr;/* pointer for call insn into 
lookup_elem */
+   };
 };
 
 #define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index fbf7b39e8103..dffa072b7b79 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -693,6 +693,11 @@ static inline bool bpf_jit_is_ebpf(void)
 # endif
 }
 
+static inline bool ebpf_jit_enabled(void)
+{
+   return bpf_jit_enable && bpf_jit_is_ebpf();
+}
+
 static inline bool bpf_prog_ebpf_jited(const struct bpf_prog *fp)
 {
return fp->jited && bpf_jit_is_ebpf();
@@ -753,6 +758,11 @@ void bpf_prog_kallsyms_del(struct bpf_prog *fp);
 
 #else /* CONFIG_BPF_JIT */
 
+static inline bool ebpf_jit_enabled(void)
+{
+   return false;
+}
+
 static inline bool bpf_prog_ebpf_jited(const struct bpf_prog *fp)
 {
return false;
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 6b6f41f0b211..bcf9955fac95 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -1,4 +1,5 @@
 /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ * Copyright (c) 2016,2017 Facebook
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of version 2 of the GNU General Public
@@ -113,6 +114,33 @@ static void *array_map_lookup_elem(struct bpf_map *map, 
void *key)
return array->value + array->elem_size * index;
 }
 
+/* emit BPF instructions equivalent to C code of array_map_lookup_elem() */
+static u32 array_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
+{
+   struct bpf_array *array = container_of(map, struct bpf_array, map);
+   struct bpf_insn *insn = insn_buf;
+   u32 elem_size = array->elem_size;
+   const int ret = BPF_REG_0;
+   const int map_ptr = BPF_REG_1;
+   const int index = BPF_REG_2;
+
+   *insn++ = BPF_ALU64_IMM(BPF_ADD, map_ptr, offsetof(struct bpf_array, 
value));
+   *insn++ = BPF_LDX_MEM(BPF_W, ret, index, 0);
+   *insn++ = BPF_JMP_IMM(BPF_JGE, ret, array->map.max_entries,
+ elem_size == 1 ? 2 : 3);
+   if (elem_size == 1) {
+   /* nop */
+   } else if (is_power_of_2(elem_size)) {
+   *insn++ = BPF_ALU64_IMM(BPF_LSH, ret, ilog2(elem_size));
+   } else {
+   *insn++ = BPF_ALU64_IMM(BPF_MUL, ret, elem_size);
+   }
+   *insn++ = BPF_ALU64_REG(BPF_ADD, ret, map_ptr);
+   *insn++ = BPF_JMP_IMM(BPF_JA, 0, 0, 1);
+   *insn++ = BPF_MOV64_IMM(ret, 0);
+   return insn - insn_buf;
+}
+
 /* Called from eBPF program */
 static void *percpu_array_map_lookup_elem(struct bpf_map *map, void *key)
 {
@@ -267,6 +295,7 @@ static const struct bpf_map_ops array_ops = {
.map_lookup_elem = array_map_lookup_elem,
.map_update_elem = array_map_update_elem,
.map_delete_elem = array_map_delete_elem,
+   .map_gen_lookup = array_map_gen_lookup,
 };
 
 static struct bpf_map_type_list array_type __ro_after_init = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2990fda1c6a5..90bf46787603 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1273,7 +1273,7 @@ static void clear_all_pkt_pointers(struct 
bpf_verifier_env *env)
}
 }
 
-static int check_call(struct bpf_verifier_env *env, int func_id)

[PATCH net-next 2/6] bpf: refactor fixup_bpf_calls()

2017-03-15 Thread Alexei Starovoitov
reduce indent and make it iterate over instructions similar to
convert_ctx_accesses(). Also convert hard BUG_ON into soft verifier error.

Signed-off-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 kernel/bpf/verifier.c | 76 ---
 1 file changed, 35 insertions(+), 41 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e41da6c57053..5dfa9b8111da 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3233,59 +3233,53 @@ static int convert_ctx_accesses(struct bpf_verifier_env 
*env)
return 0;
 }
 
-/* fixup insn->imm field of bpf_call instructions:
- * if (insn->imm == BPF_FUNC_map_lookup_elem)
- *  insn->imm = bpf_map_lookup_elem - __bpf_call_base;
- * else if (insn->imm == BPF_FUNC_map_update_elem)
- *  insn->imm = bpf_map_update_elem - __bpf_call_base;
- * else ...
+/* fixup insn->imm field of bpf_call instructions
  *
  * this function is called after eBPF program passed verification
  */
-static void fixup_bpf_calls(struct bpf_prog *prog)
+static int fixup_bpf_calls(struct bpf_verifier_env *env)
 {
+   struct bpf_prog *prog = env->prog;
+   struct bpf_insn *insn = prog->insnsi;
const struct bpf_func_proto *fn;
+   const int insn_cnt = prog->len;
int i;
 
-   for (i = 0; i < prog->len; i++) {
-   struct bpf_insn *insn = >insnsi[i];
+   for (i = 0; i < insn_cnt; i++, insn++) {
+   if (insn->code != (BPF_JMP | BPF_CALL))
+   continue;
 
-   if (insn->code == (BPF_JMP | BPF_CALL)) {
-   /* we reach here when program has bpf_call instructions
-* and it passed bpf_check(), means that
-* ops->get_func_proto must have been supplied, check it
+   if (insn->imm == BPF_FUNC_get_route_realm)
+   prog->dst_needed = 1;
+   if (insn->imm == BPF_FUNC_get_prandom_u32)
+   bpf_user_rnd_init_once();
+   if (insn->imm == BPF_FUNC_xdp_adjust_head)
+   prog->xdp_adjust_head = 1;
+   if (insn->imm == BPF_FUNC_tail_call) {
+   /* mark bpf_tail_call as different opcode to avoid
+* conditional branch in the interpeter for every normal
+* call and to prevent accidental JITing by JIT compiler
+* that doesn't support bpf_tail_call yet
 */
-   BUG_ON(!prog->aux->ops->get_func_proto);
-
-   if (insn->imm == BPF_FUNC_get_route_realm)
-   prog->dst_needed = 1;
-   if (insn->imm == BPF_FUNC_get_prandom_u32)
-   bpf_user_rnd_init_once();
-   if (insn->imm == BPF_FUNC_xdp_adjust_head)
-   prog->xdp_adjust_head = 1;
-   if (insn->imm == BPF_FUNC_tail_call) {
-   /* mark bpf_tail_call as different opcode
-* to avoid conditional branch in
-* interpeter for every normal call
-* and to prevent accidental JITing by
-* JIT compiler that doesn't support
-* bpf_tail_call yet
-*/
-   insn->imm = 0;
-   insn->code |= BPF_X;
-   continue;
-   }
+   insn->imm = 0;
+   insn->code |= BPF_X;
+   continue;
+   }
 
-   fn = prog->aux->ops->get_func_proto(insn->imm);
-   /* all functions that have prototype and verifier 
allowed
-* programs to call them, must be real in-kernel 
functions
-*/
-   BUG_ON(!fn->func);
-   insn->imm = fn->func - __bpf_call_base;
+   fn = prog->aux->ops->get_func_proto(insn->imm);
+   /* all functions that have prototype and verifier allowed
+* programs to call them, must be real in-kernel functions
+*/
+   if (!fn->func) {
+   verbose("kernel subsystem misconfigured func %s#%d\n",
+   func_id_name(insn->imm), insn->imm);
+   return -EFAULT;
}
+   insn->imm = fn->func - __bpf_call_base;
}
-}
 
+   return 0;
+}
 
 static void free_states(struct bpf_verifier_env *env)
 {
@@ -3383,7 +3377,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
*attr)
ret = convert_ctx_accesses(env);
 
if (ret == 0)
-   

[PATCH net-next 0/6] bpf: inline bpf_map_lookup_elem()

2017-03-15 Thread Alexei Starovoitov
bpf_map_lookup_elem() is one of the most frequently used helper functions.
Improve JITed program performance by inlining this helper.

bpf_map_typebefore  after
hash58M 74M
array   174M280M

The values are number of lookups per second in ideal conditions
measured by micro-benchmark in patch 6.

The 'perf report' for HASH map type:
before:
54.23%  map_perf_test  [kernel.kallsyms]  [k] __htab_map_lookup_elem
14.24%  map_perf_test  [kernel.kallsyms]  [k] lookup_elem_raw
 8.84%  map_perf_test  [kernel.kallsyms]  [k] htab_map_lookup_elem
 5.93%  map_perf_test  [kernel.kallsyms]  [k] bpf_map_lookup_elem
 2.30%  map_perf_test  [kernel.kallsyms]  [k] bpf_prog_da4fc6a3f41761a2
 1.49%  map_perf_test  [kernel.kallsyms]  [k] kprobe_ftrace_handler

after:
60.03%  map_perf_test  [kernel.kallsyms]  [k] __htab_map_lookup_elem
18.07%  map_perf_test  [kernel.kallsyms]  [k] lookup_elem_raw
 2.91%  map_perf_test  [kernel.kallsyms]  [k] bpf_prog_da4fc6a3f41761a2
 1.94%  map_perf_test  [kernel.kallsyms]  [k] _einittext
 1.90%  map_perf_test  [kernel.kallsyms]  [k] __audit_syscall_exit
 1.72%  map_perf_test  [kernel.kallsyms]  [k] kprobe_ftrace_handler

so the cost of htab_map_lookup_elem() and bpf_map_lookup_elem()
is gone after inlining.

'per-cpu' and 'lru' map types can be optimized similarly in the future.

Note the sparse will complain that bpf is addictive ;)
kernel/bpf/hashtab.c:438:19: sparse: subtraction of functions? Share your drugs
kernel/bpf/verifier.c:3342:38: sparse: subtraction of functions? Share your 
drugs
it's not a new warning, just in new places.

Alexei Starovoitov (6):
  bpf: move fixup_bpf_calls() function
  bpf: refactor fixup_bpf_calls()
  bpf: adjust insn_aux_data when patching insns
  bpf: add helper inlining infra and optimize map_array lookup
  bpf: inline htab_map_lookup_elem()
  samples/bpf: add map_lookup microbenchmark

 include/linux/bpf.h  |   1 +
 include/linux/bpf_verifier.h |   5 +-
 include/linux/filter.h   |  10 +++
 kernel/bpf/arraymap.c|  29 +
 kernel/bpf/hashtab.c |  31 +-
 kernel/bpf/syscall.c |  56 -
 kernel/bpf/verifier.c| 129 ---
 samples/bpf/map_perf_test_kern.c |  33 ++
 samples/bpf/map_perf_test_user.c |  32 ++
 9 files changed, 261 insertions(+), 65 deletions(-)

-- 
2.8.0



[PATCH net-next 6/6] samples/bpf: add map_lookup microbenchmark

2017-03-15 Thread Alexei Starovoitov
$ map_perf_test 128
speed of HASH bpf_map_lookup_elem() in lookups per second
w/o JIT w/JIT
before  46M 58M
after   42M 74M

perf report
before:
54.23%  map_perf_test  [kernel.kallsyms]  [k] __htab_map_lookup_elem
14.24%  map_perf_test  [kernel.kallsyms]  [k] lookup_elem_raw
 8.84%  map_perf_test  [kernel.kallsyms]  [k] htab_map_lookup_elem
 5.93%  map_perf_test  [kernel.kallsyms]  [k] bpf_map_lookup_elem
 2.30%  map_perf_test  [kernel.kallsyms]  [k] bpf_prog_da4fc6a3f41761a2
 1.49%  map_perf_test  [kernel.kallsyms]  [k] kprobe_ftrace_handler

after:
60.03%  map_perf_test  [kernel.kallsyms]  [k] __htab_map_lookup_elem
18.07%  map_perf_test  [kernel.kallsyms]  [k] lookup_elem_raw
 2.91%  map_perf_test  [kernel.kallsyms]  [k] bpf_prog_da4fc6a3f41761a2
 1.94%  map_perf_test  [kernel.kallsyms]  [k] _einittext
 1.90%  map_perf_test  [kernel.kallsyms]  [k] __audit_syscall_exit
 1.72%  map_perf_test  [kernel.kallsyms]  [k] kprobe_ftrace_handler

Notice that bpf_map_lookup_elem() and htab_map_lookup_elem() are trivial
functions, yet they take sizeable amount of cpu time.
htab_map_gen_lookup() removes bpf_map_lookup_elem() and converts
htab_map_lookup_elem() into three BPF insns which causing cpu time
for bpf_prog_da4fc6a3f41761a2() slightly increase.

$ map_perf_test 256
speed of ARRAY bpf_map_lookup_elem() in lookups per second
w/o JIT w/JIT
before  97M 174M
after   64M 280M

before:
37.33%  map_perf_test  [kernel.kallsyms]  [k] array_map_lookup_elem
13.95%  map_perf_test  [kernel.kallsyms]  [k] bpf_map_lookup_elem
 6.54%  map_perf_test  [kernel.kallsyms]  [k] bpf_prog_da4fc6a3f41761a2
 4.57%  map_perf_test  [kernel.kallsyms]  [k] kprobe_ftrace_handler

after:
32.86%  map_perf_test  [kernel.kallsyms]  [k] bpf_prog_da4fc6a3f41761a2
 6.54%  map_perf_test  [kernel.kallsyms]  [k] kprobe_ftrace_handler

array_map_gen_lookup() removes calls to array_map_lookup_elem()
and bpf_map_lookup_elem() and replaces them with 7 bpf insns.

The performance without JIT is slower, since executing extra insns
in the interpreter is slower than running native C code,
but with JIT the performance gains are obvious,
since native C->x86 code is replaced with fewer bpf->x86 instructions.

Signed-off-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 samples/bpf/map_perf_test_kern.c | 33 +
 samples/bpf/map_perf_test_user.c | 32 
 2 files changed, 65 insertions(+)

diff --git a/samples/bpf/map_perf_test_kern.c b/samples/bpf/map_perf_test_kern.c
index a91872a97742..9da2a3441b0a 100644
--- a/samples/bpf/map_perf_test_kern.c
+++ b/samples/bpf/map_perf_test_kern.c
@@ -65,6 +65,13 @@ struct bpf_map_def SEC("maps") lpm_trie_map_alloc = {
.map_flags = BPF_F_NO_PREALLOC,
 };
 
+struct bpf_map_def SEC("maps") array_map = {
+   .type = BPF_MAP_TYPE_ARRAY,
+   .key_size = sizeof(u32),
+   .value_size = sizeof(long),
+   .max_entries = MAX_ENTRIES,
+};
+
 SEC("kprobe/sys_getuid")
 int stress_hmap(struct pt_regs *ctx)
 {
@@ -165,5 +172,31 @@ int stress_lpm_trie_map_alloc(struct pt_regs *ctx)
return 0;
 }
 
+SEC("kprobe/sys_getpgid")
+int stress_hash_map_lookup(struct pt_regs *ctx)
+{
+   u32 key = 1, i;
+   long *value;
+
+#pragma clang loop unroll(full)
+   for (i = 0; i < 64; ++i)
+   value = bpf_map_lookup_elem(_map, );
+
+   return 0;
+}
+
+SEC("kprobe/sys_getpgrp")
+int stress_array_map_lookup(struct pt_regs *ctx)
+{
+   u32 key = 1, i;
+   long *value;
+
+#pragma clang loop unroll(full)
+   for (i = 0; i < 64; ++i)
+   value = bpf_map_lookup_elem(_map, );
+
+   return 0;
+}
+
 char _license[] SEC("license") = "GPL";
 u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/map_perf_test_user.c b/samples/bpf/map_perf_test_user.c
index 680260a91f50..e29ff318a793 100644
--- a/samples/bpf/map_perf_test_user.c
+++ b/samples/bpf/map_perf_test_user.c
@@ -38,6 +38,8 @@ static __u64 time_get_ns(void)
 #define LRU_HASH_PREALLOC  (1 << 4)
 #define PERCPU_LRU_HASH_PREALLOC   (1 << 5)
 #define LPM_KMALLOC(1 << 6)
+#define HASH_LOOKUP(1 << 7)
+#define ARRAY_LOOKUP   (1 << 8)
 
 static int test_flags = ~0;
 
@@ -125,6 +127,30 @@ static void test_lpm_kmalloc(int cpu)
   cpu, MAX_CNT * 10ll / (time_get_ns() - start_time));
 }
 
+static void test_hash_lookup(int cpu)
+{
+   __u64 start_time;
+   int i;
+
+   start_time = time_get_ns();
+   for (i = 0; i < MAX_CNT; i++)
+   syscall(__NR_getpgid, 0);
+   printf("%d:hash_lookup %lld lookups per sec\n",
+  cpu, MAX_CNT * 10ll * 64 / (time_get_ns() - start_time));
+}
+
+static void test_array_lookup(int cpu)
+{
+   __u64 start_time;
+   int i;

[PATCH net-next 3/6] bpf: adjust insn_aux_data when patching insns

2017-03-15 Thread Alexei Starovoitov
convert_ctx_accesses() replaces single bpf instruction with a set of
instructions. Adjust corresponding insn_aux_data while patching.
It's needed to make sure subsequent 'for(all insn)' loops
have matching insn and insn_aux_data.

Signed-off-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 kernel/bpf/verifier.c | 44 +++-
 1 file changed, 39 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5dfa9b8111da..2990fda1c6a5 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3162,6 +3162,41 @@ static void convert_pseudo_ld_imm64(struct 
bpf_verifier_env *env)
insn->src_reg = 0;
 }
 
+/* single env->prog->insni[off] instruction was replaced with the range
+ * insni[off, off + cnt).  Adjust corresponding insn_aux_data by copying
+ * [0, off) and [off, end) to new locations, so the patched range stays zero
+ */
+static int adjust_insn_aux_data(struct bpf_verifier_env *env, u32 prog_len,
+   u32 off, u32 cnt)
+{
+   struct bpf_insn_aux_data *new_data, *old_data = env->insn_aux_data;
+
+   if (cnt == 1)
+   return 0;
+   new_data = vzalloc(sizeof(struct bpf_insn_aux_data) * prog_len);
+   if (!new_data)
+   return -ENOMEM;
+   memcpy(new_data, old_data, sizeof(struct bpf_insn_aux_data) * off);
+   memcpy(new_data + off + cnt - 1, old_data + off,
+  sizeof(struct bpf_insn_aux_data) * (prog_len - off - cnt + 1));
+   env->insn_aux_data = new_data;
+   vfree(old_data);
+   return 0;
+}
+
+static struct bpf_prog *bpf_patch_insn_data(struct bpf_verifier_env *env, u32 
off,
+   const struct bpf_insn *patch, u32 
len)
+{
+   struct bpf_prog *new_prog;
+
+   new_prog = bpf_patch_insn_single(env->prog, off, patch, len);
+   if (!new_prog)
+   return NULL;
+   if (adjust_insn_aux_data(env, new_prog->len, off, len))
+   return NULL;
+   return new_prog;
+}
+
 /* convert load instructions that access fields of 'struct __sk_buff'
  * into sequence of instructions that access fields of 'struct sk_buff'
  */
@@ -3181,10 +3216,10 @@ static int convert_ctx_accesses(struct bpf_verifier_env 
*env)
verbose("bpf verifier is misconfigured\n");
return -EINVAL;
} else if (cnt) {
-   new_prog = bpf_patch_insn_single(env->prog, 0,
-insn_buf, cnt);
+   new_prog = bpf_patch_insn_data(env, 0, insn_buf, cnt);
if (!new_prog)
return -ENOMEM;
+
env->prog = new_prog;
delta += cnt - 1;
}
@@ -3209,7 +3244,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env 
*env)
else
continue;
 
-   if (env->insn_aux_data[i].ptr_type != PTR_TO_CTX)
+   if (env->insn_aux_data[i + delta].ptr_type != PTR_TO_CTX)
continue;
 
cnt = ops->convert_ctx_access(type, insn, insn_buf, env->prog);
@@ -3218,8 +3253,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env 
*env)
return -EINVAL;
}
 
-   new_prog = bpf_patch_insn_single(env->prog, i + delta, insn_buf,
-cnt);
+   new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
if (!new_prog)
return -ENOMEM;
 
-- 
2.8.0



[PATCH] net: ipv6: set route type for anycast routes

2017-03-15 Thread David Ahern
Anycast routes have the RTF_ANYCAST flag set, but when dumping routes
for userspace the route type is not set to RTN_ANYCAST. Make it so.

Fixes: 58c4fb86eabcb ("[IPV6]: Flag RTF_ANYCAST for anycast routes")
CC: Hideaki YOSHIFUJI 
Signed-off-by: David Ahern 
---
 net/ipv6/route.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 35c58b669ebd..9db1418993f2 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3423,6 +3423,8 @@ static int rt6_fill_node(struct net *net,
}
else if (rt->rt6i_flags & RTF_LOCAL)
rtm->rtm_type = RTN_LOCAL;
+   else if (rt->rt6i_flags & RTF_ANYCAST)
+   rtm->rtm_type = RTN_ANYCAST;
else if (rt->dst.dev && (rt->dst.dev->flags & IFF_LOOPBACK))
rtm->rtm_type = RTN_LOCAL;
else
-- 
2.1.4



linux-next: WARNING: CPU: 1 PID: 19544 at net/bridge/br_fdb.c:109 br_fdb_find+0x19d/0x1b0

2017-03-15 Thread Andrei Vagin
Hello,

We execute CRIU tests for linux-next and here is a new warning:
[  178.930950] [ cut here ]
[  178.930960] WARNING: CPU: 1 PID: 19544 at net/bridge/br_fdb.c:109
br_fdb_find+0x19d/0x1b0
[  178.930961] Modules linked in:
[  178.930966] CPU: 1 PID: 19544 Comm: criu Not tainted
4.11.0-rc1-next-20170310 #1
[  178.930968] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[  178.930970] Call Trace:
[  178.930976]  dump_stack+0x85/0xc9
[  178.930982]  __warn+0xd1/0xf0
[  178.930988]  warn_slowpath_null+0x1d/0x20
[  178.930990]  br_fdb_find+0x19d/0x1b0
[  178.930994]  br_fdb_change_mac_address+0x38/0x80
[  178.930999]  br_stp_change_bridge_id+0x44/0x140
[  178.931003]  br_dev_newlink+0x3f/0x70
[  178.931009]  rtnl_newlink+0x68e/0x830
[  178.931014]  ? netlink_broadcast_filtered+0x134/0x3e0
[  178.931025]  ? get_partial_node.isra.76+0x4b/0x2a0
[  178.931032]  ? nla_parse+0xa3/0x100
[  178.931035]  ? nla_strlcpy+0x5b/0x70
[  178.931037]  ? rtnl_link_ops_get+0x39/0x50
[  178.931040]  ? rtnl_newlink+0x158/0x830
[  178.931064]  rtnetlink_rcv_msg+0x95/0x230
[  178.931068]  ? rtnl_newlink+0x830/0x830
[  178.931072]  netlink_rcv_skb+0xa7/0xc0
[  178.931075]  rtnetlink_rcv+0x2a/0x40
[  178.931078]  netlink_unicast+0x15b/0x210
[  178.931082]  netlink_sendmsg+0x31a/0x3a0
[  178.931089]  sock_sendmsg+0x38/0x50
[  178.931092]  ___sys_sendmsg+0x26c/0x280
[  178.931099]  ? __generic_file_write_iter+0x19b/0x1e0
[  178.931105]  ? up_write+0x1f/0x40
[  178.931110]  ? ext4_file_write_iter+0xa4/0x390
[  178.931120]  __sys_sendmsg+0x45/0x80
[  178.931126]  SyS_sendmsg+0x12/0x20
[  178.931130]  entry_SYSCALL_64_fastpath+0x23/0xc2
[  178.931132] RIP: 0033:0x2b71009f99a0
[  178.931134] RSP: 002b:7ffed7413c58 EFLAGS: 0246 ORIG_RAX:
002e
[  178.931137] RAX: ffda RBX:  RCX: 2b71009f99a0
[  178.931139] RDX:  RSI: 7ffed7413c90 RDI: 0002
[  178.931140] RBP:  R08:  R09: 
[  178.931142] R10: 2b710180cfe0 R11: 0246 R12: 0001
[  178.931144] R13: 0004 R14: 7ffed74134b0 R15: 0003
[  178.931152] ---[ end trace 61d5dd5e3b9abaf8 ]---
[  178.931453] IPv6: ADDRCONF(NETDEV_UP): zdtmbr0: li

All logs are here:
https://s3.amazonaws.com/archive.travis-ci.org/jobs/211220073/log.txt


Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path

2017-03-15 Thread Eric Dumazet
On Wed, 2017-03-15 at 18:07 -0700, Eric Dumazet wrote:
> On Wed, 2017-03-15 at 16:06 -0700, Alexei Starovoitov wrote:
> 
> > yes. and we have 'xdp_tx_full' counter for it that we monitor.
> > When tx ring and mtu are sized properly, we don't expect to see it
> > incrementing at all. This is something in our control. 'Our' means
> > humans that setup the environment.
> > 'cache empty' condition is something ephemeral. Packets will be dropped
> > and we won't have any tools to address it. These packets are real
> > people requests. Any drop needs to be categorized.
> > Like there is 'rx_fifo_errors' counter that mlx4 reports when
> > hw is dropping packets before they reach the driver. We see it
> > incrementing depending on the traffic pattern though overall Mpps
> > through the nic is not too high and this is something we
> > actively investigating too.
> 
> 
> This all looks nice, except that current mlx4 driver does not have a
> counter of failed allocations.
> 
> You are asking me to keep some inexistent functionality.
> 
> Look in David net tree :
> 
> mlx4_en_refill_rx_buffers()
> ...
>if (mlx4_en_prepare_rx_desc(...))
>   break;
> 
> 
> So in case of memory pressure, mlx4 stops working and not a single
> counter is incremented/reported.
> 
> So I guess your supervision was not really tested.
> 
> Just to show you what you are asking, here is a diff against latest
> version.
> 
> You want to make sure a fresh page is there before calling XDP program.
> 
> Is it really what you want ?
> 
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |   38 
> +
>  1 file changed, 18 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 
> cc41f2f145541b469b52e7014659d5fdbb7dac68..e5ef8999087b52705faf083c94cde439aab9e2b7
>  100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -793,10 +793,24 @@ int mlx4_en_process_rx_cq(struct net_device *dev, 
> struct mlx4_en_cq *cq, int bud
> if (xdp_prog) {
> struct xdp_buff xdp;
> struct page *npage;
> -   dma_addr_t ndma, dma;
> +   dma_addr_t dma;
> void *orig_data;
> u32 act;
>  
> +   /* Make sure we have one page ready to replace this 
> one, per Alexei request */
> +   if (unlikely(!ring->page_cache.index)) {
> +   npage = mlx4_alloc_page(priv, ring,
> +   
> >page_cache.buf[0].dma,
> +   numa_mem_id(),
> +   GFP_ATOMIC | 
> __GFP_MEMALLOC);
> +   if (!npage) {
> +   /* replace this by a new 
> ring->rx_alloc_failed++
> +*/
> +   ring->xdp_drop++;
> +   goto next;
> +   }
> +   ring->page_cache.buf[0].page = npage;

Plus a missing :
ring->page_cache.index = 1;

> +   }
> dma = frags[0].dma + frags[0].page_offset;
> dma_sync_single_for_cpu(priv->ddev, dma,
> priv->frag_info[0].frag_size,
> @@ -820,29 +834,13 @@ int mlx4_en_process_rx_cq(struct net_device *dev, 
> struct mlx4_en_cq *cq, int bud
> case XDP_PASS:
> break;
> case XDP_TX:
> -   /* Make sure we have one page ready to 
> replace this one */
> -   npage = NULL;
> -   if (!ring->page_cache.index) {
> -   npage = mlx4_alloc_page(priv, ring,
> -   , 
> numa_mem_id(),
> -   GFP_ATOMIC | 
> __GFP_MEMALLOC);
> -   if (!npage) {
> -   ring->xdp_drop++;
> -   goto next;
> -   }
> -   }
> if (likely(!mlx4_en_xmit_frame(ring, frags, 
> dev,
> length, cq->ring,
> _pending))) {
> -   if (ring->page_cache.index) {
> -   u32 idx = 
> --ring->page_cache.index;
> +   

Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path

2017-03-15 Thread Eric Dumazet
On Wed, 2017-03-15 at 16:06 -0700, Alexei Starovoitov wrote:

> yes. and we have 'xdp_tx_full' counter for it that we monitor.
> When tx ring and mtu are sized properly, we don't expect to see it
> incrementing at all. This is something in our control. 'Our' means
> humans that setup the environment.
> 'cache empty' condition is something ephemeral. Packets will be dropped
> and we won't have any tools to address it. These packets are real
> people requests. Any drop needs to be categorized.
> Like there is 'rx_fifo_errors' counter that mlx4 reports when
> hw is dropping packets before they reach the driver. We see it
> incrementing depending on the traffic pattern though overall Mpps
> through the nic is not too high and this is something we
> actively investigating too.


This all looks nice, except that current mlx4 driver does not have a
counter of failed allocations.

You are asking me to keep some inexistent functionality.

Look in David net tree :

mlx4_en_refill_rx_buffers()
...
   if (mlx4_en_prepare_rx_desc(...))
  break;


So in case of memory pressure, mlx4 stops working and not a single
counter is incremented/reported.

So I guess your supervision was not really tested.

Just to show you what you are asking, here is a diff against latest
version.

You want to make sure a fresh page is there before calling XDP program.

Is it really what you want ?

 drivers/net/ethernet/mellanox/mlx4/en_rx.c |   38 +
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 
cc41f2f145541b469b52e7014659d5fdbb7dac68..e5ef8999087b52705faf083c94cde439aab9e2b7
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -793,10 +793,24 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
if (xdp_prog) {
struct xdp_buff xdp;
struct page *npage;
-   dma_addr_t ndma, dma;
+   dma_addr_t dma;
void *orig_data;
u32 act;
 
+   /* Make sure we have one page ready to replace this 
one, per Alexei request */
+   if (unlikely(!ring->page_cache.index)) {
+   npage = mlx4_alloc_page(priv, ring,
+   
>page_cache.buf[0].dma,
+   numa_mem_id(),
+   GFP_ATOMIC | 
__GFP_MEMALLOC);
+   if (!npage) {
+   /* replace this by a new 
ring->rx_alloc_failed++
+*/
+   ring->xdp_drop++;
+   goto next;
+   }
+   ring->page_cache.buf[0].page = npage;
+   }
dma = frags[0].dma + frags[0].page_offset;
dma_sync_single_for_cpu(priv->ddev, dma,
priv->frag_info[0].frag_size,
@@ -820,29 +834,13 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
case XDP_PASS:
break;
case XDP_TX:
-   /* Make sure we have one page ready to replace 
this one */
-   npage = NULL;
-   if (!ring->page_cache.index) {
-   npage = mlx4_alloc_page(priv, ring,
-   , 
numa_mem_id(),
-   GFP_ATOMIC | 
__GFP_MEMALLOC);
-   if (!npage) {
-   ring->xdp_drop++;
-   goto next;
-   }
-   }
if (likely(!mlx4_en_xmit_frame(ring, frags, dev,
length, cq->ring,
_pending))) {
-   if (ring->page_cache.index) {
-   u32 idx = 
--ring->page_cache.index;
+   u32 idx = --ring->page_cache.index;
 
-   frags[0].page = 
ring->page_cache.buf[idx].page;
-   frags[0].dma = 
ring->page_cache.buf[idx].dma;
-   } else {
-   

Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path

2017-03-15 Thread Alexei Starovoitov
On Wed, Mar 15, 2017 at 04:34:51PM -0700, Eric Dumazet wrote:
> > > > > -/* We recover from out of memory by scheduling our napi poll
> > > > > - * function (mlx4_en_process_cq), which tries to allocate
> > > > > - * all missing RX buffers (call to mlx4_en_refill_rx_buffers).
> > > > > +/* Under memory pressure, each ring->rx_alloc_order might be lowered
> > > > > + * to very small values. Periodically increase t to initial value for
> > > > > + * optimal allocations, in case stress is over.
> > > > >   */
> > > > > + for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
> > > > > + ring = priv->rx_ring[ring_ind];
> > > > > + order = min_t(unsigned int, ring->rx_alloc_order + 1,
> > > > > +   ring->rx_pref_alloc_order);
> > > > > + WRITE_ONCE(ring->rx_alloc_order, order);
> > > > 
> > > > when recycling is effective in a matter of few seconds it will
> > > > increase ther order back to 10 and the first time the driver needs
> > > > to allocate, it will start that tedious failure loop all over again.
> > > > How about removing this periodic mlx4_en_recover_from_oom() completely
> > > > and switch to increase the order inside mlx4_alloc_page().
> > > > Like N successful __alloc_pages_node() with order X will bump it
> > > > into order X+1. If it fails next time it will do only one failed 
> > > > attempt.
> > > 
> > > I wanted to do the increase out of line. (not in the data path)
> > > 
> > > We probably could increase only if ring->rx_alloc_pages got a
> > > significant increase since the last mlx4_en_recover_from_oom() call.
> > > 
> > > (That would require a new ring->prior_rx_alloc_pages out of hot cache
> > > lines)
> > 
> > right. rx_alloc_pages can also be reduce to 16-bit and this
> > new one prior_rx_alloc_pages to 16-bit too, no?
> 
> Think about arches not having atomic 8-bit or 16-bit reads or writes.
> 
> READ_ONCE()/WRITE_ONCE() will not be usable.

I mean if you really want to squeeze space these two:
+   unsigned intpre_allocated_count;
+   unsigned intrx_alloc_order;

can become 16-bit and have room for 'rx_alloc_pages_without_fail'
that will count to small N and then bump 'rx_alloc_order' by 1.
and since _oom() will be gone. There is no need for read/write__once.

anyway, looking forward to your next version.



[PATCH net-next] net/8021q: create device with all possible features in wanted_features

2017-03-15 Thread Andrei Vagin
wanted_features is a set of features which have to be enabled if a
hardware allows that.

Currently when a vlan device is created, its wanted_features is set to
current features of its base device.

The problem is that the base device can get new features and they are
not propagated to vlan-s of this device.

If we look at bonding devices, they doesn't have this problem and this
patch suggests to fix this issue by the same way how it works for bonding
devices.

We meet this problem, when we try to create a vlan device over a bonding
device. When a system are booting, real devices require time to be
initialized, so bonding devices created without slaves, then vlan
devices are created and only then ethernet devices are added to the
bonding device. As a result we have vlan devices with disabled
scatter-gather.

* create a bonding device
  $ ip link add bond0 type bond
  $ ethtool -k bond0 | grep scatter
  scatter-gather: off
tx-scatter-gather: off [requested on]
tx-scatter-gather-fraglist: off [requested on]

* create a vlan device
  $ ip link add link bond0 name bond0.10 type vlan id 10
  $ ethtool -k bond0.10 | grep scatter
  scatter-gather: off
tx-scatter-gather: off
tx-scatter-gather-fraglist: off

* Add a slave device to bond0
  $ ip link set dev eth0 master bond0

And now we can see that the bond0 device has got the scatter-gather
feature, but the bond0.10 hasn't got it.
[root@laptop linux-task-diag]# ethtool -k bond0 | grep scatter
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: on
[root@laptop linux-task-diag]# ethtool -k bond0.10 | grep scatter
scatter-gather: off
tx-scatter-gather: off
tx-scatter-gather-fraglist: off

With this patch the vlan device will get all new features from the
bonding device.

Here is a call trace how features which are set in this patch reach
dev->wanted_features.

register_netdevice
   vlan_dev_init
...
dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
   NETIF_F_FRAGLIST | NETIF_F_GSO_SOFTWARE |
   NETIF_F_HIGHDMA | NETIF_F_SCTP_CRC |
   NETIF_F_ALL_FCOE;

dev->features |= dev->hw_features;
...
dev->wanted_features = dev->features & dev->hw_features;
__netdev_update_features(dev);
vlan_dev_fix_features
   ...

Cc: Alexey Kuznetsov 
Cc: Patrick McHardy 
Cc: "David S. Miller" 
Signed-off-by: Andrei Vagin 
---
 net/8021q/vlan_dev.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 10da6c5..b9ad2f8 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -562,8 +562,7 @@ static int vlan_dev_init(struct net_device *dev)
   NETIF_F_HIGHDMA | NETIF_F_SCTP_CRC |
   NETIF_F_ALL_FCOE;
 
-   dev->features |= real_dev->vlan_features | NETIF_F_LLTX |
-NETIF_F_GSO_SOFTWARE;
+   dev->features |= dev->hw_features | NETIF_F_LLTX;
dev->gso_max_size = real_dev->gso_max_size;
dev->gso_max_segs = real_dev->gso_max_segs;
if (dev->features & NETIF_F_VLAN_FEATURES)
-- 
2.9.3



Re: [PATCH v2 net-next 4/5] sunvnet: count multicast packets

2017-03-15 Thread Shannon Nelson

On 3/15/2017 1:50 AM, David Laight wrote:

From: Shannon Nelson

Sent: 14 March 2017 17:25

...

+   if (unlikely(is_multicast_ether_addr(eth_hdr(skb)->h_dest)))
+   dev->stats.multicast++;


I'd guess that:
dev->stats.multicast += is_multicast_ether_addr(eth_hdr(skb)->h_dest);
generates faster code.
Especially if is_multicast_ether_addr(x) is (*x >> 7).

David


Hi David, thanks for the comment.  My local instruction level 
performance guru is on vacation this week so I can't do a quick check 
with him today on this.  However, I"m not too worried here since the 
inline code for is_multicast_ether_addr() is simply


return 0x01 & addr[0];

and objdump tells me that on sparc it compiles down to a simple single 
byte load and compare:


325c:   c2 08 80 03 ldub  [ %g2 + %g3 ], %g1
3260:   80 88 60 01 btst  1, %g1
3264:   32 60 00 b3 bne,a,pn   %xcc, 3530 
3268:   c2 5c 61 68 ldx  [ %l1 + 0x168 ], %g1
dev->stats.multicast++;

I don't think this driver will ever be used on anything bug sparc, so 
I'm not worried about how x86 might compile this.


sln



Re: [PATCH net-next 1/2] tcp: remove per-destination timestamp cache

2017-03-15 Thread Eric Dumazet
On Wed, 2017-03-15 at 16:45 -0700, David Miller wrote:

> Ok, I guess we can remove it in that case.  I'm still a bit disappointed
> as I was always hoping someone would find a way to make this work even
> in the presence of NAT.

Nat are good, Nat are good.

I can't find this hilarious video we watched in Copenhagen ;)






Re: [PATCH net] amd-xgbe: Fix jumbo MTU processing on newer hardware

2017-03-15 Thread David Miller
From: Tom Lendacky 
Date: Wed, 15 Mar 2017 17:52:44 -0500

> On 3/15/2017 5:41 PM, David Miller wrote:
>> From: Tom Lendacky 
>> Date: Wed, 15 Mar 2017 17:40:51 -0500
>>
>>> On 3/15/2017 5:37 PM, David Miller wrote:
 From: Tom Lendacky 
 Date: Wed, 15 Mar 2017 15:11:23 -0500

> Newer hardware does not provide a cumulative payload length when
> multiple
> descriptors are needed to handle the data. Once the MTU increases
> beyond
> the size that can be handled by a single descriptor, the SKB does not
> get
> built properly by the driver.
>
> The driver will now calculate the size of the data buffers used by the
> hardware.  The first buffer of the first descriptor is for packet
> headers
> or packet headers and data when the headers can't be split. Subsequent
> descriptors in a multi-descriptor chain will not use the first
> buffer. The
> second buffer is used by all the descriptors in the chain for payload
> data.
> Based on whether the driver is processing the first, intermediate, or
> last
> descriptor it can calculate the buffer usage and build the SKB
> properly.
>
> Tested and verified on both old and new hardware.
>
> Signed-off-by: Tom Lendacky 

 Applied, thanks Tom.
>>>
>>> Thanks David.  This is another patch for 4.10 stable. Can you please
>>> queue it up?
>>
>> Can you properly state this in your patch postings, instead of always
>> mentioning it later?
>>
> 
> Sorry, yes, I can do that.  I didn't realize you preferred it that
> way.
> Do you want the "Cc" tag to stable included in the patch or just
> mention the stable targets in the patch description?  I know you
> coordinate the stable submissions and I don't want to mess anything
> up.

Just put a note after a "---" delimiter, requesting that I queue it up
for -stable.

Thanks.


Re: [PATCH net] MAINTAINERS: remove MACVLAN and VLAN entries

2017-03-15 Thread David Miller
From: Pablo Neira Ayuso 
Date: Wed, 15 Mar 2017 18:39:46 +0100

> macvlan.c file seems to be both in VLAN and MACVLAN DRIVER, so remove
> the MACVLAN DRIVER since this is redundant.
> 
> I propose with this patch to remove the VLAN (802.1Q) entry so this just
> falls into the NETWORKING [GENERAL].
> 
> Signed-off-by: Pablo Neira Ayuso 

Applied.


Re: [PATCH net-next 1/2] tcp: remove per-destination timestamp cache

2017-03-15 Thread David Miller
From: Eric Dumazet 
Date: Wed, 15 Mar 2017 15:59:01 -0700

> On Wed, 2017-03-15 at 15:40 -0700, David Miller wrote:
>> From: Soheil Hassas Yeganeh 
>> Date: Wed, 15 Mar 2017 16:30:45 -0400
>> 
>> > Note that this cache was already broken for caching timestamps of
>> > multiple machines behind a NAT sharing the same address.
>> 
>> That's the documented, well established, limitation of time-wait
>> recycling.
>> 
>> People who enable it, need to consider this issue.
>> 
>> This limitation of the feature does not give us a reason to break the
>> feature even further as a matter of convenience, or to remove it
>> altogether for the same reason.
>> 
>> Please, instead, fix the bug that was introduced.
>> 
>> Thank you.
> 
> You mean revert Florian nice patches ?
> 
> This would kill timestamps randomization, and thus prevent some
> organizations to turn TCP timestamps on.
> 
> TCP timestamps are more useful than this obscure tw_recycle thing that
> is hurting innocent users.

Ok, I guess we can remove it in that case.  I'm still a bit disappointed
as I was always hoping someone would find a way to make this work even
in the presence of NAT.

I must be too optimistic.


Re: [PATCH net] amd-xgbe: Fix jumbo MTU processing on newer hardware

2017-03-15 Thread Tom Lendacky

On 3/15/2017 5:52 PM, Tom Lendacky wrote:

On 3/15/2017 5:41 PM, David Miller wrote:

From: Tom Lendacky 
Date: Wed, 15 Mar 2017 17:40:51 -0500


On 3/15/2017 5:37 PM, David Miller wrote:

From: Tom Lendacky 
Date: Wed, 15 Mar 2017 15:11:23 -0500


Newer hardware does not provide a cumulative payload length when
multiple
descriptors are needed to handle the data. Once the MTU increases
beyond
the size that can be handled by a single descriptor, the SKB does not
get
built properly by the driver.

The driver will now calculate the size of the data buffers used by the
hardware.  The first buffer of the first descriptor is for packet
headers
or packet headers and data when the headers can't be split. Subsequent
descriptors in a multi-descriptor chain will not use the first
buffer. The
second buffer is used by all the descriptors in the chain for payload
data.
Based on whether the driver is processing the first, intermediate, or
last
descriptor it can calculate the buffer usage and build the SKB
properly.

Tested and verified on both old and new hardware.

Signed-off-by: Tom Lendacky 


Applied, thanks Tom.


Thanks David.  This is another patch for 4.10 stable. Can you please
queue it up?


Can you properly state this in your patch postings, instead of always
mentioning it later?



Sorry, yes, I can do that.  I didn't realize you preferred it that way.
Do you want the "Cc" tag to stable included in the patch or just
mention the stable targets in the patch description?  I know you
coordinate the stable submissions and I don't want to mess anything up.



Never mind, just found the section in the netdev-FAQ.txt file that talks
about it.  Sorry to bother you.

Thanks,
Tom


Thanks,
Tom



Thank you.



Re: [PATCH net-next 1/2] tcp: remove per-destination timestamp cache

2017-03-15 Thread David Miller
From: Florian Westphal 
Date: Wed, 15 Mar 2017 23:57:26 +0100

> AFAIU we only have two alternatives, removal of the randomization feature
> or switch to a offset computed via hash(saddr, daddr, secret).
> 
> Unless there are more comments I'll look into doing the latter tomorrow.

No, I'll apply the removal.


Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path

2017-03-15 Thread Eric Dumazet
On Wed, 2017-03-15 at 16:06 -0700, Alexei Starovoitov wrote:
> On Wed, Mar 15, 2017 at 06:21:29AM -0700, Eric Dumazet wrote:
> > On Tue, 2017-03-14 at 21:06 -0700, Alexei Starovoitov wrote:
> > > On Tue, Mar 14, 2017 at 08:11:43AM -0700, Eric Dumazet wrote:
> > > > +static struct page *mlx4_alloc_page(struct mlx4_en_priv *priv,
> > > > +   struct mlx4_en_rx_ring *ring,
> > > > +   dma_addr_t *dma,
> > > > +   unsigned int node, gfp_t gfp)
> > > >  {
> > > > +   if (unlikely(!ring->pre_allocated_count)) {
> > > > +   unsigned int order = READ_ONCE(ring->rx_alloc_order);
> > > > +
> > > > +   page = __alloc_pages_node(node, (gfp & 
> > > > ~__GFP_DIRECT_RECLAIM) |
> > > > +   __GFP_NOMEMALLOC |
> > > > +   __GFP_NOWARN |
> > > > +   __GFP_NORETRY,
> > > > + order);
> > > > +   if (page) {
> > > > +   split_page(page, order);
> > > > +   ring->pre_allocated_count = 1U << order;
> > > > +   } else {
> > > > +   if (order > 1)
> > > > +   ring->rx_alloc_order--;
> > > > +   page = __alloc_pages_node(node, gfp, 0);
> > > > +   if (unlikely(!page))
> > > > +   return NULL;
> > > > +   ring->pre_allocated_count = 1U;
> > > > }
> > > > +   ring->pre_allocated = page;
> > > > +   ring->rx_alloc_pages += ring->pre_allocated_count;
> > > > }
> > > > +   page = ring->pre_allocated++;
> > > > +   ring->pre_allocated_count--;
> > > 
> > > do you think this style of allocation can be moved into net common?
> > > If it's a good thing then other drivers should be using it too, right?
> > 
> > Yes, we might do this once this proves to work well.
> 
> In theory it looks quite promising :)
> 
> > 
> > > 
> > > > +   ring->cons = 0;
> > > > +   ring->prod = 0;
> > > > +
> > > > +   /* Page recycling works best if we have enough pages in the 
> > > > pool.
> > > > +* Apply a factor of two on the minimal allocations required to
> > > > +* populate RX rings.
> > > > +*/
> > > 
> > > i'm not sure how above comments matches the code below.
> > > If my math is correct a ring of 1k elements will ask for 1024
> > > contiguous pages.
> > 
> > On x86 it might be the case, unless you use MTU=900 ?
> > 
> > On PowerPC, PAGE_SIZE=65536
> > 
> > 65536/1536 = 42  
> > 
> > So for 1024 elements, we need 1024/42 = ~24 pages.
> > 
> > Thus allocating 48 pages is the goal.
> > Rounded to next power of two (although nothing in my code really needs
> > this additional constraint, a page pool does not have to have a power of
> > two entries)
> 
> on powerpc asking for roundup(48) = 64 contiguous pages sounds reasonable.
> on x86 it's roundup(1024 * 1500 * 2 / 4096) = 1024 contiguous pages.



> I thought it has zero chance of succeeding unless it's fresh boot ?

Don't you load your NIC at boot time ?

Anyway, not a big deal, order-0 pages will then be allocated,
but each order-0 page also decrease order to something that might be
available, like order-5, after 5 failed allocations.

> Should it be something like ?
> order = min_t(u32, ilog2(pages_per_ring), 5); 
> 
> > Later, we might chose a different factor, maybe an elastic one.
> > 
> > > 
> > > > +retry:
> > > > +   total = 0;
> > > > +   pages_per_ring = new_size * stride_bytes * 2 / PAGE_SIZE;
> > > > +   pages_per_ring = roundup_pow_of_two(pages_per_ring);
> > > > +
> > > > +   order = min_t(u32, ilog2(pages_per_ring), MAX_ORDER - 1);
> > > 
> > > if you're sure it doesn't hurt the rest of the system,
> > > why use MAX_ORDER - 1? why not MAX_ORDER?
> > 
> > alloc_page(GFP_...,   MAX_ORDER)  never succeeds ;)
> 
> but MAX_ORDER - 1 also almost never succeeds ?

It does on 100% of our hosts at boot time.

And if not, we will automatically converge to whatever order-X pages are
readily in buddy allocator.

> 
> > Because of the __GRP_NOWARN you would not see the error I guess, but we
> > would get one pesky order-0 page in the ring buffer ;)
> > 
> > > 
> > > >  
> > > > -/* We recover from out of memory by scheduling our napi poll
> > > > - * function (mlx4_en_process_cq), which tries to allocate
> > > > - * all missing RX buffers (call to mlx4_en_refill_rx_buffers).
> > > > +/* Under memory pressure, each ring->rx_alloc_order might be lowered
> > > > + * to very small values. Periodically increase t to initial value for
> > > > + * optimal allocations, in case stress is over.
> > > >   */
> > > > +   for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
> > > > +   ring = 

Re: [PATCH net] amd-xgbe: Fix jumbo MTU processing on newer hardware

2017-03-15 Thread Tom Lendacky

On 3/15/2017 5:37 PM, David Miller wrote:

From: Tom Lendacky 
Date: Wed, 15 Mar 2017 15:11:23 -0500


Newer hardware does not provide a cumulative payload length when multiple
descriptors are needed to handle the data. Once the MTU increases beyond
the size that can be handled by a single descriptor, the SKB does not get
built properly by the driver.

The driver will now calculate the size of the data buffers used by the
hardware.  The first buffer of the first descriptor is for packet headers
or packet headers and data when the headers can't be split. Subsequent
descriptors in a multi-descriptor chain will not use the first buffer. The
second buffer is used by all the descriptors in the chain for payload data.
Based on whether the driver is processing the first, intermediate, or last
descriptor it can calculate the buffer usage and build the SKB properly.

Tested and verified on both old and new hardware.

Signed-off-by: Tom Lendacky 


Applied, thanks Tom.


Thanks David.  This is another patch for 4.10 stable. Can you please
queue it up?

Thanks,
Tom





Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path

2017-03-15 Thread Alexei Starovoitov
On Wed, Mar 15, 2017 at 06:21:29AM -0700, Eric Dumazet wrote:
> On Tue, 2017-03-14 at 21:06 -0700, Alexei Starovoitov wrote:
> > On Tue, Mar 14, 2017 at 08:11:43AM -0700, Eric Dumazet wrote:
> > > +static struct page *mlx4_alloc_page(struct mlx4_en_priv *priv,
> > > + struct mlx4_en_rx_ring *ring,
> > > + dma_addr_t *dma,
> > > + unsigned int node, gfp_t gfp)
> > >  {
> > > + if (unlikely(!ring->pre_allocated_count)) {
> > > + unsigned int order = READ_ONCE(ring->rx_alloc_order);
> > > +
> > > + page = __alloc_pages_node(node, (gfp & ~__GFP_DIRECT_RECLAIM) |
> > > + __GFP_NOMEMALLOC |
> > > + __GFP_NOWARN |
> > > + __GFP_NORETRY,
> > > +   order);
> > > + if (page) {
> > > + split_page(page, order);
> > > + ring->pre_allocated_count = 1U << order;
> > > + } else {
> > > + if (order > 1)
> > > + ring->rx_alloc_order--;
> > > + page = __alloc_pages_node(node, gfp, 0);
> > > + if (unlikely(!page))
> > > + return NULL;
> > > + ring->pre_allocated_count = 1U;
> > >   }
> > > + ring->pre_allocated = page;
> > > + ring->rx_alloc_pages += ring->pre_allocated_count;
> > >   }
> > > + page = ring->pre_allocated++;
> > > + ring->pre_allocated_count--;
> > 
> > do you think this style of allocation can be moved into net common?
> > If it's a good thing then other drivers should be using it too, right?
> 
> Yes, we might do this once this proves to work well.

In theory it looks quite promising :)

> 
> > 
> > > + ring->cons = 0;
> > > + ring->prod = 0;
> > > +
> > > + /* Page recycling works best if we have enough pages in the pool.
> > > +  * Apply a factor of two on the minimal allocations required to
> > > +  * populate RX rings.
> > > +  */
> > 
> > i'm not sure how above comments matches the code below.
> > If my math is correct a ring of 1k elements will ask for 1024
> > contiguous pages.
> 
> On x86 it might be the case, unless you use MTU=900 ?
> 
> On PowerPC, PAGE_SIZE=65536
> 
> 65536/1536 = 42  
> 
> So for 1024 elements, we need 1024/42 = ~24 pages.
> 
> Thus allocating 48 pages is the goal.
> Rounded to next power of two (although nothing in my code really needs
> this additional constraint, a page pool does not have to have a power of
> two entries)

on powerpc asking for roundup(48) = 64 contiguous pages sounds reasonable.
on x86 it's roundup(1024 * 1500 * 2 / 4096) = 1024 contiguous pages.
I thought it has zero chance of succeeding unless it's fresh boot ?
Should it be something like ?
order = min_t(u32, ilog2(pages_per_ring), 5); 

> Later, we might chose a different factor, maybe an elastic one.
> 
> > 
> > > +retry:
> > > + total = 0;
> > > + pages_per_ring = new_size * stride_bytes * 2 / PAGE_SIZE;
> > > + pages_per_ring = roundup_pow_of_two(pages_per_ring);
> > > +
> > > + order = min_t(u32, ilog2(pages_per_ring), MAX_ORDER - 1);
> > 
> > if you're sure it doesn't hurt the rest of the system,
> > why use MAX_ORDER - 1? why not MAX_ORDER?
> 
> alloc_page(GFP_...,   MAX_ORDER)  never succeeds ;)

but MAX_ORDER - 1 also almost never succeeds ?

> Because of the __GRP_NOWARN you would not see the error I guess, but we
> would get one pesky order-0 page in the ring buffer ;)
> 
> > 
> > >  
> > > -/* We recover from out of memory by scheduling our napi poll
> > > - * function (mlx4_en_process_cq), which tries to allocate
> > > - * all missing RX buffers (call to mlx4_en_refill_rx_buffers).
> > > +/* Under memory pressure, each ring->rx_alloc_order might be lowered
> > > + * to very small values. Periodically increase t to initial value for
> > > + * optimal allocations, in case stress is over.
> > >   */
> > > + for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
> > > + ring = priv->rx_ring[ring_ind];
> > > + order = min_t(unsigned int, ring->rx_alloc_order + 1,
> > > +   ring->rx_pref_alloc_order);
> > > + WRITE_ONCE(ring->rx_alloc_order, order);
> > 
> > when recycling is effective in a matter of few seconds it will
> > increase ther order back to 10 and the first time the driver needs
> > to allocate, it will start that tedious failure loop all over again.
> > How about removing this periodic mlx4_en_recover_from_oom() completely
> > and switch to increase the order inside mlx4_alloc_page().
> > Like N successful __alloc_pages_node() with order X will bump it
> > into order X+1. If it fails next time it will do only one failed attempt.
> 
> I wanted to do the increase out of line. (not in the data path)
> 
> We probably could increase only if ring->rx_alloc_pages got a
> significant increase since the last 

Re: [PATCH net-next 1/2] tcp: remove per-destination timestamp cache

2017-03-15 Thread Eric Dumazet
On Wed, 2017-03-15 at 15:40 -0700, David Miller wrote:
> From: Soheil Hassas Yeganeh 
> Date: Wed, 15 Mar 2017 16:30:45 -0400
> 
> > Note that this cache was already broken for caching timestamps of
> > multiple machines behind a NAT sharing the same address.
> 
> That's the documented, well established, limitation of time-wait
> recycling.
> 
> People who enable it, need to consider this issue.
> 
> This limitation of the feature does not give us a reason to break the
> feature even further as a matter of convenience, or to remove it
> altogether for the same reason.
> 
> Please, instead, fix the bug that was introduced.
> 
> Thank you.

You mean revert Florian nice patches ?

This would kill timestamps randomization, and thus prevent some
organizations to turn TCP timestamps on.

TCP timestamps are more useful than this obscure tw_recycle thing that
is hurting innocent users.




Re: [PATCH net-next 1/2] tcp: remove per-destination timestamp cache

2017-03-15 Thread Florian Westphal
David Miller  wrote:
> From: Soheil Hassas Yeganeh 
> Date: Wed, 15 Mar 2017 16:30:45 -0400
> 
> > Note that this cache was already broken for caching timestamps of
> > multiple machines behind a NAT sharing the same address.
> 
> That's the documented, well established, limitation of time-wait
> recycling.

Sigh.

"don't enable this if you connect your machine to the internet".
We're not in the 1990s anymore.  Even I am behind ipv4 CG-NAT nowadays.

So I disagree and would remove this thing.

> This limitation of the feature does not give us a reason to break the
> feature even further as a matter of convenience, or to remove it
> altogether for the same reason.
> 
> Please, instead, fix the bug that was introduced.

AFAIU we only have two alternatives, removal of the randomization feature
or switch to a offset computed via hash(saddr, daddr, secret).

Unless there are more comments I'll look into doing the latter tomorrow.


Re: [PATCH net-next 1/2] tcp: remove per-destination timestamp cache

2017-03-15 Thread Willy Tarreau
Hi David,

On Wed, Mar 15, 2017 at 03:40:44PM -0700, David Miller wrote:
> From: Soheil Hassas Yeganeh 
> Date: Wed, 15 Mar 2017 16:30:45 -0400
> 
> > Note that this cache was already broken for caching timestamps of
> > multiple machines behind a NAT sharing the same address.
> 
> That's the documented, well established, limitation of time-wait
> recycling.
> 
> People who enable it, need to consider this issue.
> 
> This limitation of the feature does not give us a reason to break the
> feature even further as a matter of convenience, or to remove it
> altogether for the same reason.
> 
> Please, instead, fix the bug that was introduced.

At least I can say I've seen many people enable it without understanding
its impact, confusing it with tcp_tw_reuse, and copy-pasting it from
random blogs and complaining about issues in production.

I agree that it's hard to arbiter between stupidity and flexibility
though :-/

Regards,
Willy


Re: [PATCH net] amd-xgbe: Fix jumbo MTU processing on newer hardware

2017-03-15 Thread Tom Lendacky

On 3/15/2017 5:41 PM, David Miller wrote:

From: Tom Lendacky 
Date: Wed, 15 Mar 2017 17:40:51 -0500


On 3/15/2017 5:37 PM, David Miller wrote:

From: Tom Lendacky 
Date: Wed, 15 Mar 2017 15:11:23 -0500


Newer hardware does not provide a cumulative payload length when
multiple
descriptors are needed to handle the data. Once the MTU increases
beyond
the size that can be handled by a single descriptor, the SKB does not
get
built properly by the driver.

The driver will now calculate the size of the data buffers used by the
hardware.  The first buffer of the first descriptor is for packet
headers
or packet headers and data when the headers can't be split. Subsequent
descriptors in a multi-descriptor chain will not use the first
buffer. The
second buffer is used by all the descriptors in the chain for payload
data.
Based on whether the driver is processing the first, intermediate, or
last
descriptor it can calculate the buffer usage and build the SKB
properly.

Tested and verified on both old and new hardware.

Signed-off-by: Tom Lendacky 


Applied, thanks Tom.


Thanks David.  This is another patch for 4.10 stable. Can you please
queue it up?


Can you properly state this in your patch postings, instead of always
mentioning it later?



Sorry, yes, I can do that.  I didn't realize you preferred it that way.
Do you want the "Cc" tag to stable included in the patch or just
mention the stable targets in the patch description?  I know you
coordinate the stable submissions and I don't want to mess anything up.

Thanks,
Tom



Thank you.



[PATCH net] ibmvnic: Free tx/rx scrq pointer array when releasing sub-crqs

2017-03-15 Thread Nathan Fontenot
The pointer array for the tx/rx sub crqs should be free'ed when
releasing the tx/rx sub crqs.

Signed-off-by: Nathan Fontenot 
---
 drivers/net/ethernet/ibm/ibmvnic.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 5f11b4d..b23d654 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1257,6 +1257,7 @@ static void release_sub_crqs(struct ibmvnic_adapter 
*adapter)
release_sub_crq_queue(adapter,
  adapter->tx_scrq[i]);
}
+   kfree(adapter->tx_scrq);
adapter->tx_scrq = NULL;
}
 
@@ -1269,6 +1270,7 @@ static void release_sub_crqs(struct ibmvnic_adapter 
*adapter)
release_sub_crq_queue(adapter,
  adapter->rx_scrq[i]);
}
+   kfree(adapter->rx_scrq);
adapter->rx_scrq = NULL;
}
 }



Re: net/udp: slab-out-of-bounds Read in udp_recvmsg

2017-03-15 Thread Eric Dumazet
On Wed, 2017-03-15 at 15:08 -0700, David Miller wrote:
> From: Eric Dumazet 
> Date: Wed, 15 Mar 2017 09:10:33 -0700
> 
> > @@ -692,12 +692,17 @@ void __sock_recv_timestamp(struct msghdr *msg, struct 
> > sock *sk,
> > ktime_to_timespec_cond(shhwtstamps->hwtstamp, tss.ts + 2))
> > empty = 0;
> > if (!empty) {
> > +   unsigned int hlen = skb_headlen(skb);
> > +
> > put_cmsg(msg, SOL_SOCKET,
> >  SCM_TIMESTAMPING, sizeof(tss), );
> >  
> > -   if (skb->len && (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_STATS))
> > +   if (hlen &&
> > +   (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_STATS) &&
> > +   sk->sk_protocol == IPPROTO_TCP &&
> > +   sk->sk_type == SOCK_STREAM)
> > put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPING_OPT_STATS,
> > -skb->len, skb->data);
> > +hlen, skb->data);
> 
> Hmmm, what is the true intention of SOF_TIMESTAMPING_OPT_STATS then?  The
> existing code seems to want to dump the entire SKB into the cmsg, and if
> that's the case then the fix is to linearlize the skb before the put_cmsg()
> or have a way to put a non-linear SKB into a cmsg.

I simply matched the conditions in __skb_tstamp_tx() which builds the
skb :

+   if (tsonly) {
+#ifdef CONFIG_INET
+   if ((sk->sk_tsflags & SOF_TIMESTAMPING_OPT_STATS) &&
+   sk->sk_protocol == IPPROTO_TCP &&
+   sk->sk_type == SOCK_STREAM)
+   skb = tcp_get_timestamping_opt_stats(sk);
+   else
+#endif
+   skb = alloc_skb(0, GFP_ATOMIC);
+   } else {


And note that I should have also used the #ifdef


A proper fix would be to find a bit in skb->cb[] to avoid duplicating
the test...






Re: [PATCH net] net: properly release sk_frag.page

2017-03-15 Thread David Miller
From: Eric Dumazet 
Date: Wed, 15 Mar 2017 13:21:28 -0700

> From: Eric Dumazet 
> 
> I mistakenly added the code to release sk->sk_frag in
> sk_common_release() instead of sk_destruct()
> 
> TCP sockets using sk->sk_allocation == GFP_ATOMIC do no call
> sk_common_release() at close time, thus leaking one (order-3) page.
> 
> iSCSI is using such sockets.
> 
> Fixes: 5640f7685831 ("net: use a per task frag allocator")
> Signed-off-by: Eric Dumazet 

Applied and queued up for -stable, thanks Eric.


Re: [PATCH net] amd-xgbe: Fix jumbo MTU processing on newer hardware

2017-03-15 Thread David Miller
From: Tom Lendacky 
Date: Wed, 15 Mar 2017 15:11:23 -0500

> Newer hardware does not provide a cumulative payload length when multiple
> descriptors are needed to handle the data. Once the MTU increases beyond
> the size that can be handled by a single descriptor, the SKB does not get
> built properly by the driver.
> 
> The driver will now calculate the size of the data buffers used by the
> hardware.  The first buffer of the first descriptor is for packet headers
> or packet headers and data when the headers can't be split. Subsequent
> descriptors in a multi-descriptor chain will not use the first buffer. The
> second buffer is used by all the descriptors in the chain for payload data.
> Based on whether the driver is processing the first, intermediate, or last
> descriptor it can calculate the buffer usage and build the SKB properly.
> 
> Tested and verified on both old and new hardware.
> 
> Signed-off-by: Tom Lendacky 

Applied, thanks Tom.


Re: [PATCH net] amd-xgbe: Fix jumbo MTU processing on newer hardware

2017-03-15 Thread David Miller
From: Tom Lendacky 
Date: Wed, 15 Mar 2017 17:40:51 -0500

> On 3/15/2017 5:37 PM, David Miller wrote:
>> From: Tom Lendacky 
>> Date: Wed, 15 Mar 2017 15:11:23 -0500
>>
>>> Newer hardware does not provide a cumulative payload length when
>>> multiple
>>> descriptors are needed to handle the data. Once the MTU increases
>>> beyond
>>> the size that can be handled by a single descriptor, the SKB does not
>>> get
>>> built properly by the driver.
>>>
>>> The driver will now calculate the size of the data buffers used by the
>>> hardware.  The first buffer of the first descriptor is for packet
>>> headers
>>> or packet headers and data when the headers can't be split. Subsequent
>>> descriptors in a multi-descriptor chain will not use the first
>>> buffer. The
>>> second buffer is used by all the descriptors in the chain for payload
>>> data.
>>> Based on whether the driver is processing the first, intermediate, or
>>> last
>>> descriptor it can calculate the buffer usage and build the SKB
>>> properly.
>>>
>>> Tested and verified on both old and new hardware.
>>>
>>> Signed-off-by: Tom Lendacky 
>>
>> Applied, thanks Tom.
> 
> Thanks David.  This is another patch for 4.10 stable. Can you please
> queue it up?

Can you properly state this in your patch postings, instead of always
mentioning it later?

Thank you.


Re: [PATCH net-next 1/2] tcp: remove per-destination timestamp cache

2017-03-15 Thread David Miller
From: Soheil Hassas Yeganeh 
Date: Wed, 15 Mar 2017 16:30:45 -0400

> Note that this cache was already broken for caching timestamps of
> multiple machines behind a NAT sharing the same address.

That's the documented, well established, limitation of time-wait
recycling.

People who enable it, need to consider this issue.

This limitation of the feature does not give us a reason to break the
feature even further as a matter of convenience, or to remove it
altogether for the same reason.

Please, instead, fix the bug that was introduced.

Thank you.


Re: [PATCH net] net: bcmgenet: Do not suspend PHY if Wake-on-LAN is enabled

2017-03-15 Thread David Miller
From: Florian Fainelli 
Date: Wed, 15 Mar 2017 12:57:21 -0700

> Suspending the PHY would be putting it in a low power state where it
> may no longer allow us to do Wake-on-LAN.
> 
> Fixes: cc013fb48898 ("net: bcmgenet: correctly suspend and resume PHY device")
> Signed-off-by: Florian Fainelli 

Applied and queued up for -stable, thanks!


Re: [PATCH net-next v2 0/3] net: dsa: check out-of-range ageing time

2017-03-15 Thread David Miller
From: Vivien Didelot 
Date: Wed, 15 Mar 2017 15:53:47 -0400

> The ageing time limits supported by DSA drivers vary depending on the
> switch model. If a driver returns -ERANGE for out-of-range values, the
> switchdev commit phase will fail with the following stacktrace:
 ...
> This patchset fixes this by adding ageing_time_min and ageing_time_max
> fields to the dsa_switch structure, which can optionally be set by a DSA
> driver.
> 
> If provided, the DSA core will check for out-of-range values in the
> SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME prepare phase and return -ERANGE
> accordingly.
> 
> Finally set these limits in the mv88e6xxx driver.

Series applied, thanks Vivien.


Re: [Bug 194723] connect() to localhost stalls after 4.9 -> 4.10 upgrade

2017-03-15 Thread David Miller
From: Eric Dumazet 
Date: Wed, 15 Mar 2017 14:40:34 -0700

> Finally time to get rid of buggy tw_recycle, that apparently some
> distros set to one.

It's not buggy, it's just not designed in a way that it can work in
the presence of NAT.

So yes, indeed, NO DISTRO SHOULD CHANGE THE VALUE TO "1" by default.

Is CentOS the only culprit?

But removing it is unnecessary, people who know what they are doing
can legitimately enable it if they know that there is no NAT going
on in their paths.


Re: [net-next PATCH 0/2] Add support for passing more information in mqprio offload

2017-03-15 Thread David Miller
From: Alexander Duyck 
Date: Wed, 15 Mar 2017 10:39:12 -0700

> This patch series lays the groundwork for future work to allow us to make
> full use of the mqprio options when offloading them to hardware.
> 
> Currently when we specify the hardware offload for mqprio the queue
> configuration is completely ignored and the hardware is only notified of
> the total number of traffic classes.  The problem is this leads to multiple
> issues, one specific issue being you can pass the queue configuration you
> want and it is totally ignored by the hardware.
> 
> What I am planning to do is add support for "hw" values in the
> configuration greater than 1.  So for example we might have one mode of
> mqprio offload that uses 1 and only offloads the TC counts like we
> currently do.  Then we might look at adding an option 2 which would factor
> in the TCs and the queue count information. This way we can select between
> the type of offload we actually want and existing drivers that don't
> support this can just fall back to their legacy configuration.

This looks fine, series applied, thanks Alex.


Re: [PATCH 00/10] Netfilter fixes for net

2017-03-15 Thread David Miller
From: Pablo Neira Ayuso 
Date: Wed, 15 Mar 2017 18:01:02 +0100

> The following patchset contains Netfilter fixes for your net tree, a
> rather large batch of fixes targeted to nf_tables, conntrack and bridge
> netfilter. More specifically, they are:
> 
> 1) Don't track fragmented packets if the socket option IP_NODEFRAG is set.
>From Florian Westphal.
> 
> 2) SCTP protocol tracker assumes that ICMP error messages contain the
>checksum field, what results in packet drops. From Ying Xue.
> 
> 3) Fix inconsistent handling of AH traffic from nf_tables.
> 
> 4) Fix new bitmap set representation with big endian. Fix mismatches in
>nf_tables due to incorrect big endian handling too. Both patches
>from Liping Zhang.
> 
> 5) Bridge netfilter doesn't honor maximum fragment size field, cap to
>largest fragment seen. From Florian Westphal.
> 
> 6) Fake conntrack entry needs to be aligned to 8 bytes since the 3 LSB
>bits are now used to store the ctinfo. From Steven Rostedt.
> 
> 7) Fix element comments with the bitmap set type. Revert the flush
>field in the nft_set_iter structure, not required anymore after
>fixing up element comments.
> 
> 8) Missing error on invalid conntrack direction from nft_ct, also from
>Liping Zhang.
> 
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Pulled, thanks Pablo!


Re: [PATCH net] net/openvswitch: Set the ipv6 source tunnel key address attribute correctly

2017-03-15 Thread David Miller
From: Or Gerlitz 
Date: Wed, 15 Mar 2017 18:10:47 +0200

> When dealing with ipv6 source tunnel key address attribute
> (OVS_TUNNEL_KEY_ATTR_IPV6_SRC) we are wrongly setting the tunnel
> dst ip, fix that.
> 
> Fixes: 6b26ba3a7d95 ('openvswitch: netlink attributes for IPv6 tunneling')
> Signed-off-by: Or Gerlitz 
> Reported-by: Paul Blakey 

Applied and queued up for -stable, thanks!


Re: net/udp: slab-out-of-bounds Read in udp_recvmsg

2017-03-15 Thread David Miller
From: Eric Dumazet 
Date: Wed, 15 Mar 2017 09:10:33 -0700

> @@ -692,12 +692,17 @@ void __sock_recv_timestamp(struct msghdr *msg, struct 
> sock *sk,
>   ktime_to_timespec_cond(shhwtstamps->hwtstamp, tss.ts + 2))
>   empty = 0;
>   if (!empty) {
> + unsigned int hlen = skb_headlen(skb);
> +
>   put_cmsg(msg, SOL_SOCKET,
>SCM_TIMESTAMPING, sizeof(tss), );
>  
> - if (skb->len && (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_STATS))
> + if (hlen &&
> + (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_STATS) &&
> + sk->sk_protocol == IPPROTO_TCP &&
> + sk->sk_type == SOCK_STREAM)
>   put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPING_OPT_STATS,
> -  skb->len, skb->data);
> +  hlen, skb->data);

Hmmm, what is the true intention of SOF_TIMESTAMPING_OPT_STATS then?  The
existing code seems to want to dump the entire SKB into the cmsg, and if
that's the case then the fix is to linearlize the skb before the put_cmsg()
or have a way to put a non-linear SKB into a cmsg.


Re: [PATCH] netxen_nic: remove redundant check if retries is zero

2017-03-15 Thread David Miller
From: Colin King 
Date: Wed, 15 Mar 2017 15:31:58 +

> From: Colin Ian King 
> 
> At the end of the timeout loop, retries will always be zero so
> the check for zero is redundant so remove it.  Also replace
> printk with pr_err as recommended by checkpatch.
> 
> Signed-off-by: Colin Ian King 

Applied.


Re: [PATCH net] bridge: ebtables: fix reception of frames DNAT-ed to bridge device

2017-03-15 Thread Pablo Neira Ayuso
On Wed, Mar 15, 2017 at 10:16:19PM +0100, Linus Lüssing wrote:
> On Wed, Mar 15, 2017 at 07:15:39PM +0100, Pablo Neira Ayuso wrote:
> > Could you update ebtables dnat to check if the ethernet address
> > matches the one of the input bridge interface, so we mangle the
> > ->pkt_type accordingly from there, instead of doing this from the
> > core?
> 
> Actually, that was the approach I thought about and went for first
> (and it would probably work for me). Just checking against the
> bridge device's net_device::dev_addr.
> 
> I scratched it though, as I was afraid that the issue might still
> exist for people using some other upper device on top of the bridge
> device. For instance, macvlan? And iterating over the
> net_device::dev_addrs list seemed too costly for fast path to me.

I was more thinking of following the simple approach that we follow in
ebt_redirect_tg() by taking the input interface.

Anyway, I'm ok with this.


Re: [PATCH] nl80211: fix dumpit error path RTNL deadlocks

2017-03-15 Thread David Miller
From: Johannes Berg 
Date: Wed, 15 Mar 2017 14:29:13 +0100

> From: Johannes Berg 
> 
> Sowmini pointed out Dmitry's RTNL deadlock report to me, and it turns out
> to be perfectly accurate - there are various error paths that miss unlock
> of the RTNL.
> 
> To fix those, change the locking a bit to not be conditional in all those
> nl80211_prepare_*_dump() functions, but make those require the RTNL to
> start with, and fix the buggy error paths. This also let me use sparse
> (by appropriately overriding the rtnl_lock/rtnl_unlock functions) to
> validate the changes.
> 
> Reported-by: Sowmini Varadhan 
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Johannes Berg 

Johannes, I assume I will get this in a future pull request?


Re: arch: arm: bpf: Converting cBPF to eBPF for arm 32 bit

2017-03-15 Thread David Miller
From: Shubham Bansal 
Date: Wed, 15 Mar 2017 17:43:44 +0530

>> You can't truncate, but you'll have to build 64-bit ops using 32-bit 
>> registers.
> 
> A small example would help a lot.

You can simply perform 64-bit operations in C code and see what gcc
outputs for that code on this 32-bit target.


[PATCH v2] net: sun: sungem: rix a possible null dereference

2017-03-15 Thread Philippe Reynes
The function gem_begin_auto_negotiation dereference
the pointer ep before testing if it's null. This
patch add a check on ep before dereferencing it.

Fixes: 92552fdd557 ("net: sun: sungem: use new api
ethtool_{get|set}_link_ksettings")

Reported-by: Dan Carpenter 
Signed-off-by: Philippe Reynes 
---
Chanelog:
v2:
- use Fixes tag (thanks Sergei Shtylyov)

 drivers/net/ethernet/sun/sungem.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/sun/sungem.c 
b/drivers/net/ethernet/sun/sungem.c
index dbfca04..fa607d0 100644
--- a/drivers/net/ethernet/sun/sungem.c
+++ b/drivers/net/ethernet/sun/sungem.c
@@ -1259,8 +1259,9 @@ static void gem_begin_auto_negotiation(struct gem *gp,
int duplex;
u32 advertising;
 
-   ethtool_convert_link_mode_to_legacy_u32(,
-   ep->link_modes.advertising);
+   if (ep)
+   ethtool_convert_link_mode_to_legacy_u32(
+   , ep->link_modes.advertising);
 
if (gp->phy_type != phy_mii_mdio0 &&
gp->phy_type != phy_mii_mdio1)
-- 
1.7.4.4



Re: [PATCH] net: sun: sungem: rix a possible null dereference

2017-03-15 Thread Philippe Reynes
Hi Sergei,

On 3/15/17, Sergei Shtylyov  wrote:
> Hello!
>
> On 3/15/2017 12:23 AM, Philippe Reynes wrote:
>
>> The function gem_begin_auto_negotiation dereference
>> the pointer ep before testing if it's null. This
>> patch add a check on ep before dereferencing it.
>>
>> This issue was added by the patch 92552fdd557:
>> "net: sun: sungem: use new api ethtool_{get|set}_link_ksettings".
>
> There's Fixes: tag for that now, described in
> Documentation/process/submitting-patches.rst...

Thanks a lot for pointing me this tag.
I'm sending a v2 of this patch with this tag.

>> Reported-by: Dan Carpenter 
>> Signed-off-by: Philippe Reynes 
> [...]
>
> MBR, Sergei
>

Thanks,
Philippe


Re: [PATCH v3 net-next 00/11] net: stmmac: prepare dma operations for multiple queues

2017-03-15 Thread David Miller
From: Joao Pinto 
Date: Wed, 15 Mar 2017 11:04:44 +

> As agreed with David Miller, this patch-set is the second of 3 to enable
> multiple queues in stmmac.
> 
> This second one concentrates on dma operations adding functionalities as:
> a) DMA Operation Mode configuration per channel and done in the multiple
> queues configuration function
> b) DMA IRQ enable and Disable by channel
> c) DMA start and stop by channel
> d) RX and TX ring length configuration by channel
> e) RX and TX set tail pointer by channel
> f) DMA Channel initialization broke into Channel comon, RX and TX
> initialization
> g) TSO being configured for all available channels
> h) DMA interrupt treatment by channel

Series applied.


Re: [PATCH 1/1] gtp: support SGSN-side tunnels

2017-03-15 Thread Pablo Neira Ayuso
Hi Harald,

On Wed, Mar 15, 2017 at 08:10:38PM +0100, Harald Welte wrote:
> I've modified the patch slightly, see below (compile-tested, but not
> otherwise tested yet).  Basically rename the flags attribute to 'role',
> expand the commit log and removed unrelated cosmetic changes.
> 
> I've also prepared a corresponding change to libgtpnl into the
> laforge/sgsn-rol branch, see
> http://git.osmocom.org/libgtpnl/commit/?h=laforge/sgsn-role
> 
> This is not yet tested in any way, but I'm planning to add some
> associated support to the command line tools and then give it some
> testing (both against the kernel GTP in GGSN mode, as well as an
> independent userspace GTP implementation).

Thanks Harald.

> > It would be good if we provide a way to configure GTP via iproute2 for
> > testing purposes.
> 
> I don't really care about which tool is used, as long as it is easily
> available [and FOSS, of course].
>
> > We would need to create some dummy socket from
> > kernel too though so we don't need any userspace daemon for this
> > testing mode.
> 
> I don't really like that latter idea. It sounds too much like a hack to
> me.  But then, I don't have enough phantasy right now ti imagine how an
> actual implementation would look like.

It's not that far away, we can just create the udp socket from
kernelspace via udp_sock_create() in the test mode. So we don't need
to pass the file descriptor from userspace. But not asking you to work
on this, just an idea.

> To me, it is perfectly fine to run a simple, small utility in userspace
> even for testing.

No problem.


Re: Fw: [Bug 194723] connect() to localhost stalls after 4.9 -> 4.10 upgrade

2017-03-15 Thread Eric Dumazet
On Wed, 2017-03-15 at 13:36 -0700, Stephen Hemminger wrote:
> 
> Begin forwarded message:
> 
> Date: Wed, 15 Mar 2017 19:41:59 +
> From: bugzilla-dae...@bugzilla.kernel.org
> To: step...@networkplumber.org
> Subject: [Bug 194723] connect() to localhost stalls after 4.9 -> 4.10 upgrade
> 
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=194723
> 
> --- Comment #15 from Lutz Vieweg (l...@5t9.de) ---
> At last, bisecting converged:
> 
> git bisect start
> # bad: [c470abd4fde40ea6a0846a2beab642a578c0b8cd] Linux 4.10
> git bisect bad c470abd4fde40ea6a0846a2beab642a578c0b8cd
> # good: [69973b830859bc6529a7a0468ba0d80ee5117826] Linux 4.9
> git bisect good 69973b830859bc6529a7a0468ba0d80ee5117826
> # bad: [f4000cd99750065d5177555c0a805c97174d1b9f] Merge tag 'arm64-upstream' 
> of
> git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
> git bisect bad f4000cd99750065d5177555c0a805c97174d1b9f
> # bad: [7079efc9d3e7f1f7cdd34082ec58209026315057] Merge tag 'fbdev-4.10' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux
> git bisect bad 7079efc9d3e7f1f7cdd34082ec58209026315057
> # bad: [669bb4c58c3091cd54650e37c5f4e345dd12c564] Merge branch 'for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32
> git bisect bad 669bb4c58c3091cd54650e37c5f4e345dd12c564
> # good: [7a8bca043cf1bb0433aa43d008b6c4de6c07d6a2] Merge branch 'sfc-tso-v2'
> git bisect good 7a8bca043cf1bb0433aa43d008b6c4de6c07d6a2
> # bad: [4f4f907a6729ae9e132810711c3a05e48311a948] Merge branch 'mvneta-64bit'
> git bisect bad 4f4f907a6729ae9e132810711c3a05e48311a948
> # good: [33f8a0458b2ce4546b681c5fae04427e3077a543] Merge tag
> 'wireless-drivers-next-for-davem-2016-11-25' of
> git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
> git bisect good 33f8a0458b2ce4546b681c5fae04427e3077a543
> # good: [80439a1704e811697ee01fd09dd95dd10790bc93] qede: Remove 'num_tc'.
> git bisect good 80439a1704e811697ee01fd09dd95dd10790bc93
> # good: [5067b6020770ef7c8102f47079c9e577d175ef2c] net/mlx5e: Remove flow 
> encap
> entry in the correct place
> git bisect good 5067b6020770ef7c8102f47079c9e577d175ef2c
> # bad: [7091d8c7055d7310339435ae3af2fb490a92524d] net/sched: cls_flower: Add
> offload support using egress Hardware device
> git bisect bad 7091d8c7055d7310339435ae3af2fb490a92524d
> # good: [b14945ac3efdf5217182a344b037f96d6b0afae1] net: atarilance: use %8ph
> for printing hex string
> git bisect good b14945ac3efdf5217182a344b037f96d6b0afae1
> # bad: [25429d7b7dca01dc4f17205de023a30ca09390d0] tcp: allow to turn tcp
> timestamp randomization off
> git bisect bad 25429d7b7dca01dc4f17205de023a30ca09390d0
> # good: [1d6cff4fca4366d0529dbce170e0f33cfe213790] qed: Add iSCSI out of order
> packet handling.
> git bisect good 1d6cff4fca4366d0529dbce170e0f33cfe213790
> # bad: [95a22caee396cef0bb2ca8fafdd82966a49367bb] tcp: randomize tcp timestamp
> offsets for each connection
> git bisect bad 95a22caee396cef0bb2ca8fafdd82966a49367bb
> 
> 
> So the culprit seems to be this change: 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=95a22caee396cef0bb2ca8fafdd82966a49367bb
> 
> "tcp: randomize tcp timestamp offsets for each connection
> jiffies based timestamps allow for easy inference of number of devices
> behind NAT translators and also makes tracking of hosts simpler.
> 
> commit ceaa1fef65a7c2e ("tcp: adding a per-socket timestamp offset")
> added the main infrastructure that is needed for per-connection ts
> randomization, in particular writing/reading the on-wire tcp header
> format takes the offset into account so rest of stack can use normal
> tcp_time_stamp (jiffies).
> 
> So only two items are left:
>  - add a tsoffset for request sockets
>  - extend the tcp isn generator to also return another 32bit number
>in addition to the ISN.
> 
> Re-use of ISN generator also means timestamps are still monotonically
> increasing for same connection quadruple, i.e. PAWS will still work.
> 
> Includes fixes from Eric Dumazet.
> 
> Signed-off-by: Florian Westphal 
> Acked-by: Eric Dumazet 
> Acked-by: Yuchung Cheng 
> Signed-off-by: David S. Miller 
> "
> 
> I will try to attract some attention from above mentioned people.
> 

Finally time to get rid of buggy tw_recycle, that apparently some
distros set to one.






Re: [PATCH] net: ethernet: aquantia: set net_device mtu when mtu is changed

2017-03-15 Thread Jarod Wilson

On 2017-03-13 7:07 PM, David Arcari wrote:

When the aquantia device mtu is changed the net_device structure is not
updated.  As a result the ip command does not properly reflect the mtu change.

Commit 5513e16421cb incorrectly assumed that __dev_set_mtu() was making the
assignment ndev->mtu = new_mtu;  This is not true in the case where the driver
has a ndo_change_mtu routine.

Fixes: 5513e16421cb ("net: ethernet: aquantia: Fixes for aq_ndev_change_mtu")

Cc: Pavel Belous 
Signed-off-by: David Arcari 


Would help if I replied to the proper version of the patch. To 
reiterate, patch looks appropriate and necessary.


Reviewed-by: Jarod Wilson 

--
Jarod Wilson
ja...@redhat.com


Re: [PATCH net] bridge: ebtables: fix reception of frames DNAT-ed to bridge device

2017-03-15 Thread Linus Lüssing
On Wed, Mar 15, 2017 at 07:15:39PM +0100, Pablo Neira Ayuso wrote:
> Could you update ebtables dnat to check if the ethernet address
> matches the one of the input bridge interface, so we mangle the
> ->pkt_type accordingly from there, instead of doing this from the
> core?

Actually, that was the approach I thought about and went for first
(and it would probably work for me). Just checking against the
bridge device's net_device::dev_addr.

I scratched it though, as I was afraid that the issue might still
exist for people using some other upper device on top of the bridge
device. For instance, macvlan? And iterating over the
net_device::dev_addrs list seemed too costly for fast path to me.


Re: [PATCH net-next v2 0/3] net: dsa: check out-of-range ageing time

2017-03-15 Thread Andrew Lunn
> This patchset fixes this by adding ageing_time_min and ageing_time_max
> fields to the dsa_switch structure, which can optionally be set by a DSA
> driver.
> 
> If provided, the DSA core will check for out-of-range values in the
> SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME prepare phase and return -ERANGE
> accordingly.
> 
> Finally set these limits in the mv88e6xxx driver.

Hi Vivien

Thanks for doing it this way. Nicely done.

Reviewed-by: Andrew Lunn 

Andrew



Re: [PATCH] net: ethernet: aquantia: set net_device mtu when mtu is changed

2017-03-15 Thread Jarod Wilson

On 2017-03-08 4:33 PM, David Arcari wrote:

When the aquantia device mtu is changed the net_device structure is not
updated.  As a result the ip command does not properly reflect the mtu change.

Commit 5513e16421cb incorrectly assumed that __dev_set_mtu() was making the
assignment ndev->mtu = new_mtu;  This is not true in the case where the driver
has a ndo_change_mtu routine.

Fixes: 5513e16421cb ("net: ethernet: aquantia: Fixes for aq_ndev_change_mtu")

Cc: Pavel Belous 
Signed-off-by: David Arcari 


Looks like an appropriate and necessary fix to me.

Reviewed-by: Jarod Wilson 

--
Jarod Wilson
ja...@redhat.com


Fw: [Bug 194723] connect() to localhost stalls after 4.9 -> 4.10 upgrade

2017-03-15 Thread Stephen Hemminger


Begin forwarded message:

Date: Wed, 15 Mar 2017 19:41:59 +
From: bugzilla-dae...@bugzilla.kernel.org
To: step...@networkplumber.org
Subject: [Bug 194723] connect() to localhost stalls after 4.9 -> 4.10 upgrade


https://bugzilla.kernel.org/show_bug.cgi?id=194723

--- Comment #15 from Lutz Vieweg (l...@5t9.de) ---
At last, bisecting converged:

git bisect start
# bad: [c470abd4fde40ea6a0846a2beab642a578c0b8cd] Linux 4.10
git bisect bad c470abd4fde40ea6a0846a2beab642a578c0b8cd
# good: [69973b830859bc6529a7a0468ba0d80ee5117826] Linux 4.9
git bisect good 69973b830859bc6529a7a0468ba0d80ee5117826
# bad: [f4000cd99750065d5177555c0a805c97174d1b9f] Merge tag 'arm64-upstream' of
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
git bisect bad f4000cd99750065d5177555c0a805c97174d1b9f
# bad: [7079efc9d3e7f1f7cdd34082ec58209026315057] Merge tag 'fbdev-4.10' of
git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux
git bisect bad 7079efc9d3e7f1f7cdd34082ec58209026315057
# bad: [669bb4c58c3091cd54650e37c5f4e345dd12c564] Merge branch 'for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32
git bisect bad 669bb4c58c3091cd54650e37c5f4e345dd12c564
# good: [7a8bca043cf1bb0433aa43d008b6c4de6c07d6a2] Merge branch 'sfc-tso-v2'
git bisect good 7a8bca043cf1bb0433aa43d008b6c4de6c07d6a2
# bad: [4f4f907a6729ae9e132810711c3a05e48311a948] Merge branch 'mvneta-64bit'
git bisect bad 4f4f907a6729ae9e132810711c3a05e48311a948
# good: [33f8a0458b2ce4546b681c5fae04427e3077a543] Merge tag
'wireless-drivers-next-for-davem-2016-11-25' of
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
git bisect good 33f8a0458b2ce4546b681c5fae04427e3077a543
# good: [80439a1704e811697ee01fd09dd95dd10790bc93] qede: Remove 'num_tc'.
git bisect good 80439a1704e811697ee01fd09dd95dd10790bc93
# good: [5067b6020770ef7c8102f47079c9e577d175ef2c] net/mlx5e: Remove flow encap
entry in the correct place
git bisect good 5067b6020770ef7c8102f47079c9e577d175ef2c
# bad: [7091d8c7055d7310339435ae3af2fb490a92524d] net/sched: cls_flower: Add
offload support using egress Hardware device
git bisect bad 7091d8c7055d7310339435ae3af2fb490a92524d
# good: [b14945ac3efdf5217182a344b037f96d6b0afae1] net: atarilance: use %8ph
for printing hex string
git bisect good b14945ac3efdf5217182a344b037f96d6b0afae1
# bad: [25429d7b7dca01dc4f17205de023a30ca09390d0] tcp: allow to turn tcp
timestamp randomization off
git bisect bad 25429d7b7dca01dc4f17205de023a30ca09390d0
# good: [1d6cff4fca4366d0529dbce170e0f33cfe213790] qed: Add iSCSI out of order
packet handling.
git bisect good 1d6cff4fca4366d0529dbce170e0f33cfe213790
# bad: [95a22caee396cef0bb2ca8fafdd82966a49367bb] tcp: randomize tcp timestamp
offsets for each connection
git bisect bad 95a22caee396cef0bb2ca8fafdd82966a49367bb


So the culprit seems to be this change: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=95a22caee396cef0bb2ca8fafdd82966a49367bb

"tcp: randomize tcp timestamp offsets for each connection
jiffies based timestamps allow for easy inference of number of devices
behind NAT translators and also makes tracking of hosts simpler.

commit ceaa1fef65a7c2e ("tcp: adding a per-socket timestamp offset")
added the main infrastructure that is needed for per-connection ts
randomization, in particular writing/reading the on-wire tcp header
format takes the offset into account so rest of stack can use normal
tcp_time_stamp (jiffies).

So only two items are left:
 - add a tsoffset for request sockets
 - extend the tcp isn generator to also return another 32bit number
   in addition to the ISN.

Re-use of ISN generator also means timestamps are still monotonically
increasing for same connection quadruple, i.e. PAWS will still work.

Includes fixes from Eric Dumazet.

Signed-off-by: Florian Westphal 
Acked-by: Eric Dumazet 
Acked-by: Yuchung Cheng 
Signed-off-by: David S. Miller 
"

I will try to attract some attention from above mentioned people.

-- 
You are receiving this mail because:
You are the assignee for the bug.


[PATCH net-next 1/2] tcp: remove per-destination timestamp cache

2017-03-15 Thread Soheil Hassas Yeganeh
From: Soheil Hassas Yeganeh 

Commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection)
randomizes TCP timestamps per connection. After this commit,
there is no guarantee that the timestamps received from the
same destination are monotonically increasing. As a result,
the per-destination timestamp cache in TCP metrics (i.e., tcpm_ts
in struct tcp_metrics_block) is broken and cannot be relied upon.

Remove the per-destination timestamp cache and all related code
paths.

Note that this cache was already broken for caching timestamps of
multiple machines behind a NAT sharing the same address.

Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Eric Dumazet 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Cc: Lutz Vieweg 
Cc: Florian Westphal 
---
 include/net/tcp.h|   6 +-
 net/ipv4/tcp_input.c |   6 +-
 net/ipv4/tcp_ipv4.c  |   4 --
 net/ipv4/tcp_metrics.c   | 147 ++-
 net/ipv4/tcp_minisocks.c |  22 ++-
 net/ipv6/tcp_ipv6.c  |   5 --
 6 files changed, 11 insertions(+), 179 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index bede8f7fa742..c81f3b958d44 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -406,11 +406,7 @@ void tcp_clear_retrans(struct tcp_sock *tp);
 void tcp_update_metrics(struct sock *sk);
 void tcp_init_metrics(struct sock *sk);
 void tcp_metrics_init(void);
-bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst,
-   bool paws_check, bool timestamps);
-bool tcp_remember_stamp(struct sock *sk);
-bool tcp_tw_remember_stamp(struct inet_timewait_sock *tw);
-void tcp_fetch_timewait_stamp(struct sock *sk, struct dst_entry *dst);
+bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst);
 void tcp_disable_fack(struct tcp_sock *tp);
 void tcp_close(struct sock *sk, long timeout);
 void tcp_init_sock(struct sock *sk);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 96b67a8b18c3..aafec0676d3e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6342,8 +6342,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
dst = af_ops->route_req(sk, , req, );
 
if (dst && strict &&
-   !tcp_peer_is_proven(req, dst, true,
-   tmp_opt.saw_tstamp)) {
+   !tcp_peer_is_proven(req, dst)) {
NET_INC_STATS(sock_net(sk), 
LINUX_MIB_PAWSPASSIVEREJECTED);
goto drop_and_release;
}
@@ -6352,8 +6351,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
else if (!net->ipv4.sysctl_tcp_syncookies &&
 (net->ipv4.sysctl_max_syn_backlog - 
inet_csk_reqsk_queue_len(sk) <
  (net->ipv4.sysctl_max_syn_backlog >> 2)) &&
-!tcp_peer_is_proven(req, dst, false,
-tmp_opt.saw_tstamp)) {
+!tcp_peer_is_proven(req, dst)) {
/* Without syncookies last quarter of
 * backlog is filled with destinations,
 * proven to be alive.
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 08d870e45658..d8b401fff9fe 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -198,10 +198,6 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr 
*uaddr, int addr_len)
tp->write_seq  = 0;
}
 
-   if (tcp_death_row->sysctl_tw_recycle &&
-   !tp->rx_opt.ts_recent_stamp && fl4->daddr == daddr)
-   tcp_fetch_timewait_stamp(sk, >dst);
-
inet->inet_dport = usin->sin_port;
sk_daddr_set(sk, daddr);
 
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 0f46e5fe31ad..9d0d4f39e42b 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -45,8 +45,6 @@ struct tcp_metrics_block {
struct inetpeer_addrtcpm_saddr;
struct inetpeer_addrtcpm_daddr;
unsigned long   tcpm_stamp;
-   u32 tcpm_ts;
-   u32 tcpm_ts_stamp;
u32 tcpm_lock;
u32 tcpm_vals[TCP_METRIC_MAX_KERNEL + 1];
struct tcp_fastopen_metrics tcpm_fastopen;
@@ -123,8 +121,6 @@ static void tcpm_suck_dst(struct tcp_metrics_block *tm,
tm->tcpm_vals[TCP_METRIC_SSTHRESH] = dst_metric_raw(dst, RTAX_SSTHRESH);
tm->tcpm_vals[TCP_METRIC_CWND] = dst_metric_raw(dst, RTAX_CWND);
tm->tcpm_vals[TCP_METRIC_REORDERING] = dst_metric_raw(dst, 
RTAX_REORDERING);
-   tm->tcpm_ts = 0;
-   tm->tcpm_ts_stamp = 

[PATCH net-next 2/2] tcp: remove tcp_tw_recycle

2017-03-15 Thread Soheil Hassas Yeganeh
From: Soheil Hassas Yeganeh 

The tcp_tw_recycle was already broken for connections
behind NAT, since the per-destination timestamp is not
monotonically increasing for multiple machines behind
a single destination address.

After the randomization of TCP timestamp offsets
in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets
for each connection), the tcp_tw_recycle is broken for all
types of connections for the same reason: the timestamps
received from a single machine is not monotonically increasing,
anymore.

Remove tcp_tw_recycle, since it is not functional. Also, remove
the PAWSPassive SNMP counter since it is only used for
tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req
since the strict argument is only set when tcp_tw_recycle is
enabled.

Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Eric Dumazet 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Cc: Lutz Vieweg 
Cc: Florian Westphal 
---
 Documentation/networking/ip-sysctl.txt |  5 -
 include/net/netns/ipv4.h   |  1 -
 include/net/tcp.h  |  3 +--
 include/uapi/linux/snmp.h  |  1 -
 net/ipv4/proc.c|  1 -
 net/ipv4/sysctl_net_ipv4.c |  7 ---
 net/ipv4/tcp_input.c   | 30 +-
 net/ipv4/tcp_ipv4.c| 15 ++-
 net/ipv6/tcp_ipv6.c|  5 +
 9 files changed, 9 insertions(+), 59 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index ab0230461377..ed3d0791eb27 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -640,11 +640,6 @@ tcp_tso_win_divisor - INTEGER
building larger TSO frames.
Default: 3
 
-tcp_tw_recycle - BOOLEAN
-   Enable fast recycling TIME-WAIT sockets. Default value is 0.
-   It should not be changed without advice/request of technical
-   experts.
-
 tcp_tw_reuse - BOOLEAN
Allow to reuse TIME-WAIT sockets for new connections when it is
safe from protocol viewpoint. Default value is 0.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 622d2da27135..2e9d649ba169 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -33,7 +33,6 @@ struct inet_timewait_death_row {
atomic_ttw_count;
 
struct inet_hashinfo*hashinfo cacheline_aligned_in_smp;
-   int sysctl_tw_recycle;
int sysctl_max_tw_buckets;
 };
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index c81f3b958d44..e614ad4d613e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1810,8 +1810,7 @@ struct tcp_request_sock_ops {
 __u16 *mss);
 #endif
struct dst_entry *(*route_req)(const struct sock *sk, struct flowi *fl,
-  const struct request_sock *req,
-  bool *strict);
+  const struct request_sock *req);
__u32 (*init_seq_tsoff)(const struct sk_buff *skb, u32 *tsoff);
int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
   struct flowi *fl, struct request_sock *req,
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 3b2bed7ca9a4..cec0e171d20c 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -177,7 +177,6 @@ enum
LINUX_MIB_TIMEWAITED,   /* TimeWaited */
LINUX_MIB_TIMEWAITRECYCLED, /* TimeWaitRecycled */
LINUX_MIB_TIMEWAITKILLED,   /* TimeWaitKilled */
-   LINUX_MIB_PAWSPASSIVEREJECTED,  /* PAWSPassiveRejected */
LINUX_MIB_PAWSACTIVEREJECTED,   /* PAWSActiveRejected */
LINUX_MIB_PAWSESTABREJECTED,/* PAWSEstabRejected */
LINUX_MIB_DELAYEDACKS,  /* DelayedACKs */
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 69cf49e8356d..4ccbf464d1ac 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -199,7 +199,6 @@ static const struct snmp_mib snmp4_net_list[] = {
SNMP_MIB_ITEM("TW", LINUX_MIB_TIMEWAITED),
SNMP_MIB_ITEM("TWRecycled", LINUX_MIB_TIMEWAITRECYCLED),
SNMP_MIB_ITEM("TWKilled", LINUX_MIB_TIMEWAITKILLED),
-   SNMP_MIB_ITEM("PAWSPassive", LINUX_MIB_PAWSPASSIVEREJECTED),
SNMP_MIB_ITEM("PAWSActive", LINUX_MIB_PAWSACTIVEREJECTED),
SNMP_MIB_ITEM("PAWSEstab", LINUX_MIB_PAWSESTABREJECTED),
SNMP_MIB_ITEM("DelayedACKs", LINUX_MIB_DELAYEDACKS),
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d6880a6149ee..11aaef0939b2 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -981,13 +981,6 @@ static 

[PATCH net] net: properly release sk_frag.page

2017-03-15 Thread Eric Dumazet
From: Eric Dumazet 

I mistakenly added the code to release sk->sk_frag in
sk_common_release() instead of sk_destruct()

TCP sockets using sk->sk_allocation == GFP_ATOMIC do no call
sk_common_release() at close time, thus leaking one (order-3) page.

iSCSI is using such sockets.

Fixes: 5640f7685831 ("net: use a per task frag allocator")
Signed-off-by: Eric Dumazet 
---
 net/core/sock.c |   10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 
a96d5f7a5734a52dfd6a2df8490c7bd7f5f6599a..acb0d413749968f24ffc7df3e366b095f80e10f4
 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1442,6 +1442,11 @@ static void __sk_destruct(struct rcu_head *head)
pr_debug("%s: optmem leakage (%d bytes) detected\n",
 __func__, atomic_read(>sk_omem_alloc));
 
+   if (sk->sk_frag.page) {
+   put_page(sk->sk_frag.page);
+   sk->sk_frag.page = NULL;
+   }
+
if (sk->sk_peer_cred)
put_cred(sk->sk_peer_cred);
put_pid(sk->sk_peer_pid);
@@ -2787,11 +2792,6 @@ void sk_common_release(struct sock *sk)
 
sk_refcnt_debug_release(sk);
 
-   if (sk->sk_frag.page) {
-   put_page(sk->sk_frag.page);
-   sk->sk_frag.page = NULL;
-   }
-
sock_put(sk);
 }
 EXPORT_SYMBOL(sk_common_release);




[PATCH net-next 5/7] drivers: net: xgene: Add workaround for errata 10GE_1

2017-03-15 Thread Iyappan Subramanian
From: Quan Nguyen 

This patch implements workaround for errata 10GE_1:
10Gb Ethernet port FIFO threshold default values are incorrect.

Signed-off-by: Quan Nguyen 
Signed-off-by: Toan Le 
Signed-off-by: Iyappan Subramanian 
Tested-by: Fushen Chen 
---
 drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c | 7 +++
 drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.h | 5 +
 2 files changed, 12 insertions(+)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c
index ece19e6..423240c 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c
@@ -341,8 +341,15 @@ static void xgene_xgmac_init(struct xgene_enet_pdata 
*pdata)
 
xgene_enet_rd_csr(pdata, XG_RSIF_CONFIG_REG_ADDR, );
data |= CFG_RSIF_FPBUFF_TIMEOUT_EN;
+   /* Errata 10GE_1 - FIFO threshold default value incorrect */
+   RSIF_CLE_BUFF_THRESH_SET(, XG_RSIF_CLE_BUFF_THRESH);
xgene_enet_wr_csr(pdata, XG_RSIF_CONFIG_REG_ADDR, data);
 
+   /* Errata 10GE_1 - FIFO threshold default value incorrect */
+   xgene_enet_rd_csr(pdata, XG_RSIF_CONFIG1_REG_ADDR, );
+   RSIF_PLC_CLE_BUFF_THRESH_SET(, XG_RSIF_PLC_CLE_BUFF_THRESH);
+   xgene_enet_wr_csr(pdata, XG_RSIF_CONFIG1_REG_ADDR, data);
+
xgene_enet_rd_csr(pdata, XG_ENET_SPARE_CFG_REG_ADDR, );
data |= BIT(12);
xgene_enet_wr_csr(pdata, XG_ENET_SPARE_CFG_REG_ADDR, data);
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.h 
b/drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.h
index 03b847a..e644a42 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.h
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.h
@@ -65,6 +65,11 @@
 #define XG_DEF_PAUSE_THRES 0x390
 #define XG_DEF_PAUSE_OFF_THRES 0x2c0
 #define XG_RSIF_CONFIG_REG_ADDR0x00a0
+#define XG_RSIF_CLE_BUFF_THRESH0x3
+#define RSIF_CLE_BUFF_THRESH_SET(dst, val) xgene_set_bits(dst, val, 0, 3)
+#define XG_RSIF_CONFIG1_REG_ADDR   0x00b8
+#define XG_RSIF_PLC_CLE_BUFF_THRESH0x1
+#define RSIF_PLC_CLE_BUFF_THRESH_SET(dst, val) xgene_set_bits(dst, val, 0, 2)
 #define XCLE_BYPASS_REG0_ADDR   0x0160
 #define XCLE_BYPASS_REG1_ADDR   0x0164
 #define XG_CFG_BYPASS_ADDR 0x0204
-- 
1.9.1



[PATCH net-next 6/7] drivers: net: xgene: Add workaround for errata 10GE_8/ENET_11

2017-03-15 Thread Iyappan Subramanian
This patch implements workaround for errata 10GE_8 and ENET_11:
"HW reports length error for valid 64 byte frames with len <46 bytes"
by recovering them from error.

Signed-off-by: Iyappan Subramanian 
Signed-off-by: Quan Nguyen 
Signed-off-by: Toan Le 
Tested-by: Fushen Chen 
---
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.c   |  2 +-
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.h   |  1 +
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 43 +++-
 drivers/net/ethernet/apm/xgene/xgene_enet_main.h |  1 +
 4 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
index c72a17e..2a835e0 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
@@ -494,7 +494,7 @@ static void xgene_gmac_set_speed(struct xgene_enet_pdata 
*pdata)
break;
}
 
-   mc2 |= FULL_DUPLEX2 | PAD_CRC;
+   mc2 |= FULL_DUPLEX2 | PAD_CRC | LENGTH_CHK;
xgene_enet_wr_mcx_mac(pdata, MAC_CONFIG_2_ADDR, mc2);
xgene_enet_wr_mcx_mac(pdata, INTERFACE_CONTROL_ADDR, intf_ctl);
xgene_enet_wr_csr(pdata, RGMII_REG_0_ADDR, rgmii);
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h 
b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
index b6cd625..d250bfe 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
@@ -216,6 +216,7 @@ enum xgene_enet_rm {
 #define ENET_GHD_MODE  BIT(26)
 #define FULL_DUPLEX2   BIT(0)
 #define PAD_CRCBIT(2)
+#define LENGTH_CHK BIT(4)
 #define SCAN_AUTO_INCR BIT(5)
 #define TBYT_ADDR  0x38
 #define TPKT_ADDR  0x39
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
index e881365..5f37ed3 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
@@ -658,12 +658,24 @@ static void xgene_enet_free_pagepool(struct 
xgene_enet_desc_ring *buf_pool,
buf_pool->head = head;
 }
 
+/* Errata 10GE_8 and ENET_11 - allow packet with length <=64B */
+static bool xgene_enet_errata_10GE_8(struct sk_buff *skb, u32 len, u8 status)
+{
+   if (status == INGRESS_PKT_LEN && len == ETHER_MIN_PACKET) {
+   if (ntohs(eth_hdr(skb)->h_proto) < 46)
+   return true;
+   }
+
+   return false;
+}
+
 static int xgene_enet_rx_frame(struct xgene_enet_desc_ring *rx_ring,
   struct xgene_enet_raw_desc *raw_desc,
   struct xgene_enet_raw_desc *exp_desc)
 {
struct xgene_enet_desc_ring *buf_pool, *page_pool;
u32 datalen, frag_size, skb_index;
+   struct xgene_enet_pdata *pdata;
struct net_device *ndev;
dma_addr_t dma_addr;
struct sk_buff *skb;
@@ -676,6 +688,7 @@ static int xgene_enet_rx_frame(struct xgene_enet_desc_ring 
*rx_ring,
bool nv;
 
ndev = rx_ring->ndev;
+   pdata = netdev_priv(ndev);
dev = ndev_to_dev(rx_ring->ndev);
buf_pool = rx_ring->buf_pool;
page_pool = rx_ring->page_pool;
@@ -686,30 +699,29 @@ static int xgene_enet_rx_frame(struct 
xgene_enet_desc_ring *rx_ring,
skb = buf_pool->rx_skb[skb_index];
buf_pool->rx_skb[skb_index] = NULL;
 
+   datalen = xgene_enet_get_data_len(le64_to_cpu(raw_desc->m1));
+   skb_put(skb, datalen);
+   prefetch(skb->data - NET_IP_ALIGN);
+   skb->protocol = eth_type_trans(skb, ndev);
+
/* checking for error */
status = (GET_VAL(ELERR, le64_to_cpu(raw_desc->m0)) << LERR_LEN) |
  GET_VAL(LERR, le64_to_cpu(raw_desc->m0));
if (unlikely(status)) {
-   dev_kfree_skb_any(skb);
-   xgene_enet_free_pagepool(page_pool, raw_desc, exp_desc);
-   xgene_enet_parse_error(rx_ring, netdev_priv(rx_ring->ndev),
-  status);
-   ret = -EIO;
-   goto out;
+   if (!xgene_enet_errata_10GE_8(skb, datalen, status)) {
+   dev_kfree_skb_any(skb);
+   xgene_enet_free_pagepool(page_pool, raw_desc, exp_desc);
+   xgene_enet_parse_error(rx_ring, pdata, status);
+   goto out;
+   }
}
 
-   /* strip off CRC as HW isn't doing this */
-   datalen = xgene_enet_get_data_len(le64_to_cpu(raw_desc->m1));
-
nv = GET_VAL(NV, le64_to_cpu(raw_desc->m0));
-   if (!nv)
+   if (!nv) {
+   /* strip off CRC as HW isn't doing this */
datalen -= 4;
-
-   skb_put(skb, datalen);
-   prefetch(skb->data - NET_IP_ALIGN);
-
-   

[PATCH net-next 4/7] drivers: net: xgene: Fix Rx checksum validation logic

2017-03-15 Thread Iyappan Subramanian
This patch fixes Rx checksum validation logic and
adds NETIF_F_RXCSUM flag.

Signed-off-by: Iyappan Subramanian 
---
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 27 +++-
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
index ec43278..e881365 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
@@ -601,14 +601,24 @@ static netdev_tx_t xgene_enet_start_xmit(struct sk_buff 
*skb,
return NETDEV_TX_OK;
 }
 
-static void xgene_enet_skip_csum(struct sk_buff *skb)
+static void xgene_enet_rx_csum(struct sk_buff *skb)
 {
+   struct net_device *ndev = skb->dev;
struct iphdr *iph = ip_hdr(skb);
 
-   if (!ip_is_fragment(iph) ||
-   (iph->protocol != IPPROTO_TCP && iph->protocol != IPPROTO_UDP)) {
-   skb->ip_summed = CHECKSUM_UNNECESSARY;
-   }
+   if (!(ndev->features & NETIF_F_RXCSUM))
+   return;
+
+   if (skb->protocol != htons(ETH_P_IP))
+   return;
+
+   if (ip_is_fragment(iph))
+   return;
+
+   if (iph->protocol != IPPROTO_TCP && iph->protocol != IPPROTO_UDP)
+   return;
+
+   skb->ip_summed = CHECKSUM_UNNECESSARY;
 }
 
 static void xgene_enet_free_pagepool(struct xgene_enet_desc_ring *buf_pool,
@@ -729,10 +739,7 @@ static int xgene_enet_rx_frame(struct xgene_enet_desc_ring 
*rx_ring,
 skip_jumbo:
skb_checksum_none_assert(skb);
skb->protocol = eth_type_trans(skb, ndev);
-   if (likely((ndev->features & NETIF_F_IP_CSUM) &&
-  skb->protocol == htons(ETH_P_IP))) {
-   xgene_enet_skip_csum(skb);
-   }
+   xgene_enet_rx_csum(skb);
 
rx_ring->rx_packets++;
rx_ring->rx_bytes += datalen;
@@ -2039,7 +2046,7 @@ static int xgene_enet_probe(struct platform_device *pdev)
xgene_enet_setup_ops(pdata);
 
if (pdata->phy_mode == PHY_INTERFACE_MODE_XGMII) {
-   ndev->features |= NETIF_F_TSO;
+   ndev->features |= NETIF_F_TSO | NETIF_F_RXCSUM;
spin_lock_init(>mss_lock);
}
ndev->hw_features = ndev->features;
-- 
1.9.1



[PATCH net-next 3/7] drivers: net: xgene: Fix wrong logical operation

2017-03-15 Thread Iyappan Subramanian
From: Quan Nguyen 

This patch fixes the wrong logical OR operation by changing it to
bit-wise OR operation.

Fixes: 3bb502f83080 ("drivers: net: xgene: fix statistics counters race 
condition")
Signed-off-by: Iyappan Subramanian 
Signed-off-by: Quan Nguyen 
---
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
index b3568c4..ec43278 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
@@ -677,9 +677,9 @@ static int xgene_enet_rx_frame(struct xgene_enet_desc_ring 
*rx_ring,
buf_pool->rx_skb[skb_index] = NULL;
 
/* checking for error */
-   status = (GET_VAL(ELERR, le64_to_cpu(raw_desc->m0)) << LERR_LEN) ||
+   status = (GET_VAL(ELERR, le64_to_cpu(raw_desc->m0)) << LERR_LEN) |
  GET_VAL(LERR, le64_to_cpu(raw_desc->m0));
-   if (unlikely(status > 2)) {
+   if (unlikely(status)) {
dev_kfree_skb_any(skb);
xgene_enet_free_pagepool(page_pool, raw_desc, exp_desc);
xgene_enet_parse_error(rx_ring, netdev_priv(rx_ring->ndev),
-- 
1.9.1



[PATCH net-next 0/7] drivers: net: xgene: Bug fixes and errata workarounds

2017-03-15 Thread Iyappan Subramanian
This patch set addresses bug fixes and errata workarounds.

Signed-off-by: Iyappan Subramanian 
Signed-off-by: Quan Nguyen 
---

Iyappan Subramanian (3):
  drivers: net: xgene: Fix Rx checksum validation logic
  drivers: net: xgene: Add workaround for errata 10GE_8/ENET_11
  MAINTAINERS: Update X-Gene SoC ethernet maintainer

Quan Nguyen (4):
  drivers: net: phy: xgene: Fix mdio write
  drivers: net: xgene: Fix hardware checksum setting
  drivers: net: xgene: Fix wrong logical operation
  drivers: net: xgene: Add workaround for errata 10GE_1

 MAINTAINERS   |  1 +
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.c|  3 +-
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.h|  2 +
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c  | 74 ++-
 drivers/net/ethernet/apm/xgene/xgene_enet_main.h  |  1 +
 drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c |  7 +++
 drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.h |  5 ++
 drivers/net/phy/mdio-xgene.c  |  2 +-
 8 files changed, 65 insertions(+), 30 deletions(-)

-- 
1.9.1



[PATCH net-next 2/7] drivers: net: xgene: Fix hardware checksum setting

2017-03-15 Thread Iyappan Subramanian
From: Quan Nguyen 

This patch fixes the hardware checksum settings by properly program
the classifier. Otherwise, packet may be received with checksum error
on X-Gene1 SoC.

Signed-off-by: Quan Nguyen 
Signed-off-by: Iyappan Subramanian 
---
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.c | 1 +
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
index 06e6816..c72a17e 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
@@ -623,6 +623,7 @@ static void xgene_enet_cle_bypass(struct xgene_enet_pdata 
*pdata,
xgene_enet_rd_csr(pdata, CLE_BYPASS_REG0_0_ADDR, );
cb |= CFG_CLE_BYPASS_EN0;
CFG_CLE_IP_PROTOCOL0_SET(, 3);
+   CFG_CLE_IP_HDR_LEN_SET(, 0);
xgene_enet_wr_csr(pdata, CLE_BYPASS_REG0_0_ADDR, cb);
 
xgene_enet_rd_csr(pdata, CLE_BYPASS_REG1_0_ADDR, );
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h 
b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
index 5f83037..b6cd625 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
@@ -163,6 +163,7 @@ enum xgene_enet_rm {
 #define CFG_RXCLK_MUXSEL0_SET(dst, val)xgene_set_bits(dst, val, 26, 3)
 
 #define CFG_CLE_IP_PROTOCOL0_SET(dst, val) xgene_set_bits(dst, val, 16, 2)
+#define CFG_CLE_IP_HDR_LEN_SET(dst, val)   xgene_set_bits(dst, val, 8, 5)
 #define CFG_CLE_DSTQID0_SET(dst, val)  xgene_set_bits(dst, val, 0, 12)
 #define CFG_CLE_FPSEL0_SET(dst, val)   xgene_set_bits(dst, val, 16, 4)
 #define CFG_CLE_NXTFPSEL0_SET(dst, val)xgene_set_bits(dst, 
val, 20, 4)
-- 
1.9.1



[PATCH net-next 1/7] drivers: net: phy: xgene: Fix mdio write

2017-03-15 Thread Iyappan Subramanian
From: Quan Nguyen 

This patches fixes a typo in the argument to xgene_enet_wr_mdio_csr().

Signed-off-by: Quan Nguyen 
Signed-off-by: Iyappan Subramanian 
---
 drivers/net/phy/mdio-xgene.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/phy/mdio-xgene.c b/drivers/net/phy/mdio-xgene.c
index f095051..3e2ac07 100644
--- a/drivers/net/phy/mdio-xgene.c
+++ b/drivers/net/phy/mdio-xgene.c
@@ -229,7 +229,7 @@ static int xgene_xfi_mdio_write(struct mii_bus *bus, int 
phy_id,
 
val = SET_VAL(HSTPHYADX, phy_id) | SET_VAL(HSTREGADX, reg) |
  SET_VAL(HSTMIIMWRDAT, data);
-   xgene_enet_wr_mdio_csr(addr, MIIM_FIELD_ADDR, data);
+   xgene_enet_wr_mdio_csr(addr, MIIM_FIELD_ADDR, val);
 
val = HSTLDCMD | SET_VAL(HSTMIIMCMD, MIIM_CMD_LEGACY_WRITE);
xgene_enet_wr_mdio_csr(addr, MIIM_COMMAND_ADDR, val);
-- 
1.9.1



[PATCH net-next 7/7] MAINTAINERS: Update X-Gene SoC ethernet maintainer

2017-03-15 Thread Iyappan Subramanian
Signed-off-by: Iyappan Subramanian 
Signed-off-by: Keyur Chudgar 
Signed-off-by: Quan Nguyen 
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index cefda30..632e762 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -896,6 +896,7 @@ F:  arch/arm64/boot/dts/apm/
 APPLIED MICRO (APM) X-GENE SOC ETHERNET DRIVER
 M: Iyappan Subramanian 
 M: Keyur Chudgar 
+M: Quan Nguyen 
 S: Supported
 F: drivers/net/ethernet/apm/xgene/
 F: drivers/net/phy/mdio-xgene.c
-- 
1.9.1



Re: [PATCH net-next v2 2/3] net: dsa: check out-of-range ageing time value

2017-03-15 Thread Florian Fainelli
On 03/15/2017 12:53 PM, Vivien Didelot wrote:
> If a DSA switch driver cannot program an ageing time value due to it
> being out-of-range, switchdev will raise a stack trace before failing.
> 
> To fix this, add ageing_time_min and ageing_time_max members to the
> dsa_switch in order for the switch drivers to optionally specify their
> supported ageing time limits.
> 
> The DSA core will now check for provided ageing time limits and return
> -ERANGE from the switchdev prepare phase if the value is out-of-range.
> 
> Signed-off-by: Vivien Didelot 

Reviewed-by: Florian Fainelli 

You could simplify the two changes (and remove the check for
ds->ageing_time_{min,max} by setting ds->ageing_time_min to ~0 by
default. Absolutely not critical and the code is clear as-is.
-- 
Florian


Re: [PATCH net-next v2 1/3] net: dsa: dsa_fastest_ageing_time return unsigned

2017-03-15 Thread Florian Fainelli
On 03/15/2017 12:53 PM, Vivien Didelot wrote:
> The ageing time is defined as unsigned int, so make
> dsa_fastest_ageing_time return an unsigned int instead of int.
> 
> Signed-off-by: Vivien Didelot 

Reviewed-by: Florian Fainelli 
-- 
Florian


[PATCH net] amd-xgbe: Fix jumbo MTU processing on newer hardware

2017-03-15 Thread Tom Lendacky
Newer hardware does not provide a cumulative payload length when multiple
descriptors are needed to handle the data. Once the MTU increases beyond
the size that can be handled by a single descriptor, the SKB does not get
built properly by the driver.

The driver will now calculate the size of the data buffers used by the
hardware.  The first buffer of the first descriptor is for packet headers
or packet headers and data when the headers can't be split. Subsequent
descriptors in a multi-descriptor chain will not use the first buffer. The
second buffer is used by all the descriptors in the chain for payload data.
Based on whether the driver is processing the first, intermediate, or last
descriptor it can calculate the buffer usage and build the SKB properly.

Tested and verified on both old and new hardware.

Signed-off-by: Tom Lendacky 
---
 drivers/net/ethernet/amd/xgbe/xgbe-common.h |6 +-
 drivers/net/ethernet/amd/xgbe/xgbe-dev.c|   20 +++--
 drivers/net/ethernet/amd/xgbe/xgbe-drv.c|  102 +--
 3 files changed, 78 insertions(+), 50 deletions(-)

diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-common.h 
b/drivers/net/ethernet/amd/xgbe/xgbe-common.h
index 8a280e7..86f1626 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-common.h
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-common.h
@@ -1148,8 +1148,8 @@
 #define RX_PACKET_ATTRIBUTES_CSUM_DONE_WIDTH   1
 #define RX_PACKET_ATTRIBUTES_VLAN_CTAG_INDEX   1
 #define RX_PACKET_ATTRIBUTES_VLAN_CTAG_WIDTH   1
-#define RX_PACKET_ATTRIBUTES_INCOMPLETE_INDEX  2
-#define RX_PACKET_ATTRIBUTES_INCOMPLETE_WIDTH  1
+#define RX_PACKET_ATTRIBUTES_LAST_INDEX2
+#define RX_PACKET_ATTRIBUTES_LAST_WIDTH1
 #define RX_PACKET_ATTRIBUTES_CONTEXT_NEXT_INDEX3
 #define RX_PACKET_ATTRIBUTES_CONTEXT_NEXT_WIDTH1
 #define RX_PACKET_ATTRIBUTES_CONTEXT_INDEX 4
@@ -1158,6 +1158,8 @@
 #define RX_PACKET_ATTRIBUTES_RX_TSTAMP_WIDTH   1
 #define RX_PACKET_ATTRIBUTES_RSS_HASH_INDEX6
 #define RX_PACKET_ATTRIBUTES_RSS_HASH_WIDTH1
+#define RX_PACKET_ATTRIBUTES_FIRST_INDEX   7
+#define RX_PACKET_ATTRIBUTES_FIRST_WIDTH   1
 
 #define RX_NORMAL_DESC0_OVT_INDEX  0
 #define RX_NORMAL_DESC0_OVT_WIDTH  16
diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-dev.c 
b/drivers/net/ethernet/amd/xgbe/xgbe-dev.c
index 937f37a..24a687c 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-dev.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-dev.c
@@ -1896,10 +1896,15 @@ static int xgbe_dev_read(struct xgbe_channel *channel)
 
/* Get the header length */
if (XGMAC_GET_BITS_LE(rdesc->desc3, RX_NORMAL_DESC3, FD)) {
+   XGMAC_SET_BITS(packet->attributes, RX_PACKET_ATTRIBUTES,
+  FIRST, 1);
rdata->rx.hdr_len = XGMAC_GET_BITS_LE(rdesc->desc2,
  RX_NORMAL_DESC2, HL);
if (rdata->rx.hdr_len)
pdata->ext_stats.rx_split_header_packets++;
+   } else {
+   XGMAC_SET_BITS(packet->attributes, RX_PACKET_ATTRIBUTES,
+  FIRST, 0);
}
 
/* Get the RSS hash */
@@ -1922,19 +1927,16 @@ static int xgbe_dev_read(struct xgbe_channel *channel)
}
}
 
-   /* Get the packet length */
-   rdata->rx.len = XGMAC_GET_BITS_LE(rdesc->desc3, RX_NORMAL_DESC3, PL);
-
-   if (!XGMAC_GET_BITS_LE(rdesc->desc3, RX_NORMAL_DESC3, LD)) {
-   /* Not all the data has been transferred for this packet */
-   XGMAC_SET_BITS(packet->attributes, RX_PACKET_ATTRIBUTES,
-  INCOMPLETE, 1);
+   /* Not all the data has been transferred for this packet */
+   if (!XGMAC_GET_BITS_LE(rdesc->desc3, RX_NORMAL_DESC3, LD))
return 0;
-   }
 
/* This is the last of the data for this packet */
XGMAC_SET_BITS(packet->attributes, RX_PACKET_ATTRIBUTES,
-  INCOMPLETE, 0);
+  LAST, 1);
+
+   /* Get the packet length */
+   rdata->rx.len = XGMAC_GET_BITS_LE(rdesc->desc3, RX_NORMAL_DESC3, PL);
 
/* Set checksum done indicator as appropriate */
if (netdev->features & NETIF_F_RXCSUM)
diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c 
b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
index ffea985..a713abd 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
@@ -1971,13 +1971,12 @@ static struct sk_buff *xgbe_create_skb(struct 
xgbe_prv_data *pdata,
 {
struct sk_buff *skb;
u8 *packet;
-   unsigned int copy_len;
 
skb = napi_alloc_skb(napi, rdata->rx.hdr.dma_len);
if (!skb)
return NULL;
 
-   /* Start with the header buffer which may contain just the header
+   /* Pull in the header buffer which may contain just the header
 * or the header plus 

Darlehen angebot 3 %

2017-03-15 Thread Frau SCHMIDT


Sehr geehrte Damen  und Herren,

Haben Sie Interesse über einer finanziellen Darlehen zu 3%???
kontaktieren Sie mich für mehr Details und Bedingungen. ich kann all 
jenen helfen, wer ein Darlehen benötigen.

Ich kann Ihnen biete ein darlehen in hohe von 10.000.000 €
Meine mail: info@rschmidt.online

Mit freundlichen Grüßen

Frau SCHMIDT


Re: [PATCH net-next v2 3/3] net: dsa: mv88e6xxx: specify ageing time limits

2017-03-15 Thread Florian Fainelli
On 03/15/2017 12:53 PM, Vivien Didelot wrote:
> Now that DSA has ageing time limits, specify them when registering a
> switch so that out-of-range values are handled correctly by the core.
> 
> Signed-off-by: Vivien Didelot 
> Reported-by: Jason Cobham 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [net-next v2 00/12][pull request] 40GbE Intel Wired LAN Driver Updates 2017-03-15

2017-03-15 Thread David Miller
From: Jeff Kirsher 
Date: Wed, 15 Mar 2017 02:25:33 -0700

> This series contains updates to i40e and i40evf only.

Pulled, thanks Jeff.


[PATCH net] net: bcmgenet: Do not suspend PHY if Wake-on-LAN is enabled

2017-03-15 Thread Florian Fainelli
Suspending the PHY would be putting it in a low power state where it
may no longer allow us to do Wake-on-LAN.

Fixes: cc013fb48898 ("net: bcmgenet: correctly suspend and resume PHY device")
Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/genet/bcmgenet.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c 
b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index 69015fa50f20..365895ed3c3e 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -3481,7 +3481,8 @@ static int bcmgenet_suspend(struct device *d)
 
bcmgenet_netif_stop(dev);
 
-   phy_suspend(priv->phydev);
+   if (!device_may_wakeup(d))
+   phy_suspend(priv->phydev);
 
netif_device_detach(dev);
 
@@ -3578,7 +3579,8 @@ static int bcmgenet_resume(struct device *d)
 
netif_device_attach(dev);
 
-   phy_resume(priv->phydev);
+   if (!device_may_wakeup(d))
+   phy_resume(priv->phydev);
 
if (priv->eee.eee_enabled)
bcmgenet_eee_enable_set(dev, true);
-- 
2.9.3



[PATCH net-next v2 3/3] net: dsa: mv88e6xxx: specify ageing time limits

2017-03-15 Thread Vivien Didelot
Now that DSA has ageing time limits, specify them when registering a
switch so that out-of-range values are handled correctly by the core.

Signed-off-by: Vivien Didelot 
Reported-by: Jason Cobham 
---
 drivers/net/dsa/mv88e6xxx/chip.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 3354f99df378..2bca297d9296 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -4253,6 +4253,8 @@ static int mv88e6xxx_register_switch(struct 
mv88e6xxx_chip *chip)
 
ds->priv = chip;
ds->ops = _switch_ops;
+   ds->ageing_time_min = chip->info->age_time_coeff;
+   ds->ageing_time_max = chip->info->age_time_coeff * U8_MAX;
 
dev_set_drvdata(dev, ds);
 
-- 
2.12.0



[PATCH net-next v2 1/3] net: dsa: dsa_fastest_ageing_time return unsigned

2017-03-15 Thread Vivien Didelot
The ageing time is defined as unsigned int, so make
dsa_fastest_ageing_time return an unsigned int instead of int.

Signed-off-by: Vivien Didelot 
---
 net/dsa/slave.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index c34872e1febc..cec47e843570 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -419,8 +419,8 @@ static int dsa_slave_vlan_filtering(struct net_device *dev,
return 0;
 }
 
-static int dsa_fastest_ageing_time(struct dsa_switch *ds,
-  unsigned int ageing_time)
+static unsigned int dsa_fastest_ageing_time(struct dsa_switch *ds,
+   unsigned int ageing_time)
 {
int i;
 
-- 
2.12.0



[PATCH net-next v2 2/3] net: dsa: check out-of-range ageing time value

2017-03-15 Thread Vivien Didelot
If a DSA switch driver cannot program an ageing time value due to it
being out-of-range, switchdev will raise a stack trace before failing.

To fix this, add ageing_time_min and ageing_time_max members to the
dsa_switch in order for the switch drivers to optionally specify their
supported ageing time limits.

The DSA core will now check for provided ageing time limits and return
-ERANGE from the switchdev prepare phase if the value is out-of-range.

Signed-off-by: Vivien Didelot 
---
 include/net/dsa.h | 4 
 net/dsa/slave.c   | 8 ++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index bf0e42c2a6f7..e42897fd7a96 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -233,6 +233,10 @@ struct dsa_switch {
u32 phys_mii_mask;
struct mii_bus  *slave_mii_bus;
 
+   /* Ageing Time limits in msecs */
+   unsigned int ageing_time_min;
+   unsigned int ageing_time_max;
+
/* Dynamically allocated ports, keep last */
size_t num_ports;
struct dsa_port ports[];
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index cec47e843570..78128acfbf63 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -443,9 +443,13 @@ static int dsa_slave_ageing_time(struct net_device *dev,
unsigned long ageing_jiffies = clock_t_to_jiffies(attr->u.ageing_time);
unsigned int ageing_time = jiffies_to_msecs(ageing_jiffies);
 
-   /* bridge skips -EOPNOTSUPP, so skip the prepare phase */
-   if (switchdev_trans_ph_prepare(trans))
+   if (switchdev_trans_ph_prepare(trans)) {
+   if (ds->ageing_time_min && ageing_time < ds->ageing_time_min)
+   return -ERANGE;
+   if (ds->ageing_time_max && ageing_time > ds->ageing_time_max)
+   return -ERANGE;
return 0;
+   }
 
/* Keep the fastest ageing time in case of multiple bridges */
p->dp->ageing_time = ageing_time;
-- 
2.12.0



[PATCH net-next v2 0/3] net: dsa: check out-of-range ageing time

2017-03-15 Thread Vivien Didelot
The ageing time limits supported by DSA drivers vary depending on the
switch model. If a driver returns -ERANGE for out-of-range values, the
switchdev commit phase will fail with the following stacktrace:

# brctl setageing br0 4
[ 8530.082179] WARNING: CPU: 0 PID: 910 at net/switchdev/switchdev.c:291 
switchdev_port_attr_set_now+0xbc/0xc0
[ 8530.090679] br0: Commit of attribute (id=5) failed.
[ 8530.094256] Modules linked in:
[ 8530.096032] CPU: 0 PID: 910 Comm: kworker/0:4 Tainted: GW   
4.10.0 #361
[ 8530.102412] Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
[ 8530.107571] Workqueue: events switchdev_deferred_process_work
[ 8530.112039] Backtrace:
[ 8530.113224] [<8010ca34>] (dump_backtrace) from [<8010cd3c>] 
(show_stack+0x20/0x24)
[ 8530.119521]  r6: r5:80834da0 r4:80ca7e48 r3:8120ca3c
[ 8530.123908] [<8010cd1c>] (show_stack) from [<8037ad40>] 
(dump_stack+0x24/0x28)
[ 8530.129873] [<8037ad1c>] (dump_stack) from [<80118de4>] 
(__warn+0xf4/0x10c)
[ 8530.135545] [<80118cf0>] (__warn) from [<80118e44>] 
(warn_slowpath_fmt+0x48/0x50)
[ 8530.141760]  r9: r8:81252bec r7:80f19d90 r6:9dc3c000 r5:80ca7e7c 
r4:80834de8
[ 8530.148235] [<80118e00>] (warn_slowpath_fmt) from [<80670b20>] 
(switchdev_port_attr_set_now+0xbc/0xc0)
[ 8530.156240]  r3:9dc3c000 r2:80834de8
[ 8530.158539]  r4:ffde
[ 8530.159788] [<80670a64>] (switchdev_port_attr_set_now) from [<80670b44>] 
(switchdev_port_attr_set_deferred+0x20/0x6c)
[ 8530.169118]  r7:806705a8 r6:9dc3c000 r5:80f19d90 r4:80f19d80
[ 8530.173500] [<80670b24>] (switchdev_port_attr_set_deferred) from 
[<80670580>] (switchdev_deferred_process+0x50/0xe8)
[ 8530.182742]  r6:80ca6000 r5:81252bec r4:80f19d80 r3:80670b24
[ 8530.187115] [<80670530>] (switchdev_deferred_process) from [<80670930>] 
(switchdev_deferred_process_work+0x1c/0x24)
[ 8530.196277]  r8: r7:9ffdc100 r6:8120ad6c r5:9ddefc00 r4:81252bf4 
r3:9de343c0
[ 8530.202756] [<80670914>] (switchdev_deferred_process_work) from 
[<8012f770>] (process_one_work+0x120/0x3b0)
[ 8530.211231] [<8012f650>] (process_one_work) from [<8012fa70>] 
(worker_thread+0x70/0x534)
[ 8530.218046]  r10:9ddefc00 r9:8120ad6c r8:80ca6038 r7:8120ad80 
r6:81211f80 r5:9ddefc18
[ 8530.224579]  r4:8120ad6c
[ 8530.225830] [<8012fa00>] (worker_thread) from [<80135640>] 
(kthread+0x114/0x144)
[ 8530.231955]  r10:9f4e9e94 r9:9de1fe58 r8:8012fa00 r7:9ddefc00 
r6:9de1fdc0 r5:
[ 8530.238497]  r4:9de1fe40
[ 8530.239750] [<8013552c>] (kthread) from [<80108cd8>] 
(ret_from_fork+0x14/0x3c)
[ 8530.245679]  r10: r9: r8: r7: 
r6: r5:8013552c
[ 8530.252234]  r4:9de1fdc0 r3:80ca6000
[ 8530.254512] ---[ end trace 87475cc71b80ef73 ]---
[ 8530.257852] br0: failed (err=-34) to set attribute (id=5)

This patchset fixes this by adding ageing_time_min and ageing_time_max
fields to the dsa_switch structure, which can optionally be set by a DSA
driver.

If provided, the DSA core will check for out-of-range values in the
SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME prepare phase and return -ERANGE
accordingly.

Finally set these limits in the mv88e6xxx driver.

Vivien Didelot (3):
  net: dsa: dsa_fastest_ageing_time return unsigned
  net: dsa: check out-of-range ageing time value
  net: dsa: mv88e6xxx: specify ageing time limits

 drivers/net/dsa/mv88e6xxx/chip.c |  2 ++
 include/net/dsa.h|  4 
 net/dsa/slave.c  | 12 
 3 files changed, 14 insertions(+), 4 deletions(-)

-- 
2.12.0



Re: [PATCH net] fjes: Fix wrong netdevice feature flags

2017-03-15 Thread David Miller
From: Taku Izumi 
Date: Wed, 15 Mar 2017 13:47:50 +0900

> This patch fixes netdev->features for Extended Socket network device.
> 
> Currently Extended Socket network device's netdev->feature claims
> NETIF_F_HW_CSUM, however this is completely wrong. There's no feature
> of checksum offloading.
> That causes invalid TCP/UDP checksum and packet rjection when IP
> forwarding from Extended Socket network device to other network device.
> 
> NETIF_F_HW_CSUM should be omitted.
> 
> Signed-off-by: Taku Izumi 

Applied, thanks.


Re: [PATCH 1/1] gtp: support SGSN-side tunnels

2017-03-15 Thread Harald Welte
On Wed, Mar 15, 2017 at 08:10:38PM +0100, Harald Welte wrote:
> I've modified the patch slightly, see below (compile-tested, but not
> otherwise tested yet).  Basically rename the flags attribute to 'role',
> expand the commit log and removed unrelated cosmetic changes.

I also have a version against current net-next/master, in case anyone is 
interested..

>From 3274a3303d1ec997392a07a92666d57b13997658 Mon Sep 17 00:00:00 2001
From: Jonas Bonn 
Date: Wed, 15 Mar 2017 20:24:28 +0100
Subject: [PATCH] gtp: support SGSN-side tunnels

The GTP-tunnel driver is explicitly GGSN-side as it searches for PDP
contexts based on the incoming packets _destination_ address.  For
real-world use cases, this is sufficient, as the other side of a GTP
tunnel is not in fact implemented by GTP, but by the protocol stacking
of a mobile station / user equipment on the radio interface (like PDCP,
SNDCP).

However, if we want to simulate the mobile station, radio access network
and SGSN (for example to test the GGSN side implementation), then we
want to be identifying PDP contexts based on _source_ address.

This patch adds a "role" attribute at GTP-link creation time to specify
whether we behave like the GGSN or SGSN role of the tunnel; this
attribute is then used to determine which part of the IP packet to use
in determining the PDP context.

Signed-off-by: Jonas Bonn 
Signed-off-by: Harald Welte 
---
 drivers/net/gtp.c| 46 +---
 include/uapi/linux/gtp.h |  2 +-
 include/uapi/linux/if_link.h |  7 +++
 3 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/drivers/net/gtp.c b/drivers/net/gtp.c
index 3e1854f34420..3ab593b9be85 100644
--- a/drivers/net/gtp.c
+++ b/drivers/net/gtp.c
@@ -74,6 +74,7 @@ struct gtp_dev {
 
struct net_device   *dev;
 
+   enum ifla_gtp_role  role;
unsigned inthash_size;
struct hlist_head   *tid_hash;
struct hlist_head   *addr_hash;
@@ -154,8 +155,8 @@ static struct pdp_ctx *ipv4_pdp_find(struct gtp_dev *gtp, 
__be32 ms_addr)
return NULL;
 }
 
-static bool gtp_check_src_ms_ipv4(struct sk_buff *skb, struct pdp_ctx *pctx,
- unsigned int hdrlen)
+static bool gtp_check_ms_ipv4(struct sk_buff *skb, struct pdp_ctx *pctx,
+ unsigned int hdrlen, enum ifla_gtp_role role)
 {
struct iphdr *iph;
 
@@ -164,27 +165,31 @@ static bool gtp_check_src_ms_ipv4(struct sk_buff *skb, 
struct pdp_ctx *pctx,
 
iph = (struct iphdr *)(skb->data + hdrlen);
 
-   return iph->saddr == pctx->ms_addr_ip4.s_addr;
+   if (role == GTP_ROLE_SGSN)
+   return iph->daddr == pctx->ms_addr_ip4.s_addr;
+   else
+   return iph->saddr == pctx->ms_addr_ip4.s_addr;
 }
 
 /* Check if the inner IP source address in this packet is assigned to any
  * existing mobile subscriber.
  */
-static bool gtp_check_src_ms(struct sk_buff *skb, struct pdp_ctx *pctx,
-unsigned int hdrlen)
+static bool gtp_check_ms(struct sk_buff *skb, struct pdp_ctx *pctx,
+unsigned int hdrlen, enum ifla_gtp_role role)
 {
switch (ntohs(skb->protocol)) {
case ETH_P_IP:
-   return gtp_check_src_ms_ipv4(skb, pctx, hdrlen);
+   return gtp_check_ms_ipv4(skb, pctx, hdrlen, role);
}
return false;
 }
 
-static int gtp_rx(struct pdp_ctx *pctx, struct sk_buff *skb, unsigned int 
hdrlen)
+static int gtp_rx(struct pdp_ctx *pctx, struct sk_buff *skb, unsigned int 
hdrlen,
+ enum ifla_gtp_role role)
 {
struct pcpu_sw_netstats *stats;
 
-   if (!gtp_check_src_ms(skb, pctx, hdrlen)) {
+   if (!gtp_check_ms(skb, pctx, hdrlen, role)) {
netdev_dbg(pctx->dev, "No PDP ctx for this MS\n");
return 1;
}
@@ -239,7 +244,7 @@ static int gtp0_udp_encap_recv(struct gtp_dev *gtp, struct 
sk_buff *skb)
return 1;
}
 
-   return gtp_rx(pctx, skb, hdrlen);
+   return gtp_rx(pctx, skb, hdrlen, gtp->role);
 }
 
 static int gtp1u_udp_encap_recv(struct gtp_dev *gtp, struct sk_buff *skb)
@@ -281,7 +286,7 @@ static int gtp1u_udp_encap_recv(struct gtp_dev *gtp, struct 
sk_buff *skb)
return 1;
}
 
-   return gtp_rx(pctx, skb, hdrlen);
+   return gtp_rx(pctx, skb, hdrlen, gtp->role);
 }
 
 static void gtp_encap_destroy(struct sock *sk)
@@ -481,7 +486,11 @@ static int gtp_build_skb_ip4(struct sk_buff *skb, struct 
net_device *dev,
 * Prepend PDP header with TEI/TID from PDP ctx.
 */
iph = ip_hdr(skb);
-   pctx = ipv4_pdp_find(gtp, iph->daddr);
+   if (gtp->role == GTP_ROLE_SGSN)
+   pctx = ipv4_pdp_find(gtp, iph->saddr);
+   else
+   pctx = ipv4_pdp_find(gtp, iph->daddr);
+
if (!pctx) {

Re: [PATCH 1/1] gtp: support SGSN-side tunnels

2017-03-15 Thread Harald Welte
Hi Pablo,

On Wed, Mar 15, 2017 at 06:23:48PM +0100, Pablo Neira Ayuso wrote:
> On Wed, Mar 15, 2017 at 05:39:16PM +0100, Harald Welte wrote:
> > 
> > I would definitely like to see this move forward, particularly in order
> > to test the GGSN-side code.
> 
> Agreed.

I've modified the patch slightly, see below (compile-tested, but not
otherwise tested yet).  Basically rename the flags attribute to 'role',
expand the commit log and removed unrelated cosmetic changes.

I've also prepared a corresponding change to libgtpnl into the
laforge/sgsn-rol branch, see
http://git.osmocom.org/libgtpnl/commit/?h=laforge/sgsn-role

This is not yet tested in any way, but I'm planning to add some
associated support to the command line tools and then give it some
testing (both against the kernel GTP in GGSN mode, as well as an
independent userspace GTP implementation).

> It would be good if we provide a way to configure GTP via iproute2 for
> testing purposes.

I don't really care about which tool is used, as long as it is easily
available [and FOSS, of course].

> We would need to create some dummy socket from
> kernel too though so we don't need any userspace daemon for this
> testing mode.

I don't really like that latter idea. It sounds too much like a hack to
me.  But then, I don't have enough phantasy right now ti imagine how an
actual implementation would look like.

To me, it is perfectly fine to run a simple, small utility in userspace
even for testing.

Regards,
Harald

>From 63920950f9498069993def78e178bde85c174e0c Mon Sep 17 00:00:00 2001
From: Jonas Bonn 
Date: Wed, 15 Mar 2017 17:52:28 +0100
Subject: [PATCH] gtp: support SGSN-side tunnels

The GTP-tunnel driver is explicitly GGSN-side as it searches for PDP
contexts based on the incoming packets _destination_ address.  For
real-world use cases, this is sufficient, as the other side of a GTP
tunnel is not in fact implemented by GTP, but by the protocol stacking
of a mobile station / user equipment on the radio interface (like PDCP,
SNDCP).

However, if we want to simulate the mobile station, radio access network
and SGSN (for example to test the GGSN side implementation), then we
want to be identifying PDP contexts based on _source_ address.

This patch adds a "role" attribute at GTP-link creation time to specify
whether we behave like the GGSN or SGSN role of the tunnel; this
attribute is then used to determine which part of the IP packet to use
in determining the PDP context.

Signed-off-by: Jonas Bonn 
Signed-off-by: Harald Welte 

diff --git a/drivers/net/gtp.c b/drivers/net/gtp.c
index 99d3df788ce8..9aef4217f6e1 100644
--- a/drivers/net/gtp.c
+++ b/drivers/net/gtp.c
@@ -71,6 +71,7 @@ struct gtp_dev {
 
struct net_device   *dev;
 
+   enum ifla_gtp_role  role;
unsigned inthash_size;
struct hlist_head   *tid_hash;
struct hlist_head   *addr_hash;
@@ -149,8 +150,8 @@ static struct pdp_ctx *ipv4_pdp_find(struct gtp_dev *gtp, 
__be32 ms_addr)
return NULL;
 }
 
-static bool gtp_check_src_ms_ipv4(struct sk_buff *skb, struct pdp_ctx *pctx,
- unsigned int hdrlen)
+static bool gtp_check_ms_ipv4(struct sk_buff *skb, struct pdp_ctx *pctx,
+ unsigned int hdrlen, enum ifla_gtp_role role)
 {
struct iphdr *iph;
 
@@ -159,18 +160,21 @@ static bool gtp_check_src_ms_ipv4(struct sk_buff *skb, 
struct pdp_ctx *pctx,
 
iph = (struct iphdr *)(skb->data + hdrlen);
 
-   return iph->saddr == pctx->ms_addr_ip4.s_addr;
+   if (role == GTP_ROLE_SGSN)
+   return iph->daddr == pctx->ms_addr_ip4.s_addr;
+   else
+   return iph->saddr == pctx->ms_addr_ip4.s_addr;
 }
 
 /* Check if the inner IP source address in this packet is assigned to any
  * existing mobile subscriber.
  */
-static bool gtp_check_src_ms(struct sk_buff *skb, struct pdp_ctx *pctx,
-unsigned int hdrlen)
+static bool gtp_check_ms(struct sk_buff *skb, struct pdp_ctx *pctx,
+unsigned int hdrlen, enum ifla_gtp_role role)
 {
switch (ntohs(skb->protocol)) {
case ETH_P_IP:
-   return gtp_check_src_ms_ipv4(skb, pctx, hdrlen);
+   return gtp_check_ms_ipv4(skb, pctx, hdrlen, role);
}
return false;
 }
@@ -204,7 +208,7 @@ static int gtp0_udp_encap_recv(struct gtp_dev *gtp, struct 
sk_buff *skb,
goto out_rcu;
}
 
-   if (!gtp_check_src_ms(skb, pctx, hdrlen)) {
+   if (!gtp_check_ms(skb, pctx, hdrlen, gtp->role)) {
netdev_dbg(gtp->dev, "No PDP ctx for this MS\n");
ret = -1;
goto out_rcu;
@@ -261,7 +265,7 @@ static int gtp1u_udp_encap_recv(struct gtp_dev *gtp, struct 
sk_buff *skb,
goto out_rcu;
}
 
-   if (!gtp_check_src_ms(skb, pctx, hdrlen)) {
+   if 

Re: [RFC v1 for accelerated IPoIB 14/25] net/mlx5: Enable flow-steering for IB link

2017-03-15 Thread Leon Romanovsky
On Mon, Mar 13, 2017 at 08:31:25PM +0200, Erez Shitrit wrote:
>
> Get the relevant capabilities if supports ipoib_enhanced_offloads and
> init the flow steering table accordingly.
>
> Signed-off-by: Erez Shitrit 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 10 +-
>  drivers/net/ethernet/mellanox/mlx5/core/fw.c  |  3 ++-
>  2 files changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
> index fa4edd88daf1..dd21fc557281 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
> @@ -1991,9 +1991,6 @@ int mlx5_init_fs(struct mlx5_core_dev *dev)
>   struct mlx5_flow_steering *steering;
>   int err = 0;
>
> - if (MLX5_CAP_GEN(dev, port_type) != MLX5_CAP_PORT_TYPE_ETH)
> - return 0;
> -
>   err = mlx5_init_fc_stats(dev);
>   if (err)
>   return err;
> @@ -2004,8 +2001,11 @@ int mlx5_init_fs(struct mlx5_core_dev *dev)
>   steering->dev = dev;
>   dev->priv.steering = steering;
>
> - if (MLX5_CAP_GEN(dev, nic_flow_table) &&
> - MLX5_CAP_FLOWTABLE_NIC_RX(dev, ft_support)) {
> + if MLX5_CAP_GEN(dev, port_type) == MLX5_CAP_PORT_TYPE_ETH) &&
> +   (MLX5_CAP_GEN(dev, nic_flow_table))) ||
> +  ((MLX5_CAP_GEN(dev, port_type) == MLX5_CAP_PORT_TYPE_IB) &&
> +   MLX5_CAP_GEN(dev, ipoib_enhanced_offloads)))
> + && MLX5_CAP_FLOWTABLE_NIC_RX(dev, ft_support)) {

Erez,

Please calculate the result outside of "if.." and do it in steps,
it is pretty hard to count all these brackets.

Thanks


signature.asc
Description: PGP signature


[PATCH net] MAINTAINERS: remove MACVLAN and VLAN entries

2017-03-15 Thread Pablo Neira Ayuso
macvlan.c file seems to be both in VLAN and MACVLAN DRIVER, so remove
the MACVLAN DRIVER since this is redundant.

I propose with this patch to remove the VLAN (802.1Q) entry so this just
falls into the NETWORKING [GENERAL].

Signed-off-by: Pablo Neira Ayuso 
---
@David, please just place this in net-next if you find it more pertinent.
Thanks.

 MAINTAINERS | 15 ---
 1 file changed, 15 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index c265a5fe4848..4d68d9657ed0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7774,13 +7774,6 @@ F:   include/net/mac80211.h
 F: net/mac80211/
 F: drivers/net/wireless/mac80211_hwsim.[ch]
 
-MACVLAN DRIVER
-M: Patrick McHardy 
-L: netdev@vger.kernel.org
-S: Maintained
-F: drivers/net/macvlan.c
-F: include/linux/if_macvlan.h
-
 MAILBOX API
 M: Jassi Brar 
 L: linux-ker...@vger.kernel.org
@@ -13384,14 +13377,6 @@ W: https://linuxtv.org
 S: Maintained
 F: drivers/media/platform/vivid/*
 
-VLAN (802.1Q)
-M: Patrick McHardy 
-L: netdev@vger.kernel.org
-S: Maintained
-F: drivers/net/macvlan.c
-F: include/linux/if_*vlan.h
-F: net/8021q/
-
 VLYNQ BUS
 M: Florian Fainelli 
 L: openwrt-de...@lists.openwrt.org (subscribers-only)
-- 
2.1.4



Re: net/sctp: recursive locking in sctp_do_peeloff

2017-03-15 Thread Cong Wang
On Wed, Mar 15, 2017 at 5:52 AM, Marcelo Ricardo Leitner
 wrote:
> On Tue, Mar 14, 2017 at 09:52:15PM -0700, Cong Wang wrote:
>> Instead of checking for the status of the sock, I believe the following
>> one-line fix should do the trick too. Can you give it a try?
>>
>> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
>> index 0f378ea..4de62d4 100644
>> --- a/net/sctp/socket.c
>> +++ b/net/sctp/socket.c
>> @@ -1494,7 +1494,7 @@ static void sctp_close(struct sock *sk, long timeout)
>>
>> pr_debug("%s: sk:%p, timeout:%ld\n", __func__, sk, timeout);
>>
>> -   lock_sock(sk);
>> +   lock_sock_nested(sk, SINGLE_DEPTH_NESTING);
>> sk->sk_shutdown = SHUTDOWN_MASK;
>> sk->sk_state = SCTP_SS_CLOSING;
>
> I refrained on doing this just because it will change the lock signature
> for the first level too, as sctp_close() can be called directly, and
> might avoid some other lockdep detections.

I knew, but for the first level it is fine to use a different class,
it is merely to make lockdep happy. There is no real deadlock here
since they are two different socks anyway.

>
> Then you probably also need:
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 465a9c8464f9..02506b4406d2 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1543,7 +1543,7 @@ static void sctp_close(struct sock *sk, long timeout)
>  * held and that should be grabbed before socket lock.
>  */
> spin_lock_bh(>sctp.addr_wq_lock);
> -   bh_lock_sock(sk);
> +   bh_lock_sock_nested(sk);
>
> /* Hold the sock, since sk_common_release() will put sock_put()
>  * and we have just a little more cleanup.
>
> because sctp_close will re-lock the socket a little later (for backlog
> processing).
>

Ah, of course I missed the re-lock. Dmitry, please add this piece too.

Thanks.


Re: [PATCH net] bridge: ebtables: fix reception of frames DNAT-ed to bridge device

2017-03-15 Thread Pablo Neira Ayuso
On Wed, Mar 15, 2017 at 03:27:20PM +0100, Linus Lüssing wrote:
> On Wed, Mar 15, 2017 at 11:42:11AM +0100, Pablo Neira Ayuso wrote:
> > I'm missing then why redirect is not then just enough for Linus usecase.
> 
> For my usecase, the MAC address is configured by the user from a
> Web-UI. It may or may not be the one from the bridge device.
> 
> Besides, found it counter intuitive that DNAT did not work here
> and took me some time to find out why. At least I didn't read about
> any such known limitations of the dnat target in the ebtables
> manpage.

Could you update ebtables dnat to check if the ethernet address
matches the one of the input bridge interface, so we mangle the
->pkt_type accordingly from there, instead of doing this from the
core?


[net-next PATCH 2/2] mqprio: Modify mqprio to pass user parameters via ndo_setup_tc.

2017-03-15 Thread Alexander Duyck
From: Amritha Nambiar 

The configurable priority to traffic class mapping and the user specified
queue ranges are used to configure the traffic class, overriding the
hardware defaults when the 'hw' option is set to 0. However, when the 'hw'
option is non-zero, the hardware QOS defaults are used.

This patch makes it so that we can pass the data the user provided to
ndo_setup_tc. This allows us to pull in the queue configuration if the
user requested it as well as any additional hardware offload type
requested by using a value other than 1 for the hw value.

Finally it also provides a means for the device driver to return the level
supported for the offload type via the qopt->hw value. Previously we were
just always assuming the value to be 1, in the future values beyond just 1
may be supported.

Signed-off-by: Amritha Nambiar 
Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/amd/xgbe/xgbe-drv.c  |3 ++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c   |5 -
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |4 +++-
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c|   16 ++--
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c   |4 +++-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |7 +--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |4 +++-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c|4 +++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |4 +++-
 drivers/net/ethernet/sfc/falcon/tx.c  |4 +++-
 drivers/net/ethernet/sfc/tx.c |4 +++-
 drivers/net/ethernet/ti/netcp_core.c  |   12 
 include/linux/netdevice.h |2 +-
 net/sched/sch_mqprio.c|   17 +++--
 14 files changed, 62 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c 
b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
index 248f60d171a5..983dd3026c7a 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
@@ -1854,7 +1854,8 @@ static int xgbe_setup_tc(struct net_device *netdev, u32 
handle, __be16 proto,
if (tc_to_netdev->type != TC_SETUP_MQPRIO)
return -EINVAL;
 
-   tc = tc_to_netdev->tc;
+   tc_to_netdev->mqprio->hw = TC_MQPRIO_HW_OFFLOAD_TCS;
+   tc = tc_to_netdev->mqprio->num_tc;
 
if (tc > pdata->hw_feat.tc_cnt)
return -EINVAL;
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 9e8c06130c09..ad3e0631877e 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -4277,7 +4277,10 @@ int __bnx2x_setup_tc(struct net_device *dev, u32 handle, 
__be16 proto,
 {
if (tc->type != TC_SETUP_MQPRIO)
return -EINVAL;
-   return bnx2x_setup_tc(dev, tc->tc);
+
+   tc->mqprio->hw = TC_MQPRIO_HW_OFFLOAD_TCS;
+
+   return bnx2x_setup_tc(dev, tc->mqprio->num_tc);
 }
 
 /* called with rtnl_lock */
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 235733e91c79..5e2515c6aee4 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -6894,7 +6894,9 @@ static int bnxt_setup_tc(struct net_device *dev, u32 
handle, __be16 proto,
if (ntc->type != TC_SETUP_MQPRIO)
return -EINVAL;
 
-   return bnxt_setup_mq_tc(dev, ntc->tc);
+   ntc->mqprio->hw = TC_MQPRIO_HW_OFFLOAD_TCS;
+
+   return bnxt_setup_mq_tc(dev, ntc->mqprio->num_tc);
 }
 
 #ifdef CONFIG_RFS_ACCEL
diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c 
b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index aa769cbc7425..d4bb8bf86a45 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -346,33 +346,37 @@ static int dpaa_setup_tc(struct net_device *net_dev, u32 
handle, __be16 proto,
 struct tc_to_netdev *tc)
 {
struct dpaa_priv *priv = netdev_priv(net_dev);
+   u8 num_tc;
int i;
 
if (tc->type != TC_SETUP_MQPRIO)
return -EINVAL;
 
-   if (tc->tc == priv->num_tc)
+   tc->mqprio->hw = TC_MQPRIO_HW_OFFLOAD_TCS;
+   num_tc = tc->mqprio->num_tc;
+
+   if (num_tc == priv->num_tc)
return 0;
 
-   if (!tc->tc) {
+   if (!num_tc) {
netdev_reset_tc(net_dev);
goto out;
}
 
-   if (tc->tc > DPAA_TC_NUM) {
+   if (num_tc > DPAA_TC_NUM) {
netdev_err(net_dev, "Too many traffic classes: max %d 
supported.\n",
   DPAA_TC_NUM);
return -EINVAL;
}
 
-   netdev_set_num_tc(net_dev, tc->tc);
+   netdev_set_num_tc(net_dev, 

Re: [RFC v1 for accelerated IPoIB 05/25] IB/ipoib: Support ipoib acceleration options callbacks

2017-03-15 Thread Leon Romanovsky
On Wed, Mar 15, 2017 at 09:58:02AM -0600, Jason Gunthorpe wrote:
> On Wed, Mar 15, 2017 at 08:47:51AM +0200, Leon Romanovsky wrote:
> > On Tue, Mar 14, 2017 at 10:00:21AM -0600, Jason Gunthorpe wrote:
> > > On Tue, Mar 14, 2017 at 04:42:55PM +0200, Erez Shitrit wrote:
> > > > >> +   if (!hca->alloc_rdma_netdev)
> > > > >> +   dev = ipoib_create_netdev_default(hca, name,
> > > > >> ipoib_setup_common);
> > > > >> +   else
> > > > >> +   dev = hca->alloc_rdma_netdev(hca, port, 
> > > > >> RDMA_NETDEV_IPOIB,
> > > > >> +name, NET_NAME_UNKNOWN,
> > > > >> +ipoib_setup_common);
> > > > >> +   if (!dev) {
> > > > >> +   kfree(priv);
> > > > >> +   return NULL;
> > > > >> +   }
> > > > >
> > > > >
> > > > > This will break ipoib on hfi1 as hfi1 will define alloc_rdma_netdev 
> > > > > for
> > > > > OPA_VNIC type. We should probably look for a dedicated return type
> > > > > (-ENODEV?) to determine of the driver supports specified rdma netdev 
> > > > > type.
> > > > > Or use a ib device attribute to suggest driver support ipoib rdma 
> > > > > netdev.
> > > >
> > > > sorry, I don't understand that, we are in ipoib driver, so the type is
> > > > RDMA_NETDEV_IPOIB, if hfi wants to implement it should use the same
> > > > flag, and to use OPA_VNIC for vnic.
> > >
> > > He means it should look like this:
> > >
> > >  if (hca->alloc_rdma_netdev)
> > >  dev = hca->alloc_rdma_netdev(hca, port, RDMA_NETDEV_IPOIB,
> > > name, NET_NAME_UNKNOWN,
> > > ipoib_setup_common);
> > >
> > >  if (IS_ERR(dev) && PTR_ERR(dev) != ENOTSUP)
> > >   goto out;
> > >
> > >  dev = ipoib_create_netdev_default(hca, name, ipoib_setup_common);
> > >  if (IS_ERR(dev))
> > >   goto out;
> > >
> > >  WARN_ON(dev == NULL);
> > >
> > >   [...]
> > >
> > > out:
> > >   return PTR_ERR(dev);
> > >
> > > And I'm confused why 'ipoib_create_netdev_default' doesn't need the
> > > same function signature as hca->alloc_rdma_netdev
> >
> > And now, I'm confused.
> > In your's proposal, the "dev" will be overwritten, in Erez's proposal,
> > "dev" will be one of two: defaults or device specific.
>
> Well, not Erez's version allowed dev to be ERR_PTR too. More like this then
>
> struct rdma_netdev *get_netdev(..)
> {
>if (hca-alloc_rdma_netdev) {
>dev = hca-alloc_rdma_netdev(hca, port, RDMA_NETDEV_IPOIB,
>   name, NET_NAME_UNKNOWN,
>   ipoib_setup_common);
>
>if (!IS_ERR(dev) || PTR_ERR(dev) != ENOTSUP)
>return dev;
>}
>
>return ipoib_create_netdev_default(hca, name, ipoib_setup_common);
> }

Thanks, I agree with you, the interfaces should properly handle errors
paths from day one.

>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature


[net-next PATCH 1/2] mqprio: Change handling of hw u8 to allow for multiple hardware offload modes

2017-03-15 Thread Alexander Duyck
From: Alexander Duyck 

This patch is meant to allow for support of multiple hardware offload type
for a single device. There is currently no bounds checking for the hw
member of the mqprio_qopt structure.  This results in us being able to pass
values from 1 to 255 with all being treated the same.  On retreiving the
value it is returned as 1 for anything 1 or greater being set.

With this change we are currently adding limited bounds checking by
defining an enum and using those values to limit the reported hardware
offloads.

Signed-off-by: Alexander Duyck 
---
 include/uapi/linux/pkt_sched.h |8 
 net/sched/sch_mqprio.c |   26 --
 2 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index df7451d35131..099bf5528fed 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -617,6 +617,14 @@ struct tc_drr_stats {
 #define TC_QOPT_BITMASK 15
 #define TC_QOPT_MAX_QUEUE 16
 
+enum {
+   TC_MQPRIO_HW_OFFLOAD_NONE,  /* no offload requested */
+   TC_MQPRIO_HW_OFFLOAD_TCS,   /* offload TCs, no queue counts */
+   __TC_MQPRIO_HW_OFFLOAD_MAX
+};
+
+#define TC_MQPRIO_HW_OFFLOAD_MAX (__TC_MQPRIO_HW_OFFLOAD_MAX - 1)
+
 struct tc_mqprio_qopt {
__u8num_tc;
__u8prio_tc_map[TC_QOPT_BITMASK + 1];
diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index b851e209da4d..5f55bf149d9f 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -21,7 +21,7 @@
 
 struct mqprio_sched {
struct Qdisc**qdiscs;
-   int hw_owned;
+   int hw_offload;
 };
 
 static void mqprio_destroy(struct Qdisc *sch)
@@ -39,7 +39,7 @@ static void mqprio_destroy(struct Qdisc *sch)
kfree(priv->qdiscs);
}
 
-   if (priv->hw_owned && dev->netdev_ops->ndo_setup_tc)
+   if (priv->hw_offload && dev->netdev_ops->ndo_setup_tc)
dev->netdev_ops->ndo_setup_tc(dev, sch->handle, 0, );
else
netdev_set_num_tc(dev, 0);
@@ -59,15 +59,20 @@ static int mqprio_parse_opt(struct net_device *dev, struct 
tc_mqprio_qopt *qopt)
return -EINVAL;
}
 
-   /* net_device does not support requested operation */
-   if (qopt->hw && !dev->netdev_ops->ndo_setup_tc)
-   return -EINVAL;
+   /* Limit qopt->hw to maximum supported offload value.  Drivers have
+* the option of overriding this later if they don't support the a
+* given offload type.
+*/
+   if (qopt->hw > TC_MQPRIO_HW_OFFLOAD_MAX)
+   qopt->hw = TC_MQPRIO_HW_OFFLOAD_MAX;
 
-   /* if hw owned qcount and qoffset are taken from LLD so
-* no reason to verify them here
+   /* If hardware offload is requested we will leave it to the device
+* to either populate the queue counts itself or to validate the
+* provided queue counts.  If ndo_setup_tc is not present then
+* hardware doesn't support offload and we should return an error.
 */
if (qopt->hw)
-   return 0;
+   return dev->netdev_ops->ndo_setup_tc ? 0 : -EINVAL;
 
for (i = 0; i < qopt->num_tc; i++) {
unsigned int last = qopt->offset[i] + qopt->count[i];
@@ -142,10 +147,11 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr 
*opt)
struct tc_to_netdev tc = {.type = TC_SETUP_MQPRIO,
  { .tc = qopt->num_tc }};
 
-   priv->hw_owned = 1;
err = dev->netdev_ops->ndo_setup_tc(dev, sch->handle, 0, );
if (err)
return err;
+
+   priv->hw_offload = qopt->hw;
} else {
netdev_set_num_tc(dev, qopt->num_tc);
for (i = 0; i < qopt->num_tc; i++)
@@ -243,7 +249,7 @@ static int mqprio_dump(struct Qdisc *sch, struct sk_buff 
*skb)
 
opt.num_tc = netdev_get_num_tc(dev);
memcpy(opt.prio_tc_map, dev->prio_tc_map, sizeof(opt.prio_tc_map));
-   opt.hw = priv->hw_owned;
+   opt.hw = priv->hw_offload;
 
for (i = 0; i < netdev_get_num_tc(dev); i++) {
opt.count[i] = dev->tc_to_txq[i].count;



Re: [PATCH net] net/openvswitch: Set the ipv6 source tunnel key address attribute correctly

2017-03-15 Thread Joe Stringer
On 15 March 2017 at 09:28, Jiri Benc  wrote:
> On Wed, 15 Mar 2017 18:10:47 +0200, Or Gerlitz wrote:
>> When dealing with ipv6 source tunnel key address attribute
>> (OVS_TUNNEL_KEY_ATTR_IPV6_SRC) we are wrongly setting the tunnel
>> dst ip, fix that.
>>
>> Fixes: 6b26ba3a7d95 ('openvswitch: netlink attributes for IPv6 tunneling')
>> Signed-off-by: Or Gerlitz 
>> Reported-by: Paul Blakey 
>
> Acked-by: Jiri Benc 

Thanks, good spotting.

Acked-by: Joe Stringer 


[net-next PATCH 0/2] Add support for passing more information in mqprio offload

2017-03-15 Thread Alexander Duyck
This patch series lays the groundwork for future work to allow us to make
full use of the mqprio options when offloading them to hardware.

Currently when we specify the hardware offload for mqprio the queue
configuration is completely ignored and the hardware is only notified of
the total number of traffic classes.  The problem is this leads to multiple
issues, one specific issue being you can pass the queue configuration you
want and it is totally ignored by the hardware.

What I am planning to do is add support for "hw" values in the
configuration greater than 1.  So for example we might have one mode of
mqprio offload that uses 1 and only offloads the TC counts like we
currently do.  Then we might look at adding an option 2 which would factor
in the TCs and the queue count information. This way we can select between
the type of offload we actually want and existing drivers that don't
support this can just fall back to their legacy configuration.

---

Alexander Duyck (1):
  mqprio: Change handling of hw u8 to allow for multiple hardware offload 
modes

Amritha Nambiar (1):
  mqprio: Modify mqprio to pass user parameters via ndo_setup_tc.


 drivers/net/ethernet/amd/xgbe/xgbe-drv.c  |3 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c   |5 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |4 ++
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c|   16 +
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c   |4 ++
 drivers/net/ethernet/intel/i40e/i40e_main.c   |7 +++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |4 ++
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c|4 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |4 ++
 drivers/net/ethernet/sfc/falcon/tx.c  |4 ++
 drivers/net/ethernet/sfc/tx.c |4 ++
 drivers/net/ethernet/ti/netcp_core.c  |   12 --
 include/linux/netdevice.h |2 +
 include/uapi/linux/pkt_sched.h|8 
 net/sched/sch_mqprio.c|   39 +
 15 files changed, 84 insertions(+), 36 deletions(-)

--


Re: [PATCH 1/1] gtp: support SGSN-side tunnels

2017-03-15 Thread Pablo Neira Ayuso
On Wed, Mar 15, 2017 at 05:39:16PM +0100, Harald Welte wrote:
> Hi Jonas,
> 
> are you working on the review feedback that was provided back in early
> February?  I think there were some comments like
> * remove unrelated cosmetic change in comment
> * change from FLAGS to a dedicated MODE netlink attribute
> * add libgtpnl code and some usage information or even sample scripts
> 
> I would definitely like to see this move forward, particularly in order
> to test the GGSN-side code.

Agreed.

It would be good if we provide a way to configure GTP via iproute2 for
testing purposes. We would need to create some dummy socket from
kernel too though so we don't need any userspace daemon for this
testing mode.


Re: [PATCH 1/1] gtp: support SGSN-side tunnels

2017-03-15 Thread Harald Welte
Hi Jonas,

are you working on the review feedback that was provided back in early
February?  I think there were some comments like
* remove unrelated cosmetic change in comment
* change from FLAGS to a dedicated MODE netlink attribute
* add libgtpnl code and some usage information or even sample scripts

I would definitely like to see this move forward, particularly in order
to test the GGSN-side code.

Regards,
Harald
-- 
- Harald Welte    http://laforge.gnumonks.org/

"Privacy in residential applications is a desirable marketing option."
  (ETSI EN 300 175-7 Ch. A6)


[PATCH 04/10] netfilter: nft_set_bitmap: fetch the element key based on the set->klen

2017-03-15 Thread Pablo Neira Ayuso
From: Liping Zhang 

Currently we just assume the element key as a u32 integer, regardless of
the set key length.

This is incorrect, for example, the tcp port number is only 16 bits.
So when we use the nft_payload expr to get the tcp dport and store
it to dreg, the dport will be stored at 0~15 bits, and 16~31 bits
will be padded with zero.

So the reg->data[dreg] will be looked like as below:
  0  15   31
  +-+-+-+-+-+-+-+-+-+-+-+-+
  | tcp dport |  0|
  +-+-+-+-+-+-+-+-+-+-+-+-+
But for these big-endian systems, if we treate this register as a u32
integer, the element key will be larger than 65535, so the following
lookup in bitmap set will cause out of bound access.

Another issue is that if we add element with comments in bitmap
set(although the comments will be ignored eventually), the element will
vanish strangely. Because we treate the element key as a u32 integer, so
the comments will become the part of the element key, then the element
key will also be larger than 65535 and out of bound access will happen:
  # nft add element t s { 1 comment test }

Since set->klen is 1 or 2, it's fine to treate the element key as a u8 or
u16 integer.

Fixes: 665153ff5752 ("netfilter: nf_tables: add bitmap set type")
Signed-off-by: Liping Zhang 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nft_set_bitmap.c | 27 +--
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/net/netfilter/nft_set_bitmap.c b/net/netfilter/nft_set_bitmap.c
index 152d226552c1..9b024e22717b 100644
--- a/net/netfilter/nft_set_bitmap.c
+++ b/net/netfilter/nft_set_bitmap.c
@@ -45,9 +45,17 @@ struct nft_bitmap {
u8  bitmap[];
 };
 
-static inline void nft_bitmap_location(u32 key, u32 *idx, u32 *off)
+static inline void nft_bitmap_location(const struct nft_set *set,
+  const void *key,
+  u32 *idx, u32 *off)
 {
-   u32 k = (key << 1);
+   u32 k;
+
+   if (set->klen == 2)
+   k = *(u16 *)key;
+   else
+   k = *(u8 *)key;
+   k <<= 1;
 
*idx = k / BITS_PER_BYTE;
*off = k % BITS_PER_BYTE;
@@ -69,7 +77,7 @@ static bool nft_bitmap_lookup(const struct net *net, const 
struct nft_set *set,
u8 genmask = nft_genmask_cur(net);
u32 idx, off;
 
-   nft_bitmap_location(*key, , );
+   nft_bitmap_location(set, key, , );
 
return nft_bitmap_active(priv->bitmap, idx, off, genmask);
 }
@@ -83,7 +91,7 @@ static int nft_bitmap_insert(const struct net *net, const 
struct nft_set *set,
u8 genmask = nft_genmask_next(net);
u32 idx, off;
 
-   nft_bitmap_location(nft_set_ext_key(ext)->data[0], , );
+   nft_bitmap_location(set, nft_set_ext_key(ext), , );
if (nft_bitmap_active(priv->bitmap, idx, off, genmask))
return -EEXIST;
 
@@ -102,7 +110,7 @@ static void nft_bitmap_remove(const struct net *net,
u8 genmask = nft_genmask_next(net);
u32 idx, off;
 
-   nft_bitmap_location(nft_set_ext_key(ext)->data[0], , );
+   nft_bitmap_location(set, nft_set_ext_key(ext), , );
/* Enter 00 state. */
priv->bitmap[idx] &= ~(genmask << off);
 }
@@ -116,7 +124,7 @@ static void nft_bitmap_activate(const struct net *net,
u8 genmask = nft_genmask_next(net);
u32 idx, off;
 
-   nft_bitmap_location(nft_set_ext_key(ext)->data[0], , );
+   nft_bitmap_location(set, nft_set_ext_key(ext), , );
/* Enter 11 state. */
priv->bitmap[idx] |= (genmask << off);
 }
@@ -128,7 +136,7 @@ static bool nft_bitmap_flush(const struct net *net,
u8 genmask = nft_genmask_next(net);
u32 idx, off;
 
-   nft_bitmap_location(nft_set_ext_key(ext)->data[0], , );
+   nft_bitmap_location(set, nft_set_ext_key(ext), , );
/* Enter 10 state, similar to deactivation. */
priv->bitmap[idx] &= ~(genmask << off);
 
@@ -161,10 +169,9 @@ static void *nft_bitmap_deactivate(const struct net *net,
struct nft_bitmap *priv = nft_set_priv(set);
u8 genmask = nft_genmask_next(net);
struct nft_set_ext *ext;
-   u32 idx, off, key = 0;
+   u32 idx, off;
 
-   memcpy(, elem->key.val.data, set->klen);
-   nft_bitmap_location(key, , );
+   nft_bitmap_location(set, elem->key.val.data, , );
 
if (!nft_bitmap_active(priv->bitmap, idx, off, genmask))
return NULL;
-- 
2.1.4



[PATCH 06/10] netfilter: bridge: honor frag_max_size when refragmenting

2017-03-15 Thread Pablo Neira Ayuso
From: Florian Westphal 

consider a bridge with mtu 9000, but end host sending smaller
packets to another host with mtu < 9000.

In this case, after reassembly, bridge+defrag would refragment,
and then attempt to send the reassembled packet as long as it
was below 9k.

Instead we have to cap by the largest fragment size seen.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 net/bridge/br_netfilter_hooks.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 95087e6e8258..3c5185021c1c 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -721,18 +721,20 @@ static unsigned int nf_bridge_mtu_reduction(const struct 
sk_buff *skb)
 
 static int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct 
sk_buff *skb)
 {
-   struct nf_bridge_info *nf_bridge;
-   unsigned int mtu_reserved;
+   struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+   unsigned int mtu, mtu_reserved;
 
mtu_reserved = nf_bridge_mtu_reduction(skb);
+   mtu = skb->dev->mtu;
+
+   if (nf_bridge->frag_max_size && nf_bridge->frag_max_size < mtu)
+   mtu = nf_bridge->frag_max_size;
 
-   if (skb_is_gso(skb) || skb->len + mtu_reserved <= skb->dev->mtu) {
+   if (skb_is_gso(skb) || skb->len + mtu_reserved <= mtu) {
nf_bridge_info_free(skb);
return br_dev_queue_push_xmit(net, sk, skb);
}
 
-   nf_bridge = nf_bridge_info_get(skb);
-
/* This is wrong! We should preserve the original fragment
 * boundaries by preserving frag_list rather than refragmenting.
 */
-- 
2.1.4



  1   2   3   >