RE: [patch net] net: fec: fix compile with CONFIG_M5272

2016-12-04 Thread Andy Duan
From: Nikita Yushchenko  Sent: Sunday, 
December 04, 2016 11:18 PM
 >To: David S. Miller ; Andy Duan
 >; Troy Kisky ;
 >Andrew Lunn ; Eric Nelson ; Philippe
 >Reynes ; Johannes Berg ;
 >netdev@vger.kernel.org
 >Cc: Chris Healy ; Fabio Estevam
 >; linux-ker...@vger.kernel.org; Nikita
 >Yushchenko 
 >Subject: [patch net] net: fec: fix compile with CONFIG_M5272
 >
 >Commit 4dfb80d18d05 ("net: fec: cache statistics while device is down")
 >introduced unconditional statistics-related actions.
 >
 >However, when driver is compiled with CONFIG_M5272, staticsics-related
 >definitions do not exist, which results into build errors.
 >
 >Fix that by adding needed #if !defined(CONFIG_M5272).
 >
 >Fixes: 4dfb80d18d05 ("net: fec: cache statistics while device is down")
 >Signed-off-by: Nikita Yushchenko 
 >---
 > drivers/net/ethernet/freescale/fec_main.c | 12 +---
 > 1 file changed, 9 insertions(+), 3 deletions(-)
 >
 >diff --git a/drivers/net/ethernet/freescale/fec_main.c
 >b/drivers/net/ethernet/freescale/fec_main.c
 >index 6a20c24a2003..89e902767abb 100644
 >--- a/drivers/net/ethernet/freescale/fec_main.c
 >+++ b/drivers/net/ethernet/freescale/fec_main.c
 >@@ -2884,7 +2884,9 @@ fec_enet_close(struct net_device *ndev)
 >  if (fep->quirks & FEC_QUIRK_ERR006687)
 >  imx6q_cpuidle_fec_irqs_unused();
 >
 >+#if !defined(CONFIG_M5272)
 >  fec_enet_update_ethtool_stats(ndev);
 >+#endif
It is better to define fec_enet_update_ethtool_stats()  for CONFIG_M5272 like 
static void fec_enet_update_ethtool_stats(struct net_device *dev){}, or 
directly return in .fec_enet_update_ethtool_stats():
@@ -2315,6 +2315,10 @@ static void fec_enet_update_ethtool_stats(struct 
net_device *dev)
struct fec_enet_private *fep = netdev_priv(dev);
int i;

+#if defined(CONFIG_M5272)
+   return;
+#endif
+
>
 >  fec_enet_clk_enable(ndev, false);
 >  pinctrl_pm_select_sleep_state(>pdev->dev);
 >@@ -3192,7 +3194,9 @@ static int fec_enet_init(struct net_device *ndev)
 >
 >  fec_restart(ndev);
 >
 >+#if !defined(CONFIG_M5272)
 >  fec_enet_update_ethtool_stats(ndev);
 >+#endif
 >
ditto

 >  return 0;
 > }
 >@@ -3292,9 +3296,11 @@ fec_probe(struct platform_device *pdev)
 >  fec_enet_get_queue_num(pdev, _tx_qs, _rx_qs);
 >
 >  /* Init network device */
 >- ndev = alloc_etherdev_mqs(sizeof(struct fec_enet_private) +
 >-   ARRAY_SIZE(fec_stats) * sizeof(u64),
 >-   num_tx_qs, num_rx_qs);
 >+ ndev = alloc_etherdev_mqs(sizeof(struct fec_enet_private) #if
 >+!defined(CONFIG_M5272)
 >+   + ARRAY_SIZE(fec_stats) * sizeof(u64)
 >#endif
 >+   , num_tx_qs, num_rx_qs);
 >  if (!ndev)
 >  return -ENOMEM;
 >
 >--
 >2.1.4


Re: [PATCH v2 net-next 1/4] bpf: xdp: Allow head adjustment in XDP prog

2016-12-04 Thread Jesper Dangaard Brouer

On Sat, 3 Dec 2016 19:17:23 -0800 Martin KaFai Lau  wrote:

> This patch allows XDP prog to extend/remove the packet
> data at the head (like adding or removing header).  It is
> done by adding a new XDP helper bpf_xdp_adjust_head().
> 
> It also renames bpf_helper_changes_skb_data() to
> bpf_helper_changes_pkt_data() to better reflect
> that XDP prog does not work on skb.
> 
> Acked-by: Alexei Starovoitov 
> Signed-off-by: Martin KaFai Lau 
> ---
[...]
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 56b43587d200..ccef948cf58a 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2234,7 +2234,34 @@ static const struct bpf_func_proto 
> bpf_skb_change_head_proto = {
>   .arg3_type  = ARG_ANYTHING,
>  };
>  
> -bool bpf_helper_changes_skb_data(void *func)
> +BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
> +{
> + /* Both mlx4 and mlx5 driver align each packet to PAGE_SIZE when
> +  * XDP prog is set.
> +  * If the above is not true for the other drivers to support
> +  * bpf_xdp_adjust_head, struct xdp_buff can be extended.
> +  */
> + unsigned long addr = (unsigned long)xdp->data & PAGE_MASK;
> + void *data_hard_start = (void *)addr;
> + void *data = xdp->data + offset;
> +
> + if (unlikely(data < data_hard_start || data > xdp->data_end - ETH_HLEN))
> + return -EINVAL;
> +
> + xdp->data = data;
> +
> + return 0;
> +}

Thanks for adjusting this, I like Daniel's suggestion.

For this patch:

Acked-by: Jesper Dangaard Brouer 

As it still looks like 3/4 need some adjustments based on Saeed's
comments.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


[v2 PATCH] netlink: Do not schedule work from sk_destruct

2016-12-04 Thread Herbert Xu
On Mon, Dec 05, 2016 at 03:19:46PM +0800, Herbert Xu wrote:
>
> Thanks for the patch.  It'll obviously work but I wanted avoid that
> because it penalises the common path for the rare case.
> 
> Andrey, please try this patch and let me know if it's any better.
> 
> ---8<---
> Subject: netlink: Do not schedule work from sk_destruct

Crap, I screwed it up again.  Here is a v2 which moves the atomic
call into the RCU callback as otherwise the socket can be freed from
another path while we await the RCU callback.

---8<---
It is wrong to schedule a work from sk_destruct using the socket
as the memory reserve because the socket will be freed immediately
after the return from sk_destruct.

Instead we should do the deferral prior to sk_free.

This patch does just that.

Fixes: 707693c8a498 ("netlink: Call cb->done from a worker thread")
Signed-off-by: Herbert Xu 

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 602e5eb..463f5cf 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -322,11 +322,13 @@ static void netlink_skb_set_owner_r(struct sk_buff *skb, 
struct sock *sk)
sk_mem_charge(sk, skb->truesize);
 }
 
-static void __netlink_sock_destruct(struct sock *sk)
+static void netlink_sock_destruct(struct sock *sk)
 {
struct netlink_sock *nlk = nlk_sk(sk);
 
if (nlk->cb_running) {
+   if (nlk->cb.done)
+   nlk->cb.done(>cb);
module_put(nlk->cb.module);
kfree_skb(nlk->cb.skb);
}
@@ -348,21 +350,7 @@ static void netlink_sock_destruct_work(struct work_struct 
*work)
struct netlink_sock *nlk = container_of(work, struct netlink_sock,
work);
 
-   nlk->cb.done(>cb);
-   __netlink_sock_destruct(>sk);
-}
-
-static void netlink_sock_destruct(struct sock *sk)
-{
-   struct netlink_sock *nlk = nlk_sk(sk);
-
-   if (nlk->cb_running && nlk->cb.done) {
-   INIT_WORK(>work, netlink_sock_destruct_work);
-   schedule_work(>work);
-   return;
-   }
-
-   __netlink_sock_destruct(sk);
+   sk_free(>sk);
 }
 
 /* This lock without WQ_FLAG_EXCLUSIVE is good on UP and it is _very_ bad on
@@ -664,11 +652,21 @@ static int netlink_create(struct net *net, struct socket 
*sock, int protocol,
goto out;
 }
 
-static void deferred_put_nlk_sk(struct rcu_head *head)
+static void deferred_free_nlk_sk(struct rcu_head *head)
 {
struct netlink_sock *nlk = container_of(head, struct netlink_sock, rcu);
+   struct sock *sk = >sk;
+
+   if (!atomic_dec_and_test(>sk_refcnt))
+   return;
+
+   if (nlk->cb_running && nlk->cb.done) {
+   INIT_WORK(>work, netlink_sock_destruct_work);
+   schedule_work(>work);
+   return;
+   }
 
-   sock_put(>sk);
+   sk_free(sk);
 }
 
 static int netlink_release(struct socket *sock)
@@ -743,7 +741,7 @@ static int netlink_release(struct socket *sock)
local_bh_disable();
sock_prot_inuse_add(sock_net(sk), _proto, -1);
local_bh_enable();
-   call_rcu(>rcu, deferred_put_nlk_sk);
+   call_rcu(>rcu, deferred_free_nlk_sk);
return 0;
 }
 
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: net: use-after-free in worker_thread

2016-12-04 Thread Herbert Xu
On Sat, Dec 03, 2016 at 05:49:07AM -0800, Eric Dumazet wrote:
>
> @@ -600,6 +600,7 @@ static int __netlink_create(struct net *net, struct 
> socket *sock,
>   }
>   init_waitqueue_head(>wait);
>  
> + sock_set_flag(sk, SOCK_RCU_FREE);
>   sk->sk_destruct = netlink_sock_destruct;
>   sk->sk_protocol = protocol;
>   return 0;

It's not necessarily a big deal but I just wanted to point out
that SOCK_RCU_FREE is not equivalent to the call_rcu thing that
netlink does.  The latter only does the RCU deferral for the socket
release call which is the only place where it's needed while
SOCK_RCU_FREE will force every path to do an RCU deferral.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH net-next 3/3] net: ethoc: Demote packet dropped error message to debug

2016-12-04 Thread Tobias Klauser
On 2016-12-04 at 21:40:30 +0100, Florian Fainelli  wrote:
> Spamming the console with: net eth1: packet dropped can happen
> fairly frequently if the adapter is busy transmitting, demote the
> message to a debug print.
> 
> Signed-off-by: Florian Fainelli 

Reviewed-by: Tobias Klauser 


Re: net: use-after-free in worker_thread

2016-12-04 Thread Herbert Xu
On Sat, Dec 03, 2016 at 10:14:48AM -0800, Cong Wang wrote:
> On Sat, Dec 3, 2016 at 9:41 AM, Cong Wang  wrote:
> > On Sat, Dec 3, 2016 at 4:56 AM, Andrey Konovalov  
> > wrote:
> >> Hi!
> >>
> >> I'm seeing lots of the following error reports while running the
> >> syzkaller fuzzer.
> >>
> >> Reports appeared when I updated to 3c49de52 (Dec 2) from 2caceb32 (Dec 1).
> >>
> >> ==
> >> BUG: KASAN: use-after-free in worker_thread+0x17d8/0x18a0
> >> Read of size 8 at addr 880067f3ecd8 by task kworker/3:1/774
> >>
> >> page:ea00019fce00 count:1 mapcount:0 mapping:  (null)
> >> index:0x880067f39c10 compound_mapcount: 0
> >> flags: 0x5004080(slab|head)
> >> page dumped because: kasan: bad access detected
> >>
> >> CPU: 3 PID: 774 Comm: kworker/3:1 Not tainted 4.9.0-rc7+ #66
> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 
> >> 01/01/2011
> >>  88006c267838 81f882da 6c25e338 11000d84ce9a
> >>  ed000d84ce92 88006c25e340 41b58ab3 8541e198
> >>  81f88048 0001 41b58ab3 853d3ee8
> >> Call Trace:
> >>  [< inline >] __dump_stack lib/dump_stack.c:15
> >>  [] dump_stack+0x292/0x398 lib/dump_stack.c:51
> >>  [< inline >] describe_address mm/kasan/report.c:262
> >>  [] kasan_report_error+0x121/0x560 mm/kasan/report.c:368
> >>  [< inline >] kasan_report mm/kasan/report.c:390
> >>  [] __asan_report_load8_noabort+0x3e/0x40
> >> mm/kasan/report.c:411
> >>  [] worker_thread+0x17d8/0x18a0 kernel/workqueue.c:2228
> >>  [] kthread+0x323/0x3e0 kernel/kthread.c:209
> >>  [] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433
> >
> > Heck... this is the pending work vs. sk_destruct() race. :-/
> > We can't wait for the work in RCU callback, let me think about it...

Sorry, my patch was obviously crap as it was trying to delay the
freeing of a socket from sk_destruct which can't be done.

> Please try the attached patch, I only did compile test, I can't access
> my desktop now, so can't do further tests.

Thanks for the patch.  It'll obviously work but I wanted avoid that
because it penalises the common path for the rare case.

Andrey, please try this patch and let me know if it's any better.

---8<---
Subject: netlink: Do not schedule work from sk_destruct

It is wrong to schedule a work from sk_destruct using the socket
as the memory reserve because the socket will be freed immediately
after the return from sk_destruct.

Instead we should do the deferral prior to sk_free.

This patch does just that.

Fixes: 707693c8a498 ("netlink: Call cb->done from a worker thread")
Signed-off-by: Herbert Xu 

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 602e5eb..8a642c5 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -322,11 +322,13 @@ static void netlink_skb_set_owner_r(struct sk_buff *skb, 
struct sock *sk)
sk_mem_charge(sk, skb->truesize);
 }
 
-static void __netlink_sock_destruct(struct sock *sk)
+static void netlink_sock_destruct(struct sock *sk)
 {
struct netlink_sock *nlk = nlk_sk(sk);
 
if (nlk->cb_running) {
+   if (nlk->cb.done)
+   nlk->cb.done(>cb);
module_put(nlk->cb.module);
kfree_skb(nlk->cb.skb);
}
@@ -348,21 +350,7 @@ static void netlink_sock_destruct_work(struct work_struct 
*work)
struct netlink_sock *nlk = container_of(work, struct netlink_sock,
work);
 
-   nlk->cb.done(>cb);
-   __netlink_sock_destruct(>sk);
-}
-
-static void netlink_sock_destruct(struct sock *sk)
-{
-   struct netlink_sock *nlk = nlk_sk(sk);
-
-   if (nlk->cb_running && nlk->cb.done) {
-   INIT_WORK(>work, netlink_sock_destruct_work);
-   schedule_work(>work);
-   return;
-   }
-
-   __netlink_sock_destruct(sk);
+   sk_free(>sk);
 }
 
 /* This lock without WQ_FLAG_EXCLUSIVE is good on UP and it is _very_ bad on
@@ -664,11 +652,17 @@ static int netlink_create(struct net *net, struct socket 
*sock, int protocol,
goto out;
 }
 
-static void deferred_put_nlk_sk(struct rcu_head *head)
+static void deferred_free_nlk_sk(struct rcu_head *head)
 {
struct netlink_sock *nlk = container_of(head, struct netlink_sock, rcu);
 
-   sock_put(>sk);
+   if (nlk->cb_running && nlk->cb.done) {
+   INIT_WORK(>work, netlink_sock_destruct_work);
+   schedule_work(>work);
+   return;
+   }
+
+   sk_free(>sk);
 }
 
 static int netlink_release(struct socket *sock)
@@ -743,7 +737,9 @@ static int netlink_release(struct socket *sock)
local_bh_disable();
sock_prot_inuse_add(sock_net(sk), _proto, -1);
local_bh_enable();
-   call_rcu(>rcu, 

Re: [PATCH net-next 1/3] net: ethoc: Account for duplex changes

2016-12-04 Thread Tobias Klauser
On 2016-12-04 at 21:40:28 +0100, Florian Fainelli  wrote:
> ethoc_mdio_poll() which is our PHYLIB adjust_link callback does nothing,
> we should at least react to duplex changes and change MODER accordingly.
> Speed changes is not a problem, since the OpenCores Ethernet core seems
> to be reacting okay without us telling it.
> 
> Signed-off-by: Florian Fainelli 

Reviewed-by: Tobias Klauser 


Re: [PATCH net-next 2/3] net: ethoc: Utilize of_get_mac_address()

2016-12-04 Thread Tobias Klauser
On 2016-12-04 at 21:40:29 +0100, Florian Fainelli  wrote:
> Do not open code getting the MAC address exclusively from the
> "local-mac-address" property, but instead use of_get_mac_address() which
> looks up the MAC address using the 3 typical property names.
> 
> Signed-off-by: Florian Fainelli 

Reviewed-by: Tobias Klauser 


[net-next][PATCH 03/18] RDS: IB: include faddr in connection log

2016-12-04 Thread Santosh Shilimkar
Also use pr_* for it.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_cm.c   | 19 +--
 net/rds/ib_recv.c |  4 ++--
 net/rds/ib_send.c |  4 ++--
 3 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 5b2ab95..b9da1e5 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -113,19 +113,18 @@ void rds_ib_cm_connect_complete(struct rds_connection 
*conn, struct rdma_cm_even
}
 
if (conn->c_version < RDS_PROTOCOL(3, 1)) {
-   printk(KERN_NOTICE "RDS/IB: Connection to %pI4 version %u.%u 
failed,"
-  " no longer supported\n",
-  >c_faddr,
-  RDS_PROTOCOL_MAJOR(conn->c_version),
-  RDS_PROTOCOL_MINOR(conn->c_version));
+   pr_notice("RDS/IB: Connection <%pI4,%pI4> version %u.%u no 
longer supported\n",
+ >c_laddr, >c_faddr,
+ RDS_PROTOCOL_MAJOR(conn->c_version),
+ RDS_PROTOCOL_MINOR(conn->c_version));
rds_conn_destroy(conn);
return;
} else {
-   printk(KERN_NOTICE "RDS/IB: connected to %pI4 version 
%u.%u%s\n",
-  >c_faddr,
-  RDS_PROTOCOL_MAJOR(conn->c_version),
-  RDS_PROTOCOL_MINOR(conn->c_version),
-  ic->i_flowctl ? ", flow control" : "");
+   pr_notice("RDS/IB: connected <%pI4,%pI4> version %u.%u%s\n",
+ >c_laddr, >c_faddr,
+ RDS_PROTOCOL_MAJOR(conn->c_version),
+ RDS_PROTOCOL_MINOR(conn->c_version),
+ ic->i_flowctl ? ", flow control" : "");
}
 
/*
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 606a11f..6803b75 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -980,8 +980,8 @@ void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic,
} else {
/* We expect errors as the qp is drained during shutdown */
if (rds_conn_up(conn) || rds_conn_connecting(conn))
-   rds_ib_conn_error(conn, "recv completion on %pI4 had 
status %u (%s), disconnecting and reconnecting\n",
- >c_faddr,
+   rds_ib_conn_error(conn, "recv completion on <%pI4,%pI4> 
had status %u (%s), disconnecting and reconnecting\n",
+ >c_laddr, >c_faddr,
  wc->status,
  ib_wc_status_msg(wc->status));
}
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 84d90c9..19eca5c 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -300,8 +300,8 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, 
struct ib_wc *wc)
 
/* We expect errors as the qp is drained during shutdown */
if (wc->status != IB_WC_SUCCESS && rds_conn_up(conn)) {
-   rds_ib_conn_error(conn, "send completion on %pI4 had status %u 
(%s), disconnecting and reconnecting\n",
- >c_faddr, wc->status,
+   rds_ib_conn_error(conn, "send completion on <%pI4,%pI4> had 
status %u (%s), disconnecting and reconnecting\n",
+ >c_laddr, >c_faddr, wc->status,
  ib_wc_status_msg(wc->status));
}
 }
-- 
1.9.1



[net-next][PATCH 18/18] RDS: IB: add missing connection cache usage info

2016-12-04 Thread Santosh Shilimkar
rds-tools already support it.

Signed-off-by: Santosh Shilimkar 
---
 include/uapi/linux/rds.h | 1 +
 net/rds/ib.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index 3833113..410ae3c 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -183,6 +183,7 @@ struct rds_info_rdma_connection {
uint32_tmax_send_sge;
uint32_trdma_mr_max;
uint32_trdma_mr_size;
+   uint32_tcache_allocs;
 };
 
 /* RDS message Receive Path Latency points */
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 8d70884..b5e2699 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -313,6 +313,7 @@ static int rds_ib_conn_info_visitor(struct rds_connection 
*conn,
iinfo->max_send_wr = ic->i_send_ring.w_nr;
iinfo->max_recv_wr = ic->i_recv_ring.w_nr;
iinfo->max_send_sge = rds_ibdev->max_sge;
+   iinfo->cache_allocs = atomic_read(>i_cache_allocs);
rds_ib_get_mr_info(rds_ibdev, iinfo);
}
return 1;
-- 
1.9.1



[net-next][PATCH 01/18] RDS: log the address on bind failure

2016-12-04 Thread Santosh Shilimkar
It's useful to know the IP address when RDS fails to bind a
connection. Thus, adding it to the error message.

Orabug: 21894138
Reviewed-by: Wei Lin Guay 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/bind.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/bind.c b/net/rds/bind.c
index 095f6ce..3a915be 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -176,8 +176,8 @@ int rds_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
if (!trans) {
ret = -EADDRNOTAVAIL;
rds_remove_bound(rs);
-   printk_ratelimited(KERN_INFO "RDS: rds_bind() could not find a 
transport, "
-   "load rds_tcp or rds_rdma?\n");
+   pr_info_ratelimited("RDS: %s could not find a transport for 
%pI4, load rds_tcp or rds_rdma?\n",
+   __func__, >sin_addr.s_addr);
goto out;
}
 
-- 
1.9.1



[net-next][PATCH 05/18] RDS: RDMA: fix the ib_map_mr_sg_zbva() argument

2016-12-04 Thread Santosh Shilimkar
Fixes warning: Using plain integer as NULL pointer

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_frmr.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index d921adc..66b3d62 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -104,14 +104,15 @@ static int rds_ib_post_reg_frmr(struct rds_ib_mr *ibmr)
struct rds_ib_frmr *frmr = >u.frmr;
struct ib_send_wr *failed_wr;
struct ib_reg_wr reg_wr;
-   int ret;
+   int ret, off = 0;
 
while (atomic_dec_return(>ic->i_fastreg_wrs) <= 0) {
atomic_inc(>ic->i_fastreg_wrs);
cpu_relax();
}
 
-   ret = ib_map_mr_sg_zbva(frmr->mr, ibmr->sg, ibmr->sg_len, 0, PAGE_SIZE);
+   ret = ib_map_mr_sg_zbva(frmr->mr, ibmr->sg, ibmr->sg_len,
+   , PAGE_SIZE);
if (unlikely(ret != ibmr->sg_len))
return ret < 0 ? ret : -EINVAL;
 
-- 
1.9.1



[net-next][PATCH 11/18] RDS: IB: add few useful cache stasts

2016-12-04 Thread Santosh Shilimkar
Tracks the ib receive cache total, incoming and frag allocations.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h   | 7 +++
 net/rds/ib_recv.c  | 6 ++
 net/rds/ib_stats.c | 2 ++
 3 files changed, 15 insertions(+)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index 97e7696..4987387 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -151,6 +151,7 @@ struct rds_ib_connection {
u64 i_ack_recv; /* last ACK received */
struct rds_ib_refill_cache i_cache_incs;
struct rds_ib_refill_cache i_cache_frags;
+   atomic_ti_cache_allocs;
 
/* sending acks */
unsigned long   i_ack_flags;
@@ -254,6 +255,8 @@ struct rds_ib_statistics {
uint64_ts_ib_rx_refill_from_cq;
uint64_ts_ib_rx_refill_from_thread;
uint64_ts_ib_rx_alloc_limit;
+   uint64_ts_ib_rx_total_frags;
+   uint64_ts_ib_rx_total_incs;
uint64_ts_ib_rx_credit_updates;
uint64_ts_ib_ack_sent;
uint64_ts_ib_ack_send_failure;
@@ -276,6 +279,8 @@ struct rds_ib_statistics {
uint64_ts_ib_rdma_mr_1m_reused;
uint64_ts_ib_atomic_cswp;
uint64_ts_ib_atomic_fadd;
+   uint64_ts_ib_recv_added_to_cache;
+   uint64_ts_ib_recv_removed_from_cache;
 };
 
 extern struct workqueue_struct *rds_ib_wq;
@@ -406,6 +411,8 @@ int rds_ib_send_grab_credits(struct rds_ib_connection *ic, 
u32 wanted,
 /* ib_stats.c */
 DECLARE_PER_CPU(struct rds_ib_statistics, rds_ib_stats);
 #define rds_ib_stats_inc(member) rds_stats_inc_which(rds_ib_stats, member)
+#define rds_ib_stats_add(member, count) \
+   rds_stats_add_which(rds_ib_stats, member, count)
 unsigned int rds_ib_stats_info_copy(struct rds_info_iterator *iter,
unsigned int avail);
 
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 6803b75..4b0f126 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -194,6 +194,8 @@ static void rds_ib_frag_free(struct rds_ib_connection *ic,
rdsdebug("frag %p page %p\n", frag, sg_page(>f_sg));
 
rds_ib_recv_cache_put(>f_cache_entry, >i_cache_frags);
+   atomic_add(RDS_FRAG_SIZE / SZ_1K, >i_cache_allocs);
+   rds_ib_stats_add(s_ib_recv_added_to_cache, RDS_FRAG_SIZE);
 }
 
 /* Recycle inc after freeing attached frags */
@@ -261,6 +263,7 @@ static struct rds_ib_incoming *rds_ib_refill_one_inc(struct 
rds_ib_connection *i
atomic_dec(_ib_allocation);
return NULL;
}
+   rds_ib_stats_inc(s_ib_rx_total_incs);
}
INIT_LIST_HEAD(>ii_frags);
rds_inc_init(>ii_inc, ic->conn, ic->conn->c_faddr);
@@ -278,6 +281,8 @@ static struct rds_page_frag *rds_ib_refill_one_frag(struct 
rds_ib_connection *ic
cache_item = rds_ib_recv_cache_get(>i_cache_frags);
if (cache_item) {
frag = container_of(cache_item, struct rds_page_frag, 
f_cache_entry);
+   atomic_sub(RDS_FRAG_SIZE / SZ_1K, >i_cache_allocs);
+   rds_ib_stats_add(s_ib_recv_added_to_cache, RDS_FRAG_SIZE);
} else {
frag = kmem_cache_alloc(rds_ib_frag_slab, slab_mask);
if (!frag)
@@ -290,6 +295,7 @@ static struct rds_page_frag *rds_ib_refill_one_frag(struct 
rds_ib_connection *ic
kmem_cache_free(rds_ib_frag_slab, frag);
return NULL;
}
+   rds_ib_stats_inc(s_ib_rx_total_frags);
}
 
INIT_LIST_HEAD(>f_item);
diff --git a/net/rds/ib_stats.c b/net/rds/ib_stats.c
index 7e78dca..9252ad1 100644
--- a/net/rds/ib_stats.c
+++ b/net/rds/ib_stats.c
@@ -55,6 +55,8 @@
"ib_rx_refill_from_cq",
"ib_rx_refill_from_thread",
"ib_rx_alloc_limit",
+   "ib_rx_total_frags",
+   "ib_rx_total_incs",
"ib_rx_credit_updates",
"ib_ack_sent",
"ib_ack_send_failure",
-- 
1.9.1



[net-next][PATCH 15/18] RDS: add stat for socket recv memory usage

2016-12-04 Thread Santosh Shilimkar
From: Venkat Venkatsubra 

Tracks the receive side memory added to scokets and removed from sockets.

Signed-off-by: Venkat Venkatsubra 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/rds.h  | 3 +++
 net/rds/recv.c | 4 
 2 files changed, 7 insertions(+)

diff --git a/net/rds/rds.h b/net/rds/rds.h
index 0bb8213..8ccd5a9 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -631,6 +631,9 @@ struct rds_statistics {
uint64_ts_cong_update_received;
uint64_ts_cong_send_error;
uint64_ts_cong_send_blocked;
+   uint64_ts_recv_bytes_added_to_socket;
+   uint64_ts_recv_bytes_removed_from_socket;
+
 };
 
 /* af_rds.c */
diff --git a/net/rds/recv.c b/net/rds/recv.c
index 9d0666e..ba19eee 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -94,6 +94,10 @@ static void rds_recv_rcvbuf_delta(struct rds_sock *rs, 
struct sock *sk,
return;
 
rs->rs_rcv_bytes += delta;
+   if (delta > 0)
+   rds_stats_add(s_recv_bytes_added_to_socket, delta);
+   else
+   rds_stats_add(s_recv_bytes_removed_from_socket, -delta);
now_congested = rs->rs_rcv_bytes > rds_sk_rcvbuf(rs);
 
rdsdebug("rs %p (%pI4:%u) recv bytes %d buf %d "
-- 
1.9.1



[net-next][PATCH 04/18] RDS: IB: make the transport retry count smallest

2016-12-04 Thread Santosh Shilimkar
Transport retry is not much useful since it indicate packet loss
in fabric so its better to failover fast rather than longer retry.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index 45ac8e8..f4e8121 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -16,7 +16,7 @@
 #define RDS_IB_DEFAULT_SEND_WR 256
 #define RDS_IB_DEFAULT_FR_WR   512
 
-#define RDS_IB_DEFAULT_RETRY_COUNT 2
+#define RDS_IB_DEFAULT_RETRY_COUNT 1
 
 #define RDS_IB_SUPPORTED_PROTOCOLS 0x0003  /* minor versions 
supported */
 
-- 
1.9.1



[net-next][PATCH 16/18] RDS: make message size limit compliant with spec

2016-12-04 Thread Santosh Shilimkar
From: Avinash Repaka 

RDS support max message size as 1M but the code doesn't check this
in all cases. Patch fixes it for RDMA & non-RDMA and RDS MR size
and its enforced irrespective of underlying transport.

Signed-off-by: Avinash Repaka 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/rdma.c |  9 -
 net/rds/rds.h  |  3 +++
 net/rds/send.c | 31 +++
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/net/rds/rdma.c b/net/rds/rdma.c
index dd508e0..6e8db33 100644
--- a/net/rds/rdma.c
+++ b/net/rds/rdma.c
@@ -40,7 +40,6 @@
 /*
  * XXX
  *  - build with sparse
- *  - should we limit the size of a mr region?  let transport return failure?
  *  - should we detect duplicate keys on a socket?  hmm.
  *  - an rdma is an mlock, apply rlimit?
  */
@@ -200,6 +199,14 @@ static int __rds_rdma_map(struct rds_sock *rs, struct 
rds_get_mr_args *args,
goto out;
}
 
+   /* Restrict the size of mr irrespective of underlying transport
+* To account for unaligned mr regions, subtract one from nr_pages
+*/
+   if ((nr_pages - 1) > (RDS_MAX_MSG_SIZE >> PAGE_SHIFT)) {
+   ret = -EMSGSIZE;
+   goto out;
+   }
+
rdsdebug("RDS: get_mr addr %llx len %llu nr_pages %u\n",
args->vec.addr, args->vec.bytes, nr_pages);
 
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 8ccd5a9..f713194 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -50,6 +50,9 @@ void rdsdebug(char *fmt, ...)
 #define RDS_FRAG_SHIFT 12
 #define RDS_FRAG_SIZE  ((unsigned int)(1 << RDS_FRAG_SHIFT))
 
+/* Used to limit both RDMA and non-RDMA RDS message to 1MB */
+#define RDS_MAX_MSG_SIZE   ((unsigned int)(1 << 20))
+
 #define RDS_CONG_MAP_BYTES (65536 / 8)
 #define RDS_CONG_MAP_PAGES (PAGE_ALIGN(RDS_CONG_MAP_BYTES) / PAGE_SIZE)
 #define RDS_CONG_MAP_PAGE_BITS (PAGE_SIZE * 8)
diff --git a/net/rds/send.c b/net/rds/send.c
index 45e025b..5cc6403 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -994,6 +994,26 @@ static int rds_send_mprds_hash(struct rds_sock *rs, struct 
rds_connection *conn)
return hash;
 }
 
+static int rds_rdma_bytes(struct msghdr *msg, size_t *rdma_bytes)
+{
+   struct rds_rdma_args *args;
+   struct cmsghdr *cmsg;
+
+   for_each_cmsghdr(cmsg, msg) {
+   if (!CMSG_OK(msg, cmsg))
+   return -EINVAL;
+
+   if (cmsg->cmsg_level != SOL_RDS)
+   continue;
+
+   if (cmsg->cmsg_type == RDS_CMSG_RDMA_ARGS) {
+   args = CMSG_DATA(cmsg);
+   *rdma_bytes += args->remote_vec.bytes;
+   }
+   }
+   return 0;
+}
+
 int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 {
struct sock *sk = sock->sk;
@@ -1008,6 +1028,7 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
int nonblock = msg->msg_flags & MSG_DONTWAIT;
long timeo = sock_sndtimeo(sk, nonblock);
struct rds_conn_path *cpath;
+   size_t total_payload_len = payload_len, rdma_payload_len = 0;
 
/* Mirror Linux UDP mirror of BSD error message compatibility */
/* XXX: Perhaps MSG_MORE someday */
@@ -1040,6 +1061,16 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
}
release_sock(sk);
 
+   ret = rds_rdma_bytes(msg, _payload_len);
+   if (ret)
+   goto out;
+
+   total_payload_len += rdma_payload_len;
+   if (max_t(size_t, payload_len, rdma_payload_len) > RDS_MAX_MSG_SIZE) {
+   ret = -EMSGSIZE;
+   goto out;
+   }
+
if (payload_len > rds_sk_sndbuf(rs)) {
ret = -EMSGSIZE;
goto out;
-- 
1.9.1



[net-next][PATCH 00/18] net: RDS updates

2016-12-04 Thread Santosh Shilimkar
Series consist of:
 - RDMA transport fixes for map failure, listen sequence, handler panic and
   composite message notification.
 - Couple of sparse fixes.
 - Message logging improvements for bind failure, use once mr semantics
   and connection remote address, active end point.
 - Performance improvement for RDMA transport by reducing the post send
   pressure on the queue and spreading the CQ vectors.
 - Useful statistics for socket send/recv usage and receive cache usage.
   rds-tools already equipped to parse this info.
 - Additional RDS CMSG used by application to track the RDS message
   stages for certain type of traffic to find out latency spots.
   Can be enabled/disabled per socket.

Series generated against 'net-next'. Patchset is also available on
below git tree.

The following changes since commit adc176c5472214971d77c1a61c83db9b01e9cdc7:

  ipv6 addrconf: Implemented enhanced DAD (RFC7527) (2016-12-03 23:21:37 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git 
for_4.10/net-next/rds

for you to fetch changes up to 35b76a2744f63f936a8b2e5c306d47618883144b:

  RDS: IB: add missing connection cache usage info (2016-12-04 17:06:30 -0800)



Avinash Repaka (1):
  RDS: make message size limit compliant with spec

Qing Huang (1):
  RDS: RDMA: start rdma listening after init

Santosh Shilimkar (15):
  RDS: log the address on bind failure
  RDS: mark few internal functions static to make sparse build happy
  RDS: IB: include faddr in connection log
  RDS: IB: make the transport retry count smallest
  RDS: RDMA: fix the ib_map_mr_sg_zbva() argument
  RDS: RDMA: return appropriate error on rdma map failures
  RDS: IB: split the mr registration and invalidation path
  RDS: RDMA: silence the use_once mr log flood
  RDS: IB: track and log active side endpoint in connection
  RDS: IB: add few useful cache stasts
  RDS: IB: Add vector spreading for cqs
  RDS: RDMA: Fix the composite message user notification
  RDS: IB: fix panic due to handlers running post teardown
  RDS: add receive message trace used by application
  RDS: IB: add missing connection cache usage info

Venkat Venkatsubra (1):
  RDS: add stat for socket recv memory usage

 include/uapi/linux/rds.h | 34 ++
 net/rds/af_rds.c | 28 +++
 net/rds/bind.c   |  4 +--
 net/rds/connection.c |  2 +-
 net/rds/ib.c | 12 +++
 net/rds/ib.h | 22 ++--
 net/rds/ib_cm.c  | 89 ++--
 net/rds/ib_frmr.c| 16 +
 net/rds/ib_recv.c| 14 ++--
 net/rds/ib_send.c| 29 +---
 net/rds/ib_stats.c   |  2 ++
 net/rds/rdma.c   | 22 ++--
 net/rds/rdma_transport.c | 11 ++
 net/rds/rds.h| 17 +
 net/rds/recv.c   | 36 ++--
 net/rds/send.c   | 50 ---
 net/rds/tcp_listen.c |  1 +
 net/rds/tcp_recv.c   |  5 +++
 18 files changed, 333 insertions(+), 61 deletions(-)

-- 
1.9.1



[net-next][PATCH 02/18] RDS: mark few internal functions static to make sparse build happy

2016-12-04 Thread Santosh Shilimkar
Fixes below warnings:
warning: symbol 'rds_send_probe' was not declared. Should it be static?
warning: symbol 'rds_send_ping' was not declared. Should it be static?
warning: symbol 'rds_tcp_accept_one_path' was not declared. Should it be static?
warning: symbol 'rds_walk_conn_path_info' was not declared. Should it be static?

Signed-off-by: Santosh Shilimkar 
---
 net/rds/connection.c | 2 +-
 net/rds/send.c   | 4 ++--
 net/rds/tcp_listen.c | 1 +
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/rds/connection.c b/net/rds/connection.c
index fe9d31c..26533b2 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -545,7 +545,7 @@ void rds_for_each_conn_info(struct socket *sock, unsigned 
int len,
 }
 EXPORT_SYMBOL_GPL(rds_for_each_conn_info);
 
-void rds_walk_conn_path_info(struct socket *sock, unsigned int len,
+static void rds_walk_conn_path_info(struct socket *sock, unsigned int len,
 struct rds_info_iterator *iter,
 struct rds_info_lengths *lens,
 int (*visitor)(struct rds_conn_path *, void *),
diff --git a/net/rds/send.c b/net/rds/send.c
index 77c8c6e..bb13c56 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -1169,7 +1169,7 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
  * or
  *   RDS_FLAG_HB_PONG|RDS_FLAG_ACK_REQUIRED
  */
-int
+static int
 rds_send_probe(struct rds_conn_path *cp, __be16 sport,
   __be16 dport, u8 h_flags)
 {
@@ -1238,7 +1238,7 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
return rds_send_probe(cp, 0, dport, 0);
 }
 
-void
+static void
 rds_send_ping(struct rds_connection *conn)
 {
unsigned long flags;
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index f74bab3..67d0929 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -79,6 +79,7 @@ int rds_tcp_keepalive(struct socket *sock)
  * smaller ip address, we recycle conns in RDS_CONN_ERROR on the passive side
  * by moving them to CONNECTING in this function.
  */
+static
 struct rds_tcp_connection *rds_tcp_accept_one_path(struct rds_connection *conn)
 {
int i;
-- 
1.9.1



[net-next][PATCH 06/18] RDS: RDMA: start rdma listening after init

2016-12-04 Thread Santosh Shilimkar
From: Qing Huang 

This prevents RDS from handling incoming rdma packets before RDS
completes initializing its recv/send components.

Signed-off-by: Qing Huang 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/rdma_transport.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 345f090..250c6b8 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -203,18 +203,13 @@ static int rds_rdma_init(void)
 {
int ret;
 
-   ret = rds_rdma_listen_init();
+   ret = rds_ib_init();
if (ret)
goto out;
 
-   ret = rds_ib_init();
+   ret = rds_rdma_listen_init();
if (ret)
-   goto err_ib_init;
-
-   goto out;
-
-err_ib_init:
-   rds_rdma_listen_stop();
+   rds_ib_exit();
 out:
return ret;
 }
-- 
1.9.1



[net-next][PATCH 10/18] RDS: IB: track and log active side endpoint in connection

2016-12-04 Thread Santosh Shilimkar
Useful to know the active and passive end points in a
RDS IB connection.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h|  3 +++
 net/rds/ib_cm.c | 11 +++
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index f14c26d..97e7696 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -181,6 +181,9 @@ struct rds_ib_connection {
 
/* Batched completions */
unsigned inti_unsignaled_wrs;
+
+   /* Endpoint role in connection */
+   int i_active_side;
 };
 
 /* This assumes that atomic_t is at least 32 bits */
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 3002acf..4d1bf04 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -120,16 +120,17 @@ void rds_ib_cm_connect_complete(struct rds_connection 
*conn, struct rdma_cm_even
rds_conn_destroy(conn);
return;
} else {
-   pr_notice("RDS/IB: connected <%pI4,%pI4> version %u.%u%s\n",
+   pr_notice("RDS/IB: %s conn connected <%pI4,%pI4> version 
%u.%u%s\n",
+ ic->i_active_side ? "Active" : "Passive",
  >c_laddr, >c_faddr,
  RDS_PROTOCOL_MAJOR(conn->c_version),
  RDS_PROTOCOL_MINOR(conn->c_version),
  ic->i_flowctl ? ", flow control" : "");
}
 
-   /*
-* Init rings and fill recv. this needs to wait until protocol 
negotiation
-* is complete, since ring layout is different from 3.0 to 3.1.
+   /* Init rings and fill recv. this needs to wait until protocol
+* negotiation is complete, since ring layout is different
+* from 3.1 to 4.1.
 */
rds_ib_send_init_ring(ic);
rds_ib_recv_init_ring(ic);
@@ -685,6 +686,7 @@ int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id)
if (ic->i_cm_id == cm_id)
ret = 0;
}
+   ic->i_active_side = true;
return ret;
 }
 
@@ -859,6 +861,7 @@ void rds_ib_conn_path_shutdown(struct rds_conn_path *cp)
ic->i_sends = NULL;
vfree(ic->i_recvs);
ic->i_recvs = NULL;
+   ic->i_active_side = false;
 }
 
 int rds_ib_conn_alloc(struct rds_connection *conn, gfp_t gfp)
-- 
1.9.1



[net-next][PATCH 17/18] RDS: add receive message trace used by application

2016-12-04 Thread Santosh Shilimkar
Socket option to tap receive path latency in various stages
in nano seconds. It can be enabled on selective sockets using
using SO_RDS_MSG_RXPATH_LATENCY socket option. RDS will return
the data to application with RDS_CMSG_RXPATH_LATENCY in defined
format. Scope is left to add more trace points for future
without need of change in the interface.

Reviewed-by: Sowmini Varadhan 
Signed-off-by: Santosh Shilimkar 
---
 include/uapi/linux/rds.h | 33 +
 net/rds/af_rds.c | 28 
 net/rds/ib_recv.c|  4 
 net/rds/rds.h| 10 ++
 net/rds/recv.c   | 32 +---
 net/rds/tcp_recv.c   |  5 +
 6 files changed, 109 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index 0f9265c..3833113 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -52,6 +52,13 @@
 #define RDS_GET_MR_FOR_DEST7
 #define SO_RDS_TRANSPORT   8
 
+/* Socket option to tap receive path latency
+ * SO_RDS: SO_RDS_MSG_RXPATH_LATENCY
+ * Format used struct rds_rx_trace_so
+ */
+#define SO_RDS_MSG_RXPATH_LATENCY  10
+
+
 /* supported values for SO_RDS_TRANSPORT */
 #defineRDS_TRANS_IB0
 #defineRDS_TRANS_IWARP 1
@@ -77,6 +84,12 @@
  * the same as for the GET_MR setsockopt.
  * RDS_CMSG_RDMA_STATUS (recvmsg)
  * Returns the status of a completed RDMA operation.
+ * RDS_CMSG_RXPATH_LATENCY(recvmsg)
+ * Returns rds message latencies in various stages of receive
+ * path in nS. Its set per socket using SO_RDS_MSG_RXPATH_LATENCY
+ * socket option. Legitimate points are defined in
+ * enum rds_message_rxpath_latency. More points can be added in
+ * future. CSMG format is struct rds_cmsg_rx_trace.
  */
 #define RDS_CMSG_RDMA_ARGS 1
 #define RDS_CMSG_RDMA_DEST 2
@@ -87,6 +100,7 @@
 #define RDS_CMSG_ATOMIC_CSWP   7
 #define RDS_CMSG_MASKED_ATOMIC_FADD8
 #define RDS_CMSG_MASKED_ATOMIC_CSWP9
+#define RDS_CMSG_RXPATH_LATENCY11
 
 #define RDS_INFO_FIRST 1
 #define RDS_INFO_COUNTERS  1
@@ -171,6 +185,25 @@ struct rds_info_rdma_connection {
uint32_trdma_mr_size;
 };
 
+/* RDS message Receive Path Latency points */
+enum rds_message_rxpath_latency {
+   RDS_MSG_RX_HDR_TO_DGRAM_START = 0,
+   RDS_MSG_RX_DGRAM_REASSEMBLE,
+   RDS_MSG_RX_DGRAM_DELIVERED,
+   RDS_MSG_RX_DGRAM_TRACE_MAX
+};
+
+struct rds_rx_trace_so {
+   u8 rx_traces;
+   u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
+};
+
+struct rds_cmsg_rx_trace {
+   u8 rx_traces;
+   u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
+   u64 rx_trace[RDS_MSG_RX_DGRAM_TRACE_MAX];
+};
+
 /*
  * Congestion monitoring.
  * Congestion control in RDS happens at the host connection
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 2ac1e61..fd821740 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -298,6 +298,30 @@ static int rds_enable_recvtstamp(struct sock *sk, char 
__user *optval,
return 0;
 }
 
+static int rds_recv_track_latency(struct rds_sock *rs, char __user *optval,
+ int optlen)
+{
+   struct rds_rx_trace_so trace;
+   int i;
+
+   if (optlen != sizeof(struct rds_rx_trace_so))
+   return -EFAULT;
+
+   if (copy_from_user(, optval, sizeof(trace)))
+   return -EFAULT;
+
+   rs->rs_rx_traces = trace.rx_traces;
+   for (i = 0; i < rs->rs_rx_traces; i++) {
+   if (trace.rx_trace_pos[i] > RDS_MSG_RX_DGRAM_TRACE_MAX) {
+   rs->rs_rx_traces = 0;
+   return -EFAULT;
+   }
+   rs->rs_rx_trace[i] = trace.rx_trace_pos[i];
+   }
+
+   return 0;
+}
+
 static int rds_setsockopt(struct socket *sock, int level, int optname,
  char __user *optval, unsigned int optlen)
 {
@@ -338,6 +362,9 @@ static int rds_setsockopt(struct socket *sock, int level, 
int optname,
ret = rds_enable_recvtstamp(sock->sk, optval, optlen);
release_sock(sock->sk);
break;
+   case SO_RDS_MSG_RXPATH_LATENCY:
+   ret = rds_recv_track_latency(rs, optval, optlen);
+   break;
default:
ret = -ENOPROTOOPT;
}
@@ -484,6 +511,7 @@ static int __rds_create(struct socket *sock, struct sock 
*sk, int protocol)
INIT_LIST_HEAD(>rs_cong_list);
spin_lock_init(>rs_rdma_lock);
rs->rs_rdma_keys = RB_ROOT;
+   rs->rs_rx_traces = 0;
 
spin_lock_bh(_sock_lock);
list_add_tail(>rs_item, _sock_list);
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 4b0f126..e10624a 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -911,8 +911,12 @@ static void 

[net-next][PATCH 07/18] RDS: RDMA: return appropriate error on rdma map failures

2016-12-04 Thread Santosh Shilimkar
The first message to a remote node should prompt a new
connection even if it is RDMA operation. For RDMA operation
the MR mapping can fail because connections is not yet up.

Since the connection establishment is asynchronous,
we make sure the map failure because of unavailable
connection reach to the user by appropriate error code.
Before returning to the user, lets trigger the connection
so that its ready for the next retry.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/send.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index bb13c56..0a6f38b 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -945,6 +945,11 @@ static int rds_cmsg_send(struct rds_sock *rs, struct 
rds_message *rm,
ret = rds_cmsg_rdma_map(rs, rm, cmsg);
if (!ret)
*allocated_mr = 1;
+   else if (ret == -ENODEV)
+   /* Accommodate the get_mr() case which can fail
+* if connection isn't established yet.
+*/
+   ret = -EAGAIN;
break;
case RDS_CMSG_ATOMIC_CSWP:
case RDS_CMSG_ATOMIC_FADD:
@@ -1082,8 +1087,12 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
 
/* Parse any control messages the user may have included. */
ret = rds_cmsg_send(rs, rm, msg, _mr);
-   if (ret)
+   if (ret) {
+   /* Trigger connection so that its ready for the next retry */
+   if (ret ==  -EAGAIN)
+   rds_conn_connect_if_down(conn);
goto out;
+   }
 
if (rm->rdma.op_active && !conn->c_trans->xmit_rdma) {
printk_ratelimited(KERN_NOTICE "rdma_op %p conn xmit_rdma %p\n",
-- 
1.9.1



[net-next][PATCH 13/18] RDS: RDMA: Fix the composite message user notification

2016-12-04 Thread Santosh Shilimkar
When application sends an RDS RDMA composite message consist of
RDMA transfer to be followed up by non RDMA payload, it expect to
be notified *only* when the full message gets delivered. RDS RDMA
notification doesn't behave this way though.

Thanks to Venkat for debug and root casuing the issue
where only first part of the message(RDMA) was
successfully delivered but remainder payload delivery failed.
In that case, application should not be notified with
a false positive of message delivery success.

Fix this case by making sure the user gets notified only after
the full message delivery.

Reviewed-by: Venkat Venkatsubra 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_send.c | 25 +++--
 net/rds/rdma.c| 10 ++
 net/rds/rds.h |  1 +
 net/rds/send.c|  4 +++-
 4 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 19eca5c..5e72de1 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -69,16 +69,6 @@ static void rds_ib_send_complete(struct rds_message *rm,
complete(rm, notify_status);
 }
 
-static void rds_ib_send_unmap_data(struct rds_ib_connection *ic,
-  struct rm_data_op *op,
-  int wc_status)
-{
-   if (op->op_nents)
-   ib_dma_unmap_sg(ic->i_cm_id->device,
-   op->op_sg, op->op_nents,
-   DMA_TO_DEVICE);
-}
-
 static void rds_ib_send_unmap_rdma(struct rds_ib_connection *ic,
   struct rm_rdma_op *op,
   int wc_status)
@@ -139,6 +129,21 @@ static void rds_ib_send_unmap_atomic(struct 
rds_ib_connection *ic,
rds_ib_stats_inc(s_ib_atomic_fadd);
 }
 
+static void rds_ib_send_unmap_data(struct rds_ib_connection *ic,
+  struct rm_data_op *op,
+  int wc_status)
+{
+   struct rds_message *rm = container_of(op, struct rds_message, data);
+
+   if (op->op_nents)
+   ib_dma_unmap_sg(ic->i_cm_id->device,
+   op->op_sg, op->op_nents,
+   DMA_TO_DEVICE);
+
+   if (rm->rdma.op_active && rm->data.op_notify)
+   rds_ib_send_unmap_rdma(ic, >rdma, wc_status);
+}
+
 /*
  * Unmap the resources associated with a struct send_work.
  *
diff --git a/net/rds/rdma.c b/net/rds/rdma.c
index 8151c49..dd508e0 100644
--- a/net/rds/rdma.c
+++ b/net/rds/rdma.c
@@ -627,6 +627,16 @@ int rds_cmsg_rdma_args(struct rds_sock *rs, struct 
rds_message *rm,
}
op->op_notifier->n_user_token = args->user_token;
op->op_notifier->n_status = RDS_RDMA_SUCCESS;
+
+   /* Enable rmda notification on data operation for composite
+* rds messages and make sure notification is enabled only
+* for the data operation which follows it so that application
+* gets notified only after full message gets delivered.
+*/
+   if (rm->data.op_sg) {
+   rm->rdma.op_notify = 0;
+   rm->data.op_notify = !!(args->flags & 
RDS_RDMA_NOTIFY_ME);
+   }
}
 
/* The cookie contains the R_Key of the remote memory region, and
diff --git a/net/rds/rds.h b/net/rds/rds.h
index ebbf909..0bb8213 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -419,6 +419,7 @@ struct rds_message {
} rdma;
struct rm_data_op {
unsigned intop_active:1;
+   unsigned intop_notify:1;
unsigned intop_nents;
unsigned intop_count;
unsigned intop_dmasg;
diff --git a/net/rds/send.c b/net/rds/send.c
index 0a6f38b..45e025b 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -476,12 +476,14 @@ void rds_rdma_send_complete(struct rds_message *rm, int 
status)
struct rm_rdma_op *ro;
struct rds_notifier *notifier;
unsigned long flags;
+   unsigned int notify = 0;
 
spin_lock_irqsave(>m_rs_lock, flags);
 
+   notify =  rm->rdma.op_notify | rm->data.op_notify;
ro = >rdma;
if (test_bit(RDS_MSG_ON_SOCK, >m_flags) &&
-   ro->op_active && ro->op_notify && ro->op_notifier) {
+   ro->op_active && notify && ro->op_notifier) {
notifier = ro->op_notifier;
rs = rm->m_rs;
sock_hold(rds_rs_to_sk(rs));
-- 
1.9.1



[net-next][PATCH 08/18] RDS: IB: split the mr registration and invalidation path

2016-12-04 Thread Santosh Shilimkar
MR invalidation in RDS is done in background thread and not in
data path like registration. So break the dependency between them
which helps to remove the performance bottleneck.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h  |  4 +++-
 net/rds/ib_cm.c   |  9 +++--
 net/rds/ib_frmr.c | 11 ++-
 3 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index f4e8121..f14c26d 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -14,7 +14,8 @@
 
 #define RDS_IB_DEFAULT_RECV_WR 1024
 #define RDS_IB_DEFAULT_SEND_WR 256
-#define RDS_IB_DEFAULT_FR_WR   512
+#define RDS_IB_DEFAULT_FR_WR   256
+#define RDS_IB_DEFAULT_FR_INV_WR   256
 
 #define RDS_IB_DEFAULT_RETRY_COUNT 1
 
@@ -125,6 +126,7 @@ struct rds_ib_connection {
 
/* To control the number of wrs from fastreg */
atomic_ti_fastreg_wrs;
+   atomic_ti_fastunreg_wrs;
 
/* interrupt handling */
struct tasklet_struct   i_send_tasklet;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index b9da1e5..3002acf 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -382,7 +382,10 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 * completion queue and send queue. This extra space is used for FRMR
 * registration and invalidation work requests
 */
-   fr_queue_space = (rds_ibdev->use_fastreg ? RDS_IB_DEFAULT_FR_WR : 0);
+   fr_queue_space = rds_ibdev->use_fastreg ?
+(RDS_IB_DEFAULT_FR_WR + 1) +
+(RDS_IB_DEFAULT_FR_INV_WR + 1)
+: 0;
 
/* add the conn now so that connection establishment has the dev */
rds_ib_add_conn(rds_ibdev, conn);
@@ -444,6 +447,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.send_cq = ic->i_send_cq;
attr.recv_cq = ic->i_recv_cq;
atomic_set(>i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR);
+   atomic_set(>i_fastunreg_wrs, RDS_IB_DEFAULT_FR_INV_WR);
 
/*
 * XXX this can fail if max_*_wr is too large?  Are we supposed
@@ -766,7 +770,8 @@ void rds_ib_conn_path_shutdown(struct rds_conn_path *cp)
wait_event(rds_ib_ring_empty_wait,
   rds_ib_ring_empty(>i_recv_ring) &&
   (atomic_read(>i_signaled_sends) == 0) &&
-  (atomic_read(>i_fastreg_wrs) == 
RDS_IB_DEFAULT_FR_WR));
+  (atomic_read(>i_fastreg_wrs) == 
RDS_IB_DEFAULT_FR_WR) &&
+  (atomic_read(>i_fastunreg_wrs) == 
RDS_IB_DEFAULT_FR_INV_WR));
tasklet_kill(>i_send_tasklet);
tasklet_kill(>i_recv_tasklet);
 
diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index 66b3d62..48332a6 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -241,8 +241,8 @@ static int rds_ib_post_inv(struct rds_ib_mr *ibmr)
if (frmr->fr_state != FRMR_IS_INUSE)
goto out;
 
-   while (atomic_dec_return(>ic->i_fastreg_wrs) <= 0) {
-   atomic_inc(>ic->i_fastreg_wrs);
+   while (atomic_dec_return(>ic->i_fastunreg_wrs) <= 0) {
+   atomic_inc(>ic->i_fastunreg_wrs);
cpu_relax();
}
 
@@ -261,7 +261,7 @@ static int rds_ib_post_inv(struct rds_ib_mr *ibmr)
if (unlikely(ret)) {
frmr->fr_state = FRMR_IS_STALE;
frmr->fr_inv = false;
-   atomic_inc(>ic->i_fastreg_wrs);
+   atomic_inc(>ic->i_fastunreg_wrs);
pr_err("RDS/IB: %s returned error(%d)\n", __func__, ret);
goto out;
}
@@ -289,9 +289,10 @@ void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, 
struct ib_wc *wc)
if (frmr->fr_inv) {
frmr->fr_state = FRMR_IS_FREE;
frmr->fr_inv = false;
+   atomic_inc(>i_fastreg_wrs);
+   } else {
+   atomic_inc(>i_fastunreg_wrs);
}
-
-   atomic_inc(>i_fastreg_wrs);
 }
 
 void rds_ib_unreg_frmr(struct list_head *list, unsigned int *nfreed,
-- 
1.9.1



[net-next][PATCH 09/18] RDS: RDMA: silence the use_once mr log flood

2016-12-04 Thread Santosh Shilimkar
In absence of extension headers, message log will keep
flooding the console. As such even without use_once we can
clean up the MRs so its not really an error case message
so make it debug message

Signed-off-by: Santosh Shilimkar 
---
 net/rds/rdma.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/rds/rdma.c b/net/rds/rdma.c
index 4c93bad..8151c49 100644
--- a/net/rds/rdma.c
+++ b/net/rds/rdma.c
@@ -415,7 +415,8 @@ void rds_rdma_unuse(struct rds_sock *rs, u32 r_key, int 
force)
spin_lock_irqsave(>rs_rdma_lock, flags);
mr = rds_mr_tree_walk(>rs_rdma_keys, r_key, NULL);
if (!mr) {
-   printk(KERN_ERR "rds: trying to unuse MR with unknown r_key 
%u!\n", r_key);
+   pr_debug("rds: trying to unuse MR with unknown r_key %u!\n",
+r_key);
spin_unlock_irqrestore(>rs_rdma_lock, flags);
return;
}
-- 
1.9.1



[net-next][PATCH 14/18] RDS: IB: fix panic due to handlers running post teardown

2016-12-04 Thread Santosh Shilimkar
Shutdown code reaping loop takes care of emptying the
CQ's before they being destroyed. And once tasklets are
killed, the hanlders are not expected to run.

But because of core tasklet code issues, tasklet handler could
still run even after tasklet_kill,
RDS IB shutdown code already reaps the CQs before freeing
cq/qp resources so as such the handlers have nothing left
to do post shutdown.

On other hand any handler running after teardown and trying
to access already freed qp/cq resources causes issues
Patch fixes this race by  makes sure that handlers returns
without any action post teardown.

Reviewed-by: Wengang 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h|  1 +
 net/rds/ib_cm.c | 12 
 2 files changed, 13 insertions(+)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index 4b133b8..8efd1eb 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -185,6 +185,7 @@ struct rds_ib_connection {
 
/* Endpoint role in connection */
int i_active_side;
+   atomic_ti_cq_quiesce;
 
/* Send/Recv vectors */
int i_scq_vector;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 33c8584..ce3775a 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -128,6 +128,8 @@ void rds_ib_cm_connect_complete(struct rds_connection 
*conn, struct rdma_cm_even
  ic->i_flowctl ? ", flow control" : "");
}
 
+   atomic_set(>i_cq_quiesce, 0);
+
/* Init rings and fill recv. this needs to wait until protocol
 * negotiation is complete, since ring layout is different
 * from 3.1 to 4.1.
@@ -267,6 +269,10 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
 
rds_ib_stats_inc(s_ib_tasklet_call);
 
+   /* if cq has been already reaped, ignore incoming cq event */
+   if (atomic_read(>i_cq_quiesce))
+   return;
+
poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
@@ -308,6 +314,10 @@ static void rds_ib_tasklet_fn_recv(unsigned long data)
 
rds_ib_stats_inc(s_ib_tasklet_call);
 
+   /* if cq has been already reaped, ignore incoming cq event */
+   if (atomic_read(>i_cq_quiesce))
+   return;
+
memset(, 0, sizeof(state));
poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, );
ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED);
@@ -804,6 +814,8 @@ void rds_ib_conn_path_shutdown(struct rds_conn_path *cp)
tasklet_kill(>i_send_tasklet);
tasklet_kill(>i_recv_tasklet);
 
+   atomic_set(>i_cq_quiesce, 1);
+
/* first destroy the ib state that generates callbacks */
if (ic->i_cm_id->qp)
rdma_destroy_qp(ic->i_cm_id);
-- 
1.9.1



[net-next][PATCH 12/18] RDS: IB: Add vector spreading for cqs

2016-12-04 Thread Santosh Shilimkar
Based on available device vectors, allocate cqs accordingly to
get better spread of completion vectors which helps performace
great deal..

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.c| 11 +++
 net/rds/ib.h|  5 +
 net/rds/ib_cm.c | 40 +---
 3 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index 5680d90..8d70884 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -111,6 +111,9 @@ static void rds_ib_dev_free(struct work_struct *work)
kfree(i_ipaddr);
}
 
+   if (rds_ibdev->vector_load)
+   kfree(rds_ibdev->vector_load);
+
kfree(rds_ibdev);
 }
 
@@ -159,6 +162,14 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_initiator_depth = device->attrs.max_qp_init_rd_atom;
rds_ibdev->max_responder_resources = device->attrs.max_qp_rd_atom;
 
+   rds_ibdev->vector_load = kzalloc(sizeof(int) * device->num_comp_vectors,
+GFP_KERNEL);
+   if (!rds_ibdev->vector_load) {
+   pr_err("RDS/IB: %s failed to allocate vector memory\n",
+   __func__);
+   goto put_dev;
+   }
+
rds_ibdev->dev = device;
rds_ibdev->pd = ib_alloc_pd(device, 0);
if (IS_ERR(rds_ibdev->pd)) {
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 4987387..4b133b8 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -185,6 +185,10 @@ struct rds_ib_connection {
 
/* Endpoint role in connection */
int i_active_side;
+
+   /* Send/Recv vectors */
+   int i_scq_vector;
+   int i_rcq_vector;
 };
 
 /* This assumes that atomic_t is at least 32 bits */
@@ -227,6 +231,7 @@ struct rds_ib_device {
spinlock_t  spinlock;   /* protect the above */
atomic_trefcount;
struct work_struct  free_work;
+   int *vector_load;
 };
 
 #define ibdev_to_node(ibdev) dev_to_node(ibdev->dma_device)
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 4d1bf04..33c8584 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -358,6 +358,28 @@ static void rds_ib_cq_comp_handler_send(struct ib_cq *cq, 
void *context)
tasklet_schedule(>i_send_tasklet);
 }
 
+static inline int ibdev_get_unused_vector(struct rds_ib_device *rds_ibdev)
+{
+   int min = rds_ibdev->vector_load[rds_ibdev->dev->num_comp_vectors - 1];
+   int index = rds_ibdev->dev->num_comp_vectors - 1;
+   int i;
+
+   for (i = rds_ibdev->dev->num_comp_vectors - 1; i >= 0; i--) {
+   if (rds_ibdev->vector_load[i] < min) {
+   index = i;
+   min = rds_ibdev->vector_load[i];
+   }
+   }
+
+   rds_ibdev->vector_load[index]++;
+   return index;
+}
+
+static inline void ibdev_put_vector(struct rds_ib_device *rds_ibdev, int index)
+{
+   rds_ibdev->vector_load[index]--;
+}
+
 /*
  * This needs to be very careful to not leave IS_ERR pointers around for
  * cleanup to trip over.
@@ -399,25 +421,30 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
/* Protection domain and memory range */
ic->i_pd = rds_ibdev->pd;
 
+   ic->i_scq_vector = ibdev_get_unused_vector(rds_ibdev);
cq_attr.cqe = ic->i_send_ring.w_nr + fr_queue_space + 1;
-
+   cq_attr.comp_vector = ic->i_scq_vector;
ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send,
 rds_ib_cq_event_handler, conn,
 _attr);
if (IS_ERR(ic->i_send_cq)) {
ret = PTR_ERR(ic->i_send_cq);
ic->i_send_cq = NULL;
+   ibdev_put_vector(rds_ibdev, ic->i_scq_vector);
rdsdebug("ib_create_cq send failed: %d\n", ret);
goto out;
}
 
+   ic->i_rcq_vector = ibdev_get_unused_vector(rds_ibdev);
cq_attr.cqe = ic->i_recv_ring.w_nr;
+   cq_attr.comp_vector = ic->i_rcq_vector;
ic->i_recv_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_recv,
 rds_ib_cq_event_handler, conn,
 _attr);
if (IS_ERR(ic->i_recv_cq)) {
ret = PTR_ERR(ic->i_recv_cq);
ic->i_recv_cq = NULL;
+   ibdev_put_vector(rds_ibdev, ic->i_rcq_vector);
rdsdebug("ib_create_cq recv failed: %d\n", ret);
goto out;
}
@@ -780,10 +807,17 @@ void rds_ib_conn_path_shutdown(struct rds_conn_path *cp)
/* first destroy the ib state that generates callbacks */
if (ic->i_cm_id->qp)
rdma_destroy_qp(ic->i_cm_id);
-   if (ic->i_send_cq)
+   if (ic->i_send_cq) {
+   if (ic->rds_ibdev)
+

[PATCH 07/28] bnxt_en: Add interface to support RDMA driver.

2016-12-04 Thread Selvin Xavier
From: Michael Chan 

Since the network driver and RDMA driver operate on the same PCI function,
we need to create an interface to allow the RDMA driver to share resources
with the network driver.

1. Create a new bnxt_en_dev struct which will be returned by
bnxt_ulp_probe() upon success.  After that, all calls from the RDMA driver
to bnxt_en will pass a pointer to this struct.

2. This struct contains additional function pointers to register, request
msix, send fw messages, register for async events.

3. If the RDMA driver wants to enable RDMA on the function, it needs to
call the function pointer bnxt_register_device().  A ulp_ops structure
is passed for upcalls from bnxt_en to the RDMA driver.

4. MSIX is requested by calling the function pointer bnxt_request_msix().

5. The RMDA driver can call firmware APIs using the bnxt_send_fw_msg()
function pointer.

6. 1 stats context is reserved when the RDMA driver registers.  MSIX
and completion rings are reserved when the RDMA driver requests for
MSIX.

7. When the RDMA driver calls bnxt_unregister_device(), all RDMA resources
will be cleaned up.

Cc: David Miller 
Cc: 
Signed-off-by: Michael Chan 
Signed-off-by: Selvin Xavier 
---
 drivers/net/ethernet/broadcom/bnxt/Makefile   |   2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  34 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |   6 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c | 288 ++
 drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.h |  91 
 5 files changed, 417 insertions(+), 4 deletions(-)
 create mode 100644 drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
 create mode 100644 drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.h

diff --git a/drivers/net/ethernet/broadcom/bnxt/Makefile 
b/drivers/net/ethernet/broadcom/bnxt/Makefile
index b233a86..6082ed1 100644
--- a/drivers/net/ethernet/broadcom/bnxt/Makefile
+++ b/drivers/net/ethernet/broadcom/bnxt/Makefile
@@ -1,3 +1,3 @@
 obj-$(CONFIG_BNXT) += bnxt_en.o
 
-bnxt_en-y := bnxt.o bnxt_sriov.o bnxt_ethtool.o bnxt_dcb.o
+bnxt_en-y := bnxt.o bnxt_sriov.o bnxt_ethtool.o bnxt_dcb.o bnxt_ulp.o
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index c26735ea..19f26b8 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -52,6 +52,7 @@
 
 #include "bnxt_hsi.h"
 #include "bnxt.h"
+#include "bnxt_ulp.h"
 #include "bnxt_sriov.h"
 #include "bnxt_ethtool.h"
 #include "bnxt_dcb.h"
@@ -1528,12 +1529,11 @@ static int bnxt_async_event_process(struct bnxt *bp,
set_bit(BNXT_RESET_TASK_SILENT_SP_EVENT, >sp_event);
break;
default:
-   netdev_err(bp->dev, "unhandled ASYNC event (id 0x%x)\n",
-  event_id);
goto async_event_process_exit;
}
schedule_work(>sp_task);
 async_event_process_exit:
+   bnxt_ulp_async_events(bp, cmpl);
return 0;
 }
 
@@ -3547,7 +3547,7 @@ static int bnxt_hwrm_vnic_ctx_alloc(struct bnxt *bp, u16 
vnic_id, u16 ctx_idx)
return rc;
 }
 
-static int bnxt_hwrm_vnic_cfg(struct bnxt *bp, u16 vnic_id)
+int bnxt_hwrm_vnic_cfg(struct bnxt *bp, u16 vnic_id)
 {
unsigned int ring = 0, grp_idx;
struct bnxt_vnic_info *vnic = >vnic_info[vnic_id];
@@ -3595,6 +3595,9 @@ static int bnxt_hwrm_vnic_cfg(struct bnxt *bp, u16 
vnic_id)
 #endif
if ((bp->flags & BNXT_FLAG_STRIP_VLAN) || def_vlan)
req.flags |= cpu_to_le32(VNIC_CFG_REQ_FLAGS_VLAN_STRIP_MODE);
+   if (!vnic_id && bnxt_ulp_registered(bp->edev, BNXT_ROCE_ULP))
+   req.flags |=
+   cpu_to_le32(VNIC_CFG_REQ_FLAGS_ROCE_DUAL_VNIC_MODE);
 
return hwrm_send_message(bp, , sizeof(req), HWRM_CMD_TIMEOUT);
 }
@@ -4842,6 +4845,16 @@ unsigned int bnxt_get_max_func_stat_ctxs(struct bnxt *bp)
 #endif
 }
 
+void bnxt_set_max_func_stat_ctxs(struct bnxt *bp, unsigned int max)
+{
+   if (BNXT_PF(bp))
+   bp->pf.max_stat_ctxs = max;
+#if defined(CONFIG_BNXT_SRIOV)
+   else
+   bp->vf.max_stat_ctxs = max;
+#endif
+}
+
 unsigned int bnxt_get_max_func_cp_rings(struct bnxt *bp)
 {
if (BNXT_PF(bp))
@@ -4851,6 +4864,16 @@ unsigned int bnxt_get_max_func_cp_rings(struct bnxt *bp)
 #endif
 }
 
+void bnxt_set_max_func_cp_rings(struct bnxt *bp, unsigned int max)
+{
+   if (BNXT_PF(bp))
+   bp->pf.max_cp_rings = max;
+#if defined(CONFIG_BNXT_SRIOV)
+   else
+   bp->vf.max_cp_rings = max;
+#endif
+}
+
 static unsigned int bnxt_get_max_func_irqs(struct bnxt *bp)
 {
if (BNXT_PF(bp))
@@ -6767,6 +6790,8 @@ static void bnxt_remove_one(struct pci_dev *pdev)
pci_iounmap(pdev, bp->bar2);
pci_iounmap(pdev, bp->bar1);
pci_iounmap(pdev, bp->bar0);
+   kfree(bp->edev);
+   bp->edev = 

[PATCH 04/28] bnxt_en: Improve completion ring allocation for VFs.

2016-12-04 Thread Selvin Xavier
From: Michael Chan 

All available remaining completion rings not used by the PF should be
made available for the VFs so that there are enough rings in the VF to
support RDMA.  The earlier workaround code of capping the rings by the
statistics context is removed.

When SRIOV is disabled, call a new function bnxt_restore_pf_fw_resources()
to restore FW resources.  Later on we need to add some logic to account
for RDMA resources.

Cc: David Miller 
Cc: 
Signed-off-by: Michael Chan 
Signed-off-by: Selvin Xavier 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c   |  8 +++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h   |  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c | 14 --
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 8b00ef4..1f6be83 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4152,7 +4152,7 @@ static int bnxt_hwrm_func_qcfg(struct bnxt *bp)
return rc;
 }
 
-int bnxt_hwrm_func_qcaps(struct bnxt *bp)
+static int bnxt_hwrm_func_qcaps(struct bnxt *bp)
 {
int rc = 0;
struct hwrm_func_qcaps_input req = {0};
@@ -6856,6 +6856,12 @@ static int bnxt_set_dflt_rings(struct bnxt *bp)
return rc;
 }
 
+void bnxt_restore_pf_fw_resources(struct bnxt *bp)
+{
+   ASSERT_RTNL();
+   bnxt_hwrm_func_qcaps(bp);
+}
+
 static void bnxt_parse_log_pcie_link(struct bnxt *bp)
 {
enum pcie_link_width width = PCIE_LNK_WIDTH_UNKNOWN;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 0ee2cc4..43a4b17 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1234,7 +1234,6 @@ int _hwrm_send_message(struct bnxt *, void *, u32, int);
 int hwrm_send_message(struct bnxt *, void *, u32, int);
 int hwrm_send_message_silent(struct bnxt *, void *, u32, int);
 int bnxt_hwrm_set_coal(struct bnxt *);
-int bnxt_hwrm_func_qcaps(struct bnxt *);
 void bnxt_set_max_func_irqs(struct bnxt *bp, unsigned int max);
 void bnxt_tx_disable(struct bnxt *bp);
 void bnxt_tx_enable(struct bnxt *bp);
@@ -1245,4 +1244,5 @@ int bnxt_open_nic(struct bnxt *, bool, bool);
 int bnxt_close_nic(struct bnxt *, bool, bool);
 int bnxt_setup_mq_tc(struct net_device *dev, u8 tc);
 int bnxt_get_max_rings(struct bnxt *, int *, int *, bool);
+void bnxt_restore_pf_fw_resources(struct bnxt *bp);
 #endif
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
index bff626a..c696025 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
@@ -420,15 +420,7 @@ static int bnxt_hwrm_func_cfg(struct bnxt *bp, int num_vfs)
bnxt_hwrm_cmd_hdr_init(bp, , HWRM_FUNC_CFG, -1, -1);
 
/* Remaining rings are distributed equally amongs VF's for now */
-   /* TODO: the following workaroud is needed to restrict total number
-* of vf_cp_rings not exceed number of HW ring groups. This WA should
-* be removed once new HWRM provides HW ring groups capability in
-* hwrm_func_qcap.
-*/
-   vf_cp_rings = min_t(u16, pf->max_cp_rings, pf->max_stat_ctxs);
-   vf_cp_rings = (vf_cp_rings - bp->cp_nr_rings) / num_vfs;
-   /* TODO: restore this logic below once the WA above is removed */
-   /* vf_cp_rings = (pf->max_cp_rings - bp->cp_nr_rings) / num_vfs; */
+   vf_cp_rings = (pf->max_cp_rings - bp->cp_nr_rings) / num_vfs;
vf_stat_ctx = (pf->max_stat_ctxs - bp->num_stat_ctxs) / num_vfs;
if (bp->flags & BNXT_FLAG_AGG_RINGS)
vf_rx_rings = (pf->max_rx_rings - bp->rx_nr_rings * 2) /
@@ -590,7 +582,9 @@ void bnxt_sriov_disable(struct bnxt *bp)
 
bp->pf.active_vfs = 0;
/* Reclaim all resources for the PF. */
-   bnxt_hwrm_func_qcaps(bp);
+   rtnl_lock();
+   bnxt_restore_pf_fw_resources(bp);
+   rtnl_unlock();
 }
 
 int bnxt_sriov_configure(struct pci_dev *pdev, int num_vfs)
-- 
2.5.5



[PATCH 03/28] bnxt_en: Move function reset to bnxt_init_one().

2016-12-04 Thread Selvin Xavier
From: Michael Chan 

Now that MSIX is enabled in bnxt_init_one(), resources may be allocated by
the RDMA driver before the network device is opened.  So we cannot do
function reset in bnxt_open() which will clear all the resources.

The proper place to do function reset now is in bnxt_init_one().
If we get AER, we'll do function reset as well.

Cc: David Miller 
Cc: 
Signed-off-by: Michael Chan 
Signed-off-by: Selvin Xavier 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 25 ++---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  1 -
 2 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 9178bf8..8b00ef4 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -5613,22 +5613,7 @@ int bnxt_open_nic(struct bnxt *bp, bool irq_re_init, 
bool link_re_init)
 static int bnxt_open(struct net_device *dev)
 {
struct bnxt *bp = netdev_priv(dev);
-   int rc = 0;
 
-   if (!test_bit(BNXT_STATE_FN_RST_DONE, >state)) {
-   rc = bnxt_hwrm_func_reset(bp);
-   if (rc) {
-   netdev_err(bp->dev, "hwrm chip reset failure rc: %x\n",
-  rc);
-   rc = -EBUSY;
-   return rc;
-   }
-   /* Do func_reset during the 1st PF open only to prevent killing
-* the VFs when the PF is brought down and up.
-*/
-   if (BNXT_PF(bp))
-   set_bit(BNXT_STATE_FN_RST_DONE, >state);
-   }
return __bnxt_open_nic(bp, true, true);
 }
 
@@ -7028,6 +7013,10 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
if (rc)
goto init_err;
 
+   rc = bnxt_hwrm_func_reset(bp);
+   if (rc)
+   goto init_err;
+
rc = bnxt_init_int_mode(bp);
if (rc)
goto init_err;
@@ -7069,7 +7058,6 @@ static pci_ers_result_t bnxt_io_error_detected(struct 
pci_dev *pdev,
   pci_channel_state_t state)
 {
struct net_device *netdev = pci_get_drvdata(pdev);
-   struct bnxt *bp = netdev_priv(netdev);
 
netdev_info(netdev, "PCI I/O error detected\n");
 
@@ -7084,8 +7072,6 @@ static pci_ers_result_t bnxt_io_error_detected(struct 
pci_dev *pdev,
if (netif_running(netdev))
bnxt_close(netdev);
 
-   /* So that func_reset will be done during slot_reset */
-   clear_bit(BNXT_STATE_FN_RST_DONE, >state);
pci_disable_device(pdev);
rtnl_unlock();
 
@@ -7119,7 +7105,8 @@ static pci_ers_result_t bnxt_io_slot_reset(struct pci_dev 
*pdev)
} else {
pci_set_master(pdev);
 
-   if (netif_running(netdev))
+   err = bnxt_hwrm_func_reset(bp);
+   if (!err && netif_running(netdev))
err = bnxt_open(netdev);
 
if (!err)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 1461355..0ee2cc4 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1021,7 +1021,6 @@ struct bnxt {
unsigned long   state;
 #define BNXT_STATE_OPEN0
 #define BNXT_STATE_IN_SP_TASK  1
-#define BNXT_STATE_FN_RST_DONE 2
 
struct bnxt_irq *irq_tbl;
int total_irqs;
-- 
2.5.5



[PATCH 06/28] bnxt_en: Refactor the driver registration function with firmware.

2016-12-04 Thread Selvin Xavier
From: Michael Chan 

The driver register function with firmware consists of passing version
information and registering for async events.  To support the RDMA driver,
the async events that we need to register may change.  Separate the
driver register function into 2 parts so that we can just update the
async events for the RDMA driver.

Cc: David Miller 
Cc: 
Signed-off-by: Michael Chan 
Signed-off-by: Selvin Xavier 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 34 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  2 ++
 2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 7218d65..c26735ea 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -3117,27 +3117,46 @@ int hwrm_send_message_silent(struct bnxt *bp, void 
*msg, u32 msg_len,
return rc;
 }
 
-static int bnxt_hwrm_func_drv_rgtr(struct bnxt *bp)
+int bnxt_hwrm_func_rgtr_async_events(struct bnxt *bp, unsigned long *bmap,
+int bmap_size)
 {
struct hwrm_func_drv_rgtr_input req = {0};
-   int i;
DECLARE_BITMAP(async_events_bmap, 256);
u32 *events = (u32 *)async_events_bmap;
+   int i;
 
bnxt_hwrm_cmd_hdr_init(bp, , HWRM_FUNC_DRV_RGTR, -1, -1);
 
req.enables =
-   cpu_to_le32(FUNC_DRV_RGTR_REQ_ENABLES_OS_TYPE |
-   FUNC_DRV_RGTR_REQ_ENABLES_VER |
-   FUNC_DRV_RGTR_REQ_ENABLES_ASYNC_EVENT_FWD);
+   cpu_to_le32(FUNC_DRV_RGTR_REQ_ENABLES_ASYNC_EVENT_FWD);
 
memset(async_events_bmap, 0, sizeof(async_events_bmap));
for (i = 0; i < ARRAY_SIZE(bnxt_async_events_arr); i++)
__set_bit(bnxt_async_events_arr[i], async_events_bmap);
 
+   if (bmap && bmap_size) {
+   for (i = 0; i < bmap_size; i++) {
+   if (test_bit(i, bmap))
+   __set_bit(i, async_events_bmap);
+   }
+   }
+
for (i = 0; i < 8; i++)
req.async_event_fwd[i] |= cpu_to_le32(events[i]);
 
+   return hwrm_send_message(bp, , sizeof(req), HWRM_CMD_TIMEOUT);
+}
+
+static int bnxt_hwrm_func_drv_rgtr(struct bnxt *bp)
+{
+   struct hwrm_func_drv_rgtr_input req = {0};
+
+   bnxt_hwrm_cmd_hdr_init(bp, , HWRM_FUNC_DRV_RGTR, -1, -1);
+
+   req.enables =
+   cpu_to_le32(FUNC_DRV_RGTR_REQ_ENABLES_OS_TYPE |
+   FUNC_DRV_RGTR_REQ_ENABLES_VER);
+
req.os_type = cpu_to_le16(FUNC_DRV_RGTR_REQ_OS_TYPE_LINUX);
req.ver_maj = DRV_VER_MAJ;
req.ver_min = DRV_VER_MIN;
@@ -3146,6 +3165,7 @@ static int bnxt_hwrm_func_drv_rgtr(struct bnxt *bp)
if (BNXT_PF(bp)) {
DECLARE_BITMAP(vf_req_snif_bmap, 256);
u32 *data = (u32 *)vf_req_snif_bmap;
+   int i;
 
memset(vf_req_snif_bmap, 0, sizeof(vf_req_snif_bmap));
for (i = 0; i < ARRAY_SIZE(bnxt_vf_req_snif); i++)
@@ -7023,6 +7043,10 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
if (rc)
goto init_err;
 
+   rc = bnxt_hwrm_func_rgtr_async_events(bp, NULL, 0);
+   if (rc)
+   goto init_err;
+
/* Get the MAX capabilities for this function */
rc = bnxt_hwrm_func_qcaps(bp);
if (rc) {
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index d796836..eec2415 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1240,6 +1240,8 @@ void bnxt_hwrm_cmd_hdr_init(struct bnxt *, void *, u16, 
u16, u16);
 int _hwrm_send_message(struct bnxt *, void *, u32, int);
 int hwrm_send_message(struct bnxt *, void *, u32, int);
 int hwrm_send_message_silent(struct bnxt *, void *, u32, int);
+int bnxt_hwrm_func_rgtr_async_events(struct bnxt *bp, unsigned long *bmap,
+int bmap_size);
 int bnxt_hwrm_set_coal(struct bnxt *);
 unsigned int bnxt_get_max_func_stat_ctxs(struct bnxt *bp);
 unsigned int bnxt_get_max_func_cp_rings(struct bnxt *bp);
-- 
2.5.5



[PATCH 05/28] bnxt_en: Reserve RDMA resources by default.

2016-12-04 Thread Selvin Xavier
From: Michael Chan 

If the device supports RDMA, we'll setup network default rings so that
there are enough minimum resources for RDMA, if possible.  However, the
user can still increase network rings to the max if he wants.  The actual
RDMA resources won't be reserved until the RDMA driver registers.

Cc: David Miller 
Cc: 
Signed-off-by: Michael Chan 
Signed-off-by: Selvin Xavier 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 58 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  9 +
 2 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 1f6be83..7218d65 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4166,6 +4166,11 @@ static int bnxt_hwrm_func_qcaps(struct bnxt *bp)
if (rc)
goto hwrm_func_qcaps_exit;
 
+   if (resp->flags & cpu_to_le32(FUNC_QCAPS_RESP_FLAGS_ROCE_V1_SUPPORTED))
+   bp->flags |= BNXT_FLAG_ROCEV1_CAP;
+   if (resp->flags & cpu_to_le32(FUNC_QCAPS_RESP_FLAGS_ROCE_V2_SUPPORTED))
+   bp->flags |= BNXT_FLAG_ROCEV2_CAP;
+
bp->tx_push_thresh = 0;
if (resp->flags &
cpu_to_le32(FUNC_QCAPS_RESP_FLAGS_PUSH_MODE_SUPPORTED))
@@ -4808,6 +4813,24 @@ static int bnxt_setup_int_mode(struct bnxt *bp)
return rc;
 }
 
+unsigned int bnxt_get_max_func_stat_ctxs(struct bnxt *bp)
+{
+   if (BNXT_PF(bp))
+   return bp->pf.max_stat_ctxs;
+#if defined(CONFIG_BNXT_SRIOV)
+   return bp->vf.max_stat_ctxs;
+#endif
+}
+
+unsigned int bnxt_get_max_func_cp_rings(struct bnxt *bp)
+{
+   if (BNXT_PF(bp))
+   return bp->pf.max_cp_rings;
+#if defined(CONFIG_BNXT_SRIOV)
+   return bp->vf.max_cp_rings;
+#endif
+}
+
 static unsigned int bnxt_get_max_func_irqs(struct bnxt *bp)
 {
if (BNXT_PF(bp))
@@ -6832,6 +6855,39 @@ int bnxt_get_max_rings(struct bnxt *bp, int *max_rx, int 
*max_tx, bool shared)
return bnxt_trim_rings(bp, max_rx, max_tx, cp, shared);
 }
 
+static int bnxt_get_dflt_rings(struct bnxt *bp, int *max_rx, int *max_tx,
+  bool shared)
+{
+   int rc;
+
+   rc = bnxt_get_max_rings(bp, max_rx, max_tx, shared);
+   if (rc)
+   return rc;
+
+   if (bp->flags & BNXT_FLAG_ROCE_CAP) {
+   int max_cp, max_stat, max_irq;
+
+   /* Reserve minimum resources for RoCE */
+   max_cp = bnxt_get_max_func_cp_rings(bp);
+   max_stat = bnxt_get_max_func_stat_ctxs(bp);
+   max_irq = bnxt_get_max_func_irqs(bp);
+   if (max_cp <= BNXT_MIN_ROCE_CP_RINGS ||
+   max_irq <= BNXT_MIN_ROCE_CP_RINGS ||
+   max_stat <= BNXT_MIN_ROCE_STAT_CTXS)
+   return 0;
+
+   max_cp -= BNXT_MIN_ROCE_CP_RINGS;
+   max_irq -= BNXT_MIN_ROCE_CP_RINGS;
+   max_stat -= BNXT_MIN_ROCE_STAT_CTXS;
+   max_cp = min_t(int, max_cp, max_irq);
+   max_cp = min_t(int, max_cp, max_stat);
+   rc = bnxt_trim_rings(bp, max_rx, max_tx, max_cp, shared);
+   if (rc)
+   rc = 0;
+   }
+   return rc;
+}
+
 static int bnxt_set_dflt_rings(struct bnxt *bp)
 {
int dflt_rings, max_rx_rings, max_tx_rings, rc;
@@ -6840,7 +6896,7 @@ static int bnxt_set_dflt_rings(struct bnxt *bp)
if (sh)
bp->flags |= BNXT_FLAG_SHARED_RINGS;
dflt_rings = netif_get_num_default_rss_queues();
-   rc = bnxt_get_max_rings(bp, _rx_rings, _tx_rings, sh);
+   rc = bnxt_get_dflt_rings(bp, _rx_rings, _tx_rings, sh);
if (rc)
return rc;
bp->rx_nr_rings = min_t(int, dflt_rings, max_rx_rings);
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 43a4b17..d796836 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -387,6 +387,9 @@ struct rx_tpa_end_cmp_ext {
 #define DB_KEY_TX_PUSH (0x4 << 28)
 #define DB_LONG_TX_PUSH(0x2 << 
24)
 
+#define BNXT_MIN_ROCE_CP_RINGS 2
+#define BNXT_MIN_ROCE_STAT_CTXS1
+
 #define INVALID_HW_RING_ID ((u16)-1)
 
 /* The hardware supports certain page sizes.  Use the supported page sizes
@@ -953,6 +956,10 @@ struct bnxt {
#define BNXT_FLAG_PORT_STATS0x400
#define BNXT_FLAG_UDP_RSS_CAP   0x800
#define BNXT_FLAG_EEE_CAP   0x1000
+   #define BNXT_FLAG_ROCEV1_CAP0x8000
+   #define BNXT_FLAG_ROCEV2_CAP0x1
+   #define BNXT_FLAG_ROCE_CAP  (BNXT_FLAG_ROCEV1_CAP | \
+

[PATCH 01/28] bnxt_en: Add bnxt_set_max_func_irqs().

2016-12-04 Thread Selvin Xavier
From: Michael Chan 

By refactoring existing code into this new function.  The new function
will be used in subsequent patches.

Cc: David Miller 
Cc: 
Signed-off-by: Michael Chan 
Signed-off-by: Selvin Xavier 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 17 +++--
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  1 +
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index e8ab5fd..6cdfe3e 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4743,6 +4743,16 @@ static int bnxt_trim_rings(struct bnxt *bp, int *rx, int 
*tx, int max,
return 0;
 }
 
+void bnxt_set_max_func_irqs(struct bnxt *bp, unsigned int max_irqs)
+{
+   if (BNXT_PF(bp))
+   bp->pf.max_irqs = max_irqs;
+#if defined(CONFIG_BNXT_SRIOV)
+   else
+   bp->vf.max_irqs = max_irqs;
+#endif
+}
+
 static int bnxt_setup_msix(struct bnxt *bp)
 {
struct msix_entry *msix_ent;
@@ -6949,12 +6959,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
 
bnxt_set_tpa_flags(bp);
bnxt_set_ring_params(bp);
-   if (BNXT_PF(bp))
-   bp->pf.max_irqs = max_irqs;
-#if defined(CONFIG_BNXT_SRIOV)
-   else
-   bp->vf.max_irqs = max_irqs;
-#endif
+   bnxt_set_max_func_irqs(bp, max_irqs);
bnxt_set_dflt_rings(bp);
 
/* Default RSS hash cfg. */
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index b4abc1b..8327d0d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1235,6 +1235,7 @@ int hwrm_send_message(struct bnxt *, void *, u32, int);
 int hwrm_send_message_silent(struct bnxt *, void *, u32, int);
 int bnxt_hwrm_set_coal(struct bnxt *);
 int bnxt_hwrm_func_qcaps(struct bnxt *);
+void bnxt_set_max_func_irqs(struct bnxt *bp, unsigned int max);
 void bnxt_tx_disable(struct bnxt *bp);
 void bnxt_tx_enable(struct bnxt *bp);
 int bnxt_hwrm_set_pause(struct bnxt *);
-- 
2.5.5



[PATCH 02/28] bnxt_en: Enable MSIX early in bnxt_init_one().

2016-12-04 Thread Selvin Xavier
From: Michael Chan 

To better support the new RDMA driver, we need to move pci_enable_msix()
from bnxt_open() to bnxt_init_one().  This way, MSIX vectors are available
to the RDMA driver whether the network device is up or down.

Part of the existing bnxt_setup_int_mode() function is now refactored into
a new bnxt_init_int_mode().  bnxt_init_int_mode() is called during
bnxt_init_one() to enable MSIX.  The remaining logic in
bnxt_setup_int_mode() to map the IRQs to the completion rings is called
during bnxt_open().

Cc: David Miller 
Cc: 
Signed-off-by: Michael Chan 
Signed-off-by: Selvin Xavier 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 183 +++---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |   1 +
 2 files changed, 115 insertions(+), 69 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 6cdfe3e..9178bf8 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4743,6 +4743,80 @@ static int bnxt_trim_rings(struct bnxt *bp, int *rx, int 
*tx, int max,
return 0;
 }
 
+static void bnxt_setup_msix(struct bnxt *bp)
+{
+   const int len = sizeof(bp->irq_tbl[0].name);
+   struct net_device *dev = bp->dev;
+   int tcs, i;
+
+   tcs = netdev_get_num_tc(dev);
+   if (tcs > 1) {
+   bp->tx_nr_rings_per_tc = bp->tx_nr_rings / tcs;
+   if (bp->tx_nr_rings_per_tc == 0) {
+   netdev_reset_tc(dev);
+   bp->tx_nr_rings_per_tc = bp->tx_nr_rings;
+   } else {
+   int i, off, count;
+
+   bp->tx_nr_rings = bp->tx_nr_rings_per_tc * tcs;
+   for (i = 0; i < tcs; i++) {
+   count = bp->tx_nr_rings_per_tc;
+   off = i * count;
+   netdev_set_tc_queue(dev, i, count, off);
+   }
+   }
+   }
+
+   for (i = 0; i < bp->cp_nr_rings; i++) {
+   char *attr;
+
+   if (bp->flags & BNXT_FLAG_SHARED_RINGS)
+   attr = "TxRx";
+   else if (i < bp->rx_nr_rings)
+   attr = "rx";
+   else
+   attr = "tx";
+
+   snprintf(bp->irq_tbl[i].name, len, "%s-%s-%d", dev->name, attr,
+i);
+   bp->irq_tbl[i].handler = bnxt_msix;
+   }
+}
+
+static void bnxt_setup_inta(struct bnxt *bp)
+{
+   const int len = sizeof(bp->irq_tbl[0].name);
+
+   if (netdev_get_num_tc(bp->dev))
+   netdev_reset_tc(bp->dev);
+
+   snprintf(bp->irq_tbl[0].name, len, "%s-%s-%d", bp->dev->name, "TxRx",
+0);
+   bp->irq_tbl[0].handler = bnxt_inta;
+}
+
+static int bnxt_setup_int_mode(struct bnxt *bp)
+{
+   int rc;
+
+   if (bp->flags & BNXT_FLAG_USING_MSIX)
+   bnxt_setup_msix(bp);
+   else
+   bnxt_setup_inta(bp);
+
+   rc = bnxt_set_real_num_queues(bp);
+   return rc;
+}
+
+static unsigned int bnxt_get_max_func_irqs(struct bnxt *bp)
+{
+   if (BNXT_PF(bp))
+   return bp->pf.max_irqs;
+#if defined(CONFIG_BNXT_SRIOV)
+   return bp->vf.max_irqs;
+#endif
+}
+
 void bnxt_set_max_func_irqs(struct bnxt *bp, unsigned int max_irqs)
 {
if (BNXT_PF(bp))
@@ -4753,16 +4827,12 @@ void bnxt_set_max_func_irqs(struct bnxt *bp, unsigned 
int max_irqs)
 #endif
 }
 
-static int bnxt_setup_msix(struct bnxt *bp)
+static int bnxt_init_msix(struct bnxt *bp)
 {
-   struct msix_entry *msix_ent;
-   struct net_device *dev = bp->dev;
int i, total_vecs, rc = 0, min = 1;
-   const int len = sizeof(bp->irq_tbl[0].name);
-
-   bp->flags &= ~BNXT_FLAG_USING_MSIX;
-   total_vecs = bp->cp_nr_rings;
+   struct msix_entry *msix_ent;
 
+   total_vecs = bnxt_get_max_func_irqs(bp);
msix_ent = kcalloc(total_vecs, sizeof(struct msix_entry), GFP_KERNEL);
if (!msix_ent)
return -ENOMEM;
@@ -4783,8 +4853,10 @@ static int bnxt_setup_msix(struct bnxt *bp)
 
bp->irq_tbl = kcalloc(total_vecs, sizeof(struct bnxt_irq), GFP_KERNEL);
if (bp->irq_tbl) {
-   int tcs;
+   for (i = 0; i < total_vecs; i++)
+   bp->irq_tbl[i].vector = msix_ent[i].vector;
 
+   bp->total_irqs = total_vecs;
/* Trim rings based upon num of vectors allocated */
rc = bnxt_trim_rings(bp, >rx_nr_rings, >tx_nr_rings,
 total_vecs, min == 1);
@@ -4792,43 +4864,10 @@ static int bnxt_setup_msix(struct bnxt *bp)
goto msix_setup_exit;
 
bp->tx_nr_rings_per_tc = bp->tx_nr_rings;
-   tcs = 

Re: [PATCH V2 net 10/20] net/ena: remove redundant logic in napi callback for busy poll mode

2016-12-04 Thread Eric Dumazet
On Sun, 2016-12-04 at 15:19 +0200, Netanel Belgazal wrote:
> sk_busy_loop can call the napi callback few million times a sec.
> For each call there is unmask interrupt.
> We want to reduce the number of unmasks.
> 
> Add an atomic variable that will tell the napi handler if
> it was called from irq context or not.
> Unmask the interrupt only from irq context.
> 
> A schenario where the driver left with missed unmask isn't feasible.
> when ena_intr_msix_io is called the driver have 2 options:
> 1)Before napi completes and call napi_complete_done
> 2)After calling napi_complete_done
> 
> In the former case the napi will unmask the interrupt as needed.
> In the latter case napi_complete_done will remove napi from the schedule
> list so napi will be rescheduled (by ena_intr_msix_io) and interrupt
> will be unmasked as desire in the 2nd napi call.
> 
> Signed-off-by: Netanel Belgazal 
> ---


This looks very complicated to me.

I guess you missed the recent patches that happened on net-next ?

2e713283751f494596655d9125c168aeb913f71d net/mlx4_en: use napi_complete_done() 
return value
364b6055738b4c752c30ccaaf25c624e69d76195 net: busy-poll: return busypolling 
status to drivers
21cb84c48ca0619181106f0f44f3802a989de024 net: busy-poll: remove need_resched() 
from sk_can_busy_loop()
217f6974368188fd8bd7804bf5a036aa5762c5e4 net: busy-poll: allow preemption in 
sk_busy_loop()

napi_complete_done() return code can be used by a driver,
no need to add yet another atomic operation in fast path.

Anyway, this looks wrong :

@@ -1186,6 +1201,7 @@ static irqreturn_t ena_intr_msix_io(int irq, void *data)
 {
struct ena_napi *ena_napi = data;
 
+   atomic_set(_napi->unmask_interrupt, 1);
napi_schedule(_napi->napi);
 
You probably wanted :

if (napi_schedule_prep(n)) {
atomic_set(_napi->unmask_interrupt, 1);
__napi_schedule(n);
}



Please rework this napi poll using core infrastructure.

busypoll logic should be centralized, not reimplemented in different ways in a 
driver.

Thanks.





Re: [PATCH] iproute2: ss: escape all null bytes in abstract unix domain socket

2016-12-04 Thread Isaac Boukris
On Sat, Dec 3, 2016 at 1:24 AM, Eric Dumazet  wrote:
> On Fri, 2016-12-02 at 15:18 -0800, Stephen Hemminger wrote:
>> name[i] = '@';
>> >
>> > ss.c: In function 'unix_show_sock':
>> > ss.c:3128:4: error: 'for' loop initial declarations are only allowed in 
>> > C99 mode
>> > ss.c:3128:4: note: use option -std=c99 or -std=gnu99 to compile your code
>> > make[1]: *** [ss.o] Error 1
>> >
>> >
>> >
>>
>> Thanks, fixed by patch from Simon
>
> Right, thanks !


Thanks for notifying me. Sorry for the bug and thanks for the fix!


Re: [PATCH V2 net 13/20] net/ena: change driver's default timeouts

2016-12-04 Thread Matt Wilson
On Sun, Dec 04, 2016 at 03:19:31PM +0200, Netanel Belgazal wrote:

... because? (they turned out to be too aggressive, I believe.)

> Signed-off-by: Netanel Belgazal 
> ---
>  drivers/net/ethernet/amazon/ena/ena_com.c| 4 ++--
>  drivers/net/ethernet/amazon/ena/ena_netdev.h | 7 ---
>  2 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c 
> b/drivers/net/ethernet/amazon/ena/ena_com.c
> index 4147d6e..a550c8a 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_com.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_com.c
> @@ -36,9 +36,9 @@
>  
> /*/
>  
>  /* Timeout in micro-sec */
> -#define ADMIN_CMD_TIMEOUT_US (100)
> +#define ADMIN_CMD_TIMEOUT_US (300)
>  
> -#define ENA_ASYNC_QUEUE_DEPTH 4
> +#define ENA_ASYNC_QUEUE_DEPTH 16

Why is this changed at the same time?

>  #define ENA_ADMIN_QUEUE_DEPTH 32
>  
>  #define MIN_ENA_VER (((ENA_COMMON_SPEC_VERSION_MAJOR) << \
> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h 
> b/drivers/net/ethernet/amazon/ena/ena_netdev.h
> index c081fd3..ed42e07 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.h
> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h
> @@ -39,6 +39,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

This change seems unrelated.

>  #include "ena_com.h"
>  #include "ena_eth_com.h"
> @@ -100,7 +101,7 @@
>  /* Number of queues to check for missing queues per timer service */
>  #define ENA_MONITORED_TX_QUEUES  4
>  /* Max timeout packets before device reset */
> -#define MAX_NUM_OF_TIMEOUTED_PACKETS 32
> +#define MAX_NUM_OF_TIMEOUTED_PACKETS 128
>  
>  #define ENA_TX_RING_IDX_NEXT(idx, ring_size) (((idx) + 1) & ((ring_size) - 
> 1))
>  
> @@ -116,9 +117,9 @@
>  #define ENA_IO_IRQ_IDX(q)(ENA_IO_IRQ_FIRST_IDX + (q))
>  
>  /* ENA device should send keep alive msg every 1 sec.
> - * We wait for 3 sec just to be on the safe side.
> + * We wait for 6 sec just to be on the safe side.
>   */
> -#define ENA_DEVICE_KALIVE_TIMEOUT(3 * HZ)
> +#define ENA_DEVICE_KALIVE_TIMEOUT(6 * HZ)
>  
>  #define ENA_MMIO_DISABLE_REG_READBIT(0)
>  


Re: [PATCH V2 net 08/20] net/ena: add hardware hints capability to the driver

2016-12-04 Thread Matt Wilson
On Sun, Dec 04, 2016 at 03:19:26PM +0200, Netanel Belgazal wrote:
> The ENA device can update the ena driver about the desire timeouts.
> The hardware hints are transmitted as Asynchronous event to the driver.

This is really a new feature, not a bugfix - correct? If it is a new
feature, submit it separately. If the built-in defaults need to be
changed, submit that as a bugfix.

--msw

> In case the device does not support this capability, the driver
> will use its own defines.
> 
> Signed-off-by: Netanel Belgazal 
> ---
>  drivers/net/ethernet/amazon/ena/ena_admin_defs.h | 31 +
>  drivers/net/ethernet/amazon/ena/ena_com.c| 41 ---
>  drivers/net/ethernet/amazon/ena/ena_com.h|  5 ++
>  drivers/net/ethernet/amazon/ena/ena_ethtool.c|  1 -
>  drivers/net/ethernet/amazon/ena/ena_netdev.c | 86 
> +++-
>  drivers/net/ethernet/amazon/ena/ena_netdev.h | 19 +-
>  drivers/net/ethernet/amazon/ena/ena_regs_defs.h  |  2 +
>  7 files changed, 157 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h 
> b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
> index 6d70bf5..35ae511 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
> +++ b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
> @@ -70,6 +70,8 @@ enum ena_admin_aq_feature_id {
>  
>   ENA_ADMIN_MAX_QUEUES_NUM= 2,
>  
> + ENA_ADMIN_HW_HINTS  = 3,
> +
>   ENA_ADMIN_RSS_HASH_FUNCTION = 10,
>  
>   ENA_ADMIN_STATELESS_OFFLOAD_CONFIG  = 11,
> @@ -749,6 +751,31 @@ struct ena_admin_feature_rss_ind_table {
>   struct ena_admin_rss_ind_table_entry inline_entry;
>  };
>  
> +/* When hint value is 0, driver should use it's own predefined value */
> +struct ena_admin_ena_hw_hints {
> + /* value in ms */
> + u16 mmio_read_timeout;
> +
> + /* value in ms */
> + u16 driver_watchdog_timeout;
> +
> + /* Per packet tx completion timeout. value in ms */
> + u16 missing_tx_completion_timeout;
> +
> + u16 missed_tx_completion_count_threshold_to_reset;
> +
> + /* value in ms */
> + u16 admin_completion_tx_timeout;
> +
> + u16 netdev_wd_timeout;
> +
> + u16 max_tx_sgl_size;
> +
> + u16 max_rx_sgl_size;
> +
> + u16 reserved[8];
> +};
> +
>  struct ena_admin_get_feat_cmd {
>   struct ena_admin_aq_common_desc aq_common_descriptor;
>  
> @@ -782,6 +809,8 @@ struct ena_admin_get_feat_resp {
>   struct ena_admin_feature_rss_ind_table ind_table;
>  
>   struct ena_admin_feature_intr_moder_desc intr_moderation;
> +
> + struct ena_admin_ena_hw_hints hw_hints;
>   } u;
>  };
>  
> @@ -857,6 +886,8 @@ enum ena_admin_aenq_notification_syndrom {
>   ENA_ADMIN_SUSPEND   = 0,
>  
>   ENA_ADMIN_RESUME= 1,
> +
> + ENA_ADMIN_UPDATE_HINTS  = 2,
>  };
>  
>  struct ena_admin_aenq_entry {
> diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c 
> b/drivers/net/ethernet/amazon/ena/ena_com.c
> index 46aad3a..f1e4f04 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_com.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_com.c
> @@ -508,15 +508,13 @@ static int ena_com_comp_status_to_errno(u8 comp_status)
>  static int ena_com_wait_and_process_admin_cq_polling(struct ena_comp_ctx 
> *comp_ctx,
>struct ena_com_admin_queue 
> *admin_queue)
>  {
> - unsigned long flags;
> - u32 start_time;
> + unsigned long flags, timeout;
>   int ret;
>  
> - start_time = ((u32)jiffies_to_usecs(jiffies));
> + timeout = jiffies + usecs_to_jiffies(admin_queue->completion_timeout);
>  
>   while (comp_ctx->status == ENA_CMD_SUBMITTED) {
> - if u32)jiffies_to_usecs(jiffies)) - start_time) >
> - ADMIN_CMD_TIMEOUT_US) {
> + if (time_is_before_jiffies(timeout)) {
>   pr_err("Wait for completion (polling) timeout\n");
>   /* ENA didn't have any completion */
>   spin_lock_irqsave(_queue->q_lock, flags);
> @@ -560,7 +558,8 @@ static int 
> ena_com_wait_and_process_admin_cq_interrupts(struct ena_comp_ctx *com
>   int ret;
>  
>   wait_for_completion_timeout(_ctx->wait_event,
> - usecs_to_jiffies(ADMIN_CMD_TIMEOUT_US));
> + usecs_to_jiffies(
> + admin_queue->completion_timeout));
>  
>   /* In case the command wasn't completed find out the root cause.
>* There might be 2 kinds of errors
> @@ -600,12 +599,14 @@ static u32 ena_com_reg_bar_read32(struct ena_com_dev 
> *ena_dev, u16 offset)
>   struct ena_com_mmio_read *mmio_read = _dev->mmio_read;
>   volatile struct ena_admin_ena_mmio_req_read_less_resp *read_resp =
>   mmio_read->read_resp;
> - u32 mmio_read_reg, ret;
> + u32 

Re: [PATCH V2 net 06/20] net/ena: fix NULL dereference when removing the driver after device reset faild

2016-12-04 Thread Matt Wilson
On Sun, Dec 04, 2016 at 03:19:24PM +0200, Netanel Belgazal wrote:
> If for some reason the device stop responding and the device reset failed
> to recover the device, the mmio register read datastructure will not be
> reinitialized.

If for some reason the device stops responding, and the device reset
fails to recover the device, the MMIO register read data structure
will not be reinitialized.

> On driver removal, the driver will also tries to reset the device
> but this time the mmio data structure will be NULL.

On driver removal, the driver will also try to reset the device, but
this time the MMIO data structure will be NULL.

> To solve this issue perform the device reset in the remove function only if
> the device is runnig.

To solve this issue, perform the device reset in the remove function
only if the device is running.

Do you have an example of the NULL pointer dereference that you can
paste in? It can be helpful for those searching for a fix for a bug
they've experienced.

--msw

> Signed-off-by: Netanel Belgazal 
> ---
>  drivers/net/ethernet/amazon/ena/ena_netdev.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
> b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> index 224302c..ad5f78f 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> @@ -2516,6 +2516,8 @@ static void ena_fw_reset_device(struct work_struct 
> *work)
>  err:
>   rtnl_unlock();
>  
> + clear_bit(ENA_FLAG_DEVICE_RUNNING, >flags);
> +
>   dev_err(>dev,
>   "Reset attempt failed. Can not reset the device\n");
>  }
> @@ -3126,7 +3128,9 @@ static void ena_remove(struct pci_dev *pdev)
>  
>   cancel_work_sync(>resume_io_task);
>  
> - ena_com_dev_reset(ena_dev);
> + /* Reset the device only if the device is running. */
> + if (test_bit(ENA_FLAG_DEVICE_RUNNING, >flags))
> + ena_com_dev_reset(ena_dev);
>  
>   ena_free_mgmnt_irq(adapter);
>  


Re: [PATCH V2 net 05/20] net/ena: fix RSS default hash configuration

2016-12-04 Thread Matt Wilson
On Sun, Dec 04, 2016 at 03:19:23PM +0200, Netanel Belgazal wrote:
> ENA default hash configure IPv4_frag hash twice instead of

configure -> configures. You may want to include "erroneously". What
is the consequence of this bug?

> configure non ip packets.

configuring non-IP packets.

--msw

> Signed-off-by: Netanel Belgazal 
> ---
>  drivers/net/ethernet/amazon/ena/ena_com.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c 
> b/drivers/net/ethernet/amazon/ena/ena_com.c
> index 3066d9c..46aad3a 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_com.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_com.c
> @@ -2184,7 +2184,7 @@ int ena_com_set_default_hash_ctrl(struct ena_com_dev 
> *ena_dev)
>   hash_ctrl->selected_fields[ENA_ADMIN_RSS_IP4_FRAG].fields =
>   ENA_ADMIN_RSS_L3_SA | ENA_ADMIN_RSS_L3_DA;
>  
> - hash_ctrl->selected_fields[ENA_ADMIN_RSS_IP4_FRAG].fields =
> + hash_ctrl->selected_fields[ENA_ADMIN_RSS_NOT_IP].fields =
>   ENA_ADMIN_RSS_L2_DA | ENA_ADMIN_RSS_L2_SA;
>  
>   for (i = 0; i < ENA_ADMIN_RSS_PROTO_NUM; i++) {


Re: [PATCH V2 net 04/20] net/ena: fix ethtool RSS flow configuration

2016-12-04 Thread Matt Wilson
On Sun, Dec 04, 2016 at 03:19:22PM +0200, Netanel Belgazal wrote:
> ena_flow_data_to_flow_hash and ena_flow_hash_to_flow_type
> treat the ena_flow_hash_to_flow_type enum as power of two values.
> 
> Change the values of ena_admin_flow_hash_fields to be power of two values.

Then I generally prefer BIT(0), BIT(1), BIT(2), etc.

Also it would be helpful to include some comments about the
consequences of the current state of the code.

--msw

> Signed-off-by: Netanel Belgazal 
> ---
>  drivers/net/ethernet/amazon/ena/ena_admin_defs.h | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h 
> b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
> index a46e749..f48c886 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
> +++ b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
> @@ -631,22 +631,22 @@ enum ena_admin_flow_hash_proto {
>  /* RSS flow hash fields */
>  enum ena_admin_flow_hash_fields {
>   /* Ethernet Dest Addr */
> - ENA_ADMIN_RSS_L2_DA = 0,
> + ENA_ADMIN_RSS_L2_DA = 0x1,
>  
>   /* Ethernet Src Addr */
> - ENA_ADMIN_RSS_L2_SA = 1,
> + ENA_ADMIN_RSS_L2_SA = 0x2,
>  
>   /* ipv4/6 Dest Addr */
> - ENA_ADMIN_RSS_L3_DA = 2,
> + ENA_ADMIN_RSS_L3_DA = 0x4,
>  
>   /* ipv4/6 Src Addr */
> - ENA_ADMIN_RSS_L3_SA = 5,
> + ENA_ADMIN_RSS_L3_SA = 0x8,
>  
>   /* tcp/udp Dest Port */
> - ENA_ADMIN_RSS_L4_DP = 6,
> + ENA_ADMIN_RSS_L4_DP = 0x10,
>  
>   /* tcp/udp Src Port */
> - ENA_ADMIN_RSS_L4_SP = 7,
> + ENA_ADMIN_RSS_L4_SP = 0x20,
>  };
>  
>  struct ena_admin_proto_input {


Re: [PATCH V2 net 03/20] net/ena: fix queues number calculation

2016-12-04 Thread Matt Wilson
On Sun, Dec 04, 2016 at 03:19:21PM +0200, Netanel Belgazal wrote:
> The ENA driver tries to open a queue per vCPU.
> To determine how many vCPUs the instance have it uses num_possible_cpus
> while it should have use num_online_cpus instead.

use () when referring to functions: num_possible_cpus(), num_online_cpus().

> Signed-off-by: Netanel Belgazal 

Reviewed-by: Matt Wilson 

> ---
>  drivers/net/ethernet/amazon/ena/ena_netdev.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
> b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> index 397c9bc..224302c 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> @@ -2667,7 +2667,7 @@ static int ena_calc_io_queue_num(struct pci_dev *pdev,
>   io_sq_num = get_feat_ctx->max_queues.max_sq_num;
>   }
>  
> - io_queue_num = min_t(int, num_possible_cpus(), ENA_MAX_NUM_IO_QUEUES);
> + io_queue_num = min_t(int, num_online_cpus(), ENA_MAX_NUM_IO_QUEUES);
>   io_queue_num = min_t(int, io_queue_num, io_sq_num);
>   io_queue_num = min_t(int, io_queue_num,
>get_feat_ctx->max_queues.max_cq_num);


Re: [PATCH V2 net 02/20] net/ena: fix error handling when probe fails

2016-12-04 Thread Matt Wilson
On Sun, Dec 04, 2016 at 03:19:20PM +0200, Netanel Belgazal wrote:
> When driver fails in probe, it will release all resources, including
> adapter.
> In case of probe failure, ena_remove should not try to free the adapter
> resources.

Please word wrap your commit message around 75 columns.

> Signed-off-by: Netanel Belgazal 

Reviewed-by: Matt Wilson 

> ---
>  drivers/net/ethernet/amazon/ena/ena_netdev.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
> b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> index 33a760e..397c9bc 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> @@ -3054,6 +3054,7 @@ static int ena_probe(struct pci_dev *pdev, const struct 
> pci_device_id *ent)
>  err_free_region:
>   ena_release_bars(ena_dev, pdev);
>  err_free_ena_dev:
> + pci_set_drvdata(pdev, NULL);
>   vfree(ena_dev);
>  err_disable_device:
>   pci_disable_device(pdev);


Re: [PATCH V2 net 01/20] net/ena: remove ntuple filter support from device feature list

2016-12-04 Thread Matt Wilson
On Sun, Dec 04, 2016 at 03:19:19PM +0200, Netanel Belgazal wrote:
> Remove NETIF_F_NTUPLE from netdev->features.
> The ENA device driver does not support ntuple filtering.
> 
> Signed-off-by: Netanel Belgazal 

Reviewed-by: Matt Wilson 

> ---
>  drivers/net/ethernet/amazon/ena/ena_netdev.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
> b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> index bfeaec5..33a760e 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> @@ -2729,7 +2729,6 @@ static void ena_set_dev_offloads(struct 
> ena_com_dev_get_features_ctx *feat,
>   netdev->features =
>   dev_features |
>   NETIF_F_SG |
> - NETIF_F_NTUPLE |
>   NETIF_F_RXHASH |
>   NETIF_F_HIGHDMA;
>  


Re: "af_unix: conditionally use freezable blocking calls in read" is wrong

2016-12-04 Thread Al Viro
On Sun, Dec 04, 2016 at 09:42:14PM -0500, David Miller wrote:
> > I've run into that converting AF_UNIX to generic_file_splice_read();
> > I can kludge around that ("freezable unless ->msg_iter is ITER_PIPE"), but
> > that only delays trouble.
> > 
> > Note that the only other user of freezable_schedule_timeout() is
> > a very different story - it's a kernel thread, which *does* have a 
> > guaranteed
> > locking environment.  Making such assumptions in unix_stream_recvmsg(),
> > OTOH, is insane...
> 
> We have to otherwise Android phones drain their batteries in 10
> minutes.
> 
> I'm not going to revert this and be responsible for that.
> 
> So you have to find a way to make the freezable calls legitimate.

Oh, well...  As I said, I can kludge around that - call from
generic_file_splice_read() can be distinguished by looking at the
->msg_iter->type; it still means unpleasantness for kernel_recvmsg()
users - in effect, it can only be called with locks held if you know that
the socket is not an AF_UNIX one.

BTW, how do they deal with plain pipes?


[PATCH] ixgbe: initialize u64_stats_sync structures early at ixgbe_probe

2016-12-04 Thread Liwei Song
Fix the following CallTrace:
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 71 PID: 1 Comm: swapper/0 Not tainted 4.8.8-WR9.0.0.1_standard #11
Hardware name: Intel Corporation S2600WTT/S2600WTT,
BIOS GRNDSDP1.86B.0036.R05.1407140519 07/14/2014
 00200086 00200086 eb5e1ab8 c144dd70   eb5e1af8 c10af89a
 c1d23de4 eb5e1af8 0009 eb5d8600 eb5d8638 eb5e1af8 c10b14d8 0009
 000a c1d32911   e44c826c eb5d8000 eb5e1b74 c10b214e
Call Trace:
 [] dump_stack+0x5f/0x8f
 [] register_lock_class+0x25a/0x4c0
 [] ? check_irq_usage+0x88/0xc0
 [] __lock_acquire+0x5e/0x17a0
 [] ? _raw_spin_unlock_irqrestore+0x3b/0x70
 [] ? rcu_read_lock_sched_held+0x8a/0x90
 [] lock_acquire+0x9f/0x1f0
 [] ? dev_get_stats+0x5f/0x110
 [] ixgbe_get_stats64+0x113/0x320
 [] ? dev_get_stats+0x5f/0x110
 [] dev_get_stats+0x5f/0x110
 [] rtnl_fill_stats+0x40/0x105
 [] rtnl_fill_ifinfo+0x4c5/0xd20
 [] ? __kmalloc_node_track_caller+0x1a5/0x410
 [] ? __kmalloc_reserve.isra.42+0x27/0x80
 [] ? __alloc_skb+0x6f/0x270
 [] rtmsg_ifinfo_build_skb+0x71/0xd0
 [] rtmsg_ifinfo.part.23+0x1a/0x50
 [] ? call_netdevice_notifiers_info+0x2d/0x60
 [] rtmsg_ifinfo+0x2b/0x40
 [] register_netdevice+0x3d7/0x4d0
 [] register_netdev+0x17/0x30
 [] ixgbe_probe+0x118d/0x1610
 [] local_pci_probe+0x32/0x80
 [] ? pci_match_device+0xd2/0x100
 [] pci_device_probe+0xc0/0x110
 [] driver_probe_device+0x1c5/0x280
 [] ? pci_match_device+0xd2/0x100
 [] __driver_attach+0x89/0x90
 [] ? driver_probe_device+0x280/0x280
 [] bus_for_each_dev+0x4f/0x80
 [] driver_attach+0x1e/0x20
 [] ? driver_probe_device+0x280/0x280
 [] bus_add_driver+0x1a7/0x220
 [] driver_register+0x59/0xe0
 [] ? igb_init_module+0x49/0x49
 [] __pci_register_driver+0x4a/0x50
 [] ixgbe_init_module+0xa5/0xc4
 [] do_one_initcall+0x35/0x150
 [] ? parameq+0x18/0x70
 [] ? repair_env_string+0x12/0x51
 [] ? parse_args+0x260/0x3b0
 [] ? __usermodehelper_set_disable_depth+0x43/0x50
 [] kernel_init_freeable+0x19b/0x267
 [] ? set_debug_rodata+0xf/0xf
 [] ? trace_hardirqs_on+0xb/0x10
 [] ? _raw_spin_unlock_irq+0x32/0x50
 [] ? finish_task_switch+0xab/0x1f0
 [] ? finish_task_switch+0x69/0x1f0
 [] kernel_init+0x10/0x110
 [] ? schedule_tail+0x25/0x80
 [] ret_from_kernel_thread+0xe/0x24
 [] ? rest_init+0x130/0x130

This CallTrace occurred on 32-bit kernel with CONFIG_PROVE_LOCKING
enabled.

This happens at ixgbe driver probe hardware stage, when comes to
ixgbe_get_stats64, the seqcount/seqlock still not initialize, although
this was initialize in TX/RX resources setup routin, but it was too late,
then lockdep give this Warning.

To fix this, move the u64_stats_init function to driver probe stage,
which before we get the status of seqcount and after the RX/TX ring
was finished init.

Signed-off-by: Liwei Song 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index fee1f29..ab1d114 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5811,8 +5811,6 @@ int ixgbe_setup_tx_resources(struct ixgbe_ring *tx_ring)
if (!tx_ring->tx_buffer_info)
goto err;
 
-   u64_stats_init(_ring->syncp);
-
/* round up to nearest 4K */
tx_ring->size = tx_ring->count * sizeof(union ixgbe_adv_tx_desc);
tx_ring->size = ALIGN(tx_ring->size, 4096);
@@ -5895,8 +5893,6 @@ int ixgbe_setup_rx_resources(struct ixgbe_ring *rx_ring)
if (!rx_ring->rx_buffer_info)
goto err;
 
-   u64_stats_init(_ring->syncp);
-
/* Round up to nearest 4K */
rx_ring->size = rx_ring->count * sizeof(union ixgbe_adv_rx_desc);
rx_ring->size = ALIGN(rx_ring->size, 4096);
@@ -9686,6 +9682,12 @@ static int ixgbe_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
if (err)
goto err_sw_init;
 
+   for (i = 0; i < adapter->num_rx_queues; i++)
+   u64_stats_init(>rx_ring[i]->syncp);
+
+   for (i = 0; i < adapter->num_tx_queues; i++)
+   u64_stats_init(>tx_ring[i]->syncp);
+
/* WOL not supported for all devices */
adapter->wol = 0;
hw->eeprom.ops.read(hw, 0x2c, >eeprom_cap);
-- 
2.7.4



Re: [PATCH] net: ping: check minimum size on ICMP header length

2016-12-04 Thread Lorenzo Colitti
On Sat, Dec 3, 2016 at 9:58 AM, Kees Cook  wrote:
> -   if (len > 0x)
> +   if (len > 0x || len < icmph_len)
> return -EMSGSIZE;

EMSGSIZE usually means the message is too long. Maybe use EINVAL?
That's what the code will return if the passed-in ICMP header is
invalid (e.g., is not an echo request).


Re: [PATCH V2 net 00/20] Increase ENA driver version to 1.1.2

2016-12-04 Thread Matt Wilson
On Sun, Dec 04, 2016 at 09:37:43PM -0500, David Miller wrote:
> 
> It is not appropriate to submit so many patches at one time.

Indeed, https://www.kernel.org/doc/Documentation/SubmittingPatches
recommends submitting no more than 15 or so at once.

> Please keep your patch series to no more than about a dozen
> at a time.

How about 15 from SubmittingPatches? The first 15 in the series are
all important bugfixes. Should Netanel resubmit a series with just the
bugfixes and a new cover letter? Or are you willing to consider the
first 15 of this series as posted?

> Also, group your changes logically and tie an appropriately
> descriptive cover letter.
> 
> "Increase driver version to X.Y.Z" tells the reader absolutely
> nothing.  Someone reading that Subject line in the GIT logs
> will have no idea what the overall purpose of the patch series
> is and what it accomplishes.

You're right, the cover letter subject needs to be better. There is
only one commit submitted with the subject "increase driver version to
1.1.2." - Patch 20/20. It is logically like:

commit b8b2372de9cc00d5ed667c7b8db29b6cfbf037f5
Author: Manish Chopra 
Date:   Wed Aug 3 04:02:04 2016 -0400

qlcnic: Update version to 5.3.65

Signed-off-by: Manish Chopra 
Signed-off-by: David S. Miller 

[...]

commit ae33256c55d2fefcad8712e750b846461994a1af
Author: Bimmy Pujari 
Date:   Mon Jun 20 09:10:39 2016 -0700

i40e/i40evf-bump version to 1.6.11

Signed-off-by: Bimmy Pujari 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 

[...]

commit 5264cc63ba10ebfa0e54e3e641cce2656c7a60e8
Author: Jacob Keller 
Date:   Tue Jun 7 16:09:02 2016 -0700

fm10k: bump version number

Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 

[...]

commit a58a3e68037647de78e3461194239a1104f76003
Author: Michael Chan 
Date:   Fri Jul 1 18:46:20 2016 -0400

bnxt_en: Update firmware spec. to 1.3.0.

And update driver version to 1.3.0.

Signed-off-by: Michael Chan 
Signed-off-by: David S. Miller 

> You really need to describe the high level purpose of the patch set.
> Is it adding a new feature?  What is that feature?  Why are you
> adding that feature?  How is that feature implemented?  Why is
> it implemented that way?

The priority is to get bug fixes to the ENA driver in 4.9. Let's focus
on the first 15.

--msw


[PATCH net] net: ep93xx_eth: Do not crash unloading module

2016-12-04 Thread Florian Fainelli
When we unload the ep93xx_eth, whether we have opened the network
interface or not, we will either hit a kernel paging request error, or a
simple NULL pointer de-reference because:

- if ep93xx_open has been called, we have created a valid DMA mapping
  for ep->descs, when we call ep93xx_stop, we also call
  ep93xx_free_buffers, ep->descs now has a stale value

- if ep93xx_open has not been called, we have a NULL pointer for
  ep->descs, so performing any operation against that address just won't
  work

Fix this by adding a NULL pointer check for ep->descs which means that
ep93xx_free_buffers() was able to successfully tear down the descriptors
and free the DMA cookie as well.

Fixes: 1d22e05df818 ("[PATCH] Cirrus Logic ep93xx ethernet driver")
Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/cirrus/ep93xx_eth.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/cirrus/ep93xx_eth.c 
b/drivers/net/ethernet/cirrus/ep93xx_eth.c
index de9f7c97d916..9a161e981529 100644
--- a/drivers/net/ethernet/cirrus/ep93xx_eth.c
+++ b/drivers/net/ethernet/cirrus/ep93xx_eth.c
@@ -468,6 +468,9 @@ static void ep93xx_free_buffers(struct ep93xx_priv *ep)
struct device *dev = ep->dev->dev.parent;
int i;
 
+   if (!ep->descs)
+   return;
+
for (i = 0; i < RX_QUEUE_ENTRIES; i++) {
dma_addr_t d;
 
@@ -490,6 +493,7 @@ static void ep93xx_free_buffers(struct ep93xx_priv *ep)
 
dma_free_coherent(dev, sizeof(struct ep93xx_descs), ep->descs,
ep->descs_dma_addr);
+   ep->descs = NULL;
 }
 
 static int ep93xx_alloc_buffers(struct ep93xx_priv *ep)
-- 
2.9.3



Re: [Patch net-next] act_mirred: fix a typo in get_dev

2016-12-04 Thread Hadar Hen Zion



On 12/3/2016 8:36 PM, Cong Wang wrote:

Cc: Hadar Hen Zion 
Cc: Jiri Pirko 
Signed-off-by: Cong Wang 
---
  net/sched/act_mirred.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index bb09ba3..2d9fa6e 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -321,7 +321,7 @@ static int tcf_mirred_device(const struct tc_action *a, 
struct net *net,
int ifindex = tcf_mirred_ifindex(a);
  
  	*mirred_dev = __dev_get_by_index(net, ifindex);

-   if (!mirred_dev)
+   if (!*mirred_dev)
return -EINVAL;
return 0;
  }

Thank you for this fix! good catch.
I know it's already applied.

Hadar



Re: [RFC PATCH 2/2] Documentation: devictree: Add macb mdio bindings

2016-12-04 Thread Harini Katakam
Hi Florian,

On Sun, Dec 4, 2016 at 4:10 AM, Florian Fainelli  wrote:
> Le 12/03/16 à 13:35, Rob Herring a écrit :
>> On Mon, Nov 28, 2016 at 03:19:27PM +0530, Harini Katakam wrote:
>>> +- reg: Address and length of the register set of MAC to be used
>>> +- clock-names: Tuple listing input clock names.
>>> +Required elements: 'pclk', 'hclk'
>>> +Optional elements: 'tx_clk'
>>> +- clocks: Phandles to input clocks.
>
> You are also missing mandatory properties:
>
> #address-cells = <1> and #size-cells = <0>
>
> Where is patch 1? Can you make sure you have the same recipient list for
> both patches in this series so we can review both the binding and driver?
>

Thanks for review, I'll update.

I did send the cover letter, patch 1 and 2 to the same recipient list.
I can see them on the mailing list. The first patch is called:
[RFC PATCH 1/2] net: macb: Add MDIO driver for accessing multiple PHY devices
I hope you can find it.

Regards,
Harini


Re: [RFC PATCH 2/2] Documentation: devictree: Add macb mdio bindings

2016-12-04 Thread Harini Katakam
Hi Rob,


Thanks for the review.
On Sun, Dec 4, 2016 at 3:05 AM, Rob Herring  wrote:
> On Mon, Nov 28, 2016 at 03:19:27PM +0530, Harini Katakam wrote:

>> +Required properties:
>> +- compatible: Should be "cdns,macb-mdio"
>
> Only one version ever? This needs more specific compatible strings.
>

This is part of the Cadence MAC in a way.
I can use revision number from the Cadence spec I was working with.
But it is not necessarily specific that version.

I'll take care of the other comments in the next version.

Regards,
Harini


Re: "af_unix: conditionally use freezable blocking calls in read" is wrong

2016-12-04 Thread David Miller
From: Al Viro 
Date: Sun, 4 Dec 2016 21:04:55 +

>   Could we please kill that kludge?  "af_unix: use freezable blocking
> calls in read" had been wrong to start with; having a method make assumptions
> of that sort ("nobody will call me while holding locks I hadn't thought of")
> is asking for serious trouble.  splice is just a place where lockdep has
> caught that - we *can't* assume that nobody will ever call kernel_recvmsg()
> while holding some locks.
> 
>   I've run into that converting AF_UNIX to generic_file_splice_read();
> I can kludge around that ("freezable unless ->msg_iter is ITER_PIPE"), but
> that only delays trouble.
> 
>   Note that the only other user of freezable_schedule_timeout() is
> a very different story - it's a kernel thread, which *does* have a guaranteed
> locking environment.  Making such assumptions in unix_stream_recvmsg(),
> OTOH, is insane...

We have to otherwise Android phones drain their batteries in 10
minutes.

I'm not going to revert this and be responsible for that.

So you have to find a way to make the freezable calls legitimate.


Re: [PATCH v2 net-next 8/8] tcp: tsq: move tsq_flags close to sk_wmem_alloc

2016-12-04 Thread Eric Dumazet
On Sat, Dec 3, 2016 at 5:37 PM, David Miller  wrote:

>
> Sorry, just noticed by visual inspection.  I expected the
> struct sock part to show up in the same patch as the one
> that removed it from tcp_sock and adjusted the users.
>
> I'll re-review this series, thanks.

Yes, I wanted to have after patch 7, the final cache line disposition
of struct sock.
(Quite critical for future bisections if needed, or performance tests
I mentioned)

I could have used a 'unsigned long _temp_padding', but just chose the
final name for the field.

Thanks.


[RFC] udp: some improvements on RX path.

2016-12-04 Thread Eric Dumazet
We currently access 3 cache lines from an skb in receive queue while
holding receive queue lock :

First cache line (contains ->next / prev pointers )
2nd cache line (skb->peeked)
3rd cache line (skb->truesize)

I believe we could get rid of skb->peeked completely.

I will cook a patch, but basically the idea is that the last owner of a
skb (right before skb->users becomes 0) can have the 'ownership' and
thus increase stats.

The 3rd cache line miss is easily avoided by the following patch.

But I also want to work on the idea I gave few days back, having a
separate queue and use splice to transfer the 'softirq queue' into
a calm queue in a different cache line.

I expect a 50 % performance increase under load, maybe 1.5 Mpps.

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 16d88ba9ff1c..37d4e8da6482 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1191,7 +1191,13 @@ static void udp_rmem_release(struct sock *sk, int size, 
int partial)
 /* Note: called with sk_receive_queue.lock held */
 void udp_skb_destructor(struct sock *sk, struct sk_buff *skb)
 {
-   udp_rmem_release(sk, skb->truesize, 1);
+   /* HACK HACK HACK :
+* Instead of using skb->truesize here, find a copy of it in skb->dev.
+* This avoids a cache line miss in this path,
+* while sk_receive_queue lock is held.
+* Look at __udp_enqueue_schedule_skb() to find where this copy is done.
+*/
+   udp_rmem_release(sk, (int)(unsigned long)skb->dev, 1);
 }
 EXPORT_SYMBOL(udp_skb_destructor);
 
@@ -1201,6 +1207,11 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct 
sk_buff *skb)
int rmem, delta, amt, err = -ENOMEM;
int size = skb->truesize;
 
+   /* help udp_skb_destructor() to get skb->truesize from skb->dev
+* without a cache line miss.
+*/
+   skb->dev = (struct net_device *)(unsigned long)size;
+
/* try to avoid the costly atomic add/sub pair when the receive
 * queue is full; always allow at least a packet
 */
@@ -1233,7 +1244,6 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct 
sk_buff *skb)
/* no need to setup a destructor, we will explicitly release the
 * forward allocated memory on dequeue
 */
-   skb->dev = NULL;
sock_skb_set_dropcount(sk, skb);
 
__skb_queue_tail(list, skb);








Re: [PATCH V2 net 00/20] Increase ENA driver version to 1.1.2

2016-12-04 Thread David Miller

It is not appropriate to submit so many patches at one time.

Please keep your patch series to no more than about a dozen
at a time.

Also, group your changes logically and tie an appropriately
descriptive cover letter.

"Increase driver version to X.Y.Z" tells the reader absolutely
nothing.  Someone reading that Subject line in the GIT logs
will have no idea what the overall purpose of the patch series
is and what it accomplishes.

You really need to describe the high level purpose of the patch set.
Is it adding a new feature?  What is that feature?  Why are you
adding that feature?  How is that feature implemented?  Why is
it implemented that way?


Re: [PATCH v2 net-next 8/8] tcp: tsq: move tsq_flags close to sk_wmem_alloc

2016-12-04 Thread David Miller
From: Eric Dumazet 
Date: Sat, 03 Dec 2016 17:13:51 -0800

> On Sat, 2016-12-03 at 19:16 -0500, David Miller wrote:
>> From: Eric Dumazet 
>> Date: Sat,  3 Dec 2016 11:14:57 -0800
>> 
>> > diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>> > index d8be083ab0b0..fc5848dad7a4 100644
>> > --- a/include/linux/tcp.h
>> > +++ b/include/linux/tcp.h
>> > @@ -186,7 +186,6 @@ struct tcp_sock {
>> >u32 tsoffset;   /* timestamp offset */
>> >  
>> >struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
>> > -  unsigned long   tsq_flags;
>> >  
>> >/* Data for direct copy to user */
>> >struct {
>> 
>> Hmmm, did you forget to "git add include/net/sock.h" before making
>> this commit?
> 
> sk_tsq_flags was added in prior patch in the series ( 7/8 net:
> reorganize struct sock for better data locality)
> 
> What is the problem with this part ?

Sorry, just noticed by visual inspection.  I expected the
struct sock part to show up in the same patch as the one
that removed it from tcp_sock and adjusted the users.

I'll re-review this series, thanks.


Re: [PATCH v2 main-v4.9-rc7] net/ipv6: allow sysctl to change link-local address generation mode

2016-12-04 Thread Roopa Prabhu
On 12/4/16, 2:31 PM, Felix Jia wrote:
> Removed the rtnl lock and switch to use RCU lock to iterate through
> the netdev list.
>
> The address generation mode for IPv6 link-local can only be configured
> by netlink messages. This patch adds the ability to change the address
> generation mode via sysctl.
>
> An possible improvement is to remove the addrgenmode variable from the
> idev structure and use the systcl storage for the flag.
>
> The patch is based from v4.9-rc7 in mainline.
>
> Signed-off-by: Felix Jia 
> Cc: Carl Smith 
> ---
>  include/linux/ipv6.h  |  1 +
>  include/uapi/linux/ipv6.h |  1 +
>  net/ipv6/addrconf.c   | 73 
> ++-
>  3 files changed, 74 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
> index a064997..0d9e5d4 100644
> --- a/include/linux/ipv6.h
> +++ b/include/linux/ipv6.h
> @@ -64,6 +64,7 @@ struct ipv6_devconf {
>   } stable_secret;
>   __s32   use_oif_addrs_only;
>   __s32   keep_addr_on_down;
> + __s32   addrgenmode;
>  
>   struct ctl_table_header *sysctl_header;
>  };
> diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
> index 8c27723..0524e2c 100644
> --- a/include/uapi/linux/ipv6.h
> +++ b/include/uapi/linux/ipv6.h
> @@ -178,6 +178,7 @@ enum {
>   DEVCONF_DROP_UNSOLICITED_NA,
>   DEVCONF_KEEP_ADDR_ON_DOWN,
>   DEVCONF_RTR_SOLICIT_MAX_INTERVAL,
> + DEVCONF_ADDRGENMODE,
>   DEVCONF_MAX
>  };
>  
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 4bc5ba3..2b83cc7 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -238,6 +238,7 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = {
>   .use_oif_addrs_only = 0,
>   .ignore_routes_with_linkdown = 0,
>   .keep_addr_on_down  = 0,
> + .addrgenmode = IN6_ADDR_GEN_MODE_EUI64,
>  };
>  
>  static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
> @@ -284,6 +285,7 @@ static struct ipv6_devconf ipv6_devconf_dflt 
> __read_mostly = {
>   .use_oif_addrs_only = 0,
>   .ignore_routes_with_linkdown = 0,
>   .keep_addr_on_down  = 0,
> + .addrgenmode = IN6_ADDR_GEN_MODE_EUI64,
>  };
>  
>  /* Check if a valid qdisc is available */
> @@ -378,7 +380,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device 
> *dev)
>   if (ndev->cnf.stable_secret.initialized)
>   ndev->addr_gen_mode = IN6_ADDR_GEN_MODE_STABLE_PRIVACY;
>   else
> - ndev->addr_gen_mode = IN6_ADDR_GEN_MODE_EUI64;
> + ndev->addr_gen_mode = ipv6_devconf_dflt.addrgenmode;
>  
>   ndev->cnf.mtu6 = dev->mtu;
>   ndev->nd_parms = neigh_parms_alloc(dev, _tbl);
> @@ -4950,6 +4952,7 @@ static inline void ipv6_store_devconf(struct 
> ipv6_devconf *cnf,
>   array[DEVCONF_DROP_UNICAST_IN_L2_MULTICAST] = 
> cnf->drop_unicast_in_l2_multicast;
>   array[DEVCONF_DROP_UNSOLICITED_NA] = cnf->drop_unsolicited_na;
>   array[DEVCONF_KEEP_ADDR_ON_DOWN] = cnf->keep_addr_on_down;
> + array[DEVCONF_ADDRGENMODE] = cnf->addrgenmode;
>  }
>  
>  static inline size_t inet6_ifla6_size(void)
> @@ -5496,6 +5499,67 @@ int addrconf_sysctl_mtu(struct ctl_table *ctl, int 
> write,
>   return proc_dointvec_minmax(, write, buffer, lenp, ppos);
>  }
>  
> +static void addrconf_addrgenmode_change(struct net *net)
> +{
> + struct net_device *dev;
> + struct inet6_dev *idev;
> +
> + rcu_read_lock();
> + for_each_netdev_rcu(net, dev) {
> + idev = __in6_dev_get(dev);
> + if (idev) {
> + idev->cnf.addrgenmode = ipv6_devconf_dflt.addrgenmode;
> + idev->addr_gen_mode = ipv6_devconf_dflt.addrgenmode;
> + addrconf_dev_config(idev->dev);
> + }
> + }
> + rcu_read_unlock();
> +}
> +
> +static int addrconf_sysctl_addrgenmode(struct ctl_table *ctl, int write,
> + void __user 
> *buffer, size_t *lenp, loff_t *ppos)
> +{
> + int ret;
> + int new_val;
> + struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1;
> + struct net *net = (struct net *)ctl->extra2;
> +
> + if (write) { /* sysctl write request */
> + ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
> + new_val = *((int *)ctl->data);
> +
>
unless I missed it, I don't see a check for valid values for new_val.
The netlink attribute  is checked  for valid values in the existing equivalent 
netlink code.


Re: [PATCH v3 net-next 3/3] openvswitch: Fix skb->protocol for vlan frames.

2016-12-04 Thread Pravin Shelar
On Fri, Dec 2, 2016 at 1:42 AM, Jiri Benc  wrote:
> On Thu, 1 Dec 2016 12:31:09 -0800, Pravin Shelar wrote:
>> On Wed, Nov 30, 2016 at 6:30 AM, Jiri Benc  wrote:
>> > I'm not opposed to changing this but I'm afraid it needs much deeper
>> > review. Because with this in place, no core kernel functions that
>> > depend on skb->protocol may be called from within openvswitch.
>> >
>> Can you give specific example where it does not work?
>
> I can't, I haven't reviewed the usage. I'm just saying that the stack
> does not expect skb->protocol being ETH_P_8021Q for e.g. IPv4 packets.
> It may not be relevant for the calls used by openvswitch but we should
> be sure about that. Especially defragmentation and conntrack is worth
> looking at.
>
> Again, I'm not saying this is wrong nor that there is an actual
> problem. I'm just pointing out that openvswitch has different
> expectations about skb wrt. vlans than the rest of the kernel and we
> should be reasonably sure the behavior is correct when passing between
> the two.
>
I agree that conntrack does not expect skb-protocol to be vlan
protocol. We could accelerate vlan if there is vlan header in packet
itself. That would make the packet consistent across upcalls.

>> skb-protocol value is set by the caller, so it should not be
>> arbitrary. is it missing in any case?
>
> It's not set exactly by the caller, because that's what this patch is
> removing. It is set by whoever handed over the packet to openvswitch.
> The point is we don't know *what* it is set to. It may as well be
> ETH_P_8021Q, breaking the conditions here. It should not happen in
> practice but still, it seems weird to depend on the fact that the
> packet coming to ovs has never skb->protocol equal to ETH_P_8021Q nor
> ETH_P_8021AD.
>

We are kind of dependent on this atleast for L3 packets injected back
by vswitchd. For rest of entry points I think we have to trust the
networking stack would set skb-protocol to correct value. If that is
not true in some case, it is bug and we will need to fix it.


Re: [PATCH v2 net-next 3/4] mlx4: xdp: Reserve headroom for receiving packet when XDP prog is active

2016-12-04 Thread Saeed Mahameed
On Sun, Dec 4, 2016 at 5:17 AM, Martin KaFai Lau  wrote:
> Reserve XDP_PACKET_HEADROOM and honor bpf_xdp_adjust_head()
> when XDP prog is active.  This patch only affects the code
> path when XDP is active.
>
> Signed-off-by: Martin KaFai Lau 
> ---

Hi Martin, Sorry for the late review, i have some comments below

>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 17 +++--
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c | 23 +--
>  drivers/net/ethernet/mellanox/mlx4/en_tx.c |  9 +
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  3 ++-
>  4 files changed, 39 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 311c14153b8b..094a13b52cf6 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -51,7 +51,8 @@
>  #include "mlx4_en.h"
>  #include "en_port.h"
>
> -#define MLX4_EN_MAX_XDP_MTU ((int)(PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN)))
> +#define MLX4_EN_MAX_XDP_MTU ((int)(PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) - \
> +  XDP_PACKET_HEADROOM))
>
>  int mlx4_en_setup_tc(struct net_device *dev, u8 up)
>  {
> @@ -1551,6 +1552,7 @@ int mlx4_en_start_port(struct net_device *dev)
> struct mlx4_en_tx_ring *tx_ring;
> int rx_index = 0;
> int err = 0;
> +   int mtu;
> int i, t;
> int j;
> u8 mc_list[16] = {0};
> @@ -1684,8 +1686,12 @@ int mlx4_en_start_port(struct net_device *dev)
> }
>
> /* Configure port */
> +   mtu = priv->rx_skb_size + ETH_FCS_LEN;
> +   if (priv->tx_ring_num[TX_XDP])
> +   mtu += XDP_PACKET_HEADROOM;
> +

Why would the physical MTU care for the headroom you preserve for XDP prog?
This is the wire MTU, it shouldn't be changed, please keep it as
before, any preservation you make in packets buffers are needed only
for FWD case or modify case (HW or wire should not care about them).

> err = mlx4_SET_PORT_general(mdev->dev, priv->port,
> -   priv->rx_skb_size + ETH_FCS_LEN,
> +   mtu,
> priv->prof->tx_pause,
> priv->prof->tx_ppp,
> priv->prof->rx_pause,
> @@ -2255,6 +2261,13 @@ static bool mlx4_en_check_xdp_mtu(struct net_device 
> *dev, int mtu)
>  {
> struct mlx4_en_priv *priv = netdev_priv(dev);
>
> +   if (mtu + XDP_PACKET_HEADROOM > priv->max_mtu) {
> +   en_err(priv,
> +  "Device max mtu:%d does not allow %d bytes reserved 
> headroom for XDP prog\n",
> +  priv->max_mtu, XDP_PACKET_HEADROOM);
> +   return false;
> +   }
> +
> if (mtu > MLX4_EN_MAX_XDP_MTU) {
> en_err(priv, "mtu:%d > max:%d when XDP prog is attached\n",
>mtu, MLX4_EN_MAX_XDP_MTU);
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 23e9d04d1ef4..324771ac929e 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -96,7 +96,6 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
> struct mlx4_en_rx_alloc page_alloc[MLX4_EN_MAX_RX_FRAGS];
> const struct mlx4_en_frag_info *frag_info;
> struct page *page;
> -   dma_addr_t dma;
> int i;
>
> for (i = 0; i < priv->num_frags; i++) {
> @@ -115,9 +114,10 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
>
> for (i = 0; i < priv->num_frags; i++) {
> frags[i] = ring_alloc[i];
> -   dma = ring_alloc[i].dma + ring_alloc[i].page_offset;
> +   frags[i].page_offset += priv->frag_info[i].rx_headroom;

I don't see any need for headroom on frag_info other that frag0 (which
where the packet starts).
What is the meaning of a headroom of a frag in a middle of a packet ?

if you agree with me then, you can use XDP_PACKET_HEADROOM as is where
needed (i.e frag0 page offset) and remove
"priv->frag_info[i].rx_headroom"

...

After going through the code a little bit i see that this code is
shared between XDP and common path, and you didn't want to add boolean
conditions.

Ok i see what you did here.

Maybe we can pass headroom as a function parameter and split frag0
handling from the rest ?
If it is too much then i am ok with the code as it is,

> +   rx_desc->data[i].addr = cpu_to_be64(frags[i].dma +
> +   frags[i].page_offset);
> ring_alloc[i] = page_alloc[i];
> -   rx_desc->data[i].addr = cpu_to_be64(dma);
> }
>
> return 0;
> @@ -250,7 +250,8 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv 
> *priv,
>
>  

Re: [PATCH net-next 2/4] mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs

2016-12-04 Thread Eric Dumazet
On Mon, 2016-12-05 at 01:31 +0200, Saeed Mahameed wrote:

> Alexei, we should start considering PPC archs for XDP use cases,
> demanding page per packet on those archs is a little bit heavy
> requirement

Well, 'little' is an understatement ;)

Note that PPC had serious problems before  commit bd68a2a854ad5a85f0
("net: set SK_MEM_QUANTUM to 4096")

So I suspect one page per frame will likely be a huge problem
for hosts dealing with 10^5 or more TCP sockets.

Either skb->truesize is set to 64KB and TCP window must be really tiny,
or skb->truesize is set to ~2KB and OOM is waiting to happen.





Re: [PATCH v1 net-next 1/5] net: dsa: mv88e6xxx: Reserved Management frames to CPU

2016-12-04 Thread Vivien Didelot
Hi Andrew,

Andrew Lunn  writes:

>> You can have several implementations in the same file (e.g. global1.c),
>> so again the only value is the function name, not the struct member.
>
> The structure member have g1_ has a lot of value.
>
> if (chip->info->ops->set_cpu_port) {
> err = chip->info->ops->set_cpu_port(chip, upstream_port);
> if (err)
> return err;
> }
>
> Where to i need to go look for set_cpu_port? I have no idea.

In your chip's ops definition, as for any ops structure. Same as for
your example right below which is unfortunately not a solution per-se.

>
> if (chip->info->ops->g1_set_cpu_port) {
> err = chip->info->ops->g1_set_cpu_port(chip, upstream_port);
> if (err)
> return err;
> }
>
> Humm, the hint tells me it is in global1.c. And i also know that all
> of them are in global1.c.

Until a new chip relocates a feature somewhere else.

Then you'll have to rename the structure member(s) because you have a
policy saying "no prefix means different set of registers".


   Vivien


Re: [PATCH v3 net-next 2/3] openvswitch: Use is_skb_forwardable() for length check.

2016-12-04 Thread Pravin Shelar
On Fri, Dec 2, 2016 at 1:25 AM, Jiri Benc  wrote:
> On Thu, 1 Dec 2016 11:50:00 -0800, Pravin Shelar wrote:
>> This is not changing any behavior compared to current OVS vlan checks.
>> Single vlan header is not considered for MTU check.
>
> It is changing it.
>
> Consider the case when there's an interface with MTU 1500 forwarding to
> an interface with MTU 1496. Obviously, full-sized vlan frames
> ingressing on the first interface are not forwardable to the second
> one. Yet, if the vlan tag is accelerated (and thus not counted in
> skb->len), is_skb_forwardable happily returns true because of the check
>
> len = dev->mtu + dev->hard_header_len + VLAN_HLEN;
> if (skb->len <= len)
>
ok, This case would be allowed due to this patch. But core linux stack
and bridge is using this check then why not just use same forwarding
check in OVS too, this make it consistent with core networking
forwarding expectations.


Re: [PATCH net-next 2/4] mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs

2016-12-04 Thread Saeed Mahameed
On Sat, Dec 3, 2016 at 2:53 AM, Alexei Starovoitov  wrote:
> On 12/2/16 4:38 PM, Eric Dumazet wrote:
>>
>> On Fri, 2016-12-02 at 15:23 -0800, Martin KaFai Lau wrote:
>>>
>>> When XDP prog is attached, it is currently limiting
>>> MTU to be FRAG_SZ0 - ETH_HLEN - (2 * VLAN_HLEN) which is 1514
>>> in x86.
>>>
>>> AFAICT, since mlx4 is doing one page per packet for XDP,
>>> we can at least raise the MTU limitation up to
>>> PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) which this patch is
>>> doing.  It will be useful in the next patch which allows
>>> XDP program to extend the packet by adding new header(s).
>>>
>>> Signed-off-by: Martin KaFai Lau 
>>> ---
>>
>>
>> Have you tested your patch on a host with PAGE_SIZE = 64 KB ?
>>
>> Looks XDP really kills arches with bigger pages :(
>
>
> I'm afraid xdp mlx[45] support was not tested on arches
> with 64k pages at all. Not just this patch.

Yep, in mlx5 page per packet became the default, with or without XDP,
unlike mlx4.
currently we allow 64KB pages per packet! which is wrong and need to be fixed.

I will get to this task soon.

> I think people who care about such archs should test?

We do test mlx5 and mlx4 on PPC arch. other than we require more
memory than we need, we don't see any issues. and we don't test XDP on
those archs.

> Note page per packet is not a hard requirement for all drivers
> and all archs. For mlx[45] it was the easiest and the most
> convenient way to achieve desired performance.
> If there are ways to do the same performance differently,
> I'm all ears :)
>

when bigger pages, i.e  PAGE_SIZE > 8K, my current low hanging fruit
options for mlx5 are
1. start sharing pages for multi packets.
2. Go back to the SKB allocator (allocate ring of SKBs on advance
rather than page per packet/s).

this means that default RX memory scheme will be different than XDP's
on such ARCHs (XDP wil still use page per packet)

Alexei, we should start considering PPC archs for XDP use cases,
demanding page per packet on those archs is a little bit heavy
requirement


Re: [PATCH] mlx4: Use kernel sizeof and alloc styles

2016-12-04 Thread Joe Perches
On Sun, 2016-12-04 at 12:58 -0800, Eric Dumazet wrote:
> On Sun, 2016-12-04 at 12:11 -0800, Joe Perches wrote:
> > Convert sizeof foo to sizeof(foo) and allocations with multiplications
> > to the appropriate kcalloc/kmalloc_array styles.
> > 
> > Signed-off-by: Joe Perches 
> > ---
> 
> Gah.
> 
> This is one of the hotest NIC driver on linux at this moment, 
> with XDP and other efforts going on.
> 
> Some kmalloc() are becoming kmalloc_node() in some dev branches, and
> there is no kmalloc_array_node() yet.

Well that kmalloc_array_node, like this patch, is pretty trivial to add.
Something like the attached for kmalloc_array_node and kcalloc_node.

> This kind of patch is making rebases/backports very painful.

That's really not an issue for me.

> Could we wait ~6 months before doing such cleanup/changes please ?

This is certainly a trivial patch that could be
done at almost any time.

> If you believe a bug needs a fix, please send a patch to address it.
> 
> Thanks.

No worries. include/linux/slab.h | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 084b12bad198..d98c07713c03 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -647,6 +647,37 @@ static inline void *kzalloc_node(size_t size, gfp_t flags, int node)
 	return kmalloc_node(size, flags | __GFP_ZERO, node);
 }
 
+/**
+ * kmalloc_array_node - allocate memory for an array
+ * from a particular memory node.
+ * @n: number of elements.
+ * @size: element size.
+ * @flags: the type of memory to allocate (see kmalloc).
+ * @node: memory node from which to allocate
+ */
+static inline void *kmalloc_array_node(size_t n, size_t size, gfp_t flags,
+   int node)
+{
+	if (size != 0 && n > SIZE_MAX / size)
+		return NULL;
+	if (__builtin_constant_p(n) && __builtin_constant_p(size))
+		return kmalloc_node(n * size, flags, node);
+	return __kmalloc_node(n * size, flags, node);
+}
+
+/**
+ * kcalloc_node - allocate memory for an array from a particular memory node.
+ * The memory is set to zero.
+ * @n: number of elements.
+ * @size: element size.
+ * @flags: the type of memory to allocate (see kmalloc).
+ * @node: memory node from which to allocate
+ */
+static inline void *kcalloc_node(size_t n, size_t size, gfp_t flags, int node)
+{
+	return kmalloc_array_node(n, size, flags | __GFP_ZERO, node);
+}
+
 unsigned int kmem_cache_size(struct kmem_cache *s);
 void __init kmem_cache_init_late(void);
 


Re: [PATCH v1 net-next 1/5] net: dsa: mv88e6xxx: Reserved Management frames to CPU

2016-12-04 Thread Andrew Lunn
> You can have several implementations in the same file (e.g. global1.c),
> so again the only value is the function name, not the struct member.

The structure member have g1_ has a lot of value.

if (chip->info->ops->set_cpu_port) {
err = chip->info->ops->set_cpu_port(chip, upstream_port);
if (err)
return err;
}

Where to i need to go look for set_cpu_port? I have no idea.

if (chip->info->ops->g1_set_cpu_port) {
err = chip->info->ops->g1_set_cpu_port(chip, upstream_port);
if (err)
return err;
}

Humm, the hint tells me it is in global1.c. And i also know that all
of them are in global1.c.

These ops do make the code simpler. But the downside is it makes it
harder to find the actual code, now that it is spread over multiple
files. And these hits help negate the downside a little.

   Andrew


[PATCH] net: calxeda: xgmac: use new api ethtool_{get|set}_link_ksettings

2016-12-04 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/calxeda/xgmac.c |   17 -
 1 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/calxeda/xgmac.c 
b/drivers/net/ethernet/calxeda/xgmac.c
index 6e72366..ce7de6f 100644
--- a/drivers/net/ethernet/calxeda/xgmac.c
+++ b/drivers/net/ethernet/calxeda/xgmac.c
@@ -1530,15 +1530,14 @@ static int xgmac_set_features(struct net_device *dev, 
netdev_features_t features
.ndo_set_features = xgmac_set_features,
 };
 
-static int xgmac_ethtool_getsettings(struct net_device *dev,
- struct ethtool_cmd *cmd)
+static int xgmac_ethtool_get_link_ksettings(struct net_device *dev,
+   struct ethtool_link_ksettings *cmd)
 {
-   cmd->autoneg = 0;
-   cmd->duplex = DUPLEX_FULL;
-   ethtool_cmd_speed_set(cmd, 1);
-   cmd->supported = 0;
-   cmd->advertising = 0;
-   cmd->transceiver = XCVR_INTERNAL;
+   cmd->base.autoneg = 0;
+   cmd->base.duplex = DUPLEX_FULL;
+   cmd->base.speed = 1;
+   ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.supported, 0);
+   ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.advertising, 0);
return 0;
 }
 
@@ -1681,7 +1680,6 @@ static int xgmac_set_wol(struct net_device *dev,
 }
 
 static const struct ethtool_ops xgmac_ethtool_ops = {
-   .get_settings = xgmac_ethtool_getsettings,
.get_link = ethtool_op_get_link,
.get_pauseparam = xgmac_get_pauseparam,
.set_pauseparam = xgmac_set_pauseparam,
@@ -1690,6 +1688,7 @@ static int xgmac_set_wol(struct net_device *dev,
.get_wol = xgmac_get_wol,
.set_wol = xgmac_set_wol,
.get_sset_count = xgmac_get_sset_count,
+   .get_link_ksettings = xgmac_ethtool_get_link_ksettings,
 };
 
 /**
-- 
1.7.4.4



[PATCH v2 main-v4.9-rc7] net/ipv6: allow sysctl to change link-local address generation mode

2016-12-04 Thread Felix Jia
Removed the rtnl lock and switch to use RCU lock to iterate through
the netdev list.

The address generation mode for IPv6 link-local can only be configured
by netlink messages. This patch adds the ability to change the address
generation mode via sysctl.

An possible improvement is to remove the addrgenmode variable from the
idev structure and use the systcl storage for the flag.

The patch is based from v4.9-rc7 in mainline.

Signed-off-by: Felix Jia 
Cc: Carl Smith 
---
 include/linux/ipv6.h  |  1 +
 include/uapi/linux/ipv6.h |  1 +
 net/ipv6/addrconf.c   | 73 ++-
 3 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index a064997..0d9e5d4 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -64,6 +64,7 @@ struct ipv6_devconf {
} stable_secret;
__s32   use_oif_addrs_only;
__s32   keep_addr_on_down;
+   __s32   addrgenmode;
 
struct ctl_table_header *sysctl_header;
 };
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index 8c27723..0524e2c 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -178,6 +178,7 @@ enum {
DEVCONF_DROP_UNSOLICITED_NA,
DEVCONF_KEEP_ADDR_ON_DOWN,
DEVCONF_RTR_SOLICIT_MAX_INTERVAL,
+   DEVCONF_ADDRGENMODE,
DEVCONF_MAX
 };
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 4bc5ba3..2b83cc7 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -238,6 +238,7 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = {
.use_oif_addrs_only = 0,
.ignore_routes_with_linkdown = 0,
.keep_addr_on_down  = 0,
+   .addrgenmode = IN6_ADDR_GEN_MODE_EUI64,
 };
 
 static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
@@ -284,6 +285,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly 
= {
.use_oif_addrs_only = 0,
.ignore_routes_with_linkdown = 0,
.keep_addr_on_down  = 0,
+   .addrgenmode = IN6_ADDR_GEN_MODE_EUI64,
 };
 
 /* Check if a valid qdisc is available */
@@ -378,7 +380,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device 
*dev)
if (ndev->cnf.stable_secret.initialized)
ndev->addr_gen_mode = IN6_ADDR_GEN_MODE_STABLE_PRIVACY;
else
-   ndev->addr_gen_mode = IN6_ADDR_GEN_MODE_EUI64;
+   ndev->addr_gen_mode = ipv6_devconf_dflt.addrgenmode;
 
ndev->cnf.mtu6 = dev->mtu;
ndev->nd_parms = neigh_parms_alloc(dev, _tbl);
@@ -4950,6 +4952,7 @@ static inline void ipv6_store_devconf(struct ipv6_devconf 
*cnf,
array[DEVCONF_DROP_UNICAST_IN_L2_MULTICAST] = 
cnf->drop_unicast_in_l2_multicast;
array[DEVCONF_DROP_UNSOLICITED_NA] = cnf->drop_unsolicited_na;
array[DEVCONF_KEEP_ADDR_ON_DOWN] = cnf->keep_addr_on_down;
+   array[DEVCONF_ADDRGENMODE] = cnf->addrgenmode;
 }
 
 static inline size_t inet6_ifla6_size(void)
@@ -5496,6 +5499,67 @@ int addrconf_sysctl_mtu(struct ctl_table *ctl, int write,
return proc_dointvec_minmax(, write, buffer, lenp, ppos);
 }
 
+static void addrconf_addrgenmode_change(struct net *net)
+{
+   struct net_device *dev;
+   struct inet6_dev *idev;
+
+   rcu_read_lock();
+   for_each_netdev_rcu(net, dev) {
+   idev = __in6_dev_get(dev);
+   if (idev) {
+   idev->cnf.addrgenmode = ipv6_devconf_dflt.addrgenmode;
+   idev->addr_gen_mode = ipv6_devconf_dflt.addrgenmode;
+   addrconf_dev_config(idev->dev);
+   }
+   }
+   rcu_read_unlock();
+}
+
+static int addrconf_sysctl_addrgenmode(struct ctl_table *ctl, int write,
+   void __user 
*buffer, size_t *lenp, loff_t *ppos)
+{
+   int ret;
+   int new_val;
+   struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1;
+   struct net *net = (struct net *)ctl->extra2;
+
+   if (write) { /* sysctl write request */
+   ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
+   new_val = *((int *)ctl->data);
+
+   /* request for the all */
+   if (>ipv6.devconf_all->addrgenmode == ctl->data) {
+   ipv6_devconf_dflt.addrgenmode = new_val;
+   addrconf_addrgenmode_change(net);
+
+   /* request for default */
+   } else if (>ipv6.devconf_dflt->addrgenmode == ctl->data) {
+   ipv6_devconf_dflt.addrgenmode = new_val;
+
+   /* request for individual inet device */
+   } else {
+   if (!idev) {
+   return ret;
+   }
+   if (idev->addr_gen_mode != new_val) {
+   

[PATCH net-next 0/3] Minor BPF cleanups and digest

2016-12-04 Thread Daniel Borkmann
First two patches are minor cleanups, and the third one adds
a prog digest. For details, please see individual patches.
After this one, I have a set with tracepoint support that makes
use of this facility as well.

Thanks!

Daniel Borkmann (3):
  bpf: remove type arg from __is_valid_{,xdp_}access
  bpf, cls: consolidate prog deletion path
  bpf: add prog_digest and expose it via fdinfo/netlink

 include/linux/bpf.h|  1 +
 include/linux/filter.h |  7 +++-
 include/uapi/linux/pkt_cls.h   |  1 +
 include/uapi/linux/tc_act/tc_bpf.h |  1 +
 kernel/bpf/core.c  | 65 ++
 kernel/bpf/syscall.c   | 24 +-
 kernel/bpf/verifier.c  |  2 ++
 net/core/filter.c  | 15 -
 net/sched/act_bpf.c|  9 ++
 net/sched/cls_bpf.c| 38 --
 10 files changed, 135 insertions(+), 28 deletions(-)

-- 
1.9.3



[PATCH net-next 1/3] bpf: remove type arg from __is_valid_{,xdp_}access

2016-12-04 Thread Daniel Borkmann
Commit d691f9e8d440 ("bpf: allow programs to write to certain skb
fields") pushed access type check outside of __is_valid_access()
to have different restrictions for socket filters and tc programs.
type is thus not used anymore within __is_valid_access() and should
be removed as a function argument. Same for __is_valid_xdp_access()
introduced by 6a773a15a1e8 ("bpf: add XDP prog type for early driver
filter").

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 net/core/filter.c | 15 ++-
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 56b4358..b751202 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2748,7 +2748,7 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const 
void *src_buff,
}
 }
 
-static bool __is_valid_access(int off, int size, enum bpf_access_type type)
+static bool __is_valid_access(int off, int size)
 {
if (off < 0 || off >= sizeof(struct __sk_buff))
return false;
@@ -2782,7 +2782,7 @@ static bool sk_filter_is_valid_access(int off, int size,
}
}
 
-   return __is_valid_access(off, size, type);
+   return __is_valid_access(off, size);
 }
 
 static bool lwt_is_valid_access(int off, int size,
@@ -2815,7 +2815,7 @@ static bool lwt_is_valid_access(int off, int size,
break;
}
 
-   return __is_valid_access(off, size, type);
+   return __is_valid_access(off, size);
 }
 
 static bool sock_filter_is_valid_access(int off, int size,
@@ -2833,11 +2833,9 @@ static bool sock_filter_is_valid_access(int off, int 
size,
 
if (off < 0 || off + size > sizeof(struct bpf_sock))
return false;
-
/* The verifier guarantees that size > 0. */
if (off % size != 0)
return false;
-
if (size != sizeof(__u32))
return false;
 
@@ -2910,11 +2908,10 @@ static bool tc_cls_act_is_valid_access(int off, int 
size,
break;
}
 
-   return __is_valid_access(off, size, type);
+   return __is_valid_access(off, size);
 }
 
-static bool __is_valid_xdp_access(int off, int size,
- enum bpf_access_type type)
+static bool __is_valid_xdp_access(int off, int size)
 {
if (off < 0 || off >= sizeof(struct xdp_md))
return false;
@@ -2942,7 +2939,7 @@ static bool xdp_is_valid_access(int off, int size,
break;
}
 
-   return __is_valid_xdp_access(off, size, type);
+   return __is_valid_xdp_access(off, size);
 }
 
 void bpf_warn_invalid_xdp_action(u32 act)
-- 
1.9.3



[PATCH net-next 3/3] bpf: add prog_digest and expose it via fdinfo/netlink

2016-12-04 Thread Daniel Borkmann
When loading a BPF program via bpf(2), calculate the digest over
the program's instruction stream and store it in struct bpf_prog's
digest member. This is done at a point in time before any instructions
are rewritten by the verifier. Any unstable map file descriptor
number part of the imm field will be zeroed for the hash.

fdinfo example output for progs:

  # cat /proc/1590/fdinfo/5
  pos:  0
  flags:0202
  mnt_id:   11
  prog_type:1
  prog_jited:   1
  prog_digest:  b27e8b06da22707513aa97363dfb11c7c3675d28
  memlock:  4096

When programs are pinned and retrieved by an ELF loader, the loader
can check the program's digest through fdinfo and compare it against
one that was generated over the ELF file's program section to see
if the program needs to be reloaded. Furthermore, this can also be
exposed through other means such as netlink in case of a tc cls/act
dump (or xdp in future), but also through tracepoints or other
facilities to identify the program. Other than that, the digest can
also serve as a base name for the work in progress kallsyms support
of programs. The digest doesn't depend/select the crypto layer, since
we need to keep dependencies to a minimum. iproute2 will get support
for this facility.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf.h|  1 +
 include/linux/filter.h |  7 +++-
 include/uapi/linux/pkt_cls.h   |  1 +
 include/uapi/linux/tc_act/tc_bpf.h |  1 +
 kernel/bpf/core.c  | 65 ++
 kernel/bpf/syscall.c   | 24 +-
 kernel/bpf/verifier.c  |  2 ++
 net/sched/act_bpf.c|  9 ++
 net/sched/cls_bpf.c|  8 +
 9 files changed, 116 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 69d0a7f..8796ff0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -216,6 +216,7 @@ struct bpf_event_entry {
 u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog 
*fp);
+void bpf_prog_calc_digest(struct bpf_prog *fp);
 
 const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
 
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 9733813..f078d2b 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -56,6 +57,9 @@
 /* BPF program can access up to 512 bytes of stack space. */
 #define MAX_BPF_STACK  512
 
+/* Maximum BPF program size in bytes. */
+#define MAX_BPF_SIZE   (BPF_MAXINSNS * sizeof(struct bpf_insn))
+
 /* Helper macros for filter block array initializers. */
 
 /* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
@@ -404,8 +408,9 @@ struct bpf_prog {
cb_access:1,/* Is control block accessed? */
dst_needed:1;   /* Do we need dst entry? */
kmemcheck_bitfield_end(meta);
-   u32 len;/* Number of filter blocks */
enum bpf_prog_type  type;   /* Type of BPF program */
+   u32 len;/* Number of filter blocks */
+   u32 digest[SHA_DIGEST_WORDS]; /* Program digest */
struct bpf_prog_aux *aux;   /* Auxiliary fields */
struct sock_fprog_kern  *orig_prog; /* Original BPF program */
unsigned int(*bpf_func)(const void *ctx,
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 86786d4..1adc0b6 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -397,6 +397,7 @@ enum {
TCA_BPF_NAME,
TCA_BPF_FLAGS,
TCA_BPF_FLAGS_GEN,
+   TCA_BPF_DIGEST,
__TCA_BPF_MAX,
 };
 
diff --git a/include/uapi/linux/tc_act/tc_bpf.h 
b/include/uapi/linux/tc_act/tc_bpf.h
index 063d9d4..a6b88a6 100644
--- a/include/uapi/linux/tc_act/tc_bpf.h
+++ b/include/uapi/linux/tc_act/tc_bpf.h
@@ -27,6 +27,7 @@ enum {
TCA_ACT_BPF_FD,
TCA_ACT_BPF_NAME,
TCA_ACT_BPF_PAD,
+   TCA_ACT_BPF_DIGEST,
__TCA_ACT_BPF_MAX,
 };
 #define TCA_ACT_BPF_MAX (__TCA_ACT_BPF_MAX - 1)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 82a0414..bdcc9f4 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -136,6 +136,71 @@ void __bpf_prog_free(struct bpf_prog *fp)
vfree(fp);
 }
 
+#define SHA_BPF_RAW_SIZE   \
+   round_up(MAX_BPF_SIZE + sizeof(__be64) + 1, SHA_MESSAGE_BYTES)
+
+/* Called under verifier mutex. */
+void bpf_prog_calc_digest(struct bpf_prog *fp)
+{
+   const u32 bits_offset = SHA_MESSAGE_BYTES - sizeof(__be64);
+   static u32 ws[SHA_WORKSPACE_WORDS];
+   static u8 raw[SHA_BPF_RAW_SIZE];
+   struct bpf_insn 

[PATCH net-next 2/3] bpf, cls: consolidate prog deletion path

2016-12-04 Thread Daniel Borkmann
Commit 18cdb37ebf4c ("net: sched: do not use tcf_proto 'tp' argument from
call_rcu") removed the last usage of tp from cls_bpf_delete_prog(), so also
remove it from the function as argument to not give a wrong impression. tp
is illegal to access from this callback, since it could already have been
freed.

Refactor the deletion code a bit, so that cls_bpf_destroy() can call into
the same code for prog deletion as cls_bpf_delete() op, instead of having
it unnecessarily duplicated.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 net/sched/cls_bpf.c | 30 +-
 1 file changed, 13 insertions(+), 17 deletions(-)

diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index c37aa8b..f70e03d 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -241,7 +241,7 @@ static int cls_bpf_init(struct tcf_proto *tp)
return 0;
 }
 
-static void cls_bpf_delete_prog(struct tcf_proto *tp, struct cls_bpf_prog 
*prog)
+static void __cls_bpf_delete_prog(struct cls_bpf_prog *prog)
 {
tcf_exts_destroy(>exts);
 
@@ -255,22 +255,22 @@ static void cls_bpf_delete_prog(struct tcf_proto *tp, 
struct cls_bpf_prog *prog)
kfree(prog);
 }
 
-static void __cls_bpf_delete_prog(struct rcu_head *rcu)
+static void cls_bpf_delete_prog_rcu(struct rcu_head *rcu)
 {
-   struct cls_bpf_prog *prog = container_of(rcu, struct cls_bpf_prog, rcu);
-
-   cls_bpf_delete_prog(prog->tp, prog);
+   __cls_bpf_delete_prog(container_of(rcu, struct cls_bpf_prog, rcu));
 }
 
-static int cls_bpf_delete(struct tcf_proto *tp, unsigned long arg)
+static void __cls_bpf_delete(struct tcf_proto *tp, struct cls_bpf_prog *prog)
 {
-   struct cls_bpf_prog *prog = (struct cls_bpf_prog *) arg;
-
cls_bpf_stop_offload(tp, prog);
list_del_rcu(>link);
tcf_unbind_filter(tp, >res);
-   call_rcu(>rcu, __cls_bpf_delete_prog);
+   call_rcu(>rcu, cls_bpf_delete_prog_rcu);
+}
 
+static int cls_bpf_delete(struct tcf_proto *tp, unsigned long arg)
+{
+   __cls_bpf_delete(tp, (struct cls_bpf_prog *) arg);
return 0;
 }
 
@@ -282,12 +282,8 @@ static bool cls_bpf_destroy(struct tcf_proto *tp, bool 
force)
if (!force && !list_empty(>plist))
return false;
 
-   list_for_each_entry_safe(prog, tmp, >plist, link) {
-   cls_bpf_stop_offload(tp, prog);
-   list_del_rcu(>link);
-   tcf_unbind_filter(tp, >res);
-   call_rcu(>rcu, __cls_bpf_delete_prog);
-   }
+   list_for_each_entry_safe(prog, tmp, >plist, link)
+   __cls_bpf_delete(tp, prog);
 
kfree_rcu(head, rcu);
return true;
@@ -511,14 +507,14 @@ static int cls_bpf_change(struct net *net, struct sk_buff 
*in_skb,
 
ret = cls_bpf_offload(tp, prog, oldprog);
if (ret) {
-   cls_bpf_delete_prog(tp, prog);
+   __cls_bpf_delete_prog(prog);
return ret;
}
 
if (oldprog) {
list_replace_rcu(>link, >link);
tcf_unbind_filter(tp, >res);
-   call_rcu(>rcu, __cls_bpf_delete_prog);
+   call_rcu(>rcu, cls_bpf_delete_prog_rcu);
} else {
list_add_rcu(>link, >plist);
}
-- 
1.9.3



Re: [PATCH v1 net-next 1/5] net: dsa: mv88e6xxx: Reserved Management frames to CPU

2016-12-04 Thread Vivien Didelot
Hi Andrew,

Andrew Lunn  writes:

>> The mv88e6xxx_ops actually implements the *features*. They can be
>> prefixed for clarity (e.g. .ppu_*, port_*, .atu_*, etc.). They don't
>> describe the register layout.
>> 
>> But we can discuss two ways of seeing this structure implementation:
>
> or
>
> 3) We have a prefix for us humans to help us find the code. Now we
> have ops, i cannot simply do M-. and emacs will take me to the
> implementation. I have to search for it a bit. Having the hint g1_
> tells me to go look in global1.c. Having the hint g2_ tells me to go
> look in global2.c. Having the port_ tells me to go look in port.c.
> Having no prefix tells me the code is scattered around and grep is my
> friend.

Just to be clear:

I totally agree for an implementation (e.g. mv88e6095_g1_set_cpu_port),
that's why I've been doing it since I started splitting the code around
in device-specific files. But I disagree for an mv88e6xxx_ops member.

You can have several implementations in the same file (e.g. global1.c),
so again the only value is the function name, not the struct member.


Thanks,

 Vivien


Re: [PATCH net-next 2/3] net/act_pedit: Support using offset relative to the conventional network headers

2016-12-04 Thread Or Gerlitz
On Fri, Dec 2, 2016 at 12:40 PM, Amir Vadai  wrote:
> On Thu, Dec 01, 2016 at 02:41:14PM -0500, David Miller wrote:
>> From: Amir Vadai 
>> Date: Wed, 30 Nov 2016 11:09:27 +0200

>> > +static int pedit_skb_hdr_offset(struct sk_buff *skb,
>> > +   enum pedit_header_type htype, int *hoffset)
>> > +{
>> > +   int ret = -1;
>> > +
>> > +   switch (htype) {
>> > +   case PEDIT_HDR_TYPE_ETH:
>> > +   if (skb_mac_header_was_set(skb)) {
>> > +   *hoffset = skb_mac_offset(skb);
>> > +   ret = 0;
>> > +   }
>> > +   break;
>> > +   case PEDIT_HDR_TYPE_RAW:
>> > +   case PEDIT_HDR_TYPE_IP4:
>> > +   case PEDIT_HDR_TYPE_IP6:
>> > +   *hoffset = skb_network_offset(skb);
>> > +   ret = 0;
>> > +   break;
>> > +   case PEDIT_HDR_TYPE_TCP:
>> > +   case PEDIT_HDR_TYPE_UDP:
>> > +   if (skb_transport_header_was_set(skb)) {
>> > +   *hoffset = skb_transport_offset(skb);
>> > +   ret = 0;
>> > +   }
>> > +   break;
>> > +   };
>> > +
>> > +   return ret;
>> > +}
>> > +

>> The only distinction between the cases is "L2", "L3", and "L4".

>> Therefore I don't see any reason to break it down into IP4 vs. IP6 vs.
>> RAW, for example.  They all map to the same thing.

>> So why not just have PEDIT_HDR_TYPE_L2, PEDIT_HDR_TYPE_L3, and
>> PEDIT_HDR_TYPE_L4?  It definitely seems more straightforward
>> and cleaner that way.

> Yeh, is isn't by mistake. The next step will be to implement hardware
> offloading of the action, and for that we would like to keep the
> information about the specific header type.

Hi Dave,

I see that this patch is marked as "Changes Requested" @ your patchworks.

Just wanted to make a note as Amir explained here and as mentioned on
the change log, this was done in purpose, as heads up for HW offloads.
Typically HW APIs would let you do things also based on header type
they have parsed, etc, so that's why we added this small redundancy
e.g of IPv4/IPv6 header ID instead of network header ID - while SW
wise both IPv4/IPv6 are using the same code path, for HW offloads, the
HW driver could choose to use the IPv4/IPv6 header ID info.

Or.


Re: [PATCH] mlx4: Use kernel sizeof and alloc styles

2016-12-04 Thread Eric Dumazet
On Sun, 2016-12-04 at 12:11 -0800, Joe Perches wrote:
> Convert sizeof foo to sizeof(foo) and allocations with multiplications
> to the appropriate kcalloc/kmalloc_array styles.
> 
> Signed-off-by: Joe Perches 
> ---

Gah.

This is one of the hotest NIC driver on linux at this moment, 
with XDP and other efforts going on.

Some kmalloc() are becoming kmalloc_node() in some dev branches, and
there is no kmalloc_array_node() yet.

This kind of patch is making rebases/backports very painful.

Could we wait ~6 months before doing such cleanup/changes please ?

If you believe a bug needs a fix, please send a patch to address it.

Thanks.




"af_unix: conditionally use freezable blocking calls in read" is wrong

2016-12-04 Thread Al Viro
Could we please kill that kludge?  "af_unix: use freezable blocking
calls in read" had been wrong to start with; having a method make assumptions
of that sort ("nobody will call me while holding locks I hadn't thought of")
is asking for serious trouble.  splice is just a place where lockdep has
caught that - we *can't* assume that nobody will ever call kernel_recvmsg()
while holding some locks.

I've run into that converting AF_UNIX to generic_file_splice_read();
I can kludge around that ("freezable unless ->msg_iter is ITER_PIPE"), but
that only delays trouble.

Note that the only other user of freezable_schedule_timeout() is
a very different story - it's a kernel thread, which *does* have a guaranteed
locking environment.  Making such assumptions in unix_stream_recvmsg(),
OTOH, is insane...


[PATCH net-next 1/3] net: ethoc: Account for duplex changes

2016-12-04 Thread Florian Fainelli
ethoc_mdio_poll() which is our PHYLIB adjust_link callback does nothing,
we should at least react to duplex changes and change MODER accordingly.
Speed changes is not a problem, since the OpenCores Ethernet core seems
to be reacting okay without us telling it.

Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/ethoc.c | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/drivers/net/ethernet/ethoc.c b/drivers/net/ethernet/ethoc.c
index 6456c180114b..877c02a36c85 100644
--- a/drivers/net/ethernet/ethoc.c
+++ b/drivers/net/ethernet/ethoc.c
@@ -221,6 +221,9 @@ struct ethoc {
struct mii_bus *mdio;
struct clk *clk;
s8 phy_id;
+
+   int old_link;
+   int old_duplex;
 };
 
 /**
@@ -667,6 +670,32 @@ static int ethoc_mdio_write(struct mii_bus *bus, int phy, 
int reg, u16 val)
 
 static void ethoc_mdio_poll(struct net_device *dev)
 {
+   struct ethoc *priv = netdev_priv(dev);
+   struct phy_device *phydev = dev->phydev;
+   bool changed = false;
+   u32 mode;
+
+   if (priv->old_link != phydev->link) {
+   changed = true;
+   priv->old_link = phydev->link;
+   }
+
+   if (priv->old_duplex != phydev->duplex) {
+   changed = true;
+   priv->old_duplex = phydev->duplex;
+   }
+
+   if (!changed)
+   return;
+
+   mode = ethoc_read(priv, MODER);
+   if (phydev->duplex == DUPLEX_FULL)
+   mode |= MODER_FULLD;
+   else
+   mode &= ~MODER_FULLD;
+   ethoc_write(priv, MODER, mode);
+
+   phy_print_status(phydev);
 }
 
 static int ethoc_mdio_probe(struct net_device *dev)
@@ -685,6 +714,9 @@ static int ethoc_mdio_probe(struct net_device *dev)
return -ENXIO;
}
 
+   priv->old_duplex = -1;
+   priv->old_link = -1;
+
err = phy_connect_direct(dev, phy, ethoc_mdio_poll,
 PHY_INTERFACE_MODE_GMII);
if (err) {
@@ -721,6 +753,9 @@ static int ethoc_open(struct net_device *dev)
netif_start_queue(dev);
}
 
+   priv->old_link = -1;
+   priv->old_duplex = -1;
+
phy_start(dev->phydev);
napi_enable(>napi);
 
-- 
2.9.3



[PATCH net-next 0/3] net: ethoc: Misc improvements

2016-12-04 Thread Florian Fainelli
Hi all,

This patch series fixes/improves a few things:

- implement a proper PHYLIB adjust_link callback to set the duplex mode
  accordingly
- do not open code the fetching of a MAC address in OF/DT environments
- demote an error message that occurs more frequently than expected in low
  CPU/memory/bandwidth environments

Tested on a Cirrus Logic EP93xx / TS7300 board.

Florian Fainelli (3):
  net: ethoc: Account for duplex changes
  net: ethoc: Utilize of_get_mac_address()
  net: ethoc: Demote packet dropped error message to debug

 drivers/net/ethernet/ethoc.c | 44 +++-
 1 file changed, 39 insertions(+), 5 deletions(-)

-- 
2.9.3



[PATCH net-next 2/3] net: ethoc: Utilize of_get_mac_address()

2016-12-04 Thread Florian Fainelli
Do not open code getting the MAC address exclusively from the
"local-mac-address" property, but instead use of_get_mac_address() which
looks up the MAC address using the 3 typical property names.

Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/ethoc.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ethoc.c b/drivers/net/ethernet/ethoc.c
index 877c02a36c85..8d0cb5ce87ee 100644
--- a/drivers/net/ethernet/ethoc.c
+++ b/drivers/net/ethernet/ethoc.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -1158,11 +1159,9 @@ static int ethoc_probe(struct platform_device *pdev)
memcpy(netdev->dev_addr, pdata->hwaddr, IFHWADDRLEN);
priv->phy_id = pdata->phy_id;
} else {
-   const uint8_t *mac;
+   const void *mac;
 
-   mac = of_get_property(pdev->dev.of_node,
- "local-mac-address",
- NULL);
+   mac = of_get_mac_address(pdev->dev.of_node);
if (mac)
memcpy(netdev->dev_addr, mac, IFHWADDRLEN);
priv->phy_id = -1;
-- 
2.9.3



[PATCH net-next 3/3] net: ethoc: Demote packet dropped error message to debug

2016-12-04 Thread Florian Fainelli
Spamming the console with: net eth1: packet dropped can happen
fairly frequently if the adapter is busy transmitting, demote the
message to a debug print.

Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/ethoc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ethoc.c b/drivers/net/ethernet/ethoc.c
index 8d0cb5ce87ee..45abc81f6f55 100644
--- a/drivers/net/ethernet/ethoc.c
+++ b/drivers/net/ethernet/ethoc.c
@@ -576,7 +576,7 @@ static irqreturn_t ethoc_interrupt(int irq, void *dev_id)
 
/* We always handle the dropped packet interrupt */
if (pending & INT_MASK_BUSY) {
-   dev_err(>dev, "packet dropped\n");
+   dev_dbg(>dev, "packet dropped\n");
dev->stats.rx_dropped++;
}
 
-- 
2.9.3



Re: [PATCH v1 net-next 1/5] net: dsa: mv88e6xxx: Reserved Management frames to CPU

2016-12-04 Thread Andrew Lunn
> > +int mv88e6095_g2_mgmt_rsvd2cpu(struct mv88e6xxx_chip *chip)
> > +{
> > +   int err;
> > +
> > +   /* Consider the frames with reserved multicast destination
> > +* addresses matching 01:80:c2:00:00:2x as MGMT.
> > +*/
> > +   if (mv88e6xxx_has(chip, MV88E6XXX_FLAG_G2_MGMT_EN_2X)) {
> 
> Please don't just move the old code. You shouldn't need flags anymore.

Hi Vivien

My aim is to get the 6390 supported. Refactoring is secondary. If it
helps implementing the 6390 code, i will refactor stuff. If it does
not help, i will leave it alone. It does not help here, so i left it
alone. Please feel free to submit a follow up patch refactoring this
further.

> The mv88e6xxx_ops actually implements the *features*. They can be
> prefixed for clarity (e.g. .ppu_*, port_*, .atu_*, etc.). They don't
> describe the register layout.
> 
> But we can discuss two ways of seeing this structure implementation:

or

3) We have a prefix for us humans to help us find the code. Now we
have ops, i cannot simply do M-. and emacs will take me to the
implementation. I have to search for it a bit. Having the hint g1_
tells me to go look in global1.c. Having the hint g2_ tells me to go
look in global2.c. Having the port_ tells me to go look in port.c.
Having no prefix tells me the code is scattered around and grep is my
friend.

The prefix is just a hint where the function is in the source
code. Nothing more.

  Andrew


Re: [GIT PULL nf-next 0/2] IPVS Updates for v4.10

2016-12-04 Thread Pablo Neira Ayuso
On Tue, Nov 15, 2016 at 10:01:41AM +0100, Simon Horman wrote:
> Hi Pablo,
> 
> please consider these enhancements to the IPVS for v4.10.
> 
> * Decrement the IP ttl in all the modes in order to prevent infinite
>   route loops. Thanks to Dwip Banerjee.
> * Use IS_ERR_OR_NULL macro. Clean-up from Gao Feng.
> 
> 
> The following changes since commit 7d384846b9987f7b611357adf3cdfecfdcf0c402:
> 
>   Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next 
> (2016-11-13 22:41:25 -0500)
> 
> are available in the git repository at:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next.git 
> tags/ipvs-for-v4.10

Pulled, thanks Simon.


[PATCH] mlx4: Use kernel sizeof and alloc styles

2016-12-04 Thread Joe Perches
Convert sizeof foo to sizeof(foo) and allocations with multiplications
to the appropriate kcalloc/kmalloc_array styles.

Signed-off-by: Joe Perches 
---
 drivers/net/ethernet/mellanox/mlx4/alloc.c |  6 +++---
 drivers/net/ethernet/mellanox/mlx4/cmd.c   |  8 
 drivers/net/ethernet/mellanox/mlx4/en_resources.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_tx.c |  2 +-
 drivers/net/ethernet/mellanox/mlx4/eq.c| 20 +-
 drivers/net/ethernet/mellanox/mlx4/fw.c|  2 +-
 drivers/net/ethernet/mellanox/mlx4/icm.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx4/icm.h   |  4 ++--
 drivers/net/ethernet/mellanox/mlx4/intf.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx4/main.c  | 12 +--
 drivers/net/ethernet/mellanox/mlx4/mcg.c   | 12 +--
 drivers/net/ethernet/mellanox/mlx4/mr.c| 18 
 drivers/net/ethernet/mellanox/mlx4/qp.c| 12 +--
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  | 24 +++---
 15 files changed, 63 insertions(+), 65 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/alloc.c 
b/drivers/net/ethernet/mellanox/mlx4/alloc.c
index 249a4584401a..c25de2740c78 100644
--- a/drivers/net/ethernet/mellanox/mlx4/alloc.c
+++ b/drivers/net/ethernet/mellanox/mlx4/alloc.c
@@ -185,8 +185,8 @@ int mlx4_bitmap_init(struct mlx4_bitmap *bitmap, u32 num, 
u32 mask,
bitmap->avail = num - reserved_top - reserved_bot;
bitmap->effective_len = bitmap->avail;
spin_lock_init(>lock);
-   bitmap->table = kzalloc(BITS_TO_LONGS(bitmap->max) *
-   sizeof (long), GFP_KERNEL);
+   bitmap->table = kcalloc(BITS_TO_LONGS(bitmap->max), sizeof(long),
+   GFP_KERNEL);
if (!bitmap->table)
return -ENOMEM;
 
@@ -668,7 +668,7 @@ static struct mlx4_db_pgdir *mlx4_alloc_db_pgdir(struct 
device *dma_device,
 {
struct mlx4_db_pgdir *pgdir;
 
-   pgdir = kzalloc(sizeof *pgdir, gfp);
+   pgdir = kzalloc(sizeof(*pgdir), gfp);
if (!pgdir)
return NULL;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index e36bebcab3f2..86e03e47ca47 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -2616,9 +2616,9 @@ int mlx4_cmd_use_events(struct mlx4_dev *dev)
int i;
int err = 0;
 
-   priv->cmd.context = kmalloc(priv->cmd.max_cmds *
-  sizeof (struct mlx4_cmd_context),
-  GFP_KERNEL);
+   priv->cmd.context = kmalloc_array(priv->cmd.max_cmds,
+ sizeof(struct mlx4_cmd_context),
+ GFP_KERNEL);
if (!priv->cmd.context)
return -ENOMEM;
 
@@ -2675,7 +2675,7 @@ struct mlx4_cmd_mailbox *mlx4_alloc_cmd_mailbox(struct 
mlx4_dev *dev)
 {
struct mlx4_cmd_mailbox *mailbox;
 
-   mailbox = kmalloc(sizeof *mailbox, GFP_KERNEL);
+   mailbox = kmalloc(sizeof(*mailbox), GFP_KERNEL);
if (!mailbox)
return ERR_PTR(-ENOMEM);
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_resources.c 
b/drivers/net/ethernet/mellanox/mlx4/en_resources.c
index a6b0db0e0383..10966dc5792c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_resources.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_resources.c
@@ -44,7 +44,7 @@ void mlx4_en_fill_qp_context(struct mlx4_en_priv *priv, int 
size, int stride,
struct mlx4_en_dev *mdev = priv->mdev;
struct net_device *dev = priv->dev;
 
-   memset(context, 0, sizeof *context);
+   memset(context, 0, sizeof(*context));
context->flags = cpu_to_be32(7 << 16 | rss << MLX4_RSS_QPC_FLAG_OFFSET);
context->pd = cpu_to_be32(mdev->priv_pdn);
context->mtu_msgmax = 0xff;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 6562f78b07f4..616d3febe7ce 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1235,7 +1235,7 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv 
*priv, int qpn,
}
qp->event = mlx4_en_sqp_event;
 
-   memset(context, 0, sizeof *context);
+   memset(context, 0, sizeof(*context));
mlx4_en_fill_qp_context(priv, ring->actual_size, ring->stride, 0, 0,
qpn, ring->cqn, -1, context);
context->db_rec_addr = cpu_to_be64(ring->wqres.db.dma);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 5de3cbe24f2b..233317e5fe72 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -660,7 

Re: [PATCH v1 net-next 1/5] net: dsa: mv88e6xxx: Reserved Management frames to CPU

2016-12-04 Thread Vivien Didelot
Hi Andrew,

Andrew Lunn  writes:

> + /* Some generations have the configuration of sending reserved
> +  * management frames to the CPU in global2, others in
> +  * global1. Hence it does not fit the two setup functions
> +  * above.
> +  */
> + if (chip->info->ops->mgmt_rsvd2cpu) {
> + err = chip->info->ops->mgmt_rsvd2cpu(chip);
> + if (err)
> + goto unlock;
> + }

Correct. The old mv88e6xxx_g*_setup() are bad. Ideally we'll get rid of
them once we have the complete set of features implemented in the API.

> +/* Offset 0x02: Management Enable 2x */
> +/* Offset 0x03: Management Enable 0x */
> +
> +int mv88e6095_g2_mgmt_rsvd2cpu(struct mv88e6xxx_chip *chip)
> +{
> + int err;
> +
> + /* Consider the frames with reserved multicast destination
> +  * addresses matching 01:80:c2:00:00:2x as MGMT.
> +  */
> + if (mv88e6xxx_has(chip, MV88E6XXX_FLAG_G2_MGMT_EN_2X)) {

Please don't just move the old code. You shouldn't need flags anymore.

> + err = mv88e6xxx_g2_write(chip, GLOBAL2_MGMT_EN_2X, 0x);
> + if (err)
> + return err;
> + }
> +
> + /* Consider the frames with reserved multicast destination
> +  * addresses matching 01:80:c2:00:00:0x as MGMT.
> +  */
> + if (mv88e6xxx_has(chip, MV88E6XXX_FLAG_G2_MGMT_EN_0X))
> + return mv88e6xxx_g2_write(chip, GLOBAL2_MGMT_EN_0X, 0x);
> +
> + return 0;
> +}

[...]

>   int (*g1_set_cpu_port)(struct mv88e6xxx_chip *chip, int port);
>   int (*g1_set_egress_port)(struct mv88e6xxx_chip *chip, int port);
> +
> + /* Can be either in g1 or g2, so don't use a prefix */

We have to discuss this and find an agreement.

The mv88e6xxx_ops actually implements the *features*. They can be
prefixed for clarity (e.g. .ppu_*, port_*, .atu_*, etc.). They don't
describe the register layout.

But we can discuss two ways of seeing this structure implementation:

1) Either we describe the exact register layout, and provide
.g1_mgmt_rsv2cpu and .g2_mgmt_rsvd2cpu. One or the other being NULL.

This will have the impact of expliciting the features location. One can
say it'll ease adding support for new chips, but that's not true when
the feature at the same location is implemented differently (e.g. port's
speed, switch reset). This will make the structure unnecessarily big as
well as cluttering the wrapping code.

2) We describe the feature. No *location* prefix.

To explain this point, please understand what Marvell does with their
chips, using the example of Rsvd2CPU feature and (old-to-new) models:

  - 6060 doesn't have the feature
  - 6185 introduced one G2 register for 0x MAC addresses
  - 6352 added a second G2 register for 2x MAC addresses
  - 6390 packed the feature in a single-register indirect table in G1

That's all. They are just relocating features to free a few registers.

So .g1_set_cpu_port, .g1_reset, or whatever location-prefixed operation
does not make sense, unless you want to describe the register layout.
But we should not mix 1) and 2).

Thanks,

Vivien


Re: [net-next PATCH v4 1/6] net: virtio dynamically disable/enable LRO

2016-12-04 Thread John Fastabend
On 16-12-03 09:36 PM, Michael S. Tsirkin wrote:
> On Fri, Dec 02, 2016 at 12:49:45PM -0800, John Fastabend wrote:
>> This adds support for dynamically setting the LRO feature flag. The
>> message to control guest features in the backend uses the
>> CTRL_GUEST_OFFLOADS msg type.
>>
>> Signed-off-by: John Fastabend 
>> ---
>>  drivers/net/virtio_net.c |   45 
>> -
>>  1 file changed, 44 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index a21d93a..d814e7cb 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -1419,6 +1419,41 @@ static void virtnet_init_settings(struct net_device 
>> *dev)
>>  .set_settings = virtnet_set_settings,
>>  };
>>  
>> +static int virtnet_set_features(struct net_device *netdev,
>> +netdev_features_t features)
>> +{
>> +struct virtnet_info *vi = netdev_priv(netdev);
>> +struct virtio_device *vdev = vi->vdev;
>> +struct scatterlist sg;
>> +u64 offloads = 0;
>> +
>> +if (features & NETIF_F_LRO)
>> +offloads |= (1 << VIRTIO_NET_F_GUEST_TSO4) |
>> +(1 << VIRTIO_NET_F_GUEST_TSO6);
>> +
>> +if (features & NETIF_F_RXCSUM)
>> +offloads |= (1 << VIRTIO_NET_F_GUEST_CSUM);
>> +
>> +if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) {
>> +sg_init_one(, , sizeof(uint64_t));
>> +if (!virtnet_send_command(vi,
>> +  VIRTIO_NET_CTRL_GUEST_OFFLOADS,
>> +  VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET,
>> +  )) {
>> +dev_warn(>dev,
>> + "Failed to set guest offloads by virtnet 
>> command.\n");
>> +return -EINVAL;
>> +}
>> +} else if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS) &&
>> +   !virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
> 
> No need for VIRTIO_F_VERSION_1 here.
> 

Yep Dave pointed it out as well. Also I found a bug on driver unload
path where XDP tx queues are not cleaned up correctly so will need a v5
for that.

Thanks,
John


[PATCH v2 net-next] bnx2x: ethtool -x full support

2016-12-04 Thread Eric Dumazet
From: Eric Dumazet 

Implement ethtool -x full support, so that rss key can be fetched
instead of assuming it matches /proc/sys/net/core/netdev_rss_key
content.

We might add "ethtool --rxfh" support later to set a different rss key.

Tested:

lpk51:~# ethtool --show-rxfh eth0 | grep -A1 'hash key'
RSS hash key:
8b:a9:3a:ff:3e:f8:44:bd:5a:44:b7:b5:6d:e8:2d:f0:f0:72:98:54:03:86:8f:39:a4:42:5a:b3:84:71:5c:4f:1c:18:d6:a3:04:68:85:ac
lpk51:~# cat /proc/sys/net/core/netdev_rss_key
8b:a9:3a:ff:3e:f8:44:bd:5a:44:b7:b5:6d:e8:2d:f0:f0:72:98:54:03:86:8f:39:a4:42:5a:b3:84:71:5c:4f:1c:18:d6:a3:04:68:85:ac:22:1f:50:76:d4:c8:a5:20:7b:61:3c:0c

Signed-off-by: Eric Dumazet 
Cc: Ariel Elior 
---
v2: support CONFIG_BNX2X_SRIOV=y

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |2 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c |   47 ++
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c  |2 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.h  |5 -
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c|5 -
 5 files changed, 38 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index ed42c1009685..28af24ae0092 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -2106,7 +2106,7 @@ int bnx2x_rss(struct bnx2x *bp, struct 
bnx2x_rss_config_obj *rss_obj,
 
if (config_hash) {
/* RSS keys */
-   netdev_rss_key_fill(params.rss_key, T_ETH_RSS_KEY * 4);
+   netdev_rss_key_fill(_obj->rss_key, T_ETH_RSS_KEY * 4);
__set_bit(BNX2X_RSS_SET_SRCH, _flags);
}
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
index 85a7800bfc12..28bc9479fc74 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
@@ -3421,6 +3421,13 @@ static u32 bnx2x_get_rxfh_indir_size(struct net_device 
*dev)
return T_ETH_INDIRECTION_TABLE_SIZE;
 }
 
+static u32 bnx2x_get_rxfh_key_size(struct net_device *dev)
+{
+   struct bnx2x *bp = netdev_priv(dev);
+
+   return (bp->port.pmf || !CHIP_IS_E1x(bp)) ? T_ETH_RSS_KEY * 4 : 0;
+}
+
 static int bnx2x_get_rxfh(struct net_device *dev, u32 *indir, u8 *key,
  u8 *hfunc)
 {
@@ -3430,23 +3437,30 @@ static int bnx2x_get_rxfh(struct net_device *dev, u32 
*indir, u8 *key,
 
if (hfunc)
*hfunc = ETH_RSS_HASH_TOP;
-   if (!indir)
-   return 0;
 
-   /* Get the current configuration of the RSS indirection table */
-   bnx2x_get_rss_ind_table(>rss_conf_obj, ind_table);
-
-   /*
-* We can't use a memcpy() as an internal storage of an
-* indirection table is a u8 array while indir->ring_index
-* points to an array of u32.
-*
-* Indirection table contains the FW Client IDs, so we need to
-* align the returned table to the Client ID of the leading RSS
-* queue.
-*/
-   for (i = 0; i < T_ETH_INDIRECTION_TABLE_SIZE; i++)
-   indir[i] = ind_table[i] - bp->fp->cl_id;
+   if (key) {
+   if (bp->port.pmf || !CHIP_IS_E1x(bp))
+   memcpy(key, >rss_conf_obj.rss_key, T_ETH_RSS_KEY * 
4);
+   else
+   memset(key, 0, T_ETH_RSS_KEY * 4);
+   }
+
+   if (indir) {
+   /* Get the current configuration of the RSS indirection table */
+   bnx2x_get_rss_ind_table(>rss_conf_obj, ind_table);
+
+   /*
+* We can't use a memcpy() as an internal storage of an
+* indirection table is a u8 array while indir->ring_index
+* points to an array of u32.
+*
+* Indirection table contains the FW Client IDs, so we need to
+* align the returned table to the Client ID of the leading RSS
+* queue.
+*/
+   for (i = 0; i < T_ETH_INDIRECTION_TABLE_SIZE; i++)
+   indir[i] = ind_table[i] - bp->fp->cl_id;
+   }
 
return 0;
 }
@@ -3628,6 +3642,7 @@ static const struct ethtool_ops bnx2x_ethtool_ops = {
.get_ethtool_stats  = bnx2x_get_ethtool_stats,
.get_rxnfc  = bnx2x_get_rxnfc,
.set_rxnfc  = bnx2x_set_rxnfc,
+   .get_rxfh_key_size  = bnx2x_get_rxfh_key_size,
.get_rxfh_indir_size= bnx2x_get_rxfh_indir_size,
.get_rxfh   = bnx2x_get_rxfh,
.set_rxfh   = bnx2x_set_rxfh,
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
index cea6bdcde33f..3b1becc23160 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
+++ 

Re: Trigger EHOSTUNREACH

2016-12-04 Thread Neal Cardwell
On Sun, Dec 4, 2016 at 7:04 AM, Marco Zunino  wrote:
> Hallo everyone, hope you are having a good day
> we are building a networking testing tool to simulate network error
> condition, and we are having difficulties triggering the EHOSTUNREACH
> socket error.
>
> We are trying to trigger this error by sending an ICMP packet type=3
> code=3 on an open STREAM socket, but it has no effect.
>
> Based on RFC1122 and the code here
>
> https://github.com/torvalds/linux/blob/e76d21c40bd6c67fd4e2c1540d77e113df962b4d/net/ipv4/tcp_ipv4.c#L353
>
> I would expect the this ICMP packet to abort the socket connection
> with a EHOSTUNREACH error on the client side, but this does not
> happen.

In my quick tests with packetdrill, it looks like Linux will not
immediately pass EHOSTUNREACH to the application unless the
application has requested this with setsockopt(SOL_IP, IP_RECVERR).

Specifically, the following packetdrill test passes for me:
---
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 
   +0 > S. 0:0(0) ack 1 
+.020 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4
   +0 setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0
   +0 write(4, ..., 1000) = 1000
   +0 > P. 1:1001(1000) ack 1

+.010 < icmp unreachable host_unreachable [1:1461(1460)]

   +0 write(4, ..., 1) = -1 EHOSTUNREACH (No route to host)
---

But without the setsockopt(SOL_IP, IP_RECVERR) there is no error upon
the second write().

My reading of RFC 1122 is that this is consistent with the RFC.

RFC 1122 section 3.2.2.1 says:

A Destination Unreachable message that is received with code
0 (Net), 1 (Host), or 5 (Bad Source Route) may result from a
routing transient and MUST therefore be interpreted as only
a hint, not proof, that the specified destination is
unreachable [IP:11].

So it seems that the RFC is suggesting that by default an ICMP host
unreachable should not cause an immediate error for the connection.
Instead, it should be used as a hint as to the cause of the problem if
TCP's normal reliable delivery mechanisms ultimately timeout and fail.

neal


[PATCH v3 net-next] net_sched: gen_estimator: complete rewrite of rate estimators

2016-12-04 Thread Eric Dumazet
From: Eric Dumazet 

1) Old code was hard to maintain, due to complex lock chains.
   (We probably will be able to remove some kfree_rcu() in callers)

2) Using a single timer to update all estimators does not scale.

3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
   is not supposed to work well)

In this rewrite :

- I removed the RB tree that had to be scanned in
  gen_estimator_active(). qdisc dumps should be much faster.

- Each estimator has its own timer.

- Estimations are maintained in net_rate_estimator structure,
  instead of dirtying the qdisc. Minor, but part of the simplification.

- Reading the estimator uses RCU and a seqcount to provide proper
  support for 32bit kernels.

- We reduce memory need when estimators are not used, since
  we store a pointer, instead of the bytes/packets counters.

- xt_rateest_mt() no longer has to grab a spinlock.
  (In the future, xt_rateest_tg() could be switched to per cpu counters)

Signed-off-by: Eric Dumazet 
---
v3: Renamed some parameters to please make htmldocs
v2: Removed unwanted changes to tcp_output.c

 include/net/act_api.h  |2 
 include/net/gen_stats.h|   17 -
 include/net/netfilter/xt_rateest.h |   10 
 include/net/sch_generic.h  |2 
 net/core/gen_estimator.c   |  299 +--
 net/core/gen_stats.c   |   20 -
 net/netfilter/xt_RATEEST.c |4 
 net/netfilter/xt_rateest.c |   28 +-
 net/sched/act_api.c|9 
 net/sched/act_police.c |   21 +
 net/sched/sch_api.c|2 
 net/sched/sch_cbq.c|6 
 net/sched/sch_drr.c|6 
 net/sched/sch_generic.c|2 
 net/sched/sch_hfsc.c   |6 
 net/sched/sch_htb.c|6 
 net/sched/sch_qfq.c|8 
 17 files changed, 182 insertions(+), 266 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 
9dddf77a69ccbcb003cfa66bcc0de337f78f3dae..1d716449209e4753a297c61a287077a1eb96e6d8
 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -36,7 +36,7 @@ struct tc_action {
struct tcf_ttcfa_tm;
struct gnet_stats_basic_packed  tcfa_bstats;
struct gnet_stats_queue tcfa_qstats;
-   struct gnet_stats_rate_est64tcfa_rate_est;
+   struct net_rate_estimator __rcu *tcfa_rate_est;
spinlock_t  tcfa_lock;
struct rcu_head tcfa_rcu;
struct gnet_stats_basic_cpu __percpu *cpu_bstats;
diff --git a/include/net/gen_stats.h b/include/net/gen_stats.h
index 
231e121cc7d9c72075e7e6dde3655d631f64a1c4..8b7aa370e7a4af61fcb71ed751dba72ebead6143
 100644
--- a/include/net/gen_stats.h
+++ b/include/net/gen_stats.h
@@ -11,6 +11,8 @@ struct gnet_stats_basic_cpu {
struct u64_stats_sync syncp;
 };
 
+struct net_rate_estimator;
+
 struct gnet_dump {
spinlock_t *  lock;
struct sk_buff *  skb;
@@ -42,8 +44,7 @@ void __gnet_stats_copy_basic(const seqcount_t *running,
 struct gnet_stats_basic_cpu __percpu *cpu,
 struct gnet_stats_basic_packed *b);
 int gnet_stats_copy_rate_est(struct gnet_dump *d,
-const struct gnet_stats_basic_packed *b,
-struct gnet_stats_rate_est64 *r);
+struct net_rate_estimator __rcu **ptr);
 int gnet_stats_copy_queue(struct gnet_dump *d,
  struct gnet_stats_queue __percpu *cpu_q,
  struct gnet_stats_queue *q, __u32 qlen);
@@ -53,16 +54,16 @@ int gnet_stats_finish_copy(struct gnet_dump *d);
 
 int gen_new_estimator(struct gnet_stats_basic_packed *bstats,
  struct gnet_stats_basic_cpu __percpu *cpu_bstats,
- struct gnet_stats_rate_est64 *rate_est,
+ struct net_rate_estimator __rcu **rate_est,
  spinlock_t *stats_lock,
  seqcount_t *running, struct nlattr *opt);
-void gen_kill_estimator(struct gnet_stats_basic_packed *bstats,
-   struct gnet_stats_rate_est64 *rate_est);
+void gen_kill_estimator(struct net_rate_estimator __rcu **ptr);
 int gen_replace_estimator(struct gnet_stats_basic_packed *bstats,
  struct gnet_stats_basic_cpu __percpu *cpu_bstats,
- struct gnet_stats_rate_est64 *rate_est,
+ struct net_rate_estimator __rcu **ptr,
  spinlock_t *stats_lock,
  seqcount_t *running, struct nlattr *opt);
-bool gen_estimator_active(const struct gnet_stats_basic_packed *bstats,
- const struct gnet_stats_rate_est64 *rate_est);
+bool gen_estimator_active(struct net_rate_estimator __rcu **ptr);
+bool gen_estimator_read(struct 

Re: [PATCN v2 net-next] net_sched: gen_estimator: complete rewrite of rate estimators

2016-12-04 Thread Eric Dumazet
On Sat, 2016-12-03 at 23:18 -0800, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> 1) Old code was hard to maintain, due to complex lock chains.
>(We probably will be able to remove some kfree_rcu() in callers)
> 
> 2) Using a single timer to update all estimators does not scale.
> 
> 3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
>is not supposed to work well)

A v3 is under way, fixing a last "make htmldocs" warning.




Re: [flamebait] xdp Was: Re: bpf bounded loops. Was: [flamebait] xdp

2016-12-04 Thread Hannes Frederic Sowa
Hello,

On 03.12.2016 00:34, Alexei Starovoitov wrote:
> On Fri, Dec 02, 2016 at 08:42:41PM +0100, Hannes Frederic Sowa wrote:
>> On Fri, Dec 2, 2016, at 20:25, Hannes Frederic Sowa wrote:
>>> On 02.12.2016 19:39, Alexei Starovoitov wrote:
 On Thu, Dec 01, 2016 at 10:27:12PM +0100, Hannes Frederic Sowa wrote:
> like") and the problematic of parsing DNS packets in XDP due to string
> processing and looping inside eBPF.

 Hannes,
 Not too long ago you proposed a very interesting idea to add
 support for bounded loops without adding any new bpf instructions and
 changing llvm (which was way better than my 'rep' like instructions
 I was experimenting with). I thought systemtap guys also wanted bounded
 loops and you were cooperating on the design, so I gave up on my work and
 was expecting an imminent patch from you. I guess it sounds like you know
 believe that bounded loops are impossible or I misunderstand your 
 statement ?
>>>
>>> Your argument was that it would need a new verifier as the current first
>>> pass checks that we indeed can lay out the basic blocks as a DAG which
>>> the second pass depends on. This would be violated.
> 
> yes. today the main part of verifier depends on cfg check that confirms DAG
> property of the program. This was done as a simplification for the algorithm,
> so any programmer that understands C can understand the verifier code.
> It certainly was the case, since most of the people who hacked
> verifier had zero compiler background.
> Now I'm thinking to introduce proper compiler technologies to it.
> On one side it will make the bar to understand higher and on the other
> side it will cleanup the logic and reuse tens of years of data flow
> analysis theory and will make verifier more robust and mathematically
> solid.

See below.

>>> Because eBPF is available by non privileged users this would need a lot
>>> of effort to rewrite and verify (or indeed keep two verifiers in the
>>> kernel for priv and non-priv). The verifier itself is exposed to
>>> unprivileged users.
> 
> I certainly hear your concerns that people unfamiliar with it are simply
> scared that more and more verification logic being added. So I don't mind
> freezing current verifier for unpriv and let proper data flow analysis
> to be done in root only component.
> 
>>> Also, by design, if we keep the current limits, this would not give you
>>> more instructions to operate on compared to the flattened version of the
>>> program, it would merely reduce the numbers of optimizations in LLVM
>>> that let the verifier reject the program.
> 
> I think we most likely will keep 4k insn limit (since there were no
> requests to increase it). The bounded loops will improve performance
> and reduce I-cache misses.

I agree that bounded loops will increase performance and in general I
see lifting this limitation as something good if it works out.

>> The only solution to protect the verifier, which I saw, would be to
>> limit it by time and space, thus making loading of eBPF programs
>> depending on how fast and hot (thermal throttling) one CPU thread is.
> 
> the verifier already has time and space limits.
> See no reason to rely on physical cpu sensors.

Time and space is bounded by the DAG property. It is still bounded in
the directed cyclic case (some arbitrary upper limit), but can have a
combinatorical explosion because of the switch from proving properties
for each node+state to prove properties for each path+state.

Compiler algorithms maybe a help here, but historically have focused on
other properties, mainly optimization and thus are mostly heuristics.

Compiler developers don't write the algorithm under the assumption they
will execute in a security and resource sensitive environment (only the
generated code should be safe). I believe that optimization algorithms
increase the attack surface, as their big-O worst case might be an
additional cost, additionally to the worst case verification path. I
don't think compiler engineers think about the code to optimize
attacking the optimization algorithm itself.

Verification of a malicious BPF program (complexity bomb) in the kernel
should not disrupt nor create a security threat (we are also mostly
voluntary preemptible in this code and hold a lock). Verification might
fail when memory fragmentation becomes more probably, as the huge state
table for path sensitive verification cannot be allocated.

In user space, instead of verification of many properties regarding
program state (which you actually need in the BPF verifier), the
development effort concentrates on sanitizers. Otherwise look how good
gcc is in finding uninitialized variables.

Most mathematical proofs that are written for compiler optimization I
know of are written to show equivalence between program text and the
optimization. Compile time recently became a more important aspect of
the open source compilers at least.

I am happy to be shown wrong 

[patch net] net: fec: fix compile with CONFIG_M5272

2016-12-04 Thread Nikita Yushchenko
Commit 4dfb80d18d05 ("net: fec: cache statistics while device is down")
introduced unconditional statistics-related actions.

However, when driver is compiled with CONFIG_M5272, staticsics-related
definitions do not exist, which results into build errors.

Fix that by adding needed #if !defined(CONFIG_M5272).

Fixes: 4dfb80d18d05 ("net: fec: cache statistics while device is down")
Signed-off-by: Nikita Yushchenko 
---
 drivers/net/ethernet/freescale/fec_main.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c 
b/drivers/net/ethernet/freescale/fec_main.c
index 6a20c24a2003..89e902767abb 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -2884,7 +2884,9 @@ fec_enet_close(struct net_device *ndev)
if (fep->quirks & FEC_QUIRK_ERR006687)
imx6q_cpuidle_fec_irqs_unused();
 
+#if !defined(CONFIG_M5272)
fec_enet_update_ethtool_stats(ndev);
+#endif
 
fec_enet_clk_enable(ndev, false);
pinctrl_pm_select_sleep_state(>pdev->dev);
@@ -3192,7 +3194,9 @@ static int fec_enet_init(struct net_device *ndev)
 
fec_restart(ndev);
 
+#if !defined(CONFIG_M5272)
fec_enet_update_ethtool_stats(ndev);
+#endif
 
return 0;
 }
@@ -3292,9 +3296,11 @@ fec_probe(struct platform_device *pdev)
fec_enet_get_queue_num(pdev, _tx_qs, _rx_qs);
 
/* Init network device */
-   ndev = alloc_etherdev_mqs(sizeof(struct fec_enet_private) +
- ARRAY_SIZE(fec_stats) * sizeof(u64),
- num_tx_qs, num_rx_qs);
+   ndev = alloc_etherdev_mqs(sizeof(struct fec_enet_private)
+#if !defined(CONFIG_M5272)
+ + ARRAY_SIZE(fec_stats) * sizeof(u64)
+#endif
+ , num_tx_qs, num_rx_qs);
if (!ndev)
return -ENOMEM;
 
-- 
2.1.4



Re: [PATCH v2 net-next 1/4] bpf: xdp: Allow head adjustment in XDP prog

2016-12-04 Thread Daniel Borkmann

On 12/04/2016 04:17 AM, Martin KaFai Lau wrote:

This patch allows XDP prog to extend/remove the packet
data at the head (like adding or removing header).  It is
done by adding a new XDP helper bpf_xdp_adjust_head().

It also renames bpf_helper_changes_skb_data() to
bpf_helper_changes_pkt_data() to better reflect
that XDP prog does not work on skb.

Acked-by: Alexei Starovoitov 
Signed-off-by: Martin KaFai Lau 


Yeah, looks good like that:

Acked-by: Daniel Borkmann 

Should there one day be different requirements wrt min length,
we could always pass dev pointer via struct xdp_buff and then
use dev->hard_header_len, for example, but not needed right now.


[PATCH net-next] net/sched: cls_flower: Set the filter Hardware device for all use-cases

2016-12-04 Thread Hadar Hen Zion
Check if the returned device from tcf_exts_get_dev function supports tc
offload and in case the rule can't be offloaded, set the filter hw_dev
parameter to the original device given by the user.

The filter hw_device parameter should always be set by fl_hw_replace_filter
function, since this pointer is used by dump stats and destroy
filter for each flower rule (offloaded or not).

Fixes: 7091d8c7055d ('net/sched: cls_flower: Add offload support using egress 
Hardware device')
Signed-off-by: Hadar Hen Zion 
Reported-by: Simon Horman 
---
 net/sched/cls_flower.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index c5cea78..29a9e6d 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -236,8 +236,11 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
int err;
 
if (!tc_can_offload(dev, tp)) {
-   if (tcf_exts_get_dev(dev, >exts, >hw_dev))
+   if (tcf_exts_get_dev(dev, >exts, >hw_dev) ||
+   (f->hw_dev && !tc_can_offload(f->hw_dev, tp))) {
+   f->hw_dev = dev;
return tc_skip_sw(f->flags) ? -EINVAL : 0;
+   }
dev = f->hw_dev;
tc->egress_dev = true;
} else {
-- 
1.8.3.1



967799855664878

2016-12-04 Thread ???
ÎÄṠÝ܊netdemon

967799855664878
23

???.xls
Description: application/msexcel


[PATCH net 0/2] bnx2x: fixes series

2016-12-04 Thread Yuval Mintz
Hi Dave,

Two unrelated fixes for bnx2x - the first one is nice-to-have,
while the other fixes fatal behaviour in older adapters.

Please consider applying them to `net'.

Thanks,
Yuval

Yuval Mintz (2):
  bnx2x: Correct ringparam estimate when interface is down
  bnx2x: Prevent tunnel configuration for 577xx adapters.

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c | 8 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c| 4 ++--
 2 files changed, 10 insertions(+), 2 deletions(-)

-- 
1.9.3



[PATCH net 2/2] bnx2x: Prevent tunnel config for 577xx

2016-12-04 Thread Yuval Mintz
Only the 578xx adapters are capable of configuring UDP ports for
the purpose of tunnelling - doing the same on 577xx might lead to
a firmware assertion.
We're already not claiming support for any related feature for such
devices, but we also need to prevent the configuration of the UDP
ports to the device in this case.

Fixes: f34fa14cc033 ("bnx2x: Add vxlan RSS support")
Reported-by: Anikina Anna 
Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 0cee4c0..42f4611 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -10138,7 +10138,7 @@ static void __bnx2x_add_udp_port(struct bnx2x *bp, u16 
port,
 {
struct bnx2x_udp_tunnel *udp_port = >udp_tunnel_ports[type];
 
-   if (!netif_running(bp->dev) || !IS_PF(bp))
+   if (!netif_running(bp->dev) || !IS_PF(bp) || CHIP_IS_E1x(bp))
return;
 
if (udp_port->count && udp_port->dst_port == port) {
@@ -10163,7 +10163,7 @@ static void __bnx2x_del_udp_port(struct bnx2x *bp, u16 
port,
 {
struct bnx2x_udp_tunnel *udp_port = >udp_tunnel_ports[type];
 
-   if (!IS_PF(bp))
+   if (!IS_PF(bp) || CHIP_IS_E1x(bp))
return;
 
if (!udp_port->count || udp_port->dst_port != port) {
-- 
1.9.3



[PATCH net 1/2] bnx2x: Correct ringparam estimate when DOWN

2016-12-04 Thread Yuval Mintz
Until interface is up [and assuming ringparams weren't explicitly
configured] when queried for the size of its rings bnx2x would
claim they're the maximal size by default.
That is incorrect as by default the maximal number of buffers would
be equally divided between the various rx rings.

This prevents the user from actually setting the number of elements
on each rx ring to be of maximal size prior to transitioning the
interface into up state.

To fix this, make a rough estimation about the number of buffers.
It wouldn't always be accurate, but it would be much better than
current estimation and would allow users to increase number of
buffers during early initialization of the interface.

Reported-by: Seymour, Shane 
Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
index 85a7800..5f19427 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
@@ -1872,8 +1872,16 @@ static void bnx2x_get_ringparam(struct net_device *dev,
 
ering->rx_max_pending = MAX_RX_AVAIL;
 
+   /* If size isn't already set, we give an estimation of the number
+* of buffers we'll have. We're neglecting some possible conditions
+* [we couldn't know for certain at this point if number of queues
+* might shrink] but the number would be correct for the likely
+* scenario.
+*/
if (bp->rx_ring_size)
ering->rx_pending = bp->rx_ring_size;
+   else if (BNX2X_NUM_RX_QUEUES(bp))
+   ering->rx_pending = MAX_RX_AVAIL / BNX2X_NUM_RX_QUEUES(bp);
else
ering->rx_pending = MAX_RX_AVAIL;
 
-- 
1.9.3



[PATCH V2 net 02/20] net/ena: fix error handling when probe fails

2016-12-04 Thread Netanel Belgazal
When driver fails in probe, it will release all resources, including
adapter.
In case of probe failure, ena_remove should not try to free the adapter
resources.

Signed-off-by: Netanel Belgazal 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 33a760e..397c9bc 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -3054,6 +3054,7 @@ static int ena_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
 err_free_region:
ena_release_bars(ena_dev, pdev);
 err_free_ena_dev:
+   pci_set_drvdata(pdev, NULL);
vfree(ena_dev);
 err_disable_device:
pci_disable_device(pdev);
-- 
2.7.4



[PATCH V2 net 00/20] Increase ENA driver version to 1.1.2

2016-12-04 Thread Netanel Belgazal
Changes between V1 and V2:
* reorder the patches so the bug fixes will appear first.
* fix the commit message of removing a tuple filter. The first patch
stated mistakenly that it removes RFS.
* add another bug fix (fix RSS default hash configuration).
* split the driver's version increase to a dedicated patch.
* add this patchset description.

This patchset contains mainly bug fixes.
Most of them are critical for the driver and system functionality.
In addition to the bug fixes, this patchset also introduces some minor
Improvements listed below.

Bug fixes:
net/ena: remove ntuple filter support from device feature list
net/ena: fix error handling when probe fails
net/ena: fix queues number calculation
net/ena: fix ethtool RSS flow configuration
net/ena: fix RSS default hash configuration
net/ena: fix NULL dereference when removing the driver after device
reset faild
net/ena: refactor ena_get_stats64 to be atomic context safe
net/ena: add hardware hints capability to the driver
net/ena: fix potential access to freed memory during device reset
net/ena: remove redundant logic in napi callback for busy poll mode
net/ena: use READ_ONCE to access completion descriptors
net/ena: reduce the severity of ena printouts
net/ena: change driver's default timeouts
net/ena: change condition for host attribute configuration
net/ena: change sizeof() argument to be the type pointer

Other improvments:
net/ena: change sizeof() argument to be the type pointer
net/ena: use napi_schedule_irqoff when possible
net/ena: add IPv6 extended protocols to ena_admin_flow_hash_proto
net/ena: remove affinity hint from the driver
net/ena: restructure skb allocation
net/ena: increase driver version to 1.1.2

Netanel Belgazal (20):
  net/ena: remove ntuple filter support from device feature list
  net/ena: fix error handling when probe fails
  net/ena: fix queues number calculation
  net/ena: fix ethtool RSS flow configuration
  net/ena: fix RSS default hash configuration
  net/ena: fix NULL dereference when removing the driver after device
reset faild
  net/ena: refactor ena_get_stats64 to be atomic context safe
  net/ena: add hardware hints capability to the driver
  net/ena: fix potential access to freed memory during device reset
  net/ena: remove redundant logic in napi callback for busy poll mode
  net/ena: use READ_ONCE to access completion descriptors
  net/ena: reduce the severity of ena printouts
  net/ena: change driver's default timeouts
  net/ena: change condition for host attribute configuration
  net/ena: change sizeof() argument to be the type pointer
  net/ena: use napi_schedule_irqoff when possible
  net/ena: add IPv6 extended protocols to ena_admin_flow_hash_proto
  net/ena: remove affinity hint from the driver
  net/ena: restructure skb allocation
  net/ena: increase driver version to 1.1.2

 drivers/net/ethernet/amazon/ena/ena_admin_defs.h |  57 +++-
 drivers/net/ethernet/amazon/ena/ena_com.c|  98 ---
 drivers/net/ethernet/amazon/ena/ena_com.h|   6 +
 drivers/net/ethernet/amazon/ena/ena_eth_com.c|   8 +-
 drivers/net/ethernet/amazon/ena/ena_ethtool.c|   1 -
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 326 ---
 drivers/net/ethernet/amazon/ena/ena_netdev.h |  30 ++-
 drivers/net/ethernet/amazon/ena/ena_regs_defs.h  |   2 +
 8 files changed, 385 insertions(+), 143 deletions(-)

-- 
2.7.4



[PATCH V2 net 03/20] net/ena: fix queues number calculation

2016-12-04 Thread Netanel Belgazal
The ENA driver tries to open a queue per vCPU.
To determine how many vCPUs the instance have it uses num_possible_cpus
while it should have use num_online_cpus instead.

Signed-off-by: Netanel Belgazal 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 397c9bc..224302c 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2667,7 +2667,7 @@ static int ena_calc_io_queue_num(struct pci_dev *pdev,
io_sq_num = get_feat_ctx->max_queues.max_sq_num;
}
 
-   io_queue_num = min_t(int, num_possible_cpus(), ENA_MAX_NUM_IO_QUEUES);
+   io_queue_num = min_t(int, num_online_cpus(), ENA_MAX_NUM_IO_QUEUES);
io_queue_num = min_t(int, io_queue_num, io_sq_num);
io_queue_num = min_t(int, io_queue_num,
 get_feat_ctx->max_queues.max_cq_num);
-- 
2.7.4



  1   2   >