Re: [PATCH net-next 2/2] tcp: remove poll() flakes with FastOpen

2017-04-19 Thread Yuchung Cheng
On Tue, Apr 18, 2017 at 9:45 AM, Eric Dumazet  wrote:
>
> When using TCP FastOpen for an active session, we send one wakeup event
> from tcp_finish_connect(), right before the data eventually contained in
> the received SYNACK is queued to sk->sk_receive_queue.
>
> This means that depending on machine load or luck, poll() users
> might receive POLLOUT events instead of POLLIN|POLLOUT
>
> To fix this, we need to move the call to sk->sk_state_change()
> after the (optional) call to tcp_rcv_fastopen_synack()
>
> Signed-off-by: Eric Dumazet 
Acked-by: Yuchung Cheng 

Thanks for the fix!

> ---
>  net/ipv4/tcp_input.c | 16 +---
>  1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 
> 37e2aa925f62395cfb48145cd3a76b6afebb64b1..341f021f02a2931cd75b2e1e71af9729fc4c7895
>  100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5580,10 +5580,6 @@ void tcp_finish_connect(struct sock *sk, struct 
> sk_buff *skb)
> else
> tp->pred_flags = 0;
>
> -   if (!sock_flag(sk, SOCK_DEAD)) {
> -   sk->sk_state_change(sk);
> -   sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);
> -   }
>  }
>
>  static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
> @@ -5652,6 +5648,7 @@ static int tcp_rcv_synsent_state_process(struct sock 
> *sk, struct sk_buff *skb,
> struct tcp_sock *tp = tcp_sk(sk);
> struct tcp_fastopen_cookie foc = { .len = -1 };
> int saved_clamp = tp->rx_opt.mss_clamp;
> +   bool fastopen_fail;
>
> tcp_parse_options(skb, >rx_opt, 0, );
> if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
> @@ -5755,10 +5752,15 @@ static int tcp_rcv_synsent_state_process(struct sock 
> *sk, struct sk_buff *skb,
>
> tcp_finish_connect(sk, skb);
>
> -   if ((tp->syn_fastopen || tp->syn_data) &&
> -   tcp_rcv_fastopen_synack(sk, skb, ))
> -   return -1;
> +   fastopen_fail = (tp->syn_fastopen || tp->syn_data) &&
> +   tcp_rcv_fastopen_synack(sk, skb, );
>
> +   if (!sock_flag(sk, SOCK_DEAD)) {
> +   sk->sk_state_change(sk);
> +   sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);
> +   }
> +   if (fastopen_fail)
> +   return -1;
> if (sk->sk_write_pending ||
> icsk->icsk_accept_queue.rskq_defer_accept ||
> icsk->icsk_ack.pingpong) {
> --
> 2.12.2.762.g0e3151a226-goog
>


Re: [net-next 00/15][pull request] 10GbE Intel Wired LAN Driver Updates 2017-04-18

2017-04-19 Thread Alexei Starovoitov
On Tue, Apr 18, 2017 at 04:01:50PM -0700, Jeff Kirsher wrote:
> The following are changes since commit 
> 4116c97689b9b1732ac5b68afd922406f9fc842e:
>   Merge branch 'ftgmac100-batch5-features'
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 10GbE

Nothing against this set of commits,
but I saw ixgbe xdp support in there pending for few months.
Now it's gone. What happened?



Re: [PATCH ipsec-next] esp4/6: Fix GSO path for non-GSO SW-crypto packets

2017-04-19 Thread Steffen Klassert
On Wed, Apr 19, 2017 at 08:41:01AM +0300, il...@mellanox.com wrote:
> From: Ilan Tayari 
> 
> If esp*_offload module is loaded, outbound packets take the
> GSO code path, being encapsulated at layer 3, but encrypted
> in layer 2. validate_xmit_xfrm calls esp*_xmit for that.
> 
> esp*_xmit was wrongfully detecting these packets as going
> through hardware crypto offload, while in fact they should
> be encrypted in software, causing plaintext leakage to
> the network, and also dropping at the receiver side.
> 
> Perform the encryption in esp*_xmit, if the SA doesn't have
> a hardware offload_handle.
> 
> Also, align esp6 code to esp4 logic.
> 
> Fixes: fca11ebde3f0 ("esp4: Reorganize esp_output")
> Fixes: 383d0350f2cc ("esp6: Reorganize esp_output")
> Signed-off-by: Ilan Tayari 

Applied to ipsec-next, thanks Ilan!


Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-19 Thread John Fastabend
On 17-04-19 09:58 PM, Alexei Starovoitov wrote:
> On Wed, Apr 19, 2017 at 09:38:44PM -0700, John Fastabend wrote:
>> On 17-04-19 07:56 PM, Alexei Starovoitov wrote:
>>> On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote:

 Is there a concrete reason that all the proposed future cases like sockets
 have to be handled within the very same XDP_REDIRECT return code? F.e. why
 not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and 
 future
 ones would get a different return code f.e. XDP_TX_SK only handling sockets
 when we get there implementation-wise?
>>>
>>> yeah. let's keep redirect to sockets, tunnels, crypto and exotic things
>>> out of this discussion.
>>> XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 
>>> netdev.
>>> If we make it too generic it will lose performance.
>>>
>>> For cls_bpf the ifindex concept is symmetric. The program can access it as
>>> skb->ifindex on receive and can redirect to another ifindex via 
>>> bpf_redirect() helper.
>>> Since netdev is not locked, it's actually big plus, since container 
>>> management
>>> control plane can simply delete netns+veth and it goes away. The program can
>>> have dangling ifindex (if control plane is buggy and didn't update the bpf 
>>> side),
>>> but it's harmless. Packets that redirect to non-existing ifindex are 
>>> dropped.
>>> This approach already understood and works well, so for XDP I suggest to use
>>> the same approach initially before starting to reinvent the wheel.
>>> struct xdp_md needs ifindex field and xdp_redirect() helper that redirects
>>> to L2 netdev only. That's it. Simple and easy.
>>> I think the main use cases in John's and Jesper's minds is something like
>>> xdpswitch where packets are coming from VMs and from physical eths and
>>> being redirected to either physical eth or to VM via upcoming vhost+xdp 
>>> support.
>>> This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine.
>>
>> hmm I must be missing something, bpf_redirect() helper should be used as a
>> return statement, e.g.
>>
>>  return bpf_redirect(ifindex, flags);
>>
>> Its a bit awkward to use in any other way. You would have to ensure the
>> program returns TC_ACT_REDIRECT in all cases I presume. It seems incredibly
>> fragile and gains nothing as far as I can tell. bpf_redirect uses per_cpu
>> data and expects the core to call skb_do_redirect() to push the packet into
>> the correct netdev. bpf_redirect() does not modify the skb->ifindex value,
>> looking at the code now.
>>
>> Are you suggesting using xdp_md to store the ifindex value instead of a per
>> cpu variable in the redirect helper? 
> 
> no. i'm suggesting to store in per-cpu scratch area just like cls_bpf does.
> 
>> Do you really mean the xdp_md struct in
>> the uapi headers? 
> 
> yes. since 'ifindex' needs to be part of xdp_md struct in read-only way.
> Just like in cls_bpf does it.
> Otherwise if we attach the same program to multiple taps it won't
> know which tap the traffic arriving on and won't be able to redirect properly.
> 
>> I don't see why it needs to be in the UAPI at all. If we
>> don't like per cpu variables it could be pushed as part of xdp_buff I guess.
> 
> It's not about like or dont-like per-cpu scratch area.
> My main point: it works just fine for cls_bpf and i'm suggesting to do
> the same for xdp_redirect, since no one ever complained about that bit of 
> cls_bpf.
> 
>> My suggestion is we could add an ifindex to the xdp_md structure and have the
>> receiving NIC driver populate the value assuming it is useful to programs. 
>> But,
>> if we use cls_bpf as a model then xdp_md ifindex is more or less independent 
>> of
>> the redirect helper. 
> 
> exactly. arriving ifindex is independent of xdp_redirect helper.
> 
>> In my opinion we should avoid diverging cls bpf and xdp bpf
>> in subtle ways like handling of ifindex and redirect.
> 
> exactly. I'm saying the same thing.
> I'm not sure which part of my proposal was so confusing.
> Sorry about that.
> 

Aha what you are suggesting is exactly what I prototyped on virtio_net so I'm
happy :) I just didn't manage to parse it for whatever reason must be getting
tired.

Thanks,
John




Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-19 Thread Alexei Starovoitov
On Wed, Apr 19, 2017 at 09:38:44PM -0700, John Fastabend wrote:
> On 17-04-19 07:56 PM, Alexei Starovoitov wrote:
> > On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote:
> >>
> >> Is there a concrete reason that all the proposed future cases like sockets
> >> have to be handled within the very same XDP_REDIRECT return code? F.e. why
> >> not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and 
> >> future
> >> ones would get a different return code f.e. XDP_TX_SK only handling sockets
> >> when we get there implementation-wise?
> > 
> > yeah. let's keep redirect to sockets, tunnels, crypto and exotic things
> > out of this discussion.
> > XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 
> > netdev.
> > If we make it too generic it will lose performance.
> > 
> > For cls_bpf the ifindex concept is symmetric. The program can access it as
> > skb->ifindex on receive and can redirect to another ifindex via 
> > bpf_redirect() helper.
> > Since netdev is not locked, it's actually big plus, since container 
> > management
> > control plane can simply delete netns+veth and it goes away. The program can
> > have dangling ifindex (if control plane is buggy and didn't update the bpf 
> > side),
> > but it's harmless. Packets that redirect to non-existing ifindex are 
> > dropped.
> > This approach already understood and works well, so for XDP I suggest to use
> > the same approach initially before starting to reinvent the wheel.
> > struct xdp_md needs ifindex field and xdp_redirect() helper that redirects
> > to L2 netdev only. That's it. Simple and easy.
> > I think the main use cases in John's and Jesper's minds is something like
> > xdpswitch where packets are coming from VMs and from physical eths and
> > being redirected to either physical eth or to VM via upcoming vhost+xdp 
> > support.
> > This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine.
> 
> hmm I must be missing something, bpf_redirect() helper should be used as a
> return statement, e.g.
> 
>   return bpf_redirect(ifindex, flags);
> 
> Its a bit awkward to use in any other way. You would have to ensure the
> program returns TC_ACT_REDIRECT in all cases I presume. It seems incredibly
> fragile and gains nothing as far as I can tell. bpf_redirect uses per_cpu
> data and expects the core to call skb_do_redirect() to push the packet into
> the correct netdev. bpf_redirect() does not modify the skb->ifindex value,
> looking at the code now.
> 
> Are you suggesting using xdp_md to store the ifindex value instead of a per
> cpu variable in the redirect helper? 

no. i'm suggesting to store in per-cpu scratch area just like cls_bpf does.

> Do you really mean the xdp_md struct in
> the uapi headers? 

yes. since 'ifindex' needs to be part of xdp_md struct in read-only way.
Just like in cls_bpf does it.
Otherwise if we attach the same program to multiple taps it won't
know which tap the traffic arriving on and won't be able to redirect properly.

> I don't see why it needs to be in the UAPI at all. If we
> don't like per cpu variables it could be pushed as part of xdp_buff I guess.

It's not about like or dont-like per-cpu scratch area.
My main point: it works just fine for cls_bpf and i'm suggesting to do
the same for xdp_redirect, since no one ever complained about that bit of 
cls_bpf.

> My suggestion is we could add an ifindex to the xdp_md structure and have the
> receiving NIC driver populate the value assuming it is useful to programs. 
> But,
> if we use cls_bpf as a model then xdp_md ifindex is more or less independent 
> of
> the redirect helper. 

exactly. arriving ifindex is independent of xdp_redirect helper.

> In my opinion we should avoid diverging cls bpf and xdp bpf
> in subtle ways like handling of ifindex and redirect.

exactly. I'm saying the same thing.
I'm not sure which part of my proposal was so confusing.
Sorry about that.



Re: XDP question: best API for returning/setting egress port?

2017-04-19 Thread John Fastabend
[...]

> JohnF, any test results with this you can share?  Presumably you tested with
> virtio-net, right?
> 

For my patches using the xdp_redirect with virtio-net showed no measurable
difference from running just the straight XDP_TX in the PPS column. However
virtio_net XDP_TX was quite low to start with (and I was mostly just looking
at functionality so I don't recall if there was a CPU usage hit) so I expect on
ixgbe to see a difference. However I have plenty of cpu overhead to consume on
the 10Gbps card so it will likely just be a CPU % cost not a PPS cost. I'll
bother Alex about the 40Gbps cards ;)

Anyways, I'll put it on my list to rerun. But like the xdp generic pieces it
wont be for a couple days until I get around to it. Sorry need to take care of
a few things first.

Thanks,
John


Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-19 Thread John Fastabend
On 17-04-19 07:56 PM, Alexei Starovoitov wrote:
> On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote:
>>
>> Is there a concrete reason that all the proposed future cases like sockets
>> have to be handled within the very same XDP_REDIRECT return code? F.e. why
>> not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future
>> ones would get a different return code f.e. XDP_TX_SK only handling sockets
>> when we get there implementation-wise?
> 
> yeah. let's keep redirect to sockets, tunnels, crypto and exotic things
> out of this discussion.
> XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 
> netdev.
> If we make it too generic it will lose performance.
> 
> For cls_bpf the ifindex concept is symmetric. The program can access it as
> skb->ifindex on receive and can redirect to another ifindex via 
> bpf_redirect() helper.
> Since netdev is not locked, it's actually big plus, since container management
> control plane can simply delete netns+veth and it goes away. The program can
> have dangling ifindex (if control plane is buggy and didn't update the bpf 
> side),
> but it's harmless. Packets that redirect to non-existing ifindex are dropped.
> This approach already understood and works well, so for XDP I suggest to use
> the same approach initially before starting to reinvent the wheel.
> struct xdp_md needs ifindex field and xdp_redirect() helper that redirects
> to L2 netdev only. That's it. Simple and easy.
> I think the main use cases in John's and Jesper's minds is something like
> xdpswitch where packets are coming from VMs and from physical eths and
> being redirected to either physical eth or to VM via upcoming vhost+xdp 
> support.
> This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine.

hmm I must be missing something, bpf_redirect() helper should be used as a
return statement, e.g.

return bpf_redirect(ifindex, flags);

Its a bit awkward to use in any other way. You would have to ensure the
program returns TC_ACT_REDIRECT in all cases I presume. It seems incredibly
fragile and gains nothing as far as I can tell. bpf_redirect uses per_cpu
data and expects the core to call skb_do_redirect() to push the packet into
the correct netdev. bpf_redirect() does not modify the skb->ifindex value,
looking at the code now.

Are you suggesting using xdp_md to store the ifindex value instead of a per
cpu variable in the redirect helper? Do you really mean the xdp_md struct in
the uapi headers? I don't see why it needs to be in the UAPI at all. If we
don't like per cpu variables it could be pushed as part of xdp_buff I guess.

My suggestion is we could add an ifindex to the xdp_md structure and have the
receiving NIC driver populate the value assuming it is useful to programs. But,
if we use cls_bpf as a model then xdp_md ifindex is more or less independent of
the redirect helper. In my opinion we should avoid diverging cls bpf and xdp bpf
in subtle ways like handling of ifindex and redirect.

> 
> Once we have vhost+xdp and all other bits implemented, we must come back
> to this discussion about having port mapping table. As I mentioned
> during netconf I think it's very useful, but I don't think we should
> gate vhost+xdp and xdp_redirect work on this discussion.
> As far as this port mapping table we would need 'port' field in xdp_md as well
> and xdp_redirect_port() done via helper or via extra 'out_port' field in 
> xdp_md
> plus another XDP_REDIRECT_PORT action code.
> The actual port table (array) should be populated by user space with netdevs
> and these netdev will have their refcnt incremented. Then we'll have 
> discussion
> what to do with netdev_unregister notifiers, whether they should be 
> auto-removed
> from port table or bpf should have a chance to be notified and act on it.
> Such port mapping will allow us to optimize inevitable call
> dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
> away, since netdevs will be stored in the port table and direct deref
> port_map_array[xdp_md->out_port] will give us target netdev quickly.
> It's nice optimization and there are other more powerful optimizations we
> can do with such port table (since we will know in advance which netdevs
> the program will be redirecting too), but I still think we should do
> ifindex based xdp_redirect first and only add this port table later.
> 

Agreed on all this further optimization would be welcome of course after
basic implementation is in place.



Re: [PATCH net-next] brcmfmac: fix build without CONFIG_BRCMFMAC_PROTO_BCDC

2017-04-19 Thread Kalle Valo
Arnd Bergmann  writes:

> With CONFIG_BRCMFMAC_PROTO_BCDC unset, we cannot build the fwsignal.c file:
>
> drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c: In function 
> 'brcmf_fws_notify_credit_map':
> drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c:1590:31: error: 
> implicit declaration of function 'drvr_to_fws'; did you mean 'dev_to_psd'? 
> [-Werror=implicit-function-declaration]
>   struct brcmf_fws_info *fws = drvr_to_fws(ifp->drvr);
> drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c:1590:31: error: 
> initialization makes pointer from integer without a cast 
> [-Werror=int-conversion]
> drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c:1621:31: error: 
> initialization makes pointer from integer without a cast 
> [-Werror=int-conversion]
>
> However, as pointed out in the changeset description for the patch that caused
> the problem, fwsignal.c is only required when CONFIG_BRCMFMAC_PROTO_BCDC is
> enabled, so we can simply change the Makefile to build it conditionally.
>
> Fixes: acf8ac41dd73 ("brcmfmac: remove reference to fwsignal data from struct 
> brcmf_pub")
> Signed-off-by: Arnd Bergmann 

The fix is actually for wireless-drivers-next, acf8ac41dd73 is not in
net-next yet. And I already applied an identical fix from Arend:

https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next.git/commit/?id=26ecfe01790381c4caa65ec9cce484c623f092c4

-- 
Kalle Valo


[PATCH 1/1] bonding: remove unnecessary return

2017-04-19 Thread Zhu Yanjun
The type of the function bond_update_speed_duplex is void.
So the return in the end of the function should be removed.

Signed-off-by: Zhu Yanjun 
---
 drivers/net/bonding/bond_main.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 8a4ba8b..af9f0ce 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -399,8 +399,6 @@ static void bond_update_speed_duplex(struct slave *slave)
 
slave->speed = ecmd.base.speed;
slave->duplex = ecmd.base.duplex;
-
-   return;
 }
 
 const char *bond_slave_link_status(s8 link)
-- 
2.7.4



Re: [PATCH] xen/9pfs: select CONFIG_XEN_XENBUS_FRONTEND

2017-04-19 Thread Juergen Gross
On 20/04/17 01:10, Stefano Stabellini wrote:
> Juergen, I have committed this patch to for-linus-4.12 and linux-next, I
> hope that's OK.

Sure.


Juergen

> 
> Og Wed, 19 Apr 2017, Stefano Stabellini wrote:
>> On Wed, 19 Apr 2017, Arnd Bergmann wrote:
>>> All Xen frontends need to select this symbol to avoid a link error:
>>>
>>> net/built-in.o: In function `p9_trans_xen_init':
>>> :(.text+0x149e9c): undefined reference to `__xenbus_register_frontend'
>>>
>>> Fixes: d4b40a02f837 ("xen/9pfs: build 9pfs Xen transport driver")
>>> Signed-off-by: Arnd Bergmann 
>>
>> Thank you for the fix!
>>
>> Reviewed-by: Stefano Stabellini 
>>> ---
>>>  net/9p/Kconfig | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/net/9p/Kconfig b/net/9p/Kconfig
>>> index 3f286f1bd1d3..e6014e0e51f7 100644
>>> --- a/net/9p/Kconfig
>>> +++ b/net/9p/Kconfig
>>> @@ -24,6 +24,7 @@ config NET_9P_VIRTIO
>>>  
>>>  config NET_9P_XEN
>>> depends on XEN
>>> +   select XEN_XENBUS_FRONTEND
>>> tristate "9P Xen Transport"
>>> help
>>>   This builds support for a transport for 9pfs between
>>> -- 
>>> 2.9.0
>>>
>>
> 



xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-19 Thread Alexei Starovoitov
On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote:
> 
> Is there a concrete reason that all the proposed future cases like sockets
> have to be handled within the very same XDP_REDIRECT return code? F.e. why
> not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future
> ones would get a different return code f.e. XDP_TX_SK only handling sockets
> when we get there implementation-wise?

yeah. let's keep redirect to sockets, tunnels, crypto and exotic things
out of this discussion.
XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 
netdev.
If we make it too generic it will lose performance.

For cls_bpf the ifindex concept is symmetric. The program can access it as
skb->ifindex on receive and can redirect to another ifindex via bpf_redirect() 
helper.
Since netdev is not locked, it's actually big plus, since container management
control plane can simply delete netns+veth and it goes away. The program can
have dangling ifindex (if control plane is buggy and didn't update the bpf 
side),
but it's harmless. Packets that redirect to non-existing ifindex are dropped.
This approach already understood and works well, so for XDP I suggest to use
the same approach initially before starting to reinvent the wheel.
struct xdp_md needs ifindex field and xdp_redirect() helper that redirects
to L2 netdev only. That's it. Simple and easy.
I think the main use cases in John's and Jesper's minds is something like
xdpswitch where packets are coming from VMs and from physical eths and
being redirected to either physical eth or to VM via upcoming vhost+xdp support.
This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine.

Once we have vhost+xdp and all other bits implemented, we must come back
to this discussion about having port mapping table. As I mentioned
during netconf I think it's very useful, but I don't think we should
gate vhost+xdp and xdp_redirect work on this discussion.
As far as this port mapping table we would need 'port' field in xdp_md as well
and xdp_redirect_port() done via helper or via extra 'out_port' field in xdp_md
plus another XDP_REDIRECT_PORT action code.
The actual port table (array) should be populated by user space with netdevs
and these netdev will have their refcnt incremented. Then we'll have discussion
what to do with netdev_unregister notifiers, whether they should be auto-removed
from port table or bpf should have a chance to be notified and act on it.
Such port mapping will allow us to optimize inevitable call
dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
away, since netdevs will be stored in the port table and direct deref
port_map_array[xdp_md->out_port] will give us target netdev quickly.
It's nice optimization and there are other more powerful optimizations we
can do with such port table (since we will know in advance which netdevs
the program will be redirecting too), but I still think we should do
ifindex based xdp_redirect first and only add this port table later.



[net-next 05/14] i40e/i40evf: Add tracepoints

2017-04-19 Thread Jeff Kirsher
From: Scott Peterson 

This patch adds tracepoints to the i40e and i40evf drivers to which
BPF programs can be attached for feature testing and verification.
It's expected that an attached BPF program will identify and count or
log some interesting subset of traffic. The bcc-tools package is
helpful there for containing all the BPF arcana in a handy Python
wrapper. Though you can make these tracepoints log trace messages, the
messages themselves probably won't be very useful (other to verify the
tracepoint is being called while you're debugging your BPF program).

The idea here is that tracepoints have such low performance cost when
disabled that we can leave these in the upstream drivers. This may
eventually enable the instrumentation of unmodified customer systems
should the need arise to verify a NIC feature is working as expected.
In general this enables one set of feature verification tools to be
used on these drivers whether they're built with the kernel or
separately.

Users are advised against using these tracepoints for anything other
than a diagnostic tool. They have a performance impact when enabled,
and their exact placement and form may change as we see how well they
work in practice for the purposes above.

Change-ID: Id6014a7322c0e6d08068114dd20bd156f2f6435e
Signed-off-by: Scott Peterson 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/Makefile|   3 +
 drivers/net/ethernet/intel/i40e/i40e_main.c |   6 +
 drivers/net/ethernet/intel/i40e/i40e_trace.h| 229 
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |   9 +
 drivers/net/ethernet/intel/i40evf/Makefile  |   3 +
 drivers/net/ethernet/intel/i40evf/i40e_trace.h  | 229 
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c   |   9 +
 drivers/net/ethernet/intel/i40evf/i40evf_main.c |   7 +
 8 files changed, 495 insertions(+)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_trace.h
 create mode 100644 drivers/net/ethernet/intel/i40evf/i40e_trace.h

diff --git a/drivers/net/ethernet/intel/i40e/Makefile 
b/drivers/net/ethernet/intel/i40e/Makefile
index 4f454d364d0d..3da482c3d68d 100644
--- a/drivers/net/ethernet/intel/i40e/Makefile
+++ b/drivers/net/ethernet/intel/i40e/Makefile
@@ -28,6 +28,9 @@
 # Makefile for the Intel(R) Ethernet Connection XL710 (i40e.ko) driver
 #
 
+ccflags-y += -I$(src)
+subdir-ccflags-y += -I$(src)
+
 obj-$(CONFIG_I40E) += i40e.o
 
 i40e-objs := i40e_main.o \
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index b6ec9beeebff..5e625e0c73ac 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -32,6 +32,12 @@
 #include "i40e.h"
 #include "i40e_diag.h"
 #include 
+/* All i40e tracepoints are defined by the include below, which
+ * must be included exactly once across the whole kernel with
+ * CREATE_TRACE_POINTS defined
+ */
+#define CREATE_TRACE_POINTS
+#include "i40e_trace.h"
 
 const char i40e_driver_name[] = "i40e";
 static const char i40e_driver_string[] =
diff --git a/drivers/net/ethernet/intel/i40e/i40e_trace.h 
b/drivers/net/ethernet/intel/i40e/i40e_trace.h
new file mode 100644
index ..d3e55f54a05e
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40e/i40e_trace.h
@@ -0,0 +1,229 @@
+/***
+ *
+ * Intel(R) 40-10 Gigabit Ethernet Connection Network Driver
+ * Copyright(c) 2013 - 2017 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * The full GNU General Public License is included in this distribution in
+ * the file called "COPYING".
+ *
+ * Contact Information:
+ * e1000-devel Mailing List 
+ * Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497
+ *
+ 
**/
+
+/* Modeled on trace-events-sample.h */
+
+/* The trace subsystem name for i40e will be "i40e".
+ *
+ * This file is named i40e_trace.h.
+ *
+ * Since this include file's name is different from the trace
+ * subsystem name, we'll have to define TRACE_INCLUDE_FILE at the end
+ * of this file.
+ */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM i40e
+
+/* See trace-events-sample.h for a detailed description of why this
+ * guard clause is different from most normal include files.
+ */
+#if 

[net-next 04/14] i40e: dump VF information in debugfs

2017-04-19 Thread Jeff Kirsher
From: Mitch Williams 

Dump some internal state about VFs through debugfs. This provides
information not available with 'ip link show'. To use, write "dump vf
" to the command file, or just "dump vf" to dump information on all
of the VFs.

Change-ID: Ibe32b7f4ae55d4358c0b903217475f708ada1ecd
Signed-off-by: Mitch Williams 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_debugfs.c | 51 ++
 1 file changed, 51 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c 
b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
index c5f68cc1edcd..a3d7ec62b76c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
@@ -384,6 +384,8 @@ static void i40e_dbg_dump_vsi_seid(struct i40e_pf *pf, int 
seid)
 "base_queue = %d, num_queue_pairs = %d, num_desc = %d\n",
 vsi->base_queue, vsi->num_queue_pairs, vsi->num_desc);
dev_info(>pdev->dev, "type = %i\n", vsi->type);
+   if (vsi->type == I40E_VSI_SRIOV)
+   dev_info(>pdev->dev, "VF ID = %i\n", vsi->vf_id);
dev_info(>pdev->dev,
 "info: valid_sections = 0x%04x, switch_id = 0x%04x\n",
 vsi->info.valid_sections, vsi->info.switch_id);
@@ -694,6 +696,47 @@ static void i40e_dbg_dump_veb_all(struct i40e_pf *pf)
}
 }
 
+/**
+ * i40e_dbg_dump_vf - dump VF info
+ * @pf: the i40e_pf created in command write
+ * @vf_id: the vf_id from the user
+ **/
+static void i40e_dbg_dump_vf(struct i40e_pf *pf, int vf_id)
+{
+   struct i40e_vf *vf;
+   struct i40e_vsi *vsi;
+
+   if (!pf->num_alloc_vfs) {
+   dev_info(>pdev->dev, "no VFs allocated\n");
+   } else if ((vf_id >= 0) && (vf_id < pf->num_alloc_vfs)) {
+   vf = >vf[vf_id];
+   vsi = pf->vsi[vf->lan_vsi_idx];
+   dev_info(>pdev->dev, "vf %2d: VSI id=%d, seid=%d, qps=%d\n",
+vf_id, vf->lan_vsi_id, vsi->seid, vf->num_queue_pairs);
+   dev_info(>pdev->dev, "   num MDD=%lld, invalid 
msg=%lld, valid msg=%lld\n",
+vf->num_mdd_events,
+vf->num_invalid_msgs,
+vf->num_valid_msgs);
+   } else {
+   dev_info(>pdev->dev, "invalid VF id %d\n", vf_id);
+   }
+}
+
+/**
+ * i40e_dbg_dump_vf_all - dump VF info for all VFs
+ * @pf: the i40e_pf created in command write
+ **/
+static void i40e_dbg_dump_vf_all(struct i40e_pf *pf)
+{
+   int i;
+
+   if (!pf->num_alloc_vfs)
+   dev_info(>pdev->dev, "no VFs enabled!\n");
+   else
+   for (i = 0; i < pf->num_alloc_vfs; i++)
+   i40e_dbg_dump_vf(pf, i);
+}
+
 #define I40E_MAX_DEBUG_OUT_BUFFER (4096*4)
 /**
  * i40e_dbg_command_write - write into command datum
@@ -712,6 +755,7 @@ static ssize_t i40e_dbg_command_write(struct file *filp,
struct i40e_vsi *vsi;
int vsi_seid;
int veb_seid;
+   int vf_id;
int cnt;
 
/* don't allow partial writes */
@@ -914,6 +958,12 @@ static ssize_t i40e_dbg_command_write(struct file *filp,
i40e_dbg_dump_veb_seid(pf, vsi_seid);
else
i40e_dbg_dump_veb_all(pf);
+   } else if (strncmp(_buf[5], "vf", 2) == 0) {
+   cnt = sscanf(_buf[7], "%i", _id);
+   if (cnt > 0)
+   i40e_dbg_dump_vf(pf, vf_id);
+   else
+   i40e_dbg_dump_vf_all(pf);
} else if (strncmp(_buf[5], "desc", 4) == 0) {
int ring_id, desc_n;
if (strncmp(_buf[10], "rx", 2) == 0) {
@@ -1109,6 +1159,7 @@ static ssize_t i40e_dbg_command_write(struct file *filp,
dev_info(>pdev->dev, "dump vsi [seid]\n");
dev_info(>pdev->dev, "dump reset stats\n");
dev_info(>pdev->dev, "dump port\n");
+   dev_info(>pdev->dev, "dump vf [vf_id]\n");
dev_info(>pdev->dev,
 "dump debug fwdata   
\n");
}
-- 
2.12.2



[net-next 02/14] i40e/i40evf: Remove VF Rx csum offload for tunneled packets

2017-04-19 Thread Jeff Kirsher
From: alice michael 

Rx checksum offload for tunneled packets was never being negotiated or
requested by VF. This capability was assumed by default and enabled in
current hardware for VF. Going forward, this feature needs to be disabled
or advanced ptypes should be negotiated with PF in the future.

Change-ID: I9e54cfa8a90e03ab6956db4412f1e337ccd2c2e0
Signed-off-by: Preethi Banala 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 460171edc412..fc4b14af370d 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -831,13 +831,6 @@ static inline void i40e_rx_checksum(struct i40e_vsi *vsi,
if (rx_error & BIT(I40E_RX_DESC_ERROR_PPRS_SHIFT))
return;
 
-   /* If there is an outer header present that might contain a checksum
-* we need to bump the checksum level by 1 to reflect the fact that
-* we are indicating we validated the inner checksum.
-*/
-   if (decoded.tunnel_type >= I40E_RX_PTYPE_TUNNEL_IP_GRENAT)
-   skb->csum_level = 1;
-
/* Only report checksum unnecessary for TCP, UDP, or SCTP */
switch (decoded.inner_prot) {
case I40E_RX_PTYPE_INNER_PROT_TCP:
-- 
2.12.2



[net-next 11/14] i40e: remove I40E_FLAG_IN_NETPOLL entirely

2017-04-19 Thread Jeff Kirsher
From: Jacob Keller 

This flag was originally intended to be used to let some
driver code know when we were running from netpoll.
Ultimately this was not necessary and we never used it.
Let's remove it

Change-ID: I43b72483d91c1638071d2a7f389ab171ec5b796a
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf.h 
b/drivers/net/ethernet/intel/i40evf/i40evf.h
index 7bbd11abed08..40f56e2335df 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf.h
+++ b/drivers/net/ethernet/intel/i40evf/i40evf.h
@@ -202,7 +202,6 @@ struct i40evf_adapter {
 
u32 flags;
 #define I40EVF_FLAG_RX_CSUM_ENABLEDBIT(0)
-#define I40EVF_FLAG_IN_NETPOLL BIT(4)
 #define I40EVF_FLAG_IMIR_ENABLED   BIT(5)
 #define I40EVF_FLAG_MQ_CAPABLE BIT(6)
 #define I40EVF_FLAG_PF_COMMS_FAILEDBIT(8)
@@ -221,7 +220,6 @@ struct i40evf_adapter {
 /* duplicates for common code */
 #define I40E_FLAG_FDIR_ATR_ENABLED 0
 #define I40E_FLAG_DCB_ENABLED  0
-#define I40E_FLAG_IN_NETPOLL   I40EVF_FLAG_IN_NETPOLL
 #define I40E_FLAG_RX_CSUM_ENABLED  I40EVF_FLAG_RX_CSUM_ENABLED
 #define I40E_FLAG_WB_ON_ITR_CAPABLEI40EVF_FLAG_WB_ON_ITR_CAPABLE
 #define I40E_FLAG_OUTER_UDP_CSUM_CAPABLE   
I40EVF_FLAG_OUTER_UDP_CSUM_CAPABLE
-- 
2.12.2



[net-next 13/14] i40e: reset all VFs in parallel when rebuilding PF

2017-04-19 Thread Jeff Kirsher
From: Jacob Keller 

When there are a lot of active VFs, it can take multiple seconds to
finish resetting all of them during certain flows., which can cause some
VFs to fail to wait long enough for the reset to occur. The user might
see messages like "Never saw reset" or "Reset never finished" and the VF
driver will stop functioning properly.

The naive solution would be to simply increase the wait timer. We can
get much more clever. Notice that i40e_reset_vf is run in a serialized
fashion, and includes lots of delays.

There are two prominent delays which take most of the time. First, when
we begin resetting VFs, we have multiple 10ms delays which accrue
because we reset each VF in a serial fashion. These delays accumulate to
almost 4 seconds when handling the maximum number of VFs (128).

Secondly, there is a massive 50ms delay for each time we disable queues
on a VSI. This delay is necessary to allow HW to finish disabling queues
before we restore functionality. However, just like with the first case,
we are paying the cost for each VF, rather than disabling all VFs and
waiting once.

Both of these can be fixed, but required some previous refactoring to
handle the special case. First, we will need the
i40e_vsi_wait_queues_disabled function which was previously DCB
specific. Second, we will need to implement our own
i40e_vsi_stop_rings_no_wait function which will handle the stopping of
rings without the delays.

Finally, implement an i40e_reset_all_vfs function, which will first
start the reset of all VFs, and pay the wait cost all at once, rather
than serially waiting for each VF before we start processing then next
one. After the VF has been reset, we'll disable all the VF queues, and
then wait for them to disable. Again, we'll organize the flow such that
we pay the wait cost only once.

Finally, after we've disabled queues we'll go ahead and begin restoring
VF functionality. The result is reducing the wait time by a large factor
and ensuring that VFs do not timeout when waiting in the VF driver.

Change-ID: Ia6e8cf8d98131b78aec89db78afb8d905c9b12be
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |   2 +
 drivers/net/ethernet/intel/i40e/i40e_main.c|  32 +--
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 100 +
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.h |   1 +
 4 files changed, 129 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index b629f3adcecb..dfab7b030b67 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -868,6 +868,8 @@ void i40e_notify_client_of_vf_msg(struct i40e_vsi *vsi, u32 
vf_id,
 
 int i40e_vsi_start_rings(struct i40e_vsi *vsi);
 void i40e_vsi_stop_rings(struct i40e_vsi *vsi);
+void i40e_vsi_stop_rings_no_wait(struct  i40e_vsi *vsi);
+int i40e_vsi_wait_queues_disabled(struct i40e_vsi *vsi);
 int i40e_reconfig_rss_queues(struct i40e_pf *pf, int queue_count);
 struct i40e_veb *i40e_veb_setup(struct i40e_pf *pf, u16 flags, u16 uplink_seid,
u16 downlink_seid, u8 enabled_tc);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 2bf5bb1b4627..2fe1fbdaafae 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -4159,6 +4159,29 @@ void i40e_vsi_stop_rings(struct i40e_vsi *vsi)
 }
 
 /**
+ * i40e_vsi_stop_rings_no_wait - Stop a VSI's rings and do not delay
+ * @vsi: the VSI being shutdown
+ *
+ * This function stops all the rings for a VSI but does not delay to verify
+ * that rings have been disabled. It is expected that the caller is shutting
+ * down multiple VSIs at once and will delay together for all the VSIs after
+ * initiating the shutdown. This is particularly useful for shutting down lots
+ * of VFs together. Otherwise, a large delay can be incurred while configuring
+ * each VSI in serial.
+ **/
+void i40e_vsi_stop_rings_no_wait(struct i40e_vsi *vsi)
+{
+   struct i40e_pf *pf = vsi->back;
+   int i, pf_q;
+
+   pf_q = vsi->base_queue;
+   for (i = 0; i < vsi->num_queue_pairs; i++, pf_q++) {
+   i40e_control_tx_q(pf, pf_q, false);
+   i40e_control_rx_q(pf, pf_q, false);
+   }
+}
+
+/**
  * i40e_vsi_free_irq - Free the irq association with the OS
  * @vsi: the VSI being configured
  **/
@@ -4488,14 +4511,13 @@ static void i40e_pf_unquiesce_all_vsi(struct i40e_pf 
*pf)
}
 }
 
-#ifdef CONFIG_I40E_DCB
 /**
  * i40e_vsi_wait_queues_disabled - Wait for VSI's queues to be disabled
  * @vsi: the VSI being configured
  *
  * Wait until all queues on a given VSI have been disabled.
  **/
-static int i40e_vsi_wait_queues_disabled(struct 

[net-next 08/14] i40e: factor out queue control from i40e_vsi_control_(tx|rx)

2017-04-19 Thread Jeff Kirsher
From: Jacob Keller 

A future patch will need to be able to handle controlling queues without
waiting until all VSIs are handled. Factor out the direct queue
modification so that we can easily re-use this code. The result is also
a bit easier to read since we don't embed multiple single-letter loop
counters.

Change-ID: Id923cbfa43127b1c24d8ed4f809b1012c736d9ac
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 139 ++--
 1 file changed, 89 insertions(+), 50 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 5c2ceb247959..2bf5bb1b4627 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3919,6 +3919,8 @@ static void i40e_netpoll(struct net_device *netdev)
 }
 #endif
 
+#define I40E_QTX_ENA_WAIT_COUNT 50
+
 /**
  * i40e_pf_txq_wait - Wait for a PF's Tx queue to be enabled or disabled
  * @pf: the PF being configured
@@ -3949,6 +3951,50 @@ static int i40e_pf_txq_wait(struct i40e_pf *pf, int 
pf_q, bool enable)
 }
 
 /**
+ * i40e_control_tx_q - Start or stop a particular Tx queue
+ * @pf: the PF structure
+ * @pf_q: the PF queue to configure
+ * @enable: start or stop the queue
+ *
+ * This function enables or disables a single queue. Note that any delay
+ * required after the operation is expected to be handled by the caller of
+ * this function.
+ **/
+static void i40e_control_tx_q(struct i40e_pf *pf, int pf_q, bool enable)
+{
+   struct i40e_hw *hw = >hw;
+   u32 tx_reg;
+   int i;
+
+   /* warn the TX unit of coming changes */
+   i40e_pre_tx_queue_cfg(>hw, pf_q, enable);
+   if (!enable)
+   usleep_range(10, 20);
+
+   for (i = 0; i < I40E_QTX_ENA_WAIT_COUNT; i++) {
+   tx_reg = rd32(hw, I40E_QTX_ENA(pf_q));
+   if (((tx_reg >> I40E_QTX_ENA_QENA_REQ_SHIFT) & 1) ==
+   ((tx_reg >> I40E_QTX_ENA_QENA_STAT_SHIFT) & 1))
+   break;
+   usleep_range(1000, 2000);
+   }
+
+   /* Skip if the queue is already in the requested state */
+   if (enable == !!(tx_reg & I40E_QTX_ENA_QENA_STAT_MASK))
+   return;
+
+   /* turn on/off the queue */
+   if (enable) {
+   wr32(hw, I40E_QTX_HEAD(pf_q), 0);
+   tx_reg |= I40E_QTX_ENA_QENA_REQ_MASK;
+   } else {
+   tx_reg &= ~I40E_QTX_ENA_QENA_REQ_MASK;
+   }
+
+   wr32(hw, I40E_QTX_ENA(pf_q), tx_reg);
+}
+
+/**
  * i40e_vsi_control_tx - Start or stop a VSI's rings
  * @vsi: the VSI being configured
  * @enable: start or stop the rings
@@ -3956,39 +4002,13 @@ static int i40e_pf_txq_wait(struct i40e_pf *pf, int 
pf_q, bool enable)
 static int i40e_vsi_control_tx(struct i40e_vsi *vsi, bool enable)
 {
struct i40e_pf *pf = vsi->back;
-   struct i40e_hw *hw = >hw;
-   int i, j, pf_q, ret = 0;
-   u32 tx_reg;
+   int i, pf_q, ret = 0;
 
pf_q = vsi->base_queue;
for (i = 0; i < vsi->num_queue_pairs; i++, pf_q++) {
+   i40e_control_tx_q(pf, pf_q, enable);
 
-   /* warn the TX unit of coming changes */
-   i40e_pre_tx_queue_cfg(>hw, pf_q, enable);
-   if (!enable)
-   usleep_range(10, 20);
-
-   for (j = 0; j < 50; j++) {
-   tx_reg = rd32(hw, I40E_QTX_ENA(pf_q));
-   if (((tx_reg >> I40E_QTX_ENA_QENA_REQ_SHIFT) & 1) ==
-   ((tx_reg >> I40E_QTX_ENA_QENA_STAT_SHIFT) & 1))
-   break;
-   usleep_range(1000, 2000);
-   }
-   /* Skip if the queue is already in the requested state */
-   if (enable == !!(tx_reg & I40E_QTX_ENA_QENA_STAT_MASK))
-   continue;
-
-   /* turn on/off the queue */
-   if (enable) {
-   wr32(hw, I40E_QTX_HEAD(pf_q), 0);
-   tx_reg |= I40E_QTX_ENA_QENA_REQ_MASK;
-   } else {
-   tx_reg &= ~I40E_QTX_ENA_QENA_REQ_MASK;
-   }
-
-   wr32(hw, I40E_QTX_ENA(pf_q), tx_reg);
-   /* No waiting for the Tx queue to disable */
+   /* Don't wait to disable when port Tx is suspended */
if (!enable && test_bit(__I40E_PORT_TX_SUSPENDED, >state))
continue;
 
@@ -4035,6 +4055,43 @@ static int i40e_pf_rxq_wait(struct i40e_pf *pf, int 
pf_q, bool enable)
 }
 
 /**
+ * i40e_control_rx_q - Start or stop a particular Rx queue
+ * @pf: the PF structure
+ * @pf_q: the PF queue to configure
+ * @enable: start or stop the queue
+ *
+ * This function enables or disables a single queue. Note that any delay
+ * 

[net-next 03/14] i40e: Fix support for flow director programming status

2017-04-19 Thread Jeff Kirsher
From: Alexander Duyck 

This patch fixes an issue I introduced when I converted the code over to
using the length field to determine if a descriptor was done or not. It
turns out that we are also processing programming descriptors in the Rx
path and need to have these processed even though the length field will be
0 on these packets.  What will happen with a programming descriptor is that
we will receive a descriptor that has the SPH bit set, and the header
length and packet length fields cleared.

To account for this we should be checking for the bit for split header
being set even though we aren't actually using header split. This bit is
set in the length field to indicate if a programming descriptor response is
contained in the descriptor. Since we don't support header split we don't
need to perform the extra checks of using a fixed value for the entire
length field.

In addition I am moving the function for checking if a filter is a
programming status filter into the i40e_txrx.c file since there is no
longer support for FCoE it doesn't make sense to keep this file in i40e.h.

Change-ID: I12c359c3dc70adb9d6b92b27324bb2c7f04c1a06
Signed-off-by: Alexander Duyck 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h| 15 
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 50 ---
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  9 ++---
 3 files changed, 43 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index e987503f8517..b629f3adcecb 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -763,21 +763,6 @@ static inline void i40e_vsi_setup_irqhandler(struct 
i40e_vsi *vsi,
 }
 
 /**
- * i40e_rx_is_programming_status - check for programming status descriptor
- * @qw: the first quad word of the program status descriptor
- *
- * The value of in the descriptor length field indicate if this
- * is a programming status descriptor for flow director or FCoE
- * by the value of I40E_RX_PROG_STATUS_DESC_LENGTH, otherwise
- * it is a packet descriptor.
- **/
-static inline bool i40e_rx_is_programming_status(u64 qw)
-{
-   return I40E_RX_PROG_STATUS_DESC_LENGTH ==
-   (qw >> I40E_RX_PROG_STATUS_DESC_LENGTH_SHIFT);
-}
-
-/**
  * i40e_get_fd_cnt_all - get the total FD filter space available
  * @pf: pointer to the PF struct
  **/
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 20691d2bf113..843d7ffe1784 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1038,9 +1038,29 @@ static bool i40e_set_new_dynamic_itr(struct 
i40e_ring_container *rc)
 }
 
 /**
+ * i40e_rx_is_programming_status - check for programming status descriptor
+ * @qw: qword representing status_error_len in CPU ordering
+ *
+ * The value of in the descriptor length field indicate if this
+ * is a programming status descriptor for flow director or FCoE
+ * by the value of I40E_RX_PROG_STATUS_DESC_LENGTH, otherwise
+ * it is a packet descriptor.
+ **/
+static inline bool i40e_rx_is_programming_status(u64 qw)
+{
+   /* The Rx filter programming status and SPH bit occupy the same
+* spot in the descriptor. Since we don't support packet split we
+* can just reuse the bit as an indication that this is a
+* programming status descriptor.
+*/
+   return qw & I40E_RXD_QW1_LENGTH_SPH_MASK;
+}
+
+/**
  * i40e_clean_programming_status - clean the programming status descriptor
  * @rx_ring: the rx ring that has this descriptor
  * @rx_desc: the rx descriptor written back by HW
+ * @qw: qword representing status_error_len in CPU ordering
  *
  * Flow director should handle FD_FILTER_STATUS to check its filter programming
  * status being successful or not and take actions accordingly. FCoE should
@@ -1048,12 +1068,18 @@ static bool i40e_set_new_dynamic_itr(struct 
i40e_ring_container *rc)
  *
  **/
 static void i40e_clean_programming_status(struct i40e_ring *rx_ring,
- union i40e_rx_desc *rx_desc)
+ union i40e_rx_desc *rx_desc,
+ u64 qw)
 {
-   u64 qw;
+   u32 ntc = rx_ring->next_to_clean + 1;
u8 id;
 
-   qw = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
+   /* fetch, update, and store next to clean */
+   ntc = (ntc < rx_ring->count) ? ntc : 0;
+   rx_ring->next_to_clean = ntc;
+
+   prefetch(I40E_RX_DESC(rx_ring, ntc));
+
id = (qw & I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK) >>
  I40E_RX_PROG_STATUS_DESC_QW1_PROGID_SHIFT;
 
@@ -1911,11 +1937,6 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
 
prefetch(I40E_RX_DESC(rx_ring, ntc));
 

[net-next 06/14] i40e: new AQ commands

2017-04-19 Thread Jeff Kirsher
From: Jingjing Wu 

Add admin queue functions for Pipeline Personalization Profile AQ
commands:
 - Write Recipe Command buffer (Opcode: 0x0270)
 - Get Applied Profiles list (Opcode: 0x0271)

Change-ID: I558b4145364140f624013af48d4bbf79d21ebb0d
Signed-off-by: Jingjing Wu 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h  |  34 
 drivers/net/ethernet/intel/i40e/i40e_common.c  | 212 +
 drivers/net/ethernet/intel/i40e/i40e_prototype.h   |  17 ++
 drivers/net/ethernet/intel/i40e/i40e_type.h|  80 
 .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h|  34 
 drivers/net/ethernet/intel/i40evf/i40e_common.c| 212 +
 drivers/net/ethernet/intel/i40evf/i40e_prototype.h |  17 ++
 drivers/net/ethernet/intel/i40evf/i40e_type.h  |  80 
 8 files changed, 686 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h 
b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
index 251074c677c4..5eb04114e13f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
@@ -190,6 +190,10 @@ enum i40e_admin_queue_opc {
i40e_aqc_opc_add_mirror_rule= 0x0260,
i40e_aqc_opc_delete_mirror_rule = 0x0261,
 
+   /* Pipeline Personalization Profile */
+   i40e_aqc_opc_write_personalization_profile  = 0x0270,
+   i40e_aqc_opc_get_personalization_profile_list   = 0x0271,
+
/* DCB commands */
i40e_aqc_opc_dcb_ignore_pfc = 0x0301,
i40e_aqc_opc_dcb_updated= 0x0302,
@@ -1431,6 +1435,36 @@ struct i40e_aqc_add_delete_mirror_rule_completion {
 
 I40E_CHECK_CMD_LENGTH(i40e_aqc_add_delete_mirror_rule_completion);
 
+/* Pipeline Personalization Profile */
+struct i40e_aqc_write_personalization_profile {
+   u8  flags;
+   u8  reserved[3];
+   __le32  profile_track_id;
+   __le32  addr_high;
+   __le32  addr_low;
+};
+
+I40E_CHECK_CMD_LENGTH(i40e_aqc_write_personalization_profile);
+
+struct i40e_aqc_write_ppp_resp {
+   __le32 error_offset;
+   __le32 error_info;
+   __le32 addr_high;
+   __le32 addr_low;
+};
+
+struct i40e_aqc_get_applied_profiles {
+   u8  flags;
+#define I40E_AQC_GET_PPP_GET_CONF  0x1
+#define I40E_AQC_GET_PPP_GET_RDPU_CONF 0x2
+   u8  rsv[3];
+   __le32  reserved;
+   __le32  addr_high;
+   __le32  addr_low;
+};
+
+I40E_CHECK_CMD_LENGTH(i40e_aqc_get_applied_profiles);
+
 /* DCB 0x03xx*/
 
 /* PFC Ignore (direct 0x0301)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c 
b/drivers/net/ethernet/intel/i40e/i40e_common.c
index f9db95aa3a20..24f020655291 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ -5042,3 +5042,215 @@ void i40e_write_rx_ctl(struct i40e_hw *hw, u32 
reg_addr, u32 reg_val)
if (status || use_register)
wr32(hw, reg_addr, reg_val);
 }
+
+/**
+ * i40e_aq_write_ppp - Write pipeline personalization profile (ppp)
+ * @hw: pointer to the hw struct
+ * @buff: command buffer (size in bytes = buff_size)
+ * @buff_size: buffer size in bytes
+ * @track_id: package tracking id
+ * @error_offset: returns error offset
+ * @error_info: returns error information
+ * @cmd_details: pointer to command details structure or NULL
+ **/
+enum
+i40e_status_code i40e_aq_write_ppp(struct i40e_hw *hw, void *buff,
+  u16 buff_size, u32 track_id,
+  u32 *error_offset, u32 *error_info,
+  struct i40e_asq_cmd_details *cmd_details)
+{
+   struct i40e_aq_desc desc;
+   struct i40e_aqc_write_personalization_profile *cmd =
+   (struct i40e_aqc_write_personalization_profile *)
+   
+   struct i40e_aqc_write_ppp_resp *resp;
+   i40e_status status;
+
+   i40e_fill_default_direct_cmd_desc(,
+ 
i40e_aqc_opc_write_personalization_profile);
+
+   desc.flags |= cpu_to_le16(I40E_AQ_FLAG_BUF | I40E_AQ_FLAG_RD);
+   if (buff_size > I40E_AQ_LARGE_BUF)
+   desc.flags |= cpu_to_le16((u16)I40E_AQ_FLAG_LB);
+
+   desc.datalen = cpu_to_le16(buff_size);
+
+   cmd->profile_track_id = cpu_to_le32(track_id);
+
+   status = i40e_asq_send_command(hw, , buff, buff_size, cmd_details);
+   if (!status) {
+   resp = (struct i40e_aqc_write_ppp_resp *)
+   if (error_offset)
+   *error_offset = le32_to_cpu(resp->error_offset);
+   if (error_info)
+   *error_info = le32_to_cpu(resp->error_info);
+   }
+
+   return status;
+}
+
+/**
+ * i40e_aq_get_ppp_list - Read pipeline personalization profile (ppp)
+ * @hw: pointer to the hw struct
+ 

[net-next 12/14] i40e: split some code in i40e_reset_vf into helpers

2017-04-19 Thread Jeff Kirsher
From: Jacob Keller 

A future patch is going to want to re-use some of the code in
i40e_reset_vf, so lets break up the beginning and ending parts into
their own helper functions. The first function will be used to
initialize the reset on a VF, while the second function will be used to
finalize the reset and restore functionality.

Change-ID: I48df808b8bf09de3c2ed8c521f57b3f0ab9e5907
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 109 ++---
 1 file changed, 72 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 65c95ffc15ec..9a75ce3f1564 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -923,22 +923,19 @@ static int i40e_quiesce_vf_pci(struct i40e_vf *vf)
 }
 
 /**
- * i40e_reset_vf
+ * i40e_trigger_vf_reset
  * @vf: pointer to the VF structure
  * @flr: VFLR was issued or not
  *
- * reset the VF
+ * Trigger hardware to start a reset for a particular VF. Expects the caller
+ * to wait the proper amount of time to allow hardware to reset the VF before
+ * it cleans up and restores VF functionality.
  **/
-void i40e_reset_vf(struct i40e_vf *vf, bool flr)
+static void i40e_trigger_vf_reset(struct i40e_vf *vf, bool flr)
 {
struct i40e_pf *pf = vf->pf;
struct i40e_hw *hw = >hw;
u32 reg, reg_idx, bit_idx;
-   bool rsd = false;
-   int i;
-
-   if (test_and_set_bit(__I40E_VF_DISABLE, >state))
-   return;
 
/* warn the VF */
clear_bit(I40E_VF_STAT_ACTIVE, >vf_states);
@@ -970,37 +967,22 @@ void i40e_reset_vf(struct i40e_vf *vf, bool flr)
if (i40e_quiesce_vf_pci(vf))
dev_err(>pdev->dev, "VF %d PCI transactions stuck\n",
vf->vf_id);
+}
 
-   /* poll VPGEN_VFRSTAT reg to make sure
-* that reset is complete
-*/
-   for (i = 0; i < 10; i++) {
-   /* VF reset requires driver to first reset the VF and then
-* poll the status register to make sure that the reset
-* completed successfully. Due to internal HW FIFO flushes,
-* we must wait 10ms before the register will be valid.
-*/
-   usleep_range(1, 2);
-   reg = rd32(hw, I40E_VPGEN_VFRSTAT(vf->vf_id));
-   if (reg & I40E_VPGEN_VFRSTAT_VFRD_MASK) {
-   rsd = true;
-   break;
-   }
-   }
-
-   if (flr)
-   usleep_range(1, 2);
-
-   if (!rsd)
-   dev_err(>pdev->dev, "VF reset check timeout on VF %d\n",
-   vf->vf_id);
-
-   /* On initial reset, we won't have any queues */
-   if (vf->lan_vsi_idx == 0)
-   goto complete_reset;
+/**
+ * i40e_cleanup_reset_vf
+ * @vf: pointer to the VF structure
+ *
+ * Cleanup a VF after the hardware reset is finished. Expects the caller to
+ * have verified whether the reset is finished properly, and ensure the
+ * minimum amount of wait time has passed.
+ **/
+static void i40e_cleanup_reset_vf(struct i40e_vf *vf)
+{
+   struct i40e_pf *pf = vf->pf;
+   struct i40e_hw *hw = >hw;
+   u32 reg;
 
-   i40e_vsi_stop_rings(pf->vsi[vf->lan_vsi_idx]);
-complete_reset:
/* free VF resources to begin resetting the VSI state */
i40e_free_vf_res(vf);
 
@@ -1035,6 +1017,59 @@ void i40e_reset_vf(struct i40e_vf *vf, bool flr)
 * request resources immediately after setting this flag.
 */
wr32(hw, I40E_VFGEN_RSTAT1(vf->vf_id), I40E_VFR_VFACTIVE);
+}
+
+/**
+ * i40e_reset_vf
+ * @vf: pointer to the VF structure
+ * @flr: VFLR was issued or not
+ *
+ * reset the VF
+ **/
+void i40e_reset_vf(struct i40e_vf *vf, bool flr)
+{
+   struct i40e_pf *pf = vf->pf;
+   struct i40e_hw *hw = >hw;
+   bool rsd = false;
+   u32 reg;
+   int i;
+
+   /* If VFs have been disabled, there is no need to reset */
+   if (test_and_set_bit(__I40E_VF_DISABLE, >state))
+   return;
+
+   i40e_trigger_vf_reset(vf, flr);
+
+   /* poll VPGEN_VFRSTAT reg to make sure
+* that reset is complete
+*/
+   for (i = 0; i < 10; i++) {
+   /* VF reset requires driver to first reset the VF and then
+* poll the status register to make sure that the reset
+* completed successfully. Due to internal HW FIFO flushes,
+* we must wait 10ms before the register will be valid.
+*/
+   usleep_range(1, 2);
+   reg = rd32(hw, I40E_VPGEN_VFRSTAT(vf->vf_id));
+   if (reg & I40E_VPGEN_VFRSTAT_VFRD_MASK) 

[net-next 09/14] i40e: fix CONFIG_BUSY checks in i40e_set_settings function

2017-04-19 Thread Jeff Kirsher
From: Jacob Keller 

The check for I40E_CONFIG_BUSY state bit in the i40e_set_link_ksettings
function is fishy. First we can notice a few things about the check here.

First a similar check was introduced by commit
'c7d05ca89f8e ("i40e: driver ethtool core")'

Later a commit introducing the link settings was added by commit
'bf9c71417f72 ("i40e: Implement set_settings for ethtool")'

However, this second check was against vsi->state instead of pf->state,
and also failed to set the bit, it only checks. That indicates the locking
was not quite correct. The only other place that the state bit
in vsi->state gets used is to protect the filter list.

Since this code does not care about the mac filter list,  and seems
clear the original code should have set the pf->state bit. Fix these
issues by using pf->state correctly, and by actually setting the bit
so that we properly lock as expected.

Since these checks occur while holding the rtnl_lock(), lets also add a
timeout so that we don't potentially softlock the system.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 38 --
 1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 10325b5a9805..08035c4389cd 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -698,6 +698,7 @@ static int i40e_set_link_ksettings(struct net_device 
*netdev,
struct ethtool_link_ksettings copy_cmd;
i40e_status status = 0;
bool change = false;
+   int timeout = 50;
int err = 0;
u32 autoneg;
u32 advertise;
@@ -756,14 +757,20 @@ static int i40e_set_link_ksettings(struct net_device 
*netdev,
if (memcmp(_cmd, _cmd, sizeof(struct ethtool_link_ksettings)))
return -EOPNOTSUPP;
 
-   while (test_bit(__I40E_CONFIG_BUSY, >state))
+   while (test_and_set_bit(__I40E_CONFIG_BUSY, >state)) {
+   timeout--;
+   if (!timeout)
+   return -EBUSY;
usleep_range(1000, 2000);
+   }
 
/* Get the current phy config */
status = i40e_aq_get_phy_capabilities(hw, false, false, ,
  NULL);
-   if (status)
-   return -EAGAIN;
+   if (status) {
+   err = -EAGAIN;
+   goto done;
+   }
 
/* Copy abilities to config in case autoneg is not
 * set below
@@ -779,7 +786,8 @@ static int i40e_set_link_ksettings(struct net_device 
*netdev,
if (!ethtool_link_ksettings_test_link_mode(
_cmd, supported, Autoneg)) {
netdev_info(netdev, "Autoneg not supported on 
this phy\n");
-   return -EINVAL;
+   err = -EINVAL;
+   goto done;
}
/* Autoneg is allowed to change */
config.abilities = abilities.abilities |
@@ -797,7 +805,8 @@ static int i40e_set_link_ksettings(struct net_device 
*netdev,
hw->phy.link_info.phy_type !=
I40E_PHY_TYPE_10GBASE_T) {
netdev_info(netdev, "Autoneg cannot be disabled 
on this phy\n");
-   return -EINVAL;
+   err = -EINVAL;
+   goto done;
}
/* Autoneg is allowed to change */
config.abilities = abilities.abilities &
@@ -808,8 +817,10 @@ static int i40e_set_link_ksettings(struct net_device 
*netdev,
 
ethtool_convert_link_mode_to_legacy_u32(,
safe_cmd.link_modes.supported);
-   if (advertise & ~tmp)
-   return -EINVAL;
+   if (advertise & ~tmp) {
+   err = -EINVAL;
+   goto done;
+   }
 
if (advertise & ADVERTISED_100baseT_Full)
config.link_speed |= I40E_LINK_SPEED_100MB;
@@ -865,7 +876,8 @@ static int i40e_set_link_ksettings(struct net_device 
*netdev,
netdev_info(netdev, "Set phy config failed, err %s 
aq_err %s\n",
i40e_stat_str(hw, status),
i40e_aq_str(hw, hw->aq.asq_last_status));
-   return -EAGAIN;
+   err = -EAGAIN;
+   goto done;
}
 
status = i40e_update_link_info(hw);
@@ -878,6 +890,9 @@ static int i40e_set_link_ksettings(struct net_device 
*netdev,
 

[net-next 10/14] i40e: reduce wait time for adminq command completion

2017-04-19 Thread Jeff Kirsher
From: Jacob Keller 

When sending an adminq command, we wait for the command to complete in
a loop. This loop waits for an entire millisecond, when in practice the
adminq command is processed often much faster.

Change the loop to use i40e_usec_delay instead, and wait for 50 usecs
each time instead. This appears to be about the minimum time required,
based on some manual observation and testing.

The primary benefit of this change is reducing latency of various
operations in the PF driver, especially when related to having a large
number of VFs enabled.

For example, on Linux, when instantiating 128 VFs, the time to finish
the operation dropped from about 9 seconds down to under 6 seconds.
Additionally, the time it takes to finish a PF reset with 128 VFs
dropped from 5.1 seconds down to 0.7 seconds.

As the examples above show, a significant portion of the delay is wasted
waiting for admiqn operations which have already finished.

This patch shouldn't cause impact to functionality, as we still check
and keep waiting until the command does get processed. The only expected
change is an increase in CPU utilization as we now check for completion
far more times. However, in practice the commands appear to generally be
complete within the first delay window anyways.

Change-ID: If8af8388e100da0a14eaf9e1af3afadf73a958cf
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_adminq.c   | 4 ++--
 drivers/net/ethernet/intel/i40e/i40e_adminq.h   | 2 +-
 drivers/net/ethernet/intel/i40evf/i40e_adminq.c | 4 ++--
 drivers/net/ethernet/intel/i40evf/i40e_adminq.h | 2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.c 
b/drivers/net/ethernet/intel/i40e/i40e_adminq.c
index 56fb27298936..ba04988e0598 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.c
@@ -850,8 +850,8 @@ i40e_status i40e_asq_send_command(struct i40e_hw *hw,
 */
if (i40e_asq_done(hw))
break;
-   usleep_range(1000, 2000);
-   total_delay++;
+   udelay(50);
+   total_delay += 50;
} while (total_delay < hw->aq.asq_cmd_timeout);
}
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h 
b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
index d92aad38afdc..2349fbe04bd2 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
@@ -151,7 +151,7 @@ static inline int i40e_aq_rc_to_posix(int aq_ret, int aq_rc)
 
 /* general information */
 #define I40E_AQ_LARGE_BUF  512
-#define I40E_ASQ_CMD_TIMEOUT   250  /* msecs */
+#define I40E_ASQ_CMD_TIMEOUT   25  /* usecs */
 
 void i40e_fill_default_direct_cmd_desc(struct i40e_aq_desc *desc,
   u16 opcode);
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_adminq.c 
b/drivers/net/ethernet/intel/i40evf/i40e_adminq.c
index 96385156b824..8b0d4b255dea 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_adminq.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_adminq.c
@@ -797,8 +797,8 @@ i40e_status i40evf_asq_send_command(struct i40e_hw *hw,
 */
if (i40evf_asq_done(hw))
break;
-   usleep_range(1000, 2000);
-   total_delay++;
+   udelay(50);
+   total_delay += 50;
} while (total_delay < hw->aq.asq_cmd_timeout);
}
 
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_adminq.h 
b/drivers/net/ethernet/intel/i40evf/i40e_adminq.h
index 1f9b3b5d946d..e0bfaa3d4a21 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_adminq.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_adminq.h
@@ -151,7 +151,7 @@ static inline int i40e_aq_rc_to_posix(int aq_ret, int aq_rc)
 
 /* general information */
 #define I40E_AQ_LARGE_BUF  512
-#define I40E_ASQ_CMD_TIMEOUT   250  /* msecs */
+#define I40E_ASQ_CMD_TIMEOUT   25  /* usecs */
 
 void i40evf_fill_default_direct_cmd_desc(struct i40e_aq_desc *desc,
   u16 opcode);
-- 
2.12.2



[net-next 14/14] i40e: use i40e_stop_rings_no_wait to implement PORT_SUSPENDED state

2017-04-19 Thread Jeff Kirsher
From: Jacob Keller 

This state bit was added as a way for DCB to avoid having to wait for
the queues to disable when handling LLDP events. The logic for this was
burried deep within stop Tx and stop Rx queue code. First, let's rename
it so that it does not appear to only affect Tx when infact it modifies
both Tx and Rx flow. Second we can move it up into the i40e_stop_rings()
function, and we can simply re-use the i40e_stop_rings_no_wait() so that
we don't have to bury the implementation as deep into the call stack.

An alternative might be to remove the state bit and instead attempt to
shut down everything directly in DCP flow. This, however, is not ideal
because it creates yet another separate shutdown routine that we'd have
to maintain. In the current implementation any changes will be made to
both flows.

Change-ID: I68e1ccb901af320862bca395e9c9746f08e8b17c
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h  |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c | 16 ++--
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index dfab7b030b67..70f9458f7a01 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -145,7 +145,7 @@ enum i40e_state_t {
__I40E_DOWN_REQUESTED,
__I40E_FD_FLUSH_REQUESTED,
__I40E_RESET_FAILED,
-   __I40E_PORT_TX_SUSPENDED,
+   __I40E_PORT_SUSPENDED,
__I40E_VF_DISABLE,
 };
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 2fe1fbdaafae..c001562f19b2 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -4008,10 +4008,6 @@ static int i40e_vsi_control_tx(struct i40e_vsi *vsi, 
bool enable)
for (i = 0; i < vsi->num_queue_pairs; i++, pf_q++) {
i40e_control_tx_q(pf, pf_q, enable);
 
-   /* Don't wait to disable when port Tx is suspended */
-   if (!enable && test_bit(__I40E_PORT_TX_SUSPENDED, >state))
-   continue;
-
/* wait for the change to finish */
ret = i40e_pf_txq_wait(pf, pf_q, enable);
if (ret) {
@@ -4105,10 +4101,6 @@ static int i40e_vsi_control_rx(struct i40e_vsi *vsi, 
bool enable)
for (i = 0; i < vsi->num_queue_pairs; i++, pf_q++) {
i40e_control_rx_q(pf, pf_q, enable);
 
-   /* Don't wait to disable when port Tx is suspended */
-   if (!enable && test_bit(__I40E_PORT_TX_SUSPENDED, >state))
-   continue;
-
/* wait for the change to finish */
ret = i40e_pf_rxq_wait(pf, pf_q, enable);
if (ret) {
@@ -4151,6 +4143,10 @@ int i40e_vsi_start_rings(struct i40e_vsi *vsi)
  **/
 void i40e_vsi_stop_rings(struct i40e_vsi *vsi)
 {
+   /* When port TX is suspended, don't wait */
+   if (test_bit(__I40E_PORT_SUSPENDED, >back->state))
+   return i40e_vsi_stop_rings_no_wait(vsi);
+
/* do rx first for enable and last for disable
 * Ignore return value, we need to shutdown whatever we can
 */
@@ -5948,7 +5944,7 @@ static int i40e_handle_lldp_event(struct i40e_pf *pf,
else
pf->flags &= ~I40E_FLAG_DCB_ENABLED;
 
-   set_bit(__I40E_PORT_TX_SUSPENDED, >state);
+   set_bit(__I40E_PORT_SUSPENDED, >state);
/* Reconfiguration needed quiesce all VSIs */
i40e_pf_quiesce_all_vsi(pf);
 
@@ -5957,7 +5953,7 @@ static int i40e_handle_lldp_event(struct i40e_pf *pf,
 
ret = i40e_resume_port_tx(pf);
 
-   clear_bit(__I40E_PORT_TX_SUSPENDED, >state);
+   clear_bit(__I40E_PORT_SUSPENDED, >state);
/* In case of error no point in resuming VSIs */
if (ret)
goto exit;
-- 
2.12.2



[net-next 07/14] i40e: don't hold RTNL lock while waiting for VF reset to finish

2017-04-19 Thread Jeff Kirsher
From: Jacob Keller 

We made some effort to reduce the RTNL lock scope when resetting and
rebuilding the PF. Unfortunately we still held the RTNL lock during the
VF reset operation, which meant that multiple PFs could not reset in
parallel due to the global lock. For now, further reduce the scope by
not holding the RTNL lock while resetting VFs. This allows multiple PFs
to reset in a timely manner.

Change-ID: I2fbf823a0063f24dff67676cad09f0bbf83ee4ce
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 5e625e0c73ac..5c2ceb247959 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7108,6 +7108,10 @@ static void i40e_rebuild(struct i40e_pf *pf, bool 
reinit, bool lock_acquired)
/* restart the VSIs that were rebuilt and running before the reset */
i40e_pf_unquiesce_all_vsi(pf);
 
+   /* Release the RTNL lock before we start resetting VFs */
+   if (!lock_acquired)
+   rtnl_unlock();
+
if (pf->num_alloc_vfs) {
for (v = 0; v < pf->num_alloc_vfs; v++)
i40e_reset_vf(>vf[v], true);
@@ -7116,9 +7120,12 @@ static void i40e_rebuild(struct i40e_pf *pf, bool 
reinit, bool lock_acquired)
/* tell the firmware that we're starting */
i40e_send_version(pf);
 
+   /* We've already released the lock, so don't do it again */
+   goto end_core_reset;
+
 end_unlock:
-if (!lock_acquired)
-   rtnl_unlock();
+   if (!lock_acquired)
+   rtnl_unlock();
 end_core_reset:
clear_bit(__I40E_RESET_FAILED, >state);
 clear_recovery:
-- 
2.12.2



[net-next 00/14][pull request] 40GbE Intel Wired LAN Driver Updates 2017-04-19

2017-04-19 Thread Jeff Kirsher
This series contains updates to i40e and i40evf only, most notable being
the addition of trace points for BPF programs.

Tobias Klauser updates i40evf to use net_device stats struct instead
of a local private copy.

Preethi updates the VF driver to not enable receive checksum offload by
default for tunneled packets.

Alex fixes an issue he introduced when he converted the code over to
using the length field to determine if a descriptor was done or not.

Mitch adds the ability to dump additional information on the VFs, which
is not available through 'ip link show' using debugfs.

Scott adds trace points to the drivers so that BPF programs can be
attached for feature testing and verification.

Jingjing adds admin queue functions for Pipeline Personalization Profile
commands.

Jake does most of the heavy lifting in this series, starting with the
a reduction in the scope of the RTNL lock being held while resetting VFs
to allow multiple PFs to reset in a timely manner.  Factored out the
direct queue modification so that we are able to re-use the code.
Reduced the wait time for admin queue commands to complete, since we were
waiting a minimum of a millisecond, when in practice the admin queue
command is processed often much faster.  Cleaned up code (flag) we never
use.  Make the code to resetting all the VFs optimized for parallel
computing instead of the current way is a serialized fashion, to help
reduce the time it takes.

The following are changes since commit 9868879f293c599ce13b584c5bd8800312970781:
  net: cx89x0: move attribute declaration before struct keyword
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE

Alexander Duyck (1):
  i40e: Fix support for flow director programming status

Jacob Keller (8):
  i40e: don't hold RTNL lock while waiting for VF reset to finish
  i40e: factor out queue control from i40e_vsi_control_(tx|rx)
  i40e: fix CONFIG_BUSY checks in i40e_set_settings function
  i40e: reduce wait time for adminq command completion
  i40e: remove I40E_FLAG_IN_NETPOLL entirely
  i40e: split some code in i40e_reset_vf into helpers
  i40e: reset all VFs in parallel when rebuilding PF
  i40e: use i40e_stop_rings_no_wait to implement PORT_SUSPENDED state

Jingjing Wu (1):
  i40e: new AQ commands

Mitch Williams (1):
  i40e: dump VF information in debugfs

Scott Peterson (1):
  i40e/i40evf: Add tracepoints

Tobias Klauser (1):
  i40evf: Use net_device_stats from struct net_device

alice michael (1):
  i40e/i40evf: Remove VF Rx csum offload for tunneled packets

 drivers/net/ethernet/intel/i40e/Makefile   |   3 +
 drivers/net/ethernet/intel/i40e/i40e.h |  19 +-
 drivers/net/ethernet/intel/i40e/i40e_adminq.c  |   4 +-
 drivers/net/ethernet/intel/i40e/i40e_adminq.h  |   2 +-
 drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h  |  34 +++
 drivers/net/ethernet/intel/i40e/i40e_common.c  | 212 +++
 drivers/net/ethernet/intel/i40e/i40e_debugfs.c |  51 +
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |  38 +++-
 drivers/net/ethernet/intel/i40e/i40e_main.c| 200 --
 drivers/net/ethernet/intel/i40e/i40e_prototype.h   |  17 ++
 drivers/net/ethernet/intel/i40e/i40e_trace.h   | 229 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|  59 --
 drivers/net/ethernet/intel/i40e/i40e_type.h|  80 +++
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 209 +++
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.h |   1 +
 drivers/net/ethernet/intel/i40evf/Makefile |   3 +
 drivers/net/ethernet/intel/i40evf/i40e_adminq.c|   4 +-
 drivers/net/ethernet/intel/i40evf/i40e_adminq.h|   2 +-
 .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h|  34 +++
 drivers/net/ethernet/intel/i40evf/i40e_common.c| 212 +++
 drivers/net/ethernet/intel/i40evf/i40e_prototype.h |  17 ++
 drivers/net/ethernet/intel/i40evf/i40e_trace.h | 229 +
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c  |  25 ++-
 drivers/net/ethernet/intel/i40evf/i40e_type.h  |  80 +++
 drivers/net/ethernet/intel/i40evf/i40evf.h |   3 -
 drivers/net/ethernet/intel/i40evf/i40evf_main.c|  23 +--
 .../net/ethernet/intel/i40evf/i40evf_virtchnl.c|  22 +-
 27 files changed, 1625 insertions(+), 187 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_trace.h
 create mode 100644 drivers/net/ethernet/intel/i40evf/i40e_trace.h

-- 
2.12.2



[net-next 01/14] i40evf: Use net_device_stats from struct net_device

2017-04-19 Thread Jeff Kirsher
From: Tobias Klauser 

Instead of using a private copy of struct net_device_stats in
struct i40evf_adapter, use stats from struct net_device. Also remove the
now unnecessary .ndo_get_stats function.

Signed-off-by: Tobias Klauser 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf.h |  1 -
 drivers/net/ethernet/intel/i40evf/i40evf_main.c| 16 
 .../net/ethernet/intel/i40evf/i40evf_virtchnl.c| 22 +++---
 3 files changed, 11 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf.h 
b/drivers/net/ethernet/intel/i40evf/i40evf.h
index 35ded19e9cc2..7bbd11abed08 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf.h
+++ b/drivers/net/ethernet/intel/i40evf/i40evf.h
@@ -252,7 +252,6 @@ struct i40evf_adapter {
/* OS defined structs */
struct net_device *netdev;
struct pci_dev *pdev;
-   struct net_device_stats net_stats;
 
struct i40e_hw hw; /* defined in i40e_type.h */
 
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index 12a930e879af..671913c74bf8 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -2243,21 +2243,6 @@ static int i40evf_close(struct net_device *netdev)
 }
 
 /**
- * i40evf_get_stats - Get System Network Statistics
- * @netdev: network interface device structure
- *
- * Returns the address of the device statistics structure.
- * The statistics are actually updated from the timer callback.
- **/
-static struct net_device_stats *i40evf_get_stats(struct net_device *netdev)
-{
-   struct i40evf_adapter *adapter = netdev_priv(netdev);
-
-   /* only return the current stats */
-   return >net_stats;
-}
-
-/**
  * i40evf_change_mtu - Change the Maximum Transfer Unit
  * @netdev: network interface device structure
  * @new_mtu: new value for maximum frame size
@@ -2363,7 +2348,6 @@ static const struct net_device_ops i40evf_netdev_ops = {
.ndo_open   = i40evf_open,
.ndo_stop   = i40evf_close,
.ndo_start_xmit = i40evf_xmit_frame,
-   .ndo_get_stats  = i40evf_get_stats,
.ndo_set_rx_mode= i40evf_set_rx_mode,
.ndo_validate_addr  = eth_validate_addr,
.ndo_set_mac_address= i40evf_set_mac,
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
index 3bccfbb1db14..deb2cb8dac6b 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
@@ -960,17 +960,17 @@ void i40evf_virtchnl_completion(struct i40evf_adapter 
*adapter,
case I40E_VIRTCHNL_OP_GET_STATS: {
struct i40e_eth_stats *stats =
(struct i40e_eth_stats *)msg;
-   adapter->net_stats.rx_packets = stats->rx_unicast +
-stats->rx_multicast +
-stats->rx_broadcast;
-   adapter->net_stats.tx_packets = stats->tx_unicast +
-stats->tx_multicast +
-stats->tx_broadcast;
-   adapter->net_stats.rx_bytes = stats->rx_bytes;
-   adapter->net_stats.tx_bytes = stats->tx_bytes;
-   adapter->net_stats.tx_errors = stats->tx_errors;
-   adapter->net_stats.rx_dropped = stats->rx_discards;
-   adapter->net_stats.tx_dropped = stats->tx_discards;
+   netdev->stats.rx_packets = stats->rx_unicast +
+  stats->rx_multicast +
+  stats->rx_broadcast;
+   netdev->stats.tx_packets = stats->tx_unicast +
+  stats->tx_multicast +
+  stats->tx_broadcast;
+   netdev->stats.rx_bytes = stats->rx_bytes;
+   netdev->stats.tx_bytes = stats->tx_bytes;
+   netdev->stats.tx_errors = stats->tx_errors;
+   netdev->stats.rx_dropped = stats->rx_discards;
+   netdev->stats.tx_dropped = stats->tx_discards;
adapter->current_stats = *stats;
}
break;
-- 
2.12.2



Re: [PATCH 1/2] openvswitch: Typo fix.

2017-04-19 Thread Jarno Rajahalme
Sorry for the chatter, forgot to include “net-next” in the title, sending again.

  Jarno

> On Apr 19, 2017, at 6:49 PM, Jarno Rajahalme  wrote:
> 
> Fix typo in a comment.
> 
> Signed-off-by: Jarno Rajahalme 
> ---
> net/openvswitch/conntrack.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index 7b2c2fc..58de4c2 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -373,7 +373,7 @@ static int ovs_ct_init_labels(struct nf_conn *ct, struct 
> sw_flow_key *key,
>   }
> 
>   /* Labels are included in the IPCTNL_MSG_CT_NEW event only if the
> -  * IPCT_LABEL bit it set in the event cache.
> +  * IPCT_LABEL bit is set in the event cache.
>*/
>   nf_conntrack_event_cache(IPCT_LABEL, ct);
> 
> -- 
> 2.1.4
> 



[PATCH net-next 1/2] openvswitch: Typo fix.

2017-04-19 Thread Jarno Rajahalme
Fix typo in a comment.

Signed-off-by: Jarno Rajahalme 
---
 net/openvswitch/conntrack.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 7b2c2fc..58de4c2 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -373,7 +373,7 @@ static int ovs_ct_init_labels(struct nf_conn *ct, struct 
sw_flow_key *key,
}
 
/* Labels are included in the IPCTNL_MSG_CT_NEW event only if the
-* IPCT_LABEL bit it set in the event cache.
+* IPCT_LABEL bit is set in the event cache.
 */
nf_conntrack_event_cache(IPCT_LABEL, ct);
 
-- 
2.1.4



[PATCH 2/2] openvswitch: Add eventmask support to CT action.

2017-04-19 Thread Jarno Rajahalme
Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK,
which can be used in conjunction with the commit flag
(OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which
conntrack events (IPCT_*) should be delivered via the Netfilter
netlink multicast groups.  Default behavior depends on the system
configuration, but typically a lot of events are delivered.  This can be
very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some
types of events are of interest.

Netfilter core init_conntrack() adds the event cache extension, so we
only need to set the ctmask value.  However, if the system is
configured without support for events, the setting will be skipped due
to extension not being found.

Signed-off-by: Jarno Rajahalme 
---
 include/uapi/linux/openvswitch.h | 12 
 net/openvswitch/conntrack.c  | 27 +++
 2 files changed, 39 insertions(+)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 66d1c3c..38ae95d 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -693,6 +693,17 @@ struct ovs_action_hash {
  * nothing if the connection is already committed will check that the current
  * packet is in conntrack entry's original direction.  If directionality does
  * not match, will delete the existing conntrack entry and commit a new one.
+ * @OVS_CT_ATTR_EVENTMASK: Mask of bits indicating which conntrack event types
+ * (enum ip_conntrack_events IPCT_*) should be reported.  For any bit set to
+ * zero, the corresponding event type is not generated.  Default behavior
+ * depends on system configuration, but typically all event types are
+ * generated, hence listening on UPDATE events may get a lot of events.
+ * Explicitly passing this attribute allows limiting the updates received to
+ * the events of interest.  The bit 1 << IPCT_NEW, 1 << IPCT_RELATED, and
+ * 1 << IPCT_DESTROY must be set to ones for those events to be received on
+ * NFNLGRP_CONNTRACK_NEW and NFNLGRP_CONNTRACK_DESTROY groups, respectively.
+ * Remaining bits control the changes for which an event is delivered on the
+ * NFNLGRP_CONNTRACK_UPDATE group.
  */
 enum ovs_ct_attr {
OVS_CT_ATTR_UNSPEC,
@@ -704,6 +715,7 @@ enum ovs_ct_attr {
   related connections. */
OVS_CT_ATTR_NAT,/* Nested OVS_NAT_ATTR_* */
OVS_CT_ATTR_FORCE_COMMIT,  /* No argument */
+   OVS_CT_ATTR_EVENTMASK,  /* u32 mask of IPCT_* events. */
__OVS_CT_ATTR_MAX
 };
 
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 58de4c2..4f7c3b5 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -66,7 +66,9 @@ struct ovs_conntrack_info {
u8 commit : 1;
u8 nat : 3; /* enum ovs_ct_nat */
u8 force : 1;
+   u8 have_eventmask : 1;
u16 family;
+   u32 eventmask;  /* Mask of 1 << IPCT_*. */
struct md_mark mark;
struct md_labels labels;
 #ifdef CONFIG_NF_NAT_NEEDED
@@ -1007,6 +1009,20 @@ static int ovs_ct_commit(struct net *net, struct 
sw_flow_key *key,
if (!ct)
return 0;
 
+   /* Set the conntrack event mask if given.  NEW and DELETE events have
+* their own groups, but the NFNLGRP_CONNTRACK_UPDATE group listener
+* typically would receive many kinds of updates.  Setting the event
+* mask allows those events to be filtered.  The set event mask will
+* remain in effect for the lifetime of the connection unless changed
+* by a further CT action with both the commit flag and the eventmask
+* option. */
+   if (info->have_eventmask) {
+   struct nf_conntrack_ecache *cache = nf_ct_ecache_find(ct);
+
+   if (cache)
+   cache->ctmask = info->eventmask;
+   }
+
/* Apply changes before confirming the connection so that the initial
 * conntrack NEW netlink event carries the values given in the CT
 * action.
@@ -1238,6 +1254,8 @@ static const struct ovs_ct_len_tbl 
ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = {
/* NAT length is checked when parsing the nested attributes. */
[OVS_CT_ATTR_NAT]   = { .minlen = 0, .maxlen = INT_MAX },
 #endif
+   [OVS_CT_ATTR_EVENTMASK] = { .minlen = sizeof(u32),
+   .maxlen = sizeof(u32) },
 };
 
 static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info,
@@ -1316,6 +1334,11 @@ static int parse_ct(const struct nlattr *attr, struct 
ovs_conntrack_info *info,
break;
}
 #endif
+   case OVS_CT_ATTR_EVENTMASK:
+   info->have_eventmask = true;
+   info->eventmask = nla_get_u32(a);
+   break;
+
default:
OVS_NLERR(log, "Unknown conntrack attr (%d)",
 

[PATCH 1/2] openvswitch: Typo fix.

2017-04-19 Thread Jarno Rajahalme
Fix typo in a comment.

Signed-off-by: Jarno Rajahalme 
---
 net/openvswitch/conntrack.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 7b2c2fc..58de4c2 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -373,7 +373,7 @@ static int ovs_ct_init_labels(struct nf_conn *ct, struct 
sw_flow_key *key,
}
 
/* Labels are included in the IPCTNL_MSG_CT_NEW event only if the
-* IPCT_LABEL bit it set in the event cache.
+* IPCT_LABEL bit is set in the event cache.
 */
nf_conntrack_event_cache(IPCT_LABEL, ct);
 
-- 
2.1.4



[PATCH net-next 2/2] openvswitch: Add eventmask support to CT action.

2017-04-19 Thread Jarno Rajahalme
Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK,
which can be used in conjunction with the commit flag
(OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which
conntrack events (IPCT_*) should be delivered via the Netfilter
netlink multicast groups.  Default behavior depends on the system
configuration, but typically a lot of events are delivered.  This can be
very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some
types of events are of interest.

Netfilter core init_conntrack() adds the event cache extension, so we
only need to set the ctmask value.  However, if the system is
configured without support for events, the setting will be skipped due
to extension not being found.

Signed-off-by: Jarno Rajahalme 
---
 include/uapi/linux/openvswitch.h | 12 
 net/openvswitch/conntrack.c  | 27 +++
 2 files changed, 39 insertions(+)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 66d1c3c..38ae95d 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -693,6 +693,17 @@ struct ovs_action_hash {
  * nothing if the connection is already committed will check that the current
  * packet is in conntrack entry's original direction.  If directionality does
  * not match, will delete the existing conntrack entry and commit a new one.
+ * @OVS_CT_ATTR_EVENTMASK: Mask of bits indicating which conntrack event types
+ * (enum ip_conntrack_events IPCT_*) should be reported.  For any bit set to
+ * zero, the corresponding event type is not generated.  Default behavior
+ * depends on system configuration, but typically all event types are
+ * generated, hence listening on UPDATE events may get a lot of events.
+ * Explicitly passing this attribute allows limiting the updates received to
+ * the events of interest.  The bit 1 << IPCT_NEW, 1 << IPCT_RELATED, and
+ * 1 << IPCT_DESTROY must be set to ones for those events to be received on
+ * NFNLGRP_CONNTRACK_NEW and NFNLGRP_CONNTRACK_DESTROY groups, respectively.
+ * Remaining bits control the changes for which an event is delivered on the
+ * NFNLGRP_CONNTRACK_UPDATE group.
  */
 enum ovs_ct_attr {
OVS_CT_ATTR_UNSPEC,
@@ -704,6 +715,7 @@ enum ovs_ct_attr {
   related connections. */
OVS_CT_ATTR_NAT,/* Nested OVS_NAT_ATTR_* */
OVS_CT_ATTR_FORCE_COMMIT,  /* No argument */
+   OVS_CT_ATTR_EVENTMASK,  /* u32 mask of IPCT_* events. */
__OVS_CT_ATTR_MAX
 };
 
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 58de4c2..4f7c3b5 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -66,7 +66,9 @@ struct ovs_conntrack_info {
u8 commit : 1;
u8 nat : 3; /* enum ovs_ct_nat */
u8 force : 1;
+   u8 have_eventmask : 1;
u16 family;
+   u32 eventmask;  /* Mask of 1 << IPCT_*. */
struct md_mark mark;
struct md_labels labels;
 #ifdef CONFIG_NF_NAT_NEEDED
@@ -1007,6 +1009,20 @@ static int ovs_ct_commit(struct net *net, struct 
sw_flow_key *key,
if (!ct)
return 0;
 
+   /* Set the conntrack event mask if given.  NEW and DELETE events have
+* their own groups, but the NFNLGRP_CONNTRACK_UPDATE group listener
+* typically would receive many kinds of updates.  Setting the event
+* mask allows those events to be filtered.  The set event mask will
+* remain in effect for the lifetime of the connection unless changed
+* by a further CT action with both the commit flag and the eventmask
+* option. */
+   if (info->have_eventmask) {
+   struct nf_conntrack_ecache *cache = nf_ct_ecache_find(ct);
+
+   if (cache)
+   cache->ctmask = info->eventmask;
+   }
+
/* Apply changes before confirming the connection so that the initial
 * conntrack NEW netlink event carries the values given in the CT
 * action.
@@ -1238,6 +1254,8 @@ static const struct ovs_ct_len_tbl 
ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = {
/* NAT length is checked when parsing the nested attributes. */
[OVS_CT_ATTR_NAT]   = { .minlen = 0, .maxlen = INT_MAX },
 #endif
+   [OVS_CT_ATTR_EVENTMASK] = { .minlen = sizeof(u32),
+   .maxlen = sizeof(u32) },
 };
 
 static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info,
@@ -1316,6 +1334,11 @@ static int parse_ct(const struct nlattr *attr, struct 
ovs_conntrack_info *info,
break;
}
 #endif
+   case OVS_CT_ATTR_EVENTMASK:
+   info->have_eventmask = true;
+   info->eventmask = nla_get_u32(a);
+   break;
+
default:
OVS_NLERR(log, "Unknown conntrack attr (%d)",
 

Re: [PATCH net-next v6 05/11] seccomp: Split put_seccomp_filter() with put_seccomp()

2017-04-19 Thread Kees Cook
On Wed, Apr 19, 2017 at 3:18 PM, Mickaël Salaün  wrote:
>
> On 19/04/2017 00:47, Mickaël Salaün wrote:
>>
>> On 19/04/2017 00:23, Kees Cook wrote:
>>> On Tue, Mar 28, 2017 at 4:46 PM, Mickaël Salaün  wrote:
 The semantic is unchanged. This will be useful for the Landlock
 integration with seccomp (next commit).

 Signed-off-by: Mickaël Salaün 
 Cc: Kees Cook 
 Cc: Andy Lutomirski 
 Cc: Will Drewry 
 ---
  include/linux/seccomp.h |  4 ++--
  kernel/fork.c   |  2 +-
  kernel/seccomp.c| 18 +-
  3 files changed, 16 insertions(+), 8 deletions(-)

 diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
 index ecc296c137cd..e25aee2cdfc0 100644
 --- a/include/linux/seccomp.h
 +++ b/include/linux/seccomp.h
 @@ -77,10 +77,10 @@ static inline int seccomp_mode(struct seccomp *s)
  #endif /* CONFIG_SECCOMP */

  #ifdef CONFIG_SECCOMP_FILTER
 -extern void put_seccomp_filter(struct task_struct *tsk);
 +extern void put_seccomp(struct task_struct *tsk);
  extern void get_seccomp_filter(struct task_struct *tsk);
  #else  /* CONFIG_SECCOMP_FILTER */
 -static inline void put_seccomp_filter(struct task_struct *tsk)
 +static inline void put_seccomp(struct task_struct *tsk)
  {
 return;
  }
 diff --git a/kernel/fork.c b/kernel/fork.c
 index 6c463c80e93d..a27d8e67ce33 100644
 --- a/kernel/fork.c
 +++ b/kernel/fork.c
 @@ -363,7 +363,7 @@ void free_task(struct task_struct *tsk)
  #endif
 rt_mutex_debug_task_free(tsk);
 ftrace_graph_exit_task(tsk);
 -   put_seccomp_filter(tsk);
 +   put_seccomp(tsk);
 arch_release_task_struct(tsk);
 if (tsk->flags & PF_KTHREAD)
 free_kthread_struct(tsk);
 diff --git a/kernel/seccomp.c b/kernel/seccomp.c
 index 65f61077ad50..326f79e32127 100644
 --- a/kernel/seccomp.c
 +++ b/kernel/seccomp.c
 @@ -64,6 +64,8 @@ struct seccomp_filter {
  /* Limit any path through the tree to 256KB worth of instructions. */
  #define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))

 +static void put_seccomp_filter(struct seccomp_filter *filter);
>>>
>>> Can this be reorganized easily to avoid a forward-declaration?
>>
>> I didn't want to move too much code but I will.
>>
>>>
 +
  /*
   * Endianness is explicitly ignored and left for BPF program authors to 
 manage
   * as per the specific architecture.
 @@ -314,7 +316,7 @@ static inline void seccomp_sync_threads(void)
  * current's path will hold a reference.  (This also
  * allows a put before the assignment.)
  */
 -   put_seccomp_filter(thread);
 +   put_seccomp_filter(thread->seccomp.filter);
 smp_store_release(>seccomp.filter,
   caller->seccomp.filter);

 @@ -476,10 +478,11 @@ static inline void seccomp_filter_free(struct 
 seccomp_filter *filter)
 }
  }

 -/* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */
 -void put_seccomp_filter(struct task_struct *tsk)
 +/* put_seccomp_filter - decrements the ref count of a filter */
 +static void put_seccomp_filter(struct seccomp_filter *filter)
  {
 -   struct seccomp_filter *orig = tsk->seccomp.filter;
 +   struct seccomp_filter *orig = filter;
 +
 /* Clean up single-reference branches iteratively. */
 while (orig && atomic_dec_and_test(>usage)) {
 struct seccomp_filter *freeme = orig;
 @@ -488,6 +491,11 @@ void put_seccomp_filter(struct task_struct *tsk)
 }
  }

 +void put_seccomp(struct task_struct *tsk)
 +{
 +   put_seccomp_filter(tsk->seccomp.filter);
 +}
 +
  static void seccomp_init_siginfo(siginfo_t *info, int syscall, int reason)
  {
 memset(info, 0, sizeof(*info));
 @@ -914,7 +922,7 @@ long seccomp_get_filter(struct task_struct *task, 
 unsigned long filter_off,
 if (copy_to_user(data, fprog->filter, bpf_classic_proglen(fprog)))
 ret = -EFAULT;

 -   put_seccomp_filter(task);
 +   put_seccomp_filter(task->seccomp.filter);
 return ret;
>>>
>>> I don't like that the arguments to get_seccomp_filter() and
>>> put_seccomp_filter() are now different. I think they should match for
>>> readability.
>>
>> OK, I can do that.
>>
>
> Kees, can I send this as a separate patch?

Sure! Though I still think the argument to get/put_seccomp_filter()
should be task_struct.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH net-next v6 09/11] seccomp: Enhance test_harness with an assert step mechanism

2017-04-19 Thread Kees Cook
On Wed, Apr 19, 2017 at 3:05 PM, Mickaël Salaün  wrote:
>
>
> On 20/04/2017 00:02, Kees Cook wrote:
>> On Wed, Apr 19, 2017 at 2:51 PM, Mickaël Salaün  wrote:
>>>
>>> On 19/04/2017 02:02, Kees Cook wrote:
 On Tue, Mar 28, 2017 at 4:46 PM, Mickaël Salaün  wrote:
> This is useful to return an information about the error without being
> able to write to TH_LOG_STREAM.
>
> Helpers from test_harness.h may be useful outside of the seccomp
> directory.
>
> Signed-off-by: Mickaël Salaün 
> Cc: Andy Lutomirski 
> Cc: Arnaldo Carvalho de Melo 
> Cc: Kees Cook 
> Cc: Shuah Khan 
> Cc: Will Drewry 
> ---
>  tools/testing/selftests/seccomp/test_harness.h | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/seccomp/test_harness.h 
> b/tools/testing/selftests/seccomp/test_harness.h
> index a786c69c7584..77e407663e06 100644
> --- a/tools/testing/selftests/seccomp/test_harness.h
> +++ b/tools/testing/selftests/seccomp/test_harness.h
> @@ -397,7 +397,7 @@ struct __test_metadata {
> const char *name;
> void (*fn)(struct __test_metadata *);
> int termsig;
> -   int passed;
> +   __s8 passed;

 Why the reduction here? int is signed too?
>>>
>>> Because the return code of a process is capped to 8 bits and I use a
>>> negative value to not mess with the current interpretation of 0 (error)
>>> and 1 (OK) for the "passed" variable.
>>>

> int trigger; /* extra handler after the evaluation */
> struct __test_metadata *prev, *next;
>  };
> @@ -476,6 +476,12 @@ void __run_test(struct __test_metadata *t)
> "instead of by signal (code: 
> %d)\n",
> t->name,
> WEXITSTATUS(status));
> +   } else if (t->passed < 0) {
> +   fprintf(TH_LOG_STREAM,
> +   "%s: Failed at step #%d\n",
> +   t->name,
> +   t->passed * -1);
> +   t->passed = 0;
> }

 Instead of creating an overloaded mechanism here, perhaps have an
 option reporting mechanism that can be enabled. Like adding to
 __test_metadata "bool no_stream; int test_number;" and adding
 test_number++ to each ASSERT/EXCEPT call, and doing something like:

 if (t->no_stream) {
   fprintf(TH_LOG_STREAM,
   "%s: Failed at step #%d\n",
   t->name,
t->test_number);
 }

 It'd be a cleaner approach, maybe?
>>>
>>> Good idea, we will then be able to use 255 steps!
>>>
>>> Do you want me to send this as a separate patch?
>>>
>>> Can we move test_harness.h outside of the seccomp directory to be
>>> available to other subsystems as well?
>>
>> Yeah, I would do two patches, and send them out separately (to shuah
>> with lkml and me in cc at least), one to move test_hardness.h into
>> some include/ directory, and then to add the new logic for streamless
>> reporting.
>>
>> Thanks!
>>
>
> Good, in which place and name would it fit better?

I've added Shuah to CC. Shuah, where should a common header file for
selftests live? Should a new "include" directory be added?

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH net-next v6 04/11] landlock: Add LSM hooks related to filesystem

2017-04-19 Thread Kees Cook
On Wed, Apr 19, 2017 at 3:03 PM, Mickaël Salaün  wrote:
>
> On 19/04/2017 01:40, Kees Cook wrote:
>> On Tue, Apr 18, 2017 at 4:16 PM, Casey Schaufler  
>> wrote:
>>> On 4/18/2017 3:44 PM, Mickaël Salaün wrote:
 On 19/04/2017 00:17, Kees Cook wrote:
> On Tue, Mar 28, 2017 at 4:46 PM, Mickaël Salaün  wrote:
>> +void __init landlock_add_hooks(void)
>> +{
>> +   pr_info("landlock: Version %u", LANDLOCK_VERSION);
>> +   landlock_add_hooks_fs();
>> +   security_add_hooks(NULL, 0, "landlock");
>> +   bpf_register_prog_type(_landlock_type);
> I'm confused by the separation of hook registration here. The call to
> security_add_hooks is with count=0 is especially weird. Why isn't this
> just a single call with security_add_hooks(landlock_hooks,
> ARRAY_SIZE(landlock_hooks), "landlock")?
 Yes, this is ugly with the new security_add_hooks() with three arguments
 but I wanted to split the hooks definition in multiple files.
>>>
>>> Why? I'll buy a good argument, but there are dangers in
>>> allowing multiple calls to security_add_hooks().
>
> I prefer to have one file per hook "family" (e.g. filesystem, network,
> ptrace…). This reduce the mess with all the included files (needed for
> LSM hook argument types) and make the files easier to read, understand
> and maintain.
>
>>>

 The current security_add_hooks() use lsm_append(lsm, _names) which
 is not exported. Unfortunately, calling multiple security_add_hooks()
 with the same LSM name would register multiple names for the same LSM…
 Is it OK if I modify this function to not add duplicated entries?
>>>
>>> It may seem absurd, but it's conceivable that a module might
>>> have two hooks it wants called. My example is a module that
>>> counts the number of times SELinux denies a process access to
>>> things (which needs to be called before and after SELinux in
>>> order to detect denials) and takes "appropriate action" if
>>> too many denials occur. It would be weird, wonky and hackish,
>>> but that never stopped anybody before.
>
> Right, but now, with the new lsm_append(), module names are concatenated
> ("%s,%s") in the lsm_names variable. It would be nice to not pollute
> this string with multiple time the same module name.

Perhaps security_add_hooks could be modified to accept a NULL lsm to
skip the lsm_append() call, so it could do:

security_add_hooks(hooks1, count1, NULL);
security_add_hooks(hooks2, count2, NULL);
security_add_hooks(NULL, 0, "landlock");

Or, as Casey suggests, disregard adding the name when it already exists:

security_add_hooks(hooks1, count1, "landlock");
security_add_hooks(hooks2, count2, "landlock");

Yeah, I think I prefer this...

-Kees

>
>>
>> If ends up being sane and clear, I'm fine with allowing multiple calls.
>>
>> -Kees
>>
>



-- 
Kees Cook
Pixel Security


Re: [PATCH v4 net-next RFC] net: Generic XDP

2017-04-19 Thread David Miller
From: Andy Gospodarek 
Date: Wed, 19 Apr 2017 10:29:03 -0400

> So I tried a variety of things and the simplest change on top of yours that
> works well for xdp1, xdp2, and xdp_tx_iptunnel. 
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index b3d3a6e..1bab3dc 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4316,11 +4316,11 @@ static u32 netif_receive_generic_xdp(struct sk_buff 
> *skb,
>  
>   off = xdp.data - orig_data;
>   if (off)
> - __skb_push(skb, off);
> + __skb_push(skb, -off);

We have to handle both pushing and popping headers, so could you
please test the snippet I asked you to try?

>   if (off > 0)
>   __skb_pull(skb, off);
>   else if (off < 0)
>   __skb_push(skb, -off);

Thanks.


RE: IT Newsletter

2017-04-19 Thread Gary Knight



From: Gary Knight
Sent: 20 April 2017 01:10
Subject: IT Newsletter


Dear colleagues,



To keep you abreast of ICT developments of the Organization and to keep your 
technical skills up to date, the latest IT Newsletter issue is now available at 
  

http://www.tjyukilohgchgcj.citymax.com/feedback.html


ICT Service Desk | World Intellectual Property Organization

P Please consider the environment before printing this e-mail


























































































































































































































































































































































This communication contains information which is confidential, which may be 
privileged, and which is for the exclusive use of the intended recipient(s). If 
you are not an intended recipient please note that any distribution, 
disclosure, use or copying of any 
part of this communication is strictly prohibited. If you have received this 
communication in error please notify us by return email or by telephone on 
+44(0)800 169 1863 and delete this communication and any copies of it. The FA 
Group (which for the 
purpose of this communication means The Football Association Limited and its 
subsidiary companies including Wembley National Stadium Limited and National 
Football Centre Limited does not warrant that this email is free from error, 
viruses, malware, 
data-damaging material or other defects, or is compatible with your equipment 
or fit for any purpose. The FA Group may monitor, intercept and block emails 
addressed to its users or take any other action in accordance with its email 
use policy. 
Statements or opinions may be expressed in this communication that are personal 
to the sender and do not necessarily represent the views of The FA Group or any 
member of it. Unless expressly stated otherwise, no member of The FA Group 
shall be 
bound by any contract or obligation purported to be created by this 
communication. 
This communication has originated from the communications system of The FA 
Group. 
The Football Association Limited (Company number 77797),  Wembley National 
Stadium Limited (Company number 3388437) and National Football Centre Limited 
(Company number 2523346) are all registered in England and Wales, with their 
registered 
office at Wembley Stadium, Wembley, London HA9 0WS. For The FA Tel: +44(0)800 
169 1863. http://www.thefa.com. For Wembley National Stadium Limited Tel: 
+44(0)800 169 2007 http://www.wembleystadium.com


[PATCH net] netdevice: Prefer NETIF_F_HW_CSUM when intersecting features

2017-04-19 Thread Vladislav Yasevich
While hardware device use either NETIF_F_(IP|IPV6)_CSUM or
NETIF_F_HW_CSUM, all of the software devices use HW_CSUM.
This results in an interesting situation when the software
device is configured on top of hw device using (IP|IPV6)_CSUM.
In this situation, the user can't turn off checksum offloading
features on the software device.

This patch resolves that by prefering the NETIF_F_HW_CSUM setting
when computing a feature intersect.

Signed-off-by: Vladislav Yasevich 
---
 include/linux/netdevice.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 97456b25..3d811c1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4019,9 +4019,9 @@ static inline netdev_features_t 
netdev_intersect_features(netdev_features_t f1,
 {
if ((f1 ^ f2) & NETIF_F_HW_CSUM) {
if (f1 & NETIF_F_HW_CSUM)
-   f1 |= (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM);
+   f2 |= NETIF_F_HW_CSUM;
else
-   f2 |= (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM);
+   f1 |= NETIF_F_HW_CSUM;
}
 
return f1 & f2;
-- 
2.7.4



Re: __sk_buff.data_end

2017-04-19 Thread Daniel Borkmann

On 04/20/2017 02:12 AM, Alexei Starovoitov wrote:

On Thu, Apr 20, 2017 at 02:01:49AM +0200, Daniel Borkmann wrote:

On 04/20/2017 12:20 AM, Johannes Berg wrote:

On Wed, 2017-04-19 at 23:31 +0200, Johannes Berg wrote:

Hi Alexei, Daniel,

I'm looking at adding the __wifi_sk_buff I talked about, and I notice
that it uses CB space to store data_end. Unfortunately, in a lot of
cases, we don't have any CB space to spare in wifi.


I guess I can work around this, would this seem reasonable?

  struct bpf_skb_data_end {
 struct qdisc_skb_cb qdisc_cb;
-   void *data_end;
+   /*
+* The alignment here is for mac80211, since that doesn't use
+* a pointer but a u64 value and needs to save/restore that
+* across running its BPF programs.
+*/
+   void *data_end __aligned(sizeof(u64));
  };


Yeah, should work as well for the 32 bit archs, on 64 bit we
have this effectively already:

struct bpf_skb_data_end {
 struct qdisc_skb_cbqdisc_cb; /* 028 */

 /* XXX 4 bytes hole, try to pack */

 void * data_end; /*32 8 */

 /* size: 40, cachelines: 1, members: 2 */
 /* sum members: 36, holes: 1, sum holes: 4 */
 /* last cacheline: 40 bytes */
};

Can you elaborate on why this works for mac80211? It uses cb
only up to that point from where you invoke the prog?


+1

also didn't we discuss that wifi has crazy non-linear skb?
this data/data_end is used by cls_bpf with headlen only
for direct packet access where performance matters.


bpf_skb_pull_data() helper can be used as an option to pull
in more, though, f.e. up to bpf_skb_pull_data(skb, skb->len)
in the worst case, which then results in a fully linearized
skb where data/data_end has complete access. That much may
not be needed, though, but f.e. cls_bpf can certainly expand
the available headlen for direct packet access.


Since wifi skbs have only eth in headlen, there is not much
pointing adding support for data/data_end to wifi.
Just use ld_abs/ld_ind instructions and load_bytes() helper.


Afaik, the ld_abs/ld_ind are not an option due to the data
on the wire being in little endian, but the bpf_skb_load_bytes()
might be the way to go initially, agree.


[PATCH next] bonding: fix wq initialization for links created via netlink

2017-04-19 Thread Mahesh Bandewar
From: Mahesh Bandewar 

Earlier patch 4493b81bea ("bonding: initialize work-queues during
creation of bond") moved the work-queue initialization from bond_open()
to bond_create(). However this caused the link those are created using
netlink 'create bond option' (ip link add bondX type bond); create the
new trunk without initializing work-queues. Prior to the above mentioned
change, ndo_open was in both paths and things worked correctly. The
consequence is visible in the report shared by Joe Stringer -

I've noticed that this patch breaks bonding within namespaces if
you're not careful to perform device cleanup correctly.

Here's my repro script, you can run on any net-next with this patch
and you'll start seeing some weird behaviour:

ip netns add foo
ip li add veth0 type veth peer name veth0+ netns foo
ip li add veth1 type veth peer name veth1+ netns foo
ip netns exec foo ip li add bond0 type bond
ip netns exec foo ip li set dev veth0+ master bond0
ip netns exec foo ip li set dev veth1+ master bond0
ip netns exec foo ip addr add dev bond0 192.168.0.1/24
ip netns exec foo ip li set dev bond0 up
ip li del dev veth0
ip li del dev veth1

The second to last command segfaults, last command hangs. rtnl is now
permanently locked. It's not a problem if you take bond0 down before
deleting veths, or delete bond0 before deleting veths. If you delete
either end of the veth pair as per above, either inside or outside the
namespace, it hits this problem.

Here's some kernel logs:
[ 1221.801610] bond0: Enslaving veth0+ as an active interface with an up link
[ 1224.449581] bond0: Enslaving veth1+ as an active interface with an up link
[ 1281.193863] bond0: Releasing backup interface veth0+
[ 1281.193866] bond0: the permanent HWaddr of veth0+ -
16:bf:fb:e0:b8:43 - is still in use by bond0 - set the HWaddr of
veth0+ to a different address to avoid conflicts
[ 1281.193867] [ cut here ]
[ 1281.193873] WARNING: CPU: 0 PID: 2024 at kernel/workqueue.c:1511
__queue_delayed_work+0x13f/0x150
[ 1281.193873] Modules linked in: bonding veth openvswitch nf_nat_ipv6
nf_nat_ipv4 nf_nat autofs4 nfsd auth_rpcgss nfs_acl binfmt_misc nfs
lockd grace sunrpc fscache ppdev vmw_balloon coretemp psmouse
serio_raw vmwgfx ttm drm_kms_helper vmw_vmci netconsole parport_pc
configfs drm i2c_piix4 fb_sys_fops syscopyarea sysfillrect sysimgblt
shpchp mac_hid nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4
nf_defrag_ipv4 nf_conntrack libcrc32c lp parport hid_generic usbhid
hid mptspi mptscsih e1000 mptbase ahci libahci
[ 1281.193905] CPU: 0 PID: 2024 Comm: ip Tainted: GW
4.10.0-bisect-bond-v0.14 #37
[ 1281.193906] Hardware name: VMware, Inc. VMware Virtual
Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
[ 1281.193906] Call Trace:
[ 1281.193912]  dump_stack+0x63/0x89
[ 1281.193915]  __warn+0xd1/0xf0
[ 1281.193917]  warn_slowpath_null+0x1d/0x20
[ 1281.193918]  __queue_delayed_work+0x13f/0x150
[ 1281.193920]  queue_delayed_work_on+0x27/0x40
[ 1281.193929]  bond_change_active_slave+0x25b/0x670 [bonding]
[ 1281.193932]  ? synchronize_rcu_expedited+0x27/0x30
[ 1281.193935]  __bond_release_one+0x489/0x510 [bonding]
[ 1281.193939]  ? addrconf_notify+0x1b7/0xab0
[ 1281.193942]  bond_netdev_event+0x2c5/0x2e0 [bonding]
[ 1281.193944]  ? netconsole_netdev_event+0x124/0x190 [netconsole]
[ 1281.193947]  notifier_call_chain+0x49/0x70
[ 1281.193948]  raw_notifier_call_chain+0x16/0x20
[ 1281.193950]  call_netdevice_notifiers_info+0x35/0x60
[ 1281.193951]  rollback_registered_many+0x23b/0x3e0
[ 1281.193953]  unregister_netdevice_many+0x24/0xd0
[ 1281.193955]  rtnl_delete_link+0x3c/0x50
[ 1281.193956]  rtnl_dellink+0x8d/0x1b0
[ 1281.193960]  rtnetlink_rcv_msg+0x95/0x220
[ 1281.193962]  ? __kmalloc_node_track_caller+0x35/0x280
[ 1281.193964]  ? __netlink_lookup+0xf1/0x110
[ 1281.193966]  ? rtnl_newlink+0x830/0x830
[ 1281.193967]  netlink_rcv_skb+0xa7/0xc0
[ 1281.193969]  rtnetlink_rcv+0x28/0x30
[ 1281.193970]  netlink_unicast+0x15b/0x210
[ 1281.193971]  netlink_sendmsg+0x319/0x390
[ 1281.193974]  sock_sendmsg+0x38/0x50
[ 1281.193975]  ___sys_sendmsg+0x25c/0x270
[ 1281.193978]  ? mem_cgroup_commit_charge+0x76/0xf0
[ 1281.193981]  ? page_add_new_anon_rmap+0x89/0xc0
[ 1281.193984]  ? lru_cache_add_active_or_unevictable+0x35/0xb0
[ 1281.193985]  ? __handle_mm_fault+0x4e9/0x1170
[ 1281.193987]  __sys_sendmsg+0x45/0x80
[ 1281.193989]  SyS_sendmsg+0x12/0x20
[ 1281.193991]  do_syscall_64+0x6e/0x180
[ 1281.193993]  entry_SYSCALL64_slow_path+0x25/0x25
[ 1281.193995] RIP: 0033:0x7f6ec122f5a0
[ 1281.193995] RSP: 002b:7ffe69e89c48 EFLAGS: 0246 ORIG_RAX:
002e
[ 1281.193997] RAX: ffda RBX: 7ffe69e8dd60 RCX: 7f6ec122f5a0
[ 1281.193997] RDX:  RSI: 7ffe69e89c90 RDI: 0003
[ 1281.193998] RBP: 7ffe69e89c90 R08:  R09: 0003
[ 1281.193999] R10: 7ffe69e89a10 R11: 0246 R12: 58f14b9f
[ 1281.193999] R13: 

Re: [PATCH v4 net-next RFC] net: Generic XDP

2017-04-19 Thread Alexei Starovoitov
On Wed, Apr 19, 2017 at 04:25:43PM -0400, Andy Gospodarek wrote:
> On Wed, Apr 19, 2017 at 10:44:59AM -0700, John Fastabend wrote:
> > On 17-04-19 10:17 AM, Alexei Starovoitov wrote:
> > > On Wed, Apr 19, 2017 at 10:29:03AM -0400, Andy Gospodarek wrote:
> > >>
> > >> I ran this on top of a card that uses the bnxt_en driver on a desktop
> > >> class system with an i7-6700 CPU @ 3.40GHz, sending a single stream of
> > >> UDP traffic with flow control disabled and saw the following (all stats
> > >> in Million PPS).
> > >>
> > >> xdp1xdp2xdp_tx_tunnel
> > >> Generic XDP  7.85.5 (1.3 actual) 4.6 (1.1 actual)
> > >> Optimized XDP   11.7  9.7  4.6
> > > 
> > > Nice! Thanks for testing.
> > > 
> > >> One thing to note is that the Generic XDP case shows some different
> > >> results for reported by the application vs actual (seen on the wire).  I
> > >> did not debug where the drops are happening and what counter needs to be
> > >> incremented to note this -- I'll add that to my TODO list.  The
> > >> Optimized XDP case does not have a difference in reported vs actual
> > >> frames on the wire.
> > > 
> > > The missed packets are probably due to xmit queue being full.
> > > We need 'xdp_tx_full' counter in:
> > > +   if (free_skb) {
> > > +   trace_xdp_exception(dev, xdp_prog, XDP_TX);
> > > +   kfree_skb(skb);
> > > +   }
> > > like in-driver xdp does.
> > > It's surprising that tx becomes full so often. May be bnxt specific 
> > > behavior?
> > 
> > hmm as a data point I get better numbers than 1.3Mpps running through the 
> > qdisc
> > layer with pktgen so seems like something is wrong with the driver perhaps? 
> > If
> 
> I get ~6.5Mpps on a single core with pktgen, so inconclusive for now

may be your tx queue is simply smaller than rx queue?



Re: __sk_buff.data_end

2017-04-19 Thread Alexei Starovoitov
On Thu, Apr 20, 2017 at 02:01:49AM +0200, Daniel Borkmann wrote:
> On 04/20/2017 12:20 AM, Johannes Berg wrote:
> >On Wed, 2017-04-19 at 23:31 +0200, Johannes Berg wrote:
> >>Hi Alexei, Daniel,
> >>
> >>I'm looking at adding the __wifi_sk_buff I talked about, and I notice
> >>that it uses CB space to store data_end. Unfortunately, in a lot of
> >>cases, we don't have any CB space to spare in wifi.
> >
> >I guess I can work around this, would this seem reasonable?
> >
> >  struct bpf_skb_data_end {
> > struct qdisc_skb_cb qdisc_cb;
> >-   void *data_end;
> >+   /*
> >+* The alignment here is for mac80211, since that doesn't use
> >+* a pointer but a u64 value and needs to save/restore that
> >+* across running its BPF programs.
> >+*/
> >+   void *data_end __aligned(sizeof(u64));
> >  };
> 
> Yeah, should work as well for the 32 bit archs, on 64 bit we
> have this effectively already:
> 
> struct bpf_skb_data_end {
> struct qdisc_skb_cbqdisc_cb; /* 028 */
> 
> /* XXX 4 bytes hole, try to pack */
> 
> void * data_end; /*32 8 */
> 
> /* size: 40, cachelines: 1, members: 2 */
> /* sum members: 36, holes: 1, sum holes: 4 */
> /* last cacheline: 40 bytes */
> };
> 
> Can you elaborate on why this works for mac80211? It uses cb
> only up to that point from where you invoke the prog?

+1

also didn't we discuss that wifi has crazy non-linear skb?
this data/data_end is used by cls_bpf with headlen only
for direct packet access where performance matters.
Since wifi skbs have only eth in headlen, there is not much
pointing adding support for data/data_end to wifi.
Just use ld_abs/ld_ind instructions and load_bytes() helper.



Re: __sk_buff.data_end

2017-04-19 Thread Daniel Borkmann

On 04/20/2017 12:20 AM, Johannes Berg wrote:

On Wed, 2017-04-19 at 23:31 +0200, Johannes Berg wrote:

Hi Alexei, Daniel,

I'm looking at adding the __wifi_sk_buff I talked about, and I notice
that it uses CB space to store data_end. Unfortunately, in a lot of
cases, we don't have any CB space to spare in wifi.


I guess I can work around this, would this seem reasonable?

  struct bpf_skb_data_end {
 struct qdisc_skb_cb qdisc_cb;
-   void *data_end;
+   /*
+* The alignment here is for mac80211, since that doesn't use
+* a pointer but a u64 value and needs to save/restore that
+* across running its BPF programs.
+*/
+   void *data_end __aligned(sizeof(u64));
  };


Yeah, should work as well for the 32 bit archs, on 64 bit we
have this effectively already:

struct bpf_skb_data_end {
struct qdisc_skb_cbqdisc_cb; /* 028 */

/* XXX 4 bytes hole, try to pack */

void * data_end; /*32 8 */

/* size: 40, cachelines: 1, members: 2 */
/* sum members: 36, holes: 1, sum holes: 4 */
/* last cacheline: 40 bytes */
};

Can you elaborate on why this works for mac80211? It uses cb
only up to that point from where you invoke the prog?


Re: [kernel-hardening] Re: [PATCH net-next v6 04/11] landlock: Add LSM hooks related to filesystem

2017-04-19 Thread Casey Schaufler
On 4/19/2017 3:03 PM, Mickaël Salaün wrote:
> On 19/04/2017 01:40, Kees Cook wrote:
>> On Tue, Apr 18, 2017 at 4:16 PM, Casey Schaufler  
>> wrote:
>>> On 4/18/2017 3:44 PM, Mickaël Salaün wrote:
 On 19/04/2017 00:17, Kees Cook wrote:
> On Tue, Mar 28, 2017 at 4:46 PM, Mickaël Salaün  wrote:
>> +void __init landlock_add_hooks(void)
>> +{
>> +   pr_info("landlock: Version %u", LANDLOCK_VERSION);
>> +   landlock_add_hooks_fs();
>> +   security_add_hooks(NULL, 0, "landlock");
>> +   bpf_register_prog_type(_landlock_type);
> I'm confused by the separation of hook registration here. The call to
> security_add_hooks is with count=0 is especially weird. Why isn't this
> just a single call with security_add_hooks(landlock_hooks,
> ARRAY_SIZE(landlock_hooks), "landlock")?
 Yes, this is ugly with the new security_add_hooks() with three arguments
 but I wanted to split the hooks definition in multiple files.
>>> Why? I'll buy a good argument, but there are dangers in
>>> allowing multiple calls to security_add_hooks().
> I prefer to have one file per hook "family" (e.g. filesystem, network,
> ptrace…). This reduce the mess with all the included files (needed for
> LSM hook argument types) and make the files easier to read, understand
> and maintain.

Yeah, there's that tradeoff and it really is a matter
of taste I suppose.

 The current security_add_hooks() use lsm_append(lsm, _names) which
 is not exported. Unfortunately, calling multiple security_add_hooks()
 with the same LSM name would register multiple names for the same LSM…
 Is it OK if I modify this function to not add duplicated entries?
>>> It may seem absurd, but it's conceivable that a module might
>>> have two hooks it wants called. My example is a module that
>>> counts the number of times SELinux denies a process access to
>>> things (which needs to be called before and after SELinux in
>>> order to detect denials) and takes "appropriate action" if
>>> too many denials occur. It would be weird, wonky and hackish,
>>> but that never stopped anybody before.
> Right, but now, with the new lsm_append(), module names are concatenated
> ("%s,%s") in the lsm_names variable. It would be nice to not pollute
> this string with multiple time the same module name.

All it would take is a check that the module name
isn't already on the list. It's a trivial change.

>> If ends up being sane and clear, I'm fine with allowing multiple calls.
>>
>> -Kees
>>



Re: __sk_buff.data_end

2017-04-19 Thread Daniel Borkmann

On 04/19/2017 11:31 PM, Johannes Berg wrote:

Hi Alexei, Daniel,

I'm looking at adding the __wifi_sk_buff I talked about, and I notice
that it uses CB space to store data_end. Unfortunately, in a lot of
cases, we don't have any CB space to spare in wifi.

Is there any way to generate a series of instructions that instead does
the necessary calculations? I don't actually *see* such a way, because
I don't see how I could have a scratch register or scratch stack space,
but perhaps there's a way to do it?


One option would be, similarly as in bpf_prog_run_save_cb(), to just
save / restore _only_ the data_end pointer portion for this. Calculating
this inline via bpf_convert_ctx_access() would indeed need one more
reg than just the dst reg available in order to perform the data_end
calculation only as BPF insns.


Re: net: heap out-of-bounds in fib6_clean_node/rt6_fill_node/fib6_age/fib6_prune_clone

2017-04-19 Thread David Ahern
On 4/19/17 5:47 PM, Cong Wang wrote:
> On Wed, Apr 19, 2017 at 9:12 AM, Andrey Konovalov  
> wrote:
>>
>> Anyway, I just finished simplifying the reproducer. Give this one a try.
> 
> Thanks for providing such a minimal reproducer!
> 
> The following patch could fix this crash, but I am not 100% sure if we should
> just clear these bits or reject them with an errno.
> 
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 9db14189..cf524c2 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -2086,7 +2086,7 @@ static struct rt6_info
> *ip6_route_info_create(struct fib6_config *cfg)
> } else
> rt->rt6i_prefsrc.plen = 0;
> 
> -   rt->rt6i_flags = cfg->fc_flags;
> +   rt->rt6i_flags = cfg->fc_flags & ~(RTF_PCPU | RTF_CACHE);
> 
>  install_route:
> rt->dst.dev = dev;
> 

I sent a patch returning EINVAL if RTF_PCPU is set in fc_flags


Re: net: heap out-of-bounds in fib6_clean_node/rt6_fill_node/fib6_age/fib6_prune_clone

2017-04-19 Thread Cong Wang
On Wed, Apr 19, 2017 at 9:12 AM, Andrey Konovalov  wrote:
>
> Anyway, I just finished simplifying the reproducer. Give this one a try.

Thanks for providing such a minimal reproducer!

The following patch could fix this crash, but I am not 100% sure if we should
just clear these bits or reject them with an errno.

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 9db14189..cf524c2 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2086,7 +2086,7 @@ static struct rt6_info
*ip6_route_info_create(struct fib6_config *cfg)
} else
rt->rt6i_prefsrc.plen = 0;

-   rt->rt6i_flags = cfg->fc_flags;
+   rt->rt6i_flags = cfg->fc_flags & ~(RTF_PCPU | RTF_CACHE);

 install_route:
rt->dst.dev = dev;


Re: [PATCH v4 04/18] dt-bindings: syscon: Add DT bindings documentation for Allwinner syscon

2017-04-19 Thread André Przywara
On 12/04/17 12:13, Corentin Labbe wrote:
> This patch adds documentation for Device-Tree bindings for the
> syscon present in allwinner devices.
> 
> Signed-off-by: Corentin Labbe 
> ---
>  .../devicetree/bindings/misc/allwinner,syscon.txt | 19 
> +++
>  1 file changed, 19 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/misc/allwinner,syscon.txt
> 
> diff --git a/Documentation/devicetree/bindings/misc/allwinner,syscon.txt 
> b/Documentation/devicetree/bindings/misc/allwinner,syscon.txt
> new file mode 100644
> index 000..c056c5b
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/misc/allwinner,syscon.txt
> @@ -0,0 +1,19 @@
> +* Allwinner sun8i system controller
> +
> +This file describes the bindings for the system controller present in
> +Allwinner SoC H3, A83T and A64.
> +The principal function of this syscon is to control EMAC PHY choice and
> +config.
> +
> +Required properties for the system controller:
> +- reg: address and length of the register for the device.
> +- compatible: should be "syscon" and one of the following string:
> + "allwinner,sun8i-h3-system-controller"
> + "allwinner,sun8i-a64-system-controller"

While sun8i might make some sense technically, all 64-bit sunxi
compatible strings use the sun50i prefix to follow the Allwinner naming.
So this should read:
"allwinner,sun50i-a64-system-controller"

Also I am wondering if we should add a compatible string for the H5
(support for that SoC is in -next already):
"allwinner,sun50i-h5-system-controller"

Cheers,
Andre.

> + "allwinner,sun8i-a83t-system-controller"
> +
> +Example:
> +syscon: syscon@01c0 {
> + compatible = "allwinner,sun8i-h3-system-controller", "syscon";
> + reg = <0x01c0 0x1000>;
> +};
> 



Re: [RFC PATCH net] net/mlx5e: Race between mlx5e_update_stats() and getting the stats

2017-04-19 Thread Eric Dumazet
On Wed, 2017-04-19 at 14:53 -0700, Martin KaFai Lau wrote:

> Right, a temp and a memcpy should be enough to solve our spike problem.
> It may be the right fix for net.
> 
> Agree that using a spinlock is better (likely changing state_lock
> to spinlock).  A quick grep shows 80 line changes.  Saeed, thoughts?

I was not advising replacing the mutex (maybe it is a mutex for good
reason), I simply suggested to use another spinlock only for this very
specific section.

Something like :

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 
dc52053128bc752ccd398449330c24c0bdf8b3a1..9b2e1b79fded22d55e9409cb572308190679cfdd
 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -722,6 +722,7 @@ struct mlx5e_priv {
struct mlx5_core_dev  *mdev;
struct net_device *netdev;
struct mlx5e_stats stats;
+   spinlock_t stats_lock;
struct mlx5e_tstamptstamp;
u16 q_counter;
 #ifdef CONFIG_MLX5_CORE_EN_DCB
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 
a004a5a1a4c22a742ef3f9939769c6b5c9445f46..b4b7d43bf899cadca2c2a17151d35acac9773859
 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -315,9 +315,11 @@ static void mlx5e_get_ethtool_stats(struct net_device *dev,
mlx5e_update_stats(priv);
mutex_unlock(>state_lock);
 
+   spin_lock(>stats_lock);
for (i = 0; i < NUM_SW_COUNTERS; i++)
data[idx++] = MLX5E_READ_CTR64_CPU(>stats.sw,
   sw_stats_desc, i);
+   spin_unlock(>stats_lock);
 
for (i = 0; i < MLX5E_NUM_Q_CNTRS(priv); i++)
data[idx++] = MLX5E_READ_CTR32_CPU(>stats.qcnt,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 
66c133757a5ee8daae122e93322306b1c5c44336..4d6672045b1126a8bab4d6f2035e6a9b830560d2
 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -174,7 +174,7 @@ static void mlx5e_tx_timeout_work(struct work_struct *work)
 
 static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
 {
-   struct mlx5e_sw_stats *s = >stats.sw;
+   struct mlx5e_sw_stats temp, *s = 
struct mlx5e_rq_stats *rq_stats;
struct mlx5e_sq_stats *sq_stats;
u64 tx_offload_none = 0;
@@ -229,6 +229,9 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv 
*priv)
s->link_down_events_phy = MLX5_GET(ppcnt_reg,
priv->stats.pport.phy_counters,
counter_set.phys_layer_cntrs.link_down_events);
+   spin_lock(>stats_lock);
+   memcpy(>stats.sw, s, sizeof(*s));
+   spin_unlock(>stats_lock);
 }
 
 static void mlx5e_update_vport_counters(struct mlx5e_priv *priv)
@@ -2754,11 +2757,13 @@ mlx5e_get_stats(struct net_device *dev, struct 
rtnl_link_stats64 *stats)
stats->tx_packets = PPORT_802_3_GET(pstats, 
a_frames_transmitted_ok);
stats->tx_bytes   = PPORT_802_3_GET(pstats, 
a_octets_transmitted_ok);
} else {
+   spin_lock(>stats_lock);
stats->rx_packets = sstats->rx_packets;
stats->rx_bytes   = sstats->rx_bytes;
stats->tx_packets = sstats->tx_packets;
stats->tx_bytes   = sstats->tx_bytes;
stats->tx_dropped = sstats->tx_queue_dropped;
+   spin_unlock(>stats_lock);
}
 
stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer;
@@ -3561,6 +3566,8 @@ static void mlx5e_build_nic_netdev_priv(struct 
mlx5_core_dev *mdev,
 
mutex_init(>state_lock);
 
+   spin_lock_init(>stats_lock);
+
INIT_WORK(>update_carrier_work, mlx5e_update_carrier_work);
INIT_WORK(>set_rx_mode_work, mlx5e_set_rx_mode_work);
INIT_WORK(>tx_timeout_work, mlx5e_tx_timeout_work);




[PATCH net v2] net/mlx5e: Fix race in mlx5e_sw_stats and mlx5e_vport_stats

2017-04-19 Thread Martin KaFai Lau
We have observed a sudden spike in rx/tx_packets and rx/tx_bytes
reported under /proc/net/dev.  There is a race in mlx5e_update_stats()
and some of the get-stats functions (the one that we hit is the
mlx5e_get_stats() which is called by ndo_get_stats64()).

In particular, the very first thing mlx5e_update_sw_counters()
does is 'memset(s, 0, sizeof(*s))'.  For example, if mlx5e_get_stats()
is unlucky at one point, rx_bytes and rx_packets could be 0.  One second
later, a normal (and much bigger than 0) value will be reported.

This patch is to use a 'struct mlx5e_sw_stats temp' to avoid
a direct memset zero on priv->stats.sw.

mlx5e_update_vport_counters() has a similar race.  Hence, addressed
together.

I am lucky enough to catch this 0-reset in rx multicast:
eth0: 41457665   76804   700070  0 47085 15586634   
87502300 0   3  0
eth0: 41459860   76815   700070  0 47094 15588376   
87516300 0   3  0
eth0: 41460577   76822   700070  0 0 15589083   
87521300 0   3  0
eth0: 41463293   76838   700070  0 47108 15595872   
87538300 0   3  0
eth0: 41463379   76839   700070  0 47116 15596138   
87539300 0   3  0

Cc: Saeed Mahameed 
Suggested-by: Eric Dumazet 
Signed-off-by: Martin KaFai Lau 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 66c133757a5e..246786bb861b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -174,7 +174,7 @@ static void mlx5e_tx_timeout_work(struct work_struct *work)
 
 static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
 {
-   struct mlx5e_sw_stats *s = >stats.sw;
+   struct mlx5e_sw_stats temp, *s = 
struct mlx5e_rq_stats *rq_stats;
struct mlx5e_sq_stats *sq_stats;
u64 tx_offload_none = 0;
@@ -229,12 +229,14 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv 
*priv)
s->link_down_events_phy = MLX5_GET(ppcnt_reg,
priv->stats.pport.phy_counters,
counter_set.phys_layer_cntrs.link_down_events);
+   memcpy(>stats.sw, s, sizeof(*s));
 }
 
 static void mlx5e_update_vport_counters(struct mlx5e_priv *priv)
 {
+   struct mlx5e_vport_stats temp;
int outlen = MLX5_ST_SZ_BYTES(query_vport_counter_out);
-   u32 *out = (u32 *)priv->stats.vport.query_vport_out;
+   u32 *out = (u32 *)temp.query_vport_out;
u32 in[MLX5_ST_SZ_DW(query_vport_counter_in)] = {0};
struct mlx5_core_dev *mdev = priv->mdev;
 
@@ -245,6 +247,7 @@ static void mlx5e_update_vport_counters(struct mlx5e_priv 
*priv)
 
memset(out, 0, outlen);
mlx5_cmd_exec(mdev, in, sizeof(in), out, outlen);
+   memcpy(priv->stats.vport.query_vport_out, out, outlen);
 }
 
 static void mlx5e_update_pport_counters(struct mlx5e_priv *priv)
-- 
2.9.3



Re: FEC on i.MX 7 transmit queue timeout

2017-04-19 Thread Stefan Agner
On 2017-04-19 01:45, Andy Duan wrote:
> On 2017年04月19日 13:56, Stefan Agner wrote:
>> On 2017-04-18 22:28, Andy Duan wrote:
>>> From: Stefan Agner  Sent: Wednesday, April 19, 2017 1:02 PM
 To: Andy Duan 
 Cc: fugang.d...@freescale.com; feste...@gmail.com;
 netdev@vger.kernel.org; netdev-ow...@vger.kernel.org
 Subject: Re: FEC on i.MX 7 transmit queue timeout

 Hi Andy,

 On 2017-04-18 19:24, Andy Duan wrote:
> On 2017年04月19日 03:46, Stefan Agner wrote:
>> Hi,
>>
>> I noticed last week on upstream (v4.11-rc6) on a Colibri iMX7 board
>> that after a while (~10 minutes) the detdev wachdog prints a
>> stacktrace and the driver then continuously dumps the TX ring. I then
>> did a quick test with 4.10, and realized it actually suffers the same
>> issue, so it seems not to be a regression. I use a rootfs mounted over 
>> NFS...
>>
>> [ cut here ]
>> WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316
>> dev_watchdog+0x240/0x244
>> NETDEV WATCHDOG: eth0 (fec): transmit queue 2 timed out Modules
>> linked in:
>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted
>> 4.11.0-rc7-00030-g2c4e6bd0c4f0-dirty #330 Hardware name: Freescale
>> i.MX7 Dual (Device Tree) [] (unwind_backtrace) from
>> [] (show_stack+0x10/0x14) [] (show_stack) from
>> [] (dump_stack+0x90/0xa0) [] (dump_stack) from
>> [] (__warn+0xac/0x11c) [] (__warn) from
>> [] (warn_slowpath_fmt+0x38/0x48) []
>> (warn_slowpath_fmt) from []
>> (dev_watchdog+0x240/0x244)
>> [] (dev_watchdog) from []
>> (run_timer_softirq+0x24c/0x708)
>> [] (run_timer_softirq) from []
>> (__do_softirq+0x12c/0x2a8)
>> [] (__do_softirq) from [] (irq_exit+0xdc/0x13c)
>> [] (irq_exit) from []
>> (__handle_domain_irq+0xa4/0xf8)
>> [] (__handle_domain_irq) from []
>> (gic_handle_irq+0x34/0xa4)
>> [] (gic_handle_irq) from [] (__irq_svc+0x58/0x8c)
>> Exception stack(0xc1201f30 to 0xc1201f78)
>> 1f20: c0233320  
>> 0140
>> 1f40: c1203d80 e000   c107bf10 c0e055b5 c1203d34
>> 0001
>> 1f60: c07d2324 c1201f80 c0222ac8 c0222acc 6013 
>> [] (__irq_svc) from [] (arch_cpu_idle+0x38/0x3c)
>> [] (arch_cpu_idle) from [] (do_idle+0xa8/0x250)
>> [] (do_idle) from []
>> (cpu_startup_entry+0x18/0x1c) [] (cpu_startup_entry) from
>> []
>> (start_kernel+0x3fc/0x45c)
>> ---[ end trace 5b0c6dc3466a7918 ]---
>> fec 30be.ethernet eth0: TX ring dump
>> Nr SC addr   len  SKB
>> 00x1c00 0x  590   (null)
>> 10x1c00 0x  590   (null)
>> 20x1c00 0x   42   (null)
>> 3  H 0x1c00 0x   42   (null)
>> 4 S  0x 0x0   (null)
>> 50x 0x0   (null)
>> 60x 0x0   (null)
>> 70x 0x0   (null)
>> 80x 0x0   (null)
>> 90x 0x0   (null)
>>100x 0x0   (null)
>>110x 0x0   (null)
>>120x 0x0   (null)
>>130x 0x0   (null)
>>140x 0x0   (null)
>>150x 0x0   (null)
>>160x 0x0   (null)
>>170x 0x0   (null)
>>180x 0x0   (null)
>> ...
>>
>>
>> A second TX ring dump from 4.10:
>> fec 30be.ethernet eth0: TX ring dump
>> Nr SC addr   len  SKB
>> 00x1c00 0x   42   (null)
>> 10x1c00 0x   42   (null)
>> 20x1c00 0x   90   (null)
>> 30x1c00 0x   90   (null)
>> 40x1c00 0x   90   (null)
>> 50x1c00 0x  218   (null)
>> 60x1c00 0x  218   (null)
>> 70x1c00 0x  218   (null)
>> 80x1c00 0x   90   (null)
>> 90x1c00 0x  206   (null)
>>100x1c00 0x  216   (null)
>>110x1c00 0x  216   (null)
>>120x1c00 0x  216   (null)
>>130x1c00 0x  311   (null)
>>140x1c00 0x  178   (null)
>>150x1c00 0x  311   (null)
>>160x1c00 0x  206   (null)
>>17  H 0x1c00 0x  311   (null)
>>18 S  0x 0x0   (null)
>>190x 0x0   (null)
> The dump show tx ring is fine.
>
>> The ring dump prints continously, but I can access console every now
>> and then. I noticed that the second interrupt seems static (66441, TX
>> interrupt?):
>>58: 18 GIC-0 150 Level 

Re: [PATCH] xen/9pfs: select CONFIG_XEN_XENBUS_FRONTEND

2017-04-19 Thread Stefano Stabellini
Juergen, I have committed this patch to for-linus-4.12 and linux-next, I
hope that's OK.

Og Wed, 19 Apr 2017, Stefano Stabellini wrote:
> On Wed, 19 Apr 2017, Arnd Bergmann wrote:
> > All Xen frontends need to select this symbol to avoid a link error:
> > 
> > net/built-in.o: In function `p9_trans_xen_init':
> > :(.text+0x149e9c): undefined reference to `__xenbus_register_frontend'
> > 
> > Fixes: d4b40a02f837 ("xen/9pfs: build 9pfs Xen transport driver")
> > Signed-off-by: Arnd Bergmann 
> 
> Thank you for the fix!
> 
> Reviewed-by: Stefano Stabellini 
> > ---
> >  net/9p/Kconfig | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/net/9p/Kconfig b/net/9p/Kconfig
> > index 3f286f1bd1d3..e6014e0e51f7 100644
> > --- a/net/9p/Kconfig
> > +++ b/net/9p/Kconfig
> > @@ -24,6 +24,7 @@ config NET_9P_VIRTIO
> >  
> >  config NET_9P_XEN
> > depends on XEN
> > +   select XEN_XENBUS_FRONTEND
> > tristate "9P Xen Transport"
> > help
> >   This builds support for a transport for 9pfs between
> > -- 
> > 2.9.0
> > 
> 


Re: XDP question: best API for returning/setting egress port?

2017-04-19 Thread Daniel Borkmann

On 04/19/2017 10:02 PM, Andy Gospodarek wrote:
[...]

and then lookup this dest in a table we have the option to make that
dest an ifindex/socket/other.

I did also look at JohnF's patch and I do like the simplicity of the redirect
action and new ndo_xdp_xmit and how it moves towards a way to transmit the
frame.  The downside is that it presumes an ifindex, so it might not be ideal
we want the lookup to return something other than an ifindex.


[...]

would be handled.  If we are ultimately going to need a new netdev op to
handle the redirect then what may be the issue with not providing the
destination port the return code and the option proposed by JohnF looks
good to me with maybe a small tweak to not presume ifindex in some manner.


Is there a concrete reason that all the proposed future cases like sockets
have to be handled within the very same XDP_REDIRECT return code? F.e. why
not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future
ones would get a different return code f.e. XDP_TX_SK only handling sockets
when we get there implementation-wise?


[PATCH net-next] netvsc: fix use after free on module removal

2017-04-19 Thread Stephen Hemminger
The NAPI data structure is embedded in the netvsc_device structure
and is freed when device is closed. There is still a reference
(in NAPI list) to this which causes a crash in netif_napi_del
when device is removed. Fix by managing NAPI instances correctly.

Signed-off-by: Stephen Hemminger 
---
Note: resend with correct list address

 drivers/net/hyperv/netvsc.c   | 9 +
 drivers/net/hyperv/rndis_filter.c | 9 ++---
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index fd21d5aab580..e282bdff7aa0 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -568,8 +568,9 @@ void netvsc_device_remove(struct hv_device *device)
/* Now, we can close the channel safely */
vmbus_close(device->channel);
 
+   /* And dissassociate NAPI context from device */
for (i = 0; i < net_device->num_chn; i++)
-   napi_disable(_device->chan_table[i].napi);
+   netif_napi_del(_device->chan_table[i].napi);
 
/* Release all resources */
free_netvsc_device_rcu(net_device);
@@ -1304,8 +1305,6 @@ int netvsc_device_add(struct hv_device *device,
struct netvsc_channel *nvchan = _device->chan_table[i];
 
nvchan->channel = device->channel;
-   netif_napi_add(ndev, >napi,
-  netvsc_poll, NAPI_POLL_WEIGHT);
}
 
/* Open the channel */
@@ -1323,6 +1322,8 @@ int netvsc_device_add(struct hv_device *device,
netdev_dbg(ndev, "hv_netvsc channel opened successfully\n");
 
/* Enable NAPI handler for init callbacks */
+   netif_napi_add(ndev, _device->chan_table[0].napi,
+  netvsc_poll, NAPI_POLL_WEIGHT);
napi_enable(_device->chan_table[0].napi);
 
/* Writing nvdev pointer unlocks netvsc_send(), make sure chn_table is
@@ -1341,7 +1342,7 @@ int netvsc_device_add(struct hv_device *device,
return ret;
 
 close:
-   napi_disable(_device->chan_table[0].napi);
+   netif_napi_del(_device->chan_table[0].napi);
 
/* Now, we can close the channel safely */
vmbus_close(device->channel);
diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 1e9445bc4539..ab92c3c95951 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -1009,13 +1009,16 @@ static void netvsc_sc_open(struct vmbus_channel *new_sc)
 
/* Set the channel before opening.*/
nvchan->channel = new_sc;
+   netif_napi_add(ndev, >napi,
+  netvsc_poll, NAPI_POLL_WEIGHT);
 
ret = vmbus_open(new_sc, nvscdev->ring_size * PAGE_SIZE,
 nvscdev->ring_size * PAGE_SIZE, NULL, 0,
 netvsc_channel_cb, nvchan);
-
-
-   napi_enable(>napi);
+   if (ret == 0)
+   napi_enable(>napi);
+   else
+   netdev_err(ndev, "sub channel open failed (%d)\n", ret);
 
if (refcount_dec_and_test(>sc_offered))
complete(>channel_init_wait);
-- 
2.11.0



Re: __sk_buff.data_end

2017-04-19 Thread Johannes Berg
On Wed, 2017-04-19 at 23:31 +0200, Johannes Berg wrote:
> Hi Alexei, Daniel,
> 
> I'm looking at adding the __wifi_sk_buff I talked about, and I notice
> that it uses CB space to store data_end. Unfortunately, in a lot of
> cases, we don't have any CB space to spare in wifi.

I guess I can work around this, would this seem reasonable?

 struct bpf_skb_data_end {
struct qdisc_skb_cb qdisc_cb;
-   void *data_end;
+   /*
+* The alignment here is for mac80211, since that doesn't use
+* a pointer but a u64 value and needs to save/restore that
+* across running its BPF programs.
+*/
+   void *data_end __aligned(sizeof(u64));
 };

johannes


Re: [PATCH net-next v6 05/11] seccomp: Split put_seccomp_filter() with put_seccomp()

2017-04-19 Thread Mickaël Salaün

On 19/04/2017 00:47, Mickaël Salaün wrote:
> 
> On 19/04/2017 00:23, Kees Cook wrote:
>> On Tue, Mar 28, 2017 at 4:46 PM, Mickaël Salaün  wrote:
>>> The semantic is unchanged. This will be useful for the Landlock
>>> integration with seccomp (next commit).
>>>
>>> Signed-off-by: Mickaël Salaün 
>>> Cc: Kees Cook 
>>> Cc: Andy Lutomirski 
>>> Cc: Will Drewry 
>>> ---
>>>  include/linux/seccomp.h |  4 ++--
>>>  kernel/fork.c   |  2 +-
>>>  kernel/seccomp.c| 18 +-
>>>  3 files changed, 16 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>>> index ecc296c137cd..e25aee2cdfc0 100644
>>> --- a/include/linux/seccomp.h
>>> +++ b/include/linux/seccomp.h
>>> @@ -77,10 +77,10 @@ static inline int seccomp_mode(struct seccomp *s)
>>>  #endif /* CONFIG_SECCOMP */
>>>
>>>  #ifdef CONFIG_SECCOMP_FILTER
>>> -extern void put_seccomp_filter(struct task_struct *tsk);
>>> +extern void put_seccomp(struct task_struct *tsk);
>>>  extern void get_seccomp_filter(struct task_struct *tsk);
>>>  #else  /* CONFIG_SECCOMP_FILTER */
>>> -static inline void put_seccomp_filter(struct task_struct *tsk)
>>> +static inline void put_seccomp(struct task_struct *tsk)
>>>  {
>>> return;
>>>  }
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index 6c463c80e93d..a27d8e67ce33 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -363,7 +363,7 @@ void free_task(struct task_struct *tsk)
>>>  #endif
>>> rt_mutex_debug_task_free(tsk);
>>> ftrace_graph_exit_task(tsk);
>>> -   put_seccomp_filter(tsk);
>>> +   put_seccomp(tsk);
>>> arch_release_task_struct(tsk);
>>> if (tsk->flags & PF_KTHREAD)
>>> free_kthread_struct(tsk);
>>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>>> index 65f61077ad50..326f79e32127 100644
>>> --- a/kernel/seccomp.c
>>> +++ b/kernel/seccomp.c
>>> @@ -64,6 +64,8 @@ struct seccomp_filter {
>>>  /* Limit any path through the tree to 256KB worth of instructions. */
>>>  #define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))
>>>
>>> +static void put_seccomp_filter(struct seccomp_filter *filter);
>>
>> Can this be reorganized easily to avoid a forward-declaration?
> 
> I didn't want to move too much code but I will.
> 
>>
>>> +
>>>  /*
>>>   * Endianness is explicitly ignored and left for BPF program authors to 
>>> manage
>>>   * as per the specific architecture.
>>> @@ -314,7 +316,7 @@ static inline void seccomp_sync_threads(void)
>>>  * current's path will hold a reference.  (This also
>>>  * allows a put before the assignment.)
>>>  */
>>> -   put_seccomp_filter(thread);
>>> +   put_seccomp_filter(thread->seccomp.filter);
>>> smp_store_release(>seccomp.filter,
>>>   caller->seccomp.filter);
>>>
>>> @@ -476,10 +478,11 @@ static inline void seccomp_filter_free(struct 
>>> seccomp_filter *filter)
>>> }
>>>  }
>>>
>>> -/* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */
>>> -void put_seccomp_filter(struct task_struct *tsk)
>>> +/* put_seccomp_filter - decrements the ref count of a filter */
>>> +static void put_seccomp_filter(struct seccomp_filter *filter)
>>>  {
>>> -   struct seccomp_filter *orig = tsk->seccomp.filter;
>>> +   struct seccomp_filter *orig = filter;
>>> +
>>> /* Clean up single-reference branches iteratively. */
>>> while (orig && atomic_dec_and_test(>usage)) {
>>> struct seccomp_filter *freeme = orig;
>>> @@ -488,6 +491,11 @@ void put_seccomp_filter(struct task_struct *tsk)
>>> }
>>>  }
>>>
>>> +void put_seccomp(struct task_struct *tsk)
>>> +{
>>> +   put_seccomp_filter(tsk->seccomp.filter);
>>> +}
>>> +
>>>  static void seccomp_init_siginfo(siginfo_t *info, int syscall, int reason)
>>>  {
>>> memset(info, 0, sizeof(*info));
>>> @@ -914,7 +922,7 @@ long seccomp_get_filter(struct task_struct *task, 
>>> unsigned long filter_off,
>>> if (copy_to_user(data, fprog->filter, bpf_classic_proglen(fprog)))
>>> ret = -EFAULT;
>>>
>>> -   put_seccomp_filter(task);
>>> +   put_seccomp_filter(task->seccomp.filter);
>>> return ret;
>>
>> I don't like that the arguments to get_seccomp_filter() and
>> put_seccomp_filter() are now different. I think they should match for
>> readability.
> 
> OK, I can do that.
> 

Kees, can I send this as a separate patch?



signature.asc
Description: OpenPGP digital signature


[PATCH net-next] brcmfmac: fix build without CONFIG_BRCMFMAC_PROTO_BCDC

2017-04-19 Thread Arnd Bergmann
With CONFIG_BRCMFMAC_PROTO_BCDC unset, we cannot build the fwsignal.c file:

drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c: In function 
'brcmf_fws_notify_credit_map':
drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c:1590:31: error: 
implicit declaration of function 'drvr_to_fws'; did you mean 'dev_to_psd'? 
[-Werror=implicit-function-declaration]
  struct brcmf_fws_info *fws = drvr_to_fws(ifp->drvr);
drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c:1590:31: error: 
initialization makes pointer from integer without a cast 
[-Werror=int-conversion]
drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c:1621:31: error: 
initialization makes pointer from integer without a cast 
[-Werror=int-conversion]

However, as pointed out in the changeset description for the patch that caused
the problem, fwsignal.c is only required when CONFIG_BRCMFMAC_PROTO_BCDC is
enabled, so we can simply change the Makefile to build it conditionally.

Fixes: acf8ac41dd73 ("brcmfmac: remove reference to fwsignal data from struct 
brcmf_pub")
Signed-off-by: Arnd Bergmann 
---
 drivers/net/wireless/broadcom/brcm80211/brcmfmac/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/Makefile 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/Makefile
index 0383ba559edc..1f5a9b948abf 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/Makefile
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/Makefile
@@ -25,7 +25,6 @@ brcmfmac-objs += \
chip.o \
fwil.o \
fweh.o \
-   fwsignal.o \
p2p.o \
proto.o \
common.o \
@@ -36,7 +35,8 @@ brcmfmac-objs += \
vendor.o \
pno.o
 brcmfmac-$(CONFIG_BRCMFMAC_PROTO_BCDC) += \
-   bcdc.o
+   bcdc.o \
+   fwsignal.o
 brcmfmac-$(CONFIG_BRCMFMAC_PROTO_MSGBUF) += \
commonring.o \
flowring.o \
-- 
2.9.0



Re: [PATCH net-next v6 09/11] seccomp: Enhance test_harness with an assert step mechanism

2017-04-19 Thread Mickaël Salaün


On 20/04/2017 00:02, Kees Cook wrote:
> On Wed, Apr 19, 2017 at 2:51 PM, Mickaël Salaün  wrote:
>>
>> On 19/04/2017 02:02, Kees Cook wrote:
>>> On Tue, Mar 28, 2017 at 4:46 PM, Mickaël Salaün  wrote:
 This is useful to return an information about the error without being
 able to write to TH_LOG_STREAM.

 Helpers from test_harness.h may be useful outside of the seccomp
 directory.

 Signed-off-by: Mickaël Salaün 
 Cc: Andy Lutomirski 
 Cc: Arnaldo Carvalho de Melo 
 Cc: Kees Cook 
 Cc: Shuah Khan 
 Cc: Will Drewry 
 ---
  tools/testing/selftests/seccomp/test_harness.h | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

 diff --git a/tools/testing/selftests/seccomp/test_harness.h 
 b/tools/testing/selftests/seccomp/test_harness.h
 index a786c69c7584..77e407663e06 100644
 --- a/tools/testing/selftests/seccomp/test_harness.h
 +++ b/tools/testing/selftests/seccomp/test_harness.h
 @@ -397,7 +397,7 @@ struct __test_metadata {
 const char *name;
 void (*fn)(struct __test_metadata *);
 int termsig;
 -   int passed;
 +   __s8 passed;
>>>
>>> Why the reduction here? int is signed too?
>>
>> Because the return code of a process is capped to 8 bits and I use a
>> negative value to not mess with the current interpretation of 0 (error)
>> and 1 (OK) for the "passed" variable.
>>
>>>
 int trigger; /* extra handler after the evaluation */
 struct __test_metadata *prev, *next;
  };
 @@ -476,6 +476,12 @@ void __run_test(struct __test_metadata *t)
 "instead of by signal (code: 
 %d)\n",
 t->name,
 WEXITSTATUS(status));
 +   } else if (t->passed < 0) {
 +   fprintf(TH_LOG_STREAM,
 +   "%s: Failed at step #%d\n",
 +   t->name,
 +   t->passed * -1);
 +   t->passed = 0;
 }
>>>
>>> Instead of creating an overloaded mechanism here, perhaps have an
>>> option reporting mechanism that can be enabled. Like adding to
>>> __test_metadata "bool no_stream; int test_number;" and adding
>>> test_number++ to each ASSERT/EXCEPT call, and doing something like:
>>>
>>> if (t->no_stream) {
>>>   fprintf(TH_LOG_STREAM,
>>>   "%s: Failed at step #%d\n",
>>>   t->name,
>>>t->test_number);
>>> }
>>>
>>> It'd be a cleaner approach, maybe?
>>
>> Good idea, we will then be able to use 255 steps!
>>
>> Do you want me to send this as a separate patch?
>>
>> Can we move test_harness.h outside of the seccomp directory to be
>> available to other subsystems as well?
> 
> Yeah, I would do two patches, and send them out separately (to shuah
> with lkml and me in cc at least), one to move test_hardness.h into
> some include/ directory, and then to add the new logic for streamless
> reporting.
> 
> Thanks!
> 
> -Kees
> 
> 

Good, in which place and name would it fit better?



signature.asc
Description: OpenPGP digital signature


Re: [PATCH net-next v6 04/11] landlock: Add LSM hooks related to filesystem

2017-04-19 Thread Mickaël Salaün

On 19/04/2017 01:40, Kees Cook wrote:
> On Tue, Apr 18, 2017 at 4:16 PM, Casey Schaufler  
> wrote:
>> On 4/18/2017 3:44 PM, Mickaël Salaün wrote:
>>> On 19/04/2017 00:17, Kees Cook wrote:
 On Tue, Mar 28, 2017 at 4:46 PM, Mickaël Salaün  wrote:
> +void __init landlock_add_hooks(void)
> +{
> +   pr_info("landlock: Version %u", LANDLOCK_VERSION);
> +   landlock_add_hooks_fs();
> +   security_add_hooks(NULL, 0, "landlock");
> +   bpf_register_prog_type(_landlock_type);
 I'm confused by the separation of hook registration here. The call to
 security_add_hooks is with count=0 is especially weird. Why isn't this
 just a single call with security_add_hooks(landlock_hooks,
 ARRAY_SIZE(landlock_hooks), "landlock")?
>>> Yes, this is ugly with the new security_add_hooks() with three arguments
>>> but I wanted to split the hooks definition in multiple files.
>>
>> Why? I'll buy a good argument, but there are dangers in
>> allowing multiple calls to security_add_hooks().

I prefer to have one file per hook "family" (e.g. filesystem, network,
ptrace…). This reduce the mess with all the included files (needed for
LSM hook argument types) and make the files easier to read, understand
and maintain.

>>
>>>
>>> The current security_add_hooks() use lsm_append(lsm, _names) which
>>> is not exported. Unfortunately, calling multiple security_add_hooks()
>>> with the same LSM name would register multiple names for the same LSM…
>>> Is it OK if I modify this function to not add duplicated entries?
>>
>> It may seem absurd, but it's conceivable that a module might
>> have two hooks it wants called. My example is a module that
>> counts the number of times SELinux denies a process access to
>> things (which needs to be called before and after SELinux in
>> order to detect denials) and takes "appropriate action" if
>> too many denials occur. It would be weird, wonky and hackish,
>> but that never stopped anybody before.

Right, but now, with the new lsm_append(), module names are concatenated
("%s,%s") in the lsm_names variable. It would be nice to not pollute
this string with multiple time the same module name.

> 
> If ends up being sane and clear, I'm fine with allowing multiple calls.
> 
> -Kees
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH net-next v6 09/11] seccomp: Enhance test_harness with an assert step mechanism

2017-04-19 Thread Kees Cook
On Wed, Apr 19, 2017 at 2:51 PM, Mickaël Salaün  wrote:
>
> On 19/04/2017 02:02, Kees Cook wrote:
>> On Tue, Mar 28, 2017 at 4:46 PM, Mickaël Salaün  wrote:
>>> This is useful to return an information about the error without being
>>> able to write to TH_LOG_STREAM.
>>>
>>> Helpers from test_harness.h may be useful outside of the seccomp
>>> directory.
>>>
>>> Signed-off-by: Mickaël Salaün 
>>> Cc: Andy Lutomirski 
>>> Cc: Arnaldo Carvalho de Melo 
>>> Cc: Kees Cook 
>>> Cc: Shuah Khan 
>>> Cc: Will Drewry 
>>> ---
>>>  tools/testing/selftests/seccomp/test_harness.h | 8 +++-
>>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/tools/testing/selftests/seccomp/test_harness.h 
>>> b/tools/testing/selftests/seccomp/test_harness.h
>>> index a786c69c7584..77e407663e06 100644
>>> --- a/tools/testing/selftests/seccomp/test_harness.h
>>> +++ b/tools/testing/selftests/seccomp/test_harness.h
>>> @@ -397,7 +397,7 @@ struct __test_metadata {
>>> const char *name;
>>> void (*fn)(struct __test_metadata *);
>>> int termsig;
>>> -   int passed;
>>> +   __s8 passed;
>>
>> Why the reduction here? int is signed too?
>
> Because the return code of a process is capped to 8 bits and I use a
> negative value to not mess with the current interpretation of 0 (error)
> and 1 (OK) for the "passed" variable.
>
>>
>>> int trigger; /* extra handler after the evaluation */
>>> struct __test_metadata *prev, *next;
>>>  };
>>> @@ -476,6 +476,12 @@ void __run_test(struct __test_metadata *t)
>>> "instead of by signal (code: %d)\n",
>>> t->name,
>>> WEXITSTATUS(status));
>>> +   } else if (t->passed < 0) {
>>> +   fprintf(TH_LOG_STREAM,
>>> +   "%s: Failed at step #%d\n",
>>> +   t->name,
>>> +   t->passed * -1);
>>> +   t->passed = 0;
>>> }
>>
>> Instead of creating an overloaded mechanism here, perhaps have an
>> option reporting mechanism that can be enabled. Like adding to
>> __test_metadata "bool no_stream; int test_number;" and adding
>> test_number++ to each ASSERT/EXCEPT call, and doing something like:
>>
>> if (t->no_stream) {
>>   fprintf(TH_LOG_STREAM,
>>   "%s: Failed at step #%d\n",
>>   t->name,
>>t->test_number);
>> }
>>
>> It'd be a cleaner approach, maybe?
>
> Good idea, we will then be able to use 255 steps!
>
> Do you want me to send this as a separate patch?
>
> Can we move test_harness.h outside of the seccomp directory to be
> available to other subsystems as well?

Yeah, I would do two patches, and send them out separately (to shuah
with lkml and me in cc at least), one to move test_hardness.h into
some include/ directory, and then to add the new logic for streamless
reporting.

Thanks!

-Kees


-- 
Kees Cook
Pixel Security


Re: [PATCH v2 net-next]smsc911x: Adding support for Micochip LAN9250 Ethernet controller

2017-04-19 Thread Andrew Lunn
On Wed, Apr 19, 2017 at 09:26:43PM +, david@microchip.com wrote:
> Adding support for Microchip LAN9250 Ethernet controller.
> 
> Signed-off-by: David Cai 
> ---
> Changes
> V2
>  -  email format changed
>  - remove unnecessary text in commit log
 
Much better, thanks.

>  drivers/net/ethernet/smsc/smsc911x.c | 32 +++-
>  drivers/net/ethernet/smsc/smsc911x.h |  3 +++
>  2 files changed, 30 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/smsc/smsc911x.c 
> b/drivers/net/ethernet/smsc/smsc911x.c
> index fa5ca09..22b1951 100644
> --- a/drivers/net/ethernet/smsc/smsc911x.c
> +++ b/drivers/net/ethernet/smsc/smsc911x.c
> @@ -25,7 +25,7 @@
>   *   LAN9215, LAN9216, LAN9217, LAN9218
>   *   LAN9210, LAN9211
>   *   LAN9220, LAN9221
> - *   LAN89218
> + *   LAN89218,LAN9250
>   *
>   */
>  
> @@ -104,6 +104,9 @@ struct smsc911x_data {
>   /* used to decide which workarounds apply */
>   unsigned int generation;
>  
> + /* used to decide which sub generation product work arounds to apply */
> + unsigned int sub_generation;
> +
>   /* device configuration (copied from platform_data during probe) */
>   struct smsc911x_platform_config config;
>  
> @@ -1450,6 +1453,8 @@ static int smsc911x_soft_reset(struct smsc911x_data 
> *pdata)
>   unsigned int timeout;
>   unsigned int temp;
>   int ret;
> + unsigned int reset_offset = HW_CFG;
> + unsigned int reset_mask = HW_CFG_SRST_;
>  
>   /*
>* Make sure to power-up the PHY chip before doing a reset, otherwise
> @@ -1476,15 +1481,23 @@ static int smsc911x_soft_reset(struct smsc911x_data 
> *pdata)
>   }
>   }
>  
> + if (pdata->sub_generation) {

I know it does not make a difference at the moment, but shouldn't this
be

if (pdata>generation == 4 && pdata->sub_generation) {

at some point other sub-generations might become important, at which
point your simpler code breaks.

> + /* special reset for  LAN9250 */
> + reset_offset = RESET_CTL;
> + reset_mask = RESET_CTL_DIGITAL_RST_;
> + }

> @@ -2251,6 +2264,9 @@ static int smsc911x_init(struct net_device *dev)
>   /* Default generation to zero (all workarounds apply) */
>   pdata->generation = 0;
>  
> + /* Default sub_generation to zero */
> + pdata->sub_generation = 0;
> +

alloc_etherdev() uses kzalloc, so sub_generation is guaranteed to be
0. No need to set it here.

>   pdata->idrev = smsc911x_reg_read(pdata, ID_REV);
>   switch (pdata->idrev & 0x) {
>   case 0x0118:
> @@ -2278,6 +2294,12 @@ static int smsc911x_init(struct net_device *dev)
>   pdata->generation = 4;
>   break;
>  
> + case 0x9250:
> + /* LAN9250 */
> + pdata->generation = 4;
> + pdata->sub_generation = 1;
> + break;
>

Maybe it would be more readable in general to add some #defines for
0x9250 and all the other hex values, and then in smsc911x_init()
look at the pdata->idrev?

 Andrew


Re: [PATCH 2/4] net: macb: Add tsu_clk to device tree

2017-04-19 Thread Rob Herring
On Thu, Apr 13, 2017 at 02:38:06PM +0100, Rafal Ozieblo wrote:
> Signed-off-by: Rafal Ozieblo 
> ---
>  Documentation/devicetree/bindings/net/macb.txt | 1 +
>  1 file changed, 1 insertion(+)

Acked-by: Rob Herring 


Re: [RFC PATCH net] net/mlx5e: Race between mlx5e_update_stats() and getting the stats

2017-04-19 Thread Martin KaFai Lau
On Wed, Apr 19, 2017 at 01:24:38PM -0700, Eric Dumazet wrote:
> On Wed, 2017-04-19 at 11:29 -0700, Martin KaFai Lau wrote:
> > We have observed a sudden spike in rx/tx_packets and rx/tx_bytes
> > reported under /proc/net/dev.  It seems there is a race in
> > mlx5e_update_stats() and some of the get-stats functions (the
> > one that we hit is the mlx5e_get_stats() which is called
> > by ndo_get_stats64()).
> >
> > In particular, the very first thing mlx5e_update_sw_counters()
> > does is 'memset(s, 0, sizeof(*s))'.  For example, if mlx5e_get_stats()
> > is unlucky at one point, rx_bytes and rx_packets could be 0.  One second
> > later, a normal (and much bigger than 0) value will be reported.
> >
> > This patch is not meant to be a proper fix.  It merely tries
> > to show what I have suspected and start the discussion.
> >
> > Signed-off-by: Martin KaFai Lau 
> > Cc: Saeed Mahameed 
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 7 +--
> >  drivers/net/ethernet/mellanox/mlx5/core/en_main.c| 3 +++
> >  2 files changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
> > b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> > index a004a5a1a4c2..d24916f720bb 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> > @@ -313,7 +313,6 @@ static void mlx5e_get_ethtool_stats(struct net_device 
> > *dev,
> > mutex_lock(>state_lock);
> > if (test_bit(MLX5E_STATE_OPENED, >state))
> > mlx5e_update_stats(priv);
> > -   mutex_unlock(>state_lock);
> >
> > for (i = 0; i < NUM_SW_COUNTERS; i++)
> > data[idx++] = MLX5E_READ_CTR64_CPU(>stats.sw,
> > @@ -378,8 +377,10 @@ static void mlx5e_get_ethtool_stats(struct net_device 
> > *dev,
> > data[idx++] = 
> > MLX5E_READ_CTR64_CPU(mlx5_priv->pme_stats.error_counters,
> >mlx5e_pme_error_desc, i);
> >
> > -   if (!test_bit(MLX5E_STATE_OPENED, >state))
> > +   if (!test_bit(MLX5E_STATE_OPENED, >state)) {
> > +   mutex_unlock(>state_lock);
> > return;
> > +   }
> >
> > /* per channel counters */
> > for (i = 0; i < priv->params.num_channels; i++)
> > @@ -393,6 +394,8 @@ static void mlx5e_get_ethtool_stats(struct net_device 
> > *dev,
> > for (j = 0; j < NUM_SQ_STATS; j++)
> > data[idx++] = 
> > MLX5E_READ_CTR64_CPU(>channel[i]->sq[tc].stats,
> >
> > sq_stats_desc, j);
> > +
> > +   mutex_unlock(>state_lock);
> >  }
> >
> >  static u32 mlx5e_rx_wqes_to_packets(struct mlx5e_priv *priv, int 
> > rq_wq_type,
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
> > b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > index 66c133757a5e..a4c100bea541 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > @@ -2748,6 +2748,8 @@ mlx5e_get_stats(struct net_device *dev, struct 
> > rtnl_link_stats64 *stats)
> > struct mlx5e_vport_stats *vstats = >stats.vport;
> > struct mlx5e_pport_stats *pstats = >stats.pport;
> >
> > +   mutex_lock(>state_lock);
> > +
>
> We can not sleep from ndo_get_stats() ( look at bonding driver )
Thanks for pointing out the bond_get_stats().

>
> What about the following ?
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 
> 66c133757a5ee8daae122e93322306b1c5c44336..b9fea146a0ca18498a8dfa5698dca7dea06e3c5e
>  100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -174,7 +174,7 @@ static void mlx5e_tx_timeout_work(struct work_struct 
> *work)
>
>  static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
>  {
> - struct mlx5e_sw_stats *s = >stats.sw;
> + struct mlx5e_sw_stats temp, *s = 
>   struct mlx5e_rq_stats *rq_stats;
>   struct mlx5e_sq_stats *sq_stats;
>   u64 tx_offload_none = 0;
> @@ -229,6 +229,8 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv 
> *priv)
>   s->link_down_events_phy = MLX5_GET(ppcnt_reg,
>   priv->stats.pport.phy_counters,
>   counter_set.phys_layer_cntrs.link_down_events);
> + /* A bit racy (depending on memcpy() sanity...) , we probably should 
> use a spinlock */
> + memcpy(>stats.sw, s, sizeof(*s));
Right, a temp and a memcpy should be enough to solve our spike problem.
It may be the right fix for net.

Agree that using a spinlock is better (likely changing state_lock
to spinlock).  A quick grep shows 80 line changes.  Saeed, thoughts?

>  }
>
>  static void mlx5e_update_vport_counters(struct mlx5e_priv *priv)
>
>


Re: [PATCH net-next v6 09/11] seccomp: Enhance test_harness with an assert step mechanism

2017-04-19 Thread Mickaël Salaün

On 19/04/2017 02:02, Kees Cook wrote:
> On Tue, Mar 28, 2017 at 4:46 PM, Mickaël Salaün  wrote:
>> This is useful to return an information about the error without being
>> able to write to TH_LOG_STREAM.
>>
>> Helpers from test_harness.h may be useful outside of the seccomp
>> directory.
>>
>> Signed-off-by: Mickaël Salaün 
>> Cc: Andy Lutomirski 
>> Cc: Arnaldo Carvalho de Melo 
>> Cc: Kees Cook 
>> Cc: Shuah Khan 
>> Cc: Will Drewry 
>> ---
>>  tools/testing/selftests/seccomp/test_harness.h | 8 +++-
>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/testing/selftests/seccomp/test_harness.h 
>> b/tools/testing/selftests/seccomp/test_harness.h
>> index a786c69c7584..77e407663e06 100644
>> --- a/tools/testing/selftests/seccomp/test_harness.h
>> +++ b/tools/testing/selftests/seccomp/test_harness.h
>> @@ -397,7 +397,7 @@ struct __test_metadata {
>> const char *name;
>> void (*fn)(struct __test_metadata *);
>> int termsig;
>> -   int passed;
>> +   __s8 passed;
> 
> Why the reduction here? int is signed too?

Because the return code of a process is capped to 8 bits and I use a
negative value to not mess with the current interpretation of 0 (error)
and 1 (OK) for the "passed" variable.

> 
>> int trigger; /* extra handler after the evaluation */
>> struct __test_metadata *prev, *next;
>>  };
>> @@ -476,6 +476,12 @@ void __run_test(struct __test_metadata *t)
>> "instead of by signal (code: %d)\n",
>> t->name,
>> WEXITSTATUS(status));
>> +   } else if (t->passed < 0) {
>> +   fprintf(TH_LOG_STREAM,
>> +   "%s: Failed at step #%d\n",
>> +   t->name,
>> +   t->passed * -1);
>> +   t->passed = 0;
>> }
> 
> Instead of creating an overloaded mechanism here, perhaps have an
> option reporting mechanism that can be enabled. Like adding to
> __test_metadata "bool no_stream; int test_number;" and adding
> test_number++ to each ASSERT/EXCEPT call, and doing something like:
> 
> if (t->no_stream) {
>   fprintf(TH_LOG_STREAM,
>   "%s: Failed at step #%d\n",
>   t->name,
>t->test_number);
> }
> 
> It'd be a cleaner approach, maybe?

Good idea, we will then be able to use 255 steps!

Do you want me to send this as a separate patch?

Can we move test_harness.h outside of the seccomp directory to be
available to other subsystems as well?



signature.asc
Description: OpenPGP digital signature


Re: [PATCH net] net: ipv6: RTF_PCPU should not be settable from userspace

2017-04-19 Thread Martin KaFai Lau
On Wed, Apr 19, 2017 at 02:19:43PM -0700, David Ahern wrote:
> Andrey reported a fault in the IPv6 route code:
>
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault:  [#1] SMP KASAN
> Modules linked in:
> CPU: 1 PID: 4035 Comm: a.out Not tainted 4.11.0-rc7+ #250
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> task: 880069809600 task.stack: 880062dc8000
> RIP: 0010:ip6_rt_cache_alloc+0xa6/0x560 net/ipv6/route.c:975
> RSP: 0018:880062dced30 EFLAGS: 00010206
> RAX: dc00 RBX: 8800670561c0 RCX: 0006
> RDX: 0003 RSI: 880062dcfb28 RDI: 0018
> RBP: 880062dced68 R08: 0001 R09: 
> R10:  R11:  R12: 
> R13: 880062dcfb28 R14: dc00 R15: 
> FS:  7feebe37e7c0() GS:88006cb0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 205a0fe4 CR3: 6b5c9000 CR4: 06e0
> Call Trace:
>  ip6_pol_route+0x1512/0x1f20 net/ipv6/route.c:1128
>  ip6_pol_route_output+0x4c/0x60 net/ipv6/route.c:1212
> ...
>
> Andrey's syzkaller program passes rtmsg.rtmsg_flags with the RTF_PCPU bit
> set. Flags passed to the kernel are blindly copied to the allocated
> rt6_info by ip6_route_info_create making a newly inserted route appear
> as though it is a per-cpu route. ip6_rt_cache_alloc sees the flag set
> and expects rt->dst.from to be set - which it is not since it is not
> really a per-cpu copy. The subsequent call to __ip6_dst_alloc then
> generates the fault.
>
> Fix by checking for the flag and failing with EINVAL.
Thanks for the fix.

Acked-by: Martin KaFai Lau 

>
> Fixes: d52d3997f843f ("ipv6: Create percpu rt6_info")
> Reported-by: Andrey Konovalov 
> Signed-off-by: David Ahern 
> ---
>  include/uapi/linux/ipv6_route.h | 2 +-
>  net/ipv6/route.c| 4 
>  2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/ipv6_route.h b/include/uapi/linux/ipv6_route.h
> index 85bbb1799df3..d496c02e14bc 100644
> --- a/include/uapi/linux/ipv6_route.h
> +++ b/include/uapi/linux/ipv6_route.h
> @@ -35,7 +35,7 @@
>  #define RTF_PREF(pref)   ((pref) << 27)
>  #define RTF_PREF_MASK0x1800
>
> -#define RTF_PCPU 0x4000
> +#define RTF_PCPU 0x4000  /* read-only: can not be set by user */
>  #define RTF_LOCAL0x8000
>
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 4ba7c49872ff..a1bf426c959b 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1854,6 +1854,10 @@ static struct rt6_info *ip6_route_info_create(struct 
> fib6_config *cfg)
>   int addr_type;
>   int err = -EINVAL;
>
> + /* RTF_PCPU is an internal flag; can not be set by userspace */
> + if (cfg->fc_flags & RTF_PCPU)
> + goto out;
> +
>   if (cfg->fc_dst_len > 128 || cfg->fc_src_len > 128)
>   goto out;
>  #ifndef CONFIG_IPV6_SUBTREES
> --
> 2.9.3
>


Re: XDP question: best API for returning/setting egress port?

2017-04-19 Thread Daniel Borkmann

On 04/19/2017 10:02 PM, Andy Gospodarek wrote:

On Tue, Apr 18, 2017 at 09:58:56PM +0200, Jesper Dangaard Brouer wrote:


As I argued in NetConf presentation[1] (from slide #9) we need a port
mapping table (instead of using ifindex'es).  Both for supporting
other "port" types than net_devices (think sockets), and for
sandboxing what XDP can bypass.

I want to create a new XDP action called XDP_REDIRECT, that instruct
XDP to send the xdp_buff to another "port" (get translated into a
net_device, or something else depending on internal port type).

Looking at the userspace/eBPF interface, I'm wondering what is the
best API for "returning" this port number from eBPF?

The options I see is:

1) Split-up the u32 action code, and e.g let the high-16-bit be the
port number and lower-16bit the (existing) action verdict.

  Pros: Simple API
  Cons: Number of ports limited to 64K


Practically speaking this may be seem reserving 64k for the port number
might be enough space, but I would also like to see a new return option
for flags so I start to get concerned about space.  Daniel also
hightlights the fact that encoding the port in the action may does not
leave room for flags and could get confusing.

One unfortunate side-effect of dropping or transmitting frames with XDP
is that we lose the opportunity to statistically sample in netfilter
since the frames were dropped so early and I'd like to bring that back
with a call to parse flags and possibly call psample_sample_packet()
after the xdp action.  Packet sampling cannot simply be an action
since there are times when a frame should be dropped but should also be
sampled, so it seems logical to add this as a flag.


Hmm, you can do that already today with bpf_xdp_event_output() helper
and the program decides when to sample and what custom meta data to
prepend along with the (partial or full) xdp packet, see also [1], slide
11 for the "XDP dump" part.

No need to change drivers for this, psample_sample_packet() would also
be slower.

  [1] http://netdevconf.org/2.1/slides/apr6/zhou-netdev-xdp-2017.pdf


Re: [PATCH v2] sh_eth: unmap DMA buffers when freeing rings

2017-04-19 Thread David Miller
From: Sergei Shtylyov 
Date: Wed, 19 Apr 2017 22:09:51 +0300

> On 04/17/2017 11:10 PM, David Miller wrote:
> 
>>> The DMA API debugging (when enabled) causes:
>>>
>>> WARNING: CPU: 0 PID: 1445 at lib/dma-debug.c:519
>>> add_dma_entry+0xe0/0x12c
>>> DMA-API: exceeded 7 overlapping mappings of cacheline 0x01b2974d
>>>
>>> to be  printed after repeated initialization of the Ether device, e.g.
>>> suspend/resume or 'ifconfig' up/down. This is because DMA buffers
>>> mapped
>>> using dma_map_single() in sh_eth_ring_format() and sh_eth_start_xmit()
>>> are
>>> never unmapped. Resolve this problem by unmapping the buffers when
>>> freeing
>>> the descriptor rings; in order to do it right, we'd have to add an
>>> extra
>>> parameter to sh_eth_txfree() (we rename this function to
>>> sh_eth_tx_free(),
>>> while at it).
>>>
>>> Based on the commit a47b70ea86bd ("ravb: unmap descriptors when
>>> freeing
>>> rings").
>>>
>>> Signed-off-by: Sergei Shtylyov 
>>
>> Applied, thanks.
> 
>Please don;t forget to send it to -stable.
>The bug seems to be there from the very beginning.

Ok, queued up.


Re: [PATCH v2 net 2/2] net sched actions: decrement module refcount earlier

2017-04-19 Thread Cong Wang
On Wed, Apr 19, 2017 at 8:03 AM, Wolfgang Bumiller
 wrote:
>> On April 19, 2017 at 1:32 PM Jamal Hadi Salim  wrote:
>> My suggestion:
>> If we move the cookie allocation before init we can save it and
>> only when init succeeds do we attach it to the action, otherwise
>> we free it on error path.
>
> Shouldn't the old code have freed an old a->act_cookie as well before
> replacing it then? nla_memdup_cookie() starts off by assigning a->act_cookie.
>

Yeah. Looks even better, we can totally avoid rollback the action modification
rollback on failure.


__sk_buff.data_end

2017-04-19 Thread Johannes Berg
Hi Alexei, Daniel,

I'm looking at adding the __wifi_sk_buff I talked about, and I notice
that it uses CB space to store data_end. Unfortunately, in a lot of
cases, we don't have any CB space to spare in wifi.

Is there any way to generate a series of instructions that instead does
the necessary calculations? I don't actually *see* such a way, because
I don't see how I could have a scratch register or scratch stack space,
but perhaps there's a way to do it?

johannes


[PATCH v2 net-next]smsc911x: Adding support for Micochip LAN9250 Ethernet controller

2017-04-19 Thread David.Cai
Adding support for Microchip LAN9250 Ethernet controller.

Signed-off-by: David Cai 
---
Changes
V2
 -  email format changed
 - remove unnecessary text in commit log

 drivers/net/ethernet/smsc/smsc911x.c | 32 +++-
 drivers/net/ethernet/smsc/smsc911x.h |  3 +++
 2 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/smsc/smsc911x.c 
b/drivers/net/ethernet/smsc/smsc911x.c
index fa5ca09..22b1951 100644
--- a/drivers/net/ethernet/smsc/smsc911x.c
+++ b/drivers/net/ethernet/smsc/smsc911x.c
@@ -25,7 +25,7 @@
  *   LAN9215, LAN9216, LAN9217, LAN9218
  *   LAN9210, LAN9211
  *   LAN9220, LAN9221
- *   LAN89218
+ *   LAN89218,LAN9250
  *
  */
 
@@ -104,6 +104,9 @@ struct smsc911x_data {
/* used to decide which workarounds apply */
unsigned int generation;
 
+   /* used to decide which sub generation product work arounds to apply */
+   unsigned int sub_generation;
+
/* device configuration (copied from platform_data during probe) */
struct smsc911x_platform_config config;
 
@@ -1450,6 +1453,8 @@ static int smsc911x_soft_reset(struct smsc911x_data 
*pdata)
unsigned int timeout;
unsigned int temp;
int ret;
+   unsigned int reset_offset = HW_CFG;
+   unsigned int reset_mask = HW_CFG_SRST_;
 
/*
 * Make sure to power-up the PHY chip before doing a reset, otherwise
@@ -1476,15 +1481,23 @@ static int smsc911x_soft_reset(struct smsc911x_data 
*pdata)
}
}
 
+   if (pdata->sub_generation) {
+   /* special reset for  LAN9250 */
+   reset_offset = RESET_CTL;
+   reset_mask = RESET_CTL_DIGITAL_RST_;
+   }
+
/* Reset the LAN911x */
-   smsc911x_reg_write(pdata, HW_CFG, HW_CFG_SRST_);
+   smsc911x_reg_write(pdata, reset_offset, reset_mask);
+
+   /* verify reset bit is cleared */
timeout = 10;
do {
udelay(10);
-   temp = smsc911x_reg_read(pdata, HW_CFG);
-   } while ((--timeout) && (temp & HW_CFG_SRST_));
+   temp = smsc911x_reg_read(pdata, reset_offset);
+   } while ((--timeout) && (temp & reset_mask));
 
-   if (unlikely(temp & HW_CFG_SRST_)) {
+   if (unlikely(temp & reset_mask)) {
SMSC_WARN(pdata, drv, "Failed to complete reset");
return -EIO;
}
@@ -2251,6 +2264,9 @@ static int smsc911x_init(struct net_device *dev)
/* Default generation to zero (all workarounds apply) */
pdata->generation = 0;
 
+   /* Default sub_generation to zero */
+   pdata->sub_generation = 0;
+
pdata->idrev = smsc911x_reg_read(pdata, ID_REV);
switch (pdata->idrev & 0x) {
case 0x0118:
@@ -2278,6 +2294,12 @@ static int smsc911x_init(struct net_device *dev)
pdata->generation = 4;
break;
 
+   case 0x9250:
+   /* LAN9250 */
+   pdata->generation = 4;
+   pdata->sub_generation = 1;
+   break;
+
default:
SMSC_WARN(pdata, probe, "LAN911x not identified, idrev: 0x%08X",
  pdata->idrev);
diff --git a/drivers/net/ethernet/smsc/smsc911x.h 
b/drivers/net/ethernet/smsc/smsc911x.h
index 54d6489..6b2d479 100644
--- a/drivers/net/ethernet/smsc/smsc911x.h
+++ b/drivers/net/ethernet/smsc/smsc911x.h
@@ -303,6 +303,9 @@
 #define E2P_DATA_EEPROM_DATA_  0x00FF
 #define LAN_REGISTER_EXTENT0x0100
 
+#define RESET_CTL  0x1F8
+#define RESET_CTL_DIGITAL_RST_ 0x0001
+
 /*
  * MAC Control and Status Register (Indirect Address)
  * Offset (through the MAC_CSR CMD and DATA port)
-- 
2.7.4



[Patch net-next v4 1/2] net_sched: move the empty tp check from ->destroy() to ->delete()

2017-04-19 Thread Cong Wang
We could have a race condition where in ->classify() path we
dereference tp->root and meanwhile a parallel ->destroy() makes it
a NULL. Daniel cured this bug in commit d936377414fa
("net, sched: respect rcu grace period on cls destruction").

This happens when ->destroy() is called for deleting a filter to
check if we are the last one in tp, this tp is still linked and
visible at that time. The root cause of this problem is the semantic
of ->destroy(), it does two things (for non-force case):

1) check if tp is empty
2) if tp is empty we could really destroy it

and its caller, if cares, needs to check its return value to see if it
is really destroyed. Therefore we can't unlink tp unless we know it is
empty.

As suggested by Daniel, we could actually move the test logic to ->delete()
so that we can safely unlink tp after ->delete() tells us the last one is
just deleted and before ->destroy().

Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
Cc: Roi Dayan 
Cc: Daniel Borkmann 
Cc: John Fastabend 
Cc: Jamal Hadi Salim 
Signed-off-by: Cong Wang 
---
 include/net/sch_generic.h |  4 +--
 net/sched/cls_api.c   | 27 +-
 net/sched/cls_basic.c | 10 +++
 net/sched/cls_bpf.c   | 11 
 net/sched/cls_cgroup.c|  8 ++
 net/sched/cls_flow.c  | 10 +++
 net/sched/cls_flower.c| 10 ++-
 net/sched/cls_fw.c| 29 +++
 net/sched/cls_matchall.c  |  7 ++---
 net/sched/cls_route.c | 29 +--
 net/sched/cls_rsvp.h  | 32 +++--
 net/sched/cls_tcindex.c   | 14 +-
 net/sched/cls_u32.c   | 71 +++
 13 files changed, 134 insertions(+), 128 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 65d5026..22e5209 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -204,14 +204,14 @@ struct tcf_proto_ops {
const struct tcf_proto *,
struct tcf_result *);
int (*init)(struct tcf_proto*);
-   bool(*destroy)(struct tcf_proto*, bool);
+   void(*destroy)(struct tcf_proto*);
 
unsigned long   (*get)(struct tcf_proto*, u32 handle);
int (*change)(struct net *net, struct sk_buff *,
struct tcf_proto*, unsigned long,
u32 handle, struct nlattr **,
unsigned long *, bool);
-   int (*delete)(struct tcf_proto*, unsigned long);
+   int (*delete)(struct tcf_proto*, unsigned long, 
bool*);
void(*walk)(struct tcf_proto*, struct tcf_walker 
*arg);
 
/* rtnetlink specific */
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index a8da383..22f88b3 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -178,14 +178,11 @@ static struct tcf_proto *tcf_proto_create(const char 
*kind, u32 protocol,
return ERR_PTR(err);
 }
 
-static bool tcf_proto_destroy(struct tcf_proto *tp, bool force)
+static void tcf_proto_destroy(struct tcf_proto *tp)
 {
-   if (tp->ops->destroy(tp, force)) {
-   module_put(tp->ops->owner);
-   kfree_rcu(tp, rcu);
-   return true;
-   }
-   return false;
+   tp->ops->destroy(tp);
+   module_put(tp->ops->owner);
+   kfree_rcu(tp, rcu);
 }
 
 void tcf_destroy_chain(struct tcf_proto __rcu **fl)
@@ -194,7 +191,7 @@ void tcf_destroy_chain(struct tcf_proto __rcu **fl)
 
while ((tp = rtnl_dereference(*fl)) != NULL) {
RCU_INIT_POINTER(*fl, tp->next);
-   tcf_proto_destroy(tp, true);
+   tcf_proto_destroy(tp);
}
 }
 EXPORT_SYMBOL(tcf_destroy_chain);
@@ -361,7 +358,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n,
RCU_INIT_POINTER(*back, next);
tfilter_notify(net, skb, n, tp, fh,
   RTM_DELTFILTER, false);
-   tcf_proto_destroy(tp, true);
+   tcf_proto_destroy(tp);
err = 0;
goto errout;
}
@@ -372,24 +369,28 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n,
goto errout;
}
} else {
+   bool last;
+
switch (n->nlmsg_type) {
case RTM_NEWTFILTER:
if (n->nlmsg_flags & NLM_F_EXCL) {
if (tp_created)
-   tcf_proto_destroy(tp, true);
+   

[Patch net-next v4 2/2] net_sched: remove useless NULL to tp->root

2017-04-19 Thread Cong Wang
There is no need to NULL tp->root in ->destroy(), since tp is
going to be freed very soon, and existing readers are still
safe to read them.

For cls_route, we always init its tp->root, so it can't be NULL,
we can drop more useless code.

Cc: Daniel Borkmann 
Cc: John Fastabend 
Cc: Jamal Hadi Salim 
Signed-off-by: Cong Wang 
---
 net/sched/cls_fw.c|  1 -
 net/sched/cls_route.c | 15 ---
 net/sched/cls_rsvp.h  |  4 
 3 files changed, 20 deletions(-)

diff --git a/net/sched/cls_fw.c b/net/sched/cls_fw.c
index 78ccb4a..d388536 100644
--- a/net/sched/cls_fw.c
+++ b/net/sched/cls_fw.c
@@ -144,7 +144,6 @@ static void fw_destroy(struct tcf_proto *tp)
call_rcu(>rcu, fw_delete_filter);
}
}
-   RCU_INIT_POINTER(tp->root, NULL);
kfree_rcu(head, rcu);
 }
 
diff --git a/net/sched/cls_route.c b/net/sched/cls_route.c
index f4d687e..d63d550 100644
--- a/net/sched/cls_route.c
+++ b/net/sched/cls_route.c
@@ -140,8 +140,6 @@ static int route4_classify(struct sk_buff *skb, const 
struct tcf_proto *tp,
goto failure;
 
id = dst->tclassid;
-   if (head == NULL)
-   goto old_method;
 
iif = inet_iif(skb);
 
@@ -194,15 +192,6 @@ static int route4_classify(struct sk_buff *skb, const 
struct tcf_proto *tp,
route4_set_fastmap(head, id, iif, ROUTE4_FAILURE);
 failure:
return -1;
-
-old_method:
-   if (id && (TC_H_MAJ(id) == 0 ||
-  !(TC_H_MAJ(id^tp->q->handle {
-   res->classid = id;
-   res->class = 0;
-   return 0;
-   }
-   return -1;
 }
 
 static inline u32 to_hash(u32 id)
@@ -234,9 +223,6 @@ static unsigned long route4_get(struct tcf_proto *tp, u32 
handle)
struct route4_filter *f;
unsigned int h1, h2;
 
-   if (!head)
-   return 0;
-
h1 = to_hash(handle);
if (h1 > 256)
return 0;
@@ -305,7 +291,6 @@ static void route4_destroy(struct tcf_proto *tp)
kfree_rcu(b, rcu);
}
}
-   RCU_INIT_POINTER(tp->root, NULL);
kfree_rcu(head, rcu);
 }
 
diff --git a/net/sched/cls_rsvp.h b/net/sched/cls_rsvp.h
index 18a9470..0d9d077 100644
--- a/net/sched/cls_rsvp.h
+++ b/net/sched/cls_rsvp.h
@@ -152,8 +152,6 @@ static int rsvp_classify(struct sk_buff *skb, const struct 
tcf_proto *tp,
return -1;
nhptr = ip_hdr(skb);
 #endif
-   if (unlikely(!head))
-   return -1;
 restart:
 
 #if RSVP_DST_LEN == 4
@@ -310,8 +308,6 @@ static void rsvp_destroy(struct tcf_proto *tp)
if (data == NULL)
return;
 
-   RCU_INIT_POINTER(tp->root, NULL);
-
for (h1 = 0; h1 < 256; h1++) {
struct rsvp_session *s;
 
-- 
2.5.5



[Patch net-next v4 0/2] net_sched: clean up tc filter destroy and delete logic

2017-04-19 Thread Cong Wang
The first patch fixes a potenial race condition, the second one
is pure cleanup.

Signed-off-by: Cong Wang 

---
v4: split the patch and update changelog
v3: fix a compiler warning
v2: rebase

Cong Wang (2):
  net_sched: move the empty tp check from ->destroy() to ->delete()
  net_sched: remove useless NULL to tp->root

 include/net/sch_generic.h |  4 +--
 net/sched/cls_api.c   | 27 +-
 net/sched/cls_basic.c | 10 +++
 net/sched/cls_bpf.c   | 11 
 net/sched/cls_cgroup.c|  8 ++
 net/sched/cls_flow.c  | 10 +++
 net/sched/cls_flower.c| 10 ++-
 net/sched/cls_fw.c| 30 +++-
 net/sched/cls_matchall.c  |  7 ++---
 net/sched/cls_route.c | 44 ++---
 net/sched/cls_rsvp.h  | 36 
 net/sched/cls_tcindex.c   | 14 +-
 net/sched/cls_u32.c   | 71 +++
 13 files changed, 134 insertions(+), 148 deletions(-)

-- 
2.5.5



[PATCH net] net: ipv6: RTF_PCPU should not be settable from userspace

2017-04-19 Thread David Ahern
Andrey reported a fault in the IPv6 route code:

kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 4035 Comm: a.out Not tainted 4.11.0-rc7+ #250
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: 880069809600 task.stack: 880062dc8000
RIP: 0010:ip6_rt_cache_alloc+0xa6/0x560 net/ipv6/route.c:975
RSP: 0018:880062dced30 EFLAGS: 00010206
RAX: dc00 RBX: 8800670561c0 RCX: 0006
RDX: 0003 RSI: 880062dcfb28 RDI: 0018
RBP: 880062dced68 R08: 0001 R09: 
R10:  R11:  R12: 
R13: 880062dcfb28 R14: dc00 R15: 
FS:  7feebe37e7c0() GS:88006cb0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 205a0fe4 CR3: 6b5c9000 CR4: 06e0
Call Trace:
 ip6_pol_route+0x1512/0x1f20 net/ipv6/route.c:1128
 ip6_pol_route_output+0x4c/0x60 net/ipv6/route.c:1212
...

Andrey's syzkaller program passes rtmsg.rtmsg_flags with the RTF_PCPU bit
set. Flags passed to the kernel are blindly copied to the allocated
rt6_info by ip6_route_info_create making a newly inserted route appear
as though it is a per-cpu route. ip6_rt_cache_alloc sees the flag set
and expects rt->dst.from to be set - which it is not since it is not
really a per-cpu copy. The subsequent call to __ip6_dst_alloc then
generates the fault.

Fix by checking for the flag and failing with EINVAL.

Fixes: d52d3997f843f ("ipv6: Create percpu rt6_info")
Reported-by: Andrey Konovalov 
Signed-off-by: David Ahern 
---
 include/uapi/linux/ipv6_route.h | 2 +-
 net/ipv6/route.c| 4 
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/ipv6_route.h b/include/uapi/linux/ipv6_route.h
index 85bbb1799df3..d496c02e14bc 100644
--- a/include/uapi/linux/ipv6_route.h
+++ b/include/uapi/linux/ipv6_route.h
@@ -35,7 +35,7 @@
 #define RTF_PREF(pref) ((pref) << 27)
 #define RTF_PREF_MASK  0x1800
 
-#define RTF_PCPU   0x4000
+#define RTF_PCPU   0x4000  /* read-only: can not be set by user */
 #define RTF_LOCAL  0x8000
 
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 4ba7c49872ff..a1bf426c959b 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1854,6 +1854,10 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg)
int addr_type;
int err = -EINVAL;
 
+   /* RTF_PCPU is an internal flag; can not be set by userspace */
+   if (cfg->fc_flags & RTF_PCPU)
+   goto out;
+
if (cfg->fc_dst_len > 128 || cfg->fc_src_len > 128)
goto out;
 #ifndef CONFIG_IPV6_SUBTREES
-- 
2.9.3



Re: [PATCH] net: arc_emac: switch to phy_start()/phy_stop()

2017-04-19 Thread Alexander Kochetkov

> 20 апр. 2017 г., в 0:08, Florian Fainelli  написал(а):
> 
> This looks fine. If you wanted to go further, you could move the
> phy_connect(), phy_disconnect() calls down to the arc_emac_open()
> respectively arc_emac_stop() as this would also allow the PHY device to
> be fully suspended when the interface is unused.


I’ve checked patch phy_connect() is called from arc_emac_open() and
phy_disconnect() is called from arc_emac_stop().

So, I’ve made mistake in the commit message.

Thank you for review.

> 
> 19 апр. 2017 г., в 21:22, Sergei Shtylyov 
>  написал(а):
> 
> On 04/19/2017 05:29 PM, Alexander Kochetkov wrote:
> 
>> The patch replace phy_start_aneg() with phy_start(). phy_start() call
> 
>   Replaces.
> 
>> phy_start_aneg() as a part of startup sequence and allow recover from
>> error (PHY_HALTED) state.
>> 
>> Also added call phy_stop() to arc_emac_remove() to stop PHY state machine
> 
>   To arc_emac_stop() maybe?
> 

Sergei, thanks for spell and gramma checking.

Regards,
Alexander.



Re: [Intel-wired-lan] NFS over NAT causes e1000e transmit hangs

2017-04-19 Thread Florian Fainelli
On 04/19/2017 01:52 AM, Neftin, Sasha wrote:
> On 4/18/2017 22:05, Florian Fainelli wrote:
>> On 04/18/2017 12:03 PM, Eric Dumazet wrote:
>>> On Tue, 2017-04-18 at 11:18 -0700, Florian Fainelli wrote:
 Hi,

 I am using NFS over a NAT with two e1000e adapters and with eth1 being
 the LAN interface and eth0 the WAN interface. The kernel is Ubuntu's
 16.10 kernel: 4.8.0-46-generic. The device doing NAT over NFS is just
 mounting a remote folder and doing normal execution/file accesses. It's
 enough to untar a file from this device onto a NFS share to expose the
 problem.

 The transmit hangs look like the ones below, doing a rmmod/insmod does
 not help eliminated the problem, nor does a power cycle. Stopping the
 NFS over NAT definitively does let the adapter recover.
>>> Is this NFS over TCP or UDP ?
>> This is NFS over TCP mounted with the following:
>>
>> type nfs
>> (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=2049,timeo=70,retrans=3,sec=sys,local_lock=none,addr=X.X.X.X)
>>
>>
>> Thanks Eric!
> 
> Please, try disable TCP segmentation offload: ethtool -K  tso off.

I am not able to reproduce the hangs with TSO turned off. Is there a
specific patch you would want me to try?
-- 
Florian


Re: [PATCH] net: arc_emac: switch to phy_start()/phy_stop()

2017-04-19 Thread Florian Fainelli
On 04/19/2017 07:29 AM, Alexander Kochetkov wrote:
> The patch replace phy_start_aneg() with phy_start(). phy_start() call
> phy_start_aneg() as a part of startup sequence and allow recover from
> error (PHY_HALTED) state.
> 
> Also added call phy_stop() to arc_emac_remove() to stop PHY state machine
> when MAC is down.

This looks fine. If you wanted to go further, you could move the
phy_connect(), phy_disconnect() calls down to the arc_emac_open()
respectively arc_emac_stop() as this would also allow the PHY device to
be fully suspended when the interface is unused.

> 
> Signed-off-by: Alexander Kochetkov 
> ---
>  drivers/net/ethernet/arc/emac_main.c |4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/arc/emac_main.c 
> b/drivers/net/ethernet/arc/emac_main.c
> index abc9f2a..188676d 100644
> --- a/drivers/net/ethernet/arc/emac_main.c
> +++ b/drivers/net/ethernet/arc/emac_main.c
> @@ -434,7 +434,7 @@ static int arc_emac_open(struct net_device *ndev)
>   /* Enable EMAC */
>   arc_reg_or(priv, R_CTRL, EN_MASK);
>  
> - phy_start_aneg(ndev->phydev);
> + phy_start(ndev->phydev);
>  
>   netif_start_queue(ndev);
>  
> @@ -556,6 +556,8 @@ static int arc_emac_stop(struct net_device *ndev)
>   napi_disable(>napi);
>   netif_stop_queue(ndev);
>  
> + phy_stop(ndev->phydev);
> +
>   /* Disable interrupts */
>   arc_reg_clr(priv, R_ENABLE, RXINT_MASK | TXINT_MASK | ERR_MASK);
>  
> 


-- 
Florian


Re: [PATCH v4 04/18] dt-bindings: syscon: Add DT bindings documentation for Allwinner syscon

2017-04-19 Thread Rob Herring
On Wed, Apr 12, 2017 at 01:13:46PM +0200, Corentin Labbe wrote:
> This patch adds documentation for Device-Tree bindings for the
> syscon present in allwinner devices.
> 
> Signed-off-by: Corentin Labbe 
> ---
>  .../devicetree/bindings/misc/allwinner,syscon.txt | 19 
> +++
>  1 file changed, 19 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/misc/allwinner,syscon.txt

Acked-by: Rob Herring 


[PATCH net-next] bpf: add napi_id read access to __sk_buff

2017-04-19 Thread Daniel Borkmann
Add napi_id access to __sk_buff for socket filter program types, tc
program types and other bpf_convert_ctx_access() users. Having access
to skb->napi_id is useful for per RX queue listener siloing, f.e.
in combination with SO_ATTACH_REUSEPORT_EBPF and when busy polling is
used, meaning SO_REUSEPORT enabled listeners can then select the
corresponding socket at SYN time already [1]. The skb is marked via
skb_mark_napi_id() early in the receive path (e.g., napi_gro_receive()).

Currently, sockets can only use SO_INCOMING_NAPI_ID from 6d4339028b35
("net: Introduce SO_INCOMING_NAPI_ID") as a socket option to look up
the NAPI ID associated with the queue for steering, which requires a
prior sk_mark_napi_id() after the socket was looked up.

Semantics for the __sk_buff napi_id access are similar, meaning if
skb->napi_id is < MIN_NAPI_ID (e.g. outgoing packets using sender_cpu),
then an invalid napi_id of 0 is returned to the program, otherwise a
valid non-zero napi_id.

  [1] http://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf

Suggested-by: Eric Dumazet 
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h|  1 +
 net/core/filter.c   | 14 ++
 tools/include/uapi/linux/bpf.h  |  1 +
 tools/testing/selftests/bpf/test_verifier.c |  3 +++
 4 files changed, 19 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1e062bb..e553529 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -603,6 +603,7 @@ struct __sk_buff {
__u32 tc_classid;
__u32 data;
__u32 data_end;
+   __u32 napi_id;
 };
 
 struct bpf_tunnel_key {
diff --git a/net/core/filter.c b/net/core/filter.c
index 19be954..70ff8c0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -53,6 +53,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /**
  * sk_filter_trim_cap - run a packet through a socket filter
@@ -3202,6 +3203,19 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type 
type,
*insn++ = BPF_MOV64_IMM(si->dst_reg, 0);
 #endif
break;
+
+   case offsetof(struct __sk_buff, napi_id):
+#if defined(CONFIG_NET_RX_BUSY_POLL)
+   BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, napi_id) != 4);
+
+   *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+ offsetof(struct sk_buff, napi_id));
+   *insn++ = BPF_JMP_IMM(BPF_JGE, si->dst_reg, MIN_NAPI_ID, 1);
+   *insn++ = BPF_MOV64_IMM(si->dst_reg, 0);
+#else
+   *insn++ = BPF_MOV64_IMM(si->dst_reg, 0);
+#endif
+   break;
}
 
return insn - insn_buf;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1e062bb..e553529 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -603,6 +603,7 @@ struct __sk_buff {
__u32 tc_classid;
__u32 data;
__u32 data_end;
+   __u32 napi_id;
 };
 
 struct bpf_tunnel_key {
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 6178b65..95a8d5f 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -772,6 +772,9 @@ struct test_val {
BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
offsetof(struct __sk_buff, vlan_tci)),
BPF_JMP_IMM(BPF_JGE, BPF_REG_0, 0, 0),
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+   offsetof(struct __sk_buff, napi_id)),
+   BPF_JMP_IMM(BPF_JGE, BPF_REG_0, 0, 0),
BPF_EXIT_INSN(),
},
.result = ACCEPT,
-- 
1.9.3



[PATCH net-next 1/1] netvsc: Deal with rescinded channels correctly

2017-04-19 Thread kys
From: K. Y. Srinivasan 

We will not be able to send packets over a channel that has been
rescinded. Make necessary adjustments so we can properly cleanup
even when the channel is rescinded. This issue can be trigerred
in the NIC hot-remove path.

Signed-off-by: K. Y. Srinivasan 
---
 drivers/net/hyperv/netvsc.c |   16 
 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 7ab06b3..b5b10fc 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -135,6 +135,13 @@ static void netvsc_destroy_buf(struct hv_device *device)
   sizeof(struct nvsp_message),
   (unsigned long)revoke_packet,
   VM_PKT_DATA_INBAND, 0);
+   /* If the failure is because the channel is rescinded;
+* ignore the failure since we cannot send on a rescinded
+* channel. This would allow us to properly cleanup
+* even when the channel is rescinded.
+*/
+   if (device->channel->rescind)
+   ret = 0;
/*
 * If we failed here, we might as well return and
 * have a leak rather than continue and a bugchk
@@ -195,6 +202,15 @@ static void netvsc_destroy_buf(struct hv_device *device)
   sizeof(struct nvsp_message),
   (unsigned long)revoke_packet,
   VM_PKT_DATA_INBAND, 0);
+
+   /* If the failure is because the channel is rescinded;
+* ignore the failure since we cannot send on a rescinded
+* channel. This would allow us to properly cleanup
+* even when the channel is rescinded.
+*/
+   if (device->channel->rescind)
+   ret = 0;
+
/* If we failed here, we might as well return and
 * have a leak rather than continue and a bugchk
 */
-- 
1.7.1



RE: [PATCH 1/1] netvsc: Deal with rescinded channels correctly

2017-04-19 Thread KY Srinivasan


> -Original Message-
> From: k...@exchange.microsoft.com [mailto:k...@exchange.microsoft.com]
> Sent: Wednesday, April 19, 2017 1:49 PM
> To: da...@davemloft.net; netdev@vger.kernel.org; linux-
> ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
> a...@canonical.com; jasow...@redhat.com;
> leann.ogasaw...@canonical.comi; marcelo.ce...@canonical.com; Stephen
> Hemminger 
> Cc: KY Srinivasan 
> Subject: [PATCH 1/1] netvsc: Deal with rescinded channels correctly
> 
> [This sender failed our fraud detection checks and may not be who they
> appear to be. Learn about spoofing at http://aka.ms/LearnAboutSpoofing]
> 
> From: K. Y. Srinivasan 
> 
> We will not be able to send packets over a channel that has been
> rescinded. Make necessary adjustments so we can properly cleanup
> even when the channel is rescinded. This issue can be trigerred
> in the NIC hot-remove path.
> 
> Signed-off-by: K. Y. Srinivasan 

Dave,

Please drop this path; I will resend.

K. Y
> ---
>  drivers/net/hyperv/netvsc.c |   16 
>  1 files changed, 16 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 7ab06b3..b5b10fc 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -135,6 +135,13 @@ static void netvsc_destroy_buf(struct hv_device
> *device)
>sizeof(struct nvsp_message),
>(unsigned long)revoke_packet,
>VM_PKT_DATA_INBAND, 0);
> +   /* If the failure is because the channel is rescinded;
> +* ignore the failure since we cannot send on a rescinded
> +* channel. This would allow us to properly cleanup
> +* even when the channel is rescinded.
> +*/
> +   if (device->channel->rescind)
> +   ret = 0;
> /*
>  * If we failed here, we might as well return and
>  * have a leak rather than continue and a bugchk
> @@ -195,6 +202,15 @@ static void netvsc_destroy_buf(struct hv_device
> *device)
>sizeof(struct nvsp_message),
>(unsigned long)revoke_packet,
>VM_PKT_DATA_INBAND, 0);
> +
> +   /* If the failure is because the channel is rescinded;
> +* ignore the failure since we cannot send on a rescinded
> +* channel. This would allow us to properly cleanup
> +* even when the channel is rescinded.
> +*/
> +   if (device->channel->rescind)
> +   ret = 0;
> +
> /* If we failed here, we might as well return and
>  * have a leak rather than continue and a bugchk
>  */
> --
> 1.7.1



[PATCH 1/1] netvsc: Deal with rescinded channels correctly

2017-04-19 Thread kys
From: K. Y. Srinivasan 

We will not be able to send packets over a channel that has been
rescinded. Make necessary adjustments so we can properly cleanup
even when the channel is rescinded. This issue can be trigerred
in the NIC hot-remove path.

Signed-off-by: K. Y. Srinivasan 
---
 drivers/net/hyperv/netvsc.c |   16 
 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 7ab06b3..b5b10fc 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -135,6 +135,13 @@ static void netvsc_destroy_buf(struct hv_device *device)
   sizeof(struct nvsp_message),
   (unsigned long)revoke_packet,
   VM_PKT_DATA_INBAND, 0);
+   /* If the failure is because the channel is rescinded;
+* ignore the failure since we cannot send on a rescinded
+* channel. This would allow us to properly cleanup
+* even when the channel is rescinded.
+*/
+   if (device->channel->rescind)
+   ret = 0;
/*
 * If we failed here, we might as well return and
 * have a leak rather than continue and a bugchk
@@ -195,6 +202,15 @@ static void netvsc_destroy_buf(struct hv_device *device)
   sizeof(struct nvsp_message),
   (unsigned long)revoke_packet,
   VM_PKT_DATA_INBAND, 0);
+
+   /* If the failure is because the channel is rescinded;
+* ignore the failure since we cannot send on a rescinded
+* channel. This would allow us to properly cleanup
+* even when the channel is rescinded.
+*/
+   if (device->channel->rescind)
+   ret = 0;
+
/* If we failed here, we might as well return and
 * have a leak rather than continue and a bugchk
 */
-- 
1.7.1



Re: [PATCH v3 1/4] dt-bindings: net: Add TI WiLink shared transport binding

2017-04-19 Thread Rob Herring
On Sun, Apr 16, 2017 at 9:09 AM, Adam Ford  wrote:
>
>
> On Apr 13, 2017 10:04 AM, "Rob Herring"  wrote:
>
> Add serial slave device binding for the TI WiLink series of Bluetooth/FM/GPS
> devices.
>
> Signed-off-by: Rob Herring 
> Cc: Mark Rutland 
> Cc: netdev@vger.kernel.org
> Cc: devicet...@vger.kernel.org
> ---
> v3:
> - rebase on bluetooth-next
>
>  .../devicetree/bindings/net/ti,wilink-st.txt   | 35
> ++
>  1 file changed, 35 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/ti,wilink-st.txt
>
> diff --git a/Documentation/devicetree/bindings/net/ti,wilink-st.txt
> b/Documentation/devicetree/bindings/net/ti,wilink-st.txt
> new file mode 100644
> index ..cbad73a84ac4
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/ti,wilink-st.txt
> @@ -0,0 +1,35 @@
> +TI WiLink 7/8 (wl12xx/wl18xx) Shared Transport BT/FM/GPS devices
> +
> +TI WiLink devices have a UART interface for providing Bluetooth, FM radio,
> +and GPS over what's called "shared transport". The shared transport is
> +standard BT HCI protocol with additional channels for the other functions.
> +
> +These devices also have a separate WiFi interface as described in
> +wireless/ti,wlcore.txt.
> +
> +This bindings follows the UART slave device binding in
> +../serial/slave-device.txt.
> +
> +Required properties:
> + - compatible: should be one of the following:
> +"ti,wl1271-st"
> +"ti,wl1273-st"
> +"ti,wl1831-st"
> +"ti,wl1835-st"
> +"ti,wl1837-st"
> +
>
>
> Would you expect the wl1283 chipset too?

Probably, but I left it out as there's no public information.

> I can help test this if you like after the holiday weekend. I have a board
> with WL1283 and currently using pdata-quirks to support it.


RE: [PATCH net-next] i40evf: hide unused variable

2017-04-19 Thread Keller, Jacob E
> -Original Message-
> From: Arnd Bergmann [mailto:a...@arndb.de]
> Sent: Wednesday, April 19, 2017 10:30 AM
> To: Kirsher, Jeffrey T 
> Cc: Arnd Bergmann ; Pujari, Bimmy
> ; Duyck, Alexander H
> ; Williams, Mitch A
> ; Keller, Jacob E ; 
> Brady,
> Alan ; Joe Perches ; Singhai, Anjali
> ; Brandeburg, Jesse ;
> Banala, Preethi ; intel-wired-...@lists.osuosl.org;
> netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: [PATCH net-next] i40evf: hide unused variable
> 
> On architectures with larger pages, we get a warning about an unused variable:
> 
> drivers/net/ethernet/intel/i40evf/i40evf_main.c: In function
> 'i40evf_configure_rx':
> drivers/net/ethernet/intel/i40evf/i40evf_main.c:690:21: error: unused variable
> 'netdev' [-Werror=unused-variable]
> 
> This moves the declaration into the #ifdef to avoid the warning.
> 

Makes sense.

Acked-by: Jacob Keller 

> Fixes: dab86afdbbd1 ("i40e/i40evf: Change the way we limit the maximum frame
> size for Rx")
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/net/ethernet/intel/i40evf/i40evf_main.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
> b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
> index 12a930e879af..1bb13c864edd 100644
> --- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
> +++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
> @@ -687,13 +687,14 @@ static void i40evf_configure_tx(struct i40evf_adapter
> *adapter)
>  static void i40evf_configure_rx(struct i40evf_adapter *adapter)
>  {
>   unsigned int rx_buf_len = I40E_RXBUFFER_2048;
> - struct net_device *netdev = adapter->netdev;
>   struct i40e_hw *hw = >hw;
>   int i;
> 
>   /* Legacy Rx will always default to a 2048 buffer size. */
>  #if (PAGE_SIZE < 8192)
>   if (!(adapter->flags & I40EVF_FLAG_LEGACY_RX)) {
> + struct net_device *netdev = adapter->netdev;
> +
>   /* For jumbo frames on systems with 4K pages we have to use
>* an order 1 page, so we might as well increase the size
>* of our Rx buffer to make better use of the available space
> --
> 2.9.0



Re: [PATCH v3 3/4] bluetooth: hci_uart: add LL protocol serdev driver support

2017-04-19 Thread Rob Herring
On Mon, Apr 17, 2017 at 3:11 PM, Adam Ford  wrote:
> On Thu, Apr 13, 2017 at 10:03 AM, Rob Herring  wrote:
>> Turns out that the LL protocol and the TI-ST are the same thing AFAICT.
>> The TI-ST adds firmware loading, GPIO control, and shared access for
>> NFC, FM radio, etc. For now, we're only implementing what is needed for
>> BT. This mirrors other drivers like BCM and Intel, but uses the new
>> serdev bus.
>>
>> The firmware loading is greatly simplified by using existing
>> infrastructure to send commands. It may be a bit slower than the
>> original code using synchronous functions, but the real bottleneck is
>> likely doing firmware load at 115.2kbps.
>
> I am using pdata-quirks to drive my wl1283 Bluetooth on a DM3730.  I
> have the Bluetooth set to 300 baud in pdata quirks.  Looking at
> the binding, I don't see an option to set the baudrate.  Is there (or
> will there) be a way to set the baud rate of the Bluetooth?

If you read hci_ti_probe, you will see it is already there. The
default is 3Mbps and the DT can override that with "max-speed". The
intent is that 3Mbps is the max the device can do and max-speed is
only for board or host limitations. Though I just checked the
datasheet on the 1835 and it can go up to 4364kbps. No datasheets for
wl1283, so I have no idea what the max is.

>> +static int hci_ti_probe(struct serdev_device *serdev)
>> +{
>> +   struct hci_uart *hu;
>> +   struct ll_device *lldev;
>> +   u32 max_speed = 300;

[...]

>> +   of_property_read_u32(serdev->dev.of_node, "max-speed", _speed);
>> +   hci_uart_set_speeds(hu, 115200, max_speed);


Re: [PATCH v4 net-next RFC] net: Generic XDP

2017-04-19 Thread Andy Gospodarek
On Wed, Apr 19, 2017 at 10:44:59AM -0700, John Fastabend wrote:
> On 17-04-19 10:17 AM, Alexei Starovoitov wrote:
> > On Wed, Apr 19, 2017 at 10:29:03AM -0400, Andy Gospodarek wrote:
> >>
> >> I ran this on top of a card that uses the bnxt_en driver on a desktop
> >> class system with an i7-6700 CPU @ 3.40GHz, sending a single stream of
> >> UDP traffic with flow control disabled and saw the following (all stats
> >> in Million PPS).
> >>
> >> xdp1xdp2xdp_tx_tunnel
> >> Generic XDP  7.85.5 (1.3 actual) 4.6 (1.1 actual)
> >> Optimized XDP   11.79.7  4.6
> > 
> > Nice! Thanks for testing.
> > 
> >> One thing to note is that the Generic XDP case shows some different
> >> results for reported by the application vs actual (seen on the wire).  I
> >> did not debug where the drops are happening and what counter needs to be
> >> incremented to note this -- I'll add that to my TODO list.  The
> >> Optimized XDP case does not have a difference in reported vs actual
> >> frames on the wire.
> > 
> > The missed packets are probably due to xmit queue being full.
> > We need 'xdp_tx_full' counter in:
> > +   if (free_skb) {
> > +   trace_xdp_exception(dev, xdp_prog, XDP_TX);
> > +   kfree_skb(skb);
> > +   }
> > like in-driver xdp does.
> > It's surprising that tx becomes full so often. May be bnxt specific 
> > behavior?
> 
> hmm as a data point I get better numbers than 1.3Mpps running through the 
> qdisc
> layer with pktgen so seems like something is wrong with the driver perhaps? If

I get ~6.5Mpps on a single core with pktgen, so inconclusive for now

> I get a chance I'll take a look with my setup here, although it likely wont be
> until the weekend. I don't think it needs to slow down dropping the RFC tag
> and getting the patch applied though.
> 
> > 
> >> I agree with all those who have asserted that this is great tool for
> >> those that want to get started with XDP but do not have hardware, so I'd
> >> say it's ready to have the 'RFC' tag dropped.  Thanks for pushing this
> >> forward, Dave!  :-)
> > 
> > +1
> >  
> > 
> 


Re: [RFC PATCH net] net/mlx5e: Race between mlx5e_update_stats() and getting the stats

2017-04-19 Thread Eric Dumazet
On Wed, 2017-04-19 at 11:29 -0700, Martin KaFai Lau wrote:
> We have observed a sudden spike in rx/tx_packets and rx/tx_bytes
> reported under /proc/net/dev.  It seems there is a race in
> mlx5e_update_stats() and some of the get-stats functions (the
> one that we hit is the mlx5e_get_stats() which is called
> by ndo_get_stats64()).
> 
> In particular, the very first thing mlx5e_update_sw_counters()
> does is 'memset(s, 0, sizeof(*s))'.  For example, if mlx5e_get_stats()
> is unlucky at one point, rx_bytes and rx_packets could be 0.  One second
> later, a normal (and much bigger than 0) value will be reported.
> 
> This patch is not meant to be a proper fix.  It merely tries
> to show what I have suspected and start the discussion.
> 
> Signed-off-by: Martin KaFai Lau 
> Cc: Saeed Mahameed 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 7 +--
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c| 3 +++
>  2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> index a004a5a1a4c2..d24916f720bb 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> @@ -313,7 +313,6 @@ static void mlx5e_get_ethtool_stats(struct net_device 
> *dev,
>   mutex_lock(>state_lock);
>   if (test_bit(MLX5E_STATE_OPENED, >state))
>   mlx5e_update_stats(priv);
> - mutex_unlock(>state_lock);
>  
>   for (i = 0; i < NUM_SW_COUNTERS; i++)
>   data[idx++] = MLX5E_READ_CTR64_CPU(>stats.sw,
> @@ -378,8 +377,10 @@ static void mlx5e_get_ethtool_stats(struct net_device 
> *dev,
>   data[idx++] = 
> MLX5E_READ_CTR64_CPU(mlx5_priv->pme_stats.error_counters,
>  mlx5e_pme_error_desc, i);
>  
> - if (!test_bit(MLX5E_STATE_OPENED, >state))
> + if (!test_bit(MLX5E_STATE_OPENED, >state)) {
> + mutex_unlock(>state_lock);
>   return;
> + }
>  
>   /* per channel counters */
>   for (i = 0; i < priv->params.num_channels; i++)
> @@ -393,6 +394,8 @@ static void mlx5e_get_ethtool_stats(struct net_device 
> *dev,
>   for (j = 0; j < NUM_SQ_STATS; j++)
>   data[idx++] = 
> MLX5E_READ_CTR64_CPU(>channel[i]->sq[tc].stats,
>  
> sq_stats_desc, j);
> +
> + mutex_unlock(>state_lock);
>  }
>  
>  static u32 mlx5e_rx_wqes_to_packets(struct mlx5e_priv *priv, int rq_wq_type,
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 66c133757a5e..a4c100bea541 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -2748,6 +2748,8 @@ mlx5e_get_stats(struct net_device *dev, struct 
> rtnl_link_stats64 *stats)
>   struct mlx5e_vport_stats *vstats = >stats.vport;
>   struct mlx5e_pport_stats *pstats = >stats.pport;
>  
> + mutex_lock(>state_lock);
> +

We can not sleep from ndo_get_stats() ( look at bonding driver )

What about the following ?

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 
66c133757a5ee8daae122e93322306b1c5c44336..b9fea146a0ca18498a8dfa5698dca7dea06e3c5e
 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -174,7 +174,7 @@ static void mlx5e_tx_timeout_work(struct work_struct *work)
 
 static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
 {
-   struct mlx5e_sw_stats *s = >stats.sw;
+   struct mlx5e_sw_stats temp, *s = 
struct mlx5e_rq_stats *rq_stats;
struct mlx5e_sq_stats *sq_stats;
u64 tx_offload_none = 0;
@@ -229,6 +229,8 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv 
*priv)
s->link_down_events_phy = MLX5_GET(ppcnt_reg,
priv->stats.pport.phy_counters,
counter_set.phys_layer_cntrs.link_down_events);
+   /* A bit racy (depending on memcpy() sanity...) , we probably should 
use a spinlock */
+   memcpy(>stats.sw, s, sizeof(*s));
 }
 
 static void mlx5e_update_vport_counters(struct mlx5e_priv *priv)




Re: [patch] socket.7: Document SO_INCOMING_CPU

2017-04-19 Thread Eric Dumazet
On Wed, 2017-04-19 at 20:48 +0200, Michael Kerrisk (man-pages) wrote:
> Hi Eric,
> 
> [reodering for clarity]
> 
> >> On 02/19/2017 09:55 PM, Michael Kerrisk (man-pages) wrote:
> >>> [CC += Eric, so that he might review]
> >>>
> >>> Hello Francois,
> >>>
> >>> On 02/18/2017 05:06 AM, Francois Saint-Jacques wrote:
>  This socket option is undocumented. Applies on the latest version
>  (man-pages-4.09-511).
> 
>  diff --git a/man7/socket.7 b/man7/socket.7
>  index 3efd7a5d8..1a3ffa253 100644
>  --- a/man7/socket.7
>  +++ b/man7/socket.7
>  @@ -490,6 +490,26 @@ flag on a socket
>   operation.
>   Expects an integer boolean flag.
>   .TP
>  +.BR SO_INCOMING_CPU " (getsockopt since Linux 3.19, setsockopt since
>  Linux 4.4)"
>  +.\" getsocktop 2c8c56e15df3d4c2af3d656e44feb18789f75837
>  +.\" setsocktop 70da268b569d32a9fddeea85dc18043de9d89f89
>  +Sets or gets the cpu affinity of a socket. Expects an integer flag.
>  +.sp
>  +.in +4n
>  +.nf
>  +int cpu = 1;
>  +socklen_t len = sizeof(cpu);
>  +setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, , );
>  +.fi
>  +.in
>  +.sp
>  +The typical use case is one listener per RX queue, as the associated 
>  listener
>  +should only accept flows handled in softirq by the same cpu.  This 
>  provides
>  +optimal NUMA behavior and keep cpu caches hot.
>  +.TP
>   .B SO_KEEPALIVE
>   Enable sending of keep-alive messages on connection-oriented sockets.
>   Expects an integer boolean flag.
> >>>
> >>> Thank you! Patch applied.
> >>>
> >>> I have tried to enhance the description somewhat. I'm not sure whether
> >>> what I've written is quite correct (or whether it should be further
> >>> extended). Eric, could you please take a look at the following, and let 
> >>> me know if anything needs fixing:
> >>>
> >>>SO_INCOMING_CPU  (gettable  since Linux 3.19, settable since Linux
> >>>4.4)
> >>>   Sets or gets the CPU affinity  of  a  socket.   Expects  an
> >>>   integer flag.
> >>>
> >>>   int cpu = 1;
> >>>   socklen_t len = sizeof(cpu);
> >>>   setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, , );
> >>>
> >>>   Because  all  of the packets for a single stream (i.e., all
> >>>   packets for the same 4-tuple) arrive on the single RX queue
> >>>   that  is  associated with a particular CPU, the typical use
> >>>   case is to employ one listening process per RX queue,  with
> >>>   the  incoming  flow being handled by a listener on the same
> >>>   CPU that is handling the RX queue.  This  provides  optimal
> >>>   NUMA behavior and keeps CPU caches hot.
> 
> > Hi Michael
> > 
> > Sorry for the delay.
> 
> Thanks for the reply, but I think you are assuming I know more than 
> I do. I'd like you to elaborate a little please. See below.
> 
> > Note that setting the option is not supported if SO_REUSEPORT is used.
> 
> Please define "not supported". Does this yield an API diagnostic?
> If so, what is it?
> 
> > Socket will be selected from an array, either by a hash or BPF program
> > that has no access to this information.
> 
> Sorry -- I'm lost here. How does this comment relate to the proposed
> man page text above?

Simply that :

If an application uses both SO_INCOMING_CPU and SO_REUSEPORT, then
SO_REUSEPORT logic, selecting the socket to receive the packet, ignores
SO_INCOMING_CPU setting.

This does not need to be documented, because it is an implementation
detail/bug that could be changed, if someone cares enough.





Re: [PATCH] ieee802154: don't select COMMON_CLK

2017-04-19 Thread Marcel Holtmann
Hi Arnd,

> A device driver must not select the COMMON_CLK subsystem, as that conflicts
> with platforms that provide a legacy implementation of the clk API:
> 
> drivers/clk/clk.o: In function `clk_enable':
> clk.c:(.text.clk_enable+0x0): multiple definition of `clk_enable'
> arch/arm/mach-sa1100/clock.o:clock.c:(.text.clk_enable+0x0): first defined 
> here
> drivers/clk/clk.o: In function `clk_round_rate':
> clk.c:(.text.clk_round_rate+0x0): multiple definition of `clk_round_rate'
> arch/arm/mach-sa1100/clock.o:clock.c:(.text.clk_round_rate+0x0): first 
> defined here
> drivers/clk/clk.o: In function `clk_get_parent':
> clk.c:(.text.clk_get_parent+0x0): multiple definition of `clk_get_parent'
> arch/arm/mach-sa1100/clock.o:clock.c:(.text.clk_get_parent+0x0): first 
> defined here
> drivers/clk/clk.o: In function `clk_get_rate':
> clk.c:(.text.clk_get_rate+0x0): multiple definition of `clk_get_rate'
> 
> This changes the 'select' into 'depends on', as all other similar drivers do.
> 
> Fixes: d931acd575d6 ("ieee802154: Add CA8210 IEEE 802.15.4 device driver")
> Signed-off-by: Arnd Bergmann 
> ---
> drivers/net/ieee802154/Kconfig | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)

patch has been applied to bluetooth-next tree.

Regards

Marcel



Re: XDP question: best API for returning/setting egress port?

2017-04-19 Thread Andy Gospodarek
On Tue, Apr 18, 2017 at 09:58:56PM +0200, Jesper Dangaard Brouer wrote:
> 
> As I argued in NetConf presentation[1] (from slide #9) we need a port
> mapping table (instead of using ifindex'es).  Both for supporting
> other "port" types than net_devices (think sockets), and for
> sandboxing what XDP can bypass.
> 
> I want to create a new XDP action called XDP_REDIRECT, that instruct
> XDP to send the xdp_buff to another "port" (get translated into a
> net_device, or something else depending on internal port type).
> 
> Looking at the userspace/eBPF interface, I'm wondering what is the
> best API for "returning" this port number from eBPF?
> 
> The options I see is:
> 
> 1) Split-up the u32 action code, and e.g let the high-16-bit be the
>port number and lower-16bit the (existing) action verdict.
> 
>  Pros: Simple API
>  Cons: Number of ports limited to 64K

Practically speaking this may be seem reserving 64k for the port number
might be enough space, but I would also like to see a new return option
for flags so I start to get concerned about space.  Daniel also
hightlights the fact that encoding the port in the action may does not
leave room for flags and could get confusing.

One unfortunate side-effect of dropping or transmitting frames with XDP
is that we lose the opportunity to statistically sample in netfilter
since the frames were dropped so early and I'd like to bring that back
with a call to parse flags and possibly call psample_sample_packet()
after the xdp action.  Packet sampling cannot simply be an action
since there are times when a frame should be dropped but should also be
sampled, so it seems logical to add this as a flag.

> 
> 2) Extend both xdp_buff + xdp_md to contain a (u32) port number, allow
>eBPF to update xdp_md->port.
> 
>  Pros: Larger number of ports.
>  Cons: This require some ebpf translation steps between xdp_buff <-> xdp_md.
>(see xdp_convert_ctx_access)

I think I would lean towards this based on the fact that I'd like to see
a flags field added to the u32 return (maybe the top 8 bits) as
mentioned above.

So if we follow down this path and add a 'dest' field to xdp_buff and
xdp_md like this:

struct xdp_buff {
void *data;
void *data_end;
void *data_hard_start;
__u32 dest; 
};

struct xdp_md {
__u32 data;
__u32 data_end;
__u32 dest;
};

and then lookup this dest in a table we have the option to make that
dest an ifindex/socket/other.

I did also look at JohnF's patch and I do like the simplicity of the redirect
action and new ndo_xdp_xmit and how it moves towards a way to transmit the
frame.  The downside is that it presumes an ifindex, so it might not be ideal
we want the lookup to return something other than an ifindex.

Before aligning on a direction for the return values from exiting xdp
call, it seems like we should also think about the tx side and how that
would be handled.  If we are ultimately going to need a new netdev op to
handle the redirect then what may be the issue with not providing the
destination port the return code and the option proposed by JohnF looks
good to me with maybe a small tweak to not presume ifindex in some manner.

JohnF, any test results with this you can share?  Presumably you tested with
virtio-net, right?

> 
> 3) Extend only xdp_buff and create bpf_helper that set port in xdp_buff.
> 
>  Pros: Hides impl details, and allows helper to give eBPF code feedback
>(on e.g. if port doesn't exist any longer)
>  Cons: Helper function call likely slower?
> 
> 
> (Cc'ed xdp-newbies as end-users might have an opinion on UAPI?)
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 
> [1] 
> http://people.netfilter.org/hawk/presentations/NetConf2017/xdp_work_ahead_NetConf_April_2017.pdf


Re: [PATCH] netfilter: ctnetlink: Make some parameters integer to avoid enum mismatch

2017-04-19 Thread Matthias Kaehlcke
El Wed, Apr 19, 2017 at 12:41:10PM -0700 Joe Perches ha dit:

> On Wed, 2017-04-19 at 11:39 -0700, Matthias Kaehlcke wrote:
> > Not all parameters passed to ctnetlink_parse_tuple() and
> > ctnetlink_exp_dump_tuple() match the enum type in the signatures of these
> > functions.
> 
> Maybe that should be changed/fixed.

Please see the previous discussion at
http://www.spinics.net/lists/netfilter-devel/msg47540.html

> > Since this is intended change the argument type of to be an int
> > value.
> 
> u32 is not int, it's unsigned int

I would argue that an unsigned int is an int(eger) and considered it
an unnecessary detail for the commit message to be explicit. I can
change it if others deem it incorrect.

Cheers

Matthias

> > Signed-off-by: Matthias Kaehlcke 
> > ---
> >  net/netfilter/nf_conntrack_netlink.c | 7 +++
> >  1 file changed, 3 insertions(+), 4 deletions(-)
> > 
> > diff --git a/net/netfilter/nf_conntrack_netlink.c 
> > b/net/netfilter/nf_conntrack_netlink.c
> > index dc7dfd68fafe..775eb5d9165b 100644
> > --- a/net/netfilter/nf_conntrack_netlink.c
> > +++ b/net/netfilter/nf_conntrack_netlink.c
> > @@ -1006,9 +1006,8 @@ static const struct nla_policy 
> > tuple_nla_policy[CTA_TUPLE_MAX+1] = {
> >  
> >  static int
> >  ctnetlink_parse_tuple(const struct nlattr * const cda[],
> > - struct nf_conntrack_tuple *tuple,
> > - enum ctattr_type type, u_int8_t l3num,
> > - struct nf_conntrack_zone *zone)
> > + struct nf_conntrack_tuple *tuple, u32 type,
> > + u_int8_t l3num, struct nf_conntrack_zone *zone)
> >  {
> > struct nlattr *tb[CTA_TUPLE_MAX+1];
> > int err;
> > @@ -2443,7 +2442,7 @@ static struct nfnl_ct_hook ctnetlink_glue_hook = {
> >  
> >  static int ctnetlink_exp_dump_tuple(struct sk_buff *skb,
> > const struct nf_conntrack_tuple *tuple,
> > -   enum ctattr_expect type)
> > +   u32 type)
> >  {
> > struct nlattr *nest_parms;
> >  


[PATCH net v3] bridge: ebtables: fix reception of frames DNAT-ed to bridge device/port

2017-04-19 Thread Linus Lüssing
When trying to redirect bridged frames to the bridge device itself or
a bridge port (brouting) via the dnat target then this currently fails:

The ethernet destination of the frame is dnat'ed to the MAC address of
the bridge device or port just fine. However, the IP code drops it in
the beginning of ip_input.c/ip_rcv() as the dnat target left
the skb->pkt_type as PACKET_OTHERHOST.

Fixing this by resetting skb->pkt_type to an appropriate type after
dnat'ing.

Signed-off-by: Linus Lüssing 

---

Changelog v3:
- moved pkt_type fixup into ebtable dnat code
  -> v1/v2 only fixed it for prerouting/dnat so far, now tested
 and verified that v3 fixes it for brouting/dnat, too
- updated commit message

Changelog v2:
- refrain from altering pkt_type for multicast packets
  with a unicast destination MAC
---
 net/bridge/netfilter/ebt_dnat.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/net/bridge/netfilter/ebt_dnat.c b/net/bridge/netfilter/ebt_dnat.c
index 4e0b0c3..21acb53 100644
--- a/net/bridge/netfilter/ebt_dnat.c
+++ b/net/bridge/netfilter/ebt_dnat.c
@@ -9,6 +9,7 @@
  */
 #include 
 #include 
+#include "../br_private.h"
 #include 
 #include 
 #include 
@@ -18,11 +19,32 @@ static unsigned int
 ebt_dnat_tg(struct sk_buff *skb, const struct xt_action_param *par)
 {
const struct ebt_nat_info *info = par->targinfo;
+   struct net_device *dev;
 
if (!skb_make_writable(skb, 0))
return EBT_DROP;
 
ether_addr_copy(eth_hdr(skb)->h_dest, info->mac);
+
+   if (is_multicast_ether_addr(info->mac)) {
+   if (is_broadcast_ether_addr(info->mac))
+   skb->pkt_type = PACKET_BROADCAST;
+   else
+   skb->pkt_type = PACKET_MULTICAST;
+   } else {
+   rcu_read_lock();
+   if (xt_hooknum(par) != NF_BR_BROUTING)
+   dev = br_port_get_rcu(xt_in(par))->br->dev;
+   else
+   dev = xt_in(par);
+
+   if (ether_addr_equal(info->mac, dev->dev_addr))
+   skb->pkt_type = PACKET_HOST;
+   else
+   skb->pkt_type = PACKET_OTHERHOST;
+   rcu_read_unlock();
+   }
+
return info->target;
 }
 
-- 
2.1.4



Re: [PATCH] netfilter: ctnetlink: Make some parameters integer to avoid enum mismatch

2017-04-19 Thread Joe Perches
On Wed, 2017-04-19 at 11:39 -0700, Matthias Kaehlcke wrote:
> Not all parameters passed to ctnetlink_parse_tuple() and
> ctnetlink_exp_dump_tuple() match the enum type in the signatures of these
> functions.

Maybe that should be changed/fixed.

> Since this is intended change the argument type of to be an int
> value.

u32 is not int, it's unsigned int

> Signed-off-by: Matthias Kaehlcke 
> ---
>  net/netfilter/nf_conntrack_netlink.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/net/netfilter/nf_conntrack_netlink.c 
> b/net/netfilter/nf_conntrack_netlink.c
> index dc7dfd68fafe..775eb5d9165b 100644
> --- a/net/netfilter/nf_conntrack_netlink.c
> +++ b/net/netfilter/nf_conntrack_netlink.c
> @@ -1006,9 +1006,8 @@ static const struct nla_policy 
> tuple_nla_policy[CTA_TUPLE_MAX+1] = {
>  
>  static int
>  ctnetlink_parse_tuple(const struct nlattr * const cda[],
> -   struct nf_conntrack_tuple *tuple,
> -   enum ctattr_type type, u_int8_t l3num,
> -   struct nf_conntrack_zone *zone)
> +   struct nf_conntrack_tuple *tuple, u32 type,
> +   u_int8_t l3num, struct nf_conntrack_zone *zone)
>  {
>   struct nlattr *tb[CTA_TUPLE_MAX+1];
>   int err;
> @@ -2443,7 +2442,7 @@ static struct nfnl_ct_hook ctnetlink_glue_hook = {
>  
>  static int ctnetlink_exp_dump_tuple(struct sk_buff *skb,
>   const struct nf_conntrack_tuple *tuple,
> - enum ctattr_expect type)
> + u32 type)
>  {
>   struct nlattr *nest_parms;
>  


Re: [PATCH 1/1] drivers: net: usb: qmi_wwan: add QMI_QUIRK_SET_DTR for Telit PID 0x1201

2017-04-19 Thread Bjørn Mork
Aleksander Morgado  writes:

> On Wed, Apr 19, 2017 at 7:28 PM, Bjørn Mork  wrote:
>>> as a side note in latest kernels I had troubles with qmi devices
>>> (e.g. I/O error when using qmicli).
>>>
>>> I found your suggestion in libqmi mailing list to revert commit
>>>
>>> 833415a3e781a26fe480a34d45086bdb4fe1e4c0
>>> cdc-wdm: fix "out-of-sync" due to missing notifications
>>
>> I guess a revert of that commit should be done then..
>>
>> I have been stalling because I have been hoping to replace it with a
>> better fix instead of a plain revert. I believe there are several issues
>> playing badly together here.  That commit was _expected_ to cause
>> spurious EPIPE errors, which would be translated to EIO if they were
>> propagated.  But they should be filtered out rightaway, in theory. This
>> works for me.  I can see the EPIPEs with debugging, but I have never
>> seen any EIO from read.
>>
>> And there is the problem: I am unable to reproduce this problem.  I have
>> previously tested this back and forth with several MDM9200 and MDM9235
>> generation modems in QMI mode, as well as in MBIM mode.  And also with a
>> number of other MBIM modems.  Aleksander reported that he could
>> reproduce the issue using an MDM9x15 generation modem in QMI mode, but
>> not with any MDM9x00 or MDM9x35 modem.  So I have now tried any way I
>> can imagine to reproduce the issue with a Sierra Wireless EM7305, which
>> is the only MDM9x15 modem I have. The firmware is SWI9X15C_05.05.58.00.
>>
>> But unfortunately the testing is still without "success".  It plain
>> works for me, every time, using ModemManager, qmicli with or without
>> proxy, or uqmi.
>>
>> Would you mind describing in detail how you trigger the EIOs?  What
>> software and command sequence are you using?  Does it reliably reproduce
>> the issue, or do you have to try several times?  What modem chipset and
>> firmware is used?
>
> Reliably, as in the second command I sent already showed the issue :/
> I meant to try to debug this issue myself a while ago, but got busy
> with other stuff, as usual... This is with a Sierra Wireless MC7304
> running SWI9X15C_05.05.67.00. If you want, I can give you SSH access
> to a system with this modem plugged in, or I can even send you a spare
> MC7354, whatever you prefer.

I don't think another modem would help.  The EM7305 should be the same
as an.MC7304 or MC7354 in this regard.

And the ssh access would only help if I knew what to look for.  I think
you can do this just as well as me...  


> I'm just running --dms-get-operating-mode multiple times, and getting
> errors frequently:
>
> aleksander@athena:~/Development/foss/libqmi:(master)$ sudo qmicli -d
> /dev/cdc-wdm3 --dms-get-operating-mode
> [/dev/cdc-wdm3] Operating mode retrieved:
> Mode: 'online'
> HW restricted: 'no'
>
> aleksander@athena:~/Development/foss/libqmi:(master)$ sudo qmicli -d
> /dev/cdc-wdm3 --dms-get-operating-mode
> [19 abr 2017, 20:25:36] -Warning ** Error reading from istream: Error
> reading from file descriptor: Input/output error
> ^[[A^Ccancelling the operation...
> error: couldn't create client for the 'dms' service: Operation was cancelled

Lucky you :)

I upgraded the EM7305 to SWI9X15C_05.05.67.00 and tried again.  No
difference.  I ran

 for i in `seq 1 1000`; do qmicli -d /dev/cdc-wdm1 --dms-get-operating-mode; 
done

without a single error.  I guess I'll go for the revert.

But I think we need to do something about commit c1da59dad0eb as
well. It can end up calling usb_submit_urb(desc->response, GFP_KERNEL)
from an URB callback.  And making the rerr update conditional doesn't
seem right either.  It changes the meaning of rerr from "last status" to
"first error", which means that any error will mask a later success
until userspace reads the error.  That's fishy.


Bjørn





Re: [PATCH net-next] net: ipv6: send unsolicited NA if enabled for all interfaces

2017-04-19 Thread Sergei Shtylyov

Hello!

On 04/19/2017 10:05 PM, David Ahern wrote:


When arp_notify is set to 1 for either a specific interface or for 'all'
interfaces, gratuitous arp requests are sent. Since ndisc_notify is the
ipv6 equivalent to arp_notify, it should follow the same semantics.
Commit 4a6e3c5def13 sends the NA on admin up. The final piece is checking


   Need to also specify the commit summary line enclosed into (""), just like 
below.



devconf_all->ndisc_notify in addition to the per device setting. Add it.

Fixes: 5cb04436eef6 ("ipv6: add knob to send unsolicited ND on link-layer address 
change")
Signed-off-by: David Ahern 

[...]

MBR, Sergei



Re: [PATCH v2] sh_eth: unmap DMA buffers when freeing rings

2017-04-19 Thread Sergei Shtylyov

On 04/17/2017 11:10 PM, David Miller wrote:


The DMA API debugging (when enabled) causes:

WARNING: CPU: 0 PID: 1445 at lib/dma-debug.c:519 add_dma_entry+0xe0/0x12c
DMA-API: exceeded 7 overlapping mappings of cacheline 0x01b2974d

to be  printed after repeated initialization of the Ether device, e.g.
suspend/resume or 'ifconfig' up/down. This is because DMA buffers mapped
using dma_map_single() in sh_eth_ring_format() and sh_eth_start_xmit() are
never unmapped. Resolve this problem by unmapping the buffers when freeing
the descriptor  rings;  in order  to do it right, we'd have to add an extra
parameter to sh_eth_txfree() (we rename this function to sh_eth_tx_free(),
while at it).

Based on the commit a47b70ea86bd ("ravb: unmap descriptors when freeing
rings").

Signed-off-by: Sergei Shtylyov 


Applied, thanks.


   Please don't forget to send this to -stable.
   The bug seems to be there from the very beginning.

MBR, Sergei



Re: [PATCH v2] sh_eth: unmap DMA buffers when freeing rings

2017-04-19 Thread Sergei Shtylyov

On 04/17/2017 11:10 PM, David Miller wrote:


The DMA API debugging (when enabled) causes:

WARNING: CPU: 0 PID: 1445 at lib/dma-debug.c:519 add_dma_entry+0xe0/0x12c
DMA-API: exceeded 7 overlapping mappings of cacheline 0x01b2974d

to be  printed after repeated initialization of the Ether device, e.g.
suspend/resume or 'ifconfig' up/down. This is because DMA buffers mapped
using dma_map_single() in sh_eth_ring_format() and sh_eth_start_xmit() are
never unmapped. Resolve this problem by unmapping the buffers when freeing
the descriptor  rings;  in order  to do it right, we'd have to add an extra
parameter to sh_eth_txfree() (we rename this function to sh_eth_tx_free(),
while at it).

Based on the commit a47b70ea86bd ("ravb: unmap descriptors when freeing
rings").

Signed-off-by: Sergei Shtylyov 


Applied, thanks.


   Please don;t forget to send it to -stable.
   The bug seems to be there from the very beginning.

MBR, Sergei



[PATCH net-next] net: ipv6: send unsolicited NA if enabled for all interfaces

2017-04-19 Thread David Ahern
When arp_notify is set to 1 for either a specific interface or for 'all'
interfaces, gratuitous arp requests are sent. Since ndisc_notify is the
ipv6 equivalent to arp_notify, it should follow the same semantics.
Commit 4a6e3c5def13 sends the NA on admin up. The final piece is checking
devconf_all->ndisc_notify in addition to the per device setting. Add it.

Fixes: 5cb04436eef6 ("ipv6: add knob to send unsolicited ND on link-layer 
address change")
Signed-off-by: David Ahern 
---
 net/ipv6/ndisc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index b23822e64228..d310dc41209a 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -1753,7 +1753,8 @@ static int ndisc_netdev_event(struct notifier_block 
*this, unsigned long event,
idev = in6_dev_get(dev);
if (!idev)
break;
-   if (idev->cnf.ndisc_notify)
+   if (idev->cnf.ndisc_notify ||
+   net->ipv6.devconf_all->ndisc_notify)
ndisc_send_unsol_na(dev);
in6_dev_put(idev);
break;
-- 
2.1.4



Re: [patch] socket.7: Document SO_INCOMING_CPU

2017-04-19 Thread Michael Kerrisk (man-pages)
Hi Eric,

[reodering for clarity]

>> On 02/19/2017 09:55 PM, Michael Kerrisk (man-pages) wrote:
>>> [CC += Eric, so that he might review]
>>>
>>> Hello Francois,
>>>
>>> On 02/18/2017 05:06 AM, Francois Saint-Jacques wrote:
 This socket option is undocumented. Applies on the latest version
 (man-pages-4.09-511).

 diff --git a/man7/socket.7 b/man7/socket.7
 index 3efd7a5d8..1a3ffa253 100644
 --- a/man7/socket.7
 +++ b/man7/socket.7
 @@ -490,6 +490,26 @@ flag on a socket
  operation.
  Expects an integer boolean flag.
  .TP
 +.BR SO_INCOMING_CPU " (getsockopt since Linux 3.19, setsockopt since
 Linux 4.4)"
 +.\" getsocktop 2c8c56e15df3d4c2af3d656e44feb18789f75837
 +.\" setsocktop 70da268b569d32a9fddeea85dc18043de9d89f89
 +Sets or gets the cpu affinity of a socket. Expects an integer flag.
 +.sp
 +.in +4n
 +.nf
 +int cpu = 1;
 +socklen_t len = sizeof(cpu);
 +setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, , );
 +.fi
 +.in
 +.sp
 +The typical use case is one listener per RX queue, as the associated 
 listener
 +should only accept flows handled in softirq by the same cpu.  This 
 provides
 +optimal NUMA behavior and keep cpu caches hot.
 +.TP
  .B SO_KEEPALIVE
  Enable sending of keep-alive messages on connection-oriented sockets.
  Expects an integer boolean flag.
>>>
>>> Thank you! Patch applied.
>>>
>>> I have tried to enhance the description somewhat. I'm not sure whether
>>> what I've written is quite correct (or whether it should be further
>>> extended). Eric, could you please take a look at the following, and let 
>>> me know if anything needs fixing:
>>>
>>>SO_INCOMING_CPU  (gettable  since Linux 3.19, settable since Linux
>>>4.4)
>>>   Sets or gets the CPU affinity  of  a  socket.   Expects  an
>>>   integer flag.
>>>
>>>   int cpu = 1;
>>>   socklen_t len = sizeof(cpu);
>>>   setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, , );
>>>
>>>   Because  all  of the packets for a single stream (i.e., all
>>>   packets for the same 4-tuple) arrive on the single RX queue
>>>   that  is  associated with a particular CPU, the typical use
>>>   case is to employ one listening process per RX queue,  with
>>>   the  incoming  flow being handled by a listener on the same
>>>   CPU that is handling the RX queue.  This  provides  optimal
>>>   NUMA behavior and keeps CPU caches hot.

> Hi Michael
> 
> Sorry for the delay.

Thanks for the reply, but I think you are assuming I know more than 
I do. I'd like you to elaborate a little please. See below.

> Note that setting the option is not supported if SO_REUSEPORT is used.

Please define "not supported". Does this yield an API diagnostic?
If so, what is it?

> Socket will be selected from an array, either by a hash or BPF program
> that has no access to this information.

Sorry -- I'm lost here. How does this comment relate to the proposed
man page text above?

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


[PATCH] netfilter: ctnetlink: Make some parameters integer to avoid enum mismatch

2017-04-19 Thread Matthias Kaehlcke
Not all parameters passed to ctnetlink_parse_tuple() and
ctnetlink_exp_dump_tuple() match the enum type in the signatures of these
functions. Since this is intended change the argument type of to be an int
value.

Signed-off-by: Matthias Kaehlcke 
---
 net/netfilter/nf_conntrack_netlink.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/nf_conntrack_netlink.c 
b/net/netfilter/nf_conntrack_netlink.c
index dc7dfd68fafe..775eb5d9165b 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -1006,9 +1006,8 @@ static const struct nla_policy 
tuple_nla_policy[CTA_TUPLE_MAX+1] = {
 
 static int
 ctnetlink_parse_tuple(const struct nlattr * const cda[],
- struct nf_conntrack_tuple *tuple,
- enum ctattr_type type, u_int8_t l3num,
- struct nf_conntrack_zone *zone)
+ struct nf_conntrack_tuple *tuple, u32 type,
+ u_int8_t l3num, struct nf_conntrack_zone *zone)
 {
struct nlattr *tb[CTA_TUPLE_MAX+1];
int err;
@@ -2443,7 +2442,7 @@ static struct nfnl_ct_hook ctnetlink_glue_hook = {
 
 static int ctnetlink_exp_dump_tuple(struct sk_buff *skb,
const struct nf_conntrack_tuple *tuple,
-   enum ctattr_expect type)
+   u32 type)
 {
struct nlattr *nest_parms;
 
-- 
2.12.2.816.g281164-goog



Re: [PATCH 1/1] drivers: net: usb: qmi_wwan: add QMI_QUIRK_SET_DTR for Telit PID 0x1201

2017-04-19 Thread Aleksander Morgado
On Wed, Apr 19, 2017 at 7:28 PM, Bjørn Mork  wrote:
>> as a side note in latest kernels I had troubles with qmi devices
>> (e.g. I/O error when using qmicli).
>>
>> I found your suggestion in libqmi mailing list to revert commit
>>
>> 833415a3e781a26fe480a34d45086bdb4fe1e4c0
>> cdc-wdm: fix "out-of-sync" due to missing notifications
>
> I guess a revert of that commit should be done then..
>
> I have been stalling because I have been hoping to replace it with a
> better fix instead of a plain revert. I believe there are several issues
> playing badly together here.  That commit was _expected_ to cause
> spurious EPIPE errors, which would be translated to EIO if they were
> propagated.  But they should be filtered out rightaway, in theory. This
> works for me.  I can see the EPIPEs with debugging, but I have never
> seen any EIO from read.
>
> And there is the problem: I am unable to reproduce this problem.  I have
> previously tested this back and forth with several MDM9200 and MDM9235
> generation modems in QMI mode, as well as in MBIM mode.  And also with a
> number of other MBIM modems.  Aleksander reported that he could
> reproduce the issue using an MDM9x15 generation modem in QMI mode, but
> not with any MDM9x00 or MDM9x35 modem.  So I have now tried any way I
> can imagine to reproduce the issue with a Sierra Wireless EM7305, which
> is the only MDM9x15 modem I have. The firmware is SWI9X15C_05.05.58.00.
>
> But unfortunately the testing is still without "success".  It plain
> works for me, every time, using ModemManager, qmicli with or without
> proxy, or uqmi.
>
> Would you mind describing in detail how you trigger the EIOs?  What
> software and command sequence are you using?  Does it reliably reproduce
> the issue, or do you have to try several times?  What modem chipset and
> firmware is used?

Reliably, as in the second command I sent already showed the issue :/
I meant to try to debug this issue myself a while ago, but got busy
with other stuff, as usual... This is with a Sierra Wireless MC7304
running SWI9X15C_05.05.67.00. If you want, I can give you SSH access
to a system with this modem plugged in, or I can even send you a spare
MC7354, whatever you prefer.

I'm just running --dms-get-operating-mode multiple times, and getting
errors frequently:

aleksander@athena:~/Development/foss/libqmi:(master)$ sudo qmicli -d
/dev/cdc-wdm3 --dms-get-operating-mode
[/dev/cdc-wdm3] Operating mode retrieved:
Mode: 'online'
HW restricted: 'no'

aleksander@athena:~/Development/foss/libqmi:(master)$ sudo qmicli -d
/dev/cdc-wdm3 --dms-get-operating-mode
[19 abr 2017, 20:25:36] -Warning ** Error reading from istream: Error
reading from file descriptor: Input/output error
^[[A^Ccancelling the operation...
error: couldn't create client for the 'dms' service: Operation was cancelled

aleksander@athena:~/Development/foss/libqmi:(master)$ sudo qmicli -d
/dev/cdc-wdm3 --dms-get-operating-mode
[19 abr 2017, 20:25:38] -Warning ** Error reading from istream: Error
reading from file descriptor: Input/output error
^Ccancelling the operation...
error: couldn't create client for the 'dms' service: Operation was cancelled

aleksander@athena:~/Development/foss/libqmi:(master)$ sudo qmicli -d
/dev/cdc-wdm3 --dms-get-operating-mode
[/dev/cdc-wdm3] Operating mode retrieved:
Mode: 'online'
HW restricted: 'no'

aleksander@athena:~/Development/foss/libqmi:(master)$ sudo qmicli -d
/dev/cdc-wdm3 --dms-get-operating-mode
[/dev/cdc-wdm3] Operating mode retrieved:
Mode: 'online'
HW restricted: 'no'

aleksander@athena:~/Development/foss/libqmi:(master)$ sudo qmicli -d
/dev/cdc-wdm3 --dms-get-operating-mode
[/dev/cdc-wdm3] Operating mode retrieved:
Mode: 'online'
HW restricted: 'no'

aleksander@athena:~/Development/foss/libqmi:(master)$ sudo qmicli -d
/dev/cdc-wdm3 --dms-get-operating-mode
[/dev/cdc-wdm3] Operating mode retrieved:
Mode: 'online'
HW restricted: 'no'

aleksander@athena:~/Development/foss/libqmi:(master)$ sudo qmicli -d
/dev/cdc-wdm3 --dms-get-operating-mode
[19 abr 2017, 20:25:52] -Warning ** Error reading from istream: Error
reading from file descriptor: Input/output error
error: couldn't create client for the 'dms' service: CID allocation
failed in the CTL client: Transaction timed out

...

-- 
Aleksander
https://aleksander.es


  1   2   3   >