Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread Eric Dumazet



On 12/08/2018 12:14 PM, Ilias Apalodimas wrote:
> On Sat, Dec 08, 2018 at 09:11:53PM +0100, Jesper Dangaard Brouer wrote:
>>>  
>>> I want to make sure you guys thought about splice() stuff, and
>>> skb_try_coalesce(), and GRO, and skb cloning, and ...
>>
>> Thanks for the pointers. To Ilias, we need to double check skb_try_coalesce()
>> code path, as it does look like we don't handle this correctly.
>>
> 
> Noted. We did check skb cloning, but not this one indeed

There is also TCP RX zero copy to consider ( tcp_zerocopy_receive() )
but more generally all paths that can grab pages from skbs.






Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread Willy Tarreau
On Sat, Dec 08, 2018 at 10:14:47PM +0200, Ilias Apalodimas wrote:
> On Sat, Dec 08, 2018 at 09:11:53PM +0100, Jesper Dangaard Brouer wrote:
> > >  
> > > I want to make sure you guys thought about splice() stuff, and
> > > skb_try_coalesce(), and GRO, and skb cloning, and ...
> > 
> > Thanks for the pointers. To Ilias, we need to double check 
> > skb_try_coalesce()
> > code path, as it does look like we don't handle this correctly.
> > 
> 
> Noted. We did check skb cloning, but not this one indeed

I'll try to run some tests on my clearfog later, but for now I'm still
100% busy until haproxy 1.9 gets released.

Cheers,
Willy


Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets

2018-12-08 Thread Andrew Lunn
> That is not how XDP works. XDP must run on the "master" physical driver
> interface, as the "slave" interface is a virtual DSA interface.  I did
> mention DSA because I'm handling that on the EspressoBin implementation.

Hi Jesper

It is going to be interesting to see how you do that.

As a user, i want to attach the XDP program to a slave interface. So i
assume you somehow redirect that to the master, with some extra meta
data to allow XDP on the master to detect packets for a specific
slave?

   Andrew


Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets

2018-12-08 Thread Jesper Dangaard Brouer
On Sat, 8 Dec 2018 19:43:38 +0100
Björn Töpel  wrote:

> Den lör 8 dec. 2018 kl 15:52 skrev Jesper Dangaard Brouer :
> >
> > On Fri, 7 Dec 2018 15:01:55 +0100
> > Björn Töpel  wrote:
> >  
> > > Den fre 7 dec. 2018 kl 14:42 skrev Jesper Dangaard Brouer 
> > > :  
> > > >
> > > > On Fri,  7 Dec 2018 12:44:24 +0100
> > > > Björn Töpel  wrote:
> > > >  
> > > > > The rationale behind attach is performance and ease of use. Many XDP
> > > > > socket users just need a simple way of creating/binding a socket and
> > > > > receiving frames right away without loading an XDP program.
> > > > >
> > > > > XDP_ATTACH adds a mechanism we call "builtin XDP program" that simply
> > > > > is a kernel provided XDP program that is installed to the netdev when
> > > > > XDP_ATTACH is being passed as a bind() flag.
> > > > >
> > > > > The builtin program is the simplest program possible to redirect a
> > > > > frame to an attached socket. In restricted C it would look like this:
> > > > >
> > > > >   SEC("xdp")
> > > > >   int xdp_prog(struct xdp_md *ctx)
> > > > >   {
> > > > > return bpf_xsk_redirect(ctx);
> > > > >   }
> > > > >
> > > > > The builtin program loaded via XDP_ATTACH behaves, from an
> > > > > install-to-netdev/uninstall-from-netdev point of view, differently
> > > > > from regular XDP programs. The easiest way to look at it is as a
> > > > > 2-level hierarchy, where regular XDP programs has precedence over the
> > > > > builtin one.
> > > > >
> > > > > If no regular XDP program is installed to the netdev, the builtin will
> > > > > be install. If the builtin program is installed, and a regular is
> > > > > installed, regular XDP program will have precedence over the builtin
> > > > > one.
> > > > >
> > > > > Further, if a regular program is installed, and later removed, the
> > > > > builtin one will automatically be installed.
> > > > >
> > > > > The sxdp_flags field of struct sockaddr_xdp gets two new options
> > > > > XDP_BUILTIN_SKB_MODE and XDP_BUILTIN_DRV_MODE, which maps to the
> > > > > corresponding XDP netlink install flags.
> > > > >
> > > > > The builtin XDP program functionally adds even more complexity to the
> > > > > already hard to read dev_change_xdp_fd. Maybe it would be simpler to
> > > > > store the program in the struct net_device together with the install
> > > > > flags instead of calling the ndo_bpf multiple times?  
> > > >
> > > > (As far as I can see from reading the code, correct me if I'm wrong.)
> > > >
> > > > If an AF_XDP program uses XDP_ATTACH, then it installs the
> > > > builtin-program as the XDP program on the "entire" device.  That means
> > > > all RX-queues will call this XDP-bpf program (indirect call), and it is
> > > > actually only relevant for the specific queue_index.  Yes, the helper
> > > > call does check that the 'xdp->rxq->queue_index' for an attached 'xsk'
> > > > and return XDP_PASS if it is NULL:
> > > >  
> > >
> > > Yes, you are correct. The builtin XDP program, just like a regular XDP
> > > program, affects the whole netdev. So, yes the non-AF_XDP queues would
> > > get a performance hit from this. Just to reiterate -- this isn't new
> > > for this series. This has always been the case for XDP when acting on
> > > just one queue.
> > >  
> > > > +BPF_CALL_1(bpf_xdp_xsk_redirect, struct xdp_buff *, xdp)
> > > > +{
> > > > +   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
> > > > +   struct xdp_sock *xsk;
> > > > +
> > > > +   xsk = READ_ONCE(xdp->rxq->dev->_rx[xdp->rxq->queue_index].xsk);
> > > > +   if (xsk) {
> > > > +   ri->xsk = xsk;
> > > > +   return XDP_REDIRECT;
> > > > +   }
> > > > +
> > > > +   return XDP_PASS;
> > > > +}
> > > >
> > > > Why do every normal XDP_PASS packet have to pay this overhead (indirect
> > > > call), when someone loads an AF_XDP socket program?  The AF_XDP socket
> > > > is tied hard and only relevant to a specific RX-queue (which is why we
> > > > get a performance boost due to SPSC queues).
> > > >
> > > > I acknowledge there is a need for this, but this use-case shows there
> > > > is a need for attaching XDP programs per RX-queue basis.
> > > >  
> > >
> > > From my AF_XDP perspective, having a program per queue would make
> > > sense. The discussion of a per-queue has been up before, and I think
> > > the conclusion was that it would be too complex from a
> > > configuration/tooling point-of-view. Again, for AF_XDP this would be
> > > great.  
> >
> > I really think it is about time that we introduce these per-queue XDP
> > programs.  From a configuration/tooling point-of-view, your proposed
> > solution have the same issues, we have to solve. (Like listing
> > introspection need to show the per-queue XDP/BPF programs). And you have
> > almost solved it, by keeping the per-queue program in net_device
> > struct.
> >  
> 
> I just to emphasize; My series is only a convenience method for
> loading a (sticky/hierarchical) XDP program and a new 

Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets

2018-12-08 Thread Jesper Dangaard Brouer
On Sat, 8 Dec 2018 17:55:05 +0100
Andrew Lunn  wrote:

> On Sat, Dec 08, 2018 at 04:12:12PM +0100, Jesper Dangaard Brouer wrote:
> > On Fri, 7 Dec 2018 13:21:08 -0800
> > Alexei Starovoitov  wrote:
> >   
> > > for production I suspect the users would want
> > > an easy way to stay safe when they're playing with AF_XDP.
> > > So another builtin program that redirects ssh and ping traffic
> > > back to the kernel would be a nice addition.  
> > 
> > Are you saying a buildin program that need to parse different kinds of
> > Eth-type headers (DSA, VLAN, QinqQ)  
> 
> Hi Jesper
> 
> Parsing DSA headers is quite hard, since most of them are not
> discoverable, you need to know they are there, and where they actually
> are.
> 
> I also don't think there are any use cases for actually trying to
> parse them on the master interface. You are more likely to want to run
> the eBPF program on the slave interface, once the DSA header has been
> removed, and it is clear what interface the frame is actually for.

That is not how XDP works. XDP must run on the "master" physical driver
interface, as the "slave" interface is a virtual DSA interface.  I did
mention DSA because I'm handling that on the EspressoBin implementation.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread Ilias Apalodimas
On Sat, Dec 08, 2018 at 12:21:10PM -0800, David Miller wrote:
> From: Ilias Apalodimas 
> Date: Sat, 8 Dec 2018 16:57:28 +0200
> 
> > The patchset speeds up the mvneta driver on the default network
> > stack. The only change that was needed was to adapt the driver to
> > using the page_pool API. The speed improvements we are seeing on
> > specific workloads (i.e 256b < packet < 400b) are almost 3x.
> > 
> > Lots of high speed drivers are doing similar recycling tricks themselves 
> > (and
> > there's no common code, everyone is doing something similar though). All we 
> > are
> > trying to do is provide a unified API to make that easier for the rest. 
> > Another
> > advantage is that if the some drivers switch to the API, adding XDP
> > functionality on them is pretty trivial.
> 
> Yeah this is a very important point moving forward.
> 
> Jesse Brandeberg brought the following up to me at LPC and I'd like to
> develop it further.
> 
> Right now we tell driver authors to write a new driver as SKB based,
> and once they've done all of that work we tell them to basically
> shoe-horn XDP support into that somewhat different framework.
> 
> Instead, the model should be the other way around, because with a raw
> meta-data free set of data buffers we can always construct an SKB or
> pass it to XDP.

Yeah exactly and it gets even worst. If the driver writer doesn't go through the
'proper' path, i.e allocate buffers and use build_skb, you end up having to
rewrite dma/memory management for the nornal stack. So it's more than 
'shoe-horning' XDP, it's re-writing and re-testing the whole thing.
The API also offers dma mapping capabilities (configurable). So you remove 
potential nasty bugs there as well.

> 
> So drivers should be targetting some raw data buffer kind of interface
> which takes care of all of this stuff.  If the buffers get wrapped
> into an SKB and get pushed into the traditional networking stack, the
> driver shouldn't know or care.  Likewise if it ends up being processed
> with XDP, it should not need to know or care.
> 
> All of those details should be behind a common layer.  Then we can
> control:
> 
> 1) Buffer handling, recycling, "fast paths"
> 
> 2) Statistics
> 
> 3) XDP feature sets
> 
> We can consolidate behavior and semantics across all of the drivers
> if we do this.  No more talk about "supporting all XDP features",
> and the inconsistencies we have because of that.
> 
> The whole common statistics discussion could be resolved with this
> common layer as well.
> 
> We'd be able to control and properly optimize everything.


/Ilias


Re: [PATCH net-next] ip: silence udp zerocopy smatch false positive

2018-12-08 Thread David Miller
From: Willem de Bruijn 
Date: Sat,  8 Dec 2018 06:22:46 -0500

> From: Willem de Bruijn 
> 
> extra_uref is used in __ip(6)_append_data only if uarg is set.
> 
> Smatch sees that the variable is passed to sock_zerocopy_put_abort.
> This function accesses it only when uarg is set, but smatch cannot
> infer this.
> 
> Make this dependency explicit.
> 
> Fixes: 52900d22288e ("udp: elide zerocopy operation in hot path")
> Signed-off-by: Willem de Bruijn 

I looked and can't figure out a better way to fix this :)

Applied, thanks Willem.


Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread David Miller
From: Ilias Apalodimas 
Date: Sat, 8 Dec 2018 16:57:28 +0200

> The patchset speeds up the mvneta driver on the default network
> stack. The only change that was needed was to adapt the driver to
> using the page_pool API. The speed improvements we are seeing on
> specific workloads (i.e 256b < packet < 400b) are almost 3x.
> 
> Lots of high speed drivers are doing similar recycling tricks themselves (and
> there's no common code, everyone is doing something similar though). All we 
> are
> trying to do is provide a unified API to make that easier for the rest. 
> Another
> advantage is that if the some drivers switch to the API, adding XDP
> functionality on them is pretty trivial.

Yeah this is a very important point moving forward.

Jesse Brandeberg brought the following up to me at LPC and I'd like to
develop it further.

Right now we tell driver authors to write a new driver as SKB based,
and once they've done all of that work we tell them to basically
shoe-horn XDP support into that somewhat different framework.

Instead, the model should be the other way around, because with a raw
meta-data free set of data buffers we can always construct an SKB or
pass it to XDP.

So drivers should be targetting some raw data buffer kind of interface
which takes care of all of this stuff.  If the buffers get wrapped
into an SKB and get pushed into the traditional networking stack, the
driver shouldn't know or care.  Likewise if it ends up being processed
with XDP, it should not need to know or care.

All of those details should be behind a common layer.  Then we can
control:

1) Buffer handling, recycling, "fast paths"

2) Statistics

3) XDP feature sets

We can consolidate behavior and semantics across all of the drivers
if we do this.  No more talk about "supporting all XDP features",
and the inconsistencies we have because of that.

The whole common statistics discussion could be resolved with this
common layer as well.

We'd be able to control and properly optimize everything.


Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread Ilias Apalodimas
On Sat, Dec 08, 2018 at 09:11:53PM +0100, Jesper Dangaard Brouer wrote:
> >  
> > I want to make sure you guys thought about splice() stuff, and
> > skb_try_coalesce(), and GRO, and skb cloning, and ...
> 
> Thanks for the pointers. To Ilias, we need to double check skb_try_coalesce()
> code path, as it does look like we don't handle this correctly.
> 

Noted. We did check skb cloning, but not this one indeed

/Ilias


Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread Jesper Dangaard Brouer
On Sat, 8 Dec 2018 11:26:56 -0800
Eric Dumazet  wrote:

> On 12/08/2018 06:57 AM, Ilias Apalodimas wrote:
> > Hi Eric,   
>  This patch is changing struct sk_buff, and is thus per-definition
>  controversial.
> 
>  Place a new member 'mem_info' of type struct xdp_mem_info, just after
>  members (flags) head_frag and pfmemalloc, And not in between
>  headers_start/end to ensure skb_copy() and pskb_copy() work as-is.
>  Copying mem_info during skb_clone() is required.  This makes sure that
>  pages are correctly freed or recycled during the altered
>  skb_free_head() invocation.  
> >>>
> >>> I read this to mean that this 'info' isn't accessed/needed until skb
> >>> is freed.  Any reason its not added at the end?
> >>>
> >>> This would avoid moving other fields that are probably accessed
> >>> more frequently during processing.
> >>>  
> >>
> >> But I do not get why the patch is needed.
> >>
> >> Adding extra cost for each skb destruction is costly.
> >>
> >> I though XDP was all about _not_ having skbs.  
> > 
> > You hit the only point i don't personally like in the code, xdp info in the 
> > skb. Considering the benefits i am convinced this is ok and here's why:
> >   
> >> Please let's do not slow down the non XDP stack only to make XDP more 
> >> appealing.  
> > 
> > We are not trying to do that, on the contrary. The patchset has nothing 
> > towards
> > speeding XDP and slowing down anything else. The patchset speeds up the 
> > mvneta driver on the default network stack. The only change that was needed 
> > was
> > to adapt the driver to using the page_pool API. The speed improvements we 
> > are
> > seeing on specific workloads (i.e 256b < packet < 400b) are almost 3x.
> > 
> > Lots of high speed drivers are doing similar recycling tricks themselves 
> > (and
> > there's no common code, everyone is doing something similar though). All we 
> > are
> > trying to do is provide a unified API to make that easier for the rest. 
> > Another
> > advantage is that if the some drivers switch to the API, adding XDP
> > functionality on them is pretty trivial.
> > 
> > Moreover our tests are only performed on systems without or with SMMU 
> > disabled.
> > On a platform i have access to, enabling and disabling the SMMU has some
> > performance impact. By keeping the buffers mapped we believe this impact
> > will be substantially less (i'll come back with results once i have them on
> > this).
> > 
> > I do understand your point, but the potential advantages on my head
> > overwight that by a lot.
> > 
> > I got other concerns on the patchset though. Like how much memory is it 
> > 'ok' to
> > keep mapped keeping in mind we are using the streaming DMA API. Are we 
> > going to
> > affect anyone else negatively by doing so ?
> >   
>  
> I want to make sure you guys thought about splice() stuff, and
> skb_try_coalesce(), and GRO, and skb cloning, and ...

Thanks for the pointers. To Ilias, we need to double check skb_try_coalesce()
code path, as it does look like we don't handle this correctly.

> I do not see how an essential property of page fragments would be
> attached to sk_buff, without adding extra code all over the places.
> 
> Page recycling in drivers is fine, but should operate on pages that
> the driver owns.
> 

I think we have addressed this.  We are only recycling pages that the
driver owns.  In case of fragments, then we release the DMA-mapping and
don't recycle the page, instead the normal code path is taken (which is
missing in case of skb_try_coalesce).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread David Miller
From: Jesper Dangaard Brouer 
Date: Sat, 8 Dec 2018 12:36:10 +0100

> The annoying part is actually that depending on the kernel config
> options CONFIG_XFRM, CONFIG_NF_CONNTRACK and CONFIG_BRIDGE_NETFILTER,
> whether there is a cache-line split, where mem_info gets moved into the
> next cacheline.

Note that Florian Westphal's work (trying to help MP-TCP) would
eliminate this variability.


Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread Eric Dumazet



On 12/08/2018 06:57 AM, Ilias Apalodimas wrote:
> Hi Eric, 
 This patch is changing struct sk_buff, and is thus per-definition
 controversial.

 Place a new member 'mem_info' of type struct xdp_mem_info, just after
 members (flags) head_frag and pfmemalloc, And not in between
 headers_start/end to ensure skb_copy() and pskb_copy() work as-is.
 Copying mem_info during skb_clone() is required.  This makes sure that
 pages are correctly freed or recycled during the altered
 skb_free_head() invocation.
>>>
>>> I read this to mean that this 'info' isn't accessed/needed until skb
>>> is freed.  Any reason its not added at the end?
>>>
>>> This would avoid moving other fields that are probably accessed
>>> more frequently during processing.
>>>
>>
>> But I do not get why the patch is needed.
>>
>> Adding extra cost for each skb destruction is costly.
>>
>> I though XDP was all about _not_ having skbs.
> 
> You hit the only point i don't personally like in the code, xdp info in the 
> skb. Considering the benefits i am convinced this is ok and here's why:
> 
>> Please let's do not slow down the non XDP stack only to make XDP more 
>> appealing.
> 
> We are not trying to do that, on the contrary. The patchset has nothing 
> towards
> speeding XDP and slowing down anything else. The patchset speeds up the 
> mvneta driver on the default network stack. The only change that was needed 
> was
> to adapt the driver to using the page_pool API. The speed improvements we are
> seeing on specific workloads (i.e 256b < packet < 400b) are almost 3x.
> 
> Lots of high speed drivers are doing similar recycling tricks themselves (and
> there's no common code, everyone is doing something similar though). All we 
> are
> trying to do is provide a unified API to make that easier for the rest. 
> Another
> advantage is that if the some drivers switch to the API, adding XDP
> functionality on them is pretty trivial.
> 
> Moreover our tests are only performed on systems without or with SMMU 
> disabled.
> On a platform i have access to, enabling and disabling the SMMU has some
> performance impact. By keeping the buffers mapped we believe this impact
> will be substantially less (i'll come back with results once i have them on
> this).
> 
> I do understand your point, but the potential advantages on my head
> overwight that by a lot.
> 
> I got other concerns on the patchset though. Like how much memory is it 'ok' 
> to
> keep mapped keeping in mind we are using the streaming DMA API. Are we going 
> to
> affect anyone else negatively by doing so ?
> 


I want to make sure you guys thought about splice() stuff, and 
skb_try_coalesce(),
and GRO, and skb cloning, and ...

I do not see how an essential property of page fragments would be attached
to sk_buff, without adding extra code all over the places.

Page recycling in drivers is fine, but should operate on pages that the driver 
owns.

Messing with skb->head seems not needed for performance, since skb->head can be 
regular
slab allocated.



Fw: [Bug 201933] New: r8169 hangs with Oops/null pointer deref on shutdown

2018-12-08 Thread Stephen Hemminger



Begin forwarded message:

Date: Sat, 08 Dec 2018 18:20:01 +
From: bugzilla-dae...@bugzilla.kernel.org
To: step...@networkplumber.org
Subject: [Bug 201933] New: r8169 hangs with Oops/null pointer deref on shutdown


https://bugzilla.kernel.org/show_bug.cgi?id=201933

Bug ID: 201933
   Summary: r8169 hangs with Oops/null pointer deref on shutdown
   Product: Networking
   Version: 2.5
Kernel Version: 4.20
  Hardware: x86-64
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: Other
  Assignee: step...@networkplumber.org
  Reporter: adf.li...@gmail.com
Regression: No

On 4.20-rc? halt/reboot r8169 hangs. ip li set down seems to get same error.

Dec  8 16:14:56 ph4 kernel: r8169 :06:00.0 enp6s0: Link is Down
Dec  8 16:14:57 ph4 kernel: BUG: unable to handle kernel NULL pointer
dereference at 03a0
Dec  8 16:14:57 ph4 kernel: PGD 22acfc067 P4D 22acfc067 PUD 232a33067 PMD 0 
Dec  8 16:14:57 ph4 kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
Dec  8 16:14:57 ph4 kernel: CPU: 2 PID: 22126 Comm: ip Not tainted 4.20.0-rc5
#1
Dec  8 16:14:57 ph4 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled
By O.E.M./970 Extreme3 R2.0, BIOS P1.60 06/05/2014
Dec  8 16:14:57 ph4 kernel: RIP: 0010:queue_work_on+0x4/0x20
Dec  8 16:14:57 ph4 kernel: Code: ff 48 c7 c7 20 55 de 81 e8 02 58 04 00 c6 05
aa 1c 3e 01 01 e9 75 ff ff ff 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 53 9c 5b
fa  48 0f ba 2a 00 73 06 31 c0 53 9d 5b c3 e8 99 fb ff ff b8 01 00
Dec  8 16:14:57 ph4 kernel: RSP: 0018:c9d473a8 EFLAGS: 00010002
Dec  8 16:14:57 ph4 kernel: RAX: 8882362c4000 RBX: 0002 RCX:

Dec  8 16:14:57 ph4 kernel: RDX: 03a0 RSI: 88823680d200 RDI:
0008
Dec  8 16:14:57 ph4 kernel: RBP: 8882362c4840 R08: 888236400248 R09:
888236400340
Dec  8 16:14:57 ph4 kernel: R10:  R11: 82043b28 R12:
8882341ca960
Dec  8 16:14:57 ph4 kernel: R13: 8882341ca828 R14: 001a R15:
8882362c4840
Dec  8 16:14:57 ph4 kernel: FS:  7f5fc097f700()
GS:888237b0() knlGS:
Dec  8 16:14:57 ph4 kernel: CS:  0010 DS:  ES:  CR0: 80050033
Dec  8 16:14:57 ph4 kernel: CR2: 03a0 CR3: 00023344 CR4:
06e0
Dec  8 16:14:57 ph4 kernel: Call Trace:
Dec  8 16:14:57 ph4 kernel:  rtl8169_interrupt+0xba/0x260 [r8169]
Dec  8 16:14:57 ph4 kernel:  ? _raw_spin_unlock_irqrestore+0xf/0x30
Dec  8 16:14:57 ph4 kernel:  __free_irq+0x177/0x2d0
Dec  8 16:14:57 ph4 kernel:  free_irq+0x29/0x60
Dec  8 16:14:57 ph4 kernel:  pci_free_irq+0x13/0x20
Dec  8 16:14:57 ph4 kernel:  rtl8169_close+0xe8/0x1f0 [r8169]
Dec  8 16:14:57 ph4 kernel:  __dev_close_many+0x8f/0x100
Dec  8 16:14:57 ph4 kernel:  __dev_change_flags+0xe9/0x210
Dec  8 16:14:57 ph4 kernel:  dev_change_flags+0x29/0x70
Dec  8 16:14:57 ph4 kernel:  do_setlink+0x310/0xf30
Dec  8 16:14:57 ph4 kernel:  ? __nla_parse+0x48/0x190
Dec  8 16:14:57 ph4 kernel:  rtnl_newlink+0x571/0x880
Dec  8 16:14:57 ph4 kernel:  ? __nla_parse+0x48/0x190
Dec  8 16:14:57 ph4 kernel:  ? nla_parse+0x13/0x20
Dec  8 16:14:57 ph4 kernel:  ? rtnl_newlink+0x167/0x880
Dec  8 16:14:57 ph4 kernel:  ? preempt_count_add+0x55/0xb0
Dec  8 16:14:57 ph4 kernel:  ? get_page_from_freelist+0xdf0/0x1130
Dec  8 16:14:57 ph4 kernel:  ? kmem_cache_alloc+0x37/0x180
Dec  8 16:14:57 ph4 kernel:  rtnetlink_rcv_msg+0x141/0x380
Dec  8 16:14:57 ph4 kernel:  ? rtnl_calcit.isra.25+0x120/0x120
Dec  8 16:14:57 ph4 kernel:  netlink_rcv_skb+0x48/0x120
Dec  8 16:14:57 ph4 kernel:  netlink_unicast+0x1be/0x270
Dec  8 16:14:57 ph4 kernel:  netlink_sendmsg+0x2b7/0x3f0
Dec  8 16:14:57 ph4 kernel:  ? netlink_unicast+0x270/0x270
Dec  8 16:14:57 ph4 kernel:  ___sys_sendmsg+0x10b/0x310
Dec  8 16:14:57 ph4 kernel:  ? dev_ioctl+0x19d/0x410
Dec  8 16:14:57 ph4 kernel:  ? preempt_count_add+0x74/0xb0
Dec  8 16:14:57 ph4 kernel:  ? __inode_wait_for_writeback+0x7a/0xe0
Dec  8 16:14:57 ph4 kernel:  ? init_wait_var_entry+0x40/0x40
Dec  8 16:14:57 ph4 kernel:  ? preempt_count_add+0x74/0xb0
Dec  8 16:14:57 ph4 kernel:  ? _raw_spin_lock+0xe/0x30
Dec  8 16:14:57 ph4 kernel:  ? _raw_spin_unlock+0xd/0x30
Dec  8 16:14:57 ph4 kernel:  ? __dentry_kill+0x11a/0x160
Dec  8 16:14:57 ph4 kernel:  __sys_sendmsg+0x64/0xb0
Dec  8 16:14:57 ph4 kernel:  do_syscall_64+0x6b/0x1c0
Dec  8 16:14:57 ph4 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Dec  8 16:14:57 ph4 kernel: RIP: 0033:0x7f5fc00c1110
Dec  8 16:14:57 ph4 kernel: Code: 8b 15 84 3d 2b 00 f7 d8 64 89 02 48 c7 c0 ff
ff ff ff eb cd 66 0f 1f 44 00 00 83 3d d9 96 2b 00 00 75 10 b8 2e 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ae b4 00 00 48 89 04 24
Dec  8 16:14:57 ph4 kernel: RSP: 002b:7ffd05cb11e8 EFLAGS: 0246
ORIG_RAX: 002e
Dec  8 16:14:57 ph4 kernel: RAX: ffda 

Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets

2018-12-08 Thread Björn Töpel
Den lör 8 dec. 2018 kl 16:12 skrev Jesper Dangaard Brouer :
>
> On Fri, 7 Dec 2018 13:21:08 -0800
> Alexei Starovoitov  wrote:
>
> > for production I suspect the users would want
> > an easy way to stay safe when they're playing with AF_XDP.
> > So another builtin program that redirects ssh and ping traffic
> > back to the kernel would be a nice addition.
>
> Are you saying a buildin program that need to parse different kinds of
> Eth-type headers (DSA, VLAN, QinqQ) and find the TCP port to match port
> 22 to return XDP_PASS, or else call AF_XDP redurect. That seems to be
> pure overhead for this fast-path buildin program for AF_XDP.
>
> Would a solution be to install a NIC hardware filter that redirect SSH
> port 22 to another RX-queue. And then have a buildin program that
> returns XDP_PASS installed on that RX-queue.   And change Bjørns
> semantics, such that RX-queue programs takes precedence over the global
> XDP program. This would also be a good fail safe in general for XDP.
>

Exactly this; I'd say this is the most common way of using AF_XDP,
i.e. steer a certain flow to a Rx queue, and *all* packets on that
queue are pulled to the AF_XDP socket. This is why the builtin is the
way it is, it's what is being used, not only for benchmarking
scenarios.

> If the RX-queues take precedence, I can use this fail safe approach.
> E.g. when I want to test my new global XDP program, I'll use ethtool
> match my management IP and send that to a specific RX-queue and my
> fail-safe BPF program.
>

Interesting take to have a *per queue* XDP program that overrides the
regular one. Maybe this is a better way to use builtins?

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets

2018-12-08 Thread Björn Töpel
Den lör 8 dec. 2018 kl 15:52 skrev Jesper Dangaard Brouer :
>
> On Fri, 7 Dec 2018 15:01:55 +0100
> Björn Töpel  wrote:
>
> > Den fre 7 dec. 2018 kl 14:42 skrev Jesper Dangaard Brouer 
> > :
> > >
> > > On Fri,  7 Dec 2018 12:44:24 +0100
> > > Björn Töpel  wrote:
> > >
> > > > The rationale behind attach is performance and ease of use. Many XDP
> > > > socket users just need a simple way of creating/binding a socket and
> > > > receiving frames right away without loading an XDP program.
> > > >
> > > > XDP_ATTACH adds a mechanism we call "builtin XDP program" that simply
> > > > is a kernel provided XDP program that is installed to the netdev when
> > > > XDP_ATTACH is being passed as a bind() flag.
> > > >
> > > > The builtin program is the simplest program possible to redirect a
> > > > frame to an attached socket. In restricted C it would look like this:
> > > >
> > > >   SEC("xdp")
> > > >   int xdp_prog(struct xdp_md *ctx)
> > > >   {
> > > > return bpf_xsk_redirect(ctx);
> > > >   }
> > > >
> > > > The builtin program loaded via XDP_ATTACH behaves, from an
> > > > install-to-netdev/uninstall-from-netdev point of view, differently
> > > > from regular XDP programs. The easiest way to look at it is as a
> > > > 2-level hierarchy, where regular XDP programs has precedence over the
> > > > builtin one.
> > > >
> > > > If no regular XDP program is installed to the netdev, the builtin will
> > > > be install. If the builtin program is installed, and a regular is
> > > > installed, regular XDP program will have precedence over the builtin
> > > > one.
> > > >
> > > > Further, if a regular program is installed, and later removed, the
> > > > builtin one will automatically be installed.
> > > >
> > > > The sxdp_flags field of struct sockaddr_xdp gets two new options
> > > > XDP_BUILTIN_SKB_MODE and XDP_BUILTIN_DRV_MODE, which maps to the
> > > > corresponding XDP netlink install flags.
> > > >
> > > > The builtin XDP program functionally adds even more complexity to the
> > > > already hard to read dev_change_xdp_fd. Maybe it would be simpler to
> > > > store the program in the struct net_device together with the install
> > > > flags instead of calling the ndo_bpf multiple times?
> > >
> > > (As far as I can see from reading the code, correct me if I'm wrong.)
> > >
> > > If an AF_XDP program uses XDP_ATTACH, then it installs the
> > > builtin-program as the XDP program on the "entire" device.  That means
> > > all RX-queues will call this XDP-bpf program (indirect call), and it is
> > > actually only relevant for the specific queue_index.  Yes, the helper
> > > call does check that the 'xdp->rxq->queue_index' for an attached 'xsk'
> > > and return XDP_PASS if it is NULL:
> > >
> >
> > Yes, you are correct. The builtin XDP program, just like a regular XDP
> > program, affects the whole netdev. So, yes the non-AF_XDP queues would
> > get a performance hit from this. Just to reiterate -- this isn't new
> > for this series. This has always been the case for XDP when acting on
> > just one queue.
> >
> > > +BPF_CALL_1(bpf_xdp_xsk_redirect, struct xdp_buff *, xdp)
> > > +{
> > > +   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
> > > +   struct xdp_sock *xsk;
> > > +
> > > +   xsk = READ_ONCE(xdp->rxq->dev->_rx[xdp->rxq->queue_index].xsk);
> > > +   if (xsk) {
> > > +   ri->xsk = xsk;
> > > +   return XDP_REDIRECT;
> > > +   }
> > > +
> > > +   return XDP_PASS;
> > > +}
> > >
> > > Why do every normal XDP_PASS packet have to pay this overhead (indirect
> > > call), when someone loads an AF_XDP socket program?  The AF_XDP socket
> > > is tied hard and only relevant to a specific RX-queue (which is why we
> > > get a performance boost due to SPSC queues).
> > >
> > > I acknowledge there is a need for this, but this use-case shows there
> > > is a need for attaching XDP programs per RX-queue basis.
> > >
> >
> > From my AF_XDP perspective, having a program per queue would make
> > sense. The discussion of a per-queue has been up before, and I think
> > the conclusion was that it would be too complex from a
> > configuration/tooling point-of-view. Again, for AF_XDP this would be
> > great.
>
> I really think it is about time that we introduce these per-queue XDP
> programs.  From a configuration/tooling point-of-view, your proposed
> solution have the same issues, we have to solve. (Like listing
> introspection need to show the per-queue XDP/BPF programs). And you have
> almost solved it, by keeping the per-queue program in net_device
> struct.
>

I just to emphasize; My series is only a convenience method for
loading a (sticky/hierarchical) XDP program and a new bpf-function.
Nothing more. I'm not sure what you mean when you say that "proposed
solution have the same issues". Do you mean XDP in general?

*I'm* all in for a per-queue XDP program -- but I might be biased due
to AF_XDP :-P Let's explore in this thread what that would look like
and 

linux-next: BUG: KASAN: use-after-free in neigh_mark_dead+0x192/0x1c0

2018-12-08 Thread Andrei Vagin
Hi David,

Our tests detected a new problem on net-next (v4.20-rc4-1241-g83af01ba1c2d).

[   92.949574] 
==
[   92.949794] BUG: KASAN: use-after-free in neigh_mark_dead+0x192/0x1c0
[   92.949950] Write of size 8 at addr 8881d9bc9438 by task kworker/u4:1/37
[   92.950108]
[   92.950254] CPU: 1 PID: 37 Comm: kworker/u4:1 Not tainted 4.20.0-rc4+ #1
[   92.950257] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[   92.950263] Workqueue: netns cleanup_net
[   92.950267] Call Trace:
[   92.950276]  dump_stack+0x5b/0x8b
[   92.950284]  print_address_description+0x6a/0x270
[   92.950290]  kasan_report+0x260/0x380
[   92.950295]  ? neigh_mark_dead+0x192/0x1c0
[   92.950301]  neigh_mark_dead+0x192/0x1c0
[   92.950307]  neigh_flush_dev+0x1ad/0x510
[   92.950318]  __neigh_ifdown+0x49/0x2a0
[   92.950327]  neigh_ifdown+0xc/0x10
[   92.950333]  fib_netdev_event+0x1f5/0x300
[   92.950342]  notifier_call_chain+0xbf/0x130
[   92.950352]  dev_close_many+0x249/0x4d0
[   92.950360]  ? __dev_close_many+0x280/0x280
[   92.950364]  ? default_device_exit_batch+0xf3/0x3b0
[   92.950374]  rollback_registered_many+0x34c/0xd00
[   92.950378]  ? default_device_exit_batch+0xf3/0x3b0
[   92.950383]  ? wait_for_completion+0x2d0/0x2d0
[   92.950386]  ? __mutex_lock+0x31f/0x1000
[   92.950391]  ? generic_xdp_install+0x3c0/0x3c0
[   92.950397]  ? default_device_exit_batch+0x1b6/0x3b0
[   92.950403]  ? find_held_lock+0x32/0x1c0
[   92.950408]  ? default_device_exit_batch+0x1b6/0x3b0
[   92.950416]  ? rtnl_is_locked+0x16/0x30
[   92.950420]  ? unregister_netdevice_queue+0x69/0x3f0
[   92.950424]  ? rollback_registered_many+0xd00/0xd00
[   92.950432]  unregister_netdevice_many.part.124+0x17/0x210
[   92.950439]  default_device_exit_batch+0x2d6/0x3b0
[   92.950444]  ? unregister_netdevice_many+0x40/0x40
[   92.950450]  ? dev_change_net_namespace+0xb90/0xb90
[   92.950459]  ? do_wait_intr_irq+0x210/0x210
[   92.950466]  ? ops_exit_list.isra.6+0x94/0x140
[   92.950473]  cleanup_net+0x309/0x650
[   92.950478]  ? process_one_work+0x8ae/0x16c0
[   92.950482]  ? net_drop_ns+0x60/0x60
[   92.950490]  ? lock_acquire+0xfe/0x290
[   92.950497]  ? strscpy+0x96/0x300
[   92.950505]  process_one_work+0x96c/0x16c0
[   92.950516]  ? pwq_dec_nr_in_flight+0x2c0/0x2c0
[   92.950520]  ? do_raw_spin_lock+0x120/0x290
[   92.950532]  worker_thread+0x87/0xe80
[   92.950541]  ? __kthread_parkme+0x82/0xf0
[   92.950546]  ? process_one_work+0x16c0/0x16c0
[   92.950550]  kthread+0x2e9/0x3a0
[   92.950554]  ? kthread_park+0x120/0x120
[   92.950560]  ret_from_fork+0x35/0x40
[   92.950572]
[   92.950713] Allocated by task 8903:
[   92.950863]  kasan_kmalloc+0xa0/0xd0
[   92.950868]  __kmalloc+0x132/0x250
[   92.950872]  ___neigh_create+0x4d6/0x18e0
[   92.950877]  ip_finish_output2+0x4e7/0xff0
[   92.950880]  ip_output+0x1c5/0x3f0
[   92.950884]  __ip_queue_xmit+0x7c9/0x1990
[   92.950888]  __tcp_transmit_skb+0x1872/0x33e0
[   92.950891]  do_tcp_setsockopt.isra.41+0x1712/0x1c70
[   92.950895]  __sys_setsockopt+0x114/0x1e0
[   92.950899]  __x64_sys_setsockopt+0xba/0x150
[   92.950904]  do_syscall_64+0x94/0x280
[   92.950908]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   92.950909]
[   92.951049] Freed by task 8936:
[   92.951200]  __kasan_slab_free+0x130/0x180
[   92.951204]  kfree+0xc2/0x1e0
[   92.951209]  rcu_process_callbacks+0x647/0xbe0
[   92.951213]  __do_softirq+0x199/0x658
[   92.951215]
[   92.951356] The buggy address belongs to the object at 8881d9bc9200
[   92.951356]  which belongs to the cache kmalloc-1k of size 1024
[   92.951530] The buggy address is located 568 bytes inside of
[   92.951530]  1024-byte region [8881d9bc9200, 8881d9bc9600)
[   92.951702] The buggy address belongs to the page:
[   92.951856] page:ea000766f200 count:1 mapcount:0
mapping:8881dac02a00 index:0x0 compound_mapcount: 0
[   92.952026] flags: 0x17fff810200(slab|head)
[   92.952185] raw: 017fff810200 dead0100 dead0200
8881dac02a00
[   92.952351] raw:  000e000e 0001

[   92.952515] page dumped because: kasan: bad access detected
[   92.952662]
[   92.952815] Memory state around the buggy address:
[   92.952981]  8881d9bc9300: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[   92.953168]  8881d9bc9380: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[   92.953348] >8881d9bc9400: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[   92.953527] ^
[   92.953688]  8881d9bc9480: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[   92.953871]  8881d9bc9500: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[   92.954048] 
==
[   92.954233] Disabling lock debugging due to kernel taint

Here is more more details (config, dmest, etc):

Re: [PATCH] net: dsa: ksz: Increase the tag alignment

2018-12-08 Thread Florian Fainelli
Le 12/7/18 à 8:25 PM, Marek Vasut a écrit :
> On 12/08/2018 01:52 AM, tristram...@microchip.com wrote:
>>> -   padlen = (skb->len >= ETH_ZLEN) ? 0 : ETH_ZLEN - skb->len;
>>> +   padlen = (skb->len >= VLAN_ETH_ZLEN) ? 0 : VLAN_ETH_ZLEN - skb-
 len;
>>
>> The requirement is the tail tag should be at the end of frame before FCS.
>> When the length is less than 60 the MAC controller will pad the frame to
>> legal size.  That is why this function makes sure the padding is done
>> manually.  Increasing the size just increases the length of the frame and the
>> chance to allocate new socket buffer.
>>
>> Example of using ping size of 18 will have the sizes of request and response
>> differ by 4 bytes.  Not that it matters much.
>>
>> Are you concerned the MAC controller will remove the VLAN tag and so the 
>> frame
>> will not be sent? Or the switch removes the VLAN tag and is not able to send?
> 
> With TI CPSW in dual-ethernet configuration, which adds internal VLAN
> tag at the end of the frame, the KSZ switch fails. The CPU will send out
> packets and the switch will reject them as corrupted. It needs this
> extra VLAN tag padding.

Oh so they add the internal VLAN at the end of the frame, not the
beginning? That is quite surprising but that would not be the one single
oddity found with CPSW I am afraid.. The way I would approach this is
with layering where the padding needs to occur:

- within the tag driver you need to make sure there is enough room to
insert the KSZ tag

- within the ethernet MAC driver (which comes last in the transmit
path), you need to make sure there is enough room to insert that trailer
VLAN tag to permit internal transmission
-- 
Florian


Re: [PATCH net-next,v5 11/12] qede: place ethtool_rx_flow_spec after code after TC flower codebase

2018-12-08 Thread Jiri Pirko
Thu, Dec 06, 2018 at 11:40:01PM CET, pa...@netfilter.org wrote:
>This is a preparation patch to reuse the existing TC flower codebase
>from ethtool_rx_flow_spec.
>
>This patch is merely moving the core ethtool_rx_flow_spec parser after
>tc flower offload driver code so we can skip a few forward function
>declarations in the follow up patch.
>
>Signed-off-by: Pablo Neira Ayuso 

Acked-by: Jiri Pirko 


Re: [PATCH net-next,v5 10/12] dsa: bcm_sf2: use flow_rule infrastructure

2018-12-08 Thread Jiri Pirko
Thu, Dec 06, 2018 at 11:40:00PM CET, pa...@netfilter.org wrote:
>Update this driver to use the flow_rule infrastructure, hence we can use
>the same code to populate hardware IR from ethtool_rx_flow and the
>cls_flower interfaces.
>
>Signed-off-by: Pablo Neira Ayuso 

Acked-by: Jiri Pirko 


Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread Andrew Lunn
> I got other concerns on the patchset though. Like how much memory is
> it 'ok' to keep mapped keeping in mind we are using the streaming
> DMA API. Are we going to affect anyone else negatively by doing so ?

For mvneta, you can expect the target to have between 512Mbyte to
3G. You can take a look at the DT files to get a better feel for this.
All of it should be low enough that it is DMA'able.

These SoCs are often used for WiFi access points, and NAS boxes.  So i
guess the big memory usage on these boxes is the block cache for a NAS
device. So if you want to see the impact of the driver using more
memory, see if disk performance is affected.

Andrew


Re: [PATCH v2 net-next 0/2] platform data controls for mdio-gpio

2018-12-08 Thread Florian Fainelli
Le 12/8/18 à 7:12 AM, Andrew Lunn a écrit :
> Soon to be mainlined is an x86 platform with a Marvell switch, and a
> bit-banging MDIO bus. In order to make this work, the phy_mask of the
> MDIO bus needs to be set to prevent scanning for PHYs, and the
> phy_ignore_ta_mask needs to be set because the switch has broken
> turnaround.
> 
> Add a platform_data structure with these parameters.

Looks good thanks Andrew. I do wonder if we could define a common
mii_bus_platform_data structure eventually which is comprised of these
two members (if nothing else) and maybe update the common
mdiobus_register() code path to look for these members. If a subsequent
platform data/device MDIO bus shows up we could do that at that time.

Thanks!

> 
> Andrew Lunn (2):
>   net: phy: mdio-gpio: Add platform_data support for phy_mask
>   net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data
> 
>  MAINTAINERS |  1 +
>  drivers/net/phy/mdio-gpio.c |  7 +++
>  include/linux/platform_data/mdio-gpio.h | 14 ++
>  3 files changed, 22 insertions(+)
>  create mode 100644 include/linux/platform_data/mdio-gpio.h
> 


-- 
Florian


Re: [PATCH v2 net-next 2/2] net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data

2018-12-08 Thread Florian Fainelli
Le 12/8/18 à 7:12 AM, Andrew Lunn a écrit :
> The Marvell 6390 Ethernet switch family does not perform MDIO
> turnaround correctly. Many hardware MDIO bus masters don't care about
> this, but the bitbangging implementation in Linux does by default. Add
> phy_ignore_ta_mask to the platform data so that the bitbangging code
> can be told which devices are known to get TA wrong.
> 
> v2
> --
> int -> u32 in platform data structure
> 
> Signed-off-by: Andrew Lunn 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH v2 net-next 1/2] net: phy: mdio-gpio: Add platform_data support for phy_mask

2018-12-08 Thread Florian Fainelli
Le 12/8/18 à 7:12 AM, Andrew Lunn a écrit :
> It is sometimes necessary to instantiate a bit-banging MDIO bus as a
> platform device, without the aid of device tree.
> 
> When device tree is being used, the bus is not scanned for devices,
> only those devices which are in device tree are probed. Without device
> tree, by default, all addresses on the bus are scanned. This may then
> find a device which is not a PHY, e.g. a switch. And the switch may
> have registers containing values which look like a PHY. So during the
> scan, a PHY device is wrongly created.
> 
> After the bus has been registered, a search is made for
> mdio_board_info structures which indicates devices on the bus, and the
> driver which should be used for them. This is typically used to
> instantiate Ethernet switches from platform drivers.  However, if the
> scanning of the bus has created a PHY device at the same location as
> indicated into the board info for a switch, the switch device is not
> created, since the address is already busy.
> 
> This can be avoided by setting the phy_mask of the mdio bus. This mask
> prevents addresses on the bus being scanned.
> 
> v2
> --
> int -> u32 in platform data structure
> 
> Signed-off-by: Andrew Lunn 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets

2018-12-08 Thread Andrew Lunn
On Sat, Dec 08, 2018 at 04:12:12PM +0100, Jesper Dangaard Brouer wrote:
> On Fri, 7 Dec 2018 13:21:08 -0800
> Alexei Starovoitov  wrote:
> 
> > for production I suspect the users would want
> > an easy way to stay safe when they're playing with AF_XDP.
> > So another builtin program that redirects ssh and ping traffic
> > back to the kernel would be a nice addition.
> 
> Are you saying a buildin program that need to parse different kinds of
> Eth-type headers (DSA, VLAN, QinqQ)

Hi Jesper

Parsing DSA headers is quite hard, since most of them are not
discoverable, you need to know they are there, and where they actually
are.

I also don't think there are any use cases for actually trying to
parse them on the master interface. You are more likely to want to run
the eBPF program on the slave interface, once the DSA header has been
removed, and it is clear what interface the frame is actually for.

 Andrew


[PATCH net-next] net: dsa: Make dsa_master_set_mtu() static

2018-12-08 Thread Andrew Lunn
Add the missing static keyword.

Signed-off-by: Andrew Lunn 
---
 net/dsa/master.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/dsa/master.c b/net/dsa/master.c
index a25242e71fb2..d7d5145aa235 100644
--- a/net/dsa/master.c
+++ b/net/dsa/master.c
@@ -158,7 +158,7 @@ static void dsa_master_ethtool_teardown(struct net_device 
*dev)
cpu_dp->orig_ethtool_ops = NULL;
 }
 
-void dsa_master_set_mtu(struct net_device *dev, struct dsa_port *cpu_dp)
+static void dsa_master_set_mtu(struct net_device *dev, struct dsa_port *cpu_dp)
 {
unsigned int mtu = ETH_DATA_LEN + cpu_dp->tag_ops->overhead;
int err;
-- 
2.19.1



[PATCH net-next] net: dsa: Restore MTU on master device on unload

2018-12-08 Thread Andrew Lunn
A previous change tries to set the MTU on the master device to take
into account the DSA overheads. This patch tries to reset the master
device back to the default MTU.

Signed-off-by: Andrew Lunn 
---
 net/dsa/master.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/net/dsa/master.c b/net/dsa/master.c
index 42f525bc68e2..a25242e71fb2 100644
--- a/net/dsa/master.c
+++ b/net/dsa/master.c
@@ -172,6 +172,18 @@ void dsa_master_set_mtu(struct net_device *dev, struct 
dsa_port *cpu_dp)
rtnl_unlock();
 }
 
+static void dsa_master_reset_mtu(struct net_device *dev)
+{
+   int err;
+
+   rtnl_lock();
+   err = dev_set_mtu(dev, ETH_DATA_LEN);
+   if (err)
+   netdev_dbg(dev,
+  "Unable to reset MTU to exclude DSA overheads\n");
+   rtnl_unlock();
+}
+
 int dsa_master_setup(struct net_device *dev, struct dsa_port *cpu_dp)
 {
dsa_master_set_mtu(dev,  cpu_dp);
@@ -190,6 +202,7 @@ int dsa_master_setup(struct net_device *dev, struct 
dsa_port *cpu_dp)
 void dsa_master_teardown(struct net_device *dev)
 {
dsa_master_ethtool_teardown(dev);
+   dsa_master_reset_mtu(dev);
 
dev->dsa_ptr = NULL;
 
-- 
2.19.1



[PATCH v2 net-next 0/2] platform data controls for mdio-gpio

2018-12-08 Thread Andrew Lunn
Soon to be mainlined is an x86 platform with a Marvell switch, and a
bit-banging MDIO bus. In order to make this work, the phy_mask of the
MDIO bus needs to be set to prevent scanning for PHYs, and the
phy_ignore_ta_mask needs to be set because the switch has broken
turnaround.

Add a platform_data structure with these parameters.

Andrew Lunn (2):
  net: phy: mdio-gpio: Add platform_data support for phy_mask
  net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data

 MAINTAINERS |  1 +
 drivers/net/phy/mdio-gpio.c |  7 +++
 include/linux/platform_data/mdio-gpio.h | 14 ++
 3 files changed, 22 insertions(+)
 create mode 100644 include/linux/platform_data/mdio-gpio.h

-- 
2.19.1



[PATCH v2 net-next 2/2] net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data

2018-12-08 Thread Andrew Lunn
The Marvell 6390 Ethernet switch family does not perform MDIO
turnaround correctly. Many hardware MDIO bus masters don't care about
this, but the bitbangging implementation in Linux does by default. Add
phy_ignore_ta_mask to the platform data so that the bitbangging code
can be told which devices are known to get TA wrong.

v2
--
int -> u32 in platform data structure

Signed-off-by: Andrew Lunn 
---
 drivers/net/phy/mdio-gpio.c | 4 +++-
 include/linux/platform_data/mdio-gpio.h | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/mdio-gpio.c b/drivers/net/phy/mdio-gpio.c
index 1e296dd4067a..ea9a0e339778 100644
--- a/drivers/net/phy/mdio-gpio.c
+++ b/drivers/net/phy/mdio-gpio.c
@@ -130,8 +130,10 @@ static struct mii_bus *mdio_gpio_bus_init(struct device 
*dev,
else
strncpy(new_bus->id, "gpio", MII_BUS_ID_SIZE);
 
-   if (pdata)
+   if (pdata) {
new_bus->phy_mask = pdata->phy_mask;
+   new_bus->phy_ignore_ta_mask = pdata->phy_ignore_ta_mask;
+   }
 
dev_set_drvdata(dev, new_bus);
 
diff --git a/include/linux/platform_data/mdio-gpio.h 
b/include/linux/platform_data/mdio-gpio.h
index a5d5ff5e174c..13874fa6e767 100644
--- a/include/linux/platform_data/mdio-gpio.h
+++ b/include/linux/platform_data/mdio-gpio.h
@@ -8,6 +8,7 @@
 
 struct mdio_gpio_platform_data {
u32 phy_mask;
+   u32 phy_ignore_ta_mask;
 };
 
 #endif /* __LINUX_MDIO_GPIO_PDATA_H */
-- 
2.19.1



[PATCH v2 net-next 1/2] net: phy: mdio-gpio: Add platform_data support for phy_mask

2018-12-08 Thread Andrew Lunn
It is sometimes necessary to instantiate a bit-banging MDIO bus as a
platform device, without the aid of device tree.

When device tree is being used, the bus is not scanned for devices,
only those devices which are in device tree are probed. Without device
tree, by default, all addresses on the bus are scanned. This may then
find a device which is not a PHY, e.g. a switch. And the switch may
have registers containing values which look like a PHY. So during the
scan, a PHY device is wrongly created.

After the bus has been registered, a search is made for
mdio_board_info structures which indicates devices on the bus, and the
driver which should be used for them. This is typically used to
instantiate Ethernet switches from platform drivers.  However, if the
scanning of the bus has created a PHY device at the same location as
indicated into the board info for a switch, the switch device is not
created, since the address is already busy.

This can be avoided by setting the phy_mask of the mdio bus. This mask
prevents addresses on the bus being scanned.

v2
--
int -> u32 in platform data structure

Signed-off-by: Andrew Lunn 
---
 MAINTAINERS |  1 +
 drivers/net/phy/mdio-gpio.c |  5 +
 include/linux/platform_data/mdio-gpio.h | 13 +
 3 files changed, 19 insertions(+)
 create mode 100644 include/linux/platform_data/mdio-gpio.h

diff --git a/MAINTAINERS b/MAINTAINERS
index fb88b6863d10..9d3b899f9ba2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5610,6 +5610,7 @@ F:include/linux/of_net.h
 F: include/linux/phy.h
 F: include/linux/phy_fixed.h
 F: include/linux/platform_data/mdio-bcm-unimac.h
+F: include/linux/platform_data/mdio-gpio.h
 F: include/trace/events/mdio.h
 F: include/uapi/linux/mdio.h
 F: include/uapi/linux/mii.h
diff --git a/drivers/net/phy/mdio-gpio.c b/drivers/net/phy/mdio-gpio.c
index 0fbcedcdf6e2..1e296dd4067a 100644
--- a/drivers/net/phy/mdio-gpio.c
+++ b/drivers/net/phy/mdio-gpio.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -112,6 +113,7 @@ static struct mii_bus *mdio_gpio_bus_init(struct device 
*dev,
  struct mdio_gpio_info *bitbang,
  int bus_id)
 {
+   struct mdio_gpio_platform_data *pdata = dev_get_platdata(dev);
struct mii_bus *new_bus;
 
bitbang->ctrl.ops = _gpio_ops;
@@ -128,6 +130,9 @@ static struct mii_bus *mdio_gpio_bus_init(struct device 
*dev,
else
strncpy(new_bus->id, "gpio", MII_BUS_ID_SIZE);
 
+   if (pdata)
+   new_bus->phy_mask = pdata->phy_mask;
+
dev_set_drvdata(dev, new_bus);
 
return new_bus;
diff --git a/include/linux/platform_data/mdio-gpio.h 
b/include/linux/platform_data/mdio-gpio.h
new file mode 100644
index ..a5d5ff5e174c
--- /dev/null
+++ b/include/linux/platform_data/mdio-gpio.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * MDIO-GPIO bus platform data structure
+ */
+
+#ifndef __LINUX_MDIO_GPIO_PDATA_H
+#define __LINUX_MDIO_GPIO_PDATA_H
+
+struct mdio_gpio_platform_data {
+   u32 phy_mask;
+};
+
+#endif /* __LINUX_MDIO_GPIO_PDATA_H */
-- 
2.19.1



Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets

2018-12-08 Thread Jesper Dangaard Brouer
On Fri, 7 Dec 2018 13:21:08 -0800
Alexei Starovoitov  wrote:

> for production I suspect the users would want
> an easy way to stay safe when they're playing with AF_XDP.
> So another builtin program that redirects ssh and ping traffic
> back to the kernel would be a nice addition.

Are you saying a buildin program that need to parse different kinds of
Eth-type headers (DSA, VLAN, QinqQ) and find the TCP port to match port
22 to return XDP_PASS, or else call AF_XDP redurect. That seems to be
pure overhead for this fast-path buildin program for AF_XDP.

Would a solution be to install a NIC hardware filter that redirect SSH
port 22 to another RX-queue. And then have a buildin program that
returns XDP_PASS installed on that RX-queue.   And change Bjørns
semantics, such that RX-queue programs takes precedence over the global
XDP program. This would also be a good fail safe in general for XDP.

If the RX-queues take precedence, I can use this fail safe approach.
E.g. when I want to test my new global XDP program, I'll use ethtool
match my management IP and send that to a specific RX-queue and my
fail-safe BPF program.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread Ilias Apalodimas
Hi Eric, 
> >> This patch is changing struct sk_buff, and is thus per-definition
> >> controversial.
> >>
> >> Place a new member 'mem_info' of type struct xdp_mem_info, just after
> >> members (flags) head_frag and pfmemalloc, And not in between
> >> headers_start/end to ensure skb_copy() and pskb_copy() work as-is.
> >> Copying mem_info during skb_clone() is required.  This makes sure that
> >> pages are correctly freed or recycled during the altered
> >> skb_free_head() invocation.
> > 
> > I read this to mean that this 'info' isn't accessed/needed until skb
> > is freed.  Any reason its not added at the end?
> > 
> > This would avoid moving other fields that are probably accessed
> > more frequently during processing.
> > 
> 
> But I do not get why the patch is needed.
> 
> Adding extra cost for each skb destruction is costly.
> 
> I though XDP was all about _not_ having skbs.

You hit the only point i don't personally like in the code, xdp info in the 
skb. Considering the benefits i am convinced this is ok and here's why:

> Please let's do not slow down the non XDP stack only to make XDP more 
> appealing.

We are not trying to do that, on the contrary. The patchset has nothing towards
speeding XDP and slowing down anything else. The patchset speeds up the 
mvneta driver on the default network stack. The only change that was needed was
to adapt the driver to using the page_pool API. The speed improvements we are
seeing on specific workloads (i.e 256b < packet < 400b) are almost 3x.

Lots of high speed drivers are doing similar recycling tricks themselves (and
there's no common code, everyone is doing something similar though). All we are
trying to do is provide a unified API to make that easier for the rest. Another
advantage is that if the some drivers switch to the API, adding XDP
functionality on them is pretty trivial.

Moreover our tests are only performed on systems without or with SMMU disabled.
On a platform i have access to, enabling and disabling the SMMU has some
performance impact. By keeping the buffers mapped we believe this impact
will be substantially less (i'll come back with results once i have them on
this).

I do understand your point, but the potential advantages on my head
overwight that by a lot.

I got other concerns on the patchset though. Like how much memory is it 'ok' to
keep mapped keeping in mind we are using the streaming DMA API. Are we going to
affect anyone else negatively by doing so ?

Thanks
/Ilias


Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets

2018-12-08 Thread Jesper Dangaard Brouer
On Fri, 7 Dec 2018 15:01:55 +0100
Björn Töpel  wrote:

> Den fre 7 dec. 2018 kl 14:42 skrev Jesper Dangaard Brouer :
> >
> > On Fri,  7 Dec 2018 12:44:24 +0100
> > Björn Töpel  wrote:
> >  
> > > The rationale behind attach is performance and ease of use. Many XDP
> > > socket users just need a simple way of creating/binding a socket and
> > > receiving frames right away without loading an XDP program.
> > >
> > > XDP_ATTACH adds a mechanism we call "builtin XDP program" that simply
> > > is a kernel provided XDP program that is installed to the netdev when
> > > XDP_ATTACH is being passed as a bind() flag.
> > >
> > > The builtin program is the simplest program possible to redirect a
> > > frame to an attached socket. In restricted C it would look like this:
> > >
> > >   SEC("xdp")
> > >   int xdp_prog(struct xdp_md *ctx)
> > >   {
> > > return bpf_xsk_redirect(ctx);
> > >   }
> > >
> > > The builtin program loaded via XDP_ATTACH behaves, from an
> > > install-to-netdev/uninstall-from-netdev point of view, differently
> > > from regular XDP programs. The easiest way to look at it is as a
> > > 2-level hierarchy, where regular XDP programs has precedence over the
> > > builtin one.
> > >
> > > If no regular XDP program is installed to the netdev, the builtin will
> > > be install. If the builtin program is installed, and a regular is
> > > installed, regular XDP program will have precedence over the builtin
> > > one.
> > >
> > > Further, if a regular program is installed, and later removed, the
> > > builtin one will automatically be installed.
> > >
> > > The sxdp_flags field of struct sockaddr_xdp gets two new options
> > > XDP_BUILTIN_SKB_MODE and XDP_BUILTIN_DRV_MODE, which maps to the
> > > corresponding XDP netlink install flags.
> > >
> > > The builtin XDP program functionally adds even more complexity to the
> > > already hard to read dev_change_xdp_fd. Maybe it would be simpler to
> > > store the program in the struct net_device together with the install
> > > flags instead of calling the ndo_bpf multiple times?  
> >
> > (As far as I can see from reading the code, correct me if I'm wrong.)
> >
> > If an AF_XDP program uses XDP_ATTACH, then it installs the
> > builtin-program as the XDP program on the "entire" device.  That means
> > all RX-queues will call this XDP-bpf program (indirect call), and it is
> > actually only relevant for the specific queue_index.  Yes, the helper
> > call does check that the 'xdp->rxq->queue_index' for an attached 'xsk'
> > and return XDP_PASS if it is NULL:
> >  
> 
> Yes, you are correct. The builtin XDP program, just like a regular XDP
> program, affects the whole netdev. So, yes the non-AF_XDP queues would
> get a performance hit from this. Just to reiterate -- this isn't new
> for this series. This has always been the case for XDP when acting on
> just one queue.
> 
> > +BPF_CALL_1(bpf_xdp_xsk_redirect, struct xdp_buff *, xdp)
> > +{
> > +   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
> > +   struct xdp_sock *xsk;
> > +
> > +   xsk = READ_ONCE(xdp->rxq->dev->_rx[xdp->rxq->queue_index].xsk);
> > +   if (xsk) {
> > +   ri->xsk = xsk;
> > +   return XDP_REDIRECT;
> > +   }
> > +
> > +   return XDP_PASS;
> > +}
> >
> > Why do every normal XDP_PASS packet have to pay this overhead (indirect
> > call), when someone loads an AF_XDP socket program?  The AF_XDP socket
> > is tied hard and only relevant to a specific RX-queue (which is why we
> > get a performance boost due to SPSC queues).
> >
> > I acknowledge there is a need for this, but this use-case shows there
> > is a need for attaching XDP programs per RX-queue basis.
> >  
> 
> From my AF_XDP perspective, having a program per queue would make
> sense. The discussion of a per-queue has been up before, and I think
> the conclusion was that it would be too complex from a
> configuration/tooling point-of-view. Again, for AF_XDP this would be
> great.

I really think it is about time that we introduce these per-queue XDP
programs.  From a configuration/tooling point-of-view, your proposed
solution have the same issues, we have to solve. (Like listing
introspection need to show the per-queue XDP/BPF programs). And you have
almost solved it, by keeping the per-queue program in net_device
struct.  

Alexei already requested more types of builtin programs.  But as the
XDP-driver API can only load XDP for the entire NIC, then how can you
use two builtin programs at the same time?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread Jesper Dangaard Brouer
On Sat, 8 Dec 2018 04:29:17 -0800
Eric Dumazet  wrote:

> On 12/08/2018 01:57 AM, Florian Westphal wrote:
> > Jesper Dangaard Brouer  wrote:  
> >> From: Ilias Apalodimas 
> >>
> >> This patch is changing struct sk_buff, and is thus per-definition
> >> controversial.
> >>
> >> Place a new member 'mem_info' of type struct xdp_mem_info, just after
> >> members (flags) head_frag and pfmemalloc, And not in between
> >> headers_start/end to ensure skb_copy() and pskb_copy() work as-is.
> >> Copying mem_info during skb_clone() is required.  This makes sure that
> >> pages are correctly freed or recycled during the altered
> >> skb_free_head() invocation.  
> > 
> > I read this to mean that this 'info' isn't accessed/needed until skb
> > is freed.  Any reason its not added at the end?
> > 
> > This would avoid moving other fields that are probably accessed
> > more frequently during processing.
> >   
> 
> But I do not get why the patch is needed.
> 
> Adding extra cost for each skb destruction is costly.
> 
> I though XDP was all about _not_ having skbs.
> 
> Please let's do not slow down the non XDP stack only to make XDP more
> appealing.

In general this work is about better cooperation between XDP and
netstack.  The patch is needed because the page_pool is keeping pages
DMA mapped and need a return hook.  I doubt that the extra
compare-and-branch will affect your use-case on super-scalar CPUs.  We
(Ilias and I) are actually testing this stuff on low-end ARM64 in-order
execution CPUs, which is actually nice as performance effects of our
code changes are not hidden by speculative execution units.  I'm
enforcing (and Ilias agrees) that we do benchmark driven development.
I actually invite people to monitor our progress here[1].  So, trust
me, I am as concerned as you about any performance regression, and is
vigilantly measuring this stuff.  (This is more than you can say about
a lot of the other stuff that gets accepted on this list).


[1] https://github.com/xdp-project/xdp-project/blob/master/areas/arm64/
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH net-next,v5 09/12] ethtool: add ethtool_rx_flow_spec to flow_rule structure translator

2018-12-08 Thread Jiri Pirko
Thu, Dec 06, 2018 at 11:39:59PM CET, pa...@netfilter.org wrote:
>This patch adds a function to translate the ethtool_rx_flow_spec
>structure to the flow_rule representation.
>
>This allows us to reuse code from the driver side given that both flower
>and ethtool_rx_flow interfaces use the same representation.
>
>This patch also includes support for the flow type flags FLOW_EXT,
>FLOW_MAC_EXT and FLOW_RSS.
>
>The ethtool_rx_flow_spec_input wrapper structure is used to convey the
>rss_context field, that is away from the ethtool_rx_flow_spec structure,
>and the ethtool_rx_flow_spec structure.
>
>Signed-off-by: Pablo Neira Ayuso 

Acked-by: Jiri Pirko 


Re: [PATCH net-next,v5 08/12] flow_offload: add wake-up-on-lan and queue to flow_action

2018-12-08 Thread Jiri Pirko
Thu, Dec 06, 2018 at 11:39:58PM CET, pa...@netfilter.org wrote:
>These actions need to be added to support the ethtool_rx_flow interface.
>The queue action includes a field to specify the RSS context, that is
>set via FLOW_RSS flow type flag and the rss_context field in struct
>ethtool_rxnfc, plus the corresponding queue index. FLOW_RSS implies that
>rss_context is non-zero, therefore, queue.ctx == 0 means that FLOW_RSS
>was not set.
>
>Signed-off-by: Pablo Neira Ayuso 

Acked-by: Jiri Pirko 


Re: [PATCH net-next,v5 06/12] drivers: net: use flow action infrastructure

2018-12-08 Thread Jiri Pirko
Thu, Dec 06, 2018 at 11:39:56PM CET, pa...@netfilter.org wrote:
>This patch updates drivers to use the new flow action infrastructure.
>
>Signed-off-by: Pablo Neira Ayuso 

Acked-by: Jiri Pirko 


Re: [PATCH net-next,v5 07/12] cls_flower: don't expose TC actions to drivers anymore

2018-12-08 Thread Jiri Pirko
Thu, Dec 06, 2018 at 11:39:57PM CET, pa...@netfilter.org wrote:
>Now that drivers have been converted to use the flow action
>infrastructure, remove this field from the tc_cls_flower_offload
>structure.
>
>Signed-off-by: Pablo Neira Ayuso 

Acked-by: Jiri Pirko 


Re: [PATCH net-next,v5 05/12] flow_offload: add statistics retrieval infrastructure and use it

2018-12-08 Thread Jiri Pirko
Thu, Dec 06, 2018 at 11:39:55PM CET, pa...@netfilter.org wrote:
>This patch provides the flow_stats structure that acts as container for
>tc_cls_flower_offload, then we can use to restore the statistics on the
>existing TC actions. Hence, tcf_exts_stats_update() is not used from
>drivers anymore.
>
>Signed-off-by: Pablo Neira Ayuso 

Acked-by: Jiri Pirko 


Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread Eric Dumazet



On 12/08/2018 04:29 AM, Eric Dumazet wrote:
> 

> But I do not get why the patch is needed.
> 
> Adding extra cost for each skb destruction is costly.
> 
> I though XDP was all about _not_ having skbs.
> 
> Please let's do not slow down the non XDP stack only to make XDP more 
> appealing.
> 

I have a similar concern with napi_consume_skb() :

This added a cost for locally generated TCP traffic, since most TX packets are 
fast clones.




Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread Eric Dumazet



On 12/08/2018 01:57 AM, Florian Westphal wrote:
> Jesper Dangaard Brouer  wrote:
>> From: Ilias Apalodimas 
>>
>> This patch is changing struct sk_buff, and is thus per-definition
>> controversial.
>>
>> Place a new member 'mem_info' of type struct xdp_mem_info, just after
>> members (flags) head_frag and pfmemalloc, And not in between
>> headers_start/end to ensure skb_copy() and pskb_copy() work as-is.
>> Copying mem_info during skb_clone() is required.  This makes sure that
>> pages are correctly freed or recycled during the altered
>> skb_free_head() invocation.
> 
> I read this to mean that this 'info' isn't accessed/needed until skb
> is freed.  Any reason its not added at the end?
> 
> This would avoid moving other fields that are probably accessed
> more frequently during processing.
> 

But I do not get why the patch is needed.

Adding extra cost for each skb destruction is costly.

I though XDP was all about _not_ having skbs.

Please let's do not slow down the non XDP stack only to make XDP more appealing.


Re: [PATCH 5/5] net: dsa: ksz: Add Microchip KSZ8795 DSA driver

2018-12-08 Thread Andrew Lunn
> +static int ksz8795_valid_dyn_entry(struct ksz_device *dev, u8 *data)
> +{
> + int timeout = 100;
> +
> + do {
> + ksz_read8(dev, REG_IND_DATA_CHECK, data);
> + timeout--;
> + } while ((*data & DYNAMIC_MAC_TABLE_NOT_READY) && timeout);

readx_poll_timeout()?

> +static inline void ksz8795_from_vlan(u16 vlan, u8 *fid, u8 *member, u8 
> *valid)

Please don't use inline in C code, just in headers. Leave the compile
to decide if it should be inlined.

> +static void ksz8795_r_vlan_table(struct ksz_device *dev, u16 vid, u16 *vlan)
> +{
> + u64 buf;
> + u16 *data = (u16 *)
> + u16 addr;
> + int index;

The networking code uses reverse christmas tree. So you need to change
the order of these declarations, and do the assignment in the body of
the function. Please review all the functions.

> +static const u8 stp_multicast_addr[] = {
> + 0x01, 0x80, 0xC2, 0x00, 0x00, 0x00
> +};

include/linux/etherdevice.h defines eth_stp_addr.

Andrew


Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread Jesper Dangaard Brouer
On Sat, 8 Dec 2018 10:57:58 +0100
Florian Westphal  wrote:

> Jesper Dangaard Brouer  wrote:
> > From: Ilias Apalodimas 
> > 
> > This patch is changing struct sk_buff, and is thus per-definition
> > controversial.
> > 
> > Place a new member 'mem_info' of type struct xdp_mem_info, just after
> > members (flags) head_frag and pfmemalloc, And not in between
> > headers_start/end to ensure skb_copy() and pskb_copy() work as-is.
> > Copying mem_info during skb_clone() is required.  This makes sure that
> > pages are correctly freed or recycled during the altered
> > skb_free_head() invocation.  
> 
> I read this to mean that this 'info' isn't accessed/needed until skb
> is freed.  Any reason its not added at the end?
>
> This would avoid moving other fields that are probably accessed
> more frequently during processing.

It is placed here because it is close to skb->head_frag, which is used
also used when SKB is freed.  This is the reason, both because I think
we can move the skb->head_frag bit into mem_info, and due to cacheline
accesses (e.g. skb->cloned and destructor pointer are also read, which
is on that cache-line)

The annoying part is actually that depending on the kernel config
options CONFIG_XFRM, CONFIG_NF_CONNTRACK and CONFIG_BRIDGE_NETFILTER,
whether there is a cache-line split, where mem_info gets moved into the
next cacheline.

I do like the idea of moving it to the end, iif we also move
skb->head_frag into mem_info, as skb->head (on end-cacheline) is also
accessed during free, but given we still need to read the cache-line
containing skb->{cloned,destructor}, when I don't think it will be a
win.  I think the skb->cloned value is read during lifetime of the SKB.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH 3/5] net: dsa: ksz: Factor out common tag code

2018-12-08 Thread Andrew Lunn
On Fri, Dec 07, 2018 at 07:18:43PM +0100, Marek Vasut wrote:
> From: Tristram Ha 
> 
> Factor out common code from the tag_ksz , so that the code can be used
> with other KSZ family switches which use differenly sized tags.

I prefer this implementation over what Tristram recently submitted. It
is also what we suggested a while back. However, since then we have
had Spectra/meltdown, and we now know a function call through a
pointer is expensive. This is the hot path, every frame comes through
here, so it is worth taking the time to optimize this. Could you try
to remove the ksz_tag_ops structure. xmit looks simple, since it is a
tail call, so you can do that in ksz9477_xmit. Receive looks a bit
more complex.

I also think for the moment we need it ignore PTP until we have the
big picture sorted out. If need be, the code can be refactored yet
again. But i don't want PTP holding up getting more switches
supported.

> Signed-off-by: Tristram Ha 
> Signed-off-by: Marek Vasut 
> Cc: Vivien Didelot 
> Cc: Woojung Huh 
> Cc: David S. Miller 
> ---
>  net/dsa/tag_ksz.c | 125 +-
>  1 file changed, 90 insertions(+), 35 deletions(-)
> 
> diff --git a/net/dsa/tag_ksz.c b/net/dsa/tag_ksz.c
> index 036bc62198f28..d94bad1ab7e53 100644
> --- a/net/dsa/tag_ksz.c
> +++ b/net/dsa/tag_ksz.c
> @@ -14,34 +14,30 @@
>  #include 
>  #include "dsa_priv.h"
>  
> -/* For Ingress (Host -> KSZ), 2 bytes are added before FCS.
> - * 
> ---
> - * 
> DA(6bytes)|SA(6bytes)||Data(nbytes)|tag0(1byte)|tag1(1byte)|FCS(4bytes)
> - * 
> ---
> - * tag0 : Prioritization (not used now)
> - * tag1 : each bit represents port (eg, 0x01=port1, 0x02=port2, 0x10=port5)
> - *
> - * For Egress (KSZ -> Host), 1 byte is added before FCS.
> - * 
> ---
> - * DA(6bytes)|SA(6bytes)||Data(nbytes)|tag0(1byte)|FCS(4bytes)
> - * 
> ---
> - * tag0 : zero-based value represents port
> - * (eg, 0x00=port1, 0x02=port3, 0x06=port7)
> - */
> +struct ksz_tag_ops {
> + void(*get_tag)(u8 *tag, unsigned int *port, unsigned int *len);
> + void(*set_tag)(struct sk_buff *skb, struct net_device *dev);
> +};
>  
> -#define  KSZ_INGRESS_TAG_LEN 2
> -#define  KSZ_EGRESS_TAG_LEN  1
> +/* Typically only one byte is used for tail tag. */
> +#define KSZ_EGRESS_TAG_LEN   1
>  
> -static struct sk_buff *ksz_xmit(struct sk_buff *skb, struct net_device *dev)
> +/* Frames with following addresse may need to be sent even when the port is
> + * closed.
> + */
> +static const u8 special_mult_addr[] = {
> + 0x01, 0x80, 0xC2, 0x00, 0x00, 0x00
> +};
> +

This is new functionality, not part of the refactoring the
code. Please add that as a new patch. Also, you can probably make use
of is_link_local_ether_addr().

   Andrew


[PATCH net-next] ip: silence udp zerocopy smatch false positive

2018-12-08 Thread Willem de Bruijn
From: Willem de Bruijn 

extra_uref is used in __ip(6)_append_data only if uarg is set.

Smatch sees that the variable is passed to sock_zerocopy_put_abort.
This function accesses it only when uarg is set, but smatch cannot
infer this.

Make this dependency explicit.

Fixes: 52900d22288e ("udp: elide zerocopy operation in hot path")
Signed-off-by: Willem de Bruijn 
---
 net/ipv4/ip_output.c  | 3 ++-
 net/ipv6/ip6_output.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 78f028bdad30b..ab6618036afed 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1130,7 +1130,8 @@ static int __ip_append_data(struct sock *sk,
 error_efault:
err = -EFAULT;
 error:
-   sock_zerocopy_put_abort(uarg, extra_uref);
+   if (uarg)
+   sock_zerocopy_put_abort(uarg, extra_uref);
cork->length -= length;
IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
refcount_add(wmem_alloc_delta, >sk_wmem_alloc);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6592a39e5ec10..be25b34dc5c2f 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1575,7 +1575,8 @@ static int __ip6_append_data(struct sock *sk,
 error_efault:
err = -EFAULT;
 error:
-   sock_zerocopy_put_abort(uarg, extra_uref);
+   if (uarg)
+   sock_zerocopy_put_abort(uarg, extra_uref);
cork->length -= length;
IP6_INC_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
refcount_add(wmem_alloc_delta, >sk_wmem_alloc);
-- 
2.20.0.rc2.403.gdbc3b29805-goog



Re: [PATCH 1/5] net: dsa: ksz: Add MIB counter reading support

2018-12-08 Thread Andrew Lunn
> +static void ksz9477_phy_setup(struct ksz_device *dev, int port,
> +   struct phy_device *phy)
> +{
> + if (port < dev->phy_port_cnt) {
> + /* SUPPORTED_Asym_Pause and SUPPORTED_Pause can be removed to
> +  * disable flow control when rate limiting is used.
> +  */
> + phy->advertising = phy->supported;
> + }
> +}
> +

This has nothing to do with MIB counters.

Other than that, is this the same as Tristram's recent submission?
Maybe once the comments have been addressed, we can merge that
version?

Thanks
Andrew


Re: [PATCH V2] net: dsa: ksz: Add reset GPIO handling

2018-12-08 Thread Andrew Lunn
> This actually is an individual patch, it doesn't depend on anything.
> Or do you mean a series with the DT documentation change ?

Yes, i mean together with the DT documentation change. Those two
belong together, they are one functional change.

Part of this is also to do with scalability. It takes less effort to
merge one patchset of two patches, as two individual patches. The
truth is, developer time is cheap, maintainer time is expensive, so
the process is optimized towards making the maintainers life easy.

So sometimes you do combine orthogonal changes together into one
patchset, if there is a high purpose, eg. adding support for a new
device on a new board. However, given the situation of two overlapping
patchsets, it might be better to submit smaller patchsets.

   Andrew


Re: [PATCH] net: dsa: ksz: Fix port membership

2018-12-08 Thread Andrew Lunn
On Sat, Dec 08, 2018 at 05:23:25AM +0100, Marek Vasut wrote:
> On 12/08/2018 01:13 AM, tristram...@microchip.com wrote:
> >> Do you have a git tree with all the KSZ patches based on -next
> >> somewhere, so I don't have to look for them in random MLs ?
> > 
> > I just sent it this Monday and the subject for that patch is
> > "[PATCH RFC 6/6] net: dsa: microchip: Add switch offload forwarding 
> > support."
> 
> Is all that collected in some git tree somewhere, so I don't have to
> look for various patches in varying states of decay throughout the ML?

Hi Marek, Tristram

It is unfortunate that we have two patchsets being submitted at the
same time, with overlapping functionality. Some degree of cooperation
is needed to handle this.

Could i ask you both to publish a git tree of your patches. And then
figure out how to submit them. Maybe as smaller sets, with the easy,
non-overlapping bits first?

Andrew


RE: [PATCH bpf-next 7/7] samples: bpf: add support for XDP_ATTACH to xdpsock

2018-12-08 Thread Zhang, Qi Z
Hi:

Seems there are only 6 patches of the patch set in patchwork

https://patchwork.ozlabs.org/project/netdev/list/?submitter=70569
https://patchwork.ozlabs.org/project/netdev/list/?series=80389

Anyone can help to check why patch 7/7 is missing?

Thanks
Qi

> -Original Message-
> From: Björn Töpel [mailto:bjorn.to...@gmail.com]
> Sent: Friday, December 7, 2018 7:45 PM
> To: bjorn.to...@gmail.com; Karlsson, Magnus ;
> magnus.karls...@gmail.com; a...@kernel.org; dan...@iogearbox.net;
> netdev@vger.kernel.org
> Cc: Topel, Bjorn ; bro...@redhat.com;
> u9012...@gmail.com; Zhang, Qi Z 
> Subject: [PATCH bpf-next 7/7] samples: bpf: add support for XDP_ATTACH to
> xdpsock
> 
> From: Björn Töpel 
> 
> Teach the sample xdpsock application about XDP_ATTACH.
> 
> Signed-off-by: Björn Töpel 
> ---
>  samples/bpf/xdpsock_user.c | 108 +++--
>  1 file changed, 67 insertions(+), 41 deletions(-)
> 
> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c index
> 57ecadc58403..12f908b60468 100644
> --- a/samples/bpf/xdpsock_user.c
> +++ b/samples/bpf/xdpsock_user.c
> @@ -633,7 +633,8 @@ static void int_exit(int sig)  {
>   (void)sig;
>   dump_stats();
> - bpf_set_link_xdp_fd(opt_ifindex, -1, opt_xdp_flags);
> + if (!(opt_xdp_bind_flags & XDP_ATTACH))
> + bpf_set_link_xdp_fd(opt_ifindex, -1, opt_xdp_flags);
>   exit(EXIT_SUCCESS);
>  }
> 
> @@ -650,6 +651,7 @@ static struct option long_options[] = {
>   {"interval", required_argument, 0, 'n'},
>   {"zero-copy", no_argument, 0, 'z'},
>   {"copy", no_argument, 0, 'c'},
> + {"attach", no_argument, 0, 'a'},
>   {0, 0, 0, 0}
>  };
> 
> @@ -670,6 +672,7 @@ static void usage(const char *prog)
>   "  -n, --interval=n Specify statistics update interval 
> (default 1
> sec).\n"
>   "  -z, --zero-copy  Force zero-copy mode.\n"
>   "  -c, --copy   Force copy mode.\n"
> + "  -a, --attach XDP_ATTACH mode.\n"
>   "\n";
>   fprintf(stderr, str, prog);
>   exit(EXIT_FAILURE);
> @@ -682,7 +685,7 @@ static void parse_command_line(int argc, char **argv)
>   opterr = 0;
> 
>   for (;;) {
> - c = getopt_long(argc, argv, "rtli:q:psSNn:cz", long_options,
> + c = getopt_long(argc, argv, "rtli:q:psSNn:cza", long_options,
>   _index);
>   if (c == -1)
>   break;
> @@ -725,11 +728,22 @@ static void parse_command_line(int argc, char
> **argv)
>   case 'c':
>   opt_xdp_bind_flags |= XDP_COPY;
>   break;
> + case 'a':
> + opt_xdp_bind_flags |= XDP_ATTACH;
> + break;
>   default:
>   usage(basename(argv[0]));
>   }
>   }
> 
> + if (opt_xdp_bind_flags & XDP_ATTACH) {
> + if (opt_xdp_flags & XDP_FLAGS_DRV_MODE)
> + opt_xdp_bind_flags |= XDP_BUILTIN_DRV_MODE;
> + if (opt_xdp_flags & XDP_FLAGS_SKB_MODE)
> + opt_xdp_bind_flags |= XDP_BUILTIN_SKB_MODE;
> +
> + }
> +
>   opt_ifindex = if_nametoindex(opt_if);
>   if (!opt_ifindex) {
>   fprintf(stderr, "ERROR: interface \"%s\" does not exist\n", @@
> -903,7 +917,7 @@ int main(int argc, char **argv)
>   struct bpf_prog_load_attr prog_load_attr = {
>   .prog_type  = BPF_PROG_TYPE_XDP,
>   };
> - int prog_fd, qidconf_map, xsks_map;
> + int prog_fd, qidconf_map, xsks_map = -1;
>   struct bpf_object *obj;
>   char xdp_filename[256];
>   struct bpf_map *map;
> @@ -918,59 +932,71 @@ int main(int argc, char **argv)
>   exit(EXIT_FAILURE);
>   }
> 
> - snprintf(xdp_filename, sizeof(xdp_filename), "%s_kern.o", argv[0]);
> - prog_load_attr.file = xdp_filename;
> + if (!(opt_xdp_bind_flags & XDP_ATTACH)) {
> + snprintf(xdp_filename, sizeof(xdp_filename), "%s_kern.o",
> +  argv[0]);
> + prog_load_attr.file = xdp_filename;
> 
> - if (bpf_prog_load_xattr(_load_attr, , _fd))
> - exit(EXIT_FAILURE);
> - if (prog_fd < 0) {
> - fprintf(stderr, "ERROR: no program found: %s\n",
> - strerror(prog_fd));
> - exit(EXIT_FAILURE);
> - }
> + if (bpf_prog_load_xattr(_load_attr, , _fd))
> + exit(EXIT_FAILURE);
> + if (prog_fd < 0) {
> + fprintf(stderr, "ERROR: no program found: %s\n",
> + strerror(prog_fd));
> + exit(EXIT_FAILURE);
> + }
> 
> - map = bpf_object__find_map_by_name(obj, "qidconf_map");
> - qidconf_map = bpf_map__fd(map);
> - if (qidconf_map < 0) {
> - fprintf(stderr, "ERROR: no qidconf map found: %s\n",
> - 

Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread Florian Westphal
Jesper Dangaard Brouer  wrote:
> From: Ilias Apalodimas 
> 
> This patch is changing struct sk_buff, and is thus per-definition
> controversial.
> 
> Place a new member 'mem_info' of type struct xdp_mem_info, just after
> members (flags) head_frag and pfmemalloc, And not in between
> headers_start/end to ensure skb_copy() and pskb_copy() work as-is.
> Copying mem_info during skb_clone() is required.  This makes sure that
> pages are correctly freed or recycled during the altered
> skb_free_head() invocation.

I read this to mean that this 'info' isn't accessed/needed until skb
is freed.  Any reason its not added at the end?

This would avoid moving other fields that are probably accessed
more frequently during processing.


Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets

2018-12-08 Thread Björn Töpel
Den fre 7 dec. 2018 kl 22:21 skrev Alexei Starovoitov
:
>
> On Fri, Dec 07, 2018 at 12:44:24PM +0100, Björn Töpel wrote:
> > From: Björn Töpel 
> >
> > Hi!
> >
> > This patch set adds support for a new XDP socket bind option,
> > XDP_ATTACH.
> >
> > The rationale behind attach is performance and ease of use. Many XDP
> > socket users just need a simple way of creating/binding a socket and
> > receiving frames right away without loading an XDP program.
> >
> > XDP_ATTACH adds a mechanism we call "builtin XDP program" that simply
> > is a kernel provided XDP program that is installed to the netdev when
> > XDP_ATTACH is being passed as a bind() flag.
> >
> > The builtin program is the simplest program possible to redirect a
> > frame to an attached socket. In restricted C it would look like this:
> >
> >   SEC("xdp")
> >   int xdp_prog(struct xdp_md *ctx)
> >   {
> > return bpf_xsk_redirect(ctx);
> >   }
> >
> > The builtin program loaded via XDP_ATTACH behaves, from an
> > install-to-netdev/uninstall-from-netdev point of view, differently
> > from regular XDP programs. The easiest way to look at it is as a
> > 2-level hierarchy, where regular XDP programs has precedence over the
> > builtin one.
>
> The feature makes sense to me.
> May be XDP_ATTACH_BUILTIN would be a better name ?

Yes, agree, or maybe XDP_BUILTIN_ATTACH? Regardless, I'll change the
name for the next revision.

> Also I think it needs another parameter to say which builtin
> program to use.

Yup, I had a plan to add the parameter when it's actually more than
*one* builtin, but you're right, let's do it right away. I'll add a
builtin prog enum field to the struct sockaddr_xdp.

> This unconditional xsk_redirect is fine for performance
> benchmarking, but for production I suspect the users would want
> an easy way to stay safe when they're playing with AF_XDP.

For setups that don't direct the flows explicitly by HW filters,  yes!

> So another builtin program that redirects ssh and ping traffic
> back to the kernel would be a nice addition.
>

I suspect AF_XDP users would prefer redirecting packets to the kernel
via the CPUMAP instead of XDP_PASS -- not paying for the ipstack on
the AF_XDP core. Another builtin would be a tcpdump-like behavior, but
that would require an XDP clone (which Magnus is actually
experimenting with!).

I'll address your input and get back with a new revision. Thanks for
spending time on the series!


Björn


Re: [PATCH net-next,v5 04/12] cls_api: add translator to flow_action representation

2018-12-08 Thread Jiri Pirko
Thu, Dec 06, 2018 at 11:39:54PM CET, pa...@netfilter.org wrote:
>This patch implements a new function to translate from native TC action
>to the new flow_action representation. Moreover, this patch also updates
>cls_flower to use this new function.
>
>Signed-off-by: Pablo Neira Ayuso 

Acked-by: Jiri Pirko 


Re: [PATCH net-next,v5 03/12] flow_offload: add flow action infrastructure

2018-12-08 Thread Jiri Pirko
Thu, Dec 06, 2018 at 11:39:53PM CET, pa...@netfilter.org wrote:
>This new infrastructure defines the nic actions that you can perform
>from existing network drivers. This infrastructure allows us to avoid a
>direct dependency with the native software TC action representation.
>
>Signed-off-by: Pablo Neira Ayuso 

Acked-by: Jiri Pirko 


Re: [PATCH net-next,v5 02/12] net/mlx5e: support for two independent packet edit actions

2018-12-08 Thread Jiri Pirko
Thu, Dec 06, 2018 at 11:39:52PM CET, pa...@netfilter.org wrote:
>This patch adds pedit_headers_action structure to store the result of
>parsing tc pedit actions. Then, it calls alloc_tc_pedit_action() to
>populate the mlx5e hardware intermediate representation once all actions
>have been parsed.
>
>This patch comes in preparation for the new flow_action infrastructure,
>where each packet mangling comes in an separated action, ie. not packed
>as in tc pedit.
>
>Signed-off-by: Pablo Neira Ayuso 

Acked-by: Jiri Pirko 


Re: [PATCH net-next,v5 01/12] flow_offload: add flow_rule and flow_match structures and use them

2018-12-08 Thread Jiri Pirko
Thu, Dec 06, 2018 at 11:39:51PM CET, pa...@netfilter.org wrote:
>This patch wraps the dissector key and mask - that flower uses to
>represent the matching side - around the flow_match structure.
>
>To avoid a follow up patch that would edit the same LoCs in the drivers,
>this patch also wraps this new flow match structure around the flow rule
>object. This new structure will also contain the flow actions in follow
>up patches.
>
>This introduces two new interfaces:
>
>   bool flow_rule_match_key(rule, dissector_id)
>
>that returns true if a given matching key is set on, and:
>
>   flow_rule_match_XYZ(rule, );
>
>To fetch the matching side XYZ into the match container structure, to
>retrieve the key and the mask with one single call.
>
>Signed-off-by: Pablo Neira Ayuso 

Acked-by: Jiri Pirko