Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API
On 12/08/2018 12:14 PM, Ilias Apalodimas wrote: > On Sat, Dec 08, 2018 at 09:11:53PM +0100, Jesper Dangaard Brouer wrote: >>> >>> I want to make sure you guys thought about splice() stuff, and >>> skb_try_coalesce(), and GRO, and skb cloning, and ... >> >> Thanks for the pointers. To Ilias, we need to double check skb_try_coalesce() >> code path, as it does look like we don't handle this correctly. >> > > Noted. We did check skb cloning, but not this one indeed There is also TCP RX zero copy to consider ( tcp_zerocopy_receive() ) but more generally all paths that can grab pages from skbs.
Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API
On Sat, Dec 08, 2018 at 10:14:47PM +0200, Ilias Apalodimas wrote: > On Sat, Dec 08, 2018 at 09:11:53PM +0100, Jesper Dangaard Brouer wrote: > > > > > > I want to make sure you guys thought about splice() stuff, and > > > skb_try_coalesce(), and GRO, and skb cloning, and ... > > > > Thanks for the pointers. To Ilias, we need to double check > > skb_try_coalesce() > > code path, as it does look like we don't handle this correctly. > > > > Noted. We did check skb cloning, but not this one indeed I'll try to run some tests on my clearfog later, but for now I'm still 100% busy until haproxy 1.9 gets released. Cheers, Willy
Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets
> That is not how XDP works. XDP must run on the "master" physical driver > interface, as the "slave" interface is a virtual DSA interface. I did > mention DSA because I'm handling that on the EspressoBin implementation. Hi Jesper It is going to be interesting to see how you do that. As a user, i want to attach the XDP program to a slave interface. So i assume you somehow redirect that to the master, with some extra meta data to allow XDP on the master to detect packets for a specific slave? Andrew
Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets
On Sat, 8 Dec 2018 19:43:38 +0100 Björn Töpel wrote: > Den lör 8 dec. 2018 kl 15:52 skrev Jesper Dangaard Brouer : > > > > On Fri, 7 Dec 2018 15:01:55 +0100 > > Björn Töpel wrote: > > > > > Den fre 7 dec. 2018 kl 14:42 skrev Jesper Dangaard Brouer > > > : > > > > > > > > On Fri, 7 Dec 2018 12:44:24 +0100 > > > > Björn Töpel wrote: > > > > > > > > > The rationale behind attach is performance and ease of use. Many XDP > > > > > socket users just need a simple way of creating/binding a socket and > > > > > receiving frames right away without loading an XDP program. > > > > > > > > > > XDP_ATTACH adds a mechanism we call "builtin XDP program" that simply > > > > > is a kernel provided XDP program that is installed to the netdev when > > > > > XDP_ATTACH is being passed as a bind() flag. > > > > > > > > > > The builtin program is the simplest program possible to redirect a > > > > > frame to an attached socket. In restricted C it would look like this: > > > > > > > > > > SEC("xdp") > > > > > int xdp_prog(struct xdp_md *ctx) > > > > > { > > > > > return bpf_xsk_redirect(ctx); > > > > > } > > > > > > > > > > The builtin program loaded via XDP_ATTACH behaves, from an > > > > > install-to-netdev/uninstall-from-netdev point of view, differently > > > > > from regular XDP programs. The easiest way to look at it is as a > > > > > 2-level hierarchy, where regular XDP programs has precedence over the > > > > > builtin one. > > > > > > > > > > If no regular XDP program is installed to the netdev, the builtin will > > > > > be install. If the builtin program is installed, and a regular is > > > > > installed, regular XDP program will have precedence over the builtin > > > > > one. > > > > > > > > > > Further, if a regular program is installed, and later removed, the > > > > > builtin one will automatically be installed. > > > > > > > > > > The sxdp_flags field of struct sockaddr_xdp gets two new options > > > > > XDP_BUILTIN_SKB_MODE and XDP_BUILTIN_DRV_MODE, which maps to the > > > > > corresponding XDP netlink install flags. > > > > > > > > > > The builtin XDP program functionally adds even more complexity to the > > > > > already hard to read dev_change_xdp_fd. Maybe it would be simpler to > > > > > store the program in the struct net_device together with the install > > > > > flags instead of calling the ndo_bpf multiple times? > > > > > > > > (As far as I can see from reading the code, correct me if I'm wrong.) > > > > > > > > If an AF_XDP program uses XDP_ATTACH, then it installs the > > > > builtin-program as the XDP program on the "entire" device. That means > > > > all RX-queues will call this XDP-bpf program (indirect call), and it is > > > > actually only relevant for the specific queue_index. Yes, the helper > > > > call does check that the 'xdp->rxq->queue_index' for an attached 'xsk' > > > > and return XDP_PASS if it is NULL: > > > > > > > > > > Yes, you are correct. The builtin XDP program, just like a regular XDP > > > program, affects the whole netdev. So, yes the non-AF_XDP queues would > > > get a performance hit from this. Just to reiterate -- this isn't new > > > for this series. This has always been the case for XDP when acting on > > > just one queue. > > > > > > > +BPF_CALL_1(bpf_xdp_xsk_redirect, struct xdp_buff *, xdp) > > > > +{ > > > > + struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info); > > > > + struct xdp_sock *xsk; > > > > + > > > > + xsk = READ_ONCE(xdp->rxq->dev->_rx[xdp->rxq->queue_index].xsk); > > > > + if (xsk) { > > > > + ri->xsk = xsk; > > > > + return XDP_REDIRECT; > > > > + } > > > > + > > > > + return XDP_PASS; > > > > +} > > > > > > > > Why do every normal XDP_PASS packet have to pay this overhead (indirect > > > > call), when someone loads an AF_XDP socket program? The AF_XDP socket > > > > is tied hard and only relevant to a specific RX-queue (which is why we > > > > get a performance boost due to SPSC queues). > > > > > > > > I acknowledge there is a need for this, but this use-case shows there > > > > is a need for attaching XDP programs per RX-queue basis. > > > > > > > > > > From my AF_XDP perspective, having a program per queue would make > > > sense. The discussion of a per-queue has been up before, and I think > > > the conclusion was that it would be too complex from a > > > configuration/tooling point-of-view. Again, for AF_XDP this would be > > > great. > > > > I really think it is about time that we introduce these per-queue XDP > > programs. From a configuration/tooling point-of-view, your proposed > > solution have the same issues, we have to solve. (Like listing > > introspection need to show the per-queue XDP/BPF programs). And you have > > almost solved it, by keeping the per-queue program in net_device > > struct. > > > > I just to emphasize; My series is only a convenience method for > loading a (sticky/hierarchical) XDP program and a new
Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets
On Sat, 8 Dec 2018 17:55:05 +0100 Andrew Lunn wrote: > On Sat, Dec 08, 2018 at 04:12:12PM +0100, Jesper Dangaard Brouer wrote: > > On Fri, 7 Dec 2018 13:21:08 -0800 > > Alexei Starovoitov wrote: > > > > > for production I suspect the users would want > > > an easy way to stay safe when they're playing with AF_XDP. > > > So another builtin program that redirects ssh and ping traffic > > > back to the kernel would be a nice addition. > > > > Are you saying a buildin program that need to parse different kinds of > > Eth-type headers (DSA, VLAN, QinqQ) > > Hi Jesper > > Parsing DSA headers is quite hard, since most of them are not > discoverable, you need to know they are there, and where they actually > are. > > I also don't think there are any use cases for actually trying to > parse them on the master interface. You are more likely to want to run > the eBPF program on the slave interface, once the DSA header has been > removed, and it is clear what interface the frame is actually for. That is not how XDP works. XDP must run on the "master" physical driver interface, as the "slave" interface is a virtual DSA interface. I did mention DSA because I'm handling that on the EspressoBin implementation. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API
On Sat, Dec 08, 2018 at 12:21:10PM -0800, David Miller wrote: > From: Ilias Apalodimas > Date: Sat, 8 Dec 2018 16:57:28 +0200 > > > The patchset speeds up the mvneta driver on the default network > > stack. The only change that was needed was to adapt the driver to > > using the page_pool API. The speed improvements we are seeing on > > specific workloads (i.e 256b < packet < 400b) are almost 3x. > > > > Lots of high speed drivers are doing similar recycling tricks themselves > > (and > > there's no common code, everyone is doing something similar though). All we > > are > > trying to do is provide a unified API to make that easier for the rest. > > Another > > advantage is that if the some drivers switch to the API, adding XDP > > functionality on them is pretty trivial. > > Yeah this is a very important point moving forward. > > Jesse Brandeberg brought the following up to me at LPC and I'd like to > develop it further. > > Right now we tell driver authors to write a new driver as SKB based, > and once they've done all of that work we tell them to basically > shoe-horn XDP support into that somewhat different framework. > > Instead, the model should be the other way around, because with a raw > meta-data free set of data buffers we can always construct an SKB or > pass it to XDP. Yeah exactly and it gets even worst. If the driver writer doesn't go through the 'proper' path, i.e allocate buffers and use build_skb, you end up having to rewrite dma/memory management for the nornal stack. So it's more than 'shoe-horning' XDP, it's re-writing and re-testing the whole thing. The API also offers dma mapping capabilities (configurable). So you remove potential nasty bugs there as well. > > So drivers should be targetting some raw data buffer kind of interface > which takes care of all of this stuff. If the buffers get wrapped > into an SKB and get pushed into the traditional networking stack, the > driver shouldn't know or care. Likewise if it ends up being processed > with XDP, it should not need to know or care. > > All of those details should be behind a common layer. Then we can > control: > > 1) Buffer handling, recycling, "fast paths" > > 2) Statistics > > 3) XDP feature sets > > We can consolidate behavior and semantics across all of the drivers > if we do this. No more talk about "supporting all XDP features", > and the inconsistencies we have because of that. > > The whole common statistics discussion could be resolved with this > common layer as well. > > We'd be able to control and properly optimize everything. /Ilias
Re: [PATCH net-next] ip: silence udp zerocopy smatch false positive
From: Willem de Bruijn Date: Sat, 8 Dec 2018 06:22:46 -0500 > From: Willem de Bruijn > > extra_uref is used in __ip(6)_append_data only if uarg is set. > > Smatch sees that the variable is passed to sock_zerocopy_put_abort. > This function accesses it only when uarg is set, but smatch cannot > infer this. > > Make this dependency explicit. > > Fixes: 52900d22288e ("udp: elide zerocopy operation in hot path") > Signed-off-by: Willem de Bruijn I looked and can't figure out a better way to fix this :) Applied, thanks Willem.
Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API
From: Ilias Apalodimas Date: Sat, 8 Dec 2018 16:57:28 +0200 > The patchset speeds up the mvneta driver on the default network > stack. The only change that was needed was to adapt the driver to > using the page_pool API. The speed improvements we are seeing on > specific workloads (i.e 256b < packet < 400b) are almost 3x. > > Lots of high speed drivers are doing similar recycling tricks themselves (and > there's no common code, everyone is doing something similar though). All we > are > trying to do is provide a unified API to make that easier for the rest. > Another > advantage is that if the some drivers switch to the API, adding XDP > functionality on them is pretty trivial. Yeah this is a very important point moving forward. Jesse Brandeberg brought the following up to me at LPC and I'd like to develop it further. Right now we tell driver authors to write a new driver as SKB based, and once they've done all of that work we tell them to basically shoe-horn XDP support into that somewhat different framework. Instead, the model should be the other way around, because with a raw meta-data free set of data buffers we can always construct an SKB or pass it to XDP. So drivers should be targetting some raw data buffer kind of interface which takes care of all of this stuff. If the buffers get wrapped into an SKB and get pushed into the traditional networking stack, the driver shouldn't know or care. Likewise if it ends up being processed with XDP, it should not need to know or care. All of those details should be behind a common layer. Then we can control: 1) Buffer handling, recycling, "fast paths" 2) Statistics 3) XDP feature sets We can consolidate behavior and semantics across all of the drivers if we do this. No more talk about "supporting all XDP features", and the inconsistencies we have because of that. The whole common statistics discussion could be resolved with this common layer as well. We'd be able to control and properly optimize everything.
Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API
On Sat, Dec 08, 2018 at 09:11:53PM +0100, Jesper Dangaard Brouer wrote: > > > > I want to make sure you guys thought about splice() stuff, and > > skb_try_coalesce(), and GRO, and skb cloning, and ... > > Thanks for the pointers. To Ilias, we need to double check skb_try_coalesce() > code path, as it does look like we don't handle this correctly. > Noted. We did check skb cloning, but not this one indeed /Ilias
Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API
On Sat, 8 Dec 2018 11:26:56 -0800 Eric Dumazet wrote: > On 12/08/2018 06:57 AM, Ilias Apalodimas wrote: > > Hi Eric, > This patch is changing struct sk_buff, and is thus per-definition > controversial. > > Place a new member 'mem_info' of type struct xdp_mem_info, just after > members (flags) head_frag and pfmemalloc, And not in between > headers_start/end to ensure skb_copy() and pskb_copy() work as-is. > Copying mem_info during skb_clone() is required. This makes sure that > pages are correctly freed or recycled during the altered > skb_free_head() invocation. > >>> > >>> I read this to mean that this 'info' isn't accessed/needed until skb > >>> is freed. Any reason its not added at the end? > >>> > >>> This would avoid moving other fields that are probably accessed > >>> more frequently during processing. > >>> > >> > >> But I do not get why the patch is needed. > >> > >> Adding extra cost for each skb destruction is costly. > >> > >> I though XDP was all about _not_ having skbs. > > > > You hit the only point i don't personally like in the code, xdp info in the > > skb. Considering the benefits i am convinced this is ok and here's why: > > > >> Please let's do not slow down the non XDP stack only to make XDP more > >> appealing. > > > > We are not trying to do that, on the contrary. The patchset has nothing > > towards > > speeding XDP and slowing down anything else. The patchset speeds up the > > mvneta driver on the default network stack. The only change that was needed > > was > > to adapt the driver to using the page_pool API. The speed improvements we > > are > > seeing on specific workloads (i.e 256b < packet < 400b) are almost 3x. > > > > Lots of high speed drivers are doing similar recycling tricks themselves > > (and > > there's no common code, everyone is doing something similar though). All we > > are > > trying to do is provide a unified API to make that easier for the rest. > > Another > > advantage is that if the some drivers switch to the API, adding XDP > > functionality on them is pretty trivial. > > > > Moreover our tests are only performed on systems without or with SMMU > > disabled. > > On a platform i have access to, enabling and disabling the SMMU has some > > performance impact. By keeping the buffers mapped we believe this impact > > will be substantially less (i'll come back with results once i have them on > > this). > > > > I do understand your point, but the potential advantages on my head > > overwight that by a lot. > > > > I got other concerns on the patchset though. Like how much memory is it > > 'ok' to > > keep mapped keeping in mind we are using the streaming DMA API. Are we > > going to > > affect anyone else negatively by doing so ? > > > > I want to make sure you guys thought about splice() stuff, and > skb_try_coalesce(), and GRO, and skb cloning, and ... Thanks for the pointers. To Ilias, we need to double check skb_try_coalesce() code path, as it does look like we don't handle this correctly. > I do not see how an essential property of page fragments would be > attached to sk_buff, without adding extra code all over the places. > > Page recycling in drivers is fine, but should operate on pages that > the driver owns. > I think we have addressed this. We are only recycling pages that the driver owns. In case of fragments, then we release the DMA-mapping and don't recycle the page, instead the normal code path is taken (which is missing in case of skb_try_coalesce). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API
From: Jesper Dangaard Brouer Date: Sat, 8 Dec 2018 12:36:10 +0100 > The annoying part is actually that depending on the kernel config > options CONFIG_XFRM, CONFIG_NF_CONNTRACK and CONFIG_BRIDGE_NETFILTER, > whether there is a cache-line split, where mem_info gets moved into the > next cacheline. Note that Florian Westphal's work (trying to help MP-TCP) would eliminate this variability.
Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API
On 12/08/2018 06:57 AM, Ilias Apalodimas wrote: > Hi Eric, This patch is changing struct sk_buff, and is thus per-definition controversial. Place a new member 'mem_info' of type struct xdp_mem_info, just after members (flags) head_frag and pfmemalloc, And not in between headers_start/end to ensure skb_copy() and pskb_copy() work as-is. Copying mem_info during skb_clone() is required. This makes sure that pages are correctly freed or recycled during the altered skb_free_head() invocation. >>> >>> I read this to mean that this 'info' isn't accessed/needed until skb >>> is freed. Any reason its not added at the end? >>> >>> This would avoid moving other fields that are probably accessed >>> more frequently during processing. >>> >> >> But I do not get why the patch is needed. >> >> Adding extra cost for each skb destruction is costly. >> >> I though XDP was all about _not_ having skbs. > > You hit the only point i don't personally like in the code, xdp info in the > skb. Considering the benefits i am convinced this is ok and here's why: > >> Please let's do not slow down the non XDP stack only to make XDP more >> appealing. > > We are not trying to do that, on the contrary. The patchset has nothing > towards > speeding XDP and slowing down anything else. The patchset speeds up the > mvneta driver on the default network stack. The only change that was needed > was > to adapt the driver to using the page_pool API. The speed improvements we are > seeing on specific workloads (i.e 256b < packet < 400b) are almost 3x. > > Lots of high speed drivers are doing similar recycling tricks themselves (and > there's no common code, everyone is doing something similar though). All we > are > trying to do is provide a unified API to make that easier for the rest. > Another > advantage is that if the some drivers switch to the API, adding XDP > functionality on them is pretty trivial. > > Moreover our tests are only performed on systems without or with SMMU > disabled. > On a platform i have access to, enabling and disabling the SMMU has some > performance impact. By keeping the buffers mapped we believe this impact > will be substantially less (i'll come back with results once i have them on > this). > > I do understand your point, but the potential advantages on my head > overwight that by a lot. > > I got other concerns on the patchset though. Like how much memory is it 'ok' > to > keep mapped keeping in mind we are using the streaming DMA API. Are we going > to > affect anyone else negatively by doing so ? > I want to make sure you guys thought about splice() stuff, and skb_try_coalesce(), and GRO, and skb cloning, and ... I do not see how an essential property of page fragments would be attached to sk_buff, without adding extra code all over the places. Page recycling in drivers is fine, but should operate on pages that the driver owns. Messing with skb->head seems not needed for performance, since skb->head can be regular slab allocated.
Fw: [Bug 201933] New: r8169 hangs with Oops/null pointer deref on shutdown
Begin forwarded message: Date: Sat, 08 Dec 2018 18:20:01 + From: bugzilla-dae...@bugzilla.kernel.org To: step...@networkplumber.org Subject: [Bug 201933] New: r8169 hangs with Oops/null pointer deref on shutdown https://bugzilla.kernel.org/show_bug.cgi?id=201933 Bug ID: 201933 Summary: r8169 hangs with Oops/null pointer deref on shutdown Product: Networking Version: 2.5 Kernel Version: 4.20 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Other Assignee: step...@networkplumber.org Reporter: adf.li...@gmail.com Regression: No On 4.20-rc? halt/reboot r8169 hangs. ip li set down seems to get same error. Dec 8 16:14:56 ph4 kernel: r8169 :06:00.0 enp6s0: Link is Down Dec 8 16:14:57 ph4 kernel: BUG: unable to handle kernel NULL pointer dereference at 03a0 Dec 8 16:14:57 ph4 kernel: PGD 22acfc067 P4D 22acfc067 PUD 232a33067 PMD 0 Dec 8 16:14:57 ph4 kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI Dec 8 16:14:57 ph4 kernel: CPU: 2 PID: 22126 Comm: ip Not tainted 4.20.0-rc5 #1 Dec 8 16:14:57 ph4 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./970 Extreme3 R2.0, BIOS P1.60 06/05/2014 Dec 8 16:14:57 ph4 kernel: RIP: 0010:queue_work_on+0x4/0x20 Dec 8 16:14:57 ph4 kernel: Code: ff 48 c7 c7 20 55 de 81 e8 02 58 04 00 c6 05 aa 1c 3e 01 01 e9 75 ff ff ff 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 53 9c 5b fa 48 0f ba 2a 00 73 06 31 c0 53 9d 5b c3 e8 99 fb ff ff b8 01 00 Dec 8 16:14:57 ph4 kernel: RSP: 0018:c9d473a8 EFLAGS: 00010002 Dec 8 16:14:57 ph4 kernel: RAX: 8882362c4000 RBX: 0002 RCX: Dec 8 16:14:57 ph4 kernel: RDX: 03a0 RSI: 88823680d200 RDI: 0008 Dec 8 16:14:57 ph4 kernel: RBP: 8882362c4840 R08: 888236400248 R09: 888236400340 Dec 8 16:14:57 ph4 kernel: R10: R11: 82043b28 R12: 8882341ca960 Dec 8 16:14:57 ph4 kernel: R13: 8882341ca828 R14: 001a R15: 8882362c4840 Dec 8 16:14:57 ph4 kernel: FS: 7f5fc097f700() GS:888237b0() knlGS: Dec 8 16:14:57 ph4 kernel: CS: 0010 DS: ES: CR0: 80050033 Dec 8 16:14:57 ph4 kernel: CR2: 03a0 CR3: 00023344 CR4: 06e0 Dec 8 16:14:57 ph4 kernel: Call Trace: Dec 8 16:14:57 ph4 kernel: rtl8169_interrupt+0xba/0x260 [r8169] Dec 8 16:14:57 ph4 kernel: ? _raw_spin_unlock_irqrestore+0xf/0x30 Dec 8 16:14:57 ph4 kernel: __free_irq+0x177/0x2d0 Dec 8 16:14:57 ph4 kernel: free_irq+0x29/0x60 Dec 8 16:14:57 ph4 kernel: pci_free_irq+0x13/0x20 Dec 8 16:14:57 ph4 kernel: rtl8169_close+0xe8/0x1f0 [r8169] Dec 8 16:14:57 ph4 kernel: __dev_close_many+0x8f/0x100 Dec 8 16:14:57 ph4 kernel: __dev_change_flags+0xe9/0x210 Dec 8 16:14:57 ph4 kernel: dev_change_flags+0x29/0x70 Dec 8 16:14:57 ph4 kernel: do_setlink+0x310/0xf30 Dec 8 16:14:57 ph4 kernel: ? __nla_parse+0x48/0x190 Dec 8 16:14:57 ph4 kernel: rtnl_newlink+0x571/0x880 Dec 8 16:14:57 ph4 kernel: ? __nla_parse+0x48/0x190 Dec 8 16:14:57 ph4 kernel: ? nla_parse+0x13/0x20 Dec 8 16:14:57 ph4 kernel: ? rtnl_newlink+0x167/0x880 Dec 8 16:14:57 ph4 kernel: ? preempt_count_add+0x55/0xb0 Dec 8 16:14:57 ph4 kernel: ? get_page_from_freelist+0xdf0/0x1130 Dec 8 16:14:57 ph4 kernel: ? kmem_cache_alloc+0x37/0x180 Dec 8 16:14:57 ph4 kernel: rtnetlink_rcv_msg+0x141/0x380 Dec 8 16:14:57 ph4 kernel: ? rtnl_calcit.isra.25+0x120/0x120 Dec 8 16:14:57 ph4 kernel: netlink_rcv_skb+0x48/0x120 Dec 8 16:14:57 ph4 kernel: netlink_unicast+0x1be/0x270 Dec 8 16:14:57 ph4 kernel: netlink_sendmsg+0x2b7/0x3f0 Dec 8 16:14:57 ph4 kernel: ? netlink_unicast+0x270/0x270 Dec 8 16:14:57 ph4 kernel: ___sys_sendmsg+0x10b/0x310 Dec 8 16:14:57 ph4 kernel: ? dev_ioctl+0x19d/0x410 Dec 8 16:14:57 ph4 kernel: ? preempt_count_add+0x74/0xb0 Dec 8 16:14:57 ph4 kernel: ? __inode_wait_for_writeback+0x7a/0xe0 Dec 8 16:14:57 ph4 kernel: ? init_wait_var_entry+0x40/0x40 Dec 8 16:14:57 ph4 kernel: ? preempt_count_add+0x74/0xb0 Dec 8 16:14:57 ph4 kernel: ? _raw_spin_lock+0xe/0x30 Dec 8 16:14:57 ph4 kernel: ? _raw_spin_unlock+0xd/0x30 Dec 8 16:14:57 ph4 kernel: ? __dentry_kill+0x11a/0x160 Dec 8 16:14:57 ph4 kernel: __sys_sendmsg+0x64/0xb0 Dec 8 16:14:57 ph4 kernel: do_syscall_64+0x6b/0x1c0 Dec 8 16:14:57 ph4 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Dec 8 16:14:57 ph4 kernel: RIP: 0033:0x7f5fc00c1110 Dec 8 16:14:57 ph4 kernel: Code: 8b 15 84 3d 2b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb cd 66 0f 1f 44 00 00 83 3d d9 96 2b 00 00 75 10 b8 2e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ae b4 00 00 48 89 04 24 Dec 8 16:14:57 ph4 kernel: RSP: 002b:7ffd05cb11e8 EFLAGS: 0246 ORIG_RAX: 002e Dec 8 16:14:57 ph4 kernel: RAX: ffda
Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets
Den lör 8 dec. 2018 kl 16:12 skrev Jesper Dangaard Brouer : > > On Fri, 7 Dec 2018 13:21:08 -0800 > Alexei Starovoitov wrote: > > > for production I suspect the users would want > > an easy way to stay safe when they're playing with AF_XDP. > > So another builtin program that redirects ssh and ping traffic > > back to the kernel would be a nice addition. > > Are you saying a buildin program that need to parse different kinds of > Eth-type headers (DSA, VLAN, QinqQ) and find the TCP port to match port > 22 to return XDP_PASS, or else call AF_XDP redurect. That seems to be > pure overhead for this fast-path buildin program for AF_XDP. > > Would a solution be to install a NIC hardware filter that redirect SSH > port 22 to another RX-queue. And then have a buildin program that > returns XDP_PASS installed on that RX-queue. And change Bjørns > semantics, such that RX-queue programs takes precedence over the global > XDP program. This would also be a good fail safe in general for XDP. > Exactly this; I'd say this is the most common way of using AF_XDP, i.e. steer a certain flow to a Rx queue, and *all* packets on that queue are pulled to the AF_XDP socket. This is why the builtin is the way it is, it's what is being used, not only for benchmarking scenarios. > If the RX-queues take precedence, I can use this fail safe approach. > E.g. when I want to test my new global XDP program, I'll use ethtool > match my management IP and send that to a specific RX-queue and my > fail-safe BPF program. > Interesting take to have a *per queue* XDP program that overrides the regular one. Maybe this is a better way to use builtins? > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets
Den lör 8 dec. 2018 kl 15:52 skrev Jesper Dangaard Brouer : > > On Fri, 7 Dec 2018 15:01:55 +0100 > Björn Töpel wrote: > > > Den fre 7 dec. 2018 kl 14:42 skrev Jesper Dangaard Brouer > > : > > > > > > On Fri, 7 Dec 2018 12:44:24 +0100 > > > Björn Töpel wrote: > > > > > > > The rationale behind attach is performance and ease of use. Many XDP > > > > socket users just need a simple way of creating/binding a socket and > > > > receiving frames right away without loading an XDP program. > > > > > > > > XDP_ATTACH adds a mechanism we call "builtin XDP program" that simply > > > > is a kernel provided XDP program that is installed to the netdev when > > > > XDP_ATTACH is being passed as a bind() flag. > > > > > > > > The builtin program is the simplest program possible to redirect a > > > > frame to an attached socket. In restricted C it would look like this: > > > > > > > > SEC("xdp") > > > > int xdp_prog(struct xdp_md *ctx) > > > > { > > > > return bpf_xsk_redirect(ctx); > > > > } > > > > > > > > The builtin program loaded via XDP_ATTACH behaves, from an > > > > install-to-netdev/uninstall-from-netdev point of view, differently > > > > from regular XDP programs. The easiest way to look at it is as a > > > > 2-level hierarchy, where regular XDP programs has precedence over the > > > > builtin one. > > > > > > > > If no regular XDP program is installed to the netdev, the builtin will > > > > be install. If the builtin program is installed, and a regular is > > > > installed, regular XDP program will have precedence over the builtin > > > > one. > > > > > > > > Further, if a regular program is installed, and later removed, the > > > > builtin one will automatically be installed. > > > > > > > > The sxdp_flags field of struct sockaddr_xdp gets two new options > > > > XDP_BUILTIN_SKB_MODE and XDP_BUILTIN_DRV_MODE, which maps to the > > > > corresponding XDP netlink install flags. > > > > > > > > The builtin XDP program functionally adds even more complexity to the > > > > already hard to read dev_change_xdp_fd. Maybe it would be simpler to > > > > store the program in the struct net_device together with the install > > > > flags instead of calling the ndo_bpf multiple times? > > > > > > (As far as I can see from reading the code, correct me if I'm wrong.) > > > > > > If an AF_XDP program uses XDP_ATTACH, then it installs the > > > builtin-program as the XDP program on the "entire" device. That means > > > all RX-queues will call this XDP-bpf program (indirect call), and it is > > > actually only relevant for the specific queue_index. Yes, the helper > > > call does check that the 'xdp->rxq->queue_index' for an attached 'xsk' > > > and return XDP_PASS if it is NULL: > > > > > > > Yes, you are correct. The builtin XDP program, just like a regular XDP > > program, affects the whole netdev. So, yes the non-AF_XDP queues would > > get a performance hit from this. Just to reiterate -- this isn't new > > for this series. This has always been the case for XDP when acting on > > just one queue. > > > > > +BPF_CALL_1(bpf_xdp_xsk_redirect, struct xdp_buff *, xdp) > > > +{ > > > + struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info); > > > + struct xdp_sock *xsk; > > > + > > > + xsk = READ_ONCE(xdp->rxq->dev->_rx[xdp->rxq->queue_index].xsk); > > > + if (xsk) { > > > + ri->xsk = xsk; > > > + return XDP_REDIRECT; > > > + } > > > + > > > + return XDP_PASS; > > > +} > > > > > > Why do every normal XDP_PASS packet have to pay this overhead (indirect > > > call), when someone loads an AF_XDP socket program? The AF_XDP socket > > > is tied hard and only relevant to a specific RX-queue (which is why we > > > get a performance boost due to SPSC queues). > > > > > > I acknowledge there is a need for this, but this use-case shows there > > > is a need for attaching XDP programs per RX-queue basis. > > > > > > > From my AF_XDP perspective, having a program per queue would make > > sense. The discussion of a per-queue has been up before, and I think > > the conclusion was that it would be too complex from a > > configuration/tooling point-of-view. Again, for AF_XDP this would be > > great. > > I really think it is about time that we introduce these per-queue XDP > programs. From a configuration/tooling point-of-view, your proposed > solution have the same issues, we have to solve. (Like listing > introspection need to show the per-queue XDP/BPF programs). And you have > almost solved it, by keeping the per-queue program in net_device > struct. > I just to emphasize; My series is only a convenience method for loading a (sticky/hierarchical) XDP program and a new bpf-function. Nothing more. I'm not sure what you mean when you say that "proposed solution have the same issues". Do you mean XDP in general? *I'm* all in for a per-queue XDP program -- but I might be biased due to AF_XDP :-P Let's explore in this thread what that would look like and
linux-next: BUG: KASAN: use-after-free in neigh_mark_dead+0x192/0x1c0
Hi David, Our tests detected a new problem on net-next (v4.20-rc4-1241-g83af01ba1c2d). [ 92.949574] == [ 92.949794] BUG: KASAN: use-after-free in neigh_mark_dead+0x192/0x1c0 [ 92.949950] Write of size 8 at addr 8881d9bc9438 by task kworker/u4:1/37 [ 92.950108] [ 92.950254] CPU: 1 PID: 37 Comm: kworker/u4:1 Not tainted 4.20.0-rc4+ #1 [ 92.950257] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 [ 92.950263] Workqueue: netns cleanup_net [ 92.950267] Call Trace: [ 92.950276] dump_stack+0x5b/0x8b [ 92.950284] print_address_description+0x6a/0x270 [ 92.950290] kasan_report+0x260/0x380 [ 92.950295] ? neigh_mark_dead+0x192/0x1c0 [ 92.950301] neigh_mark_dead+0x192/0x1c0 [ 92.950307] neigh_flush_dev+0x1ad/0x510 [ 92.950318] __neigh_ifdown+0x49/0x2a0 [ 92.950327] neigh_ifdown+0xc/0x10 [ 92.950333] fib_netdev_event+0x1f5/0x300 [ 92.950342] notifier_call_chain+0xbf/0x130 [ 92.950352] dev_close_many+0x249/0x4d0 [ 92.950360] ? __dev_close_many+0x280/0x280 [ 92.950364] ? default_device_exit_batch+0xf3/0x3b0 [ 92.950374] rollback_registered_many+0x34c/0xd00 [ 92.950378] ? default_device_exit_batch+0xf3/0x3b0 [ 92.950383] ? wait_for_completion+0x2d0/0x2d0 [ 92.950386] ? __mutex_lock+0x31f/0x1000 [ 92.950391] ? generic_xdp_install+0x3c0/0x3c0 [ 92.950397] ? default_device_exit_batch+0x1b6/0x3b0 [ 92.950403] ? find_held_lock+0x32/0x1c0 [ 92.950408] ? default_device_exit_batch+0x1b6/0x3b0 [ 92.950416] ? rtnl_is_locked+0x16/0x30 [ 92.950420] ? unregister_netdevice_queue+0x69/0x3f0 [ 92.950424] ? rollback_registered_many+0xd00/0xd00 [ 92.950432] unregister_netdevice_many.part.124+0x17/0x210 [ 92.950439] default_device_exit_batch+0x2d6/0x3b0 [ 92.950444] ? unregister_netdevice_many+0x40/0x40 [ 92.950450] ? dev_change_net_namespace+0xb90/0xb90 [ 92.950459] ? do_wait_intr_irq+0x210/0x210 [ 92.950466] ? ops_exit_list.isra.6+0x94/0x140 [ 92.950473] cleanup_net+0x309/0x650 [ 92.950478] ? process_one_work+0x8ae/0x16c0 [ 92.950482] ? net_drop_ns+0x60/0x60 [ 92.950490] ? lock_acquire+0xfe/0x290 [ 92.950497] ? strscpy+0x96/0x300 [ 92.950505] process_one_work+0x96c/0x16c0 [ 92.950516] ? pwq_dec_nr_in_flight+0x2c0/0x2c0 [ 92.950520] ? do_raw_spin_lock+0x120/0x290 [ 92.950532] worker_thread+0x87/0xe80 [ 92.950541] ? __kthread_parkme+0x82/0xf0 [ 92.950546] ? process_one_work+0x16c0/0x16c0 [ 92.950550] kthread+0x2e9/0x3a0 [ 92.950554] ? kthread_park+0x120/0x120 [ 92.950560] ret_from_fork+0x35/0x40 [ 92.950572] [ 92.950713] Allocated by task 8903: [ 92.950863] kasan_kmalloc+0xa0/0xd0 [ 92.950868] __kmalloc+0x132/0x250 [ 92.950872] ___neigh_create+0x4d6/0x18e0 [ 92.950877] ip_finish_output2+0x4e7/0xff0 [ 92.950880] ip_output+0x1c5/0x3f0 [ 92.950884] __ip_queue_xmit+0x7c9/0x1990 [ 92.950888] __tcp_transmit_skb+0x1872/0x33e0 [ 92.950891] do_tcp_setsockopt.isra.41+0x1712/0x1c70 [ 92.950895] __sys_setsockopt+0x114/0x1e0 [ 92.950899] __x64_sys_setsockopt+0xba/0x150 [ 92.950904] do_syscall_64+0x94/0x280 [ 92.950908] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 92.950909] [ 92.951049] Freed by task 8936: [ 92.951200] __kasan_slab_free+0x130/0x180 [ 92.951204] kfree+0xc2/0x1e0 [ 92.951209] rcu_process_callbacks+0x647/0xbe0 [ 92.951213] __do_softirq+0x199/0x658 [ 92.951215] [ 92.951356] The buggy address belongs to the object at 8881d9bc9200 [ 92.951356] which belongs to the cache kmalloc-1k of size 1024 [ 92.951530] The buggy address is located 568 bytes inside of [ 92.951530] 1024-byte region [8881d9bc9200, 8881d9bc9600) [ 92.951702] The buggy address belongs to the page: [ 92.951856] page:ea000766f200 count:1 mapcount:0 mapping:8881dac02a00 index:0x0 compound_mapcount: 0 [ 92.952026] flags: 0x17fff810200(slab|head) [ 92.952185] raw: 017fff810200 dead0100 dead0200 8881dac02a00 [ 92.952351] raw: 000e000e 0001 [ 92.952515] page dumped because: kasan: bad access detected [ 92.952662] [ 92.952815] Memory state around the buggy address: [ 92.952981] 8881d9bc9300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 92.953168] 8881d9bc9380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 92.953348] >8881d9bc9400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 92.953527] ^ [ 92.953688] 8881d9bc9480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 92.953871] 8881d9bc9500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 92.954048] == [ 92.954233] Disabling lock debugging due to kernel taint Here is more more details (config, dmest, etc):
Re: [PATCH] net: dsa: ksz: Increase the tag alignment
Le 12/7/18 à 8:25 PM, Marek Vasut a écrit : > On 12/08/2018 01:52 AM, tristram...@microchip.com wrote: >>> - padlen = (skb->len >= ETH_ZLEN) ? 0 : ETH_ZLEN - skb->len; >>> + padlen = (skb->len >= VLAN_ETH_ZLEN) ? 0 : VLAN_ETH_ZLEN - skb- len; >> >> The requirement is the tail tag should be at the end of frame before FCS. >> When the length is less than 60 the MAC controller will pad the frame to >> legal size. That is why this function makes sure the padding is done >> manually. Increasing the size just increases the length of the frame and the >> chance to allocate new socket buffer. >> >> Example of using ping size of 18 will have the sizes of request and response >> differ by 4 bytes. Not that it matters much. >> >> Are you concerned the MAC controller will remove the VLAN tag and so the >> frame >> will not be sent? Or the switch removes the VLAN tag and is not able to send? > > With TI CPSW in dual-ethernet configuration, which adds internal VLAN > tag at the end of the frame, the KSZ switch fails. The CPU will send out > packets and the switch will reject them as corrupted. It needs this > extra VLAN tag padding. Oh so they add the internal VLAN at the end of the frame, not the beginning? That is quite surprising but that would not be the one single oddity found with CPSW I am afraid.. The way I would approach this is with layering where the padding needs to occur: - within the tag driver you need to make sure there is enough room to insert the KSZ tag - within the ethernet MAC driver (which comes last in the transmit path), you need to make sure there is enough room to insert that trailer VLAN tag to permit internal transmission -- Florian
Re: [PATCH net-next,v5 11/12] qede: place ethtool_rx_flow_spec after code after TC flower codebase
Thu, Dec 06, 2018 at 11:40:01PM CET, pa...@netfilter.org wrote: >This is a preparation patch to reuse the existing TC flower codebase >from ethtool_rx_flow_spec. > >This patch is merely moving the core ethtool_rx_flow_spec parser after >tc flower offload driver code so we can skip a few forward function >declarations in the follow up patch. > >Signed-off-by: Pablo Neira Ayuso Acked-by: Jiri Pirko
Re: [PATCH net-next,v5 10/12] dsa: bcm_sf2: use flow_rule infrastructure
Thu, Dec 06, 2018 at 11:40:00PM CET, pa...@netfilter.org wrote: >Update this driver to use the flow_rule infrastructure, hence we can use >the same code to populate hardware IR from ethtool_rx_flow and the >cls_flower interfaces. > >Signed-off-by: Pablo Neira Ayuso Acked-by: Jiri Pirko
Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API
> I got other concerns on the patchset though. Like how much memory is > it 'ok' to keep mapped keeping in mind we are using the streaming > DMA API. Are we going to affect anyone else negatively by doing so ? For mvneta, you can expect the target to have between 512Mbyte to 3G. You can take a look at the DT files to get a better feel for this. All of it should be low enough that it is DMA'able. These SoCs are often used for WiFi access points, and NAS boxes. So i guess the big memory usage on these boxes is the block cache for a NAS device. So if you want to see the impact of the driver using more memory, see if disk performance is affected. Andrew
Re: [PATCH v2 net-next 0/2] platform data controls for mdio-gpio
Le 12/8/18 à 7:12 AM, Andrew Lunn a écrit : > Soon to be mainlined is an x86 platform with a Marvell switch, and a > bit-banging MDIO bus. In order to make this work, the phy_mask of the > MDIO bus needs to be set to prevent scanning for PHYs, and the > phy_ignore_ta_mask needs to be set because the switch has broken > turnaround. > > Add a platform_data structure with these parameters. Looks good thanks Andrew. I do wonder if we could define a common mii_bus_platform_data structure eventually which is comprised of these two members (if nothing else) and maybe update the common mdiobus_register() code path to look for these members. If a subsequent platform data/device MDIO bus shows up we could do that at that time. Thanks! > > Andrew Lunn (2): > net: phy: mdio-gpio: Add platform_data support for phy_mask > net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data > > MAINTAINERS | 1 + > drivers/net/phy/mdio-gpio.c | 7 +++ > include/linux/platform_data/mdio-gpio.h | 14 ++ > 3 files changed, 22 insertions(+) > create mode 100644 include/linux/platform_data/mdio-gpio.h > -- Florian
Re: [PATCH v2 net-next 2/2] net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data
Le 12/8/18 à 7:12 AM, Andrew Lunn a écrit : > The Marvell 6390 Ethernet switch family does not perform MDIO > turnaround correctly. Many hardware MDIO bus masters don't care about > this, but the bitbangging implementation in Linux does by default. Add > phy_ignore_ta_mask to the platform data so that the bitbangging code > can be told which devices are known to get TA wrong. > > v2 > -- > int -> u32 in platform data structure > > Signed-off-by: Andrew Lunn Reviewed-by: Florian Fainelli -- Florian
Re: [PATCH v2 net-next 1/2] net: phy: mdio-gpio: Add platform_data support for phy_mask
Le 12/8/18 à 7:12 AM, Andrew Lunn a écrit : > It is sometimes necessary to instantiate a bit-banging MDIO bus as a > platform device, without the aid of device tree. > > When device tree is being used, the bus is not scanned for devices, > only those devices which are in device tree are probed. Without device > tree, by default, all addresses on the bus are scanned. This may then > find a device which is not a PHY, e.g. a switch. And the switch may > have registers containing values which look like a PHY. So during the > scan, a PHY device is wrongly created. > > After the bus has been registered, a search is made for > mdio_board_info structures which indicates devices on the bus, and the > driver which should be used for them. This is typically used to > instantiate Ethernet switches from platform drivers. However, if the > scanning of the bus has created a PHY device at the same location as > indicated into the board info for a switch, the switch device is not > created, since the address is already busy. > > This can be avoided by setting the phy_mask of the mdio bus. This mask > prevents addresses on the bus being scanned. > > v2 > -- > int -> u32 in platform data structure > > Signed-off-by: Andrew Lunn Reviewed-by: Florian Fainelli -- Florian
Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets
On Sat, Dec 08, 2018 at 04:12:12PM +0100, Jesper Dangaard Brouer wrote: > On Fri, 7 Dec 2018 13:21:08 -0800 > Alexei Starovoitov wrote: > > > for production I suspect the users would want > > an easy way to stay safe when they're playing with AF_XDP. > > So another builtin program that redirects ssh and ping traffic > > back to the kernel would be a nice addition. > > Are you saying a buildin program that need to parse different kinds of > Eth-type headers (DSA, VLAN, QinqQ) Hi Jesper Parsing DSA headers is quite hard, since most of them are not discoverable, you need to know they are there, and where they actually are. I also don't think there are any use cases for actually trying to parse them on the master interface. You are more likely to want to run the eBPF program on the slave interface, once the DSA header has been removed, and it is clear what interface the frame is actually for. Andrew
[PATCH net-next] net: dsa: Make dsa_master_set_mtu() static
Add the missing static keyword. Signed-off-by: Andrew Lunn --- net/dsa/master.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/dsa/master.c b/net/dsa/master.c index a25242e71fb2..d7d5145aa235 100644 --- a/net/dsa/master.c +++ b/net/dsa/master.c @@ -158,7 +158,7 @@ static void dsa_master_ethtool_teardown(struct net_device *dev) cpu_dp->orig_ethtool_ops = NULL; } -void dsa_master_set_mtu(struct net_device *dev, struct dsa_port *cpu_dp) +static void dsa_master_set_mtu(struct net_device *dev, struct dsa_port *cpu_dp) { unsigned int mtu = ETH_DATA_LEN + cpu_dp->tag_ops->overhead; int err; -- 2.19.1
[PATCH net-next] net: dsa: Restore MTU on master device on unload
A previous change tries to set the MTU on the master device to take into account the DSA overheads. This patch tries to reset the master device back to the default MTU. Signed-off-by: Andrew Lunn --- net/dsa/master.c | 13 + 1 file changed, 13 insertions(+) diff --git a/net/dsa/master.c b/net/dsa/master.c index 42f525bc68e2..a25242e71fb2 100644 --- a/net/dsa/master.c +++ b/net/dsa/master.c @@ -172,6 +172,18 @@ void dsa_master_set_mtu(struct net_device *dev, struct dsa_port *cpu_dp) rtnl_unlock(); } +static void dsa_master_reset_mtu(struct net_device *dev) +{ + int err; + + rtnl_lock(); + err = dev_set_mtu(dev, ETH_DATA_LEN); + if (err) + netdev_dbg(dev, + "Unable to reset MTU to exclude DSA overheads\n"); + rtnl_unlock(); +} + int dsa_master_setup(struct net_device *dev, struct dsa_port *cpu_dp) { dsa_master_set_mtu(dev, cpu_dp); @@ -190,6 +202,7 @@ int dsa_master_setup(struct net_device *dev, struct dsa_port *cpu_dp) void dsa_master_teardown(struct net_device *dev) { dsa_master_ethtool_teardown(dev); + dsa_master_reset_mtu(dev); dev->dsa_ptr = NULL; -- 2.19.1
[PATCH v2 net-next 0/2] platform data controls for mdio-gpio
Soon to be mainlined is an x86 platform with a Marvell switch, and a bit-banging MDIO bus. In order to make this work, the phy_mask of the MDIO bus needs to be set to prevent scanning for PHYs, and the phy_ignore_ta_mask needs to be set because the switch has broken turnaround. Add a platform_data structure with these parameters. Andrew Lunn (2): net: phy: mdio-gpio: Add platform_data support for phy_mask net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data MAINTAINERS | 1 + drivers/net/phy/mdio-gpio.c | 7 +++ include/linux/platform_data/mdio-gpio.h | 14 ++ 3 files changed, 22 insertions(+) create mode 100644 include/linux/platform_data/mdio-gpio.h -- 2.19.1
[PATCH v2 net-next 2/2] net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data
The Marvell 6390 Ethernet switch family does not perform MDIO turnaround correctly. Many hardware MDIO bus masters don't care about this, but the bitbangging implementation in Linux does by default. Add phy_ignore_ta_mask to the platform data so that the bitbangging code can be told which devices are known to get TA wrong. v2 -- int -> u32 in platform data structure Signed-off-by: Andrew Lunn --- drivers/net/phy/mdio-gpio.c | 4 +++- include/linux/platform_data/mdio-gpio.h | 1 + 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/phy/mdio-gpio.c b/drivers/net/phy/mdio-gpio.c index 1e296dd4067a..ea9a0e339778 100644 --- a/drivers/net/phy/mdio-gpio.c +++ b/drivers/net/phy/mdio-gpio.c @@ -130,8 +130,10 @@ static struct mii_bus *mdio_gpio_bus_init(struct device *dev, else strncpy(new_bus->id, "gpio", MII_BUS_ID_SIZE); - if (pdata) + if (pdata) { new_bus->phy_mask = pdata->phy_mask; + new_bus->phy_ignore_ta_mask = pdata->phy_ignore_ta_mask; + } dev_set_drvdata(dev, new_bus); diff --git a/include/linux/platform_data/mdio-gpio.h b/include/linux/platform_data/mdio-gpio.h index a5d5ff5e174c..13874fa6e767 100644 --- a/include/linux/platform_data/mdio-gpio.h +++ b/include/linux/platform_data/mdio-gpio.h @@ -8,6 +8,7 @@ struct mdio_gpio_platform_data { u32 phy_mask; + u32 phy_ignore_ta_mask; }; #endif /* __LINUX_MDIO_GPIO_PDATA_H */ -- 2.19.1
[PATCH v2 net-next 1/2] net: phy: mdio-gpio: Add platform_data support for phy_mask
It is sometimes necessary to instantiate a bit-banging MDIO bus as a platform device, without the aid of device tree. When device tree is being used, the bus is not scanned for devices, only those devices which are in device tree are probed. Without device tree, by default, all addresses on the bus are scanned. This may then find a device which is not a PHY, e.g. a switch. And the switch may have registers containing values which look like a PHY. So during the scan, a PHY device is wrongly created. After the bus has been registered, a search is made for mdio_board_info structures which indicates devices on the bus, and the driver which should be used for them. This is typically used to instantiate Ethernet switches from platform drivers. However, if the scanning of the bus has created a PHY device at the same location as indicated into the board info for a switch, the switch device is not created, since the address is already busy. This can be avoided by setting the phy_mask of the mdio bus. This mask prevents addresses on the bus being scanned. v2 -- int -> u32 in platform data structure Signed-off-by: Andrew Lunn --- MAINTAINERS | 1 + drivers/net/phy/mdio-gpio.c | 5 + include/linux/platform_data/mdio-gpio.h | 13 + 3 files changed, 19 insertions(+) create mode 100644 include/linux/platform_data/mdio-gpio.h diff --git a/MAINTAINERS b/MAINTAINERS index fb88b6863d10..9d3b899f9ba2 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5610,6 +5610,7 @@ F:include/linux/of_net.h F: include/linux/phy.h F: include/linux/phy_fixed.h F: include/linux/platform_data/mdio-bcm-unimac.h +F: include/linux/platform_data/mdio-gpio.h F: include/trace/events/mdio.h F: include/uapi/linux/mdio.h F: include/uapi/linux/mii.h diff --git a/drivers/net/phy/mdio-gpio.c b/drivers/net/phy/mdio-gpio.c index 0fbcedcdf6e2..1e296dd4067a 100644 --- a/drivers/net/phy/mdio-gpio.c +++ b/drivers/net/phy/mdio-gpio.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -112,6 +113,7 @@ static struct mii_bus *mdio_gpio_bus_init(struct device *dev, struct mdio_gpio_info *bitbang, int bus_id) { + struct mdio_gpio_platform_data *pdata = dev_get_platdata(dev); struct mii_bus *new_bus; bitbang->ctrl.ops = _gpio_ops; @@ -128,6 +130,9 @@ static struct mii_bus *mdio_gpio_bus_init(struct device *dev, else strncpy(new_bus->id, "gpio", MII_BUS_ID_SIZE); + if (pdata) + new_bus->phy_mask = pdata->phy_mask; + dev_set_drvdata(dev, new_bus); return new_bus; diff --git a/include/linux/platform_data/mdio-gpio.h b/include/linux/platform_data/mdio-gpio.h new file mode 100644 index ..a5d5ff5e174c --- /dev/null +++ b/include/linux/platform_data/mdio-gpio.h @@ -0,0 +1,13 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * MDIO-GPIO bus platform data structure + */ + +#ifndef __LINUX_MDIO_GPIO_PDATA_H +#define __LINUX_MDIO_GPIO_PDATA_H + +struct mdio_gpio_platform_data { + u32 phy_mask; +}; + +#endif /* __LINUX_MDIO_GPIO_PDATA_H */ -- 2.19.1
Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets
On Fri, 7 Dec 2018 13:21:08 -0800 Alexei Starovoitov wrote: > for production I suspect the users would want > an easy way to stay safe when they're playing with AF_XDP. > So another builtin program that redirects ssh and ping traffic > back to the kernel would be a nice addition. Are you saying a buildin program that need to parse different kinds of Eth-type headers (DSA, VLAN, QinqQ) and find the TCP port to match port 22 to return XDP_PASS, or else call AF_XDP redurect. That seems to be pure overhead for this fast-path buildin program for AF_XDP. Would a solution be to install a NIC hardware filter that redirect SSH port 22 to another RX-queue. And then have a buildin program that returns XDP_PASS installed on that RX-queue. And change Bjørns semantics, such that RX-queue programs takes precedence over the global XDP program. This would also be a good fail safe in general for XDP. If the RX-queues take precedence, I can use this fail safe approach. E.g. when I want to test my new global XDP program, I'll use ethtool match my management IP and send that to a specific RX-queue and my fail-safe BPF program. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API
Hi Eric, > >> This patch is changing struct sk_buff, and is thus per-definition > >> controversial. > >> > >> Place a new member 'mem_info' of type struct xdp_mem_info, just after > >> members (flags) head_frag and pfmemalloc, And not in between > >> headers_start/end to ensure skb_copy() and pskb_copy() work as-is. > >> Copying mem_info during skb_clone() is required. This makes sure that > >> pages are correctly freed or recycled during the altered > >> skb_free_head() invocation. > > > > I read this to mean that this 'info' isn't accessed/needed until skb > > is freed. Any reason its not added at the end? > > > > This would avoid moving other fields that are probably accessed > > more frequently during processing. > > > > But I do not get why the patch is needed. > > Adding extra cost for each skb destruction is costly. > > I though XDP was all about _not_ having skbs. You hit the only point i don't personally like in the code, xdp info in the skb. Considering the benefits i am convinced this is ok and here's why: > Please let's do not slow down the non XDP stack only to make XDP more > appealing. We are not trying to do that, on the contrary. The patchset has nothing towards speeding XDP and slowing down anything else. The patchset speeds up the mvneta driver on the default network stack. The only change that was needed was to adapt the driver to using the page_pool API. The speed improvements we are seeing on specific workloads (i.e 256b < packet < 400b) are almost 3x. Lots of high speed drivers are doing similar recycling tricks themselves (and there's no common code, everyone is doing something similar though). All we are trying to do is provide a unified API to make that easier for the rest. Another advantage is that if the some drivers switch to the API, adding XDP functionality on them is pretty trivial. Moreover our tests are only performed on systems without or with SMMU disabled. On a platform i have access to, enabling and disabling the SMMU has some performance impact. By keeping the buffers mapped we believe this impact will be substantially less (i'll come back with results once i have them on this). I do understand your point, but the potential advantages on my head overwight that by a lot. I got other concerns on the patchset though. Like how much memory is it 'ok' to keep mapped keeping in mind we are using the streaming DMA API. Are we going to affect anyone else negatively by doing so ? Thanks /Ilias
Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets
On Fri, 7 Dec 2018 15:01:55 +0100 Björn Töpel wrote: > Den fre 7 dec. 2018 kl 14:42 skrev Jesper Dangaard Brouer : > > > > On Fri, 7 Dec 2018 12:44:24 +0100 > > Björn Töpel wrote: > > > > > The rationale behind attach is performance and ease of use. Many XDP > > > socket users just need a simple way of creating/binding a socket and > > > receiving frames right away without loading an XDP program. > > > > > > XDP_ATTACH adds a mechanism we call "builtin XDP program" that simply > > > is a kernel provided XDP program that is installed to the netdev when > > > XDP_ATTACH is being passed as a bind() flag. > > > > > > The builtin program is the simplest program possible to redirect a > > > frame to an attached socket. In restricted C it would look like this: > > > > > > SEC("xdp") > > > int xdp_prog(struct xdp_md *ctx) > > > { > > > return bpf_xsk_redirect(ctx); > > > } > > > > > > The builtin program loaded via XDP_ATTACH behaves, from an > > > install-to-netdev/uninstall-from-netdev point of view, differently > > > from regular XDP programs. The easiest way to look at it is as a > > > 2-level hierarchy, where regular XDP programs has precedence over the > > > builtin one. > > > > > > If no regular XDP program is installed to the netdev, the builtin will > > > be install. If the builtin program is installed, and a regular is > > > installed, regular XDP program will have precedence over the builtin > > > one. > > > > > > Further, if a regular program is installed, and later removed, the > > > builtin one will automatically be installed. > > > > > > The sxdp_flags field of struct sockaddr_xdp gets two new options > > > XDP_BUILTIN_SKB_MODE and XDP_BUILTIN_DRV_MODE, which maps to the > > > corresponding XDP netlink install flags. > > > > > > The builtin XDP program functionally adds even more complexity to the > > > already hard to read dev_change_xdp_fd. Maybe it would be simpler to > > > store the program in the struct net_device together with the install > > > flags instead of calling the ndo_bpf multiple times? > > > > (As far as I can see from reading the code, correct me if I'm wrong.) > > > > If an AF_XDP program uses XDP_ATTACH, then it installs the > > builtin-program as the XDP program on the "entire" device. That means > > all RX-queues will call this XDP-bpf program (indirect call), and it is > > actually only relevant for the specific queue_index. Yes, the helper > > call does check that the 'xdp->rxq->queue_index' for an attached 'xsk' > > and return XDP_PASS if it is NULL: > > > > Yes, you are correct. The builtin XDP program, just like a regular XDP > program, affects the whole netdev. So, yes the non-AF_XDP queues would > get a performance hit from this. Just to reiterate -- this isn't new > for this series. This has always been the case for XDP when acting on > just one queue. > > > +BPF_CALL_1(bpf_xdp_xsk_redirect, struct xdp_buff *, xdp) > > +{ > > + struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info); > > + struct xdp_sock *xsk; > > + > > + xsk = READ_ONCE(xdp->rxq->dev->_rx[xdp->rxq->queue_index].xsk); > > + if (xsk) { > > + ri->xsk = xsk; > > + return XDP_REDIRECT; > > + } > > + > > + return XDP_PASS; > > +} > > > > Why do every normal XDP_PASS packet have to pay this overhead (indirect > > call), when someone loads an AF_XDP socket program? The AF_XDP socket > > is tied hard and only relevant to a specific RX-queue (which is why we > > get a performance boost due to SPSC queues). > > > > I acknowledge there is a need for this, but this use-case shows there > > is a need for attaching XDP programs per RX-queue basis. > > > > From my AF_XDP perspective, having a program per queue would make > sense. The discussion of a per-queue has been up before, and I think > the conclusion was that it would be too complex from a > configuration/tooling point-of-view. Again, for AF_XDP this would be > great. I really think it is about time that we introduce these per-queue XDP programs. From a configuration/tooling point-of-view, your proposed solution have the same issues, we have to solve. (Like listing introspection need to show the per-queue XDP/BPF programs). And you have almost solved it, by keeping the per-queue program in net_device struct. Alexei already requested more types of builtin programs. But as the XDP-driver API can only load XDP for the entire NIC, then how can you use two builtin programs at the same time? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API
On Sat, 8 Dec 2018 04:29:17 -0800 Eric Dumazet wrote: > On 12/08/2018 01:57 AM, Florian Westphal wrote: > > Jesper Dangaard Brouer wrote: > >> From: Ilias Apalodimas > >> > >> This patch is changing struct sk_buff, and is thus per-definition > >> controversial. > >> > >> Place a new member 'mem_info' of type struct xdp_mem_info, just after > >> members (flags) head_frag and pfmemalloc, And not in between > >> headers_start/end to ensure skb_copy() and pskb_copy() work as-is. > >> Copying mem_info during skb_clone() is required. This makes sure that > >> pages are correctly freed or recycled during the altered > >> skb_free_head() invocation. > > > > I read this to mean that this 'info' isn't accessed/needed until skb > > is freed. Any reason its not added at the end? > > > > This would avoid moving other fields that are probably accessed > > more frequently during processing. > > > > But I do not get why the patch is needed. > > Adding extra cost for each skb destruction is costly. > > I though XDP was all about _not_ having skbs. > > Please let's do not slow down the non XDP stack only to make XDP more > appealing. In general this work is about better cooperation between XDP and netstack. The patch is needed because the page_pool is keeping pages DMA mapped and need a return hook. I doubt that the extra compare-and-branch will affect your use-case on super-scalar CPUs. We (Ilias and I) are actually testing this stuff on low-end ARM64 in-order execution CPUs, which is actually nice as performance effects of our code changes are not hidden by speculative execution units. I'm enforcing (and Ilias agrees) that we do benchmark driven development. I actually invite people to monitor our progress here[1]. So, trust me, I am as concerned as you about any performance regression, and is vigilantly measuring this stuff. (This is more than you can say about a lot of the other stuff that gets accepted on this list). [1] https://github.com/xdp-project/xdp-project/blob/master/areas/arm64/ -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH net-next,v5 09/12] ethtool: add ethtool_rx_flow_spec to flow_rule structure translator
Thu, Dec 06, 2018 at 11:39:59PM CET, pa...@netfilter.org wrote: >This patch adds a function to translate the ethtool_rx_flow_spec >structure to the flow_rule representation. > >This allows us to reuse code from the driver side given that both flower >and ethtool_rx_flow interfaces use the same representation. > >This patch also includes support for the flow type flags FLOW_EXT, >FLOW_MAC_EXT and FLOW_RSS. > >The ethtool_rx_flow_spec_input wrapper structure is used to convey the >rss_context field, that is away from the ethtool_rx_flow_spec structure, >and the ethtool_rx_flow_spec structure. > >Signed-off-by: Pablo Neira Ayuso Acked-by: Jiri Pirko
Re: [PATCH net-next,v5 08/12] flow_offload: add wake-up-on-lan and queue to flow_action
Thu, Dec 06, 2018 at 11:39:58PM CET, pa...@netfilter.org wrote: >These actions need to be added to support the ethtool_rx_flow interface. >The queue action includes a field to specify the RSS context, that is >set via FLOW_RSS flow type flag and the rss_context field in struct >ethtool_rxnfc, plus the corresponding queue index. FLOW_RSS implies that >rss_context is non-zero, therefore, queue.ctx == 0 means that FLOW_RSS >was not set. > >Signed-off-by: Pablo Neira Ayuso Acked-by: Jiri Pirko
Re: [PATCH net-next,v5 06/12] drivers: net: use flow action infrastructure
Thu, Dec 06, 2018 at 11:39:56PM CET, pa...@netfilter.org wrote: >This patch updates drivers to use the new flow action infrastructure. > >Signed-off-by: Pablo Neira Ayuso Acked-by: Jiri Pirko
Re: [PATCH net-next,v5 07/12] cls_flower: don't expose TC actions to drivers anymore
Thu, Dec 06, 2018 at 11:39:57PM CET, pa...@netfilter.org wrote: >Now that drivers have been converted to use the flow action >infrastructure, remove this field from the tc_cls_flower_offload >structure. > >Signed-off-by: Pablo Neira Ayuso Acked-by: Jiri Pirko
Re: [PATCH net-next,v5 05/12] flow_offload: add statistics retrieval infrastructure and use it
Thu, Dec 06, 2018 at 11:39:55PM CET, pa...@netfilter.org wrote: >This patch provides the flow_stats structure that acts as container for >tc_cls_flower_offload, then we can use to restore the statistics on the >existing TC actions. Hence, tcf_exts_stats_update() is not used from >drivers anymore. > >Signed-off-by: Pablo Neira Ayuso Acked-by: Jiri Pirko
Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API
On 12/08/2018 04:29 AM, Eric Dumazet wrote: > > But I do not get why the patch is needed. > > Adding extra cost for each skb destruction is costly. > > I though XDP was all about _not_ having skbs. > > Please let's do not slow down the non XDP stack only to make XDP more > appealing. > I have a similar concern with napi_consume_skb() : This added a cost for locally generated TCP traffic, since most TX packets are fast clones.
Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API
On 12/08/2018 01:57 AM, Florian Westphal wrote: > Jesper Dangaard Brouer wrote: >> From: Ilias Apalodimas >> >> This patch is changing struct sk_buff, and is thus per-definition >> controversial. >> >> Place a new member 'mem_info' of type struct xdp_mem_info, just after >> members (flags) head_frag and pfmemalloc, And not in between >> headers_start/end to ensure skb_copy() and pskb_copy() work as-is. >> Copying mem_info during skb_clone() is required. This makes sure that >> pages are correctly freed or recycled during the altered >> skb_free_head() invocation. > > I read this to mean that this 'info' isn't accessed/needed until skb > is freed. Any reason its not added at the end? > > This would avoid moving other fields that are probably accessed > more frequently during processing. > But I do not get why the patch is needed. Adding extra cost for each skb destruction is costly. I though XDP was all about _not_ having skbs. Please let's do not slow down the non XDP stack only to make XDP more appealing.
Re: [PATCH 5/5] net: dsa: ksz: Add Microchip KSZ8795 DSA driver
> +static int ksz8795_valid_dyn_entry(struct ksz_device *dev, u8 *data) > +{ > + int timeout = 100; > + > + do { > + ksz_read8(dev, REG_IND_DATA_CHECK, data); > + timeout--; > + } while ((*data & DYNAMIC_MAC_TABLE_NOT_READY) && timeout); readx_poll_timeout()? > +static inline void ksz8795_from_vlan(u16 vlan, u8 *fid, u8 *member, u8 > *valid) Please don't use inline in C code, just in headers. Leave the compile to decide if it should be inlined. > +static void ksz8795_r_vlan_table(struct ksz_device *dev, u16 vid, u16 *vlan) > +{ > + u64 buf; > + u16 *data = (u16 *) > + u16 addr; > + int index; The networking code uses reverse christmas tree. So you need to change the order of these declarations, and do the assignment in the body of the function. Please review all the functions. > +static const u8 stp_multicast_addr[] = { > + 0x01, 0x80, 0xC2, 0x00, 0x00, 0x00 > +}; include/linux/etherdevice.h defines eth_stp_addr. Andrew
Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API
On Sat, 8 Dec 2018 10:57:58 +0100 Florian Westphal wrote: > Jesper Dangaard Brouer wrote: > > From: Ilias Apalodimas > > > > This patch is changing struct sk_buff, and is thus per-definition > > controversial. > > > > Place a new member 'mem_info' of type struct xdp_mem_info, just after > > members (flags) head_frag and pfmemalloc, And not in between > > headers_start/end to ensure skb_copy() and pskb_copy() work as-is. > > Copying mem_info during skb_clone() is required. This makes sure that > > pages are correctly freed or recycled during the altered > > skb_free_head() invocation. > > I read this to mean that this 'info' isn't accessed/needed until skb > is freed. Any reason its not added at the end? > > This would avoid moving other fields that are probably accessed > more frequently during processing. It is placed here because it is close to skb->head_frag, which is used also used when SKB is freed. This is the reason, both because I think we can move the skb->head_frag bit into mem_info, and due to cacheline accesses (e.g. skb->cloned and destructor pointer are also read, which is on that cache-line) The annoying part is actually that depending on the kernel config options CONFIG_XFRM, CONFIG_NF_CONNTRACK and CONFIG_BRIDGE_NETFILTER, whether there is a cache-line split, where mem_info gets moved into the next cacheline. I do like the idea of moving it to the end, iif we also move skb->head_frag into mem_info, as skb->head (on end-cacheline) is also accessed during free, but given we still need to read the cache-line containing skb->{cloned,destructor}, when I don't think it will be a win. I think the skb->cloned value is read during lifetime of the SKB. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH 3/5] net: dsa: ksz: Factor out common tag code
On Fri, Dec 07, 2018 at 07:18:43PM +0100, Marek Vasut wrote: > From: Tristram Ha > > Factor out common code from the tag_ksz , so that the code can be used > with other KSZ family switches which use differenly sized tags. I prefer this implementation over what Tristram recently submitted. It is also what we suggested a while back. However, since then we have had Spectra/meltdown, and we now know a function call through a pointer is expensive. This is the hot path, every frame comes through here, so it is worth taking the time to optimize this. Could you try to remove the ksz_tag_ops structure. xmit looks simple, since it is a tail call, so you can do that in ksz9477_xmit. Receive looks a bit more complex. I also think for the moment we need it ignore PTP until we have the big picture sorted out. If need be, the code can be refactored yet again. But i don't want PTP holding up getting more switches supported. > Signed-off-by: Tristram Ha > Signed-off-by: Marek Vasut > Cc: Vivien Didelot > Cc: Woojung Huh > Cc: David S. Miller > --- > net/dsa/tag_ksz.c | 125 +- > 1 file changed, 90 insertions(+), 35 deletions(-) > > diff --git a/net/dsa/tag_ksz.c b/net/dsa/tag_ksz.c > index 036bc62198f28..d94bad1ab7e53 100644 > --- a/net/dsa/tag_ksz.c > +++ b/net/dsa/tag_ksz.c > @@ -14,34 +14,30 @@ > #include > #include "dsa_priv.h" > > -/* For Ingress (Host -> KSZ), 2 bytes are added before FCS. > - * > --- > - * > DA(6bytes)|SA(6bytes)||Data(nbytes)|tag0(1byte)|tag1(1byte)|FCS(4bytes) > - * > --- > - * tag0 : Prioritization (not used now) > - * tag1 : each bit represents port (eg, 0x01=port1, 0x02=port2, 0x10=port5) > - * > - * For Egress (KSZ -> Host), 1 byte is added before FCS. > - * > --- > - * DA(6bytes)|SA(6bytes)||Data(nbytes)|tag0(1byte)|FCS(4bytes) > - * > --- > - * tag0 : zero-based value represents port > - * (eg, 0x00=port1, 0x02=port3, 0x06=port7) > - */ > +struct ksz_tag_ops { > + void(*get_tag)(u8 *tag, unsigned int *port, unsigned int *len); > + void(*set_tag)(struct sk_buff *skb, struct net_device *dev); > +}; > > -#define KSZ_INGRESS_TAG_LEN 2 > -#define KSZ_EGRESS_TAG_LEN 1 > +/* Typically only one byte is used for tail tag. */ > +#define KSZ_EGRESS_TAG_LEN 1 > > -static struct sk_buff *ksz_xmit(struct sk_buff *skb, struct net_device *dev) > +/* Frames with following addresse may need to be sent even when the port is > + * closed. > + */ > +static const u8 special_mult_addr[] = { > + 0x01, 0x80, 0xC2, 0x00, 0x00, 0x00 > +}; > + This is new functionality, not part of the refactoring the code. Please add that as a new patch. Also, you can probably make use of is_link_local_ether_addr(). Andrew
[PATCH net-next] ip: silence udp zerocopy smatch false positive
From: Willem de Bruijn extra_uref is used in __ip(6)_append_data only if uarg is set. Smatch sees that the variable is passed to sock_zerocopy_put_abort. This function accesses it only when uarg is set, but smatch cannot infer this. Make this dependency explicit. Fixes: 52900d22288e ("udp: elide zerocopy operation in hot path") Signed-off-by: Willem de Bruijn --- net/ipv4/ip_output.c | 3 ++- net/ipv6/ip6_output.c | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 78f028bdad30b..ab6618036afed 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1130,7 +1130,8 @@ static int __ip_append_data(struct sock *sk, error_efault: err = -EFAULT; error: - sock_zerocopy_put_abort(uarg, extra_uref); + if (uarg) + sock_zerocopy_put_abort(uarg, extra_uref); cork->length -= length; IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS); refcount_add(wmem_alloc_delta, >sk_wmem_alloc); diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 6592a39e5ec10..be25b34dc5c2f 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1575,7 +1575,8 @@ static int __ip6_append_data(struct sock *sk, error_efault: err = -EFAULT; error: - sock_zerocopy_put_abort(uarg, extra_uref); + if (uarg) + sock_zerocopy_put_abort(uarg, extra_uref); cork->length -= length; IP6_INC_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS); refcount_add(wmem_alloc_delta, >sk_wmem_alloc); -- 2.20.0.rc2.403.gdbc3b29805-goog
Re: [PATCH 1/5] net: dsa: ksz: Add MIB counter reading support
> +static void ksz9477_phy_setup(struct ksz_device *dev, int port, > + struct phy_device *phy) > +{ > + if (port < dev->phy_port_cnt) { > + /* SUPPORTED_Asym_Pause and SUPPORTED_Pause can be removed to > + * disable flow control when rate limiting is used. > + */ > + phy->advertising = phy->supported; > + } > +} > + This has nothing to do with MIB counters. Other than that, is this the same as Tristram's recent submission? Maybe once the comments have been addressed, we can merge that version? Thanks Andrew
Re: [PATCH V2] net: dsa: ksz: Add reset GPIO handling
> This actually is an individual patch, it doesn't depend on anything. > Or do you mean a series with the DT documentation change ? Yes, i mean together with the DT documentation change. Those two belong together, they are one functional change. Part of this is also to do with scalability. It takes less effort to merge one patchset of two patches, as two individual patches. The truth is, developer time is cheap, maintainer time is expensive, so the process is optimized towards making the maintainers life easy. So sometimes you do combine orthogonal changes together into one patchset, if there is a high purpose, eg. adding support for a new device on a new board. However, given the situation of two overlapping patchsets, it might be better to submit smaller patchsets. Andrew
Re: [PATCH] net: dsa: ksz: Fix port membership
On Sat, Dec 08, 2018 at 05:23:25AM +0100, Marek Vasut wrote: > On 12/08/2018 01:13 AM, tristram...@microchip.com wrote: > >> Do you have a git tree with all the KSZ patches based on -next > >> somewhere, so I don't have to look for them in random MLs ? > > > > I just sent it this Monday and the subject for that patch is > > "[PATCH RFC 6/6] net: dsa: microchip: Add switch offload forwarding > > support." > > Is all that collected in some git tree somewhere, so I don't have to > look for various patches in varying states of decay throughout the ML? Hi Marek, Tristram It is unfortunate that we have two patchsets being submitted at the same time, with overlapping functionality. Some degree of cooperation is needed to handle this. Could i ask you both to publish a git tree of your patches. And then figure out how to submit them. Maybe as smaller sets, with the easy, non-overlapping bits first? Andrew
RE: [PATCH bpf-next 7/7] samples: bpf: add support for XDP_ATTACH to xdpsock
Hi: Seems there are only 6 patches of the patch set in patchwork https://patchwork.ozlabs.org/project/netdev/list/?submitter=70569 https://patchwork.ozlabs.org/project/netdev/list/?series=80389 Anyone can help to check why patch 7/7 is missing? Thanks Qi > -Original Message- > From: Björn Töpel [mailto:bjorn.to...@gmail.com] > Sent: Friday, December 7, 2018 7:45 PM > To: bjorn.to...@gmail.com; Karlsson, Magnus ; > magnus.karls...@gmail.com; a...@kernel.org; dan...@iogearbox.net; > netdev@vger.kernel.org > Cc: Topel, Bjorn ; bro...@redhat.com; > u9012...@gmail.com; Zhang, Qi Z > Subject: [PATCH bpf-next 7/7] samples: bpf: add support for XDP_ATTACH to > xdpsock > > From: Björn Töpel > > Teach the sample xdpsock application about XDP_ATTACH. > > Signed-off-by: Björn Töpel > --- > samples/bpf/xdpsock_user.c | 108 +++-- > 1 file changed, 67 insertions(+), 41 deletions(-) > > diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c index > 57ecadc58403..12f908b60468 100644 > --- a/samples/bpf/xdpsock_user.c > +++ b/samples/bpf/xdpsock_user.c > @@ -633,7 +633,8 @@ static void int_exit(int sig) { > (void)sig; > dump_stats(); > - bpf_set_link_xdp_fd(opt_ifindex, -1, opt_xdp_flags); > + if (!(opt_xdp_bind_flags & XDP_ATTACH)) > + bpf_set_link_xdp_fd(opt_ifindex, -1, opt_xdp_flags); > exit(EXIT_SUCCESS); > } > > @@ -650,6 +651,7 @@ static struct option long_options[] = { > {"interval", required_argument, 0, 'n'}, > {"zero-copy", no_argument, 0, 'z'}, > {"copy", no_argument, 0, 'c'}, > + {"attach", no_argument, 0, 'a'}, > {0, 0, 0, 0} > }; > > @@ -670,6 +672,7 @@ static void usage(const char *prog) > " -n, --interval=n Specify statistics update interval > (default 1 > sec).\n" > " -z, --zero-copy Force zero-copy mode.\n" > " -c, --copy Force copy mode.\n" > + " -a, --attach XDP_ATTACH mode.\n" > "\n"; > fprintf(stderr, str, prog); > exit(EXIT_FAILURE); > @@ -682,7 +685,7 @@ static void parse_command_line(int argc, char **argv) > opterr = 0; > > for (;;) { > - c = getopt_long(argc, argv, "rtli:q:psSNn:cz", long_options, > + c = getopt_long(argc, argv, "rtli:q:psSNn:cza", long_options, > _index); > if (c == -1) > break; > @@ -725,11 +728,22 @@ static void parse_command_line(int argc, char > **argv) > case 'c': > opt_xdp_bind_flags |= XDP_COPY; > break; > + case 'a': > + opt_xdp_bind_flags |= XDP_ATTACH; > + break; > default: > usage(basename(argv[0])); > } > } > > + if (opt_xdp_bind_flags & XDP_ATTACH) { > + if (opt_xdp_flags & XDP_FLAGS_DRV_MODE) > + opt_xdp_bind_flags |= XDP_BUILTIN_DRV_MODE; > + if (opt_xdp_flags & XDP_FLAGS_SKB_MODE) > + opt_xdp_bind_flags |= XDP_BUILTIN_SKB_MODE; > + > + } > + > opt_ifindex = if_nametoindex(opt_if); > if (!opt_ifindex) { > fprintf(stderr, "ERROR: interface \"%s\" does not exist\n", @@ > -903,7 +917,7 @@ int main(int argc, char **argv) > struct bpf_prog_load_attr prog_load_attr = { > .prog_type = BPF_PROG_TYPE_XDP, > }; > - int prog_fd, qidconf_map, xsks_map; > + int prog_fd, qidconf_map, xsks_map = -1; > struct bpf_object *obj; > char xdp_filename[256]; > struct bpf_map *map; > @@ -918,59 +932,71 @@ int main(int argc, char **argv) > exit(EXIT_FAILURE); > } > > - snprintf(xdp_filename, sizeof(xdp_filename), "%s_kern.o", argv[0]); > - prog_load_attr.file = xdp_filename; > + if (!(opt_xdp_bind_flags & XDP_ATTACH)) { > + snprintf(xdp_filename, sizeof(xdp_filename), "%s_kern.o", > + argv[0]); > + prog_load_attr.file = xdp_filename; > > - if (bpf_prog_load_xattr(_load_attr, , _fd)) > - exit(EXIT_FAILURE); > - if (prog_fd < 0) { > - fprintf(stderr, "ERROR: no program found: %s\n", > - strerror(prog_fd)); > - exit(EXIT_FAILURE); > - } > + if (bpf_prog_load_xattr(_load_attr, , _fd)) > + exit(EXIT_FAILURE); > + if (prog_fd < 0) { > + fprintf(stderr, "ERROR: no program found: %s\n", > + strerror(prog_fd)); > + exit(EXIT_FAILURE); > + } > > - map = bpf_object__find_map_by_name(obj, "qidconf_map"); > - qidconf_map = bpf_map__fd(map); > - if (qidconf_map < 0) { > - fprintf(stderr, "ERROR: no qidconf map found: %s\n", > -
Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API
Jesper Dangaard Brouer wrote: > From: Ilias Apalodimas > > This patch is changing struct sk_buff, and is thus per-definition > controversial. > > Place a new member 'mem_info' of type struct xdp_mem_info, just after > members (flags) head_frag and pfmemalloc, And not in between > headers_start/end to ensure skb_copy() and pskb_copy() work as-is. > Copying mem_info during skb_clone() is required. This makes sure that > pages are correctly freed or recycled during the altered > skb_free_head() invocation. I read this to mean that this 'info' isn't accessed/needed until skb is freed. Any reason its not added at the end? This would avoid moving other fields that are probably accessed more frequently during processing.
Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets
Den fre 7 dec. 2018 kl 22:21 skrev Alexei Starovoitov : > > On Fri, Dec 07, 2018 at 12:44:24PM +0100, Björn Töpel wrote: > > From: Björn Töpel > > > > Hi! > > > > This patch set adds support for a new XDP socket bind option, > > XDP_ATTACH. > > > > The rationale behind attach is performance and ease of use. Many XDP > > socket users just need a simple way of creating/binding a socket and > > receiving frames right away without loading an XDP program. > > > > XDP_ATTACH adds a mechanism we call "builtin XDP program" that simply > > is a kernel provided XDP program that is installed to the netdev when > > XDP_ATTACH is being passed as a bind() flag. > > > > The builtin program is the simplest program possible to redirect a > > frame to an attached socket. In restricted C it would look like this: > > > > SEC("xdp") > > int xdp_prog(struct xdp_md *ctx) > > { > > return bpf_xsk_redirect(ctx); > > } > > > > The builtin program loaded via XDP_ATTACH behaves, from an > > install-to-netdev/uninstall-from-netdev point of view, differently > > from regular XDP programs. The easiest way to look at it is as a > > 2-level hierarchy, where regular XDP programs has precedence over the > > builtin one. > > The feature makes sense to me. > May be XDP_ATTACH_BUILTIN would be a better name ? Yes, agree, or maybe XDP_BUILTIN_ATTACH? Regardless, I'll change the name for the next revision. > Also I think it needs another parameter to say which builtin > program to use. Yup, I had a plan to add the parameter when it's actually more than *one* builtin, but you're right, let's do it right away. I'll add a builtin prog enum field to the struct sockaddr_xdp. > This unconditional xsk_redirect is fine for performance > benchmarking, but for production I suspect the users would want > an easy way to stay safe when they're playing with AF_XDP. For setups that don't direct the flows explicitly by HW filters, yes! > So another builtin program that redirects ssh and ping traffic > back to the kernel would be a nice addition. > I suspect AF_XDP users would prefer redirecting packets to the kernel via the CPUMAP instead of XDP_PASS -- not paying for the ipstack on the AF_XDP core. Another builtin would be a tcpdump-like behavior, but that would require an XDP clone (which Magnus is actually experimenting with!). I'll address your input and get back with a new revision. Thanks for spending time on the series! Björn
Re: [PATCH net-next,v5 04/12] cls_api: add translator to flow_action representation
Thu, Dec 06, 2018 at 11:39:54PM CET, pa...@netfilter.org wrote: >This patch implements a new function to translate from native TC action >to the new flow_action representation. Moreover, this patch also updates >cls_flower to use this new function. > >Signed-off-by: Pablo Neira Ayuso Acked-by: Jiri Pirko
Re: [PATCH net-next,v5 03/12] flow_offload: add flow action infrastructure
Thu, Dec 06, 2018 at 11:39:53PM CET, pa...@netfilter.org wrote: >This new infrastructure defines the nic actions that you can perform >from existing network drivers. This infrastructure allows us to avoid a >direct dependency with the native software TC action representation. > >Signed-off-by: Pablo Neira Ayuso Acked-by: Jiri Pirko
Re: [PATCH net-next,v5 02/12] net/mlx5e: support for two independent packet edit actions
Thu, Dec 06, 2018 at 11:39:52PM CET, pa...@netfilter.org wrote: >This patch adds pedit_headers_action structure to store the result of >parsing tc pedit actions. Then, it calls alloc_tc_pedit_action() to >populate the mlx5e hardware intermediate representation once all actions >have been parsed. > >This patch comes in preparation for the new flow_action infrastructure, >where each packet mangling comes in an separated action, ie. not packed >as in tc pedit. > >Signed-off-by: Pablo Neira Ayuso Acked-by: Jiri Pirko
Re: [PATCH net-next,v5 01/12] flow_offload: add flow_rule and flow_match structures and use them
Thu, Dec 06, 2018 at 11:39:51PM CET, pa...@netfilter.org wrote: >This patch wraps the dissector key and mask - that flower uses to >represent the matching side - around the flow_match structure. > >To avoid a follow up patch that would edit the same LoCs in the drivers, >this patch also wraps this new flow match structure around the flow rule >object. This new structure will also contain the flow actions in follow >up patches. > >This introduces two new interfaces: > > bool flow_rule_match_key(rule, dissector_id) > >that returns true if a given matching key is set on, and: > > flow_rule_match_XYZ(rule, ); > >To fetch the matching side XYZ into the match container structure, to >retrieve the key and the mask with one single call. > >Signed-off-by: Pablo Neira Ayuso Acked-by: Jiri Pirko