Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
On 28 Jul 2020, at 13:27, Christoph Hellwig wrote: On Tue, Jul 28, 2020 at 01:18:48PM -0400, Chris Mason wrote: come after in the future. Jonathan, I think we need to do a better job talking about patches that are just meant to enable possible users vs patches that we actually hope the upstream kernel to take. Obviously code that only supports out of tree drivers isn???t a good fit for the upstream kernel. From the point of view of experimenting with these patches, GPUs benefit a lot from this functionality so I think it does make sense to have the enabling patches somewhere, just not in this series. Sorry, but his crap is built only for this use case, and that is what really pissed people off as it very much looks intentional. No, we’ve had workloads asking for better zero copy solutions for ages. The goal is to address both this specialized workload and the general case zero copy tx/rx. -chris
Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
On 28 Jul 2020, at 12:31, Greg KH wrote: On Mon, Jul 27, 2020 at 03:44:44PM -0700, Jonathan Lemon wrote: From: Jonathan Lemon This provides the interface between the netgpu core module and the nvidia kernel driver. This should be built as an external module, pointing to the nvidia build. For example: export NV_PACKAGE_DIR=/w/nvidia/NVIDIA-Linux-x86_64-440.64 make -C ${kdir} M=`pwd` O=obj $* Ok, now you are just trolling us. Nice job, I shouldn't have read the previous patches. Please, go get a lawyer to sign-off on this patch, with their corporate email address on it. That's the only way we could possibly consider something like this. Oh, and we need you to use your corporate email address too, as you are not putting copyright notices on this code, we will need to know who to come after in the future. Jonathan, I think we need to do a better job talking about patches that are just meant to enable possible users vs patches that we actually hope the upstream kernel to take. Obviously code that only supports out of tree drivers isn’t a good fit for the upstream kernel. From the point of view of experimenting with these patches, GPUs benefit a lot from this functionality so I think it does make sense to have the enabling patches somewhere, just not in this series. We’re finding it more common to have pcie switch hops between a [ GPU, NIC ] pair and the CPU, which gives a huge advantage to out of tree drivers or extensions that can DMA directly between the GPU/NIC without having to copy through the CPU. I’d love to have an alternative built on TCP because that’s where we invest the vast majority of our tuning, security and interoperability testing. It’s just more predictable overall. This isn’t a new story, but if we can layer on APIs that enable this cleanly for in-tree drivers, we can work with the vendors to use better supported APIs and have a more stable kernel. Obviously this is an RFC and there’s a long road ahead, but as long as the upstream kernel doesn’t provide an answer, out of tree drivers are going to fill in the weak spots. Other possible use cases would include also include other GPUs or my favorite: NVME <-> filesystem <-> NIC with io_uring driving the IO and without copies. -chris
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On 6 Mar 2018, at 11:12, Linus Torvalds wrote: On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov wrote: As the first step in development of bpfilter project [1] the request_module() code is extended to allow user mode helpers to be invoked. Idea is that user mode helpers are built as part of the kernel build and installed as traditional kernel modules with .ko file extension into distro specified location, such that from a distribution point of view, they are no different than regular kernel modules. Thus, allow request_module() logic to load such user mode helper (umh) modules via: [,,] I like this, but I have one request: can we make sure that this action is visible in the system messages? When we load a regular module, at least it shows in lsmod afterwards, although I have a few times wanted to really see module load as an event in the logs too. When we load a module that just executes a user program, and there is no sign of it in the module list, I think we *really* need to make that event show to the admin some way. .. and yes, maybe we'll need to rate-limit the messages, and maybe it turns out that I'm entirely wrong and people will hate the messages after they get used to the concept of these pseudo-modules, but particularly for the early implementation when this is a new thing, I really want a message like executed user process xyz-abc as a pseudo-module or something in dmesg. I do *not* want this to be a magical way to hide things. Especially early on, this makes a lot of sense. But I wanted to plug bps and the hopefully growing set of bpf introspection tools: https://github.com/iovisor/bcc/blob/master/introspection/bps_example.txt Long term these are probably a good place to tell the admin what's going on. -chris
Re: [PATCH RFC 4/4] xfs: Transmit flow steering
On 08/30/2016 08:00 PM, Tom Herbert wrote: XFS maintains a per device flow table that is indexed by the skbuff hash. The XFS table is only consulted when there is no queue saved in a transmit socket for an skbuff. Each entry in the flow table contains a queue index and a queue pointer. The queue pointer is set when a queue is chosen using a flow table entry. This pointer is set to the head pointer in the transmit queue (which is maintained by BQL). The new function get_xfs_index that looks up flows in the XPS table. The entry returned gives the last queue a matching flow used. The returned queue is compared against the normal XPS queue. If they are different, then we only switch if the tail pointer in the TX queue has advanced past the pointer saved in the entry. In this way OOO should be avoided when XPS wants to use a different queue. I'd love for Dave Chinner to get some networking bug reports, but maybe we shouldn't call it XFS? At least CONFIG_XFS should be something else. It doesn't conflict now because we have CONFIG_XFS_FS, but even CONFIG_XFS_NET sounds like it's related to the filesystem instead of transmit flows. [ Sorry, four patches in and all I do is complain about the name ] -chris Signed-off-by: Tom Herbert --- net/Kconfig| 6 net/core/dev.c | 93 -- 2 files changed, 84 insertions(+), 15 deletions(-) diff --git a/net/Kconfig b/net/Kconfig index 7b6cd34..5e3eddf 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -255,6 +255,12 @@ config XPS depends on SMP default y +config XFS + bool + depends on XPS + depends on BQL + default y + config HWBM bool ... -static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb) +/* Must be called with RCU read_lock */ +static int get_xfs_index(struct net_device *dev, struct sk_buff *skb) { - struct sock *sk = skb->sk; - int queue_index = sk_tx_queue_get(sk); +#ifdef CONFIG_XFS + struct xps_dev_flow_table *flow_table; + struct xps_dev_flow ent; + int queue_index; + struct netdev_queue *txq; + u32 hash;
Re: [RFC] net: use atomic allocation for order-3 page allocation
On 06/11/2015 05:22 PM, Eric Dumazet wrote: > On Thu, 2015-06-11 at 17:16 -0400, Chris Mason wrote: >> On 06/11/2015 04:48 PM, Eric Dumazet wrote: >> >> networking is asking for 32KB, and the MM layer is doing what it can to >> provide it. Are the gains from getting 32KB contig bigger than the cost >> of moving pages around if the MM has to actually go into compaction? >> Should we start disk IO to give back 32KB contig? >> >> I think we want to tell the MM to compact in the background and give >> networking 32KB if it happens to have it available. If not, fall back >> to smaller allocations without doing anything expensive. > > Exactly my point. (And I mentioned this about 4 months ago) Sorry, reading this again I wasn't very clear. I agree with Shaohua's patch because it is telling the allocator that we don't want to wait for reclaim or compaction to find contiguous pages. But, is there any fallback to a single page allocation somewhere else? If this is the only way to get memory, we might want to add a single alloc_page path that won't trigger compaction but is at least able to wait for kswapd to make progress. -chris -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] net: use atomic allocation for order-3 page allocation
On 06/11/2015 04:48 PM, Eric Dumazet wrote: > On Thu, 2015-06-11 at 13:24 -0700, Shaohua Li wrote: >> We saw excessive memory compaction triggered by skb_page_frag_refill. >> This causes performance issues. Commit 5640f7685831e0 introduces the >> order-3 allocation to improve performance. But memory compaction has >> high overhead. The benefit of order-3 allocation can't compensate the >> overhead of memory compaction. >> >> This patch makes the order-3 page allocation atomic. If there is no >> memory pressure and memory isn't fragmented, the alloction will still >> success, so we don't sacrifice the order-3 benefit here. If the atomic >> allocation fails, compaction will not be triggered and we will fallback >> to order-0 immediately. >> >> The mellanox driver does similar thing, if this is accepted, we must fix >> the driver too. >> >> Cc: Eric Dumazet >> Signed-off-by: Shaohua Li >> --- >> net/core/sock.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/net/core/sock.c b/net/core/sock.c >> index 292f422..e9855a4 100644 >> --- a/net/core/sock.c >> +++ b/net/core/sock.c >> @@ -1883,7 +1883,7 @@ bool skb_page_frag_refill(unsigned int sz, struct >> page_frag *pfrag, gfp_t gfp) >> >> pfrag->offset = 0; >> if (SKB_FRAG_PAGE_ORDER) { >> -pfrag->page = alloc_pages(gfp | __GFP_COMP | >> +pfrag->page = alloc_pages((gfp & ~__GFP_WAIT) | __GFP_COMP | >>__GFP_NOWARN | __GFP_NORETRY, >>SKB_FRAG_PAGE_ORDER); >> if (likely(pfrag->page)) { > > This is not a specific networking issue, but mm one. > > You really need to start a discussion with mm experts. > > Your changelog does not exactly explains what _is_ the problem. > > If the problem lies in mm layer, it might be time to fix it, instead of > work around the bug by never triggering it from this particular point, > which is a safe point where a process is willing to wait a bit. > > Memory compaction is either working as intending, or not. > > If we enabled it but never run it because it hurts, what is the point > enabling it ? networking is asking for 32KB, and the MM layer is doing what it can to provide it. Are the gains from getting 32KB contig bigger than the cost of moving pages around if the MM has to actually go into compaction? Should we start disk IO to give back 32KB contig? I think we want to tell the MM to compact in the background and give networking 32KB if it happens to have it available. If not, fall back to smaller allocations without doing anything expensive. -chris -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html