Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu

2020-07-28 Thread Chris Mason

On 28 Jul 2020, at 13:27, Christoph Hellwig wrote:


On Tue, Jul 28, 2020 at 01:18:48PM -0400, Chris Mason wrote:

come after in the future.


Jonathan, I think we need to do a better job talking about patches 
that are
just meant to enable possible users vs patches that we actually hope 
the
upstream kernel to take.  Obviously code that only supports out of 
tree
drivers isn???t a good fit for the upstream kernel.  From the point 
of view

of experimenting with these patches, GPUs benefit a lot from this
functionality so I think it does make sense to have the enabling 
patches

somewhere, just not in this series.


Sorry, but his crap is built only for this use case, and that is what
really pissed people off as it very much looks intentional.


No, we’ve had workloads asking for better zero copy solutions for 
ages.  The goal is to address both this specialized workload and the 
general case zero copy tx/rx.


-chris


Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu

2020-07-28 Thread Chris Mason

On 28 Jul 2020, at 12:31, Greg KH wrote:


On Mon, Jul 27, 2020 at 03:44:44PM -0700, Jonathan Lemon wrote:

From: Jonathan Lemon 

This provides the interface between the netgpu core module and the
nvidia kernel driver.  This should be built as an external module,
pointing to the nvidia build.  For example:

export NV_PACKAGE_DIR=/w/nvidia/NVIDIA-Linux-x86_64-440.64
make -C ${kdir} M=`pwd` O=obj $*


Ok, now you are just trolling us.

Nice job, I shouldn't have read the previous patches.

Please, go get a lawyer to sign-off on this patch, with their 
corporate

email address on it.  That's the only way we could possibly consider
something like this.

Oh, and we need you to use your corporate email address too, as you 
are
not putting copyright notices on this code, we will need to know who 
to

come after in the future.


Jonathan, I think we need to do a better job talking about patches that 
are just meant to enable possible users vs patches that we actually hope 
the upstream kernel to take.  Obviously code that only supports out of 
tree drivers isn’t a good fit for the upstream kernel.  From the point 
of view of experimenting with these patches, GPUs benefit a lot from 
this functionality so I think it does make sense to have the enabling 
patches somewhere, just not in this series.


We’re finding it more common to have pcie switch hops between a [ GPU, 
NIC ] pair and the CPU, which gives a huge advantage to out of tree 
drivers or extensions that can DMA directly between the GPU/NIC without 
having to copy through the CPU.  I’d love to have an alternative built 
on TCP because that’s where we invest the vast majority of our tuning, 
security and interoperability testing.  It’s just more predictable 
overall.


This isn’t a new story, but if we can layer on APIs that enable this 
cleanly for in-tree drivers, we can work with the vendors to use better 
supported APIs and have a more stable kernel.  Obviously this is an RFC 
and there’s a long road ahead, but as long as the upstream kernel 
doesn’t provide an answer, out of tree drivers are going to fill in 
the weak spots.


Other possible use cases would include also include other GPUs or my 
favorite:


NVME <-> filesystem <-> NIC with io_uring driving the IO and without 
copies.


-chris


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-06 Thread Chris Mason

On 6 Mar 2018, at 11:12, Linus Torvalds wrote:

On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov  
wrote:
As the first step in development of bpfilter project [1] the 
request_module()
code is extended to allow user mode helpers to be invoked. Idea is 
that
user mode helpers are built as part of the kernel build and installed 
as
traditional kernel modules with .ko file extension into distro 
specified
location, such that from a distribution point of view, they are no 
different
than regular kernel modules. Thus, allow request_module() logic to 
load such

user mode helper (umh) modules via:

[,,]

I like this, but I have one request: can we make sure that this action
is visible in the system messages?

When we load a regular module, at least it shows in lsmod afterwards,
although I have a few times wanted to really see module load as an
event in the logs too.

When we load a module that just executes a user program, and there is
no sign of it in the module list, I think we *really* need to make
that event show to the admin some way.

.. and yes, maybe we'll need to rate-limit the messages, and maybe it
turns out that I'm entirely wrong and people will hate the messages
after they get used to the concept of these pseudo-modules, but
particularly for the early implementation when this is a new thing, I
really want a message like

 executed user process xyz-abc as a pseudo-module

or something in dmesg.

I do *not* want this to be a magical way to hide things.


Especially early on, this makes a lot of sense.  But I wanted to plug 
bps and the hopefully growing set of bpf introspection tools:


https://github.com/iovisor/bcc/blob/master/introspection/bps_example.txt

Long term these are probably a good place to tell the admin what's going 
on.


-chris


Re: [PATCH RFC 4/4] xfs: Transmit flow steering

2016-08-31 Thread Chris Mason



On 08/30/2016 08:00 PM, Tom Herbert wrote:

XFS maintains a per device flow table that is indexed by the skbuff
hash. The XFS table is only consulted when there is no queue saved in
a transmit socket for an skbuff.

Each entry in the flow table contains a queue index and a queue
pointer. The queue pointer is set when a queue is chosen using a
flow table entry. This pointer is set to the head pointer in the
transmit queue (which is maintained by BQL).

The new function get_xfs_index that looks up flows in the XPS table.
The entry returned gives the last queue a matching flow used. The
returned queue is compared against the normal XPS queue. If they
are different, then we only switch if the tail pointer in the TX
queue has advanced past the pointer saved in the entry. In this
way OOO should be avoided when XPS wants to use a different queue.



I'd love for Dave Chinner to get some networking bug reports, but maybe 
we shouldn't call it XFS?


At least CONFIG_XFS should be something else.  It doesn't conflict now 
because we have CONFIG_XFS_FS, but even CONFIG_XFS_NET sounds like it's 
related to the filesystem instead of transmit flows.


[ Sorry, four patches in and all I do is complain about the name ]

-chris


Signed-off-by: Tom Herbert 
---
 net/Kconfig|  6 
 net/core/dev.c | 93 --
 2 files changed, 84 insertions(+), 15 deletions(-)

diff --git a/net/Kconfig b/net/Kconfig
index 7b6cd34..5e3eddf 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -255,6 +255,12 @@ config XPS
depends on SMP
default y

+config XFS
+   bool
+   depends on XPS
+   depends on BQL
+   default y
+
 config HWBM
bool


...


-static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)
+/* Must be called with RCU read_lock */
+static int get_xfs_index(struct net_device *dev, struct sk_buff *skb)
 {
-   struct sock *sk = skb->sk;
-   int queue_index = sk_tx_queue_get(sk);
+#ifdef CONFIG_XFS
+   struct xps_dev_flow_table *flow_table;
+   struct xps_dev_flow ent;
+   int queue_index;
+   struct netdev_queue *txq;
+   u32 hash;


Re: [RFC] net: use atomic allocation for order-3 page allocation

2015-06-11 Thread Chris Mason
On 06/11/2015 05:22 PM, Eric Dumazet wrote:
> On Thu, 2015-06-11 at 17:16 -0400, Chris Mason wrote:
>> On 06/11/2015 04:48 PM, Eric Dumazet wrote:

>>
>> networking is asking for 32KB, and the MM layer is doing what it can to
>> provide it.  Are the gains from getting 32KB contig bigger than the cost
>> of moving pages around if the MM has to actually go into compaction?
>> Should we start disk IO to give back 32KB contig?
>>
>> I think we want to tell the MM to compact in the background and give
>> networking 32KB if it happens to have it available.  If not, fall back
>> to smaller allocations without doing anything expensive.
> 
> Exactly my point. (And I mentioned this about 4 months ago)

Sorry, reading this again I wasn't very clear.  I agree with Shaohua's
patch because it is telling the allocator that we don't want to wait for
reclaim or compaction to find contiguous pages.

But, is there any fallback to a single page allocation somewhere else?
If this is the only way to get memory, we might want to add a single
alloc_page path that won't trigger compaction but is at least able to
wait for kswapd to make progress.

-chris




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] net: use atomic allocation for order-3 page allocation

2015-06-11 Thread Chris Mason
On 06/11/2015 04:48 PM, Eric Dumazet wrote:
> On Thu, 2015-06-11 at 13:24 -0700, Shaohua Li wrote:
>> We saw excessive memory compaction triggered by skb_page_frag_refill.
>> This causes performance issues. Commit 5640f7685831e0 introduces the
>> order-3 allocation to improve performance. But memory compaction has
>> high overhead. The benefit of order-3 allocation can't compensate the
>> overhead of memory compaction.
>>
>> This patch makes the order-3 page allocation atomic. If there is no
>> memory pressure and memory isn't fragmented, the alloction will still
>> success, so we don't sacrifice the order-3 benefit here. If the atomic
>> allocation fails, compaction will not be triggered and we will fallback
>> to order-0 immediately.
>>
>> The mellanox driver does similar thing, if this is accepted, we must fix
>> the driver too.
>>
>> Cc: Eric Dumazet 
>> Signed-off-by: Shaohua Li 
>> ---
>>  net/core/sock.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 292f422..e9855a4 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -1883,7 +1883,7 @@ bool skb_page_frag_refill(unsigned int sz, struct 
>> page_frag *pfrag, gfp_t gfp)
>>  
>>  pfrag->offset = 0;
>>  if (SKB_FRAG_PAGE_ORDER) {
>> -pfrag->page = alloc_pages(gfp | __GFP_COMP |
>> +pfrag->page = alloc_pages((gfp & ~__GFP_WAIT) | __GFP_COMP |
>>__GFP_NOWARN | __GFP_NORETRY,
>>SKB_FRAG_PAGE_ORDER);
>>  if (likely(pfrag->page)) {
> 
> This is not a specific networking issue, but mm one.
> 
> You really need to start a discussion with mm experts.
> 
> Your changelog does not exactly explains what _is_ the problem.
> 
> If the problem lies in mm layer, it might be time to fix it, instead of
> work around the bug by never triggering it from this particular point,
> which is a safe point where a process is willing to wait a bit.
> 
> Memory compaction is either working as intending, or not.
> 
> If we enabled it but never run it because it hurts, what is the point
> enabling it ?

networking is asking for 32KB, and the MM layer is doing what it can to
provide it.  Are the gains from getting 32KB contig bigger than the cost
of moving pages around if the MM has to actually go into compaction?
Should we start disk IO to give back 32KB contig?

I think we want to tell the MM to compact in the background and give
networking 32KB if it happens to have it available.  If not, fall back
to smaller allocations without doing anything expensive.

-chris

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html