Re: [PATCH 7/9] vhost: allow userspace to create workers

2021-06-17 Thread Mike Christie
On 6/10/21 3:06 AM, Stefan Hajnoczi wrote:
> On Wed, Jun 09, 2021 at 04:03:55PM -0500, Mike Christie wrote:
>> On 6/7/21 10:19 AM, Stefan Hajnoczi wrote:
>>> My concern is that threads should probably accounted against
>>> RLIMIT_NPROC and max_threads rather than something indirect like 128 *
>>> RLIMIT_NOFILE (a userspace process can only have RLIMIT_NOFILE
>>> vhost-user file descriptors open).
>>>
>>
>> Ah ok, I see what you want I think.
>>
>> Ok, I think the options are:
>>
>> 0. Nothing. Just use existing indirect/RLIMIT_NOFILE.
>>
>> 1. Do something like io_uring's create_io_thread/copy_process. If we call
>> copy_process from the vhost ioctl context, then the userspace process that
>> did the ioctl will have it's processes count incremented and checked against
>> its rlimit.
>>
>> The drawbacks:
>> - This gets a little more complicated than just calling copy_process though.
>> We end up duplicating a lot of the kthread API.
>> - We have to deal with new error cases like the parent exiting early.
>> - I think all devs sharing a worker have to have the same owner. 
>> kthread_use_mm
>> and kthread_unuse_mm to switch between mm's for differrent owner's devs seem 
>> to
>> be causing lots of errors. I'm still looking into this one though.
>>
>> 2.  It's not really what you want, but for unbound work io_uring has a check 
>> for
>> RLIMIT_NPROC in the io_uring code. It does:
>>
>> wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers =
>>  task_rlimit(current, RLIMIT_NPROC);
>>
>> then does:
>>
>> if (!ret && acct->nr_workers < acct->max_workers) {
>>
>> Drawbacks:
>> In vhost.c, we could do something similar. It would make sure that vhost.c 
>> does
>> not create more worker threads than the rlimit value, but we wouldn't be
>> incrementing the userspace process's process count. The userspace process 
>> could
>> then create RLIMIT_NPROC threads and vhost.c could also create RLIMIT_NPROC
>> threads, so we end up with 2 * RLIMIT_NPROC threads.
> 
> Yes, in that case we might as well go with Option 0, so I think this
> option can be eliminated.
> 
>> 3. Change the kthread and copy_process code so we can pass in the thread
>> (or it's creds or some struct that has the values that need to be check) that
>> needs to be checked and updated.
>>
>> Drawback:
>> This might be considered too ugly for how special case vhost is. For 
>> example, we
>> need checks/code like the io_thread/PF_IO_WORKER code in copy_process for 
>> io_uring.
>> I can see how added that for io_uring because it affects so many users, but 
>> I can
>> see how vhost is not special enough.
> 
> This seems like the most general solution. If you try it and get
> negative feedback then maybe the maintainers can help suggest how to
> solve this problem :).

Hey, I implemented #3 here:

https://github.com/mikechristie/linux/commit/76f7a555a85147420a22d0163c15259e01e02193

in this patchset:

https://github.com/mikechristie/linux/commits/kthread-node-user

but before I post I wanted to bring up an option 4 someone mentioned to me
offlist.

Again it's io_uring. Check out fs/io_uring.c:__io_account_mem(). For 
RLIMIT_MEMLOCK
it just does the check and increments the user's counter itself. It's simple 
like
option 2, and it handles the issue where the process doing the ioctl wasn't 
having
its RLIMIT_NPROC checked/updated.



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v6 1/2] x86/sev: Make sure IRQs are disabled while GHCB is active

2021-06-17 Thread Peter Zijlstra


On Wed, Jun 16, 2021 at 08:49:12PM +0200, Joerg Roedel wrote:
> @@ -514,7 +523,7 @@ void noinstr __sev_es_nmi_complete(void)
>   struct ghcb_state state;
>   struct ghcb *ghcb;
>  
> - ghcb = sev_es_get_ghcb();
> + ghcb = __sev_get_ghcb();
>  
>   vc_ghcb_invalidate(ghcb);
>   ghcb_set_sw_exit_code(ghcb, SVM_VMGEXIT_NMI_COMPLETE);
> @@ -524,7 +533,7 @@ void noinstr __sev_es_nmi_complete(void)
>   sev_es_wr_ghcb_msr(__pa_nodebug(ghcb));
>   VMGEXIT();
>  
> - sev_es_put_ghcb();
> + __sev_put_ghcb();
>  }

I'm getting (with all of v6.1 applied):

vmlinux.o: warning: objtool: __sev_es_nmi_complete()+0x1bf: call to panic() 
leaves .noinstr.text section

$ ./scripts/faddr2line defconfig-build/vmlinux __sev_es_nmi_complete+0x1bf
__sev_es_nmi_complete+0x1bf/0x1d0:
__sev_get_ghcb at arch/x86/kernel/sev.c:223
(inlined by) __sev_es_nmi_complete at arch/x86/kernel/sev.c:519


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v6 1/2] x86/sev: Make sure IRQs are disabled while GHCB is active

2021-06-17 Thread Peter Zijlstra
On Wed, Jun 16, 2021 at 08:49:12PM +0200, Joerg Roedel wrote:

>  static void sev_es_ap_hlt_loop(void)
>  {
>   struct ghcb_state state;
> + unsigned long flags;
>   struct ghcb *ghcb;
>  
> - ghcb = sev_es_get_ghcb();
> + local_irq_save(flags);
> +
> + ghcb = __sev_get_ghcb();
>  
>   while (true) {
>   vc_ghcb_invalidate(ghcb);
> @@ -692,7 +704,9 @@ static void sev_es_ap_hlt_loop(void)
>   break;
>   }
>  
> - sev_es_put_ghcb();
> + __sev_put_ghcb();
> +
> + local_irq_restore(flags);
>  }

I think this is broken, at this point RCU is quite dead on this CPU and
local_irq_save/restore include tracing and all sorts.

Also, shouldn't IRQs already be disabled by the time we get here?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v6.1 2/2] x86/sev: Split up runtime #VC handler for correct state tracking

2021-06-17 Thread Joerg Roedel
From: Joerg Roedel 

Split up the #VC handler code into a from-user and a from-kernel part.
This allows clean and correct state tracking, as the #VC handler needs
to enter NMI-state when raised from kernel mode and plain IRQ state when
raised from user-mode.

Fixes: 62441a1fb532 ("x86/sev-es: Correctly track IRQ states in runtime #VC 
handler")
Suggested-by: Peter Zijlstra 
Signed-off-by: Joerg Roedel 
---
 arch/x86/entry/entry_64.S   |   4 +-
 arch/x86/include/asm/idtentry.h |  29 +++
 arch/x86/kernel/sev.c   | 148 +---
 3 files changed, 91 insertions(+), 90 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index a16a5294d55f..1886aaf19914 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -506,7 +506,7 @@ SYM_CODE_START(\asmsym)
 
movq%rsp, %rdi  /* pt_regs pointer */
 
-   call\cfunc
+   callkernel_\cfunc
 
/*
 * No need to switch back to the IST stack. The current stack is either
@@ -517,7 +517,7 @@ SYM_CODE_START(\asmsym)
 
/* Switch to the regular task stack */
 .Lfrom_usermode_switch_stack_\@:
-   idtentry_body safe_stack_\cfunc, has_error_code=1
+   idtentry_body user_\cfunc, has_error_code=1
 
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 73d45b0dfff2..cd9f3e304944 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -312,8 +312,8 @@ static __always_inline void __##func(struct pt_regs *regs)
  */
 #define DECLARE_IDTENTRY_VC(vector, func)  \
DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func);   \
-   __visible noinstr void ist_##func(struct pt_regs *regs, unsigned long 
error_code);  \
-   __visible noinstr void safe_stack_##func(struct pt_regs *regs, unsigned 
long error_code)
+   __visible noinstr void kernel_##func(struct pt_regs *regs, unsigned 
long error_code);   \
+   __visible noinstr void   user_##func(struct pt_regs *regs, unsigned 
long error_code)
 
 /**
  * DEFINE_IDTENTRY_IST - Emit code for IST entry points
@@ -355,33 +355,24 @@ static __always_inline void __##func(struct pt_regs *regs)
DEFINE_IDTENTRY_RAW_ERRORCODE(func)
 
 /**
- * DEFINE_IDTENTRY_VC_SAFE_STACK - Emit code for VMM communication handler
-  which runs on a safe stack.
+ * DEFINE_IDTENTRY_VC_KERNEL - Emit code for VMM communication handler
+  when raised from kernel mode
  * @func:  Function name of the entry point
  *
  * Maps to DEFINE_IDTENTRY_RAW_ERRORCODE
  */
-#define DEFINE_IDTENTRY_VC_SAFE_STACK(func)\
-   DEFINE_IDTENTRY_RAW_ERRORCODE(safe_stack_##func)
+#define DEFINE_IDTENTRY_VC_KERNEL(func)\
+   DEFINE_IDTENTRY_RAW_ERRORCODE(kernel_##func)
 
 /**
- * DEFINE_IDTENTRY_VC_IST - Emit code for VMM communication handler
-   which runs on the VC fall-back stack
+ * DEFINE_IDTENTRY_VC_USER - Emit code for VMM communication handler
+when raised from user mode
  * @func:  Function name of the entry point
  *
  * Maps to DEFINE_IDTENTRY_RAW_ERRORCODE
  */
-#define DEFINE_IDTENTRY_VC_IST(func)   \
-   DEFINE_IDTENTRY_RAW_ERRORCODE(ist_##func)
-
-/**
- * DEFINE_IDTENTRY_VC - Emit code for VMM communication handler
- * @func:  Function name of the entry point
- *
- * Maps to DEFINE_IDTENTRY_RAW_ERRORCODE
- */
-#define DEFINE_IDTENTRY_VC(func)   \
-   DEFINE_IDTENTRY_RAW_ERRORCODE(func)
+#define DEFINE_IDTENTRY_VC_USER(func)  \
+   DEFINE_IDTENTRY_RAW_ERRORCODE(user_##func)
 
 #else  /* CONFIG_X86_64 */
 
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index e0d12728bcb7..fe98cec2973e 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -796,7 +796,7 @@ void __init sev_es_init_vc_handling(void)
sev_es_setup_play_dead();
 
/* Secondary CPUs use the runtime #VC handler */
-   initial_vc_handler = (unsigned long)safe_stack_exc_vmm_communication;
+   initial_vc_handler = (unsigned long)kernel_exc_vmm_communication;
 }
 
 static void __init vc_early_forward_exception(struct es_em_ctxt *ctxt)
@@ -1234,14 +1234,6 @@ static enum es_result vc_handle_trap_ac(struct ghcb 
*ghcb,
return ES_EXCEPTION;
 }
 
-static __always_inline void vc_handle_trap_db(struct pt_regs *regs)
-{
-   if (user_mode(regs))
-   noist_exc_debug(regs);
-   else
-   exc_debug(regs);
-}
-
 static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 struct ghcb *ghcb,
 unsigned long exit_code)
@@ -1337,41 +1329,13 @@ static __always_inline bool 

Re: [PATCH v4 3/6] ACPI: Add driver for the VIOT table

2021-06-17 Thread Rafael J. Wysocki
On Thu, Jun 10, 2021 at 10:03 AM Jean-Philippe Brucker
 wrote:
>
> The ACPI Virtual I/O Translation Table describes topology of
> para-virtual platforms, similarly to vendor tables DMAR, IVRS and IORT.
> For now it describes the relation between virtio-iommu and the endpoints
> it manages.
>
> Three steps are needed to configure DMA of endpoints:
>
> (1) acpi_viot_init(): parse the VIOT table, find or create the fwnode
> associated to each vIOMMU device.
>
> (2) When probing the vIOMMU device, the driver registers its IOMMU ops
> within the IOMMU subsystem. This step doesn't require any
> intervention from the VIOT driver.
>
> (3) viot_iommu_configure(): before binding the endpoint to a driver,
> find the associated IOMMU ops. Register them, along with the
> endpoint ID, into the device's iommu_fwspec.
>
> If step (3) happens before step (2), it is deferred until the IOMMU is
> initialized, then retried.
>
> Signed-off-by: Jean-Philippe Brucker 
> ---
>  drivers/acpi/Kconfig  |   3 +
>  drivers/iommu/Kconfig |   1 +
>  drivers/acpi/Makefile |   2 +
>  include/linux/acpi_viot.h |  19 ++
>  drivers/acpi/bus.c|   2 +
>  drivers/acpi/scan.c   |   3 +
>  drivers/acpi/viot.c   | 364 ++
>  MAINTAINERS   |   8 +
>  8 files changed, 402 insertions(+)
>  create mode 100644 include/linux/acpi_viot.h
>  create mode 100644 drivers/acpi/viot.c
>
> diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
> index eedec61e3476..3758c6940ed7 100644
> --- a/drivers/acpi/Kconfig
> +++ b/drivers/acpi/Kconfig
> @@ -526,6 +526,9 @@ endif
>
>  source "drivers/acpi/pmic/Kconfig"
>
> +config ACPI_VIOT
> +   bool
> +
>  endif  # ACPI
>
>  config X86_PM_TIMER
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 1f111b399bca..aff8a4830dd1 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -403,6 +403,7 @@ config VIRTIO_IOMMU
> depends on ARM64
> select IOMMU_API
> select INTERVAL_TREE
> +   select ACPI_VIOT if ACPI
> help
>   Para-virtualised IOMMU driver with virtio.
>
> diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
> index 700b41adf2db..a6e644c48987 100644
> --- a/drivers/acpi/Makefile
> +++ b/drivers/acpi/Makefile
> @@ -118,3 +118,5 @@ video-objs  += acpi_video.o video_detect.o
>  obj-y  += dptf/
>
>  obj-$(CONFIG_ARM64)+= arm64/
> +
> +obj-$(CONFIG_ACPI_VIOT)+= viot.o
> diff --git a/include/linux/acpi_viot.h b/include/linux/acpi_viot.h
> new file mode 100644
> index ..1eb8ee5b0e5f
> --- /dev/null
> +++ b/include/linux/acpi_viot.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +
> +#ifndef __ACPI_VIOT_H__
> +#define __ACPI_VIOT_H__
> +
> +#include 
> +
> +#ifdef CONFIG_ACPI_VIOT
> +void __init acpi_viot_init(void);
> +int viot_iommu_configure(struct device *dev);
> +#else
> +static inline void acpi_viot_init(void) {}
> +static inline int viot_iommu_configure(struct device *dev)
> +{
> +   return -ENODEV;
> +}
> +#endif
> +
> +#endif /* __ACPI_VIOT_H__ */
> diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
> index be7da23fad76..b835ca702ff0 100644
> --- a/drivers/acpi/bus.c
> +++ b/drivers/acpi/bus.c
> @@ -27,6 +27,7 @@
>  #include 
>  #endif
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1339,6 +1340,7 @@ static int __init acpi_init(void)
> pci_mmcfg_late_init();
> acpi_iort_init();
> acpi_scan_init();
> +   acpi_viot_init();

Is there a specific reason why to call it right here?

In particular, does it need to be called after acpi_scan_init()?  And
does it need to be called before the subsequent functions?  If so,
then why?

> acpi_ec_init();
> acpi_debugfs_init();
> acpi_sleep_proc_init();
> diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
> index 0c53c8533300..4fa684fdfda8 100644
> --- a/drivers/acpi/scan.c
> +++ b/drivers/acpi/scan.c
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1556,6 +1557,8 @@ static const struct iommu_ops 
> *acpi_iommu_configure_id(struct device *dev,
> return ops;
>
> err = iort_iommu_configure_id(dev, id_in);
> +   if (err && err != -EPROBE_DEFER)
> +   err = viot_iommu_configure(dev);
>
> /*
>  * If we have reason to believe the IOMMU driver missed the initial
> diff --git a/drivers/acpi/viot.c b/drivers/acpi/viot.c
> new file mode 100644
> index ..892cd9fa7b6d
> --- /dev/null
> +++ b/drivers/acpi/viot.c
> @@ -0,0 +1,364 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Virtual I/O topology
> + *
> + * The Virtual I/O Translation Table (VIOT) describes the topology of
> + * para-virtual IOMMUs and the endpoints they manage. The OS uses it to
> + * initialize devices in the right order, 

Re: [PATCH] drivers: gpio: add virtio-gpio guest driver

2021-06-17 Thread Viresh Kumar
On 17-06-21, 11:54, Enrico Weigelt, metux IT consult wrote:
> Actually, I am subscribed in the list. We already had debates on it,
> including on your postings (but also other things).

Right.

> And the ascii
> version of the spec actually landed on the list last year, we had
> discussions about it there.

I tried to search for it earlier, but never found anything on virtio
list.  Maybe I missed it then.

> I've just had the problem that my patches didn't go through, which is
> very strange, since I actually am on the list and other mails of mine
> went through all the time. I'm now suspecting it's triggered by some
> subtle difference between my regular mail clients and git send-email.
> 
> > Since you started this all and still want to do it, I will take my
> > patches back and let you finish with what you started. I will help
> > review them.
> 
> Thank you very much.
> 
> Please don't me wrong, I really don't wanna any kind of power play, just
> wanna get an technically good solution. If there had been any mis-
> understandings at that point, I'm officially saying sorry here.

Its okay, we are both trying to make things better here :)

> Let's be friends.
> 
> You mentioned you've been missing with my spec. Please come foreward and
> tell us what exactly you're missing and what your use cases are.

I have sent a detailed review of your spec patch, lets do it there
point by point :)

> Note that I've intentionally left off certain "more sophisticated"
> functionality we find on *some* gpio controllers, eg. per-line irq
> masking, pinmux settings for several reasons, e.g.:
> 
> * those are only implemented by some hardware
> * often implemented in or at least need to be coordinated with other
>   pieces of hw (e.g. in SoCs, pinmux is usually done in a separate
>   device)
> * it shall be possible to support even the most simple devices and
>   have the more sophisticated things totally optional. minium
>   requirements for silicon implementations should be the lowest possible
>   (IOW: minimal number of logic gates)
> 
> >> You sound like a politician that tries to push an hidden agenda,
> >> made by some secret interest group in the back room, against the
> >> people - like "resistance is futile".
> >
> > :)
> 
> Perhaps I've been a bit overreacting at that point. But: this is really
> that kind of talking we hear from politicians and corporate leaders
> since many years, whenever they wanna push something through that we the
> people don't want. Politicians use that as a social engineering tool for
> demotivating any resistance. Over heare in Germany this even had become
> a meme, and folks from CCC made a radio show about and named by that
> (the German word is "alternativlos" - in english: without any
> alternative). No idea about other countries, maybe it's a cultural
> issue, but over here, those kind of talking had become a red light.
> 
> Of course, I never intended to accuse you of being one of these people.
> Sorry if there's been misunderstanding.

It sounded strange yesterday to be honest, but I have gone past it
already :)
 
> Let's get back to your implementation: you've mentioned you're routing
> raw virtio traffic into userland, to some other process (outside VMMs
> like qemu) - how exactly are you doing that ?
> 
> That could be interesting for completely different scenarios. For
> example, I'm currently exploring how to get VirGL running between separate
> processes running under the same kernel instance (fow now we
> only have the driver side inside VM and the device outside it), means
> driver and device are running as separate processes.
> 
> The primary use case are containers that shall have really GPU generic
> drivers, not knowing anything about the actual hardware on the host.
> Currently, container workloads wanting to use a GPU need to have special
> drivers for exactly the HW the host happens to have. This makes generic,
> portable container images a tuff problem.
> 
> I haven't digged deeply into the matter, but some virtio-tap transport
> could be an relatively easy (probably not the most efficient) way to
> solve this problem. In that scanario it would like this:
> 
> * we have a "virgl server" (could be some X or wayland application, or
>   completely own compositor) opens up the device-end of an "virtio-tap"
>   transport and attaches its virtio-gpio device emulation on it.
> * "virtio-tap" now creates a driver-end, kernel probes an virtio-gpu
>   instance on this (also leading to a new DRI device)
> * container runtime picks the new DRI device and maps it into the
>   container(s)
>   [ yet open question, whether one DRI device for many containers
> is enough ]
> * container application sees that virtio-gpu DRI device and speaks to
>   it (mesa->virgl backend)
> * the "virgl-server" receives buffers and commands from via virtio and
>   sends them to the host's GL or Gallium API.
> 
> Once we're already there, we might think whether it could make sense
> putting 

Re: [PATCH v8 03/10] eventfd: Increase the recursion depth of eventfd_signal()

2021-06-17 Thread He Zhe



On 6/15/21 10:13 PM, Xie Yongji wrote:
> Increase the recursion depth of eventfd_signal() to 1. This
> is the maximum recursion depth we have found so far, which
> can be triggered with the following call chain:
>
> kvm_io_bus_write[kvm]
>   --> ioeventfd_write   [kvm]
> --> eventfd_signal  [eventfd]
>   --> vhost_poll_wakeup [vhost]
> --> vduse_vdpa_kick_vq  [vduse]
>   --> eventfd_signal[eventfd]
>
> Signed-off-by: Xie Yongji 
> Acked-by: Jason Wang 

The fix had been posted one year ago.

https://lore.kernel.org/lkml/20200410114720.24838-1-zhe...@windriver.com/


> ---
>  fs/eventfd.c| 2 +-
>  include/linux/eventfd.h | 5 -
>  2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index e265b6dd4f34..cc7cd1dbedd3 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -71,7 +71,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>* it returns true, the eventfd_signal() call should be deferred to a
>* safe context.
>*/
> - if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
> + if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH))
>   return 0;
>  
>   spin_lock_irqsave(>wqh.lock, flags);
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index fa0a524baed0..886d99cd38ef 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -29,6 +29,9 @@
>  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
>  #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
>  
> +/* Maximum recursion depth */
> +#define EFD_WAKE_DEPTH 1
> +
>  struct eventfd_ctx;
>  struct file;
>  
> @@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
>  
>  static inline bool eventfd_signal_count(void)
>  {
> - return this_cpu_read(eventfd_wake_count);
> + return this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH;

count is just count. How deep is acceptable should be put
where eventfd_signal_count is called.


Zhe

>  }
>  
>  #else /* CONFIG_EVENTFD */

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx

2021-06-17 Thread Jason Wang


在 2021/6/17 下午2:37, Xuan Zhuo 写道:

On Thu, 17 Jun 2021 14:03:29 +0800, Jason Wang  wrote:

在 2021/6/17 下午1:53, Xuan Zhuo 写道:

On Thu, 17 Jun 2021 11:23:52 +0800, Jason Wang  wrote:

在 2021/6/10 下午4:22, Xuan Zhuo 写道:

Compared to the case of xsk tx, the case of xsk zc rx is more
complicated.

When we process the buf received by vq, we may encounter ordinary
buffers, or xsk buffers. What makes the situation more complicated is
that in the case of mergeable, when num_buffer > 1, we may still
encounter the case where xsk buffer is mixed with ordinary buffer.

Another thing that makes the situation more complicated is that when we
get an xsk buffer from vq, the xsk bound to this xsk buffer may have
been unbound.

Signed-off-by: Xuan Zhuo 

This is somehow similar to the case of tx where we don't have per vq reset.

[...]


-   if (vi->mergeable_rx_bufs)
+   if (is_xsk_ctx(ctx))
+   skb = receive_xsk(dev, vi, rq, buf, len, xdp_xmit, stats);
+   else if (vi->mergeable_rx_bufs)
skb = receive_mergeable(dev, vi, rq, buf, ctx, len, xdp_xmit,
stats);
else if (vi->big_packets)
@@ -1175,6 +1296,14 @@ static bool try_fill_recv(struct virtnet_info *vi, 
struct receive_queue *rq,
int err;
bool oom;

+   /* Because virtio-net does not yet support flow direct,

Note that this is not the case any more. RSS has been supported by
virtio spec and qemu/vhost/tap now. We just need some work on the
virtio-net driver part (e.g the ethool interface).

Oh, are there any plans? Who is doing this work, can I help?


Qemu and spec has support RSS.

TAP support is ready via steering eBPF program, you can try to play it
with current qemu master.

The only thing missed is the Linux driver, I think Yuri or Andrew is
working on this.

I feel that in the case of xsk, the flow director is more appropriate.

Users may still want to allocate packets to a certain channel based on
information such as port/ip/tcp/udp, and then xsk will process them.

I will try to push the flow director to the spec.



That would be fine. For the backend implementation, it could still be 
implemented via steering eBPF.


Thanks




Thanks.


Thanks



Thanks.


Thanks




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v5 15/15] virtio-net: xsk zero copy xmit kick by threshold

2021-06-17 Thread Xuan Zhuo
On Thu, 17 Jun 2021 14:00:40 +0800, Jason Wang  wrote:
>
> 在 2021/6/17 下午1:56, Xuan Zhuo 写道:
> > On Thu, 17 Jun 2021 11:08:34 +0800, Jason Wang  wrote:
> >> 在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> >>> After testing, the performance of calling kick every time is not stable.
> >>> And if all the packets are sent and kicked again, the performance is not
> >>> good. So add a module parameter to specify how many packets are sent to
> >>> call a kick.
> >>>
> >>> 8 is a relatively stable value with the best performance.
> >>>
> >>> Here is the pps of the test of xsk_kick_thr under different values (from
> >>> 1 to 64).
> >>>
> >>> thr  PPS thr PPS thr PPS
> >>> 12924116.74247 | 23  3683263.04348 | 45  2777907.22963
> >>> 23441010.57191 | 24  3078880.13043 | 46  2781376.21739
> >>> 33636728.72378 | 25  2859219.57656 | 47  2777271.91304
> >>> 43637518.61468 | 26  2851557.9593  | 48  2800320.56575
> >>> 53651738.16251 | 27  2834783.54408 | 49  2813039.87599
> >>> 63652176.69231 | 28  2847012.41472 | 50  3445143.01839
> >>> 73665415.80602 | 29  2860633.91304 | 51  3666918.01281
> >>> 83665045.16555 | 30  2857903.5786  | 52  3059929.2709
> >>
> >> I wonder what's the number for the case of non zc xsk?
> >
> > These data are used to compare the situation of sending different numbers of
> > packets to virtio at one time. I think it has nothing to do with 
> > non-zerocopy
> > xsk.
>
>
> Yes, but it would be helpful to see how much we can gain from zerocopy.

Okay, I will add the performance data of no-zerocopy xsk in the next patch set.

Thanks.

>
> Thanks
>
>
> >
> > Thanks.
> >
> >> Thanks
> >>
> >>
> >>> 93671023.2401  | 31  2835589.98963 | 53  2831515.21739
> >>> 10   3669532.23274 | 32  2862827.88706 | 54  3451804.07204
> >>> 11   3666160.37749 | 33  2871855.96696 | 55  3654975.92385
> >>> 12   3674951.44813 | 34  3434456.44816 | 56  3676198.3188
> >>> 13   3667447.57331 | 35  3656918.54177 | 57  3684740.85619
> >>> 14   3018846.0503  | 36  3596921.16722 | 58  3060958.8594
> >>> 15   2792773.84505 | 37  3603460.63903 | 59  2828874.57191
> >>> 16   3430596.3602  | 38  3595410.87666 | 60  3459926.11027
> >>> 17   3660525.85806 | 39  3604250.17819 | 61  3685444.47599
> >>> 18   3045627.69054 | 40  3596542.28428 | 62  3049959.0809
> >>> 19   2841542.94177 | 41  3600705.16054 | 63  2806280.04013
> >>> 20   2830475.97348 | 42  3019833.71191 | 64  3448494.3913
> >>> 21   2845655.55789 | 43  2752951.93264 |
> >>> 22   3450389.84365 | 44  2753107.27164 |
> >>>
> >>> It can be found that when the value of xsk_kick_thr is relatively small,
> >>> the performance is not good, and when its value is greater than 13, the
> >>> performance will be more irregular and unstable. It looks similar from 3
> >>> to 13, I chose 8 as the default value.
> >>>
> >>> The test environment is qemu + vhost-net. I modified vhost-net to drop
> >>> the packets sent by vm directly, so that the cpu of vm can run higher.
> >>> By default, the processes in the vm and the cpu of softirqd are too low,
> >>> and there is no obvious difference in the test data.
> >>>
> >>> During the test, the cpu of softirq reached 100%. Each xsk_kick_thr was
> >>> run for 300s, the pps of every second was recorded, and the average of
> >>> the pps was finally taken. The vhost process cpu on the host has also
> >>> reached 100%.
> >>>
> >>> Signed-off-by: Xuan Zhuo 
> >>> Reviewed-by: Dust Li 
> >>> ---
> >>>drivers/net/virtio/virtio_net.c |  1 +
> >>>drivers/net/virtio/xsk.c| 18 --
> >>>drivers/net/virtio/xsk.h|  2 ++
> >>>3 files changed, 19 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/net/virtio/virtio_net.c 
> >>> b/drivers/net/virtio/virtio_net.c
> >>> index 9503133e71f0..dfe509939b45 100644
> >>> --- a/drivers/net/virtio/virtio_net.c
> >>> +++ b/drivers/net/virtio/virtio_net.c
> >>> @@ -14,6 +14,7 @@ static bool csum = true, gso = true, napi_tx = true;
> >>>module_param(csum, bool, 0444);
> >>>module_param(gso, bool, 0444);
> >>>module_param(napi_tx, bool, 0644);
> >>> +module_param(xsk_kick_thr, int, 0644);
> >>>
> >>>/* FIXME: MTU in config. */
> >>>#define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> >>> diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> >>> index 3973c82d1ad2..2f3ba6ab4798 100644
> >>> --- a/drivers/net/virtio/xsk.c
> >>> +++ b/drivers/net/virtio/xsk.c
> >>> @@ -5,6 +5,8 @@
> >>>
> >>>#include "virtio_net.h"
> >>>
> >>> +int xsk_kick_thr = 8;
> >>> +
> >>>static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
> >>>
> >>>static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct 
> >>> virtnet_xsk_ctx_head *head)
> >>> @@ -455,6 +457,7 @@ static int virtnet_xsk_xmit_batch(struct send_queue 
> >>> *sq,
> >>>   struct xdp_desc desc;
> >>>   int err, packet = 0;
> >>>   int ret = -EAGAIN;
> >>> + int need_kick = 0;
> >>>
> >>>   while (budget-- > 

Re: [PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx

2021-06-17 Thread Xuan Zhuo
On Thu, 17 Jun 2021 14:03:29 +0800, Jason Wang  wrote:
>
> 在 2021/6/17 下午1:53, Xuan Zhuo 写道:
> > On Thu, 17 Jun 2021 11:23:52 +0800, Jason Wang  wrote:
> >> 在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> >>> Compared to the case of xsk tx, the case of xsk zc rx is more
> >>> complicated.
> >>>
> >>> When we process the buf received by vq, we may encounter ordinary
> >>> buffers, or xsk buffers. What makes the situation more complicated is
> >>> that in the case of mergeable, when num_buffer > 1, we may still
> >>> encounter the case where xsk buffer is mixed with ordinary buffer.
> >>>
> >>> Another thing that makes the situation more complicated is that when we
> >>> get an xsk buffer from vq, the xsk bound to this xsk buffer may have
> >>> been unbound.
> >>>
> >>> Signed-off-by: Xuan Zhuo 
> >>
> >> This is somehow similar to the case of tx where we don't have per vq reset.
> >>
> >> [...]
> >>
> >>> - if (vi->mergeable_rx_bufs)
> >>> + if (is_xsk_ctx(ctx))
> >>> + skb = receive_xsk(dev, vi, rq, buf, len, xdp_xmit, stats);
> >>> + else if (vi->mergeable_rx_bufs)
> >>>   skb = receive_mergeable(dev, vi, rq, buf, ctx, len, 
> >>> xdp_xmit,
> >>>   stats);
> >>>   else if (vi->big_packets)
> >>> @@ -1175,6 +1296,14 @@ static bool try_fill_recv(struct virtnet_info *vi, 
> >>> struct receive_queue *rq,
> >>>   int err;
> >>>   bool oom;
> >>>
> >>> + /* Because virtio-net does not yet support flow direct,
> >>
> >> Note that this is not the case any more. RSS has been supported by
> >> virtio spec and qemu/vhost/tap now. We just need some work on the
> >> virtio-net driver part (e.g the ethool interface).
> > Oh, are there any plans? Who is doing this work, can I help?
>
>
> Qemu and spec has support RSS.
>
> TAP support is ready via steering eBPF program, you can try to play it
> with current qemu master.
>
> The only thing missed is the Linux driver, I think Yuri or Andrew is
> working on this.

I feel that in the case of xsk, the flow director is more appropriate.

Users may still want to allocate packets to a certain channel based on
information such as port/ip/tcp/udp, and then xsk will process them.

I will try to push the flow director to the spec.

Thanks.

>
> Thanks
>
>
> >
> > Thanks.
> >
> >> Thanks
> >>
> >>
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx

2021-06-17 Thread Jason Wang


在 2021/6/17 下午1:53, Xuan Zhuo 写道:

On Thu, 17 Jun 2021 11:23:52 +0800, Jason Wang  wrote:

在 2021/6/10 下午4:22, Xuan Zhuo 写道:

Compared to the case of xsk tx, the case of xsk zc rx is more
complicated.

When we process the buf received by vq, we may encounter ordinary
buffers, or xsk buffers. What makes the situation more complicated is
that in the case of mergeable, when num_buffer > 1, we may still
encounter the case where xsk buffer is mixed with ordinary buffer.

Another thing that makes the situation more complicated is that when we
get an xsk buffer from vq, the xsk bound to this xsk buffer may have
been unbound.

Signed-off-by: Xuan Zhuo 


This is somehow similar to the case of tx where we don't have per vq reset.

[...]


-   if (vi->mergeable_rx_bufs)
+   if (is_xsk_ctx(ctx))
+   skb = receive_xsk(dev, vi, rq, buf, len, xdp_xmit, stats);
+   else if (vi->mergeable_rx_bufs)
skb = receive_mergeable(dev, vi, rq, buf, ctx, len, xdp_xmit,
stats);
else if (vi->big_packets)
@@ -1175,6 +1296,14 @@ static bool try_fill_recv(struct virtnet_info *vi, 
struct receive_queue *rq,
int err;
bool oom;

+   /* Because virtio-net does not yet support flow direct,


Note that this is not the case any more. RSS has been supported by
virtio spec and qemu/vhost/tap now. We just need some work on the
virtio-net driver part (e.g the ethool interface).

Oh, are there any plans? Who is doing this work, can I help?



Qemu and spec has support RSS.

TAP support is ready via steering eBPF program, you can try to play it 
with current qemu master.


The only thing missed is the Linux driver, I think Yuri or Andrew is 
working on this.


Thanks




Thanks.


Thanks




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v5 14/15] virtio-net: xsk direct xmit inside xsk wakeup

2021-06-17 Thread Jason Wang


在 2021/6/17 下午1:55, Xuan Zhuo 写道:

On Thu, 17 Jun 2021 11:07:17 +0800, Jason Wang  wrote:

在 2021/6/10 下午4:22, Xuan Zhuo 写道:

Calling virtqueue_napi_schedule() in wakeup results in napi running on
the current cpu. If the application is not busy, then there is no
problem. But if the application itself is busy, it will cause a lot of
scheduling.

If the application is continuously sending data packets, due to the
continuous scheduling between the application and napi, the data packet
transmission will not be smooth, and there will be an obvious delay in
the transmission (you can use tcpdump to see it). When pressing a
channel to 100% (vhost reaches 100%), the cpu where the application is
located reaches 100%.

This patch sends a small amount of data directly in wakeup. The purpose
of this is to trigger the tx interrupt. The tx interrupt will be
awakened on the cpu of its affinity, and then trigger the operation of
the napi mechanism, napi can continue to consume the xsk tx queue. Two
cpus are running, cpu0 is running applications, cpu1 executes
napi consumption data. The same is to press a channel to 100%, but the
utilization rate of cpu0 is 12.7% and the utilization rate of cpu1 is
2.9%.

Signed-off-by: Xuan Zhuo 
---
   drivers/net/virtio/xsk.c | 28 +++-
   1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
index 36cda2dcf8e7..3973c82d1ad2 100644
--- a/drivers/net/virtio/xsk.c
+++ b/drivers/net/virtio/xsk.c
@@ -547,6 +547,7 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 
flag)
   {
struct virtnet_info *vi = netdev_priv(dev);
struct xsk_buff_pool *pool;
+   struct netdev_queue *txq;
struct send_queue *sq;

if (!netif_running(dev))
@@ -559,11 +560,28 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, 
u32 flag)

rcu_read_lock();
pool = rcu_dereference(sq->xsk.pool);
-   if (pool) {
-   local_bh_disable();
-   virtqueue_napi_schedule(>napi, sq->vq);
-   local_bh_enable();
-   }
+   if (!pool)
+   goto end;
+
+   if (napi_if_scheduled_mark_missed(>napi))
+   goto end;
+
+   txq = netdev_get_tx_queue(dev, qid);
+
+   __netif_tx_lock_bh(txq);
+
+   /* Send part of the packet directly to reduce the delay in sending the
+* packet, and this can actively trigger the tx interrupts.
+*
+* If no packet is sent out, the ring of the device is full. In this
+* case, we will still get a tx interrupt response. Then we will deal
+* with the subsequent packet sending work.
+*/
+   virtnet_xsk_run(sq, pool, sq->napi.weight, false);


This looks tricky, and it won't be efficient since there could be some
contention on the tx lock.

I wonder if we can simulate the interrupt via IPI like what RPS did.

Let me try.


In the long run, we may want to extend the spec to support interrupt
trigger though driver.

Can we submit this with reset queue?



We need separate features. And it looks to me it's not as urgent as reset.

Thanks




Thanks.


Thanks



+
+   __netif_tx_unlock_bh(txq);
+
+end:
rcu_read_unlock();
return 0;
   }


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v5 15/15] virtio-net: xsk zero copy xmit kick by threshold

2021-06-17 Thread Jason Wang


在 2021/6/17 下午1:56, Xuan Zhuo 写道:

On Thu, 17 Jun 2021 11:08:34 +0800, Jason Wang  wrote:

在 2021/6/10 下午4:22, Xuan Zhuo 写道:

After testing, the performance of calling kick every time is not stable.
And if all the packets are sent and kicked again, the performance is not
good. So add a module parameter to specify how many packets are sent to
call a kick.

8 is a relatively stable value with the best performance.

Here is the pps of the test of xsk_kick_thr under different values (from
1 to 64).

thr  PPS thr PPS thr PPS
12924116.74247 | 23  3683263.04348 | 45  2777907.22963
23441010.57191 | 24  3078880.13043 | 46  2781376.21739
33636728.72378 | 25  2859219.57656 | 47  2777271.91304
43637518.61468 | 26  2851557.9593  | 48  2800320.56575
53651738.16251 | 27  2834783.54408 | 49  2813039.87599
63652176.69231 | 28  2847012.41472 | 50  3445143.01839
73665415.80602 | 29  2860633.91304 | 51  3666918.01281
83665045.16555 | 30  2857903.5786  | 52  3059929.2709


I wonder what's the number for the case of non zc xsk?


These data are used to compare the situation of sending different numbers of
packets to virtio at one time. I think it has nothing to do with non-zerocopy
xsk.



Yes, but it would be helpful to see how much we can gain from zerocopy.

Thanks




Thanks.


Thanks



93671023.2401  | 31  2835589.98963 | 53  2831515.21739
10   3669532.23274 | 32  2862827.88706 | 54  3451804.07204
11   3666160.37749 | 33  2871855.96696 | 55  3654975.92385
12   3674951.44813 | 34  3434456.44816 | 56  3676198.3188
13   3667447.57331 | 35  3656918.54177 | 57  3684740.85619
14   3018846.0503  | 36  3596921.16722 | 58  3060958.8594
15   2792773.84505 | 37  3603460.63903 | 59  2828874.57191
16   3430596.3602  | 38  3595410.87666 | 60  3459926.11027
17   3660525.85806 | 39  3604250.17819 | 61  3685444.47599
18   3045627.69054 | 40  3596542.28428 | 62  3049959.0809
19   2841542.94177 | 41  3600705.16054 | 63  2806280.04013
20   2830475.97348 | 42  3019833.71191 | 64  3448494.3913
21   2845655.55789 | 43  2752951.93264 |
22   3450389.84365 | 44  2753107.27164 |

It can be found that when the value of xsk_kick_thr is relatively small,
the performance is not good, and when its value is greater than 13, the
performance will be more irregular and unstable. It looks similar from 3
to 13, I chose 8 as the default value.

The test environment is qemu + vhost-net. I modified vhost-net to drop
the packets sent by vm directly, so that the cpu of vm can run higher.
By default, the processes in the vm and the cpu of softirqd are too low,
and there is no obvious difference in the test data.

During the test, the cpu of softirq reached 100%. Each xsk_kick_thr was
run for 300s, the pps of every second was recorded, and the average of
the pps was finally taken. The vhost process cpu on the host has also
reached 100%.

Signed-off-by: Xuan Zhuo 
Reviewed-by: Dust Li 
---
   drivers/net/virtio/virtio_net.c |  1 +
   drivers/net/virtio/xsk.c| 18 --
   drivers/net/virtio/xsk.h|  2 ++
   3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
index 9503133e71f0..dfe509939b45 100644
--- a/drivers/net/virtio/virtio_net.c
+++ b/drivers/net/virtio/virtio_net.c
@@ -14,6 +14,7 @@ static bool csum = true, gso = true, napi_tx = true;
   module_param(csum, bool, 0444);
   module_param(gso, bool, 0444);
   module_param(napi_tx, bool, 0644);
+module_param(xsk_kick_thr, int, 0644);

   /* FIXME: MTU in config. */
   #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
index 3973c82d1ad2..2f3ba6ab4798 100644
--- a/drivers/net/virtio/xsk.c
+++ b/drivers/net/virtio/xsk.c
@@ -5,6 +5,8 @@

   #include "virtio_net.h"

+int xsk_kick_thr = 8;
+
   static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;

   static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct 
virtnet_xsk_ctx_head *head)
@@ -455,6 +457,7 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
struct xdp_desc desc;
int err, packet = 0;
int ret = -EAGAIN;
+   int need_kick = 0;

while (budget-- > 0) {
if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {
@@ -475,11 +478,22 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
}

++packet;
+   ++need_kick;
+   if (need_kick > xsk_kick_thr) {
+   if (virtqueue_kick_prepare(sq->vq) &&
+   virtqueue_notify(sq->vq))
+   ++stats->kicks;
+
+   need_kick = 0;
+   }
}

if (packet) {
-   if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
-   ++stats->kicks;
+   if (need_kick) {
+   if (virtqueue_kick_prepare(sq->vq) &&
+