[PATCH v4 1/1] vringh: IOMEM support

2023-06-01 Thread Shunsuke Mie
Introduce a new memory accessor for vringh. It is able to use vringh to
virtio rings located on io-memory region.

Signed-off-by: Shunsuke Mie 
---
 drivers/vhost/vringh.c | 201 +
 include/linux/vringh.h |  32 +++
 2 files changed, 233 insertions(+)

diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c
index 955d938eb663..6e89dcd871b4 100644
--- a/drivers/vhost/vringh.c
+++ b/drivers/vhost/vringh.c
@@ -1604,4 +1604,205 @@ EXPORT_SYMBOL(vringh_need_notify_iotlb);
 
 #endif
 
+static inline int getu16_iomem(const struct vringh *vrh, u16 *val,
+  const __virtio16 *p)
+{
+   *val = vringh16_to_cpu(vrh, ioread16(p));
+   return 0;
+}
+
+static inline int putu16_iomem(const struct vringh *vrh, __virtio16 *p, u16 
val)
+{
+   iowrite16(cpu_to_vringh16(vrh, val), p);
+   return 0;
+}
+
+static inline int copydesc_iomem(const struct vringh *vrh, void *dst,
+const void *src, size_t len)
+{
+   memcpy_fromio(dst, src, len);
+   return 0;
+}
+
+static int putused_iomem(const struct vringh *vrh, struct vring_used_elem *dst,
+const struct vring_used_elem *src, unsigned int num)
+{
+   memcpy_toio(dst, src, num * sizeof(*dst));
+   return 0;
+}
+
+static inline int xfer_from_iomem(const struct vringh *vrh, void *src,
+ void *dst, size_t len)
+{
+   memcpy_fromio(dst, src, len);
+   return 0;
+}
+
+static inline int xfer_to_iomem(const struct vringh *vrh, void *dst, void *src,
+   size_t len)
+{
+   memcpy_toio(dst, src, len);
+   return 0;
+}
+
+/**
+ * vringh_init_iomem - initialize a vringh for a vring on io-memory.
+ * @vrh: the vringh to initialize.
+ * @features: the feature bits for this ring.
+ * @num: the number of elements.
+ * @weak_barriers: true if we only need memory barriers, not I/O.
+ * @desc: the userspace descriptor pointer.
+ * @avail: the userspace avail pointer.
+ * @used: the userspace used pointer.
+ *
+ * Returns an error if num is invalid: you should check pointers
+ * yourself!
+ */
+int vringh_init_iomem(struct vringh *vrh, u64 features, unsigned int num,
+ bool weak_barriers, struct vring_desc *desc,
+ struct vring_avail *avail, struct vring_used *used)
+{
+   return vringh_init_kern(vrh, features, num, weak_barriers, desc, avail,
+   used);
+}
+EXPORT_SYMBOL(vringh_init_iomem);
+
+/**
+ * vringh_getdesc_iomem - get next available descriptor from vring on 
io-memory.
+ * @vrh: the vring on io-memory.
+ * @riov: where to put the readable descriptors (or NULL)
+ * @wiov: where to put the writable descriptors (or NULL)
+ * @head: head index we received, for passing to vringh_complete_iomem().
+ * @gfp: flags for allocating larger riov/wiov.
+ *
+ * Returns 0 if there was no descriptor, 1 if there was, or -errno.
+ *
+ * There some notes, and those are same with vringh_getdesc_kern(). Please see
+ * it.
+ */
+int vringh_getdesc_iomem(struct vringh *vrh, struct vringh_kiov *riov,
+struct vringh_kiov *wiov, u16 *head, gfp_t gfp)
+{
+   int err;
+
+   err = __vringh_get_head(vrh, getu16_iomem, >last_avail_idx);
+   if (err < 0)
+   return err;
+
+   /* Empty... */
+   if (err == vrh->vring.num)
+   return 0;
+
+   *head = err;
+   err = __vringh_iov(vrh, *head, riov, wiov, no_range_check, NULL, gfp,
+  copydesc_iomem);
+   if (err)
+   return err;
+
+   return 1;
+}
+EXPORT_SYMBOL(vringh_getdesc_iomem);
+
+/**
+ * vringh_iov_pull_iomem - copy bytes from vring_iov.
+ * @riov: the riov as passed to vringh_getdesc_iomem() (updated as we consume)
+ * @dst: the place to copy.
+ * @len: the maximum length to copy.
+ *
+ * Returns the bytes copied <= len or a negative errno.
+ */
+ssize_t vringh_iov_pull_iomem(struct vringh *vrh, struct vringh_kiov *riov,
+ void *dst, size_t len)
+{
+   return vringh_iov_xfer(vrh, riov, dst, len, xfer_from_iomem);
+}
+EXPORT_SYMBOL(vringh_iov_pull_iomem);
+
+/**
+ * vringh_iov_push_iomem - copy bytes into vring_iov.
+ * @wiov: the wiov as passed to vringh_getdesc_iomem() (updated as we consume)
+ * @src: the place to copy from.
+ * @len: the maximum length to copy.
+ *
+ * Returns the bytes copied <= len or a negative errno.
+ */
+ssize_t vringh_iov_push_iomem(struct vringh *vrh, struct vringh_kiov *wiov,
+ const void *src, size_t len)
+{
+   return vringh_iov_xfer(vrh, wiov, (void *)src, len, xfer_to_iomem);
+}
+EXPORT_SYMBOL(vringh_iov_push_iomem);
+
+/**
+ * vringh_abandon_iomem - we've decided not to handle the descriptor(s).
+ * @vrh: the vring.
+ * @num: the number of descriptors to put back (ie. num
+ *  vringh_getdesc_iomem() to undo).
+ *
+ * The next vringh_get_kern() 

[PATCH v4 0/1] Introduce a vringh accessor for IO memory

2023-06-01 Thread Shunsuke Mie
Vringh is a host-side implementation of virtio rings, and supports the vring
located on three kinds of memories, userspace, kernel space and a space
translated iotlb.

This patch introduces a new accessor for the vring on IO memory regions. The
accessor is used by the proposed PCIe endpoint virtio-{net[1], console[2]}
function drivers, which emulate a virtio device and access the virtio ring on
PCIe host memory mapped to the local IO memory region.

- [1] PCIe endpoint virtio-net function driver
link: https://lore.kernel.org/linux-pci/20230203100418.2981144-1-...@igel.co.jp/
- [2] PCIe endpoint virtio-console function driver
link: https://lore.kernel.org/linux-pci/20230427104428.862643-4-...@igel.co.jp/

Changes from:

v3: 
https://lore.kernel.org/virtualization/20230425102250.3847395-1-...@igel.co.jp/
- Remove a kconfig option that is for this support
- Add comments to exported functions
- Remove duplicated newlines

rfc v2: 
https://lore.kernel.org/virtualization/20230202090934.549556-8-...@igel.co.jp/
- Focus on a adding io memory APIs
- Rebase on next-20230414

rfc v1: 
https://lore.kernel.org/virtualization/20221227022528.609839-1-...@igel.co.jp/
- Initial patchset

Shunsuke Mie (1):
  vringh: IOMEM support

 drivers/vhost/vringh.c | 201 +
 include/linux/vringh.h |  32 +++
 2 files changed, 233 insertions(+)

-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-01 Thread Jason Wang
On Thu, Jun 1, 2023 at 3:43 PM Oleg Nesterov  wrote:
>
> On 06/01, Jason Wang wrote:
> >
> > On Wed, May 31, 2023 at 5:14 PM Oleg Nesterov  wrote:
> > >
> > > > > I don't understand you. OK, to simplify, suppose we have 2 global vars
> > > > >
> > > > > void *PTR = something_non_null;
> > > > > unsigned long FLAGS = -1ul;
> > > > >
> > > > > Now I think this code
> > > > >
> > > > > CPU_0   CPU_1
> > > > >
> > > > > void *ptr = PTR;if (!test_and_set_bit(0, 
> > > > > FLAGS))
> > > > > clear_bit(0, FLAGS);PTR = NULL;
> > > > > BUG_ON(!ptr);
> > > > >
> > > > > is racy and can hit the BUG_ON(!ptr).
> > > >
> > > > This seems different to the above case?
> > >
> > > not sure,
> > >
> > > > And you can hit BUG_ON with
> > > > the following execution sequence:
> > > >
> > > > [cpu 0] clear_bit(0, FLAGS);
> > > > [cpu 1] if (!test_and_set_bit(0, FLAGS))
> > > > [cpu 1] PTR = NULL;
> > > > [cpu 0] BUG_ON(!ptr)
> > >
> > > I don't understand this part... yes, we can hit this BUG_ON() without mb 
> > > in
> > > between, this is what I tried to say.
> >
> > I may miss something,
>
> Or me... note that CPU_0 loads the global "PTR" into the local "ptr" before 
> clear_bit.
> Since you have mentioned the program order: yes this lacks READ_ONCE() or 
> barrier(),
> but the same is true for the code in vhost_worker(). So I still don't 
> understand.
>
> > but the above is the sequence that is executed
> > by the processor (for each CPU, it's just the program order). So where
> > do you expect to place an mb can help?
>
> before clear_bit:
>
> CPU_0
>
> void *ptr = PTR;
> mb();   // implies compiler barrier as well
> clear_bit(0, FLAGS);
> BUG_ON(!ptr);
>
> just in case... mb() in the code above is only for illustration, we can use
> smp_mb__before_atomic() + clear_bit(). Or just clear_bit_unlock(), iiuc the
> one-way barrier is fine in this case.

Ok, but it seems different, in the case of vhost we had a condition
above the clear_bit().

>
>
> > > > In vhost code, there's a condition before the clear_bit() which sits
> > > > inside llist_for_each_entry_safe():
> > > >
> > > > #define llist_for_each_entry_safe(pos, n, node, member) 
> > > >\
> > > > for (pos = llist_entry((node), typeof(*pos), member);   
> > > >\
> > > >  member_address_is_nonnull(pos, member) &&  
> > > >\
> > > > (n = llist_entry(pos->member.next, typeof(*n), member), 
> > > > true); \
> > > >  pos = n)
> > > >
> > > > The clear_bit() is a store which is not speculated, so there's a
> > > > control dependency, the store can't be executed until the condition
> > > > expression is evaluated which requires pos->member.next
> > > > (work->node.next) to be loaded.
> > >
> > > But llist_for_each_entry_safe() doesn't check "n", I mean, it is not that 
> > > we have
> > > something like
> > >
> > > n = llist_entry(...);
> > > if (n)
> > > clear_bit(...);
> > >
> > > so I do not see how can we rely on the load-store control dependency.
> >
> > Just to make sure we are on the same page, the condition expression is
> >
> > member_address_is_nonnull(pos, member) && (n =
> > llist_entry(pos->member.next, typeof(*n), member), true)
> >
> > So it's something like:
> >
> > if (work->node && (work_next = work->node->next, true))
> > clear_bit(>flags);
> >
> > So two loads from both work->node and work->node->next, and there's a
> > store which is clear_bit, then it's a load-store control dependencies?
>
> I guess you missed the comma expression...

Probably not, see below:

> Let me rewrite your pseudo-code
> above, it is equivalent to
>
> if (work->node) {
> if ((work_next = work->node->next, true))
> clear_bit(>flags);
> }
>
> another rewrite:
>
> if (work->node) {
> work_next = work->node->next;
> if ((work, true))
> clear_bit(>flags);
> }
>
> and the final rewrite:
>
> if (work->node) {
> work_next = work->node->next;
> if (true)
> clear_bit(>flags);
> }
>
> so again, I do not see the load-store control dependency.

This kind of optimization is suspicious. Especially considering it's
the control expression of the loop but not a condition.

Looking at the assembly (x86):

   0x81d46c5b <+75>:callq  0x81689ac0 
   0x81d46c60 <+80>:mov%rax,%r15
   0x81d46c63 <+83>:test   %rax,%rax
   0x81d46c66 <+86>:je 0x81d46c3a 
   0x81d46c68 <+88>:mov%r15,%rdi
   0x81d46c6b <+91>:mov(%r15),%r15
   0x81d46c6e <+94>:lock andb $0xfd,0x10(%rdi)
   0x81d46c73 

Re: [PATCH 1/1] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-01 Thread Eric W. Biederman
Mike Christie  writes:

> When switching from kthreads to vhost_tasks two bugs were added:
> 1. The vhost worker tasks's now show up as processes so scripts doing
> ps or ps a would not incorrectly detect the vhost task as another
> process.  2. kthreads disabled freeze by setting PF_NOFREEZE, but
> vhost tasks's didn't disable or add support for them.
>
> To fix both bugs, this switches the vhost task to be thread in the
> process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
> get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
> SIGKILL/STOP support is required because CLONE_THREAD requires
> CLONE_SIGHAND which requires those 2 signals to be supported.
>
> This is a modified version of the patch written by Mike Christie
>  which was a modified version of patch
> originally written by Linus.
>
> Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER.
> Including ignoring signals, setting up the register state, and having
> get_signal return instead of calling do_group_exit.
>
> Tidied up the vhost_task abstraction so that the definition of
> vhost_task only needs to be visible inside of vhost_task.c.  Making
> it easier to review the code and tell what needs to be done where.
> As part of this the main loop has been moved from vhost_worker into
> vhost_task_fn.  vhost_worker now returns true if work was done.

Please see below for a race in that tidying up.

> The main loop has been updated to call get_signal which handles
> SIGSTOP, freezing, and collects the message that tells the thread to
> exit as part of process exit.  This collection clears
> __fatal_signal_pending.  This collection is not guaranteed to
> clear signal_pending() so clear that explicitly so the schedule()
> sleeps.
>
> For now the vhost thread continues to exist and run work until the
> last file descriptor is closed and the release function is called as
> part of freeing struct file.  To avoid hangs in the coredump
> rendezvous and when killing threads in a multi-threaded exec.  The
> coredump code and de_thread have been modified to ignore vhost threads.
>
> Remvoing the special case for exec appears to require teaching
> vhost_dev_flush how to directly complete transactions in case
> the vhost thread is no longer running.
>
> Removing the special case for coredump rendezvous requires either the
> above fix needed for exec or moving the coredump rendezvous into
> get_signal.



In just fixing the hang after exec I am afraid I may have introduced
something worse.

Two different sighand_struct's (and their associated locks) pointing
at the same signal_struct.  (Should be fixable?)

I am worried about what happens with that vhost task after an exec.
It retains it's existing cred (and technically the old mm) but shares
signal_struct so it might be possible to use permission checks against
the old vhost task cred to bypass permission checks on the new tasks
cred.  In particular for exec's that gain privilege.

It doesn't look like that is an issue for signals and suid exec as
kill_ok_by_cred seems to deliberately allow the same thing.


We may be ok but the way that vhost task remains after exec it smells
like the setup for a local privilege escalation.


Oleg do you have any insights?

Does anyone see why using the vhost task to modify the process should
not result in privilege escalation?


> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index a92af08e7864..074273020849 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -256,7 +256,7 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
> vhost_work *work)
>* test_and_set_bit() implies a memory barrier.
>*/
>   llist_add(>node, >worker->work_list);
> - wake_up_process(dev->worker->vtsk->task);
> + vhost_task_wake(dev->worker->vtsk);
>   }
>  }
>  EXPORT_SYMBOL_GPL(vhost_work_queue);
> @@ -333,31 +333,19 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>   __vhost_vq_meta_reset(vq);
>  }
>  
> -static int vhost_worker(void *data)
> +static bool vhost_worker(void *data)
>  {
>   struct vhost_worker *worker = data;
>   struct vhost_work *work, *work_next;
>   struct llist_node *node;
>  
> - for (;;) {
> - /* mb paired w/ kthread_stop */
> - set_current_state(TASK_INTERRUPTIBLE);
> -
> - if (vhost_task_should_stop(worker->vtsk)) {
> - __set_current_state(TASK_RUNNING);
> - break;
> - }
> -
> - node = llist_del_all(>work_list);
> - if (!node)
> - schedule();
> -
> + node = llist_del_all(>work_list);
> + if (node) {
>   node = llist_reverse_order(node);
>   /* make sure flag is seen after deletion */
>   smp_wmb();
>   llist_for_each_entry_safe(work, work_next, node, node) {
>   

Re: [PATCH 1/1] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-01 Thread Michael S. Tsirkin
On Thu, Jun 01, 2023 at 01:32:32PM -0500, Mike Christie wrote:
> When switching from kthreads to vhost_tasks two bugs were added:
> 1. The vhost worker tasks's now show up as processes so scripts doing
> ps or ps a would not incorrectly detect the vhost task as another
> process.  2. kthreads disabled freeze by setting PF_NOFREEZE, but
> vhost tasks's didn't disable or add support for them.
> 
> To fix both bugs, this switches the vhost task to be thread in the
> process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
> get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
> SIGKILL/STOP support is required because CLONE_THREAD requires
> CLONE_SIGHAND which requires those 2 signals to be supported.
> 
> This is a modified version of the patch written by Mike Christie
>  which was a modified version of patch
> originally written by Linus.
> 
> Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER.
> Including ignoring signals, setting up the register state, and having
> get_signal return instead of calling do_group_exit.
> 
> Tidied up the vhost_task abstraction so that the definition of
> vhost_task only needs to be visible inside of vhost_task.c.  Making
> it easier to review the code and tell what needs to be done where.
> As part of this the main loop has been moved from vhost_worker into
> vhost_task_fn.  vhost_worker now returns true if work was done.
> 
> The main loop has been updated to call get_signal which handles
> SIGSTOP, freezing, and collects the message that tells the thread to
> exit as part of process exit.  This collection clears
> __fatal_signal_pending.  This collection is not guaranteed to
> clear signal_pending() so clear that explicitly so the schedule()
> sleeps.
> 
> For now the vhost thread continues to exist and run work until the
> last file descriptor is closed and the release function is called as
> part of freeing struct file.  To avoid hangs in the coredump
> rendezvous and when killing threads in a multi-threaded exec.  The
> coredump code and de_thread have been modified to ignore vhost threads.
> 
> Remvoing the special case for exec appears to require teaching
> vhost_dev_flush how to directly complete transactions in case
> the vhost thread is no longer running.
> 
> Removing the special case for coredump rendezvous requires either the
> above fix needed for exec or moving the coredump rendezvous into
> get_signal.
> 
> Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
> Signed-off-by: Eric W. Biederman 
> Co-developed-by: Mike Christie 
> Signed-off-by: Mike Christie 


Acked-by: Michael S. Tsirkin 


> ---
>  arch/x86/include/asm/fpu/sched.h |  2 +-
>  arch/x86/kernel/fpu/context.h|  2 +-
>  arch/x86/kernel/fpu/core.c   |  2 +-
>  drivers/vhost/vhost.c| 22 ++--
>  fs/coredump.c|  4 +-
>  include/linux/sched/task.h   |  1 -
>  include/linux/sched/vhost_task.h | 15 ++
>  kernel/exit.c|  5 +-
>  kernel/fork.c| 13 ++---
>  kernel/signal.c  |  8 +--
>  kernel/vhost_task.c  | 92 +---
>  11 files changed, 89 insertions(+), 77 deletions(-)
> 
> diff --git a/arch/x86/include/asm/fpu/sched.h 
> b/arch/x86/include/asm/fpu/sched.h
> index c2d6cd78ed0c..78fcde7b1f07 100644
> --- a/arch/x86/include/asm/fpu/sched.h
> +++ b/arch/x86/include/asm/fpu/sched.h
> @@ -39,7 +39,7 @@ extern void fpu_flush_thread(void);
>  static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
>  {
>   if (cpu_feature_enabled(X86_FEATURE_FPU) &&
> - !(current->flags & (PF_KTHREAD | PF_IO_WORKER))) {
> + !(current->flags & (PF_KTHREAD | PF_USER_WORKER))) {
>   save_fpregs_to_fpstate(old_fpu);
>   /*
>* The save operation preserved register state, so the
> diff --git a/arch/x86/kernel/fpu/context.h b/arch/x86/kernel/fpu/context.h
> index 9fcfa5c4dad7..af5cbdd9bd29 100644
> --- a/arch/x86/kernel/fpu/context.h
> +++ b/arch/x86/kernel/fpu/context.h
> @@ -57,7 +57,7 @@ static inline void fpregs_restore_userregs(void)
>   struct fpu *fpu = >thread.fpu;
>   int cpu = smp_processor_id();
>  
> - if (WARN_ON_ONCE(current->flags & (PF_KTHREAD | PF_IO_WORKER)))
> + if (WARN_ON_ONCE(current->flags & (PF_KTHREAD | PF_USER_WORKER)))
>   return;
>  
>   if (!fpregs_state_valid(fpu, cpu)) {
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index caf33486dc5e..1015af1ae562 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -426,7 +426,7 @@ void kernel_fpu_begin_mask(unsigned int kfpu_mask)
>  
>   this_cpu_write(in_kernel_fpu, true);
>  
> - if (!(current->flags & (PF_KTHREAD | PF_IO_WORKER)) &&
> + if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) &&
>   !test_thread_flag(TIF_NEED_FPU_LOAD)) {
>   

Re: [PATCH V2] virtio-fs: Improved request latencies when Virtio queue is full

2023-06-01 Thread Vivek Goyal
On Thu, Jun 01, 2023 at 10:08:50AM -0400, Stefan Hajnoczi wrote:
> On Wed, May 31, 2023 at 04:49:39PM -0400, Vivek Goyal wrote:
> > On Wed, May 31, 2023 at 10:34:15PM +0200, Peter-Jan Gootzen wrote:
> > > On 31/05/2023 21:18, Vivek Goyal wrote:
> > > > On Wed, May 31, 2023 at 07:10:32PM +0200, Peter-Jan Gootzen wrote:
> > > >> When the Virtio queue is full, a work item is scheduled
> > > >> to execute in 1ms that retries adding the request to the queue.
> > > >> This is a large amount of time on the scale on which a
> > > >> virtio-fs device can operate. When using a DPU this is around
> > > >> 40us baseline without going to a remote server (4k, QD=1).
> > > >> This patch queues requests when the Virtio queue is full,
> > > >> and when a completed request is taken off, immediately fills
> > > >> it back up with queued requests.
> > > >>
> > > >> This reduces the 99.9th percentile latencies in our tests by
> > > >> 60x and slightly increases the overall throughput, when using a
> > > >> queue depth 2x the size of the Virtio queue size, with a
> > > >> DPU-powered virtio-fs device.
> > > >>
> > > >> Signed-off-by: Peter-Jan Gootzen 
> > > >> ---
> > > >> V1 -> V2: Not scheduling dispatch work anymore when not needed
> > > >> and changed delayed_work structs to work_struct structs
> > > >>
> > > >>  fs/fuse/virtio_fs.c | 32 +---
> > > >>  1 file changed, 17 insertions(+), 15 deletions(-)
> > > >>
> > > >> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > > >> index 4d8d4f16c727..a676297db09b 100644
> > > >> --- a/fs/fuse/virtio_fs.c
> > > >> +++ b/fs/fuse/virtio_fs.c
> > > >> @@ -45,7 +45,7 @@ struct virtio_fs_vq {
> > > >>struct work_struct done_work;
> > > >>struct list_head queued_reqs;
> > > >>struct list_head end_reqs;  /* End these requests */
> > > >> -  struct delayed_work dispatch_work;
> > > >> +  struct work_struct dispatch_work;
> > > >>struct fuse_dev *fud;
> > > >>bool connected;
> > > >>long in_flight;
> > > >> @@ -202,7 +202,7 @@ static void virtio_fs_drain_queue(struct 
> > > >> virtio_fs_vq *fsvq)
> > > >>}
> > > >>  
> > > >>flush_work(>done_work);
> > > >> -  flush_delayed_work(>dispatch_work);
> > > >> +  flush_work(>dispatch_work);
> > > >>  }
> > > >>  
> > > >>  static void virtio_fs_drain_all_queues_locked(struct virtio_fs *fs)
> > > >> @@ -346,6 +346,9 @@ static void virtio_fs_hiprio_done_work(struct 
> > > >> work_struct *work)
> > > >>dec_in_flight_req(fsvq);
> > > >>}
> > > >>} while (!virtqueue_enable_cb(vq) && 
> > > >> likely(!virtqueue_is_broken(vq)));
> > > >> +
> > > >> +  if (!list_empty(>queued_reqs))
> > > >> +  schedule_work(>dispatch_work);
> > > >>spin_unlock(>lock);
> > > >>  }
> > > >>  
> > > >> @@ -353,7 +356,7 @@ static void virtio_fs_request_dispatch_work(struct 
> > > >> work_struct *work)
> > > >>  {
> > > >>struct fuse_req *req;
> > > >>struct virtio_fs_vq *fsvq = container_of(work, struct 
> > > >> virtio_fs_vq,
> > > >> -   dispatch_work.work);
> > > >> +   dispatch_work);
> > > >>int ret;
> > > >>  
> > > >>pr_debug("virtio-fs: worker %s called.\n", __func__);
> > > >> @@ -388,8 +391,6 @@ static void virtio_fs_request_dispatch_work(struct 
> > > >> work_struct *work)
> > > >>if (ret == -ENOMEM || ret == -ENOSPC) {
> > > >>spin_lock(>lock);
> > > >>list_add_tail(>list, 
> > > >> >queued_reqs);
> > > >> -  
> > > >> schedule_delayed_work(>dispatch_work,
> > > >> -
> > > >> msecs_to_jiffies(1));
> > > > 
> > > > Virtqueue being full is only one of the reasons for failure to queue
> > > > the request. What if virtqueue is empty but we could not queue the
> > > > request because lack of memory (-ENOMEM). In that case we will queue
> > > > the request and it might not be dispatched because there is no 
> > > > completion.
> > > > (Assume there is no further new request coming). That means deadlock?
> > > > 
> > > > Thanks
> > > > Vivek
> > > > 
> > > 
> > > Good catch that will deadlock.
> > > 
> > > Is default kernel behavior to indefinitely retry a file system
> > > request until memory is available?
> > 
> > As of now that seems to be the behavior. I think I had copied this
> > code from another driver. 
> > 
> > But I guess one can argue that if memory is not available, then
> > return -ENOMEM to user space instead of retrying in kernel.
> > 
> > Stefan, Miklos, WDYT?
> 
> My understanding is that file system syscalls may return ENOMEM, so this
> is okay.

Ok. Fair enough. Thanks.

One more question. How do we know virtqueue is full. Is -ENOSPC is the
correct error code to check and 

[PATCH 1/1] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-01 Thread Mike Christie
When switching from kthreads to vhost_tasks two bugs were added:
1. The vhost worker tasks's now show up as processes so scripts doing
ps or ps a would not incorrectly detect the vhost task as another
process.  2. kthreads disabled freeze by setting PF_NOFREEZE, but
vhost tasks's didn't disable or add support for them.

To fix both bugs, this switches the vhost task to be thread in the
process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
SIGKILL/STOP support is required because CLONE_THREAD requires
CLONE_SIGHAND which requires those 2 signals to be supported.

This is a modified version of the patch written by Mike Christie
 which was a modified version of patch
originally written by Linus.

Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER.
Including ignoring signals, setting up the register state, and having
get_signal return instead of calling do_group_exit.

Tidied up the vhost_task abstraction so that the definition of
vhost_task only needs to be visible inside of vhost_task.c.  Making
it easier to review the code and tell what needs to be done where.
As part of this the main loop has been moved from vhost_worker into
vhost_task_fn.  vhost_worker now returns true if work was done.

The main loop has been updated to call get_signal which handles
SIGSTOP, freezing, and collects the message that tells the thread to
exit as part of process exit.  This collection clears
__fatal_signal_pending.  This collection is not guaranteed to
clear signal_pending() so clear that explicitly so the schedule()
sleeps.

For now the vhost thread continues to exist and run work until the
last file descriptor is closed and the release function is called as
part of freeing struct file.  To avoid hangs in the coredump
rendezvous and when killing threads in a multi-threaded exec.  The
coredump code and de_thread have been modified to ignore vhost threads.

Remvoing the special case for exec appears to require teaching
vhost_dev_flush how to directly complete transactions in case
the vhost thread is no longer running.

Removing the special case for coredump rendezvous requires either the
above fix needed for exec or moving the coredump rendezvous into
get_signal.

Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
Signed-off-by: Eric W. Biederman 
Co-developed-by: Mike Christie 
Signed-off-by: Mike Christie 
---
 arch/x86/include/asm/fpu/sched.h |  2 +-
 arch/x86/kernel/fpu/context.h|  2 +-
 arch/x86/kernel/fpu/core.c   |  2 +-
 drivers/vhost/vhost.c| 22 ++--
 fs/coredump.c|  4 +-
 include/linux/sched/task.h   |  1 -
 include/linux/sched/vhost_task.h | 15 ++
 kernel/exit.c|  5 +-
 kernel/fork.c| 13 ++---
 kernel/signal.c  |  8 +--
 kernel/vhost_task.c  | 92 +---
 11 files changed, 89 insertions(+), 77 deletions(-)

diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
index c2d6cd78ed0c..78fcde7b1f07 100644
--- a/arch/x86/include/asm/fpu/sched.h
+++ b/arch/x86/include/asm/fpu/sched.h
@@ -39,7 +39,7 @@ extern void fpu_flush_thread(void);
 static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
 {
if (cpu_feature_enabled(X86_FEATURE_FPU) &&
-   !(current->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+   !(current->flags & (PF_KTHREAD | PF_USER_WORKER))) {
save_fpregs_to_fpstate(old_fpu);
/*
 * The save operation preserved register state, so the
diff --git a/arch/x86/kernel/fpu/context.h b/arch/x86/kernel/fpu/context.h
index 9fcfa5c4dad7..af5cbdd9bd29 100644
--- a/arch/x86/kernel/fpu/context.h
+++ b/arch/x86/kernel/fpu/context.h
@@ -57,7 +57,7 @@ static inline void fpregs_restore_userregs(void)
struct fpu *fpu = >thread.fpu;
int cpu = smp_processor_id();
 
-   if (WARN_ON_ONCE(current->flags & (PF_KTHREAD | PF_IO_WORKER)))
+   if (WARN_ON_ONCE(current->flags & (PF_KTHREAD | PF_USER_WORKER)))
return;
 
if (!fpregs_state_valid(fpu, cpu)) {
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index caf33486dc5e..1015af1ae562 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -426,7 +426,7 @@ void kernel_fpu_begin_mask(unsigned int kfpu_mask)
 
this_cpu_write(in_kernel_fpu, true);
 
-   if (!(current->flags & (PF_KTHREAD | PF_IO_WORKER)) &&
+   if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) &&
!test_thread_flag(TIF_NEED_FPU_LOAD)) {
set_thread_flag(TIF_NEED_FPU_LOAD);
save_fpregs_to_fpstate(>thread.fpu);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..074273020849 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -256,7 +256,7 @@ void vhost_work_queue(struct 

Re: [PATCH V2] virtio-fs: Improved request latencies when Virtio queue is full

2023-06-01 Thread Stefan Hajnoczi
On Thu, Jun 01, 2023 at 04:49:06PM +0200, Peter-Jan Gootzen wrote:
> On 01/06/2023 16:08, Stefan Hajnoczi wrote:
> > On Wed, May 31, 2023 at 04:49:39PM -0400, Vivek Goyal wrote:
> >> On Wed, May 31, 2023 at 10:34:15PM +0200, Peter-Jan Gootzen wrote:
> >>> On 31/05/2023 21:18, Vivek Goyal wrote:
>  On Wed, May 31, 2023 at 07:10:32PM +0200, Peter-Jan Gootzen wrote:
> > When the Virtio queue is full, a work item is scheduled
> > to execute in 1ms that retries adding the request to the queue.
> > This is a large amount of time on the scale on which a
> > virtio-fs device can operate. When using a DPU this is around
> > 40us baseline without going to a remote server (4k, QD=1).
> > This patch queues requests when the Virtio queue is full,
> > and when a completed request is taken off, immediately fills
> > it back up with queued requests.
> >
> > This reduces the 99.9th percentile latencies in our tests by
> > 60x and slightly increases the overall throughput, when using a
> > queue depth 2x the size of the Virtio queue size, with a
> > DPU-powered virtio-fs device.
> >
> > Signed-off-by: Peter-Jan Gootzen 
> > ---
> > V1 -> V2: Not scheduling dispatch work anymore when not needed
> > and changed delayed_work structs to work_struct structs
> >
> >  fs/fuse/virtio_fs.c | 32 +---
> >  1 file changed, 17 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > index 4d8d4f16c727..a676297db09b 100644
> > --- a/fs/fuse/virtio_fs.c
> > +++ b/fs/fuse/virtio_fs.c
> > @@ -45,7 +45,7 @@ struct virtio_fs_vq {
> > struct work_struct done_work;
> > struct list_head queued_reqs;
> > struct list_head end_reqs;  /* End these requests */
> > -   struct delayed_work dispatch_work;
> > +   struct work_struct dispatch_work;
> > struct fuse_dev *fud;
> > bool connected;
> > long in_flight;
> > @@ -202,7 +202,7 @@ static void virtio_fs_drain_queue(struct 
> > virtio_fs_vq *fsvq)
> > }
> >  
> > flush_work(>done_work);
> > -   flush_delayed_work(>dispatch_work);
> > +   flush_work(>dispatch_work);
> >  }
> >  
> >  static void virtio_fs_drain_all_queues_locked(struct virtio_fs *fs)
> > @@ -346,6 +346,9 @@ static void virtio_fs_hiprio_done_work(struct 
> > work_struct *work)
> > dec_in_flight_req(fsvq);
> > }
> > } while (!virtqueue_enable_cb(vq) && 
> > likely(!virtqueue_is_broken(vq)));
> > +
> > +   if (!list_empty(>queued_reqs))
> > +   schedule_work(>dispatch_work);
> > spin_unlock(>lock);
> >  }
> >  
> > @@ -353,7 +356,7 @@ static void virtio_fs_request_dispatch_work(struct 
> > work_struct *work)
> >  {
> > struct fuse_req *req;
> > struct virtio_fs_vq *fsvq = container_of(work, struct 
> > virtio_fs_vq,
> > -dispatch_work.work);
> > +dispatch_work);
> > int ret;
> >  
> > pr_debug("virtio-fs: worker %s called.\n", __func__);
> > @@ -388,8 +391,6 @@ static void virtio_fs_request_dispatch_work(struct 
> > work_struct *work)
> > if (ret == -ENOMEM || ret == -ENOSPC) {
> > spin_lock(>lock);
> > list_add_tail(>list, 
> > >queued_reqs);
> > -   
> > schedule_delayed_work(>dispatch_work,
> > - 
> > msecs_to_jiffies(1));
> 
>  Virtqueue being full is only one of the reasons for failure to queue
>  the request. What if virtqueue is empty but we could not queue the
>  request because lack of memory (-ENOMEM). In that case we will queue
>  the request and it might not be dispatched because there is no 
>  completion.
>  (Assume there is no further new request coming). That means deadlock?
> 
>  Thanks
>  Vivek
> 
> >>>
> >>> Good catch that will deadlock.
> >>>
> >>> Is default kernel behavior to indefinitely retry a file system
> >>> request until memory is available?
> >>
> >> As of now that seems to be the behavior. I think I had copied this
> >> code from another driver. 
> >>
> >> But I guess one can argue that if memory is not available, then
> >> return -ENOMEM to user space instead of retrying in kernel.
> >>
> >> Stefan, Miklos, WDYT?
> > 
> > My understanding is that file system syscalls may return ENOMEM, so this
> > is okay.
> > 
> > Stefan
> 
> Then I propose only handling -ENOSPC as a special case and letting all
> other errors go through to userspace.
> 
> 

Re: [syzbot] [kvm?] [net?] [virt?] general protection fault in vhost_work_queue

2023-06-01 Thread Mike Christie
On 6/1/23 2:47 AM, Stefano Garzarella wrote:
>>
>> static void vhost_worker_free(struct vhost_dev *dev)
>> {
>> -    struct vhost_worker *worker = dev->worker;
>> +    struct vhost_task *vtsk = READ_ONCE(dev->worker.vtsk);
>>
>> -    if (!worker)
>> +    if (!vtsk)
>>     return;
>>
>> -    dev->worker = NULL;
>> -    WARN_ON(!llist_empty(>work_list));
>> -    vhost_task_stop(worker->vtsk);
>> -    kfree(worker);
>> +    vhost_task_stop(vtsk);
>> +    WARN_ON(!llist_empty(>worker.work_list));
>> +    WRITE_ONCE(dev->worker.vtsk, NULL);
> 
> The patch LGTM, I just wonder if we should set dev->worker to zero here,

We might want to just set kcov_handle to zero for now.

In 6.3 and older, I think we could do:

1. vhost_dev_set_owner could successfully set dev->worker.
2. vhost_transport_send_pkt runs vhost_work_queue and sees worker
is set and adds the vhost_work to the work_list.
3. vhost_dev_set_owner fails in vhost_attach_cgroups, so we stop
the worker before the work can be run and set worker to NULL.
4. We clear kcov_handle and return.

We leave the work on the work_list.

5. Userspace can then retry vhost_dev_set_owner. If that works, then the
work gets executed ok eventually.

OR

Userspace can just close the device. vhost_vsock_dev_release would
eventually call vhost_dev_cleanup (vhost_dev_flush won't see a worker
so will just return), and that will hit the WARN_ON but we would
proceed ok.

If I do a memset of the worker, then if userspace were to retry
VHOST_SET_OWNER, we would lose the queued work since the work_list would
get zero'd. I think it's unlikely this ever happens, but you know best
so let me know if this a real issue.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND

2023-06-01 Thread Mike Christie
On 6/1/23 5:47 AM, Christian Brauner wrote:
> On Thu, Jun 01, 2023 at 09:58:38AM +0200, Thorsten Leemhuis wrote:
>> On 19.05.23 14:15, Christian Brauner wrote:
>>> On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
 On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
> normal testing, haven't coverted vsock and vdpa, and I know you guys
> will not like the first patch. However, I think it better shows what
 Just to summarize the core idea behind my proposal is that no signal
 handling changes are needed unless there's a bug in the current way
 io_uring workers already work. All that should be needed is
 s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
>> [...]
 So it feels like this should be achievable by adding a callback to
 struct vhost_worker that get's called when vhost_worker() gets SIGKILL
 and that all the users of vhost workers are forced to implement.

 Yes, it is more work but I think that's the right thing to do and not to
 complicate our signal handling.

 Worst case if this can't be done fast enough we'll have to revert the
 vhost parts. I think the user worker parts are mostly sane and are
>>> As mentioned, if we can't settle this cleanly before -rc4 we should
>>> revert the vhost parts unless Linus wants to have it earlier.
>> Meanwhile -rc5 is just a few days away and there are still a lot of
>> discussions in the patch-set proposed to address the issues[1]. Which is
>> kinda great (albeit also why I haven't given it a spin yet), but on the
>> other hand makes we wonder:
> You might've missed it in the thread but it seems everyone is currently
> operating under the assumption that the preferred way is to fix this is
> rather than revert. See the mail in [1]:
> 
> "So I'd really like to finish this. Even if we end up with a hack or
> two in signal handling that we can hopefully fix up later by having
> vhost fix up some of its current assumptions."
> 
> which is why no revert was send for -rc4. And there's a temporary fix we
> seem to have converged on.
> 
> @Mike, do you want to prepare an updated version of the temporary fix.
> If @Linus prefers to just apply it directly he can just grab it from the
> list rather than delaying it. Make sure to grab a Co-developed-by line
> on this, @Mike.

Yes, I'll send it within a couple hours.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V2] virtio-fs: Improved request latencies when Virtio queue is full

2023-06-01 Thread Peter-Jan Gootzen via Virtualization
On 01/06/2023 16:08, Stefan Hajnoczi wrote:
> On Wed, May 31, 2023 at 04:49:39PM -0400, Vivek Goyal wrote:
>> On Wed, May 31, 2023 at 10:34:15PM +0200, Peter-Jan Gootzen wrote:
>>> On 31/05/2023 21:18, Vivek Goyal wrote:
 On Wed, May 31, 2023 at 07:10:32PM +0200, Peter-Jan Gootzen wrote:
> When the Virtio queue is full, a work item is scheduled
> to execute in 1ms that retries adding the request to the queue.
> This is a large amount of time on the scale on which a
> virtio-fs device can operate. When using a DPU this is around
> 40us baseline without going to a remote server (4k, QD=1).
> This patch queues requests when the Virtio queue is full,
> and when a completed request is taken off, immediately fills
> it back up with queued requests.
>
> This reduces the 99.9th percentile latencies in our tests by
> 60x and slightly increases the overall throughput, when using a
> queue depth 2x the size of the Virtio queue size, with a
> DPU-powered virtio-fs device.
>
> Signed-off-by: Peter-Jan Gootzen 
> ---
> V1 -> V2: Not scheduling dispatch work anymore when not needed
> and changed delayed_work structs to work_struct structs
>
>  fs/fuse/virtio_fs.c | 32 +---
>  1 file changed, 17 insertions(+), 15 deletions(-)
>
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 4d8d4f16c727..a676297db09b 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -45,7 +45,7 @@ struct virtio_fs_vq {
>   struct work_struct done_work;
>   struct list_head queued_reqs;
>   struct list_head end_reqs;  /* End these requests */
> - struct delayed_work dispatch_work;
> + struct work_struct dispatch_work;
>   struct fuse_dev *fud;
>   bool connected;
>   long in_flight;
> @@ -202,7 +202,7 @@ static void virtio_fs_drain_queue(struct virtio_fs_vq 
> *fsvq)
>   }
>  
>   flush_work(>done_work);
> - flush_delayed_work(>dispatch_work);
> + flush_work(>dispatch_work);
>  }
>  
>  static void virtio_fs_drain_all_queues_locked(struct virtio_fs *fs)
> @@ -346,6 +346,9 @@ static void virtio_fs_hiprio_done_work(struct 
> work_struct *work)
>   dec_in_flight_req(fsvq);
>   }
>   } while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
> +
> + if (!list_empty(>queued_reqs))
> + schedule_work(>dispatch_work);
>   spin_unlock(>lock);
>  }
>  
> @@ -353,7 +356,7 @@ static void virtio_fs_request_dispatch_work(struct 
> work_struct *work)
>  {
>   struct fuse_req *req;
>   struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
> -  dispatch_work.work);
> +  dispatch_work);
>   int ret;
>  
>   pr_debug("virtio-fs: worker %s called.\n", __func__);
> @@ -388,8 +391,6 @@ static void virtio_fs_request_dispatch_work(struct 
> work_struct *work)
>   if (ret == -ENOMEM || ret == -ENOSPC) {
>   spin_lock(>lock);
>   list_add_tail(>list, >queued_reqs);
> - schedule_delayed_work(>dispatch_work,
> -   msecs_to_jiffies(1));

 Virtqueue being full is only one of the reasons for failure to queue
 the request. What if virtqueue is empty but we could not queue the
 request because lack of memory (-ENOMEM). In that case we will queue
 the request and it might not be dispatched because there is no completion.
 (Assume there is no further new request coming). That means deadlock?

 Thanks
 Vivek

>>>
>>> Good catch that will deadlock.
>>>
>>> Is default kernel behavior to indefinitely retry a file system
>>> request until memory is available?
>>
>> As of now that seems to be the behavior. I think I had copied this
>> code from another driver. 
>>
>> But I guess one can argue that if memory is not available, then
>> return -ENOMEM to user space instead of retrying in kernel.
>>
>> Stefan, Miklos, WDYT?
> 
> My understanding is that file system syscalls may return ENOMEM, so this
> is okay.
> 
> Stefan

Then I propose only handling -ENOSPC as a special case and letting all
other errors go through to userspace.

Noob Linux contributor question: how often should I send in a new revision of
the patch? Should I wait for more comments or send in a V3 with that fix now?

Best,
Peter-Jan
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V2] virtio-fs: Improved request latencies when Virtio queue is full

2023-06-01 Thread Stefan Hajnoczi
On Wed, May 31, 2023 at 04:49:39PM -0400, Vivek Goyal wrote:
> On Wed, May 31, 2023 at 10:34:15PM +0200, Peter-Jan Gootzen wrote:
> > On 31/05/2023 21:18, Vivek Goyal wrote:
> > > On Wed, May 31, 2023 at 07:10:32PM +0200, Peter-Jan Gootzen wrote:
> > >> When the Virtio queue is full, a work item is scheduled
> > >> to execute in 1ms that retries adding the request to the queue.
> > >> This is a large amount of time on the scale on which a
> > >> virtio-fs device can operate. When using a DPU this is around
> > >> 40us baseline without going to a remote server (4k, QD=1).
> > >> This patch queues requests when the Virtio queue is full,
> > >> and when a completed request is taken off, immediately fills
> > >> it back up with queued requests.
> > >>
> > >> This reduces the 99.9th percentile latencies in our tests by
> > >> 60x and slightly increases the overall throughput, when using a
> > >> queue depth 2x the size of the Virtio queue size, with a
> > >> DPU-powered virtio-fs device.
> > >>
> > >> Signed-off-by: Peter-Jan Gootzen 
> > >> ---
> > >> V1 -> V2: Not scheduling dispatch work anymore when not needed
> > >> and changed delayed_work structs to work_struct structs
> > >>
> > >>  fs/fuse/virtio_fs.c | 32 +---
> > >>  1 file changed, 17 insertions(+), 15 deletions(-)
> > >>
> > >> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > >> index 4d8d4f16c727..a676297db09b 100644
> > >> --- a/fs/fuse/virtio_fs.c
> > >> +++ b/fs/fuse/virtio_fs.c
> > >> @@ -45,7 +45,7 @@ struct virtio_fs_vq {
> > >>  struct work_struct done_work;
> > >>  struct list_head queued_reqs;
> > >>  struct list_head end_reqs;  /* End these requests */
> > >> -struct delayed_work dispatch_work;
> > >> +struct work_struct dispatch_work;
> > >>  struct fuse_dev *fud;
> > >>  bool connected;
> > >>  long in_flight;
> > >> @@ -202,7 +202,7 @@ static void virtio_fs_drain_queue(struct 
> > >> virtio_fs_vq *fsvq)
> > >>  }
> > >>  
> > >>  flush_work(>done_work);
> > >> -flush_delayed_work(>dispatch_work);
> > >> +flush_work(>dispatch_work);
> > >>  }
> > >>  
> > >>  static void virtio_fs_drain_all_queues_locked(struct virtio_fs *fs)
> > >> @@ -346,6 +346,9 @@ static void virtio_fs_hiprio_done_work(struct 
> > >> work_struct *work)
> > >>  dec_in_flight_req(fsvq);
> > >>  }
> > >>  } while (!virtqueue_enable_cb(vq) && 
> > >> likely(!virtqueue_is_broken(vq)));
> > >> +
> > >> +if (!list_empty(>queued_reqs))
> > >> +schedule_work(>dispatch_work);
> > >>  spin_unlock(>lock);
> > >>  }
> > >>  
> > >> @@ -353,7 +356,7 @@ static void virtio_fs_request_dispatch_work(struct 
> > >> work_struct *work)
> > >>  {
> > >>  struct fuse_req *req;
> > >>  struct virtio_fs_vq *fsvq = container_of(work, struct 
> > >> virtio_fs_vq,
> > >> - dispatch_work.work);
> > >> + dispatch_work);
> > >>  int ret;
> > >>  
> > >>  pr_debug("virtio-fs: worker %s called.\n", __func__);
> > >> @@ -388,8 +391,6 @@ static void virtio_fs_request_dispatch_work(struct 
> > >> work_struct *work)
> > >>  if (ret == -ENOMEM || ret == -ENOSPC) {
> > >>  spin_lock(>lock);
> > >>  list_add_tail(>list, 
> > >> >queued_reqs);
> > >> -
> > >> schedule_delayed_work(>dispatch_work,
> > >> -  
> > >> msecs_to_jiffies(1));
> > > 
> > > Virtqueue being full is only one of the reasons for failure to queue
> > > the request. What if virtqueue is empty but we could not queue the
> > > request because lack of memory (-ENOMEM). In that case we will queue
> > > the request and it might not be dispatched because there is no completion.
> > > (Assume there is no further new request coming). That means deadlock?
> > > 
> > > Thanks
> > > Vivek
> > > 
> > 
> > Good catch that will deadlock.
> > 
> > Is default kernel behavior to indefinitely retry a file system
> > request until memory is available?
> 
> As of now that seems to be the behavior. I think I had copied this
> code from another driver. 
> 
> But I guess one can argue that if memory is not available, then
> return -ENOMEM to user space instead of retrying in kernel.
> 
> Stefan, Miklos, WDYT?

My understanding is that file system syscalls may return ENOMEM, so this
is okay.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND

2023-06-01 Thread Linus Torvalds
On Thu, Jun 1, 2023 at 6:47 AM Christian Brauner  wrote:
>
> @Mike, do you want to prepare an updated version of the temporary fix.
> If @Linus prefers to just apply it directly he can just grab it from the
> list rather than delaying it. Make sure to grab a Co-developed-by line
> on this, @Mike.

Yeah, let's apply the known "fix the immediate regression" patch wrt
vhost ps output and the freezer. That gets rid of the regression.

I think that we can - and should - then treat the questions about core
dumping and execve as separate issues.

vhost wouldn't have done execve since it's nonsensical and has never
worked anyway since it always left the old mm ref behind, and
similarly core dumping has never been an issue.

So on those things we don't have any "semantic" issues, we just need
to make sure we don't do crazy things like hang uninterruptibly.

Linus
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND

2023-06-01 Thread Thorsten Leemhuis
On 01.06.23 12:47, Christian Brauner wrote:
> On Thu, Jun 01, 2023 at 09:58:38AM +0200, Thorsten Leemhuis wrote:
>> On 19.05.23 14:15, Christian Brauner wrote:
>>> On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
 On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
> normal testing, haven't coverted vsock and vdpa, and I know you guys
> will not like the first patch. However, I think it better shows what

 Just to summarize the core idea behind my proposal is that no signal
 handling changes are needed unless there's a bug in the current way
 io_uring workers already work. All that should be needed is
 s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
>> [...]
 So it feels like this should be achievable by adding a callback to
 struct vhost_worker that get's called when vhost_worker() gets SIGKILL
 and that all the users of vhost workers are forced to implement.

 Yes, it is more work but I think that's the right thing to do and not to
 complicate our signal handling.

 Worst case if this can't be done fast enough we'll have to revert the
 vhost parts. I think the user worker parts are mostly sane and are
>>>
>>> As mentioned, if we can't settle this cleanly before -rc4 we should
>>> revert the vhost parts unless Linus wants to have it earlier.
>>
>> Meanwhile -rc5 is just a few days away and there are still a lot of
>> discussions in the patch-set proposed to address the issues[1]. Which is
>> kinda great (albeit also why I haven't given it a spin yet), but on the
>> other hand makes we wonder:
> 
> You might've missed it in the thread but it seems everyone is currently
> operating under the assumption that the preferred way is to fix this is
> rather than revert. 

I saw that, but that was also a week ago already, so I slowly started to
wonder if plans might have/should be changed. Anyway: if that's still
the plan forward it's totally fine for me if it's fine for Linus. :-D

BTW: I for now didn't sit down to test Mike's patches, as due to all the
discussions I assumed new ones would be coming sooner or later anyway.
If it's worth giving them a shot, please let me know.

> [...]

Thx for the update!

Ciao, Thorsten
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 13/43] x86/paravirt: Use relative reference for original instruction

2023-06-01 Thread Juergen Gross via Virtualization

On 28.04.23 11:50, Hou Wenlong wrote:

Similar to the alternative patching, use relative reference for original
instruction rather than absolute one, which saves 8 bytes for one entry
on x86_64.  And it could generate R_X86_64_PC32 relocation instead of
R_X86_64_64 relocation, which also reduces relocation metadata on
relocatable builds. And the alignment could be hard coded to be 4 now.

Signed-off-by: Hou Wenlong 
Cc: Thomas Garnier 
Cc: Lai Jiangshan 
Cc: Kees Cook 


Reviewed-by: Juergen Gross 

I think this patch should be taken even without the series.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net] virtio/vsock: fix sock refcnt bug on owner set failure

2023-06-01 Thread Stefano Garzarella

On Wed, May 31, 2023 at 07:47:32PM +, Bobby Eshleman wrote:

Previous to setting the owner the socket is found via
vsock_find_connected_socket(), which returns sk after a call to
sock_hold().

If setting the owner fails, then sock_put() needs to be called.

Fixes: f9d2b1e146e0 ("virtio/vsock: fix leaks due to missing skb owner")
Signed-off-by: Bobby Eshleman 
---
net/vmw_vsock/virtio_transport_common.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index b769fc258931..f01cd6adc5cb 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1343,6 +1343,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,

if (!skb_set_owner_sk_safe(skb, sk)) {
WARN_ONCE(1, "receiving vsock socket has sk_refcnt == 0\n");
+   sock_put(sk);


Did you have any warning, issue here?

IIUC skb_set_owner_sk_safe() can return false only if the ref counter
is 0, so calling a sock_put() on it should have no effect except to
produce a warning.

Thanks,
Stefano

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND

2023-06-01 Thread Thorsten Leemhuis
On 19.05.23 14:15, Christian Brauner wrote:
> On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
>> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
>>> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
>>> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
>>> normal testing, haven't coverted vsock and vdpa, and I know you guys
>>> will not like the first patch. However, I think it better shows what
>>
>> Just to summarize the core idea behind my proposal is that no signal
>> handling changes are needed unless there's a bug in the current way
>> io_uring workers already work. All that should be needed is
>> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
[...]
>> So it feels like this should be achievable by adding a callback to
>> struct vhost_worker that get's called when vhost_worker() gets SIGKILL
>> and that all the users of vhost workers are forced to implement.
>>
>> Yes, it is more work but I think that's the right thing to do and not to
>> complicate our signal handling.
>>
>> Worst case if this can't be done fast enough we'll have to revert the
>> vhost parts. I think the user worker parts are mostly sane and are
> 
> As mentioned, if we can't settle this cleanly before -rc4 we should
> revert the vhost parts unless Linus wants to have it earlier.

Meanwhile -rc5 is just a few days away and there are still a lot of
discussions in the patch-set proposed to address the issues[1]. Which is
kinda great (albeit also why I haven't given it a spin yet), but on the
other hand makes we wonder:

Is it maybe time to revert the vhost parts for 6.4 and try again next cycle?

[1]
https://lore.kernel.org/all/20230522025124.5863-1-michael.chris...@oracle.com/

Ciao, Thorsten "not sure if I'm asking because I'm affected, or because
it's my duty as regression tracker" Leemhuis
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [syzbot] [kvm?] [net?] [virt?] general protection fault in vhost_work_queue

2023-06-01 Thread Stefano Garzarella

On Wed, May 31, 2023 at 11:27:12AM -0500, Mike Christie wrote:

On 5/31/23 10:15 AM, Mike Christie wrote:

rcu would work for your case and for what Jason had requested.

Yeah, so you already have some patches?

Do you want to send it to solve this problem?


Yeah, I'll break them out and send them later today when I can retest
rebased patches.



Just one question. Do you core vhost developers consider RCU more complex
or switching to READ_ONCE/WRITE_ONCE? I am asking because for this immediate
regression fix we could just switch to the latter like below to just fix
the crash if we think that is more simple.

I think RCU is just a little more complex/invasive because it will have the
extra synchronize_rcu calls.


Yes, you may be right, in this case we should just need
READ_ONCE/WRITE_ONCE if dev->worker is no longer a pointer.




diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..03fd47a22a73 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -235,7 +235,7 @@ void vhost_dev_flush(struct vhost_dev *dev)
{
struct vhost_flush_struct flush;

-   if (dev->worker) {
+   if (READ_ONCE(dev->worker.vtsk)) {
init_completion(_event);
vhost_work_init(, vhost_flush_work);

@@ -247,7 +247,9 @@ EXPORT_SYMBOL_GPL(vhost_dev_flush);

void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
{
-   if (!dev->worker)
+   struct vhost_task *vtsk = READ_ONCE(dev->worker.vtsk);
+
+   if (!vtsk)
return;

if (!test_and_set_bit(VHOST_WORK_QUEUED, >flags)) {
@@ -255,8 +257,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 * sure it was not in the list.
 * test_and_set_bit() implies a memory barrier.
 */
-   llist_add(>node, >worker->work_list);
-   wake_up_process(dev->worker->vtsk->task);
+   llist_add(>node, >worker.work_list);
+   wake_up_process(vtsk->task);
}
}
EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -264,7 +266,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue);
/* A lockless hint for busy polling code to exit the loop */
bool vhost_has_work(struct vhost_dev *dev)
{
-   return dev->worker && !llist_empty(>worker->work_list);
+   return !llist_empty(>worker.work_list);
}
EXPORT_SYMBOL_GPL(vhost_has_work);

@@ -468,7 +470,7 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->umem = NULL;
dev->iotlb = NULL;
dev->mm = NULL;
-   dev->worker = NULL;
+   memset(>worker, 0, sizeof(dev->worker));
dev->iov_limit = iov_limit;
dev->weight = weight;
dev->byte_weight = byte_weight;
@@ -542,46 +544,38 @@ static void vhost_detach_mm(struct vhost_dev *dev)

static void vhost_worker_free(struct vhost_dev *dev)
{
-   struct vhost_worker *worker = dev->worker;
+   struct vhost_task *vtsk = READ_ONCE(dev->worker.vtsk);

-   if (!worker)
+   if (!vtsk)
return;

-   dev->worker = NULL;
-   WARN_ON(!llist_empty(>work_list));
-   vhost_task_stop(worker->vtsk);
-   kfree(worker);
+   vhost_task_stop(vtsk);
+   WARN_ON(!llist_empty(>worker.work_list));
+   WRITE_ONCE(dev->worker.vtsk, NULL);


The patch LGTM, I just wonder if we should set dev->worker to zero here,
but maybe we don't need to.

Thanks,
Stefano


}

static int vhost_worker_create(struct vhost_dev *dev)
{
-   struct vhost_worker *worker;
struct vhost_task *vtsk;
char name[TASK_COMM_LEN];
int ret;

-   worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
-   if (!worker)
-   return -ENOMEM;
-
-   dev->worker = worker;
-   worker->kcov_handle = kcov_common_handle();
-   init_llist_head(>work_list);
+   dev->worker.kcov_handle = kcov_common_handle();
+   init_llist_head(>worker.work_list);
snprintf(name, sizeof(name), "vhost-%d", current->pid);

-   vtsk = vhost_task_create(vhost_worker, worker, name);
+   vtsk = vhost_task_create(vhost_worker, >worker, name);
if (!vtsk) {
ret = -ENOMEM;
goto free_worker;
}

-   worker->vtsk = vtsk;
+   WRITE_ONCE(dev->worker.vtsk, vtsk);
vhost_task_start(vtsk);
return 0;

free_worker:
-   kfree(worker);
-   dev->worker = NULL;
+   WRITE_ONCE(dev->worker.vtsk, NULL);
return ret;
}

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 0308638cdeee..305ec8593d46 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -154,7 +154,7 @@ struct vhost_dev {
struct vhost_virtqueue **vqs;
int nvqs;
struct eventfd_ctx *log_ctx;
-   struct vhost_worker *worker;
+   struct vhost_worker worker;
struct vhost_iotlb *umem;
struct vhost_iotlb *iotlb;
spinlock_t iotlb_lock;



___
Virtualization mailing 

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-01 Thread Oleg Nesterov
On 06/01, Jason Wang wrote:
>
> On Wed, May 31, 2023 at 5:14 PM Oleg Nesterov  wrote:
> >
> > > > I don't understand you. OK, to simplify, suppose we have 2 global vars
> > > >
> > > > void *PTR = something_non_null;
> > > > unsigned long FLAGS = -1ul;
> > > >
> > > > Now I think this code
> > > >
> > > > CPU_0   CPU_1
> > > >
> > > > void *ptr = PTR;if (!test_and_set_bit(0, FLAGS))
> > > > clear_bit(0, FLAGS);PTR = NULL;
> > > > BUG_ON(!ptr);
> > > >
> > > > is racy and can hit the BUG_ON(!ptr).
> > >
> > > This seems different to the above case?
> >
> > not sure,
> >
> > > And you can hit BUG_ON with
> > > the following execution sequence:
> > >
> > > [cpu 0] clear_bit(0, FLAGS);
> > > [cpu 1] if (!test_and_set_bit(0, FLAGS))
> > > [cpu 1] PTR = NULL;
> > > [cpu 0] BUG_ON(!ptr)
> >
> > I don't understand this part... yes, we can hit this BUG_ON() without mb in
> > between, this is what I tried to say.
>
> I may miss something,

Or me... note that CPU_0 loads the global "PTR" into the local "ptr" before 
clear_bit.
Since you have mentioned the program order: yes this lacks READ_ONCE() or 
barrier(),
but the same is true for the code in vhost_worker(). So I still don't 
understand.

> but the above is the sequence that is executed
> by the processor (for each CPU, it's just the program order). So where
> do you expect to place an mb can help?

before clear_bit:

CPU_0

void *ptr = PTR;
mb();   // implies compiler barrier as well
clear_bit(0, FLAGS);
BUG_ON(!ptr);

just in case... mb() in the code above is only for illustration, we can use
smp_mb__before_atomic() + clear_bit(). Or just clear_bit_unlock(), iiuc the
one-way barrier is fine in this case.


> > > In vhost code, there's a condition before the clear_bit() which sits
> > > inside llist_for_each_entry_safe():
> > >
> > > #define llist_for_each_entry_safe(pos, n, node, member)   
> > >  \
> > > for (pos = llist_entry((node), typeof(*pos), member); 
> > >  \
> > >  member_address_is_nonnull(pos, member) &&
> > >  \
> > > (n = llist_entry(pos->member.next, typeof(*n), member), 
> > > true); \
> > >  pos = n)
> > >
> > > The clear_bit() is a store which is not speculated, so there's a
> > > control dependency, the store can't be executed until the condition
> > > expression is evaluated which requires pos->member.next
> > > (work->node.next) to be loaded.
> >
> > But llist_for_each_entry_safe() doesn't check "n", I mean, it is not that 
> > we have
> > something like
> >
> > n = llist_entry(...);
> > if (n)
> > clear_bit(...);
> >
> > so I do not see how can we rely on the load-store control dependency.
>
> Just to make sure we are on the same page, the condition expression is
>
> member_address_is_nonnull(pos, member) && (n =
> llist_entry(pos->member.next, typeof(*n), member), true)
>
> So it's something like:
>
> if (work->node && (work_next = work->node->next, true))
> clear_bit(>flags);
>
> So two loads from both work->node and work->node->next, and there's a
> store which is clear_bit, then it's a load-store control dependencies?

I guess you missed the comma expression... Let me rewrite your pseudo-code
above, it is equivalent to

if (work->node) {
if ((work_next = work->node->next, true))
clear_bit(>flags);
}

another rewrite:

if (work->node) {
work_next = work->node->next;
if ((work, true))
clear_bit(>flags);
}

and the final rewrite:

if (work->node) {
work_next = work->node->next;
if (true)
clear_bit(>flags);
}

so again, I do not see the load-store control dependency. Not to mention this
code lacks READ_ONCE().


If we had something like

if (work->node) {
work_next = READ_ONCE(work->node->next);
if (work_next)
clear_bit(>flags);
}

instead, then yes, we could rely on the LOAD-STORE dependency.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization