from:"Michael S. Tsirkin"

Re: [PATCH v2 0/4] vhost: Cleanup

2024-04-29 Thread Michael S. Tsirkin

On Mon, Apr 29, 2024 at 08:13:56PM +1000, Gavin Shan wrote:
> This is suggested by Michael S. Tsirkin according to [1] and the goal
> is to apply smp_rmb() inside vhost_get_avail_idx() if needed. With it,
> the caller of the function needn't to worry about memory barriers. Since
> we're here, other cleanups are also applied.
> 
> [1] 
> https://lore.kernel.org/virtualization/20240327155750-mutt-send-email-...@kernel.org/


Patch 1 makes some sense, gave some comments. Rest I think we should
just drop.

> PATCH[1] improves vhost_get_avail_idx() so that smp_rmb() is applied if
>  needed. Besides, the sanity checks on the retrieved available
>  queue index are also squeezed to vhost_get_avail_idx()
> PATCH[2] drops the local variable @last_avail_idx since it's equivalent
>  to vq->last_avail_idx
> PATCH[3] improves vhost_get_avail_head(), similar to what we're doing
>  for vhost_get_avail_idx(), so that the relevant sanity checks
>  on the head are squeezed to vhost_get_avail_head()
> PATCH[4] Reformat vhost_{get, put}_user() by using tab instead of space
>  as the terminator for each line
> 
> Gavin Shan (3):
>   vhost: Drop variable last_avail_idx in vhost_get_vq_desc()
>   vhost: Improve vhost_get_avail_head()
>   vhost: Reformat vhost_{get, put}_user()
> 
> Michael S. Tsirkin (1):
>   vhost: Improve vhost_get_avail_idx() with smp_rmb()
> 
>  drivers/vhost/vhost.c | 215 +++---
>  1 file changed, 97 insertions(+), 118 deletions(-)
> 
> Changelog
> =
> v2:
>   * Improve vhost_get_avail_idx() as Michael suggested in [1]
> as above (Michael)
>   * Correct @head's type from 'unsigned int' to 'int'
> (l...@intel.com)
> 
> -- 
> 2.44.0

Re: [PATCH v2 4/4] vhost: Reformat vhost_{get, put}_user()

2024-04-29 Thread Michael S. Tsirkin

On Mon, Apr 29, 2024 at 08:14:00PM +1000, Gavin Shan wrote:
> Reformat the macros to use tab as the terminator for each line so
> that it looks clean.
> 
> No functional change intended.
> 
> Signed-off-by: Gavin Shan 

Just messes up history for no real gain.

> ---
>  drivers/vhost/vhost.c | 60 +--
>  1 file changed, 30 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 4ddb9ec2fe46..c1ed5e750521 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1207,21 +1207,22 @@ static inline void __user *__vhost_get_user(struct 
> vhost_virtqueue *vq,
>   return __vhost_get_user_slow(vq, addr, size, type);
>  }
>  
> -#define vhost_put_user(vq, x, ptr)   \
> -({ \
> - int ret; \
> - if (!vq->iotlb) { \
> - ret = __put_user(x, ptr); \
> - } else { \
> - __typeof__(ptr) to = \
> +#define vhost_put_user(vq, x, ptr)   \
> +({   \
> + int ret;\
> + if (!vq->iotlb) {   \
> + ret = __put_user(x, ptr);   \
> + } else {\
> + __typeof__(ptr) to =\
>   (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
> -   sizeof(*ptr), VHOST_ADDR_USED); \
> - if (to != NULL) \
> - ret = __put_user(x, to); \
> - else \
> - ret = -EFAULT;  \
> - } \
> - ret; \
> + sizeof(*ptr),   \
> + VHOST_ADDR_USED);   \
> + if (to != NULL) \
> + ret = __put_user(x, to);\
> + else\
> + ret = -EFAULT;  \
> + }   \
> + ret;\
>  })
>  
>  static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
> @@ -1252,22 +1253,21 @@ static inline int vhost_put_used_idx(struct 
> vhost_virtqueue *vq)
> >used->idx);
>  }
>  
> -#define vhost_get_user(vq, x, ptr, type) \
> -({ \
> - int ret; \
> - if (!vq->iotlb) { \
> - ret = __get_user(x, ptr); \
> - } else { \
> - __typeof__(ptr) from = \
> - (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
> -sizeof(*ptr), \
> -type); \
> - if (from != NULL) \
> - ret = __get_user(x, from); \
> - else \
> - ret = -EFAULT; \
> - } \
> - ret; \
> +#define vhost_get_user(vq, x, ptr, type) \
> +({   \
> + int ret;\
> + if (!vq->iotlb) {   \
> + ret = __get_user(x, ptr);   \
> + } else {\
> + __typeof__(ptr) from =  \
> + (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
> + sizeof(*ptr), type);\
> + if (from != NULL)   \
> + ret = __get_user(x, from);  \
> + else\
> + ret = -EFAULT;  \
> + }   \
> + ret;\
>  })
>  
>  #define vhost_get_avail(vq, x, ptr) \
> -- 
> 2.44.0

Re: [PATCH v2 3/4] vhost: Improve vhost_get_avail_head()

2024-04-29 Thread Michael S. Tsirkin

On Mon, Apr 29, 2024 at 08:13:59PM +1000, Gavin Shan wrote:
> Improve vhost_get_avail_head() so that the head or errno is returned.
> With it, the relevant sanity checks are squeezed to vhost_get_avail_head()
> and vhost_get_vq_desc() is further simplified.
> 
> No functional change intended.
> 
> Signed-off-by: Gavin Shan 

I don't see what does this moving code around achieve.

> ---
>  drivers/vhost/vhost.c | 50 ++-
>  1 file changed, 26 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index b278c0333a66..4ddb9ec2fe46 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1322,11 +1322,27 @@ static inline int vhost_get_avail_idx(struct 
> vhost_virtqueue *vq)
>   return 1;
>  }
>  
> -static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> -__virtio16 *head, int idx)
> +static inline int vhost_get_avail_head(struct vhost_virtqueue *vq)
>  {
> - return vhost_get_avail(vq, *head,
> ->avail->ring[idx & (vq->num - 1)]);
> + __virtio16 head;
> + int r;
> +
> + r = vhost_get_avail(vq, head,
> + >avail->ring[vq->last_avail_idx & (vq->num - 
> 1)]);
> + if (unlikely(r)) {
> + vq_err(vq, "Failed to read head: index %u address %p\n",
> +vq->last_avail_idx,
> +>avail->ring[vq->last_avail_idx & (vq->num - 1)]);
> + return r;
> + }
> +
> + r = vhost16_to_cpu(vq, head);
> + if (unlikely(r >= vq->num)) {
> + vq_err(vq, "Invalid head %d (%u)\n", r, vq->num);
> + return -EINVAL;
> + }
> +
> + return r;
>  }
>  
>  static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> @@ -2523,9 +2539,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> struct vhost_log *log, unsigned int *log_num)
>  {
>   struct vring_desc desc;
> - unsigned int i, head, found = 0;
> - __virtio16 ring_head;
> - int ret, access;
> + unsigned int i, found = 0;
> + int head, ret, access;
>  
>   if (vq->avail_idx == vq->last_avail_idx) {
>   ret = vhost_get_avail_idx(vq);
> @@ -2536,23 +2551,10 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   return vq->num;
>   }
>  
> - /* Grab the next descriptor number they're advertising, and increment
> -  * the index we've seen. */
> - if (unlikely(vhost_get_avail_head(vq, _head, vq->last_avail_idx))) 
> {
> - vq_err(vq, "Failed to read head: idx %d address %p\n",
> -vq->last_avail_idx,
> ->avail->ring[vq->last_avail_idx % vq->num]);
> - return -EFAULT;
> - }
> -
> - head = vhost16_to_cpu(vq, ring_head);
> -
> - /* If their number is silly, that's an error. */
> - if (unlikely(head >= vq->num)) {
> - vq_err(vq, "Guest says index %u > %u is available",
> -head, vq->num);
> - return -EINVAL;
> - }
> + /* Grab the next descriptor number they're advertising */
> + head = vhost_get_avail_head(vq);
> + if (unlikely(head < 0))
> + return head;
>  
>   /* When we start there are none of either input nor output. */
>   *out_num = *in_num = 0;
> -- 
> 2.44.0

Re: [PATCH v2 2/4] vhost: Drop variable last_avail_idx in vhost_get_vq_desc()

2024-04-29 Thread Michael S. Tsirkin

On Mon, Apr 29, 2024 at 08:13:58PM +1000, Gavin Shan wrote:
> The local variable @last_avail_idx is equivalent to vq->last_avail_idx.
> So the code can be simplified a bit by dropping the local variable
> @last_avail_idx.
> 
> No functional change intended.
> 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/vhost/vhost.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 7aa623117aab..b278c0333a66 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2524,7 +2524,6 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  {
>   struct vring_desc desc;
>   unsigned int i, head, found = 0;
> - u16 last_avail_idx = vq->last_avail_idx;
>   __virtio16 ring_head;
>   int ret, access;
>  
> @@ -2539,10 +2538,10 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  
>   /* Grab the next descriptor number they're advertising, and increment
>* the index we've seen. */
> - if (unlikely(vhost_get_avail_head(vq, _head, last_avail_idx))) {
> + if (unlikely(vhost_get_avail_head(vq, _head, vq->last_avail_idx))) 
> {
>   vq_err(vq, "Failed to read head: idx %d address %p\n",
> -last_avail_idx,
> ->avail->ring[last_avail_idx % vq->num]);
> +vq->last_avail_idx,
> +>avail->ring[vq->last_avail_idx % vq->num]);
>   return -EFAULT;
>   }

I don't see the big advantage and the line is long now.

>  
> -- 
> 2.44.0

Re: [PATCH v2 1/4] vhost: Improve vhost_get_avail_idx() with smp_rmb()

2024-04-29 Thread Michael S. Tsirkin

On Mon, Apr 29, 2024 at 08:13:57PM +1000, Gavin Shan wrote:
> From: "Michael S. Tsirkin" 
> 
> All the callers of vhost_get_avail_idx() are concerned to the memory

*with* the memory barrier

> barrier, imposed by smp_rmb() to ensure the order of the available
> ring entry read and avail_idx read.
> 
> Improve vhost_get_avail_idx() so that smp_rmb() is executed when
> the avail_idx is advanced.

accessed, not advanced. guest advances it.

> With it, the callers needn't to worry
> about the memory barrier.
> 
> No functional change intended.

I'd add:

As a side benefit, we also validate the index on all paths now, which
will hopefully help catch future errors earlier.

Note: current code is inconsistent in how it handles errors:
some places treat it as an empty ring, others - non empty.
This patch does not attempt to change the existing behaviour.



> Signed-off-by: Michael S. Tsirkin 
> [gshan: repainted vhost_get_avail_idx()]

?repainted?

> Reviewed-by: Gavin Shan 
> Acked-by: Will Deacon 
> ---
>  drivers/vhost/vhost.c | 106 +-
>  1 file changed, 42 insertions(+), 64 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 8995730ce0bf..7aa623117aab 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1290,10 +1290,36 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
>   mutex_unlock(>vqs[i]->mutex);
>  }
>  
> -static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> -   __virtio16 *idx)
> +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
>  {
> - return vhost_get_avail(vq, *idx, >avail->idx);
> + __virtio16 idx;
> + int r;
> +
> + r = vhost_get_avail(vq, idx, >avail->idx);
> + if (unlikely(r < 0)) {
> + vq_err(vq, "Failed to access available index at %p (%d)\n",
> +>avail->idx, r);
> + return r;
> + }
> +
> + /* Check it isn't doing very strange thing with available indexes */
> + vq->avail_idx = vhost16_to_cpu(vq, idx);
> + if (unlikely((u16)(vq->avail_idx - vq->last_avail_idx) > vq->num)) {
> + vq_err(vq, "Invalid available index change from %u to %u",
> +vq->last_avail_idx, vq->avail_idx);
> + return -EINVAL;
> + }
> +
> + /* We're done if there is nothing new */
> + if (vq->avail_idx == vq->last_avail_idx)
> + return 0;
> +
> + /*
> +  * We updated vq->avail_idx so we need a memory barrier between
> +  * the index read above and the caller reading avail ring entries.
> +  */
> + smp_rmb();
> + return 1;
>  }
>  
>  static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> @@ -2498,38 +2524,17 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  {
>   struct vring_desc desc;
>   unsigned int i, head, found = 0;
> - u16 last_avail_idx;
> - __virtio16 avail_idx;
> + u16 last_avail_idx = vq->last_avail_idx;
>   __virtio16 ring_head;
>   int ret, access;
>  
> - /* Check it isn't doing very strange things with descriptor numbers. */
> - last_avail_idx = vq->last_avail_idx;
> -
>   if (vq->avail_idx == vq->last_avail_idx) {
> - if (unlikely(vhost_get_avail_idx(vq, _idx))) {
> - vq_err(vq, "Failed to access avail idx at %p\n",
> - >avail->idx);
> - return -EFAULT;
> - }
> - vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> -
> - if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
> - vq_err(vq, "Guest moved avail index from %u to %u",
> - last_avail_idx, vq->avail_idx);
> - return -EFAULT;
> - }
> + ret = vhost_get_avail_idx(vq);
> + if (unlikely(ret < 0))
> + return ret;
>  
> - /* If there's nothing new since last we looked, return
> -  * invalid.
> -  */
> - if (vq->avail_idx == last_avail_idx)
> + if (!ret)
>   return vq->num;
> -
> - /* Only get avail ring entries after they have been
> -  * exposed by guest.
> -  */
> - smp_rmb();
>   }
>  
>   /* Grab the next descriptor number they're advertising, and increment
> @@ -2790,35 +2795,20 @@ EXPORT_SYMBOL_GPL(vhost

Re: [PATCH 0/4] vhost: Cleanup

2024-04-29 Thread Michael S. Tsirkin

On Tue, Apr 23, 2024 at 01:24:03PM +1000, Gavin Shan wrote:
> This is suggested by Michael S. Tsirkin according to [1] and the goal
> is to apply smp_rmb() inside vhost_get_avail_idx() if needed. With it,
> the caller of the function needn't to worry about memory barriers. Since
> we're here, other cleanups are also applied.


Gavin I suggested another approach.
1. Start with the patch I sent (vhost: order avail ring reads after
   index updates) just do a diff against latest.
   simplify error handling a bit.
2. Do any other cleanups on top.

> [1] 
> https://lore.kernel.org/virtualization/20240327075940-mutt-send-email-...@kernel.org/
> 
> PATCH[1] drops the local variable @last_avail_idx since it's equivalent
>  to vq->last_avail_idx
> PATCH[2] improves vhost_get_avail_idx() so that smp_rmb() is applied if
>  needed. Besides, the sanity checks on the retrieved available
>  queue index are also squeezed to vhost_get_avail_idx()
> PATCH[3] improves vhost_get_avail_head(), similar to what we're doing
>  for vhost_get_avail_idx(), so that the relevant sanity checks
>  on the head are squeezed to vhost_get_avail_head()
> PATCH[4] Reformat vhost_{get, put}_user() by using tab instead of space
>  as the terminator for each line
> 
> Gavin Shan (4):
>   vhost: Drop variable last_avail_idx in vhost_get_vq_desc()
>   vhost: Improve vhost_get_avail_idx() with smp_rmb()
>   vhost: Improve vhost_get_avail_head()
>   vhost: Reformat vhost_{get, put}_user()
> 
>  drivers/vhost/vhost.c | 199 +++---
>  1 file changed, 88 insertions(+), 111 deletions(-)
> 
> -- 
> 2.44.0

Re: 回复: [PATCH v5] vp_vdpa: don't allocate unused msix vectors

2024-04-25 Thread Michael S. Tsirkin

On Tue, Apr 23, 2024 at 08:42:57AM +, Angus Chen wrote:
> Hi mst.
> 
> > -Original Message-
> > From: Michael S. Tsirkin 
> > Sent: Tuesday, April 23, 2024 4:35 PM
> > To: Gavin Liu 
> > Cc: jasow...@redhat.com; Angus Chen ;
> > virtualizat...@lists.linux.dev; xuanz...@linux.alibaba.com;
> > linux-kernel@vger.kernel.org; Heng Qi 
> > Subject: Re: 回复: [PATCH v5] vp_vdpa: don't allocate unused msix vectors
> > 
> > On Tue, Apr 23, 2024 at 01:39:17AM +, Gavin Liu wrote:
> > > On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote:
> > > > From: Yuxue Liu 
> > > >
> > > > When there is a ctlq and it doesn't require interrupt callbacks,the
> > > > original method of calculating vectors wastes hardware msi or msix
> > > > resources as well as system IRQ resources.
> > > >
> > > > When conducting performance testing using testpmd in the guest os, it
> > > > was found that the performance was lower compared to directly using
> > > > vfio-pci to passthrough the device
> > > >
> > > > In scenarios where the virtio device in the guest os does not utilize
> > > > interrupts, the vdpa driver still configures the hardware's msix
> > > > vector. Therefore, the hardware still sends interrupts to the host os.
> > >
> > > >I just have a question on this part. How come hardware sends interrupts 
> > > >does
> > not guest driver disable them?
> > >
> > >1：Assuming the guest OS's Virtio device is using PMD mode, QEMU sets
> > the call fd to -1
> > >2：On the host side, the vhost_vdpa program will set
> > vp_vdpa->vring[i].cb.callback to invalid
> > >3：Before the modification, the vp_vdpa_request_irq function does not
> > check whether
> > >   vp_vdpa->vring[i].cb.callback is valid. Instead, it enables the
> > hardware's MSIX
> > > interrupts based on the number of queues of the device
> > >
> > 
> > So MSIX is enabled but why would it trigger? virtio PMD in poll mode
> > presumably suppresses interrupts after all.
> Virtio pmd is in the guest,but in host side,the msix is enabled,then the 
> device will triger 
> Interrupt normally. I analysed this bug before,and I think gavin is right.
> Did I make it clear?

Not really. Guest disables interrupts presumably (it's polling)
why does device still send them?


> > 
> > >
> > >
> > > - Original Message -
> > > From: Michael S. Tsirkin m...@redhat.com
> > > Sent: April 22, 2024 20:09
> > > To: Gavin Liu gavin@jaguarmicro.com
> > > Cc: jasow...@redhat.com; Angus Chen angus.c...@jaguarmicro.com;
> > virtualizat...@lists.linux.dev; xuanz...@linux.alibaba.com;
> > linux-kernel@vger.kernel.org; Heng Qi hen...@linux.alibaba.com
> > > Subject: Re: [PATCH v5] vp_vdpa: don't allocate unused msix vectors
> > >
> > >
> > >
> > > External Mail: This email originated from OUTSIDE of the organization!
> > > Do not click links, open attachments or provide ANY information unless you
> > recognize the sender and know the content is safe.
> > >
> > >
> > > On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote:
> > > > From: Yuxue Liu 
> > > >
> > > > When there is a ctlq and it doesn't require interrupt callbacks,the
> > > > original method of calculating vectors wastes hardware msi or msix
> > > > resources as well as system IRQ resources.
> > > >
> > > > When conducting performance testing using testpmd in the guest os, it
> > > > was found that the performance was lower compared to directly using
> > > > vfio-pci to passthrough the device
> > > >
> > > > In scenarios where the virtio device in the guest os does not utilize
> > > > interrupts, the vdpa driver still configures the hardware's msix
> > > > vector. Therefore, the hardware still sends interrupts to the host os.
> > >
> > > I just have a question on this part. How come hardware sends interrupts 
> > > does
> > not guest driver disable them?
> > >
> > > > Because of this unnecessary
> > > > action by the hardware, hardware performance decreases, and it also
> > > > affects the performance of the host os.
> > > >
> > > > Before modification:(interrupt mode)
> > > >  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
> > > >

[GIT PULL] virtio: bugfix

2024-04-25 Thread Michael S. Tsirkin

The following changes since commit 0bbac3facb5d6cc0171c45c9873a2dc96bea9680:

  Linux 6.9-rc4 (2024-04-14 13:38:39 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 98a821546b3919a10a58faa12ebe5e9a55cd638e:

  vDPA: code clean for vhost_vdpa uapi (2024-04-22 17:07:13 -0400)


virtio: bugfix

enum renames for vdpa uapi - we better do this now before
the names have been in any releases.

Signed-off-by: Michael S. Tsirkin 


Zhu Lingshan (1):
  vDPA: code clean for vhost_vdpa uapi

 drivers/vdpa/vdpa.c   | 6 +++---
 include/uapi/linux/vdpa.h | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-25 Thread Michael S. Tsirkin

On Thu, Apr 25, 2024 at 09:35:58AM +0800, Jason Wang wrote:
> On Wed, Apr 24, 2024 at 5:51 PM Michael S. Tsirkin  wrote:
> >
> > On Wed, Apr 24, 2024 at 08:44:10AM +0800, Jason Wang wrote:
> > > On Tue, Apr 23, 2024 at 4:42 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Tue, Apr 23, 2024 at 11:09:59AM +0800, Jason Wang wrote:
> > > > > On Tue, Apr 23, 2024 at 4:05 AM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote:
> > > > > > > On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> > > > > > > > > Add the function vduse_alloc_reconnnect_info_mem
> > > > > > > > > and vduse_alloc_reconnnect_info_mem
> > > > > > > > > These functions allow vduse to allocate and free memory for 
> > > > > > > > > reconnection
> > > > > > > > > information. The amount of memory allocated is vq_num pages.
> > > > > > > > > Each VQS will map its own page where the reconnection 
> > > > > > > > > information will be saved
> > > > > > > > >
> > > > > > > > > Signed-off-by: Cindy Lu 
> > > > > > > > > ---
> > > > > > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 40 
> > > > > > > > > ++
> > > > > > > > >  1 file changed, 40 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > > > > > > > > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > > index ef3c9681941e..2da659d5f4a8 100644
> > > > > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > > @@ -65,6 +65,7 @@ struct vduse_virtqueue {
> > > > > > > > >   int irq_effective_cpu;
> > > > > > > > >   struct cpumask irq_affinity;
> > > > > > > > >   struct kobject kobj;
> > > > > > > > > + unsigned long vdpa_reconnect_vaddr;
> > > > > > > > >  };
> > > > > > > > >
> > > > > > > > >  struct vduse_dev;
> > > > > > > > > @@ -1105,6 +1106,38 @@ static void 
> > > > > > > > > vduse_vq_update_effective_cpu(struct vduse_virtqueue *vq)
> > > > > > > > >
> > > > > > > > >   vq->irq_effective_cpu = curr_cpu;
> > > > > > > > >  }
> > > > > > > > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev 
> > > > > > > > > *dev)
> > > > > > > > > +{
> > > > > > > > > + unsigned long vaddr = 0;
> > > > > > > > > + struct vduse_virtqueue *vq;
> > > > > > > > > +
> > > > > > > > > + for (int i = 0; i < dev->vq_num; i++) {
> > > > > > > > > + /*page 0~ vq_num save the reconnect info for 
> > > > > > > > > vq*/
> > > > > > > > > + vq = dev->vqs[i];
> > > > > > > > > + vaddr = get_zeroed_page(GFP_KERNEL);
> > > > > > > >
> > > > > > > >
> > > > > > > > I don't get why you insist on stealing kernel memory for 
> > > > > > > > something
> > > > > > > > that is just used by userspace to store data for its own use.
> > > > > > > > Userspace does not lack ways to persist data, for example,
> > > > > > > > create a regular file anywhere in the filesystem.
> > > > > > >
> > > > > > > Good point. So the motivation here is to:
> > > > > > >
> > > > > > > 1) be self contained, no dependency for high speed persist data
> > > > > > > storage like tmpfs
> > > > > >
> > > > > > No idea what this means.
> > > > >
> > > > > I mean a regular file may slow down the datapath performance, so
> > > > > usually the application will try to use tmpfs and other which is a
> > > > > dependency for implementing the reconnection.
> > > >
> > > > Are we worried about systems without tmpfs now?
> > >
> > > Yes.
> >
> > Why? Who ships these?
> 
> Not sure, but it could be disabled or unmounted. I'm not sure make
> VDUSE depends on TMPFS is a good idea.
> 
> Thanks

Don't disable or unmount it then?
The use-case needs to be much clearer if we are adding a way for
userspace to pin kernel memory for unlimited time.

-- 
MST

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-24 Thread Michael S. Tsirkin

On Wed, Apr 24, 2024 at 08:44:10AM +0800, Jason Wang wrote:
> On Tue, Apr 23, 2024 at 4:42 PM Michael S. Tsirkin  wrote:
> >
> > On Tue, Apr 23, 2024 at 11:09:59AM +0800, Jason Wang wrote:
> > > On Tue, Apr 23, 2024 at 4:05 AM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote:
> > > > > On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> > > > > > > Add the function vduse_alloc_reconnnect_info_mem
> > > > > > > and vduse_alloc_reconnnect_info_mem
> > > > > > > These functions allow vduse to allocate and free memory for 
> > > > > > > reconnection
> > > > > > > information. The amount of memory allocated is vq_num pages.
> > > > > > > Each VQS will map its own page where the reconnection information 
> > > > > > > will be saved
> > > > > > >
> > > > > > > Signed-off-by: Cindy Lu 
> > > > > > > ---
> > > > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 40 
> > > > > > > ++
> > > > > > >  1 file changed, 40 insertions(+)
> > > > > > >
> > > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > > > > > > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > index ef3c9681941e..2da659d5f4a8 100644
> > > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > @@ -65,6 +65,7 @@ struct vduse_virtqueue {
> > > > > > >   int irq_effective_cpu;
> > > > > > >   struct cpumask irq_affinity;
> > > > > > >   struct kobject kobj;
> > > > > > > + unsigned long vdpa_reconnect_vaddr;
> > > > > > >  };
> > > > > > >
> > > > > > >  struct vduse_dev;
> > > > > > > @@ -1105,6 +1106,38 @@ static void 
> > > > > > > vduse_vq_update_effective_cpu(struct vduse_virtqueue *vq)
> > > > > > >
> > > > > > >   vq->irq_effective_cpu = curr_cpu;
> > > > > > >  }
> > > > > > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev)
> > > > > > > +{
> > > > > > > + unsigned long vaddr = 0;
> > > > > > > + struct vduse_virtqueue *vq;
> > > > > > > +
> > > > > > > + for (int i = 0; i < dev->vq_num; i++) {
> > > > > > > + /*page 0~ vq_num save the reconnect info for vq*/
> > > > > > > + vq = dev->vqs[i];
> > > > > > > + vaddr = get_zeroed_page(GFP_KERNEL);
> > > > > >
> > > > > >
> > > > > > I don't get why you insist on stealing kernel memory for something
> > > > > > that is just used by userspace to store data for its own use.
> > > > > > Userspace does not lack ways to persist data, for example,
> > > > > > create a regular file anywhere in the filesystem.
> > > > >
> > > > > Good point. So the motivation here is to:
> > > > >
> > > > > 1) be self contained, no dependency for high speed persist data
> > > > > storage like tmpfs
> > > >
> > > > No idea what this means.
> > >
> > > I mean a regular file may slow down the datapath performance, so
> > > usually the application will try to use tmpfs and other which is a
> > > dependency for implementing the reconnection.
> >
> > Are we worried about systems without tmpfs now?
> 
> Yes.

Why? Who ships these?


> >
> >
> > > >
> > > > > 2) standardize the format in uAPI which allows reconnection from
> > > > > arbitrary userspace, unfortunately, such effort was removed in new
> > > > > versions
> > > >
> > > > And I don't see why that has to live in the kernel tree either.
> > >
> > > I can't find a better place, any idea?
> > >
> > > Thanks
> >
> >
> > Well anywhere on github really. w

Re: [PATCH v3 2/4] virtio_balloon: introduce oom-kill invocations

2024-04-23 Thread Michael S. Tsirkin

On Tue, Apr 23, 2024 at 11:41:07AM +0800, zhenwei pi wrote:
> When the guest OS runs under critical memory pressure, the guest
> starts to kill processes. A guest monitor agent may scan 'oom_kill'
> from /proc/vmstat, and reports the OOM KILL event. However, the agent
> may be killed and we will loss this critical event(and the later
> events).
> 
> For now we can also grep for magic words in guest kernel log from host
> side. Rather than this unstable way, virtio balloon reports OOM-KILL
> invocations instead.
> 
> Acked-by: David Hildenbrand 
> Signed-off-by: zhenwei pi 
> ---
>  drivers/virtio/virtio_balloon.c | 1 +
>  include/uapi/linux/virtio_balloon.h | 6 --
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 1710e3098ecd..f7a47eaa0936 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -330,6 +330,7 @@ static inline unsigned int update_balloon_vm_stats(struct 
> virtio_balloon *vb)
>   pages_to_bytes(events[PSWPOUT]));
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
> + update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]);
>  
>  #ifdef CONFIG_HUGETLB_PAGE
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
> diff --git a/include/uapi/linux/virtio_balloon.h 
> b/include/uapi/linux/virtio_balloon.h
> index ddaa45e723c4..b17bbe033697 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -71,7 +71,8 @@ struct virtio_balloon_config {
>  #define VIRTIO_BALLOON_S_CACHES   7   /* Disk caches */
>  #define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Hugetlb page allocations */
>  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation 
> failures */
> -#define VIRTIO_BALLOON_S_NR   10
> +#define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
> +#define VIRTIO_BALLOON_S_NR   11
>  
>  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \
>   VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \

Looks like a useful extension. But
any UAPI extension has to go to virtio spec first.

> @@ -83,7 +84,8 @@ struct virtio_balloon_config {
>   VIRTIO_BALLOON_S_NAMES_prefix "available-memory", \
>   VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \
>   VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
> - VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures" \
> + VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
> + VIRTIO_BALLOON_S_NAMES_prefix "oom-kills" \
>  }
>  
>  #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("")
> -- 
> 2.34.1

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-23 Thread Michael S. Tsirkin

On Tue, Apr 23, 2024 at 11:09:59AM +0800, Jason Wang wrote:
> On Tue, Apr 23, 2024 at 4:05 AM Michael S. Tsirkin  wrote:
> >
> > On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote:
> > > On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> > > > > Add the function vduse_alloc_reconnnect_info_mem
> > > > > and vduse_alloc_reconnnect_info_mem
> > > > > These functions allow vduse to allocate and free memory for 
> > > > > reconnection
> > > > > information. The amount of memory allocated is vq_num pages.
> > > > > Each VQS will map its own page where the reconnection information 
> > > > > will be saved
> > > > >
> > > > > Signed-off-by: Cindy Lu 
> > > > > ---
> > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 40 
> > > > > ++
> > > > >  1 file changed, 40 insertions(+)
> > > > >
> > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > > > > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > index ef3c9681941e..2da659d5f4a8 100644
> > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > @@ -65,6 +65,7 @@ struct vduse_virtqueue {
> > > > >   int irq_effective_cpu;
> > > > >   struct cpumask irq_affinity;
> > > > >   struct kobject kobj;
> > > > > + unsigned long vdpa_reconnect_vaddr;
> > > > >  };
> > > > >
> > > > >  struct vduse_dev;
> > > > > @@ -1105,6 +1106,38 @@ static void 
> > > > > vduse_vq_update_effective_cpu(struct vduse_virtqueue *vq)
> > > > >
> > > > >   vq->irq_effective_cpu = curr_cpu;
> > > > >  }
> > > > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev)
> > > > > +{
> > > > > + unsigned long vaddr = 0;
> > > > > + struct vduse_virtqueue *vq;
> > > > > +
> > > > > + for (int i = 0; i < dev->vq_num; i++) {
> > > > > + /*page 0~ vq_num save the reconnect info for vq*/
> > > > > + vq = dev->vqs[i];
> > > > > + vaddr = get_zeroed_page(GFP_KERNEL);
> > > >
> > > >
> > > > I don't get why you insist on stealing kernel memory for something
> > > > that is just used by userspace to store data for its own use.
> > > > Userspace does not lack ways to persist data, for example,
> > > > create a regular file anywhere in the filesystem.
> > >
> > > Good point. So the motivation here is to:
> > >
> > > 1) be self contained, no dependency for high speed persist data
> > > storage like tmpfs
> >
> > No idea what this means.
> 
> I mean a regular file may slow down the datapath performance, so
> usually the application will try to use tmpfs and other which is a
> dependency for implementing the reconnection.

Are we worried about systems without tmpfs now?


> >
> > > 2) standardize the format in uAPI which allows reconnection from
> > > arbitrary userspace, unfortunately, such effort was removed in new
> > > versions
> >
> > And I don't see why that has to live in the kernel tree either.
> 
> I can't find a better place, any idea?
> 
> Thanks


Well anywhere on github really. with libvhost-user maybe?
It's harmless enough in Documentation
if you like but ties you to the kernel release cycle in a way that
is completely unnecessary.

> >
> > > If the above doesn't make sense, we don't need to offer those pages by 
> > > VDUSE.
> > >
> > > Thanks
> > >
> > >
> > > >
> > > >
> > > >
> > > > > + if (vaddr == 0)
> > > > > + return -ENOMEM;
> > > > > +
> > > > > + vq->vdpa_reconnect_vaddr = vaddr;
> > > > > + }
> > > > > +
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > +static int vduse_free_reconnnect_info_mem(struct vduse_dev *dev)
> > > > > +{
> > > > > + struct vduse_virtqueue *vq;
> > > > > +
> > > > &g

Re: 回复: [PATCH v5] vp_vdpa: don't allocate unused msix vectors

2024-04-23 Thread Michael S. Tsirkin

On Tue, Apr 23, 2024 at 01:39:17AM +, Gavin Liu wrote:
> On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote:
> > From: Yuxue Liu 
> >
> > When there is a ctlq and it doesn't require interrupt callbacks,the 
> > original method of calculating vectors wastes hardware msi or msix 
> > resources as well as system IRQ resources.
> >
> > When conducting performance testing using testpmd in the guest os, it 
> > was found that the performance was lower compared to directly using 
> > vfio-pci to passthrough the device
> >
> > In scenarios where the virtio device in the guest os does not utilize 
> > interrupts, the vdpa driver still configures the hardware's msix 
> > vector. Therefore, the hardware still sends interrupts to the host os.
> 
> >I just have a question on this part. How come hardware sends interrupts does 
> >not guest driver disable them?
>
>1：Assuming the guest OS's Virtio device is using PMD mode, QEMU sets the 
> call fd to -1
>2：On the host side, the vhost_vdpa program will set 
> vp_vdpa->vring[i].cb.callback to invalid
>3：Before the modification, the vp_vdpa_request_irq function does not check 
> whether 
>   vp_vdpa->vring[i].cb.callback is valid. Instead, it enables the 
> hardware's MSIX
> interrupts based on the number of queues of the device
> 

So MSIX is enabled but why would it trigger? virtio PMD in poll mode
presumably suppresses interrupts after all.

> 
> 
> - Original Message -
> From: Michael S. Tsirkin m...@redhat.com
> Sent: April 22, 2024 20:09
> To: Gavin Liu gavin@jaguarmicro.com
> Cc: jasow...@redhat.com; Angus Chen angus.c...@jaguarmicro.com; 
> virtualizat...@lists.linux.dev; xuanz...@linux.alibaba.com; 
> linux-kernel@vger.kernel.org; Heng Qi hen...@linux.alibaba.com
> Subject: Re: [PATCH v5] vp_vdpa: don't allocate unused msix vectors
> 
> 
> 
> External Mail: This email originated from OUTSIDE of the organization!
> Do not click links, open attachments or provide ANY information unless you 
> recognize the sender and know the content is safe.
> 
> 
> On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote:
> > From: Yuxue Liu 
> >
> > When there is a ctlq and it doesn't require interrupt callbacks,the 
> > original method of calculating vectors wastes hardware msi or msix 
> > resources as well as system IRQ resources.
> >
> > When conducting performance testing using testpmd in the guest os, it 
> > was found that the performance was lower compared to directly using 
> > vfio-pci to passthrough the device
> >
> > In scenarios where the virtio device in the guest os does not utilize 
> > interrupts, the vdpa driver still configures the hardware's msix 
> > vector. Therefore, the hardware still sends interrupts to the host os.
> 
> I just have a question on this part. How come hardware sends interrupts does 
> not guest driver disable them?
> 
> > Because of this unnecessary
> > action by the hardware, hardware performance decreases, and it also 
> > affects the performance of the host os.
> >
> > Before modification:(interrupt mode)
> >  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
> >  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
> >  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
> >  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> >
> > After modification:(interrupt mode)
> >  32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[:00:02.0]-0
> >  33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[:00:02.0]-1
> >  34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[:00:02.0]-config
> >
> > Before modification:(virtio pmd mode for guest os)
> >  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
> >  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
> >  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
> >  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> >
> > After modification:(virtio pmd mode for guest os)
> >  32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[:00:02.0]-config
> >
> > To verify the use of the virtio PMD mode in the guest operating 
> > system, the following patch needs to be applied to QEMU:
> > https://lore.kernel.org/all/20240408073311.2049-1-yuxue.liu@jaguarmicr
> > o.com
> >
> > Signed-off-by: Yuxue Liu 
> > Acked-by: Jason Wang 
> > Reviewed-by: Heng Qi 
> > ---
> > V5: modify the description of the printout when an exception occurs
> > V4: update the title and

Re: [PATCH 0/3] Improve memory statistics for virtio balloon

2024-04-22 Thread Michael S. Tsirkin

On Thu, Apr 18, 2024 at 02:25:59PM +0800, zhenwei pi wrote:
> RFC -> v1:
> - several text changes: oom-kill -> oom-kills, SCAN_ASYNC -> ASYN_SCAN.
> - move vm events codes into '#ifdef CONFIG_VM_EVENT_COUNTERS'
> 
> RFC version:
> Link: 
> https://lore.kernel.org/lkml/20240415084113.1203428-1-pizhen...@bytedance.com/T/#m1898963b3c27a989b1123db475135c3ca687ca84


Make sure this builds without introducing new warnings please. 

> zhenwei pi (3):
>   virtio_balloon: introduce oom-kill invocations
>   virtio_balloon: introduce memory allocation stall counter
>   virtio_balloon: introduce memory scan/reclaim info
> 
>  drivers/virtio/virtio_balloon.c | 30 -
>  include/uapi/linux/virtio_balloon.h | 16 +--
>  2 files changed, 43 insertions(+), 3 deletions(-)
> 
> -- 
> 2.34.1

Re: [PATCH v3 3/3] vhost: Improve vhost_get_avail_idx() with smp_rmb()

2024-04-22 Thread Michael S. Tsirkin

On Mon, Apr 08, 2024 at 02:15:24PM +1000, Gavin Shan wrote:
> Hi Michael,
> 
> On 3/30/24 19:02, Gavin Shan wrote:
> > On 3/28/24 19:31, Michael S. Tsirkin wrote:
> > > On Thu, Mar 28, 2024 at 10:21:49AM +1000, Gavin Shan wrote:
> > > > All the callers of vhost_get_avail_idx() are concerned to the memory
> > > > barrier, imposed by smp_rmb() to ensure the order of the available
> > > > ring entry read and avail_idx read.
> > > > 
> > > > Improve vhost_get_avail_idx() so that smp_rmb() is executed when
> > > > the avail_idx is advanced. With it, the callers needn't to worry
> > > > about the memory barrier.
> > > > 
> > > > Suggested-by: Michael S. Tsirkin 
> > > > Signed-off-by: Gavin Shan 
> > > 
> > > Previous patches are ok. This one I feel needs more work -
> > > first more code such as sanity checking should go into
> > > this function, second there's actually a difference
> > > between comparing to last_avail_idx and just comparing
> > > to the previous value of avail_idx.
> > > I will pick patches 1-2 and post a cleanup on top so you can
> > > take a look, ok?
> > > 
> > 
> > Thanks, Michael. It's fine to me.
> > 
> 
> A kindly ping.
> 
> If it's ok to you, could you please merge PATCH[1-2]? Our downstream
> 9.4 need the fixes, especially for NVidia's grace-hopper and grace-grace
> platforms.
> 
> For PATCH[3], I also can help with the improvement if you don't have time
> for it. Please let me know.
> 
> Thanks,
> Gavin

1-2 are upstream go ahead and post the cleanup.

-- 
MST

Re: [PATCH v2 0/6] virtiofs: fix the warning for ITER_KVEC dio

2024-04-22 Thread Michael S. Tsirkin

On Tue, Apr 09, 2024 at 09:48:08AM +0800, Hou Tao wrote:
> Hi,
> 
> On 4/8/2024 3:45 PM, Michael S. Tsirkin wrote:
> > On Wed, Feb 28, 2024 at 10:41:20PM +0800, Hou Tao wrote:
> >> From: Hou Tao 
> >>
> >> Hi,
> >>
> >> The patch set aims to fix the warning related to an abnormal size
> >> parameter of kmalloc() in virtiofs. The warning occurred when attempting
> >> to insert a 10MB sized kernel module kept in a virtiofs with cache
> >> disabled. As analyzed in patch #1, the root cause is that the length of
> >> the read buffer is no limited, and the read buffer is passed directly to
> >> virtiofs through out_args[0].value. Therefore patch #1 limits the
> >> length of the read buffer passed to virtiofs by using max_pages. However
> >> it is not enough, because now the maximal value of max_pages is 256.
> >> Consequently, when reading a 10MB-sized kernel module, the length of the
> >> bounce buffer in virtiofs will be 40 + (256 * 4096), and kmalloc will
> >> try to allocate 2MB from memory subsystem. The request for 2MB of
> >> physically contiguous memory significantly stress the memory subsystem
> >> and may fail indefinitely on hosts with fragmented memory. To address
> >> this, patch #2~#5 use scattered pages in a bio_vec to replace the
> >> kmalloc-allocated bounce buffer when the length of the bounce buffer for
> >> KVEC_ITER dio is larger than PAGE_SIZE. The final issue with the
> >> allocation of the bounce buffer and sg array in virtiofs is that
> >> GFP_ATOMIC is used even when the allocation occurs in a kworker context.
> >> Therefore the last patch uses GFP_NOFS for the allocation of both sg
> >> array and bounce buffer when initiated by the kworker. For more details,
> >> please check the individual patches.
> >>
> >> As usual, comments are always welcome.
> >>
> >> Change Log:
> > Bernd should I just merge the patchset as is?
> > It seems to fix a real problem and no one has the
> > time to work on a better fix  WDYT?
> 
> Sorry for the long delay. I am just start to prepare for v3. In v3, I
> plan to avoid the unnecessary memory copy between fuse args and bio_vec.
> Will post it before next week.

Didn't happen before this week apparently.

> >
> >
> >> v2:
> >>   * limit the length of ITER_KVEC dio by max_pages instead of the
> >> newly-introduced max_nopage_rw. Using max_pages make the ITER_KVEC
> >> dio being consistent with other rw operations.
> >>   * replace kmalloc-allocated bounce buffer by using a bounce buffer
> >> backed by scattered pages when the length of the bounce buffer for
> >> KVEC_ITER dio is larger than PAG_SIZE, so even on hosts with
> >> fragmented memory, the KVEC_ITER dio can be handled normally by
> >> virtiofs. (Bernd Schubert)
> >>   * merge the GFP_NOFS patch [1] into this patch-set and use
> >> memalloc_nofs_{save|restore}+GFP_KERNEL instead of GFP_NOFS
> >> (Benjamin Coddington)
> >>
> >> v1: 
> >> https://lore.kernel.org/linux-fsdevel/20240103105929.1902658-1-hou...@huaweicloud.com/
> >>
> >> [1]: 
> >> https://lore.kernel.org/linux-fsdevel/20240105105305.4052672-1-hou...@huaweicloud.com/
> >>
> >> Hou Tao (6):
> >>   fuse: limit the length of ITER_KVEC dio by max_pages
> >>   virtiofs: move alloc/free of argbuf into separated helpers
> >>   virtiofs: factor out more common methods for argbuf
> >>   virtiofs: support bounce buffer backed by scattered pages
> >>   virtiofs: use scattered bounce buffer for ITER_KVEC dio
> >>   virtiofs: use GFP_NOFS when enqueuing request through kworker
> >>
> >>  fs/fuse/file.c  |  12 +-
> >>  fs/fuse/virtio_fs.c | 336 +---
> >>  2 files changed, 296 insertions(+), 52 deletions(-)
> >>
> >> -- 
> >> 2.29.2

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-22 Thread Michael S. Tsirkin

On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote:
> On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin  wrote:
> >
> > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> > > Add the function vduse_alloc_reconnnect_info_mem
> > > and vduse_alloc_reconnnect_info_mem
> > > These functions allow vduse to allocate and free memory for reconnection
> > > information. The amount of memory allocated is vq_num pages.
> > > Each VQS will map its own page where the reconnection information will be 
> > > saved
> > >
> > > Signed-off-by: Cindy Lu 
> > > ---
> > >  drivers/vdpa/vdpa_user/vduse_dev.c | 40 ++
> > >  1 file changed, 40 insertions(+)
> > >
> > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > index ef3c9681941e..2da659d5f4a8 100644
> > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > @@ -65,6 +65,7 @@ struct vduse_virtqueue {
> > >   int irq_effective_cpu;
> > >   struct cpumask irq_affinity;
> > >   struct kobject kobj;
> > > + unsigned long vdpa_reconnect_vaddr;
> > >  };
> > >
> > >  struct vduse_dev;
> > > @@ -1105,6 +1106,38 @@ static void vduse_vq_update_effective_cpu(struct 
> > > vduse_virtqueue *vq)
> > >
> > >   vq->irq_effective_cpu = curr_cpu;
> > >  }
> > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev)
> > > +{
> > > + unsigned long vaddr = 0;
> > > + struct vduse_virtqueue *vq;
> > > +
> > > + for (int i = 0; i < dev->vq_num; i++) {
> > > + /*page 0~ vq_num save the reconnect info for vq*/
> > > + vq = dev->vqs[i];
> > > + vaddr = get_zeroed_page(GFP_KERNEL);
> >
> >
> > I don't get why you insist on stealing kernel memory for something
> > that is just used by userspace to store data for its own use.
> > Userspace does not lack ways to persist data, for example,
> > create a regular file anywhere in the filesystem.
> 
> Good point. So the motivation here is to:
> 
> 1) be self contained, no dependency for high speed persist data
> storage like tmpfs

No idea what this means.

> 2) standardize the format in uAPI which allows reconnection from
> arbitrary userspace, unfortunately, such effort was removed in new
> versions

And I don't see why that has to live in the kernel tree either.

> If the above doesn't make sense, we don't need to offer those pages by VDUSE.
> 
> Thanks
> 
> 
> >
> >
> >
> > > + if (vaddr == 0)
> > > + return -ENOMEM;
> > > +
> > > + vq->vdpa_reconnect_vaddr = vaddr;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static int vduse_free_reconnnect_info_mem(struct vduse_dev *dev)
> > > +{
> > > + struct vduse_virtqueue *vq;
> > > +
> > > + for (int i = 0; i < dev->vq_num; i++) {
> > > + vq = dev->vqs[i];
> > > +
> > > + if (vq->vdpa_reconnect_vaddr)
> > > + free_page(vq->vdpa_reconnect_vaddr);
> > > + vq->vdpa_reconnect_vaddr = 0;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > >
> > >  static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >   unsigned long arg)
> > > @@ -1672,6 +1705,8 @@ static int vduse_destroy_dev(char *name)
> > >   mutex_unlock(>lock);
> > >   return -EBUSY;
> > >   }
> > > + vduse_free_reconnnect_info_mem(dev);
> > > +
> > >   dev->connected = true;
> > >   mutex_unlock(>lock);
> > >
> > > @@ -1855,12 +1890,17 @@ static int vduse_create_dev(struct 
> > > vduse_dev_config *config,
> > >   ret = vduse_dev_init_vqs(dev, config->vq_align, config->vq_num);
> > >   if (ret)
> > >   goto err_vqs;
> > > + ret = vduse_alloc_reconnnect_info_mem(dev);
> > > + if (ret < 0)
> > > + goto err_mem;
> > >
> > >   __module_get(THIS_MODULE);
> > >
> > >   return 0;
> > >  err_vqs:
> > >   device_destroy(_class, MKDEV(MAJOR(vduse_major), dev->minor));
> > > +err_mem:
> > > + vduse_free_reconnnect_info_mem(dev);
> > >  err_dev:
> > >   idr_remove(_idr, dev->minor);
> > >  err_idr:
> > > --
> > > 2.43.0
> >

Re: [PATCH virt] virt: fix uninit-value in vhost_vsock_dev_open

2024-04-22 Thread Michael S. Tsirkin

On Mon, Apr 22, 2024 at 09:00:31AM -0400, Stefan Hajnoczi wrote:
> On Sun, Apr 21, 2024 at 12:06:06PM +0900, Jeongjun Park wrote:
> > static bool vhost_transport_seqpacket_allow(u32 remote_cid)
> > {
> > 
> > vsock = vhost_vsock_get(remote_cid);
> > 
> > if (vsock)
> > seqpacket_allow = vsock->seqpacket_allow;
> > 
> > }
> > 
> > I think this is due to reading a previously created uninitialized 
> > vsock->seqpacket_allow inside vhost_transport_seqpacket_allow(), 
> > which is executed by the function pointer present in the if statement.
> 
> CCing Arseny, author of commit ced7b713711f ("vhost/vsock: support
> SEQPACKET for transport").
> 
> Looks like a genuine bug in the commit. vhost_vsock_set_features() sets
> seqpacket_allow to true when the feature is negotiated. The assumption
> is that the field defaults to false.
> 
> The rest of the vhost_vsock.ko code is written to initialize the
> vhost_vsock fields, so you could argue seqpacket_allow should just be
> explicitly initialized to false.
> 
> However, eliminating this class of errors by zeroing seems reasonable in
> this code path. vhost_vsock_dev_open() is not performance-critical.
> 
> Acked-by: Stefan Hajnoczi 

But now that it's explained, the bugfix as proposed is incomplete:
userspace can set features twice and the second time will leak
old VIRTIO_VSOCK_F_SEQPACKET bit value.

And I am pretty sure the Fixes tag is wrong.

So I wrote this, but I actually don't have a set for
seqpacket to test this. Arseny could you help test maybe?
Thanks!

commit bcc17a060d93b198d8a17a9b87b593f41337ee28
Author: Michael S. Tsirkin 
Date:   Mon Apr 22 10:03:13 2024 -0400

vhost/vsock: always initialize seqpacket_allow

There are two issues around seqpacket_allow:
1. seqpacket_allow is not initialized when socket is
created. Thus if features are never set, it will be
read uninitialized.
2. if VIRTIO_VSOCK_F_SEQPACKET is set and then cleared,
then seqpacket_allow will not be cleared appropriately
(existing apps I know about don't usually do this but
it's legal and there's no way to be sure no one relies
on this).

To fix:
- initialize seqpacket_allow after allocation
- set it unconditionally in set_features

Reported-by: syzbot+6c21aeb59d0e82eb2...@syzkaller.appspotmail.com
Reported-by: Jeongjun Park 
Fixes: ced7b713711f ("vhost/vsock: support SEQPACKET for transport").
Cc: Arseny Krasnov 
Cc: David S. Miller 
Cc: Stefan Hajnoczi 
Signed-off-by: Michael S. Tsirkin 

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index ec20ecff85c7..bf664ec9341b 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -667,6 +667,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct 
file *file)
}

vsock->guest_cid = 0; /* no CID assigned yet */
+   vsock->seqpacket_allow = false;

atomic_set(>queued_replies, 0);

@@ -810,8 +811,7 @@ static int vhost_vsock_set_features(struct vhost_vsock 
*vsock, u64 features)
goto err;
}

-   if (features & (1ULL << VIRTIO_VSOCK_F_SEQPACKET))
-   vsock->seqpacket_allow = true;
+   vsock->seqpacket_allow = features & (1ULL << VIRTIO_VSOCK_F_SEQPACKET);

for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
vq = >vqs[i];

Re: [syzbot] [virt?] [net?] KMSAN: uninit-value in vsock_assign_transport (2)

2024-04-22 Thread Michael S. Tsirkin

On Fri, Apr 19, 2024 at 02:39:20AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:8cd26fd90c1a Merge tag 'for-6.9-rc4-tag' of git://git.kern..
> git tree:   upstream
> console+strace: https://syzkaller.appspot.com/x/log.txt?x=102d27cd18
> kernel config:  https://syzkaller.appspot.com/x/.config?x=87a805e655619c64
> dashboard link: https://syzkaller.appspot.com/bug?extid=6c21aeb59d0e82eb2782
> compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
> 2.40
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=16e38c3b18
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10e62fed18
> 
> Downloadable assets:
> disk image: 
> https://storage.googleapis.com/syzbot-assets/488822aee24a/disk-8cd26fd9.raw.xz
> vmlinux: 
> https://storage.googleapis.com/syzbot-assets/ba40e322ba00/vmlinux-8cd26fd9.xz
> kernel image: 
> https://storage.googleapis.com/syzbot-assets/f30af1dfbc30/bzImage-8cd26fd9.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+6c21aeb59d0e82eb2...@syzkaller.appspotmail.com
> 
> =
> BUG: KMSAN: uninit-value in vsock_assign_transport+0xb2a/0xb90 
> net/vmw_vsock/af_vsock.c:500
>  vsock_assign_transport+0xb2a/0xb90 net/vmw_vsock/af_vsock.c:500
>  vsock_connect+0x544/0x1560 net/vmw_vsock/af_vsock.c:1393
>  __sys_connect_file net/socket.c:2048 [inline]
>  __sys_connect+0x606/0x690 net/socket.c:2065
>  __do_sys_connect net/socket.c:2075 [inline]
>  __se_sys_connect net/socket.c:2072 [inline]
>  __x64_sys_connect+0x91/0xe0 net/socket.c:2072
>  x64_sys_call+0x3356/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:43
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> Uninit was created at:
>  __kmalloc_large_node+0x231/0x370 mm/slub.c:3921
>  __do_kmalloc_node mm/slub.c:3954 [inline]
>  __kmalloc_node+0xb07/0x1060 mm/slub.c:3973
>  kmalloc_node include/linux/slab.h:648 [inline]
>  kvmalloc_node+0xc0/0x2d0 mm/util.c:634
>  kvmalloc include/linux/slab.h:766 [inline]
>  vhost_vsock_dev_open+0x44/0x510 drivers/vhost/vsock.c:659
>  misc_open+0x66b/0x760 drivers/char/misc.c:165
>  chrdev_open+0xa5f/0xb80 fs/char_dev.c:414
>  do_dentry_open+0x11f1/0x2120 fs/open.c:955
>  vfs_open+0x7e/0xa0 fs/open.c:1089
>  do_open fs/namei.c:3642 [inline]
>  path_openat+0x4a3c/0x5b00 fs/namei.c:3799
>  do_filp_open+0x20e/0x590 fs/namei.c:3826
>  do_sys_openat2+0x1bf/0x2f0 fs/open.c:1406
>  do_sys_open fs/open.c:1421 [inline]
>  __do_sys_openat fs/open.c:1437 [inline]
>  __se_sys_openat fs/open.c:1432 [inline]
>  __x64_sys_openat+0x2a1/0x310 fs/open.c:1432
>  x64_sys_call+0x3a64/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:258
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> CPU: 1 PID: 5021 Comm: syz-executor390 Not tainted 
> 6.9.0-rc4-syzkaller-00038-g8cd26fd90c1a #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 03/27/2024
> =
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> 
> If the report is already addressed, let syzbot know by replying with:
> #syz fix: exact-commit-title
> 
> If you want syzbot to run the reproducer, reply with:
> #syz test: git://repo/address.git branch-or-commit-hash
> If you attach or paste a git patch, syzbot will apply it before testing.
> 
> If you want to overwrite report's subsystems, reply with:
> #syz set subsystems: new-subsystem
> (See the list of subsystem names on the web dashboard)
> 
> If the report is a duplicate of another one, reply with:
> #syz dup: exact-subject-of-another-report
> 
> If you want to undo deduplication, reply with:
> #syz undup


#syz test: https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git 
bcc17a060d93b198d8a17a9b87b593f41337ee28

Re: [PATCH v5] vp_vdpa: don't allocate unused msix vectors

2024-04-22 Thread Michael S. Tsirkin

On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote:
> From: Yuxue Liu 
> 
> When there is a ctlq and it doesn't require interrupt
> callbacks,the original method of calculating vectors
> wastes hardware msi or msix resources as well as system
> IRQ resources.
> 
> When conducting performance testing using testpmd in the
> guest os, it was found that the performance was lower compared
> to directly using vfio-pci to passthrough the device
> 
> In scenarios where the virtio device in the guest os does
> not utilize interrupts, the vdpa driver still configures
> the hardware's msix vector. Therefore, the hardware still
> sends interrupts to the host os.

I just have a question on this part. How come hardware
sends interrupts does not guest driver disable them?

> Because of this unnecessary
> action by the hardware, hardware performance decreases, and
> it also affects the performance of the host os.
> 
> Before modification:(interrupt mode)
>  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
>  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
>  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
>  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> 
> After modification:(interrupt mode)
>  32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[:00:02.0]-0
>  33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[:00:02.0]-1
>  34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[:00:02.0]-config
> 
> Before modification:(virtio pmd mode for guest os)
>  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
>  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
>  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
>  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> 
> After modification:(virtio pmd mode for guest os)
>  32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[:00:02.0]-config
> 
> To verify the use of the virtio PMD mode in the guest operating
> system, the following patch needs to be applied to QEMU:
> https://lore.kernel.org/all/20240408073311.2049-1-yuxue@jaguarmicro.com
> 
> Signed-off-by: Yuxue Liu 
> Acked-by: Jason Wang 
> Reviewed-by: Heng Qi 
> ---
> V5: modify the description of the printout when an exception occurs
> V4: update the title and assign values to uninitialized variables
> V3: delete unused variables and add validation records
> V2: fix when allocating IRQs, scan all queues
> 
>  drivers/vdpa/virtio_pci/vp_vdpa.c | 22 --
>  1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c 
> b/drivers/vdpa/virtio_pci/vp_vdpa.c
> index df5f4a3bccb5..8de0224e9ec2 100644
> --- a/drivers/vdpa/virtio_pci/vp_vdpa.c
> +++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
> @@ -160,7 +160,13 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   struct pci_dev *pdev = mdev->pci_dev;
>   int i, ret, irq;
>   int queues = vp_vdpa->queues;
> - int vectors = queues + 1;
> + int vectors = 1;
> + int msix_vec = 0;
> +
> + for (i = 0; i < queues; i++) {
> + if (vp_vdpa->vring[i].cb.callback)
> + vectors++;
> + }
>  
>   ret = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX);
>   if (ret != vectors) {
> @@ -173,9 +179,12 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   vp_vdpa->vectors = vectors;
>  
>   for (i = 0; i < queues; i++) {
> + if (!vp_vdpa->vring[i].cb.callback)
> + continue;
> +
>   snprintf(vp_vdpa->vring[i].msix_name, VP_VDPA_NAME_SIZE,
>   "vp-vdpa[%s]-%d\n", pci_name(pdev), i);
> - irq = pci_irq_vector(pdev, i);
> + irq = pci_irq_vector(pdev, msix_vec);
>   ret = devm_request_irq(>dev, irq,
>  vp_vdpa_vq_handler,
>  0, vp_vdpa->vring[i].msix_name,
> @@ -185,21 +194,22 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   "vp_vdpa: fail to request irq for vq %d\n", i);
>   goto err;
>   }
> - vp_modern_queue_vector(mdev, i, i);
> + vp_modern_queue_vector(mdev, i, msix_vec);
>   vp_vdpa->vring[i].irq = irq;
> + msix_vec++;
>   }
>  
>   snprintf(vp_vdpa->msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-config\n",
>pci_name(pdev));
> - irq = pci_irq_vector(pdev, queues);
> + irq = pci_irq_vector(pdev, msix_vec);
>   ret = devm_request_irq(>dev, irq, vp_vdpa_config_handler, 0,
>  vp_vdpa->msix_name, vp_vdpa);
>   if (ret) {
>   dev_err(>dev,
> - "vp_vdpa: fail to request irq for vq %d\n", i);
> + "vp_vdpa: fail to request irq for config: %d\n", ret);
>   goto err;
>   }
> -

Re: [PATCH v2 1/4] virtio_balloon: separate vm events into a function

2024-04-22 Thread Michael S. Tsirkin

On Mon, Apr 22, 2024 at 03:42:51PM +0800, zhenwei pi wrote:
> All the VM events related statistics have dependence on
> 'CONFIG_VM_EVENT_COUNTERS', once any stack variable is required by any
> VM events in future, we would have codes like:
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>   unsigned long foo;
>  #endif
>   ...
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>   foo = events[XXX] + events[YYY];
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
>  #endif
> 
> Separate vm events into a single function, also remove
> 'CONFIG_VM_EVENT_COUNTERS' from 'update_balloon_stats'.
> 
> Signed-off-by: zhenwei pi 
> ---
>  drivers/virtio/virtio_balloon.c | 44 ++---
>  1 file changed, 29 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 1f5b3dd31fcf..59fe157e5722 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -316,34 +316,48 @@ static inline void update_stat(struct virtio_balloon 
> *vb, int idx,
>  
>  #define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT)
>  
> -static unsigned int update_balloon_stats(struct virtio_balloon *vb)
> +/* Return the number of entries filled by vm events */
> +static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb,
> +unsigned int start)
>  {
> +#ifdef CONFIG_VM_EVENT_COUNTERS
>   unsigned long events[NR_VM_EVENT_ITEMS];
> - struct sysinfo i;
> - unsigned int idx = 0;
> - long available;
> - unsigned long caches;
> + unsigned int idx = start;
>  
>   all_vm_events(events);
> - si_meminfo();
> -
> - available = si_mem_available();
> - caches = global_node_page_state(NR_FILE_PAGES);
> -
> -#ifdef CONFIG_VM_EVENT_COUNTERS
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN,
> - pages_to_bytes(events[PSWPIN]));
> + pages_to_bytes(events[PSWPIN]));
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
> - pages_to_bytes(events[PSWPOUT]));
> + pages_to_bytes(events[PSWPOUT]));
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
> +
>  #ifdef CONFIG_HUGETLB_PAGE
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
>   events[HTLB_BUDDY_PGALLOC]);
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGFAIL,
>   events[HTLB_BUDDY_PGALLOC_FAIL]);
> -#endif
> -#endif
> +#endif /* CONFIG_HUGETLB_PAGE */
> +
> + return idx - start;
> +#else /* CONFIG_VM_EVENT_COUNTERS */
> +
> + return 0;
> +#endif /* CONFIG_VM_EVENT_COUNTERS */
> +}
> +

Generally the preferred style is this:

#ifdef .

static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb,
   unsigned int start)
{

}

#else /* CONFIG_VM_EVENT_COUNTERS */

static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb,
   unsigned int start)
{
return 0;
}

#endif

however given it was a spaghetti of ifdefs even before that,
the patch's ok I think.


> +static unsigned int update_balloon_stats(struct virtio_balloon *vb)
> +{
> + struct sysinfo i;
> + unsigned int idx = 0;
> + long available;
> + unsigned long caches;
> +
> + idx += update_balloon_vm_stats(vb, idx);
> +
> + si_meminfo();
> + available = si_mem_available();
> + caches = global_node_page_state(NR_FILE_PAGES);
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
>   pages_to_bytes(i.freeram));
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMTOT,
> -- 
> 2.34.1

Re: [PATCH virt] virt: fix uninit-value in vhost_vsock_dev_open

2024-04-20 Thread Michael S. Tsirkin

On Sat, Apr 20, 2024 at 05:57:50PM +0900, Jeongjun Park wrote:
> Change vhost_vsock_dev_open() to use kvzalloc() instead of kvmalloc()
> to avoid uninit state.
> 
> Reported-by: syzbot+6c21aeb59d0e82eb2...@syzkaller.appspotmail.com
> Fixes: dcda9b04713c ("mm, tree wide: replace __GFP_REPEAT by 
> __GFP_RETRY_MAYFAIL with more useful semantic")
> Signed-off-by: Jeongjun Park 

What value exactly is used uninitialized?

> ---
>  drivers/vhost/vsock.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index ec20ecff85c7..652ef97a444b 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -656,7 +656,7 @@ static int vhost_vsock_dev_open(struct inode *inode, 
> struct file *file)
>   /* This struct is large and allocation could fail, fall back to vmalloc
>* if there is no other way.
>*/
> - vsock = kvmalloc(sizeof(*vsock), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> + vsock = kvzalloc(sizeof(*vsock), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
>   if (!vsock)
>   return -ENOMEM;
>  
> -- 
> 2.34.1

Re: [PATCH 1/1] virtio: Add support for the virtio suspend feature

2024-04-18 Thread Michael S. Tsirkin

On Thu, Apr 18, 2024 at 03:14:37PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 4/17/2024 4:54 PM, David Stevens wrote:
> > Add support for the VIRTIO_F_SUSPEND feature. When this feature is
> > negotiated, power management can use it to suspend virtio devices
> > instead of resorting to resetting the devices entirely.
> > 
> > Signed-off-by: David Stevens 
> > ---
> >   drivers/virtio/virtio.c| 32 ++
> >   drivers/virtio/virtio_pci_common.c | 29 +++
> >   drivers/virtio/virtio_pci_modern.c | 19 ++
> >   include/linux/virtio.h |  2 ++
> >   include/uapi/linux/virtio_config.h | 10 +-
> >   5 files changed, 74 insertions(+), 18 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
> > index f4080692b351..cd11495a5098 100644
> > --- a/drivers/virtio/virtio.c
> > +++ b/drivers/virtio/virtio.c
> > @@ -1,5 +1,6 @@
> >   // SPDX-License-Identifier: GPL-2.0-only
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >   #include 
> > @@ -580,6 +581,37 @@ int virtio_device_restore(struct virtio_device *dev)
> > return ret;
> >   }
> >   EXPORT_SYMBOL_GPL(virtio_device_restore);
> > +
> > +static int virtio_device_set_suspend_bit(struct virtio_device *dev, bool 
> > enabled)
> > +{
> > +   u8 status, target;
> > +
> > +   status = dev->config->get_status(dev);
> > +   if (enabled)
> > +   target = status | VIRTIO_CONFIG_S_SUSPEND;
> > +   else
> > +   target = status & ~VIRTIO_CONFIG_S_SUSPEND;
> > +   dev->config->set_status(dev, target);
> I think it is better to verify whether the device SUSPEND bit is
> already set or clear, we can just return if status == target.
> 
> Thanks
> Zhu Lingshan
> > +
> > +   while ((status = dev->config->get_status(dev)) != target) {
> > +   if (status & VIRTIO_CONFIG_S_NEEDS_RESET)
> > +   return -EIO;
> > +   mdelay(10);

Bad device state (set by surprise removal) should also
be handled here I think.


> > +   }
> > +   return 0;
> > +}
> > +
> > +int virtio_device_suspend(struct virtio_device *dev)
> > +{
> > +   return virtio_device_set_suspend_bit(dev, true);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_device_suspend);
> > +
> > +int virtio_device_resume(struct virtio_device *dev)
> > +{
> > +   return virtio_device_set_suspend_bit(dev, false);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_device_resume);
> >   #endif
> >   static int virtio_init(void)
> > diff --git a/drivers/virtio/virtio_pci_common.c 
> > b/drivers/virtio/virtio_pci_common.c
> > index b655fccaf773..4d542de05970 100644
> > --- a/drivers/virtio/virtio_pci_common.c
> > +++ b/drivers/virtio/virtio_pci_common.c
> > @@ -495,31 +495,26 @@ static int virtio_pci_restore(struct device *dev)
> > return virtio_device_restore(_dev->vdev);
> >   }
> > -static bool vp_supports_pm_no_reset(struct device *dev)
> > +static int virtio_pci_suspend(struct device *dev)
> >   {
> > struct pci_dev *pci_dev = to_pci_dev(dev);
> > -   u16 pmcsr;
> > -
> > -   if (!pci_dev->pm_cap)
> > -   return false;
> > -
> > -   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, );
> > -   if (PCI_POSSIBLE_ERROR(pmcsr)) {
> > -   dev_err(dev, "Unable to query pmcsr");
> > -   return false;
> > -   }
> > +   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
> > -   return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;
> > -}
> > +   if (virtio_has_feature(_dev->vdev, VIRTIO_F_SUSPEND))
> > +   return virtio_device_suspend(_dev->vdev);
> > -static int virtio_pci_suspend(struct device *dev)
> > -{
> > -   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_freeze(dev);
> > +   return virtio_pci_freeze(dev);
> >   }
> >   static int virtio_pci_resume(struct device *dev)
> >   {
> > -   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_restore(dev);
> > +   struct pci_dev *pci_dev = to_pci_dev(dev);
> > +   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
> > +
> > +   if (virtio_has_feature(_dev->vdev, VIRTIO_F_SUSPEND))
> > +   return virtio_device_resume(_dev->vdev);
> > +
> > +   return virtio_pci_restore(dev);
> >   }
> >   static const struct dev_pm_ops virtio_pci_pm_ops = {
> > diff --git a/drivers/virtio/virtio_pci_modern.c 
> > b/drivers/virtio/virtio_pci_modern.c
> > index f62b530aa3b5..ac8734526b8d 100644
> > --- a/drivers/virtio/virtio_pci_modern.c
> > +++ b/drivers/virtio/virtio_pci_modern.c
> > @@ -209,6 +209,22 @@ static void vp_modern_avq_deactivate(struct 
> > virtio_device *vdev)
> > __virtqueue_break(admin_vq->info.vq);
> >   }
> > +static bool vp_supports_pm_no_reset(struct pci_dev *pci_dev)
> > +{
> > +   u16 pmcsr;
> > +
> > +   if (!pci_dev->pm_cap)
> > +   return false;
> > +
> > +   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, );
> > +   if (PCI_POSSIBLE_ERROR(pmcsr)) {
> > +   dev_err(_dev->dev, "Unable to query pmcsr");
> > +   return false;

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-17 Thread Michael S. Tsirkin

On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> Add the function vduse_alloc_reconnnect_info_mem
> and vduse_alloc_reconnnect_info_mem
> These functions allow vduse to allocate and free memory for reconnection
> information. The amount of memory allocated is vq_num pages.
> Each VQS will map its own page where the reconnection information will be 
> saved
> 
> Signed-off-by: Cindy Lu 
> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 40 ++
>  1 file changed, 40 insertions(+)
> 
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index ef3c9681941e..2da659d5f4a8 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -65,6 +65,7 @@ struct vduse_virtqueue {
>   int irq_effective_cpu;
>   struct cpumask irq_affinity;
>   struct kobject kobj;
> + unsigned long vdpa_reconnect_vaddr;
>  };
>  
>  struct vduse_dev;
> @@ -1105,6 +1106,38 @@ static void vduse_vq_update_effective_cpu(struct 
> vduse_virtqueue *vq)
>  
>   vq->irq_effective_cpu = curr_cpu;
>  }
> +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev)
> +{
> + unsigned long vaddr = 0;
> + struct vduse_virtqueue *vq;
> +
> + for (int i = 0; i < dev->vq_num; i++) {
> + /*page 0~ vq_num save the reconnect info for vq*/
> + vq = dev->vqs[i];
> + vaddr = get_zeroed_page(GFP_KERNEL);


I don't get why you insist on stealing kernel memory for something
that is just used by userspace to store data for its own use.
Userspace does not lack ways to persist data, for example,
create a regular file anywhere in the filesystem.



> + if (vaddr == 0)
> + return -ENOMEM;
> +
> + vq->vdpa_reconnect_vaddr = vaddr;
> + }
> +
> + return 0;
> +}
> +
> +static int vduse_free_reconnnect_info_mem(struct vduse_dev *dev)
> +{
> + struct vduse_virtqueue *vq;
> +
> + for (int i = 0; i < dev->vq_num; i++) {
> + vq = dev->vqs[i];
> +
> + if (vq->vdpa_reconnect_vaddr)
> + free_page(vq->vdpa_reconnect_vaddr);
> + vq->vdpa_reconnect_vaddr = 0;
> + }
> +
> + return 0;
> +}
>  
>  static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>   unsigned long arg)
> @@ -1672,6 +1705,8 @@ static int vduse_destroy_dev(char *name)
>   mutex_unlock(>lock);
>   return -EBUSY;
>   }
> + vduse_free_reconnnect_info_mem(dev);
> +
>   dev->connected = true;
>   mutex_unlock(>lock);
>  
> @@ -1855,12 +1890,17 @@ static int vduse_create_dev(struct vduse_dev_config 
> *config,
>   ret = vduse_dev_init_vqs(dev, config->vq_align, config->vq_num);
>   if (ret)
>   goto err_vqs;
> + ret = vduse_alloc_reconnnect_info_mem(dev);
> + if (ret < 0)
> + goto err_mem;
>  
>   __module_get(THIS_MODULE);
>  
>   return 0;
>  err_vqs:
>   device_destroy(_class, MKDEV(MAJOR(vduse_major), dev->minor));
> +err_mem:
> + vduse_free_reconnnect_info_mem(dev);
>  err_dev:
>   idr_remove(_idr, dev->minor);
>  err_idr:
> -- 
> 2.43.0

Re: [PATCH] vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API

2024-04-14 Thread Michael S. Tsirkin

On Sun, Apr 14, 2024 at 10:59:06AM +0200, Christophe JAILLET wrote:
> Le 14/04/2024 à 10:35, Michael S. Tsirkin a écrit :
> > On Mon, Jan 15, 2024 at 09:35:50PM +0100, Christophe JAILLET wrote:
> > > ida_alloc() and ida_free() should be preferred to the deprecated
> > > ida_simple_get() and ida_simple_remove().
> > > 
> > > Note that the upper limit of ida_simple_get() is exclusive, buInputt the 
> > > one of
> > 
> > What's buInputt? But?
> 
> Yes, sorry. It is "but".
> 
> Let me know if I should send a v2, or if it can be fixed when it is applied.
> 
> CJ

Yes it's easier if you do. Thanks!

> > 
> > > ida_alloc_max() is inclusive. So a -1 has been added when needed.
> > > 
> > > Signed-off-by: Christophe JAILLET 
> > 
> > 
> > Jason, wanna ack?
> > 
> > > ---
> > >   drivers/vhost/vdpa.c | 6 +++---
> > >   1 file changed, 3 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > > index bc4a51e4638b..849b9d2dd51f 100644
> > > --- a/drivers/vhost/vdpa.c
> > > +++ b/drivers/vhost/vdpa.c
> > > @@ -1534,7 +1534,7 @@ static void vhost_vdpa_release_dev(struct device 
> > > *device)
> > >   struct vhost_vdpa *v =
> > >  container_of(device, struct vhost_vdpa, dev);
> > > - ida_simple_remove(_vdpa_ida, v->minor);
> > > + ida_free(_vdpa_ida, v->minor);
> > >   kfree(v->vqs);
> > >   kfree(v);
> > >   }
> > > @@ -1557,8 +1557,8 @@ static int vhost_vdpa_probe(struct vdpa_device 
> > > *vdpa)
> > >   if (!v)
> > >   return -ENOMEM;
> > > - minor = ida_simple_get(_vdpa_ida, 0,
> > > -VHOST_VDPA_DEV_MAX, GFP_KERNEL);
> > > + minor = ida_alloc_max(_vdpa_ida, VHOST_VDPA_DEV_MAX - 1,
> > > +   GFP_KERNEL);
> > >   if (minor < 0) {
> > >   kfree(v);
> > >   return minor;
> > > -- 
> > > 2.43.0
> > 
> > 
> >

Re: [PATCH] vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API

2024-04-14 Thread Michael S. Tsirkin

On Mon, Jan 15, 2024 at 09:35:50PM +0100, Christophe JAILLET wrote:
> ida_alloc() and ida_free() should be preferred to the deprecated
> ida_simple_get() and ida_simple_remove().
> 
> Note that the upper limit of ida_simple_get() is exclusive, buInputt the one 
> of

What's buInputt? But?

> ida_alloc_max() is inclusive. So a -1 has been added when needed.
> 
> Signed-off-by: Christophe JAILLET 


Jason, wanna ack?

> ---
>  drivers/vhost/vdpa.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index bc4a51e4638b..849b9d2dd51f 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1534,7 +1534,7 @@ static void vhost_vdpa_release_dev(struct device 
> *device)
>   struct vhost_vdpa *v =
>  container_of(device, struct vhost_vdpa, dev);
>  
> - ida_simple_remove(_vdpa_ida, v->minor);
> + ida_free(_vdpa_ida, v->minor);
>   kfree(v->vqs);
>   kfree(v);
>  }
> @@ -1557,8 +1557,8 @@ static int vhost_vdpa_probe(struct vdpa_device *vdpa)
>   if (!v)
>   return -ENOMEM;
>  
> - minor = ida_simple_get(_vdpa_ida, 0,
> -VHOST_VDPA_DEV_MAX, GFP_KERNEL);
> + minor = ida_alloc_max(_vdpa_ida, VHOST_VDPA_DEV_MAX - 1,
> +   GFP_KERNEL);
>   if (minor < 0) {
>   kfree(v);
>   return minor;
> -- 
> 2.43.0

[GIT PULL] virtio: bugfixes

2024-04-14 Thread Michael S. Tsirkin

The following changes since commit fec50db7033ea478773b159e0e2efb135270e3b7:

  Linux 6.9-rc3 (2024-04-07 13:22:46 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 76f408535aab39c33e0a1dcada9fba5631c65595:

  vhost: correct misleading printing information (2024-04-08 04:11:04 -0400)


virtio: bugfixes

Some small, obvious (in hindsight) bugfixes:

- new ioctl in vhost-vdpa has a wrong # - not too late to fix

- vhost has apparently been lacking an smp_rmb() -
  due to code duplication :( The duplication will be fixed in
  the next merge cycle, this is a minimal fix.

- an error message in vhost talks about guest moving used index -
  which of course never happens, guest only ever moves the
  available index.

- i2c-virtio didn't set the driver owner so it did not get
  refcounted correctly.

Signed-off-by: Michael S. Tsirkin 


Gavin Shan (2):
  vhost: Add smp_rmb() in vhost_vq_avail_empty()
  vhost: Add smp_rmb() in vhost_enable_notify()

Krzysztof Kozlowski (1):
  virtio: store owner from modules with register_virtio_driver()

Michael S. Tsirkin (1):
  vhost-vdpa: change ioctl # for VDPA_GET_VRING_SIZE

Xianting Tian (1):
  vhost: correct misleading printing information

 .../driver-api/virtio/writing_virtio_drivers.rst   |  1 -
 drivers/vhost/vhost.c  | 30 ++
 drivers/virtio/virtio.c|  6 +++--
 include/linux/virtio.h |  7 +++--
 include/uapi/linux/vhost.h | 15 ++-
 5 files changed, 42 insertions(+), 17 deletions(-)

Re: [PATCH v4] vp_vdpa: don't allocate unused msix vectors

2024-04-09 Thread Michael S. Tsirkin

Good and clear subject, I like it.

On Tue, Apr 09, 2024 at 04:58:18PM +0800, lyx634449800 wrote:
> From: Yuxue Liu 
> 
> When there is a ctlq and it doesn't require interrupt
> callbacks,the original method of calculating vectors
> wastes hardware msi or msix resources as well as system
> IRQ resources.
> 
> When conducting performance testing using testpmd in the
> guest os, it was found that the performance was lower compared
> to directly using vfio-pci to passthrough the device
> 
> In scenarios where the virtio device in the guest os does
> not utilize interrupts, the vdpa driver still configures
> the hardware's msix vector. Therefore, the hardware still
> sends interrupts to the host os. Because of this unnecessary
> action by the hardware, hardware performance decreases, and
> it also affects the performance of the host os.
> 
> Before modification:(interrupt mode)
>  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
>  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
>  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
>  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> 
> After modification:(interrupt mode)
>  32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[:00:02.0]-0
>  33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[:00:02.0]-1
>  34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[:00:02.0]-config
> 
> Before modification:(virtio pmd mode for guest os)
>  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
>  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
>  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
>  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> 
> After modification:(virtio pmd mode for guest os)
>  32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[:00:02.0]-config
> 
> To verify the use of the virtio PMD mode in the guest operating
> system, the following patch needs to be applied to QEMU:
> https://lore.kernel.org/all/20240408073311.2049-1-yuxue@jaguarmicro.com
> 
> Signed-off-by: Yuxue Liu 
> Acked-by: Jason Wang 

Much better, thanks!
A couple of small tweaks to polish it up and it'll be ready.

> ---
> V4: Update the title and assign values to uninitialized variables
> V3: delete unused variables and add validation records
> V2: fix when allocating IRQs, scan all queues
> 
>  drivers/vdpa/virtio_pci/vp_vdpa.c | 23 +--
>  1 file changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c 
> b/drivers/vdpa/virtio_pci/vp_vdpa.c
> index df5f4a3bccb5..74bc8adfc7e8 100644
> --- a/drivers/vdpa/virtio_pci/vp_vdpa.c
> +++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
> @@ -160,7 +160,14 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   struct pci_dev *pdev = mdev->pci_dev;
>   int i, ret, irq;
>   int queues = vp_vdpa->queues;
> - int vectors = queues + 1;
> + int vectors = 0;
> + int msix_vec = 0;
> +
> + for (i = 0; i < queues; i++) {
> + if (vp_vdpa->vring[i].cb.callback)
> + vectors++;
> + }
> + vectors++;


Actually even easier: int vectors = 1; and then we do not need
this last line.
Sorry I only noticed now.

>  
>   ret = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX);
>   if (ret != vectors) {
> @@ -173,9 +180,12 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   vp_vdpa->vectors = vectors;
>  
>   for (i = 0; i < queues; i++) {
> + if (!vp_vdpa->vring[i].cb.callback)
> + continue;
> +
>   snprintf(vp_vdpa->vring[i].msix_name, VP_VDPA_NAME_SIZE,
>   "vp-vdpa[%s]-%d\n", pci_name(pdev), i);
> - irq = pci_irq_vector(pdev, i);
> + irq = pci_irq_vector(pdev, msix_vec);
>   ret = devm_request_irq(>dev, irq,
>  vp_vdpa_vq_handler,
>  0, vp_vdpa->vring[i].msix_name,
> @@ -185,21 +195,22 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   "vp_vdpa: fail to request irq for vq %d\n", i);
>   goto err;
>   }
> - vp_modern_queue_vector(mdev, i, i);
> + vp_modern_queue_vector(mdev, i, msix_vec);
>   vp_vdpa->vring[i].irq = irq;
> + msix_vec++;
>   }
>  
>   snprintf(vp_vdpa->msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-config\n",
>pci_name(pdev));
> - irq = pci_irq_vector(pdev, queues);
> + irq = pci_irq_vector(pdev, msix_vec);
>   ret = devm_request_irq(>dev, irq, vp_vdpa_config_handler, 0,
>  vp_vdpa->msix_name, vp_vdpa);
>   if (ret) {
>   dev_err(>dev,
> - "vp_vdpa: fail to request irq for vq %d\n", i);
> + "vp_vdpa: fail to request irq for config, ret %d\n", 
> ret);

As long as we are here

Re: [PATCH] drivers/virtio: delayed configuration descriptor flags

2024-04-09 Thread Michael S. Tsirkin

On Tue, Apr 09, 2024 at 01:02:52AM +0800, ni.liqiang wrote:
> In our testing of the virtio hardware accelerator, we found that
> configuring the flags of the descriptor after addr and len,
> as implemented in DPDK, seems to be more friendly to the hardware.
> 
> In our Virtio hardware implementation tests, using the default
> open-source code, the hardware's bulk reads ensure performance
> but correctness is compromised. If we refer to the implementation code
> of DPDK, placing the flags configuration of the descriptor
> after addr and len, virtio backend can function properly based on
> our hardware accelerator.
> 
> I am somewhat puzzled by this. From a software process perspective,
> it seems that there should be no difference whether
> the flags configuration of the descriptor is before or after addr and len.
> However, this is not the case according to experimental test results.

You should be aware of the following, from the PCI Express spec.
Note especially the second paragraph, and the last paragraph:

2.4.2.
25
30
Update Ordering and Granularity Observed by a
Read Transaction
If a Requester using a single transaction reads a block of data from a 
Completer, and the
Completer's data buffer is concurrently being updated, the ordering of multiple 
updates and
granularity of each update reflected in the data returned by the read is 
outside the scope of this
specification. This applies both to updates performed by PCI Express write 
transactions and
updates performed by other mechanisms such as host CPUs updating host memory.
If a Requester using a single transaction reads a block of data from a 
Completer, and the
Completer's data buffer is concurrently being updated by one or more entities 
not on the PCI
Express fabric, the ordering of multiple updates and granularity of each update 
reflected in the data
returned by the read is outside the scope of this specification.

As an example of update ordering, assume that the block of data is in host 
memory, and a host CPU
writes first to location A and then to a different location B. A Requester 
reading that data block
with a single read transaction is not guaranteed to observe those updates in 
order. In other words,
the Requester may observe an updated value in location B and an old value in 
location A, regardless
of the placement of locations A and B within the data block. Unless a Completer 
makes its own
guarantees (outside this specification) with respect to update ordering, a 
Requester that relies on
update ordering must observe the update to location B via one read transaction 
before initiating a
subsequent read to location A to return its updated value.

As an example of update granularity, if a host CPU writes a QWORD to host 
memory, a Requester
reading that QWORD from host memory may observe a portion of the QWORD updated 
and
another portion of it containing the old value.
While not required by this specification, it is strongly recommended that host 
platforms guarantee
that when a host CPU writes aligned DWORDs or aligned QWORDs to host memory, 
the update
granularity observed by a PCI Express read will not be smaller than a DWORD.

IMPLEMENTATION NOTE
No Ordering Required Between Cachelines
15
A Root Complex serving as a Completer to a single Memory Read that requests 
multiple cachelines
from host memory is permitted to fetch multiple cachelines concurrently, to 
help facilitate multi-
cacheline completions, subject to Max_Payload_Size. No ordering relationship 
between these
cacheline fetches is required.

Now I suspect that what is going on is that your Root complex
reads descriptors out of order, so the second descriptor is invalid
but the 1st one is valid.

> We would like to know if such a change in the configuration order
> is reasonable and acceptable?

We need to understand the root cause and how robust the fix is
before answering this.

> Thanks.
> 
> Signed-off-by: ni.liqiang 
> Reviewed-by: jin.qi 
> Tested-by: jin.qi 
> Cc: ni.liqiang 
> ---
>  drivers/virtio/virtio_ring.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 6f7e5010a673..bea2c2fb084e 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -1472,15 +1472,16 @@ static inline int virtqueue_add_packed(struct 
> virtqueue *_vq,
>   flags = cpu_to_le16(vq->packed.avail_used_flags |
>   (++c == total_sg ? 0 : VRING_DESC_F_NEXT) |
>   (n < out_sgs ? 0 : VRING_DESC_F_WRITE));
> - if (i == head)
> - head_flags = flags;
> - else
> - desc[i].flags = flags;
>  
>   desc[i].addr = cpu_to_le64(addr);
>   desc[i].len = cpu_to_le32(sg->length);
>   desc[i].id = cpu_to_le16(id);
>  
> +

Re: [PATCH v3] vp_vdpa: fix the method of calculating vectors

2024-04-08 Thread Michael S. Tsirkin

better subject:

 vp_vdpa: don't allocate unused msix vectors

to make it clear it's not a bugfix.




more comments below, but most importantly this
looks like it adds a bug.

On Tue, Apr 09, 2024 at 09:49:35AM +0800, lyx634449800 wrote:
> When there is a ctlq and it doesn't require interrupt
> callbacks,the original method of calculating vectors
> wastes hardware msi or msix resources as well as system
> IRQ resources.
> 
> When conducting performance testing using testpmd in the
> guest os, it was found that the performance was lower compared
> to directly using vfio-pci to passthrough the device
> 
> In scenarios where the virtio device in the guest os does
> not utilize interrupts, the vdpa driver still configures
> the hardware's msix vector. Therefore, the hardware still
> sends interrupts to the host os. Because of this unnecessary
> action by the hardware, hardware performance decreases, and
> it also affects the performance of the host os.
> 
> Before modification:(interrupt mode)
>  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
>  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
>  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
>  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> 
> After modification:(interrupt mode)
>  32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[:00:02.0]-0
>  33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[:00:02.0]-1
>  34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[:00:02.0]-config
> 
> Before modification:(virtio pmd mode for guest os)
>  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
>  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
>  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
>  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> 
> After modification:(virtio pmd mode for guest os)
>  32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[:00:02.0]-config
> 
> To verify the use of the virtio PMD mode in the guest operating
> system, the following patch needs to be applied to QEMU:
> https://lore.kernel.org/all/20240408073311.2049-1-yuxue@jaguarmicro.com
> 
> Signed-off-by: lyx634449800 


Bad S.O.B format. Should be

Signed-off-by: Real Name 


> ---
> 
> V3: delete unused variables and add validation records
> V2: fix when allocating IRQs, scan all queues
> 
>  drivers/vdpa/virtio_pci/vp_vdpa.c | 35 +++
>  1 file changed, 22 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c 
> b/drivers/vdpa/virtio_pci/vp_vdpa.c
> index df5f4a3bccb5..cd3aeb3b8f21 100644
> --- a/drivers/vdpa/virtio_pci/vp_vdpa.c
> +++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
> @@ -160,22 +160,31 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   struct pci_dev *pdev = mdev->pci_dev;
>   int i, ret, irq;
>   int queues = vp_vdpa->queues;
> - int vectors = queues + 1;
> + int msix_vec, allocated_vectors = 0;


I would actually call allocated_vectors -> vectors, make the patch
smaller.

>  
> - ret = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX);
> - if (ret != vectors) {
> + for (i = 0; i < queues; i++) {
> + if (vp_vdpa->vring[i].cb.callback)
> + allocated_vectors++;
> + }
> + allocated_vectors = allocated_vectors + 1;

better: 
allocated_vectors++; /* extra one for config */

> +
> + ret = pci_alloc_irq_vectors(pdev, allocated_vectors, allocated_vectors,
> + PCI_IRQ_MSIX);
> + if (ret != allocated_vectors) {
>   dev_err(>dev,
>   "vp_vdpa: fail to allocate irq vectors want %d but 
> %d\n",
> - vectors, ret);
> + allocated_vectors, ret);
>   return ret;
>   }
> -
> - vp_vdpa->vectors = vectors;
> + vp_vdpa->vectors = allocated_vectors;
>  
>   for (i = 0; i < queues; i++) {
> + if (!vp_vdpa->vring[i].cb.callback)
> + continue;
> +
>   snprintf(vp_vdpa->vring[i].msix_name, VP_VDPA_NAME_SIZE,
>   "vp-vdpa[%s]-%d\n", pci_name(pdev), i);
> - irq = pci_irq_vector(pdev, i);
> + irq = pci_irq_vector(pdev, msix_vec);

using uninitialized msix_vec here?

I would expect compiler to warn about it.


pay attention to compiler warnings pls.


>   ret = devm_request_irq(>dev, irq,
>  vp_vdpa_vq_handler,
>  0, vp_vdpa->vring[i].msix_name,
> @@ -185,23 +194,23 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   "vp_vdpa: fail to request irq for vq %d\n", i);
>   goto err;
>   }
> - vp_modern_queue_vector(mdev, i, i);
> + vp_modern_queue_vector(mdev, i, msix_vec);
>

Re: [PATCH v3] Documentation: Add reconnect process for VDUSE

2024-04-08 Thread Michael S. Tsirkin

On Mon, Apr 08, 2024 at 08:39:21PM +0800, Cindy Lu wrote:
> On Mon, Apr 8, 2024 at 3:40 PM Michael S. Tsirkin  wrote:
> >
> > On Thu, Apr 04, 2024 at 01:56:31PM +0800, Cindy Lu wrote:
> > > Add a document explaining the reconnect process, including what the
> > > Userspace App needs to do and how it works with the kernel.
> > >
> > > Signed-off-by: Cindy Lu 
> > > ---
> > >  Documentation/userspace-api/vduse.rst | 41 +++
> > >  1 file changed, 41 insertions(+)
> > >
> > > diff --git a/Documentation/userspace-api/vduse.rst 
> > > b/Documentation/userspace-api/vduse.rst
> > > index bdb880e01132..7faa83462e78 100644
> > > --- a/Documentation/userspace-api/vduse.rst
> > > +++ b/Documentation/userspace-api/vduse.rst
> > > @@ -231,3 +231,44 @@ able to start the dataplane processing as follows:
> > > after the used ring is filled.
> > >
> > >  For more details on the uAPI, please see include/uapi/linux/vduse.h.
> > > +
> > > +HOW VDUSE devices reconnection works
> > > +
> > > +1. What is reconnection?
> > > +
> > > +   When the userspace application loads, it should establish a connection
> > > +   to the vduse kernel device. Sometimes,the userspace application 
> > > exists,
> > > +   and we want to support its restart and connect to the kernel device 
> > > again
> > > +
> > > +2. How can I support reconnection in a userspace application?
> > > +
> > > +2.1 During initialization, the userspace application should first verify 
> > > the
> > > +existence of the device "/dev/vduse/vduse_name".
> > > +If it doesn't exist, it means this is the first-time for connection. 
> > > goto step 2.2
> > > +If it exists, it means this is a reconnection, and we should goto 
> > > step 2.3
> > > +
> > > +2.2 Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> > > +/dev/vduse/control.
> > > +When ioctl(VDUSE_CREATE_DEV) is called, kernel allocates memory for
> > > +the reconnect information. The total memory size is 
> > > PAGE_SIZE*vq_mumber.
> >
> > Confused. Where is that allocation, in code?
> >
> > Thanks!
> >
> this should allocated in function vduse_create_dev(),

I mean, it's not allocated there ATM right? This is just doc patch
to become part of a larger patchset?

> I will rewrite
> this part  to make it more clearer
> will send a new version soon
> Thanks
> cindy
> 
> > > +2.3 Check if the information is suitable for reconnect
> > > +If this is reconnection :
> > > +Before attempting to reconnect, The userspace application needs to 
> > > use the
> > > +ioctl(VDUSE_DEV_GET_CONFIG, VDUSE_DEV_GET_STATUS, 
> > > VDUSE_DEV_GET_FEATURES...)
> > > +to get the information from kernel.
> > > +Please review the information and confirm if it is suitable to 
> > > reconnect.
> > > +
> > > +2.4 Userspace application needs to mmap the memory to userspace
> > > +The userspace application requires mapping one page for every vq. 
> > > These pages
> > > +should be used to save vq-related information during system running. 
> > > Additionally,
> > > +the application must define its own structure to store information 
> > > for reconnection.
> > > +
> > > +2.5 Completed the initialization and running the application.
> > > +While the application is running, it is important to store relevant 
> > > information
> > > +about reconnections in mapped pages. When calling the ioctl 
> > > VDUSE_VQ_GET_INFO to
> > > +get vq information, it's necessary to check whether it's a 
> > > reconnection. If it is
> > > +a reconnection, the vq-related information must be get from the 
> > > mapped pages.
> > > +
> > > +2.6 When the Userspace application exits, it is necessary to unmap all 
> > > the
> > > +pages for reconnection
> > > --
> > > 2.43.0
> >

Re: [PATCH v2 0/6] virtiofs: fix the warning for ITER_KVEC dio

2024-04-08 Thread Michael S. Tsirkin

On Wed, Feb 28, 2024 at 10:41:20PM +0800, Hou Tao wrote:
> From: Hou Tao 
> 
> Hi,
> 
> The patch set aims to fix the warning related to an abnormal size
> parameter of kmalloc() in virtiofs. The warning occurred when attempting
> to insert a 10MB sized kernel module kept in a virtiofs with cache
> disabled. As analyzed in patch #1, the root cause is that the length of
> the read buffer is no limited, and the read buffer is passed directly to
> virtiofs through out_args[0].value. Therefore patch #1 limits the
> length of the read buffer passed to virtiofs by using max_pages. However
> it is not enough, because now the maximal value of max_pages is 256.
> Consequently, when reading a 10MB-sized kernel module, the length of the
> bounce buffer in virtiofs will be 40 + (256 * 4096), and kmalloc will
> try to allocate 2MB from memory subsystem. The request for 2MB of
> physically contiguous memory significantly stress the memory subsystem
> and may fail indefinitely on hosts with fragmented memory. To address
> this, patch #2~#5 use scattered pages in a bio_vec to replace the
> kmalloc-allocated bounce buffer when the length of the bounce buffer for
> KVEC_ITER dio is larger than PAGE_SIZE. The final issue with the
> allocation of the bounce buffer and sg array in virtiofs is that
> GFP_ATOMIC is used even when the allocation occurs in a kworker context.
> Therefore the last patch uses GFP_NOFS for the allocation of both sg
> array and bounce buffer when initiated by the kworker. For more details,
> please check the individual patches.
> 
> As usual, comments are always welcome.
> 
> Change Log:

Bernd should I just merge the patchset as is?
It seems to fix a real problem and no one has the
time to work on a better fix  WDYT?


> v2:
>   * limit the length of ITER_KVEC dio by max_pages instead of the
> newly-introduced max_nopage_rw. Using max_pages make the ITER_KVEC
> dio being consistent with other rw operations.
>   * replace kmalloc-allocated bounce buffer by using a bounce buffer
> backed by scattered pages when the length of the bounce buffer for
> KVEC_ITER dio is larger than PAG_SIZE, so even on hosts with
> fragmented memory, the KVEC_ITER dio can be handled normally by
> virtiofs. (Bernd Schubert)
>   * merge the GFP_NOFS patch [1] into this patch-set and use
> memalloc_nofs_{save|restore}+GFP_KERNEL instead of GFP_NOFS
> (Benjamin Coddington)
> 
> v1: 
> https://lore.kernel.org/linux-fsdevel/20240103105929.1902658-1-hou...@huaweicloud.com/
> 
> [1]: 
> https://lore.kernel.org/linux-fsdevel/20240105105305.4052672-1-hou...@huaweicloud.com/
> 
> Hou Tao (6):
>   fuse: limit the length of ITER_KVEC dio by max_pages
>   virtiofs: move alloc/free of argbuf into separated helpers
>   virtiofs: factor out more common methods for argbuf
>   virtiofs: support bounce buffer backed by scattered pages
>   virtiofs: use scattered bounce buffer for ITER_KVEC dio
>   virtiofs: use GFP_NOFS when enqueuing request through kworker
> 
>  fs/fuse/file.c  |  12 +-
>  fs/fuse/virtio_fs.c | 336 +---
>  2 files changed, 296 insertions(+), 52 deletions(-)
> 
> -- 
> 2.29.2

Re: [PATCH v3] Documentation: Add reconnect process for VDUSE

2024-04-08 Thread Michael S. Tsirkin

On Thu, Apr 04, 2024 at 01:56:31PM +0800, Cindy Lu wrote:
> Add a document explaining the reconnect process, including what the
> Userspace App needs to do and how it works with the kernel.
> 
> Signed-off-by: Cindy Lu 
> ---
>  Documentation/userspace-api/vduse.rst | 41 +++
>  1 file changed, 41 insertions(+)
> 
> diff --git a/Documentation/userspace-api/vduse.rst 
> b/Documentation/userspace-api/vduse.rst
> index bdb880e01132..7faa83462e78 100644
> --- a/Documentation/userspace-api/vduse.rst
> +++ b/Documentation/userspace-api/vduse.rst
> @@ -231,3 +231,44 @@ able to start the dataplane processing as follows:
> after the used ring is filled.
>  
>  For more details on the uAPI, please see include/uapi/linux/vduse.h.
> +
> +HOW VDUSE devices reconnection works
> +
> +1. What is reconnection?
> +
> +   When the userspace application loads, it should establish a connection
> +   to the vduse kernel device. Sometimes,the userspace application exists,
> +   and we want to support its restart and connect to the kernel device again
> +
> +2. How can I support reconnection in a userspace application?
> +
> +2.1 During initialization, the userspace application should first verify the
> +existence of the device "/dev/vduse/vduse_name".
> +If it doesn't exist, it means this is the first-time for connection. 
> goto step 2.2
> +If it exists, it means this is a reconnection, and we should goto step 
> 2.3
> +
> +2.2 Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> +/dev/vduse/control.
> +When ioctl(VDUSE_CREATE_DEV) is called, kernel allocates memory for
> +the reconnect information. The total memory size is PAGE_SIZE*vq_mumber.

Confused. Where is that allocation, in code?

Thanks!

> +2.3 Check if the information is suitable for reconnect
> +If this is reconnection :
> +Before attempting to reconnect, The userspace application needs to use 
> the
> +ioctl(VDUSE_DEV_GET_CONFIG, VDUSE_DEV_GET_STATUS, 
> VDUSE_DEV_GET_FEATURES...)
> +to get the information from kernel.
> +Please review the information and confirm if it is suitable to reconnect.
> +
> +2.4 Userspace application needs to mmap the memory to userspace
> +The userspace application requires mapping one page for every vq. These 
> pages
> +should be used to save vq-related information during system running. 
> Additionally,
> +the application must define its own structure to store information for 
> reconnection.
> +
> +2.5 Completed the initialization and running the application.
> +While the application is running, it is important to store relevant 
> information
> +about reconnections in mapped pages. When calling the ioctl 
> VDUSE_VQ_GET_INFO to
> +get vq information, it's necessary to check whether it's a reconnection. 
> If it is
> +a reconnection, the vq-related information must be get from the mapped 
> pages.
> +
> +2.6 When the Userspace application exits, it is necessary to unmap all the
> +pages for reconnection
> -- 
> 2.43.0

Re: [PATCH v3 3/3] vhost: Improve vhost_get_avail_idx() with smp_rmb()

2024-04-08 Thread Michael S. Tsirkin

On Mon, Apr 08, 2024 at 02:15:24PM +1000, Gavin Shan wrote:
> Hi Michael,
> 
> On 3/30/24 19:02, Gavin Shan wrote:
> > On 3/28/24 19:31, Michael S. Tsirkin wrote:
> > > On Thu, Mar 28, 2024 at 10:21:49AM +1000, Gavin Shan wrote:
> > > > All the callers of vhost_get_avail_idx() are concerned to the memory
> > > > barrier, imposed by smp_rmb() to ensure the order of the available
> > > > ring entry read and avail_idx read.
> > > > 
> > > > Improve vhost_get_avail_idx() so that smp_rmb() is executed when
> > > > the avail_idx is advanced. With it, the callers needn't to worry
> > > > about the memory barrier.
> > > > 
> > > > Suggested-by: Michael S. Tsirkin 
> > > > Signed-off-by: Gavin Shan 
> > > 
> > > Previous patches are ok. This one I feel needs more work -
> > > first more code such as sanity checking should go into
> > > this function, second there's actually a difference
> > > between comparing to last_avail_idx and just comparing
> > > to the previous value of avail_idx.
> > > I will pick patches 1-2 and post a cleanup on top so you can
> > > take a look, ok?
> > > 
> > 
> > Thanks, Michael. It's fine to me.
> > 
> 
> A kindly ping.
> 
> If it's ok to you, could you please merge PATCH[1-2]? Our downstream
> 9.4 need the fixes, especially for NVidia's grace-hopper and grace-grace
> platforms.
> 
> For PATCH[3], I also can help with the improvement if you don't have time
> for it. Please let me know.
> 
> Thanks,
> Gavin

The thing to do is basically diff with the patch I wrote :)
We can also do a bit more cleanups on top of *that*, like unifying
error handling.

-- 
MST

Re: [PATCH v3 3/3] vhost: Improve vhost_get_avail_idx() with smp_rmb()

2024-04-08 Thread Michael S. Tsirkin

On Mon, Apr 08, 2024 at 02:15:24PM +1000, Gavin Shan wrote:
> Hi Michael,
> 
> On 3/30/24 19:02, Gavin Shan wrote:
> > On 3/28/24 19:31, Michael S. Tsirkin wrote:
> > > On Thu, Mar 28, 2024 at 10:21:49AM +1000, Gavin Shan wrote:
> > > > All the callers of vhost_get_avail_idx() are concerned to the memory
> > > > barrier, imposed by smp_rmb() to ensure the order of the available
> > > > ring entry read and avail_idx read.
> > > > 
> > > > Improve vhost_get_avail_idx() so that smp_rmb() is executed when
> > > > the avail_idx is advanced. With it, the callers needn't to worry
> > > > about the memory barrier.
> > > > 
> > > > Suggested-by: Michael S. Tsirkin 
> > > > Signed-off-by: Gavin Shan 
> > > 
> > > Previous patches are ok. This one I feel needs more work -
> > > first more code such as sanity checking should go into
> > > this function, second there's actually a difference
> > > between comparing to last_avail_idx and just comparing
> > > to the previous value of avail_idx.
> > > I will pick patches 1-2 and post a cleanup on top so you can
> > > take a look, ok?
> > > 
> > 
> > Thanks, Michael. It's fine to me.
> > 
> 
> A kindly ping.
> 
> If it's ok to you, could you please merge PATCH[1-2]? Our downstream
> 9.4 need the fixes, especially for NVidia's grace-hopper and grace-grace
> platforms.

Yes - in the next rc hopefully.

> For PATCH[3], I also can help with the improvement if you don't have time
> for it. Please let me know.
> 
> Thanks,
> Gavin


That would be great.

-- 
MST

[PATCH] vhost-vdpa: change ioctl # for VDPA_GET_VRING_SIZE

2024-04-02 Thread Michael S. Tsirkin

VDPA_GET_VRING_SIZE by mistake uses the already occupied
ioctl # 0x80 and we never noticed - it happens to work
because the direction and size are different, but confuses
tools such as perf which like to look at just the number,
and breaks the extra robustness of the ioctl numbering macros.

To fix, sort the entries and renumber the ioctl - not too late
since it wasn't in any released kernels yet.

Cc: Arnaldo Carvalho de Melo 
Reported-by: Namhyung Kim 
Fixes: 1496c47065f9 ("vhost-vdpa: uapi to support reporting per vq size")
Cc: "Zhu Lingshan" 
Signed-off-by: Michael S. Tsirkin 
---

Build tested only - userspace patches using this will have to adjust.
I will merge this in a week or so unless I hear otherwise,
and afterwards perf can update there header.

 include/uapi/linux/vhost.h | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index bea697390613..b95dd84eef2d 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -179,12 +179,6 @@
 /* Get the config size */
 #define VHOST_VDPA_GET_CONFIG_SIZE _IOR(VHOST_VIRTIO, 0x79, __u32)
 
-/* Get the count of all virtqueues */
-#define VHOST_VDPA_GET_VQS_COUNT   _IOR(VHOST_VIRTIO, 0x80, __u32)
-
-/* Get the number of virtqueue groups. */
-#define VHOST_VDPA_GET_GROUP_NUM   _IOR(VHOST_VIRTIO, 0x81, __u32)
-
 /* Get the number of address spaces. */
 #define VHOST_VDPA_GET_AS_NUM  _IOR(VHOST_VIRTIO, 0x7A, unsigned int)
 
@@ -228,10 +222,17 @@
 #define VHOST_VDPA_GET_VRING_DESC_GROUP_IOWR(VHOST_VIRTIO, 0x7F,   
\
  struct vhost_vring_state)
 
+
+/* Get the count of all virtqueues */
+#define VHOST_VDPA_GET_VQS_COUNT   _IOR(VHOST_VIRTIO, 0x80, __u32)
+
+/* Get the number of virtqueue groups. */
+#define VHOST_VDPA_GET_GROUP_NUM   _IOR(VHOST_VIRTIO, 0x81, __u32)
+
 /* Get the queue size of a specific virtqueue.
  * userspace set the vring index in vhost_vring_state.index
  * kernel set the queue size in vhost_vring_state.num
  */
-#define VHOST_VDPA_GET_VRING_SIZE  _IOWR(VHOST_VIRTIO, 0x80,   \
+#define VHOST_VDPA_GET_VRING_SIZE  _IOWR(VHOST_VIRTIO, 0x82,   \
  struct vhost_vring_state)
 #endif
-- 
MST

Re: [syzbot] [virtualization?] bpf boot error: WARNING: refcount bug in __free_pages_ok

2024-03-31 Thread Michael S. Tsirkin

On Sat, Mar 30, 2024 at 08:37:19AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:6dae957c8eef bpf: fix possible file descriptor leaks in ve..
> git tree:   bpf
> console output: https://syzkaller.appspot.com/x/log.txt?x=14ec025e18
> kernel config:  https://syzkaller.appspot.com/x/.config?x=7b667bc37450fdcd
> dashboard link: https://syzkaller.appspot.com/bug?extid=689655a7402cc18ace0a
> compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
> 2.40
> 
> Downloadable assets:
> disk image: 
> https://storage.googleapis.com/syzbot-assets/94b03853b65f/disk-6dae957c.raw.xz
> vmlinux: 
> https://storage.googleapis.com/syzbot-assets/7375c1b6b108/vmlinux-6dae957c.xz
> kernel image: 
> https://storage.googleapis.com/syzbot-assets/126013ac11e1/bzImage-6dae957c.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+689655a7402cc18ac...@syzkaller.appspotmail.com
> 
> Key type pkcs7_test registered
> Block layer SCSI generic (bsg) driver version 0.4 loaded (major 239)
> io scheduler mq-deadline registered
> io scheduler kyber registered
> io scheduler bfq registered
> input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> ACPI: button: Power Button [PWRF]
> input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
> ACPI: button: Sleep Button [SLPF]
> ioatdma: Intel(R) QuickData Technology Driver 5.00
> ACPI: \_SB_.LNKC: Enabled at IRQ 11
> virtio-pci :00:03.0: virtio_pci: leaving for legacy driver
> ACPI: \_SB_.LNKD: Enabled at IRQ 10
> virtio-pci :00:04.0: virtio_pci: leaving for legacy driver
> ACPI: \_SB_.LNKB: Enabled at IRQ 10
> virtio-pci :00:06.0: virtio_pci: leaving for legacy driver
> virtio-pci :00:07.0: virtio_pci: leaving for legacy driver
> N_HDLC line discipline registered with maxframe=4096
> Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
> 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud = 115200) is a 16550A
> 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud = 115200) is a 16550A
> Non-volatile memory driver v1.3
> Linux agpgart interface v0.103
> ACPI: bus type drm_connector registered
> [drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
> [drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
> Console: switching to colour frame buffer device 128x48
> platform vkms: [drm] fb0: vkmsdrmfb frame buffer device
> usbcore: registered new interface driver udl
> brd: module loaded
> loop: module loaded
> zram: Added device: zram0
> null_blk: disk nullb0 created
> null_blk: module loaded
> Guest personality initialized and is inactive
> VMCI host device registered (name=vmci, major=10, minor=118)
> Initialized host personality
> usbcore: registered new interface driver rtsx_usb
> usbcore: registered new interface driver viperboard
> usbcore: registered new interface driver dln2
> usbcore: registered new interface driver pn533_usb
> nfcsim 0.2 initialized
> usbcore: registered new interface driver port100
> usbcore: registered new interface driver nfcmrvl
> Loading iSCSI transport class v2.0-870.
> virtio_scsi virtio0: 1/0/0 default/read/poll queues
> [ cut here ]
> refcount_t: decrement hit 0; leaking memory.
> WARNING: CPU: 1 PID: 1 at lib/refcount.c:31 refcount_warn_saturate+0xfa/0x1d0 
> lib/refcount.c:31
> Modules linked in:
> CPU: 1 PID: 1 Comm: swapper/0 Not tainted 
> 6.9.0-rc1-syzkaller-00160-g6dae957c8eef #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 03/27/2024
> RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> Code: b2 00 00 00 e8 97 cf e9 fc 5b 5d c3 cc cc cc cc e8 8b cf e9 fc c6 05 8e 
> 73 e8 0a 01 90 48 c7 c7 e0 33 1f 8c e8 c7 6b ac fc 90 <0f> 0b 90 90 eb d9 e8 
> 6b cf e9 fc c6 05 6b 73 e8 0a 01 90 48 c7 c7
> RSP: :c9066e18 EFLAGS: 00010246
> RAX: eee901a1fb7e2300 RBX: 888146687e7c RCX: 8880166d
> RDX:  RSI:  RDI: 
> RBP: 0004 R08: 815800c2 R09: fbfff1c396e0
> R10: dc00 R11: fbfff1c396e0 R12: ea000502edc0
> R13: ea000502edc8 R14: 1d4000a05db9 R15: 
> FS:  () GS:8880b950() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2:  CR3: 0e132000 CR4: 003506f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  
>  reset_page_owner include/linux/page_owner.h:25 [inline]
>  free_pages_prepare mm/page_alloc.c:1141 [inline]
>  __free_pages_ok+0xc60/0xd90 mm/page_alloc.c:1270
>  make_alloc_exact+0xa3/0xf0 mm/page_alloc.c:4829
>  vring_alloc_queue drivers/virtio/virtio_ring.c:319

Re: [PATCH net v3] virtio_net: Do not send RSS key if it is not supported

2024-03-31 Thread Michael S. Tsirkin

On Fri, Mar 29, 2024 at 10:16:41AM -0700, Breno Leitao wrote:
> There is a bug when setting the RSS options in virtio_net that can break
> the whole machine, getting the kernel into an infinite loop.
> 
> Running the following command in any QEMU virtual machine with virtionet
> will reproduce this problem:
> 
> # ethtool -X eth0  hfunc toeplitz
> 
> This is how the problem happens:
> 
> 1) ethtool_set_rxfh() calls virtnet_set_rxfh()
> 
> 2) virtnet_set_rxfh() calls virtnet_commit_rss_command()
> 
> 3) virtnet_commit_rss_command() populates 4 entries for the rss
> scatter-gather
> 
> 4) Since the command above does not have a key, then the last
> scatter-gatter entry will be zeroed, since rss_key_size == 0.
> sg_buf_size = vi->rss_key_size;
> 
> 5) This buffer is passed to qemu, but qemu is not happy with a buffer
> with zero length, and do the following in virtqueue_map_desc() (QEMU
> function):
> 
>   if (!sz) {
>   virtio_error(vdev, "virtio: zero sized buffers are not allowed");
> 
> 6) virtio_error() (also QEMU function) set the device as broken
> 
> vdev->broken = true;
> 
> 7) Qemu bails out, and do not repond this crazy kernel.
> 
> 8) The kernel is waiting for the response to come back (function
> virtnet_send_command())
> 
> 9) The kernel is waiting doing the following :
> 
>   while (!virtqueue_get_buf(vi->cvq, ) &&
>!virtqueue_is_broken(vi->cvq))
> cpu_relax();
> 
> 10) None of the following functions above is true, thus, the kernel
> loops here forever. Keeping in mind that virtqueue_is_broken() does
> not look at the qemu `vdev->broken`, so, it never realizes that the
> vitio is broken at QEMU side.
> 
> Fix it by not sending RSS commands if the feature is not available in
> the device.
> 
> Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
> Cc: sta...@vger.kernel.org

net has its own stable process, don't CC stable on net patches.


> Cc: qemu-de...@nongnu.org
> Signed-off-by: Breno Leitao 
> ---
> Changelog:
> 
> V2:
>   * Moved from creating a valid packet, by rejecting the request
> completely
> V3:
>   * Got some good feedback from and Xuan Zhuo and Heng Qi, and reworked
> the rejection path.
> 
> ---
>  drivers/net/virtio_net.c | 22 ++
>  1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index c22d1118a133..c4a21ec51adf 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3807,6 +3807,7 @@ static int virtnet_set_rxfh(struct net_device *dev,
>   struct netlink_ext_ack *extack)
>  {
>   struct virtnet_info *vi = netdev_priv(dev);
> + bool update = false;
>   int i;
>  
>   if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE &&
> @@ -3814,13 +3815,24 @@ static int virtnet_set_rxfh(struct net_device *dev,
>   return -EOPNOTSUPP;
>  
>   if (rxfh->indir) {
> + if (!vi->has_rss)
> + return -EOPNOTSUPP;
> +
>   for (i = 0; i < vi->rss_indir_table_size; ++i)
>   vi->ctrl->rss.indirection_table[i] = rxfh->indir[i];
> + update = true;
>   }
> - if (rxfh->key)
> +
> + if (rxfh->key) {
> + if (!vi->has_rss && !vi->has_rss_hash_report)
> + return -EOPNOTSUPP;


What's the logic here? Is it || or &&? A comment can't hurt.

> +
>   memcpy(vi->ctrl->rss.key, rxfh->key, vi->rss_key_size);
> + update = true;
> + }
>  
> - virtnet_commit_rss_command(vi);
> + if (update)
> + virtnet_commit_rss_command(vi);
>  
>   return 0;
>  }
> @@ -4729,13 +4741,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>   if (virtio_has_feature(vdev, VIRTIO_NET_F_HASH_REPORT))
>   vi->has_rss_hash_report = true;
>  
> - if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS))
> + if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) {
>   vi->has_rss = true;
>  
> - if (vi->has_rss || vi->has_rss_hash_report) {
>   vi->rss_indir_table_size =
>   virtio_cread16(vdev, offsetof(struct virtio_net_config,
>   rss_max_indirection_table_length));
> + }
> +
> + if (vi->has_rss || vi->has_rss_hash_report) {
>   vi->rss_key_size =
>   virtio_cread8(vdev, offsetof(struct virtio_net_config, 
> rss_max_key_size));
>  
> -- 
> 2.43.0

Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-03-29 Thread Michael S. Tsirkin

On Fri, Mar 29, 2024 at 06:39:33PM +0800, Jason Wang wrote:
> On Fri, Mar 29, 2024 at 5:13 PM Michael S. Tsirkin  wrote:
> >
> > On Wed, Mar 27, 2024 at 05:08:57PM +0800, Jason Wang wrote:
> > > On Thu, Mar 21, 2024 at 3:00 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:
> > > > > From: Rong Wang 
> > > > >
> > > > > Once enable iommu domain for one device, the MSI
> > > > > translation tables have to be there for software-managed MSI.
> > > > > Otherwise, platform with software-managed MSI without an
> > > > > irq bypass function, can not get a correct memory write event
> > > > > from pcie, will not get irqs.
> > > > > The solution is to obtain the MSI phy base address from
> > > > > iommu reserved region, and set it to iommu MSI cookie,
> > > > > then translation tables will be created while request irq.
> > > > >
> > > > > Change log
> > > > > --
> > > > >
> > > > > v1->v2:
> > > > > - add resv iotlb to avoid overlap mapping.
> > > > > v2->v3:
> > > > > - there is no need to export the iommu symbol anymore.
> > > > >
> > > > > Signed-off-by: Rong Wang 
> > > >
> > > > There's in interest to keep extending vhost iotlb -
> > > > we should just switch over to iommufd which supports
> > > > this already.
> > >
> > > IOMMUFD is good but VFIO supports this before IOMMUFD.
> >
> > You mean VFIO migrated to IOMMUFD but of course they keep supporting
> > their old UAPI?
> 
> I meant VFIO support software managed MSI before IOMMUFD.

And then they switched over and stopped adding new IOMMU
related features. And so should vdpa?


> > OK and point being?
> >
> > > This patch
> > > makes vDPA run without a backporting of full IOMMUFD in the production
> > > environment. I think it's worth.
> >
> > Where do we stop? saying no to features is the only tool maintainers
> > have to make cleanups happen, otherwise people will just keep piling
> > stuff up.
> 
> I think we should not have more features than VFIO without IOMMUFD.
> 
> Thanks
> 
> >
> > > If you worry about the extension, we can just use the vhost iotlb
> > > existing facility to do this.
> > >
> > > Thanks
> > >
> > > >
> > > > > ---
> > > > >  drivers/vhost/vdpa.c | 59 
> > > > > +---
> > > > >  1 file changed, 56 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > > > > index ba52d128aeb7..28b56b10372b 100644
> > > > > --- a/drivers/vhost/vdpa.c
> > > > > +++ b/drivers/vhost/vdpa.c
> > > > > @@ -49,6 +49,7 @@ struct vhost_vdpa {
> > > > >   struct completion completion;
> > > > >   struct vdpa_device *vdpa;
> > > > >   struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS];
> > > > > + struct vhost_iotlb resv_iotlb;
> > > > >   struct device dev;
> > > > >   struct cdev cdev;
> > > > >   atomic_t opened;
> > > > > @@ -247,6 +248,7 @@ static int _compat_vdpa_reset(struct vhost_vdpa 
> > > > > *v)
> > > > >  static int vhost_vdpa_reset(struct vhost_vdpa *v)
> > > > >  {
> > > > >   v->in_batch = 0;
> > > > > + vhost_iotlb_reset(>resv_iotlb);
> > > > >   return _compat_vdpa_reset(v);
> > > > >  }
> > > > >
> > > > > @@ -1219,10 +1221,15 @@ static int 
> > > > > vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> > > > >   msg->iova + msg->size - 1 > v->range.last)
> > > > >   return -EINVAL;
> > > > >
> > > > > + if (vhost_iotlb_itree_first(>resv_iotlb, msg->iova,
> > > > > + msg->iova + msg->size - 1))
> > > > > + return -EINVAL;
> > > > > +
> > > > >   if (vhost_iotlb_itree_first(iotlb, msg->iova,
> > > > >   msg->iova + msg->size - 1))
> > > > >   re

Re: [PATCH v2] Documentation: Add reconnect process for VDUSE

2024-03-29 Thread Michael S. Tsirkin

On Fri, Mar 29, 2024 at 05:38:25PM +0800, Cindy Lu wrote:
> Add a document explaining the reconnect process, including what the
> Userspace App needs to do and how it works with the kernel.
> 
> Signed-off-by: Cindy Lu 
> ---
>  Documentation/userspace-api/vduse.rst | 41 +++
>  1 file changed, 41 insertions(+)
> 
> diff --git a/Documentation/userspace-api/vduse.rst 
> b/Documentation/userspace-api/vduse.rst
> index bdb880e01132..f903aed714d1 100644
> --- a/Documentation/userspace-api/vduse.rst
> +++ b/Documentation/userspace-api/vduse.rst
> @@ -231,3 +231,44 @@ able to start the dataplane processing as follows:
> after the used ring is filled.
>  
>  For more details on the uAPI, please see include/uapi/linux/vduse.h.
> +
> +HOW VDUSE devices reconnectoin works

typo

> +
> +1. What is reconnection?
> +
> +   When the userspace application loads, it should establish a connection
> +   to the vduse kernel device. Sometimes,the userspace application exists,
> +   and we want to support its restart and connect to the kernel device again
> +
> +2. How can I support reconnection in a userspace application?
> +
> +2.1 During initialization, the userspace application should first verify the
> +existence of the device "/dev/vduse/vduse_name".
> +If it doesn't exist, it means this is the first-time for connection. 
> goto step 2.2
> +If it exists, it means this is a reconnection, and we should goto step 
> 2.3
> +
> +2.2 Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> +/dev/vduse/control.
> +When ioctl(VDUSE_CREATE_DEV) is called, kernel allocates memory for
> +the reconnect information. The total memory size is PAGE_SIZE*vq_mumber.
> +
> +2.3 Check if the information is suitable for reconnect
> +If this is reconnection :
> +Before attempting to reconnect, The userspace application needs to use 
> the
> +ioctl(VDUSE_DEV_GET_CONFIG, VDUSE_DEV_GET_STATUS, 
> VDUSE_DEV_GET_FEATURES...)
> +to get the information from kernel.
> +Please review the information and confirm if it is suitable to reconnect.
> +
> +2.4 Userspace application needs to mmap the memory to userspace
> +The userspace application requires mapping one page for every vq. These 
> pages
> +should be used to save vq-related information during system running. 
> Additionally,
> +the application must define its own structure to store information for 
> reconnection.
> +
> +2.5 Completed the initialization and running the application.
> +While the application is running, it is important to store relevant 
> information
> +about reconnections in mapped pages. When calling the ioctl 
> VDUSE_VQ_GET_INFO to
> +get vq information, it's necessary to check whether it's a reconnection. 
> If it is
> +a reconnection, the vq-related information must be get from the mapped 
> pages.
> +


I don't get it. So this is just a way for the application to allocate
memory? Why do we need this new way to do it?
Why not just mmap a file anywhere at all?


> +2.6 When the Userspace application exits, it is necessary to unmap all the
> +pages for reconnection
> -- 
> 2.43.0

Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-03-29 Thread Michael S. Tsirkin

On Fri, Mar 29, 2024 at 11:55:50AM +0800, Jason Wang wrote:
> On Wed, Mar 27, 2024 at 5:08 PM Jason Wang  wrote:
> >
> > On Thu, Mar 21, 2024 at 3:00 PM Michael S. Tsirkin  wrote:
> > >
> > > On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:
> > > > From: Rong Wang 
> > > >
> > > > Once enable iommu domain for one device, the MSI
> > > > translation tables have to be there for software-managed MSI.
> > > > Otherwise, platform with software-managed MSI without an
> > > > irq bypass function, can not get a correct memory write event
> > > > from pcie, will not get irqs.
> > > > The solution is to obtain the MSI phy base address from
> > > > iommu reserved region, and set it to iommu MSI cookie,
> > > > then translation tables will be created while request irq.
> > > >
> > > > Change log
> > > > --
> > > >
> > > > v1->v2:
> > > > - add resv iotlb to avoid overlap mapping.
> > > > v2->v3:
> > > > - there is no need to export the iommu symbol anymore.
> > > >
> > > > Signed-off-by: Rong Wang 
> > >
> > > There's in interest to keep extending vhost iotlb -
> > > we should just switch over to iommufd which supports
> > > this already.
> >
> > IOMMUFD is good but VFIO supports this before IOMMUFD. This patch
> > makes vDPA run without a backporting of full IOMMUFD in the production
> > environment. I think it's worth.
> >
> > If you worry about the extension, we can just use the vhost iotlb
> > existing facility to do this.
> >
> > Thanks
> 
> Btw, Wang Rong,
> 
> It looks that Cindy does have the bandwidth in working for IOMMUFD support.

I think you mean she does not.

> Do you have the will to do that?
> 
> Thanks

Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-03-29 Thread Michael S. Tsirkin

On Wed, Mar 27, 2024 at 05:08:57PM +0800, Jason Wang wrote:
> On Thu, Mar 21, 2024 at 3:00 PM Michael S. Tsirkin  wrote:
> >
> > On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:
> > > From: Rong Wang 
> > >
> > > Once enable iommu domain for one device, the MSI
> > > translation tables have to be there for software-managed MSI.
> > > Otherwise, platform with software-managed MSI without an
> > > irq bypass function, can not get a correct memory write event
> > > from pcie, will not get irqs.
> > > The solution is to obtain the MSI phy base address from
> > > iommu reserved region, and set it to iommu MSI cookie,
> > > then translation tables will be created while request irq.
> > >
> > > Change log
> > > --
> > >
> > > v1->v2:
> > > - add resv iotlb to avoid overlap mapping.
> > > v2->v3:
> > > - there is no need to export the iommu symbol anymore.
> > >
> > > Signed-off-by: Rong Wang 
> >
> > There's in interest to keep extending vhost iotlb -
> > we should just switch over to iommufd which supports
> > this already.
> 
> IOMMUFD is good but VFIO supports this before IOMMUFD.

You mean VFIO migrated to IOMMUFD but of course they keep supporting
their old UAPI? OK and point being?

> This patch
> makes vDPA run without a backporting of full IOMMUFD in the production
> environment. I think it's worth.

Where do we stop? saying no to features is the only tool maintainers
have to make cleanups happen, otherwise people will just keep piling
stuff up.

> If you worry about the extension, we can just use the vhost iotlb
> existing facility to do this.
> 
> Thanks
> 
> >
> > > ---
> > >  drivers/vhost/vdpa.c | 59 +---
> > >  1 file changed, 56 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > > index ba52d128aeb7..28b56b10372b 100644
> > > --- a/drivers/vhost/vdpa.c
> > > +++ b/drivers/vhost/vdpa.c
> > > @@ -49,6 +49,7 @@ struct vhost_vdpa {
> > >   struct completion completion;
> > >   struct vdpa_device *vdpa;
> > >   struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS];
> > > + struct vhost_iotlb resv_iotlb;
> > >   struct device dev;
> > >   struct cdev cdev;
> > >   atomic_t opened;
> > > @@ -247,6 +248,7 @@ static int _compat_vdpa_reset(struct vhost_vdpa *v)
> > >  static int vhost_vdpa_reset(struct vhost_vdpa *v)
> > >  {
> > >   v->in_batch = 0;
> > > + vhost_iotlb_reset(>resv_iotlb);
> > >   return _compat_vdpa_reset(v);
> > >  }
> > >
> > > @@ -1219,10 +1221,15 @@ static int vhost_vdpa_process_iotlb_update(struct 
> > > vhost_vdpa *v,
> > >   msg->iova + msg->size - 1 > v->range.last)
> > >   return -EINVAL;
> > >
> > > + if (vhost_iotlb_itree_first(>resv_iotlb, msg->iova,
> > > + msg->iova + msg->size - 1))
> > > + return -EINVAL;
> > > +
> > >   if (vhost_iotlb_itree_first(iotlb, msg->iova,
> > >   msg->iova + msg->size - 1))
> > >   return -EEXIST;
> > >
> > > +
> > >   if (vdpa->use_va)
> > >   return vhost_vdpa_va_map(v, iotlb, msg->iova, msg->size,
> > >msg->uaddr, msg->perm);
> > > @@ -1307,6 +1314,45 @@ static ssize_t vhost_vdpa_chr_write_iter(struct 
> > > kiocb *iocb,
> > >   return vhost_chr_write_iter(dev, from);
> > >  }
> > >
> > > +static int vhost_vdpa_resv_iommu_region(struct iommu_domain *domain, 
> > > struct device *dma_dev,
> > > + struct vhost_iotlb *resv_iotlb)
> > > +{
> > > + struct list_head dev_resv_regions;
> > > + phys_addr_t resv_msi_base = 0;
> > > + struct iommu_resv_region *region;
> > > + int ret = 0;
> > > + bool with_sw_msi = false;
> > > + bool with_hw_msi = false;
> > > +
> > > + INIT_LIST_HEAD(_resv_regions);
> > > + iommu_get_resv_regions(dma_dev, _resv_regions);
> > > +
> > > + list_for_each_entry(region, _resv_regions, list) {
> > > + ret = vhost_iotlb_add_range_ctx(resv_iotlb, region->start,
> > > +

Re: [PATCH v3 3/3] vhost: Improve vhost_get_avail_idx() with smp_rmb()

2024-03-28 Thread Michael S. Tsirkin

On Thu, Mar 28, 2024 at 10:21:49AM +1000, Gavin Shan wrote:
> All the callers of vhost_get_avail_idx() are concerned to the memory
> barrier, imposed by smp_rmb() to ensure the order of the available
> ring entry read and avail_idx read.
> 
> Improve vhost_get_avail_idx() so that smp_rmb() is executed when
> the avail_idx is advanced. With it, the callers needn't to worry
> about the memory barrier.
> 
> Suggested-by: Michael S. Tsirkin 
> Signed-off-by: Gavin Shan 

Previous patches are ok. This one I feel needs more work -
first more code such as sanity checking should go into
this function, second there's actually a difference
between comparing to last_avail_idx and just comparing
to the previous value of avail_idx.
I will pick patches 1-2 and post a cleanup on top so you can
take a look, ok?


> ---
>  drivers/vhost/vhost.c | 75 +++
>  1 file changed, 26 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 32686c79c41d..e6882f4f6ce2 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1290,10 +1290,28 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
>   mutex_unlock(>vqs[i]->mutex);
>  }
>  
> -static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> -   __virtio16 *idx)
> +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
>  {
> - return vhost_get_avail(vq, *idx, >avail->idx);
> + __virtio16 avail_idx;
> + int r;
> +
> + r = vhost_get_avail(vq, avail_idx, >avail->idx);
> + if (unlikely(r)) {
> + vq_err(vq, "Failed to access avail idx at %p\n",
> +>avail->idx);
> + return r;
> + }
> +
> + vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> + if (vq->avail_idx != vq->last_avail_idx) {
> + /* Ensure the available ring entry read happens
> +  * before the avail_idx read when the avail_idx
> +  * is advanced.
> +  */
> + smp_rmb();
> + }
> +
> + return 0;
>  }
>  
>  static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> @@ -2499,7 +2517,6 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   struct vring_desc desc;
>   unsigned int i, head, found = 0;
>   u16 last_avail_idx;
> - __virtio16 avail_idx;
>   __virtio16 ring_head;
>   int ret, access;
>  
> @@ -2507,12 +2524,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   last_avail_idx = vq->last_avail_idx;
>  
>   if (vq->avail_idx == vq->last_avail_idx) {
> - if (unlikely(vhost_get_avail_idx(vq, _idx))) {
> - vq_err(vq, "Failed to access avail idx at %p\n",
> - >avail->idx);
> + if (unlikely(vhost_get_avail_idx(vq)))
>   return -EFAULT;
> - }
> - vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
>  
>   if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
>   vq_err(vq, "Guest moved used index from %u to %u",
> @@ -2525,11 +2538,6 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>*/
>   if (vq->avail_idx == last_avail_idx)
>   return vq->num;
> -
> - /* Only get avail ring entries after they have been
> -  * exposed by guest.
> -  */
> - smp_rmb();
>   }
>  
>   /* Grab the next descriptor number they're advertising, and increment
> @@ -2790,35 +2798,19 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
>  /* return true if we're sure that avaiable ring is empty */
>  bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  {
> - __virtio16 avail_idx;
> - int r;
> -
>   if (vq->avail_idx != vq->last_avail_idx)
>   return false;
>  
> - r = vhost_get_avail_idx(vq, _idx);
> - if (unlikely(r))
> - return false;
> -
> - vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> - if (vq->avail_idx != vq->last_avail_idx) {
> - /* Since we have updated avail_idx, the following
> -  * call to vhost_get_vq_desc() will read available
> -  * ring entries. Make sure that read happens after
> -  * the avail_idx read.
> -  */
> - smp_rmb();
> + if (unlikely(vhost_get_avail_idx(vq)))
>   return false;
> - }
>  
> - retu

Re: [PATCH untested] vhost: order avail ring reads after index updates

2024-03-27 Thread Michael S. Tsirkin

On Wed, Mar 27, 2024 at 07:52:02PM +, Will Deacon wrote:
> On Wed, Mar 27, 2024 at 01:26:23PM -0400, Michael S. Tsirkin wrote:
> > vhost_get_vq_desc (correctly) uses smp_rmb to order
> > avail ring reads after index reads.
> > However, over time we added two more places that read the
> > index and do not bother with barriers.
> > Since vhost_get_vq_desc when it was written assumed it is the
> > only reader when it sees a new index value is cached
> > it does not bother with a barrier either, as a result,
> > on the nvidia-gracehopper platform (arm64) available ring
> > entry reads have been observed bypassing ring reads, causing
> > a ring corruption.
> > 
> > To fix, factor out the correct index access code from vhost_get_vq_desc.
> > As a side benefit, we also validate the index on all paths now, which
> > will hopefully help catch future errors earlier.
> > 
> > Note: current code is inconsistent in how it handles errors:
> > some places treat it as an empty ring, others - non empty.
> > This patch does not attempt to change the existing behaviour.
> > 
> > Cc: sta...@vger.kernel.org
> > Reported-by: Gavin Shan 
> > Reported-by: Will Deacon 
> > Suggested-by: Will Deacon 
> > Fixes: 275bf960ac69 ("vhost: better detection of available buffers")
> > Cc: "Jason Wang" 
> > Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()")
> > Cc: "Stefano Garzarella" 
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> > 
> > I think it's better to bite the bullet and clean up the code.
> > Note: this is still only built, not tested.
> > Gavin could you help test please?
> > Especially on the arm platform you have?
> > 
> > Will thanks so much for finding this race!
> 
> No problem, and I was also hoping that the smp_rmb() could be
> consolidated into a single helper like you've done here.
> 
> One minor comment below:
> 
> >  drivers/vhost/vhost.c | 80 +++
> >  1 file changed, 42 insertions(+), 38 deletions(-)
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 045f666b4f12..26b70b1fd9ff 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -1290,10 +1290,38 @@ static void vhost_dev_unlock_vqs(struct vhost_dev 
> > *d)
> > mutex_unlock(>vqs[i]->mutex);
> >  }
> >  
> > -static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> > - __virtio16 *idx)
> > +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
> >  {
> > -   return vhost_get_avail(vq, *idx, >avail->idx);
> > +   __virtio16 idx;
> > +   u16 avail_idx;
> > +   int r = vhost_get_avail(vq, idx, >avail->idx);
> > +
> > +   if (unlikely(r < 0)) {
> > +   vq_err(vq, "Failed to access avail idx at %p: %d\n",
> > +  >avail->idx, r);
> > +   return -EFAULT;
> > +   }
> > +
> > +   avail_idx = vhost16_to_cpu(vq, idx);
> > +
> > +   /* Check it isn't doing very strange things with descriptor numbers. */
> > +   if (unlikely((u16)(avail_idx - vq->last_avail_idx) > vq->num)) {
> > +   vq_err(vq, "Guest moved used index from %u to %u",
> > +  vq->last_avail_idx, vq->avail_idx);
> > +   return -EFAULT;
> > +   }
> > +
> > +   /* Nothing new? We are done. */
> > +   if (avail_idx == vq->avail_idx)
> > +   return 0;
> > +
> > +   vq->avail_idx = avail_idx;
> > +
> > +   /* We updated vq->avail_idx so we need a memory barrier between
> > +* the index read above and the caller reading avail ring entries.
> > +*/
> > +   smp_rmb();
> 
> I think you could use smp_acquire__after_ctrl_dep() if you're feeling
> brave, but to be honest I'd prefer we went in the opposite direction
> and used READ/WRITE_ONCE + smp_load_acquire()/smp_store_release() across
> the board. It's just a thankless, error-prone task to get there :(

Let's just say that's a separate patch, I tried hard to make this one
a bugfix only, no other functional changes at all.

> So, for the patch as-is:
> 
> Acked-by: Will Deacon 
> 
> (I've not tested it either though, so definitely wait for Gavin on that!)
> 
> Cheers,
> 
> Will

[PATCH untested] vhost: order avail ring reads after index updates

2024-03-27 Thread Michael S. Tsirkin

vhost_get_vq_desc (correctly) uses smp_rmb to order
avail ring reads after index reads.
However, over time we added two more places that read the
index and do not bother with barriers.
Since vhost_get_vq_desc when it was written assumed it is the
only reader when it sees a new index value is cached
it does not bother with a barrier either, as a result,
on the nvidia-gracehopper platform (arm64) available ring
entry reads have been observed bypassing ring reads, causing
a ring corruption.

To fix, factor out the correct index access code from vhost_get_vq_desc.
As a side benefit, we also validate the index on all paths now, which
will hopefully help catch future errors earlier.

Note: current code is inconsistent in how it handles errors:
some places treat it as an empty ring, others - non empty.
This patch does not attempt to change the existing behaviour.

Cc: sta...@vger.kernel.org
Reported-by: Gavin Shan 
Reported-by: Will Deacon 
Suggested-by: Will Deacon 
Fixes: 275bf960ac69 ("vhost: better detection of available buffers")
Cc: "Jason Wang" 
Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()")
Cc: "Stefano Garzarella" 
Signed-off-by: Michael S. Tsirkin 
---

I think it's better to bite the bullet and clean up the code.
Note: this is still only built, not tested.
Gavin could you help test please?
Especially on the arm platform you have?

Will thanks so much for finding this race!


 drivers/vhost/vhost.c | 80 +++
 1 file changed, 42 insertions(+), 38 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..26b70b1fd9ff 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1290,10 +1290,38 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
mutex_unlock(>vqs[i]->mutex);
 }
 
-static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
- __virtio16 *idx)
+static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
 {
-   return vhost_get_avail(vq, *idx, >avail->idx);
+   __virtio16 idx;
+   u16 avail_idx;
+   int r = vhost_get_avail(vq, idx, >avail->idx);
+
+   if (unlikely(r < 0)) {
+   vq_err(vq, "Failed to access avail idx at %p: %d\n",
+  >avail->idx, r);
+   return -EFAULT;
+   }
+
+   avail_idx = vhost16_to_cpu(vq, idx);
+
+   /* Check it isn't doing very strange things with descriptor numbers. */
+   if (unlikely((u16)(avail_idx - vq->last_avail_idx) > vq->num)) {
+   vq_err(vq, "Guest moved used index from %u to %u",
+  vq->last_avail_idx, vq->avail_idx);
+   return -EFAULT;
+   }
+
+   /* Nothing new? We are done. */
+   if (avail_idx == vq->avail_idx)
+   return 0;
+
+   vq->avail_idx = avail_idx;
+
+   /* We updated vq->avail_idx so we need a memory barrier between
+* the index read above and the caller reading avail ring entries.
+*/
+   smp_rmb();
+   return 1;
 }
 
 static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
@@ -2498,38 +2526,21 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 {
struct vring_desc desc;
unsigned int i, head, found = 0;
-   u16 last_avail_idx;
-   __virtio16 avail_idx;
+   u16 last_avail_idx = vq->last_avail_idx;
__virtio16 ring_head;
int ret, access;
 
-   /* Check it isn't doing very strange things with descriptor numbers. */
-   last_avail_idx = vq->last_avail_idx;
 
if (vq->avail_idx == vq->last_avail_idx) {
-   if (unlikely(vhost_get_avail_idx(vq, _idx))) {
-   vq_err(vq, "Failed to access avail idx at %p\n",
-   >avail->idx);
-   return -EFAULT;
-   }
-   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
-
-   if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
-   vq_err(vq, "Guest moved used index from %u to %u",
-   last_avail_idx, vq->avail_idx);
-   return -EFAULT;
-   }
+   ret = vhost_get_avail_idx(vq);
+   if (unlikely(ret < 0))
+   return ret;
 
/* If there's nothing new since last we looked, return
 * invalid.
 */
-   if (vq->avail_idx == last_avail_idx)
+   if (!ret)
return vq->num;
-
-   /* Only get avail ring entries after they have been
-* exposed by guest.
-*/
-   smp_rmb();
}
 
/* Grab the next descriptor number they're advertising,

Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()

2024-03-27 Thread Michael S. Tsirkin

On Wed, Mar 27, 2024 at 09:38:45AM +1000, Gavin Shan wrote:
> A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by
> Will Deacon . Otherwise, it's not ensured the
> available ring entries pushed by guest can be observed by vhost
> in time, leading to stale available ring entries fetched by vhost
> in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's
> grace-hopper (ARM64) platform.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
>   -accel kvm -machine virt,gic-version=host -cpu host  \
>   -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
>   -m 4096M,slots=16,maxmem=64G \
>   -object memory-backend-ram,id=mem0,size=4096M\
>:   \
>   -netdev tap,id=vnet0,vhost=true  \
>   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
>:
>   guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
>   virtio_net virtio0: output.0:id 100 is not a head!
> 
> Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it
> should be safe until vq->avail_idx is changed by commit 275bf960ac697
> ("vhost: better detection of available buffers").
> 
> Fixes: 275bf960ac697 ("vhost: better detection of available buffers")
> Cc:  # v4.11+
> Reported-by: Yihuang Yu 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/vhost/vhost.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..00445ab172b3 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, 
> struct vhost_virtqueue *vq)
>   r = vhost_get_avail_idx(vq, _idx);
>   if (unlikely(r))
>   return false;
> +
>   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> + if (vq->avail_idx != vq->last_avail_idx) {
> + /* Similar to what's done in vhost_get_vq_desc(), we need
> +  * to ensure the available ring entries have been exposed
> +  * by guest.
> +  */

A slightly clearer comment:

/* Since we have updated avail_idx, the following call to
 * vhost_get_vq_desc will read available ring entries.
 * Make sure that read happens after the avail_idx read.
 */

Pls repost with that, and I will apply.

Also add suggested-by for will.


> + smp_rmb();
> + return false;
> + }
>  
> - return vq->avail_idx == vq->last_avail_idx;
> + return true;
>  }
>  EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);

As a follow-up patch, we should clean out code duplication that
accumulated with 3 places reading avail idx in essentially
the same way - this duplication is what causes the mess in
the 1st place.






> -- 
> 2.44.0

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-27 Thread Michael S. Tsirkin

On Tue, Mar 26, 2024 at 03:46:29PM +, Will Deacon wrote:
> On Tue, Mar 26, 2024 at 11:43:13AM +, Will Deacon wrote:
> > On Tue, Mar 26, 2024 at 09:38:55AM +, Keir Fraser wrote:
> > > On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> > > > > Secondly, the debugging code is enhanced so that the available head 
> > > > > for
> > > > > (last_avail_idx - 1) is read for twice and recorded. It means the 
> > > > > available
> > > > > head for one specific available index is read for twice. I do see the
> > > > > available heads are different from the consecutive reads. More details
> > > > > are shared as below.
> > > > > 
> > > > > From the guest side
> > > > > ===
> > > > > 
> > > > > virtio_net virtio0: output.0:id 86 is not a head!
> > > > > head to be released: 047 062 112
> > > > > 
> > > > > avail_idx:
> > > > > 000  49665
> > > > > 001  49666  <--
> > > > >  :
> > > > > 015  49664
> > > > 
> > > > what are these #s 49665 and so on?
> > > > and how large is the ring?
> > > > I am guessing 49664 is the index ring size is 16 and
> > > > 49664 % 16 == 0
> > > 
> > > More than that, 49664 % 256 == 0
> > > 
> > > So again there seems to be an error in the vicinity of roll-over of
> > > the idx low byte, as I observed in the earlier log. Surely this is
> > > more than coincidence?
> > 
> > Yeah, I'd still really like to see the disassembly for both sides of the
> > protocol here. Gavin, is that something you're able to provide? Worst
> > case, the host and guest vmlinux objects would be a starting point.
> > 
> > Personally, I'd be fairly surprised if this was a hardware issue.
> 
> Ok, long shot after eyeballing the vhost code, but does the diff below
> help at all? It looks like vhost_vq_avail_empty() can advance the value
> saved in 'vq->avail_idx' but without the read barrier, possibly confusing
> vhost_get_vq_desc() in polling mode.
> 
> Will
> 
> --->8
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..87bff710331a 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct 
> vhost_virtqueue *vq)
> return false;
> vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
>  
> +   smp_rmb();
> return vq->avail_idx == vq->last_avail_idx;
>  }
>  EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);

Oh wow you are right.

We have:

if (vq->avail_idx == vq->last_avail_idx) {
if (unlikely(vhost_get_avail_idx(vq, _idx))) {
vq_err(vq, "Failed to access avail idx at %p\n",
>avail->idx);
return -EFAULT;
}
vq->avail_idx = vhost16_to_cpu(vq, avail_idx);

if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
vq_err(vq, "Guest moved used index from %u to %u",
last_avail_idx, vq->avail_idx);
return -EFAULT;
}

/* If there's nothing new since last we looked, return
 * invalid.
 */
if (vq->avail_idx == last_avail_idx)
return vq->num;

/* Only get avail ring entries after they have been
 * exposed by guest.
 */
smp_rmb();
}


and so the rmb only happens if avail_idx is not advanced.

Actually there is a bunch of code duplication where we assign to
avail_idx, too.

Will thanks a lot for looking into this! I kept looking into
the virtio side for some reason, the fact that it did not
trigger with qemu should have been a big hint!


-- 
MST

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-26 Thread Michael S. Tsirkin

On Mon, Mar 25, 2024 at 05:34:29PM +1000, Gavin Shan wrote:
> 
> On 3/20/24 17:14, Michael S. Tsirkin wrote:
> > On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> > > On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 6f7e5010a673..79456706d0bd 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct 
> > > > virtqueue *_vq,
> > > > /* Put entry in available array (but don't update avail->idx 
> > > > until they
> > > >  * do sync). */
> > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > -   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > head);
> > > > +   u16 headwithflag = head | (q->split.avail_idx_shadow & 
> > > > ~(vq->split.vring.num - 1));
> > > > +   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > headwithflag);
> > > > /* Descriptors and available array need to be set before we 
> > > > expose the
> > > >  * new available array entries. */
> > > > 
> 
> Ok, Michael. I continued with my debugging code. It still looks like a
> hardware bug on NVidia's grace-hopper. I really think NVidia needs to be
> involved for the discussion, as suggested by you.

Do you have a support contact at Nvidia to report this?

> Firstly, I bind the vhost process and vCPU thread to CPU#71 and CPU#70.
> Note that I have only one vCPU in my configuration.

Interesting but is guest built with CONFIG_SMP set?

> Secondly, the debugging code is enhanced so that the available head for
> (last_avail_idx - 1) is read for twice and recorded. It means the available
> head for one specific available index is read for twice. I do see the
> available heads are different from the consecutive reads. More details
> are shared as below.
> 
> From the guest side
> ===
> 
> virtio_net virtio0: output.0:id 86 is not a head!
> head to be released: 047 062 112
> 
> avail_idx:
> 000  49665
> 001  49666  <--
>  :
> 015  49664

what are these #s 49665 and so on?
and how large is the ring?
I am guessing 49664 is the index ring size is 16 and
49664 % 16 == 0

> avail_head:


is this the avail ring contents?

> 000  062
> 001  047  <--
>  :
> 015  112


What are these arrows pointing at, btw?


> From the host side
> ==
> 
> avail_idx
> 000  49663
> 001  49666  <---
>  :
> 
> avail_head
> 000  062  (062)
> 001  047  (047)  <---
>  :
> 015  086  (112)  // head 086 is returned from the first read,
>  // but head 112 is returned from the second read
> 
> vhost_get_vq_desc: Inconsistent head in two read (86 -> 112) for avail_idx 
> 49664
> 
> Thanks,
> Gavin

OK thanks so this proves it is actually the avail ring value.

-- 
MST

Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-03-21 Thread Michael S. Tsirkin

On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:
> From: Rong Wang 
> 
> Once enable iommu domain for one device, the MSI
> translation tables have to be there for software-managed MSI.
> Otherwise, platform with software-managed MSI without an
> irq bypass function, can not get a correct memory write event
> from pcie, will not get irqs.
> The solution is to obtain the MSI phy base address from
> iommu reserved region, and set it to iommu MSI cookie,
> then translation tables will be created while request irq.
> 
> Change log
> --
> 
> v1->v2:
> - add resv iotlb to avoid overlap mapping.
> v2->v3:
> - there is no need to export the iommu symbol anymore.
> 
> Signed-off-by: Rong Wang 

There's in interest to keep extending vhost iotlb -
we should just switch over to iommufd which supports
this already.

> ---
>  drivers/vhost/vdpa.c | 59 +---
>  1 file changed, 56 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ba52d128aeb7..28b56b10372b 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -49,6 +49,7 @@ struct vhost_vdpa {
>   struct completion completion;
>   struct vdpa_device *vdpa;
>   struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS];
> + struct vhost_iotlb resv_iotlb;
>   struct device dev;
>   struct cdev cdev;
>   atomic_t opened;
> @@ -247,6 +248,7 @@ static int _compat_vdpa_reset(struct vhost_vdpa *v)
>  static int vhost_vdpa_reset(struct vhost_vdpa *v)
>  {
>   v->in_batch = 0;
> + vhost_iotlb_reset(>resv_iotlb);
>   return _compat_vdpa_reset(v);
>  }
>  
> @@ -1219,10 +1221,15 @@ static int vhost_vdpa_process_iotlb_update(struct 
> vhost_vdpa *v,
>   msg->iova + msg->size - 1 > v->range.last)
>   return -EINVAL;
>  
> + if (vhost_iotlb_itree_first(>resv_iotlb, msg->iova,
> + msg->iova + msg->size - 1))
> + return -EINVAL;
> +
>   if (vhost_iotlb_itree_first(iotlb, msg->iova,
>   msg->iova + msg->size - 1))
>   return -EEXIST;
>  
> +
>   if (vdpa->use_va)
>   return vhost_vdpa_va_map(v, iotlb, msg->iova, msg->size,
>msg->uaddr, msg->perm);
> @@ -1307,6 +1314,45 @@ static ssize_t vhost_vdpa_chr_write_iter(struct kiocb 
> *iocb,
>   return vhost_chr_write_iter(dev, from);
>  }
>  
> +static int vhost_vdpa_resv_iommu_region(struct iommu_domain *domain, struct 
> device *dma_dev,
> + struct vhost_iotlb *resv_iotlb)
> +{
> + struct list_head dev_resv_regions;
> + phys_addr_t resv_msi_base = 0;
> + struct iommu_resv_region *region;
> + int ret = 0;
> + bool with_sw_msi = false;
> + bool with_hw_msi = false;
> +
> + INIT_LIST_HEAD(_resv_regions);
> + iommu_get_resv_regions(dma_dev, _resv_regions);
> +
> + list_for_each_entry(region, _resv_regions, list) {
> + ret = vhost_iotlb_add_range_ctx(resv_iotlb, region->start,
> + region->start + region->length - 1,
> + 0, 0, NULL);
> + if (ret) {
> + vhost_iotlb_reset(resv_iotlb);
> + break;
> + }
> +
> + if (region->type == IOMMU_RESV_MSI)
> + with_hw_msi = true;
> +
> + if (region->type == IOMMU_RESV_SW_MSI) {
> + resv_msi_base = region->start;
> + with_sw_msi = true;
> + }
> + }
> +
> + if (!ret && !with_hw_msi && with_sw_msi)
> + ret = iommu_get_msi_cookie(domain, resv_msi_base);
> +
> + iommu_put_resv_regions(dma_dev, _resv_regions);
> +
> + return ret;
> +}
> +
>  static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v)
>  {
>   struct vdpa_device *vdpa = v->vdpa;
> @@ -1335,11 +1381,16 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa 
> *v)
>  
>   ret = iommu_attach_device(v->domain, dma_dev);
>   if (ret)
> - goto err_attach;
> + goto err_alloc_domain;
>  
> - return 0;
> + ret = vhost_vdpa_resv_iommu_region(v->domain, dma_dev, >resv_iotlb);
> + if (ret)
> + goto err_attach_device;
>  
> -err_attach:
> + return 0;
> +err_attach_device:
> + iommu_detach_device(v->domain, dma_dev);
> +err_alloc_domain:
>   iommu_domain_free(v->domain);
>   v->domain = NULL;
>   return ret;
> @@ -1595,6 +1646,8 @@ static int vhost_vdpa_probe(struct vdpa_device *vdpa)
>   goto err;
>   }
>  
> + vhost_iotlb_init(>resv_iotlb, 0, 0);
> +
>   r = dev_set_name(>dev, "vhost-vdpa-%u", minor);
>   if (r)
>   goto err;
> -- 
> 2.27.0
>

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-20 Thread Michael S. Tsirkin

On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> > I think you are wasting the time with these tests. Even if it helps what
> > does this tell us? Try setting a flag as I suggested elsewhere.
> > Then check it in vhost.
> > Or here's another idea - possibly easier. Copy the high bits from index
> > into ring itself. Then vhost can check that head is synchronized with
> > index.
> > 
> > Warning: completely untested, not even compiled. But should give you
> > the idea. If this works btw we should consider making this official in
> > the spec.
> > 
> > 
> >   static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 6f7e5010a673..79456706d0bd 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue 
> > *_vq,
> > /* Put entry in available array (but don't update avail->idx until they
> >  * do sync). */
> > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > -   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > +   u16 headwithflag = head | (q->split.avail_idx_shadow & 
> > ~(vq->split.vring.num - 1));
> > +   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > headwithflag);
> > /* Descriptors and available array need to be set before we expose the
> >  * new available array entries. */
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 045f666b4f12..bd8f7c763caa 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -1299,8 +1299,15 @@ static inline int vhost_get_avail_idx(struct 
> > vhost_virtqueue *vq,
> >   static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> >__virtio16 *head, int idx)
> >   {
> > -   return vhost_get_avail(vq, *head,
> > +   unsigned i = idx;
> > +   unsigned flag = i & ~(vq->num - 1);
> > +   unsigned val = vhost_get_avail(vq, *head,
> >>avail->ring[idx & (vq->num - 1)]);
> > +   unsigned valflag = val & ~(vq->num - 1);
> > +
> > +   WARN_ON(valflag != flag);
> > +
> > +   return val & (vq->num - 1);
> >   }
> 
> Thanks, Michael. The code is already self-explanatory.

Apparently not. See below.

> Since vq->num is 256, I just
> squeezed the last_avail_idx to the high byte. Unfortunately, I'm unable to hit
> the WARN_ON(). Does it mean the low byte is stale (or corrupted) while the 
> high
> byte is still correct and valid?


I would find this very surprising.

> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> vq->split.vring.avail->ring[avail] =
> cpu_to_virtio16(_vq->vdev, head | (avail << 8));
> 
> 
> head = vhost16_to_cpu(vq, ring_head);
> WARN_ON((head >> 8) != (vq->last_avail_idx % vq->num));
> head = head & 0xff;

This code misses the point of the test.
The high value you store now is exactly the same each time you
go around the ring. E.g. at beginning of ring you now always
store 0 as high byte. So a stale value will not be detected/
The high value you store now is exactly the same each time you
go around the ring. E.g. at beginning of ring you now always
store 0 as high byte. So a stale value will not be detected.

The value you are interested in should change
each time you go around the ring a full circle.
Thus you want exactly the *high byte* of avail idx -
this is what my patch did - your patch instead
stored and compared the low byte.


The advantage of this debugging patch is that it will detect the issue 
immediately
not after guest detected the problem in the used ring.
For example, you can add code to re-read the value, or dump the whole
ring.

> One question: Does QEMU has any chance writing data to the available queue 
> when
> vhost is enabled? My previous understanding is no, the queue is totally owned 
> by
> vhost instead of QEMU.

It shouldn't do it normally.

> Before this patch was posted, I had debugging code to record last 16 
> transactions
> to the available and used queue from guest and host side. It did reveal the 
> wrong
> head was fetched from the available queue.

Oh nice that's a very good hint. And is this still reproducible?

> [   11.785745]  virtqueue_get_buf_ctx_split 
> [   11

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Wed, Mar 20, 2024 at 09:56:58AM +1000, Gavin Shan wrote:
> On 3/20/24 04:22, Will Deacon wrote:
> > On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > > On 3/19/24 02:59, Will Deacon wrote:
> > > > >drivers/virtio/virtio_ring.c | 12 +---
> > > > >1 file changed, 9 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > b/drivers/virtio/virtio_ring.c
> > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > virtqueue *_vq,
> > > > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > >   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > > head);
> > > > > - /* Descriptors and available array need to be set before we 
> > > > > expose the
> > > > > -  * new available array entries. */
> > > > > - virtio_wmb(vq->weak_barriers);
> > > > > + /*
> > > > > +  * Descriptors and available array need to be set before we 
> > > > > expose
> > > > > +  * the new available array entries. virtio_wmb() should be 
> > > > > enough
> > > > > +  * to ensuere the order theoretically. However, a stronger 
> > > > > barrier
> > > > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > +  * by the host (vhost). A stronger barrier should work for other
> > > > > +  * architectures, but performance loss is expected.
> > > > > +  */
> > > > > + virtio_mb(false);
> > > > >   vq->split.avail_idx_shadow++;
> > > > >   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > >   
> > > > > vq->split.avail_idx_shadow);
> > > > 
> > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > here, especially when ordering accesses to coherent memory.
> > > > 
> > > > In practice, either the larger timing different from the DSB or the fact
> > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > (e.g. via __smb_mb()).
> > > > 
> > > > We definitely shouldn't take changes like this without a proper
> > > > explanation of what is going on.
> > > > 
> > > 
> > > Thanks for your comments, Will.
> > > 
> > > Yes, DMB should work for us. However, it seems this instruction has 
> > > issues on
> > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB 
> > > works
> > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > before we fully understand the root cause.
> > > 
> > > I tried the possible replacement like below. __smp_mb() can avoid the 
> > > issue like
> > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > 
> > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > {
> > >  :
> > >  /* Put entry in available array (but don't update avail->idx 
> > > until they
> > >   * do sync). */
> > >  avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >  vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > head);
> > > 
> > >  /* Descriptors and available array need to be set before we 
> > > expose the
> > >   * new available array entries. */
> > >  // Broken: virtio_wmb(vq->weak_barriers);
> > >  // Broken: __dma_mb();
> > >  // Work:   __mb();
> > >  // Work:   __smp_mb();
> > 
> > It's pretty weird that __dma_mb() is "broken" but __smp_mb() "works". How
> > confident are you in that result?
> > 
> 
> Yes, __dma_mb() is even stronger than __smp_mb(). I retried the test, showing
> that both __dma_mb() and __smp_mb() work for us. I had too many tests 
> yesterday
> and something may have been messed up.
> 
> Instruction Hitting times in 10 tests
> -
> __smp_wmb() 8
> __smp_mb()  0
> __dma_wmb() 7
> __dma_mb()  0
> __mb()  0
> __wmb() 0
> 
> It's strange that __smp_mb() works, but __smp_wmb() fails. It seems we need a
> read barrier here. I will try WRITE_ONCE() + __smp_wmb() as suggested by 
> Michael
> in another reply. Will update the result soon.
> 
> Thanks,
> Gavin


I think you are wasting the time with these tests. Even if it helps what
does this tell us? Try setting a flag as I suggested elsewhere.
Then check it in vhost.
Or here's another idea - possibly easier. Copy the high bits from index
into ring itself. Then vhost can check that head is synchronized with
index.

Warning: completely untested, not even compiled. But should give you
the idea. If this works btw we should consider making this official in
the spec.


 static inline int vhost_get_avail_flags(struct

Re: [GIT PULL] virtio: features, fixes

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 11:03:44AM -0700, Linus Torvalds wrote:
> On Tue, 19 Mar 2024 at 00:41, Michael S. Tsirkin  wrote:
> >
> > virtio: features, fixes
> >
> > Per vq sizes in vdpa.
> > Info query for block devices support in vdpa.
> > DMA sync callbacks in vduse.
> >
> > Fixes, cleanups.
> 
> Grr. I thought the merge message was a bit too terse, but I let it slide.
> 
> But only after pushing it out do I notice that not only was the pull
> request message overly terse, you had also rebased this all just
> moments before sending the pull request and didn't even give a hit of
> a reason for that.
> 
> So I missed that, and the merge is out now, but this was NOT OK.
> 
> Yes, rebasing happens. But last-minute rebasing needs to be explained,
> not some kind of nasty surprise after-the-fact.
> 
> And that pull request explanation was really borderline even *without*
> that issue.
> 
> Linus

OK thanks Linus and sorry. I did that rebase for testing then I thought
hey history looks much nicer now why don't I switch to that.  Just goes
to show not to do this thing past midnight, I write better merge
messages at sane hours, too.

-- 
MST

Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 01:19:23PM -0400, Stefan Hajnoczi wrote:
> On Tue, Mar 19, 2024 at 03:40:53AM -0400, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 12:32:26AM -0700, syzbot wrote:
> > > Hello,
> > > 
> > > syzbot found the following issue on:
> > > 
> > > HEAD commit:b3603fcb79b1 Merge tag 'dlm-6.9' of 
> > > git://git.kernel.org/p..
> > > git tree:   upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=10f04c8118
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=fcb5bfbee0a42b54
> > > dashboard link: 
> > > https://syzkaller.appspot.com/bug?extid=70f57d8a3ae84934c003
> > > compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
> > > Debian) 2.40
> > > 
> > > Downloadable assets:
> > > disk image: 
> > > https://storage.googleapis.com/syzbot-assets/43969dffd4a6/disk-b3603fcb.raw.xz
> > > vmlinux: 
> > > https://storage.googleapis.com/syzbot-assets/ef48ab3b378b/vmlinux-b3603fcb.xz
> > > kernel image: 
> > > https://storage.googleapis.com/syzbot-assets/728f5ff2b6fe/bzImage-b3603fcb.xz
> > > 
> > > IMPORTANT: if you fix the issue, please add the following tag to the 
> > > commit:
> > > Reported-by: syzbot+70f57d8a3ae84934c...@syzkaller.appspotmail.com
> > > 
> > > Key type pkcs7_test registered
> > > Block layer SCSI generic (bsg) driver version 0.4 loaded (major 239)
> > > io scheduler mq-deadline registered
> > > io scheduler kyber registered
> > > io scheduler bfq registered
> > > input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> > > ACPI: button: Power Button [PWRF]
> > > input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
> > > ACPI: button: Sleep Button [SLPF]
> > > ioatdma: Intel(R) QuickData Technology Driver 5.00
> > > ACPI: \_SB_.LNKC: Enabled at IRQ 11
> > > virtio-pci :00:03.0: virtio_pci: leaving for legacy driver
> > > ACPI: \_SB_.LNKD: Enabled at IRQ 10
> > > virtio-pci :00:04.0: virtio_pci: leaving for legacy driver
> > > ACPI: \_SB_.LNKB: Enabled at IRQ 10
> > > virtio-pci :00:06.0: virtio_pci: leaving for legacy driver
> > > virtio-pci :00:07.0: virtio_pci: leaving for legacy driver
> > > N_HDLC line discipline registered with maxframe=4096
> > > Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> > > 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> > > 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
> > > 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud = 115200) is a 16550A
> > > 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud = 115200) is a 16550A
> > > Non-volatile memory driver v1.3
> > > Linux agpgart interface v0.103
> > > ACPI: bus type drm_connector registered
> > > [drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
> > > [drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
> > > Console: switching to colour frame buffer device 128x48
> > > platform vkms: [drm] fb0: vkmsdrmfb frame buffer device
> > > usbcore: registered new interface driver udl
> > > brd: module loaded
> > > loop: module loaded
> > > zram: Added device: zram0
> > > null_blk: disk nullb0 created
> > > null_blk: module loaded
> > > Guest personality initialized and is inactive
> > > VMCI host device registered (name=vmci, major=10, minor=118)
> > > Initialized host personality
> > > usbcore: registered new interface driver rtsx_usb
> > > usbcore: registered new interface driver viperboard
> > > usbcore: registered new interface driver dln2
> > > usbcore: registered new interface driver pn533_usb
> > > nfcsim 0.2 initialized
> > > usbcore: registered new interface driver port100
> > > usbcore: registered new interface driver nfcmrvl
> > > Loading iSCSI transport class v2.0-870.
> > > virtio_scsi virtio0: 1/0/0 default/read/poll queues
> > > [ cut here ]
> > > refcount_t: decrement hit 0; leaking memory.
> > > WARNING: CPU: 0 PID: 1 at lib/refcount.c:31 
> > > refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> > > Modules linked in:
> > > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> > > 6.8.0-syzkaller-11567-gb3603fcb79b1 #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > > Google 02/29/2024
> > > RIP: 0010:refcount_warn_s

Re: [PATCH v7 0/3] vduse: add support for networking devices

2024-03-19 Thread Michael S. Tsirkin

On Thu, Feb 29, 2024 at 11:16:04AM +0100, Maxime Coquelin wrote:
> Hello Michael,
> 
> On 2/1/24 09:40, Michael S. Tsirkin wrote:
> > On Thu, Feb 01, 2024 at 09:34:11AM +0100, Maxime Coquelin wrote:
> > > Hi Jason,
> > > 
> > > It looks like all patches got acked by you.
> > > Any blocker to queue the series for next release?
> > > 
> > > Thanks,
> > > Maxime
> > 
> > I think it's good enough at this point. Will put it in
> > linux-next shortly.
> > 
> 
> I fetched linux-next and it seems the series is not in yet.
> Is there anything to be reworked on my side?
> 
> Thanks,
> Maxime

I am sorry I messed up. It was in a wrong branch and was not
pushed so of course it did not get tested and I kept wondering
why. I pushed it in my tree but it is too late to put it upstream
for this cycle. Assuming Linus merges my tree
with no drama, I will make an effort not to rebase my tree below them
so these patches will keep their hashes, you can use them already.

-- 
MST

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 06:08:27PM +1000, Gavin Shan wrote:
> On 3/19/24 17:09, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 04:49:50PM +1000, Gavin Shan wrote:
> > > 
> > > On 3/19/24 16:43, Michael S. Tsirkin wrote:
> > > > On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
> > > > > On 3/19/24 16:09, Michael S. Tsirkin wrote:
> > > > > 
> > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > > > @@ -687,9 +687,15 @@ static inline int 
> > > > > > > > > virtqueue_add_split(struct virtqueue *_vq,
> > > > > > > > >   avail = vq->split.avail_idx_shadow & 
> > > > > > > > > (vq->split.vring.num - 1);
> > > > > > > > >   vq->split.vring.avail->ring[avail] = 
> > > > > > > > > cpu_to_virtio16(_vq->vdev, head);
> > > > > > > > > - /* Descriptors and available array need to be set 
> > > > > > > > > before we expose the
> > > > > > > > > -  * new available array entries. */
> > > > > > > > > - virtio_wmb(vq->weak_barriers);
> > > > > > > > > + /*
> > > > > > > > > +  * Descriptors and available array need to be set 
> > > > > > > > > before we expose
> > > > > > > > > +  * the new available array entries. virtio_wmb() should 
> > > > > > > > > be enough
> > > > > > > > > +  * to ensuere the order theoretically. However, a 
> > > > > > > > > stronger barrier
> > > > > > > > > +  * is needed by ARM64. Otherwise, the stale data can be 
> > > > > > > > > observed
> > > > > > > > > +  * by the host (vhost). A stronger barrier should work 
> > > > > > > > > for other
> > > > > > > > > +  * architectures, but performance loss is expected.
> > > > > > > > > +  */
> > > > > > > > > + virtio_mb(false);
> > > > > > > > >   vq->split.avail_idx_shadow++;
> > > > > > > > >   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > > > >   
> > > > > > > > > vq->split.avail_idx_shadow);
> > > > > > > > 
> > > > > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct 
> > > > > > > > solution
> > > > > > > > here, especially when ordering accesses to coherent memory.
> > > > > > > > 
> > > > > > > > In practice, either the larger timing different from the DSB or 
> > > > > > > > the fact
> > > > > > > > that you're going from a Store->Store barrier to a full barrier 
> > > > > > > > is what
> > > > > > > > makes things "work" for you. Have you tried, for example, a DMB 
> > > > > > > > SY
> > > > > > > > (e.g. via __smb_mb()).
> > > > > > > > 
> > > > > > > > We definitely shouldn't take changes like this without a proper
> > > > > > > > explanation of what is going on.
> > > > > > > > 
> > > > > > > 
> > > > > > > Thanks for your comments, Will.
> > > > > > > 
> > > > > > > Yes, DMB should work for us. However, it seems this instruction 
> > > > > > > has issues on
> > > > > > > NVidia's grace-hopper. It's hard for me to understand how DMB and 
> > > > > > > DSB works
> > > > > > > from hardware level. I agree it's not the solution to replace DMB 
> > > > > > > with DSB
> > > > > > > before we fully understand the root cause.
> > > > > > > 
>

Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 09:21:06AM +0100, Tobias Huschle wrote:
> On 2024-03-15 11:31, Michael S. Tsirkin wrote:
> > On Fri, Mar 15, 2024 at 09:33:49AM +0100, Tobias Huschle wrote:
> > > On Thu, Mar 14, 2024 at 11:09:25AM -0400, Michael S. Tsirkin wrote:
> > > >
> > 
> > Could you remind me pls, what is the kworker doing specifically that
> > vhost is relying on?
> 
> The kworker is handling the actual data moving in memory if I'm not
> mistaking.

I think that is the vhost process itself. Maybe you mean the
guest thread versus the vhost thread then?

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 04:54:15PM +1000, Gavin Shan wrote:
> On 3/19/24 16:10, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
> > > On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > > > On 3/19/24 02:59, Will Deacon wrote:
> [...]
> > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > > virtqueue *_vq,
> > > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > > > head);
> > > > > > -   /* Descriptors and available array need to be set before we 
> > > > > > expose the
> > > > > > -* new available array entries. */
> > > > > > -   virtio_wmb(vq->weak_barriers);
> > > > > > +   /*
> > > > > > +* Descriptors and available array need to be set before we 
> > > > > > expose
> > > > > > +* the new available array entries. virtio_wmb() should be 
> > > > > > enough
> > > > > > +* to ensuere the order theoretically. However, a stronger 
> > > > > > barrier
> > > > > > +* is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > +* by the host (vhost). A stronger barrier should work for other
> > > > > > +* architectures, but performance loss is expected.
> > > > > > +*/
> > > > > > +   virtio_mb(false);
> > > > > > vq->split.avail_idx_shadow++;
> > > > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > 
> > > > > > vq->split.avail_idx_shadow);
> > > > > 
> > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct 
> > > > > solution
> > > > > here, especially when ordering accesses to coherent memory.
> > > > > 
> > > > > In practice, either the larger timing different from the DSB or the 
> > > > > fact
> > > > > that you're going from a Store->Store barrier to a full barrier is 
> > > > > what
> > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > (e.g. via __smb_mb()).
> > > > > 
> > > > > We definitely shouldn't take changes like this without a proper
> > > > > explanation of what is going on.
> > > > > 
> > > > 
> > > > Thanks for your comments, Will.
> > > > 
> > > > Yes, DMB should work for us. However, it seems this instruction has 
> > > > issues on
> > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB 
> > > > works
> > > > from hardware level. I agree it's not the solution to replace DMB with 
> > > > DSB
> > > > before we fully understand the root cause.
> > > > 
> > > > I tried the possible replacement like below. __smp_mb() can avoid the 
> > > > issue like
> > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > > 
> > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > {
> > > >  :
> > > >  /* Put entry in available array (but don't update avail->idx 
> > > > until they
> > > >   * do sync). */
> > > >  avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > >  vq->split.vring.avail->ring[avail] = 
> > > > cpu_to_virtio16(_vq->vdev, head);
> > > > 
> > > >  /* Descriptors and available array need to be set before we 
> > > > expose the
> > > >   * new available array entries. */
> > > >  // Broken: virtio_wmb(vq->weak_barriers);
> > > >  // Broken: __dma_mb();
> > > >  // Work:   __mb();
> > &g

[GIT PULL] virtio: features, fixes

2024-03-19 Thread Michael S. Tsirkin

The following changes since commit e8f897f4afef0031fe618a8e94127a0934896aba:

  Linux 6.8 (2024-03-10 13:38:09 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 5da7137de79ca6ffae3ace77050588cdf5263d33:

  virtio_net: rename free_old_xmit_skbs to free_old_xmit (2024-03-19 03:19:22 
-0400)


virtio: features, fixes

Per vq sizes in vdpa.
Info query for block devices support in vdpa.
DMA sync callbacks in vduse.

Fixes, cleanups.

Signed-off-by: Michael S. Tsirkin 


Andrew Melnychenko (1):
  vhost: Added pad cleanup if vnet_hdr is not present.

David Hildenbrand (1):
  virtio: reenable config if freezing device failed

Jason Wang (2):
  virtio-net: convert rx mode setting to use workqueue
  virtio-net: add cond_resched() to the command waiting loop

Jonah Palmer (1):
  vdpa/mlx5: Allow CVQ size changes

Maxime Coquelin (1):
  vduse: implement DMA sync callbacks

Ricardo B. Marliere (2):
  vdpa: make vdpa_bus const
  virtio: make virtio_bus const

Shannon Nelson (1):
  vdpa/pds: fixes for VF vdpa flr-aer handling

Steve Sistare (2):
  vdpa_sim: reset must not run
  vdpa: skip suspend/resume ops if not DRIVER_OK

Suzuki K Poulose (1):
  virtio: uapi: Drop __packed attribute in linux/virtio_pci.h

Xuan Zhuo (3):
  virtio: packed: fix unmap leak for indirect desc table
  virtio_net: unify the code for recycling the xmit ptr
  virtio_net: rename free_old_xmit_skbs to free_old_xmit

Zhu Lingshan (20):
  vhost-vdpa: uapi to support reporting per vq size
  vDPA: introduce get_vq_size to vdpa_config_ops
  vDPA/ifcvf: implement vdpa_config_ops.get_vq_size
  vp_vdpa: implement vdpa_config_ops.get_vq_size
  eni_vdpa: implement vdpa_config_ops.get_vq_size
  vdpa_sim: implement vdpa_config_ops.get_vq_size for vDPA simulator
  vduse: implement vdpa_config_ops.get_vq_size for vduse
  virtio_vdpa: create vqs with the actual size
  vDPA/ifcvf: get_max_vq_size to return max size
  vDPA/ifcvf: implement vdpa_config_ops.get_vq_num_min
  vDPA: report virtio-block capacity to user space
  vDPA: report virtio-block max segment size to user space
  vDPA: report virtio-block block-size to user space
  vDPA: report virtio-block max segments in a request to user space
  vDPA: report virtio-block MQ info to user space
  vDPA: report virtio-block topology info to user space
  vDPA: report virtio-block discarding configuration to user space
  vDPA: report virtio-block write zeroes configuration to user space
  vDPA: report virtio-block read-only info to user space
  vDPA: report virtio-blk flush info to user space

 drivers/net/virtio_net.c | 151 +++-
 drivers/vdpa/alibaba/eni_vdpa.c  |   8 ++
 drivers/vdpa/ifcvf/ifcvf_base.c  |  11 +-
 drivers/vdpa/ifcvf/ifcvf_base.h  |   2 +
 drivers/vdpa/ifcvf/ifcvf_main.c  |  15 +++
 drivers/vdpa/mlx5/net/mlx5_vnet.c|  13 ++-
 drivers/vdpa/pds/aux_drv.c   |   2 +-
 drivers/vdpa/pds/vdpa_dev.c  |  20 +++-
 drivers/vdpa/pds/vdpa_dev.h  |   1 +
 drivers/vdpa/vdpa.c  | 214 ++-
 drivers/vdpa/vdpa_sim/vdpa_sim.c |  15 ++-
 drivers/vdpa/vdpa_user/iova_domain.c |  27 -
 drivers/vdpa/vdpa_user/iova_domain.h |   8 ++
 drivers/vdpa/vdpa_user/vduse_dev.c   |  34 ++
 drivers/vdpa/virtio_pci/vp_vdpa.c|   8 ++
 drivers/vhost/net.c  |   3 +
 drivers/vhost/vdpa.c |  14 +++
 drivers/virtio/virtio.c  |   6 +-
 drivers/virtio/virtio_ring.c |   6 +-
 drivers/virtio/virtio_vdpa.c |   5 +-
 include/linux/vdpa.h |   6 +
 include/uapi/linux/vdpa.h|  17 +++
 include/uapi/linux/vhost.h   |   7 ++
 include/uapi/linux/virtio_pci.h  |  10 +-
 24 files changed, 521 insertions(+), 82 deletions(-)

Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 12:32:26AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:b3603fcb79b1 Merge tag 'dlm-6.9' of git://git.kernel.org/p..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=10f04c8118
> kernel config:  https://syzkaller.appspot.com/x/.config?x=fcb5bfbee0a42b54
> dashboard link: https://syzkaller.appspot.com/bug?extid=70f57d8a3ae84934c003
> compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
> 2.40
> 
> Downloadable assets:
> disk image: 
> https://storage.googleapis.com/syzbot-assets/43969dffd4a6/disk-b3603fcb.raw.xz
> vmlinux: 
> https://storage.googleapis.com/syzbot-assets/ef48ab3b378b/vmlinux-b3603fcb.xz
> kernel image: 
> https://storage.googleapis.com/syzbot-assets/728f5ff2b6fe/bzImage-b3603fcb.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+70f57d8a3ae84934c...@syzkaller.appspotmail.com
> 
> Key type pkcs7_test registered
> Block layer SCSI generic (bsg) driver version 0.4 loaded (major 239)
> io scheduler mq-deadline registered
> io scheduler kyber registered
> io scheduler bfq registered
> input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> ACPI: button: Power Button [PWRF]
> input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
> ACPI: button: Sleep Button [SLPF]
> ioatdma: Intel(R) QuickData Technology Driver 5.00
> ACPI: \_SB_.LNKC: Enabled at IRQ 11
> virtio-pci :00:03.0: virtio_pci: leaving for legacy driver
> ACPI: \_SB_.LNKD: Enabled at IRQ 10
> virtio-pci :00:04.0: virtio_pci: leaving for legacy driver
> ACPI: \_SB_.LNKB: Enabled at IRQ 10
> virtio-pci :00:06.0: virtio_pci: leaving for legacy driver
> virtio-pci :00:07.0: virtio_pci: leaving for legacy driver
> N_HDLC line discipline registered with maxframe=4096
> Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
> 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud = 115200) is a 16550A
> 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud = 115200) is a 16550A
> Non-volatile memory driver v1.3
> Linux agpgart interface v0.103
> ACPI: bus type drm_connector registered
> [drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
> [drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
> Console: switching to colour frame buffer device 128x48
> platform vkms: [drm] fb0: vkmsdrmfb frame buffer device
> usbcore: registered new interface driver udl
> brd: module loaded
> loop: module loaded
> zram: Added device: zram0
> null_blk: disk nullb0 created
> null_blk: module loaded
> Guest personality initialized and is inactive
> VMCI host device registered (name=vmci, major=10, minor=118)
> Initialized host personality
> usbcore: registered new interface driver rtsx_usb
> usbcore: registered new interface driver viperboard
> usbcore: registered new interface driver dln2
> usbcore: registered new interface driver pn533_usb
> nfcsim 0.2 initialized
> usbcore: registered new interface driver port100
> usbcore: registered new interface driver nfcmrvl
> Loading iSCSI transport class v2.0-870.
> virtio_scsi virtio0: 1/0/0 default/read/poll queues
> [ cut here ]
> refcount_t: decrement hit 0; leaking memory.
> WARNING: CPU: 0 PID: 1 at lib/refcount.c:31 refcount_warn_saturate+0xfa/0x1d0 
> lib/refcount.c:31
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.8.0-syzkaller-11567-gb3603fcb79b1 
> #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 02/29/2024
> RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> Code: b2 00 00 00 e8 57 d4 f2 fc 5b 5d c3 cc cc cc cc e8 4b d4 f2 fc c6 05 0c 
> f9 ef 0a 01 90 48 c7 c7 a0 5d 1e 8c e8 b7 75 b5 fc 90 <0f> 0b 90 90 eb d9 e8 
> 2b d4 f2 fc c6 05 e9 f8 ef 0a 01 90 48 c7 c7
> RSP: :c9066e18 EFLAGS: 00010246
> RAX: 76f86e452fcad900 RBX: 8880210d2aec RCX: 888016ac8000
> RDX:  RSI:  RDI: 
> RBP: 0004 R08: 8157ffe2 R09: fbfff1c396e0
> R10: dc00 R11: fbfff1c396e0 R12: ea000502cdc0
> R13: ea000502cdc8 R14: 1d4000a059b9 R15: 
> FS:  () GS:8880b940() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 88823000 CR3: 0e132000 CR4: 003506f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  
>  reset_page_owner include/linux/page_owner.h:25 [inline]
>  free_pages_prepare mm/page_alloc.c:1141 [inline]
>  __free_pages_ok+0xc54/0xd80 mm/page_alloc.c:1270
>  make_alloc_exact+0xa3/0xf0 mm/page_alloc.c:4829
>  vring_alloc_queue drivers/virtio/virtio_ring.c:319

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Mon, Mar 18, 2024 at 04:59:24PM +, Will Deacon wrote:
> On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > The issue is reported by Yihuang Yu who have 'netperf' test on
> > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > client is started in the VM hosted by grace-hopper machine,
> > while the 'netperf' server is running on grace-grace machine.
> > 
> > The VM is started with virtio-net and vhost has been enabled.
> > We observe a error message spew from VM and then soft-lockup
> > report. The error message indicates the data associated with
> > the descriptor (index: 135) has been released, and the queue
> > is marked as broken. It eventually leads to the endless effort
> > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > and soft-lockup. The stale index 135 is fetched from the available
> > ring and published to the used ring by vhost, meaning we have
> > disordred write to the available ring element and available index.
> > 
> >   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
> >   -accel kvm -machine virt,gic-version=host\
> >  : \
> >   -netdev tap,id=vnet0,vhost=on\
> >   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > 
> >   [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > 
> > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > ARM64. It should work for other architectures, but performance loss is
> > expected.
> > 
> > Cc: sta...@vger.kernel.org
> > Reported-by: Yihuang Yu 
> > Signed-off-by: Gavin Shan 
> > ---
> >  drivers/virtio/virtio_ring.c | 12 +---
> >  1 file changed, 9 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 49299b1f9ec7..7d852811c912 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue 
> > *_vq,
> > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> >  
> > -   /* Descriptors and available array need to be set before we expose the
> > -* new available array entries. */
> > -   virtio_wmb(vq->weak_barriers);
> > +   /*
> > +* Descriptors and available array need to be set before we expose
> > +* the new available array entries. virtio_wmb() should be enough
> > +* to ensuere the order theoretically. However, a stronger barrier
> > +* is needed by ARM64. Otherwise, the stale data can be observed
> > +* by the host (vhost). A stronger barrier should work for other
> > +* architectures, but performance loss is expected.
> > +*/
> > +   virtio_mb(false);
> > vq->split.avail_idx_shadow++;
> > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > vq->split.avail_idx_shadow);
> 
> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> here, especially when ordering accesses to coherent memory.
> 
> In practice, either the larger timing different from the DSB or the fact
> that you're going from a Store->Store barrier to a full barrier is what
> makes things "work" for you. Have you tried, for example, a DMB SY
> (e.g. via __smb_mb()).
> 
> We definitely shouldn't take changes like this without a proper
> explanation of what is going on.
> 
> Will

Just making sure: so on this system, how do
smp_wmb() and wmb() differ? smb_wmb is normally for synchronizing
with kernel running on another CPU and we are doing something
unusual in virtio when we use it to synchronize with host
as opposed to the guest - e.g. CONFIG_SMP is special cased
because of this:

#define virt_wmb() do { kcsan_wmb(); __smp_wmb(); } while (0)

Note __smp_wmb not smp_wmb which would be a NOP on UP.


-- 
MST

Re: [PATCH v3] vduse: Fix off by one in vduse_dev_mmap()

2024-03-19 Thread Michael S. Tsirkin

On Wed, Feb 28, 2024 at 09:24:07PM +0300, Dan Carpenter wrote:
> The dev->vqs[] array has "dev->vq_num" elements.  It's allocated in
> vduse_dev_init_vqs().  Thus, this > comparison needs to be >= to avoid
> reading one element beyond the end of the array.
> 
> Add an array_index_nospec() as well to prevent speculation issues.
> 
> Fixes: 316ecd1346b0 ("vduse: Add file operation for mmap")
> Signed-off-by: Dan Carpenter 

Thanks a lot!
I assume this will be squashed in the relevant patch when that is
re-spun.

> ---
> v2: add array_index_nospec()
> v3: I accidentally corrupted v2.  Try again.
> 
>  drivers/vdpa/vdpa_user/vduse_dev.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index b7a1fb88c506..eb914084c650 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -1532,9 +1532,10 @@ static int vduse_dev_mmap(struct file *file, struct 
> vm_area_struct *vma)
>   if ((vma->vm_flags & VM_SHARED) == 0)
>   return -EINVAL;
>  
> - if (index > dev->vq_num)
> + if (index >= dev->vq_num)
>   return -EINVAL;
>  
> + index = array_index_nospec(index, dev->vq_num);
>   vq = dev->vqs[index];
>   vaddr = vq->vdpa_reconnect_vaddr;
>   if (vaddr == 0)
> -- 
> 2.43.0

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 04:49:50PM +1000, Gavin Shan wrote:
> 
> On 3/19/24 16:43, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
> > > On 3/19/24 16:09, Michael S. Tsirkin wrote:
> > > 
> > > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > > > virtqueue *_vq,
> > > > > > >   avail = vq->split.avail_idx_shadow & 
> > > > > > > (vq->split.vring.num - 1);
> > > > > > >   vq->split.vring.avail->ring[avail] = 
> > > > > > > cpu_to_virtio16(_vq->vdev, head);
> > > > > > > - /* Descriptors and available array need to be set before we 
> > > > > > > expose the
> > > > > > > -  * new available array entries. */
> > > > > > > - virtio_wmb(vq->weak_barriers);
> > > > > > > + /*
> > > > > > > +  * Descriptors and available array need to be set before we 
> > > > > > > expose
> > > > > > > +  * the new available array entries. virtio_wmb() should be 
> > > > > > > enough
> > > > > > > +  * to ensuere the order theoretically. However, a stronger 
> > > > > > > barrier
> > > > > > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > > +  * by the host (vhost). A stronger barrier should work for other
> > > > > > > +  * architectures, but performance loss is expected.
> > > > > > > +  */
> > > > > > > + virtio_mb(false);
> > > > > > >   vq->split.avail_idx_shadow++;
> > > > > > >   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > >   
> > > > > > > vq->split.avail_idx_shadow);
> > > > > > 
> > > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct 
> > > > > > solution
> > > > > > here, especially when ordering accesses to coherent memory.
> > > > > > 
> > > > > > In practice, either the larger timing different from the DSB or the 
> > > > > > fact
> > > > > > that you're going from a Store->Store barrier to a full barrier is 
> > > > > > what
> > > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > > (e.g. via __smb_mb()).
> > > > > > 
> > > > > > We definitely shouldn't take changes like this without a proper
> > > > > > explanation of what is going on.
> > > > > > 
> > > > > 
> > > > > Thanks for your comments, Will.
> > > > > 
> > > > > Yes, DMB should work for us. However, it seems this instruction has 
> > > > > issues on
> > > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB 
> > > > > works
> > > > > from hardware level. I agree it's not the solution to replace DMB 
> > > > > with DSB
> > > > > before we fully understand the root cause.
> > > > > 
> > > > > I tried the possible replacement like below. __smp_mb() can avoid the 
> > > > > issue like
> > > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) 
> > > > > doesn't.
> > > > > 
> > > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > > {
> > > > >   :
> > > > >   /* Put entry in available array (but don't update 
> > > > > avail->idx until they
> > > > >* do sync). */
> > > > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 
> > > > > 1);
> > > > >   vq->split.vring.avail->ring[avail] = 
> > > > > cpu_to_virtio16(_vq-&g

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 04:54:15PM +1000, Gavin Shan wrote:
> On 3/19/24 16:10, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
> > > On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > > > On 3/19/24 02:59, Will Deacon wrote:
> [...]
> > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > > virtqueue *_vq,
> > > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > > > head);
> > > > > > -   /* Descriptors and available array need to be set before we 
> > > > > > expose the
> > > > > > -* new available array entries. */
> > > > > > -   virtio_wmb(vq->weak_barriers);
> > > > > > +   /*
> > > > > > +* Descriptors and available array need to be set before we 
> > > > > > expose
> > > > > > +* the new available array entries. virtio_wmb() should be 
> > > > > > enough
> > > > > > +* to ensuere the order theoretically. However, a stronger 
> > > > > > barrier
> > > > > > +* is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > +* by the host (vhost). A stronger barrier should work for other
> > > > > > +* architectures, but performance loss is expected.
> > > > > > +*/
> > > > > > +   virtio_mb(false);
> > > > > > vq->split.avail_idx_shadow++;
> > > > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > 
> > > > > > vq->split.avail_idx_shadow);
> > > > > 
> > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct 
> > > > > solution
> > > > > here, especially when ordering accesses to coherent memory.
> > > > > 
> > > > > In practice, either the larger timing different from the DSB or the 
> > > > > fact
> > > > > that you're going from a Store->Store barrier to a full barrier is 
> > > > > what
> > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > (e.g. via __smb_mb()).
> > > > > 
> > > > > We definitely shouldn't take changes like this without a proper
> > > > > explanation of what is going on.
> > > > > 
> > > > 
> > > > Thanks for your comments, Will.
> > > > 
> > > > Yes, DMB should work for us. However, it seems this instruction has 
> > > > issues on
> > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB 
> > > > works
> > > > from hardware level. I agree it's not the solution to replace DMB with 
> > > > DSB
> > > > before we fully understand the root cause.
> > > > 
> > > > I tried the possible replacement like below. __smp_mb() can avoid the 
> > > > issue like
> > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > > 
> > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > {
> > > >  :
> > > >  /* Put entry in available array (but don't update avail->idx 
> > > > until they
> > > >   * do sync). */
> > > >  avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > >  vq->split.vring.avail->ring[avail] = 
> > > > cpu_to_virtio16(_vq->vdev, head);
> > > > 
> > > >  /* Descriptors and available array need to be set before we 
> > > > expose the
> > > >   * new available array entries. */
> > > >  // Broken: virtio_wmb(vq->weak_barriers);
> > > >  // Broken: __dma_mb();
> > > >  // Work:   __mb(

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
> On 3/19/24 16:09, Michael S. Tsirkin wrote:
> 
> > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > b/drivers/virtio/virtio_ring.c
> > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > virtqueue *_vq,
> > > > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > >   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > > head);
> > > > > - /* Descriptors and available array need to be set before we 
> > > > > expose the
> > > > > -  * new available array entries. */
> > > > > - virtio_wmb(vq->weak_barriers);
> > > > > + /*
> > > > > +  * Descriptors and available array need to be set before we 
> > > > > expose
> > > > > +  * the new available array entries. virtio_wmb() should be 
> > > > > enough
> > > > > +  * to ensuere the order theoretically. However, a stronger 
> > > > > barrier
> > > > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > +  * by the host (vhost). A stronger barrier should work for other
> > > > > +  * architectures, but performance loss is expected.
> > > > > +  */
> > > > > + virtio_mb(false);
> > > > >   vq->split.avail_idx_shadow++;
> > > > >   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > >   
> > > > > vq->split.avail_idx_shadow);
> > > > 
> > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > here, especially when ordering accesses to coherent memory.
> > > > 
> > > > In practice, either the larger timing different from the DSB or the fact
> > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > (e.g. via __smb_mb()).
> > > > 
> > > > We definitely shouldn't take changes like this without a proper
> > > > explanation of what is going on.
> > > > 
> > > 
> > > Thanks for your comments, Will.
> > > 
> > > Yes, DMB should work for us. However, it seems this instruction has 
> > > issues on
> > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB 
> > > works
> > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > before we fully understand the root cause.
> > > 
> > > I tried the possible replacement like below. __smp_mb() can avoid the 
> > > issue like
> > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > 
> > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > {
> > >  :
> > >  /* Put entry in available array (but don't update avail->idx 
> > > until they
> > >   * do sync). */
> > >  avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >  vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > head);
> > > 
> > >  /* Descriptors and available array need to be set before we 
> > > expose the
> > >   * new available array entries. */
> > >  // Broken: virtio_wmb(vq->weak_barriers);
> > >  // Broken: __dma_mb();
> > >  // Work:   __mb();
> > >  // Work:   __smp_mb();
> > >  // Work:   __ndelay(100);
> > >  // Work:   __ndelay(10);
> > >  // Broken: __ndelay(9);
> > > 
> > > vq->split.avail_idx_shadow++;
> > >  vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >  
> > > vq->split.avail_idx_shadow);
> > 
> > What if you stick __ndelay here?
> > 
> 
>/* Put entry in available array (but don't update avail->idx until they
>  * d

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> The issue is reported by Yihuang Yu who have 'netperf' test on
> NVidia's grace-grace and grace-hopper machines. The 'netperf'
> client is started in the VM hosted by grace-hopper machine,
> while the 'netperf' server is running on grace-grace machine.
> 
> The VM is started with virtio-net and vhost has been enabled.
> We observe a error message spew from VM and then soft-lockup
> report. The error message indicates the data associated with
> the descriptor (index: 135) has been released, and the queue
> is marked as broken. It eventually leads to the endless effort
> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> and soft-lockup. The stale index 135 is fetched from the available
> ring and published to the used ring by vhost, meaning we have
> disordred write to the available ring element and available index.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
>   -accel kvm -machine virt,gic-version=host\
>  : \
>   -netdev tap,id=vnet0,vhost=on\
>   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> 
>   [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> 
> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> ARM64. It should work for other architectures, but performance loss is
> expected.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Yihuang Yu 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/virtio/virtio_ring.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 49299b1f9ec7..7d852811c912 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue 
> *_vq,
>   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>  
> - /* Descriptors and available array need to be set before we expose the
> -  * new available array entries. */
> - virtio_wmb(vq->weak_barriers);
> + /*
> +  * Descriptors and available array need to be set before we expose
> +  * the new available array entries. virtio_wmb() should be enough
> +  * to ensuere the order theoretically. However, a stronger barrier
> +  * is needed by ARM64. Otherwise, the stale data can be observed
> +  * by the host (vhost). A stronger barrier should work for other
> +  * architectures, but performance loss is expected.
> +  */
> + virtio_mb(false);
>   vq->split.avail_idx_shadow++;
>   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>   vq->split.avail_idx_shadow);



Something else to try, is to disassemble the code and check the compiler is not 
broken.

It also might help to replace assigment above with WRITE_ONCE -
it's technically always has been the right thing to do, it's just a big
change (has to be done everywhere if done at all) so we never bothered
and we never hit a compiler that would split or speculate stores ...


> -- 
> 2.44.0

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > On 3/19/24 02:59, Will Deacon wrote:
> > > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > > client is started in the VM hosted by grace-hopper machine,
> > > > while the 'netperf' server is running on grace-grace machine.
> > > > 
> > > > The VM is started with virtio-net and vhost has been enabled.
> > > > We observe a error message spew from VM and then soft-lockup
> > > > report. The error message indicates the data associated with
> > > > the descriptor (index: 135) has been released, and the queue
> > > > is marked as broken. It eventually leads to the endless effort
> > > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > > and soft-lockup. The stale index 135 is fetched from the available
> > > > ring and published to the used ring by vhost, meaning we have
> > > > disordred write to the available ring element and available index.
> > > > 
> > > >/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  
> > > > \
> > > >-accel kvm -machine virt,gic-version=host
> > > > \
> > > >   : 
> > > > \
> > > >-netdev tap,id=vnet0,vhost=on
> > > > \
> > > >-device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 
> > > > \
> > > > 
> > > >[   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > > 
> > > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > > ARM64. It should work for other architectures, but performance loss is
> > > > expected.
> > > > 
> > > > Cc: sta...@vger.kernel.org
> > > > Reported-by: Yihuang Yu 
> > > > Signed-off-by: Gavin Shan 
> > > > ---
> > > >   drivers/virtio/virtio_ring.c | 12 +---
> > > >   1 file changed, 9 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 49299b1f9ec7..7d852811c912 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > virtqueue *_vq,
> > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > head);
> > > > -   /* Descriptors and available array need to be set before we 
> > > > expose the
> > > > -* new available array entries. */
> > > > -   virtio_wmb(vq->weak_barriers);
> > > > +   /*
> > > > +* Descriptors and available array need to be set before we 
> > > > expose
> > > > +* the new available array entries. virtio_wmb() should be 
> > > > enough
> > > > +* to ensuere the order theoretically. However, a stronger 
> > > > barrier
> > > > +* is needed by ARM64. Otherwise, the stale data can be observed
> > > > +* by the host (vhost). A stronger barrier should work for other
> > > > +* architectures, but performance loss is expected.
> > > > +*/
> > > > +   virtio_mb(false);
> > > > vq->split.avail_idx_shadow++;
> > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > 
> > > > vq->split.avail_idx_shadow);
> > > 
> > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > here, especially when ordering accesses to coherent memory.
> > > 
> > > In practice, either the larger timing different from the DSB or the fact
> > > that you're going from a Store->Store barrier to a full barrier is what
> > > makes t

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> On 3/19/24 02:59, Will Deacon wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > client is started in the VM hosted by grace-hopper machine,
> > > while the 'netperf' server is running on grace-grace machine.
> > > 
> > > The VM is started with virtio-net and vhost has been enabled.
> > > We observe a error message spew from VM and then soft-lockup
> > > report. The error message indicates the data associated with
> > > the descriptor (index: 135) has been released, and the queue
> > > is marked as broken. It eventually leads to the endless effort
> > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > and soft-lockup. The stale index 135 is fetched from the available
> > > ring and published to the used ring by vhost, meaning we have
> > > disordred write to the available ring element and available index.
> > > 
> > >/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
> > >-accel kvm -machine virt,gic-version=host\
> > >   : \
> > >-netdev tap,id=vnet0,vhost=on\
> > >-device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > 
> > >[   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > 
> > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > ARM64. It should work for other architectures, but performance loss is
> > > expected.
> > > 
> > > Cc: sta...@vger.kernel.org
> > > Reported-by: Yihuang Yu 
> > > Signed-off-by: Gavin Shan 
> > > ---
> > >   drivers/virtio/virtio_ring.c | 12 +---
> > >   1 file changed, 9 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > virtqueue *_vq,
> > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > head);
> > > - /* Descriptors and available array need to be set before we expose the
> > > -  * new available array entries. */
> > > - virtio_wmb(vq->weak_barriers);
> > > + /*
> > > +  * Descriptors and available array need to be set before we expose
> > > +  * the new available array entries. virtio_wmb() should be enough
> > > +  * to ensuere the order theoretically. However, a stronger barrier
> > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > +  * by the host (vhost). A stronger barrier should work for other
> > > +  * architectures, but performance loss is expected.
> > > +  */
> > > + virtio_mb(false);
> > >   vq->split.avail_idx_shadow++;
> > >   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >   
> > > vq->split.avail_idx_shadow);
> > 
> > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > here, especially when ordering accesses to coherent memory.
> > 
> > In practice, either the larger timing different from the DSB or the fact
> > that you're going from a Store->Store barrier to a full barrier is what
> > makes things "work" for you. Have you tried, for example, a DMB SY
> > (e.g. via __smb_mb()).
> > 
> > We definitely shouldn't take changes like this without a proper
> > explanation of what is going on.
> > 
> 
> Thanks for your comments, Will.
> 
> Yes, DMB should work for us. However, it seems this instruction has issues on
> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> from hardware level. I agree it's not the solution to replace DMB with DSB
> before we fully understand the root cause.
> 
> I tried the possible replacement like below. __smp_mb() can avoid the issue 
> like
> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> 
> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> {
> :
> /* Put entry in available array (but don't update avail->idx until 
> they
>  * do sync). */
> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> 
> /* Descriptors and available array need to be set before we expose the
>  * new available array entries. */
> // Broken: virtio_wmb(vq->weak_barriers);
> // Broken: __dma_mb();
> // Work:   __mb();
> // Work:   __smp_mb();
>

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-18 Thread Michael S. Tsirkin

On Mon, Mar 18, 2024 at 09:41:45AM +1000, Gavin Shan wrote:
> On 3/18/24 02:50, Michael S. Tsirkin wrote:
> > On Fri, Mar 15, 2024 at 09:24:36PM +1000, Gavin Shan wrote:
> > > 
> > > On 3/15/24 21:05, Michael S. Tsirkin wrote:
> > > > On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
> > > > > > > Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper 
> > > > > > > platform. I tried
> > > > > to reproduce it with my own driver where one thread writes to the 
> > > > > shared buffer
> > > > > and another thread reads from the buffer. I don't hit the 
> > > > > out-of-order issue so
> > > > > far.
> > > > 
> > > > Make sure the 2 areas you are accessing are in different cache lines.
> > > > 
> > > 
> > > Yes, I already put those 2 areas to separate cache lines.
> > > 
> > > > 
> > > > > My driver may be not correct somewhere and I will update if I can 
> > > > > reproduce
> > > > > the issue with my driver in the future.
> > > > 
> > > > Then maybe your change is just making virtio slower and masks the bug
> > > > that is actually elsewhere?
> > > > 
> > > > You don't really need a driver. Here's a simple test: without barriers
> > > > assertion will fail. With barriers it will not.
> > > > (Warning: didn't bother testing too much, could be buggy.
> > > > 
> > > > ---
> > > > 
> > > > #include 
> > > > #include 
> > > > #include 
> > > > #include 
> > > > 
> > > > #define FIRST values[0]
> > > > #define SECOND values[64]
> > > > 
> > > > volatile int values[100] = {};
> > > > 
> > > > void* writer_thread(void* arg) {
> > > > while (1) {
> > > > FIRST++;
> > > > // NEED smp_wmb here
> > >  __asm__ volatile("dmb ishst" : : : "memory");
> > > > SECOND++;
> > > > }
> > > > }
> > > > 
> > > > void* reader_thread(void* arg) {
> > > >   while (1) {
> > > > int first = FIRST;
> > > > // NEED smp_rmb here
> > >  __asm__ volatile("dmb ishld" : : : "memory");
> > > > int second = SECOND;
> > > > assert(first - second == 1 || first - second == 0);
> > > >   }
> > > > }
> > > > 
> > > > int main() {
> > > >   pthread_t writer, reader;
> > > > 
> > > >   pthread_create(, NULL, writer_thread, NULL);
> > > >   pthread_create(, NULL, reader_thread, NULL);
> > > > 
> > > >   pthread_join(writer, NULL);
> > > >   pthread_join(reader, NULL);
> > > > 
> > > >   return 0;
> > > > }
> > > > 
> > > 
> > > Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
> > > the assert on both of them. After replacing 'dmb' with 'dsb', I can
> > > hit assert on both of them too. I need to look at the code closely.
> > > 
> > > [root@virt-mtcollins-02 test]# ./a
> > > a: a.c:26: reader_thread: Assertion `first - second == 1 || first - 
> > > second == 0' failed.
> > > Aborted (core dumped)
> > > 
> > > [root@nvidia-grace-hopper-05 test]# ./a
> > > a: a.c:26: reader_thread: Assertion `first - second == 1 || first - 
> > > second == 0' failed.
> > > Aborted (core dumped)
> > > 
> > > Thanks,
> > > Gavin
> > 
> > 
> > Actually this test is broken. No need for ordering it's a simple race.
> > The following works on x86 though (x86 does not need barriers
> > though).
> > 
> > 
> > #include 
> > #include 
> > #include 
> > #include 
> > 
> > #if 0
> > #define x86_rmb()  asm volatile("lfence":::"memory")
> > #define x86_mb()  asm volatile("mfence":::"memory")
> > #define x86_smb()  asm volatile("sfence":::"memory")
> > #else
> > #define x86_rmb()  asm volatile("":::"memory")
> > #define x86_mb()  asm volatile("":::"memory")
> > #define x86_smb()  asm volatile("":::"memory")
> > #endif
> > 
&g

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-17 Thread Michael S. Tsirkin

On Fri, Mar 15, 2024 at 09:24:36PM +1000, Gavin Shan wrote:
> 
> On 3/15/24 21:05, Michael S. Tsirkin wrote:
> > On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
> > > > > Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper 
> > > > > platform. I tried
> > > to reproduce it with my own driver where one thread writes to the shared 
> > > buffer
> > > and another thread reads from the buffer. I don't hit the out-of-order 
> > > issue so
> > > far.
> > 
> > Make sure the 2 areas you are accessing are in different cache lines.
> > 
> 
> Yes, I already put those 2 areas to separate cache lines.
> 
> > 
> > > My driver may be not correct somewhere and I will update if I can 
> > > reproduce
> > > the issue with my driver in the future.
> > 
> > Then maybe your change is just making virtio slower and masks the bug
> > that is actually elsewhere?
> > 
> > You don't really need a driver. Here's a simple test: without barriers
> > assertion will fail. With barriers it will not.
> > (Warning: didn't bother testing too much, could be buggy.
> > 
> > ---
> > 
> > #include 
> > #include 
> > #include 
> > #include 
> > 
> > #define FIRST values[0]
> > #define SECOND values[64]
> > 
> > volatile int values[100] = {};
> > 
> > void* writer_thread(void* arg) {
> > while (1) {
> > FIRST++;
> > // NEED smp_wmb here
> __asm__ volatile("dmb ishst" : : : "memory");
> > SECOND++;
> > }
> > }
> > 
> > void* reader_thread(void* arg) {
> >  while (1) {
> > int first = FIRST;
> > // NEED smp_rmb here
> __asm__ volatile("dmb ishld" : : : "memory");
> > int second = SECOND;
> > assert(first - second == 1 || first - second == 0);
> >  }
> > }
> > 
> > int main() {
> >  pthread_t writer, reader;
> > 
> >  pthread_create(, NULL, writer_thread, NULL);
> >  pthread_create(, NULL, reader_thread, NULL);
> > 
> >  pthread_join(writer, NULL);
> >  pthread_join(reader, NULL);
> > 
> >  return 0;
> > }
> > 
> 
> Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
> the assert on both of them. After replacing 'dmb' with 'dsb', I can
> hit assert on both of them too. I need to look at the code closely.
> 
> [root@virt-mtcollins-02 test]# ./a
> a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 
> 0' failed.
> Aborted (core dumped)
> 
> [root@nvidia-grace-hopper-05 test]# ./a
> a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 
> 0' failed.
> Aborted (core dumped)
> 
> Thanks,
> Gavin


Actually this test is broken. No need for ordering it's a simple race.
The following works on x86 though (x86 does not need barriers
though).


#include 
#include 
#include 
#include 

#if 0
#define x86_rmb()  asm volatile("lfence":::"memory")
#define x86_mb()  asm volatile("mfence":::"memory")
#define x86_smb()  asm volatile("sfence":::"memory")
#else
#define x86_rmb()  asm volatile("":::"memory")
#define x86_mb()  asm volatile("":::"memory")
#define x86_smb()  asm volatile("":::"memory")
#endif

#define FIRST values[0]
#define SECOND values[640]
#define FLAG values[1280]

volatile unsigned values[2000] = {};

void* writer_thread(void* arg) {
while (1) {
/* Now synchronize with reader */
while(FLAG);
FIRST++;
x86_smb();
SECOND++;
x86_smb();
FLAG = 1;
}
}

void* reader_thread(void* arg) {
while (1) {
/* Now synchronize with writer */
while(!FLAG);
x86_rmb();
unsigned first = FIRST;
x86_rmb();
unsigned second = SECOND;
assert(first - second == 1 || first - second == 0);
FLAG = 0;

if (!(first %100))
printf("%d\n", first);
   }
}

int main() {
pthread_t writer, reader;

pthread_create(, NULL, writer_thread, NULL);
pthread_create(, NULL, reader_thread, NULL);

pthread_join(writer, NULL);
pthread_join(reader, NULL);

return 0;
}

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-15 Thread Michael S. Tsirkin

On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
> 
> + Will, Catalin and Matt from Nvidia
> 
> On 3/14/24 22:59, Michael S. Tsirkin wrote:
> > On Thu, Mar 14, 2024 at 10:50:15PM +1000, Gavin Shan wrote:
> > > On 3/14/24 21:50, Michael S. Tsirkin wrote:
> > > > On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
> > > > > On 3/14/24 18:05, Michael S. Tsirkin wrote:
> > > > > > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > > > > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > > > > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > > > > > client is started in the VM hosted by grace-hopper machine,
> > > > > > > while the 'netperf' server is running on grace-grace machine.
> > > > > > > 
> > > > > > > The VM is started with virtio-net and vhost has been enabled.
> > > > > > > We observe a error message spew from VM and then soft-lockup
> > > > > > > report. The error message indicates the data associated with
> > > > > > > the descriptor (index: 135) has been released, and the queue
> > > > > > > is marked as broken. It eventually leads to the endless effort
> > > > > > > to fetch free buffer (skb) in 
> > > > > > > drivers/net/virtio_net.c::start_xmit()
> > > > > > > and soft-lockup. The stale index 135 is fetched from the available
> > > > > > > ring and published to the used ring by vhost, meaning we have
> > > > > > > disordred write to the available ring element and available index.
> > > > > > > 
> > > > > > >  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  
> > > > > > > \
> > > > > > >  -accel kvm -machine virt,gic-version=host
> > > > > > > \
> > > > > > > : 
> > > > > > > \
> > > > > > >  -netdev tap,id=vnet0,vhost=on
> > > > > > > \
> > > > > > >  -device 
> > > > > > > virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > > > > > 
> > > > > > >  [   19.993158] virtio_net virtio1: output.0:id 135 is not a 
> > > > > > > head!
> > > > > > > 
> > > > > > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with 
> > > > > > > stronger
> > > > > > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' 
> > > > > > > instruction on
> > > > > > > ARM64. It should work for other architectures, but performance 
> > > > > > > loss is
> > > > > > > expected.
> > > > > > > 
> > > > > > > Cc: sta...@vger.kernel.org
> > > > > > > Reported-by: Yihuang Yu 
> > > > > > > Signed-off-by: Gavin Shan 
> > > > > > > ---
> > > > > > > drivers/virtio/virtio_ring.c | 12 +---
> > > > > > > 1 file changed, 9 insertions(+), 3 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > > > virtqueue *_vq,
> > > > > > >   avail = vq->split.avail_idx_shadow & 
> > > > > > > (vq->split.vring.num - 1);
> > > > > > >   vq->split.vring.avail->ring[avail] = 
> > > > > > > cpu_to_virtio16(_vq->vdev, head);
> > > > > > > - /* Descriptors and available array need to be set before we 
> > > > > > > expose the
> > > > > > > -  * new available array entries. */
> > > > > > > - virtio_wmb(vq->weak_barriers);
> > > > > > > + /*
> > > > > > &

Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-03-15 Thread Michael S. Tsirkin

On Fri, Mar 15, 2024 at 09:33:49AM +0100, Tobias Huschle wrote:
> On Thu, Mar 14, 2024 at 11:09:25AM -0400, Michael S. Tsirkin wrote:
> > 
> > Thanks a lot! To clarify it is not that I am opposed to changing vhost.
> > I would like however for some documentation to exist saying that if you
> > do abc then call API xyz. Then I hope we can feel a bit safer that
> > future scheduler changes will not break vhost (though as usual, nothing
> > is for sure).  Right now we are going by the documentation and that says
> > cond_resched so we do that.
> > 
> > -- 
> > MST
> > 
> 
> Here I'd like to add that we have two different problems:
> 
> 1. cond_resched not working as expected
>This appears to me to be a bug in the scheduler where it lets the cgroup, 
>which the vhost is running in, loop endlessly. In EEVDF terms, the cgroup
>is allowed to surpass its own deadline without consequences. One of my RFCs
>mentioned above adresses this issue (not happy yet with the 
> implementation).
>This issue only appears in that specific scenario, so it's not a general 
>issue, rather a corner case.
>But, this fix will still allow the vhost to reach its deadline, which is
>one full time slice. This brings down the max delays from 300+ms to 
> whatever
>the timeslice is. This is not enough to fix the regression.
> 
> 2. vhost relying on kworker being scheduled on wake up
>This is the bigger issue for the regression. There are rare cases, where
>the vhost runs only for a very short amount of time before it wakes up 
>the kworker. Simultaneously, the kworker takes longer than usual to 
>complete its work and takes longer than the vhost did before. We
>are talking 4digit to low 5digit nanosecond values.
>With those two being the only tasks on the CPU, the scheduler now assumes
>that the kworker wants to unfairly consume more than the vhost and denies
>it being scheduled on wakeup.
>In the regular cases, the kworker is faster than the vhost, so the 
>scheduler assumes that the kworker needs help, which benefits the
>scenario we are looking at.
>In the bad case, this means unfortunately, that cond_resched cannot work
>as good as before, for this particular case!
>So, let's assume that problem 1 from above is fixed. It will take one 
>full time slice to get the need_resched flag set by the scheduler
>because vhost surpasses its deadline. Before, the scheduler cannot know
>that the kworker should actually run. The kworker itself is unable
>to communicate that by itself since it's not getting scheduled and there 
>is no external entity that could intervene.
>Hence my argumentation that cond_resched still works as expected. The
>crucial part is that the wake up behavior has changed which is why I'm 
>a bit reluctant to propose a documentation change on cond_resched.
>I could see proposing a doc change, that cond_resched should not be
>used if a task heavily relies on a woken up task being scheduled.

Could you remind me pls, what is the kworker doing specifically that
vhost is relying on?

-- 
MST

Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-03-14 Thread Michael S. Tsirkin

On Thu, Mar 14, 2024 at 12:46:54PM +0100, Tobias Huschle wrote:
> On Tue, Mar 12, 2024 at 09:45:57AM +, Luis Machado wrote:
> > On 3/11/24 17:05, Michael S. Tsirkin wrote:
> > > 
> > > Are we going anywhere with this btw?
> > > 
> > >
> > 
> > I think Tobias had a couple other threads related to this, with other 
> > potential fixes:
> > 
> > https://lore.kernel.org/lkml/20240228161018.14253-1-husc...@linux.ibm.com/
> > 
> > https://lore.kernel.org/lkml/20240228161023.14310-1-husc...@linux.ibm.com/
> > 
> 
> Sorry, Michael, should have provided those threads here as well.
> 
> The more I look into this issue, the more things to ponder upon I find.
> It seems like this issue can (maybe) be fixed on the scheduler side after all.
> 
> The root cause of this regression remains that the mentioned kworker gets
> a negative lag value and is therefore not elligible to run on wake up.
> This negative lag is potentially assigned incorrectly. But I'm not sure yet.
> 
> Anytime I find something that can address the symptom, there is a potential
> root cause on another level, and I would like to avoid to just address a
> symptom to fix the issue, wheras it would be better to find the actual
> root cause.
> 
> I would nevertheless still argue, that vhost relies rather heavily on the fact
> that the kworker gets scheduled on wake up everytime. But I don't have a 
> proposal at hand that accounts for potential side effects if opting for
> explicitly initiating a schedule.
> Maybe the assumption, that said kworker should always be selected on wake 
> up is valid. In that case the explicit schedule would merely be a safety 
> net.
> 
> I will let you know if something comes up on the scheduler side. There are
> some more ideas on my side how this could be approached.

Thanks a lot! To clarify it is not that I am opposed to changing vhost.
I would like however for some documentation to exist saying that if you
do abc then call API xyz. Then I hope we can feel a bit safer that
future scheduler changes will not break vhost (though as usual, nothing
is for sure).  Right now we are going by the documentation and that says
cond_resched so we do that.

-- 
MST

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-14 Thread Michael S. Tsirkin

On Thu, Mar 14, 2024 at 10:50:15PM +1000, Gavin Shan wrote:
> On 3/14/24 21:50, Michael S. Tsirkin wrote:
> > On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
> > > On 3/14/24 18:05, Michael S. Tsirkin wrote:
> > > > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > > > client is started in the VM hosted by grace-hopper machine,
> > > > > while the 'netperf' server is running on grace-grace machine.
> > > > > 
> > > > > The VM is started with virtio-net and vhost has been enabled.
> > > > > We observe a error message spew from VM and then soft-lockup
> > > > > report. The error message indicates the data associated with
> > > > > the descriptor (index: 135) has been released, and the queue
> > > > > is marked as broken. It eventually leads to the endless effort
> > > > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > > > and soft-lockup. The stale index 135 is fetched from the available
> > > > > ring and published to the used ring by vhost, meaning we have
> > > > > disordred write to the available ring element and available index.
> > > > > 
> > > > > /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64   
> > > > >\
> > > > > -accel kvm -machine virt,gic-version=host 
> > > > >\
> > > > >:  
> > > > >\
> > > > > -netdev tap,id=vnet0,vhost=on 
> > > > >\
> > > > > -device 
> > > > > virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > > > 
> > > > > [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > > > 
> > > > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > > > ARM64. It should work for other architectures, but performance loss is
> > > > > expected.
> > > > > 
> > > > > Cc: sta...@vger.kernel.org
> > > > > Reported-by: Yihuang Yu 
> > > > > Signed-off-by: Gavin Shan 
> > > > > ---
> > > > >drivers/virtio/virtio_ring.c | 12 +---
> > > > >1 file changed, 9 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > b/drivers/virtio/virtio_ring.c
> > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > virtqueue *_vq,
> > > > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > >   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > > head);
> > > > > - /* Descriptors and available array need to be set before we 
> > > > > expose the
> > > > > -  * new available array entries. */
> > > > > - virtio_wmb(vq->weak_barriers);
> > > > > + /*
> > > > > +  * Descriptors and available array need to be set before we 
> > > > > expose
> > > > > +  * the new available array entries. virtio_wmb() should be 
> > > > > enough
> > > > > +  * to ensuere the order theoretically. However, a stronger 
> > > > > barrier
> > > > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > +  * by the host (vhost). A stronger barrier should work for other
> > > > > +  * architectures, but performance loss is expected.
> > > > > +  */
> > > > > + virtio_mb(false);
> > > > 
> > > > 
> > > > I don't get what is going on here. Any explanation why virtio_wmb is not
> > > > enough besides "it does not work"?
> > > > 
> > > 
> > > The change is replacing instructi

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-14 Thread Michael S. Tsirkin

On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
> On 3/14/24 18:05, Michael S. Tsirkin wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > client is started in the VM hosted by grace-hopper machine,
> > > while the 'netperf' server is running on grace-grace machine.
> > > 
> > > The VM is started with virtio-net and vhost has been enabled.
> > > We observe a error message spew from VM and then soft-lockup
> > > report. The error message indicates the data associated with
> > > the descriptor (index: 135) has been released, and the queue
> > > is marked as broken. It eventually leads to the endless effort
> > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > and soft-lockup. The stale index 135 is fetched from the available
> > > ring and published to the used ring by vhost, meaning we have
> > > disordred write to the available ring element and available index.
> > > 
> > >/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
> > >-accel kvm -machine virt,gic-version=host\
> > >   : \
> > >-netdev tap,id=vnet0,vhost=on\
> > >-device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > 
> > >[   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > 
> > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > ARM64. It should work for other architectures, but performance loss is
> > > expected.
> > > 
> > > Cc: sta...@vger.kernel.org
> > > Reported-by: Yihuang Yu 
> > > Signed-off-by: Gavin Shan 
> > > ---
> > >   drivers/virtio/virtio_ring.c | 12 +---
> > >   1 file changed, 9 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > virtqueue *_vq,
> > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > head);
> > > - /* Descriptors and available array need to be set before we expose the
> > > -  * new available array entries. */
> > > - virtio_wmb(vq->weak_barriers);
> > > + /*
> > > +  * Descriptors and available array need to be set before we expose
> > > +  * the new available array entries. virtio_wmb() should be enough
> > > +  * to ensuere the order theoretically. However, a stronger barrier
> > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > +  * by the host (vhost). A stronger barrier should work for other
> > > +  * architectures, but performance loss is expected.
> > > +  */
> > > + virtio_mb(false);
> > 
> > 
> > I don't get what is going on here. Any explanation why virtio_wmb is not
> > enough besides "it does not work"?
> > 
> 
> The change is replacing instruction "dmb" with "dsb". "dsb" is stronger 
> barrier
> than "dmb" because "dsb" ensures that all memory accesses raised before this
> instruction is completed when the 'dsb' instruction completes. However, "dmb"
> doesn't guarantee the order of completion of the memory accesses.
>
> So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, 
> vq->split.avail_idx_shadow)'
> can be completed before 'vq->split.vring.avail->ring[avail] = 
> cpu_to_virtio16(_vq->vdev, head)'.

Completed as observed by which CPU?
We have 2 writes that we want observed by another CPU in order.
So if CPU observes a new value of idx we want it to see
new value in ring.
This is standard use of smp_wmb()
How are these 2 writes different?

What DMB does, is that is seems to ensure that effects
of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, 
vq->split.avail_idx_shadow)'
are observed after effects of
'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-14 Thread Michael S. Tsirkin

On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> The issue is reported by Yihuang Yu who have 'netperf' test on
> NVidia's grace-grace and grace-hopper machines. The 'netperf'
> client is started in the VM hosted by grace-hopper machine,
> while the 'netperf' server is running on grace-grace machine.
> 
> The VM is started with virtio-net and vhost has been enabled.
> We observe a error message spew from VM and then soft-lockup
> report. The error message indicates the data associated with
> the descriptor (index: 135) has been released, and the queue
> is marked as broken. It eventually leads to the endless effort
> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> and soft-lockup. The stale index 135 is fetched from the available
> ring and published to the used ring by vhost, meaning we have
> disordred write to the available ring element and available index.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
>   -accel kvm -machine virt,gic-version=host\
>  : \
>   -netdev tap,id=vnet0,vhost=on\
>   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> 
>   [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> 
> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> ARM64. It should work for other architectures, but performance loss is
> expected.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Yihuang Yu 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/virtio/virtio_ring.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 49299b1f9ec7..7d852811c912 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue 
> *_vq,
>   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>  
> - /* Descriptors and available array need to be set before we expose the
> -  * new available array entries. */
> - virtio_wmb(vq->weak_barriers);
> + /*
> +  * Descriptors and available array need to be set before we expose
> +  * the new available array entries. virtio_wmb() should be enough
> +  * to ensuere the order theoretically. However, a stronger barrier
> +  * is needed by ARM64. Otherwise, the stale data can be observed
> +  * by the host (vhost). A stronger barrier should work for other
> +  * architectures, but performance loss is expected.
> +  */
> + virtio_mb(false);


I don't get what is going on here. Any explanation why virtio_wmb is not
enough besides "it does not work"?

>   vq->split.avail_idx_shadow++;
>   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>   vq->split.avail_idx_shadow);
> -- 
> 2.44.0

Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-03-11 Thread Michael S. Tsirkin

On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
> On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> > > On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > > 
> > >  Summary 
> > > 
> > > In my (non-vhost experience) opinion the way to go would be either
> > > replacing the cond_resched with a hard schedule or setting the
> > > need_resched flag within vhost if the a data transfer was successfully
> > > initiated. It will be necessary to check if this causes problems with
> > > other workloads/benchmarks.
> > 
> > Yes but conceptually I am still in the dark on whether the fact that
> > periodically invoking cond_resched is no longer sufficient to be nice to
> > others is a bug, or intentional.  So you feel it is intentional?
> 
> I would assume that cond_resched is still a valid concept.
> But, in this particular scenario we have the following problem:
> 
> So far (with CFS) we had:
> 1. vhost initiates data transfer
> 2. kworker is woken up
> 3. CFS gives priority to woken up task and schedules it
> 4. kworker runs
> 
> Now (with EEVDF) we have:
> 0. In some cases, kworker has accumulated negative lag 
> 1. vhost initiates data transfer
> 2. kworker is woken up
> -3a. EEVDF does not schedule kworker if it has negative lag
> -4a. vhost continues running, kworker on same CPU starves
> --
> -3b. EEVDF schedules kworker if it has positive or no lag
> -4b. kworker runs
> 
> In the 3a/4a case, the kworker is given no chance to set the
> necessary flag. The flag can only be set by another CPU now.
> The schedule of the kworker was not caused by cond_resched, but
> rather by the wakeup path of the scheduler.
> 
> cond_resched works successfully once the load balancer (I suppose) 
> decides to migrate the vhost off to another CPU. In that case, the
> load balancer on another CPU sets that flag and we are good.
> That then eventually allows the scheduler to pick kworker, but very
> late.

Are we going anywhere with this btw?


> > I propose a two patch series then:
> > 
> > patch 1: in this text in Documentation/kernel-hacking/hacking.rst
> > 
> > If you're doing longer computations: first think userspace. If you
> > **really** want to do it in kernel you should regularly check if you need
> > to give up the CPU (remember there is cooperative multitasking per CPU).
> > Idiom::
> > 
> > cond_resched(); /* Will sleep */
> > 
> > 
> > replace cond_resched -> schedule
> > 
> > 
> > Since apparently cond_resched is no longer sufficient to
> > make the scheduler check whether you need to give up the CPU.
> > 
> > patch 2: make this change for vhost.
> > 
> > WDYT?
> 
> For patch 1, I would like to see some feedback from Peter (or someone else
> from the scheduler maintainers).
> For patch 2, I would prefer to do some more testing first if this might have
> an negative effect on other benchmarks.
> 
> I also stumbled upon something in the scheduler code that I want to verify.
> Maybe a cgroup thing, will check that out again.
> 
> I'll do some more testing with the cond_resched->schedule fix, check the
> cgroup thing and wait for Peter then.
> Will get back if any of the above yields some results.
> 
> > 
> > -- 
> > MST
> > 
> >

Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support

2024-03-01 Thread Michael S. Tsirkin

On Fri, Mar 01, 2024 at 11:45:52AM +, wangyunjian wrote:
> > -Original Message-
> > From: Paolo Abeni [mailto:pab...@redhat.com]
> > Sent: Thursday, February 29, 2024 7:13 PM
> > To: wangyunjian ; m...@redhat.com;
> > willemdebruijn.ker...@gmail.com; jasow...@redhat.com; k...@kernel.org;
> > bj...@kernel.org; magnus.karls...@intel.com; maciej.fijalkow...@intel.com;
> > jonathan.le...@gmail.com; da...@davemloft.net
> > Cc: b...@vger.kernel.org; net...@vger.kernel.org;
> > linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> > virtualizat...@lists.linux.dev; xudingke ; liwei (DT)
> > 
> > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support
> > 
> > On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote:
> > > @@ -2661,6 +2776,54 @@ static int tun_ptr_peek_len(void *ptr)
> > >   }
> > >  }
> > >
> > > +static void tun_peek_xsk(struct tun_file *tfile) {
> > > + struct xsk_buff_pool *pool;
> > > + u32 i, batch, budget;
> > > + void *frame;
> > > +
> > > + if (!ptr_ring_empty(>tx_ring))
> > > + return;
> > > +
> > > + spin_lock(>pool_lock);
> > > + pool = tfile->xsk_pool;
> > > + if (!pool) {
> > > + spin_unlock(>pool_lock);
> > > + return;
> > > + }
> > > +
> > > + if (tfile->nb_descs) {
> > > + xsk_tx_completed(pool, tfile->nb_descs);
> > > + if (xsk_uses_need_wakeup(pool))
> > > + xsk_set_tx_need_wakeup(pool);
> > > + }
> > > +
> > > + spin_lock(>tx_ring.producer_lock);
> > > + budget = min_t(u32, tfile->tx_ring.size, TUN_XDP_BATCH);
> > > +
> > > + batch = xsk_tx_peek_release_desc_batch(pool, budget);
> > > + if (!batch) {
> > 
> > This branch looks like an unneeded "optimization". The generic loop below
> > should have the same effect with no measurable perf delta - and smaller 
> > code.
> > Just remove this.
> > 
> > > + tfile->nb_descs = 0;
> > > + spin_unlock(>tx_ring.producer_lock);
> > > + spin_unlock(>pool_lock);
> > > + return;
> > > + }
> > > +
> > > + tfile->nb_descs = batch;
> > > + for (i = 0; i < batch; i++) {
> > > + /* Encode the XDP DESC flag into lowest bit for consumer to 
> > > differ
> > > +  * XDP desc from XDP buffer and sk_buff.
> > > +  */
> > > + frame = tun_xdp_desc_to_ptr(>tx_descs[i]);
> > > + /* The budget must be less than or equal to tx_ring.size,
> > > +  * so enqueuing will not fail.
> > > +  */
> > > + __ptr_ring_produce(>tx_ring, frame);
> > > + }
> > > + spin_unlock(>tx_ring.producer_lock);
> > > + spin_unlock(>pool_lock);
> > 
> > More related to the general design: it looks wrong. What if
> > get_rx_bufs() will fail (ENOBUF) after successful peeking? With no more
> > incoming packets, later peek will return 0 and it looks like that the
> > half-processed packets will stay in the ring forever???
> > 
> > I think the 'ring produce' part should be moved into tun_do_read().
> 
> Currently, the vhost-net obtains a batch descriptors/sk_buffs from the
> ptr_ring and enqueue the batch descriptors/sk_buffs to the virtqueue'queue,
> and then consumes the descriptors/sk_buffs from the virtqueue'queue in
> sequence. As a result, TUN does not know whether the batch descriptors have
> been used up, and thus does not know when to return the batch descriptors.
> 
> So, I think it's reasonable that when vhost-net checks ptr_ring is empty,
> it calls peek_len to get new xsk's descs and return the descriptors.
> 
> Thanks

What you need to think about is that if you peek, another call
in parallel can get the same value at the same time.


> > 
> > Cheers,
> > 
> > Paolo
>

Re: [PATCH net-next v2 0/3] tun: AF_XDP Tx zero-copy support

2024-02-28 Thread Michael S. Tsirkin

On Wed, Feb 28, 2024 at 07:04:41PM +0800, Yunjian Wang wrote:
> Hi all:
> 
> Now, some drivers support the zero-copy feature of AF_XDP sockets,
> which can significantly reduce CPU utilization for XDP programs.
> 
> This patch set allows TUN to also support the AF_XDP Tx zero-copy
> feature. It is based on Linux 6.8.0+(openEuler 23.09) and has
> successfully passed Netperf and Netserver stress testing with
> multiple streams between VM A and VM B, using AF_XDP and OVS.
> 
> The performance testing was performed on a Intel E5-2620 2.40GHz
> machine. Traffic were generated/send through TUN(testpmd txonly
> with AF_XDP) to VM (testpmd rxonly in guest).
> 
> +--+-+-+-+
> |  |   copy  |zero-copy| speedup |
> +--+-+-+-+
> | UDP  |   Mpps  |   Mpps  |%|
> | 64   |   2.5   |   4.0   |   60%   |
> | 512  |   2.1   |   3.6   |   71%   |
> | 1024 |   1.9   |   3.3   |   73%   |
> +--+-+-+-+
> 
> Yunjian Wang (3):
>   xsk: Remove non-zero 'dma_page' check in xp_assign_dev
>   vhost_net: Call peek_len when using xdp
>   tun: AF_XDP Tx zero-copy support


threading broken pls repost.

vhost bits look ok though:

Acked-by: Michael S. Tsirkin 


>  drivers/net/tun.c   | 177 ++--
>  drivers/vhost/net.c |  21 +++--
>  include/linux/if_tun.h  |  32 
>  net/xdp/xsk_buff_pool.c |   7 --
>  4 files changed, 220 insertions(+), 17 deletions(-)
> 
> -- 
> 2.41.0

Re: [PATCH] vduse: Fix off by one in vduse_dev_mmap()

2024-02-27 Thread Michael S. Tsirkin

On Tue, Feb 27, 2024 at 06:21:46PM +0300, Dan Carpenter wrote:
> The dev->vqs[] array has "dev->vq_num" elements.  It's allocated in
> vduse_dev_init_vqs().  Thus, this > comparison needs to be >= to avoid
> reading one element beyond the end of the array.
> 
> Fixes: 316ecd1346b0 ("vduse: Add file operation for mmap")
> Signed-off-by: Dan Carpenter 


Oh wow and does this not come from userspace? If yes we
need the speculation magic macro when using the index, do we not?

> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index b7a1fb88c506..9150c8281953 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -1532,7 +1532,7 @@ static int vduse_dev_mmap(struct file *file, struct 
> vm_area_struct *vma)
>   if ((vma->vm_flags & VM_SHARED) == 0)
>   return -EINVAL;
>  
> - if (index > dev->vq_num)
> + if (index >= dev->vq_num)
>   return -EINVAL;
>  
>   vq = dev->vqs[index];
> -- 
> 2.43.0

Re: [PATCH] virtiofs: limit the length of ITER_KVEC dio by max_nopage_rw

2024-02-25 Thread Michael S. Tsirkin

On Fri, Feb 23, 2024 at 10:42:37AM +0100, Miklos Szeredi wrote:
> On Wed, 3 Jan 2024 at 11:58, Hou Tao  wrote:
> >
> > From: Hou Tao 
> >
> > When trying to insert a 10MB kernel module kept in a virtiofs with cache
> > disabled, the following warning was reported:
> >
> >   [ cut here ]
> >   WARNING: CPU: 2 PID: 439 at mm/page_alloc.c:4544 ..
> >   Modules linked in:
> >   CPU: 2 PID: 439 Comm: insmod Not tainted 6.7.0-rc7+ #33
> >   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ..
> >   RIP: 0010:__alloc_pages+0x2c4/0x360
> >   ..
> >   Call Trace:
> >
> >? __warn+0x8f/0x150
> >? __alloc_pages+0x2c4/0x360
> >__kmalloc_large_node+0x86/0x160
> >__kmalloc+0xcd/0x140
> >virtio_fs_enqueue_req+0x240/0x6d0
> >virtio_fs_wake_pending_and_unlock+0x7f/0x190
> >queue_request_and_unlock+0x58/0x70
> >fuse_simple_request+0x18b/0x2e0
> >fuse_direct_io+0x58a/0x850
> >fuse_file_read_iter+0xdb/0x130
> >__kernel_read+0xf3/0x260
> >kernel_read+0x45/0x60
> >kernel_read_file+0x1ad/0x2b0
> >init_module_from_file+0x6a/0xe0
> >idempotent_init_module+0x179/0x230
> >__x64_sys_finit_module+0x5d/0xb0
> >do_syscall_64+0x36/0xb0
> >entry_SYSCALL_64_after_hwframe+0x6e/0x76
> >..
> >
> >   ---[ end trace  ]---
> >
> > The warning happened as follow. In copy_args_to_argbuf(), virtiofs uses
> > kmalloc-ed memory as bound buffer for fuse args, but
> 
> So this seems to be the special case in fuse_get_user_pages() when the
> read/write requests get a piece of kernel memory.
> 
> I don't really understand the comment in virtio_fs_enqueue_req():  /*
> Use a bounce buffer since stack args cannot be mapped */
> 
> Stefan, can you explain?  What's special about the arg being on the stack?

virtio core wants DMA'able addresses.

See Documentation/core-api/dma-api-howto.rst :

...


This rule also means that you may use neither kernel image addresses
(items in data/text/bss segments), nor module image addresses, nor
stack addresses for DMA.



> What if the arg is not on the stack (as is probably the case for big
> args like this)?   Do we need the bounce buffer in that case?
> 
> Thanks,
> Miklos

Re: [PATCH -next] VDUSE: fix another doc underline warning

2024-02-23 Thread Michael S. Tsirkin

On Thu, Feb 22, 2024 at 10:23:41PM -0800, Randy Dunlap wrote:
> Extend the underline for a heading to prevent a documentation
> build warning. Also spell "reconnection" correctly.
> 
> Documentation/userspace-api/vduse.rst:236: WARNING: Title underline too short.
> HOW VDUSE devices reconnectoin works
> 
> 
> Fixes: 2b3fd606c662 ("Documentation: Add reconnect process for VDUSE")
> Signed-off-by: Randy Dunlap 
> Cc: Cindy Lu 
> Cc: Michael S. Tsirkin 
> Cc: Jason Wang 
> Cc: Xuan Zhuo 
> Cc: virtualizat...@lists.linux.dev
> Cc: Jonathan Corbet 

Thanks, I fixed this in my tree already.

> ---
>  Documentation/userspace-api/vduse.rst |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff -- a/Documentation/userspace-api/vduse.rst 
> b/Documentation/userspace-api/vduse.rst
> --- a/Documentation/userspace-api/vduse.rst
> +++ b/Documentation/userspace-api/vduse.rst
> @@ -232,8 +232,8 @@ able to start the dataplane processing a
>  
>  For more details on the uAPI, please see include/uapi/linux/vduse.h.
>  
> -HOW VDUSE devices reconnectoin works
> -
> +HOW VDUSE devices reconnection works
> +
>  0. Userspace APP checks if the device /dev/vduse/vduse_name exists.
> If it does not exist, need to create the instance.goto step 1
> If it does exist, it means this is a reconnect and goto step 3.

Re: [PATCH] vduse: implement DMA sync callbacks

2024-02-22 Thread Michael S. Tsirkin

On Tue, Feb 20, 2024 at 01:01:36AM -0800, Christoph Hellwig wrote:
> On Mon, Feb 19, 2024 at 06:06:06PM +0100, Maxime Coquelin wrote:
> > Since commit 295525e29a5b ("virtio_net: merge dma
> > operations when filling mergeable buffers"), VDUSE device
> > require support for DMA's .sync_single_for_cpu() operation
> > as the memory is non-coherent between the device and CPU
> > because of the use of a bounce buffer.
> > 
> > This patch implements both .sync_single_for_cpu() and
> > sync_single_for_device() callbacks, and also skip bounce
> > buffer copies during DMA map and unmap operations if the
> > DMA_ATTR_SKIP_CPU_SYNC attribute is set to avoid extra
> > copies of the same buffer.
> 
> vduse really needs to get out of implementing fake DMA operations for
> something that is not DMA.

In a sense ... but on the other hand, the "fake DMA" metaphor seems to
work surprisingly well, like in this instance - internal bounce buffer
looks a bit like non-coherent DMA.  A way to make this all prettier
would I guess be to actually wrap all of DMA with virtio wrappers which
would all go if () dma_... else vduse_...; or something to this end.  A
lot of work for sure, and is it really worth it? if the only crazy
driver is vduse I'd maybe rather keep the crazy hacks local there ...

-- 
MST

Re: [PATCH net-next v5] virtio_net: Support RX hash XDP hint

2024-02-22 Thread Michael S. Tsirkin

On Fri, Feb 09, 2024 at 01:57:25PM +0100, Paolo Abeni wrote:
> On Fri, 2024-02-09 at 18:39 +0800, Liang Chen wrote:
> > On Wed, Feb 7, 2024 at 10:27 PM Paolo Abeni  wrote:
> > > 
> > > On Wed, 2024-02-07 at 10:54 +0800, Liang Chen wrote:
> > > > On Tue, Feb 6, 2024 at 6:44 PM Paolo Abeni  wrote:
> > > > > 
> > > > > On Sat, 2024-02-03 at 10:56 +0800, Liang Chen wrote:
> > > > > > On Sat, Feb 3, 2024 at 12:20 AM Jesper Dangaard Brouer 
> > > > > >  wrote:
> > > > > > > On 02/02/2024 13.11, Liang Chen wrote:
> > > > > [...]
> > > > > > > > @@ -1033,6 +1039,16 @@ static void put_xdp_frags(struct 
> > > > > > > > xdp_buff *xdp)
> > > > > > > >   }
> > > > > > > >   }
> > > > > > > > 
> > > > > > > > +static void virtnet_xdp_save_rx_hash(struct virtnet_xdp_buff 
> > > > > > > > *virtnet_xdp,
> > > > > > > > +  struct net_device *dev,
> > > > > > > > +  struct 
> > > > > > > > virtio_net_hdr_v1_hash *hdr_hash)
> > > > > > > > +{
> > > > > > > > + if (dev->features & NETIF_F_RXHASH) {
> > > > > > > > + virtnet_xdp->hash_value = hdr_hash->hash_value;
> > > > > > > > + virtnet_xdp->hash_report = hdr_hash->hash_report;
> > > > > > > > + }
> > > > > > > > +}
> > > > > > > > +
> > > > > > > 
> > > > > > > Would it be possible to store a pointer to hdr_hash in 
> > > > > > > virtnet_xdp_buff,
> > > > > > > with the purpose of delaying extracting this, until and only if 
> > > > > > > XDP
> > > > > > > bpf_prog calls the kfunc?
> > > > > > > 
> > > > > > 
> > > > > > That seems to be the way v1 works,
> > > > > > https://lore.kernel.org/all/20240122102256.261374-1-liangchen.li...@gmail.com/
> > > > > > . But it was pointed out that the inline header may be overwritten 
> > > > > > by
> > > > > > the xdp prog, so the hash is copied out to maintain its integrity.
> > > > > 
> > > > > Why? isn't XDP supposed to get write access only to the pkt
> > > > > contents/buffer?
> > > > > 
> > > > 
> > > > Normally, an XDP program accesses only the packet data. However,
> > > > there's also an XDP RX Metadata area, referenced by the data_meta
> > > > pointer. This pointer can be adjusted with bpf_xdp_adjust_meta to
> > > > point somewhere ahead of the data buffer, thereby granting the XDP
> > > > program access to the virtio header located immediately before the
> > > 
> > > AFAICS bpf_xdp_adjust_meta() does not allow moving the meta_data before
> > > xdp->data_hard_start:
> > > 
> > > https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4210
> > > 
> > > and virtio net set such field after the virtio_net_hdr:
> > > 
> > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1218
> > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1420
> > > 
> > > I don't see how the virtio hdr could be touched? Possibly even more
> > > important: if such thing is possible, I think is should be somewhat
> > > denied (for the same reason an H/W nic should prevent XDP from
> > > modifying its own buffer descriptor).
> > 
> > Thank you for highlighting this concern. The header layout differs
> > slightly between small and mergeable mode. Taking 'mergeable mode' as
> > an example, after calling xdp_prepare_buff the layout of xdp_buff
> > would be as depicted in the diagram below,
> > 
> >   buf
> >|
> >v
> > +--+--+-+
> > | xdp headroom | virtio header| packet  |
> > | (256 bytes)  | (20 bytes)   | content |
> > +--+--+-+
> > ^ ^
> > | |
> >  data_hard_startdata
> >   data_meta
> > 
> > If 'bpf_xdp_adjust_meta' repositions the 'data_meta' pointer a little
> > towards 'data_hard_start', it would point to the inline header, thus
> > potentially allowing the XDP program to access the inline header.
> 
> I see. That layout was completely unexpected to me.
> 
> AFAICS the virtio_net driver tries to avoid accessing/using the
> virtio_net_hdr after the XDP program execution, so nothing tragic
> should happen.
> 
> @Michael, @Jason, I guess the above is like that by design? Isn't it a
> bit fragile?
> 
> Thanks!
> 
> Paolo

I agree it is all a bit fragile, not sure how to do better without extra
copies though ...

-- 
MST

Re: [PATCH 1/1] vhost: Added pad cleanup if vnet_hdr is not present.

2024-02-22 Thread Michael S. Tsirkin

On Mon, Jan 15, 2024 at 05:32:25PM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 15, 2024 at 09:48:40PM +0200, Andrew Melnychenko wrote:
> > When the Qemu launched with vhost but without tap vnet_hdr,
> > vhost tries to copy vnet_hdr from socket iter with size 0
> > to the page that may contain some trash.
> > That trash can be interpreted as unpredictable values for
> > vnet_hdr.
> > That leads to dropping some packets and in some cases to
> > stalling vhost routine when the vhost_net tries to process
> > packets and fails in a loop.
> > 
> > Qemu options:
> >   -netdev tap,vhost=on,vnet_hdr=off,...
> > 
> > Signed-off-by: Andrew Melnychenko 
> > ---
> >  drivers/vhost/net.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index f2ed7167c848..57411ac2d08b 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -735,6 +735,9 @@ static int vhost_net_build_xdp(struct 
> > vhost_net_virtqueue *nvq,
> > hdr = buf;
> > gso = >gso;
> >  
> > +   if (!sock_hlen)
> > +   memset(buf, 0, pad);
> > +
> > if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
> > vhost16_to_cpu(vq, gso->csum_start) +
> > vhost16_to_cpu(vq, gso->csum_offset) + 2 >
> 
> 
> Hmm need to analyse it to make sure there are no cases where we leak
> some data to guest here in case where sock_hlen is set ...


Could you post this analysis pls?

> > -- 
> > 2.43.0

Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-02-22 Thread Michael S. Tsirkin

On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
> I'll do some more testing with the cond_resched->schedule fix, check the
> cgroup thing and wait for Peter then.
> Will get back if any of the above yields some results.

As I predicted, if you want attention from sched guys you need to
send a patch in their area.

-- 
MST

Re: [PATCH] virtiofs: limit the length of ITER_KVEC dio by max_nopage_rw

2024-02-22 Thread Michael S. Tsirkin

On Wed, Jan 03, 2024 at 06:59:29PM +0800, Hou Tao wrote:
> From: Hou Tao 
> 
> When trying to insert a 10MB kernel module kept in a virtiofs with cache
> disabled, the following warning was reported:
> 
>   [ cut here ]
>   WARNING: CPU: 2 PID: 439 at mm/page_alloc.c:4544 ..
>   Modules linked in:
>   CPU: 2 PID: 439 Comm: insmod Not tainted 6.7.0-rc7+ #33
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ..
>   RIP: 0010:__alloc_pages+0x2c4/0x360
>   ..
>   Call Trace:
>
>? __warn+0x8f/0x150
>? __alloc_pages+0x2c4/0x360
>__kmalloc_large_node+0x86/0x160
>__kmalloc+0xcd/0x140
>virtio_fs_enqueue_req+0x240/0x6d0
>virtio_fs_wake_pending_and_unlock+0x7f/0x190
>queue_request_and_unlock+0x58/0x70
>fuse_simple_request+0x18b/0x2e0
>fuse_direct_io+0x58a/0x850
>fuse_file_read_iter+0xdb/0x130
>__kernel_read+0xf3/0x260
>kernel_read+0x45/0x60
>kernel_read_file+0x1ad/0x2b0
>init_module_from_file+0x6a/0xe0
>idempotent_init_module+0x179/0x230
>__x64_sys_finit_module+0x5d/0xb0
>do_syscall_64+0x36/0xb0
>entry_SYSCALL_64_after_hwframe+0x6e/0x76
>..
>
>   ---[ end trace  ]---
> 
> The warning happened as follow. In copy_args_to_argbuf(), virtiofs uses
> kmalloc-ed memory as bound buffer for fuse args, but
> fuse_get_user_pages() only limits the length of fuse arg by max_read or
> max_write for IOV_KVEC io (e.g., kernel_read_file from finit_module()).
> For virtiofs, max_read is UINT_MAX, so a big read request which is about
> 10MB is passed to copy_args_to_argbuf(), kmalloc() is called in turn
> with len=10MB, and triggers the warning in __alloc_pages():
> WARN_ON_ONCE_GFP(order > MAX_ORDER, gfp)).
> 
> A feasible solution is to limit the value of max_read for virtiofs, so
> the length passed to kmalloc() will be limited. However it will affects
> the max read size for ITER_IOVEC io and the value of max_write also needs
> limitation. So instead of limiting the values of max_read and max_write,
> introducing max_nopage_rw to cap both the values of max_read and
> max_write when the fuse dio read/write request is initiated from kernel.
> 
> Considering that fuse read/write request from kernel is uncommon and to
> decrease the demand for large contiguous pages, set max_nopage_rw as
> 256KB instead of KMALLOC_MAX_SIZE - 4096 or similar.
> 
> Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem")
> Signed-off-by: Hou Tao 


So what should I do with this patch? It includes fuse changes
but of course I can merge too if no one wants to bother either way...


> ---
>  fs/fuse/file.c  | 12 +++-
>  fs/fuse/fuse_i.h|  3 +++
>  fs/fuse/inode.c |  1 +
>  fs/fuse/virtio_fs.c |  6 ++
>  4 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a660f1f21540..f1beb7c0b782 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1422,6 +1422,16 @@ static int fuse_get_user_pages(struct fuse_args_pages 
> *ap, struct iov_iter *ii,
>   return ret < 0 ? ret : 0;
>  }
>  
> +static size_t fuse_max_dio_rw_size(const struct fuse_conn *fc,
> +const struct iov_iter *iter, int write)
> +{
> + unsigned int nmax = write ? fc->max_write : fc->max_read;
> +
> + if (iov_iter_is_kvec(iter))
> + nmax = min(nmax, fc->max_nopage_rw);
> + return nmax;
> +}
> +
>  ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>  loff_t *ppos, int flags)
>  {
> @@ -1432,7 +1442,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct 
> iov_iter *iter,
>   struct inode *inode = mapping->host;
>   struct fuse_file *ff = file->private_data;
>   struct fuse_conn *fc = ff->fm->fc;
> - size_t nmax = write ? fc->max_write : fc->max_read;
> + size_t nmax = fuse_max_dio_rw_size(fc, iter, write);
>   loff_t pos = *ppos;
>   size_t count = iov_iter_count(iter);
>   pgoff_t idx_from = pos >> PAGE_SHIFT;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 1df83eebda92..fc753cd34211 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -594,6 +594,9 @@ struct fuse_conn {
>   /** Constrain ->max_pages to this value during feature negotiation */
>   unsigned int max_pages_limit;
>  
> + /** Maximum read/write size when there is no page in request */
> + unsigned int max_nopage_rw;
> +
>   /** Input queue */
>   struct fuse_iqueue iq;
>  
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 2a6d44f91729..4cbbcb4a4b71 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -923,6 +923,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct 
> fuse_mount *fm,
>   fc->user_ns = get_user_ns(user_ns);
>   fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
>   fc->max_pages_limit = FUSE_MAX_MAX_PAGES;
> + fc->max_nopage_rw = UINT_MAX;
>  
>   INIT_LIST_HEAD(>mounts);
>

Re: [PATCH net-next v4 2/2] virtio-net: add cond_resched() to the command waiting loop

2024-02-22 Thread Michael S. Tsirkin

On Tue, Jul 25, 2023 at 11:03:11AM +0800, Jason Wang wrote:
> On Mon, Jul 24, 2023 at 3:18 PM Michael S. Tsirkin  wrote:
> >
> > On Mon, Jul 24, 2023 at 02:52:49PM +0800, Jason Wang wrote:
> > > On Mon, Jul 24, 2023 at 2:46 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Fri, Jul 21, 2023 at 10:18:03PM +0200, Maxime Coquelin wrote:
> > > > >
> > > > >
> > > > > On 7/21/23 17:10, Michael S. Tsirkin wrote:
> > > > > > On Fri, Jul 21, 2023 at 04:58:04PM +0200, Maxime Coquelin wrote:
> > > > > > >
> > > > > > >
> > > > > > > On 7/21/23 16:45, Michael S. Tsirkin wrote:
> > > > > > > > On Fri, Jul 21, 2023 at 04:37:00PM +0200, Maxime Coquelin wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On 7/20/23 23:02, Michael S. Tsirkin wrote:
> > > > > > > > > > On Thu, Jul 20, 2023 at 01:26:20PM -0700, Shannon Nelson 
> > > > > > > > > > wrote:
> > > > > > > > > > > On 7/20/23 1:38 AM, Jason Wang wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Adding cond_resched() to the command waiting loop for a 
> > > > > > > > > > > > better
> > > > > > > > > > > > co-operation with the scheduler. This allows to give 
> > > > > > > > > > > > CPU a breath to
> > > > > > > > > > > > run other task(workqueue) instead of busy looping when 
> > > > > > > > > > > > preemption is
> > > > > > > > > > > > not allowed on a device whose CVQ might be slow.
> > > > > > > > > > > >
> > > > > > > > > > > > Signed-off-by: Jason Wang 
> > > > > > > > > > >
> > > > > > > > > > > This still leaves hung processes, but at least it doesn't 
> > > > > > > > > > > pin the CPU any
> > > > > > > > > > > more.  Thanks.
> > > > > > > > > > > Reviewed-by: Shannon Nelson 
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I'd like to see a full solution
> > > > > > > > > > 1- block until interrupt
> > > > > > > > >
> > > > > > > > > Would it make sense to also have a timeout?
> > > > > > > > > And when timeout expires, set FAILED bit in device status?
> > > > > > > >
> > > > > > > > virtio spec does not set any limits on the timing of vq
> > > > > > > > processing.
> > > > > > >
> > > > > > > Indeed, but I thought the driver could decide it is too long for 
> > > > > > > it.
> > > > > > >
> > > > > > > The issue is we keep waiting with rtnl locked, it can quickly 
> > > > > > > make the
> > > > > > > system unusable.
> > > > > >
> > > > > > if this is a problem we should find a way not to keep rtnl
> > > > > > locked indefinitely.
> > > > >
> > > > > From the tests I have done, I think it is. With OVS, a 
> > > > > reconfiguration is
> > > > > performed when the VDUSE device is added, and when a MLX5 device is
> > > > > in the same bridge, it ends up doing an ioctl() that tries to take the
> > > > > rtnl lock. In this configuration, it is not possible to kill OVS 
> > > > > because
> > > > > it is stuck trying to acquire rtnl lock for mlx5 that is held by 
> > > > > virtio-
> > > > > net.
> > > >
> > > > So for sure, we can queue up the work and process it later.
> > > > The somewhat tricky part is limiting the memory consumption.
> > >
> > > And it needs to sync with rtnl somehow, e.g device unregistering which
> > > seems not easy.
> > >
> > > Thanks
> >
> > since when does device unregister need to send cvq commands?
> 
> It doesn't do this n

Re: [syzbot] [virtualization?] linux-next boot error: WARNING: refcount bug in __free_pages_ok

2024-02-22 Thread Michael S. Tsirkin

On Thu, Feb 22, 2024 at 11:06:55AM +0800, Lei Yang wrote:
> Hi All
> 
> I hit a similar issue when doing a regression testing from my side.
> For the error messages please help review the attachment.
> 
> The latest commit:
> commit c02197fc9076e7d991c8f6adc11759c5ba52ddc6 (HEAD -> master,
> origin/master, origin/HEAD)
> Merge: f2667e0c3240 0846dd77c834
> Author: Linus Torvalds 
> Date:   Sat Feb 17 16:59:31 2024 -0800
> 
> Merge tag 'powerpc-6.8-3' of
> git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
> 
> Pull powerpc fixes from Michael Ellerman:
>  "This is a bit of a big batch for rc4, but just due to holiday hangover
>   and because I didn't send any fixes last week due to a late revert
>   request. I think next week should be back to normal.
> 
> Regards
> Lei

It all looks like a generic bug dealing with some refcounting
in the allocator.  So, a chance of a bisect there?

-- 
MST

Re: [syzbot] [virtualization?] linux-next boot error: WARNING: refcount bug in __free_pages_ok

2024-02-18 Thread Michael S. Tsirkin

On Sun, Feb 18, 2024 at 09:06:18PM -0800, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:d37e1e4c52bc Add linux-next specific files for 20240216
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=171ca65218
> kernel config:  https://syzkaller.appspot.com/x/.config?x=4bc446d42a7d56c0
> dashboard link: https://syzkaller.appspot.com/bug?extid=6f3c38e8a6a0297caa5a
> compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
> 2.40
> 
> Downloadable assets:
> disk image: 
> https://storage.googleapis.com/syzbot-assets/14d0894504b9/disk-d37e1e4c.raw.xz
> vmlinux: 
> https://storage.googleapis.com/syzbot-assets/6cda61e084ee/vmlinux-d37e1e4c.xz
> kernel image: 
> https://storage.googleapis.com/syzbot-assets/720c85283c05/bzImage-d37e1e4c.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+6f3c38e8a6a0297ca...@syzkaller.appspotmail.com
> 
> Key type pkcs7_test registered
> Block layer SCSI generic (bsg) driver version 0.4 loaded (major 239)
> io scheduler mq-deadline registered
> io scheduler kyber registered
> io scheduler bfq registered
> input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> ACPI: button: Power Button [PWRF]
> input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
> ACPI: button: Sleep Button [SLPF]
> ioatdma: Intel(R) QuickData Technology Driver 5.00
> ACPI: \_SB_.LNKC: Enabled at IRQ 11
> virtio-pci :00:03.0: virtio_pci: leaving for legacy driver
> ACPI: \_SB_.LNKD: Enabled at IRQ 10
> virtio-pci :00:04.0: virtio_pci: leaving for legacy driver
> ACPI: \_SB_.LNKB: Enabled at IRQ 10
> virtio-pci :00:06.0: virtio_pci: leaving for legacy driver
> virtio-pci :00:07.0: virtio_pci: leaving for legacy driver
> N_HDLC line discipline registered with maxframe=4096
> Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
> 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud = 115200) is a 16550A
> 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud = 115200) is a 16550A
> Non-volatile memory driver v1.3
> Linux agpgart interface v0.103
> ACPI: bus type drm_connector registered
> [drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
> [drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
> Console: switching to colour frame buffer device 128x48
> platform vkms: [drm] fb0: vkmsdrmfb frame buffer device
> usbcore: registered new interface driver udl
> brd: module loaded
> loop: module loaded
> zram: Added device: zram0
> null_blk: disk nullb0 created
> null_blk: module loaded
> Guest personality initialized and is inactive
> VMCI host device registered (name=vmci, major=10, minor=118)
> Initialized host personality
> usbcore: registered new interface driver rtsx_usb
> usbcore: registered new interface driver viperboard
> usbcore: registered new interface driver dln2
> usbcore: registered new interface driver pn533_usb
> nfcsim 0.2 initialized
> usbcore: registered new interface driver port100
> usbcore: registered new interface driver nfcmrvl
> Loading iSCSI transport class v2.0-870.
> virtio_scsi virtio0: 1/0/0 default/read/poll queues
> [ cut here ]
> refcount_t: decrement hit 0; leaking memory.
> WARNING: CPU: 0 PID: 1 at lib/refcount.c:31 refcount_warn_saturate+0xfa/0x1d0 
> lib/refcount.c:31
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.8.0-rc4-next-20240216-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/25/2024
> RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> Code: b2 00 00 00 e8 b7 94 f0 fc 5b 5d c3 cc cc cc cc e8 ab 94 f0 fc c6 05 c6 
> 16 ce 0a 01 90 48 c7 c7 a0 5a fe 8b e8 67 69 b4 fc 90 <0f> 0b 90 90 eb d9 e8 
> 8b 94 f0 fc c6 05 a3 16 ce 0a 01 90 48 c7 c7
> RSP: :c9066e10 EFLAGS: 00010246
> RAX: 15c2c224c9b50400 RBX: 888020827d2c RCX: 8880162d8000
> RDX:  RSI:  RDI: 
> RBP: 0004 R08: 8157b942 R09: fbfff1bf95cc
> R10: dc00 R11: fbfff1bf95cc R12: ea000502fdc0
> R13: ea000502fdc8 R14: 1d4000a05fb9 R15: 
> FS:  () GS:8880b940() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 88823000 CR3: 0df32000 CR4: 003506f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  
>  reset_page_owner include/linux/page_owner.h:24 [inline]
>  free_pages_prepare mm/page_alloc.c:1140 [inline]
>  __free_pages_ok+0xc42/0xd70 mm/page_alloc.c:1269
>  make_alloc_exact+0xc4/0x140 mm/page_alloc.c:4847
>  vring_alloc_queue drivers/virtio/virtio_ring.c:319

Re: [v4 PATCH] ALSA: virtio: Fix "Coverity: virtsnd_kctl_tlv_op(): Uninitialized variables" warning.

2024-02-16 Thread Michael S. Tsirkin

On Fri, Feb 16, 2024 at 02:42:37PM +0100, Takashi Iwai wrote:
> On Fri, 16 Feb 2024 12:27:48 +0100,
> Michael S. Tsirkin wrote:
> > 
> > On Fri, Feb 16, 2024 at 11:06:43AM +0100, Aiswarya Cyriac wrote:
> > > This commit fixes the following warning when building virtio_snd driver.
> > > 
> > > "
> > > *** CID 1583619:  Uninitialized variables  (UNINIT)
> > > sound/virtio/virtio_kctl.c:294 in virtsnd_kctl_tlv_op()
> > > 288
> > > 289   break;
> > > 290   }
> > > 291
> > > 292   kfree(tlv);
> > > 293
> > > vvv CID 1583619:  Uninitialized variables  (UNINIT)
> > > vvv Using uninitialized value "rc".
> > > 294   return rc;
> > > 295 }
> > > 296
> > > 297 /**
> > > 298  * virtsnd_kctl_get_enum_items() - Query items for the ENUMERATED 
> > > element type.
> > > 299  * @snd: VirtIO sound device.
> > > "
> > > 
> > > This warning is caused by the absence of the "default" branch in the
> > > switch-block, and is a false positive because the kernel calls
> > > virtsnd_kctl_tlv_op() only with values for op_flag processed in
> > > this block.
> > > 
> > > Also, this commit unifies the cleanup path for all possible control
> > > paths in the callback function.
> > > 
> > > Signed-off-by: Anton Yakovlev 
> > > Signed-off-by: Aiswarya Cyriac 
> > > Reported-by: coverity-bot 
> > > Addresses-Coverity-ID: 1583619 ("Uninitialized variables")
> > > Fixes: d6568e3de42d ("ALSA: virtio: add support for audio controls")
> > 
> > 
> > 
> > > ---
> > >  sound/virtio/virtio_kctl.c | 19 +++
> > >  1 file changed, 15 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/sound/virtio/virtio_kctl.c b/sound/virtio/virtio_kctl.c
> > > index 0c6ac74aca1e..7aa79c05b464 100644
> > > --- a/sound/virtio/virtio_kctl.c
> > > +++ b/sound/virtio/virtio_kctl.c
> > > @@ -253,8 +253,8 @@ static int virtsnd_kctl_tlv_op(struct snd_kcontrol 
> > > *kcontrol, int op_flag,
> > >  
> > >   tlv = kzalloc(size, GFP_KERNEL);
> > >   if (!tlv) {
> > > - virtsnd_ctl_msg_unref(msg);
> > > - return -ENOMEM;
> > > + rc = -ENOMEM;
> > > + goto on_msg_unref;
> > >   }
> > >  
> > >   sg_init_one(, tlv, size);
> > > @@ -281,14 +281,25 @@ static int virtsnd_kctl_tlv_op(struct snd_kcontrol 
> > > *kcontrol, int op_flag,
> > >   hdr->hdr.code =
> > >   cpu_to_le32(VIRTIO_SND_R_CTL_TLV_COMMAND);
> > >  
> > > - if (copy_from_user(tlv, utlv, size))
> > > + if (copy_from_user(tlv, utlv, size)) {
> > >   rc = -EFAULT;
> > > - else
> > > + goto on_msg_unref;
> > > + } else {
> > >   rc = virtsnd_ctl_msg_send(snd, msg, , NULL, false);
> > > + }
> > >  
> > >   break;
> > > + default:
> > > + rc = -EINVAL;
> > > + /* We never get here - we listed all values for op_flag */
> > > + WARN_ON(1);
> > > + goto on_msg_unref;
> > >   }
> > > + kfree(tlv);
> > > + return rc;
> > >  
> > > +on_msg_unref:
> > > + virtsnd_ctl_msg_unref(msg);
> > >   kfree(tlv);
> > >  
> > >   return rc;
> > 
> > I don't really like adding code for a false-positive but ALSA
> > maintainers seem to like this. If yes, this seems like as good
> > a way as any to do it.
> 
> Err, no, you misunderstood the situation.
> 
> I took the v1 patch quickly because:
> - It was with Anton's SOB, who is another maintainer of the driver
> - I assumed you lost interest in this driver since you haven't reacted
>   to the previous patches for long time
> - The change there was small and simple enough
> 
> Now, it grows unnecessarily large, and yet you complained.  Why should
> I take it, then?
> 
> This is a subtle cosmetic issue that isn't worth for wasting too much
> time and energy.  If we want to shut up the compile warning, and this
> is a case where it can't happen, just put the "default:" to the
> existing case.  If you want to be user-friendly, put some comment.
> That's all.  It'll be a one-liner.
> 
> OTOH, if we do care and want to catch any potential logical mistake,
> you can put WARN().  But, this doesn't have to go out as an error.
> Simply putting WARN() for the default and going through would work,
> too.
> 
> Or we can keep this lengthy changes if we want, too.
> 
> So, I really don't mind which way to fix as long as it works correctly
> (and doesn't look too ugly).  Please make agreement among you guys,
> and resubmit if needed.
> 
> 
> thanks,
> 
> Takashi

OK sorry about too verbose.  I mean since Anton wants it, I ack this.

Acked-by: Michael S. Tsirkin 


-- 
MST

Re: [v4 PATCH] ALSA: virtio: Fix "Coverity: virtsnd_kctl_tlv_op(): Uninitialized variables" warning.

2024-02-16 Thread Michael S. Tsirkin

On Fri, Feb 16, 2024 at 11:06:43AM +0100, Aiswarya Cyriac wrote:
> This commit fixes the following warning when building virtio_snd driver.
> 
> "
> *** CID 1583619:  Uninitialized variables  (UNINIT)
> sound/virtio/virtio_kctl.c:294 in virtsnd_kctl_tlv_op()
> 288
> 289   break;
> 290   }
> 291
> 292   kfree(tlv);
> 293
> vvv CID 1583619:  Uninitialized variables  (UNINIT)
> vvv Using uninitialized value "rc".
> 294   return rc;
> 295 }
> 296
> 297 /**
> 298  * virtsnd_kctl_get_enum_items() - Query items for the ENUMERATED 
> element type.
> 299  * @snd: VirtIO sound device.
> "
> 
> This warning is caused by the absence of the "default" branch in the
> switch-block, and is a false positive because the kernel calls
> virtsnd_kctl_tlv_op() only with values for op_flag processed in
> this block.
> 
> Also, this commit unifies the cleanup path for all possible control
> paths in the callback function.
> 
> Signed-off-by: Anton Yakovlev 
> Signed-off-by: Aiswarya Cyriac 
> Reported-by: coverity-bot 
> Addresses-Coverity-ID: 1583619 ("Uninitialized variables")
> Fixes: d6568e3de42d ("ALSA: virtio: add support for audio controls")



> ---
>  sound/virtio/virtio_kctl.c | 19 +++
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/sound/virtio/virtio_kctl.c b/sound/virtio/virtio_kctl.c
> index 0c6ac74aca1e..7aa79c05b464 100644
> --- a/sound/virtio/virtio_kctl.c
> +++ b/sound/virtio/virtio_kctl.c
> @@ -253,8 +253,8 @@ static int virtsnd_kctl_tlv_op(struct snd_kcontrol 
> *kcontrol, int op_flag,
>  
>   tlv = kzalloc(size, GFP_KERNEL);
>   if (!tlv) {
> - virtsnd_ctl_msg_unref(msg);
> - return -ENOMEM;
> + rc = -ENOMEM;
> + goto on_msg_unref;
>   }
>  
>   sg_init_one(, tlv, size);
> @@ -281,14 +281,25 @@ static int virtsnd_kctl_tlv_op(struct snd_kcontrol 
> *kcontrol, int op_flag,
>   hdr->hdr.code =
>   cpu_to_le32(VIRTIO_SND_R_CTL_TLV_COMMAND);
>  
> - if (copy_from_user(tlv, utlv, size))
> + if (copy_from_user(tlv, utlv, size)) {
>   rc = -EFAULT;
> - else
> + goto on_msg_unref;
> + } else {
>   rc = virtsnd_ctl_msg_send(snd, msg, , NULL, false);
> + }
>  
>   break;
> + default:
> + rc = -EINVAL;
> + /* We never get here - we listed all values for op_flag */
> + WARN_ON(1);
> + goto on_msg_unref;
>   }
> + kfree(tlv);
> + return rc;
>  
> +on_msg_unref:
> + virtsnd_ctl_msg_unref(msg);
>   kfree(tlv);
>  
>   return rc;

I don't really like adding code for a false-positive but ALSA
maintainers seem to like this. If yes, this seems like as good
a way as any to do it.

Acked-by: Michael S. Tsirkin 


> -- 
> 2.43.2

Re: [v3 PATCH] ALSA: virtio: Fix "Coverity: virtsnd_kctl_tlv_op(): Uninitialized variables" warning.

2024-02-14 Thread Michael S. Tsirkin

On Wed, Feb 14, 2024 at 03:01:10PM +0100, Aiswarya Cyriac wrote:
> This commit fixes the following warning when building virtio_snd driver.
> 
> "
> *** CID 1583619:  Uninitialized variables  (UNINIT)
> sound/virtio/virtio_kctl.c:294 in virtsnd_kctl_tlv_op()
> 288
> 289   break;
> 290   }
> 291
> 292   kfree(tlv);
> 293
> vvv CID 1583619:  Uninitialized variables  (UNINIT)
> vvv Using uninitialized value "rc".
> 294   return rc;
> 295 }
> 296
> 297 /**
> 298  * virtsnd_kctl_get_enum_items() - Query items for the ENUMERATED 
> element type.
> 299  * @snd: VirtIO sound device.
> "
> 
> This warning is caused by the absence of the "default" branch in the
> switch-block, and is a false positive because the kernel calls
> virtsnd_kctl_tlv_op() only with values for op_flag processed in
> this block.
> 
> Also, this commit unifies the cleanup path for all possible control
> paths in the callback function.
> 
> Signed-off-by: Anton Yakovlev 
> Signed-off-by: Aiswarya Cyriac 
> Reported-by: coverity-bot 
> Addresses-Coverity-ID: 1583619 ("Uninitialized variables")
> Fixes: d6568e3de42d ("ALSA: virtio: add support for audio controls")
> ---
>  sound/virtio/virtio_kctl.c | 25 +
>  1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/sound/virtio/virtio_kctl.c b/sound/virtio/virtio_kctl.c
> index 0c6ac74aca1e..40606eb381af 100644
> --- a/sound/virtio/virtio_kctl.c
> +++ b/sound/virtio/virtio_kctl.c
> @@ -253,8 +253,8 @@ static int virtsnd_kctl_tlv_op(struct snd_kcontrol 
> *kcontrol, int op_flag,
>  
>   tlv = kzalloc(size, GFP_KERNEL);
>   if (!tlv) {
> - virtsnd_ctl_msg_unref(msg);
> - return -ENOMEM;
> + rc = -ENOMEM;
> + goto on_cleanup;
>   }
>  
>   sg_init_one(, tlv, size);
> @@ -266,6 +266,11 @@ static int virtsnd_kctl_tlv_op(struct snd_kcontrol 
> *kcontrol, int op_flag,
>   case SNDRV_CTL_TLV_OP_READ:
>   hdr->hdr.code = cpu_to_le32(VIRTIO_SND_R_CTL_TLV_READ);
>  
> + /* Since virtsnd_ctl_msg_send() drops the reference, we increase
> +  * the counter to be consistent with the on_cleanup path.
> +  */


This is not how multi-line comments should look.


Adding overhead here is just a waste of cycles.
Instead, separate error handling and normal exit paths.
Then you will not need to increase the refcount here.

> + virtsnd_ctl_msg_ref(msg);
> +
>   rc = virtsnd_ctl_msg_send(snd, msg, NULL, , false);
>   if (!rc) {
>   if (copy_to_user(utlv, tlv, size))
> @@ -281,14 +286,26 @@ static int virtsnd_kctl_tlv_op(struct snd_kcontrol 
> *kcontrol, int op_flag,
>   hdr->hdr.code =
>   cpu_to_le32(VIRTIO_SND_R_CTL_TLV_COMMAND);
>  
> - if (copy_from_user(tlv, utlv, size))
> + if (copy_from_user(tlv, utlv, size)) {
>   rc = -EFAULT;
> - else
> + } else {
> + /* Same as the comment above */

Same thing.
Besides, this kind of cross referencing breaks immediately when
someone adds a comment in the middle.

> + virtsnd_ctl_msg_ref(msg);
> +
>   rc = virtsnd_ctl_msg_send(snd, msg, , NULL, false);
> + }
> +
> + break;
> + default:
> + rc = -EINVAL;


/* We never get here - we listed all values for op_flag */

> + WARN_ON(1);
>  
>   break;
>   }
>  
> +on_cleanup:
> + virtsnd_ctl_msg_unref(msg);
> +
>   kfree(tlv);
>  
>   return rc;

on_cleanup is not informative, coding style says:
"Choose label names which say what the goto does or why the goto
exists."

And saving on duplication here by paying elsewhere does not make sense.
So you do this instead:


kfree(tlv);
return rc;

on_error:
virtsnd_ctl_msg_unref(msg);
kfree(tlv);
return rc;


This is very ideomatic.

> -- 
> 2.43.0

Re: [PATCH] ALSA: virtio: Fix "Coverity: virtsnd_kctl_tlv_op(): Uninitialized variables" warning.

2024-02-14 Thread Michael S. Tsirkin

On Wed, Feb 14, 2024 at 09:08:26AM +, Aiswarya Cyriac wrote:
> Hi Michael,
> 
> Thank you for reviewing. I have updated my response inline
> 
> On Tue, Feb 13, 2024 at 09:51:30AM +0100, Aiswarya Cyriac wrote:
> >> Fix the following warning when building virtio_snd driver.
> >>
> >> "
> >> *** CID 1583619:  Uninitialized variables  (UNINIT)
> >> sound/virtio/virtio_kctl.c:294 in virtsnd_kctl_tlv_op()
> >> 288
> >> 289 break;
> >> 290   }
> >> 291
> >> 292   kfree(tlv);
> >> 293
> >> vvv CID 1583619:  Uninitialized variables  (UNINIT)
> >> vvv Using uninitialized value "rc".
> >> 294   return rc;
> >> 295 }
> >> 296
> >> 297 /**
> >> 298  * virtsnd_kctl_get_enum_items() - Query items for the ENUMERATED 
> >> element type.
> >> 299  * @snd: VirtIO sound device.
> >> "
> >>
> >> Signed-off-by: Anton Yakovlev 
> >> Signed-off-by: Aiswarya Cyriac 
> >> Reported-by: coverity-bot 
> >> Addresses-Coverity-ID: 1583619 ("Uninitialized variables")
> >> Fixes: d6568e3de42d ("ALSA: virtio: add support for audio controls")
> 
> >I don't know enough about ALSA to say whether the patch is correct.  But
> >the commit log needs work: please, do not "fix warnings" - analyse the
> >code and explain whether there is a real issue and if yes what is it
> >and how it can trigger. Is an invalid op_flag ever passed?
> >If it's just a coverity false positive it might be ok to
> >work around that but document this.
> 
> This warning is caused by the absence of the "default" branch in the
> switch-block, and is a false positive because the kernel calls
> virtsnd_kctl_tlv_op() only with values for op_flag processed in
> this block.

Well we don't normally have functions validate inputs.
In this case I am not really sure we should bother
with adding dead code. If you really want to, add BUG_ON.



> I will update the fix and send a v2 patch
> 
> >> ---
> >>  sound/virtio/virtio_kctl.c | 5 +
> >>  1 file changed, 5 insertions(+)
> >>
> >> diff --git a/sound/virtio/virtio_kctl.c b/sound/virtio/virtio_kctl.c
> >> index 0c6ac74aca1e..d7a160c5db03 100644
> >> --- a/sound/virtio/virtio_kctl.c
> >> +++ b/sound/virtio/virtio_kctl.c
> >> @@ -286,6 +286,11 @@ static int virtsnd_kctl_tlv_op(struct snd_kcontrol 
> >> *kcontrol, int op_flag,
> >>else
> >>rc = virtsnd_ctl_msg_send(snd, msg, , NULL, 
> >> false);
> >>
> >> + break;
> >> + default:
> >> + virtsnd_ctl_msg_unref(msg);
> >> + rc = -EINVAL;
> >> +
> 
> >There's already virtsnd_ctl_msg_unref call above.
> >Also don't we need virtsnd_ctl_msg_unref on other error paths
> >such as EFAULT?
> >Unify error handling to fix it all then?
> 
> This also need to be handled and virtsnd_ctl_msg_unref needed in case of 
> EFAULT as well.
> I will update the patch.
> 
> 
> Thanks,
> Aiswarya Cyriac
> Software Engineer
> 
> OpenSynergy GmbH
> Rotherstr. 20, 10245 Berlin
> 
> EMail: aiswarya.cyr...@opensynergy.com
> 
> www.opensynergy.com
> Handelsregister/Commercial Registry: Amtsgericht Charlottenburg, HRB 108616B
> Geschäftsführer/Managing Director: Régis Adjamah
> 
> 
> From: Michael S. Tsirkin 
> Sent: Tuesday, February 13, 2024 10:06 AM
> To: Aiswarya Cyriac
> Cc: jasow...@redhat.com; pe...@perex.cz; ti...@suse.com; 
> linux-kernel@vger.kernel.org; alsa-de...@alsa-project.org; 
> virtualizat...@lists.linux-foundation.org; virtio-...@lists.oasis-open.org; 
> Anton Yakovlev; coverity-bot
> Subject: Re: [PATCH] ALSA: virtio: Fix "Coverity: virtsnd_kctl_tlv_op(): 
> Uninitialized variables" warning.
> 
> On Tue, Feb 13, 2024 at 09:51:30AM +0100, Aiswarya Cyriac wrote:
> > Fix the following warning when building virtio_snd driver.
> >
> > "
> > *** CID 1583619:  Uninitialized variables  (UNINIT)
> > sound/virtio/virtio_kctl.c:294 in virtsnd_kctl_tlv_op()
> > 288
> > 289 break;
> > 290   }
> > 291
> > 292   kfree(tlv);
> > 293
> > vvv CID 1583619:  Uninitialized variables  (UNINIT)
> > vvv Using uninitialized value "rc".
> > 294   return rc;
> > 295 }
> > 296
> > 297 /**
> > 298  * virtsnd_kctl_get_enum_items(

Re: [PATCH] ALSA: virtio: Fix "Coverity: virtsnd_kctl_tlv_op(): Uninitialized variables" warning.

2024-02-13 Thread Michael S. Tsirkin

On Tue, Feb 13, 2024 at 10:02:24AM +0100, Takashi Iwai wrote:
> On Tue, 13 Feb 2024 09:51:30 +0100,
> Aiswarya Cyriac wrote:
> > 
> > Fix the following warning when building virtio_snd driver.
> > 
> > "
> > *** CID 1583619:  Uninitialized variables  (UNINIT)
> > sound/virtio/virtio_kctl.c:294 in virtsnd_kctl_tlv_op()
> > 288
> > 289 break;
> > 290   }
> > 291
> > 292   kfree(tlv);
> > 293
> > vvv CID 1583619:  Uninitialized variables  (UNINIT)
> > vvv Using uninitialized value "rc".
> > 294   return rc;
> > 295 }
> > 296
> > 297 /**
> > 298  * virtsnd_kctl_get_enum_items() - Query items for the ENUMERATED 
> > element type.
> > 299  * @snd: VirtIO sound device.
> > "
> > 
> > Signed-off-by: Anton Yakovlev 
> > Signed-off-by: Aiswarya Cyriac 
> > Reported-by: coverity-bot 
> > Addresses-Coverity-ID: 1583619 ("Uninitialized variables")
> > Fixes: d6568e3de42d ("ALSA: virtio: add support for audio controls")
> 
> Thanks, applied.
> 
> 
> Takashi

Why did you apply it directly? The patch isn't great IMHO.
Why not give people a couple of days to review?

-- 
MST

Re: [PATCH] ALSA: virtio: Fix "Coverity: virtsnd_kctl_tlv_op(): Uninitialized variables" warning.

2024-02-13 Thread Michael S. Tsirkin

On Tue, Feb 13, 2024 at 09:51:30AM +0100, Aiswarya Cyriac wrote:
> Fix the following warning when building virtio_snd driver.
> 
> "
> *** CID 1583619:  Uninitialized variables  (UNINIT)
> sound/virtio/virtio_kctl.c:294 in virtsnd_kctl_tlv_op()
> 288
> 289 break;
> 290   }
> 291
> 292   kfree(tlv);
> 293
> vvv CID 1583619:  Uninitialized variables  (UNINIT)
> vvv Using uninitialized value "rc".
> 294   return rc;
> 295 }
> 296
> 297 /**
> 298  * virtsnd_kctl_get_enum_items() - Query items for the ENUMERATED 
> element type.
> 299  * @snd: VirtIO sound device.
> "
> 
> Signed-off-by: Anton Yakovlev 
> Signed-off-by: Aiswarya Cyriac 
> Reported-by: coverity-bot 
> Addresses-Coverity-ID: 1583619 ("Uninitialized variables")
> Fixes: d6568e3de42d ("ALSA: virtio: add support for audio controls")

I don't know enough about ALSA to say whether the patch is correct.  But
the commit log needs work: please, do not "fix warnings" - analyse the
code and explain whether there is a real issue and if yes what is it
and how it can trigger. Is an invalid op_flag ever passed?
If it's just a coverity false positive it might be ok to
work around that but document this.


> ---
>  sound/virtio/virtio_kctl.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/sound/virtio/virtio_kctl.c b/sound/virtio/virtio_kctl.c
> index 0c6ac74aca1e..d7a160c5db03 100644
> --- a/sound/virtio/virtio_kctl.c
> +++ b/sound/virtio/virtio_kctl.c
> @@ -286,6 +286,11 @@ static int virtsnd_kctl_tlv_op(struct snd_kcontrol 
> *kcontrol, int op_flag,
>   else
>   rc = virtsnd_ctl_msg_send(snd, msg, , NULL, false);
>  
> + break;
> + default:
> + virtsnd_ctl_msg_unref(msg);
> + rc = -EINVAL;
> +

There's already virtsnd_ctl_msg_unref call above.
Also don't we need virtsnd_ctl_msg_unref on other error paths
such as EFAULT?
Unify error handling to fix it all then?

>   break;
>   }
>  
> -- 
> 2.43.0

Re: [PATCH V1] vdpa: suspend and resume require DRIVER_OK

2024-02-12 Thread Michael S. Tsirkin

On Mon, Feb 12, 2024 at 11:37:12AM -0500, Steven Sistare wrote:
> On 2/12/2024 10:56 AM, Michael S. Tsirkin wrote:
> > On Mon, Feb 12, 2024 at 09:56:31AM -0500, Steven Sistare wrote:
> >> On 2/12/2024 3:19 AM, Michael S. Tsirkin wrote:
> >>> On Fri, Feb 09, 2024 at 02:29:59PM -0800, Steve Sistare wrote:
> >>>> Calling suspend or resume requires VIRTIO_CONFIG_S_DRIVER_OK, for all
> >>>> vdpa devices.
> >>>>
> >>>> Suggested-by: Eugenio Perez Martin "
> >>>> Signed-off-by: Steve Sistare 
> >>>
> >>> I don't think failing suspend or resume makes sense though -
> >>> e.g. practically failing suspend will just prevent sleeping I think -
> >>> why should guest not having driver loaded prevent system suspend?
> >>
> >> Got it, my fix is too heavy handed.
> >>
> >>> there's also state such as features set which does need to be
> >>> preserved.
> >>>
> >>> I think the thing to do is to skip invoking suspend/resume callback
> >>
> >> OK.
> >>
> >>>  and in
> >>> fact checking suspend/resume altogether.
> >>
> >> Currently ops->suspend, vhost_vdpa_can_suspend(), and 
> >> VHOST_BACKEND_F_SUSPEND
> >> are equivalent.  Hence if !ops->suspend, then then the driver does not 
> >> support
> >> it, and indeed may break if suspend is used, so system suspend must be 
> >> blocked,
> >> AFAICT.  Yielding:
> > 
> > If DRIVER_OK is not set then there's nothing to be done for migration.
> > So callback not needed.
> 
> OK, I missed your point.  Next attempt:
> 
>vhost_vdpa_suspend()
>if (!(ops->get_status(vdpa) & VIRTIO_CONFIG_S_DRIVER_OK))
>return 0;
> 
>if (!ops->suspend)
>return -EOPNOTSUPP;

right

> - Steve
> >> vhost_vdpa_suspend()
> >> if (!ops->suspend)
> >> return -EOPNOTSUPP;
> >>
> >> if (!(ops->get_status(vdpa) & VIRTIO_CONFIG_S_DRIVER_OK))
> >> return 0;
> >>
> >> - Steve
> >>
> >>>> ---
> >>>>  drivers/vhost/vdpa.c | 6 ++
> >>>>  1 file changed, 6 insertions(+)
> >>>>
> >>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >>>> index bc4a51e4638b..ce1882acfc3b 100644
> >>>> --- a/drivers/vhost/vdpa.c
> >>>> +++ b/drivers/vhost/vdpa.c
> >>>> @@ -598,6 +598,9 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
> >>>>  if (!ops->suspend)
> >>>>  return -EOPNOTSUPP;
> >>>>  
> >>>> +if (!(ops->get_status(vdpa) & VIRTIO_CONFIG_S_DRIVER_OK))
> >>>> +return -EINVAL;
> >>>> +
> >>>>  ret = ops->suspend(vdpa);
> >>>>  if (!ret)
> >>>>  v->suspended = true;
> >>>> @@ -618,6 +621,9 @@ static long vhost_vdpa_resume(struct vhost_vdpa *v)
> >>>>  if (!ops->resume)
> >>>>  return -EOPNOTSUPP;
> >>>>  
> >>>> +if (!(ops->get_status(vdpa) & VIRTIO_CONFIG_S_DRIVER_OK))
> >>>> +return -EINVAL;
> >>>> +
> >>>>  ret = ops->resume(vdpa);
> >>>>  if (!ret)
> >>>>  v->suspended = false;
> >>>> -- 
> >>>> 2.39.3
> >>>
> >

Re: [PATCH V1] vdpa: suspend and resume require DRIVER_OK

2024-02-12 Thread Michael S. Tsirkin

On Mon, Feb 12, 2024 at 09:56:31AM -0500, Steven Sistare wrote:
> On 2/12/2024 3:19 AM, Michael S. Tsirkin wrote:
> > On Fri, Feb 09, 2024 at 02:29:59PM -0800, Steve Sistare wrote:
> >> Calling suspend or resume requires VIRTIO_CONFIG_S_DRIVER_OK, for all
> >> vdpa devices.
> >>
> >> Suggested-by: Eugenio Perez Martin "
> >> Signed-off-by: Steve Sistare 
> > 
> > I don't think failing suspend or resume makes sense though -
> > e.g. practically failing suspend will just prevent sleeping I think -
> > why should guest not having driver loaded prevent system suspend?
> 
> Got it, my fix is too heavy handed.
> 
> > there's also state such as features set which does need to be
> > preserved.
> > 
> > I think the thing to do is to skip invoking suspend/resume callback
> 
> OK.
> 
> >  and in
> > fact checking suspend/resume altogether.
> 
> Currently ops->suspend, vhost_vdpa_can_suspend(), and VHOST_BACKEND_F_SUSPEND
> are equivalent.  Hence if !ops->suspend, then then the driver does not support
> it, and indeed may break if suspend is used, so system suspend must be 
> blocked,
> AFAICT.  Yielding:

If DRIVER_OK is not set then there's nothing to be done for migration.
So callback not needed.


> vhost_vdpa_suspend()
> if (!ops->suspend)
> return -EOPNOTSUPP;
> 
> if (!(ops->get_status(vdpa) & VIRTIO_CONFIG_S_DRIVER_OK))
> return 0;
> 
> - Steve
> 
> >> ---
> >>  drivers/vhost/vdpa.c | 6 ++
> >>  1 file changed, 6 insertions(+)
> >>
> >> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >> index bc4a51e4638b..ce1882acfc3b 100644
> >> --- a/drivers/vhost/vdpa.c
> >> +++ b/drivers/vhost/vdpa.c
> >> @@ -598,6 +598,9 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
> >>if (!ops->suspend)
> >>return -EOPNOTSUPP;
> >>  
> >> +  if (!(ops->get_status(vdpa) & VIRTIO_CONFIG_S_DRIVER_OK))
> >> +  return -EINVAL;
> >> +
> >>ret = ops->suspend(vdpa);
> >>if (!ret)
> >>v->suspended = true;
> >> @@ -618,6 +621,9 @@ static long vhost_vdpa_resume(struct vhost_vdpa *v)
> >>if (!ops->resume)
> >>return -EOPNOTSUPP;
> >>  
> >> +  if (!(ops->get_status(vdpa) & VIRTIO_CONFIG_S_DRIVER_OK))
> >> +  return -EINVAL;
> >> +
> >>ret = ops->resume(vdpa);
> >>if (!ret)
> >>v->suspended = false;
> >> -- 
> >> 2.39.3
> >

Re: [PATCH V1] vdpa: suspend and resume require DRIVER_OK

2024-02-12 Thread Michael S. Tsirkin

On Fri, Feb 09, 2024 at 02:29:59PM -0800, Steve Sistare wrote:
> Calling suspend or resume requires VIRTIO_CONFIG_S_DRIVER_OK, for all
> vdpa devices.
> 
> Suggested-by: Eugenio Perez Martin "
> Signed-off-by: Steve Sistare 

I don't think failing suspend or resume makes sense though -
e.g. practically failing suspend will just prevent sleeping I think -
why should guest not having driver loaded prevent
system suspend?

there's also state such as features set which does need to be
preserved.

I think the thing to do is to skip invoking suspend/resume callback, and in
fact checking suspend/resume altogether.

> ---
>  drivers/vhost/vdpa.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index bc4a51e4638b..ce1882acfc3b 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -598,6 +598,9 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
>   if (!ops->suspend)
>   return -EOPNOTSUPP;
>  
> + if (!(ops->get_status(vdpa) & VIRTIO_CONFIG_S_DRIVER_OK))
> + return -EINVAL;
> +
>   ret = ops->suspend(vdpa);
>   if (!ret)
>   v->suspended = true;
> @@ -618,6 +621,9 @@ static long vhost_vdpa_resume(struct vhost_vdpa *v)
>   if (!ops->resume)
>   return -EOPNOTSUPP;
>  
> + if (!(ops->get_status(vdpa) & VIRTIO_CONFIG_S_DRIVER_OK))
> + return -EINVAL;
> +
>   ret = ops->resume(vdpa);
>   if (!ret)
>   v->suspended = false;
> -- 
> 2.39.3

Re: [PATCH] vhost-vdpa: fail enabling virtqueue in certain conditions

2024-02-06 Thread Michael S. Tsirkin

better @subj: try late vq enable only if negotiated

On Tue, Feb 06, 2024 at 03:51:54PM +0100, Stefano Garzarella wrote:
> If VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK is not negotiated, we expect
> the driver to enable virtqueue before setting DRIVER_OK. If the driver
> tries anyway, better to fail right away as soon as we get the ioctl.
> Let's also update the documentation to make it clearer.
> 
> We had a problem in QEMU for not meeting this requirement, see
> https://lore.kernel.org/qemu-devel/20240202132521.32714-1-kw...@redhat.com/
> 
> Fixes: 9f09fd6171fe ("vdpa: accept VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK 
> backend feature")
> Cc: epere...@redhat.com
> Signed-off-by: Stefano Garzarella 
> ---
>  include/uapi/linux/vhost_types.h | 3 ++-
>  drivers/vhost/vdpa.c | 4 
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/vhost_types.h 
> b/include/uapi/linux/vhost_types.h
> index d7656908f730..5df49b6021a7 100644
> --- a/include/uapi/linux/vhost_types.h
> +++ b/include/uapi/linux/vhost_types.h
> @@ -182,7 +182,8 @@ struct vhost_vdpa_iova_range {
>  /* Device can be resumed */
>  #define VHOST_BACKEND_F_RESUME  0x5
>  /* Device supports the driver enabling virtqueues both before and after
> - * DRIVER_OK
> + * DRIVER_OK. If this feature is not negotiated, the virtqueues must be
> + * enabled before setting DRIVER_OK.
>   */
>  #define VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK  0x6
>  /* Device may expose the virtqueue's descriptor area, driver area and
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index bc4a51e4638b..1fba305ba8c1 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -651,6 +651,10 @@ static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, 
> unsigned int cmd,
>   case VHOST_VDPA_SET_VRING_ENABLE:
>   if (copy_from_user(, argp, sizeof(s)))
>   return -EFAULT;
> + if (!vhost_backend_has_feature(vq,
> + VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK) &&
> + (ops->get_status(vdpa) & VIRTIO_CONFIG_S_DRIVER_OK))
> + return -EINVAL;
>   ops->set_vq_ready(vdpa, idx, s.num);
>   return 0;
>   case VHOST_VDPA_GET_VRING_GROUP:
> -- 
> 2.43.0

Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-02-01 Thread Michael S. Tsirkin

On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
> On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> > > On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > > 
> > >  Summary 
> > > 
> > > In my (non-vhost experience) opinion the way to go would be either
> > > replacing the cond_resched with a hard schedule or setting the
> > > need_resched flag within vhost if the a data transfer was successfully
> > > initiated. It will be necessary to check if this causes problems with
> > > other workloads/benchmarks.
> > 
> > Yes but conceptually I am still in the dark on whether the fact that
> > periodically invoking cond_resched is no longer sufficient to be nice to
> > others is a bug, or intentional.  So you feel it is intentional?
> 
> I would assume that cond_resched is still a valid concept.
> But, in this particular scenario we have the following problem:
> 
> So far (with CFS) we had:
> 1. vhost initiates data transfer
> 2. kworker is woken up
> 3. CFS gives priority to woken up task and schedules it
> 4. kworker runs
> 
> Now (with EEVDF) we have:
> 0. In some cases, kworker has accumulated negative lag 
> 1. vhost initiates data transfer
> 2. kworker is woken up
> -3a. EEVDF does not schedule kworker if it has negative lag
> -4a. vhost continues running, kworker on same CPU starves
> --
> -3b. EEVDF schedules kworker if it has positive or no lag
> -4b. kworker runs
> 
> In the 3a/4a case, the kworker is given no chance to set the
> necessary flag. The flag can only be set by another CPU now.
> The schedule of the kworker was not caused by cond_resched, but
> rather by the wakeup path of the scheduler.
> 
> cond_resched works successfully once the load balancer (I suppose) 
> decides to migrate the vhost off to another CPU. In that case, the
> load balancer on another CPU sets that flag and we are good.
> That then eventually allows the scheduler to pick kworker, but very
> late.

I don't really understand what is special about vhost though.
Wouldn't it apply to any kernel code?

> > I propose a two patch series then:
> > 
> > patch 1: in this text in Documentation/kernel-hacking/hacking.rst
> > 
> > If you're doing longer computations: first think userspace. If you
> > **really** want to do it in kernel you should regularly check if you need
> > to give up the CPU (remember there is cooperative multitasking per CPU).
> > Idiom::
> > 
> > cond_resched(); /* Will sleep */
> > 
> > 
> > replace cond_resched -> schedule
> > 
> > 
> > Since apparently cond_resched is no longer sufficient to
> > make the scheduler check whether you need to give up the CPU.
> > 
> > patch 2: make this change for vhost.
> > 
> > WDYT?
> 
> For patch 1, I would like to see some feedback from Peter (or someone else
> from the scheduler maintainers).

I am guessing once you post it you will see feedback.

> For patch 2, I would prefer to do some more testing first if this might have
> an negative effect on other benchmarks.
> 
> I also stumbled upon something in the scheduler code that I want to verify.
> Maybe a cgroup thing, will check that out again.
> 
> I'll do some more testing with the cond_resched->schedule fix, check the
> cgroup thing and wait for Peter then.
> Will get back if any of the above yields some results.
> 
> > 
> > -- 
> > MST
> > 
> >

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 9608 matches

Mail list logo